~/snippets/Remove-Outliers
Published on

Remove Outliers

446 words3 min read

Function to remove outliers using the interquartile range (IQR):

def remove_outliers_iqr(data):
    # Calculate the first and third quartiles
    q1, q3 = data.quantile([0.25, 0.75])

    # Calculate the interquartile range
    iqr = q3 - q1

    # Calculate the lower and upper bounds
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)

    # Return the data without the outliers
    return data[(data > lower_bound) & (data < upper_bound)]

This function takes a pandas Series or DataFrame data and calculates the first and third quartiles, which are the 25th and 75th percentiles, respectively. It then calculates the interquartile range (IQR), which is the difference between the two quartiles. It uses the IQR to calculate the lower and upper bounds, which are 1.5 times the IQR below and above the first and third quartiles, respectively. Finally, it uses boolean indexing to return the data without the outliers that fall outside the bounds.

Function to remove outliers using the Z-score:

def remove_outliers_iqr(data):
    # Calculate the first and third quartiles
    q1, q3 = data.quantile([0.25, 0.75])

    # Calculate the interquartile range
    iqr = q3 - q1

    # Calculate the lower and upper bounds
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)

    # Return the data without the outliers
    return data[(data > lower_bound) & (data < upper_bound)]

This function takes a pandas Series or DataFrame data and calculates the Z-scores of the data, which is the number of standard deviations each value is from the mean. It then uses boolean indexing to return the data without the outliers that have a Z-score greater than or equal to the threshold (which is set to 3 by default).