Removing Outliers Refining Academic Performance Dataset

Introduction

Welcome back to our data wrangling journey, where we delve into the intricacies of handling academic performance datasets. In this blog post, our primary objective is to address outliers and enhance the accuracy of our data representation. Through the utilization of Python and indispensable libraries like pandas, numpy, matplotlib, and scipy, we will demonstrate effective techniques for identifying and eliminating outliers. The result will be a refined and dependable dataset that instills trust in its users. Prepare to witness the prowess of outlier detection and witness the remarkable metamorphosis of raw data into a more reliable and trustworthy form.

1. Loading the Academic Performance Dataset

To kickstart our journey, let's load the academic performance dataset using the following code:

 ```

import pandas as pd 

import numpy as np 

import matplotlib.pyplot as plt 

from scipy.stats import zscore 

# Load the dataset 

df = pd.read_csv('Academic_Performance.csv')

 ```

2. Tackling Outliers

Outliers can significantly impact our analysis by distorting results and affecting statistical measures. Let's begin by examining the dataset for any missing values using the `isnull().sum()` function in pandas: 

```

missing_values = df.isnull().sum() 

print("Missing Values:\n", missing_values) 

``` 

To handle missing values, we'll employ mean imputation, as shown in the previous blog. Now, let's move on to tackling outliers

3. Identifying and Removing Outliers

We'll focus on the numeric columns: 'Sem1', 'Sem2', 'Sem3', 'Sem4', 'Sem5', 'Sem6', 'Sem7', 'Sem8', and 'Average'. Before identifying outliers, let's visualize the boxplot of these variables:

```

plt.figure(figsize=(8, 6)) 

df[numeric_columns].boxplot() 

plt.title("Boxplot Before Handling Outliers") 

plt.xlabel("Variable") 

plt.ylabel("Value") 

plt.show() 

```

Then output will be look like this 


By using the z-score method, we can identify and remove outliers. The following code accomplishes this:

```

z_scores = zscore(df[numeric_columns]) 

outliers = (np.abs(z_scores) > 3).any(axis=1) 

df = df[~outliers]

 ```

After removing the outliers, let's visualize the boxplot again to observe the impact: 

```

 plt.figure(figsize=(8, 6)) 

df[numeric_columns].boxplot() 

plt.title("Boxplot After Removing Outliers") 

plt.xlabel("Variable") 

plt.ylabel("Value") 

plt.show() ``` 

Then output will be look like this


4. Saving the Modified Dataset

To preserve our efforts and access the refined dataset later, let's save it to a CSV file using the following code: 
```
df.to_csv('Modified_Academic_Performance.csv', index=False) 
```

Summary

In this blog post, we embarked on a mission to conquer outliers lurking within the academic performance dataset. By leveraging the power of Python and essential libraries, we successfully identified and removed outliers, ensuring a more reliable and accurate representation of the data. Through insightful visualization and robust outlier detection techniques, we refined the dataset and set the stage for more trustworthy analyses. 

Stay tuned for more data wrangling adventures, where we unravel the mysteries hidden within datasets and extract valuable insights! 

Happy data exploration and may your outliers never stand in the way of truth

Click here for accesing dataset- Academic_Performance.csv


Previous Post Next Post