Empty Cells
Empty cells also known as missing value can lead to incorrect results or errors during data analysis. It's important to deal with them before moving forward.
Remove Rows with Empty Cells
A simple and common way to handle missing data is by removing rows that contain any empty cells.
This approach is acceptable when:
- The dataset is large, and
- The number of missing entries is relatively small
- In such cases, dropping a few rows won’t significantly affect the overall analysis.
Program
Remove Rows with Missing Data
import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())
This will return a new DataFrame (new_df) that excludes any rows with missing values.
By default, the dropna() method does not modify the original DataFrame. Instead, it returns a new one.
If you want to remove rows with missing values directly from the original DataFrame, you need to use the inplace=True argument.
Program:
Drop Missing Values In-Place
import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace=True)
print(df.to_string())
Now, dropna(inplace=True) updates the original df and does not return a new DataFrame.
Replacing Empty Values
Instead of removing rows with missing data, you can replace empty cells with a specific value. This approach helps preserve the rest of the dataset, which might still be valuable.
Pandas provides the fillna() method to replace missing values with a chosen value.
Example
Replace all missing (NULL) values with the number 130
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace=True)
This ensures that no data is lost due to empty cells, making your dataset more complete and easier to work with.
Replace Missing Values in Specific Columns
If you only want to replace empty cells in a particular column, you don’t need to apply changes to the entire DataFrame. Instead, you can target specific columns by providing the column name.
Pandas allows this using the fillna() method with a dictionary.
Example
Replace missing values in the "Calories" column with the number 130
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna({"Calories": 130}, inplace=True)
This method ensures that only the "Calories" column is affected, leaving the rest of the DataFrame unchanged.
Replace Missing Values with Mean, Median, or Mode
A common technique to handle missing data is to replace empty cells with the mean, median, or mode of the column. This approach helps maintain the integrity of the data while minimizing distortion.
Pandas provides built-in methods to calculate these statistical values:
- mean() – average value
- median() – middle value
- mode() – most frequent value
Example
Replace missing values in the "Calories" column with the mean
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].mean()
df.fillna({"Calories": x}, inplace=True)
You can follow the same approach with median() or mode()[0] depending on your data needs.
- Mean = the average value (the sum of all values divided by number of values).
- Median = the value in the middle, after you have sorted all values ascending.
- Mode = the value that appears most frequently.