Pandas - Cleaning Empty Cells

Dhanapriya D

Empty Cells

Empty cells also known as missing value can lead to incorrect results or errors during data analysis. It's important to deal with them before moving forward.

Remove Rows with Empty Cells

A simple and common way to handle missing data is by removing rows that contain any empty cells.

This approach is acceptable when:
  • The dataset is large, and
  • The number of missing entries is relatively small
  • In such cases, dropping a few rows won’t significantly affect the overall analysis.

Program

Remove Rows with Missing Data

import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())

This will return a new DataFrame (new_df) that excludes any rows with missing values.

By default, the dropna() method does not modify the original DataFrame. Instead, it returns a new one.

If you want to remove rows with missing values directly from the original DataFrame, you need to use the inplace=True argument.

Program:

Drop Missing Values In-Place

import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace=True)
print(df.to_string())
 
Now, dropna(inplace=True) updates the original df and does not return a new DataFrame.

Replacing Empty Values

Instead of removing rows with missing data, you can replace empty cells with a specific value. This approach helps preserve the rest of the dataset, which might still be valuable.

Pandas provides the fillna() method to replace missing values with a chosen value.

Example

Replace all missing (NULL) values with the number 130

import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(130, inplace=True)

This ensures that no data is lost due to empty cells, making your dataset more complete and easier to work with.

Replace Missing Values in Specific Columns

If you only want to replace empty cells in a particular column, you don’t need to apply changes to the entire DataFrame. Instead, you can target specific columns by providing the column name.

Pandas allows this using the fillna() method with a dictionary.

Example

Replace missing values in the "Calories" column with the number 130

import pandas as pd
df = pd.read_csv('data.csv')
df.fillna({"Calories": 130}, inplace=True)

This method ensures that only the "Calories" column is affected, leaving the rest of the DataFrame unchanged.

Replace Missing Values with Mean, Median, or Mode

A common technique to handle missing data is to replace empty cells with the mean, median, or mode of the column. This approach helps maintain the integrity of the data while minimizing distortion.

Pandas provides built-in methods to calculate these statistical values:

  • mean() – average value
  • median() – middle value
  • mode() – most frequent value
Example

Replace missing values in the "Calories" column with the mean

import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].mean()
df.fillna({"Calories": x}, inplace=True)


You can follow the same approach with median() or mode()[0] depending on your data needs.
  • Mean = the average value (the sum of all values divided by number of values).
  • Median = the value in the middle, after you have sorted all values ascending.
  • Mode = the value that appears most frequently.















Our website uses cookies to enhance your experience. Learn More
Accept !

GocourseAI

close
send