Pandas - Cleaning Data of Wrong Format

Dhanapriya D

Dealing with Data of the Wrong Format

When working with datasets, it's common to encounter cells with data in the wrong format. These inconsistencies can make data analysis challenging or sometimes impossible.

How to Handle Incorrect Formats

  • You typically have two choices to fix this issue:
  • Remove the problematic rows, or
  • Convert all the values in a column to a consistent format.
  • In this example, we'll focus on converting all the values in the 'Date' column into proper date objects.
  • Converting Strings to Dates in Pandas
  • Pandas provides a handy method called to_datetime() for converting date strings into datetime objects.

Program

import pandas as pd

df = pd.read_csv('data.csv')

df['Date'] = pd.to_datetime(df['Date'], format='mixed')

print(df.to_string())

Using format='mixed' allows Pandas to intelligently detect and parse different date formats within the same column.

Removing Rows with Invalid Data

In the previous example, some values couldn't be converted to proper dates. As a result, they were replaced with NaT (Not a Time), which is treated like a NULL value in Pandas.

To clean up the dataset, you can remove these rows using the dropna() method.

Program

Remove Rows with Missing Dates

# Remove rows where the 'Date' column has NULL (NaT) values

df.dropna(subset=['Date'], inplace=True)

This will drop any rows from the DataFrame where the 'Date' column contains missing or invalid entries



Tags
Our website uses cookies to enhance your experience. Learn More
Accept !

GocourseAI

close
send