Pandas - Fixing Wrong Data

Dhanapriya D

Wrong data

In Pandas, fixing wrong data refers to the process of identifying and correcting values in a DataFrame that are inaccurate, inconsistent, or not in the expected format. This is a key step in data cleaning, as incorrect data can lead to misleading analysis or errors in computation.

Replacing Values

One common way to correct wrong or inaccurate values in a DataFrame is by replacing them with the correct ones.

Manual Replacement

If you're working with a small dataset, you can manually fix incorrect values.

Program

If you spot a typo like "450" where it should be "45", you can directly update that specific cell:

# Set 'Duration' to 45 in row 7

df.loc[7, 'Duration'] = 45

Replacing Values in Large Datasets

When dealing with large datasets, manual fixes are not practical. Instead, you can define rules or conditions to correct incorrect values.

Program

For instance, if you want to cap all values in the Duration column to a maximum of 120, you can loop through the DataFrame and adjust the values accordingly:

# Replace values greater than 120 with 120

for x in df.index:

    if df.loc[x, "Duration"] > 120:

        df.loc[x, "Duration"] = 120

This is useful when you're confident that values above a certain threshold are not realistic and need correction.

Removing Rows with Wrong Data

Another approach is to simply remove rows that contain invalid data, especially when you're unsure what the correct value should be. This can help keep your analysis clean and accurate.

Program

Here’s how you can delete rows where Duration is greater than 120

# Delete rows where 'Duration' is greater than 120

for x in df.index:

    if df.loc[x, "Duration"] > 120:

        df.drop(x, inplace=True)

This method is especially effective when the incorrect data isn’t critical for your analysis or when it's safer to discard rather than guess.

Tags
Our website uses cookies to enhance your experience. Learn More
Accept !

GocourseAI

close
send