Wrong data
In Pandas, fixing wrong data refers to the process of identifying and correcting values in a DataFrame that are inaccurate, inconsistent, or not in the expected format. This is a key step in data cleaning, as incorrect data can lead to misleading analysis or errors in computation.
Replacing Values
One common way to correct wrong or inaccurate values in a DataFrame is by replacing them with the correct ones.
Manual Replacement
If you're working with a small dataset, you can manually fix incorrect values.
Program
If you spot a typo like "450" where it should be "45", you can directly update that specific cell:
# Set 'Duration' to 45 in row 7
df.loc[7, 'Duration'] = 45
Replacing Values in Large Datasets
When dealing with large datasets, manual fixes are not practical. Instead, you can define rules or conditions to correct incorrect values.
Program
For instance, if you want to cap all values in the Duration column to a maximum of 120, you can loop through the DataFrame and adjust the values accordingly:
# Replace values greater than 120 with 120
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.loc[x, "Duration"] = 120
This is useful when you're confident that values above a certain threshold are not realistic and need correction.
Removing Rows with Wrong Data
Another approach is to simply remove rows that contain invalid data, especially when you're unsure what the correct value should be. This can help keep your analysis clean and accurate.
Program
Here’s how you can delete rows where Duration is greater than 120
# Delete rows where 'Duration' is greater than 120
for x in df.index:
if df.loc[x, "Duration"] > 120:
df.drop(x, inplace=True)
This method is especially effective when the incorrect data isn’t critical for your analysis or when it's safer to discard rather than guess.