Pandas - Cleaning Data

Dhanapriya D

Data Cleaning in Pandas

Data cleaning is a crucial step in the data analysis process. It involves identifying and correcting issues in your dataset to ensure accuracy, consistency, and reliability.

What is Bad Data?

Bad or "dirty" data can take many forms. Common issues include:

  • Empty cells (missing values)
  • Incorrect data formats
  • Inaccurate or illogical values
  • Duplicate entries 

Sample Dataset

        Duration          Date  Pulse  Maxpulse  Calories

0         60  '2020/12/01'    110       130     409.1
1         60  '2020/12/02'    117       145     479.0
2         60  '2020/12/03'    103       135     340.0
3         45  '2020/12/04'    109       175     282.4
...
7        450  '2020/12/08'    104       134     253.3  ← wrong value
...
11        60  '2020/12/12'    100       120     250.7
12        60  '2020/12/12'    100       120     250.7  ← duplicate
...
18        45  '2020/12/18'     90       112       NaN  ← missing value
22        45           NaN    100       119        282.0  ← missing date
26        60    2020/12/26    100       120     250.0  ← wrong date format
28        60  '2020/12/28'    103       132       NaN  ← missing value

  • The data set contains some empty cells ("Date" in row 22, and "Calories" in row 18 and 28).
  • The data set contains wrong format ("Date" in row 26).
  • The data set contains wrong data ("Duration" in row 7).
  • The data set contains duplicates (row 11 and 12).


Our website uses cookies to enhance your experience. Learn More
Accept !

GocourseAI

close
send