Data Cleaning in Pandas
Data cleaning is a crucial step in the data analysis process. It involves identifying and correcting issues in your dataset to ensure accuracy, consistency, and reliability.
What is Bad Data?
Bad or "dirty" data can take many forms. Common issues include:
- Empty cells (missing values)
- Incorrect data formats
- Inaccurate or illogical values
- Duplicate entries
Sample Dataset
Duration Date Pulse Maxpulse Calories
0 60 '2020/12/01' 110 130 409.1
1 60 '2020/12/02' 117 145 479.0
2 60 '2020/12/03' 103 135 340.0
3 45 '2020/12/04' 109 175 282.4
...
7 450 '2020/12/08' 104 134 253.3 ← wrong value
...
11 60 '2020/12/12' 100 120 250.7
12 60 '2020/12/12' 100 120 250.7 ← duplicate
...
18 45 '2020/12/18' 90 112 NaN ← missing value
22 45 NaN 100 119 282.0 ← missing date
26 60 2020/12/26 100 120 250.0 ← wrong date format
28 60 '2020/12/28' 103 132 NaN ← missing value
- The data set contains some empty cells ("Date" in row 22, and "Calories" in row 18 and 28).
- The data set contains wrong format ("Date" in row 26).
- The data set contains wrong data ("Duration" in row 7).
- The data set contains duplicates (row 11 and 12).