Relationships in Data with Pandas
One of the powerful features of Pandas is its ability to help you understand relationships between different columns in your dataset. This can be extremely useful for data analysis and prediction modeling.
Pandas provides the corr() method to calculate correlation coefficients, which quantify the strength and direction of a linear relationship between numerical columns.
Using corr() to Analyze Relationships
The corr() method returns a correlation matrix that shows how strongly columns are related to each other. The method works only with numeric columns, and automatically ignores non-numeric ones.
Program
Let’s use a dataset called data.csv, and analyze the relationships:
import pandas as pd
df = pd.read_csv("data.csv")
print(df.corr())
Output
Duration Pulse Maxpulse Calories
Duration 1.000000 -0.155408 0.009403 0.922721
Pulse -0.155408 1.000000 0.786535 0.025120
Maxpulse 0.009403 0.786535 1.000000 0.203814
Calories 0.922721 0.025120 0.203814 1.000000
Understanding the Correlation Matrix
The values in the matrix range from -1 to 1:
1.0 → Perfect positive correlation
-1.0 → Perfect negative correlation
0.0 → No correlation
What Do These Numbers Mean?
When interpreting correlation values, context matters. However, a common guideline is that a correlation coefficient of at least 0.6 or -0.6 typically indicates a moderate to strong relationship between variables.
- A positive value (e.g., 0.6) suggests that as one variable increases, the other tends to increase as well.
- A negative value (e.g., -0.6) indicates that as one variable increases, the other tends to decrease.
Duration and Duration have a value of 1.0, as every column is perfectly correlated with itself.
Strong Positive Correlation
Duration and Calories have a correlation of 0.92. This suggests that the longer the workout duration, the more calories are burned — a logical relationship.
Weak or No Correlation
Duration and Maxpulse have a correlation of 0.009, indicating almost no relationship between workout length and maximum pulse rate.
Negative Correlation
A value like -0.9 would indicate a strong inverse relationship — as one variable increases, the other tends to decrease.
What Counts as a Good Correlation?
- There's no universal cutoff, but generally:
- 0.6 or higher (or -0.6 or lower) indicates a strong relationship.
- Values near 0 suggest a weak or no meaningful correlation.
- The interpretation depends on your domain, dataset, and use case.