Class Comparison Methods in Data Mining
In data mining, users do not always want to study just one group (or class). Instead, they often want to compare one class with another to understand the differences between them.
This process is called class comparison (or class discrimination). It helps to find patterns that distinguish a target class from other similar classes.
Important point: The classes being compared must be similar in structure (i.e., they should have the same type of attributes).
For example:
- Valid comparison: Computer Science students vs Physics students
- Invalid comparison: Person vs Address vs Item (not comparable)
Class Comparison vs Class Characterization
- Class Characterization → Describes a single class
- Class Comparison → Compares two or more classes
Example:
Comparing sales in 2003 vs sales in 2004 is a class
comparison.
Synchronous Generalization
To make fair comparisons, data must be generalized to the same
level of detail.
Example:
If we compare sales data:
Both datasets should be at the same location level:
- City level OR
- State level OR
- Country level
Wrong comparison:
- Sales in Vancouver (city) vs Sales in USA (country)
Correct comparison:
- Both should be at the same level (e.g., both at country level)
Users can also manually adjust these levels if needed
Steps in Class Comparison
1. Data Collection
Collect relevant data using queries.
Divide data into:
- Target class
- Contrasting class(es)
2. Dimension Relevance Analysis
- If there are many attributes, select only the important ones.
- This improves accuracy and reduces complexity. Cross-tabulations
3. Synchronous Generalization
Generalize the target class data to a chosen level.
Apply the same level of generalization to contrasting
classes.
4. Presentation of Results
Show the comparison using:
- Tables
- Charts (bar, pie, etc.)
- Rules
A common measure used is count%, which shows the proportion of data in
each class.
Example (DMQL Query)
Comparing graduate vs undergraduate students:
use University_Database
mine comparison as "graduate_students vs_undergraduate_students"
in relevance to name, gender, program, birth_place, birth_date,
residence, phone_no, GPA
for "graduate_students"
where status in "graduate"
versus "undergraduate_students"
where status in "undergraduate"
analyze count%
from student
Key Terms Explained
- Attributes → Data fields (e.g., name, gender, GPA)
- Concept Hierarchy (Gen(ai)) → Levels of data abstraction
- Thresholds (Ui, Ti) → Limits used for analysis and generalization
- Relevance Threshold (R) → Determines important attributes
Presentation of Class Comparison
Results can be shown using:
- Tables
- Charts (bar chart, pie chart, curves)
- Cross-tabulations
- Rules
Discriminant Rules
Class comparison results are often expressed using discriminant
rules.
These rules:
- Highlight differences between classes
- Use a measure called d-weight
- Show how strongly a feature distinguishes one class from another