Entity Identification Problem in Data Mining
What is Data Integration?
In data mining, large amounts of data come from different sources. Data
Integration means combining this data into one unified dataset.
For example, data can come from:
- Banking systems
- Customer records
- Social media (Twitter, blogs)
- Sensors
- Images, audio, and videos
This combined data is then used for:
- Machine learning
- Data analysis
- Predicting trends
Why is Data Integration Important?
Proper data integration helps:
- Remove duplicate data
- Reduce errors and inconsistencies
- Improve accuracy of analysis
- Speed up data mining processes
However, combining data from different sources is not easy because:
- Data formats differ
- Names of attributes vary
- Meanings may not match
What is the Entity Identification Problem?
When integrating data from multiple sources, the same real-world
object may appear in different forms.
Entity Identification Problem means:
- Identifying whether two records from different datasets refer to the same real-world entity.
Example:
Table A: cust_id
Table B: cust_number
Even though names are different, both may represent the same
customer.
Common Issues in Data Integration
1. Data Redundancy
Same data appears multiple times.
Example: Same customer stored in different databases
Causes:
- Different names for same attribute
- Derived data (e.g., annual income calculated from monthly income)
2. Duplicate Attributes
Same information stored in different columns.
3. Irrelevant Attributes
Some data is not useful. Example: Student ID is not needed to predict
GPA
4. Entity Identification Problem
Matching the same real-world entity across different data
sources.
Types of Data Integration
1. Virtual Integration
- Data stays in original databases
- A unified view is created
- No physical merging
2. Actual Integration
- Data is physically merged into one database
- Old databases may be removed
Why Does This Problem Occur?
Different databases are:
- Created at different times
- Designed by different people
- Built for different purposes
So, the same real-world entity may be stored differently.
Levels of Data Integration Problems
1. Schema Level Problems (Structure Issues)
These occur when database structures differ.
a. Domain Mismatch
- Same data, different formats
- Example: Currency in USD vs Yen
b. Schema Mismatch
- Different table structures
- Example: One database has "Employee", another has "Part-time" and "Full-time"
c. Constraint Mismatch
- Different rules
- Example: GPA requirement 3.0 vs 3.5
2. Instance Level Problems (Data Issues)
These occur when actual data values differ.
a. Entity Identification
Finding matching records for the same entity.
b. Attribute Value Conflict
Same attribute has different values due to:
- Different units (kg vs pounds)
- Missing data
- Incorrect data
Approaches to Solve Entity Identification
1. Key Matching
Use common key (like ID)
Works only if common key exists
2. User-Specified Matching
User manually defines matches
Accurate but time-consuming
3. Probabilistic Key Matching
Match based on partial key similarity
Example: Matching names
4. Attribute Matching
Compare multiple attributes
Uses probability to decide match
5. Heuristic Rules
Use rules and logic
May not always be accurate
Better Solution Approach
- A more reliable method includes:
- Use strict matching rules (not just probability)
- Do not depend only on common keys
- Ensure accurate and correct matching
- Allow user input if needed
Why is this important?
Incorrect matching can lead to serious problems.
Example: Wrong employee records may lead to incorrect
decisions.
Key Takeaways
- Data integration combines multiple data sources
- Entity identification ensures correct matching of records
- It is a major challenge in data mining
- Solving it improves data quality and decision-making