Entity Identification Problem in Data Mining

Jeevadharshan

What is Data Integration?

In data mining, large amounts of data come from different sources. Data Integration means combining this data into one unified dataset.

For example, data can come from:

Banking systems
Customer records
Social media (Twitter, blogs)
Sensors
Images, audio, and videos

This combined data is then used for:

Machine learning
Data analysis
Predicting trends

Why is Data Integration Important?

Proper data integration helps:

Remove duplicate data
Reduce errors and inconsistencies
Improve accuracy of analysis
Speed up data mining processes

However, combining data from different sources is not easy because:

Data formats differ
Names of attributes vary
Meanings may not match

What is the Entity Identification Problem?

When integrating data from multiple sources, the same real-world object may appear in different forms.

Entity Identification Problem means:

Identifying whether two records from different datasets refer to the same real-world entity.

Example:

Table A: cust_id

Table B: cust_number

Even though names are different, both may represent the same customer.

Common Issues in Data Integration

1. Data Redundancy

Same data appears multiple times.

Example: Same customer stored in different databases

Causes:

Different names for same attribute
Derived data (e.g., annual income calculated from monthly income)

2. Duplicate Attributes

Same information stored in different columns.

3. Irrelevant Attributes

Some data is not useful. Example: Student ID is not needed to predict GPA

4. Entity Identification Problem

Matching the same real-world entity across different data sources.

Types of Data Integration

1. Virtual Integration

Data stays in original databases
A unified view is created
No physical merging

2. Actual Integration

Data is physically merged into one database
Old databases may be removed

Why Does This Problem Occur?

Different databases are:

Created at different times
Designed by different people
Built for different purposes

So, the same real-world entity may be stored differently.

Levels of Data Integration Problems

1. Schema Level Problems (Structure Issues)

These occur when database structures differ.

a. Domain Mismatch

Same data, different formats
Example: Currency in USD vs Yen

b. Schema Mismatch

Different table structures
Example: One database has "Employee", another has "Part-time" and "Full-time"

c. Constraint Mismatch

Different rules
Example: GPA requirement 3.0 vs 3.5

2. Instance Level Problems (Data Issues)

These occur when actual data values differ.

a. Entity Identification

Finding matching records for the same entity.

b. Attribute Value Conflict

Same attribute has different values due to:

Different units (kg vs pounds)
Missing data
Incorrect data

Approaches to Solve Entity Identification

1. Key Matching

Use common key (like ID)

Works only if common key exists

2. User-Specified Matching

User manually defines matches

Accurate but time-consuming

3. Probabilistic Key Matching

Match based on partial key similarity

Example: Matching names

4. Attribute Matching

Compare multiple attributes

Uses probability to decide match

5. Heuristic Rules

Use rules and logic

May not always be accurate

Better Solution Approach

A more reliable method includes:
Use strict matching rules (not just probability)
Do not depend only on common keys
Ensure accurate and correct matching
Allow user input if needed

Why is this important?

Incorrect matching can lead to serious problems.

Example: Wrong employee records may lead to incorrect decisions.

Key Takeaways

Data integration combines multiple data sources
Entity identification ensures correct matching of records
It is a major challenge in data mining
Solving it improves data quality and decision-making

« Previous Next »