Entity Identification Problem in Data Mining
gocourse.in Maintenance

We'll be back soon

Our CDN (cdn.gocourse.in) is currently unreachable. Some images, JavaScript, or CSS files may not load properly.

Estimated downtime: ~30 minutes

Entity Identification Problem in Data Mining

Jeevadharshan

Entity Identification Problem in Data Mining  

What is Data Integration? 

In data mining, large amounts of data come from different sources. Data Integration means combining this data into one unified dataset. 

For example, data can come from:

  • Banking systems 
  • Customer records 
  • Social media (Twitter, blogs) 
  • Sensors 
  • Images, audio, and videos

This combined data is then used for:

  • Machine learning 
  • Data analysis 
  • Predicting trends 

Why is Data Integration Important? 

Proper data integration helps: 

  • Remove duplicate data 
  • Reduce errors and inconsistencies 
  • Improve accuracy of analysis 
  • Speed up data mining processes

However, combining data from different sources is not easy because:

  • Data formats differ 
  • Names of attributes vary 
  • Meanings may not match

What is the Entity Identification Problem? 

 When integrating data from multiple sources, the same real-world object may appear in different forms.

Entity Identification Problem means:

  • Identifying whether two records from different datasets refer to the same real-world  entity.

Example:

Table A: cust_id 

Table B: cust_number 

Even though names are different, both may represent the same customer.

Common Issues in Data Integration

1. Data Redundancy 

Same data appears multiple times. 
Example: Same customer stored in different databases
Causes:
  • Different names for same attribute 
  • Derived data (e.g., annual income calculated from monthly income)

2. Duplicate Attributes 

Same information stored in different columns. 

3. Irrelevant Attributes 

Some data is not useful. Example: Student ID is not needed to predict GPA 

4. Entity Identification Problem

Matching the same real-world entity across different data sources.

Types of Data Integration

1. Virtual Integration

  • Data stays in original databases 
  • A unified view is created 
  • No physical merging

2. Actual Integration

  • Data is physically merged into one database 
  • Old databases may be removed

Why Does This Problem Occur?

Different databases are:

  • Created at different times 
  • Designed by different people 
  • Built for different purposes
So, the same real-world entity may be stored differently.

Levels of Data Integration Problems 

 1. Schema Level Problems (Structure Issues) 

These occur when database structures differ.

a. Domain Mismatch 

  • Same data, different formats 
  • Example: Currency in USD vs Yen

b. Schema Mismatch

  • Different table structures 
  • Example: One database has "Employee", another has "Part-time" and "Full-time"

c. Constraint Mismatch

  • Different rules 
  • Example: GPA requirement 3.0 vs 3.5

2. Instance Level Problems (Data Issues)

These occur when actual data values differ.

a. Entity Identification 

 Finding matching records for the same entity. 

b. Attribute Value Conflict

 Same attribute has different values due to:
  • Different units (kg vs pounds) 
  • Missing data 
  • Incorrect data

Approaches to Solve Entity Identification

1. Key Matching 

 Use common key (like ID) 
 Works only if common key exists 

 2. User-Specified Matching 

 User manually defines matches
 Accurate but time-consuming 

 3. Probabilistic Key Matching 

 Match based on partial key similarity
 Example: Matching names 

 4. Attribute Matching 

 Compare multiple attributes 
 Uses probability to decide match 

5. Heuristic Rules

 Use rules and logic 
 May not always be accurate

Better Solution Approach

  • A more reliable method includes: 
  • Use strict matching rules (not just probability) 
  • Do not depend only on common keys 
  • Ensure accurate and correct matching 
  • Allow user input if needed

Why is this important?

 Incorrect matching can lead to serious problems. 
 Example: Wrong employee records may lead to incorrect decisions. 

 Key Takeaways

  • Data integration combines multiple data sources 
  • Entity identification ensures correct matching of records 
  • It is a major challenge in data mining 
  • Solving it improves data quality and decision-making

Our website uses cookies to enhance your experience. Learn More
Accept !