Data Mining Task Primitives
A data mining task can be described using a data mining query that is given to a data mining system.
This query is defined using data mining task primitives.
These primitives help users interact with the data mining system while discovering patterns in the data. They allow users to guide the mining process and analyze the results from different perspectives or levels.
Data mining task primitives define the following:
- The data that needs to be mined
- The type of knowledge or patterns to discover
- The background knowledge used during mining
- The measures used to evaluate interesting patterns
- The format used to display the discovered results
A data mining query language can include these primitives so that users can communicate easily with data mining systems. It also helps in building user-friendly graphical interfaces.
Designing a complete data mining language is difficult because data mining includes many tasks such as data description, classification, clustering, and evolution analysis. Each task requires different techniques and processing methods.
Therefore, designing an effective data mining query language requires a strong understanding of the capabilities and limitations of different data mining techniques. This also helps integrate the data mining system with other information systems.
List of Data Mining Task Primitives
A data mining query mainly includes the following primitives.
1. Task-Relevant Data to be Mined
This specifies which part of the database will be used for mining.
It includes:
- Selected attributes (columns)
- Selected dimensions from a data warehouse
- Specific records that match certain conditions
In a relational database, this data is collected using queries that involve operations such as:
- Selection
- Projection
- Join
- Aggregation
After retrieving the required data, a new dataset is created called the initial data relation.This dataset can also be sorted or grouped based on the query conditions.Sometimes, this dataset does not physically exist in the database. Instead, it is a virtual table, also called a view.In data mining, this view is known as a minable view, because it contains the data used for mining.
2. Kind of Knowledge to be Mined
This primitive defines what type of patterns or knowledge we want to discover.
Some common data mining tasks include:
- Data characterization
- Discrimination
- Association rule mining
- Correlation analysis
- Classification
- Prediction
- Clustering
- Outlier detection
- Evolution analysis
Each task finds different kinds of patterns in the data.
3. Background Knowledge
Background knowledge is domain knowledge that helps guide the data mining process and improves the quality of discovered patterns.
A common form of background knowledge is a Concept Hierarchy.
Concept Hierarchy
A concept hierarchy organizes data from low-level detailed concepts to higher-level general concepts.
For example:
City → State → Country
Concept hierarchies allow mining at different levels of abstraction.
Operations in Concept Hierarchy
Rolling Up (Generalization)
- Converts detailed data into more general information
- Makes the data easier to understand
- Reduces the amount of data
Example:
Chennai → Tamil Nadu → India
Drilling Down (Specialization)
Moves from general concepts to more detailed information
Example:
Country → State → City
Different users may create different hierarchies for the same attribute depending on their needs.
4. Interestingness Measures and Thresholds
Not all discovered patterns are useful. Therefore, interestingness measures are used to evaluate patterns.
Users can also define threshold values to filter out unimportant patterns.
Common Measures
1. Simplicity
A pattern should be simple and easy to understand.
- Complex rules are harder to interpret.
- Simpler rules are usually more useful.
Simplicity can be measured based on:
- Number of attributes
- Number of conditions in the rule
2. Certainty (Confidence)
Confidence measures how reliable a rule is.
For an association rule:
A → B
Confidence is calculated as:
Confidence(A → B) =
(Number of records containing both A and B)
/
(Number of records containing A)
Higher confidence means the rule is more reliable.
3. Utility (Support)
Support measures how frequently a pattern occurs in the dataset.
Support(A → B) =
(Number of records containing both A and B)
/
(Total number of records)
High support means the pattern appears frequently.
4. Novelty
A pattern is interesting if it provides new information.
For example:
- Unexpected patterns
- Rare exceptions
- Non-redundant rules
Removing duplicate or similar patterns can help highlight novel patterns.
5. Representation of Discovered Patterns
This primitive specifies how the discovered patterns should be displayed to the user.
Some common representation formats include:
- Rules
- Tables
- Cross-tabulations
- Charts (bar chart, pie chart)
- Graphs
- Decision trees
- Data cubes
Users should be able to choose the most suitable visualization method for their analysis.
For example:
- Charts and tables are useful for descriptive analysis
- Decision trees are commonly used for classification
Example of Data Mining Task Primitives
Suppose you are a marketing manager at AllElectronics and want to classify customers based on their buying behavior.
You are interested in customers who:
- Have a salary of at least $40,000
- Purchased items worth more than $1,000
- Each item costs at least $100
You want to analyze the following customer information:
- Age
- Income
- Type of items purchased
- Purchase location
- Place where the items were manufactured
Finally, you want to display the results in the form of classification rules.
This type of request can be written as a Data Mining Query Language (DMQL) query like this:
use database AllElectronics_db
use hierarchy location_hierarchy for T.branch, age_hierarchy for C.age
mine classification as promising_customers
in relevance to C.age, C.income, I.type, I.place_made, T.branch
from customer C, item I, transaction T
where I.item_ID = T.item_ID
and C.cust_ID = T.cust_ID
and C.income ≥ 40,000
and I.price ≥ 100
group by T.cust_ID
This query tells the data mining system to classify customers based on their purchasing patterns.