Data Selection in Data Mining

Balaji. K

Data Selection in Data Mining

Data selection is the process of choosing the right type of data, data source, and tools to collect data before starting a project.

It is an important step because the quality of your results depends on the data you choose.

Main Goal of Data Selection

The goal is to select data that can properly answer the research question. This depends on:

The purpose of the study
Previous research (literature)
Availability of data

If data is chosen only based on low cost or convenience, it may reduce the accuracy and

reliability of the results.

Common Issues in Data Selection

While selecting data, researchers should consider:

Choosing the correct type and source of data
Ensuring the data can answer the research question
Collecting a representative sample
Using proper tools/instruments for data collection
Making sure the data and tools are compatible

Types of Data

There are two main types of data:

1. Quantitative Data

Numerical data
Example: temperature, height, voltage
Used for calculations and analysis

2. Qualitative Data

Non-numerical data
Example: text, images, videos, observations
Used for understanding behavior or patterns
Many projects use both types to get better insights.

Sources of Data

Data can come from:

Human observations
Experiments
Field notes or journals
Lab results
Direct observation of people, animals, or plants

Questions to Ask Before Selecting Data

What is the research question?
What is the scope of the study?
What type of data is needed (quantitative, qualitative, or both)?
What do previous studies suggest?
Is the required data available?

Feature Selection in Data Mining

Feature selection means choosing only the most important variables (features) from a dataset and removing unnecessary ones.

It helps improve model performance and reduces complexity.

Simple Example

A doctor uses only important medical test results (features) to decide whether surgery is needed.

Types of Feature Selection

1. Supervised Learning

Uses labeled data
Goal: improve prediction accuracy

2. Unsupervised Learning

No labeled data
Goal: find better grouping (clustering)

Why Feature Selection is Important

Feature selection helps to:

Reduce extra or useless data
Improve model accuracy
Make models easier to understand
Save time, memory, and computing power

Problems Without Feature Selection

Too much data → slows down processing
Noisy or duplicate data → reduces accuracy
High-dimensional data → needs more training data

How Feature Selection Works

Each feature is given a score
Important features are selected
Less useful features are removed

This can be done:

Manually by analysts
Automatically by algorithms

Feature Selection in SQL Server

In SQL Server Data Mining:

Feature selection happens before training the model
The system calculates a score for each feature
Only high-scoring features are used

Factors affecting selection:

Algorithm used
Type of data
Model settings

Feature Selection Scoring Methods

1. Interestingness Score

Measures how useful a feature is

Based on entropy (information content)

Less randomness = more useful feature

2. Shannon’s Entropy

Measures uncertainty in data

Formula:

H(X) = -Σ P(x) log(P(x))

3. Bayesian (K2 Algorithm)

Uses probability and relationships between variables

Works well with structured data

4. Bayesian Dirichlet (BDE)

Uses probability distributions

Assumes equal importance of prior data

Feature Selection Parameters

These control how many features are selected:

MAXIMUM_INPUT_ATTRIBUTES

→ Limits number of input features

MAXIMUM_OUTPUT_ATTRIBUTES

→ Limits number of output features

MAXIMUM_STATES

→ Limits number of values in a feature

If set to 0, feature selection is turned OFF.

Together, they improve:

Accuracy
Efficiency
Model performance

« Previous Next »

Data Selection in Data Mining

Data Selection in Data Mining

Main Goal of Data Selection

Common Issues in Data Selection

Types of Data

1. Quantitative Data

2. Qualitative Data

Sources of Data

Data can come from:

Questions to Ask Before Selecting Data

Feature Selection in Data Mining

Types of Feature Selection

1. Supervised Learning

2. Unsupervised Learning

Why Feature Selection is Important

Feature selection helps to:

Problems Without Feature Selection

How Feature Selection Works

This can be done:

Feature Selection in SQL Server

In SQL Server Data Mining:

Factors affecting selection:

Feature Selection Scoring Methods

1. Interestingness Score

2. Shannon’s Entropy

3. Bayesian (K2 Algorithm)

4. Bayesian Dirichlet (BDE)

Feature Selection Parameters

MAXIMUM_INPUT_ATTRIBUTES

→ Limits number of input features

MAXIMUM_OUTPUT_ATTRIBUTES

MAXIMUM_STATES

Together, they improve:

You may like these posts

Footer Copyright

Contact form