Data Selection in Data Mining
gocourse.in Maintenance

We'll be back soon

Our CDN (cdn.gocourse.in) is currently unreachable. Some images, JavaScript, or CSS files may not load properly.

Estimated downtime: ~30 minutes

Data Selection in Data Mining

Balaji. K

Data Selection in Data Mining

Data selection is the process of choosing the right type of data, data source, and tools to collect data before starting a project.
 
It is an important step because the quality of your results depends on the data you choose.

Main Goal of Data Selection 

The goal is to select data that can properly answer the research question. This depends on:
  •  The purpose of the study
  •  Previous research (literature)
  •  Availability of data
If data is chosen only based on low cost or convenience, it may reduce the accuracy and
reliability of the results.

Common Issues in Data Selection

While selecting data, researchers should consider:
  •  Choosing the correct type and source of data
  •  Ensuring the data can answer the research question
  •  Collecting a representative sample
  •  Using proper tools/instruments for data collection
  •  Making sure the data and tools are compatible

Types of Data

There are two main types of data:
1. Quantitative Data
  • Numerical data
  • Example: temperature, height, voltage
  • Used for calculations and analysis
2. Qualitative Data
  •  Non-numerical data
  •  Example: text, images, videos, observations
  •  Used for understanding behavior or patterns
  •  Many projects use both types to get better insights.

Sources of Data

Data can come from:
  •  Human observations
  •  Experiments
  •  Field notes or journals
  •  Lab results
  • Direct observation of people, animals, or plants

Questions to Ask Before Selecting Data

  •  What is the research question?
  •  What is the scope of the study?
  •  What type of data is needed (quantitative, qualitative, or both)?
  •  What do previous studies suggest?
  •  Is the required data available?

Feature Selection in Data Mining

Feature selection means choosing only the most important variables (features) from a dataset and removing unnecessary ones.

It helps improve model performance and reduces complexity.

Simple Example
  •  A doctor uses only important medical test results (features) to decide whether surgery is needed.

Types of Feature Selection

1. Supervised Learning
  •  Uses labeled data
  •  Goal: improve prediction accuracy
2. Unsupervised Learning
  •  No labeled data
  •  Goal: find better grouping (clustering)

Why Feature Selection is Important

Feature selection helps to:
  •  Reduce extra or useless data
  •  Improve model accuracy
  •  Make models easier to understand
  •  Save time, memory, and computing power
Problems Without Feature Selection
  •  Too much data → slows down processing
  •  Noisy or duplicate data → reduces accuracy
  •  High-dimensional data → needs more training data
How Feature Selection Works
  •  Each feature is given a score
  •  Important features are selected
  •  Less useful features are removed
This can be done:
  •  Manually by analysts
  •  Automatically by algorithms

Feature Selection in SQL Server

In SQL Server Data Mining:
  •  Feature selection happens before training the model
  •  The system calculates a score for each feature
  •  Only high-scoring features are used
Factors affecting selection:
  • Algorithm used
  •  Type of data
  •  Model settings

Feature Selection Scoring Methods

1. Interestingness Score
Measures how useful a feature is
Based on entropy (information content)
Less randomness = more useful feature

2. Shannon’s Entropy
Measures uncertainty in data
Formula:
 H(X) = -Σ P(x) log(P(x))

3. Bayesian (K2 Algorithm)
Uses probability and relationships between variables
Works well with structured data

4. Bayesian Dirichlet (BDE)
Uses probability distributions
Assumes equal importance of prior data

Feature Selection Parameters

These control how many features are selected:

MAXIMUM_INPUT_ATTRIBUTES
→ Limits number of input features

MAXIMUM_OUTPUT_ATTRIBUTES
→ Limits number of output features

MAXIMUM_STATES
→ Limits number of values in a feature
If set to 0, feature selection is turned OFF.

Conclusion

Data Selection ensures you use the right data for your problem
Feature Selection ensures you use only the most important data

Together, they improve:
  •  Accuracy
  •  Efficiency
  •  Model performance
Our website uses cookies to enhance your experience. Learn More
Accept !