What is an Outlier in Data Mining?
In data analysis, we often come across unusual data values called outliers.
An outlier is a data
point that is very different from the rest of the data in a dataset. It lies
far away from the normal
pattern or expected range of values.
Outliers can occur due to measurement errors, data entry mistakes, or
natural variations in the
data. During data analysis, it is important to identify these values because
they may affect the
accuracy of the results. In some cases, outliers are removed, while in other
cases they are
carefully analyzed because they may provide useful insights.
The concept of outliers was first formally defined by Frank E. Grubbs in
1969.
Difference Between Outliers and Noise
- Noise refers to random errors or unwanted variations in measured data. It usually occurs due to problems in measurement, data collection, or transmission.
- Outliers are extreme data points that significantly differ from the rest of the dataset.
Before detecting outliers, it is usually recommended to remove noise from
the dataset, because
noise can make outlier detection more difficult.
Types of Outliers
Outliers in data mining are generally classified into three types:
- Global (Point) Outliers
- Collective Outliers
- Contextual (Conditional) Outliers
1. Global Outliers (Point Outliers)
Global outliers are the simplest type of outliers.
A global outlier occurs when a single data point is very different
from all other data points in the
dataset.
Most outlier detection methods in data mining focus on identifying
this type of outlier.
Example:
If the average exam score of students is between 60 and 80, but one
student scores 10, that
value can be considered a global outlier.
2. Collective Outliers
A collective outlier occurs when a group of data points together behaves
abnormally compared
to the rest of the dataset.
In this case, individual data points may appear normal, but when
considered as a group, they
show unusual behavior.
Example:
In a network intrusion detection system, sending a few data packets
from one computer may be
normal. However, if many computers send a large number of packets at
the same time, it may
indicate a Denial-of-Service (DoS) attack. The group of packets
together becomes a collective
outlier.
3. Contextual Outliers (Conditional Outliers)
A contextual outlier occurs when a data point is considered unusual
only within a specific
context or condition.
These outliers depend on two types of attributes:
Contextual attributes – define the context (e.g., time, location)
Behavioral attributes – define the behavior of the data
Example:
A temperature of 45°C may be normal during summer, but it would be
unusual during the rainy
season or winter. Therefore, the same value can be normal or an
outlier depending on the
context.
Outlier Analysis
The process of identifying and studying unusual data points in a
dataset is called Outlier
Analysis or Outlier Mining. It is an important task in data mining
because rare events often
provide valuable information.
Although outliers are sometimes removed from datasets, they are very
useful in many real-world
applications.
Applications of Outlier Detection
Outlier detection is widely used in several fields, such as:
- Fraud detection in banking, credit cards, and insurance
- Telecommunication fraud detection
- Medical diagnosis and treatment analysis
- Market analysis to understand unusual customer behavior
- Network intrusion detection systems
- Financial data monitoring
For example, in medical analysis, unusual patient responses to a
treatment can be identified
through outlier analysis.