Cosine Similarity in Data Mining
In data mining, many tasks like document clustering, recommendation systems,
and search engines need a way to measure how similar two data items are. One
of the most commonly used methods for this is cosine similarity.
What is Cosine Similarity?
Cosine similarity measures how similar two vectors (data points) are by
calculating the angle between them.
If the angle is small → vectors are similar
If the angle is large → vectors are different
The cosine value ranges from:
1 → Exactly the same (perfect similarity)
0 → No similarity (perpendicular vectors)
-1 → Completely opposite
Formula of Cosine Similarity
cos(θ)=x⋅y/∥x∥∥y∥
Where:
x⋅y = dot product of vectors
∣∣x∣∣,∣∣y∣∣ = magnitude (length) of vectors
Applications of Cosine Similarity
1. Document Similarity
Used in text mining and NLP to compare documents.
Example: Checking similarity using TF-IDF or word vectors.
2. Recommender Systems
Helps suggest products or content based on user preferences.
Example: Netflix or Amazon recommendations.
3. Information Retrieval
Used in search engines to find documents that best match a query.
4. Image Similarity
Used in image recognition and retrieval to find similar images.
Steps to Compute Cosine Similarity
Convert Data into Vectors
Example: Bag-of-words, TF-IDF, Word2Vec, etc.
Normalize the Vectors
Convert vectors to unit length to remove size differences.
Calculate Dot Product
Multiply corresponding values and sum them.
Apply Formula
Divide dot product by product of magnitudes.
Advantages of Cosine Similarity
- Scale Independent
- Works even if vector sizes are different.
- Works Well for High-Dimensional Data
- Useful in text and big data applications.
- Easy to Understand
- Values clearly show similarity level.
Limitations of Cosine Similarity
- Ignores Meaning (Context)
- Only considers numbers, not actual meaning of data.
- Sensitive to Data Representation
- Results depend on how vectors are created.
- Not Ideal for Sparse Data
- When most values are zero, accuracy may reduce.