Cosine Similarity in Data Mining
gocourse.in Maintenance

We'll be back soon

Our CDN (cdn.gocourse.in) is currently unreachable. Some images, JavaScript, or CSS files may not load properly.

Estimated downtime: ~30 minutes

Cosine Similarity in Data Mining

Vinithra

Cosine Similarity in Data Mining

In data mining, many tasks like document clustering, recommendation systems, and search engines need a way to measure how similar two data items are. One of the most commonly used methods for this is cosine similarity. 

What is Cosine Similarity? 

Cosine similarity measures how similar two vectors (data points) are by calculating the angle between them.

If the angle is small → vectors are similar
If the angle is large → vectors are different 

The cosine value ranges from:

1 → Exactly the same (perfect similarity)
0 → No similarity (perpendicular vectors) 
-1 → Completely opposite 

Formula of Cosine Similarity
cos(θ)=x⋅y/∥x∥∥y∥

Where:
x⋅y = dot product of vectors
∣∣x∣∣,∣∣y∣∣ = magnitude (length) of vectors  

Applications of Cosine Similarity

1. Document Similarity

Used in text mining and NLP to compare documents.
Example: Checking similarity using TF-IDF or word vectors.

2. Recommender Systems

Helps suggest products or content based on user preferences.
Example: Netflix or Amazon recommendations.

3. Information Retrieval 

Used in search engines to find documents that best match a query.

4. Image Similarity 

Used in image recognition and retrieval to find similar images. 

Steps to Compute Cosine Similarity

Convert Data into Vectors 
Example: Bag-of-words, TF-IDF, Word2Vec, etc.

Normalize the Vectors 
Convert vectors to unit length to remove size differences.

Calculate Dot Product 
Multiply corresponding values and sum them.

Apply Formula 
Divide dot product by product of magnitudes.

Advantages of Cosine Similarity

  • Scale Independent 
  • Works even if vector sizes are different. 
  • Works Well for High-Dimensional Data 
  • Useful in text and big data applications. 
  • Easy to Understand 
  • Values clearly show similarity level. 

Limitations of Cosine Similarity

  • Ignores Meaning (Context) 
  • Only considers numbers, not actual meaning of data. 
  • Sensitive to Data Representation 
  • Results depend on how vectors are created. 
  • Not Ideal for Sparse Data 
  • When most values are zero, accuracy may reduce.
Our website uses cookies to enhance your experience. Learn More
Accept !