Zipf Distribution in Python
Zipf’s distribution is a fascinating way to model data following Zipf's Law — a principle stating that in many datasets, the frequency of an item is inversely proportional to its rank in the frequency table. In simple terms, the nth most common item appears roughly 1/n as frequently as the most common item.
For example, in English, the 2nd most common word is used about half as often as the most common word, the 3rd about a third as often, and so on.
Key Parameters:
- a → Distribution parameter (higher values skew results towards smaller numbers).
- size → The shape of the output array (how many samples to draw).
Sampling Data from Zipf Distribution:
Here’s a simple example of drawing random numbers from a Zipf distribution using NumPy:
Program:
from numpy import random
# Draw samples from Zipf distribution
x = random.zipf(a=2, size=(2, 3))
print(x)
This generates a 2x3 array of random numbers following Zipf’s distribution with a=2.
Output:
[[1 2 4]
[1 1 1]]
Visualizing Zipf Distribution:
Let’s generate 1000 samples and visualize the distribution for values less than 10 (this helps focus on the meaningful part of the chart, since larger values are rare but extreme).
Program:
from numpy import random
import matplotlib.pyplot as plt
import seaborn as sns
# Generate Zipf-distributed samples
x = random.zipf(a=2, size=1000)
# Plot only values less than 10 for better visualization
sns.displot(x[x < 10])
plt.show()
This creates a histogram showing how often each value appears — illustrating how Zipf’s Law favors smaller numbers in the sample.