What is Data Distribution?
- Data distribution refers to the way data values are spread out across a dataset. It shows all the possible values in the dataset and how frequently each value appears.
- Understanding data distribution is crucial when working with statistics and data science, as it helps analyze patterns and make predictions.
Random Data Distribution
- A random distribution is a collection of random numbers that follow a particular probability density function (PDF).
- Probability Density Function (PDF)
- A PDF describes the likelihood of different outcomes in a continuous random variable. It defines the probability of every possible value within a given range.
- Generating Random Distributions with Python
- Python’s NumPy library provides methods to generate random data distributions. One such method is random.choice(). This method allows you to:
- Specify a list of possible values.
- Define the probability for each value.
- Probability Settings:
- Each probability value must be between 0 and 1.
- The sum of all probability values must equal 1.
- A probability of 0 means the value will never appear, and a probability of 1 means it will always appear.
Program:
Generating a 1-D Random Distribution
Let’s generate an array of 100 random values, where each value can be either 3, 5, 7, or 9 with defined probabilities:
Probability for 3: 0.1
Probability for 5: 0.3
Probability for 7: 0.6
Probability for 9: 0.0
from numpy import random
x = random.choice([3, 5, 7, 9], p=[0.1, 0.3, 0.6, 0.0], size=100)
print(x)
Output:
[7 7 7 5 7 5 7 5 7 7 7 7 5 5 7 7 5 7 5 7 7 5 7 7 7 7 5 7 3 7 7 7 7 7 5 7 7 5 7 7 7 7 5 5 7 5 7 5 7 7 7 7 7 7 7 5 7 7 7 5 7 7 7 7 7 7 7 7 7 7 5 7 5 7 7 7 7 5 5 5 5 5 7 7 7 7 5 7 7 7 7 5 7 7 7 7 7 7 7 5 5 7 7 7]
Note: In this case, 9 will never appear since its probability is 0.
Program:
Generating a 2-D Random Distribution
We can also generate multi-dimensional arrays by specifying the desired shape using the size parameter. Here’s an example of a 2-D array with 3 rows and 5 columns using the same probabilities:
from numpy import random
x = random.choice([3, 5, 7, 9], p=[0.1, 0.3, 0.6, 0.0], size=(3, 5))
print(x)
Output:
[[7 7 5 7 7]
[7 7 7 5 7]
[7 5 7 7 7]]