Statistics Part 3: Data Distribution

shorya sharma

5 min readDec 26, 2022

“Data are just summaries of thousands of stories—tell a few of those stories to help make the data meaningful.”

~ Dan Heath, bestselling author

This is part 3 of our statistics series.

What will you learn by the end of this part ?

Variance.
Standard deviation.
Types of distribution.
Empirical rule.
Z score.
Chebyshev’s Theorem.

Without wasting any more time, let’s begin.

Variance

Variance is the measure of how far the data is spread from the mean. Population variance is given by sigma squared.

import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Stats").enableHiveSupport().getOrCreate()
simpleData = [("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", 3000),
    ("James", "Sales", 3000),
    ("Scott", "Finance", 3300),
    ("Jen", "Finance", 3900),
    ("Jeff", "Marketing", 3000),
    ("Kumar", "Marketing", 2000),
    ("Saif", "Sales", 4100)
  ]
schema = ["employee_name", "department", "salary"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)

from pyspark.sql.functions import *
df.select(variance("salary"),var_samp("salary"),var_pop("salary")) \
  .show(truncate=False)

Standard Deviation

Standard Deviation is a measure of how much the data in a set varies from the mean.

df.select(stddev("salary"), stddev_samp("salary"), \
    stddev_pop("salary")).show(truncate=False)

Symmetric and Skewed Distribution

Symmetric

When a density curve is perfectly symmetric, then the mean and the median are both at the center of the disribution.

2. Skewed Distribution

Non-symmetric distribution that lean to the right(positively skewed) or left (negatively skewed).

The Empirical Rule

Normal Distribuution follows the empirical rule, the empirical rule predicts that 68% of observations falls within the first standard deviation (µ ± σ), 95% within the first two standard deviations (µ ± 2σ), and 99.7% within the first three standard deviations (µ ± 3σ).

assume that an animal in the zoo lives to an average of 10 years of age, with a standard deviation of 1.4 years. Assume the zookeeper attempts to figure out the probability of an animal living for more than 7.2 years. This distribution looks as follows:

One standard deviation (µ ± σ): 8.6 to 11.4 years
Two standard deviations (µ ± 2σ): 7.2 to 12.8 years
Three standard deviations ((µ ± 3σ): 5.8 to 14.2 years

The empirical rule states that 95% of the distribution lies within two standard deviations. Thus, 5% lies outside of two standard deviations; half above 12.8 years and half below 7.2 years. Thus, the probability of living for more than 7.2 years is:

95% + (5% / 2) = 97.5%

Percentile

The nth percentile is the value such that n percent of values lie below it. In other words, a value in the 95th percentile is greater than 95% of data.

Z-Scores

The z-score tells us the number of Standard deviations a point is from the mean.

where, x=datapoint

The Z score is always given in terms of Standard Deviation.

We make the use of Z table, which is the table that takes the number of Standard Deviation and tells the percentage of the area under the curve upto that point.

Table with negative Z score looks like:

Table with Positive Z score looks like:

Example problems

Example 1:

Example 2:

The scores on a certain college entrance exam are normally distributed with mean μ = 82 and standard deviation σ = 8. Approximately what percentage of students score less than 84 on the exam?
First, we will find the z-score associated with an exam score of 84:
z-score = (x — μ) / σ = (84–82) / 8 = 2 / 8 = 0.25
Next, we will look up the value 0.25 in the z-table
Approximately 59.87% of students score less than 84 on this exam.

Threshold
If the z-score of a data point is more than +/- 3, it indicates that the data point is quite different from the other data points. Such a data point can be an outlier.

import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Stats").enableHiveSupport().getOrCreate()
simpleData = [["Patty O’Furniture",5.9],
        ["Paddy O’Furniture",5.2],
        ["Olive Yew",5.1],
        ["Aida Bugg",5.5],
        ["Maureen Biologist",4.9],
        ["Teri Dacty",5.4],
        ["Peg Legge",6.2],
        ["Allie Grate",6.5],
        ["Liz Erd",7.1], 
        ["A. Mused",14.5],
        ["Constance Noring",6.1],
        ["Lois Di Nominator",5.6],
        ["Minnie Van Ryder",1.2],
        ["Lynn O’Leeum",5.5]]
schema = ["student_name", "height"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)
from pyspark.sql.functions import *
lower_limit = -3
upper_limit = 3
column_subset = df.columns
for cols in column_subset:
    if df.select(cols).dtypes[0][1]=="string":
        pass
    else:
        mean = df.select(mean(cols)).collect()[0][0]
        stddev = df.select(stddev(cols)).collect()[0][0]
        df = df.withColumn("z_score", (col("height")-mean)/stddev)
        df = df.filter((df["z_score"]<lower_limit) | (df["z_score"]>upper_limit))
df.show()

Chebyshev’s Theorem

Constraints of emperical rule is that it only applies to normally distributed data. Chebyshev’s Theorem tells us that at least (1 − 1/k^2 ) % of our data must be within k standard deviations of the mean, for k > 1, and regardless of the shape of the data’s distribution.

For instance, here’s what the theorem can conclude for any distribution when k = 2, 3, and 4:

At least 75 % of the data must be within k = 2 standard deviations of the mean.
At least 89 % of the data must be within k = 3 standard deviations of the mean.
At least 94 % of the data must be within k = 4 standard deviations of the mean.

Keep in mind that k doesn’t have to be an integer, but it does have to be greater than 1. So we could use Chebyshev’s Theorem for k = 1.32, k = 2, or k = 2.14, but not for k = 1 or for k = 0.46.

Because Chebyshev’s Theorem has to work for distributions of all shapes, unlike the Empirical Rule which applies only to the normal distribution, Chebyshev’s Theorem is required to be more conservative. And that’s why we see smaller percentages for Chebyshev’s Theorem than we do for the Empirical Rule.