Statistics Part 2: Central tendency and Measures of Spread

4 min readDec 24, 2022

In this part we are going to learn about how to analyze data by the means of central tendency and measures of spread. I have demonstrated the same using pyspark , feel free to use pandas and Numpy as the concepts remains the same.

Measures of central tendency

A measure of central tendency (also referred to as measures of centre or central location) is a summary measure that attempts to describe a whole set of data with a single value that represents the middle or centre of its distribution.

Mean

It is sum of all the datapoints divided by the number of datapoints.

import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Stats").enableHiveSupport().getOrCreate()
simpleData = [("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", 3000),
    ("James", "Sales", 3000),
    ("Scott", "Finance", 3300),
    ("Jen", "Finance", 3900),
    ("Jeff", "Marketing", 3000),
    ("Kumar", "Marketing", 2000),
    ("Saif", "Sales", 4100)
  ]
schema = ["employee_name", "department", "salary"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)

root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: long (nullable = true)

+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|James        |Sales     |3000  |
|Michael      |Sales     |4600  |
|Robert       |Sales     |4100  |
|Maria        |Finance   |3000  |
|James        |Sales     |3000  |
|Scott        |Finance   |3300  |
|Jen          |Finance   |3900  |
|Jeff         |Marketing |3000  |
|Kumar        |Marketing |2000  |
|Saif         |Sales     |4100  |
+-------------+----------+------+

from pyspark.sql.functions import *
#based on a column
df.groupby("department").agg({"salary": "mean"}).show(truncate=False)

df.select(mean("salary")).show()

2. Median

It is the value at the middle of the dataset when we line up all the datapoints in order from least to greatest. If there are even number of datapoints then we take the mean of the pair of numbers in the middle.

#based on a column
df.groupby("department").agg(percentile_approx("salary", 0.5).alias("median")).show(truncate=False)

df.select(percentile_approx("salary", 0.5).alias("median")).show()

3. Mode

The Mode of the dataset is the value that occurs most often, more than any other value.

df.groupby("salary").count().orderBy("salary", ascending=False).first()[0]

Measures of Spread

Spread tells us about how and by how much our data is spread across the center. It is also called measures of dispersion, or scatter.

Range

The range of the dataset is the distance between the largest and the smallest value.

df.select((max("salary")-min("salary")).alias("mode")).show()

2. Interquartile Range

We can divide a dataset into 4 quarters by using the median in the data. We cut the data in half at the median, then we find median of each half and split the data based on them. Each quarter that we create is bounded by the data Quartiles.

The median on the lower half is called the first quartile Q1, the median on the lower half is called the third quartile Q3, and the median on the entire dataset is called second quartile Q2.

The interquartile range is the difference between the median of the upper half and the median of the lower half, Q3-Q1.

df.approxQuantile("salary",[0.25, 0.75], 0)

Any observations that are more than 1.5 IQR below Q1 or more than 1.5 IQR above Q3 are considered outliers.

def calculate_bounds(df):
    bounds = {
        c: dict(
            zip(["q1", "q3"], df.approxQuantile(c, [0.25, 0.75], 0))
        )
    for c,d in zip(df.columns, df.dtypes) if d[1] == "bigint"
    }
    
    for c in bounds:
        iqr = bounds[c]['q3'] - bounds[c]['q1']
        bounds[c]['min'] = bounds[c]['q1'] - (iqr * 1.5)
        bounds[c]['max'] = bounds[c]['q3'] + (iqr * 1.5)

    return bounds

def flag_outliers(df, c):
    bounds = calculate_bounds(df)
    outliers = {}

    return df.select(c,
        *[
            when(
                ~col(c).between(bounds[c]['min'], bounds[c]['max']),
                "yes"
            ).otherwise("no").alias(c+'_outlier')
        ]
    )

flag_outliers(df, "salary").show()