Statistics Part 2: Central tendency and Measures of Spread
In this part we are going to learn about how to analyze data by the means of central tendency and measures of spread. I have demonstrated the same using pyspark , feel free to use pandas and Numpy as the concepts remains the same.
Measures of central tendency
A measure of central tendency (also referred to as measures of centre or central location) is a summary measure that attempts to describe a whole set of data with a single value that represents the middle or centre of its distribution.
- Mean
It is sum of all the datapoints divided by the number of datapoints.
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Stats").enableHiveSupport().getOrCreate()
simpleData = [("James", "Sales", 3000),
("Michael", "Sales", 4600),
("Robert", "Sales", 4100),
("Maria", "Finance", 3000),
("James", "Sales", 3000),
("Scott", "Finance", 3300),
("Jen", "Finance", 3900),
("Jeff", "Marketing", 3000),
("Kumar", "Marketing", 2000),
("Saif", "Sales", 4100)
]
schema = ["employee_name", "department", "salary"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)
root
|-- employee_name: string (nullable = true)
|-- department: string (nullable = true)
|-- salary: long (nullable = true)
+-------------+----------+------+
|employee_name|department|salary|
+-------------+----------+------+
|James |Sales |3000 |
|Michael |Sales |4600 |
|Robert |Sales |4100 |
|Maria |Finance |3000 |
|James |Sales |3000 |
|Scott |Finance |3300 |
|Jen |Finance |3900 |
|Jeff |Marketing |3000 |
|Kumar |Marketing |2000 |
|Saif |Sales |4100 |
+-------------+----------+------+
from pyspark.sql.functions import *
#based on a column
df.groupby("department").agg({"salary": "mean"}).show(truncate=False)
df.select(mean("salary")).show()
2. Median
It is the value at the middle of the dataset when we line up all the datapoints in order from least to greatest. If there are even number of datapoints then we take the mean of the pair of numbers in the middle.
#based on a column
df.groupby("department").agg(percentile_approx("salary", 0.5).alias("median")).show(truncate=False)
df.select(percentile_approx("salary", 0.5).alias("median")).show()
3. Mode
The Mode of the dataset is the value that occurs most often, more than any other value.
df.groupby("salary").count().orderBy("salary", ascending=False).first()[0]
Measures of Spread
Spread tells us about how and by how much our data is spread across the center. It is also called measures of dispersion, or scatter.
- Range
The range of the dataset is the distance between the largest and the smallest value.
df.select((max("salary")-min("salary")).alias("mode")).show()
2. Interquartile Range
We can divide a dataset into 4 quarters by using the median in the data. We cut the data in half at the median, then we find median of each half and split the data based on them. Each quarter that we create is bounded by the data Quartiles.
The median on the lower half is called the first quartile Q1, the median on the lower half is called the third quartile Q3, and the median on the entire dataset is called second quartile Q2.
The interquartile range is the difference between the median of the upper half and the median of the lower half, Q3-Q1.
df.approxQuantile("salary",[0.25, 0.75], 0)
Any observations that are more than 1.5 IQR below Q1 or more than 1.5 IQR above Q3 are considered outliers.
def calculate_bounds(df):
bounds = {
c: dict(
zip(["q1", "q3"], df.approxQuantile(c, [0.25, 0.75], 0))
)
for c,d in zip(df.columns, df.dtypes) if d[1] == "bigint"
}
for c in bounds:
iqr = bounds[c]['q3'] - bounds[c]['q1']
bounds[c]['min'] = bounds[c]['q1'] - (iqr * 1.5)
bounds[c]['max'] = bounds[c]['q3'] + (iqr * 1.5)
return bounds
def flag_outliers(df, c):
bounds = calculate_bounds(df)
outliers = {}
return df.select(c,
*[
when(
~col(c).between(bounds[c]['min'], bounds[c]['max']),
"yes"
).otherwise("no").alias(c+'_outlier')
]
)
flag_outliers(df, "salary").show()
follow me on Linkedin
LinkedIn: https://www.linkedin.com/in/shorya-sharma-b94161121/