Statistics Part 1: Visualising data in tables with Pyspark/Python

4 min readDec 10, 2022

I recently got into a role in my professional life where I started to encounter statistics more in my day to day work than ever before. I know it isn’t the first time you are hearing about this term “statistics” but I thought, while I faced challenges with it , can I make a basic tutorial series to simplify the use of stats with pyspark.

If you are a pro in it you might wanna skip this tutorial but incase you are not or you have been coding in other languages like SAS/ R , this can be a good start.

Individuals and Variables

The individuals are the items in the dataset and can be cases, things, people, etc.

In the above table, individuals are the people and age is the variable because Age is the property of each individual. Together, individuals and variables are called data. When we organize data into table like above, its called data table .

The variables can be Categorical or Quantitative.

Categorical Variables (also called Qualitative Variables) are non-numerical Variables.
Quantitative Variables are numerical Variables.

We can divide quantitative variables into discrete variables and continuous Variables.

Discrete variables are those that we can obtain by counting, therefore, they can only take only certain numerical values.
Continuous variables may include data as decimals, fractions, or irrational numbers.

Leves of measurement

Nominal: Things like favourite food, names, colors, and “yes” or “no” responses have a nominal scale of measurement. Only categorical data can be measured with a nominal scale.
Categorical data can also be ordinal. This type of data can be ordered. For example experience of a movie can be “Awesome”, “good”, “Bad” and “terrible”, which follows an order from best to worst.
Data measured in using an interval scale can be ordered, but interval data also gives us an interval between the measurements. for e.g temperature can be measured with an interval scale.
Ratio Scale, it is similar to interval scale but it also has a starting point or absolute zero i.e ratio scales data measures can never be negative.

One way Table

One way table is the data collected for just the one individual , also we can make sure we are dealing with on way table is how many questions you have to ask, for example in the below table if someone asks how many scoops of ice cream you sold then we just need to ask for which flavour , so it means we just have to ask one question in one way table.

Two Way Tables

We talk about this type of data in terms of independent variables and dependent variables. In two way tables , we have two independent categories on which the variables are dependent.

If we look at the above table which respresents the number of transactions in a bank per month per year , notice now how the data is dependent on two independent things. If someone ask what are the nuber of transactions, we will ask 2 questions, for which year and for which month.

Frequency Tables

Frequency table is a table that displays how frequently or infrequently something occurs.

import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Stats").enableHiveSupport().getOrCreate()
Grocery = [("Dairy", "Milk", 40), ("Fruit", "Banana", 80), ("Fruit", "Orange", 80), ("Vegetable", "Potato", 60), ("Dairy", "curd", 40), ("Vegetable", "Brinjal", 70),
          ("Dairy", "cheese", 50), ("Fruit", "Mango", 90), ("Vegetable", "Carrot", 90), ("Fruit", "Apple", 100)]
GroceryColumns = ["Item_Group", "Item_name", "Price"]
df = spark.createDataFrame(data=Grocery, schema=GroceryColumns)
df.show()

#Frequency 
df.groupBy("Item_Group").count().show()

df.groupBy("Item_Group", "Price").count().show()

Relative Frequency Tables

df1 = df.groupBy("Item_Group").count().withColumn("relativeFrequency", col("count")/df1.select("count").groupBy().sum().collect()[0][0]).show()

Joint Distribution Tables

A joint distribution is a table of percentages similar to a relative frequency table. The difference is that , in joint distribution, we show the distribution of one set of data against the distribution of another set of data.

mycrosstab = df.crosstab('Item_Group', 'Item_name')
cols = [i for i in mycrosstab.columns if not i=='Item_Group_Item_name']
out = (mycrosstab.withColumn("SumCols",expr('+'.join(cols)))
.select("Item_Group_Item_name",*[coalesce(round(col(i)/col("SumCols"),2),lit(0)).alias(i) 
                                                                 for i in cols]))
out.show()

I hope you developed some if not much interest in the world of statistics, we will continue our journey in the next part.

follow me on Linkedin

LinkedIn: https://www.linkedin.com/in/shorya-sharma-b94161121/