Variable Selection In Pyspark: Part 1

5 min readFeb 13, 2023

Variable Selection , also known as feature selection, is a process that can be used to select the best features in your dataset, that is, to select appropriate variables from a complete list of variables by removing those that are irrelevant or redundant.

To put it in simple words, let us assume you want to create a cricket team to play in world cup, you need to have best players among the lot , which means best features, and , you dont want many players to do the same job at a given time, i.e multicollinearity.

Download the dataset bank-full.csv from the below link

https://archive.ics.uci.edu/ml/datasets/bank+marketing

I will use the databricks environment for the demo , feel free to use any spark environment of your choice.

Let’s Load the dataset

filename = "/FileStore/tables/bank_full.csv"
target_variable_name = "y"
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv(filename, header=True, inferSchema=True, sep=';')
df.show()

Exploratory data Analysis

As per IBM definition, Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods.

Cardinality

“cardinality” refers to the number of possible values that a feature can assume, i.e number of unique values of a variable. Although we can calculate cardinality of any variable , it is useful for categorical variables.

Missing Values

Missing Values refers to missing piece of information. Data could be missing because of three reasons,

Missing at random: When a piece is missing at random, there is a relationship between the missing data and observed data.
Mission complete at random: There is no relationship between the missing data and teh observed data.
Missing not at random: When a data is missing not at random, it means the data is not collected.

# code to check Cardinality
from pyspark.sql.functions import approxCountDistinct, countDistinct
def cardinality_calculation(df, cut_off=1):
    cardinality = df.select(*[approxCountDistinct(c).alias(c) for c in df.columns])
    final_cardinality_df = cardinality.toPandas().transpose()
    final_cardinality_df.reset_index(inplace=True)
    final_cardinality_df.rename(columns={0:'Cardinality'}, inplace=True)
    vars_selected = final_cardinality_df['index'][final_cardinality_df['Cardinality'] <= cut_off]
    return final_cardinality_df, vars_selected
cardinality_df, cardinality_vars_selected = cardinality_calculation(df)

print(cardinality_df)

# missing values code
from pyspark.sql.functions import count, when, isnan, col
def missing_calculation(df, miss_percentage=0.80):
    missing = df.select(*[count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns])
    length_df = df.count()
    final_missing_df = missing.toPandas().transpose()
    final_missing_df.reset_index(inplace=True)
    final_missing_df.rename(columns={0:'Missing_Count'}, inplace=True)
    final_missing_df['missing_percentage'] = final_missing_df['Missing_Count']/length_df
    vars_selected = final_missing_df['index'][final_missing_df['missing_percentage'] >= miss_percentage]
    return final_missing_df, vars_selected
missing_df,  missing_vars_selected = missing_calculation(df)

For our datasets no colums are rejected based on these checks but this isn’t always true in real world dataset.

Making the Data Compatible

Moving ahead we would convert all the string types to numbers in order to ensure we are able to select variables, and our techniques requires all the columns to be numbers.

# check variable type
def var_type(df):
    vars_list = df.dtypes
    char_vars = []
    num_vars = []
    for i in vars_list:
        if i[1] in ('string'):
            char_vars.append(i[0])
        else:
            num_vars.append(i[0])
    return char_vars, num_vars
char_vars, num_vars = var_type(df)

We would use StringIndexer from pyspark.ml.features.
What is StringIndexer ?
A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0. The ordering behavior is controlled by setting stringOrderType. Its default value is ‘frequencyDesc’.

Lets Apply the StringIndexer function to the variables identified from var_type function.

from pyspark.ml.feature import VectorAssembler, StandardScaler
def assemble_vectors(df, features_list, target_variable_name):
    stages = []
    assembler = VectorAssembler(inputCols=features_list, outputCol="features")
    scaler = StandardScaler(inputCol=assembler.getOutputCol(), outputCol="features2")
    stages = [assembler, scaler]
    selectedCols = [target_variable_name, 'features2'] + features_list
    pipeline = Pipeline(stages=stages)
    assembleModel = pipeline.fit(df)
    df = scaleAssembleModel.transform(df).select(selectedCols)
    return df
features_list = df.columns
features_list.remove(target_variable_name)
df = assemble_vectors(df, features_list, target_variable_name)

Now what we have is DataFrame that is all numeric.

def rename_columns(df, char_vars):
    mapping = dict(zip([i + '_index' for i in char_vars], char_vars))
    df = df.select([col(c).alias(mapping.get(c, c)) for c in df.columns])
    return df
df = rename_columns(df , char_vars)

Now lets move to our lest step to Assemble features into a single vector before we use the actual techniques.

from pyspark.ml.feature import VectorAssembler, StandardScaler
def scaled_assemble_vectors(df, features_list, target_variable_name):
    stages = []
    assembler = VectorAssembler(inputCols=features_list, outputCol="features")
    scaler = StandardScaler(inputCol=assembler.getOutputCol(), outputCol="features2")
    stages = [assembler, scaler]
    selectedCols = [target_variable_name, 'features2'] + features_list
    pipeline = Pipeline(stages=stages)
    scaleAssembleModel = pipeline.fit(df)
    df = scaleAssembleModel.transform(df).select(selectedCols)
    return df
features_list = df.columns
features_list.remove(target_variable_name)
df = scaled_assemble_vectors(df, features_list, target_variable_name)

to look at the schema use the code,

df.schema['features'].metadata["ml_attr"]["attrs"]

No we have the data in the format which is compatible with the variable selection techniques that we will discuss.

Built In Variable Selection Technique: without Target

Principal Component Analysis

Principal Component Analysis is an unsupervised learning algorithm that is used for the dimensionality reduction in machine learning.

Explanation of PCA is beyond the scope of this article , if you want to deep dive into PCA read this blog post https://www.sartorius.com/en/knowledge/science-snippets/what-is-principal-component-analysis-pca-and-how-it-is-used-507186

from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors
no_of_components = 3
pca = PCA(k=no_of_components, inputCol="features2", outputCol="pcaFeatures")
model = pca.fit(df)
result = model.transform(df).select("pcaFeatures")
result.show(truncate=False)

Notice how we reduced our dimension size to upto 3 components, To get the loading scores for each variable use the below code.

model.pc.toArray()

import pandas as pd
pd.options.display.float_format = '{:.5f}'.format
loading_scores = pd.DataFrame(model.pc.toArray(),columns = [f"PCA{i}" for i in range(1, len(model.pc.toArray()[0])+1)])
loading_scores['Variable'] = features_list
print(loading_scores)

Singular Value Decomposition

The Singular-Value Decomposition, or SVD for short, is a matrix decomposition method for reducing a matrix to its constituent parts in order to make certain subsequent matrix calculations simpler.

from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
df_svd_vector = df.rdd.map(lambda x: x['features2'].toArray())
mat = RowMatrix(df_svd_vector)
svd = mat.computeSVD(5, computeU=True)
U = svd.U
s = svd.s
V = svd.V

collected =U.rows.collect()
print("U factor is:")
for vector in collected:
    print(vector)
print("singular values are: %s" %s)
print("V factor is :n%s" % V)

Here we end the part 1 of this tutorial series, in the next parts we will cover supervised, model based and custom techniques for variable selection.

follow me on Linkedin

LinkedIn: https://www.linkedin.com/in/shorya-sharma-b94161121/

Variable Selection In Pyspark: Part 1

Exploratory data Analysis

Built In Variable Selection Technique: without Target

Written by shorya sharma

Responses (1)