Variable Selection In Pyspark: Part 3

shorya sharma
4 min readFeb 25, 2023

--

This is the part 3 of our variable selection series where we will learn about the custom built variable technique for variable selection.

To look at the part 1 of this series use the link:

To look at the part 2 of this series use the link:

Information Value using Weight Of Evidence

The weight of evidence tells the predictive power of an independent variable in relation to the dependent variable. Since it evolved from credit scoring world, it is generally described as a measure of the separation of good and bad customers.

Information Value analysis is a data exploration technique that helps determine which columns in a data set have predictive power or influence on the value of a specified dependent variable.

Steps of Calculating WOE

  1. For a continuous variable, split data into 10 parts (or lesser depending on the distribution).
  2. Calculate the number of events and non-events in each group (bin)
  3. Calculate the % of events and % of non-events in each group.
  4. Calculate WOE by taking natural log of division of % of non-events and % of events

Creating custom bins for a large dataset with hundreds on continuos variables is a tedious task, thats where monotonic binning comes into play.

Monotonic Binning using Spearman Correlation

A monotonic relationship is a relationship that does one of the following:

  1. As the value of one variable increases, so does the value of the other variable; or
  2. As the value of one variable increases, the other variable value decreases.

Spearman’s correlation measures the strength and direction of monotonic association between two variables.

Code

Below is the code that can be used to calculate the IV for any dataset.

We have used melbourne housing market dataset available here

https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
dataset = 2 # 1 or 2
# Load Dataset
from pyspark.sql import functions as F
import pandas as pd
import numpy as np
from pyspark.ml.feature import QuantileDiscretizer
from pyspark.ml.feature import VectorAssembler
import scipy.stats.stats as stats
if dataset == 1:
filename = "/FileStore/tables/bank_full.csv"
target_variable_name = "y"
df = spark.read.csv(filename, header=True, inferSchema=True, sep=';')
df = df.withColumn(target_variable_name, F.when(df[target_variable_name]=='no', 0).otherwise(1))
else:
filename = '/FileStore/tables/melb_data.csv'
target_variable_name = "type"
df = spark.read.csv(filename, header=True, inferSchema=True, sep=',')
df = df.withColumn(target_variable_name, F.when(df[target_variable_name]=='h', 0).otherwise(1))
df.show()
df.groupby(target_variable_name).count().show()
def var_type(df):
vars_list = df.dtypes
char_vars = []
num_vars = []
for i in vars_list:
if i[1] in ('string', 'date'):
char_vars.append(i[0])
else:
num_vars.append(i[0])
return char_vars, num_vars
char_vars, num_vars = var_type(df)
if dataset!=1:
char_vars.remove('Address')
char_vars.remove('SellerG')
char_vars.remove('Date')
char_vars.remove('Suburb')
num_vars.remove(target_variable_name)
final_vars = char_vars + num_vars
custom_rho = 1
max_bin = 20

def calculate_woe(count_df, event_df, min_value, max_value, feature):
woe_df = pd.merge(left=count_df, right=event_df)
woe_df['min_value'] = min_value
woe_df['max_value'] = max_value
woe_df['non_event'] = woe_df['count'] - woe_df['event']
woe_df['event_rate'] = woe_df['event']/woe_df['count']
woe_df['nonevent_rate'] = woe_df['non_event']/woe_df['count']
woe_df['dist_event'] = woe_df['event']/woe_df['event'].sum()
woe_df['dist_nonevent'] = woe_df['non_event']/woe_df['non_event'].sum()
woe_df['woe'] = np.log(woe_df['dist_event']/woe_df['dist_nonevent'])
woe_df['iv'] = (woe_df['dist_event']-woe_df['dist_nonevent'])*woe_df['woe']
woe_df['varname'] = [feature]* len(woe_df)
woe_df = woe_df[['varname', 'min_value', 'max_value', 'count', 'event', 'non_event', 'event_rate', 'nonevent_rate', 'dist_event', 'dist_nonevent', 'woe', 'iv']]
woe_df = woe_df.replace([np.inf, -np.inf], 0)
woe_df['iv'] = woe_df['iv'].sum()
return woe_df

def mono_bin(temp_df, feature, target, n = max_bin):
r = 0
while np.abs(r) < custom_rho and n>1:
try:
qds = QuantileDiscretizer(numBuckets=n, inputCol=feature, outputCol="buckets", relativeError=0.01)
bucketizer = qds.fit(temp_df)
temp_df = bucketizer.transform(temp_df)
corr_df = temp_df.groupby('buckets').agg({feature: 'avg', target: 'avg'}).toPandas()
corr_df.columns = ['buckets', fearure, target]
r, p = stats.spearmanr(corr_df[feature], corr_df[target])
n = n-1
except Exception as e:
n=n-1
return temp_df

def execute_woe(df, target):
count = -1
for feature in final_vars:
if feature!=target:
count = count+1
temp_df = df.select([feature, target])
if feature in num_vars:
temp_df = mono_bin(temp_df, feature, target, n = max_bin)
grouped = temp_df.groupby('buckets')
else:
grouped = temp_df.groupby(feature)
count_df = grouped.agg(F.count(target).alias('count')).toPandas()
event_df = grouped.agg(F.sum(target).alias('event')).toPandas()
if feature in num_vars:
min_value = grouped.agg(F.min(feature).alias('min')).toPandas()['min']
max_value = grouped.agg(F.max(feature).alias('max')).toPandas()['max']
else:
min_value = count_df[feature]
max_value = count_df[feature]
temp_woe_df = calculate_woe(count_df, event_df, min_value, max_value, feature)
if count == 0:
final_woe_df = temp_woe_df
else:
final_woe_df = final_woe_df.append(temp_woe_df, ignore_index=True)
iv = pd.DataFrame({'IV': final_woe_df.groupby('varname').iv.max()})
iv = iv.reset_index()
return final_woe_df, iv

output, iv = execute_woe(df, target_variable_name)
print(output)
print(iv)

Here we conclude the part 3 of this tutorial series

follow me on Linkedin

LinkedIn: https://www.linkedin.com/in/shorya-sharma-b94161121

--

--

shorya sharma
shorya sharma

Written by shorya sharma

Assistant Manager at Bank Of America | Ex-Data Engineer at IBM | Ex - Software Engineer at Compunnel inc. | Python | Data Science

No responses yet