Variable Selection In Pyspark: Part 3
This is the part 3 of our variable selection series where we will learn about the custom built variable technique for variable selection.
To look at the part 1 of this series use the link:
To look at the part 2 of this series use the link:
Information Value using Weight Of Evidence
The weight of evidence tells the predictive power of an independent variable in relation to the dependent variable. Since it evolved from credit scoring world, it is generally described as a measure of the separation of good and bad customers.
Information Value analysis is a data exploration technique that helps determine which columns in a data set have predictive power or influence on the value of a specified dependent variable.
Steps of Calculating WOE
- For a continuous variable, split data into 10 parts (or lesser depending on the distribution).
- Calculate the number of events and non-events in each group (bin)
- Calculate the % of events and % of non-events in each group.
- Calculate WOE by taking natural log of division of % of non-events and % of events
Creating custom bins for a large dataset with hundreds on continuos variables is a tedious task, thats where monotonic binning comes into play.
Monotonic Binning using Spearman Correlation
A monotonic relationship is a relationship that does one of the following:
- As the value of one variable increases, so does the value of the other variable; or
- As the value of one variable increases, the other variable value decreases.
Spearman’s correlation measures the strength and direction of monotonic association between two variables.
Code
Below is the code that can be used to calculate the IV for any dataset.
We have used melbourne housing market dataset available here
https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
dataset = 2 # 1 or 2
# Load Dataset
from pyspark.sql import functions as F
import pandas as pd
import numpy as np
from pyspark.ml.feature import QuantileDiscretizer
from pyspark.ml.feature import VectorAssembler
import scipy.stats.stats as stats
if dataset == 1:
filename = "/FileStore/tables/bank_full.csv"
target_variable_name = "y"
df = spark.read.csv(filename, header=True, inferSchema=True, sep=';')
df = df.withColumn(target_variable_name, F.when(df[target_variable_name]=='no', 0).otherwise(1))
else:
filename = '/FileStore/tables/melb_data.csv'
target_variable_name = "type"
df = spark.read.csv(filename, header=True, inferSchema=True, sep=',')
df = df.withColumn(target_variable_name, F.when(df[target_variable_name]=='h', 0).otherwise(1))
df.show()
df.groupby(target_variable_name).count().show()
def var_type(df):
vars_list = df.dtypes
char_vars = []
num_vars = []
for i in vars_list:
if i[1] in ('string', 'date'):
char_vars.append(i[0])
else:
num_vars.append(i[0])
return char_vars, num_vars
char_vars, num_vars = var_type(df)
if dataset!=1:
char_vars.remove('Address')
char_vars.remove('SellerG')
char_vars.remove('Date')
char_vars.remove('Suburb')
num_vars.remove(target_variable_name)
final_vars = char_vars + num_vars
custom_rho = 1
max_bin = 20
def calculate_woe(count_df, event_df, min_value, max_value, feature):
woe_df = pd.merge(left=count_df, right=event_df)
woe_df['min_value'] = min_value
woe_df['max_value'] = max_value
woe_df['non_event'] = woe_df['count'] - woe_df['event']
woe_df['event_rate'] = woe_df['event']/woe_df['count']
woe_df['nonevent_rate'] = woe_df['non_event']/woe_df['count']
woe_df['dist_event'] = woe_df['event']/woe_df['event'].sum()
woe_df['dist_nonevent'] = woe_df['non_event']/woe_df['non_event'].sum()
woe_df['woe'] = np.log(woe_df['dist_event']/woe_df['dist_nonevent'])
woe_df['iv'] = (woe_df['dist_event']-woe_df['dist_nonevent'])*woe_df['woe']
woe_df['varname'] = [feature]* len(woe_df)
woe_df = woe_df[['varname', 'min_value', 'max_value', 'count', 'event', 'non_event', 'event_rate', 'nonevent_rate', 'dist_event', 'dist_nonevent', 'woe', 'iv']]
woe_df = woe_df.replace([np.inf, -np.inf], 0)
woe_df['iv'] = woe_df['iv'].sum()
return woe_df
def mono_bin(temp_df, feature, target, n = max_bin):
r = 0
while np.abs(r) < custom_rho and n>1:
try:
qds = QuantileDiscretizer(numBuckets=n, inputCol=feature, outputCol="buckets", relativeError=0.01)
bucketizer = qds.fit(temp_df)
temp_df = bucketizer.transform(temp_df)
corr_df = temp_df.groupby('buckets').agg({feature: 'avg', target: 'avg'}).toPandas()
corr_df.columns = ['buckets', fearure, target]
r, p = stats.spearmanr(corr_df[feature], corr_df[target])
n = n-1
except Exception as e:
n=n-1
return temp_df
def execute_woe(df, target):
count = -1
for feature in final_vars:
if feature!=target:
count = count+1
temp_df = df.select([feature, target])
if feature in num_vars:
temp_df = mono_bin(temp_df, feature, target, n = max_bin)
grouped = temp_df.groupby('buckets')
else:
grouped = temp_df.groupby(feature)
count_df = grouped.agg(F.count(target).alias('count')).toPandas()
event_df = grouped.agg(F.sum(target).alias('event')).toPandas()
if feature in num_vars:
min_value = grouped.agg(F.min(feature).alias('min')).toPandas()['min']
max_value = grouped.agg(F.max(feature).alias('max')).toPandas()['max']
else:
min_value = count_df[feature]
max_value = count_df[feature]
temp_woe_df = calculate_woe(count_df, event_df, min_value, max_value, feature)
if count == 0:
final_woe_df = temp_woe_df
else:
final_woe_df = final_woe_df.append(temp_woe_df, ignore_index=True)
iv = pd.DataFrame({'IV': final_woe_df.groupby('varname').iv.max()})
iv = iv.reset_index()
return final_woe_df, iv
output, iv = execute_woe(df, target_variable_name)
print(output)
print(iv)
Here we conclude the part 3 of this tutorial series
follow me on Linkedin
LinkedIn: https://www.linkedin.com/in/shorya-sharma-b94161121