Variable Selection In Pyspark: Part 3

4 min readFeb 25, 2023

This is the part 3 of our variable selection series where we will learn about the custom built variable technique for variable selection.

To look at the part 1 of this series use the link:

Variable Selection In Pyspark: Part 1

Variable Selection , also known as feature selection, is a process that can be used to select the best features in your…

sharmashorya1996.medium.com

To look at the part 2 of this series use the link:

Variable Selection In Pyspark: Part 2

This will be a very short blog where we will learn and do hands on over Model based feature selection Process. Here we…

sharmashorya1996.medium.com

Information Value using Weight Of Evidence

The weight of evidence tells the predictive power of an independent variable in relation to the dependent variable. Since it evolved from credit scoring world, it is generally described as a measure of the separation of good and bad customers.

Information Value analysis is a data exploration technique that helps determine which columns in a data set have predictive power or influence on the value of a specified dependent variable.

Steps of Calculating WOE

For a continuous variable, split data into 10 parts (or lesser depending on the distribution).
Calculate the number of events and non-events in each group (bin)
Calculate the % of events and % of non-events in each group.
Calculate WOE by taking natural log of division of % of non-events and % of events

Creating custom bins for a large dataset with hundreds on continuos variables is a tedious task, thats where monotonic binning comes into play.

Monotonic Binning using Spearman Correlation

A monotonic relationship is a relationship that does one of the following:

As the value of one variable increases, so does the value of the other variable; or
As the value of one variable increases, the other variable value decreases.

Spearman’s correlation measures the strength and direction of monotonic association between two variables.

Code

Below is the code that can be used to calculate the IV for any dataset.

We have used melbourne housing market dataset available here

https://www.kaggle.com/datasets/anthonypino/melbourne-housing-market

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
dataset = 2 # 1 or 2
# Load Dataset 
from pyspark.sql import functions as F
import pandas as pd
import numpy as np
from pyspark.ml.feature import QuantileDiscretizer
from pyspark.ml.feature import VectorAssembler
import scipy.stats.stats as stats
if dataset == 1:
    filename = "/FileStore/tables/bank_full.csv"
    target_variable_name = "y"
    df = spark.read.csv(filename, header=True, inferSchema=True, sep=';')
    df = df.withColumn(target_variable_name, F.when(df[target_variable_name]=='no', 0).otherwise(1))
else:
    filename = '/FileStore/tables/melb_data.csv'
    target_variable_name = "type"
    df = spark.read.csv(filename, header=True, inferSchema=True, sep=',')
    df = df.withColumn(target_variable_name, F.when(df[target_variable_name]=='h', 0).otherwise(1))
df.show()
df.groupby(target_variable_name).count().show()
def var_type(df):
    vars_list = df.dtypes
    char_vars = []
    num_vars = []
    for i in vars_list:
        if i[1] in ('string', 'date'):
            char_vars.append(i[0])
        else:
            num_vars.append(i[0])
    return char_vars, num_vars
char_vars, num_vars = var_type(df)

if dataset!=1:
    char_vars.remove('Address')
    char_vars.remove('SellerG')
    char_vars.remove('Date')
    char_vars.remove('Suburb')
num_vars.remove(target_variable_name)
final_vars = char_vars + num_vars
custom_rho = 1
max_bin = 20

def calculate_woe(count_df, event_df, min_value, max_value, feature):
    woe_df = pd.merge(left=count_df, right=event_df)
    woe_df['min_value'] = min_value
    woe_df['max_value'] = max_value
    woe_df['non_event'] = woe_df['count'] - woe_df['event']
    woe_df['event_rate'] = woe_df['event']/woe_df['count']
    woe_df['nonevent_rate'] = woe_df['non_event']/woe_df['count']
    woe_df['dist_event'] = woe_df['event']/woe_df['event'].sum()
    woe_df['dist_nonevent'] = woe_df['non_event']/woe_df['non_event'].sum()
    woe_df['woe'] = np.log(woe_df['dist_event']/woe_df['dist_nonevent'])
    woe_df['iv'] = (woe_df['dist_event']-woe_df['dist_nonevent'])*woe_df['woe']
    woe_df['varname'] = [feature]* len(woe_df)
    woe_df = woe_df[['varname', 'min_value', 'max_value', 'count', 'event', 'non_event', 'event_rate', 'nonevent_rate', 'dist_event', 'dist_nonevent', 'woe', 'iv']]
    woe_df = woe_df.replace([np.inf, -np.inf], 0)
    woe_df['iv'] = woe_df['iv'].sum()
    return woe_df

def mono_bin(temp_df, feature, target, n = max_bin):
    r = 0
    while np.abs(r) < custom_rho and n>1:
        try:
            qds = QuantileDiscretizer(numBuckets=n, inputCol=feature, outputCol="buckets", relativeError=0.01)
            bucketizer = qds.fit(temp_df)
            temp_df = bucketizer.transform(temp_df)
            corr_df = temp_df.groupby('buckets').agg({feature: 'avg', target: 'avg'}).toPandas()
            corr_df.columns = ['buckets', fearure, target]
            r, p = stats.spearmanr(corr_df[feature], corr_df[target])
            n = n-1
        except Exception as e:
            n=n-1
        return temp_df
    
def execute_woe(df, target):
    count = -1
    for feature in final_vars:
        if feature!=target:
            count = count+1
            temp_df = df.select([feature, target])
            if feature in num_vars:
                temp_df = mono_bin(temp_df, feature, target, n = max_bin)
                grouped = temp_df.groupby('buckets')
            else:
                grouped = temp_df.groupby(feature)
            count_df = grouped.agg(F.count(target).alias('count')).toPandas()
            event_df = grouped.agg(F.sum(target).alias('event')).toPandas()
            if feature in num_vars:
                min_value = grouped.agg(F.min(feature).alias('min')).toPandas()['min']
                max_value = grouped.agg(F.max(feature).alias('max')).toPandas()['max']
            else:
                min_value = count_df[feature]
                max_value = count_df[feature]
            temp_woe_df = calculate_woe(count_df, event_df, min_value, max_value, feature)
            if count == 0:
                final_woe_df = temp_woe_df
            else:
                final_woe_df = final_woe_df.append(temp_woe_df, ignore_index=True)
        iv = pd.DataFrame({'IV': final_woe_df.groupby('varname').iv.max()})
        iv = iv.reset_index()
    return final_woe_df, iv

output, iv = execute_woe(df, target_variable_name)