Credit card fraud detection with Snap ML and Scikit learn
What is Scikit Learn?
scikit-learn, also known as sklearn, is a popular open-source machine learning library for Python. It provides a wide range of tools and algorithms for various machine learning tasks, including classification, regression, clustering, dimensionality reduction, and model selection
What is Snap ML ?
Snap ML is a machine learning library developed by IBM Research that focuses on providing scalable and efficient solutions for training and deploying machine learning models. It is designed to leverage parallel computing architectures, such as multi-core CPUs and GPUs, to accelerate the training process and handle large datasets.
Snap ML is specifically optimized for tree-based models, such as decision trees and random forests.
Snap ML provides an API that is compatible with popular machine learning frameworks like scikit-learn, making it easy to integrate into existing machine learning pipelines. It also offers GPU acceleration, which can significantly speed up the training process on compatible hardware.
What is Credit Card Fraud ?
Credit card fraud refers to unauthorized or fraudulent use of someone else’s credit card information to make purchases or conduct transactions without their knowledge or consent. It is a form of identity theft and a prevalent type of financial fraud.
Credit card fraud can occur through various methods, including:
- Stolen or Lost Cards: If a credit card is stolen or lost, the thief can use it to make purchases before it is reported as missing.
- Skimming: This involves using a device to illegally collect credit card information, usually by tampering with card readers at ATMs, gas pumps, or point-of-sale terminals. The stolen data is then used to create counterfeit cards or make fraudulent online transactions.
- Phishing and Online Scams: Fraudsters may trick individuals into providing their credit card information through fake websites, emails, or phone calls, posing as legitimate organizations or financial institutions.
- Data Breaches: When a company’s database containing credit card information is hacked or compromised, criminals can gain access to large amounts of credit card data, which can be used for fraudulent purposes.
- Card-Not-Present Fraud: This occurs when credit card information is used for online or phone purchases where the physical card is not present. Fraudsters may obtain card details through various means and use them to make unauthorized transactions.
Let’s begin
Consider a scenario where you are employed at a financial institution, responsible for developing a predictive model to determine whether a credit card transaction is fraudulent. In this particular task, you can frame the problem as a binary classification challenge, where a transaction is labeled as positive (1) if it is identified as fraud, while it is labeled as negative (0) if it is not fraudulent.
In this blog we will use google colab notebook to do our coding as most of the modules used in the code are already installed in it, we just need to install SnapML
!pip install snapml
Download the dataset from the below link:
# Import the libraries we need to use in this lab
import warnings
warnings.filterwarnings('ignore')
# from __future__ import print_function
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize, StandardScaler
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.metrics import roc_auc_score
import time
import gc, sys
# read the input data
raw_data = pd.read_csv('creditcard.csv')
raw_data = raw_data.dropna()
print("There are " + str(len(raw_data)) + " observations in the credit card fraud dataset.")
print("There are " + str(len(raw_data.columns)) + " variables in the dataset.")
# display the first rows in the dataset
raw_data.head()
As a Data Scientist, you have access to transactions that occured over a certain period of time. The majority of the transactions are normally legitimate and only a small fraction are non-legitimate. Thus, typically you have access to a dataset that is highly unbalanced.
In practice, a financial institution may have access to a much larger dataset of transactions. To simulate such a case, we will inflate the original one 10 times.
n_replicas = 10
# inflate the original dataset
big_raw_data = pd.DataFrame(np.repeat(raw_data.values, n_replicas, axis=0), columns=raw_data.columns)
print("There are " + str(len(big_raw_data)) + " observations in the inflated credit card fraud dataset.")
print("There are " + str(len(big_raw_data.columns)) + " variables in the dataset.")
# display first rows in the new dataset
big_raw_data.head()
# get the set of distinct classes
labels = big_raw_data.Class.unique()
# get the count of each class
sizes = big_raw_data.Class.value_counts().values
# plot the class value counts
fig, ax = plt.subplots()
ax.pie(sizes,labels=labels, autopct='%1.3f%%')
ax.set_title('Target Variable Value Counts')
plt.show()
In the dataset, every row corresponds to a credit card transaction, and as depicted earlier, each row contains 31 variables. Among these variables, there is one called “Class,” which serves as the target variable.
Please note that, to ensure confidentiality, the original names of the majority of features have been anonymized as V1, V2, …, V28. These features have undergone a numerical transformation using Principal Component Analysis (PCA), and the resulting values are numerical in nature.
Data preprocessing such as scaling/normalization is typically useful for linear models to accelerate the training convergence. We standardize features by removing the mean and scaling to unit variance.
big_raw_data.iloc[:, 1:30] = StandardScaler().fit_transform(big_raw_data.iloc[:, 1:30])
data_matrix = big_raw_data.values
# X: feature matrix (for this analysis, we exclude the Time variable from the dataset)
X = data_matrix[:, 1:30]
# y: labels vector
y = data_matrix[:, 30]
# data normalization
X = normalize(X, norm="l1")
# print the shape of the features matrix and the labels vector
print('X.shape=', X.shape, 'y.shape=', y.shape)
Dataset Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print('X_train.shape=', X_train.shape, 'Y_train.shape=', y_train.shape)
print('X_test.shape=', X_test.shape, 'Y_test.shape=', y_test.shape)
Scikit-Learn
w_train = compute_sample_weight('balanced', y_train)
from sklearn.tree import DecisionTreeClassifier
sklearn_dt = DecisionTreeClassifier(max_depth=4, random_state=35)
t0 = time.time()
sklearn_dt.fit(X_train, y_train, sample_weight=w_train)
sklearn_time = time.time()-t0
print("[Scikit-Learn] Training time (s): {0:.5f}".format(sklearn_time))
SnapML
from snapml import DecisionTreeClassifier
snapml_dt = DecisionTreeClassifier(max_depth=4, random_state=45, n_jobs=4)
# train a Decision Tree Classifier model using Snap ML
t0 = time.time()
snapml_dt.fit(X_train, y_train, sample_weight=w_train)
snapml_time = time.time()-t0
print("[Snap ML] Training time (s): {0:.5f}".format(snapml_time))
# Snap ML vs Scikit-Learn training speedup
training_speedup = sklearn_time/snapml_time
print('[Decision Tree Classifier] Snap ML vs. Scikit-Learn speedup : {0:.2f}x '.format(training_speedup))
sklearn_pred = sklearn_dt.predict_proba(X_test)[:,1]
snapml_pred = snapml_dt.predict_proba(X_test)[:,1]
sklearn_roc_auc = roc_auc_score(y_test, sklearn_pred)
print('[Scikit-Learn] ROC-AUC score : {0:.3f}'.format(sklearn_roc_auc))
snapml_roc_auc = roc_auc_score(y_test, snapml_pred)
print('[Snap ML] ROC-AUC score : {0:.3f}'.format(snapml_roc_auc))
Visualization of Scikit Learn tree model
In order to see what rules were learn from our model we can plot the graph using the below code in scikit learn.
!pip install graphviz
from sklearn import tree
import graphviz
dot_data = tree.export_graphviz(sklearn_dt,
feature_names = list(big_raw_data.columns[1:30]),
class_names = ["No Fraud", "Fraud"],
filled=True)
graph = graphviz.Source(dot_data, format="png")
graph.render("decision_tree_graphivz")