Payment Risk and Fraud: Part 3— A Decision Tree Case Study for Enhanced Business Performance

shorya sharma
4 min readNov 28, 2023

--

This is the final part of our series where we will work on a case study and will develop a Decision tree to reduce fraud.

Case Study: Unveiling Subscription Success

Introduction:

In this case study, we delve into the journey of our subscription service, launched two years ago at our thriving social media company. As the Vice President of Analytics, our CEO has tasked us with a critical mission — to evaluate the performance of our subscription model and provide actionable insights for future growth.

We need to not only give the overall performance summary, but also come up with solutions and strategies to make the product even better. Our report should include things like the business performance, what things we are doing great, what things we’re doing bad, and how can we improve it.

Let’s Create our tables

Use any SQL editor online to run the following commands.

-- create Revenue and Cost tables  
create table Revenue (
Year int,
Revenue_type varchar,
Yearly_revenue int
);

create table Cost (
Year int,
Cost_type varchar,
Yearly_cost int
);

-- add values into the Revenue table
insert into Revenue values
(2022,'advertisement',16800000),
(2022,'subscription',8960000),
(2023,'advertisement',17304000),
(2023,'subscription',8998080);

-- add values into the Cost table
insert into Cost values
(2022,'marketing',10500000),
(2022,'fraud',8500000),
(2022,'others',1600000),
(2023,'marketing',10800000),
(2023,'fraud',7000000),
(2023,'others',1500000);

Question to answer:

○YoY net revenue performance : Net Revenue = Total Revenue — Total Cost

○The key driver(s)

Let’s Perform the analysis

-- calculate total Revenue year by year  
select Year, sum(Yearly_revenue) as total_revenue
from Revenue
group by Year order by Year;

-- calculate total Cost year by year
select Year, sum(Yearly_cost) as total_cost
from Cost
group by Year order by Year;

-- calculate total net revenue year by year
select
r.Year
,r.total_revenue
,c.total_cost
,r.total_revenue-c.total_cost as net_revenue
from (select Year, sum(Yearly_revenue) as total_revenue
from Revenue
group by Year) as r
join (select Year, sum(Yearly_cost) as total_cost
from Cost
group by Year) as c
on r.Year=c.Year
group by r.Year
order by r.Year;

-- calculate cost segment percentage
select
r.Year
,r.cost_type
,100*r.yearly_cost / total_cost as cost_seg_percentage
from cost as r
join (select Year, sum(Yearly_cost) as total_cost
from Cost
group by Year) as c
on r.Year=c.Year

From the last query we can see that the cost reduction, especially the fraud cost reduction, drives the net revenue increase.

let’s go deep into the data and build models to further improve our fraud reduction.

I am using google colab to perform the analysis and build a DT model , you may chose to use and platform that supports python.

Use the dataset available here:

import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn import tree
## load data into dataframe
df=pd.read_csv('subscription.csv')
df.head()

In the above dataset 1 means default and 0 means not default.

Lets run some basic data exploratory analysis.

  1. Number of accounts per year.
df2=df.groupby(['Subscription Year'])['Account ID'].count()
print(df2)

2. Number of Fraud and valid accounts per year.

df2 = df.groupby(['Subscription Year','default'])['Account ID'].count()
print(df2)

3. How Many defaults as per Housing category

df2 = df.groupby(['housing','default'])['Account ID'].count()
print(df2)
## dataframe for decision tree 
df_dc=df.loc[:,['user age','Account Age','Count of linking accounts','default']]
df_dc.head()
## decision tree
feature_name=['user age','Account Age','Count of linking accounts']
X=df_dc[feature_name]
y=df_dc['default']
dt = DecisionTreeClassifier(max_depth = 2,
random_state = 0)
dt = dt.fit(X, y)
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (6,6), dpi=500)
tree.plot_tree(dt, feature_names=feature_name)

Interpreting the tree

Decision Tree Summary

Using the above table we can select the Leaf logics with maximum Precision and recall rate and develop a strategies to decline the users meeting the criteria.

--

--

shorya sharma

Assistant Manager at Bank Of America | Ex-Data Engineer at IBM | Ex - Software Engineer at Compunnel inc. | Python | Data Science