Unlocking Insights: Interpreting Clustering Results through Decision Trees
In the world of data analysis and pattern recognition, clustering stands as a powerful technique to uncover underlying structures within complex datasets. By grouping similar data points together, clustering algorithms offer a bird’s-eye view of data, facilitating the discovery of valuable insights and meaningful patterns. However, understanding the significance of each cluster can be a challenging task. That’s where decision trees come to the rescue! Decision trees are known for their ability to provide clear insights into decision-making processes.
In this blog, we’ll explore how decision trees can help us interpret clustering results, making it easier to comprehend and utilize the valuable information hidden within the clusters. Let’s embark on a journey to discover how decision trees unlock the secrets of clustering outcomes, transforming them into actionable knowledge. Join us as we explore the fascinating world of interpreting clustering results using decision trees.
pre-requisites
- Google account: I would be using Google colab through out this blog for coding as libraries are pre installed over it.
- Familiarity with python and SkLearn Library
- Data: Download the data from https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
Let’s Begin
I will be using KMeans for clustering the data, K-means clustering is a popular unsupervised machine learning algorithm used for partitioning data into ‘K’ clusters. The goal of K-means is to group similar data points together and separate different ones based on their features. The algorithm iteratively assigns data points to the nearest cluster centroid and updates the centroids to minimize the sum of squared distances between data points and their assigned centroids.
- Upload the data set you downloaded to google colab , I gave the name as creditcard.csv and create a dataframe out of it. In the dataset, every row corresponds to a credit card transaction, and as depicted earlier, each row contains 31 variables. Among these variables, there is one called “Class,” which serves as the target variable.
# Import the libraries we need to use in this lab
import warnings
warnings.filterwarnings('ignore')
# from __future__ import print_function
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize, StandardScaler
from sklearn.utils.class_weight import compute_sample_weight
from sklearn.metrics import roc_auc_score
import time
import gc, sys
# read the input data
raw_data = pd.read_csv('creditcard.csv')
raw_data = raw_data.dropna()
print("There are " + str(len(raw_data)) + " observations in the credit card fraud dataset.")
print("There are " + str(len(raw_data.columns)) + " variables in the dataset.")
# display the first rows in the dataset
raw_data.head()
y = raw_data[["Class"]]
X_df = raw_data.iloc[:,1:-1]
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
pipeline = Pipeline(steps=[
('scaler', StandardScaler())
])
pc = pipeline.fit_transform(X_df)
from sklearn.cluster import KMeans
3. Train the data using KMeans module present within Sklearn
kmeans_model = KMeans(n_clusters=4, random_state=42)
y_cluster = kmeans_model.fit_predict(pc)
raw_data['Cluster'] = y_cluster
raw_data.groupby('Cluster')['Class'].mean().reset_index()
So , We created 4 clusters for our dataset, the above results show the mean of Class grouped by Clusters.
But, now the challenging part is to interpret what the cluster means. This is where decision Tree comes into picture.
We will use our dataset X_df , but this time our target variable will be cluster
Decision trees can aid in understanding clusters by providing a transparent and intuitive representation of the decision-making process within each cluster. By analyzing the splits and conditions in the tree, we can identify the most influential features and their thresholds that contribute to the formation of distinct clusters. Decision trees also allow us to explore the hierarchy of features, enabling a deeper comprehension of the cluster’s structure and its distinguishing characteristics. As a powerful tool for visualizing and explaining the data-driven decisions, decision trees play a crucial role in transforming complex cluster results into interpretable insights, fostering better data understanding and informed decision-making.
4. Import Decision Tree Classifier
from sklearn.tree import _tree, DecisionTreeClassifier
5. Create the train and test split
X = normalize(X_df, norm="l1")
y = y_cluster
# print the shape of the features matrix and the labels vector
print('X.shape=', X.shape, 'y.shape=', y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print('X_train.shape=', X_train.shape, 'Y_train.shape=', y_train.shape)
print('X_test.shape=', X_test.shape, 'Y_test.shape=', y_test.shape)
6. Train the decision Tree
w_train = compute_sample_weight('balanced', y_train)
from sklearn.tree import DecisionTreeClassifier
sklearn_dt = DecisionTreeClassifier(max_depth=4, random_state=35)
t0 = time.time()
sklearn_dt.fit(X_train, y_train, sample_weight=w_train)
sklearn_time = time.time()-t0
print("[Scikit-Learn] Training time (s): {0:.5f}".format(sklearn_time))
7. Plot the Decision Tree to interpret the rules the tree learned in order to classify the segment.
from sklearn import tree
import graphviz
dot_data = tree.export_graphviz(sklearn_dt,
feature_names = list(X_df.columns),
class_names = True,
filled=True)
graph = graphviz.Source(dot_data, format="png")
graph.render("decision_tree_graphivz")
8. Extract the rules from the tree
def extract_rules(tree, feature_names, class_names, rule_list=None, node_id=0, indent=""):
if rule_list is None:
rule_list = []
# Get the feature index and threshold for the current node
feature_index = tree.feature[node_id]
threshold = tree.threshold[node_id]
# Get the feature name from the feature index
feature_name = feature_names[feature_index]
# Get the class probabilities for the current node
class_probabilities = tree.value[node_id][0]
class_index = class_probabilities.argmax()
class_name = class_names[class_index]
rule = f"{indent}if {feature_name} <= {threshold:.2f} then {class_name} [class probabilities: {class_probabilities}]"
rule_list.append(rule)
# Recurse for the left and right subtrees
left_child = tree.children_left[node_id]
right_child = tree.children_right[node_id]
if left_child != _tree.TREE_LEAF:
extract_rules(tree, feature_names, class_names, rule_list, left_child, indent + " ")
if right_child != _tree.TREE_LEAF:
extract_rules(tree, feature_names, class_names, rule_list, right_child, indent + " ")
return rule_list
# Example usage:
# Assuming you have already trained a decision tree classifier named "clf"
# with corresponding feature_names and class_names
# For example, feature_names = ['feature1', 'feature2', ...]
# class_names = ['class1', 'class2', ...]
# Extract rules from the decision tree
tree_rules = extract_rules(sklearn_dt.tree_, list(X_df.columns), [0,1,2,3])
# Print the extracted rules
for rule in tree_rules:
print(rule)
In conclusion, using decision trees to interpret clustering results offers valuable insights into the underlying patterns within the data. By transforming clusters into classes, we can understand the defining features of each group. This approach bridges the gap between unsupervised and supervised learning, enabling data-driven decisions that drive innovation and solve complex problems in various domains. Embrace the power of interpreting clustering results using decision trees to uncover hidden knowledge and unleash the full potential of your datasets.
follow me on LinkedIn
LinkedIn: https://www.linkedin.com/in/shorya-sharma-b94161121