Fradulent Transactions
(Classification of Imbalanced Dataset)
Introduction
- FRAUDULENT TRANSACTIONS is a case of imbalanced dataset. Imbalanced dataset are the dataset where the presence of once class is very high compared to the second class. For example spame data for email.
- The issue with imbalanced data is that most of the classification algorithms are designed for approximately equal sample of both classes.
- For example if we take case of credit card transanctions, maximum 2-3 out of 100 transactions will be fraud. In this case if I do nothing still the accuracy for zero lable is 96% which is very high.
- Some classifiers are created in this project for FRAUDULENT TRANSACTIONS and compared. The dataset used for this project is downloaded from Kaggle.
Outline of the Project is
- General overview of the data
- Visualization
- Issues with the Default classifier
- Classification using resampling
- Resampling
- Visualization
- Model pipeline
- Classificassion
- Results
- Conclusion
General overview of the data
# Importing the required Libraries
import numpy as np
import matplotlib.pyplot plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# A quick info
df = pd.read_csv("Fraud.csv")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
# Column Dtype
--- ------ -----
0 step int64
1 type object
2 amount float64
3 nameOrig object
4 oldbalanceOrg float64
5 newbalanceOrig float64
6 nameDest object
7 oldbalanceDest float64
8 newbalanceDest float64
9 isFraud int64
10 isFlaggedFraud int64
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB
</div>
# general view
df.head(5)
step | type | amount | nameOrig | oldbalanceOrg | newbalanceOrig | nameDest | oldbalanceDest | newbalanceDest | isFraud | isFlaggedFraud | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | PAYMENT | 9839.64 | C1231006815 | 170136.0 | 160296.36 | M1979787155 | 0.0 | 0.0 | 0 | 0 |
1 | 1 | PAYMENT | 1864.28 | C1666544295 | 21249.0 | 19384.72 | M2044282225 | 0.0 | 0.0 | 0 | 0 |
2 | 1 | TRANSFER | 181.00 | C1305486145 | 181.0 | 0.00 | C553264065 | 0.0 | 0.0 | 1 | 0 |
3 | 1 | CASH_OUT | 181.00 | C840083671 | 181.0 | 0.00 | C38997010 | 21182.0 | 0.0 | 1 | 0 |
4 | 1 | PAYMENT | 11668.14 | C2048537720 | 41554.0 | 29885.86 | M1230701703 | 0.0 | 0.0 | 0 | 0 |
plt.style.use('dark_background')
fig = plt.gcf()
fig.set_size_inches(10,6)
sns.scatterplot(x = 'oldbalanceOrg', y = 'newbalanceDest', data = df[df['isFraud'] == 1], color = 'red', label = 'Fraud', marker = '*');
sns.scatterplot(x = 'oldbalanceOrg', y = 'newbalanceDest', data = df[df['isFraud'] == 0], color = 'green', label = 'non-Fraud', marker = '+');
plt.legend()
plt.show()
</div>
# Fraud vs non-fraud transactions count
print('Number of fraud Transactions = ', df[df['isFraud'] == 1].shape[0])
print('Number of non-fraud Transactions = ', df[df['isFraud'] == 0].shape[0])
sns.countplot(x = 'isFraud', data = df);
Number of fraud Transactions = 8213
Number of non-fraud Transactions = 6354407
</div>
- Here we can see that out 6362620 transactions, only 8213 (around 0.13 %) are fraud. If we just do the count plot, the fraud transaction is not even visible.
- This is clearly an example of imbalanced dataset.
- The data has 11 columns in which isFraud is the target column. In remaining columns, nameOrig and nameDest are not relevent features as they are the name of id of the costomures.
- Column isFlagFraud contains 1, for all the transaction which are more than 200 dollars otherwise 0. Again this is not relevant for us.
Visualization
</section>
# Getting a sample of 5000 and saving it for future reference
= df.sample(5000, random_state = 0)
df_vis'df_vis.csv') df_vis.to_csv(
= pd.read_csv('df_vis.csv') df_vis
Does the fraud transactions happen for high amount? If yes then which mode is being used?
= 'type', y = 'amount', hue = 'isFraud', data = df_vis); sns.swarmplot(x
- Commenting about the transactions is hard here.
- Other thing to notice is that most of the transaction are being transfered specially high amount transactions.
- It is also intutive that the fraud transactions are mostly done by cashind out the money.
Does the balance in acount lead towards fraud transactions? If yes then how much amount?
= 'oldbalanceOrg', y = 'amount', hue = 'isFraud', data = df_vis); sns.scatterplot(x
- From the above graph it is clear that Fraud does not happen only for customers who have high balance in the account.
Skewness in the data
for col in ['amount','oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']:
=col, data = df_vis);
sns.boxenplot(x
plt.show()print(col,'\n')
amount
oldbalanceOrg
newbalanceOrig
oldbalanceDest
newbalanceDest
- All the numeric features are highly right skewed and log transformation of the data will be required.
- It is also fact that most of the accounts have less amount of money and it can be seen from the pictures.
Issues with default classifier for imbalanced data
Let us create a some classifier to see the performance without changing anything.
Default Classfiers
# copy of data to check the performance
= df.copy()
df_def
# Droping non requiered features
= df_def.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis = 1)
df_def
# Encoding categorical variables
= pd.get_dummies(df_def) df_def
# log transformation of skewed fetures
= ['amount','oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']
sfeature = np.log(df_def[sfeature] + 1)
df_def[sfeature] df_def.head()
step | amount | oldbalanceOrg | newbalanceOrig | oldbalanceDest | newbalanceDest | isFraud | type_CASH_IN | type_CASH_OUT | type_DEBIT | type_PAYMENT | type_TRANSFER | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 9.194276 | 12.044359 | 11.984786 | 0.000000 | 0.0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 1 | 7.531166 | 9.964112 | 9.872292 | 0.000000 | 0.0 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 1 | 5.204007 | 5.204007 | 0.000000 | 0.000000 | 0.0 | 1 | 0 | 0 | 0 | 0 | 1 |
3 | 1 | 5.204007 | 5.204007 | 0.000000 | 9.960954 | 0.0 | 1 | 0 | 1 | 0 | 0 | 0 |
4 | 1 | 9.364703 | 10.634773 | 10.305174 | 0.000000 | 0.0 | 0 | 0 | 0 | 0 | 1 | 0 |
Model Pipeline
# features and target
= df_def.drop('isFraud', axis = True)
features = df_def['isFraud']
target
# Splitting the train and test data
from sklearn.model_selection import train_test_split
= train_test_split(features, target, test_size = 0.3, random_state = 0)
xtrain, xtest, ytrain, ytest
# Importing the classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
# Performace metircs
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
def performace(xtrain, ytrain, xtest, ytest, classifier):
= classifier.predict(xtest)
ypred = classification_report(ytest,ypred)
report = confusion_matrix(ytest,ypred)
cm print(report)
= ConfusionMatrixDisplay(confusion_matrix = cm)
disp
disp.plot()
plt.show()
# Result
for classifier in [LogisticRegression(random_state = 0), RandomForestClassifier(random_state = 0), GaussianNB()]:
classifier.fit(xtrain, ytrain)print(classifier)
performace(xtrain, ytrain, xtest, ytest, classifier)
LogisticRegression(random_state=0)
precision recall f1-score support
0 1.00 1.00 1.00 1906367
1 0.86 0.48 0.62 2419
accuracy 1.00 1908786
macro avg 0.93 0.74 0.81 1908786
weighted avg 1.00 1.00 1.00 1908786
RandomForestClassifier(random_state=0)
precision recall f1-score support
0 1.00 1.00 1.00 1906367
1 0.98 0.79 0.88 2419
accuracy 1.00 1908786
macro avg 0.99 0.90 0.94 1908786
weighted avg 1.00 1.00 1.00 1908786
GaussianNB()
precision recall f1-score support
0 1.00 0.82 0.90 1906367
1 0.01 0.97 0.01 2419
accuracy 0.82 1908786
macro avg 0.50 0.89 0.46 1908786
weighted avg 1.00 0.82 0.90 1908786
It is quite clear that in the case of every classifier used above the classification for 0 label (non-fraud transaction) is very good. But in the opposite the classification of class label 1 (fraud transactions) is very bad, which is not acceptable.
For example in the case of Logistic regression classifier (f-1 scores = 0.62), 2419 samples are actually labeled 1. But out of 2419 samples, 1253 (more than half) are misclassified. But this situation is not happening for class zero.
Gaussian Naive Bayes classifier has classified class 1 with a very high accuracy but it also has classified many lables as 1 which were 0 in actual. Because of this the f-1 score has reduced to 0.01 which is much much lower than that of other two classifier.
Classification with resampling
Resampling
Resampling can be done in two ways:
Over sampling - Creating duplicates of class 1 to make number of both the class samples equal. Therefore, total number of rows will be the twice of the samples supporting class 0 (i.e., more than original number of samples)
Under sampling - Dropping random samples supporting class 0 to make number of both the class samples equal. Therefore, total number of rows will be the twice of the samples supporting class 1 (i.e., Much less than the original number of samples)
As the dataframe has a lot of rows, Undersampling will be a better choice.
Undersampling (Resampling)
# selecting the class 1 samples
= df[df['isFraud'] == 1]
df1
# Selectiong the equal number of random class 0 samples
= df[df['isFraud'] == 0].sample(df1.shape[0])
df0
# Merging to get the final data framd
= pd.concat([df1, df0])
df_und_sam
# Saving it for future refrence
'Under_sampled_data.csv') df_und_sam.to_csv(
= pd.read_csv('Under_sampled_data.csv')
df df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16426 entries, 0 to 16425
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 16426 non-null int64
1 step 16426 non-null int64
2 type 16426 non-null object
3 amount 16426 non-null float64
4 nameOrig 16426 non-null object
5 oldbalanceOrg 16426 non-null float64
6 newbalanceOrig 16426 non-null float64
7 nameDest 16426 non-null object
8 oldbalanceDest 16426 non-null float64
9 newbalanceDest 16426 non-null float64
10 isFraud 16426 non-null int64
11 isFlaggedFraud 16426 non-null int64
dtypes: float64(5), int64(4), object(3)
memory usage: 1.5+ MB
- Total number of samples in resampled dataframe 16426 which is twice of the number of samples supporting class 1.
# General view of the data
'dark_background')
plt.style.use(= plt.gcf()
fig 5, 5)
fig.set_size_inches(= 'oldbalanceOrg', y = 'newbalanceDest', data = df[df['isFraud'] == 1], color = 'red', label = 'Fraud', marker = '*');
sns.scatterplot(x = 'oldbalanceOrg', y = 'newbalanceDest', data = df[df['isFraud'] == 0], color = 'green', label = 'non-Fraud', marker = '+');
sns.scatterplot(x -0.01*(10**8), 0.5*(10**8))
plt.ylim(
plt.legend() plt.show()
- Now visual of fraud transactions is much clear compared to earlier.
Visualization
# Amount Vs transaction type
= 'type', y = 'amount', data = df.sample(1000), hue='isFraud'); sns.swarmplot(x
- If we can clearly see that the most of the fraud transactions are either done by cashing it out or by transfering to another account.
# Skewness in the numerical features
for col in ['amount','oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']:
=col, data = df);
sns.boxenplot(x
plt.show()print(col,'\n')
amount
oldbalanceOrg
newbalanceOrig
oldbalanceDest
newbalanceDest
Data processing
# Droping non requiered features
= df.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis = 1)
df
# Encoding categorical variables
= pd.get_dummies(df)
df
# log transformation of skewed fetures
= ['amount','oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']
sfeature = np.log(df[sfeature] + 1) df[sfeature]
Model Pipeling
# features and target
= df.drop('isFraud', axis = True)
features = df['isFraud']
target
# Splitting the train and test data
from sklearn.model_selection import train_test_split
= train_test_split(features, target, test_size = 0.3, random_state = 0)
xtrain, xtest, ytrain, ytest
# Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
# Performace metircs
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
def performace(xtrain, ytrain, xtest, ytest, classifier):
= classifier.predict(xtest)
ypred = classification_report(ytest,ypred)
report = confusion_matrix(ytest,ypred)
cm print(report)
= ConfusionMatrixDisplay(confusion_matrix = cm)
disp
disp.plot()
plt.show()
# Result
for classifier in [LogisticRegression(random_state = 0), RandomForestClassifier(random_state = 0), GaussianNB()]:
classifier.fit(xtrain, ytrain)print(classifier)
performace(xtrain, ytrain, xtest, ytest, classifier)
LogisticRegression(random_state=0)
precision recall f1-score support
0 0.00 0.00 0.00 2490
1 0.49 1.00 0.66 2438
accuracy 0.49 4928
macro avg 0.25 0.50 0.33 4928
weighted avg 0.24 0.49 0.33 4928
RandomForestClassifier(random_state=0)
precision recall f1-score support
0 0.99 0.99 0.99 2490
1 0.99 0.99 0.99 2438
accuracy 0.99 4928
macro avg 0.99 0.99 0.99 4928
weighted avg 0.99 0.99 0.99 4928
GaussianNB()
precision recall f1-score support
0 0.63 0.74 0.68 2490
1 0.68 0.56 0.62 2438
accuracy 0.65 4928
macro avg 0.66 0.65 0.65 4928
weighted avg 0.66 0.65 0.65 4928
Conclusion
Out of three classfiers, Random Forest classifier is working best before and after reshampling. It is also clear that it is working much better after resampling.
Logistic regression classifier was baised towards the 0 leble before resampling and it became biased towards label 1 after resampling (under sampling leads to loose a lot of valuable information)
Gaussian Naive Bayes classifier's is performing better after resampling.
f1-score of every classifier is increased for every classifier.
As this is the case of fraud detection, detecting of a fraud transactions as non-fraud is worse than detecting of a non-fraud transacation as fraud.