In this section, we will present the methods and techniques used to predict credit card fraud. We used the dataset “Credit Card Fraud Detection”, from kaggle.com, which contains anonymized data from credit card transactions with a size of more than 200 MB. We will create four training and prediction models: Logistic Regression, Decision Trees, Random Forests, and the complex XGBOOST method.
Logistic Regression is a statistical model often used for classification. Instead of predicting the exact value of a variable, as in linear Regression, logistic Regression estimates the probability of an outcome based on one or more independent variables. It is ideal for binary classification, such as predicting whether a transaction is fraudulent.
Decision Trees are a widely used algorithm in machine learning that utilizes a tree-like structure for decision-making. Each node in the tree represents a question or property, and each branch is a possible answer to that question. Decision Trees are easy to understand and interpret but can be sensitive or perform poorly in training with new, unseen data.
Random Forest is a machine learning algorithm that utilizes predictions from multiple machine learning algorithms, most notably Decision Trees, to improve performance and reduce overfitting. Each tree in the forest is trained on a random subset of the data, and the predictions from all the trees are combined to give one final result. This method is very effective in classification and Regression.
XGBoost (eXtreme Gradient Boosting) is an optimized machine-learning algorithm that boosts decision trees. It uses a series of models built sequentially, each new model focusing on correcting errors found in the previous models. XGBoost is known for its high performance, ability to handle large datasets effectively, and flexibility in parameter tuning.
3.2. Dataset Pre-Processing
We continue with the pre-processing of our data. First, it is important to check if there are any missing values for each feature in our dataset. We do this by running
data.isna().sum(). The result is shown below (
Table 5).
As you can see from the image above, no feature contains any missing values. Therefore, we can continue with the pre-processing of our data. It is also important to check if there are duplicate entries. This is achieved by using the following command: data.duplicated().any(). The result of the execution is False, which means that there are no duplicate records, as shown below.
→ In 12 data.duplicated().any()
→ Out 12 False
Next, we want to investigate the correlation of classes with each other. We do this by running a command to create a heatmap graph. Specifically by executing the following command:
→heatmap = plt.figure(figsize=[20,10])
→sns.heatmap(data.corr(),cmap=’crest’,annot=True)
→plt.show()
The result of the command is the heatmap below.
The information deduced from the heatmap indicates that there are several notable correlations among different characteristics. V17 and V18 display a high correlation with each other, as do V16 and V17. V14 shows a negative correlation with V4, while V12 is negatively correlated with both V10 and V11. Additionally, V11 is negatively correlated with V10 but positively correlated with V4. V3 is positively correlated with both V10 and V12, and V9 and V10 also show a positive correlation.
Next, we want to investigate the transaction amounts in order to be able to draw conclusions. By executing data['Amount'].plot.box(), we can print a box plot type graph that shows us roughly where the transactions are moving in terms of amount. The result of running the above command is shown in the graph below.
Next, we want to check the distribution of the feature amount to see if the distribution of our feature amount data is normal. By executing the following command, we can create a graph, known as kdeplot which shows us the curvature of the data:
→sns.kdeplot(data=data[‘Amount’], shade=True)
→plt.show()
The result of running the above command is what is shown in the graph below (Figure 9).
Figure 5.
Kdeplot of transaction amount distribution to examine data curvature and normality.
Figure 5.
Kdeplot of transaction amount distribution to examine data curvature and normality.
As the graph shows, our data is fairly well distributed. Next, we want to study some features, which as shown by the heatmap earlier, illustrate high correlation. These features are V1, V10, V12 and V23. To see the correlation of the above features, we will need to create histograms. This is achieved by using the following command:
→paper, axes = plt.subplots(2, 2, figsize=(10,6))
→data[‘V1’].plot(kind=’hist’, ax=axes[0,0], title=’Distribution of V1’)
→data[‘V10’].plot(kind=’hist’, ax=axes[0,1], title=’Distribution of V10’)
→data[‘V12’].plot(kind=’hist’, ax=axes[1,0], title=’Distribution of V12’)
→data[‘V23’].plot(kind=’hist’, ax=axes[1,1], title=’Distribution of V23’)
→plt.suptitle(‘Distribution of V1,V10,V12 and V23’, size=14)
→plt.tight_layout()
→plt.show()
The result of running the above command is different histograms showing features V1,V10,V12 and V23 (Figure 10).
Figure 6.
Printing histograms of features V1, V10, V12, and V23.
Figure 6.
Printing histograms of features V1, V10, V12, and V23.
As shown in the graph above, we have successfully printed the features that are highly correlated with each other, and in particular, the distribution of data in them. Based on these features, we want to study what percentage of our data constitutes credit card fraud (class 1) or not (class 0). To achieve the above, we execute the command:
→data[‘Class’].value_counts().plot.pie(explode=[0,1,0],autopct=’%3.1f%%’,shadow=True,legend= True,startangle=45)
→plt.title(‘Distribution of Class’, size=14)
→plt.show()
The result of the execution of the above command is the following pie which shows us that 50% of the transactions are fraud, while the other 50% are not fraud (Figure 11).
Figure 7.
Data pie representing the fraud rate in credit card transactions.
Figure 7.
Data pie representing the fraud rate in credit card transactions.
Next, we need to divide our data into dependent and independent. So, we run the following command:
→x=data.drop([‘id’,’Class’],axis=1)
→y=data.Class
The above command removes the class "class" which indicates whether the transaction is valid or a fraud. At this point we should note that this class is the specific one we are trying to predict, which is why we “drop” it. Next, we print the head (representation) of the X dataframe we created. That is the dataset that does not contain the fraud class. The result is shown below.
x.head()
Next, we want to study the number of numbers contained in our X and Y dataset. Note that our data must be equally distributed.
→print(‘Shape of x’, x.shape)
→print(‘Shape of y’, y.shape)
→Shape of x (568630, 29)
→Shape of y (568630, )
As shown above, the data is equally distributed for the two subsets of the dataset. Next, we will need, in order to train our model on prediction, to scale the data, a step useful for correct prediction. We accomplish this by running the following commands:
→sc = StandardScaler()
→x_scaled = sc.fit_transform(x)
→x_scaled_df = pd.DataFrame(x_scaled,colomns=x.colomuns)
We then study our data again to see if the transformation has been successful. This is accomplished with x_scaled_df.head(). The result is shown below.
Figure 7.
Checking transformed data after scaling application.
Figure 7.
Checking transformed data after scaling application.
As the graph shows, our data seems to have been converted correctly. We then split our data into train/test subsets. This is accomplished with the following command:
→x_train,x_test,y_train,y_test =
train_test_split(x_scaled_df,y,test_size=0.25,random_state=15,stratify=y)
Next, we want to examine whether the subset splitting has been done successfully and the subset shapes are correct. This is achieved with the following commands:
→print(x_train.shape)
→print(x_test.shape)
→print(y_train.shape)
→print(y_test.shape)
The result is the elements contained in each subset:
→ (426472, 29)
→ (142158, 29)
→ (426472,)
→ (142158,)
Based on these subsets, we start training our first model, namely the logistic regression. We execute the command:
→1r=LogisticRegression()
Next, we run the command:
→1r.fit(x_train,y_train)
Our two commands above are responsible for loading the appropriate libraries and to make our data fit. Next, we need to create a function which is subservient to do quality evaluation of the prediction model we are building.
→def model_eval(actual,predicted):
→acc_score = accuract_score(actual, predicted)
→conf_matrix = confusion_matrix(actual, predicted)
→clas_rep = classification_report(actual, predicted)
→print(‘Model Acurracy is: ‘, round(acc_score, 2))
→print(conf_matrix)
With the above function we evaluate the model. Specifically, we want to predict the actual values based on what we have predicted. The classification report class shows us the actual values with the predicted values while the accuracy score is how well our model was able to predict transactions and classify them as either normal or fraudulent. Finally, we print the confusion matrix and the prediction class.