In our survey, the primary focus was on classifying texts as either human-written or machine-generated using an existing corpus. To differentiate between human and machine-generated text, a few AI Text Classifiers have been developed.
3.1. Dataset
The corpus for this study consists of multiple datasets with comparable text lengths, including both machine-generated and human-written content. Experiments were conducted iteratively across all datasets to provide a comprehensive overview.
To ensure that the detector generalizes well across various domains and writing styles, the human dataset includes texts from diverse domains, specifically:
M4 dataset (
https://paperswithcode.com/datasets, accessed on 22 July 2024): Contains human-written text from sources such as Wikipedia, Wiki-How [
34], Reddit (ELI5), arXiv, and PeerRead [
35] for Chinese, as well as news articles for Urdu, RuATD [
36] for Russian, and Indonesian news articles. Machine-generated text is sourced from multilingual LLMs such as ChatGPT, textdavinci-003, LLaMa [
37], FlanT5 [
38], Cohere, Dolly-v2, and BLOOMz [
39];
Indonesian Hoax News Detection Dataset (INDONESIAN HOAX NEWS DETECTION DATASET—Mendeley Data, accessed on 22 July) [
40]: Contains valid and hoax news articles in Indonesian. It has a simple structure, with CSV files consisting of 2 columns: text and label;
The M4 input data is organized as JavaScript Object Notation (JSON) records in files with the extension JSON Lines (JSONL).
The structure of each record is very straightforward and intuitive.
Figure 2 and
Figure 3 present the structure of the datasets (monolingual/multilingual) for training and development testing.
There are mainly three major differences if we compare the datasets used for training the models and the dataset that will be used for final evaluation:
The task formulation is different;
Human text was upsampled to balance the data;
New and surprising domains, generators, and languages will appear in the test sets. Real test sets will not include information about generators, domains, and languages.
Nevertheless, the test dataset includes BLOOMZ
1 outputs (for monolingual language) that are not included in the training set. Moreover, the model is prepared for real-world application scenarios.
3.2. System Overview
The architecture (
Figure 4) is based on BERT-based transformers (BERT, RoBERTa, DistilBERT) using the HuggingFace library.
The model was pretrained [
9,
41,
42,
43] on large generic datasets and fine-tuned for specific tasks like text classification, named entity recognition, and sentiment analysis [
44].
As a baseline, we chose the RoBERTa-base pretrained model, fine-tuned with a sequence classification/regression head on top.
The model was trained and evaluated on the same dataset mentioned before.
Table 2 contains the baselines’ hyperparameters. Additionally, we used Cross-Entropy loss as the loss function, as we are dealing with a binary classification task. The model is built in such a way that via the Sigmoid function at the end, it should output a probability of 0 (= no AI-generated text) and 1 (= AI-generated text). As an optimizer, AdamW, an improved version of Adaptive Moment Estimation (Adam), is significant in training deep learning (DL) models. The learning rate value was set to 2e-5.
The average results for the baseline monolingual setup across three runs for the RoBERTa-base pretrain-dataset are 0.74, and respectively 0.72 for multilanguage, based on the xlm-roberta-base pretrain-dataset.
This model has 2 layers, 768 hidden units, 12 heads, and 125 million parameters.
Table 3.
Baseline model: Hyperparameter Optimization.
Table 3.
Baseline model: Hyperparameter Optimization.
Hyperparameter |
Values |
Learning rate |
2e-5 |
Batch Size |
16 |
Epochs |
3 |
Weight decay |
0,01 |
Fine-tuned models. The main objective of the experiment was to obtain a fine-tuned model that could outperform the baseline model.
Various variations of models/approaches were trained, and ultimately, we decided to combine Hugging Face’s Transformers library with PyTorch and Scikit-Learn libraries.
Additionally, a custom classifier class was applied on top of pretrained models to identify the correct label for our texts. The classifier consists of 2 dense layers: the first layer with 768 neurons (for “base” versions) / 1024 neurons (for “large” versions), and the second layer with 32 neurons (for “base” versions) / 8 neurons (for “large” versions).
Since we have a binary classification task, we use one neuron for the output layer and the sigmoid function (which returns values between 0 and 1) as the activation function for our neural network. The number of neurons in the first layer actually represents the number of neurons in the output layer of the pretrained models (768 for “base” versions, and 1024 for large models).
For classifications task based on neural networks, we used Activation Function (Rectified Linear Unit—ReLU Sigmoid for hidden layers) loss function, and AdamW, as optimizer. The learning rate value was set to 1e-5.
Table 4.
Fine-tuned model: Hyperparameter Optimization.
Table 4.
Fine-tuned model: Hyperparameter Optimization.
Hyperparameter |
Values |
Learning rate |
1e-5 |
Batch Size |
8 |
Epochs |
5 |
As pretrained models, we tested BERT-base, RoBERTa-base, RoBERTa-large, as well as DistilBERT-base-uncased for the monolingual setup, and XLM-RoBERTa-base, BERT-base-multilingual-cased, DistilBERT-base-multilingual-cased for the multilingual setup, models provided by the Transformers library.
For monolingual experiments, as expected, RoBERTa-large provided the best results with an accuracy of 0.83, but the training process took approximately 10 hours.
Using the DistilBERT-base-multilingual-cased model for monolingual experiments also yielded promising results, with less power consumption, within approximately 3 hours. Thus, it can be considered a very good alternative to RoBERTa or BERT. It is important to note that we need to use different pretrained models for each subtask (monolingual and multilingual), as there are separate models optimized for multilingual tasks.
In order to reduce training time, GPUs were used for model training and inference. All experiments were conducted on a Mac Studio machine, as detailed in the results section.
3.3. Experiments
The experimental setup involved preprocessing the dataset, feature engineering, and modeling using different transformer architectures.
We created a custom PyTorch DataSet class for loading data and performing basic preprocessing steps:
(1) Text Cleanup: Removing HTML tags, special characters such as # and @, punctuation, and multiple spaces.
(2) Basic preprocessing: Tokenization
For this survey, Bag of Words (BoW) and Word to Vectors (word2vec) models were used.
For pretrained, transformers like BERT-base, RoBERTa-base, RoBERTa-large, DistilBERT-base-uncased/XLM-RoBERTa-base, BERT-base-multilingual-cased, DistilBERT-base-multilingual-cased) combined with a custom classifier consisting of 3 layers with varying numbers of neurons responded promising.
To adjust the learning rate for different parameters, Adaptive Moment Estimation (ADAM) optimizer was chosen.
Since this model returns probabilities between 0 and 1, we use a 50% threshold for target classification. Predictions are stored using the given test dataset, which includes test IDs and sample targets, and a prediction file is generated based on the model’s predictions. For evaluation of both subtasks, we employ sklearn.metrics, calculating Accuracy (Acc), Precision (P), Recall (R), and F-score (also known as the F1 score or F-measure).
For the multilingual subtask, we use different pretrained models, selecting custom pretrained models optimized for multilanguage tasks.