In this section, we present the experimental process and explain the results obtained in this study. The experiments were conducted based on the proposed scenarios using datasets from YouTube comments. This aligns with the goal of self-learning of automated annotation processes with minimal datasets to obtain optimal accuracy according to the proposed method. The model performance was assessed using a confusion matrix.
3.2. Experimental Results of the Machine Learning Based Approach
This section presents the results and analysis of machine learning methods’ performance with the meta-learner concept based on feature extraction of meta vectorization. The carried out experiments used the processing scenario as shown in
Table 1. The grouping was based on the percentage of training data, namely 20%, 10%, and 5%.
The experiment aimed to determine meta-learners’ performance in automatic annotations using the least amount of labeled training data, to optimize the annotation process. Thus, the annotation process with a small amount of labeled training data can annotate itself using the application of self-learning. Furthermore, in the context of self-learning, we employ threshold parameters to identify and seek similarities within the training dataset. This approach ensures that annotations align closely with the training and testing phases.
The results obtained use accuracy to determine the correctness of the annotation process.
Figure 3 presents the results of calculating the accuracy of each combination of machine learning algorithm (SVM, DT, KNN, and NB) with feature method (TF-IDF and Word2Vec). The evaluation uses manual annotations of 20% of 13169 data. This automatically generated annotation will be used for the initial training.
Figure 3 shows that the machine learning model with 20% initially labeled data scenario is capable of performing automatic annotations with varying accuracy, with the most considerable value being 90% at the threshold of 0.9 obtained by the SVM-Word2Vec method. The accuracy of each individual method is visible on the chart presented in
Figure 3. The evaluation showed that the worst machine learning methods in all scenarios were KNN combined with both feature extraction algorithms (TF-IDF and Word2Vec). The other methods have an accuracy of
and above. The accuracy of KNN-TF-IDF method has increased as compared to Cahyana et al. [
5] from 59.68% to 61.3% (+1.62 p.p.). This enhancement reflects advancements in the preprocessing procedures from prior research and has been validated through several experimental scenarios, as shown in
Table 1. The Indonesian language preprocessing was carried out using the Sastrawi library for Python in the stemming process. In addition, we have also added several stop-words to calculate the average vector features. However, it is worth noting that these stop-words are not employed in the Word2Vec method due to the fundamental differences in the calculation process it entails. In addition, the SVM-TF-IDF and SVM-Word2Vec methods, as implemented by Saifullah et al. [
54], have been assessed under similar conditions. These methods achieved consecutive accuracies of 63.3% and 89%. As a result, feature extraction using TF-IDF has increased accuracy in performing automatic annotations by 19.5%. This also influences the application of the ensemble concept and the addition of stop-words data in the preprocessing. The Word2Vec method has achieved optimal results, and when a threshold value of 0.9 is added, it leads to a 90% yield, representing a slight 1% increase.
Based on the scenario results, the amount of 20% of the labeled data has not yet obtained optimal accuracy because several methods for detecting hate speech annotations have an accuracy of
[
55] and even more than 95% [
45]. So, we conducted a trial to reduce the amount of labeled data to 10% (
Figure 4) and 5% (
Figure 5) of the total YouTube comment data. We aim to optimize the automatic annotation with minimal amount of data. The text annotation results improved in the case of several methods, such as DT-TF-IDF and KNN-Word2Vec. These methods obtained the accuracy of 90% and more for all threshold experiments (for DT-TF-IDF) and 0.7, 0.8, and 0.9 threshold experiments (for KNN-Word2Vec). In addition, the accuracy of SVM-Word2Vec also increased to 91.9% in the 10%, 90%, and 0.7 parameters of the same scenario. So, the notable accuracy improvement when reducing the labeled data from 20% to 10% data is based on
Figure 3 and
Figure 4. This increase in accuracy is based on the process of calculating the weight of the available data, where the smaller the amount of data, the higher the accuracy of the process. In addition, the self-learning performance of the meta-learner and meta-vectorizer relies on an ensemble approach to optimize the calculated weights, resulting in enhanced performance.
The improvement of selected proposed methods from previous studies [
5,
54] became a reference for conducting this research. Building upon some proposed methodologies from the previous studies [
5], our future work encompasses an exploration of various scenarios. These scenarios involve the reduction of labeled data to 5%, 10%, and 20%, in conjunction with the adjustment of threshold values to 0.6, 0.7, 0.8, and 0.9. This study has achieved this by employing meta-vectorization and meta-learning methods to enhance the annotation process. Furthermore, the application of threshold variations consistently resulted in the highest accuracy at the most significant value, specifically 0.9. This pattern was observed across all variations in the percentage of labeled data, except for the 10% labeled data scenario. This observation is attributed to the robust performance of SVM in various contexts, as documented in references [
56,
57], including the present research, as evidenced by
Figure 3,
Figure 4, and
Figure 5. These comparisons are readily evident in
Figure A1 and
Table A1. Notably, SVM consistently exhibits increasing accuracy. However, while the overall accuracy results demonstrate an increase, it is worth noting that certain methods experience a reduction of up to 8% in specific scenarios. The detailed comparison of the increase and decrease in method accuracy based on the scenario can be seen in
Table A2.
Based on the results of the 10% labeled data scenario, the final experiment is to apply a 5% labeled data scenario for training with each existing threshold. Remarkably, each scenario demonstrates a systematic progression, encompassing thresholds ranging from minor to the most significant. The accuracy of this scenario is best with a threshold value of 0.9. The highest accuracy was obtained for DT-TF-IDF with the value of 97.1%, followed by KNN-Word2Vec (96.9%), SVM-Word2Vec (96.8%), SVM-TF-IDF, DT-Word2Vec (93.4%), and others below 90%. These findings indicate that varying the quantity of data and adjusting the threshold value had a discernible impact on the methods’ accuracy, leading to improved performance. In
Figure 5, we can observe that DT-TF-IDF consistently achieves a commendable accuracy level exceeding 94%, surpassing both KNN-Word2Vec and SVM-Word2Vec (exact accuracy values are shown in
Table A1).
Based on presented scenarios and outcomes, this study adopts a semi-supervised learning approach, as illustrated in
Figure 1. It involves the annotation of unlabeled data and substantial refinement of the sample dataset, ultimately leading to the automatic hate speech annotation. This process implements a self-learning mechanism, which leverages prior learning experiences, to address the limitations associated with manual annotations. In terms of algorithm design, our proposed approach is shown in Listing 1. This annotation delineates an approach that employs minimal training data to maximize accuracy. This is achieved through the incorporation of a threshold parameter designed to align the training dataset with the annotation process.
Listing 1. Text Auto-Annotation Based on Semi-Supervised and Self-Learning Approach |
STEPS:
-
Dataset Input:
a. Annotated Dataset (20%) with scenarios training 5%:10%:20%
b. UnAnnotated dataset (80%)
-
Preprocessing
a. Annotated Dataset –> semi structured Annotated Dataset
b. UnAnnotated dataset –> semi structured UnAnnotated Dataset
Vectorization using merged semi structured annotated dataset + unanotated dataset: TF-IDF, Word2Vec
Put 10% from semi structured annotated dataset for sample dataset
Create model classifier with 80% data training and 20% data validation. Model using SVM, DT, KNN, NB
-
Get the best model MTA: vectorizer + machine learning (with best parameter)
=========================================================== VALIDATE
SEMI SUPERVISED ANNOTATION BEST MODEL Using SSADF
===========================================================
Convert Semi Structured Annotated Dataset –> Semi Structured Annotated Dataset FEATURE: SSADF
Split SSADF into 80% for Data Training (SSADFT) and 20% for Data Validation (SSADFV)
Create MTA model using SSADFT (as Data Training)
-
VALIDATE MTA Model using SSADFV (as Data Validation)
=========================================================== SEMI
SUPERVISED ANNOTATION using THE BEST MODEL
===========================================================
Convert Semi Structured UnAnnotated dataset –> Semi Structured UnAnnotated Dataset FEATURE: SSUDF
AnnotationResult = Annotate SSUDF using MTA MODEL with SSADFT
percentage = VALIDATE AnnotationResult using SAADFV
SAVE AnnotationResult_Valid to merge with SSADF
-
IF percentage < treshold THEN Go To 7.
Output: SSADF
|