4.2. Results
As it has been mentioned in Chapter 3, this project performed four tests with at least 60 attempts being done in testing the model. The results of the test are shown in
Table 4.1,
Table 4.2,
Table 4.3, and
Table 4.4. The number of the scores are based on the score that is calculated by the model in measuring the successful rate of its model in recognizing the given sign language which is based on the detection scores. Each table represents the results of Test 1, Test 2, Test 3, and Test 4 sequentially with the respective angle of the subject and object orientation towards the device’s camera.
Table 4.1.
Test 1 Results.
Table 4.1.
Test 1 Results.
Class |
Orientation |
Average Score |
Left |
Straight |
Right |
hello |
87.00 |
95.20 |
49.60 |
77.27 |
no |
83.40 |
88.60 |
61.00 |
77.67 |
yes |
72.20 |
93.80 |
54.00 |
73.33 |
sorry |
45.80 |
92.40 |
71.20 |
69.80 |
thanks |
52.40 |
96.00 |
73.60 |
74.00 |
Total Average Score |
74.41 |
Table 4.2.
Test 2 Results.
Table 4.2.
Test 2 Results.
Class |
Orientation |
Average Score |
Left |
Straight |
Right |
hello |
66.40 |
83.00 |
67.20 |
72.20 |
no |
0.00 |
73.00 |
13.00 |
28.67 |
yes |
59.80 |
91.80 |
59.80 |
70.47 |
sorry |
15.00 |
16.60 |
12.20 |
14.60 |
thanks |
24.60 |
96.00 |
84.40 |
68.33 |
Total Average Score |
50.85 |
Table 4.3.
Test 3 Results.
Table 4.3.
Test 3 Results.
Class |
Orientation |
Average Score |
Left |
Straight |
Right |
hello |
52.60 |
60.40 |
12.20 |
41.73 |
no |
64.20 |
87.60 |
12.00 |
54.60 |
yes |
0.00 |
92.00 |
18.40 |
36.80 |
sorry |
64.80 |
86.80 |
44.80 |
65.47 |
thanks |
0.00 |
93.00 |
27.40 |
40.13 |
Total Average Score |
47.75 |
Table 4.4.
Test 4 Results.
Table 4.4.
Test 4 Results.
Class |
Orientation |
Average Score |
Left |
Straight |
Right |
hello |
78.40 |
76.00 |
0.00 |
51.47 |
no |
26.60 |
50.60 |
0.00 |
25.73 |
yes |
0.00 |
80.20 |
14.80 |
31.67 |
sorry |
27.20 |
86.60 |
0.00 |
37.93 |
thanks |
90.20 |
96.00 |
82.80 |
89.67 |
Total Average Score |
47.29 |
According to the previous chapter, Test 1 in
Table 4.1 tested the model on five classes where the environment and the subject backgrounds are similar to the background from the dataset. The test also involves the various orientations of the object towards the camera.
Table 4.2 shows the results of Test 2, which similarly uses a similar environment background from the dataset where the subject background is different. Test 2 also includes the various orientations of the object.
Table 4.3 and
Table 4.4 show the results of Test 3 and Test 4 sequentially. It shows the results of the test where the model is tested using different environment background in various orientation. As discussed earlier, the scores are taken from the average value of the detection scores of the models from five attempts of each class.
From multiple tests conducted and the results were recorded in tables. The results show that Test 1 had the highest average score of 74 percent successful rate among all the tests. Test 2, Test 3, and Test 4 had a lower average score, but the exact percentages were not specified. It is also mentioned that the class “thanks” with a straight orientation produced the highest score across all the tests. This suggests that the model or algorithm being tested performed particularly well when recognizing the “thanks” class, and that the orientation of the object had a positive impact on the performance. Across the results, the highest score achieved is 96 percent. It indicates that under certain conditions the model can accurately predict and recognize the sign language.
4.3. Evaluation
Test 1 and Test 2 indicate that the model still can perform well in regard to the suitability of the environment and the orientation of the object. Across all the test, however, shows that the class with straight orientation outperforms other classes’ scores. On the other hand, the worst performance was observed in Test 4, particularly when the test was conducted with a right orientation. This implies that the model or algorithm had difficulty recognizing objects in the “right” orientation, or that the training data for this orientation was not as diverse or representative. According to (Rosebrock, 2018) and (Bengio et al., 2016), there are multiple factors that drive the performance of deep learning models including the quality and the quantity of data, the architecture of the model, data preprocessing and augmentation, and optimization.
To elaborate further, the statement suggests that the majority of the collected dataset is homogeneous in terms of orientation. Specifically, most of the images in the dataset were taken with a straightforward orientation. This means that the majority of the images in the dataset depict the objects in a similar way, with the object facing directly forward. This homogeneity in the orientation of the images in the dataset can have a significant impact on the performance of a deep learning model that is trained on this data. Having a diverse training dataset ensures that the data provides more descriptive and distinct information for the model to learn, thus allowing the model to make more accurate predictions on new, unseen data (Gong et al., 2019). If the majority of the images in the dataset depict the objects in a straightforward orientation as depicted on the results table, the model may perform well when recognizing objects in this orientation but may struggle to recognize the same objects in other orientations. In addition to improving the accuracy of the model, there is also the potential to bring this technology into the field of cybersecurity (Hussain et al., 2021).
Author (Shorten & Khoshgoftaar, 2019) highlights the importance of data diversity in deep learning, specifically in the context of computer vision tasks, and how data augmentation can be used to improve the performance of these models by increasing the diversity of the training data. It also mentions that a significant challenge in using deep learning for computer vision tasks is to enhance the generalization capability of the models. This means that the ability of the model to produce accurate outcomes on new and unseen data (Mallick et al., 2023). Data augmentation is presented as an effective solution for this challenge by reducing the difference between the training and validation sets and any future testing sets (Anwesa et al., 2015).
Accordingly, it can be seen that the trained model performance is dependent on the diversity of the dataset. For instance, according to the results of the test, when the model is trained on a dataset where most of the images are taken with a straightforward orientation, the model may be less trained with many examples of objects in a right orientation. This means that the model will not have learned to recognize in a certain orientation. Consequently, the model may struggle to identify objects in unfamiliar orientations in real-world scenarios, leading to less accurate recognition. Additionally, this model may also need a larger dataset for training.
As previously mentioned, multiple factors impact the performance of the model, including the quantity, quality, and diversity of the data as well as the model architecture, optimization, preprocessing, and hyperparameter tuning. However, this paper will only discuss the impact of data and architecture on model performance. MobileNetV1 is considered a lightweight model with a low number of parameters, making it suitable for applications that require low computational cost while still maintaining high efficiency and performance. However, as highlighted by (Sandler et al., 2018), the model has certain limitations such as difficulty in handling high-dimensional feature maps which results in poor accuracy and performance due to its separable convolution architecture. Additionally, it also has a tendency to lose spatial information, leading to lower performance.
In concurrence with previous studies, (Ma et al., 2018) and (Kasera et al., 2019) have both emphasized that the utilization of separable convolution can result in memory constraints and negatively impact the computational efficiency of the model, thus reducing its overall performance. To address this limitation, the authors of these studies have proposed optimizations to the pointwise convolution component of the model. It is acknowledged that the efficient nature of MobileNetV1 presents a compelling advantage; however, a trade-off between efficiency, as characterized by the reduced use of computational resources, and performance is a prevalent issue that can be addressed through modifications to the model in the future work.
In concurrence with previous studies, (Ma et al., 2018) and (Kasera et al., 2019) have both emphasized that the utilization of separable convolution can result in memory constraints and negatively impact the computational efficiency of the model, thus reducing its overall performance. To address this limitation, the authors of these studies have proposed optimizations to the pointwise convolution component of the model. It is acknowledged that the efficient nature of MobileNetV1 presents a compelling advantage; however, a trade-off between efficiency, as characterized by the reduced use of computational resources, and performance is a prevalent issue that can be addressed through modifications to the model in the future work.