2.3. Related Work
There are some papers that already discussed the Myanmar ASR. Wunna Soe presented Myanmar's syllable-based voice recognition technology (Fatima, et.al., Wunna Soe & Yadana Theins, 2015). Syllable segmentation and the syllable-based n-gram approach were used in this system to build the language model. Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM) were used to construct an acoustic model. The speech is recorded utilizing recording software like wave surfer in the news domain area.
Thin Thin Nwe showed how to recognize speech in Myanmar using a hybrid Artificial Neural Network (ANN) and HMM (Gopi, 2022, et.al., Gouda, 2022, et.al., Thin Thin Nwe & Theingi Myint, 2015). This technique employed segmentation based on syllables. Techniques for extracting features included Mel Frequency Cepstral Coefficient (MFCC), Linear Predictive Cepstral Coding (LPCC), and Perceptual Linear Prediction (PLP). The words were recognized using hybrid ANN-HMM.
Hay Mar Soe Naing presented a continuous voice recognition system for the travel industry with a vast vocabulary for Myanmar (Humayun, 2022, et. al., Hay Mar Soe Naing et al., 2015). Deep Neural Networks (DNN) were employed in this system to model acoustics. The acoustic model was expanded with tonal features. For DNN training, sequence discriminative criteria such as Cross-Entropy (CE) and State-level Minimum Bayes Risk (sMBR) were applied.
Ingyin Khaing presented continuous speech recognition for Myanmar using Dynamic Time Wrapping (DTW) and HMM (Lim, 2019, et.al., Ingyin Khaing, 2013). The feature extraction process in this system used Linear Prediction Coefficients (LPC), MFCC, and Gammatone Cepstral Coefficients (GTCC) approaches. Additionally, DTW was applied to feature clustering to address the Markov model's lack of discrimination. For the recognition procedure, HMM was utilized.
Aye Nyein Mon, Win Pa Pa, and Ye Kyaw Thu changed Convolutional Neural Network (CNN) hyperparameters for Myanmar ASR, and the performance of Myanmar ASR is enhanced (Majid, et.al., 2021 & Aye Nyein Mon et al., 2018). The localization attribute of the CNN can lower the number of neural network weights that must be taught, hence decreasing overfitting. Additionally, the pooling process is highly helpful in addressing the little frequency changes that speech signals frequently have. As a result, CNN's various parameters, such as feature map counts and pooling sizes, have been altered in order to improve ASR accuracy for the Burmese language. Because Myanmar is a tonal and syllable-timed language, a large data collection was used to construct the syllable-based Myanmar ASR. Following that, two open test sets, web data, and recorded data, are used to assess the effectiveness of word-based and syllable-based ASRs.
(Saeed, et.al., 2019, Shah, et.al., 2019 & Yong et al., 2023) discusses the implementation of efficient smart street lights that are capable of monitoring crime and accidents. It can be inferred that the system likely uses some form of recognition technology to detect and report incidents. This could potentially include image or pattern recognition to identify accidents or criminal activity. (Mallick et al., 2023) present a solution to the transportation problem related to drug delivery from drug factories to different warehouses. The aim is to minimize the delivery time as well as the cost of transportation. The paper uses the Stepping Stone method for optimization of the cost and compares it with Vogel’s method. This involves the use of mathematical and computational recognition to identify the most efficient routes and schedules. In another paper, (Tabbakh, et.al., 2021 & Mallick et al., 2023) discuss the minimization of costs in airline crew scheduling using an assignment technique. The paper presents a case study in which the airline schedule and crew schedule are optimized to minimize transportation cost. This involves the use of recognition in the form of identifying the optimal assignment of crew members to flights to minimize costs.
(Zaman et al., 2021) present an ontological framework for information extraction from diverse scientific sources. This involves the use of recognition in the form of identifying relevant information from various sources and extracting it in a meaningful way. The paper uses deep learning to evolve numerous algorithms and even low-resource languages may be turned into speech-to-text systems with a small quantity of data. (Hussain et al., 2021) discuss performance enhancement in wireless body area networks with secure communication. The paper presents a solution to the problem of secure communication in wireless body area networks, which likely involves the use of recognition systems to identify and mitigate potential security threats. The Hierarchically Aggregated Graph Neural Networks (HAGNN) proposed by (V. Singhal et al., 2020., & Xu et al., 2023) is used to capture different granularities of high-level representations of text content and fusing the rumor propagation structure. This approach could potentially be adapted for speech-to-text systems to capture different levels of linguistic features in the speech data and use them for more accurate transcription.
Lawrence demonstrated that whole-word reference-based linked word recognition algorithms have advanced to the point where they can now achieve great recognition performance for tiny or syntax-restricted, moderate-sized vocabularies in a speaker-trained mode (RABINER et al., 1990, p. xx). In particular, it has been shown that very high string accuracy for a vocabulary of digits may be achieved in a speaker-trained mode utilizing either HMM or templates as the digit reference patterns.
A voice processing and identification method for individually spoken Urdu language words were introduced by S K Hasnain (Beg & Hasnain, 2008, p. xx). A collection of 150 unique samples obtained from 15 distinct speakers served as the foundation for the speech feature extraction. Matrix Laboratory (MATLAB) was used to create the feed-forward neural networks for voice recognition. In this study, the author made an effort to detect spoken Urdu words using a neural network. The obtained data's Discrete Fourier Transform (DFT) was utilized to train and test the voice recognition neural network. The network was quite accurate in its predictions.
According to Vikhe, accurate endpoint identification is crucial for isolated word recognition systems for two reasons, it is crucial for reliable word recognition and it reduces the amount of computing required to analyze speech when the endpoints are accurately found (Vikhe, 2011). The experimental database consisted of zero to nine English-language numbers.
This project aims to build a prototype that can conventionally convert Burmese speech into text. This project will be a novice because this project is based on a deep learning model and a small amount of data. The more data feed to the model, the better the prediction result. Almost all existing Burmese speech-to-text models are left halfway and neither tries to maintain it nor make the system better. These kinds of models need a huge amount of data and need to be maintained over time. Some models are built with very old designs and cannot perform very well these days. This project will be built with up-to-date techniques and always maintain based on the feedback of the users. Possible challenges faced are collecting the data, a huge amount of data storing and pre-processing, modeling with up-to-date techniques, and finally implementation on the website. However, the biggest challenge is to run the prototype and get a satisfactory result. This prototype can be developed better in the future with more good quality data and using Natural Language Processing (NLP) to get more accurate prediction results.