Automatic Detection of Stop Words for Texts in the Uzbek Language

Khabibulla Madatov; Shukurla Bekchanov; Jernej Vičič

doi:10.20944/preprints202204.0234.v1

Submitted:

21 April 2022

Posted:

26 April 2022

You are already at the latest version

Abstract

Stop words are very important for information retrieval and text analysis investigation. This study aimed to automatically analyze and detect stop words in texts in the Uzbek language. Because of the limited availability of methods for automatic search of stop words of texts in Uzbek we analyzed a newly prepared corpus. The Uzbek language belongs to the family of agglutinative languages. As with all agglutinative languages, we can explain that the detection of stop words in Uzbek texts is a more complex process than in inflected languages: In inflected languages, words such as auxiliary words, articles, prepositions can be included in the stop words group. In agglutinative languages, the meanings of such words are hidden in the text. Therefore, it is not appropriate to apply all known methods of stop words detection in inflected languages directly to agglutinative languages. In this work, the “School corpus” which contains 731156 Uzbek words has been investigated. The bigram method of analysis was applied to the corpus. We proposed the collocation method of detecting stop words of the corpus. We proposed the method of automatically detecting stop words of texts in Uzbek. It is shown that the collocation method is 6 times better than the bigram method.

Keywords:

stop word detection

;

Uzbek language

;

agglutinative language

;

algorithm

Subject:

Computer Science and Mathematics - Computer Science

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Automatic Detection of Stop Words for Texts in the Uzbek Language

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe