The mBERT Model for Restoring Punctuation in Uzbek-Language Texts

Hushnudbek S. Adinaev

doi:10.29013/ESR-25-7.8-28-33

In the original languageTranslation into English

The mBERT Model for Restoring Punctuation in Uzbek-Language Texts

Authors

Hushnudbek S. Adinaev

Rubric:Informatics

Journal:European Science Review №7-8 (2025)

DOI:10.29013/ESR-25-7.8-28-33

1973

6

Download article Quote

Share

1973

6

Annotation

This study proposes an mBERT-based approach for restoring punctuation in Uzbek-language texts. The main objective is to ensure the structural coherence of Uzbek texts by accurately reinserting punctuation marks. Using the mBERT model, we first predict punctuation for each token and then compare the prediction with any existing punctuation in the text to determine whether each mark is correctly or incorrectly placed. Within the project, we construct a dedicated Uzbek corpus in which the relationship between every word and its surrounding punctuation is explicitly annotated. Each text is labelled according to its morphological and syntactic features. A dataset derived from this corpus is then prepared for training the model.

Keywords

punctuation marks; NLP; mBERT model; F1 metrics.

Authors

Hushnudbek S. Adinaev

Rubric:Informatics

Journal:European Science Review №7-8 (2025)

DOI:10.29013/ESR-25-7.8-28-33

1973

6

Download article Quote

Share

1973

6

References:

Pham Q.H., Nguyen B.T., Cuong N.V. Punctuation Prediction for Vietnamese Texts Using Conditional Random Fields // ACML Workshop: Machine Learning and Its Applications in Vietnam (MLAVN 2014). – 2014. – B. 1–9.

A. Nagy, B. Bial, va J. Ács, “Automatic punctuation restoration with BERT models,” arXiv preprint arXiv:2101.07343, Jan. 2021.

M. S. Sharipov, H. S. Adinaev, and E. R. Kuriyozov, “Rule-Based Punctuation Algorithm for the Uzbek Language,” in International Conference of Young Specialists on Micro/Nanotechnologies and Electron Devices, EDM, 2024, pp. 2410 – 2414. doi: 10.1109/EDM61683.2024.10615061.

H. S. Adinaev, “Punctuation Analysis of Uzbek Texts Based on the N-gram Model,” Electronic Journal of Actual Problems of Modern Science, Education and Training, vol. 2025, no. 2, pp. 104–110, Feb. 2025. ISSN 2181-9750.

M. S. Sharipov va H. S. Adinaev, “Development of models for punctuation analysis of Uzbek language texts,” Bulletin of TUIT: Management and Communication Technologies, vol. 1, no. 4, 2025. ISSN 2181-1083.

M. S. Sharipov and H. S. Adinaev, “O‘zbek tili matnlarida so‘roq gaplarni aniqlashning qoidaga asoslangan algoritmini ishlab chiqish,” Management and Future Technologies, vol. 2, no. 1, pp. 182 – 189, Mar. 2025.

M. S. Sharipov va H. S. Adinaev, “Shartli tasodifiy maydonlar modeli asosida o‘zbek tili matnlarini punktuatsion tahlil qilish,” Al-Farg‘oniy avlodlari elektron ilmiy jurnali, vol. 1, no. 2, pp. 66–70, 2025.

M. S. Sharipov, H. S. Adinaev va M. M. Yusupova, “O‘zbek tili matnlarida punktuatsion tahlil qilish uchun korpus yaratish,” Development of Science, vol. 2, no. 3, pp. 128–133, Mar. 2025. ISSN 3030-3907.

B. B. Elov va Z. Sobirova, “Tinish belgilarining matndagi ahamiyati va vazifalari,” Proc. V Xalqaro ilmiy-amaliy konferensiya “Kompyuter lingvistikasi: muammolar, yechim, istiqbollar”, vol. 1, no. 1, pp. 540–547, May 2025.

M. Sharipov and J. Vičič, “Dataset of Uzbek verbs with formation and suffixes,” Data in Brief, vol. 61, 2025, doi: 10.1016/j.dib.2025.111731.

M. Sharipov, E. Kuriyozov, O. Yuldashov, and O. Sobirov, “UzbekVerbDetection: Rule-based Detection of Verbs in Uzbek Texts,” in Proc. of the 2024 Joint Int. Conf. on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 17343–17347.

The mBERT Model for Restoring Punctuation in Uzbek-Language Texts

Annotation

Keywords

References:

Other articles of the issue