Програмна система класифікації текстів на основі машинного навчання та рекурентної нейронної мережі

Глибовець, АндрійДубовик, АндрійАфонін, Андрій2026-02-042026-02-042025Глибовець А. М. Програмна система класифікації текстів на основі машинного навчання та рекурентної нейронної мережі / Глибовець А. М., Дубовик А. В., Афонін А. О. // Наукові записки НаУКМА. Комп'ютерні науки. - 2025. - Т. 8. - С. 15-27. - https://doi.org/10.18523/2617-3808.2025.8.15-272617-38082617-7323https://doi.org/10.18523/2617-3808.2025.8.15-27https://ekmair.ukma.edu.ua/handle/123456789/38255In the context of rapid advancements in information technology and the exponential growth of accessible data, efficient data analysis management has become increasingly critical. One of the key challenges in this domain is automatic text classification — the process of assigning texts to specific categories. This task is complicated by the diversity of formats, structural variability, and the inherent semantic complexity of natural language. Addressing these challenges requires robust algorithms and effective natural language processing (NLP) techniques. This paper describes the development and testing of a software system for automatic text classification, which involves categorizing texts into predefined classes, including texts written in Ukrainian. Our application is based on the use of three models: Naive Bayes, Support Vector Machine, and the LSTM-based architecture of Recurrent Neural Networks (RNN), as well as their combinations. It allows for fast and accurate text classification and provides users with a convenient way to train the system on their own datasets and easily configure parameters for optimal results. To efficiently process input data and implement the classification algorithm, we chose the Python programming language. The core libraries used for the application’s functionality include TensorFlow, scikitlearn (for a simple and intuitive interface), Natural Language Toolkit (nltk), NumPy, and Pandas. Matplotlib and Seaborn were used for data visualization and plotting. The developed graphical application is capable of recognizing texts (in English or Ukrainian) in four categories ("World", "Sports" "Science/ Technology", "Business") with an accuracy of approximately 92%. For model training, we used the "AG News Classification Dataset" from Kaggle.com. Testing confirmed the hypothesis that specialized models, in addition to being significantly more resource-efficient, can also outperform large language models (LLMs) in text classification tasks. The system can also be quickly adapted for spam filtering tasks. Within seconds, it is possible to obtain an SVM model capable of identifying typical spam messages with about 99 % accuracy. We also tested the system’s capabilities in detecting emotional tone in texts, achieving an accuracy of 87.75 %. This work was supported by a grant from the Simons Foundation (SFI-PD-Ukraine-00014577; T.S.).У цій роботі описано побудову та результати тестування програмної системи автоматичної класифікації текстів, яка полягає в розподілі текстів за певними категоріями, зокрема текстів українською мовою. Наш застосунок побудований на використанні трьох моделей — Naive Bayes, Support Vector Machine, LSTM — архітектури рекурентної нейронної мережі Recurrent Neural Network (RNN) та їх комбінації. Він дає змогу доволі швидко і точно класифікувати тексти, надавати користувачу можливість зручним способом натренувати систему на власних даних і досить просто налаштувати параметри для оптимальних результатів. Для ефективного опрацювання вхідних даних і реалізації алгоритму класифікації ми вибрали мову програмування Python. Основними бібліотеками реалізації функціоналу застосунку стали TensorFlow, scikit-learn (для надання простого та зрозумілого інтерфейсу), Natural Language Toolkit (nltk), NumPy, Pandas. Matplotlib і seaborn застосовували для візуалізації даних і побудови графіків. Розроблений графічний застосунок здатен розпізнавати тексти (англійською або українською мовою) чотирьох категорій (World, Sports, Science / Technology, Business) з точністю близько 92 %. Для навчання моделей ми застосували AG News Classification Dataset із kaggle.com. Тестування застосунку підтвердило припущення, що спеціалізовані моделі, крім того, що є значно ефективнішими в плані використання ресурсів, також можуть демонструвати кращий результат у класифікації текстів, ніж LLM. Система також може бути швидко адаптована й до задачі фільтрації спаму. За декілька секунд можна отримати SVM модель, яка зможе розпізнавати типові спам-повідомлення з точністю близько 99 %. Так само були протестовані можливості системи при розпізнаванні емоційної забарвленості тексту. Вдалося досягти точності 87,75 %.ukавтоматична класифікація текстівNaive BayesSupport Vector MachineLSTMRecurrent Neural Networkмашинне навчанняPythonTensorFlowAG News Classification Datasetстаттяautomatic text classificationNaive BayesSupport Vector MachineLSTMRecurrent Neural Networkmachine learningPythonTensorFlowAG News Classification DatasetПрограмна система класифікації текстів на основі машинного навчання та рекурентної нейронної мережіText classification software system based on machine learning and recurrent Neural NetworkArticle