Використання алгоритму LSA для кластеризації задач із геометрії
Loading...
Date
2020
Authors
Жежерун, Олександр
Борозенний, Сергій
Ніверовський, Микита
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
У роботі розглянуто метод LSA (латентно-семантичного аналізу), зокрема його найпоширеніший варіант, що базується на сингулярному розкладі матриці (SVD). На його основі реалізовано
алгоритм кластеризації задач і застосовано на прикладі кластеризації задач із геометрії.
Currently, there are a huge number of clustering algorithms. The basic idea of most of them is to combine identical sequences into one class or cluster based on similarity. As a rule, the choice of algorithm is determined by the task. As for textual data, the compared components are sequences of words and their attributes (for example, the weight of a word in the text, the type of the named entity, tonality, etc.). Thus, the texts are first transformed into vectors, which are used for various types of manipulation. At the same time, as a rule, there are a number of problems connected with: selection of primary clusters, the dependence of the quality of clustering on the length of the text, determining the total number of clusters, etc. But the most difficult problem is the lack of connection between similar texts, which use different vocabulary. In such cases, the association should take place not only on the basis of similarity, but also on the basis of semantic contiguity or associativity. One of the methods that allows to solve such problems is Latent semantic analysis (LSA). LSA is a method of information processing that analyzes a set of documents and finds the terms that occur there, and on this basis identifies the characteristic factors, topics that characterize the content of the document. Define the following types of correlation: "Word-word"; "Word-paragraph"; "Paragraph-paragraph". These are the three types that a person thinks, comparing parts of the text with the content. LSA technology takes into account not only the frequency of the text use, but also latent (deep) connections. The first article on the Automatic Document Classification [4] was published in the Journal of the ACM in early 1963, and was the first to describe the method of factor analysis as a means of finding information. Factor analysis is a method that determines the relationship between the values of variables. In this paper, the possibility of using latent-semantic analysis for clustering of texts (geometry problems) has been investigated, for which an algorithm and the necessary software have been developed.
Currently, there are a huge number of clustering algorithms. The basic idea of most of them is to combine identical sequences into one class or cluster based on similarity. As a rule, the choice of algorithm is determined by the task. As for textual data, the compared components are sequences of words and their attributes (for example, the weight of a word in the text, the type of the named entity, tonality, etc.). Thus, the texts are first transformed into vectors, which are used for various types of manipulation. At the same time, as a rule, there are a number of problems connected with: selection of primary clusters, the dependence of the quality of clustering on the length of the text, determining the total number of clusters, etc. But the most difficult problem is the lack of connection between similar texts, which use different vocabulary. In such cases, the association should take place not only on the basis of similarity, but also on the basis of semantic contiguity or associativity. One of the methods that allows to solve such problems is Latent semantic analysis (LSA). LSA is a method of information processing that analyzes a set of documents and finds the terms that occur there, and on this basis identifies the characteristic factors, topics that characterize the content of the document. Define the following types of correlation: "Word-word"; "Word-paragraph"; "Paragraph-paragraph". These are the three types that a person thinks, comparing parts of the text with the content. LSA technology takes into account not only the frequency of the text use, but also latent (deep) connections. The first article on the Automatic Document Classification [4] was published in the Journal of the ACM in early 1963, and was the first to describe the method of factor analysis as a means of finding information. Factor analysis is a method that determines the relationship between the values of variables. In this paper, the possibility of using latent-semantic analysis for clustering of texts (geometry problems) has been investigated, for which an algorithm and the necessary software have been developed.
Description
Keywords
LSA, LSI, SVD, кластеризація, стаття, LSA, LSI, SVD, clustering
Citation
Жежерун О. П. Використання алгоритму LSA для кластеризації задач із геометрії / Жежерун О. П., Борозенний С. О., Ніверовський М. М. // Наукові записки НаУКМА. Комп'ютерні науки. - 2020. - Т. 3. - С. 107-113.