Digit. Scholarsh. Humanit. | 2021

UDAT: Compound quantitative analysis of text using machine learning

 

Abstract


\n Computing machines allow quantitative analysis of large databases of text, providing knowledge that is difficult to obtain without using automation. This article describes Universal Data Analysis of Text (UDAT) —a text analysis method that extracts a large set of numerical text content descriptors from text files and performs various pattern recognition tasks such as classification, similarity between classes, correlation between text and numerical values, and query by example. Unlike several previously proposed methods, UDAT is not based on frequency of words and links between certain key words and topics. The method is implemented as an open-source software tool that can provide detailed reports about the quantitative analysis of sets of text files, as well as exporting the numerical text content descriptors in the form of comma-separated values files to allow statistical or pattern recognition analysis with external tools. It also allows the identification of specific text descriptors that differentiate between classes or correlate with numerical values and can be applied to problems related to knowledge discovery in domains such as literature and social media. UDAT is implemented as a command-line tool that runs in Windows, and the open source is available and can be compiled in Linux systems. UDAT can be downloaded from http://people.cs.ksu.edu/∼lshamir/downloads/udat.

Volume 36
Pages 187-208
DOI 10.1093/llc/fqaa007
Language English
Journal Digit. Scholarsh. Humanit.

Full Text