Alexey O. Shigarov
Russian Academy of Sciences
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Alexey O. Shigarov.
Expert Systems With Applications | 2015
Alexey O. Shigarov
We propose an approach to table understanding using a rule engine.It is restricted by tasks of table analysis and interpretation.Spatial, style, and text information of tables is used for table understanding.Experimental results show the applicability of approach to a wide range of tables.The approach is designed for unstructured tabular data integration. The paper discusses issues on the conversion of tabular data from unstructured to structured form. Particularly, we propose an approach to table understanding (i.e. recovering semantic relationships in a table), which is designed for unstructured tabular data integration. Our approach is based on using a rule engine. It is assumed that spatial, style (typographical), and natural language information can be used for table analysis and interpretation. The CELLS system based on the approach has been developed for integrating unstructured tabular data presented in Excel spreadsheet format. Experimental results show that the approach and system can be applied to a wide range of tables from statistical and financial reports.
document engineering | 2016
Alexey O. Shigarov; Andrey Mikhailov; Andrey Altaev
Today, PDF is one of the most popular document formats in the web. Many PDF documents are not images, but remain untagged. They have no tags for identifying the logical reading order, paragraphs, figures, and tables. One of the challenges with these documents is how to extract tables from them. The paper discusses a new system for table structure recognition in untagged PDF documents. It is formulated as a set of configurable parameters and ad-hoc heuristics for recovering table cells. We consider two different configurations for the system and demonstrate experimental results based on the existing competition dataset for both of them.
Information Systems | 2017
Alexey O. Shigarov; Andrey Mikhailov
Abstract The paper discusses issues of rule-based data transformation from arbitrary spreadsheet tables to a canonical (relational) form. We present a novel table object model and rule-based language for table analysis and interpretation. The model is intended to represent a physical (cellular) and logical (semantic) structure of an arbitrary table in the transformation process. The language allows drawing up this process as consecutive steps of table understanding, i. e. recovering implicit semantics. Both are implemented in our tool for spreadsheet data canonicalization. The presented case study demonstrates the use of the tool for developing a task-specific rule-set to convert data from arbitrary tables of the same genre (government statistical websites) to flat file databases. The performance evaluation confirms the applicability of the implemented rule-set in accomplishing the stated objectives of the application.
international conference on information and software technologies | 2016
Alexey O. Shigarov; Viacheslav V. Paramonov; Polina V. Belykh; Alexander I. Bondarev
Arbitrary tables presented in spreadsheets can be an important data source in business intelligence. However, many of them have complex layouts that hinder the process of extracting, transforming, and loading their data in a database. The paper is devoted to the issues of rule-based data transformation from arbitrary tables presented in spreadsheets to a structured canonical form that can be loaded into a database by regular ETL-tools. We propose a system for canonicalization of arbitrary tables presented in spreadsheets as an implementation of our methodology for rule-based table analysis and interpretation. It enables the execution of rules expressed in our specialized rule language called CRL to recover implicit relationships in a table. Our experimental results show that particular CRL-programs can be developed for different sets of tables with similar features to automate table canonicalization with high accuracy.
international conference on information and software technologies | 2015
Alexey O. Shigarov
Today, a huge amount of tables are presented in web pages, word documents, and spreadsheets. Many of them are unstructured tabular data. They are intended to be understood by humans but not to be interpreted by machines. At the same time, we often need to have that information in a structured form, e.g. relational databases. We propose a rule-based approach to table analysis and interpretation and demonstrate how it can be applied to transform tabular data from unstructured (spreadsheets) to structured (relational databases) form. The paper discusses representing tabular data as facts in the working memory of a rule engine, a formal language for defining rules of table analysis and interpretation, and its implementation.
Pattern Recognition and Image Analysis | 2011
Alexey O. Shigarov; R. K. Fedorov
An algorithm for page layout analysis (segmentation) is suggested in the paper. It allows whitespace between text blocks to be detected on a document page. The algorithm could be used in document analysis and recognition problems. In particular, it can be used for column recognition in multicolumn text and tables. The suggested algorithm is quite simple for implementation.
Pattern Recognition and Image Analysis | 2009
Alexey O. Shigarov; I. V. Bychkov; G. M. Ruzhnikov; A. E. Khmel’nov
A method is proposed for the detection of statistical tables that use metafiles as input data; the latter fact allows one to apply this method to documents of different formats. In this method, the table detection process is viewed as a bottom-up segmentation of a document page, i.e., segmentation from simple elements of a page to more complicated ones. The experimental evaluation of the method shows that it is efficient as applied to a wide class of statistical tables.
international conference on information and software technologies | 2016
Viacheslav V. Paramonov; Alexey O. Shigarov; Gennagy M. Ruzhnikov; Polina V. Belykh
Data cleansing is the crucial matter in business intelligence. We propose a new phonetic algorithm to string matching in Russian language without transliteration from Cyrillic to Latin characters. It is based on the rules of sounds formation in Russian language. Additionally, we consider an extended algorithm for matching of Cyrillic strings where phonetic code letters are presented as primes, and the code of a string is the sum of these numbers. Experimental results show that our algorithms allow accurately matching phonetically similar strings in Russian language.
international conference on information systems | 2018
Viacheslav V. Paramonov; Alexey O. Shigarov; Gennady Ruzhnikov; Evgeny Cherkashin
The usage of phonetic similarity in comparison of textual strings and elimination of misprints is one of significant issues in philology. It is widely used in automatic text checking. Nowadays most of phonetic algorithms are designed for English language words processing. The quality of comparison may be decreased for non-English languages especially for languages, which have rich morphology and use non-Latin alphabet symbols, e.g. East Slavic languages with Cyrillic letters. We propose an approach to phonetic comparison of Russian language words. It is based on detection letters and letter sequences that have similar pronunciation according to rules of the language. The resultant phonetic representation of the words are coded by prime numbers. The efficiency of the reviewed algorithm is considered in the paper. The algorithm was adopted for Mongolian language phonetic processing.
international conference on information and software technologies | 2018
Evgeny A. Cherkashin; Alexey Kopaygorodsky; Ljubica Kazi; Alexey O. Shigarov; Viacheslav V. Paramonov
We consider tools for developing information systems with use of Model Driven Architecture (MDA) and Linked Open Data technologies (LOD). The original idea of LOD is to allow the software designers to develop program systems integrated by means of common ontologies and web protocols. MDA Platform Independent Model (PIM) is expressed as set of UML diagrams. PIM forms a LOD graph and its namespace. All the PIM entities are defined as ontology resources, i.e. with URI references to LOD terms. This allows us to translate PIM UML model to a set of triples and store them in an ontology warehouse for further transformation into a Platform Specific Model (PSM). The ClioPatria ontology server and the SWI Prolog language are used as tools of PIM and PSM storage, querying and processing. The tools will allow us to mediate the MDA static means of code generation and configuration at development stage with the techniques of flexible data structure processing at run time, thus, producing even more productive information system development and maintenance techniques. This research corresponds to nowadays direction of Semantic Web Software Engineering.