Xiaofan Lin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiaofan Lin is active.

Explore More

Publication

Featured researches published by Xiaofan Lin.

international conference on document analysis and recognition | 2011

Mathematical Formula Identification in PDF Documents

Xiaoyan Lin; Liangcai Gao; Zhi Tang; Xiaofan Lin; Xuan Hu

Recognizing mathematical expressions in PDF documents is a new and important field in document analysis. It is quite different from extracting mathematical expressions in image-based documents. In this paper, we propose a novel method by combining rule-based and learning-based methods to detect both isolated and embedded mathematical expressions in PDF documents. Moreover, various features of formulas, including geometric layout, character and context content, are used to adapt to a wide range of formula types. Experimental results show satisfactory performance of the proposed method. Furthermore, the method has been successfully incorporated into a commercial software package for large-scale Chinese e-Book production.

acm ieee joint conference on digital libraries | 2011

Structure extraction from PDF-based book documents

Liangcai Gao; Zhi Tang; Xiaofan Lin; Ying Liu; Ruiheng Qiu; Yongtao Wang

Nowadays PDF documents have become a dominating knowledge repository for both the academia and industry largely because they are very convenient to print and exchange. However, the methods of automated structure information extraction are yet to be fully explored and the lack of effective methods hinders the information reuse of the PDF documents. To enhance the usability for PDF-formatted electronic books, we propose a novel computational framework to analyze the underlying physical structure and logical structure. The analysis is conducted at both page level and document level, including global typographies, reading order, logical elements, chapter/section hierarchy and metadata. Moreover, two characteristics of PDF-based books, i.e., style consistency in the whole book document and natural rendering order of PDF files, are fully exploited in this paper to improve the conventional image-based structure extraction methods. This paper employs the bipartite graph as a common structure for modeling various tasks, including reading order recovery, figure and caption association, and metadata extraction. Based on the graph representation, the optimal matching (OM) method is utilized to find the global optima in those tasks. Extensive benchmarking using real-world data validates the high efficiency and discrimination ability of the proposed method.

acm/ieee joint conference on digital libraries | 2013

WikiMirs: a mathematical information retrieval system for wikipedia

Xuan Hu; Liangcai Gao; Xiaoyan Lin; Zhi Tang; Xiaofan Lin; Josef B. Baker

Mathematical formulae in structural formats such as MathML and LaTeX are becoming increasingly available. Moreover, repositories and websites, including ArXiv and Wikipedia, and growing numbers of digital libraries use these structural formats to present mathematical formulae. This presents an important new and challenging area of research, namely Mathematical Information Retrieval (MIR). In this paper, we propose WikiMirs, a tool to facilitate mathematical formula retrieval in Wikipedia. WikiMirs is aimed at searching for similar mathematical formulae based upon both textual and spatial similarities, using a new indexing and matching model developed for layout structures. A hierarchical generalization technique is proposed to generate sub-trees from presentation trees of mathematical formulae, and similarity is calculated based upon matching at different levels of these trees. Experimental results show that WikiMirs can efficiently support sub-structure matching and similarity matching of mathematical formulae. Moreover, WikiMirs obtains both higher accuracy and better ranked results over Wikipedia in comparison to Wikipedia Search and Egomath. We conclude that WikiMirs provides a new, alternative, and hopefully better service for users to search mathematical expressions within Wikipedia.

international conference on document analysis and recognition | 2009

Analysis of Book Documents' Table of Content Based on Clustering

Liangcai Gao; Zhi Tang; Xiaofan Lin; Xin Tao; Yimin Chu

Table of contents (TOC) recognition has attracted a great deal of attention in recent years. After reviewing the merits and drawbacks of the existing TOC recognition methods, we have observed that book documents are multi-page documents with intrinsic local format consistency. Based on this finding we introduce an automatic TOC analysis method through clustering. This method first detects the decorative elements in TOC pages. Then it learns a layout model used in the TOC pages through clustering. Finally, it generates TOC entries and extracts their hierarchical structure under the guidance of the model. More specifically, broken lines are taken into account in the method. Experimental results show that this method achieves high accuracy and efficiency. In addition, this method has been successfully applied in a commercial E-book production software package.

acm/ieee joint conference on digital libraries | 2009

CEBBIP: a parser of bibliographic information in chinese electronic books

Liangcai Gao; Zhi Tang; Xiaofan Lin

Bibliographic information is essential for many digital library applications, such as citation analysis, academic searching and topic discovery. And bibliographic data extraction has attracted a great deal of attention in recent years. In this paper, we address the problem of automatic extraction of bibliographic data in Chinese electronic book and propose a tool called CEBBIP* for the task, which includes three main systems: data preprocessing, data parsing and data postprocessing. In the data preprocessing system, the tool adopts a rules-based method to locate citation data in a book and to segment citation data into citation strings of individual referencing literature. And a learning-based approach, Conditional Random Fields (CRF), is employed to parse citation strings in the data parsing system. Finally, the tool takes advantage of document intrinsic local format consistency to enhance citation data segmentation and parsing through clustering techniques. CEBBIP has been used in a commercial E-book production system. Experimental results show that CEBBIPs precision rate is very high. More specially, adopting the document intrinsic local format consistency obviously improves the citation data segmenting and parsing accuracy.

document recognition and retrieval | 2012

Identification of embedded mathematical formulas in PDF documents using SVM

Xiaoyan Lin; Liangcai Gao; Zhi Tang; Xuan Hu; Xiaofan Lin

With the tremendous popularity of PDF format, recognizing mathematical formulas in PDF documents becomes a new and important problem in document analysis field. In this paper, we present a method of embedded mathematical formula identification in PDF documents, based on Support Vector Machine (SVM). The method first segments text lines into words, and then classifies each word into two classes, namely formula or ordinary text. Various features of embedded formulas, including geometric layout, character and context content, are utilized to build a robust and adaptable SVM classifier. Embedded formulas are then extracted through merging the words labeled as formulas. Experimental results show good performance of the proposed method. Furthermore, the method has been successfully incorporated into a commercial software package for large-scale e-Book production.

document analysis systems | 2012

Performance Evaluation of Mathematical Formula Identification

Xiaoyan Lin; Liangcai Gao; Zhi Tang; Xiaofan Lin; Xuan Hu

This paper presents a performance evaluation system for mathematical formula identification. First, a ground-truth dataset is constructed to facilitate the performance comparison of different mathematical formula identification algorithms. Statistics analysis of the dataset shows the diversities of the dataset to reflect the real-world documents. Second, a performance evaluation metric for mathematical formula identification is proposed, including the error type definitions and the scenario-adjustable scoring. The proposed metric enables in-depth analysis of mathematical formula identification systems in different scenarios. Finally, based on the proposed evaluation metric, a tool is developed to automatically evaluate mathematical formula identification results. It is worth noting that the ground-truth dataset and the evaluation tool are freely available for academic purpose.

Applied Soft Computing | 2013

Newspaper article reconstruction using ant colony optimization and bipartite graph

Liangcai Gao; Yongtao Wang; Zhi Tang; Xiaofan Lin

The primary information units in a newspaper are the articles. How to segment a newspaper page into individual articles and to recover the reading order of each article, namely newspaper article reconstruction, is known to be challenging due to the complexity of the multi-article page layout. In this paper, we propose a novel article reconstruction approach by solving a series of subtasks: grouping the article bodies, detecting the reading order, associating the title-body pairs and linking article parts scattered in multiple pages. We formulate reading order detection as a traveling salesman problem (TSP), and employ the Max-Min Ant System (MMAS) to solve it. Furthermore, a level-based pheromone mechanism is introduced to improve the efficiency of standard MMAS. Moreover, in sharp contrast to the existing methods, we perform the first two subtasks of article reconstruction in reverse order, that is, we detect the reading order of the text blocks first and then use the content continuity implicitly specified in the reading order to aggregate text blocks of the same article. In this way, we can effectively overcome the limitation of content similarity on article body aggregation. The other two subtasks (associating the title-body pairs, linking article parts scattered in multiple pages), are solved under a unified bipartite graph framework, which models the complex relationships between page objects as one-to-one correspondences, and accomplishes the two subtasks by finding the optimal matching on this graph. During the optimization process, various information sources, including geometric layout, linguistic and semantic content, are deeply mined in MMAS and bipartite graph model to deal with the wide range of complex newspaper layouts. Experimental results on real-world data have demonstrated the effectiveness of our proposed method. It has also been adopted in several large-scale newspaper digitalization projects.

acm/ieee joint conference on digital libraries | 2012

Web-based citation parsing, correction and augmentation

Liangcai Gao; Xixi Qi; Zhi Tang; Xiaofan Lin; Ying Liu

Considering the tremendous value of citation metadata, many methods have been proposed to automate Citation Metadata Extraction (CME). The existing methods primarily rely on the content analysis of citation text. However, the results from such content-based methods are often unreliable. Moreover, the extracted citation metadata is only a small part of the relevant metadata that spreads across the Internet. As opposed to the content-based CME methods, this paper proposes a Web-based CME approach and a citation enriching system, called as BibAll, which is capable of correcting the parsing results of content-based CME methods and augmenting citation metadata by leveraging relevant bibliographic data from digital repositories and cited-by publications on the Web. BibAll consists of four main components: citation parsing, Web-based bibliographic data retrieval, irrelevant bibliographic data filtering, and relevant bibliographic data integration. The system has been tested on the publicly available FLUX-CIM dataset. Experimental results show that BibAll significantly improves the citation parsing accuracy and augments the metadata of the original citation.

document analysis systems | 2008

Comprehensive Global Typography Extraction System for Electronic Book Documents

Liangcai Gao; Zhi Tang; Xiaofan Lin; Ruiheng Qiu

Book documents usually have consistent typographies throughout the whole book, including headers, footers, columns, text line directions, and fonts used in the each level of headings. Such document-level typography information is of great value for downstream document processing applications. This paper presents a document analysis system that can extract a comprehensive set of typographies used in book documents. The system consists of several components: recognition of fonts used in the body text and chapter headings; detection of page body area, headers and footers; detection of columns, text line direction and line spacing of body text. Page-association is employed in the system. The preliminary experimental results demonstrate the effectiveness of the system.

Explore More