Liangcai Gao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Liangcai Gao is active.

Explore More

Publication

Featured researches published by Liangcai Gao.

international conference on document analysis and recognition | 2011

A Table Detection Method for Multipage PDF Documents via Visual Seperators and Tabular Structures

Jing Fang; Liangcai Gao; Kun Bai; Ruiheng Qiu; Xin Tao; Zhi Tang

Table detection is always an important task of document analysis and recognition. In this paper, we propose a novel and effective table detection method via visual separators and geometric content layout information, targeting at PDF documents. The visual separators refer to not only the graphic ruling lines but also the white spaces to handle tables with or without ruling lines. Furthermore, we detect page columns in order to assist table region delimitation in complex layout pages. Evaluations of our algorithm on an e-Book dataset and a scientific document dataset show competitive performance. It is noteworthy that the proposed method has been successfully incorporated into a commercial software package for large-scale Chinese e-Book production.

international conference on document analysis and recognition | 2011

Mathematical Formula Identification in PDF Documents

Xiaoyan Lin; Liangcai Gao; Zhi Tang; Xiaofan Lin; Xuan Hu

Recognizing mathematical expressions in PDF documents is a new and important field in document analysis. It is quite different from extracting mathematical expressions in image-based documents. In this paper, we propose a novel method by combining rule-based and learning-based methods to detect both isolated and embedded mathematical expressions in PDF documents. Moreover, various features of formulas, including geometric layout, character and context content, are used to adapt to a wide range of formula types. Experimental results show satisfactory performance of the proposed method. Furthermore, the method has been successfully incorporated into a commercial software package for large-scale Chinese e-Book production.

acm/ieee joint conference on digital libraries | 2014

Full-text based context-rich heterogeneous network mining approach for citation recommendation

Xiaozhong Liu; Yingying Yu; Chun Guo; Yizhou Sun; Liangcai Gao

Citation relationship between scientific publications has been successfully used for scholarly bibliometrics, information retrieval and data mining tasks, and citation-based recommendation algorithms are well documented. While previous studies investigated citation relations from various viewpoints, most of them share the same assumption that, if paper1 cites paper2 (or author1 cites author2), they are connected, regardless of citation importance, sentiment, reason, topic, or motivation. However, this assumption is oversimplified. In this study, we employ an innovative “context-rich heterogeneous network” approach, which paves a new way for citation recommendation task. In the network, we characterize (1) the importance of citation relationships between citing and cited papers, and (2) the topical citation motivation. Unlike earlier studies, the citation information, in this paper, is characterized by citation textual contexts extracted from the full-text citing paper. We also propose algorithm to cope with the situation when large portion of full-text missing information exists in the bibliographic repository. Evaluation results show that, context-rich heterogeneous network can significantly enhance the citation recommendation performance.

acm ieee joint conference on digital libraries | 2011

Structure extraction from PDF-based book documents

Liangcai Gao; Zhi Tang; Xiaofan Lin; Ying Liu; Ruiheng Qiu; Yongtao Wang

Nowadays PDF documents have become a dominating knowledge repository for both the academia and industry largely because they are very convenient to print and exchange. However, the methods of automated structure information extraction are yet to be fully explored and the lack of effective methods hinders the information reuse of the PDF documents. To enhance the usability for PDF-formatted electronic books, we propose a novel computational framework to analyze the underlying physical structure and logical structure. The analysis is conducted at both page level and document level, including global typographies, reading order, logical elements, chapter/section hierarchy and metadata. Moreover, two characteristics of PDF-based books, i.e., style consistency in the whole book document and natural rendering order of PDF files, are fully exploited in this paper to improve the conventional image-based structure extraction methods. This paper employs the bipartite graph as a common structure for modeling various tasks, including reading order recovery, figure and caption association, and metadata extraction. Based on the graph representation, the optimal matching (OM) method is utilized to find the global optima in those tasks. Extensive benchmarking using real-world data validates the high efficiency and discrimination ability of the proposed method.

acm/ieee joint conference on digital libraries | 2013

WikiMirs: a mathematical information retrieval system for wikipedia

Xuan Hu; Liangcai Gao; Xiaoyan Lin; Zhi Tang; Xiaofan Lin; Josef B. Baker

Mathematical formulae in structural formats such as MathML and LaTeX are becoming increasingly available. Moreover, repositories and websites, including ArXiv and Wikipedia, and growing numbers of digital libraries use these structural formats to present mathematical formulae. This presents an important new and challenging area of research, namely Mathematical Information Retrieval (MIR). In this paper, we propose WikiMirs, a tool to facilitate mathematical formula retrieval in Wikipedia. WikiMirs is aimed at searching for similar mathematical formulae based upon both textual and spatial similarities, using a new indexing and matching model developed for layout structures. A hierarchical generalization technique is proposed to generate sub-trees from presentation trees of mathematical formulae, and similarity is calculated based upon matching at different levels of these trees. Experimental results show that WikiMirs can efficiently support sub-structure matching and similarity matching of mathematical formulae. Moreover, WikiMirs obtains both higher accuracy and better ranked results over Wikipedia in comparison to Wikipedia Search and Egomath. We conclude that WikiMirs provides a new, alternative, and hopefully better service for users to search mathematical expressions within Wikipedia.

Multimedia Tools and Applications | 2014

Automatic comic page segmentation based on polygon detection

Luyuan Li; Yongtao Wang; Zhi Tang; Liangcai Gao

Comic page segmentation aims to automatically decompose scanned comic images into storyboards (frames), which is the key technique to produce digital comic documents that are suitable for reading on mobile devices. In this paper, we propose a novel method for comic page segmentation by finding the quadrilateral enclosing box of each storyboard. We first acquire the edge image of the input comic image, and then extract line segments with a heuristic line segment detection algorithm. We perform line clustering to further merge the overlapped line segments and remove the redundancy line segments. Finally, we perform another round of line clustering and post-processing to compose the obtained line segments into complete quadrilateral enclosing boxes of the storyboards. The proposed method is tested on 2,237 comic images from 12 different printed comic series, and the experimental results demonstrate that our method is effective for comic image segmentation and outperforms the existing methods.

document analysis systems | 2014

Plane Geometry Figure Retrieval with Bag of Shapes

Lu Liu; Xiaoqing Lu; Keqiang Li; Jingwei Qu; Liangcai Gao; Zhi Tang

Digital education is serving an increasingly important function in most educational institutions, thus resulting in the production of a large number of digital documents online for education purposes. However, convenient ways to retrieve mathematic geometry questions are lacking because current retrieval systems largely rely on keywords instead of geometry figure images. This study focuses on plane geometry figure (PGF) image retrieval with the aim of retrieving relevant geometry images that contain more structural information than a question text stem. To fully use geometrical properties, a Bag-of-shapes (BoS) method is proposed to build the feature descriptor of an image. The BoS method contains either basic geometric primitives or dual-primitive structures along with several specific geometrical features for shape description. Based on the BoS feature descriptor, we apply cosine similarity with group feature weight as vector similarity measure for ranking to achieve high efficiency. For a PGF image query, the retrieval results are provided in an appropriate ranking order, which has high visual similarity with respect to human perception. Retrieval experiments and evaluation results show the effectiveness and efficiency of the proposed BoS shape descriptor.

international conference on document analysis and recognition | 2009

Analysis of Book Documents' Table of Content Based on Clustering

Liangcai Gao; Zhi Tang; Xiaofan Lin; Xin Tao; Yimin Chu

Table of contents (TOC) recognition has attracted a great deal of attention in recent years. After reviewing the merits and drawbacks of the existing TOC recognition methods, we have observed that book documents are multi-page documents with intrinsic local format consistency. Based on this finding we introduce an automatic TOC analysis method through clustering. This method first detects the decorative elements in TOC pages. Then it learns a layout model used in the TOC pages through clustering. Finally, it generates TOC entries and extracts their hierarchical structure under the guidance of the model. More specifically, broken lines are taken into account in the method. Experimental results show that this method achieves high accuracy and efficiency. In addition, this method has been successfully applied in a commercial E-book production software package.

Proceedings of SPIE | 2010

A Novel XML-Based Document Format with Printing Quality for Web Publishing

Ruiheng Qiu; Zhi Tang; Liangcai Gao; Yinyan Yu

Although many XML-based document formats are available for printing or publishing on the Internet, none of them is well designed to support both high quality printing and web publishing. Therefore, we propose a novel XML-based document format for web publishing, called CEBX, in this paper. The proposed format is a fixed-layout document supporting high quality printing, which has optimized document content organization, physical structure and protection scheme to support web publishing. There are four noteworthy features of CEBX documents: (1) CEBX provides original fixed layout by graphic units for printing quality. (2) The content in CEBX document can be reflowed to fit the display device basing on the content blocks and additional fluid information. (3) XML Document Archiving model (XDA), the packaging model used in CEBX, supports document linearization and incremental edit well. (4) By introducing a segment-based content protection scheme into CEBX, some part of a document can be previewed directly while the remaining part is protected effectively such that readers only need to purchase partial content of a book that they are interested in. This will be very helpful to document distribution and support flexible business models such as try-beforebuy, on-demand reading, superdistribution, etc.

acm/ieee joint conference on digital libraries | 2009

CEBBIP: a parser of bibliographic information in chinese electronic books

Liangcai Gao; Zhi Tang; Xiaofan Lin

Bibliographic information is essential for many digital library applications, such as citation analysis, academic searching and topic discovery. And bibliographic data extraction has attracted a great deal of attention in recent years. In this paper, we address the problem of automatic extraction of bibliographic data in Chinese electronic book and propose a tool called CEBBIP* for the task, which includes three main systems: data preprocessing, data parsing and data postprocessing. In the data preprocessing system, the tool adopts a rules-based method to locate citation data in a book and to segment citation data into citation strings of individual referencing literature. And a learning-based approach, Conditional Random Fields (CRF), is employed to parse citation strings in the data parsing system. Finally, the tool takes advantage of document intrinsic local format consistency to enhance citation data segmentation and parsing through clustering techniques. CEBBIP has been used in a commercial E-book production system. Experimental results show that CEBBIPs precision rate is very high. More specially, adopting the document intrinsic local format consistency obviously improves the citation data segmenting and parsing accuracy.

Explore More