Sagnik Ray Choudhury | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sagnik Ray Choudhury is active.

Explore More

Publication

Featured researches published by Sagnik Ray Choudhury.

international conference on document analysis and recognition | 2013

Figure Metadata Extraction from Digital Documents

Sagnik Ray Choudhury; Prasenjit Mitra; Andi Kirk; Silvia Szep; Donald A. Pellegrino; Sue Jones; C. Lee Giles

Academic papers contain multiple figures (information graphics) representing important findings and experimental results. Automatic data extraction from such figures and classification of information graphics is not straightforward and a well studied problem in document analysis cite{4275059}. Also, very few digital library search engines index figures and/or associated metadata (figure caption) from PDF documents. We describe the very first step in indexing, classification and data extraction from figures in PDF documents - accurate automatic extraction of figures and associated metadata, a nontrivial task. Document layout, font information, lexical and linguistic features for figure caption extraction from PDF documents is considered for both rule based and machine learning based approaches. We also describe a digital library search engine that indexes figure captions and mentions from 150K documents, extracted by our custom built extractor.

acm/ieee joint conference on digital libraries | 2014

Towards building a scholarly big data platform: challenges, lessons and opportunities

Zhaohui Wu; Jian Wu; Madian Khabsa; Kyle Williams; Hung-Hsuan Chen; Wenyi Huang; Suppawong Tuarob; Sagnik Ray Choudhury; Alexander G. Ororbia; Prasenjit Mitra; C. Lee Giles

We introduce a Big Data platform that provides various services for harvesting scholarly information and enabling efficient scholarly applications. The core architecture of the platform is built on a secured private cloud, crawls data using a scholarly focused crawler that leverages a dynamic scheduler, processes by utilizing a map reduce based crawl-extraction-ingestion (CEI) workflow, and is stored in distributed repositories and databases. Services such as scholarly data harvesting, information extraction, and user information and log data analytics are integrated into the platform and provided by an OAI and RESTful API. We also introduce a set of scholarly applications built on top of this platform including citation recommendation and collaborator discovery.

acm/ieee joint conference on digital libraries | 2013

A figure search engine architecture for a chemistry digital library

Sagnik Ray Choudhury; Suppawong Tuarob; Prasenjit Mitra; Lior Rokach; Andi Kirk; Silvia Szep; Donald A. Pellegrino; Sue Jones; C.L. Giles

Academic papers contain multiple figures representing important findings and experimental results; we present a search engine specifically focused on figures in academic documents. This search engine allows users to search on figures in approximately 150,000 chemistry journal articles though the method is easily extendable to other domains. Our system indexes figure caption and mentions extracted from the PDF in documents using a custom built extractor. Recall and precision performance of extracted figures is in the 80 to 90% range. We give the frame work for the extraction algorithm, architecture and ranking function.

international world wide web conferences | 2015

An Architecture for Information Extraction from Figures in Digital Libraries

Sagnik Ray Choudhury; C.L. Giles

Scholarly documents contain multiple figures representing experimental findings. These figures are generated from data which is not reported anywhere else in the paper. We propose a modular architecture for analyzing such figures. Our architecture consists of the following modules: 1. An extractor for figures and associated metadata (figure captions and mentions) from PDF documents; 2. A Search engine on the extracted figures and metadata; 3. An image processing module for automated data extraction from the figures and 4. A natural language processing module to understand the semantics of the figure. We discuss the challenges in each step, report an extractor algorithm to extract vector graphics from scholarly documents and a classification algorithm for figures. Our extractor algorithm improves the state of the art by more than 10% and the classification process is very scalable, yet achieves 85\% accuracy. We also describe a semi-automatic system for data extraction from figures which is integrated with our search engine to improve user experience.

international conference on knowledge capture | 2015

PDFMEF: A Multi-Entity Knowledge Extraction Framework for Scholarly Documents and Semantic Search

Jian Wu; Jason Killian; Huaiyu Yang; Kyle Williams; Sagnik Ray Choudhury; Suppawong Tuarob; Cornelia Caragea; C. Lee Giles

We introduce PDFMEF, a multi-entity knowledge extraction framework for scholarly documents in the PDF format. It is implemented with a framework that encapsulates open-source extraction tools. Currently, it leverages PDFBox and TET for full text extraction, the scholarly document filter described in [5] for document classification, GROBID for header extraction, ParsCit for citation extraction, PDFFigures for figure and table extraction, and algorithm extraction [27]. While it can be run as a whole, the extraction tool in each module is highly customizable. Users can substitute default extractors with other extraction tools they prefer by writing a thin wrapper to implement the abstracts. The framework is designed to be scalable and is capable of running in parallel using a multi-processing technique in Python. Experiments indicate that the system with default setups is CPU bounded, and leaves a small footprint in the memory, which makes it best to run on a multi-core machine. The best performance using a dedicated server of 16 cores takes 1.3 seconds on average to process one PDF document. It is used to index extracted information and help users to quickly locate relevant results in published scholarly documents and to efficiently construct a large knowledge base in order to build a semantic scholarly search engine. Part of it is running on CiteSeerX digital library search engine.

international conference on data engineering | 2014

Scholarly big data information extraction and integration in the CiteSeer χ digital library

Kyle Williams; Jian Wu; Sagnik Ray Choudhury; Madian Khabsa; C. Lee Giles

CiteSeerχ is a digital library that contains approximately 3.5 million scholarly documents and receives between 2 and 4 million requests per day. In addition to making documents available via a public Website, the data is also used to facilitate research in areas like citation analysis, co-author network analysis, scalability evaluation and information extraction. The papers in CiteSeerχ are gathered from the Web by means of continuous automatic focused crawling and go through a series of automatic processing steps as part of the ingestion process. Given the size of the collection, the fact that it is constantly expanding, and the multiple ways in which it is used both by the public to access scholarly documents and for research, there are several big data challenges. In this paper, we provide a case study description of how we address these challenges when it comes to information extraction, data integration and entity linking in CiteSeerχ. We describe how we: aggregate data from multiple sources on the Web; store and manage data; process data as part of an automatic ingestion pipeline that includes automatic metadata and information extraction; perform document and citation clustering; perform entity linking and name disambiguation; and make our data and source code available to enable research and collaboration.

international conference on management of data | 2016

Scalable algorithms for scholarly figure mining and semantics

Sagnik Ray Choudhury; Shuting Wang; C. Lee Giles

Most scholarly papers contain one or multiple figures. Often these figures show experimental results, e.g, line graphs are used to compare various methods. Compared to the text of the paper, figures and their semantics have received relatively less attention. This has significantly limited semantic search capabilities in scholarly search engines. Here, we report scalable algorithms for generating semantic metadata for figures. Our system has four sequential modules: 1. Extraction of figure, caption and mention; 2. Binary classification of figures as compound (contains sub-figures) or not; 3. Three class classification of non compound figures as line graph, bar graph or others; and 4. Automatic processing of line graphs to generate a textual summary. In each step a metadata file is generated, each having richer information than the previous one. The algorithms are scalable yet each individual step has an accuracy greater than 80%.

acm/ieee joint conference on digital libraries | 2016

Curve Separation for Line Graphs in Scholarly Documents

Sagnik Ray Choudhury; Shuting Wang; C. Lee Giles

Line graphs are abundant in scholarly papers. They are usually generated from a data table and that data can not be accessed. One important step in an automated data extraction pipeline is the curve separation problem: segmenting the pixels into separate curves. Previous work in this domain has focused on raster graphics extracted from scholarly PDFs, whereas most scholarly plots are embedded as vector graphics. We report a system to extract these plots as SVG images and show how that can improve both the accuracy (90%) and the scalability (5-8 seconds) of the curve separation problem.

acm ieee joint conference on digital libraries | 2017

HESDK: a hybrid approach to extracting scientific domain knowledge entities

Jian Wu; Sagnik Ray Choudhury; Agnese Chiatti; Chen Liang; C. Lee Giles

We investigate a variant of the problem of automatic keyphrase extraction from scientific documents, which we define as Scientific Domain Knowledge Entity (SDKE) extraction. Keyphrases are noun phrases important to the documents themselves. In contrast, an SDKE is text that refers to a concept and can be classified as a process, material, task, dataset etc. A SDKE represents domain knowledge, but is not necessarily important to the document it is in. Supervised keyphrase extraction algorithms using non-sequential classifiers and global measures of informativeness (PMI, tf-idf) have been used for this task. Another approach is to use sequential labeling algorithms with local context from a sentence, as done in the named entity recognition. We show that these two methods can complement each other and a simple merging can improve the extraction accuracy by 5-7 percentiles. We further propose several heuristics to improve the extraction accuracy. Our preliminary experiments suggest that it is possible to improve the accuracy of the sequential learner itself by utilizing the predictions of the non-sequential model.

arXiv: Computation and Language | 2014