Pucktada Treeratpituk
Pennsylvania State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Pucktada Treeratpituk.
acm/ieee joint conference on digital libraries | 2009
Pucktada Treeratpituk; C. Lee Giles
Users of digital libraries usually want to know the exact author or authors of an article. But different authors may share the same names, either as full names or as initials and last names (complete name change examples are not considered here). In such a case, the user would like the digital library to differentiate among these authors. Name disambiguation can help in many cases; one being a user in a search of all articles written by a particular author. Disambiguation also enables better bibliometric analysis by allowing a more accurate counting and grouping of publications and citations. In this paper, we describe an algorithm for pair-wise disambiguation of author names based on a machine learning classification algorithm, random forests. We define a set of similarity profile features to assist in author disambiguation. Our experiments on the Medline database show that the random forest model outperforms other previously proposed techniques such as those using support-vector machines (SVM). In addition, we demonstrate that the variable importance produced by the random forest model can be used in feature selection with little degradation in the disambiguation accuracy. In particular, the inverse document frequency of author last name and the middle names similarity alone achieves an accuracy of almost 90%.
acm/ieee joint conference on digital libraries | 2012
Madian Khabsa; Pucktada Treeratpituk; C. Lee Giles
Acknowledgments are widely used in scientific articles to express gratitude and credit collaborators. Despite suggestions that indexing acknowledgments automatically will give interesting insights, there is currently, to the best of our knowledge, no such system to track acknowledgments and index them. In this paper we introduce AckSeer, a search engine and a repository for automatically extracted acknowledgments in digital libraries. AckSeer is a fully automated system that scans items in digital libraries including conference papers, journals, and books extracting acknowledgment sections and identifying acknowledged entities mentioned within. We describe the architecture of AckSeer and discuss the extraction algorithms that achieve a F1 measure above 83%. We use multiple Named Entity Recognition (NER) tools and propose a method for merging the outcome from different recognizers. The resulting entities are stored in a database then made searchable by adding them to the AckSeer index along with the metadata of the containing paper/book. We build AckSeer on top of the documents in CiteSeerx digital library yielding more than 500,000 acknowledgments and more than 4 million mentioned entities.
international conference on big data | 2014
Madian Khabsa; Pucktada Treeratpituk; C. Lee Giles
Person name disambiguation is essential to distinguish between persons that share the same name where unique identifiers are not present. In many domains this is a common problem including digital libraries where the same name can refer to multiple unique authors. Correctly attributing work and citations requires the digital librarys database to be disambiguated. In this work we describe a large scale framework for disambiguating author names efficiently and effectively. The framework uses a density based clustering algorithm with a random forest based distance function to clusters unique authors. Effective use of blocking functions allows the clustering algorithm to be run in parallel. In our experiments we show that the framework disambiguates authors of more than 4 million papers in 24 hours.
acm/ieee joint conference on digital libraries | 2015
Madian Khabsa; Pucktada Treeratpituk; C. Lee Giles
While many clustering techniques have been successfully applied to the person name disambiguation problem, most do not address two main practical issues: allowing constraints to be added to the clustering process, and allowing the data to be added incrementally without clustering the entire database. Constraints can be particularly useful especially in a system such as a digital library, where users are allowed to make corrections to the disambiguated result. For example, a user correction on a disambiguation result specifying that a record does not belong to an author could be kept as a cannot-link constraint to be used in any future disambiguation (such as when new documents are added). Besides such user corrections, constraints also allow background heuristics to be encoded into the disambiguation process. We propose a constraint-based clustering algorithm for person name disambiguation, based on DBSCAN combined with a pairwise distance based on random forests. We further propose an extension to the density-based clustering algorithm (DBSCAN) to handle online clustering so that the disambiguation process can be done iteratively as new data points are added. Our algorithm utilizes similarity features based on both metadata information and citation similarity. We implement two types of clustering constraints to demonstrate the concept. Experiments on the CiteSeer data show that our model can achieve 0.95 pairwise F1 and 0.79 cluster F1. The presence of constraints also consistently improves the disambiguation result across different combinations of features.
D-lib Magazine | 2012
Sumit Bhatia; Cornelia Caragea; Hung-Hsuan Chen; Jian Wu; Pucktada Treeratpituk; Zhaohui Wu; Madian Khabsa; Prasenjit Mitra; C. Lee Giles
This article provides an overview of some of the specialized datasets that were created for various projects related to the CiteSeer˟ digital library.
conference on information and knowledge management | 2012
Madian Khabsa; Pucktada Treeratpituk; C. Lee Giles
Given a set of automatically extracted entities E of size n, we would like to cluster all the various names referring to the same canonical entity together. The variations of each entity include acronyms, full name, and informal naming conventions. We propose using search engine results to cluster variations of each entity based on the URLs appearing in those results. We create a cluster C for each top search result returned by querying for the entity e ∈ E assigning e to the cluster C. Our experiments on a manually created dataset shows that our approach achieves higher precision and recall than string matching algorithm and hierarchical clustering based disambiguation methods.
acm/ieee joint conference on digital libraries | 2015
Pucktada Treeratpituk; Madian Khabsa; C. Lee Giles
We propose a novel graph-based approach for constructing concept hierarchy from a large text corpus. Our algorithm incorporates both statistical co-occurrences and lexical similarity in optimizing the structure of the taxonomy. To automatically generate topic-dependent taxonomies from a large text corpus, we first extracts topical terms and their relationships from the corpus. The algorithm then constructs a weighted graph representing topics and their associations. A graph partitioning algorithm is then used to recursively partition the topic graph into a taxonomy. For evaluation, we apply our approach to articles, primarily computer science, in the CiteSeerX digital library and search engine.
meeting of the association for computational linguistics | 2010
Pucktada Treeratpituk; Pradeep B. Teregowda; Jian Huang; C. Lee Giles
national conference on artificial intelligence | 2012
Pucktada Treeratpituk; C. Lee Giles
acm/ieee joint conference on digital libraries | 2013
Hung-Hsuan Chen; Pucktada Treeratpituk; Prasenjit Mitra; C. Lee Giles