Pengtao Xie | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pengtao Xie is active.

Explore More

Publication

Featured researches published by Pengtao Xie.

knowledge discovery and data mining | 2015

Diversifying Restricted Boltzmann Machine for Document Modeling

Pengtao Xie; Yuntian Deng; Eric P. Xing

Restricted Boltzmann Machine (RBM) has shown great effectiveness in document modeling. It utilizes hidden units to discover the latent topics and can learn compact semantic representations for documents which greatly facilitate document retrieval, clustering and classification. The popularity (or frequency) of topics in text corpora usually follow a power-law distribution where a few dominant topics occur very frequently while most topics (in the long-tail region) have low probabilities. Due to this imbalance, RBM tends to learn multiple redundant hidden units to best represent dominant topics and ignore those in the long-tail region, which renders the learned representations to be redundant and non-informative. To solve this problem, we propose Diversified RBM (DRBM) which diversifies the hidden units, to make them cover not only the dominant topics, but also those in the long-tail region. We define a diversity metric and use it as a regularizer to encourage the hidden units to be diverse. Since the diversity metric is hard to optimize directly, we instead optimize its lower bound and prove that maximizing the lower bound with projected gradient ascent can increase this diversity metric. Experiments on document retrieval and clustering demonstrate that with diversification, the document modeling power of DRBM can be greatly improved.

north american chapter of the association for computational linguistics | 2015

Incorporating Word Correlation Knowledge into Topic Modeling.

Pengtao Xie; Diyi Yang; Eric P. Xing

This paper studies how to incorporate the external word correlation knowledge to improve the coherence of topic modeling. Existing topic models assume words are generated independently and lack the mechanism to utilize the rich similarity relationships among words to learn coherent topics. To solve this problem, we build a Markov Random Field (MRF) regularized Latent Dirichlet Allocation (LDA) model, which defines a MRF on the latent topic layer of LDA to encourage words labeled as similar to share the same topic label. Under our model, the topic assignment of each word is not independent, but rather affected by the topic labels of its correlated words. Similar words have better chance to be put into the same topic due to the regularization of MRF, hence the coherence of topics can be boosted. In addition, our model can accommodate the subtlety that whether two words are similar depends on which topic they appear in, which allows word with multiple senses to be put into different topics properly. We derive a variational inference method to infer the posterior probabilities and learn model parameters and present techniques to deal with the hardto-compute partition function in MRF. Experiments on two datasets demonstrate the effectiveness of our model.

Engineering | 2016

Strategies and Principles of Distributed Machine Learning on Big Data

Eric P. Xing; Qirong Ho; Pengtao Xie; Dai Wei

The rise of Big Data has led to new demands for Machine Learning (ML) systems to learn complex models with millions to billions of parameters, that promise adequate capacity to digest massive datasets and offer powerful predictive analytics thereupon. In order to run ML algorithms at such scales, on a distributed cluster with 10s to 1000s of machines, it is often the case that significant engineering efforts are required --- and one might fairly ask if such engineering truly falls within the domain of ML research or not. Taking the view that Big ML systems can benefit greatly from ML-rooted statistical and algorithmic insights --- and that ML researchers should therefore not shy away from such systems design --- we discuss a series of principles and strategies distilled from our recent efforts on industrial-scale ML solutions. These principles and strategies span a continuum from application, to engineering, and to theoretical research and development of Big ML systems and architectures, with the goal of understanding how to make them efficient, generally-applicable, and supported with convergence and scaling guarantees. They concern four key questions which traditionally receive little attention in ML research: How to distribute an ML program over a cluster? How to bridge ML computation with inter-machine communication? How to perform such communication? What should be communicated between machines? By exposing underlying statistical and algorithmic characteristics unique to ML programs but not typically seen in traditional computer programs, and by dissecting successful cases to reveal how we have harnessed these principles to design and develop both high-performance distributed ML software as well as general-purpose ML frameworks, we present opportunities for ML researchers and practitioners to further shape and grow the area that lies between ML and systems.

european conference on machine learning | 2015

Learning compact and effective distance metrics with diversity regularization

Pengtao Xie

Learning a proper distance metric is of vital importance for many distance based applications. Distance metric learning aims to learn a set of latent factors based on which the distances between data points can be effectively measured. The number of latent factors incurs a tradeoff: a small amount of factors are not powerful and expressive enough to measure distances while a large number of factors cause high computational overhead. In this paper, we aim to achieve two seemingly conflicting goals: keeping the number of latent factors to be small for the sake of computational efficiency, meanwhile making them as effective as a large set of factors. The approach we take is to impose a diversity regularizer over the latent factors to encourage them to be uncorrelated, such that each factor can capture some unique information that is hard to be captured by other factors. In this way, a small amount of latent factors can be sufficient to capture a large proportion of information, which retains computational efficiency while preserving the effectiveness in measuring distances. Experiments on retrieval, clustering and classification demonstrate that a small amount of factors learned with diversity regularization can achieve comparable or even better performance compared with a large factor set learned without regularization.

Heredity | 2017

Inference of multiple-wave population admixture by modeling decay of linkage disequilibrium with polynomial functions

Ying Zhou; Kai Yuan; Yaoliang Yu; Xuming Ni; Pengtao Xie; Eric P. Xing; Shuhua Xu

To infer the histories of population admixture, one important challenge with methods based on the admixture linkage disequilibrium (ALD) is to remove the effect of source LD (SLD), which is directly inherited from source populations. In previous methods, only the decay curve of weighted LD between pairs of sites whose genetic distance were larger than a certain starting distance was fitted by single or multiple exponential functions, for the inference of recent single- or multiple-wave admixture. However, the effect of SLD has not been well defined and no tool has been developed to estimate the effect of SLD on weighted LD decay. In this study, we defined the SLD in the formularized weighted LD statistic under the two-way admixture model and proposed a polynomial spectrum (p-spectrum) to study the weighted SLD and weighted LD. We also found that reference populations could be used to reduce the SLD in weighted LD statistics. We further developed a method, iMAAPs, to infer multiple-wave admixture by fitting ALD using a p-spectrum. We evaluated the performance of iMAAPs under various admixture models in simulated data and applied iMAAPs to the analysis of genome-wide single nucleotide polymorphism data from the Human Genome Diversity Project and the HapMap Project. We showed that iMAAPs is a considerable improvement over other current methods and further facilitates the inference of histories of complex population admixtures.

symposium on cloud computing | 2018

Orpheus: Efficient Distributed Machine Learning via System and Algorithm Co-design.

Pengtao Xie; Jin Kyu Kim; Qirong Ho; Yaoliang Yu; Eric P. Xing

Numerous existing works have shown that, key to the efficiency of distributed machine learning (ML) is proper system and algorithm co-design: system design should be tailored to the unique mathematical properties of ML algorithms, and algorithms can be re-designed to better exploit the system architecture. While existing research has made attempts along this direction, many algorithmic and system properties that are characteristic of ML problems remain to be explored. Through an exploration of system-algorithm co-design, we build a new decentralized system Orpheus to support distributed training of a general class of ML models whose parameters are represented with large matrices. Training such models at scale is challenging: transmitting and checkpointing large matrices incur substantial network traffic and disk IO, which aggravates the inconsistency among parameter replicas. To cope with these challenges, Orpheus jointly exploits system and algorithm designs which (1) reduce the size and number of network messages for efficient communication, 2) incrementally checkpoint vectors for light-weight and fine-grained fault tolerance without blocking computation, 3) improve the consistency among parameter copies via periodic centralized synchronization and parameter-replicas rotation. As a result of these co-designs, communication and fault tolerance costs are linear to both matrix dimension and number of machines in the network, as opposed to being quadratic in existing systems. And the improved parameter consistency accelerates algorithmic convergence. Empirically, we show our system outperforms several existing baseline systems on training several representative large-scale ML models.

meeting of the association for computational linguistics | 2017

A Constituent-Centric Neural Architecture for Reading Comprehension.

Pengtao Xie; Eric P. Xing

Reading comprehension (RC), aiming to understand natural texts and answer questions therein, is a challenging task. In this paper, we study the RC problem on the Stanford Question Answering Dataset (SQuAD). Observing from the training set that most correct answers are centered around constituents in the parse tree, we design a constituent-centric neural architecture where the generation of candidate answers and their representation learning are both based on constituents and guided by the parse tree. Under this architecture, the search space of candidate answers can be greatly reduced without sacrificing the coverage of correct answers and the syntactic, hierarchical and compositional structure among constituents can be well captured, which contributes to better representation learning of the candidate answers. On SQuAD, our method achieves the state of the art performance and the ablation study corroborates the effectiveness of individual modules.

international joint conference on artificial intelligence | 2017

Improving the Generalization Performance of Multi-class SVM via Angular Regularization

Jianxin Li; Haoyi Zhou; Pengtao Xie; Yingchun Zhang

In multi-class support vector machine (MSVM) for classification, one core issue is to regularize the coefficient vectors to reduce overfitting. Various regularizers have been proposed such as `2, `1, and trace norm. In this paper, we introduce a new type of regularization approach – angular regularization, that encourages the coefficient vectors to have larger angles such that class regions can be widen to flexibly accommodate unseen samples. We propose a novel angular regularizer based on the singular values of the coefficient matrix, where the uniformity of singular values reduces the correlation among different classes and drives the angles between coefficient vectors to increase. In generalization error analysis, we show that decreasing this regularizer effectively reduces generalization error bound. On various datasets, we demonstrate the efficacy of the regularizer in reducing overfitting.

bioRxiv | 2015

Inference of multiple-wave population admixture by modeling decay of linkage disequilibrium with multiple exponential functions

Ying Zhou; Kai Yuan; Yaoliang Yu; Xumin Ni; Pengtao Xie; Eric P. Xing; Shuhua Xu

Admixture-introduced linkage disequilibrium (LD) has recently been introduced into the inference of the histories of complex admixtures. However, the influence of ancestral source populations on the LD pattern in admixed populations is not properly taken into consideration by currently available methods, which affects the estimation of several gene flow parameters from empirical data. We first illustrated the dynamic changes of LD in admixed populations and mathematically formulated the LD under a generalized admixture model with finite population size. We next developed a new method, MALDmef, by fitting LD with multiple exponential functions for inferring and dating multiple-wave admixtures. MALDmef takes into account the effects of source populations which substantially affect modeling LD in admixed population, which renders it capable of efficiently detecting and dating multiple-wave admixture events. The performance of MALDmef was evaluated by simulation and it was shown to be more accurate than MALDER, a state-of-the-art method that was recently developed for similar purposes, under various admixture models. We further applied MALDmef to analyzing genome-wide data from the Human Genome Diversity Project (HGDP) and the HapMap Project. Interestingly, we were able to identify more than one admixture events in several populations, which have yet to be reported. For example, two major admixture events were identified in the Xinjiang Uyghur, occurring around 27–30 generations ago and 182–195 generations ago, respectively. In an African population (MKK), three recent major admixtures occurring 13–16, 50–67, and 107–139 generations ago were detected. Our method is a considerable improvement over other current methods and further facilitates the inference of the histories of complex population admixtures.

uncertainty in artificial intelligence | 2013