David M. Mimno | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David M. Mimno is active.

Explore More

Publication

Featured researches published by David M. Mimno.

international conference on machine learning | 2009

Evaluation methods for topic models

Hanna M. Wallach; Iain Murray; Ruslan Salakhutdinov; David M. Mimno

A natural evaluation metric for statistical topic models is the probability of held-out documents given a trained model. While exact computation of this probability is intractable, several estimators for this probability have been used in the topic modeling literature, including the harmonic mean method and empirical likelihood method. In this paper, we demonstrate experimentally that commonly-used methods are unlikely to accurately estimate the probability of held-out documents, and propose two alternative methods that are both accurate and efficient.

knowledge discovery and data mining | 2009

Efficient methods for topic model inference on streaming document collections

Limin Yao; David M. Mimno; Andrew McCallum

Topic models provide a powerful tool for analyzing large text collections by representing high dimensional data in a low dimensional subspace. Fitting a topic model given a set of training documents requires approximate inference techniques that are computationally expensive. With todays large-scale, constantly expanding document collections, it is useful to be able to infer topic distributions for new documents without retraining the model. In this paper, we empirically evaluate the performance of several methods for topic inference in previously unseen documents, including methods based on Gibbs sampling, variational inference, and a new method inspired by text classification. The classification-based inference method produces results similar to iterative inference methods, but requires only a single matrix multiplication. In addition to these inference methods, we present SparseLDA, an algorithm and data structure for evaluating Gibbs sampling distributions. Empirical results indicate that SparseLDA can be approximately 20 times faster than traditional LDA and provide twice the speedup of previously published fast sampling methods, while also using substantially less memory.

empirical methods in natural language processing | 2009

Polylingual Topic Models

David M. Mimno; Hanna M. Wallach; Jason Naradowsky; David A. Smith; Andrew McCallum

Topic models are a useful tool for analyzing large text collections, but have previously been applied in only monolingual, or at most bilingual, contexts. Meanwhile, massive collections of interlinked documents in dozens of languages, such as Wikipedia, are now widely available, calling for tools that can characterize content in many languages. We introduce a polylingual topic model that discovers topics aligned across multiple languages. We explore the models characteristics using two large corpora, each with over ten different languages, and demonstrate its usefulness in supporting machine translation and tracking topic trends across languages.

international conference on machine learning | 2007

Mixtures of hierarchical topics with Pachinko allocation

David M. Mimno; Wei Li; Andrew McCallum

The four-level pachinko allocation model (PAM) (Li & McCallum, 2006) represents correlations among topics using a DAG structure. It does not, however, represent a nested hierarchy of topics, with some topical word distributions representing the vocabulary that is shared among several more specific topics. This paper presents hierarchical PAM---an enhancement that explicitly represents a topic hierarchy. This model can be seen as combining the advantages of hLDAs topical hierarchy representation with PAMs ability to mix multiple leaves of the topic hierarchy. Experimental results show improvements in likelihood of held-out documents, as well as mutual information between automatically-discovered topics and humangenerated categories such as journals.

empirical methods in natural language processing | 2015

Evaluation methods for unsupervised word embeddings

Tobias Schnabel; Igor Labutov; David M. Mimno

We present a comprehensive study of evaluation methods for unsupervised embedding techniques that obtain meaningful representations of words from text. Different evaluations result in different orderings of embedding methods, calling into question the common assumption that there is one single optimal vector representation. We present new evaluation techniques that directly compare embeddings with respect to specific queries. These methods reduce bias, provide greater insight, and allow us to solicit data-driven relevance judgments rapidly and accurately through crowdsourcing.

acm/ieee joint conference on digital libraries | 2006

Bibliometric impact measures leveraging topic analysis

Gideon S. Mann; David M. Mimno; Andrew McCallum

Measurements of the impact and history of research literature provide a useful complement to scientific digital library collections. Bibliometric indicators have been extensively studied, mostly in the context of journals. However, journal-based metrics poorly capture topical distinctions in fast-moving fields, and are increasingly problematic with the rise of open-access publishing. Recent developments in latent topic models have produced promising results for automatic sub-field discovery. The fine-grained, faceted topics produced by such models provide a clearer view of the topical divisions of a body of research literature and the interactions between those divisions. We demonstrate the usefulness of topic models in measuring impact by applying a new phrase-based topic discovery model to a collection of 300,000 computer science publications, collected by the Rexa automatic citation indexing system

ACM Journal on Computing and Cultural Heritage | 2012

Computational historiography: Data mining in a century of classics journals

David M. Mimno

More than a century of modern Classical scholarship has created a vast archive of journal publications that is now becoming available online. Most of this work currently receives little, if any, attention. The collection is too large to be read by any single person and mostly not of sufficient interest to warrant traditional close reading. This article presents computational methods for identifying patterns and testing hypotheses about Classics as a field. Such tools can help organize large collections, introduce younger scholars to the history of the field, and act as a “survey,” identifying anomalies that can be explored using more traditional methods.

Nature Methods | 2011

Database of NIH grants using machine-learned categories and graphical clustering

Edmund M. Talley; David Newman; David M. Mimno; Bruce William Herr; Hanna M. Wallach; Gully A. P. C. Burns; A G Miriam Leenders; Andrew McCallum

framework that is based on scientific research rather than NIH administrative and categorical designations. We found that topic-based categories are not strictly associated with the missions of individual Institutes but instead cut across the NIH, albeit in varying proportions consistent with each Institute’s distinct mission (Supplementary Table 1). The graphical map layout (Fig. 1) shows a global research structure that is logically coherent but only loosely related to Institute organization (Supplementary Table 1). We describe four example use cases (Supplementary Data). First, we show a query using an algorithm-derived category relevant to angiogenesis (Supplementary Fig. 1). Unlike standard keywordbased searches, this type of query allows retrieval of grants that are truly focused on a particular research area. In addition, the resulting graphical clusters reveal clear patterns in the relationships between the retrieved grants and the multiple Institutes funding this research. Second, we examine an NIH peer review study section. The database categories and clusters clarify the complex relationship between the NIH Institutes and the centralized NIH peer review system, which is distinct and independent from the Institutes. Third, we show an analysis of the NIH RCDC category ‘sleep research’ in conjunction with the database topics, the latter Database of NIH grants using machine-learned categories and graphical clustering

european conference on research and advanced technology for digital libraries | 2006

Beyond digital incunabula: modeling the next generation of digital libraries

Gregory R. Crane; David Bamman; Lisa Cerrato; Alison Jones; David M. Mimno; Adrian Packel; D. Sculley; Gabriel Weaver

This paper describes several incunabular assumptions that impose upon early digital libraries the limitations drawn from print, and argues for a design strategy aimed at providing customization and personalization services that go beyond the limiting models of print distribution, based on services and experiments developed for the Greco-Roman collections in the Perseus Digital Library. Three features fundamentally characterize a successful digital library design: finer granularity of collection objects, automated processes, and decentralized community contributions.

Archive | 2014

Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements

Edoardo M. Airoldi; David M. Blei; Elena A. Erosheva; Stephen E. Fienberg; Jordan L. Boyd-Graber; David M. Mimno; David Newman

@inbook{Boyd-Graber:Mimno:Newman-2014, Publisher = {CRC Press}, Address = {Boca Raton, Florida}, Title = {Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements}, Url = {docs/2014_book_chapter_care_and_feeding.pdf}, Series = {CRC Handbooks of Modern Statistical Methods}, Booktitle = {Handbook of Mixed Membership Models and Their Applications}, Author = {Jordan Boyd-Graber and David Mimno and David Newman}, Year = {2014}, Editor = {Edoardo M. Airoldi and David Blei and Elena A. Erosheva and Stephen E. Fienberg}, }

Explore More