David Bamman | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David Bamman is active.

Explore More

Publication

Featured researches published by David Bamman.

Journal of Sociolinguistics | 2014

Gender identity and lexical variation in social media

David Bamman; Jacob Eisenstein; Tyler Schnoebelen

We present a study of the relationship between gender, linguistic style, and social networks, using a novel corpus of 14,000 Twitter users. Prior quantitative work on gender often treats this social variable as a female/male binary; we argue for a more nuanced approach. By clustering Twitter users, we find a natural decomposition of the dataset into various styles and topical interests. Many clusters have strong gender orientations, but their use of linguistic resources sometimes directly conflicts with the population-level language statistics. We view these clusters as a more accurate reflection of the multifaceted nature of gendered language styles. Previous corpus-based work has also had little to say about individuals whose linguistic styles defy population-level gender patterns. To identify such individuals, we train a statistical classifier, and measure the classifier confidence for each individual in the dataset. Examining individuals whose language does not match the classifiers model for their gender, we find that they have social networks that include significantly fewer same-gender social connections and that, in general, social network homophily is correlated with the use of same-gender language markers. Pairing computational methods and social theory thus offers a new perspective on how gender emerges as individuals position themselves relative to audiences, topics, and mainstream gender norms.

meeting of the association for computational linguistics | 2014

A Bayesian Mixed Effects Model of Literary Character

David Bamman; Ted Underwood; Noah A. Smith

We consider the problem of automatically inferring latent character types in a collection of 15,099 English novels published between 1700 and 1899. Unlike prior work in which character types are assumed responsible for probabilistically generating all text associated with a character, we introduce a model that employs multiple effects to account for the influence of extra-linguistic information (such as author). In an empirical evaluation, we find that this method leads to improved agreement with the preregistered judgments of a literary scholar, complementing the results of alternative models.

european conference on research and advanced technology for digital libraries | 2009

Improving OCR accuracy for classical critical editions

Federico Boschetti; Matteo Romanello; Alison Babeu; David Bamman; Gregory R. Crane

This paper describes a work-flow designed to populate a digital library of ancient Greek critical editions with highly accurate OCR scanned text. While the most recently available OCR engines are now able after suitable training to deal with the polytonic Greek fonts used in 19th and 20th century editions, further improvements can also be achieved with postprocessing. In particular, the progressive multiple alignment method applied to different OCR outputs based on the same images is discussed in this paper.

european conference on research and advanced technology for digital libraries | 2006

Beyond digital incunabula: modeling the next generation of digital libraries

Gregory R. Crane; David Bamman; Lisa Cerrato; Alison Jones; David M. Mimno; Adrian Packel; D. Sculley; Gabriel Weaver

This paper describes several incunabular assumptions that impose upon early digital libraries the limitations drawn from print, and argues for a design strategy aimed at providing customization and personalization services that go beyond the limiting models of print distribution, based on services and experiments developed for the Greco-Roman collections in the Perseus Digital Library. Three features fundamentally characterize a successful digital library design: finer granularity of collection objects, automated processes, and decentralized community contributions.

acm/ieee joint conference on digital libraries | 2008

Building a dynamic lexicon from a digital library

David Bamman; Gregory R. Crane

We describe here in detail our work toward creating a dynamic lexicon from the texts in a large digital library. By leveraging a small structured knowledge source (a 30,457 word treebank), we are able to extract selectional preferences for words from a 3.5 million word Latin corpus. This is promising news for low-resource languages and digital collections seeking to leverage a small human investment into much larger gain. The library architecture in which this work is developed allows us to query customized subcorpora to report on lexical usage by author, genre or era and allows us to continually update the lexicon as new texts are added to the collection.

meeting of the association for computational linguistics | 2014

Distributed Representations of Geographically Situated Language

David Bamman; Chris Dyer; Noah A. Smith

We introduce a model for incorporating contextual information (such as geography) in learning vector-space representations of situated language. In contrast to approaches to multimodal representation learning that have used properties of the object being described (such as its color), our model includes information about the subject (i.e., the speaker), allowing us to learn the contours of a word’s meaning that are shaped by the context in which it is uttered. In a quantitative evaluation on the task of judging geographically informed semantic similarity between representations learned from 1.1 billion words of geo-located tweets, our joint model outperforms comparable independent models that learn meaning in isolation.

Language Technology for Cultural Heritage | 2011

The Ancient Greek and Latin Dependency Treebanks

David Bamman; Gregory R. Crane

This paper describes the development, composition, and several uses of the Ancient Greek and Latin Dependency Treebanks, large collections of Classical texts in which the syntactic, morphological and lexical information for each word is made explicit. To date, over 200 individuals from around the world have collaborated to annotate over 350,000 words, including the entirety of Homer’s Iliad and Odyssey, Sophocles’ Ajax, all of the extant works of Hesiod and Aeschylus, and selections from Caesar, Cicero, Jerome, Ovid, Petronius, Propertius, Sallust and Vergil. While perhaps the most straightforward value of such an annotated corpus for Classical philology is the morphosyntactic searching it makes possible, it also enables a large number of downstream tasks as well, such as inducing the syntactic behavior of lexemes and automatically identifying similar passages between texts.

International Journal on Digital Libraries | 2007

eScience and the humanities

Gregory R. Crane; Alison Babeu; David Bamman

Humanists face problems that are comparable to their colleagues in the sciences. Like scientists, humanists have electronic sources and datasets that are too large for traditional labor intensive analysis. They also need to work with materials that presuppose more background knowledge than any one researcher can master: no one can, for example, know all the languages needed for subjects that cross multiple disciplines. Unlike their colleagues in the sciences, however, humanists have relatively few resources with which to develop this new infrastructure. They must therefore systematically cultivate alliances with better funded disciplines, learning how to build on emerging infrastructure from other disciplines and, where possible, contributing to the design of a cyberinfrastructure that serves all of academia, including the humanities.

acm ieee joint conference on digital libraries | 2011

Measuring historical word sense variation

David Bamman; Gregory R. Crane

We describe here a method for automatically identifying word sense variation in a dated collection of historical books in a large digital library. By leveraging a small set of known translation book pairs to induce a bilingual sense inventory and labeled training data for a WSD classifier, we are able to automatically classify the Latin word senses in a 389 million word corpus and track the rise and fall of those senses over a span of two thousand years. We evaluate the performance of seven different classifiers both in a tenfold test on 83,892 words from the aligned parallel corpus and on a smaller, manually annotated sample of 525 words, measuring both the overall accuracy of each system and how well that accuracy correlates (via mean square error) to the observed historical variation.

ACM Journal on Computing and Cultural Heritage | 2012

Extracting two thousand years of latin from a million book library

David Bamman; David A. Smith

With the rise of large open digitization projects such as the Internet Archive and Google Books, we are witnessing an explosive growth in the number of source texts becoming available to researchers in historical languages. The Internet Archive alone contains over 27,014 texts catalogued as Latin, including classical prose and poetry written under the Roman Empire, ecclesiastical treatises from the Middle Ages, and dissertations from 19th-century Germany written—in Latin—on the philosophy of Hegel. At one billion words, this collection eclipses the extant corpus of Classical Latin by several orders of magnitude. In addition, the much larger collection of books in English, German, French, and other languages already scanned contains unknown numbers of translations for many Latin books, or parts of books. The sheer scale of this collection offers a broad vista of new research questions, and we focus here on both the opportunities and challenges of computing over such a large space of heterogeneous texts. The works in this massive collection do not constitute a finely curated (or much less balanced) corpus of Latin; it is, instead, simply all the Latin that can be extracted, and in its reach of twenty-one centuries (from approximately 200 BCE to 1922 CE) arguably spans the greatest historical distance of any major textual collection today. While we might hope that the size and historical reach of this collection can eventually offer insight into grand questions such as the evolution of a language over both time and space, we must contend as well with the noise inherent in a corpus that has been assembled with minimal human intervention.

Explore More