Harald Hammarström | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Harald Hammarström is active.

Explore More

Publication

Featured researches published by Harald Hammarström.

Computational Linguistics | 2011

Unsupervised learning of morphology

Harald Hammarström; Lars Borin

This article surveys work on Unsupervised Learning of Morphology. We define Unsupervised Learning of Morphology as the problem of inducing a description (of some kind, even if only morpheme-segmentation) of how orthographic words are built up given only raw text data of a language. We briefly go through the history and motivation of the this problem. Next, over 200 items of work are listed with a brief characterization, and the most important ideas in the field are critically discussed. We summarize the achievements so far and give pointers for future developments.

Current Anthropology | 2011

Automated dating of the world’s language families based on lexical similarity

Eric W. Holman; Cecil H. Brown; Søren Wichmann; A. Müller; Viveka Velupillai; Harald Hammarström; Sebastian Sauppe; Hagen Jung; D. Bakker; Pamela Brown; Oleg Belyaev; Matthias Urban; Robert Mailhammer; Johann-Mattis List; Dmitry Egorov

This paper describes a computerized alternative to glottochronology for estimating elapsed time since parent languages diverged into daughter languages. The method, developed by the Automated Similarity Judgment Program (ASJP) consortium, is different from glottochronology in four major respects: (1) it is automated and thus is more objective, (2) it applies a uniform analytical approach to a single database of worldwide languages, (3) it is based on lexical similarity as determined from Levenshtein (edit) distances rather than on cognate percentages, and (4) it provides a formula for date calculation that mathematically recognizes the lexical heterogeneity of individual languages, including parent languages just before their breakup into daughter languages. Automated judgments of lexical similarity for groups of related languages are calibrated with historical, epigraphic, and archaeological divergence dates for 52 language groups. The discrepancies between estimated and calibration dates are found to be on average 29% as large as the estimated dates themselves, a figure that does not differ significantly among language families. As a resource for further research that may require dates of known level of accuracy, we offer a list of ASJP time depths for nearly all the world’s recognized language families and for many subfamilies.

Proceedings of the National Academy of Sciences of the United States of America | 2016

Sound–meaning association biases evidenced across thousands of languages

Damián E. Blasi; Søren Wichmann; Harald Hammarström; Peter F. Stadler; Morten H. Christiansen

Significance The independence between sound and meaning is believed to be a crucial property of language: across languages, sequences of different sounds are used to express similar concepts (e.g., Russian “ptitsa,” Swahili “ndege,” and Japanese “tori” all mean “bird”). However, a careful statistical examination of words from nearly two-thirds of the world’s languages reveals that unrelated languages very often use (or avoid) the same sounds for specific referents. For instance, words for tongue tend to have l or u, “round” often appears with r, and “small” with i. These striking similarities call for a reexamination of the fundamental assumption of the arbitrariness of the sign. It is widely assumed that one of the fundamental properties of spoken language is the arbitrary relation between sound and meaning. Some exceptions in the form of nonarbitrary associations have been documented in linguistics, cognitive science, and anthropology, but these studies only involved small subsets of the 6,000+ languages spoken in the world today. By analyzing word lists covering nearly two-thirds of the world’s languages, we demonstrate that a considerable proportion of 100 basic vocabulary items carry strong associations with specific kinds of human speech sounds, occurring persistently across continents and linguistic lineages (linguistic families or isolates). Prominently among these relations, we find property words (“small” and i, “full” and p or b) and body part terms (“tongue” and l, “nose” and n). The areal and historical distribution of these associations suggests that they often emerge independently rather than being inherited or borrowed. Our results therefore have important implications for the language sciences, given that nonarbitrary associations have been proposed to play a critical role in the emergence of cross-modal mappings, the acquisition of language, and the evolution of our species’ unique communication system.

international conference natural language processing | 2006

Morphological lexicon extraction from raw text data

Markus Forsberg; Harald Hammarström; Aarne Ranta

The tool extract enables the automatic extraction of lemma-paradigm pairs from raw text data. The tool uses search patterns that consist of regular expressions and propositional logic. These search patterns define sufficient conditions for including lemma-paradigm pairs in the lexicon, on the basis of word forms occurring in the data. This paper explains the search pattern syntax of extract as well as the search algorithm, and discusses the design of search patterns from the recall and precision point of view. The extract tool was developed for morphologies defined in the Functional Morphology tool [1], but it is usable for all systems that implement a word-and-paradigm description of a morphology. The usefulness of the tool is demonstrated by a case study on the Canadian Hansards Corpus of French. The result is evaluated in terms of precision of the extracted lemmas and statistics on coverage and rule productiveness. Competitive extraction figures show that human-written rules in a tailored tool is a time-efficient approach to the task at hand.

asia information retrieval symposium | 2006

Poor man’s stemming: unsupervised recognition of same-stem words

Harald Hammarström

We present a new fully unsupervised human-intervention-free algorithm for stemming for an open class of languages. Since it does not rely on existing large data collections or other linguistic resources than raw text it is especially attractive for low-density languages. The stemming problem is formulated as a decision whether two given words are variants of the same stem and requires that, if so, there is a concatenative relation between the two. The underlying theory makes no assumptions on whether the language uses a lot of morphology or not, whether it is prefixing or suffixing, or whether affixes are long or short. It does however make the assumption that 1. salient affixes have to be frequent, 2. words essentially are variable length sequences of random characters, and furthermore 3. that a heuristic on what constitutes a systematic affix alteration is valid. Tested on four typologically distant languages, the stemmer shows very promising results in an evaluation against a human-made gold standard.

meeting of the association for computational linguistics | 2006

A naive theory of affixation and an algorithm for extraction

Harald Hammarström

We present a novel approach to the unsupervised detection of affixes, that is, to extract a set of salient prefixes and suffixes from an unlabeled corpus of a language. The underlying theory makes no assumptions on whether the language uses a lot of morphology or not, whether it is prefixing or suffixing, or whether affixes are long or short. It does however make the assumption that 1. salient affixes have to be frequent, i.e occur much more often that random segments of the same length, and that 2. words essentially are variable length sequences of random characters, e.g a character should not occur in far too many words than random without a reason, such as being part of a very frequent affix. The affix extraction algorithm uses only information from fluctation of frequencies, runs in linear time, and is free from thresholds and untransparent iterations. We demonstrate the usefulness of the approach with example case studies on typologically distant languages.

Linguistic Typology | 2009

Whence the Kanum base-6 numeral system?

Harald Hammarström

Abstract Base-6-36 numeral systems, a typological rarity, are found in Kanum languages of New Guinea as testified by Donohue (Linguistic Typology 12: 423–429, 2008). We look at the probable relatives of the Kanum languages and show that the base-6 system must have emerged in the Tonda group specifically. Since there is no evidence of body-part terms in the base-6 forms attested, we speculate that these systems have a different origin. Specifically, we suggest that the base-6 systems arose for counting yams. The ethnographic data for Kanum and other relevant languages are in concord with such a scenario. Whether there is a historical connection with base-6 systems of the Kolopom languages, near, but not adjacent, to the west, remains an open question. If there is a connection, it is areal rather than genetic, but sufficient evidence for a pre-historic areal connection remains to be collected. Equally, if not more, puzzling would be the conclusion that there is no historical connection, given the rarity of base-6 in the world as a whole.

Journal of Quantitative Linguistics | 2008

Counting Languages in Dialect Continua Using the Criterion of Mutual Intelligibility

Harald Hammarström

Abstract This paper shows how it is possible to count languages vs. dialects if, for every pair of varieties, we are given whether they are mutually intelligible or not. The method is to divide the varieties into a minimum number of internally mutually intelligible groups where each group counts as one language. Expressed in terms of graphs (as in discrete mathematics), the method is even easier understood as: applying graph-colouring to a graph over varieties with the intelligibility interrelationships as edges. Graph colouring is already mathematically well-understood and we can easily prove properties intuitively associated with the concepts language and dialect, and remove any fears that these concepts should lead to inconsistencies. The presentation requires only a minimal acquaintance with sets, combinatorics and graphs.

international conference on computational linguistics | 2008

Automatic annotation of bibliographical references with target language

Harald Hammarström

In a large-scale project to list bibliographical references to all of the ca 7 000 languages of the world, the need arises to automatically annotated the bibliographical entries with ISO-639-3 language identifiers. The task can be seen as a special case of a more general Information Extraction problem: to classify short text snippets in various languages into a large number of classes. We will explore supervised and unsupervised approaches motivated by distributional characterists of the specific domain and availability of data sets. In all cases, we make use of a database with language names and identifiers. The suggested methods are rigorously evaluated on a fresh representative data set.

cross language evaluation forum | 2011

Automatic annotation of bibliographical references for descriptive language materials

Harald Hammarström

The present paper considers the problem of annotating bibliographical references with labels/classes, given training data of references already annotated with labels. The problem is an instance of document categorization where the documents are short and written in a wide variety of languages. The skewed distributions of title words and labels calls for special carefulness when choosing a Machine Learning approach. The present paper describes how to induce Disjunctive Normal Form formulae (DNFs), which have several advantages over Decision Trees. The approach is evaluated on a large real-world collection of bibliographical references.

Explore More