Anders Johannsen
University of Copenhagen
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Anders Johannsen.
international joint conference on natural language processing | 2015
Anders Søgaard; Żeljko Agić; Héctor Martínez Alonso; Barbara Plank; Bernd Bohnet; Anders Johannsen
We present a novel, count-based approach to obtaining inter-lingual word representations based on inverted indexing of Wikipedia. We present experiments applying these representations to 17 datasets in document classification, POS tagging, dependency parsing, and word alignment. Our approach has the advantage that it is simple, computationally efficient and almost parameter-free, and, more importantly, it enables multi-source crosslingual learning. In 14/17 cases, we improve over using state-of-the-art bilingual embeddings.
conference on computational natural language learning | 2015
Anders Johannsen; Dirk Hovy; Anders Søgaard
Most computational sociolinguistics studies have focused on phonological and lexical variation. We present the first large-scale study of syntactic variation among demographic groups (age and gender) across several languages. We harvest data from online user-review sites and parse it with universal dependencies. We show that several age and gender-specific variations hold across languages, for example that women are more likely to use VP conjunctions.
international world wide web conferences | 2015
Dirk Hovy; Anders Johannsen; Anders Søgaard
Sociolinguistic studies investigate the relation between language and extra-linguistic variables. This requires both representative text data and the associated socio-economic meta-data of the subjects. Traditionally, sociolinguistic studies use small samples of hand-curated data and meta-data. This can lead to exaggerated or false conclusions. Using social media data offers a large-scale source of language data, but usually lacks reliable socio-economic meta-data. Our research aims to remedy both problems by exploring a large new data source, international review websites with user profiles. They provide more text data than manually collected studies, and more meta-data than most available social media text. We describe the data and present various pilot studies, illustrating the usefulness of this resource for sociolinguistic studies. Our approach can help generate new research hypotheses based on data-driven findings across several countries and languages.
joint conference on lexical and computational semantics | 2014
Anders Johannsen; Dirk Hovy; Héctor Martínez Alonso; Barbara Plank; Anders Søgaard
We present two Twitter datasets annotated with coarse-grained word senses (supersenses), as well as a series of experiments with three learning scenarios for supersense tagging: weakly supervised learning, as well as unsupervised and supervised domain adaptation. We show that (a) off-the-shelf tools perform poorly on Twitter, (b) models augmented with embeddings learned from Twitter data perform much better, and (c) errors can be reduced using type-constrained inference with distant supervision from WordNet.
conference on computational natural language learning | 2014
Anders Søgaard; Anders Johannsen; Barbara Plank; Dirk Hovy; Héctor Martínez Alonso
In NLP, we need to document that our proposed methods perform significantly better with respect to standard metrics than previous approaches, typically by reporting p-values obtained by rank- or randomization-based tests. We show that significance results following current research standards are unreliable and, in addition, very sensitive to sample size, covariates such as sentence length, as well as to the existence of multiple metrics. We estimate that under the assumption of perfect metrics and unbiased data, we need a significance cut-off at ⇠0.0025 to reduce the risk of false positive results to <5%. Since in practice we often have considerable selection bias and poor metrics, this, however, will not do alone.
north american chapter of the association for computational linguistics | 2016
Nathan Schneider; Dirk Hovy; Anders Johannsen; Marine Carpuat
This task combines the labeling of multiword expressions and supersenses (coarse-grained classes) in an explicit, yet broad-coverage paradigm for lexical semantics. Nine systems participated; the best scored 57.7% F1 in a multi-domain evaluation setting, indicating that the task remains largely unresolved. An error analysis reveals that a large number of instances in the data set are either hard cases, which no systems get right, or easy cases, which all systems correctly solve.
empirical methods in natural language processing | 2015
Anders Johannsen; Héctor Martínez Alonso; Anders Søgaard
We present a multilingual corpus of Wikipedia and Twitter texts annotated with FRAMENET 1.5 semantic frames in nine different languages, as well as a novel technique for weakly supervised cross-lingual frame-semantic parsing. Our approach only assumes the existence of linked, comparable source and target language corpora (e.g., Wikipedia) and a bilingual dictionary (e.g., Wiktionary or BABELNET). Our approach uses a truly interlingual representation, enabling us to use the same model across all nine languages. We present average error reductions over running a state-of-the-art parser on word-to-word translations of 46% for target identification, 37% for frame identification, and 14% for argument identification.
meeting of the association for computational linguistics | 2016
Anders Johannsen; Željko Agić; Anders Søgaard
Most previous work on annotation projection has been limited to a subset of IndoEuropean languages, using only a single source language, and projecting annotation for one task at a time. In contrast, we present an Integer Linear Programming (ILP) algorithm that simultaneously projects annotation for multiple tasks from multiple source languages, relying on parallel corpora available for hundreds of languages. When training POS taggers and dependency parsers on jointly projected POS tags and syntactic dependencies using our algorithm, we obtain better performance than a standard approach on 20/23 languages using one parallel corpus; and 18/27 languages using another.
empirical methods in natural language processing | 2014
Barbara Plank; Anders Johannsen; Anders Søgaard
Importance weighting is a generalization of various statistical bias correction techniques. While our labeled data in NLP is heavily biased, importance weighting has seen only few applications in NLP, most of them relying on a small amount of labeled target data. The publication bias toward reporting positive results makes it hard to say whether researchers have tried. This paper presents a negative result on unsupervised domain adaptation for POS tagging. In this setup, we only have unlabeled data and thus only indirect access to the bias in emission and transition probabilities. Moreover, most errors in POS tagging are due to unseen words, and there, importance weighting cannot help. We present experiments with a wide variety of weight functions, quantilizations, as well as with randomly generated weights, to support these claims.
international conference natural language processing | 2010
Anders Søgaard; Anders Johannsen
Mihalcea [1] discusses self-training and co-training in the context of word sense disambiguation and shows that parameter optimization on individual words was important to obtain good results. Using smoothed co-training of a naive Bayes classifier she obtains a 9.8% error reduction on Senseval-2 data with a fixed parameter setting. In this paper we test a semi-supervised learning algorithm with no parameters, namely tri-training [2]. We also test the random subspace method [3] for building committees out of stable learners. Both techniques lead to significant error reductions with different learning algorithms, but improvements do not accumulate. Our best error reduction is 7.4%, and our best absolute average over Senseval-2 data, though not directly comparable, is 12% higher than the results reported in Mihalcea [1].