Anders Johannsen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Anders Johannsen is active.

Explore More

Publication

Featured researches published by Anders Johannsen.

international joint conference on natural language processing | 2015

Inverted indexing for cross-lingual NLP

Anders Søgaard; Żeljko Agić; Héctor Martínez Alonso; Barbara Plank; Bernd Bohnet; Anders Johannsen

We present a novel, count-based approach to obtaining inter-lingual word representations based on inverted indexing of Wikipedia. We present experiments applying these representations to 17 datasets in document classification, POS tagging, dependency parsing, and word alignment. Our approach has the advantage that it is simple, computationally efficient and almost parameter-free, and, more importantly, it enables multi-source crosslingual learning. In 14/17 cases, we improve over using state-of-the-art bilingual embeddings.

conference on computational natural language learning | 2015

Cross-lingual syntactic variation over age and gender

Anders Johannsen; Dirk Hovy; Anders Søgaard

Most computational sociolinguistics studies have focused on phonological and lexical variation. We present the first large-scale study of syntactic variation among demographic groups (age and gender) across several languages. We harvest data from online user-review sites and parse it with universal dependencies. We show that several age and gender-specific variations hold across languages, for example that women are more likely to use VP conjunctions.

international world wide web conferences | 2015

User Review Sites as a Resource for Large-Scale Sociolinguistic Studies

Dirk Hovy; Anders Johannsen; Anders Søgaard

Sociolinguistic studies investigate the relation between language and extra-linguistic variables. This requires both representative text data and the associated socio-economic meta-data of the subjects. Traditionally, sociolinguistic studies use small samples of hand-curated data and meta-data. This can lead to exaggerated or false conclusions. Using social media data offers a large-scale source of language data, but usually lacks reliable socio-economic meta-data. Our research aims to remedy both problems by exploring a large new data source, international review websites with user profiles. They provide more text data than manually collected studies, and more meta-data than most available social media text. We describe the data and present various pilot studies, illustrating the usefulness of this resource for sociolinguistic studies. Our approach can help generate new research hypotheses based on data-driven findings across several countries and languages.

joint conference on lexical and computational semantics | 2014

More or less supervised supersense tagging of Twitter

Anders Johannsen; Dirk Hovy; Héctor Martínez Alonso; Barbara Plank; Anders Søgaard

We present two Twitter datasets annotated with coarse-grained word senses (supersenses), as well as a series of experiments with three learning scenarios for supersense tagging: weakly supervised learning, as well as unsupervised and supervised domain adaptation. We show that (a) off-the-shelf tools perform poorly on Twitter, (b) models augmented with embeddings learned from Twitter data perform much better, and (c) errors can be reduced using type-constrained inference with distant supervision from WordNet.

conference on computational natural language learning | 2014

What's in a p-value in NLP?

Anders Søgaard; Anders Johannsen; Barbara Plank; Dirk Hovy; Héctor Martínez Alonso

In NLP, we need to document that our proposed methods perform significantly better with respect to standard metrics than previous approaches, typically by reporting p-values obtained by rank- or randomization-based tests. We show that significance results following current research standards are unreliable and, in addition, very sensitive to sample size, covariates such as sentence length, as well as to the existence of multiple metrics. We estimate that under the assumption of perfect metrics and unbiased data, we need a significance cut-off at ⇠0.0025 to reduce the risk of false positive results to <5%. Since in practice we often have considerable selection bias and poor metrics, this, however, will not do alone.

north american chapter of the association for computational linguistics | 2016

SemEval-2016 Task~10: Detecting Minimal Semantic Units and their Meanings (DiMSUM)

Nathan Schneider; Dirk Hovy; Anders Johannsen; Marine Carpuat

This task combines the labeling of multiword expressions and supersenses (coarse-grained classes) in an explicit, yet broad-coverage paradigm for lexical semantics. Nine systems participated; the best scored 57.7% F1 in a multi-domain evaluation setting, indicating that the task remains largely unresolved. An error analysis reveals that a large number of instances in the data set are either hard cases, which no systems get right, or easy cases, which all systems correctly solve.

empirical methods in natural language processing | 2015

Any-language frame-semantic parsing

Anders Johannsen; Héctor Martínez Alonso; Anders Søgaard

We present a multilingual corpus of Wikipedia and Twitter texts annotated with FRAMENET 1.5 semantic frames in nine different languages, as well as a novel technique for weakly supervised cross-lingual frame-semantic parsing. Our approach only assumes the existence of linked, comparable source and target language corpora (e.g., Wikipedia) and a bilingual dictionary (e.g., Wiktionary or BABELNET). Our approach uses a truly interlingual representation, enabling us to use the same model across all nine languages. We present average error reductions over running a state-of-the-art parser on word-to-word translations of 46% for target identification, 37% for frame identification, and 14% for argument identification.

meeting of the association for computational linguistics | 2016

Joint part-of-speech and dependency projection from multiple sources

Anders Johannsen; Željko Agić; Anders Søgaard

Most previous work on annotation projection has been limited to a subset of IndoEuropean languages, using only a single source language, and projecting annotation for one task at a time. In contrast, we present an Integer Linear Programming (ILP) algorithm that simultaneously projects annotation for multiple tasks from multiple source languages, relying on parallel corpora available for hundreds of languages. When training POS taggers and dependency parsers on jointly projected POS tags and syntactic dependencies using our algorithm, we obtain better performance than a standard approach on 20/23 languages using one parallel corpus; and 18/27 languages using another.

empirical methods in natural language processing | 2014

Importance weighting and unsupervised domain adaptation of POS taggers: a negative result

Barbara Plank; Anders Johannsen; Anders Søgaard

Importance weighting is a generalization of various statistical bias correction techniques. While our labeled data in NLP is heavily biased, importance weighting has seen only few applications in NLP, most of them relying on a small amount of labeled target data. The publication bias toward reporting positive results makes it hard to say whether researchers have tried. This paper presents a negative result on unsupervised domain adaptation for POS tagging. In this setup, we only have unlabeled data and thus only indirect access to the bias in emission and transition probabilities. Moreover, most errors in POS tagging are due to unseen words, and there, importance weighting cannot help. We present experiments with a wide variety of weight functions, quantilizations, as well as with randomly generated weights, to support these claims.

international conference natural language processing | 2010

Robust semi-supervised and ensemble-based methods in word sense disambiguation

Anders Søgaard; Anders Johannsen

Mihalcea [1] discusses self-training and co-training in the context of word sense disambiguation and shows that parameter optimization on individual words was important to obtain good results. Using smoothed co-training of a naive Bayes classifier she obtains a 9.8% error reduction on Senseval-2 data with a fixed parameter setting. In this paper we test a semi-supervised learning algorithm with no parameters, namely tri-training [2]. We also test the random subspace method [3] for building committees out of stable learners. Both techniques lead to significant error reductions with different learning algorithms, but improvements do not accumulate. Our best error reduction is 7.4%, and our best absolute average over Senseval-2 data, though not directly comparable, is 12% higher than the results reported in Mihalcea [1].

Explore More