Lars Borin
University of Gothenburg
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lars Borin.
Computational Linguistics | 2011
Harald Hammarström; Lars Borin
This article surveys work on Unsupervised Learning of Morphology. We define Unsupervised Learning of Morphology as the problem of inducing a description (of some kind, even if only morpheme-segmentation) of how orthographic words are built up given only raw text data of a language. We briefly go through the history and motivation of the this problem. Next, over 200 items of work are listed with a brief characterization, and the most important ideas in the field are critically discussed. We summarize the achievements so far and give pointers for future developments.
language resources and evaluation | 2013
Lars Borin; Markus Forsberg; Lennart Lönngren
The English-language Princeton WordNet (PWN) and some wordnets for other languages have been extensively used as lexical–semantic knowledge sources in language technology applications, due to their free availability and their size. The ubiquitousness of PWN-type wordnets tends to overshadow the fact that they represent one out of many possible choices for structuring a lexical–semantic resource, and it could be enlightening to look at a differently structured resource both from the point of view of theoretical–methodological considerations and from the point of view of practical text processing requirements. The resource described here—SALDO—is such a lexical–semantic resource, intended primarily for use in language technology applications, and offering an alternative organization to PWN-style wordnets. We present our work on SALDO, compare it with PWN, and discuss some implications of the differences. We also describe an integrated infrastructure for computational lexical resources where SALDO forms the central component.
international conference on computational linguistics | 2000
Lars Borin
While language-independent sentence alignment programs typically achieve a recall in the 90 percent range, the same cannot be said about word alignment systems, where normal recall figures tend to fall somewhere between 20 and 40 percent, in the language-independent case. As words (and phrases) for various reasons are more interesting to align than sentences, we need methods to increase word alignment recall, preferably without sacrificing precision. This paper reports on a series of experiments with pivot alignment, which is the use of one or more additional languages to improve bilingual word alignment. The conclusion is that in a multilingual parallel corpus, pivot alignment is a safe way to increase word alignment recall without lowering the precision.
Archive | 2002
Lars Borin
It is sometimes said that part of speech (POS) tags are likely to be the same for translation equivalent words. If this is correct, we could formulate the following hypothesis: It should be possible to use POS tagging for one language in combination with a word alignment system, in order to obtain a (partial) POS tagging for another language. This hypothesis is investigated both empirically — an experiment is described where POS tags were transferred from a POS tagged German text to a parallel Swedish text by automatic word alignment — and theoretically, in the form of a review of relevant linguistic work on the typology of POS systems. The conclusions are that the hypothesis seems to hold at least for closely related languages, that the findings of typological research do not contradict it (or a slightly modified form of it), but that further empirical research is needed.
Shall We Play the Festschrift Game? | 2012
Lars Borin
This paper is a theoretical and empirical investigation into the use of the notion “core vocabulary” in some areas of linguistics and related disciplines, originally prompted by the concrete task of compiling core vocabularies in two research projects growing out of two quite different research traditions: (1) lexicostatistics, where “core vocabularies” are used to measure the linguistic distance among languages in order to establish genetic and typological language groupings; and (2) computer-assisted language learning—a long-standing research interest of Lauri Carlson—where the “core vocabulary” is the most central vocabulary, to which language learners should be exposed first. In linguistics we also find a more theoretically motivated notion of “core vocabulary”, as so-called “semantic primitives”. In the paper, I compare the three kinds of “core vocabulary” and discuss their relationship to the formal knowledge-representation systems called “ontologies” (currently among Lauri Carlson’s research interests)—especially “core” ontologies such as SUMO—and the notion of “concept” central to the latter work: What is the relationship—if any—between concepts in such ontologies and lexical items in languages?
Language Technology for Cultural Heritage | 2011
Lars Borin; Markus Forsberg
The goal of the work presented in this chapter is to create a set of computational lexical resources, interlinked on the lexical sense level using the persistent sense identifiers designed for the Present-Day Swedish lexical resource SALDO. In this way, all the diverse linguistic information available in our individual lexical resources – modern and historical – becomes available for all resources where the interlinking has been completed. Using this mechanism, we have been able to devise a semantic search application for 19th century fiction which combines a morphological component for this language variety – Late Modern Swedish – and a lexicalsemantic resource for Present-Day Swedish through the lexical sense identifiers. At present, we are extending and cross-linking the modern lexical resources, as well as steadily integrating the 19th century resource into this emerging diachronic lexical resource. We are also working on the intricate and challenging problem of incorporating the most deviant historical language variety – Old Swedish – in this resource, which we clearly have to do before we can truthfully refer to it as a diachronic computational lexical resource for 800 years of Swedish.
Archive | 2006
Anju Saxena; Lars Borin
We do not at the present time know how the language situation in a multilingual region such as South Asia will be affected by modern information and communication technologies: Will linguistic diversity be strengthened or weakened as they become increasingly prevalent in all walks of life? This volume brings together articles on South Asian descriptive linguistics and sociolinguistics, documentary linguistics, issues of intellectual and cultural property and fieldwork ethics, and language technology. The book provides the reader with some basic knowledge of the problems concerned and some directions from which solutions could be forthcoming.
conference on information and knowledge management | 2013
Lars Borin; Devdatt P. Dubhashi; Markus Forsberg; Richard Johansson; Dimitrios Kokkinakis; Pierre Nugues
The massive amounts of text data made available through the Google Books digitization project have inspired a new field of big-data textual research. Named culturomics, this field has attracted the attention of a growing number of scholars over recent years. However, initial studies based on these data have been criticized for not referring to relevant work in linguistics and language technology. This paper provides some ideas, thoughts and first steps towards a new culturomics initiative, based this time on Swedish data, which pursues a more knowledge-based approach than previous work in this emerging field. The amount of new Swedish text produced daily and older texts being digitized in cultural heritage projects grows at an accelerating rate. These volumes of text being available in digital form have grown far beyond the capacity of human readers, leaving automated semantic processing of the texts as the only realistic option for accessing and using the information contained in them. The aim of our recently initiated research program is to advance the state of the art in language technology resources and methods for semantic processing of Big Swedish text and focus on the theoretical and methodological advancement of the state of the art in extracting and correlating information from large volumes of Swedish text using a combination of knowledge-based and statistical methods.
International Journal on Digital Libraries | 2015
Nina Tahmasebi; Lars Borin; Gabriele Capannini; Devdatt P. Dubhashi; Peter Exner; Markus Forsberg; Gerhard Gossen; Fredrik D. Johansson; Richard Johansson; Mikael Kågebäck; Olof Mogren; Pierre Nugues; Thomas Risse
The concept of culturomics was born out of the availability of massive amounts of textual data and the interest to make sense of cultural and language phenomena over time. Thus far however, culturomics has only made use of, and shown the great potential of, statistical methods. In this paper, we present a vision for a knowledge-based culturomics that complements traditional culturomics. We discuss the possibilities and challenges of combining knowledge-based methods with statistical methods and address major challenges that arise due to the nature of the data; diversity of sources, changes in language over time as well as temporal dynamics of information in general. We address all layers needed for knowledge-based culturomics, from natural language processing and relations to summaries and opinions.
Archive | 2013
Lars Borin; Anju Saxena
The present volume collects contributions addressing different aspects of the measurement of linguistic differences, a topic which probably is as old as language itself but at the same time has acquired renewed interest over the last decade or so, reflecting a rapid development of data-intensive computing in all fields of research, including linguistics.