Harald Trost
University of Vienna
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Harald Trost.
meeting of the association for computational linguistics | 2002
Marco Baroni; Johannes Matiasek; Harald Trost
We present an algorithm that takes an unannotated corpus as its input, and returns a ranked list of probable morphologically related pairs as its output. The algorithm tries to discover morphologically related pairs by looking for pairs that are both orthographically and semantically similar, where orthographic similarity is measured in terms of minimum edit distance, and semantic similarity is measured in terms of mutual information. The procedure does not rely on a morpheme concatenation model, nor on distributional properties of word substrings (such as affix frequency). Experiments with German and English input give encouraging results, both in terms of precision (proportion of good pairs found at various cutoff points of the ranked list), and in terms of a qualitative analysis of the types of morphological patterns discovered by the algorithm.
Applied Artificial Intelligence | 2005
Harald Trost; Johannes Matiasek; Marco Baroni
ABSTRACT This paper describes the language component of FASTY, a text prediction system designed to improve text input efficiency for disabled users. The FASTY language component is based on state-of-the-art n-gram-based word-level and part-of-speech-level prediction and on a number of innovative modules (morphological analysis, collocation-based prediction, compound prediction) that are meant to enhance performance in languages other than English. Together with its modular architecture, these novel techniques make it adaptable to a wide range of languages without sacrificing performance. Currently, versions for Dutch, German, French, Italian, and Swedish are supported. The system can be parameterized to be used with different user interfaces and for a range of different applications. In this paper, we discuss each of the modules in detail and we present a series of experimental evaluations of the system.
international conference on computers helping people with special needs | 2002
Johannes Matiasek; Marco Baroni; Harald Trost
Communication and information exchange is a vital factor in human society. Communication disorders severely influence the quality of life. Whereas experienced typists will produce some 300 keystrokes per minute, persons with motor impairments achieve only much lower rates. Predictive typing systems for English speaking areas have proven useful andefficien t, but for all other European languages there exist no predictive typing programs powerful enough to substantially improve the communication rate andthe IT access for disabledp ersons. FASTY aims at offering a communication support system significantly increasing typing speed, adaptable to users with different language and strongly varying needs. In this way the large group of non-English-speaking disabled citizens will be supported in living a more independent and self determined life.
international conference on computational linguistics | 2002
Marco Baroni; Johannes Matiasek; Harald Trost
In word prediction systems for augmentative and alternative communication (AAC), productive word-formation processes such as compounding pose a serious problem. We present a model that predicts German nominal compounds by splitting them into their modifier and head components, instead of trying to predict them as a whole. The model is improved further by the use of class-based modifier-head bigrams constructed using semantic classes automatically extracted from a corpus. The evaluation shows that the split compound model with class bigrams leads to an improvement in keystroke savings of more than 15% over a no split compound baseline model. We also present preliminary results obtained with a word prediction model integrating compound and simple word prediction.
conference on recommender systems | 2010
Jeremy Jancsary; Friedrich Neubarth; Harald Trost
We analyze preferences and the reading flow of users of a popular Austrian online newspaper. Unlike traditional news filtering approaches, we postulate that a users preference for particular articles depends not only on the topic and on propositional contents, but also on the users current context and on more subtle attributes. Our assumption is motivated by the observation that many people read newspapers because they actually enjoy the process. Such sentiments depend on a complex variety of factors. The present study is part of an ongoing effort to bring more advanced personalization to online media. Towards this end, we present a systematic evaluation of the merit of contextual and non-propositional features based on real-life clickstream and postings data. Furthermore, we assess the impact of different recommendation strategies on the learning performance of our system.
Applied Artificial Intelligence | 1991
Harald Trost
A language-independent morphological component for the recognition and generation of word forms is presented. Based on a lexicon of morphs, the approach combines two-level morphology and a feature-based unification grammar describing word formation. To overcome the heavy use of diacritics, feature structures are associated with the two-level rules. These feature structures function as filters for the application of the rules. That way information contained in the lexicon and the morphological grammar can guide the application of the two-level rules. Moreover, information can be transmitted from the two-level part to the grammar part. This approach allows for a natural description of some nonconcatena-tive morphological phenomena as well as morphonological phenomena that are restricted to certain word classes in their applicability. The approach is applied to German inflectional and derivational morphology. The component may easily be incorporated into natural language understanding systems and can be espe...
empirical methods in natural language processing | 2008
Jeremy Jancsary; Johannes Matiasek; Harald Trost
Automatic processing of medical dictations poses a significant challenge. We approach the problem by introducing a statistical framework capable of identifying types and boundaries of sections, lists and other structures occurring in a dictation, thereby gaining explicit knowledge about the function of such elements. Training data is created semi-automatically by aligning a parallel corpus of corrected medical reports and corresponding transcripts generated via automatic speech recognition. We highlight the properties of our statistical framework, which is based on conditional random fields (CRFs) and implemented as an efficient, publicly available toolkit. Finally, we show that our approach is effective both under ideal conditions and for real-life dictation involving speech recognition errors and speech-related phenomena such as hesitation and repetitions.
international joint conference on artificial intelligence | 1991
Harald Trost
X2MORF is a language independent morphological component for the recognition and generation of word forms based on a lexicon of morphs. The approach is based on two-level morphology. Extensions are motivated by linguistic data which call into question an underlying assumption of standard two-level morphology, namely the independence of morphophonology and morphology as exemplified by two-level rules and continuation classes. Accordingly, I propose a model which allows for interaction between these two parts. Instead of using continuation classes, word formation is described in a feature-based unification grammar. Two-level rules are provided with a morphological context in the form of feature structures. Information contained in the lexicon and the word formation grammar guides the application of two-level rules by matching the morphological context against the morphs. I present an efficient implementation of that model where rules are compiled into automata (as in the standard model) and where processing of the feature-based grammar is enhanced using an automaton derived from that grammar as a filter.
Computer Speech & Language | 2011
Stefan Petrik; Christina Drexel; Leo Fessler; Jeremy Jancsary; Alexandra Klein; Gernot Kubin; Johannes Matiasek; Franz Pernkopf; Harald Trost
Automatic speech recognition (ASR) has become a valuable tool in large document production environments like medical dictation. While manual post-processing is still needed for correcting speech recognition errors and for creating documents which adhere to various stylistic and formatting conventions, a large part of the document production process is carried out by the ASR system. For improving the quality of the system output, knowledge about the multi-layered relationship between the dictated texts and the final documents is required. Thus, typical speech-recognition errors can be avoided, and proper style and formatting can be anticipated in the ASR part of the document production process. Yet - while vast amounts of recognition results and manually edited final reports are constantly being produced - the error-free literal transcripts of the actually dictated texts are a scarce and costly resource because they have to be created by manually transcribing the audio files. To obtain large corpora of literal transcripts for medical dictation, we propose a method for automatically reconstructing them from draft speech-recognition transcripts plus the corresponding final medical reports. The main innovative aspect of our method is the combination of two independent knowledge sources: phonetic information for the identification of speech-recognition errors and semantic information for detecting post-editing concerning format and style. Speech recognition results and final reports are first aligned, then properly matched based on semantic and phonetic similarity, and finally categorised and selectively combined into a reconstruction hypothesis. This method can be used for various applications in language technology, e.g., adaptation for ASR, document production, or generally for the development of parallel text corpora of non-literal text resources. In an experimental evaluation, which also includes an assessment of the quality of the reconstructed transcripts compared to manual transcriptions, the described method results in a relative word error rate reduction of 7.74% after retraining the standard language model with reconstructed transcripts.
international conference on computational linguistics | 1986
Harald Trost; Ernst Buchberger
Creating a knowledge base has always been a bottleneck in the implementation of AI systems. This is also true for Natural Language Understanding (NLU) systems, particularly for data-driven ones. While a perfect system for automatic acquisition of all sorts of knowledge is still far from being realized, partial solutions are possible. This holds especially for lexical data. Nevertheless, the task is not trivial, in particular when dealing with languages rich in inflectional forms like German. Our system is to be used by persons with no specific linguistic knowledge, thus linguistic expertise has been put into the system to ascertain correct classification of words. Classification is done by means of a small rule based system with lexical knowledge and language-specific heuristics. The key idea is the identification of three sorts of knowledge which are processed distinctly and the optimal use of knowledge already contained in the existing lexicon.