Jan Hajic
Charles University in Prague
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jan Hajic.
empirical methods in natural language processing | 2005
Ryan T. McDonald; Fernando Pereira; Kiril Ribarov; Jan Hajic
We formalize weighted dependency parsing as searching for maximum spanning trees (MSTs) in directed graphs. Using this representation, the parsing algorithm of Eisner (1996) is sufficient for searching over all projective trees in O(n3) time. More surprisingly, the representation is extended naturally to non-projective parsing using Chu-Liu-Edmonds (Chu and Liu, 1965; Edmonds, 1967) MST algorithm, yielding an O(n2) parsing algorithm. We evaluate these methods on the Prague Dependency Treebank using online large-margin learning techniques (Crammer et al., 2003; McDonald et al., 2005) and show that MST parsing increases efficiency and accuracy for languages with non-projective dependencies.
Archive | 2003
Alena Böhmová; Jan Hajic; Eva Hajičová; Barbora Hladká
The availability of annotated data (with as rich and “deep” annotation as possible) is desirable in any new developments. Textual data are being used for so-called training phase of various empirical methods solving various problems in the field of computational linguistics. While there are many methods that use texts in their plain (or raw) form (in most cases for so-called unsupervised training), more accurate results may be obtained if annotated corpora are available. The data annotation itself is a complex task. While morphologically annotated corpora (pioneered by Henry Kucera in the 60’s) are now available for English and other languages, syntactically annotated corpora are rare. Inspired by the Penn Treebank, the most widely used syntactically annotated corpus of English, we decided to develop a similarly sized corpus of Czech with a rich annotation scheme.
meeting of the association for computational linguistics | 1999
Michael Collins; Jan Hajic; Lance A. Ramshaw; Christoph Tillmann
This paper considers statistical parsing of Czech, which differs radically from English in at least two respects: (1) it is a highly inflected language, and (2) it has relatively free word order. These differences are likely to pose new problems for techniques that have been developed on English. We describe our experience in building on the parsing model of (Collins 97). Our final results- 80% dependency accuracy - represent good progress towards the 91% accuracy of the parser on English (Wall Street Journal) text.
meeting of the association for computational linguistics | 1998
Jan Hajic; Barbora Hladká
p u r p o s e s , i t h a s b e e n t a g g e d b y o u r t a g g e r ; e r r o r s a r e p r i n t e d u n d e r l i n e d a n d c o r r e c t i o n s a r e s h o w n . } Hlavnfm/AAIS7 .... IA-probl4mem/NNIS7 ..... A--
IEEE Transactions on Speech and Audio Processing | 2004
William Byrne; David S. Doermann; Martin Franz; Samuel Gustman; Jan Hajic; Douglas W. Oard; Michael Picheny; Josef Psutka; Bhuvana Ramabhadran; Dagobert Soergel; Todd Ward; Wei-Jing Zhu
Much is known about the design of automated systems to search broadcast news, but it has only recently become possible to apply similar techniques to large collections of spontaneous speech. This paper presents initial results from experiments with speech recognition, topic segmentation, topic categorization, and named entity detection using a large collection of recorded oral histories. The work leverages a massive manual annotation effort on 10 000 h of spontaneous speech to evaluate the degree to which automatic speech recognition (ASR)-based segmentation and categorization techniques can be adapted to approximate decisions made by human annotators. ASR word error rates near 40% were achieved for both English and Czech for heavily accented, emotional and elderly spontaneous speech based on 65-84 h of transcribed speech. Topical segmentation based on shifts in the recognized English vocabulary resulted in 80% agreement with manually annotated boundary positions at a 0.35 false alarm rate. Categorization was considerably more challenging, with a nearest-neighbor technique yielding F=0.3. This is less than half the value obtained by the same technique on a standard newswire categorization benchmark, but replication on human-transcribed interviews showed that ASR errors explain little of that difference. The paper concludes with a description of how these capabilities could be used together to search large collections of recorded oral histories.
meeting of the association for computational linguistics | 2001
Jan Hajic; Pavel Krbec; Pavel Kveton; Karel Oliva; Vladimír Petkevič
A hybrid system is described which combines the strength of manual rule-writing and statistical learning, obtaining results superior to both methods if applied separately. The combination of a rule-based system and a statistical one is not parallel but serial: the rule-based system performing partial disambiguation with recall close to 100% is applied first, and a trigram HMM tagger runs on its results. An experiment in Czech tagging has been performed with encouraging results.
conference on applied natural language processing | 2000
Jan Hajic
Using examples of the transfer-based MT system between Czech and Russian RUSLAN and the word-for-word MT system with morphological disambiguation between Czech and Slovak CESILKO we argue that for really close languages it is possible to obtain better translation quality by means of simpler methods. The problem of translation to a group of typologically similar languages using a pivot language is also discussed here.
north american chapter of the association for computational linguistics | 2014
Stephan Oepen; Marco Kuhlmann; Yusuke Miyao; Daniel Zeman; Silvie Cinková; Dan Flickinger; Jan Hajic; Zdenka Uresová
Task 18 at SemEval 2015 defines Broad-Coverage Semantic Dependency Parsing (SDP) as the problem of recovering sentence-internal predicate–argument relationships for all content words, i.e. the sema ...
international acm sigir conference on research and development in information retrieval | 2005
J. Scott Olsson; Douglas W. Oard; Jan Hajic
Our goal in cross-language text classification (CLTC) is to use English training data to classify Czech documents (although the concepts presented here are applicable to any language pair). CLTC is an off-line problem, and the authors are unaware of any previous work in this area. CLTC is motivated by both the non-availability of Czech training data (the case, presently, in our dataset) and the possibility of leveraging different topic distributions in different languages to improve overall classification for information retrieval. Consider, for example, that English speakers tend to contribute more to some topics than their Czech counterparts (e.g., to discuss London more than Prague), so that, having only documents in English, we may expect to do poorly at identifying topics like Prague. Czech speakers, on the other hand, often talk about Prague, so that by leveraging Czech data, we might expect to improve on detecting the topic Prague in English speakers; and Prague in English speakers is exactly the sort of thesaurus label which information seekers are most interested in—because it is rare. Accordingly, while a lack of Czech training data presently necessitates CLTC, we would have no reason to warrant the method’s abandonment if such data were to suddenly become available. Our dataset is a collection of manually transcribed, spontaneous, conversational speech in English and Czech. English transcripts have human assigned labels from a hierarchical thesaurus of approximately 40,000 labels. Presently, labeled Czech data is not available for classifier training. The hierarchy may be divided into two principle branches, containing 1) concept labels (e.g., education) and 2) precoordinated place-date labels (e.g., Germany, 1914 – 1918).
conference on applied natural language processing | 1997
Jan Hajic; Barbora Hladká
We present results of probabilistic tagging of Czech texts in order to show how these techniques work for one of the highly morphologically ambiguous inflective languages. After description of the tag system used, we show the results of four experiments using a simple probabilistic model to tag Czech texts (unigram, two bigram experiments, and a trigram one). For comparison, we have applied the same code and settings to tag an English text (another four experiments) using the same size of training and test data in the experiments in order to avoid any doubt concerning the validity of the comparison. The experiments use the source channel model and maximum likelihood training on a Czech hand-tagged corpus and on tagged Wall Street Journal (WSJ) from the LDC collection. The experiments show (not surprisingly) that the more training data, the better is the success rate. The results also indicate that for inflective languages with 1000+ tags we have to develop a more sophisticated approach in order to get closer to an acceptable error rate. In order to compare two different approaches to text tagging---statistical and rule-based --- we modified Eric Brills rule-based part of speech tagger and carried out two more experiments on the Czech data, obtaining similar results in terms of the error rate. We have also run three more experiments with greatly reduced tagset to get another comparison based on similar tagset size.