Petr Pajas
Charles University in Prague
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Petr Pajas.
workshop on statistical machine translation | 2008
Zdenek Zabokrtsky; Jan Ptáček; Petr Pajas
We present a new English→Czech machine translation system combining linguistically motivated layers of language description (as defined in the Prague Dependency Treebank annotation scenario) with statistical NLP approaches.
international conference on computational linguistics | 2008
Petr Pajas; Jan Štėpánek
This paper presents recent advances in an established treebank annotation framework comprising of an abstract XML-based data format, fully customizable editor of tree-based annotations, a toolkit for all kinds of automated data processing with support for cluster computing, and a work-in-progress database-driven search engine with a graphical user interface built into the tree editor.
text speech and dialogue | 2001
Eva Hajičová; Jan Hajic; Barbora Hladká; Martin Holub; Petr Pajas; Veronika Reznícková; Petr Sgall
The Prague Dependency Treebank (PDT) project is conceived of as a many-layered scenario, both from the point of view of the stratal annotation scheme, from the division-of-labor point of view, and with regard to the level of detail captured at the highest, tectogrammatical layer. The following aspects of the present status of the PDT are discussed in detail: the now-available PDT version 1.0, annotated manually at the morphemic and analytic layers, including the recent experience with post-annotation checking; the ongoing effort of tectogrammatical layer annotation, with a specific attention to the so-called model collection; and to two different areas of exploitation of the PDT, for linguistic research purposes and for information retrieval application purposes.
Journal of Quantitative Linguistics | 2010
Radek Čech; Petr Pajas; Ján Macutek
Abstract The aim of the article is to introduce a new approach to verb valency analysis. This approach – full valency – observes properties of verbs which occur solely in actual language usage. The term “full valency” means that all arguments, without distinguishing complements (obligatory arguments governed by the verb) and adjuncts (optional arguments directly dependent on the predicate verb), are taken into account. Because of an expectation that full valency reflects some mechanism which governs verb behaviour in a language, hypotheses concerning (1) the distribution of full valency frames, (2) the relationship between the number of valency frames and the frequency of the verb, and (3) the relationship between the number of valency frames and verb length were tested empirically. To test the hypotheses, a Czech syntactically annotated corpus – the Prague Dependency Treebank – was used.
linguistic annotation workshop | 2009
Anna Nedoluzhko; Jiří Mírovský; Petr Pajas
The present paper outlines an ongoing project of annotation of the extended nominal coreference and the bridging anaphora in the Prague Dependency Treebank. We describe the annotation scheme with respect to the linguistic classification of coreferential and bridging relations and focus also on details of the annotation process from the technical point of view. We present methods of helping the annotators -- by a pre-annotation and by several useful features implemented in the annotation tool. Our method of the inter-annotator agreement is focused on the improvement of the annotation guidelines; we present results of three subsequent measurements of the agreement.
spoken language technology workshop | 2008
Jan Hajic; Silvie Cinková; Marie Mikulová; Petr Pajas; Jan Ptáček; Josef Toman; Zdenka Uresová
We present a description of a new resource (Prague Dependency Treebank of Spoken Language) being created for English and Czech to be used for the task of speech understanding, broad natural language analysis for dialog systems and other speech-related tasks, including speech editing. The resources we have created so far contain audio and a standard transcription of spontaneous speech, but as a novel layer, we add an edited (ldquoreconstructedrdquo) version of the spoken utterances. These edits go beyond the scope of current speech reconstruction efforts in that we allow, on top of the usual deletions of speech artifacts, fillers, etc. also for word modifications, insertions and word order changes. We have used both monologue and dialogue recordings in English and Czech to verify the feasibility of such transcription. We have also assessed the quality of the resulting annotation since the relative freedom of the editing raises an issue of what a ldquocorrectrdquo annotation is.
Glottotheory | 2009
Radek Čech; Petr Pajas
Abstract The aim of the article is to test empirically predictions formulated in the Transitivity Hypothesis framework. Methodological problems of the original approach are discussed and some solutions are offered. For the testing of the hypotheses two corpora of Czech were used (Prague Spoken Corpus and Prague Dependency Treebank). The results question both the predicted impact of the language form on transitivity and, more importantly, the concept of the Transitivity Hypothesis in general.
text speech and dialogue | 2017
Marie Mikulová; Jiří Mírovský; Anja Nedoluzhko; Petr Pajas; Jan Štěpánek; Jan Hajic
We present a richly annotated spoken language resource, the Prague Dependency Treebank of Spoken Czech 2.0, the primary purpose of which is to serve for speech-related NLP tasks. The treebank features several novel annotation schemas close to the audio and transcript, and the morphological, syntactic and semantic annotation corresponds to the family of Prague Dependency Treebanks; it could thus be used also for linguistic studies, including comparative studies regarding text and speech. The most unique and novel feature is our approach to syntactic annotation, which differs from other similar corpora such as Treebank-3 [8] in that it does not attempt to impose syntactic structure over input, but it includes one more layer which edits the literal transcript to fluent Czech while keeping the original transcript explicitly aligned with the edited version. This allows the morphological, syntactic and semantic annotation to be deterministically and fully mapped back to the transcript and audio. It brings new possibilities for modeling morphology, syntax and semantics in spoken language – either at the original transcript with mapped annotation, or at the new layer after (automatic) editing. The corpus is publicly and freely available.
text speech and dialogue | 2000
Eva Hajičová; Petr Pajas
Two phases of an evaluation of annotating a Czech text corpus on an underlying syntactic level are described and the results are compared and analysed.
language resources and evaluation | 2006
Saso Dzeroski; Tomaz Erjavec; Nina Ledinek; Petr Pajas; Anreja Zele