Agata Savary
François Rabelais University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Agata Savary.
Lecture Notes in Computer Science | 2003
Agata Savary; Christian Jacquemin
We discuss the nature and the scope of linguistic (morphological, syntactic and semantic) variation of terms and its impact on two information retrieval tasks: term acquisition and automatic indexing. A review of natural language processing techniques existing in these two areas is done, along with an in-depth presentation of FASTR, a corpus processor for the recognition, normalization, and acquisition of multi-word terms.
international database engineering and applications symposium | 2006
Béatrice Bouchou; Ahmed Cheriat; Myrian Halfeld Ferrari; Agata Savary
Updating XML documents submitted to schema constraints requires incremental validation, i.e. checking the parts of the document concerned by the updates. We propose to correct subtrees for which re-validation fails: if the validator fails at node p, a correction routine is called in order to compute corrections of the subtree rooted at p, within a given threshold. Then validation continues. In the correction process, we limit ourselves to single typed tree language. The correction routine uses trees edit distance matrices. Different correction versions are proposed to the user
international conference on implementation and application of automata | 2001
Agata Savary
A method of error-tolerant lookup in a finite-state lexicon is described, as well as its application to automatic spelling correction. We compare our method to the algorithm by K. Oflazer [14]. While Oflazers algorithm searches for all possible corrections of a misspelled word that are within a given similarity threshold, our approach is to retain only the most similar corrections (nearest neighbours), reducing dynamically the search space in the lexicon, and to reach the first correction as soon as possible.
international multiconference on computer science and information technology | 2010
Jakub Waszczuk; Katarzyna Głowińska; Agata Savary; Adam Przepiórkowski
The on-going project aiming at the creation of the National Corpus of Polish assumes several levels of linguistic annotation. We present the technical environment and methodological background developed for the three upper annotation levels: the level of syntactic words and groups, and the level of named entities. We show how knowledge-based platforms Spejd and Sprout are used for the automatic pre-annotation of the corpus, and we discuss some particular problems faced during the elaboration of the syntactic grammar, which contains over 800 rules and is one of the largest chunking grammars for Polish. We also show how the tree editor TrEd has been customized for manual post-editing of annotations, and for further revision of discrepancies. Our XML format converters and customized archiving repository ensure the automatic data flow and efficient corpus file management. We believe that this environment or substantial parts of it can be reused in or adapted for other corpus annotation tasks.
international conference on implementation and application of automata | 2009
Agata Savary
Multi-word units are linguistic objects whose idiosyncrasy calls for a lexicalized approach allowing to render their orthographic, inflectional and syntactic flexibility. Multiflex is a graph-based formalism answering this need by conflation of different surface realizations of the same underlying concept. Its implementation relies on a finite-state machinery with unification. It can be applied to the creation of linguistic resources for a high-quality natural language processing tasks.
language and technology conference | 2013
Maciej Ogrodniczuk; Katarzyna Głowińska; Mateusz Kopeć; Agata Savary; Magdalena Zawisławska
The Polish Coreference Corpus (PCC) is a large corpus of Polish general nominal coreference built upon the National Corpus of Polish. With its 1900 documents from 14 text genres, containing about 540,000 tokens, 180,000 mentions and 128,000 coreference clusters, the PCC is among the largest coreference corpora in the international community. It has some novel features, such as the annotation of the quasi-identity relation, inspired by Recasens’ near-identity, as well as the mark-up of semantic heads and dominant expressions. It shows a good inter-annotator agreement and is distributed in three formats under an open license. Its by-products include freely available annotation tools with custom features such as file distribution management and annotation adjudication.
The Computer Journal | 2014
Joshua Amavi; Béatrice Bouchou; Agata Savary
We present an algorithm for the correction of an XML document with respect to schema constraints expressed as a document type definition. Given a well-formed XML document t seen as a tree, a schema S and a non-negative threshold th, the algorithm finds every tree t′ valid with respect to S such that the edit distance between t and t′ is no higher than th. The algorithm is based on a recursive exploration of the finite-state automata representing structural constraints imposed by the schema, as well as on the construction of an edit distance matrix storing edit sequences leading to correction trees. We prove the termination, correctness and completeness of the algorithm, as well as its exponential time complexity. We also perform experimental tests on real-life XML data showing the influence of various input parameters on the execution time and on the number of solutions found. The algorithms implementation demonstrates polynomial rather than exponential behavior. It has been made public under the GNU LGPL v3 license. As we show in our in-depth discussion of the related work, this is the first full-fledged study of the document-to-schema correction problem.
international conference on computational linguistics | 2013
Maciej Ogrodniczuk; Magdalena Zawisławska; Katarzyna Głowińska; Agata Savary
Creating a coreference corpus for an inflectional and free-word-order language is a challenging task due to specific syntactic features largely ignored by existing annotation guidelines, such as the absence of definite/indefinite articles (making quasi-anaphoricity very common), frequent use of zero subjects or discrepancies between syntactic and semantic heads. This paper comments on the experience gained in preparation of such a resource for an ongoing project (CORE), aiming at creating tools for coreference resolution. Starting with a clarification of the relation between noun groups and mentions, through definition of the annotation scope and strategies, up to actual decisions for borderline cases, we present the process of building the first, to our best knowledge, corpus of general coreference of Polish.
International Journal of Data Mining, Modelling and Management | 2013
Jakub Waszczuk; Katarzyna Głowińska; Agata Savary; Adam Przepiórkowski; Michał Lenart
The ongoing National Corpus of Polish project assumes several levels of linguistic annotation. We present the technical environment and methodological background developed for the three upper annotation levels: the levels of syntactic words, syntactic groups and named entities. We show how knowledge-based platforms Spejd and Sprout are used for the automatic pre-annotation of the corpus and discuss some particular problems faced during the preparation of the parser grammar, which contains over 1,000 rules and is one of the largest chunking grammars for Polish. We also show how the tree editor TrEd has been customised for manual post-editing of annotations and for further revision of discrepancies. Our XML format converters and customised archiving repository ensure an automatic data flow and efficient corpus file management. We discuss the inter-annotator agreement in the manually annotated data, and present the first results of a CRF classifier trained on these data.
CCL | 2013
Maciej Ogrodniczuk; Katarzyna Głowińska; Mateusz Kopeć; Agata Savary; Magdalena Zawisławska
This paper reports on linguistic features and decisions that we find vital in the process of annotation and resolution of coreference for highly inflectional languages. The presented results have been collected during preparation of a corpus of general direct nominal coreference of Polish. Starting from the notion of a mention, its borders and potential vs. actual referentiality, we discuss the problem of complete and near-identity, zero subjects and dominant expressions. We also present interesting linguistic cases influencing the coreference resolution such as the difference between semantic and syntactic heads or the phenomenon of coreference chains made of indefinite pronouns.