Agata Savary | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Agata Savary is active.

Explore More

Publication

Featured researches published by Agata Savary.

Lecture Notes in Computer Science | 2003

Reducing Information Variation in Text

Agata Savary; Christian Jacquemin

We discuss the nature and the scope of linguistic (morphological, syntactic and semantic) variation of terms and its impact on two information retrieval tasks: term acquisition and automatic indexing. A review of natural language processing techniques existing in these two areas is done, along with an in-depth presentation of FASTR, a corpus processor for the recognition, normalization, and acquisition of multi-word terms.

international database engineering and applications symposium | 2006

XML Document Correction: Incremental Approach Activated by Schema Validation

Béatrice Bouchou; Ahmed Cheriat; Myrian Halfeld Ferrari; Agata Savary

Updating XML documents submitted to schema constraints requires incremental validation, i.e. checking the parts of the document concerned by the updates. We propose to correct subtrees for which re-validation fails: if the validator fails at node p, a correction routine is called in order to compute corrections of the subtree rooted at p, within a given threshold. Then validation continues. In the correction process, we limit ourselves to single typed tree language. The correction routine uses trees edit distance matrices. Different correction versions are proposed to the user

international conference on implementation and application of automata | 2001

Typographical Nearest-Neighbor Search in a Finite-State Lexicon and Its Application to Spelling Correction

Agata Savary

A method of error-tolerant lookup in a finite-state lexicon is described, as well as its application to automatic spelling correction. We compare our method to the algorithm by K. Oflazer [14]. While Oflazers algorithm searches for all possible corrections of a misspelled word that are within a given similarity threshold, our approach is to retain only the most similar corrections (nearest neighbours), reducing dynamically the search space in the lexicon, and to reach the first correction as soon as possible.

international multiconference on computer science and information technology | 2010

Tools and methodologies for annotating syntax and named entities in the National Corpus of Polish

Jakub Waszczuk; Katarzyna Głowińska; Agata Savary; Adam Przepiórkowski

The on-going project aiming at the creation of the National Corpus of Polish assumes several levels of linguistic annotation. We present the technical environment and methodological background developed for the three upper annotation levels: the level of syntactic words and groups, and the level of named entities. We show how knowledge-based platforms Spejd and Sprout are used for the automatic pre-annotation of the corpus, and we discuss some particular problems faced during the elaboration of the syntactic grammar, which contains over 800 rules and is one of the largest chunking grammars for Polish. We also show how the tree editor TrEd has been customized for manual post-editing of annotations, and for further revision of discrepancies. Our XML format converters and customized archiving repository ensure the automatic data flow and efficient corpus file management. We believe that this environment or substantial parts of it can be reused in or adapted for other corpus annotation tasks.

international conference on implementation and application of automata | 2009

Multiflex: A Multilingual Finite-State Tool for Multi-Word Units

Agata Savary

Multi-word units are linguistic objects whose idiosyncrasy calls for a lexicalized approach allowing to render their orthographic, inflectional and syntactic flexibility. Multiflex is a graph-based formalism answering this need by conflation of different surface realizations of the same underlying concept. Its implementation relies on a finite-state machinery with unification. It can be applied to the creation of linguistic resources for a high-quality natural language processing tasks.

language and technology conference | 2013

Polish Coreference Corpus

Maciej Ogrodniczuk; Katarzyna Głowińska; Mateusz Kopeć; Agata Savary; Magdalena Zawisławska

The Polish Coreference Corpus (PCC) is a large corpus of Polish general nominal coreference built upon the National Corpus of Polish. With its 1900 documents from 14 text genres, containing about 540,000 tokens, 180,000 mentions and 128,000 coreference clusters, the PCC is among the largest coreference corpora in the international community. It has some novel features, such as the annotation of the quasi-identity relation, inspired by Recasens’ near-identity, as well as the mark-up of semantic heads and dominant expressions. It shows a good inter-annotator agreement and is distributed in three formats under an open license. Its by-products include freely available annotation tools with custom features such as file distribution management and annotation adjudication.

The Computer Journal | 2014

On Correcting XML Documents with Respect to a Schema

Joshua Amavi; Béatrice Bouchou; Agata Savary

We present an algorithm for the correction of an XML document with respect to schema constraints expressed as a document type definition. Given a well-formed XML document t seen as a tree, a schema S and a non-negative threshold th, the algorithm finds every tree t′ valid with respect to S such that the edit distance between t and t′ is no higher than th. The algorithm is based on a recursive exploration of the finite-state automata representing structural constraints imposed by the schema, as well as on the construction of an edit distance matrix storing edit sequences leading to correction trees. We prove the termination, correctness and completeness of the algorithm, as well as its exponential time complexity. We also perform experimental tests on real-life XML data showing the influence of various input parameters on the execution time and on the number of solutions found. The algorithms implementation demonstrates polynomial rather than exponential behavior. It has been made public under the GNU LGPL v3 license. As we show in our in-depth discussion of the related work, this is the first full-fledged study of the document-to-schema correction problem.

international conference on computational linguistics | 2013

Coreference annotation schema for an inflectional language

Maciej Ogrodniczuk; Magdalena Zawisławska; Katarzyna Głowińska; Agata Savary

Creating a coreference corpus for an inflectional and free-word-order language is a challenging task due to specific syntactic features largely ignored by existing annotation guidelines, such as the absence of definite/indefinite articles (making quasi-anaphoricity very common), frequent use of zero subjects or discrepancies between syntactic and semantic heads. This paper comments on the experience gained in preparation of such a resource for an ongoing project (CORE), aiming at creating tools for coreference resolution. Starting with a clarification of the relation between noun groups and mentions, through definition of the annotation scope and strategies, up to actual decisions for borderline cases, we present the process of building the first, to our best knowledge, corpus of general coreference of Polish.

International Journal of Data Mining, Modelling and Management | 2013

Annotation tools for syntax and named entities in the National Corpus of Polish

Jakub Waszczuk; Katarzyna Głowińska; Agata Savary; Adam Przepiórkowski; Michał Lenart

The ongoing National Corpus of Polish project assumes several levels of linguistic annotation. We present the technical environment and methodological background developed for the three upper annotation levels: the levels of syntactic words, syntactic groups and named entities. We show how knowledge-based platforms Spejd and Sprout are used for the automatic pre-annotation of the corpus and discuss some particular problems faced during the preparation of the parser grammar, which contains over 1,000 rules and is one of the largest chunking grammars for Polish. We also show how the tree editor TrEd has been customised for manual post-editing of annotations and for further revision of discrepancies. Our XML format converters and customised archiving repository ensure an automatic data flow and efficient corpus file management. We discuss the inter-annotator agreement in the manually annotated data, and present the first results of a CRF classifier trained on these data.

CCL | 2013

Interesting Linguistic Features in Coreference Annotation of an Inflectional Language

Maciej Ogrodniczuk; Katarzyna Głowińska; Mateusz Kopeć; Agata Savary; Magdalena Zawisławska

This paper reports on linguistic features and decisions that we find vital in the process of annotation and resolution of coreference for highly inflectional languages. The presented results have been collected during preparation of a corpus of general direct nominal coreference of Polish. Starting from the notion of a mention, its borders and potential vs. actual referentiality, we discuss the problem of complete and near-identity, zero subjects and dominant expressions. We also present interesting linguistic cases influencing the coreference resolution such as the difference between semantic and syntactic heads or the phenomenon of coreference chains made of indefinite pronouns.

Explore More