Maciej Ogrodniczuk
Polish Academy of Sciences
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Maciej Ogrodniczuk.
language and technology conference | 2013
Maciej Ogrodniczuk; Katarzyna Głowińska; Mateusz Kopeć; Agata Savary; Magdalena Zawisławska
The Polish Coreference Corpus (PCC) is a large corpus of Polish general nominal coreference built upon the National Corpus of Polish. With its 1900 documents from 14 text genres, containing about 540,000 tokens, 180,000 mentions and 128,000 coreference clusters, the PCC is among the largest coreference corpora in the international community. It has some novel features, such as the annotation of the quasi-identity relation, inspired by Recasens’ near-identity, as well as the mark-up of semantic heads and dominant expressions. It shows a good inter-annotator agreement and is distributed in three formats under an open license. Its by-products include freely available annotation tools with custom features such as file distribution management and annotation adjudication.
international conference on computational linguistics | 2013
Maciej Ogrodniczuk; Magdalena Zawisławska; Katarzyna Głowińska; Agata Savary
Creating a coreference corpus for an inflectional and free-word-order language is a challenging task due to specific syntactic features largely ignored by existing annotation guidelines, such as the absence of definite/indefinite articles (making quasi-anaphoricity very common), frequent use of zero subjects or discrepancies between syntactic and semantic heads. This paper comments on the experience gained in preparation of such a resource for an ongoing project (CORE), aiming at creating tools for coreference resolution. Starting with a clarification of the relation between noun groups and mentions, through definition of the annotation scope and strategies, up to actual decisions for borderline cases, we present the process of building the first, to our best knowledge, corpus of general coreference of Polish.
applications of natural language to data bases | 2013
Maciej Ogrodniczuk; Michał Lenart
This paper presents a new implementation of the multi-purpose set of NLP tools for Polish, made available online in a common web service framework. The tool set comprises a morphological analyzer, a tagger, a named entity recognizer, a dependency parser, a constituency parser and a coreference resolver. Additionally, a web application offering chaining capabilities and a common BRAT-based presentation framework is presented.
CCL | 2013
Maciej Ogrodniczuk; Katarzyna Głowińska; Mateusz Kopeć; Agata Savary; Magdalena Zawisławska
This paper reports on linguistic features and decisions that we find vital in the process of annotation and resolution of coreference for highly inflectional languages. The presented results have been collected during preparation of a corpus of general direct nominal coreference of Polish. Starting from the notion of a mention, its borders and potential vs. actual referentiality, we discuss the problem of complete and near-identity, zero subjects and dominant expressions. We also present interesting linguistic cases influencing the coreference resolution such as the difference between semantic and syntactic heads or the phenomenon of coreference chains made of indefinite pronouns.
Studies in Polish Linguistics | 2016
Renata Bronikowska; Włodzimierz Gruszczyński; Maciej Ogrodniczuk; Marcin Woliński
The History of the 17th and 18th c. Polish Language Laboratory, Institute of Polish Language, Polish Academy of Sciences, is in the process of creating two large databases: The Electronic Dictionary of the 17th−18th c. Polish and The Electronic Corpus of the 17th and 18th c. Polish Texts (up to 1772), the latter in cooperation with the Institute of Computer Science, Polish Academy of Sciences. It is expected that combining these two sets of data will help to achieve the objectives established for both database projects. The present article shows the benefits that the Corpus creators can get from the data gathered in the dictionary, with special emphasis put on the use of grammatical information included in the dictionary entries to design tools for automatic text annotation in the Corpus.
international conference natural language processing | 2014
Maciej Ogrodniczuk; Alicja Wójcicka; Katarzyna Głowińska; Mateusz Kopeć
This paper describes the results of creating a shallow grammar of Polish capable of detecting multi-level nested nominal phrases, intended to be used as mentions in coreference resolution tasks. The work is based on existing grammar developed for the National Corpus of Polish and evaluated on manually annotated Polish Coreference Corpus.
international conference on mining intelligence and knowledge exploration | 2013
Maciej Ogrodniczuk
This paper reports on the preliminary experiment aimed at verification whether extraction of nominal facts corresponding to world knowledge from both structured and unstructured data could be effectively performed and its results used as a source of pragmatic knowledge for coreference resolution in Polish. Being the proof-of-concept only, this approach is work in progress and is intended to be further validated in a full-scale project.
intelligent information systems | 2013
Maciej Ogrodniczuk
Creating a coreference resolution tool for a new language is a challenging task due to substantial effort required by development of associated linguistic data, regardless of rule-based or statistical nature of the approach. In this paper, we test the translation- and projection-based method for an inflectional language, evaluate the result on a corpus of general coreference and compare the results with state-of-the-art solutions of this type for other languages.
language data and knowledge | 2017
Bartłomiej Nitoń; Maciej Ogrodniczuk
This paper examines the portability of Stanford’s multi-pass rule-based sieve coreference resolution system to inflectional language (Polish) with a different annotation scheme. The presented system is implemented in BART, a modular toolkit later adapted to the sieve architecture by Baumann et al. The sieves for Polish include processing of zero subjects and experimental knowledge-intensive sieve using the newly created database of periphrastic expressions. Evaluation shows that the results for Polish are higher than those seen on the CoNLL-2011/2012 data.
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature | 2017
Maciej Ogrodniczuk; Mateusz Kopeć
Language processing architectures are often evaluated in near-to-perfect conditions with respect to processed content. The tools which perform sufficiently well on electronic press, books and other type of non-interactive content may poorly handle littered, colloquial and multilingual textual data which make the majority of communication today. This paper aims at investigating how Polish Twitter data (in a slightly controlled ‘political’ flavour) differs from expectation of linguistic tools and how they could be corrected to be ready for processing by standard language processing chains available for Polish. The setting includes specialised components for spelling correction of tweets as well as hashtag and username decoding.