Agnieszka Wołk

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Agnieszka Wołk is active.

Explore More

Publication

Featured researches published by Agnieszka Wołk.

MISSI | 2017

Analysis of Complexity Between Spoken and Written Language for Statistical Machine Translation in West-Slavic Group

Agnieszka Wołk; Krzysztof Wołk; Krzysztof Marasek

The multilingual nature of the world makes translation a crucial requirement today. Within this research we apply state of the art statistical machine translation techniques to the West-Slavic languages group. We do West-Slavic languages classification and choose Polish as a representative candidate for our research. The experiments are conducted on written and spoken texts, which characteristics are defined as well. The machine translation systems are trained within West-Slavic group as well as into English. Translation systems and data sets are analyzed, prepared and adapted for the needs of West-Slavic—* translation. To evaluate the effects of different preparations on translation results, we conducted experiments and used the BLEU, NIST and TER metrics. By defining proper translation parameters to morphologically rich languages we improve the translation quality and draw the conclusions.

federated conference on computer science and information systems | 2016

Exploration for Polish-* bi-lingual translation equivalents from comparable and quasi-comparable corpora

Krzysztof Wołk; Krzysztof Marasek; Agnieszka Wołk

In contemporary world, translation becomes a critical need of the time. Parallel dictionaries have now become a most accessible source by humans, but confines are there as they do not offer good quality translation function, because of neologisms and words that are out of vocabulary. To overcome this problem in the usage of statistical translation systems is becoming more and more important in maintaining the eminence and quantity of the training data. But due to the limitations in these systems they have very limited availability for few languages and very limited narrow text areas. The purpose of this research is to bring calculation time up gradation via GPU acceleration, tuning script introduction and the enhancement and improvements in the methodologies of the contemporary comparable corpora mining through re-implementation of analogous algorithms through Needleman-Wunch algorithm. Experiments have been conducted on multiple language data which were extracted on numerous domains from Wikipedia. For the sake of Wikipedia, multiple cross-lingual contrasts and comparison were established. Optimistic impact on the both quantity and quality of mined data was observed due to such changes and adaptation. The solution is language independent and highly practical especially for under-resourced languages.

world conference on information systems and technologies | 2018

Statistical Approach to Noisy-Parallel and Comparable Corpora Filtering for the Extraction of Bi-lingual Equivalent Data at Sentence-Level.

Krzysztof Wołk; Emilia Zawadzka; Agnieszka Wołk

Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems and other text processing tasks requiring bilingual data. In this study, we propose a language independent bi-sentence filtering approach based on Polish to English translation. This approach was developed using a noisy TED Talks corpus and tested on a Wikipedia-based comparable corpus; however, it can be extended to any text domain or language pair. The proposed method uses various statistical measures for sentence comparison and can be used for in-domain data adaptation tasks as well. Minimization of data loss was ensured by parameter adaptation. An improvement in MT system score using the text processed using our tool is discussed and in-domain data adaptation results are presented. We also discuss measures to improve performance such as bootstrapping and comparison model pruning. The results show significant improvement in filtering in terms of MT quality.

world conference on information systems and technologies | 2018

Augmenting SMT with Semantically-Generated Virtual-Parallel Corpora from Monolingual Texts

Krzysztof Wołk; Agnieszka Wołk

Several natural languages have undergone a great deal of processing, but the problem of limited textual linguistic resources remains. The manual creation of parallel corpora by humans is rather expensive and time consuming, while the language data required for statistical machine translation (SMT) do not exist in adequate quantities for their statistical information to be used to initiate the research process. On the other hand, applying known approaches to build parallel resources from multiple sources, such as comparable or quasi-comparable corpora, is very complicated and provides rather noisy output, which later needs to be further processed and requires in-domain adaptation. To optimize the performance of comparable corpora mining algorithms, it is essential to use a quality parallel corpus for training of a good data classifier. In this research, we have developed a methodology for generating an accurate parallel corpus (Czech-English) from monolingual resources by calculating the compatibility between the results of three machine translation systems. We have created translations of large, single-language resources by applying multiple translation systems and strictly measuring translation compatibility using rules based on the Levenshtein distance. The results produced by this approach were very favorable. The generated corpora successfully improved the quality of SMT systems and seem to be useful for many other natural language processing tasks.

world conference on information systems and technologies | 2017

Augmenting SMT with Generated Pseudo-parallel Corpora from Monolingual News Resources

Krzysztof Wołk; Agnieszka Wołk

Several natural languages have had much processing, but the problem of limited linguistic resources remains. Manual creation of parallel corpora by humans is rather expensive and very time consuming. In addition, language data required for statistical machine translation (SMT) does not exist in adequate capacity to use its statistical information to initiate the research process. On the other hand, applying unsubstantiated approaches to build the parallel resources from multiple means like comparable corpora or quasi-comparable corpora is very complicated and provides rather noisy output. These outputs of the process would later need to be reprocessed, and in-domain adaptations would also be required. To optimize the performance of these algorithms, it is essential to use a quality parallel corpus for training of the end-to-end procedure. In the present research, we have developed a methodology to generate an accurate parallel corpus from monolingual resources through the calculation of compatibility between the results of machine translation systems. We have translations of huge, single-language resources through the application of multiple translation systems and the strict measurement of translation compatibility with rules based on the Levenshtein distance. The results produced by such an approach are very favorable. All the monolingual resources that we obtained were taken from the WMT16 conference for Czech to generate the parallel corpus, which improved translation performance.

world conference on information systems and technologies | 2017

Automatic Parallel Data Mining After Bilingual Document Alignment.

Krzysztof Wołk; Agnieszka Wołk

It has become essential to have precise translations of texts from different parts of the world, but it is often difficult to fill the translation gaps as quickly as might be needed. Undoubtedly, there are multiple dictionaries that can help in this regard, and various online translators exist to help cross this lingual bridge in many cases, but even these resources can fall short of serving their true purpose. The translators can provide a very accurate meaning of given words in a phrase, but they often miss the true essence of the language. The research presented here describes a method that can help close this lingual gap by extending certain aspects of the alignment task for WMT16. It is possible to achieve this goal by utilizing different classifiers and algorithms and by use of advanced computation. We carried out various experiments that allowed us to extract parallel data at the sentence level. This data proved capable of improving overall machine translation quality.

federated conference on computer science and information systems | 2017

Unsupervised tool for quantification of progress in L2 English phraseological

Krzysztof Wołk; Agnieszka Wołk; Krzysztof Marasek

This study aimed to aid the enormous effort required to analyze phraseological writing competence by developing an automatic evaluation tool for texts. We attempted to measure both second language (L2) writing proficiency and text quality. In our research, we adapted the CollGram technique that searches a reference corpus to determine the frequency of each pair of tokens (bi-grams) and calculates the t-score and related information. We used the Level 3 Corpus of Contemporary American English as a reference corpus. Our solution performed well in writing evaluation and is freely available as a web service or as source for other researchers.

federated conference on computer science and information systems | 2017

Big data language model of contemporary polish

Krzysztof Wołk; Agnieszka Wołk; Krzysztof Marasek

Based on big data training we provide 5-gram language models of contemporary Polish which are based on the Common Crawl corpus (which is a compilation of more than 9,000,000,000 pages from across the web) and other resources. We prove that our model is better than the Google WEB1T n-gram counts and assures better quality in terms of perplexity and machine translation. The model includes lower-counting entries and also de-duplication in order to lessen boilerplate. We also provide POS tagged version of raw corpus and raw corpus itself. We also provide dictionary of contemporary Polish. By maintaining singletons, Kneser-Ney smoothing in SRILM toolkit was used in order to construct big data language models. In this research, it is detailed exactly how the corpus was obtained and pre-processed, with a prominence on issues which surface when working with information on this scale. We train the language model and finally present advances of BLEU score in MT and perplexity values, through the utilization of our model.

Computational and Mathematical Methods in Medicine | 2017

A Cross-Lingual Mobile Medical Communication System Prototype for Foreigners and Subjects with Speech, Hearing, and Mental Disabilities Based on Pictograms

Krzysztof Wołk; Agnieszka Wołk; Wojciech Glinkowski

People with speech, hearing, or mental impairment require special communication assistance, especially for medical purposes. Automatic solutions for speech recognition and voice synthesis from text are poor fits for communication in the medical domain because they are dependent on error-prone statistical models. Systems dependent on manual text input are insufficient. Recently introduced systems for automatic sign language recognition are dependent on statistical models as well as on image and gesture quality. Such systems remain in early development and are based mostly on minimal hand gestures unsuitable for medical purposes. Furthermore, solutions that rely on the Internet cannot be used after disasters that require humanitarian aid. We propose a high-speed, intuitive, Internet-free, voice-free, and text-free tool suited for emergency medical communication. Our solution is a pictogram-based application that provides easy communication for individuals who have speech or hearing impairment or mental health issues that impair communication, as well as foreigners who do not speak the local language. It provides support and clarification in communication by using intuitive icons and interactive symbols that are easy to use on a mobile device. Such pictogram-based communication can be quite effective and ultimately make peoples lives happier, easier, and safer.

Procedia Computer Science | 2017

Pictogram-based mobile first medical aid communicator

Krzysztof Wołk; Agnieszka Wołk; Krzystof Marasek; Wojciech Glinkowski

Explore More

Collaboration

Dive into the Agnieszka Wołk's collaboration.

Top Co-Authors

Krzysztof Marasek

University of Stuttgart

View shared research outputs

Top Co-Authors

Wojciech Glinkowski

Medical University of Warsaw

View shared research outputs

Explore More

Military University of Technology in Warsaw

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot

Dive into the research topics where Agnieszka Wołk is active.

Publication

Featured researches published by Agnieszka Wołk.

Analysis of Complexity Between Spoken and Written Language for Statistical Machine Translation in West-Slavic Group

Exploration for Polish-* bi-lingual translation equivalents from comparable and quasi-comparable corpora

Statistical Approach to Noisy-Parallel and Comparable Corpora Filtering for the Extraction of Bi-lingual Equivalent Data at Sentence-Level.

Augmenting SMT with Semantically-Generated Virtual-Parallel Corpora from Monolingual Texts

Augmenting SMT with Generated Pseudo-parallel Corpora from Monolingual News Resources

Automatic Parallel Data Mining After Bilingual Document Alignment.

Unsupervised tool for quantification of progress in L2 English phraseological

Big data language model of contemporary polish

A Cross-Lingual Mobile Medical Communication System Prototype for Foreigners and Subjects with Speech, Hearing, and Mental Disabilities Based on Pictograms

Pictogram-based mobile first medical aid communicator

Collaboration

Dive into the Agnieszka Wołk's collaboration.