Agnieszka Wołk
Military University of Technology in Warsaw
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Agnieszka Wołk.
MISSI | 2017
Agnieszka Wołk; Krzysztof Wołk; Krzysztof Marasek
The multilingual nature of the world makes translation a crucial requirement today. Within this research we apply state of the art statistical machine translation techniques to the West-Slavic languages group. We do West-Slavic languages classification and choose Polish as a representative candidate for our research. The experiments are conducted on written and spoken texts, which characteristics are defined as well. The machine translation systems are trained within West-Slavic group as well as into English. Translation systems and data sets are analyzed, prepared and adapted for the needs of West-Slavic—* translation. To evaluate the effects of different preparations on translation results, we conducted experiments and used the BLEU, NIST and TER metrics. By defining proper translation parameters to morphologically rich languages we improve the translation quality and draw the conclusions.
federated conference on computer science and information systems | 2016
Krzysztof Wołk; Krzysztof Marasek; Agnieszka Wołk
In contemporary world, translation becomes a critical need of the time. Parallel dictionaries have now become a most accessible source by humans, but confines are there as they do not offer good quality translation function, because of neologisms and words that are out of vocabulary. To overcome this problem in the usage of statistical translation systems is becoming more and more important in maintaining the eminence and quantity of the training data. But due to the limitations in these systems they have very limited availability for few languages and very limited narrow text areas. The purpose of this research is to bring calculation time up gradation via GPU acceleration, tuning script introduction and the enhancement and improvements in the methodologies of the contemporary comparable corpora mining through re-implementation of analogous algorithms through Needleman-Wunch algorithm. Experiments have been conducted on multiple language data which were extracted on numerous domains from Wikipedia. For the sake of Wikipedia, multiple cross-lingual contrasts and comparison were established. Optimistic impact on the both quantity and quality of mined data was observed due to such changes and adaptation. The solution is language independent and highly practical especially for under-resourced languages.
world conference on information systems and technologies | 2018
Krzysztof Wołk; Emilia Zawadzka; Agnieszka Wołk
Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems and other text processing tasks requiring bilingual data. In this study, we propose a language independent bi-sentence filtering approach based on Polish to English translation. This approach was developed using a noisy TED Talks corpus and tested on a Wikipedia-based comparable corpus; however, it can be extended to any text domain or language pair. The proposed method uses various statistical measures for sentence comparison and can be used for in-domain data adaptation tasks as well. Minimization of data loss was ensured by parameter adaptation. An improvement in MT system score using the text processed using our tool is discussed and in-domain data adaptation results are presented. We also discuss measures to improve performance such as bootstrapping and comparison model pruning. The results show significant improvement in filtering in terms of MT quality.
world conference on information systems and technologies | 2018
Krzysztof Wołk; Agnieszka Wołk
Several natural languages have undergone a great deal of processing, but the problem of limited textual linguistic resources remains. The manual creation of parallel corpora by humans is rather expensive and time consuming, while the language data required for statistical machine translation (SMT) do not exist in adequate quantities for their statistical information to be used to initiate the research process. On the other hand, applying known approaches to build parallel resources from multiple sources, such as comparable or quasi-comparable corpora, is very complicated and provides rather noisy output, which later needs to be further processed and requires in-domain adaptation. To optimize the performance of comparable corpora mining algorithms, it is essential to use a quality parallel corpus for training of a good data classifier. In this research, we have developed a methodology for generating an accurate parallel corpus (Czech-English) from monolingual resources by calculating the compatibility between the results of three machine translation systems. We have created translations of large, single-language resources by applying multiple translation systems and strictly measuring translation compatibility using rules based on the Levenshtein distance. The results produced by this approach were very favorable. The generated corpora successfully improved the quality of SMT systems and seem to be useful for many other natural language processing tasks.
world conference on information systems and technologies | 2017
Krzysztof Wołk; Agnieszka Wołk
Several natural languages have had much processing, but the problem of limited linguistic resources remains. Manual creation of parallel corpora by humans is rather expensive and very time consuming. In addition, language data required for statistical machine translation (SMT) does not exist in adequate capacity to use its statistical information to initiate the research process. On the other hand, applying unsubstantiated approaches to build the parallel resources from multiple means like comparable corpora or quasi-comparable corpora is very complicated and provides rather noisy output. These outputs of the process would later need to be reprocessed, and in-domain adaptations would also be required. To optimize the performance of these algorithms, it is essential to use a quality parallel corpus for training of the end-to-end procedure. In the present research, we have developed a methodology to generate an accurate parallel corpus from monolingual resources through the calculation of compatibility between the results of machine translation systems. We have translations of huge, single-language resources through the application of multiple translation systems and the strict measurement of translation compatibility with rules based on the Levenshtein distance. The results produced by such an approach are very favorable. All the monolingual resources that we obtained were taken from the WMT16 conference for Czech to generate the parallel corpus, which improved translation performance.
world conference on information systems and technologies | 2017
Krzysztof Wołk; Agnieszka Wołk
It has become essential to have precise translations of texts from different parts of the world, but it is often difficult to fill the translation gaps as quickly as might be needed. Undoubtedly, there are multiple dictionaries that can help in this regard, and various online translators exist to help cross this lingual bridge in many cases, but even these resources can fall short of serving their true purpose. The translators can provide a very accurate meaning of given words in a phrase, but they often miss the true essence of the language. The research presented here describes a method that can help close this lingual gap by extending certain aspects of the alignment task for WMT16. It is possible to achieve this goal by utilizing different classifiers and algorithms and by use of advanced computation. We carried out various experiments that allowed us to extract parallel data at the sentence level. This data proved capable of improving overall machine translation quality.
federated conference on computer science and information systems | 2017
Krzysztof Wołk; Agnieszka Wołk; Krzysztof Marasek
This study aimed to aid the enormous effort required to analyze phraseological writing competence by developing an automatic evaluation tool for texts. We attempted to measure both second language (L2) writing proficiency and text quality. In our research, we adapted the CollGram technique that searches a reference corpus to determine the frequency of each pair of tokens (bi-grams) and calculates the t-score and related information. We used the Level 3 Corpus of Contemporary American English as a reference corpus. Our solution performed well in writing evaluation and is freely available as a web service or as source for other researchers.
federated conference on computer science and information systems | 2017
Krzysztof Wołk; Agnieszka Wołk; Krzysztof Marasek
Based on big data training we provide 5-gram language models of contemporary Polish which are based on the Common Crawl corpus (which is a compilation of more than 9,000,000,000 pages from across the web) and other resources. We prove that our model is better than the Google WEB1T n-gram counts and assures better quality in terms of perplexity and machine translation. The model includes lower-counting entries and also de-duplication in order to lessen boilerplate. We also provide POS tagged version of raw corpus and raw corpus itself. We also provide dictionary of contemporary Polish. By maintaining singletons, Kneser-Ney smoothing in SRILM toolkit was used in order to construct big data language models. In this research, it is detailed exactly how the corpus was obtained and pre-processed, with a prominence on issues which surface when working with information on this scale. We train the language model and finally present advances of BLEU score in MT and perplexity values, through the utilization of our model.
Computational and Mathematical Methods in Medicine | 2017
Krzysztof Wołk; Agnieszka Wołk; Wojciech Glinkowski
People with speech, hearing, or mental impairment require special communication assistance, especially for medical purposes. Automatic solutions for speech recognition and voice synthesis from text are poor fits for communication in the medical domain because they are dependent on error-prone statistical models. Systems dependent on manual text input are insufficient. Recently introduced systems for automatic sign language recognition are dependent on statistical models as well as on image and gesture quality. Such systems remain in early development and are based mostly on minimal hand gestures unsuitable for medical purposes. Furthermore, solutions that rely on the Internet cannot be used after disasters that require humanitarian aid. We propose a high-speed, intuitive, Internet-free, voice-free, and text-free tool suited for emergency medical communication. Our solution is a pictogram-based application that provides easy communication for individuals who have speech or hearing impairment or mental health issues that impair communication, as well as foreigners who do not speak the local language. It provides support and clarification in communication by using intuitive icons and interactive symbols that are easy to use on a mobile device. Such pictogram-based communication can be quite effective and ultimately make peoples lives happier, easier, and safer.
Procedia Computer Science | 2017
Krzysztof Wołk; Agnieszka Wołk; Krzystof Marasek; Wojciech Glinkowski