Piotr Przybyła
University of Manchester
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Piotr Przybyła.
Database | 2016
Piotr Przybyła; Matthew Shardlow; Sophie Aubin; Robert Bossy; Richard Eckart de Castilho; Stelios Piperidis; John McNaught; Sophia Ananiadou
Text mining is a powerful technology for quickly distilling key information from vast quantities of biomedical literature. However, to harness this power the researcher must be well versed in the availability, suitability, adaptability, interoperability and comparative accuracy of current text mining resources. In this survey, we give an overview of the text mining resources that exist in the life sciences to help researchers, especially those employed in biocuration, to engage with text mining in their own work. We categorize the various resources under three sections: Content Discovery looks at where and how to find biomedical publications for text mining; Knowledge Encoding describes the formats used to represent the different levels of information associated with content that enable text mining, including those formats used to carry such information between processes; Tools and Services gives an overview of workflow management systems that can be used to rapidly configure and compare domain- and task-specific processes, via access to a wide range of pre-built tools. We also provide links to relevant repositories in each section to enable the reader to find resources relevant to their own area of interest. Throughout this work we give a special focus to resources that are interoperable—those that have the crucial ability to share information, enabling smooth integration and reusability.
Journal of Biomedical Informatics | 2017
Georgios Kontonatsios; Austin J. Brockmeier; Piotr Przybyła; John McNaught; Tingting Mu; John Yannis Goulermas; Sophia Ananiadou
Graphical abstract
applications of natural language to data bases | 2015
Piotr Przybyła
This paper presents an entity recognition (ER) module for a question answering system for Polish called RAFAEL. Two techniques of ER are compared: traditional, based on named entity categories (e.g. person), and novel Deep Entity Recognition, using WordNet synsets (e.g. impressionist). The latter is possible thanks to a previously assembled entity library, gathered by analysing encyclopaedia definitions. Evaluation based on over 500 questions answered on the grounds of Wikipedia suggests that the strength of DeepER approach lies in its ability to tackle questions that demand answers beyond the categories of named entities.
Journal of Quantitative Linguistics | 2014
Piotr Przybyła; Paweł Teisseyre
Abstract In this study we use transcripts of the Sejm (Polish parliament) to predict speaker’s background: gender, education, party affiliation and birth year. We create learning cases consisting of 100 utterances by the same author and, using rich multi-level annotations of the source corpus, extract a variety of features from them. They are either text-based (e.g. mean sentence length, percentage of long words or frequency of named entities of certain types) or word-based (unigrams and bigrams of surface forms, lemmas and interpretations). Next, we apply general-purpose feature selection, regression and classification algorithms and obtain results well over the baseline (97% of accuracy for gender, 95% for education, 76–88% for party). Comparative study shows that random forest and k nearest neighbour’s classifier usually outperform other methods commonly used in text mining, such as support vector machines and naïve Bayes classifier. Performed evaluation experiments help to understand how these solutions deal with such sparse and highly-dimensional data and which of the considered traits influence the language the most. We also address difficulties caused by some of the properties of Polish, typical also for other Slavonic languages.
north american chapter of the association for computational linguistics | 2016
Piotr Przybyła; Nhung T. H. Nguyen; Matthew Shardlow; Georgios Kontonatsios; Sophia Ananiadou
We present a description of the system submitted to the Semantic Textual Similarity (STS) shared task at SemEval 2016. The task is to assess the degree to which two sentences carry the same meaning. We have designed two different methods to automatically compute a similarity score between sentences. The first method combines a variety of semantic similarity measures as features in a machine learning model. In our second approach, we employ training data from the Interpretable Similarity subtask to create a combined wordsimilarity measure and assess the importance of both aligned and unaligned words. Finally, we combine the two methods into a single hybrid model. Our best-performing run attains a score of 0.7732 on the 2015 STS evaluation data and 0.7488 on the 2016 STS evaluation data.
intelligent information systems | 2013
Piotr Przybyła
This paper deals with a problem of question type classification for Polish Question Answering (QA). The goal of this task is to determine both a general type and a class of an entity which is expected as an answer. Three types of approaches: pattern matching, WordNet-aided focus analysis and machine learning are presented and evaluated using a test set of 1137 manually classified questions from a Polish quiz TV show. Quantitative results supported with an analysis of error sources help to find possible improvements.
bioRxiv | 2018
Alexandra Bannach-Brown; Piotr Przybyła; James Thomas; Andrew S.C. Rice; Sophia Ananiadou; Jing Liao; Malcolm R. Macleod
Background Here we outline a method of applying existing machine learning (ML) approaches to aid citation screening in an on-going broad and shallow systematic review of preclinical animal studies, with the aim of achieving a high performing algorithm comparable to human screening. Methods We applied ML approaches to a broad systematic review of animal models of depression at the citation screening stage. We tested two independently developed ML approaches which used different classification models and feature sets. We recorded the performance of the ML approaches on an unseen validation set of papers using sensitivity, specificity and accuracy. We aimed to achieve 95% sensitivity and to maximise specificity. The classification model providing the most accurate predictions was applied to the remaining unseen records in the dataset and will be used in the next stage of the preclinical biomedical sciences systematic review. We used a cross validation technique to assign ML inclusion likelihood scores to the human screened records, to identify potential errors made during the human screening process (error analysis). Results ML approaches reached 98.7% sensitivity based on learning from a training set of 5749 records, with an inclusion prevalence of 13.2%. The highest level of specificity reached was 86%. Performance was assessed on an independent validation dataset. Human errors in the training and validation sets were successfully identified using assigned the inclusion likelihood from the ML model to highlight discrepancies. Training the ML algorithm on the corrected dataset improved the specificity of the algorithm without compromising sensitivity. Error analysis correction leads to a 3% improvement in sensitivity and specificity, which increases precision and accuracy of the ML algorithm. Conclusions This work has confirmed the performance and application of ML algorithms for screening in systematic reviews of preclinical animal studies. It has highlighted the novel use of ML algorithms to identify human error. This needs to be confirmed in other reviews, , but represents a promising approach to integrating human decisions and automation in systematic review methodology.
Research Synthesis Methods | 2018
Piotr Przybyła; Austin J. Brockmeier; Georgios Kontonatsios; Marie-Annick Le Pogam; John McNaught; Erik von Elm; Kay Nolan; Sophia Ananiadou
Screening references is a time‐consuming step necessary for systematic reviews and guideline development. Previous studies have shown that human effort can be reduced by using machine learning software to prioritise large reference collections such that most of the relevant references are identified before screening is completed. We describe and evaluate RobotAnalyst, a Web‐based software system that combines text‐mining and machine learning algorithms for organising references by their content and actively prioritising them based on a relevancy classification model trained and updated throughout the process. We report an evaluation over 22 reference collections (most are related to public health topics) screened using RobotAnalyst with a total of 43 610 abstract‐level decisions. The number of references that needed to be screened to identify 95% of the abstract‐level inclusions for the evidence review was reduced on 19 of the 22 collections. Significant gains over random sampling were achieved for all reviews conducted with active prioritisation, as compared with only two of five when prioritisation was not used. RobotAnalysts descriptive clustering and topic modelling functionalities were also evaluated by public health analysts. Descriptive clustering provided more coherent organisation than topic modelling, and the content of the clusters was apparent to the users across a varying number of clusters. This is the first large‐scale study using technology‐assisted screening to perform new reviews, and the positive results provide empirical evidence that RobotAnalyst can accelerate the identification of relevant studies. The results also highlight the issue of user complacency and the need for a stopping criterion to realise the work savings.
European Physical Journal C | 2018
M. Maćkowiak-Pawłowska; Piotr Przybyła
The incomplete particle identification limits the experimentally-available phase space region for identified particle analysis. This problem affects ongoing fluctuation and correlation studies including the search for the critical point of strongly interacting matter performed on SPS and RHIC accelerators. In this paper we provide a procedure to obtain nth order moments of the multiplicity distribution using the identity method, generalising previously published solutions for
Bioinformatics | 2018
Axel Soto; Piotr Przybyła; Sophia Ananiadou