Julian Zubek | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Julian Zubek is active.

Explore More

Publication

Featured researches published by Julian Zubek.

Molecular BioSystems | 2014

Ensemble learning prediction of protein–protein interactions using proteins functional annotations

Indrajit Saha; Julian Zubek; Tomas Klingström; Simon K. G. Forsberg; Johan Wikander; Marcin Kierczak; Ujjwal Maulik; Dariusz Plewczynski

Protein-protein interactions are important for the majority of biological processes. A significant number of computational methods have been developed to predict protein-protein interactions using protein sequence, structural and genomic data. Vast experimental data is publicly available on the Internet, but it is scattered across numerous databases. This fact motivated us to create and evaluate new high-throughput datasets of interacting proteins. We extracted interaction data from DIP, MINT, BioGRID and IntAct databases. Then we constructed descriptive features for machine learning purposes based on data from Gene Ontology and DOMINE. Thereafter, four well-established machine learning methods: Support Vector Machine, Random Forest, Decision Tree and Naïve Bayes, were used on these datasets to build an Ensemble Learning method based on majority voting. In cross-validation experiment, sensitivity exceeded 80% and classification/prediction accuracy reached 90% for the Ensemble Learning method. We extended the experiment to a bigger and more realistic dataset maintaining sensitivity over 70%. These results confirmed that our datasets are suitable for performing PPI prediction and Ensemble Learning method is well suited for this task. Both the processed PPI datasets and the software are available at .

PeerJ | 2015

Multi-level machine learning prediction of protein–protein interactions in Saccharomyces cerevisiae

Julian Zubek; Marcin Tatjewski; Adam Boniecki; Maciej Mnich; Subhadip Basu; Dariusz Plewczynski

Accurate identification of protein–protein interactions (PPI) is the key step in understanding proteins’ biological functions, which are typically context-dependent. Many existing PPI predictors rely on aggregated features from protein sequences, however only a few methods exploit local information about specific residue contacts. In this work we present a two-stage machine learning approach for prediction of protein–protein interactions. We start with the carefully filtered data on protein complexes available for Saccharomyces cerevisiae in the Protein Data Bank (PDB) database. First, we build linear descriptions of interacting and non-interacting sequence segment pairs based on their inter-residue distances. Secondly, we train machine learning classifiers to predict binary segment interactions for any two short sequence fragments. The final prediction of the protein–protein interaction is done using the 2D matrix representation of all-against-all possible interacting sequence segments of both analysed proteins. The level-I predictor achieves 0.88 AUC for micro-scale, i.e., residue-level prediction. The level-II predictor improves the results further by a more complex learning paradigm. We perform 30-fold macro-scale, i.e., protein-level cross-validation experiment. The level-II predictor using PSIPRED-predicted secondary structure reaches 0.70 precision, 0.68 recall, and 0.70 AUC, whereas other popular methods provide results below 0.6 threshold (recall, precision, AUC). Our results demonstrate that multi-scale sequence features aggregation procedure is able to improve the machine learning results by more than 10% as compared to other sequence representations. Prepared datasets and source code for our experimental pipeline are freely available for download from: http://zubekj.github.io/mlppi/ (open source Python implementation, OS independent).

Journal of Molecular Modeling | 2016

PDP-CON: prediction of domain/linker residues in protein sequences using a consensus approach

Piyali Chatterjee; Subhadip Basu; Julian Zubek; Mahantapas Kundu; Mita Nasipuri; Dariusz Plewczynski

The prediction of domain/linker residues in protein sequences is a crucial task in the functional classification of proteins, homology-based protein structure prediction, and high-throughput structural genomics. In this work, a novel consensus-based machine-learning technique was applied for residue-level prediction of the domain/linker annotations in protein sequences using ordered/disordered regions along protein chains and a set of physicochemical properties. Six different classifiers—decision tree, Gaussian naïve Bayes, linear discriminant analysis, support vector machine, random forest, and multilayer perceptron—were exhaustively explored for the residue-level prediction of domain/linker regions. The protein sequences from the curated CATH database were used for training and cross-validation experiments. Test results obtained by applying the developed PDP-CON tool to the mutually exclusive, independent proteins of the CASP-8, CASP-9, and CASP-10 databases are reported. An n-star quality consensus approach was used to combine the results yielded by different classifiers. The average PDP-CON accuracy and F-measure values for the CASP targets were found to be 0.86 and 0.91, respectively. The dataset, source code, and all supplementary materials for this work are available at https://cmaterju.org/cmaterbioinfo/ for noncommercial use.

PeerJ | 2016

Computational inference of H3K4me3 and H3K27ac domain length

Julian Zubek; Michael L. Stitzel; Duygu Ucar; Dariusz Plewczynski

Background. Recent epigenomic studies have shown that the length of a DNA region covered by an epigenetic mark is not just a byproduct of the assaying technologies and has functional implications for that locus. For example, expanded regions of DNA sequences that are marked by enhancer-specific histone modifications, such as acetylation of histone H3 lysine 27 (H3K27ac) domains coincide with cell-specific enhancers, known as super or stretch enhancers. Similarly, promoters of genes critical for cell-specific functions are marked by expanded H3K4me3 domains in the cognate cell type, and these can span DNA regions from 4–5kb up to 40–50kb in length. These expanded H3K4me3 domains are known as buffer domains or super promoters. Methods. To ask what correlates with—and potentially regulates—the length of loci marked with these two important histone marks, H3K4me3 and H3K27ac, we built Random Forest regression models. With these models, we computationally identified genomic and epigenomic patterns that are predictive for the length of these marks in seven ENCODE cell lines. Results. We found that certain epigenetic marks and transcription factors explain the variability of the length of H3K4me3 and H3K27ac marks across different cell types, which implies that the lengths of these two epigenetic marks are tightly regulated in a given cell type. Our source code for the regression models and data can be found at our GitHub page: https://github.com/zubekj/broad_peaks. Discussion. Our Random Forest based regression models enabled us to estimate the individual contribution of different epigenetic marks and protein binding patterns to the length of H3K4me3 and H3K27ac deposition patterns, therefore potentially revealing genomic signatures at cell specific regulatory elements.

pattern recognition and machine intelligence | 2015

PDP-RF: Protein Domain Boundary Prediction Using Random Forest Classifier

Piyali Chatterjee; Subhadip Basu; Julian Zubek; Mahantapas Kundu; Mita Nasipuri; Dariusz Plewczynski

The Domain Boundary Prediction is a crucial task for functional classification of proteins, homology-based protein structure prediction and for high-throughput structural genomics. Each amino acid is represented using a set of physico-chemical properties. Random Forest Classifier is explored for accurate prediction of domain regions by training on the curated dataset obtained from CATH database. The software is tested on proteins of CASP-6, CASP-8, CASP-9 and CASP-10 targets in order to evaluate its prediction accuracy using three fold cross validation experiments. Finally, a consensus approach is used to combine results of the classifiers obtained through the cross-validation experiments. The average recall and precision scores achieved by the developed consensus based Random Forest classifiers (PDP-RF) are 0.98 and 0.88 respectively for prediction of CASP targets. The overall accuracy and F-scores of the PDP-RF are observed as 0.87 and 0.91 respectively.

International Conference on Man-Machine Interactions 2013 | 2014

Generic Framework for Simulation of Cognitive Systems: A Case Study of Color Category Boundaries

Dariusz Plewczynski; Michał Łukasik; Konrad Kurdej; Julian Zubek; Franciszek Rakowski; Joanna Rączaszek-Leonardi

We present a generic model of a cognitive system, which is based on a population of communicating agents. Following the earlier models (Steels and Belpaeme, 2005) we give communication an important role in shaping the cognitive categories of individual agents. Yet in this paper we underscore the importance of other constraints on cognition: the structure of the environment, in which a system evolves and learns and the learning capacities of individual agents. Thus our agent-based model of cultural emergence of colour categories shows that boundaries might be seen as a product of agent’s communication in a given environment.We discuss the methodological issues related to real data characterization, as well as to the process of modeling the emergence of perceptual categories in human subjects.

PeerJ | 2016

Complexity curve: a graphical measure of data complexity and classifier performance

Julian Zubek; Dariusz Plewczynski

We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. In contrast to some popular complexity measures, it is not focused on the shape of a decision boundary in a classification task but on the amount of available data with respect to the attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. It demonstrates the relative increase of available information with the growth of sample size. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We then compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining performance of specific classifiers on these sets. We also apply our methodology to a panel of simple benchmark data sets, demonstrating how it can be used in practice to gain insights into data characteristics. Moreover, we show that the complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without compromising classification accuracy. The associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation). Subjects Algorithms and Analysis of Algorithms, Artificial Intelligence, Data Mining and Machine Learning

Frontiers in Psychology | 2016

Performance of Language-Coordinated Collective Systems: A Study of Wine Recognition and Description.

Julian Zubek; Michał Denkiewicz; Agnieszka Dębska; Alicja Radkowska; Joanna Komorowska-Mach; Piotr Litwin; Magdalena Stępień; Adrianna Kucińska; Ewa Sitarska; Krystyna Komorowska; Riccardo Fusaroli; Kristian Tylén; Joanna Rączaszek-Leonardi

Most of our perceptions of and engagements with the world are shaped by our immersion in social interactions, cultural traditions, tools and linguistic categories. In this study we experimentally investigate the impact of two types of language-based coordination on the recognition and description of complex sensory stimuli: that of red wine. Participants were asked to taste, remember and successively recognize samples of wines within a larger set in a two-by-two experimental design: (1) either individually or in pairs, and (2) with or without the support of a sommelier card—a cultural linguistic tool designed for wine description. Both effectiveness of recognition and the kinds of errors in the four conditions were analyzed. While our experimental manipulations did not impact recognition accuracy, bias-variance decomposition of error revealed non-trivial differences in how participants solved the task. Pairs generally displayed reduced bias and increased variance compared to individuals, however the variance dropped significantly when they used the sommelier card. The effect of sommelier card reducing the variance was observed only in pairs, individuals did not seem to benefit from the cultural linguistic tool. Analysis of descriptions generated with the aid of sommelier cards shows that pairs were more coherent and discriminative than individuals. The findings are discussed in terms of global properties and dynamics of collective systems when constrained by different types of cultural practices.

PLOS ONE | 2017

Social adaptation in multi-agent model of linguistic categorization is affected by network information flow

Julian Zubek; Michał Denkiewicz; Juliusz Barański; Przemysław Wróblewski; Joanna Rączaszek-Leonardi; Dariusz Plewczynski

This paper explores how information flow properties of a network affect the formation of categories shared between individuals, who are communicating through that network. Our work is based on the established multi-agent model of the emergence of linguistic categories grounded in external environment. We study how network information propagation efficiency and the direction of information flow affect categorization by performing simulations with idealized network topologies optimizing certain network centrality measures. We measure dynamic social adaptation when either network topology or environment is subject to change during the experiment, and the system has to adapt to new conditions. We find that both decentralized network topology efficient in information propagation and the presence of central authority (information flow from the center to peripheries) are beneficial for the formation of global agreement between agents. Systems with central authority cope well with network topology change, but are less robust in the case of environment change. These findings help to understand which network properties affect processes of social adaptation. They are important to inform the debate on the advantages and disadvantages of centralized systems.

International Conference on Man-Machine Interactions 2013 | 2014