Marek Śmieja | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Marek Śmieja is active.

Explore More

Publication

Featured researches published by Marek Śmieja.

Advanced Data Analysis and Classification | 2017

Constrained clustering with a complex cluster structure

Marek Śmieja; Magdalena Wiercioch

In this contribution we present a novel constrained clustering method, Constrained clustering with a complex cluster structure (C4s), which incorporates equivalence constraints, both positive and negative, as the background information. C4s is capable of discovering groups of arbitrary structure, e.g. with multi-modal distribution, since at the initial stage the equivalence classes of elements generated by the positive constraints are split into smaller parts. This provides a detailed description of elements, which are in positive equivalence relation. In order to enable an automatic detection of the number of groups, the cross-entropy clustering is applied for each partitioning process. Experiments show that the proposed method achieves significantly better results than previous constrained clustering approaches. The advantage of our algorithm increases when we are focusing on finding partitions with complex structure of clusters.

PLOS ONE | 2016

Average Information Content Maximization—A New Approach for Fingerprint Hybridization and Reduction

Marek Śmieja; Dawid Warszycki

Fingerprints, bit representations of compound chemical structure, have been widely used in cheminformatics for many years. Although fingerprints with the highest resolution display satisfactory performance in virtual screening campaigns, the presence of a relatively high number of irrelevant bits introduces noise into data and makes their application more time-consuming. In this study, we present a new method of hybrid reduced fingerprint construction, the Average Information Content Maximization algorithm (AIC-Max algorithm), which selects the most informative bits from a collection of fingerprints. This methodology, applied to the ligands of five cognate serotonin receptors (5-HT2A, 5-HT2B, 5-HT2C, 5-HT5A, 5-HT6), proved that 100 bits selected from four non-hashed fingerprints reflect almost all structural information required for a successful in silico discrimination test. A classification experiment indicated that a reduced representation is able to achieve even slightly better performance than the state-of-the-art 10-times-longer fingerprints and in a significantly shorter time.

PLOS ONE | 2014

Asymmetric clustering index in a case study of 5-HT1A receptor ligands.

Marek Śmieja; Dawid Warszycki; Jacek Tabor; Andrzej J. Bojarski

The automatic clustering of chemical compounds is an important branch of chemoinformatics. In this paper the Asymmetric Clustering Index (Aci) is proposed to assess how well an automatically created partition reflects the reference. The asymmetry allows for a distinction between the fixed reference and the numerically constructed partition. The introduced index is applied to evaluate the quality of hierarchical clustering procedures for 5-HT1A receptor ligands. We find that the most appropriate combination of parameters for the hierarchical clustering of compounds with a determined activity for this biological target is the Klekota Roth fingerprint combined with the complete linkage function and the Buser similarity metric.

Ima Journal of Mathematical Control and Information | 2015

Weighted approach to general entropy function

Marek Śmieja

The definition of weighted entropy allows for easy calculation of the entropy of the mixture of measures. In this paper we investigate the problem of equivalent definition of the general entropy function in weighted form. We show that under reasonable condition, which is satisfied by the well-known Shannon, Renyi and Tsallis entropies, every entropy function can be defined equivalently in the weighted way. As a corollary, we show how use the weighted form to compute Tsallis entropy of the mixture of measures.

computer recognition systems | 2013

Image Segmentation with Use of Cross-Entropy Clustering

Marek Śmieja; Jacek Tabor

We present an image segmentation approach which is invariant to affine transformation – the result after rescaling the picture remains almost the same as before. Moreover, the algorithm detects automatically the correct number of groups. We show that the method is capable of discovering general shapes as well as small details by the appropriate choice of only two input parameters.

Neurocomputing | 2017

R Package CEC

Przemysław Spurek; Konrad Kamieniecki; Jacek Tabor; Krzysztof Misztal; Marek Śmieja

Abstract Cross-Entropy Clustering (CEC) is a model-based clustering method which divides data into Gaussian-like clusters. The main advantage of CEC is that it combines the speed and simplicity of k-means with the ability of using various Gaussian models similarly to EM. Moreover, the method is capable of the automatic reduction of unnecessary clusters. In this paper we present the R Package CEC implementing CEC method.

Information Sciences | 2017

Semi-supervised cross-entropy clustering with information bottleneck constraint

Marek Śmieja; Bernhard C. Geiger

Abstract In this paper, we propose a semi-supervised clustering method, CEC-IB , that models data with a set of Gaussian distributions and that retrieves clusters based on a partial labeling provided by the user (partition-level side information). By combining the ideas from cross-entropy clustering (CEC) with those from the information bottleneck method (IB), our method trades between three conflicting goals: the accuracy with which the data set is modeled, the simplicity of the model, and the consistency of the clustering with side information. Experiments demonstrate that CEC-IB performs similar as Gaussian mixture models in a classical semi-supervised scenario, but is faster, more robust to noisy labels, automatically determines the optimal number of clusters, and performs well when not all classes are present in the side information. Moreover, in contrast to many other semi-supervised models, it can be successfully applied in discovering natural subgroups if the partition-level side information is derived from the top levels of a hierarchical clustering.

Entropy | 2015

Entropy Approximation in Lossy Source Coding Problem

Marek Śmieja; Jacek Tabor

In this paper, we investigate a lossy source coding problem, where an upper limit on the permitted distortion is defined for every dataset element. It can be seen as an alternative approach to rate distortion theory where a bound on the allowed average error is specified. In order to find the entropy, which gives a statistical length of source code compatible with a fixed distortion bound, a corresponding optimization problem has to be solved. First, we show how to simplify this general optimization by reducing the number of coding partitions, which are irrelevant for the entropy calculation. In our main result, we present a fast and feasible for implementation greedy algorithm, which allows one to approximate the entropy within an additive error term of log2 e. The proof is based on the minimum entropy set cover problem, for which a similar bound was obtained.

Expert Systems With Applications | 2017

Semi-supervised model-based clustering with controlled clusters leakage

Marek Śmieja; Łukasz Struski; Jacek Tabor

In this paper, we focus on finding clusters in partially categorized data sets. We propose a semi-supervised version of Gaussian mixture model, called C3L, which retrieves natural subgroups of given categories. In contrast to other semi-supervised models, C3L is parametrized by user-defined leakage level, which controls maximal inconsistency between initial categorization and resulting clustering. Our method can be implemented as a module in practical expert systems to detect clusters, which combine expert knowledge with true distribution of data. Moreover, it can be used for improving the results of less flexible clustering techniques, such as projection pursuit clustering. The paper presents extensive theoretical analysis of the model and fast algorithm for its efficient optimization. Experimental results show that C3L finds high quality clustering model, which can be applied in discovering meaningful groups in partially classified data.

computer information systems and industrial management applications | 2014

Subspaces clustering approach to lossy image compression

Przemysław Spurek; Marek Śmieja; Krzysztof Misztal

In this contribution lossy image compression based on subspaces clustering is considered. Given a PCA factorization of each cluster into subspaces and a maximal compression error, we show that the selection of those subspaces that provide the optimal lossy image compression is equivalent to the 0-1 Knapsack Problem. We present a theoretical and an experimental comparison between accurate and approximate algorithms for solving the 0-1 Knapsack problem in the case of lossy image compression.

Explore More