Lena Pietruczuk
Częstochowa University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Lena Pietruczuk.
IEEE Transactions on Knowledge and Data Engineering | 2013
Leszek Rutkowski; Lena Pietruczuk; Piotr Duda; Maciej Jaworski
In mining data streams the most popular tool is the Hoeffding tree algorithm. It uses the Hoeffdings bound to determine the smallest number of examples needed at a node to select a splitting attribute. In the literature the same Hoeffdings bound was used for any evaluation function (heuristic measure), e.g., information gain or Gini index. In this paper, it is shown that the Hoeffdings inequality is not appropriate to solve the underlying problem. We prove two theorems presenting the McDiarmids bound for both the information gain, used in ID3 algorithm, and for Gini index, used in Classification and Regression Trees (CART) algorithm. The results of the paper guarantee that a decision tree learning system, applied to data streams and based on the McDiarmids bound, has the property that its output is nearly identical to that of a conventional learner. The results of the paper have a great impact on the state of the art of mining data streams and various developed so far methods and algorithms should be reconsidered.
IEEE Transactions on Knowledge and Data Engineering | 2014
Leszek Rutkowski; Maciej Jaworski; Lena Pietruczuk; Piotr Duda
Since the Hoeffding tree algorithm was proposed in the literature, decision trees became one of the most popular tools for mining data streams. The key point of constructing the decision tree is to determine the best attribute to split the considered node. Several methods to solve this problem were presented so far. However, they are either wrongly mathematically justified (e.g., in the Hoeffding tree algorithm) or time-consuming (e.g., in the McDiarmid tree algorithm). In this paper, we propose a new method which significantly outperforms the McDiarmid tree algorithm and has a solid mathematical basis. Our method ensures, with a high probability set by the user, that the best attribute chosen in the considered node using a finite data sample is the same as it would be in the case of the whole data stream.
Information Sciences | 2014
Leszek Rutkowski; Maciej Jaworski; Lena Pietruczuk; Piotr Duda
One of the most popular tools for mining data streams are decision trees. In this paper we propose a new algorithm, which is based on the commonly known CART algorithm. The most important task in constructing decision trees for data streams is to determine the best attribute to make a split in the considered node. To solve this problem we apply the Gaussian approximation. The presented algorithm allows to obtain high accuracy of classification, with a short processing time. The main result of this paper is the theorem showing that the best attribute computed in considered node according to the available data sample is the same, with some high probability, as the attribute derived from the whole data stream.
IEEE Transactions on Neural Networks | 2015
Leszek Rutkowski; Maciej Jaworski; Lena Pietruczuk; Piotr Duda
In this paper, a new method for constructing decision trees for stream data is proposed. First a new splitting criterion based on the misclassification error is derived. A theorem is proven showing that the best attribute computed in considered node according to the available data sample is the same, with some high probability, as the attribute derived from the whole infinite data stream. Next this result is combined with the splitting criterion based on the Gini index. It is shown that such combination provides the highest accuracy among all studied algorithms.
Information Sciences | 2017
Lena Pietruczuk; Leszek Rutkowski; Maciej Jaworski; Piotr Duda
In this paper we propose a new approach for designing an ensemble applied to stream data classification. Our approach is supported by two theorems showing how to decide whether a new component should be added to the ensemble or not, based on the assumption that such an action should increase the accuracy of the ensemble not only for the current portion of observations but also for the whole (infinite) data stream. The conclusions of these theorems hold with a certain probability (confidence) set by the user. Through computer simulations, among others, we show that decreasing the confidence that decision based on the finite portion of the stream is the same as based on the whole (infinite) data stream only slightly improves the accuracy at the expense of significant memory consumption. Moreover, we will introduce a novel procedure of weighting ensemble components, i.e. decision trees, by assigning a weight to each leaf of the tree. In previous approaches a weight was assigned to the whole ensemble component. The new approach is based on the observation that probability of the correct tree outcome is different in various tree sections.
international conference on artificial intelligence and soft computing | 2013
Lena Pietruczuk; Piotr Duda; Maciej Jaworski
The problem of data stream mining is widely studied in the literature. Especially difficult to solve is the problem of mining data with occurring concept drift. The most commonly used algorithms are those based on decision trees. In this article we investigate the performance of a few algorithms of constructing decision trees for data stream classification, not explicitly designed to deal with changing distribution of data. We show how to adapt these methods to deal with concept drift and we compare the obtained results.
international conference on artificial intelligence and soft computing | 2012
Lena Pietruczuk; Piotr Duda; Maciej Jaworski
Along with technological developments we observe an increasing amount of stored and processed data. It is not possible to store all incoming data and analyze it on the fly. Therefore many researchers are working on new algorithms for data stream mining. New algorithm should be fast and should use a small amount of memory. We will consider the problem of data stream classification. To increase the accuracy we propose to use an ensemble of classifiers based on a modified FID3 algorithm. The experimental results show that this algorithm is fast and accurate. Therefore it is adequate tool for data stream classification.
international conference on artificial intelligence and soft computing | 2012
Maciej Jaworski; Meng Joo Er; Lena Pietruczuk
A problem of learning in non-stationary environment is solved by making use of order statistics in combination with the Parzen kernel-type regression neural network. Probabilistic properties of the algorithm are investigated and weak convergence is established. Experimental results are presented.
international conference on artificial intelligence and soft computing | 2012
Piotr Duda; Maciej Jaworski; Lena Pietruczuk
Clustering is a one of the most important tasks of data mining. Algorithms like the Fuzzy C-Means and Possibilistic C-Means provide good result both for the static data and data streams. All clustering algorithms compute centers from chunk of data, what requires a lot of time. If the rate of incoming data is faster than speed of algorithm, part of data will be lost. To prevent such situation, some pre-processing algorithms should be used. The purpose of this paper is to propose a pre-processing method for clustering algorithms. Experimental results show that proposed method is appropriate to handle noisy data and can accelerate processing time.
international conference on artificial intelligence and soft computing | 2012
Maciej Jaworski; Lena Pietruczuk; Piotr Duda
In this paper the resource consumption of the fuzzy clustering algorithms for data streams is studied. As the examples, the wFCM and the wPCM algorithms are examined. It is shown that partitioning a data stream into chunks reduces the processing time of considered algorithms significantly. The partitioning procedure is accompanied with the reduction of results accuracy, however the change is acceptable. The problems arised due to the high speed data streams are presented as well. The uncontrolable growth of subsequent data chunk sizes, which leads to the overflow of the available memory, is demonstrated for both the wFCM and wPCM algorithms. The maximum chunk size limit modification, as a solution to this problem, is introduced. This modification ensures that the available memory is never exceeded, what is shown in the simulations. The considered modification decreases the quality of clustering results only slightly.