Stéphane Lallich
University of Lyon
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Stéphane Lallich.
European Journal of Operational Research | 2008
Philippe Lenca; Patrick Meyer; Benoît Vaillant; Stéphane Lallich
Abstract Data mining algorithms, especially those used for unsupervised learning, generate a large quantity of rules. In particular this applies to the A priori family of algorithms for the determination of association rules. It is hence impossible for an expert in the field being mined to sustain these rules. To help carry out the task, many measures which evaluate the interestingness of rules have been developed. They make it possible to filter and sort automatically a set of rules with respect to given goals. Since these measures may produce different results, and as experts have different understandings of what a good rule is, we propose in this article a new direction to select the best rules: a two-step solution to the problem of the recommendation of one or more user-adapted interestingness measures. First, a description of interestingness measures, based on meaningful classical properties, is given. Second, a multicriteria decision aid process is applied to this analysis and illustrates the benefit that a user, who is not a data mining expert, can achieve with such methods.
intelligent information systems | 2004
Fabrice Muhlenbach; Stéphane Lallich; Djamel Abdelkader Zighed
Data mining and knowledge discovery aim at producing useful and reliable models from the data. Unfortunately some databases contain noisy data which perturb the generalization of the models. An important source of noise consists of mislabelled training instances. We offer a new approach which deals with improving classification accuracies by using a preliminary filtering procedure. An example is suspect when in its neighbourhood defined by a geometrical graph the proportion of examples of the same class is not significantly greater than in the database itself. Such suspect examples in the training data can be removed or relabelled. The filtered training set is then provided as input to learning algorithms. Our experiments on ten benchmarks of UCI Machine Learning Repository using 1-NN as the final algorithm show that removal gives better results than relabelling. Removing allows maintaining the generalization error rate when we introduce from 0 to 20% of noise on the class, especially when classes are well separable. The filtering method proposed is finally compared to the relaxation relabelling schema.
discovery science | 2004
Benoît Vaillant; Philippe Lenca; Stéphane Lallich
It is a common issue that KDD processes may generate a large number of patterns depending on the algorithm used, and its parameters. It is hence impossible for an expert to sustain these patterns. This may be the case with the well-known APRIORI algorithm. One of the methods used to cope with such an amount of output depends on the use of interestingness measures. Stating that selecting interesting rules also means using an adapted measure, we present an experimental study of the behaviour of 20 measures on 10 datasets. This study is compared to a previous analysis of formal and meaningful properties of the measures, by means of two clusterings. One of the goals of this study is to enhance our previous approach. Both approaches seem to be complementary and could be profitable for the problem of a users choice of a measure.
Quality Measures in Data Mining | 2007
Stéphane Lallich; Olivier Teytaud; Elie Prudhomme
The search for interesting Boolean association rules is an important topic in knowledge discovery in databases. The set of admissible rules for the selected support and condence thresholds can easily be extracted by algorithms based on support and condence, such as Apriori. However, they may produce a large number of rules, many of them are uninteresting. One has to resolve a two-tier problem: choosing the measures best suited to the problem at hand, then validating the interesting rules against the selected measures. First, the usual measures suggested in the literature will be reviewed and criteria to appreciate the qualities of these measures will be proposed. Statistical validation of the most interesting rules requests performing a large number of tests. Thus, controlling for false discoveries (type I errors) is of prime importance. An original bootstrap-based validation method is proposed which controls, for a given level, the number of false discoveries. The interest of this method for the selection of interesting association rules will be illustrated by several examples.
Quality Measures in Data Mining | 2007
Philippe Lenca; Benoît Vaillant; Patrick Meyer; Stéphane Lallich
It is a common problem that Kdd processes may generate a large number of patterns depending on the algorithm used, and its parameters. It is hence impossible for an expert to assess these patterns. This is the case with the well-known Apriori algorithm. One of the methods used to cope with such an amount of output depends on using association rule interestingness measures. Stating that selecting interesting rules also means using an adapted measure, we present a formal and an experimental study of 20 measures. The experimental studies carried out on 10 data sets lead to an experimental classification of the measures. This study is compared to an analysis of the formal and meaningful properties of the measures. Finally, the properties are used in a multi-criteria decision analysis in order to select amongst the available measures the one or those that best take into account the user’s needs. These approaches seem to be complementary and could be useful in solving the problem of a user’s choice of measure.
knowledge discovery and data mining | 2008
Philippe Lenca; Stéphane Lallich; Thanh-Nghi Do; Nguyen-Khang Pham
In data mining, large differences in prior class probabilities known as the class imbalance problem have been reported to hinder the performance of classifiers such as decision trees. Dealing with imbalanced and cost-sensitive data has been recognized as one of the 10 most challenging problems in data mining research. In decision trees learning, many measures are based on the concept of Shannons entropy. A major characteristic of the entropies is that they take their maximal value when the distribution of the modalities of the class variable is uniform. To deal with the class imbalance problem, we proposed an off-centered entropy which takes its maximum value for a distribution fixed by the user. This distribution can be the a priori distribution of the class variable modalities or a distribution taking into account the costs of misclassification. Others authors have proposed an asymmetric entropy. In this paper we present the concepts of the three entropies and compare their effectiveness on 20 imbalanced data sets. All our experiments are founded on the C4.5 decision trees algorithm, in which only the function of entropy is modified. The results are promising and show the interest of off-centered entropies to deal with the problem of class imbalance.
international syposium on methodologies for intelligent systems | 2002
Stéphane Lallich; Fabrice Muhlenbach; Djamel A. Zighed
It is common that a database contains noisy data. An important source of noise consists in mislabeled training instances. We present a new approach that deals with improving classification accuracies in such a case by using a preliminary filtering procedure. An example is suspect when in its neighborhood defined by a geometrical graph the proportion of examples of the same class is not significantly greater than in the whole database. Such suspect examples in the training data can be removed or relabeled. The filtered training set is then provided as input to learning algorithm. Our experiments on ten benchmarks of UCI Machine Learning Repository using 1-NN as the final algorithm show that removing give better results than relabeling. Removing allows maintaining the generalization error rate when we introduce from 0 to 20% of noise on the class, especially when classes are well separable.
BMC Medical Informatics and Decision Making | 2012
Marie Hélène Metzger; Thierry Durand; Stéphane Lallich; Roger Salamon; Philippe Castets
BackgroundIn France, recent developments in healthcare system organization have aimed at strengthening decision-making and action in public health at the regional level. Firstly, the 2004 Public Health Act, by setting 100 national and regional public health targets, introduced an evaluative approach to public health programs at the national and regional levels. Meanwhile, the implementation of regional platforms for managing electronic health records (EHRs) has also been under assessment to coordinate the deployment of this important instrument of care within each geographic area. In this context, the development and implementation of a regional approach to epidemiological data extracted from EHRs are an opportunity that must be seized as soon as possible. Our article addresses certain design and organizational aspects so that the technical requirements for such use are integrated into regional platforms in France. The article will base itself on organization of the Rhône-Alpes regional health platform.DiscussionDifferent tools being deployed in France allow us to consider the potential of these regional platforms for epidemiology and public health (implementation of a national health identification number and a national information system interoperability framework). The deployment of the Rhône-Alpes regional health platform began in the 2000s in France. By August 2011, 2.6 million patients were identified in this platform. A new development step is emerging because regional decision-makers need to measure healthcare efficiency. To pool heterogeneous information contained in various independent databases, the format, norm and content of the metadata have been defined. Two types of databases will be created according to the nature of the data processed, one for extracting structured data, and the second for extracting non-structured and de-identified free-text documents.SummaryRegional platforms for managing EHRs could constitute an important data source for epidemiological surveillance in the context of epidemic alerts, but also in monitoring a number of indicators of infectious and chronic diseases for which no data are yet available in France.
EGC (best of volume) | 2010
Thanh-Nghi Do; Philippe Lenca; Stéphane Lallich; Nguyen-Khang Pham
The random forests method is one of the most successful ensemble methods. However, random forests do not have high performance when dealing with very-high-dimensional data in presence of dependencies. In this case one can expect that there exist many combinations between the variables and unfortunately the usual random forests method does not effectively exploit this situation. We here investigate a new approach for supervised classification with a huge number of numerical attributes. We propose a random oblique decision trees method. It consists of randomly choosing a subset of predictive attributes and it uses SVM as a split function of these attributes.We compare, on 25 datasets, the effectiveness with classical measures (e.g. precision, recall, F1-measure and accuracy) of random forests of random oblique decision trees with SVMs and random forests of C4.5. Our proposal has significant better performance on very-high-dimensional datasets with slightly better results on lower dimensional datasets.
Data Mining | 2010
Yannick Le Bras; Philippe Lenca; Stéphane Lallich
Many studies have shown the limits of support/confidence framework used in Apriori-like algorithms to mine association rules. There are a lot of efficient implementations based on the antimonotony property of the support. But candidate set generation is still costly and many rules are uninteresting or redundant. In addition one can miss interesting rules like nuggets. We are thus facing a complexity issue and a quality issue.