Jefrey Lijffijt
Ghent University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jefrey Lijffijt.
Data Mining and Knowledge Discovery | 2014
Jefrey Lijffijt; Panagiotis Papapetrou; Kai Puolamäki
Hypothesis testing using constrained null models can be used to compute the significance of data mining results given what is already known about the data. We study the novel problem of finding the smallest set of patterns that explains most about the data in terms of a global p value. The resulting set of patterns, such as frequent patterns or clusterings, is the smallest set that statistically explains the data. We show that the newly formulated problem is, in its general form, NP-hard and there exists no efficient algorithm with finite approximation ratio. However, we show that in a special case a solution can be computed efficiently with a provable approximation ratio. We find that a greedy algorithm gives good results on real data and that, using our approach, we can formulate and solve many known data-mining tasks. We demonstrate our method on several data mining tasks. We conclude that our framework is able to identify in various settings a small set of patterns that statistically explains the data and to formulate data mining problems in the terms of statistical significance.
Literary and Linguistic Computing | 2016
Jefrey Lijffijt; Terttu Nevalainen; Tanja Säily; Panagiotis Papapetrou; Kai Puolamäki; Heikki Mannila
Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (Language is never, ever, ever, random, Corpus Linguistics and Linguistic Theory , 2005; 1(2): 263–76.), the use of the χ2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (Comparing corpora, International Journal of Corpus Linguistics , 2001; 6(1): 1–37) and Paquot and Bestgen (Distinctive words in academic writing: a comparison of three statistical tests for keyword extraction. In Jucker, A., Schreier, D., and Hundt, M. (eds), Corpora: Pragmatics and Discourse . Amsterdam: Rodopi, 2009, pp. 247–69), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank-sum test, or bootstrap test for comparing word frequencies across corpora.
pervasive technologies related to assistive environments | 2010
Jefrey Lijffijt; Panagiotis Papapetrou; Jaakko Hollmén; Vassilis Athitsos
We study the performance of three dynamic programming methods on music retrieval. The methods are designed for time series matching but can be directly applied to retrieval of music. Dynamic Time Warping (DTW) identifies an optimal alignment between two time series, and computes the matching cost corresponding to that alignment. Significant speed-ups can be achieved by constrained Dynamic Time Warping (cDTW), which narrows down the set of positions in one time series that can be matched with specific positions in the other time series. Both methods are designed for full sequence matching but can also be applied for subsequence matching, by using a sliding window over each database sequence to compute a matching score for each database subsequence. In addition, SPRING is a dynamic programming approach designed for subsequence matching, where the query is matched with a database subsequence without requiring the match length to be equal to the query length. SPRING has a lower computational cost than DTW and cDTW. Our database consists of a set of MIDI files taken from the web. Each MIDI file has been converted to a 2-dimensional time series, taking into account both note pitches and durations. We have used synthetic queries of fixed size and different noise levels. Surprisingly, when looking for the top-K best matches, all three approaches show similar behavior in terms of retrieval accuracy for small values of K. This suggests that for the specific application area, a computationally cheaper method, such as SPRING, is sufficient to retrieve the best top-K matches.
knowledge discovery and data mining | 2016
Bo Kang; Jefrey Lijffijt; Raul Santos-Rodriguez; Tijl De Bie
Methods that find insightful low-dimensional projections are essential to effectively explore high-dimensional data. Principal Component Analysis is used pervasively to find low-dimensional projections, not only because it is straightforward to use, but it is also often effective, because the variance in data is often dominated by relevant structure. However, even if the projections highlight real structure in the data, not all structure is interesting to every user. If a user is already aware of, or not interested in the dominant structure, Principal Component Analysis is less effective for finding interesting components. We introduce a new method called Subjectively Interesting Component Analysis (SICA), designed to find data projections that are subjectively interesting, i.e, projections that truly surprise the end-user. It is rooted in information theory and employs an explicit model of a users prior expectations about the data. The corresponding optimization problem is a simple eigenvalue problem, and the result is a trade-off between explained variance and novelty. We present five case studies on synthetic data, images, time-series, and spatial data, to illustrate how SICA enables users to find (subjectively) interesting projections.
ieee international conference on data science and advanced analytics | 2015
Jefrey Lijffijt; Eirini Spyropoulou; Bo Kang; Tijl De Bie
Methods for local pattern mining are fragmented along two dimensions: the pattern syntax, and the data types on which they are applicable. Pattern syntaxes include subgroups, n-sets, itemsets, and many more; common data types include binary, categorical, and real-valued. Recent research on relational pattern mining has shown how the aforementioned pattern syntaxes can be unified in a single framework. However, a unified model to deal with various data types is lacking, certainly for more complexly structured types such as real numbers, time of day—which is circular—, geographical location, terms from a taxonomy, etc. We introduce P-N-RMiner, a generic tool for mining interesting local patterns in (relational) data with structured attributes. We show how to handle the attribute structures in a generic manner, by modelling them as partial orders. We also derive an information-theoretic subjective interestingness measure for such patterns and present an algorithm to efficiently enumerate the patterns. We find that (1) P-N-RMiner finds patterns that are substantially more informative, (2) the new interestingness measure cannot be approximated using existing methods, and (3) we can leverage the partial orders to speed up enumeration.
european conference on machine learning | 2013
Jefrey Lijffijt
We consider the problem of mining subsequences with surprising event counts. When mining patterns, we often test a very large number of potentially present patterns, leading to a high likelihood of finding spurious results. Typically, this problem grows as the size of the data increases. Existing methods for statistical testing are not usable for mining patterns in big data, because they are either computationally too demanding, or fail to take into account the dependency structure between patterns, leading to true findings going unnoticed. We propose a new method to compute the significance of event frequencies in subsequences of a long data sequence. The method is based on analyzing the joint distribution of the patterns, omitting the need for randomization. We argue that computing the p-values exactly is computationally costly, but that an upper bound is easy to compute. We investigate the tightness of the upper bound and compare the power of the test with the alternative of post-hoc correction. We demonstrate the utility of the method on two types of data: text and DNA. We show that the proposed method is easy to implement and can be computed quickly. Moreover, we conclude that the upper bound is sufficiently tight and that meaningful results can be obtained in practice.
pervasive technologies related to assistive environments | 2010
Jefrey Lijffijt; Panagiotis Papapetrou; Jaakko Hollmén
Monitoring human motion has recently received great attention and can be used in many applications, such as human motion prediction. We present the collected data set from a body sensor network attached to the human body. The set of sensors consists of accelerometers measuring acceleration in three directions that are attached to the upper and lower back as well as the knees and ankles. In addition, pressures on the insoles are measured with four pressure sensors inside each shoe. Two types of motion are considered: walking backwards on a straight line and walking forwards on a figure-8 path. Finally, we study and present basic statistics of the data.
european conference on machine learning | 2016
Kai Puolamäki; Bo Kang; Jefrey Lijffijt; Tijl De Bie
Data visualization and iterative/interactive data mining are growing rapidly in attention, both in research as well as in industry. However, integrated methods and tools that combine advanced visualization and data mining techniques are rare, and those that exist are often specialized to a single problem or domain. In this paper, we introduce a novel generic method for interactive visual exploration of high-dimensional data. In contrast to most visualization tools, it is not based on the traditional dogma of manually zooming and rotating data. Instead, the tool initially presents the user with an ‘interesting’ projection of the data and then employs data randomization with constraints to allow users to flexibly and intuitively express their interests or beliefs using visual interactions that correspond to exactly defined constraints. These constraints expressed by the user are then taken into account by a projection-finding algorithm to compute a new ‘interesting’ projection, a process that can be iterated until the user runs out of time or finds that constraints explain everything she needs to find from the data. We present the tool by means of two case studies, one controlled study on synthetic data and another on real census data. The data and software related to this paper are available at http://www.interesting-patterns.net/forsied/interactive-visual-data-exploration-with-subjective-feedback/.
knowledge discovery and data mining | 2018
Xi Chen; Jefrey Lijffijt; Tijl De Bie
Controversy, disagreement, conflict, polarization and opinion divergence in social networks have been the subject of much recent research. In particular, researchers have addressed the question of how such concepts can be quantified given peoples prior opinions, and how they can be optimized by influencing the opinion of a small number of people or by editing the networks connectivity. Here, rather than optimizing such concepts given a specific set of prior opinions, we study whether they can be optimized in the average case and in the worst case over all sets of prior opinions. In particular, we derive the worst-case and average-case conflict risk of networks, and we propose algorithms for optimizing these. For some measures of conflict, these are non-convex optimization problems with many local minima. We provide a theoretical and empirical analysis of the nature of some of these local minima, and show how they are related to existing organizational structures. Empirical results show how a small number of edits quickly decreases its conflict risk, both average-case and worst-case. Furthermore, it shows that minimizing average-case conflict risk often does not reduce worst-case conflict risk. Minimizing worst-case conflict risk on the other hand, while computationally more challenging, is generally effective at minimizing both worst-case as well as average-case conflict risk.
Data Mining and Knowledge Discovery | 2018
Bo Kang; Jefrey Lijffijt; Raul Santos-Rodriguez; Tijl De Bie
The information in high-dimensional datasets is often too complex for human users to perceive directly. Hence, it may be helpful to use dimensionality reduction methods to construct lower dimensional representations that can be visualized. The natural question that arises is how do we construct a most informative low dimensional representation? We study this question from an information-theoretic perspective and introduce a new method for linear dimensionality reduction. The obtained model that quantifies the informativeness also allows us to flexibly account for prior knowledge a user may have about the data. This enables us to provide representations that are subjectively interesting. We title the method Subjectively Interesting Component Analysis (SICA) and expect it is mainly useful for iterative data mining. SICA is based on a model of a user’s belief state about the data. This belief state is used to search for surprising views. The initial state is chosen by the user (it may be empty up to the data format) and is updated automatically as the analysis progresses. We study several types of prior beliefs: if a user only knows the scale of the data, SICA yields the same cost function as Principal Component Analysis (PCA), while if a user expects the data to have outliers, we obtain a variant that we term t-PCA. Finally, scientifically more interesting variants are obtained when a user has more complicated beliefs, such as knowledge about similarities between data points. The experiments suggest that SICA enables users to find subjectively more interesting representations.