Daoyuan Li
University of Luxembourg
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Daoyuan Li.
IEEE Transactions on Information Forensics and Security | 2017
Li Li; Daoyuan Li; Tegawendé François D Assise Bissyande; Jacques Klein; Yves Le Traon; David Lo; Lorenzo Cavallaro
The Android packaging model offers ample opportunities for malware writers to piggyback malicious code in popular apps, which can then be easily spread to a large user base. Although recent research has produced approaches and tools to identify piggybacked apps, the literature lacks a comprehensive investigation into such phenomenon. We fill this gap by: 1) systematically building a large set of piggybacked and benign apps pairs, which we release to the community; 2) empirically studying the characteristics of malicious piggybacked apps in comparison with their benign counterparts; and 3) providing insights on piggybacking processes. Among several findings providing insights analysis techniques should build upon to improve the overall detection and classification accuracy of piggybacked apps, we show that piggybacking operations not only concern app code, but also extensively manipulates app resource files, largely contradicting common beliefs. We also find that piggybacking is done with little sophistication, in many cases automatically, and often via library code.
2015 IEEE International Conference on Software Quality, Reliability and Security | 2015
Li Li; Kevin Allix; Daoyuan Li; Alexandre Bartel; Tegawendé François D Assise Bissyande; Jacques Klein
We discuss the capability of a new feature set for malware detection based on potential component leaks (PCLs). PCLs are defined as sensitive data-flows that involve Android inter-component communications. We show that PCLs are common in Android apps and that malicious applications indeed manipulate significantly more PCLs than benign apps. Then, we evaluate a machine learning-based approach relying on PCLs. Experimental validations show high performance for identifying malware, demonstrating that PCLs can be used for discriminating malicious apps from benign apps.
2017 IEEE/ACM 4th International Conference on Mobile Software Engineering and Systems (MOBILESoft) | 2017
Li Li; Daoyuan Li; Tegawendé François D Assise Bissyande; Jacques Klein; Haipeng Cai; David Lo; Yves Le Traon
To devise efficient approaches and tools for detecting malicious packages in the Android ecosystem, researchers are increasingly required to have a deep understanding of malware. There is thus a need to provide a framework for dissecting malware and locating malicious program fragments within app code in order to build a comprehensive dataset of malicious samples. Towards addressing this need, we propose in this work a tool-based approach called HookRanker, which provides ranked lists of potentially malicious packages based on the way malware behaviour code is triggered. With experiments on a ground truth set of piggybacked apps, we are able to automatically locate the malicious packages from piggybacked Android apps with an accuracy of 83.6% in verifying the top five reported items.
international conference on software engineering | 2017
Li Li; Daoyuan Li; Tegawendé François D Assise Bissyande; Jacques Klein; Yves Le Traon; David Lo; Lorenzo Cavallaro
The Android packaging model offers adequate opportunities for attackers to inject malicious code into popular benign apps, attempting to develop new malicious apps that can then be easily spread to a large user base. Despite the fact that the literature has already presented a number of tools to detect piggybacked apps, there is still lacking a comprehensive investigation on the piggybacking processes. To fill this gap, in this work, we collect a large set of benign/piggybacked app pairs that can be taken as benchmark apps for further investigation. We manually look into these benchmark pairs for understanding the characteristics of piggybacking apps and eventually we report 20 interesting findings. We expect these findings to initiate new research directions such as practical and scalable piggybacked app detection, explainable malware detection, and malicious code location.
machine learning and data mining in pattern recognition | 2016
Daoyuan Li; Li Li; Tegawendé François D Assise Bissyande; Jacques Klein; Yves Le Traon
Time series data are abundant in various domains and are often characterized as large in size and high in dimensionality, leading to storage and processing challenges. Symbolic representation of time series – which transforms numeric time series data into texts – is a promising technique to address these challenges. However, these techniques are essentially lossy compression functions and information are partially lost during transformation. To that end, we bring up a novel approach named Domain Series Corpus (DSCo), which builds per-class language models from the symbolized texts. To classify unlabeled samples, we compute the fitness of each symbolized sample against all per-class models and choose the class represented by the model with the best fitness score. Our work innovatively takes advantage of mature techniques from both time series mining and NLP communities. Through extensive experiments on an open dataset archive, we demonstrate that it performs similarly to approaches working with original uncompressed numeric data.
software engineering and knowledge engineering | 2016
Daoyuan Li; Tegawendé François D Assise Bissyande; Jacques Klein; Yves Le Traon
Time series mining has become essential for extracting knowledge from the abundant data that flows out from many application domains. To overcome storage and processing challenges in time series mining, compression techniques are being used. In this paper, we investigate the loss/gain of performance of time series classification approaches when fed with lossy-compressed data. This empirical study is essential for reassuring practitioners, but also for providing more insights on how compression techniques can even be effective in reducing noise in time series data. From a knowledge engineering perspective, we show that time series may be compressed by 90% using discrete wavelet transforms and still achieve remarkable classification accuracy, and that residual details left by popular wavelet compression techniques can sometimes even help achieve higher classification accuracy than the raw time series data, as they better capture essential local features.
acm symposium on applied computing | 2016
Li Li; Daoyuan Li; Alexandre Bartel; Tegawendé François D Assise Bissyande; Jacques Klein; Yves Le Traon
Despite much effort in the community, the momentum of Android research has not yet produced complete tools to perform thorough analysis on Android apps, leaving users vulnerable to malicious apps. Because it is hard for a single tool to efficiently address all of the various challenges of Android programming which make analysis difficult, we propose to instrument the app code for reducing the analysis complexity, e.g., transforming a hard problem to a easy-resolvable one. To this end, we introduce in this paper Apkpler, a plugin-based framework for supporting such instrumentation. We evaluate Apkpler with two plugins, demonstrating the feasibility of our approach and showing that Apkpler can indeed be leveraged to reduce the analysis complexity of Android apps.
intelligent data analysis | 2016
Daoyuan Li; Tegawendé François D Assise Bissyande; Jacques Klein; Yves Le Traon
The abundance of time series data in various domains and their high dimensionality characteristic are challenging for harvesting useful information from them. To tackle storage and processing challenges, compression-based techniques have been proposed. Our previous work, Domain Series Corpus (DSCo), compresses time series into symbolic strings and takes advantage of language modeling techniques to extract from the training set knowledge about different classes. However, this approach was flawed in practice due to its excessive memory usage and the need for a priori knowledge about the dataset. In this paper we propose DSCo-NG, which reduces DSCo’s complexity and offers an efficient (linear time complexity and low memory footprint), accurate (performance comparable to approaches working on uncompressed data) and generic (so that it can be applied to various domains) approach for time series classification. Our confidence is backed with extensive experimental evaluation against publicly accessible datasets, which also offers insights on when DSCo-NG can be a better choice than others.
Journal of Computer Science and Technology | 2017
Li Li; Daoyuan Li; Tegawendé François D Assise Bissyande; Jacques Klein; Haipeng Cai; David Lo; Yves Le Traon
To devise efficient approaches and tools for detecting malicious packages in the Android ecosystem, researchers are increasingly required to have a deep understanding of malware. There is thus a need to provide a framework for dissecting malware and locating malicious program fragments within app code in order to build a comprehensive dataset of malicious samples. Towards addressing this need, we propose in this work a tool-based approach called HookRanker, which provides ranked lists of potentially malicious packages based on the way malware behaviour code is triggered. With experiments on a ground truth of piggybacked apps, we are able to automatically locate the malicious packages from piggybacked Android apps with an accuracy@5 of 83.6% for such packages that are triggered through method invocations and an accuracy@5 of 82.2% for such packages that are triggered independently.
International Journal of Software Engineering and Knowledge Engineering | 2016
Daoyuan Li; Tegawendé François D Assise Bissyande; Jacques Klein; Yves Le Traon
Time series mining has become essential for extracting knowledge from the abundant data that flows out from many application domains. To overcome storage and processing challenges in time series mining, compression techniques are being used. In this paper, we investigate the loss/gain of performance of time series classification approaches when fed with lossy-compressed data. This extended empirical study is essential for reassuring practitioners, but also for providing more insights on how compression techniques can even be effective in smoothing and reducing noise in time series data. From a knowledge engineering perspective, we show that time series may be compressed by 90% using discrete wavelet transforms and still achieve remarkable classification accuracy, and that residual details left by popular wavelet compression techniques can sometimes even help to achieve higher classification accuracy than the raw time series data, as they better capture essential local features.