Is this you? Create Your Porfile

Do Le Quoc

Dresden University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Do Le Quoc is active.

Explore More

Publication

Featured researches published by Do Le Quoc.

international world wide web conferences | 2016

IncApprox: A Data Analytics System for Incremental Approximate Computing

Dhanya R. Krishnan; Do Le Quoc; Pramod Bhatotia; Christof Fetzer; Rodrigo Rodrigues

Incremental and approximate computations are increasingly being adopted for data analytics to achieve low-latency execution and efficient utilization of computing resources. Incremental computation updates the output incrementally instead of re-computing everything from scratch for successive runs of a job with input changes. Approximate computation returns an approximate output for a job instead of the exact output. Both paradigms rely on computing over a subset of data items instead of computing over the entire dataset, but they differ in their means for skipping parts of the computation. Incremental computing relies on the memoization of intermediate results of sub-computations, and reusing these memoized results across jobs. Approximate computing relies on representative sampling of the entire dataset to compute over a subset of data items. In this paper, we observe that these two paradigms are complementary, and can be married together! Our idea is quite simple: design a sampling algorithm that biases the sample selection to the memoized data items from previous runs. To realize this idea, we designed an online stratified sampling algorithm that uses self-adjusting computation to produce an incrementally updated approximate output with bounded error. We implemented our algorithm in a data analytics system called IncApprox based on Apache Spark Streaming. Our evaluation using micro-benchmarks and real-world case-studies shows that IncApprox achieves the benefits of both incremental and approximate computing.

international conference on cloud computing | 2015

UniCrawl: A Practical Geographically Distributed Web Crawler

Do Le Quoc; Christof Fetzer; Pascal Felber; Etienne Rivière; Valerio Schiavoni; Pierre Sutra

As the wealth of information available on the web keeps growing, being able to harvest massive amounts of data has become a major challenge. Web crawlers are the core components to retrieve such vast collections of publicly available data. The key limiting factor of any crawler architecture is however its large infrastructure cost. To reduce this cost, and in particular the high upfront investments, we present in this paper a geo-distributed crawler solution, UniCrawl. UniCrawl orchestrates several geographically distributed sites. Each site operates an independent crawler and relies on well-established techniques for fetching and parsing the content of the web. UniCrawl splits the crawled domain space across the sites and federates their storage and computing resources, while minimizing thee inter-site communication cost. To assess our design choices, we evaluate UniCrawl in a controlled environment using the ClueWeb12 dataset, and in the wild when deployed over several remote locations. We conducted several experiments over 3 sites spread across Germany. When compared to a centralized architecture with a crawler simply stretched over several locations, UniCrawl shows a performance improvement of 93.6% in terms of network bandwidth consumption, and a speedup factor of 1.75.

conference on the future of the internet | 2014

DoLen: User-Side Multi-cloud Application Monitoring

Do Le Quoc; Lenar Yazdanov; Christof Fetzer

Cloud computing is a popular platform offering computation, storage and communication resources as a service. Monitoring the performance and behavior of multi-cloud applications in a scenario where applications are deployed on multiple cloud providers is a challenge for cloud computing research community. In this paper, we propose a framework that allows monitoring in near real-time resource utilization including CPU, memory, disk activity and network traffic of applications deployed in multiple clouds. We conducted various experiments using a cluster to simulate three clouds, showing the frameworks ability to analyze resource consumption of Hadoop applications deployed in clouds in near real-time. We also performed VM-to-VM attacks to evaluate anomaly detection capacity of the proposed framework.

ieee/acm international conference utility and cloud computing | 2013

Scalable and Real-Time Deep Packet Inspection

Do Le Quoc; André Martin; Christof Fetzer

Internet traffic has continued to grow at a spectacular rate over the past ten years. Understanding and managing network traffic have become an important issue for network operators to meet service-level agreements with their customers. In addition, the emergence of high-speed networks, such as 20 Gbps, 40Gbps Ethernet and beyond, requires fast analysis of a large volume of network traffic and this is beyond the capabilities of a single machine. Distributed parallel processing schemes have recently been developed to analyze high quantities of traffic data. However, scalable Internet traffic analysis in real-time is difficult because of a large dataset requires high processing intensity. In this paper, we describe a real-time Deep Packet Inspection (DPI) system based on the MapReduce programming model. We combine a stand-alone classification engine (L7-filter) with the distributed programming MapReduce model. Our experimental results show that the MapReduce programming paradigm is a useful approach for building highly scalable real-time network traffic processing systems. We generate 20 Gbps network traffic to validate the real-time analysis ability of the proposed system.

Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference on | 2017

StreamApprox: approximate computing for stream analytics

Do Le Quoc; Ruichuan Chen; Pramod Bhatotia; Christof Fetzer; Volker Hilt; Thorsten Strufe

Approximate computing aims for efficient execution of workflows where an approximate output is sufficient instead of the exact output. The idea behind approximate computing is to compute over a representative sample instead of the entire input dataset. Thus, approximate computing --- based on the chosen sample size --- can make a systematic trade-off between the output accuracy and computation efficiency. Unfortunately, the state-of-the-art systems for approximate computing primarily target batch analytics, where the input data remains unchanged during the course of computation. Thus, they are not well-suited for stream analytics. This motivated the design of StreamApprox--- a stream analytics system for approximate computing. To realize this idea, we designed an online stratified reservoir sampling algorithm to produce approximate output with rigorous error bounds. Importantly, our proposed algorithm is generic and can be applied to two prominent types of stream processing systems: (1) batched stream processing such as Apache Spark Streaming, and (2) pipelined stream processing such as Apache Flink. To showcase the effectiveness of our algorithm, we implemented StreamApprox as a fully functional prototype based on Apache Spark Streaming and Apache Flink. We evaluated StreamApprox using a set of microbenchmarks and real-world case studies. Our results show that Spark- and Flink-based StreamApprox systems achieve a speedup of 1.15×---3× compared to the respective native Spark Streaming and Flink executions, with varying sampling fraction of 80% to 10%. Furthermore, we have also implemented an improved baseline in addition to the native execution baseline --- a Spark-based approximate computing system leveraging the existing sampling modules in Apache Spark. Compared to the improved baseline, our results show that StreamApprox achieves a speedup of 1.1×---2.4× while maintaining the same accuracy level.

international conference on cloud computing | 2015

Scalable Network Traffic Classification Using Distributed Support Vector Machines

Do Le Quoc; Valerio DAlessandro; Byungchul Park; Luigi Romano; Christof Fetzer

Internet traffic has increased dramatically in recent years due to the popularization of the Internet and the appearance of wireless Internet mobile devices such as smart-phones and tablets. The explosive growth of Internet traffic has introduced a practical example that demonstrates the concept of Big Data. Accurate identification and classification of large network traffic data plays an important role in network management including capacity planning, network forensics, QoS and intrusion detection. However, the state-of-the-art solutions, which rely on a dedicated server, are not scalable for analyzing high volume network traffic data. In this paper, we implement a distributed Support Vector Machines (SVMs) framework for classifying network traffic using Hadoop, an open-source distributed computing framework for Big Data processing. We design a global parameter store that maintains the global shared parameters between SVM training nodes. The distributed SVMs have been deployed on a 20 node cluster to analyze real network traffic trace. The results demonstrate that with 19 Mapper nodes the system is around 30% faster than Cloud SVM solution and outperforms the standalone SVM with nearly 9 times faster in training process and 15 times in the classifying process. In addition, the distributed SVMs architecture is designed to analyze large scale datasets. Therefore, it can be used not only for processing network traffic dataset, but also other large scale datasets such as Web data.

symposium on cloud computing | 2018

ApproxJoin: Approximate Distributed Joins.

Do Le Quoc; Istemi Ekin Akkus; Pramod Bhatotia; Spyros Blanas; Ruichuan Chen; Christof Fetzer; Thorsten Strufe

A distributed join is a fundamental operation for processing massive datasets in parallel. Unfortunately, computing an equi-join over such datasets is very resource-intensive, even when done in parallel. Given this cost, the equi-join operator becomes a natural candidate for optimization using approximation techniques, which allow users to trade accuracy for latency. Finding the right approximation technique for joins, however, is a challenging task. Sampling, in particular, cannot be directly used in joins; naïvely performing a join over a sample of the dataset will not preserve statistical properties of the query result. To address this problem, we introduce ApproxJoin. We interweave Bloom filter sketching and stratified sampling with the join computation in a new operator that preserves statistical properties of an aggregation over the join output. ApproxJoin leverages Bloom filters to avoid shuffling non-joinable data items around the network, and then applies stratified sampling to obtain a representative sample of the join output. We implemented ApproxJoin in Apache Spark, and evaluated it using microbenchmarks and real-world workloads. Our evaluation shows that ApproxJoin scales well and significantly reduces data movement, without sacrificing tight error bounds on the accuracy of the final results. ApproxJoin achieves a speedup of up to 9x over unmodified Spark-based joins with the same sampling ratio. Furthermore, the speedup is accompanied by a significant reduction in the shuffled data volume, which is up to 82x less than unmodified Spark-based joins.

ieee/acm international conference utility and cloud computing | 2013

Tutorial: Elastic and Fault Tolerant Event Stream Processing using StreamMine3G

André Martin; Do Le Quoc

The massive amount of new data being generated each day by data sources such as smartphones and sensor devices calls for new techniques to process such continues streams of data. Event Stream Processing (ESP) addresses this problem and enables users to process such data streams in (soft) realtime allowing the detection as well as a quick reaction to relevant situations. In this tutorial, we will introduce the participants to ESP techniques as well as ESP systems such as Storm, Apache S4 and StreamMine3G. We will cover aspects such as programming models, fault tolerance as well as elasticity and cloud support of these platforms.

usenix annual technical conference | 2017