Dionysios Logothetis | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Dionysios Logothetis is active.

Explore More

Publication

Featured researches published by Dionysios Logothetis.

symposium on cloud computing | 2010

Stateful bulk processing for incremental analytics

Dionysios Logothetis; Christopher Olston; Benjamin Reed; Kevin C. Webb; Ken Yocum

This work addresses the need for stateful dataflow programs that can rapidly sift through huge, evolving data sets. These data-intensive applications perform complex multi-step computations over successive generations of data inflows, such as weekly web crawls, daily image/video uploads, log files, and growing social networks. While programmers may simply re-run the entire dataflow when new data arrives, this is grossly inefficient, increasing result latency and squandering hardware resources and energy. Alternatively, programmers may use prior results to incrementally incorporate the changes. However, current large-scale data processing tools, such as Map-Reduce or Dryad, limit how programmers incorporate and use state in data-parallel programs. Straightforward approaches to incorporating state can result in custom, fragile code and disappointing performance. This work presents a generalized architecture for continuous bulk processing (CBP) that raises the level of abstraction for building incremental applications. At its core is a flexible, groupwise processing operator that takes state as an explicit input. Unifying stateful programming with a data-parallel operator affords several fundamental opportunities for minimizing the movement of data in the underlying processing system. As case studies, we show how one can use a small set of flexible dataflow primitives to perform web analytics and mine large-scale, evolving graphs in an incremental fashion. Experiments with our prototype using real-world data indicate significant data movement and running time reductions relative to current practice. For example, incrementally computing PageRank using CBP can reduce data movement by 46% and cut running time in half.

very large data bases | 2008

Ad-hoc data processing in the cloud

Dionysios Logothetis; Ken Yocum

Ad-hoc data processing has proven to be a critical paradigm for Internet companies processing large volumes of unstructured data. However, the emergence of cloud-based computing, where storage and CPU are outsourced to multiple third-parties across the globe, implies large collections of highly distributed and continuously evolving data. Our demonstration combines the power and simplicity of the MapReduce abstraction with a wide-scale distributed stream processor, Mortar. While our incremental MapReduce operators avoid data re-processing, the stream processor manages the placement and physical data flow of the operators across the wide area. We demonstrate a distributed web indexing engine against which users can submit and deploy continuous MapReduce jobs. A visualization component illustrates both the incremental indexing and index searches in real time.

cluster computing and the grid | 2005

Efficient resource description and high quality selection for virtual grids

Yang-Suk Kee; Dionysios Logothetis; Richard Y. Huang; Henri Casanova; Andrew A. Chien

Simple resource specification, resource selection, and effective binding are critical capabilities for grid middleware. We describe the virtual grid, an abstraction for providing these capabilities complex resource environments. Elements of the virtual grid include a novel resource description language (vgDL) and a resource selection and binding component (vgFAB), which accepts a vgDL specification and returns a virtual grid, that is, a set of selected and bound resources. The goals of vgFAB are efficiency, scalability, robustness to high resource contention, and the ability to produce results with quantifiable high quality. We present the design of vgDL, showing how it captures application-level resource abstractions using resource aggregates and connectivity amongst them. We present and evaluate a prototype implementation of vgFAB. Our results show that resource selection and binding for virtual grids of 10,000s of resources can scale up to grids with millions of resources, identifying good matches in less than one second. Further, these matches have quantifiable quality, enabling applications to have high confidence in the results. We demonstrate the effectiveness of our combined selection and binding approach in the presence of resource contention, showing that robust selection and binding can be achieved at moderate cost.

symposium on cloud computing | 2013

Adaptive partitioning for large-scale dynamic graphs

Luis M. Vaquero; Félix Cuadrado; Dionysios Logothetis; Claudio Martella

In the last years, large-scale graph processing has gained increasing attention, with most recent systems placing particular emphasis on latency. One possible technique to improve runtime performance in a distributed graph processing system is to reduce network communication. The most notable way to achieve this goal is to partition the graph by minimizing the number of edges that connect vertices assigned to different machines, while keeping the load balanced. However, real-world graphs are highly dynamic, with vertices and edges being constantly added and removed. Carefully updating the partitioning of the graph to reflect these changes is necessary to avoid the introduction of an extensive number of cut edges, which would gradually worsen computation performance. In this paper we show that performance degradation in dynamic graph processing systems can be avoided by adapting continuously the graph partitions as the graph changes. We present a novel highly scalable adaptive partitioning strategy, and show a number of refinements that make it work under the constraints of a large-scale distributed system. The partitioning strategy is based on iterative vertex migrations, relying only on local information. We have implemented the technique in a graph processing system, and we show through three real-world scenarios how adapting graph partitioning reduces execution time by over 50% when compared to commonly used hash-partitioning.

symposium on cloud computing | 2013

Scalable lineage capture for debugging DISC analytics

Dionysios Logothetis; Soumyarupa De; Ken Yocum

A fundamental challenge for big-data analytics is how to efficiently tune and debug multi-step dataflows. This paper presents Newt, a scalable architecture for capturing and using record-level data lineage to discover and resolve errors in analytics. Newts flexible instrumentation allows system developers to collect this fine-grain lineage from a range of data intensive scalable computing (DISC) architectures, actively recording the flow of data through multi-step, user-defined transformations. Newt pairs this API with a scale-out, fault-tolerant lineage store and query engine. We find that while active collection can be expensive, it incurs modest runtime overheads for real-world analytics (<36%) and enables novel lineage-based debugging techniques. For instance, Newt can efficiently recreate errors (crashes or bad outputs) or remove input data from the dataflow to enable data cleaning strategies. Additionally, Newts active lineage collection allows retro-spective analyses of a dataflows behavior, such as identifying anomalous processing steps. As case studies, we instrument two DISC systems, Hadoop and Hyracks, with less than 105 lines of additional code for each. Finally, we use Newt to systematically clean input data to a Hadoop-based de novo genome assembler, improving the quality of the output assembly.

cloud data management | 2012

Facilitating real-time graph mining

Zhuhua Cai; Dionysios Logothetis; Georgos Siganos

Real-time data processing is increasingly gaining momentum as the preferred method for analytical applications. Many of these applications are built on top of large graphs with hundreds of millions of vertices and edges. A fundamental requirement for real-time processing is the ability to do incremental processing. However, graph algorithms are inherently difficult to compute incrementally due to data dependencies. At the same time, devising incremental graph algorithms is a challenging programming task. This paper introduces GraphInc, a system that builds on top of the Pregel model and provides efficient incremental processing of graphs. Importantly, GraphInc supports incremental computations automatically, hiding the complexity from the programmers. Programmers write graph analytics in the Pregel model without worrying about the continuous nature of the data. GraphInc integrates new data in real-time in a transparent manner, by automatically identifying opportunities for incremental processing. We discuss the basic mechanisms of GraphInc and report on the initial evaluation of our approach.

Computers & Security | 2016

Íntegro: Leveraging victim prediction for robust fake account detection in large scale OSNs

Yazan Boshmaf; Dionysios Logothetis; Georgos Siganos; Jorge Lería; Jose Lorenzo; Matei Ripeanu; Konstantin Beznosov; Hassan Halawa

Abstract Detecting fake accounts in online social networks (OSNs) protects both OSN operators and their users from various malicious activities. Most detection mechanisms attempt to classify user accounts as real (i.e., benign, honest) or fake (i.e., malicious, Sybil) by analyzing either user-level activities or graph-level structures. These mechanisms, however, are not robust against adversarial attacks in which fake accounts cloak their operation with patterns resembling real user behavior. In this article, we show that victims – real accounts whose users have accepted friend requests sent by fakes – form a distinct classification category that is useful for designing robust detection mechanisms. In particular, we present Integro – a robust and scalable defense system that leverages victim classification to rank most real accounts higher than fakes, so that OSN operators can take actions against low-ranking fake accounts. Integro starts by identifying potential victims from user-level activities using supervised machine learning. After that, it annotates the graph by assigning lower weights to edges incident to potential victims. Finally, Integro ranks user accounts based on the landing probability of a short random walk that starts from a known real account. As this walk is unlikely to traverse low-weight edges in a few steps and land on fakes, Integro achieves the desired ranking. We implemented Integro using widely-used, open-source distributed computing platforms, where it scaled nearly linearly. We evaluated Integro against SybilRank, which is the state-of-the-art in fake account detection, using real-world datasets and a large-scale deployment at Tuenti – the largest OSN in Spain with more than 15 million active users. We show that Integro significantly outperforms SybilRank in user ranking quality, with the only requirement that the employed victim classifier is better than random. Moreover, the deployment of Integro at Tuenti resulted in up to an order of magnitude higher precision in fake account detection, as compared to SybilRank.

very large data bases | 2008

XTreeNet: democratic community search

Emiran Curtmola; Alin Deutsch; Dionysios Logothetis; K. K. Ramakrishnan; Divesh Srivastava; Ken Yocum

We describe XTreeNet, a distributed query dissemination engine which facilitates democratization of publishing and efficient data search among members of online communities with powerful full-text queries. This demonstration shows XTreeNet in full action. XTreeNet serves as a proof of concept for democratic community search by proposing a distributed novel infrastructure in which data resides only with the publishers owning it. Expressive user queries are disseminated to publishers. Given the virtual nature of the global data collection (e.g., the union of all local data published in the community) our infrastructure efficiently locates the publishers that contain matching documents with a specified query, processes the complex full-text query at the publisher and returns all relevant documents to querier.

Archive | 2015

Working with Giraph

Claudio Martella; Roman Shaposhnik; Dionysios Logothetis

Previous chapters introduced the generic Giraph programming model and talked about a few common use cases that lend themselves easily to being modeled as graph-processing applications. This chapter covers practical aspects of developing Giraph applications and focuses on running Giraph on top of Hadoop.

network and distributed system security symposium | 2015