Is this you? Create Your Porfile

Alessandro Morari

Pacific Northwest National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alessandro Morari is active.

Explore More

Publication

Featured researches published by Alessandro Morari.

international parallel and distributed processing symposium | 2011

A Quantitative Analysis of OS Noise

Alessandro Morari; Roberto Gioiosa; Robert W. Wisniewski; Francisco J. Cazorla; Mateo Valero

Operating system noise is a well-known problem that may limit application scalability on large-scale machines, significantly reducing their performance. Though the problem is well studied, much of the previous work has been qualitative. We have developed a technique to provide a \textit{quantitative} descriptive analysis for each OS event that contributes to OS noise. The mechanism allows us to detail all sources of OS noise through precise kernel instrumentation and provides frequency and duration analysis for each event. Such a description gives OS developers better guidance for reducing OS noise. We integrated this data with a trace visualizer allowing quicker and more intuitive understanding of the data. Specifically, the contributions of this paper are three-fold. First, we describe a methodology whereby detailed quantitative information may be obtained for each OS noise event. Though not the thrust of the paper, we show how we implemented that methodology by augmenting LTTng. We validate our approach by comparing it to other well-known standard techniques to analyze OS noise. Second, we provide a case study in which we use our methodology to analyze the OS noise when running benchmarks from the LLNL Sequoia applications. Our experiments enrich and expand previous results with our quantitative characterization. Third, we describe how a detailed characterization permits to disambiguate noise signatures of qualitatively similar events, allowing developers to address the true cause of each noise event.

international parallel and distributed processing symposium | 2014

Scaling Irregular Applications through Data Aggregation and Software Multithreading

Alessandro Morari; Antonino Tumeo; Daniel G. Chavarría-Miranda; Oreste Villa; Mateo Valero

Emerging applications in areas such as bioinformatics, data analytics, semantic databases and knowledge discovery employ datasets from tens to hundreds of terabytes. Currently, only distributed memory clusters have enough aggregate space to enable in-memory processing of datasets of this size. However, in addition to large sizes, the data structures used by these new application classes are usually characterized by unpredictable and fine-grained accesses: i.e., they present an irregular behavior. Traditional commodity clusters, instead, exploit cache-based processor and high-bandwidth networks optimized for locality, regular computation and bulk communication. For these reasons, irregular applications are inefficient on these systems, and require custom, hand-coded optimizations to provide scaling in both performance and size. Lightweight software multithreading, which enables tolerating data access latencies by overlapping network communication with computation, and aggregation, which allows reducing overheads and increasing bandwidth utilization by coalescing fine-grained network messages, are key techniques that can speed up the performance of large scale irregular applications on commodity clusters. In this paper we describe GMT (Global Memory and Threading), a runtime system library that couples software multithreading and message aggregation together with a Partitioned Global Address Space (PGAS) data model to enable higher performance and scaling of irregular applications on multi-node systems. We present the architecture of the runtime, explaining how it is designed around these two critical techniques. We show that irregular applications written using our runtime can outperform, even by orders of magnitude, the corresponding applications written using other programming models that do not exploit these techniques.

international parallel and distributed processing symposium | 2012

Evaluating the Impact of TLB Misses on Future HPC Systems

Alessandro Morari; Roberto Gioiosa; Robert W. Wisniewski; Bryan S. Rosenburg; Todd A. Inglett; Mateo Valero

TLB misses have been considered an important source of system overhead and one of the causes that limit scalability on large supercomputers. This assumption lead to HPC lightweight kernel designs that usually statically map page table entries to TLB entries and do not take TLB misses. While this approach worked for petascale clusters, programming and debugging exascale applications composed of billions of threads is not a trivial task and users have started to explore novel programming models and tools, which require a richer system software support. In this study we present a quantitative analysis of the effect of TLB misses on current and future parallel applications at scale. To provide a fair evaluation, we compare a noiseless OS (CNK) with a custom version of the same OS capable of handling TLB misses on a BG/P system (up to 4096 cores). Our methodology follows a two-step approach: we first analyze the effects of TLB misses with a low-overhead, range-checking TLB miss handler, and then simulate a more complex TLB management system through TLB noise injection. We analyze the system behavior with different page sizes and increasing number of nodes and perform a sensitivity analysis. Our results show that the overhead introduced by TLB misses on complex HPC applications from the LLNL and ANL benchmarks is below 2% if the TLB pressure is contained and/or the TLB miss handler overhead is low, even with 1MB-pages and under large TLB noise injection. These results open the possibility of implementing richer OS memory management services to satisfy the requirements of future applications and users.

IEEE Micro | 2014

Scaling Semantic Graph Databases in Size and Performance

Alessandro Morari; Vito Giovanni Castellana; Oreste Villa; Antonino Tumeo; Jesse Weaver; David J. Haglin; Sutanay Choudhury; John Feo

GEMS is a full software system that implements a large-scale, semantic graph database on commodity clusters. Its framework comprises a SPARQL-to-C++ compiler, a library of distributed data structures, and a custom multithreaded runtime library. The authors evaluated their software stack on the Berlin SPARQL benchmark with datasets of up to 10 billion graph edges, demonstrating scaling in dataset size and performance as they added cluster nodes.

IEEE Computer | 2015

In-Memory Graph Databases for Web-Scale Data

Vito Giovanni Castellana; Alessandro Morari; Jesse Weaver; Antonino Tumeo; David J. Haglin; Oreste Villa; John Feo

A software stack relies primarily on graph-based methods to implement scalable resource description framework databases on top of commodity clusters, providing an inexpensive way to extract meaning from volumes of heterogeneous data.

symposium on computer architecture and high performance computing | 2012

Efficient Sorting on the Tilera Manycore Architecture

Alessandro Morari; Antonino Tumeo; Oreste Villa; Simone Secchi; Mateo Valero

We present an efficient implementation of the radix sort algorithm for the Tilera TILEPro64 processor. The TILEPro64 is one of the first successful commercial manycore processors. It is composed of 64 tiles interconnected through multiple fast Networks-on-chip and features a fully coherent, shared distributed cache. The architecture has a large degree of flexibility, and allows various optimization strategies. We describe how we mapped the algorithm to this architecture. We present an in-depth analysis of the optimizations for each phase of the algorithm with respect to the processors sustained performance. We discuss the overall throughput reached by our radix sort implementation (up to 132 MK/s) and show that it provides comparable or better performance-per-watt with respect to state-of-the art implementations on x86 processors and graphic processing units.

international conference on big data | 2013

Accelerating semantic graph databases on commodity clusters

Alessandro Morari; Vito Giovanni Castellana; David J. Haglin; John Feo; Jesse Weaver; Antonino Tumeo; Oreste Villa

We are developing a full software system for accelerating semantic graph databases on commodity cluster that scales to hundreds of nodes while maintaining constant query throughput. Our framework comprises a SPARQL to C++ compiler, a library of parallel graph methods and a custom multithreaded runtime layer, which provides a Partitioned Global Address Space (PGAS) programming model with fork/join parallelism and automatic load balancing over a commodity clusters. We present preliminary results for the compiler and for the runtime.

IEEE Transactions on Computers | 2013

SMT Malleability in IBM POWER5 and POWER6 Processors

Alessandro Morari; Carlos Boneti; Francisco J. Cazorla; Roberto Gioiosa; Chen-Yong Cher; Alper Buyuktosunoglu; Pradip Bose; Mateo Valero

While several hardware mechanisms have been proposed to control the interaction between hardware threads in an SMT processor, few have addressed the issue of software-controllable SMT performance. The IBM POWER5 and POWER6 are the first high-performance processors implementing a software-controllable hardware-thread prioritization mechanism that controls the rate at which each hardware-thread decodes instructions. This paper shows the potential of this basic mechanism to improve several target metrics for various applications on POWER5 and POWER6 processors. Our results show that although the software interface is exactly the same, the software-controlled priority mechanism has a different effect on POWER5 and POWER6. For instance, hardware threads in POWER6 are less sensitive to priorities than in POWER5 due to the in order design. We study the SMT thread malleability to enable user-level optimizations that leverage software-controlled thread priorities. We also show how to achieve various system objectives such as parallel application load balancing, in order to reduce execution time. Finally, we characterize user-level transparent execution on POWER5 and POWER6, and identify the workload mix that best benefits from it.

international parallel and distributed processing symposium | 2016

GraQL: A Query Language for High-Performance Attributed Graph Databases

Daniel G. Chavarría-Miranda; Vito Giovanni Castellana; Alessandro Morari; David J. Haglin; John Feo

Graph databases are becoming a critical tool for the analysis of graph-structured data in the context of multiple scientific and technical domains, including cybersecurity and computational biology. In particular, the storage, analysis and querying of attributed graphs is a very important capability. Attributed graphs contain properties attached to the vertices and edges of the graph structure. Queries over attributed graphs do not only include structural pattern matching, but also conditions over the values of the attributes. In this work, we present GraQL, a query language designed for high-performance attributed graph databases hosted on a high memory capacity cluster. GraQL is designed to be the front-end language for the attributed graph data model for the GEMS database system.

Resource Management for Big Data Platforms | 2016

Load Balancing and Fault Tolerance Mechanisms for Scalable and Reliable Big Data Analytics

Nitin Sukhija; Alessandro Morari; Ioana Banicescu

Data collection and analysis is rapidly changing the way scientific, national security and business communities operate. Data analytics applications, especially the ones involving graph analytics have received increased attention over the years. Moreover, with this increasing interest in graph processing, the diversity of the graph datasets and the graph processing algorithms has also increased. There has been a similar explosion in the design and development of the big data platforms to manage, store, process, and analyze large-scale graph datasets. Although these platforms have gained unquestionable success, it is currently difficult to decide on choosing a platform for deploying big data applications, due to a lack of comprehensive understanding of the performance and the design tradeoffs of these platforms in terms of handling both real-world workloads and resource failures. In this chapter, we will be surveying the load balancing and fault tolerance strategies employed by the most dominant graph database platforms.

Explore More