Haryadi S. Gunawi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Haryadi S. Gunawi is active.

Explore More

Publication

Featured researches published by Haryadi S. Gunawi.

symposium on operating systems principles | 2005

IRON file systems

Vijayan Prabhakaran; Lakshmi N. Bairavasundaram; Nitin Agrawal; Haryadi S. Gunawi; Andrea C. Arpaci-Dusseau; Remzi H. Arpaci-Dusseau

Commodity file systems trust disks to either work or fail completely, yet modern disks exhibit more complex failure modes. We suggest a new fail-partial failure model for disks, which incorporates realistic localized faults such as latent sector errors and block corruption. We then develop and apply a novel failure-policy fingerprinting framework, to investigate how commodity file systems react to a range of more realistic disk failures. We classify their failure policies in a new taxonomy that measures their Internal RObustNess (IRON), which includes both failure detection and recovery techniques. We show that commodity file system failure policies are often inconsistent, sometimes buggy, and generally inadequate in their ability to recover from partial disk failures. Finally, we design, implement, and evaluate a prototype IRON file system, Linux ixt3, showing that techniques such as in-disk checksumming, replication, and parity greatly enhance file system robustness while incurring minimal time and space overheads.

symposium on operating systems principles | 2003

Transforming policies into mechanisms with infokernel

Andrea C. Arpaci-Dusseau; Remzi H. Arpaci-Dusseau; Nathan C. Burnett; Timothy E. Denehy; Thomas J. Engle; Haryadi S. Gunawi; James A. Nugent; Florentina I. Popovici

We describe an evolutionary path that allows operating systems to be used in a more flexible and appropriate manner by higher-level services. An infokernel exposes key pieces of information about its algorithms and internal state; thus, its default policies become mechanisms, which can be controlled from user-level. We have implemented two prototype infokernels based on the linuxtwofour and netbsdver kernels, called infolinux and infobsd, respectively. The infokernels export key abstractions as well as basic information primitives. Using infolinux, we have implemented four case studies showing that policies within Linux can be manipulated outside of the kernel. Specifically, we show that the default file cache replacement algorithm, file layout policy, disk scheduling algorithm, and TCP congestion control algorithm can each be turned into base mechanisms. For each case study, we have found that infokernel abstractions can be implemented with little code and that the overhead and accuracy of synthesizing policies at user-level is acceptable.

symposium on operating systems principles | 2007

Improving file system reliability with I/O shepherding

Haryadi S. Gunawi; Vijayan Prabhakaran; Swetha Krishnan; Andrea C. Arpaci-Dusseau; Remzi H. Arpaci-Dusseau

We introduce a new reliability infrastructure for file systems called I/O shepherding. I/O shepherding allows a file system developer to craft nuanced reliability policies to detect and recover from a wide range of storage system failures. We incorporate shepherding into the Linux ext3 file system through a set of changes to the consistency management subsystem, layout engine, disk scheduler, and buffer cache. The resulting file system, CrookFS, enables a broad class of policies to be easily and correctly specified. We implement numerous policies, incorporating data protection techniques such as retry, parity, mirrors, checksums, sanity checks, and data structure repairs; even complex policies can be implemented in less than 100 lines of code, confirming the power and simplicity of the shepherding framework. We also demonstrate that shepherding is properly integrated, adding less than 5% overhead to the I/O path.

programming language design and implementation | 2009

Error propagation analysis for file systems

Cindy Rubio-González; Haryadi S. Gunawi; Ben Liblit; Remzi H. Arpaci-Dusseau; Andrea C. Arpaci-Dusseau

Unchecked errors are especially pernicious in operating system file management code. Transient or permanent hardware failures are inevitable, and error-management bugs at the file system layer can cause silent, unrecoverable data corruption. We propose an interprocedural static analysis that tracks errors as they propagate through file system code. Our implementation detects overwritten, out-of-scope, and unsaved unchecked errors. Analysis of four widely-used Linux file system implementations (CIFS, ext3, IBM JFS and ReiserFS), a relatively new file system implementation (ext4), and shared virtual file system (VFS) code uncovers 312 error propagation bugs. Our flow- and context-sensitive approach produces more precise results than related techniques while providing better diagnostic information, including possible execution paths that demonstrate each bug found.

international symposium on computer architecture | 2005

Deconstructing Commodity Storage Clusters

Haryadi S. Gunawi; Nitin Agrawal; Andrea C. Arpaci-Dusseau; J. Schindler; Remzi H. Arpaci-Dusseau

The traditional approach for characterizing complex systems is to run standard workloads and measure the resulting performance as seen by the end user. However, unique opportunities exist when characterizing a system that is itself constructed from standardized components: one can also look inside the system itself by instrumenting each of the components. In this paper, we show how intra-box instrumentation can help one understand the behavior of a large-scale storage cluster, the EMC Centera. In our analysis, we leverage standard tools for tracing both the disk and network traffic emanating from each node of the cluster. By correlating this traffic with the running workload, we are able to infer the structure of the software system (e.g., its write update protocol) as well as its policies (e.g., how it performs caching, replication, and load-balancing). Further, by imposing variable intra-box delays on network and disk traffic, we can confirm the causal relationships between network and disk events. Thus, we are able to infer the semantics of the messages between nodes without examining a single line of source code.

symposium on cloud computing | 2016

Why Does the Cloud Stop Computing?: Lessons from Hundreds of Service Outages

Haryadi S. Gunawi; Mingzhe Hao; Riza O. Suminto; Agung Laksono; Anang D. Satria; Jeffry Adityatama; Kurnia J. Eliazar

We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed 1247 headline news and public post-mortem reports that detail 597 unplanned outages that occurred within a 7-year span from 2009 to 2015. We analyzed outage duration, root causes, impacts, and fix procedures. This study reveals the broader availability landscape of modern cloud services and provides answers to why outages still take place even with pervasive redundancies.

architectural support for programming languages and operating systems | 2016

TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems

Tanakorn Leesatapornwongsa; Jeffrey F. Lukman; Shan Lu; Haryadi S. Gunawi

We present TaxDC, the largest and most comprehensive taxonomy of non-deterministic concurrency bugs in distributed systems. We study 104 distributed concurrency (DC) bugs from four widely-deployed cloud-scale datacenter distributed systems, Cassandra, Hadoop MapReduce, HBase and ZooKeeper. We study DC-bug characteristics along several axes of analysis such as the triggering timing condition and input preconditions, error and failure symptoms, and fix strategies, collectively stored as 2,083 classification labels in TaxDC database. We discuss how our study can open up many new research directions in combating DC bugs.

international conference on data engineering | 2010

Impact of disk corruption on open-source DBMS

Sriram Subramanian; Yupu Zhang; Rajiv Vaidyanathan; Haryadi S. Gunawi; Andrea C. Arpaci-Dusseau; Remzi H. Arpaci-Dusseau; Jeffrey F. Naughton

Despite the best intentions of disk and RAID manufacturers, on-disk data can still become corrupted. In this paper, we examine the effects of corruption on database management systems. Through injecting faults into the MySQL DBMS, we find that in certain cases, corruption can greatly harm the system, leading to untimely crashes, data loss, or even incorrect results. Overall, of 145 injected faults, 110 lead to serious problems. More detailed observations point us to three deficiencies: MySQL does not have the capability to detect some corruptions due to lack of redundant information, does not isolate corrupted data from valid data, and has inconsistent reactions to similar corruption scenarios. To detect and repair corruption, a DBMS is typically equipped with an offline checker. Unfortunately, the MySQL offline checker is not comprehensive in the checks it performs, misdiagnosing many corruption scenarios and missing others. Sometimes the checker itself crashes; more ominously, its incorrect checking can lead to incorrect repairs. Overall, we find that the checker does not behave correctly in 18 of 145 injected corruptions, and thus can leave the DBMS vulnerable to the problems described above.

ACM Transactions on Storage | 2017

Tiny-Tail Flash: Near-Perfect Elimination of Garbage Collection Tail Latencies in NAND SSDs

Shiqin Yan; Huaicheng Li; Mingzhe Hao; Michael Hao Tong; Swaminathan Sundararaman; Andrew A. Chien; Haryadi S. Gunawi

Flash storage has become the mainstream destination for storage users. However, SSDs do not always deliver the performance that users expect. The core culprit of flash performance instability is the well-known garbage collection (GC) process, which causes long delays as the SSD cannot serve (blocks) incoming I/Os, which then induces the long tail latency problem. We present ttFlash as a solution to this problem. ttFlash is a “tiny-tail” flash drive (SSD) that eliminates GC-induced tail latencies by circumventing GC-blocked I/Os with four novel strategies: plane-blocking GC, rotating GC, GC-tolerant read, and GC-tolerant flush. These four strategies leverage the timely combination of modern SSD internal technologies such as powerful controllers, parity-based redundancies, and capacitor-backed RAM. Our strategies are dependent on the use of intra-plane copyback operations. Through an extensive evaluation, we show that ttFlash comes significantly close to a “no-GC” scenario. Specifically, between the 99 and 99.99th percentiles, ttFlash is only 1.0 to 2.6× slower than the no-GC case, while a base approach suffers from 5–138× GC-induced slowdowns.

symposium on cloud computing | 2014

The Case for Drill-Ready Cloud Computing

Tanakorn Leesatapornwongsa; Haryadi S. Gunawi

As cloud computing has matured, more and more local applications are replaced by easy-to-use on-demand services accessible via computer networks (a.k.a. cloud services). Running behind these services are massive hardware infrastructures and complex management tasks (e.g., recovery, software upgrades) that if not tested thoroughly can exhibit failures that lead to major service disruptions. Some researchers estimate that 568 hours of downtime at 13 well-known cloud services since 2007 had an economic impact of more than

Explore More