[PDF] A Platform for Automating Chaos Experiments

Abstract

The Netflix video streaming system is composed of many interacting services. In such a large system, failures in individual services are not uncommon. This paper describes the Chaos Automation Platform, a system for running failure injection experiments on the production system to verify that failures in non-critical services do not result in system outages.

Full PDF

AA Platform for Automating Chaos Experiments

Ali Basiri, Aaron Blohowiak, Lorin Hochstein and Casey Rosenthal

Netﬂix { abasiri,ablohowiak,lhochstein,crosenthal } @netﬂix.com ©2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists,or reuse of any copyrighted component of this work in other works. DOI: 10.1109/ISSREW.2016.52 Abstract —The Netﬂix video streaming system is composed ofmany interacting services. In such a large system, failures inindividual services are not uncommon. This paper describesthe Chaos Automation Platform, a system for running failureinjection experiments on the production system to verify thatfailures in non-critical services do not result in system outages.

1. Introduction

To an end-user, Netﬂix is a single service that allowsthem to stream television shows and movies over the Inter-net. To the engineers who work for the company, Netﬂix isa distributed system made up of many services that interactvia remote procedure call (RPC), sometimes referred to asa microservice architecture [1].In a large system such as Netﬂix, where hundreds ofservices run on thousands of machines and engineers aremaking changes every day, many things can go wrong.Fortunately, many of the internal services that make upNetﬂix are not critical for the user to be able to watcha video. For example, a personalized list of recommenda-tions and bookmarks that recall where you left off whenpreviously watching a video add value to the user, but ifthe services that implement these features stop working, weshould still be able to provide a reasonable user experience.Hodges describes this kind of graceful degradation as partialavailability [2].Partial availability doesn’t come for free: engineers mustexplicitly implement fallback behavior when making RPCcalls against non-critical services. If fallback behavior is notimplemented correctly, a problem in a non-critical servicecan lead to an outage. This work addresses the followingquestion: how can we have conﬁdence that Netﬂix userswill still be able to stream videos after non-critical serviceshave failed ?At Netﬂix, we practice Chaos Engineering [3]. Namely,we believe there is a level of complexity in modern dis-tributed systems that is chaotic, and that a chief architectcannot hold all of the system’s moving parts in their head.Chaos Engineering is about engineering practices that helpus surface systemic effects, as embodied by the Principlesof Chaos Engineering [4].In particular, we believe that to have maximum conﬁ-dence you must test in your production environment with live trafﬁc. Chaos Monkey [5] is one example of ChaosEngineering in practice at Netﬂix. Another example isautomated canary analysis [6], which tests new code inthe production environment with live trafﬁc. Unfortunately,canary analysis is not guaranteed to test the code pathsassociated with dealing with failures in non-critical services.Another tenet of Chaos Engineering is automation: we wantan automated solution for ensuring the system is resilient tofailures in non-critical services.This paper describes our proposed solution: the ChaosAutomation Platform, or ChAP. ChAP enables engineeringteams to run Chaos Engineering experiments on live trafﬁcin production in order to build conﬁdence that their servicewill degrade gracefully when non-critical downstream ser-vices fail.ChAP works by diverting a fraction of production trafﬁc,injecting failures into the diverted trafﬁc, and checking thatthe system behaves as expected. Section 4 describes how anengineer would use ChAP to verify that Netﬂix is resilientto failures in a particular service.

2. Individual service failures vs system-levelfailures

As Hodges notes, “distributed systems are different be-cause they fail often” [2]. When a system runs on thousandsof servers, it becomes very likely that something will gowrong somewhere.A simple example of a failure is a bug that results in anunhandled exception , such as a null pointer exception. InNetﬂix’s microservice architecture, an unhandled exceptionresults in a service returning an HTTP 500 error code [7].There are other failure modes that are common foran individual service in a microservice architecture. Onecommon problem is resource exhaustion. Examples of ﬁniteresources on a server include memory, disk space, CPU cy-cles, threads, and open TCP/IP connections. When a serverruns out of one of these resources, system calls that wouldnormally succeed may block or throw exceptions. Resourceexhaustion can be caused by a resource leak, but it may alsooccur if the load on a server exceeds its capacity. Here theproblem is that the service has been insufﬁciently scaled:not enough servers have been allocated to that service.

1. At Netﬂix, most services are implemented in Java, which uses excep-tions for error signaling. a r X i v : . [ c s . S E ] F e b igure 1. Unexpected fallback behavior When a server runs low on one of its resources, onesymptom is an increase in the average response time ofthe server. For example, memory pressure on a server maylead to garbage collection pauses. Another example: for aservice that allocates one thread-per-request, if the numberof pending requests exceeds the number of available threads,latency will increase.Yet another issue is the environment that these ser-vices run in. All of the Netﬂix services run within theAmazon Web Services Elastic Compute Cloud (EC2),an infrastructure-as-a-service cloud computing environment[8]. Because cloud providers such as EC2 compete onprice, in order to reduce costs they use commodity-gradehardware instead of more reliable enterprise-grade hardware.This increases the likelihood of an individual server failingbecause of hardware issues. Transient networking issuessuch as latency spikes are also not uncommon in cloud com-puting environments. When deploying to a cloud computingplatform, it is the responsibility of the software engineers todesign systems that incorporate redundancy to compensatefor occasional failures in hardware.Individual service failures are inevitable, and Netﬂixengineers leverage the Hystrix [9] library to implementfallback logic to handle failures in downstream services. Ourgoal is to prevent system-level failures. In particular, ourgoal is to reduce the likelihood of an outage, when Netﬂixcustomers are not able to stream videos. The primary metricof system health at Netﬂix is the number of video streamstarts per second, internally referred to as SPS [10].A failure of an individual service can lead to a drop inSPS if the client calling the service does not have properfallbacks in place. A study by Yuan et al. revealed that92% of catastrophic system failures happened because ofincorrect error-handling logic [11].Even if fallback logic is present in a client, the failureof a non-critical service may still lead to a system-levelfailure due to cascading effects. Consider the followingfailure scenario, illustrated in Figure 1.Typically, service A calls service B . For some reason, C starts to become overloaded, and returns errors to B .The fallback behavior for B is not working correctly, whichcauses B to return errors. A detects a problem and calls C as a fallback. Fallback behavior that should have alleviatedthe load on C instead increased the load on C , acceleratingthe problem and resulting in an outage.We believe that ChAP will help us identify these kindsof failure modes before they result in outages. Figure 2. Services in the request path when calling Gallery

3. Example of a non-critical service: gallery

When a user logs in to Netﬂix, they are presented withrows of images, called galleries , that represent video con-tent. Each gallery represents a different category. Examplesof galleries include: • Trending Now • Recently Added • Critically-acclaimed Comedies • TV DramasThe list of galleries and the contents of the gallery arepersonalized for each Netﬂix user: different users will beshown different galleries.The

Gallery microservice is responsible for generatingthe galleries. If this service stops working, the client thatcalls the Gallery service must return a sensible fallback. Forexample, it may return an older gallery that is present in alocal cache. Or, it may return a gallery that is not personal-ized for the particular user. From the user’s perspective, theNetﬂix interface should still appear to be working properly,even if the content presented to the user is stale or not fullypersonalized.Figure 2 shows the request path for requests that ulti-mately reach the Gallery service. The ﬁrst service in therequest path is Zuul [12], a reverse-proxy that serves as thefront-door to Netﬂix. Next in the request path is a servicecalled API [13]. API contains the Gallery client library thatmakes calls against the Gallery service. It is this clientlibrary that is responsible for serving fallbacks in the eventthat the Gallery service fails. To verify that this fallbackbehavior works correctly, we must inject failures on the callsfrom API to Gallery.

4. Running a ChAP experiment

Consider the following scenario: Alice, a (ﬁctional) QAengineer on the Gallery team, wants to verify that Netﬂix isresilient to failures in the Gallery service. She uses ChAP’sweb interface to deﬁne an experiment. Because ChAP injectsfailures on the client side of the request, she selects theAPI server group as the subject of the experiment. Shepeciﬁes that all calls to the Gallery service should fail.She chooses to divert only a small amount of trafﬁc for thisexperiment: 0.3%. She chooses a duration of 30 minutes forthe experiment.Finally, she selects the metrics that she is interestedin observing for the experiment. She chooses a number ofHystrix commands to track for the experiment. Hystrix is alibrary that allows engineers to wrap RPC calls and specifywhat the fallback behavior should be if an RPC call fails.Each Hystrix command has a name, e.g.: “GetGallery”.For each Hystrix command, for the control and experi-ment server groups, ChAP will display counts of: • successful requests served • successful fallbacks served • failed fallbacks servedAn example set of plots for the GetGallery Hystrixcommand is shown in Figure 3.Alice expects to see a large number of successful re-quests served in the control group, and a large number ofsuccessful fallbacks served in the experiment group.Once the experiment starts, the following things happen,as depicted in Figure 4.ChAP creates two new server groups, named api-chap-control and api-chap-experiment. The servers in these twonew groups are deployed with the same software as theservers in the api server group.Of all of the requests that are destined for the APIservices, 99.7% are routed to the original API server group,0.15% are routed to the api-chap-control group, and 0.15%are routed to the api-chap-experiment group. In the api-chap-experiment group, all of the RPC calls to the Gallery servicefail immediately with an error.ChAP presents Alice with a dashboard that plots themetrics speciﬁed by user for the control and experimentgroups. The dashboard also shows the SPS for each group.By comparing the metrics between the two groups, Alice candetermine whether the system is handling Gallery failurescorrectly.

5. Implementation details

ChAP uses an internally developed system called FIT[14] to cause RPC calls between microservices to fail. FITis only able to inject two types of failures: an error responseand an increase in latency. However, from the point of viewof a client making a call to a service, a large number ofproblems that can occur in an individual service manifest aseither an error response or a response delay. Hence, ChAPcan model many types of real failures in individual services.ChAP works by coordinating among many existingsystems inside of Netﬂix. In addition to FIT, ChAP in-teracts with Hystrix [9] (fault tolerance), Spinnaker [15](deployment), Eureka [16] (service discovery), Zuul [12](reverse-proxy), Archaius [17] (dynamic conﬁguration man-agement), Ribbon [18] (interprocess communication), Atlas[19] (telemetry) and Mantis [20] (stream processing).

6. Current status and future work

ChAP is still under heavy development, with a fewteams inside of Netﬂix currently test-driving the system andproviding feedback. Our ultimate goal is to be able to detectautomatically whether a service is resilient to failure ratherthan relying on a human looking at dashboards and making ajudgment. We also plan to integrate ChAP into the Spinnakerdeployment system so that ChAP experiments can be startedautomatically as part of the deployment process.There are failures that FIT (and, hence, ChAP) cannotcurrently model. We can only inject failures in the requestpath , in requests that originate from a Netﬂix client device.In particular, we cannot yet inject failures in calls betweenservices that are occur during the startup of a service.Finally, while we use SPS as our health metric, what weare ultimately concerned about is the user experience. In thefuture, we hope to use information from client devices toget more accurate information on the impact of a ChAPexperiment on a user.

References [1] S. Newman,

Building microservices

IEEE Software ,vol. 33, no. 3, pp. 35–41, May 2016.[4] “Principles of Chaos Engineering,” http://principlesofchaos.org, ac-cessed: 2016-07-27.[5] C. Bennett and A. Tseitlin, “Chaos monkey releasedinto the wild,” http://techblog.netﬂix.com/2012/07/chaos-monkey-released-into-wild.html, July 30 2012, NetﬂixTech Blog.[6] B. Schmaus, “Deploying the Netﬂix API,” http://techblog.netﬂix.com/2013/08/deploying-netﬂix-api.html, August 14 2013, Netﬂix TechBlog.[7] R. Fielding and J. Reschke, “Hypertext transfer protocol (HTTP/1.1):Semantics and content,” Internet Requests for Comments, RFC Editor,RFC 2731, June 2014.[8] P. Mell and T. Grance, “The NIST deﬁnition of cloud computing,”National Institute of Standards and Technology, Tech. Rep. 800-145,September 2011.[9] B. Christensen, “Introducing Hystrix for resilience engineering,”http://techblog.netﬂix.com/2012/11/hystrix.html, November 26 2012,Netﬂix Tech Blog.[10] P. Fisher-Ogden, C. Sanden, and C. Rioux, “SPS: the pulseof Netﬂix streaming,” http://techblog.netﬂix.com/2015/02/sps-pulse-of-netﬂix-streaming.html, February 2 2015, NetﬂixTech Blog.[11] D. Yuan, Y. Luo, X. Zhuang, G. R. Rodrigues, X. Zhao, Y. Zhang,P. U. Jain, and M. Stumm, “Simple testing can prevent most criti-cal failures: An analysis of production failures in distributed data-intensive systems,”