A Platform for Automating Chaos Experiments
Ali Basiri, Aaron Blohowiak, Lorin Hochstein, Casey Rosenthal
AA Platform for Automating Chaos Experiments
Ali Basiri, Aaron Blohowiak, Lorin Hochstein and Casey Rosenthal
Netflix { abasiri,ablohowiak,lhochstein,crosenthal } @netflix.com ©2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists,or reuse of any copyrighted component of this work in other works. DOI: 10.1109/ISSREW.2016.52 Abstract —The Netflix video streaming system is composed ofmany interacting services. In such a large system, failures inindividual services are not uncommon. This paper describesthe Chaos Automation Platform, a system for running failureinjection experiments on the production system to verify thatfailures in non-critical services do not result in system outages.
1. Introduction
To an end-user, Netflix is a single service that allowsthem to stream television shows and movies over the Inter-net. To the engineers who work for the company, Netflix isa distributed system made up of many services that interactvia remote procedure call (RPC), sometimes referred to asa microservice architecture [1].In a large system such as Netflix, where hundreds ofservices run on thousands of machines and engineers aremaking changes every day, many things can go wrong.Fortunately, many of the internal services that make upNetflix are not critical for the user to be able to watcha video. For example, a personalized list of recommenda-tions and bookmarks that recall where you left off whenpreviously watching a video add value to the user, but ifthe services that implement these features stop working, weshould still be able to provide a reasonable user experience.Hodges describes this kind of graceful degradation as partialavailability [2].Partial availability doesn’t come for free: engineers mustexplicitly implement fallback behavior when making RPCcalls against non-critical services. If fallback behavior is notimplemented correctly, a problem in a non-critical servicecan lead to an outage. This work addresses the followingquestion: how can we have confidence that Netflix userswill still be able to stream videos after non-critical serviceshave failed ?At Netflix, we practice Chaos Engineering [3]. Namely,we believe there is a level of complexity in modern dis-tributed systems that is chaotic, and that a chief architectcannot hold all of the system’s moving parts in their head.Chaos Engineering is about engineering practices that helpus surface systemic effects, as embodied by the Principlesof Chaos Engineering [4].In particular, we believe that to have maximum confi-dence you must test in your production environment with live traffic. Chaos Monkey [5] is one example of ChaosEngineering in practice at Netflix. Another example isautomated canary analysis [6], which tests new code inthe production environment with live traffic. Unfortunately,canary analysis is not guaranteed to test the code pathsassociated with dealing with failures in non-critical services.Another tenet of Chaos Engineering is automation: we wantan automated solution for ensuring the system is resilient tofailures in non-critical services.This paper describes our proposed solution: the ChaosAutomation Platform, or ChAP. ChAP enables engineeringteams to run Chaos Engineering experiments on live trafficin production in order to build confidence that their servicewill degrade gracefully when non-critical downstream ser-vices fail.ChAP works by diverting a fraction of production traffic,injecting failures into the diverted traffic, and checking thatthe system behaves as expected. Section 4 describes how anengineer would use ChAP to verify that Netflix is resilientto failures in a particular service.
2. Individual service failures vs system-levelfailures
As Hodges notes, “distributed systems are different be-cause they fail often” [2]. When a system runs on thousandsof servers, it becomes very likely that something will gowrong somewhere.A simple example of a failure is a bug that results in anunhandled exception , such as a null pointer exception. InNetflix’s microservice architecture, an unhandled exceptionresults in a service returning an HTTP 500 error code [7].There are other failure modes that are common foran individual service in a microservice architecture. Onecommon problem is resource exhaustion. Examples of finiteresources on a server include memory, disk space, CPU cy-cles, threads, and open TCP/IP connections. When a serverruns out of one of these resources, system calls that wouldnormally succeed may block or throw exceptions. Resourceexhaustion can be caused by a resource leak, but it may alsooccur if the load on a server exceeds its capacity. Here theproblem is that the service has been insufficiently scaled:not enough servers have been allocated to that service.
1. At Netflix, most services are implemented in Java, which uses excep-tions for error signaling. a r X i v : . [ c s . S E ] F e b igure 1. Unexpected fallback behavior When a server runs low on one of its resources, onesymptom is an increase in the average response time ofthe server. For example, memory pressure on a server maylead to garbage collection pauses. Another example: for aservice that allocates one thread-per-request, if the numberof pending requests exceeds the number of available threads,latency will increase.Yet another issue is the environment that these ser-vices run in. All of the Netflix services run within theAmazon Web Services Elastic Compute Cloud (EC2),an infrastructure-as-a-service cloud computing environment[8]. Because cloud providers such as EC2 compete onprice, in order to reduce costs they use commodity-gradehardware instead of more reliable enterprise-grade hardware.This increases the likelihood of an individual server failingbecause of hardware issues. Transient networking issuessuch as latency spikes are also not uncommon in cloud com-puting environments. When deploying to a cloud computingplatform, it is the responsibility of the software engineers todesign systems that incorporate redundancy to compensatefor occasional failures in hardware.Individual service failures are inevitable, and Netflixengineers leverage the Hystrix [9] library to implementfallback logic to handle failures in downstream services. Ourgoal is to prevent system-level failures. In particular, ourgoal is to reduce the likelihood of an outage, when Netflixcustomers are not able to stream videos. The primary metricof system health at Netflix is the number of video streamstarts per second, internally referred to as SPS [10].A failure of an individual service can lead to a drop inSPS if the client calling the service does not have properfallbacks in place. A study by Yuan et al. revealed that92% of catastrophic system failures happened because ofincorrect error-handling logic [11].Even if fallback logic is present in a client, the failureof a non-critical service may still lead to a system-levelfailure due to cascading effects. Consider the followingfailure scenario, illustrated in Figure 1.Typically, service A calls service B . For some reason, C starts to become overloaded, and returns errors to B .The fallback behavior for B is not working correctly, whichcauses B to return errors. A detects a problem and calls C as a fallback. Fallback behavior that should have alleviatedthe load on C instead increased the load on C , acceleratingthe problem and resulting in an outage.We believe that ChAP will help us identify these kindsof failure modes before they result in outages. Figure 2. Services in the request path when calling Gallery
3. Example of a non-critical service: gallery
When a user logs in to Netflix, they are presented withrows of images, called galleries , that represent video con-tent. Each gallery represents a different category. Examplesof galleries include: • Trending Now • Recently Added • Critically-acclaimed Comedies • TV DramasThe list of galleries and the contents of the gallery arepersonalized for each Netflix user: different users will beshown different galleries.The
Gallery microservice is responsible for generatingthe galleries. If this service stops working, the client thatcalls the Gallery service must return a sensible fallback. Forexample, it may return an older gallery that is present in alocal cache. Or, it may return a gallery that is not personal-ized for the particular user. From the user’s perspective, theNetflix interface should still appear to be working properly,even if the content presented to the user is stale or not fullypersonalized.Figure 2 shows the request path for requests that ulti-mately reach the Gallery service. The first service in therequest path is Zuul [12], a reverse-proxy that serves as thefront-door to Netflix. Next in the request path is a servicecalled API [13]. API contains the Gallery client library thatmakes calls against the Gallery service. It is this clientlibrary that is responsible for serving fallbacks in the eventthat the Gallery service fails. To verify that this fallbackbehavior works correctly, we must inject failures on the callsfrom API to Gallery.
4. Running a ChAP experiment
Consider the following scenario: Alice, a (fictional) QAengineer on the Gallery team, wants to verify that Netflix isresilient to failures in the Gallery service. She uses ChAP’sweb interface to define an experiment. Because ChAP injectsfailures on the client side of the request, she selects theAPI server group as the subject of the experiment. Shepecifies that all calls to the Gallery service should fail.She chooses to divert only a small amount of traffic for thisexperiment: 0.3%. She chooses a duration of 30 minutes forthe experiment.Finally, she selects the metrics that she is interestedin observing for the experiment. She chooses a number ofHystrix commands to track for the experiment. Hystrix is alibrary that allows engineers to wrap RPC calls and specifywhat the fallback behavior should be if an RPC call fails.Each Hystrix command has a name, e.g.: “GetGallery”.For each Hystrix command, for the control and experi-ment server groups, ChAP will display counts of: • successful requests served • successful fallbacks served • failed fallbacks servedAn example set of plots for the GetGallery Hystrixcommand is shown in Figure 3.Alice expects to see a large number of successful re-quests served in the control group, and a large number ofsuccessful fallbacks served in the experiment group.Once the experiment starts, the following things happen,as depicted in Figure 4.ChAP creates two new server groups, named api-chap-control and api-chap-experiment. The servers in these twonew groups are deployed with the same software as theservers in the api server group.Of all of the requests that are destined for the APIservices, 99.7% are routed to the original API server group,0.15% are routed to the api-chap-control group, and 0.15%are routed to the api-chap-experiment group. In the api-chap-experiment group, all of the RPC calls to the Gallery servicefail immediately with an error.ChAP presents Alice with a dashboard that plots themetrics specified by user for the control and experimentgroups. The dashboard also shows the SPS for each group.By comparing the metrics between the two groups, Alice candetermine whether the system is handling Gallery failurescorrectly.
5. Implementation details
ChAP uses an internally developed system called FIT[14] to cause RPC calls between microservices to fail. FITis only able to inject two types of failures: an error responseand an increase in latency. However, from the point of viewof a client making a call to a service, a large number ofproblems that can occur in an individual service manifest aseither an error response or a response delay. Hence, ChAPcan model many types of real failures in individual services.ChAP works by coordinating among many existingsystems inside of Netflix. In addition to FIT, ChAP in-teracts with Hystrix [9] (fault tolerance), Spinnaker [15](deployment), Eureka [16] (service discovery), Zuul [12](reverse-proxy), Archaius [17] (dynamic configuration man-agement), Ribbon [18] (interprocess communication), Atlas[19] (telemetry) and Mantis [20] (stream processing).
6. Current status and future work
ChAP is still under heavy development, with a fewteams inside of Netflix currently test-driving the system andproviding feedback. Our ultimate goal is to be able to detectautomatically whether a service is resilient to failure ratherthan relying on a human looking at dashboards and making ajudgment. We also plan to integrate ChAP into the Spinnakerdeployment system so that ChAP experiments can be startedautomatically as part of the deployment process.There are failures that FIT (and, hence, ChAP) cannotcurrently model. We can only inject failures in the requestpath , in requests that originate from a Netflix client device.In particular, we cannot yet inject failures in calls betweenservices that are occur during the startup of a service.Finally, while we use SPS as our health metric, what weare ultimately concerned about is the user experience. In thefuture, we hope to use information from client devices toget more accurate information on the impact of a ChAPexperiment on a user.
References [1] S. Newman,
Building microservices
IEEE Software ,vol. 33, no. 3, pp. 35–41, May 2016.[4] “Principles of Chaos Engineering,” http://principlesofchaos.org, ac-cessed: 2016-07-27.[5] C. Bennett and A. Tseitlin, “Chaos monkey releasedinto the wild,” http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html, July 30 2012, NetflixTech Blog.[6] B. Schmaus, “Deploying the Netflix API,” http://techblog.netflix.com/2013/08/deploying-netflix-api.html, August 14 2013, Netflix TechBlog.[7] R. Fielding and J. Reschke, “Hypertext transfer protocol (HTTP/1.1):Semantics and content,” Internet Requests for Comments, RFC Editor,RFC 2731, June 2014.[8] P. Mell and T. Grance, “The NIST definition of cloud computing,”National Institute of Standards and Technology, Tech. Rep. 800-145,September 2011.[9] B. Christensen, “Introducing Hystrix for resilience engineering,”http://techblog.netflix.com/2012/11/hystrix.html, November 26 2012,Netflix Tech Blog.[10] P. Fisher-Ogden, C. Sanden, and C. Rioux, “SPS: the pulseof Netflix streaming,” http://techblog.netflix.com/2015/02/sps-pulse-of-netflix-streaming.html, February 2 2015, NetflixTech Blog.[11] D. Yuan, Y. Luo, X. Zhuang, G. R. Rodrigues, X. Zhao, Y. Zhang,P. U. Jain, and M. Stumm, “Simple testing can prevent most criti-cal failures: An analysis of production failures in distributed data-intensive systems,”