FIXME: Enhance Software Reliability with Hybrid Approaches in Cloud
Jinho Hwang, Larisa Shwartz, Qing Wang, Raghav Batta, Harshit Kumar, Michael Nidd
FFIXME: Enhance Software Reliability with HybridApproaches in Cloud
Jinho Hwang, Larisa Shwartz, Qing Wang, Raghav Batta
IBM T.J. Watson Research Center
Yorktown Heights, NY, USA { jinho, lshwart } @us.ibm.com, { qing.wang1, raghav.batta1 } @ibm.com Harshit Kumar
IBM Research India
New Delhi, [email protected]
Michael Nidd
IBM Research Europe
Zurich, [email protected]
Abstract —With the promise of reliability in cloud, moreenterprises are migrating to cloud. The process of continuousintegration/deployment (CICD) in cloud connects developers whoneed to deliver value faster and more transparently with site relia-bility engineers (SREs) who need to manage applications reliably.SREs feed back development issues to developers, and developerscommit fixes and trigger CICD to redeploy. The release cycle ismore continuous than ever, thus the code to production is fasterand more automated. To provide this higher level agility, thecloud platforms become more complex in the face of flexibilitywith deeper layers of virtualization. However, reliability doesnot come for free with all these complexities. Software engineersand SREs need to deal with wider information spectrum fromvirtualized layers. Therefore, providing correlated informationwith true positive evidences is critical to identify the root causeof issues quickly in order to reduce mean time to recover(MTTR), performance metrics for SREs. Similarity, knowledge,or statistics driven approaches have been effective, but withincreasing data volume and types, an individual approach islimited to correlate semantic relations of different data sources.In this paper, we introduce FIXME to enhance software reliabilitywith hybrid diagnosis approaches for enterprises. Our evaluationresults show using hybrid diagnosis approach is about 17% betterin precision. The results are helpful for both practitioners andresearchers to develop hybrid diagnosis in the highly dynamiccloud environment.
Index Terms —event management, hybrid system, event corre-lation, localization, cloud
I. I
NTRODUCTION
With the promise of reliability, cloud has become moreflexible and dynamic to provide continuous software develop-ment and deployment. While the contemporary microservicesarchitecture has simplified the scope of software developersthrough well defined representational state transfer applicationprogramming interfaces (REST APIs), roles of site reliabilityengineers (SREs) towards availability, latency, performance,efficiency, change management, monitoring, emergency re-sponse, and capacity planning have become even more com-plex. Container deployments are more dynamic than ever, withlifespans of 10 seconds or less becoming increasingly preva-lent, emphasizing the need for real-time visibility that deliversdetailed audit and forensics records [1]. The ephemeral andimmutable nature of containers is advantageous for develop-ment and operations (DevOps), but simultaneously can bechallenging for software developers and SREs to correctlydiagnose incidents and resolve them timely. Based on 2020
Physical MachineApplication Physical MachineVIrtual MachineApplication Physical MachineVIrtual MachineContainerApplication Physical MachineVIrtual MachineContainerApplicationCluster OrchestratorPhysical Interface Physical InterfaceVirtual Interface Physical InterfaceVirtual InterfaceOverlay Physical InterfaceVirtual InterfaceOverlayServer VirtualizationNetwork Virtualization Hybrid (L2 + L3)Evolution
Fig. 1:
Evolution of server and network virtualization
SRE report, 80% of SREs work on post-mortem analysis ofincidents due to lack of provided information and 16% of toilcome from investigating false positives/negatives [2], [3].As shown in Figure 1, cloud has adopted more virtual-ization technologies, in turn virtualized layers stack up torun applications and the number of applications running inone node increases. This also means that any virtualizedlayer below applications can have direct impact on runningapplications. This increased scope of application impact is notonly derived from server virtualization, but also from networkvirtualization. A microservices architecture has driven thisas the number of containers is more than the same type ofa monolithic application. That is, more information such aslogs, alerts, metrics are generated, thus consolidating theseinformation in a meaningful way becomes intractable. Asnoted in [2] −
41% of SREs have answered half or more oftheir work is toil − , when dealing with the high amountsof toil in an organization, the underlying reasons are lackof intelligence and automation. To cope with this hardship,artificial intelligence for IT operations (AIOps) has emergedto help SREs to recognize serious issues faster and withgreater accuracy than humans. The ultimate objective of AIOpsis to minimize mean time to recover (MTTR) by providingsoftware engineers and SREs with localized and correlatedinformation. MTTR spans different stages including mean-time-to-detect (MTTD), -identify (MTTI), -know (MTTK), -repair (MTTRepair), and -resolve (MTTResolve). From thelarge amount of data, the problem determination and informa-tion correlation are the keys to start with the right problemand correct information. Not to mention that this reduces thenumber of false positives/negatives.From various data sources such as logs, alerts, metrics, a r X i v : . [ c s . S E ] F e b nomaly events, and the like, the premise of informationconsolidation is that information have some common elementsto each other, appear within a time window, and occur ina proximity of location. The two main bodies of researchinclude event correlation [4] and problem localization [5],[6]. In large, similarity-based, knowledge-based, and statisticalapproaches have been used to identify patterns and groups,filter and prioritize the events [4], [5] for both pre-mortem andpost-mortem analysis. However, in AIOps, not one approachcan always perform better than others because it is extremelyhard to identify underlying patterns of different data sourcesof all levels and also generalize knowledge obtained fromone application to another. In practice, a hybrid or ensembleapproach would usually result in better correlation results [7],[8], [9].In this paper, we introduce hybrid methodologies for AIOpsused in production, and evaluate how they improve problemdetermination and information correlation. Our contributionsinclude the following:• Explore various data sources in cloud native environmentand their normalization,• Find patterns defined by complex predicates in large, con-stantly changing datasets,• Introduce hybrid diagnosis approaches for entity resolutionand event correlation,• Incorporate continuous learning with feedback, and• Evaluate the hybrid diagnosis approaches and the impactof the continuous learning.II. B ACKGROUND AND C HALLENGES
Monitoring system is an automated system that providesan effective and reliable means of ensuring that anomalousbehavior or degradation of the vital signs of hybrid applicationis flagged as a problem candidate (monitoring event) and sentfor diagnosis and resolution. Events from hybrid environmentare consolidated in an enterprise event management system(EM). EM often incorporates rules that define whether tocreate an incident record (ticket) for IT problem reporting.In some cases, SREs analyze incoming events or symptomsbefore deciding on creating a ticket. Tickets are collected byIncident, Problem, and Change (IPC) system. Independently,the information is also collected by myriad of tools that ingestdata from various sources and provide events of their own.Operations or SREs perform problem determination, diag-nosis, and resolution based on the symptoms information inthe ticket. In the interviews conducted with SREs, they haveidentified diagnosis as the most difficult task. The majorityof SREs have pointed out that given right diagnosis, theywould be able quickly to derive actions required to resolvethe issue. Being able to troubleshoot a problem and to arriveto a diagnosis is often considered to be an innate skill [10].Problem determination is a labor-intensive process, and Oper-ations/SREs use any help they could get from analytics. Therehas been a great deal of effort spent on developing method-ologies for specifying and reasoning about symptoms/signals provided through monitoring of systems, be they hardware orsoftware.While there is large body of techniques in existence, eachtechnique is usually focused on a single type of data (events orlog anomalies or metric analysis). Lately, a number of serviceproviders have embarked on extending their existing analyticsto other data types. The benefit of this approach is short-ening time to market through adaptation of well-researchedcapabilities; one obvious drawback is that performance ofany methodology is optimized for specific data, and it doesnot perform as well on different data types. In this paperwe present in-depth review of the operational data types andcombinations of algorithms for working with each data typeto achieve a goal of identifying a group of symptoms with acommon root cause and localizing the problem.Although the methodology we describe is successfully usedfor identification and linkage of symptoms from variety ofdata sources, we are not able to adjust for some changes inthe architecture and monitoring strategy. It is imperative tocontinue improving the models through continuous learningand feedback. We point out that in our experience, SREs donot tend to provide an extensive feedback and more oftenthan not their feedback is limited to thumbs-up/thumbs-downresponse. Learning from this feedback is a challenge if theinsight that they assess is complex. For example, a negativefeedback received for a grouping of symptoms is missingcritical information about which signals do not belong to thegroup. In this paper we describe an approach to using theirminimal feedback efficiently.It is a widely accepted fact today that operational data posesadditional challenges in comparison to social data for example.In this paper we describe various data types used in operationsand challenges associated with it. To name a few, some IT datais often created using templates by different developers, so thecontent of data of the same type could vary drastically; datathat represents a symptom of a problem only makes sensein the context of system/application configuration or resource,however this critical information is often embedded in the textand not provided as first class item.III. R
ELATED W ORK
The main body of related researches is information (eventsor alerts) correlation. Prior arts for information correlationare categorized into similarity based, knowledge based andstatistical approaches. The similarity-based event correlationworks on the premise that two events that have similar rootcauses should be grouped together. The base logic is to com-pute similarity between the feature spaces of data and measurethe score [11], [12], [13]. For example, a weighted sum offeature similarity of two data points is used to measure thescore [14]. One of the most popular similarity based methodsis clustering. The clustering divides data into a number ofgroups such that data in the same group are more similarthan other data in other groups. Klaus [15] proposes an alertclustering approach for identification of root causes that triggerthe alert. Vaarandi [16] clusters for log event data which helps2ne to detect frequent patterns from logs, to build log profiles,and to identify anomalous logs.Experience based knowledge often becomes the source ofintelligence or automation, which can be turned into rules,templates, or scenarios. Kabiri et. al. [17] propose a rule-basedtemporal alert correlation system that uses an inference engineto aggregate redundant alerts and derive correlation betweenalerts using a scenario-based knowledge base. Klaus [18]mines rules by learning from previous events, and use theserules to cluster the new incoming events as they occur.Dain et. al. [19] map each incoming event to one of themanually prepared predefined scenarios or patterns, whereeach scenario represents a sequence of actions. Another typeof the knowledge based approach is expert systems. Theexpert systems aim at reproducing the performance of ahuman expert. The approaches under expert systems constitutebuilding a simulator that generates alerts [20] or learns alertscenarios [21], [22], [23], [24], [25] on software applicationsto simulate the various faults, further models are proposed foran event correlator [26], [27] that correlates related events.Statistical traits find latent characteristics of data and jointhem in meaningful ways towards the objective. Many priorarts use machine learning algorithms to find patterns of datapoints and the repetition patterns [28]. Among them, Dain et.al. [29] mine scenarios from historical events, and classifyeach incoming event into one of the candidate scenarios usingprobabilistic methods. Smith et. al. [30] use a hierarchicalunsupervised machine learning structure to identify the firstand second levels of groups that bubble up. They also use anauto-encoder to learn the event distribution, and to correlateevents together. Pietraszek et. al. [28] propose an ensembleapproach to use various machine learning algorithms such assupport vector machines, decision trees and na¨ıve bayes tosuppress false positives. Peter et. al. [31] define activationpatterns that store information in an associative network graphand perform a graph analysis to find common patterns basedon nominal and distance based features.To the best of our knowledge, we are the first to investi-gate the hybrid diagnosis system that spans all of similaritybased, knowledge based, and statistical approaches for entityresolution and event correlation.IV. D
ATA
As a rule of thumb to build AI systems, results can only beas good as quality of data . Understanding types of data andguaranteeing data quality are the first steps towards a betterAI system. This section defines terminologies and discussesdetails about input data.
A. Terminology An event indicates that something of note has happened andis associated with one or more applications, services, or othermanaged resources. For instance, a container has moved tonew node, column added to a DB table, a new version of anapplication is deployed, or memory or CPU exhausted. One ormore events can turn into any form of alerts or anomalies based on the deviation from what is defined as standard, normalor expected. Anomaly detection (a.k.a. outlier analysis) is astep in data mining that identifies events and/or observationsthat deviate from a dataset’s normal behavior. Anomalous datacan indicate critical incidents, such as a technical glitch, orpotential opportunities, for instance a change in consumerbehavior.An alert is a record (type) of an event indicating a (fault)condition in the managed environment. It requires or willrequire in the future, human or automatic attention and actionstoward remediation. For instance, disk drive failure or networklink down could be alerts. An incident represents a reductionin the quality of a business application or service. It is drivenby one or more alerts. Incidents require prompt attention. Forinstance, application unresponsive or storage array inaccessiblecould be serious outages. A ticket is an actionable or incidentembodied in a service desk tool, where the client has decidedto use one. A change record is a description of a changethat should be deployed, including configuration changes, codechanges, security updates, and so on. B. Logs / Metrics
Logs and metrics are two fundamental data sources gener-ated from every level of components as shown in Figure 1. Alog is an event happened and a metric is a measurement ofthe health of a system. In general, logs include informational,debug, warning, error, and critical depending on the severity.In production systems, only warning, error and critical logsmay be collected. In each log line, details about the eventsuch as a resource that was accessed, who accessed it, and thetime are included. Each log is meant to have different sets ofdata so that the problem localization can be obvious.While logs are about a specific event, metrics are a mea-surement at a point in time. Each metric data point can havevalue, timestamp, and identifier of what that value applies to(like a source or a tag). While logs may be collected anytime an event takes place, metrics are typically collected atfixed-time intervals, thus called a time-series metric. A suddenrise and spike of CPU or memory utilization render alerts.However, without logs, it is hard to understand what causessuch spikes. Therefore, bringing together both logs and metricscan provide much more clarity. During spikes, there may besome unusual log entries indicating the specific event thatcaused the spike.From logs and metrics, SREs look for important signals thatcan lead to the problem diagnosis and potentially resolution.Broadly, four golden signals are known as the most helpfulsignals (events): Latency : Latency is the time it takes to service a request.While latency can be captured for both successful requestsand failed requests, it is important to differentiate betweenthem. The failed requests may increase the overall latency,but this may not necessarily be the latency of the system. At https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems nfowarningerrorcritical S e v e r it y Time (1 year 2019-06 - 2020-05) Event
Fig. 2:
Alerts over 1 year with severities for a production application the same time, a slow error is even worse than a fast error,so it is important to track error latency, as opposed to justfiltering out errors.
Traffic : Traffic is a measure of demand for the system,measured in a high-level system-specific metric. For instance,too many HTTP requests to a web server or API may result inadditional stress on the system, triggering downstream effects.The traffic signal helps differentiate capacity problems fromimproper system configurations that can cause problems evenduring low traffic.
Errors : This is the rate of requests that fail, either explicitly(e.g., HTTP 500s), implicitly (e.g., an HTTP 200 successresponse, but coupled with the wrong content), or by policy(e.g., service level objective (SLO)). It is useful to diagnosemisconfigurations in your infrastructure, bugs in your applica-tion code, or broken dependencies.
Saturation : Saturation is the load on the system resources(e.g., CPU utilization, memory usage, disk capacity, andoperations per second). Note that many systems degrade inperformance before they achieve 100% utilization, so having autilization target is essential. While parts of the system becomesaturated first, often, these metrics are leading indicators, socapacity can be adjusted before performance degrades.
C. Events / Alerts
While the main sources of operational data are metrics andlogs (§IV-B), based on the rules or algorithms, they becomeevents. Other artifacts such as configuration, security, codechanges, and the like can be sources of events. Anomalous—also, based on rules or algorithms—events become alerts. Asshown in Figure 1, every component in virtualized layers canproduce events and parts of them can become alerts (i.e.,important error signals). In often cases, not a single alert cantell what went wrong exactly, so correlating information withpatterns would likely lead to better diagnosis. For example, aspike in error rate could indicate the failure of a database ornetwork outage. Also, following a code deployment, it couldindicate bugs in the code that somehow survived testing oronly surfaced in the production environment.As an example, Figure 2 shows 1 year worth of alerts (onlyescalated events) generated from one production applicationthat is composed of 26 microservices running on Kubernetesclusters. The alerts include 10,399 (75.79%) critical, 367(2.68%) error, 3 (0.02%) warning, and 2,951 (21.51%) info.alerts.
D. Incidents / Change Records
A manually or automatically created incident serves asa formal record to log all the information relevant to anissue and resolution. An incident record typically captures theinformation like: id, title, description, opened date, severity,impacted configuration item(s), outage start time, outage endtime, state (resolved, open), resolution description, changeID to represent the incident has been “caused by change”.A change record captures key attributes for a change like:id, title, description, purpose, environment, request date, startdate, end date, team, state (open, closed), closure code (suc-cessful, failed, induced issues), backout plan, close notes,configuration items for resources or images associated with thechange ticket. Incidents and change records are often used topredictively measure the potential risk of the submitted change(not yet deployed) or reactively find which images have beenproblematic.Any changes—for instance, performed to add new features,address vulnerabilities or improve the performance of thesystem—can also induce alerts/incidents. Any changes to thesoftware are generally made by modifying the source code,rebuilding the images and redeploying the newer version ofthe images to containers (§IV-E). So, when an alert or groupof alerts is produced on any component of the virtualized layer,checking for recent changes deployed on the component canhelp in the root cause analysis.Observer tools to monitor the CICD pipeline can help linkchanges directly to the DevOps workflow, but in absenceof such a mechanism, we will have to extract references toimage names from the change tickets. While in some cases,information about the image(s) deployed by change can bementioned in structured fields of the change ticket, in mostcases the reference to images is hidden in the unstructuredtext fields. We map change tickets to topology objects bymining the references to image names from the change text.We first look for direct mentions of image names in the changetickets. If direct mentions are not identified then we searchfor reference to image tags as it is unlikely for two images tohave the same tag. If both of these return no matches, thenwe search for similar changes to the current change ticket andadd the image reference if the similar changes have any imagesmapped to them.Based on the above image mapping, we identify the virtualtopology objects associated with each change ticket and addchange ticket references to the alert(s) mapped to the sametopology object through entity resolution (§V-A) if the changewas recently deployed before the alert was generated.
E. Virtual Topology
In legacy systems, a (physical) node is often equivalentto an application (or multiple processes running in the samenode) or one database and physical interfaces thereof representconnectivities with other nodes (applications). In today’s cloudsystems, a physical node is virtualized to run multiple virtualmachines (VMs) , even further each VM is virtualized to runmultiple containers . Also, the virtualized network provides4 ype Category Relationships
Node Type application, backplane, bridge, card, chassis, command,component, container, cpu, database, directory, disk,emailaddress, event, fan, file, firewall, fqdn, group, host,hsrp, hub, ipaddress, loadbalancer, location, networkad-dress, networkinterface, operatingsystem, organization,path, person, process, product, psu, router, rsm, sector,server, service, serviceaccesspoint, slackchannel, snmpsys-tem, status, storage, subnet, switch, tcpudpport, variable,vlan, volume, vpn, vfrEdge Aggregation contains, federates, membersEdge Association aliasOf, assignedTo, attachedTo, classifies, configures, de-ployedTo, exposes, has, implements, locatedAt, manages,monitors, movedTo, origin, owns, rates, resolvesTo, real-izes, segregates, usesEdge Data flow accessedVia, bindsTo, communicatesWith, connectedTo,downlinkTo, reachableVia, receives, routes, routesVia,loadBalances, resolved, resolves, sends, traverses, up-linkToEdge Dependency dependsOn, runsOnEdge metaData metadataFor
TABLE I:
Entity and edge semantic relationships in topology service virtual interfaces that support overlay or hybrid protocols.Figure 1 illustrates evolution from legacy systems (left) tocontemporary cloud systems (right).To correctly diagnose a problem, identifying where alertsare generated is the key to understand the problem correctly.The systems are often represented as topological graphs thathave nodes (i.e., any box in Figure 1), and edges (i.e., verticalconnectivity or horizontal connectivity). Table I shows typesof nodes and edges representing cloud systems: 52 nodetypes and 41 edge types. The semantic relationships amongnodes and edges are quite complex as shown in Figure 3.Therefore, identifying ‘correct’ topological entities (§V-A) andcorrelating alerts derived from them (§V-C) are the mainstepping stones for SREs to tackle the problem.
F. Enrichment
Artifacts like alerts or change tickets generally make ref-erence to topological entities to which they relate. Knowingwhich topological entities are involved is necessary for prob-lem localization, and is a general assistance to any correlationeffort. These entities are not always explicitly listed in a well-formatted way, although they sometimes are. They may beincluded in a JSON object that has been converted to a string inan embedded field that is standard in a particular deployment,referenced by name in a field provided by a performancemonitor template, or identified by IP address in the free textof the description.Logs and metrics are two key sources of information that anapplication generates, irrespective of whether it is in healthystate or unhealthy state. Moreover, the logs generated byan application are in a finite space such that they can bemined and mapped to a set of template-ids. In order to uselogs for event correlations, each log line is processed andtemplatized using a pipeline called as log-template pipeline .The log-template pipeline consists of two components, errorclassification of log line followed by templatization of log line.When any event is generated, its corresponding log lines thatare within a fixed duration of the start of the event are fetched.Each log line from the set of log lines is input to a pre-trained application dependsOn deployment federates daemonSet federates service federates statefulSet federates pod federates manages replicaSet manages managesexposes exposes dependsOnexposes exposescommunicatesWithmanages container communicatesWith contains server runsOn ingress communicatesWith route communicatesWithmanages image uses cluster memberOf volume assignedTo
Fig. 3:
Topological entities and relationships from cloud deployment classifier, the output of the classifier is a 0 (error) or 1 (non-erroneous). The error classifier allows us to separate log linespertaining to healthy state of the system and the correspondingmicroservice from the non-erroneous log lines. The outcomeof the error classifier is a subset of total log lines which arethen input to the next step of the pipeline, template miner . Atemplate miner is pre-trained on millions of log lines that canmap a log line to a template id. For each erroneous log line, weobtain its corresponding template-id from the template miner,thus yielding a set of template-ids for all log lines. Each eventis enriched with its entities and a set of template-ids. Theentities and template-ids contained in the enriched event areused at a later stage for event correlation.
G. Normalization
Various data sources (events) flowing into FIXME arenormalized or standardized with the same format in order toincrease the cohesion of data types and reduce the redundancy.Having looked through many data sources, we have found thatthe following information is enough to run all the algorithms:title, description, created at, resolved at, severity, source, andfeatures. Note that features are expandable to accommodateany unique information from any data source as an objectsuch as name, URL, alert id, team, and application.V. M
ETHODOLOGY
A. Hybrid Entity Resolution
FIXME uses multiple sources for operational events. Someare more easily associated with a specific topological entitythan others. For example, if our log anomaly detection sub-system creates an event to alert operations of unusual behavior,then it can label that event with the exact container that gener-ated the log being observed. If, on the other hand, we importan alert from an external alert management system, which inturn created it in response to a level-crossing observed by amonitoring tool, matching it with the topology is harder.The first challenge is that these alerts are often createdmanually, and how they identify the resource depends on5ocal standards. To allow the embedded information to beused for locating the resources, we have defined a “matcher”language ( template ) for describing how these locations may beembedded in the raw alert object. Alert management systemshave some standard fields, so our starting point is a standardextraction file that checks these known places. Customers whofollow local practices, such as a standard name-value pair inthe title of the alert, can add this to the template rules file.In addition to this template-based approach, we also usea dictionary-based approach to scan for things that “looklike” entity references. Specifically, we know the domainname server (DNS) in use in the subject environment. Wealso know a number of names used by docker images andKubernetes objects in this environment. These are all partialnames, not knowing every host name or every image tag,but by combining the names that we do know with regularexpressions for how they generally appear allows us to searchfor likely references. To adapt to the changing environment,we use an internal query system that allows us to compile arapid regular expression matcher for fast application, whilestill allowing the specific list of names ( dictionary ) to beupdated dynamically.The second challenge is that resources may be identifiedusing different methods, and no single method will work forall resources. In our case, we use resource IDs (e.g., UUID),assigned by a topology service when the topological object isadded to its database, but an external monitoring tool will notknow these IDs. Kubernetes has unique object IDs, but theyare also generally invisible to an external monitor. IP addressesand DNS names identify an endpoint, but that is often just afront end, such as a Kubernetes ingress or a load balancer.Once we have used the rules file to get a value that shouldidentify an object in the topology, we then use a search toresolve that reference to an object ID. Topology searches needto identify the time at which to search, so we use the creationtimestamp on the alert. If the alert was actually a result of theobject being deleted, then it did not exist at the creation timeof the alert, so we will not resolve that object. In practice, thisdoes not appear to be a significant problem.
B. Learning Patterns
The same types of problems tend to happen repeatedly andcapturing them as patterns help the problem determination. Inthis section, we introduce various ways to capture patterns forthe given datasets.
Association rule mining : one of the most important datamining algorithms, aims to discover interesting relationshipslatent in large datasets [32]. A typical and widely knownexample of association rules application is market basketanalysis by learning from sales transactions. The strategy ofassociation rule mining is composed of two phases [33]:•
Frequent Itemset Generation : The objective of this phraseis to find all the itemsets, also called frequent itemsets, ifitems occurred together greater than a minimum supportthreshold ( min sup ) in transactions. •
Rule Generation : This phrase is to extract all the high-confidence rules (i.e., strong rules) based on the minimumconfidence threshold ( min con f )) the frequent itemsets gen-erated from the first phrase.
Fig. 4:
An example of converting sequential events to events trans-actions . In this example, three types of events (i.e., e , e , and e )are monitored and reported from time t start to time t end . A fixed time window (e.g., 10 mins) is applied to generate events transactions from the sequential events monitored from t start to t end (seen in Figure 4). Each row in the table corre-sponds to a transaction containing a unique identifier labeledTID and a set of events in a time window. In this example, let E = { e , e , e } be the set of all event types and T = { t , t , t } be the set of all transactions. Supposing min sup =
50% and min con f = { e , e } and strong rule { e → e } through the calculations shown asfollows: Support ( { e , e } ) = σ ( { e , e } ) | T | > min sup , where σ ( X ) = { t i | X ⊆ t i , t i ∈ T } , andConfidence ( e → e ) = Support ( e , e ) Support ( e ) > min con f . The uncovered correlations (i.e., frequent itemsets andstrong rules) can perform basis for decision making andprediction to support hybrid cloud management such as eventcorrelation, anomaly detection, fault localization, and the like.
Log templates : Two events may or may not have a similardescription, however, if the underlying logs are similar thenthey are most likely related to each other—this is the keyhypothesis of using logs for event correlation.Each application consists of several microservices, some ofthese services are related to other services forming a graph. Ifone service fails then any other service which is upstreamor downstream of the failed service could throw error loglines. It is important to identify error log lines for each failedmicroservice during an execution of an application, collatethem together to form log signature for a particular event.In order to use logs for event correlations, each log line isprocessed and templatized, then they are collated to form a log-signature for each event. To recall, from §IV-F the log-templatepipeline consists of two components, error classification oflog line and templatization of log line. As a result, for agiven event, there is a set of templates-ids and correspond-ing application-ids. We propose a log-signature representa-tion for each event from its template-ids and correspondingapplication-ids, and use that for event correlation. The examplebelow shows a log signature for an event; there are threelog template ids— template_id_a , template_id_b ,6 l u s t e r L a b e l The silhouette coe ffi cient values-0.1 0.0 0.2 0.4 0.6 0.8 1.0012345678914131011 Fig. 5:
Silhouette coefficientsin case of 15 clusters F ea t u r e s p ace f o r t h e d f ea t u r e Feature space for the 1st feature-40 -20 0 20 40 60-40-2002040
Fig. 6: t-SNE for word embed-dings on alerts and 15 clusters template_id_c ; two log template ids ( template_id_a and template_id_b ) belong to application_id_a ,and one log template id ( template_id_c ) belongs to application_id_b . This representation is called as logsignature of an event. { "templates": [{"application_id": "application_id_a","template": "template_id_a"}, {"application_id": "application_id_b","template": "template_id_b"}]}
Once we have a log signature for each event, the similarityis calculated between two events by computing the overlapbetween their application ids; for each application id thatoverlaps, it computes the overlap between their respectivetemplates ids to calculate a score called as log template simi-larity score. For event grouping using template clustering, thealgorithm starts with a list of groups, each group could containone or more events. Next, it computes the similarity betweeneach pair of groups using the log template similarity, asexplained above. This produces pairs of groups with similarityscore between them, the pair with the highest similarity scorewhich is above a threshold is taken and the groups in the pairare merged into one single group. Then the process is repeateduntil there is only one group left or the highest similarity scorebetween groups is below the threshold.
Word embeddings : Application events, different from net-work events, are not often structured in that the data de-scriptions or log messages are written in natural language.So, the natural language processing would help understandthe patterns of the events. Because learning word embeddingsis effective to capture the same or similar representationfor words that have the same meaning [34], the fastText(https://fasttext.cc) word embedding (vector size = 300) hasbeen applied to the alert descriptions to train the embeddingmodel. Later, this model turns alert descriptions into learnedrepresentations. In order to identify how the embedding isclustered, a balanced iterative reducing and clustering usinghierarchies (BIRCH) is applied to the embeddings [35].The silhouette coefficient is a measure [ − , ] of how closeeach point in one cluster to points in the neighboring clustersand thus provides a way to assess goodness of clusters. Fig-ure 5 shows the silhouette coefficients. The average silhouettecoefficient is the maximum when the number of clusters is15. This means the quality of clustering is the best when 15 clusters are formed. This will be further evaluated in theevaluation (§VI-D). Also, for the same number of clusters,Figure 6 shows the clustering space in 2 dimensions with t-SNE that is a technique of non-linear dimensionality reductionand visualization of multi-dimensional data. - Similarity Knowledge Statistical Group alerts from diff. sources O O XRequire prior knowledge O O XDetect false alerts O O GuessDetect multi-stage incidents X O GuessFind new incidents O X OError rate Mid Low High
TABLE II:
Qualitative comparison for different approaches
C. Hybrid Correlation
As briefly discussed in §I and §III, generally similarity (e.g.,rules, patterns), knowledge (e.g., scenario, knowledge base),and statistical (e.g., statistical estimation, causal relation es-timation, reliability degree combination) approaches are usedfor the information correlation [4], [5]. Table II compares thethree approaches qualitatively. Not one approach is perfect,but an hybrid approach would complement deficiencies.FIXME uses a mix of different methods to make a correctverdict on the correlated information. For similarity , rules(time, spatio, prior information), patterns/templates (§V-B),and for knowledge , knowledge base (capture causal inferencefrom SREs or root cause analysis), feedback (§V-E), and lastlyfor statistical , machine learning algorithms (e.g., associationrule mining), clustering (§V-B) are used. Figure 7 illustratessome of methods in time (y-axis) and spatial (x-axis) dimen-sions. , , and represent spatial information (§IV-E),respectively, physical nodes, VMs and containers that havevertical relationship with “runsOn”, and horizontal relationshipwith “dependsOn” (Table I). ‘ , a , and e represent differenttypes of alerts.dotted- describes correlated alert groups that are madeby different methods. The temporal group is based on thetime given that alerts are generated from the same topologicalentity. Similarly, the spatial considers both time and spacewhere entities have a special semantic relationship (in this case“dependsOn”). The rule group is derived from SREs’ input, apriori group is based on the patterns learned from the asso-ciation rule mining, and the similarity of log-template groupis measured by templates of alerts (logs). We demonstrate theeffectiveness of this ensemble approach in §VI-C. D. Localization
The correlated alert group itself explains about the problem,but often case, it is not enough for software developersand SREs to understand the root cause immediately, espe-cially when the group includes multiple alerts from differentlocations. Additional explainability would help localize theproblem and understand the impact of it. Treating alerts in thegroup as error signals and knowing the topological entities andtheir relationships render reasoning to find the root entity ofthe problem and its impact for the related entities. In Figure 7,7 pace T i m e ( E v e n t s ) NodeVMContainer (App) Alert Type 1Alert Type 2Alert Type 3
Horizontal
Temporal SpatialApriori Log-Template
Correlation groupBlast radiusLocalization V e r t i ca l Rule
Fig. 7:
Illustration of correlated alert groups with different methods dotted- shows the localized entity given the correlated groupbased on the apriori. dotted- is a blast radius (impact) ofthe problem. While SREs know well about the architecture oftheir application, the localization and blast radius help SREsto visualize the problem together with the description of thealerts. FIXME uses a Souffle reasoner to traverse topologicalentities with error signals and outputs topological entities forlocalization and blast radius. E. Continuous Improvement (Feedback)
The feedback is an essential part of continuous learningfor AIOps systems, but collecting feedbacks is challengingwith the following reasons: first, collecting a large number offeecbacks is hard as SREs are not committed or incentivized;second, correctness or consistency is not guaranteed due toambiguity from different experiences or skillsets; third, detailsare often missing as the most effective way is yes/no or thumbsup/down or at best drop-down with a short list of selections.Especially, since the ChatOps interface (main interface forFIXME) flows based on timeline rather than dashboard, itis harder to wait SREs to finish the work and come backto provide feedbacks. Therefore, learning from minimal feed-backs (i.e., active learning) is important. The feedback is ex-pected to improve the correctness of correlation and suppressfalse positives. FIXME uses a content-based approach thatleverages clustering split and merge operations based on theword embedding [36] (§V-B). Applying feedbacks improvescorrectness of rules, patterns, and knowledge base.VI. E
VALUATION
FIXME enables hybrid diagnosis approaches on various datatypes to correctly identify entities and group information forsoftware engineers and SREs to diagnose problem withoutspending too much time for post-mortem analysis. In thissection, we evaluate FIXME with the following goals:• Demonstrate a hybrid approach for data enrichment withentity resolution and show the effectiveness (§VI-B),• Show how hybrid correlation approaches improve correct-ness (§VI-C), and• Analyze how minimal feedbacks can improve the correla-tion results (§VI-D). https://github.com/souffle-lang/souffle A. Setup
A total of 12,696 alerts (1 year) have been obtained fromthe product team that manages a software as a service (SaaS)application in IBM Cloud . Our algorithms have been appliedto 1,134 alerts for a worth of 1 month and identified 382 issues(groups)—141 issues have more than 1 alert and the rest 241includes only single alert. This dataset has been shared withdevelopers and SREs to get feedbacks for each alert whetherit belongs to the correct group or not. The rest of 11,565 alertshave been used for training to learn patterns. In addition, thesnapshot of the virtual topology running on Kubernetes hasbeen shared. It includes 516 nodes, 755 deployments, 1,532pods and 801 services. B. Hybrid Entity Resolution
Our two main extraction techniques, described in §V-A, aretemplate-based and dictionary-based approaches. As the la-beled dataset does not include the ground-truth of the topolog-ical entities, we use the sampling and manual labeling. Whendeciding on which to incorporate into the final system, we startwith a random sample of 10 events from the dataset, and havemanually identified 16 useful topological references that theycontain (true positives). The template-based approach finds 9,and the dictionary-based approach finds 10, so the combinedapproach returns 15 correct (93% accuracy). However, only 4are found by both. Although it is a small set, the results gaveus a clear indication that neither technique alone would likelybe sufficient.
C. Hybrid Correlation
As described in §V-C, the hybrid correlation would benefittoward correctness. In this section, we show how applyingvarious methods help improve performance.
Temporal and spatial : Without learned models that helpidentify similarity, knowledge, statistical patterns, the timeand space are base attributes towards the correlation. Outof 1,134 alerts (forming 382 groups), based on time andspace, 825 alerts labeled as correct (TP), 68 as incorrect (FP),241 (single alert in group) as not applicable (FP). Since thedata is not necessarily associated problems, feedbacks aboutwhat additional data should be included (FN) have not beencollected . Therefore, Precision ( P ) = T PT P + FP = = . Recall ( R ) = T PT P + FN = = . F Score = × R × PR + P = × . × . . + . = . . = . Given the labeled data without knowing true/false negatives,precision is important to understand correctness. This is ourbaseline performance. The application name and data are considered IBM confidential, so it isnot included in this paper True Positive (TP), False Positive (FP), False Negative (FN) ig. 8: Running time for gen-erating frequent itemsets andassociations rules at differentminimum support levels
Fig. 9:
The number of frequentitemsets and associations rulesat different minimum supportlevels with min con f = . Association rule mining : In this section, we explain howwe apply apriori , one of the most popluar association rulemining algorithms, to find frequent co-occurrence alerts (i.e.,events) from our dataset, which are generated from 73 sources(i.e., pods, containers, nodes). Our goal is to find associationsamong these alerts from each source. As detailed in §V-B, wefirst get alert transactions from sequence alerts and use apriorito generate frequent itemsets and rules for each source. Sinceeach source is not large enough, 5 mins is used as the fixedtime window and finally we get 852 transactions in total. Themaximum number of transactions in sources is 52 and theaverage numer is 12. The performance curve of running timefor generating frequent itemsets with minimum supports from0.2 to 0.9 is shown in Figure 8. Figure 9 indicates that thenumber of frequent itemsets and the number of associationrules increases as the minimum support is reduced on thisdata. In order to reveal how apriori can improve the correlationperformance, we select a minimum support threshold 0 . min sup = .
3, there are 4 L and 2 L frequent itemsets, where L k indicats the lengthof itemsets is k , among all generated 52 frequent itemsets.Based on the feedback from SREs, all the learned frequentitemsets are the correct groups. Using these frequent itemsets(i.e., clusters), we split a group into multiple smaller groupsor merge multiple clusters into a larger one in order toimprove the accuracy of groups. As expected, 31 alerts, outof 68 incorrect labeled alerts from temporal and spacial groupresults, are split and merged into another groups or becomea new correct group. the precision of correlation improves to0 .
78, while F1 improves by 3 . Log-template grouping : In this section, we want to verifythe hypothesis that the two similar alerts may or may nothave a lexically matching incident descriptions, but their logsshould have high overlap and that they are discriminative.In order to validate the hypothesis, a total of 10 similarpairs and 19 dissimilar pairs are sampled from the datasetwith their description, log-templates and entities. For eachpair of alerts, we compute the similarity score between thembased on: alert-description, entity, log-templates. To computethe alert-description based similarity between two alerts, weobtain the distributed representation using universal sentenceencoder for each alert in the pair, and then compute cosinesimilarity between them. For entity based similarity, we firstextract entities from each alert, represent them as a vector of
Log-template Entity Event-descSimilar Dissim. Similar Dissim. Similar Dissim.AVERAGE 0.72 0.55 0.66 0.53 0.84 0.75STDEV 0.049 0.09 0.10 0.07 0.10 0.06MEDIAN 0.71 0.54 0.63 0.54 0.82 0.75
TABLE III:
Similarity scores for similar and dissimilar (Dissim.)events using three different methods
Threshold=0.6 0.65 0.7Alert-description 0.34 0.38 0.48Time & Spatial 0.72 0.75 0.79Log-template 0.82
TABLE IV:
Precision of different event grouping models for varyingvalues of threshold entities. Then, we compute the tf-idf similarity between them.§V-B outlines our method for calculating log-template basedsimilarity between two alerts.The results of the experiments are presented in Table III(w/o each data point). The average similarity scores of bothsimilar and dissimilar alert pairs are computed using alert-description are 0.84 and 0.75, respectively. This shows thatone cannot rely only on the alert-description based similarityscore, as it might result in a lot of false positives. Although,precision will be high, but accuracy will be low; these resultsindicate that the alert-description based similarity score isnot discriminative enough to discriminate similar pairs fromdissimilar pairs. On the other hand, the average similarityscores of both similar and dissimilar alert pairs calculatedusing log-template are 0.72 and 0.55, respectively. It clearlyindicates the discriminative power of log-template based simi-larity scores, where it can clearly discriminate the similar pairsfrom the dissimilar pairs. The entity based similarity scoresshow similar results. That is, the average similarity score forsimilar and dissimilar pairs are 0.66 and 0.53, respectively.In order to establish the threshold value for alert group-ing, i.e., when the similarity between groups is less than athreshold then the grouping should stop, we calculate theaccuracy of log-template based alert grouping, entity basedalert grouping, and alert-description based alert grouping fordifferent threshold values. Table IV shows that the precisionof alert-description based alert grouping is the lowest, and theprecision of log-template based alert grouping is the highest.The maximum precision of log-template alert grouping is when the threshold is set to —thus, this is used in FIXME.
D. Feedback
From the labeled data, we take the incorrect labels (6%response rate) as feedbacks. As detailed in §V-E, FIXMEcollects a simple feedback, yes/no from SREs, and uses themto perform clustering split and merge operations to divide acluster (group) into multiple clusters or combine multiple clus-ters into one cluster. Both cases help improve the correctness.11,565 alerts have been used for training word embeddings(size = 300) (§V-B) and the word embedding vectors are usedas input to the clustering. Figure 10 illustrates the quality ofclusters (i.e., consistency) using silhouette score and accuracyagainst the labeled data (§VI-A). 15 clusters show the bestconsistency and accuracy. From the labeled data (§VI-C),9 N o r m a li ze d S il hou e tt e S c o r e / A cc u r ac y Number of Clusters Silhouette ScoreAccuracy
Fig. 10:
Cluster consistency and accuracy for number of cluster out of 68 incorrect labeled alerts, 46 alerts have been splitand merged into another groups, then become correct. Also,149 alerts (241 from single alert groups) have been mergedinto other groups, then they become correct. Therefore, theprecision improves to = . . ONCLUSION
We have introduced hybrid approaches used in the AIOpsproduction systems, and demonstrated that hybrid entity reso-lution and hybrid event correlation provide better results thanany single method. Our experimental results clearly show thatcombining template-based and dictionary-based approaches toentity resolution achieves 93% accuracy, while neither tech-nique alone provides sufficiently good results. We also showthat consecutive application of temporal and spacial methodstogether with association rule mining and log-template group-ing, help to improve performance. We conclude the paperwith description of methods used for feedback processing forimproved accuracy. We have focused our initial work on dataenrichment, correlation of variety of data and fault localizationmethods as these two steps were identified by SREs as mostdifficult in their troubleshooting process, we plan to focus onroot cause identification next.R
EFERENCES[1] Sysdig. Sysdig 2019 container usage report. https://sysdig.com/blog/sysdig-2019-container-usage-report/, 2019. [
ONLINE ].[2] Catchpoint. Sre report 2020. https://blog.catchpoint.com/2020/06/24/catchpoint-sre-report-2020/, 2020. [
ONLINE ].[3] Junjie Chen, Xiaoting He, Qingwei Lin, Yong Xu, Hongyu Zhang,Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and DongmeiZhang. An empirical investigation of incident triage for online servicesystems. In
International Conference on Software Engineering: SoftwareEngineering in Practice , 2019.[4] Seyed Ali Mirheidari, Sajjad Arshad, and Rasool Jalili. Alert correlationalgorithms: A survey and taxonomy.
Lecture Notes in Computer Science ,page 183–197, 2013.[5] Ayush Dusia and Adarshpal S. Sethi. Recent advances in fault localiza-tion in computer networks.
Commun. Surveys Tuts. , 18(4), 2016.[6] Malgorzata Steinder and Adarshpal Sethi. A survey of fault localizationtechniques in computer networks.
Science of Computer Programming ,53:165–194, 11 2004.[7] Abhishek Verma, Luis Pedrosa, Madhukar R. Korupolu, David Oppen-heimer, Eric Tune, and John Wilkes. Large-scale cluster managementat Google with Borg. In
Proceedings of the European Conference onComputer Systems (EuroSys) , Bordeaux, France, 2015.[8] Ouissem Ben Fredj. A realistic graph-based alert correlation system.
Security and Communication Networks , 8(15):2477–2493, 2015.[9] Seyed Hossein Ahmadinejad, Saeed Jalili, and Mahdi Abadi. A hybridmodel for correlating alerts of known and unknown attack scenarios andupdating attack graphs.
Comput. Netw. , 55(9):2221–2240, June 2011.[10] Google. Sre book. https://landing.google.com/sre/sre-book/chapters/effective-troubleshooting/, 2020. [
ONLINE ]. [11] F. Cuppens. Managing alerts in a multi-intrusion detection environment.In
Annual Computer Security Applications Conference , USA, 2001.[12] Fredrik Valeur, Giovanni Vigna, Christopher Kruegel, and Richard A.Kemmerer. A comprehensive approach to intrusion detection alertcorrelation.
IEEE Trans. Dependable Secur. Comput. , 1(3), July 2004.[13] Huwaida Tagelsir Elshoush and Izzeldin Mohamed Osman.
IntrusionAlert Correlation Framework: An Innovative Approach , pages 405–420.Springer Netherlands, Dordrecht, 2013.[14] Alfonso Valdes and Keith Skinner. Probabilistic alert correlation. In
Proceedings of the 4th International Symposium on Recent Advances inIntrusion Detection , RAID ’00. Springer-Verlag, 2001.[15] K. Julisch. Mining alarm clusters to improve alarm handling efficiency.In
Annual Computer Security Applications Conference , 2001.[16] R. Vaarandi. A data clustering algorithm for mining patterns from eventlogs. In
Proceedings of the 3rd IEEE Workshop on IP OperationsManagement , pages 119–126, 2003.[17] Peyman Kabiri and Ali Ghorbani. A rule-based temporal alert correlationsystem.
International Journal of Network Security , 5:66–72, 07 2007.[18] Klaus Julisch. Clustering intrusion detection alarms to support root causeanalysis.
ACM Trans. Inf. Syst. Secur. , 6(4):443–471, November 2003.[19] Oliver M Dain and Robert K Cunningham. Building scenarios from aheterogeneous alert stream. In
Proceedings of the 2001 IEEE workshopon Information Assurance and Security , volume 6. United States MilitaryAcademy, West Point, NY, 2001.[20] Peng Ning and Yun Cui. An intrusion alert correlator based onprerequisites of intrusions. Technical report, North Carolina StateUniversity at Raleigh, USA, 2002.[21] S. Cheung, U. Lindqvist, and M.W. Fong. Modeling multistep cyberattacks for scenario recognition. In
DARPA Information SurvivabilityConference and Exposition , pages 284– 292 vol.1, 05 2003.[22] Steven T. Eckmann, Giovanni Vigna, and Richard A. Kemmerer. Statl:An attack language for state-based intrusion detection.
J. Comput. Secur. ,10(1–2):71–103, July 2002.[23] Benjamin Morin, Ludovic M´e, Herv´e Debar, and Mireille Ducass´e.M2d2: A formal data model for ids alert correlation. In Andreas Wespi,Giovanni Vigna, and Luca Deri, editors,
Recent Advances in IntrusionDetection , Berlin, Heidelberg, 2002. Springer Berlin Heidelberg.[24] Benjamin Morin, Ludovic M´e, Herv´e Debar, and Mireille Ducass´e. Alogic-based model to support alert correlation in intrusion detection.
Inf.Fusion , 10(4):285–299, October 2009.[25] Safaa O. Al-Mamory and Hongli Zhang. Intrusion detection alarmsreduction using root cause analysis and clustering.
Comput. Commun. ,32(2):419–430, February 2009.[26] Peng Ning, Yun Cui, and Douglas S. Reeves. Constructing attackscenarios through correlation of intrusion alerts. In
Proceedings ofthe 9th ACM Conference on Computer and Communications Security ,CCS ’02, page 245–254, New York, NY, USA, 2002. Association forComputing Machinery.[27] Peng Ning, Yun Cui, Douglas S. Reeves, and Dingbang Xu. Techniquesand tools for analyzing intrusion alerts.
ACM Trans. Inf. Syst. Secur. ,7(2):274–318, May 2004.[28] Tadeusz Pietraszek and Axel Tanner. Data mining and machine learning-towards reducing false positives in intrusion detection.
Inf. Secur. Tech.Rep. , 10(3):169–183, January 2005.[29] Oliver Dain and Robert K. Cunningham.
Fusing A Heterogeneous AlertStream Into Scenarios , pages 103–122. Springer US, Boston, MA, 2002.[30] Reuben L Smith, Nathalie Japkowicz, and Maxwell G. Dondo. Cluster-ing using an autoassociator: A case study in network event correlation.In
IASTED PDCS , 2005.[31] Peter Teufl, Udo Payer, and Reinhard Fellner. Event correlation on thebasis of activation patterns. In
Proceedings of the 2010 18th EuromicroConference on Parallel, Distributed and Network-Based Processing ,PDP ’10, page 631–640, USA, 2010. IEEE Computer Society.[32] Jiawei Han, Jian Pei, and Micheline Kamber.
Data mining: conceptsand techniques . Elsevier, 2011.[33] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar.
Introduction todata mining . Pearson Education India, 2016.[34] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficientestimation of word representations in vector space, 2013.[35] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: Anefficient data clustering method for very large databases.
SIGMOD Rec. ,25(2):103–114, June 1996.[36] Maria-Florina Balcan and Avrim Blum. Clustering with interactivefeedback, Jan 2008.,25(2):103–114, June 1996.[36] Maria-Florina Balcan and Avrim Blum. Clustering with interactivefeedback, Jan 2008.