[PDF] Feature Engineering for Scalable Application-Level Post-Silicon Debugging

Abstract

We present systematic and efficient solutions for both observability enhancement and root-cause diagnosis of post-silicon System-on-Chips (SoCs) validation with diverse usage scenarios. We model specification of interacting flows in typical applications for message selection. Our method for message selection optimizes flow specification coverage and trace buffer utilization. We define the diagnosis problem as identifying buggy traces as outliers and bug-free traces as inliers/normal behaviors, for which we use unsupervised learning algorithms for outlier detection. Instead of direct application of machine learning algorithms over trace data using the signals as raw features, we use feature engineering to transform raw features into more sophisticated features using domain specific operations. The engineered features are highly relevant to the diagnosis task and are generic to be applied across any hardware designs. We present debugging and root cause analysis of subtle post-silicon bugs in industry-scale OpenSPARC T2 SoC. We achieve a trace buffer utilization of 98.96\% with a flow specification coverage of 94.3\% (average). Our diagnosis method was able to diagnose up to 66.7\% more bugs and took up to 847\times less diagnosis time as compared to the manual debugging with a diagnosis precision of 0.769.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015, COMPILED ON 02/10/2021 AT 1:40am 1

Feature Engineering for Scalable Application-LevelPost-Silicon Debugging

Debjit Pal ∗ , Shobha Vasudevan † Email: { ∗ dpal2, † shobhav } @illinois.edu Abstract —We present systematic and efﬁcient solutions forboth observability enhancement and root-cause diagnosis of post-silicon System-on-Chips (SoCs) validation with diverse usagescenarios. We model speciﬁcation of interacting ﬂows in typicalapplications for message selection. Our method for messageselection optimizes ﬂow speciﬁcation coverage and trace bufferutilization . We deﬁne the diagnosis problem as identifying buggytraces as outliers and bug-free traces as inliers/normal behaviors ,for which we use unsupervised learning algorithms for outlierdetection. Instead of direct application of machine learningalgorithms over trace data using the signals as raw features,we use feature engineering to transform raw features into moresophisticated features using domain speciﬁc operations. Theengineered features are highly relevant to the diagnosis task andare generic to be applied across any hardware designs. We presentdebugging and root cause analysis of subtle post-silicon bugs inindustry-scale OpenSPARC T2 SoC. We achieve a trace bufferutilization of 98.96% with a ﬂow speciﬁcation coverage of 94.3% (average). Our diagnosis method was able to diagnose up to 66.7% more bugs and took up to 847 × less diagnosis time as comparedto the manual debugging with a diagnosis precision of 0.769. I. I

NTRODUCTION

Post-silicon validation is a crucial component of a modernSoC design validation that is performed under highly ag-gressive schedules and accounts for more than of thevalidation cost [23], [34].An expensive component of the post-silicon validation isapplication level use-case validation through message passing .In this activity, a validator exercises various target usagescenarios of the system ( e.g. , for a smartphone, playing videosor surﬁng the Web, while receiving a phone call) and monitorsfor failures ( e.g. , hangs, crashes, deadlocks, overﬂows, etc.).Each usage scenario involves interleaved execution of severalprotocols among IPs in the SoC design. Due to the concurrentexecution of multiple protocols [1], [9], [28], extremely longexecution traces (millions of clock cycles), and lack of bug re-producibility and error sequentiality lead to an extremely timeconsuming post-silicon diagnosis effort. In current industrialpractice [20], post-silicon diagnosis is a manual, unsystematic,and ad hoc process primarily relying on the creativity ofthe validator and often takes weeks to months of validationtime. Consequently, it is crucial to determine techniques tostreamline this activity.In this paper, we present an automated post-silicon debugand diagnosis methodology to shorten diagnosis time usingmachine learning and feature engineering.In previous work [22], we developed a message selectionmethod that speciﬁcally targets use-case validation. To debug ause-case scenario, the validator typically needs to observe andcomprehend the messages being sent by the constituent IPs.An effective way to do that is to use hardware tracing [23] where a small set of signals are monitored continuously duringexecution. The effectiveness of hardware tracing is limited bythe signals being selected for tracing. Note that the omissionof a critical signal manifests only during post-silicon debugwhen a silicon respin is infeasible.To select trace signals that are most beneﬁcial for use-casedebugging, we depart from the gate-level post-silicon signalselection approaches [4], [8], [18], [24] of prior art and raisethe design abstraction at which we apply hardware tracing.We systematically model and analyze usage scenarios at theapplication level. Our message selection framework uses pro-tocol formalizations as sequences of transactions or ﬂows [1],[9], [28], [31]. Given a collection of usage scenarios andthe application-level ﬂows they activate (and the constituentmessages), our algorithm computes the messages that are mostbeneﬁcial for debugging. We scale our observability enhancingalgorithms to the industry scale OpenSPARC T2 SoC [29],[30] that is orders of magnitude larger and complex thantraditional ISCAS89 benchmarks used in the literature. Alongwith scale, we argue with empirical evidence that the qualityof selected observable are of higher quality and are highlyeffective for post-silicon usage scenario failure debugging.Although in [22] we automated the message selection,the debug and diagnosis still remains manual and extremelytedious task. The primary objective of the manual post-silicondebug and diagnosis is i) to understand the desired behaviorfrom the speciﬁcation, ii) to learn the correct message patternsas per the speciﬁcation, and iii) to learn one or more messagepatterns that are symptomatic of the bug(s). Machine learn-ing [19], [21] (ML) algorithms automatically learn statisticalmodels from large amounts of training data.In this paper, we argue that ML algorithms can learn modelsof the correct and buggy executions of an SoC during the post-silicon debug and diagnosis. To train the models, we can use alarge amount of post-silicon trace data that is generated duringuse-case validation. The primary challenge of applying ML tothe diagnosis problem is in representing the training data suchthat ML model can learn the differences between correct andbuggy behavior, and generalize to any arbitrary design.Logical bugs in designs can be considered as triggering corner-case design behavior; which is infrequent and deviant from normal design behavior. In ML parlance, outlier detec-tion [7], [10] is a technique to identify infrequent and deviant data points, called outliers whereas normal data points arecalled inliers . We apply outlier detection techniques to auto-matically diagnose post-silicon failures by modeling normaldesign behaviors as inliers and buggy design behavior asoutliers. Consequently, the task of learning a buggy designbehavior transforms into a task of modeling the buggy design a r X i v : . [ c s . A R ] F e b OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015, COMPILED ON 02/10/2021 AT 1:40am 2 behavior as an outlier.In post-silicon execution, design bugs typically manifest asone or more patterns of consecutive messages (also knownas message interleaving) in the trace data. Human engineersspend considerable time and effort to identify such patterns inthe trace data. We call such a message pattern as an anomalousmessage sequence . In this work, we apply ML to identify suchanomalous message sequences automatically.The task for the ML algorithm is the outlier detection wherethe ML model is expected to learn normal design behaviorsand buggy design behaviors and classify buggy behaviors asoutliers. The data used for training is post-silicon executiondata where each data point is a triplet consisting of – i) thecycle of occurrence of a message, ii) the IP interface fromwhich the message is sourced, and iii) the message iteself . Inour experiments we found that the raw features are insufﬁcientto capture infrequent and deviant nature of buggy designbehaviors as compared to the normal behavior (c.f., Figure 4).Feature engineering is a critically important task in thesuccessful generalization of ML to a problem domain [5], [12].We engineer domain speciﬁc features that are highly relevantto the diagnosis task to control the normal and buggy behaviormodel as seen by the outlier detection algorithms. The engi-neered features are generic, i.e. , they are transformations thatcan be applied to any hardware designs.In our formulation we need the anomalous message se-quences to appear as outliers. We identify infrequency and deviancy as the relevant features that could capture the dis-tinction between normal and anomalous message sequences.Our engineered feature space needs to capture this distinc-tion. We would like normal message sequences to be closeand densely distributed in the feature space and anomalousmessage sequences to be sparsely distributed and distant.Due to the large number of possible message sequences,the computation can become computationally expensive. Tokeep computation tractable, we pre-process trace messagesequences to create message aggregates and characterize eachsuch aggregates for anomaly. A message aggregate with infre-quent message sequences contains more information than [27]a message aggregate with frequent message sequences. Weuse entropy to quantify the information content of an aggre-gate. As the number of infrequent messages sequences in anaggregate increases, the entropy of the aggregate increasesmonotonically. In order to quantify deviancy of a messagesequence with respect to other message sequences in theaggregate, we use a string similarity metric , in particular

Levenshtein distance [13]. As an aggregate contains moreand more deviant message sequences, the average pairwiseLevenshtein distance of the aggregate increases monotonically.We identify message aggregates with both high entropy andhigh Levenshtein distance as outliers and report them ascandidate root causes.The primary beneﬁts of this diagnosis solution are – i) itautomatically learns the normal and buggy design behaviorsfrom trace message data without training, ii) the engineeredfeatures are generic and are independent of any particulardesign and/or application, and iii) the proposed method canshift through a large amount of trace data, thereby improvingdetection of candidate anomalous message sequences that aresymptomatic of design bugs.

Dir 1

InitWait GntW G n t E Done A c k R e q E (a) Instance 1 Instance 21:Init (n)1:Wait (w)1:GntW (c)1:Done (d) R e q E GntE A c k R e q E GntE A c k (b) Fig. 1: 1a shows a ﬂow for an exclusive line access request fora cache coherence ﬂow [31] along with participating IPs. 1bshows two legally indexed instances of cache coherence ﬂow.To show scalability and effectiveness of our automateddiagnosis approach, we perform our experiments on industrialscale OpenSPARC T2 SoC [29], [30]. We inject complex andsubtle bugs, with each bug symptom taking several hundredobserved messages ( up to 457 messages ) and several hundredthousands of clock cycles ( up to 21,290,999 clock cycles )to manifest. Our analysis shows that the proposed diagnosismethod is computationally efﬁcient. It incurred runtime of upto 44.3 seconds and peak memory usage of up to 508.7 MB to pre-process trace messages to create aggregates. To detectoutlier message aggregates, it incurred runtime of up to 18.91seconds and peak memory usage of up to 508.2 MB .We also evaluated effectiveness of our engineered featuresfor outlier detection. We found that each of the candidateanomalous message aggregates has entropy of up to 4.3482 and Levenshtein distance of up to 3.0 . This shows thatour engineered features are highly effective in demarcatinganomalous message aggregates from normal aggregates.Our analysis shows that our proposed diagnosis method ishighly effective. We found that the proposed diagnosis methodwas able to diagnose up to 66.7% more injected bugs with upto 847 × less diagnosis time with a high precision of up to0.769 as compared to manual debugging.Our contributions over [22] are as follows. First, we pose thepost-silicon diagnosis problem as an outlier detection problemand propose a ML-based scalable and efﬁcient technique todiagnose post-silicon use-case failures. Second, we systemati-cally model buggy behavior as an outlier and normal behavioras an inlier in the ML data space. To that extent, we engineeredtwo features that are highly relevant to the diagnosis taskand applicable across hardware designs. Third, we establishwith empirical evidence that our ML-based technique is highlyeffective and can diagnose many more bugs at a fraction oftime with high precision as compared to manual debugging.II. B ACKGROUND AND PRELIMINARIES

Conventions : In SoC designs, a message can be viewed asan assignment of Boolean values to the interface signalsof a hardware IP. In our formalization below, we leave thedeﬁnition of message implicit, but we will treat it as a pair (cid:104)C , w (cid:105) where w ∈ Z + . Informally, C represents the content ofthe message and w represents the number of bits required torepresent C . Given a message m = (cid:104)C , w (cid:105) , we will refer to w as bit-width of m , denoted by width ( m ) or | m | . Deﬁnition 1: A ﬂow is a directed acyclic graph (DAG)deﬁned as a tuple, F = (cid:104)S , S , S p , E , δ F , Atom (cid:105) where S is the set of ﬂow states, S ⊆ S is the set of initial states, S p ⊆ S and S p ∩ Atom = ∅ is called the set of stop states, E OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015, COMPILED ON 02/10/2021 AT 1:40am 3 : R e q E : G n t E : A c k : R e q E : G n t E : R e q E : R e q E : R e q E : G n t E : G n t E : R e q E : A c k : G n t E : A c k n1, n2 w1,n2n1,w2 c1, n2w1,w2n1, c2 d1,d2 d1,c2d1,w2d1,n2 c1,w2 Fig. 2: Two instances of cache coherence ﬂow of Figure 1ainterleaved.is a set of messages, δ F ⊆ S × E × S is the transition relationand Atom ⊂ S is the set of atomic states of the ﬂow.We use F . S , F . E etc. to denote the individual componentsof a ﬂow F . A stop state of a ﬂow is its ﬁnal state after itssuccessful completion. Atom is a mutex set of ﬂow states i.e. ,any two ﬂow states in

Atom cannot happen together. Othercomponents of F are self-explanatory. In Figure 1a, we haveshown a toy cache coherence ﬂow along with the participatingIPs and the messages. In Figure 1a, S = { Init, Wait, GntW,Done } , S = { Init } , S p = { Done } , Atom = { GntW } . Eachof the messages in the cache coherence ﬂow is 1 bit wide,hence E = {(cid:104) ReqE, (cid:105) , (cid:104) GntE, (cid:105) , (cid:104) Ack, (cid:105)} . Deﬁnition 2:

Given a ﬂow F , an execution ρ is analternating sequence of ﬂow states and messages ending witha stop state. For ﬂow F , ρ = s α s α s α . . . α n s n such that s i α i +1 −→ s i +1 , ∀ ≤ i < n, s i ∈ F .S, α i +1 ∈F . E , s n ∈ F . S p . The trace of an execution ρ is deﬁned as trace ( ρ ) = α α α α . . . α n .For the cache coherence ﬂow of Figure 1a, ρ = { n, ReqE,w, GntE, c, Ack, d } , trace ( ρ ) = { ReqE, GntE, Ack } .Intuitively, a ﬂow provides a pattern of system execution. Aﬂow can be invoked several times, even concurrently, during asingle run of the system. To make precise the relation betweenan execution of the system with participating ﬂows, we needto distinguish between these instances of the same ﬂow. Thenotion of indexing accomplishes that by augmenting a ﬂowwith an “index”. Deﬁnition 3: An indexed message is a pair α = (cid:104) m, i (cid:105) where m is the message and i ∈ N , referred to as the index of α . An indexed state is a pair ˆ s = (cid:104) s, j (cid:105) where s is a ﬂowstate and j ∈ N , referred as the index of ˆ s . An indexed ﬂow (cid:104) f, k (cid:105) is a ﬂow consisting of indexed message m and indexedstate ˆ s indexed by k ∈ N .Figure 1b shows two instances of the cache coherence ﬂowof Figure 1a indexed with their respective instance number.In our modeling, we ensure by construction that two differentinstances of the same ﬂow do not have same indices. Note thatin practice, most SoC designs include architectural support toenable tagging , i.e. , uniquely identifying different concurrentlyexecuting instances of the same ﬂow. Our formalization simplymakes the notion of tagging explicit. Deﬁnition 4:

Any two indexed ﬂows (cid:104)F , i (cid:105) , (cid:104)G , j (cid:105) are saidto be legally indexed either if F (cid:54) = G or if F = G then i (cid:54) = j . Figure 1b shows two legally indexed instances of the cachecoherence ﬂow of Figure 1a. Indices uniquely identify eachinstance of the cache coherence ﬂow.A usage scenario is a pattern of frequently used applica-tions. Each such pattern comprises multiple interleaved ﬂowscorresponding to communicating hardware IPs. Deﬁnition 5:

Let F , G be two legally indexed ﬂows. Theinterleaving F (cid:57) G is a ﬂow called interleaved ﬂow deﬁnedas U = F (cid:57) G = (cid:104)F . S ×G . S , F . S ×G . S , F . S p ×G . S p , F . E ∪G . E , δ U , F .Atom ∪ G .Atom (cid:105) where δ U is deﬁned as:i) s α −→ s (cid:48) ∧ s (cid:54)∈G .Atom (cid:104) s ,s (cid:105) α −→(cid:104) s (cid:48) ,s (cid:105) and ii) s β −→ s (cid:48) ∧ s (cid:54)∈F .Atom (cid:104) s ,s (cid:105) β −→(cid:104) s ,s (cid:48) (cid:105) where s , s (cid:48) ∈ F . S , s , s (cid:48) ∈ G . S , α ∈ F . E , β ∈ G . E . Everypath in the interleaved ﬂow is an execution of U and representsan interleaving of the messages of the participating ﬂows.Rule i) of δ U says that if s evolves to the state s (cid:48) whenmessage α is performed and if g has a state s which is notatomic/indivisible, then in the interleaved ﬂow, if we have astate ( s , s ) , it evolves to state ( s (cid:48) , s ) when message α isperformed. A similar explanation holds good for Rule ii) of δ U .For any two concurrently executing legally indexed ﬂow F and G , J = F (cid:57) G , for any s ∈ F .Atom and for any s (cid:48) ∈ G .Atom , ( s, s (cid:48) ) (cid:54)∈ J. S . If one ﬂow is in one of its atomic/indivisiblestate, then no other concurrently executing ﬂow can be in itsatomic/indivisible state.Figure 2 shows partial interleaving U of two legally indexedﬂow instances of Figure 1b. Since c and c both are atomicstate, state ( c , c ) is an illegal state in the interleaved ﬂow. δ U and the Atom set make sure that such illegal states do notappear in the interleaved ﬂows.Trace buffer availability is measured in terms of bits thusrendering bit width of a message important. In Deﬁnition 6,we deﬁne a message combination. Different instances of thesame message i.e. indexed messages are not required whilecomputing the bit width of the message combination.

Deﬁnition 6: A message combination M is an unorderedset of messages. The total bit width W of a message com-bination M is the sum total of the bit width of the individualmessages contained in M i.e. W ( M ) = (cid:80) ki =1 width ( m i ) = (cid:80) ki =1 w i , m i ∈ M , k = |M| .We introduce a metric called ﬂow speciﬁcation coverage to evaluate the quality of a message combination . Deﬁnition 7:

Let F be a ﬂow. The visible ﬂow states visible ( α ) of a message α ∈ F . E is deﬁned as the setof ﬂow states reached on the occurrence of message α i.e. , visible ( α ) = { s (cid:48) | s α −→ s (cid:48) , s, s (cid:48) ∈ F . S} . The ﬂow speciﬁ-cation coverage F Cov ( M ) of a message combination M is deﬁned as the set union of the visible ﬂow states of allthe messages in the message combination, expressed as afraction of the total number of ﬂow states i.e. , F Cov ( M ) = ∪ ki =1 visible ( α i ) |F . S| , k = |M| .We extend the deﬁnition of a trace ( ρ ) of an execution ρ (c.f., Deﬁnition 2) to deﬁne message sequence and messageaggregate for diagnosis. Deﬁnition 8: A message sequence m ( ρ ) of a trace( ρ ) isdeﬁned as a subsequence of the trace of the execution. The length k of a message sequence m ( ρ ) is deﬁned as the number OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015, COMPILED ON 02/10/2021 AT 1:40am 4 of messages contained in m ( ρ ) . For example, for trace ( ρ ) = α α α . . . α n , m ( ρ ) = (cid:104) α α α (cid:105) is a message sequenceof trace( ρ ) of length k = 3 . Any two message sequences m i ( ρ ) and m j ( ρ ) of length k are distinct if ∃ l ∈ [1 , k ] , α i,l (cid:54) = α j,l where α i,l ∈ m i ( ρ ) , α j,l ∈ m j ( ρ ) . Deﬁnition 9: A message aggregate maggr ( ρ ) of a trace( ρ )is deﬁned as an unordered set of message sequences of length k . Each distinct message sequence in a message aggregate iscalled an unique message sequence of that message aggre-gate. For example, maggr ( ρ ) = {(cid:104) α α α (cid:105) , (cid:104) α α α (cid:105)} is amessage aggregate of length 3 message sequences of trace( ρ ).Each of the (cid:104) α α α (cid:105) and (cid:104) α α α (cid:105) is an unique messagesequence of maggr ( ρ ) .For the cache coherence ﬂow of Figure 1a, m ( ρ ) = (cid:104) ReqE, GntW (cid:105) , m ( ρ ) = (cid:104) GntW, Ack (cid:105) are two length-2 message sequences and maggr ( ρ ) = { m , m } = {(cid:104) ReqE, GntW (cid:105) , (cid:104) GntW, Ack (cid:105)} is a message aggregate.

A. Entropy and mutual information gain

Entropy : The entropy [27] measures the uncertainty ina random variable. Let X be a discrete random variablewith possible values X val = { x , x , . . . , x n } . Let p ( x ) be the associated probability mass function of X . The en-tropy of the random variable X is deﬁned as H ( X ) = − (cid:80) x i ∈ X val p ( x i ) log p ( x i ) where p ( x i ) = | X = x i | / | X val | denotes the fraction of X in which X = x i . Mutual information gain : The

Mutual information gain [27]measures the amount of information that can be obtained aboutone random variable X by observing another random variable Y . More precisely, the conditional entropy of a random vari-able X with respect to another random variable Y is the reduc-tion in uncertainty in the realization of X when the outcome of Y is known. For jointly distributed discrete random variables X and Y , the mutual information gain of X relative to Y isgiven by I ( X ; Y ) = (cid:80) x,y p ( x, y ) log ( p ( x, y ) / ( p ( x ) p ( y ))) ,where p ( x ) and p ( y ) are the associated random probabilitymass function for two random variables X and Y respectively. B. Levenshtein distance

The Levenshtein distance is a string similarity metric formeasuring the dissimilarity between two strings. Mathemat-ically, the Levenshtein distance between two strings a, b (oflength | a | and | b | ) L a,b ( | a | , | b | ) is deﬁned as: L a,b ( i, j ) =  max ( i, j ) if min ( i, j ) = 0 min  L a,b ( i − , j ) L a,b ( i, j − L a,b ( i − , j −

1) + ( ai (cid:54) = bj ) otherwise . where ( a i (cid:54) = b j ) is the indicator function equal to when a i == b j and equal to otherwise. The L a,b ( i, j ) is thedistance between the ﬁrst i characters of a and the ﬁrst j characters of b . We will denote L a,b ( | a | , | b | ) as L ( a, b ) .The salient features of Levenshtein distance are – i) it isat least the difference of the sizes of the two strings, ii) it isat most the length of the longer string, iii) it is zero iff thestrings are equal, and iv) if the strings are of the same size,the Hamming distance [11] is an upper bound. III. O UTLIER DETECTION

A. Outliers in ML

In ML, outliers are deﬁned as data samples whose character-istics notably deviate from our expectation [7], [10]. Outliershave two characteristics – i) they are different and ii) they arerare as compared to normal data samples.In spite of straightforward deﬁnition, detecting an outlier ischallenging. First, the boundary between outliers and normalsamples are often not precise. Additionally, some outliersonly manifest their outlierness in an engineered feature spacederived from the original feature space via non-trivial transfor-mation. Second, the groundtruth of the outliers is often absentdue to prohibitive cost. Hence in many cases, outliers are deter-mined based on the sample characteristics alone. Unsupervisedoutlier detection algorithms are developed to identify outliersthrough only the patterns and intrinsic properties of the featurespace, and hence do not require any groundtruth labels.

B. Different notions of outliers

There are three distinct notions of outliers depending onthe proﬁling of normal samples –i) classiﬁcation-based , ii) density-based , and iii) spectral-based . Classiﬁcation-based outlier detection : Outliers can be de-ﬁned by a classiﬁer which can be learned in the feature spaceto distinguish between the normal and anomalous samples[7]. Any sample that does not ﬁt the representation of thenormal samples would be considered as outliers. When thegroundtruth is unavailable, the classiﬁer can be learned inan unsupervised manner.

One-class Support Vector Machine (OCSVM) [2], [26] is an unsupervised outlier detectionmethod that adopts this notion of outliers.

Density-based outlier detection : Density-based outliers arebased on the assumption that the normal data samples reside inneighborhoods of high density whereas outliers reside in low-density regions [7]. There are two distinct notion of density-based outlierness. First, the local density of a sample canbe estimated as its distance to its k -nearest neighbors, withlarger distances indicating higher degrees of outlierness. The k-Nearest Neighbors ( k NN) [3], [25] is an unsupervised outlierdetection technique that adopts this notion of outliers directlyand uses the distance to quantify outlierness. Second, therelative density of each data sample to the density of its k neighbors can be used as an indication of outlierness of asample. A normal sample has a local density that is similar toits neighbors whereas an outlier’s local density is lower thanthat of its neighbors. Local Outlier Factor (LOF) [6] is anunsupervised outlier detection method that identiﬁes outliersbased on the relative density of a sample’s neighborhoods.

Spectral-based outlier detection : Spectral-based notion ofoutliers assumes that the difference between normal samplesand outliers can be signiﬁcantly enhanced when the datais embedded into a lower dimensional subspace [7]. Hence,outlier detection methods that adopt this notion of outliersapproximate the data space using a transformation of theoriginal features to capture the variability in the data for easyoutlier identiﬁcation.

Principal Component Analysis (PCA) The distance can be the distance to the k th distant neighbor or the averageof distances of all k neighbors. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015, COMPILED ON 02/10/2021 AT 1:40am 5

Step: 1

Find message combinations

Input : System level flows, Trace buffer width

Output : Message combination withInfo Gain maximized Trace buffer maximally utilized

Step: 2

Selecting a message combination based on mutual info gain

Step: 3

Packing the trace buffer

Fig. 3: Our message selection approach[15] is an unsupervised outlier detection method that canproject data into a lower dimensional space where most ofthe variability of the data is captured and explained by thenew dimensions. The variability that is not captured by thenew dimensions is considered as anomalous.

Isolation Forest (IForest) [16], [17] is another unsupervised outlier detectionmethod that attempts to identify outliers using only a subset ofthe features. IForest recursively select feature and split featurevalues in random until samples are isolated. Since outliersare rare and lie further away from the normal samples in thefeature space, the number of splittings required to isolate anoutlier can be served as its outlier score.

C. Metrics of an outlier detection algorithm

Deﬁnition 10:

The precision of an outlier detection algo-rithm is deﬁned as the number of true positives expressed asa fraction of total number of samples labeled as belonging tothe outlier class i.e. , P recision = t p t p + f p where t p = numberof true positives, f p = number of false positives.AP Deﬁnition 11:

The recall of an outlier detection algorithmis deﬁned as the number of true positives expressed as afraction of total number of true positives and false negatives i.e. , Recall = t p t p + f n , where f n = number of false negatives. Deﬁnition 12:

The accuracy of an outlier detection algo-rithm is deﬁned as the number of samples that are correctlylabeled as belonging to both the outlier class and normalclass expressed as a fraction of total number of samples i.e. , Accuracy = t p + t n t p + t n + f p + f n , where t n = number of truenegatives.IV. M ESSAGE SELECTION METHODOLOGY

A. Objective of message selection methodology

Maximizing information gain is done in order to increaseﬂow speciﬁcation coverage during post-silicon debug of usagescenarios. The message selection procedure considers the message combination M for tracing, whereas to calculateinformation gain over U , it uses indexed messages .Given a set of legally indexed participating ﬂows of ausage scenario, bit widths of associated messages, and atrace buffer width constraint, our method selects a messagecombination such that information gain is maximized overthe interleaved ﬂow U and the trace buffer is maximallyutilized .For the cache coherence ﬂow example of Figure 1a, weassume a trace buffer width of 2 bits and concurrent executionof two instances of the ﬂow. ReqE, GntE , and

Ack messageshappen between , and

IP pairs respectively.

ReqE, GntE , and

Ack consist of req, gnt and ack signaland each of the messages is 1-bit wide. Let B = { , } be theset of Boolean values. C ( ReqE ) = B | req | , C ( GntE ) = B | gnt | ,and C ( Ack ) = B | ack | denote respective message contents. B. Step 1 : Finding message combinations

In Step 1, we identify all possible message combinationsfrom the set of all messages of the participating ﬂows in ausage scenario.While we ﬁnd different message combinations, we alsocalculate the total bit width of each such combination. Anymessage combination that has a total bit width less than orequal to the available trace buffer width is kept for furtheranalysis in Step 2. Each such message combination is apotential candidate for tracing.In the example of Figure 1a, there are 3 messages and (cid:80) k =1 (cid:0) k (cid:1) = 7 different message combinations. Of these, onlyone ( ReqE, GntE, Ack ) has a bit width more than trace bufferwidth (2). We retain the remaining six message combinationsfor further analysis in Step 2. C. Step 2 : Selecting a message combination based on mutualinformation gain

In this step, we compute the mutual information gain ofmessage combinations computed in step 1 over the interleavedﬂow. We then select the message combination that has the highest mutual information gain for tracing.We use mutual information gain as a metric to evaluatethe quality of the selected set of messages with respect tothe interleaving of a set of ﬂows. We associate two randomvariables with the interleaved ﬂow namely X and Y i . X represents the different states in the interleaved ﬂow i.e. itcan take any value in the set S of the different states of theinterleaved ﬂow. Let M = (cid:83) i E i be the set of all possibleindexed messages in the interleaved ﬂow. Let Y (cid:48) i be a candidatemessage combination and Y i be a random variable representingall indexed messages corresponding to Y (cid:48) i . All values of X are equally probable since the interleaved ﬂow can be in anystate and hence p X ( x ) = |S| . To ﬁnd the marginal distributionof Y i , we count the number of occurrences of each indexedmessage in the set M (cid:48) over the entire interleaved ﬂow. Wedeﬁne p Y i ( y ) = . To ﬁndthe joint probability, we use the conditional probability and themarginal distribution i.e. p ( x, y ) = p ( x | y ) p ( y ) = p ( y | x ) p ( x ) . P ( x | y ) can be calculated as the fraction of the interleavedﬂow states x is reached after the message Y i = y has beenobserved. In other words, p ( x | y ) is the fraction of times x is reached, from the total number of occurrences of theindexed message y in the interleaved ﬂow i.e. p X | Y i ( x | y ) = . Now we substitute these valuesin I ( X ; Y ) to calculate the mutual information gain of thestate set X w.r.t Y i .In Figure 2, p X ( x ) = ∀ x ∈ S . Let Y (cid:48) = { GntE, ReqE } be a candidate message combination and Y = { } . For I ( X ; Y ) , we have p ( y = y i ) = , ∀ y i ∈ Y . Therefore, p X | Y ( x | GntE ) = { / if x =( c , n , / if x = ( c , w , / if x = ( c , d } and p X,Y ( x, GntE ) = { / if x = ( c , n , / if x =( c , w , / if x = ( c , d } . Similarly, we calculate p X,Y ( x, GntE ) , p X,Y ( x, ReqE ) and p X,Y ( x, ReqE ) . The mutual information gain is given by: I ( X, Y ) = (cid:80) x,y p ( x, y ) logp ( x, y ) /p ( x ) p ( y ) = 1 . . For multi-cycle messages, the number of bits that can be traced in a singlecycle is considered as the message bit width

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015, COMPILED ON 02/10/2021 AT 1:40am 6 L e g a l I P - p a i r i n d e x M e s s a g e i n d e x C y c l e r a n g e V a l u e s (a) Case study 1 ( k = 5 ) L e g a l I P - p a i r i n d e x M e s s a g e i n d e x C y c l e r a n g e V a l u e s (b) Case study 3 ( k = 5 ) L e g a l I P - p a i r i n d e x M e s s a g e i n d e x C y c l e r a n g e V a l u e s (c) Case study 5 ( k = 5 ) Fig. 4: (a), (b), and (c) show inability of raw feature data to demarcate anomalous message sequences.Similarly, we calculate the mutual information gain forthe remaining ﬁve message combinations. We then select themessage combination that has the highest mutual informationgain, which is I ( X, Y ) = 1 . thereby selecting the messagecombination Y (cid:48) = { ReqE, GntE } for tracing. Intuitively, in anexecution of U of Figure 2, if the observed trace is { } , immediately we are able to localize theexecution to two paths shown in red in Figure 2 among manypossible paths of U . D. Step 3 : Packing the trace buffer

Message combinations with the highest mutual informationgain selected in Step 2 may not completely ﬁll the tracebuffer. To maximize trace buffer utilization, in this step we pack smaller message groups which are small enough to ﬁt inthe leftover trace buffer width. Usually, these smaller messagegroups are part of a larger message that cannot be ﬁt intothe trace buffer, e.g. in OpenSPARC T2, dmusiidata is20 bits wide message whereas cputhreadid a subgroupof dmusiidata is 6 bits wide. We select a message groupthat can ﬁt into the leftover trace buffer width, such that theinformation gain of the selected message combination in unionwith this smaller message group is maximal. We repeat thisstep until no more smaller message groups can be added in theleftover trace buffer. Beneﬁts of packing are shown empiricallyin Section VII-A.In our running example, the trace buffer is ﬁlled up by theset of selected message combination. The ﬂow speciﬁcationcoverage achieved with Y (cid:48) is 0.7333.V. B UG SYMPTOM DIAGNOSIS METHODOLOGY

A. Formulation of post-silicon debug as an outlier detectionproblem

A post-silicon execution passes if it ﬁnishes without anyfailures e.g. , hangs, deadlock, livelock, crash etc., otherwisethe execution fails . For the diagnosis problem, we considertraced messages during execution as input data. In post-siliconexecution, a failure happens due to the occurrence of one ormore message sequence(s) that is symptomatic of one or moredesign bugs. We consider such a message sequence as an anomalous message sequence . Since an anomalous messagesequence represents a deviant design behavior, we considersuch a message sequence as an outlier in post-silicon executiondata space. Consequently, we formulate post-silicon diagnosisas an outlier detection problem.

Given a set of anomalous post-silicon executions, our diagnosis method identiﬁes oneor more candidate anomalous message sequences.

Since post-silicon executions span millions of clock cycles,hence for tractable computation, we segregate raw trace datain multiple cycle ranges. Further, we assign an index to every legal

IP pair and to every unique message that happensin a post-silicon execution. The segregated trace data hasthree raw features – i) cycle range in which the message hasoccurred, ii) the index of the legal IP-pair between whichthe message has occurred, and iii) the index of a messagethat has occurred. In Figure 4 we show raw trace datain three-dimensional feature space for several case studies(c.f., Section VIII) for OpenSPARC T2 SoC.

B. Insufﬁciency of raw features for detection

An anomalous message sequence has two primary charac-teristics – i) it is infrequent and ii) it is deviant from othernormal message sequences. An in-depth inspection of Figure 4shows that the trace data in raw feature space has the followingdeﬁciencies – i) the raw features provide message-speciﬁcinformation, ii) in raw feature space outliers are not welldemarcated, and iii) the raw features fail to provide context ofthe failure during diagnosis.Hence, we pre-process raw trace message data to constructmessage sequences and characterize each such message se-quences for infrequency and deviancy using engineered fea-tures (c.f., Section III-A). The computational cost of analyzingeach of the individual message sequences can be prohibitivelylarge due to the large number of message sequences obtainedfrom traces. To keep computational cost nominal, insteadof analyzing each of the message sequences individually,we analyze message aggregates of message sequences andcharacterize each such aggregate for the anomaly.

C. Intuition of engineered features

In order to quantify the characterization of anomalousness,we calculate two engineered feature values of each of themessage aggregates – i) entropy (characterizes infrequency)and ii) Levenshtein distance (characterizes deviancy).

Entropy as an engineered feature : A message aggregate ischaracterized as anomalous if it contains one or more infre-quent unique message sequences. An aggregate is considered An IP pair is legal if a message is passed between them. This index is an enumeration of traced messages and is different fromindexed messages discussed in Deﬁnition 3.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015, COMPILED ON 02/10/2021 AT 1:40am 7

TABLE I: Deﬁnition of anomalies using engineered featuresentropy and Levenshtein distance.

Ldist : Levenshtein distance. (cid:88) : Non-anomalous message aggregate. (cid:55) : Anomalous mes-sage aggregate.

Ldist Entropy

Low HighLow (cid:88) (cid:88)

High (cid:88) (cid:55) to be more anomalous if it contains many such infrequentunique message sequences. An information theoretic way toquantify the notion of infrequency is to compute the informa-tion content of the aggregate. Entropy is one such metric thatsuccinctly quantiﬁes information content. An aggregate withfrequent unique message sequences will have less entropy dueto less information content. On the other hand, an aggregatewith more and more infrequent unique message sequences willhave higher entropy due to higher information content. Theentropy of a message aggregate is lower bounded by 0.0 (whenthe aggregate contains exactly one unique message sequence)and is upper bounded by log ( n ) (when the aggregate containsexactly one of each of the n unique message sequences). Levenshtein distance as an engineered feature : Entropy failsto characterize the speciﬁc relationship that exists betweenindividual unique message sequences of a message aggregate.Consequently, we calculate a similarity metric , in particular,Levenshtein distance (c.f., Section II-B) to quantify the de-viancy of the constituent message sequences in a messageaggregate. If a message aggregate contains similar unique mes-sage sequences, the dissimilarity score will be small whereasif the message aggregate contains deviant unique messagesequences, the dissimilarity score will be large. A messageaggregate with higher Levenshtein distance will likely to bemore anomalous as compared to another message aggregatewith smaller Levenshtein distance. Levenshtein distance ofa message aggregate is lower bounded by 0.0 (when theaggregate contains exactly one unique message sequence) andis upper bounded by the average of Hamming distance [11]of pairwise unique message sequences (when the aggregatecontains n different unique message sequences).Let us consider aggregates A1 : { ‘aba’, ‘bab’ } and A2 : { ‘aba’, ‘cdc’ } where a, b, c, d are messages. For each ofthe A1 and A2 , the entropy is log (2) = 1 . Although A2 comprises dissimilar unique message sequences as comparedto A1 , entropy alone fails to capture that dissimilarity. Hencewe calculate the Levenshtein distance of each of the aggregatesto quantify the dissimilarity of the constituent messages. For A1 , L (‘aba’, ‘bab’) = 2 (1 deletion and 1 insertion) and for A2 , L (‘aba’, ‘cdc’) = 3 (3 substitutions). Clearly, in spite ofhaving same entropy, Levenshtein distance helped to identify A2 to be more anomalous than A1 .In our diagnosis solution, we deﬁne a message aggregateas anomalous ( i.e. , contains anomalous unique message se-quences) that has both high entropy and high Levenshteindistance . Table I summarizes our deﬁnition of anomalousnessof a message aggregate. Usage of outlier detection algorithms : We apply outlierdetection algorithms to the engineered feature data space span-ning over entropy and Levenshtein distance. In the engineeredfeature space, message aggregates that represents normal be-havior will be very close to each other and will form a dense a b a b a c

Messages

Aggregrates a b b a a b b a a cX Y X Y Z

75 125 150 175

Cycles

Cycle range 1 Cycle range 2

Fig. 5: Example execution trace and a set of message se-quences of length k = 2 and granularity g = 100 cycles.cluster. On the other hand, message aggregates that representsanomalous behavior will be sparsely distributed and distantfrom the normal message aggregates. Outlier algorithms outputa ranked list of anomalous message aggregates ranked byoutlier scores. We output message sequences contained in top-ﬁve anomalous message aggregates as candidate anomalies. D. Example for generating engineered feature values from rawfeature values

We use an example trace of Figure 5 to explain the stepsfor generating engineered feature values. This methodology isparameterized by i) the length k of the message sequence forwhich anomaly needs to be detected and ii) the granularity g in number of cycles at which message aggregates need to becreated. For this example, we use k = 2 and g = 100 . Step 1 (Creation of message aggregates) : We use a slidingwindow of length k to create a set of k -length messagesequences. The set of message sequences are partitioned intomessage aggregates based on granularity g . In the example,the set of two-length message sequences is S= { ab , ba , ab , ba , ac } . We partition S at a granularity of 100 cycles which createstwo message aggregates s = { X, Y, X } and s = { X, Y, Z } where X = ab, Y = ba, Z = ac . Step 2 (Identifying unique message sequences and theiroccurrences per message aggregate) : We identify uniquemessage sequences per message aggregate and calculate theirnumber of occurrences. In this example, s has two uniquemessage sequences X and Y , and s has three unique messagesequences X , Y , and Z . In s , X happened two times, and Y happened one time. In s , each of the X , Y , and Z hashappened one time. Step 3 (Calculation of entropy and Levenshtein distanceper message aggregate) : We calculate entropy and Leven-shtein distance for each of the message aggregates using theinformation of unique message sequences from Step 2.In the example, for aggregate s , p ( X ) = 2 / and p ( Y ) =1 / . Hence H ( s ) = − p ( X ) log ( X ) − p ( Y ) log ( Y ) = − / ∗ log (2 / − / ∗ log (1 /

3) = 0 . and L ( X, Y ) =2 , L ( X, X ) = 0 , and L ( Y, X ) = 2 . The average Levenshteindistance of aggregate s is (2 + 0 + 2) / . .Similarly, for aggregate s , p ( X ) = 1 / , p ( Y ) = 1 / ,and p ( Z ) = 1 / . Hence H ( s ) = − p ( X ) log ( X ) − p ( Y ) log ( Y ) − p ( Z ) log ( Z ) = − / ∗ log (1 / − / ∗ log (1 / − / ∗ log (1 /

3) = 1 . and L ( X, Y ) =2 , L ( X, Z ) = 2 , and L ( Y, Z ) = 2 . The average Levenshteindistance of aggregate s is (2 + 2 + 2) / . .The aggregates s and s are represented by tuples (0.9182,1.33) and (1.58, 2.0) respectively in engineered feature space. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015, COMPILED ON 02/10/2021 AT 1:40am 8

CPU

Cache Processor Crossbar (CCX / PCX)

CPU CPU CPU CPU CPU CPU CPU

NCU

System Interface Unit (SIU)Network Interface Unit (NIU)

Data Management Unit (DMU / DSN)

High Speed IO Core

L2 Bank

MCU

L2 Bank

MCU

L2 Bank

MCU

L2 Bank

MCU

X8 PCI Express

10 Gb 10 Gb

C1 C2 C3 C4 C5 C7C6 C8 C9 C10

Operating at 350 MHzOperating at 1.5 GHz

Fig. 6: Block diagram of OpenSPARC T2 processor. NCU:Non-cacheable unit, MCU: Memory Controller Unit [29], [30]

Application Assembly Code Design CheckersSystemVerilog Monitors+ Verilog MonitorsDesign signalsDesign signals MessagesPass / Failure

Fig. 7: Experimental setup to convert design signals to ﬂowmessagesWe input these tuples to outlier detection algorithms to detectanomalous message aggregates.

E. Limitations of the engineered features

Our proposed engineered features can diagnose a widerange of post-silicon use-case failures. However we do notclaim that our features are sufﬁcient to diagnose any arbitrarypost-silicon use-case failures. For instance, if a buggy post-silicon execution trace consists of a very few unique messagesrepetitively ( e.g. , ‘abcabcabc. . . ’ where a, b, and c are uniquemessages), then the message aggregates will have frequentlyoccurring similar unique message sequences. This will result in low entropy (due to frequent occurrence of each of the uniquemessage sequences in the aggregate) and small average Leven-shtein distance per message aggregate (due to similar uniquemessage sequences in the aggregate) causing our method tofail to diagnose the bug. Certain class of bugs may escape ourdiagnosis method if the engineered features fail to demarcatecorrect and buggy behaviors in the engineered feature space.However this does not limit the practical applicability ofour diagnosis solution. Our solution expedites predominantlymanual post-silicon debugging by several orders of magnitude(c.f., Section VIII-F). To make our solution comprehensive,additional engineered features are needed.VI. E

XPERIMENTAL S ETUP

Design testbed : We primarily use the publicly availableOpenSPARC T2 SoC [29], [30] to demonstrate our result.Figure 6 shows an IP level block diagram of T2. Threedifferent usage scenarios considered in our debugging casestudies are shown in Table II along with participating ﬂows(column 2-6) and participating IPs (column 7). We also use theUSB design [33] to compare with other methods that cannotscale to the T2.

Testbenches : We used 37 different tests from fc1_all_T2 regression environment. Each test exercises two or more IPsand associated ﬂows. We monitored message communication TABLE II:

Usage scenarios and participating ﬂows in T2.

UID :Usage scenario ID. PI : participating IPs. PRC : Number of potentialroot causes.

PIOR:

PIO read,

PIOW:

PIO write,

NCUU:

NCUupstream,

NCUD:

NCU downstream and

Mon:

Mondo interrupt ﬂow. (cid:88) indicates Scenario i executes a ﬂow j and (cid:55) indicates Scenario idoes not execute a ﬂow j . Flows are annotated with (No of ﬂowstates, No of messages). UID Participating ﬂows PI PRC

PIOR (6, 5)

PIOW (3, 2)

NCUU (4, 3)

NCUD (3, 2)

Mon (6, 5)S1 (cid:88) (cid:88) (cid:55) (cid:55) (cid:88)

NCU, DMU,SIU 9S2 (cid:55) (cid:55) (cid:88) (cid:88) (cid:88)

NCU, MCU,CCX 8S3 (cid:88) (cid:88) (cid:88) (cid:88) (cid:55)

NCU, MCU,DMU, SIU 9

TABLE III: Representative bugs injected in IP blocks ofOpenSPARC T2.

Bug depth indicates the hierarchical depth ofan IP block from the top. Bug type is the functional implicationof a bug.

Bug Bug Bug Bug BuggyID depth category type IP across participating IPs and recorded the messages into anoutput trace ﬁle using the System-Verilog monitor of Figure 7.We also record the status (passing/failing) of each of the tests.

Bug injection : We created 5 different buggy versions of T2,that we analyze as ﬁve different case studies. Each case studycomprises 5 different IPs. We injected a total of 14 differentbugs across the 5 IPs in each case. The injected bugs followtwo sources –i) sanitized examples of communication bugsreceived from our industrial partners and ii) the “bug model”developed at the Stanford University in the QED [14] projectcapturing commonly occurring bugs in an SoC design. Afew representative injected bugs are detailed in Table III.Table III shows that the set of injected bugs are complex,subtle and realistic. It took up to

457 observed messages and up to for each bug symptom tomanifest. These demonstrate complexity and subtlety of theinjected bugs. Following [29], [30] and Table III, we haveidentiﬁed several potential architectural causes that can causean execution of a usage scenario to fail. Column 8 of Table IIshows number of potential root causes per usage scenario.

Anomaly detection techniques : We used six different outlierdetection algorithms, namely IForest, PCA, LOF, LkNN (kNNwith longest distance method), MukNN (kNN with meandistance method), and OCSVM from PyOD [35]. We appliedeach of the above outlier detection algorithms on the failuretrace data generated from each of the ﬁve different casestudies to diagnose anomalous message sequences that aresymptomatic of each of the injected bugs per case study.VII. E

XPERIMENTAL RESULTS ON MESSAGE SELECTION

In this section, we provide insights into the effectiveness ofour message selection technique using ﬁve different (buggy)case studies across 3 usage scenarios of the T2.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015, COMPILED ON 02/10/2021 AT 1:40am 9

TABLE IV: Trace buffer utilization ﬂow speciﬁcation coverageand path localization of traced messages for 3 different usagescenarios.

FSP Cov : Flow speciﬁcation coverage (Deﬁni-tion 7), WP: With packing, WoP: Without Packing. 32 bitswide trace buffer assumed.

Case Usage Trace Buffer FSP Cov Pathstudy Scenario Utilization Localization

WP WoP WP WoP WP WoP1 S1 96.88% 84.37% 99.86% 97.22% 0.13% 3.23%2 0.31% 6.11%3 S2 100% 71.87% 99.69% 93.75% 0.26% 5.13%4 0.10% 2.47%5 S3 100% 93.75% 83.33% 77.78% 0.11% 2.65% F l o w Sp e c i f i c a t i o n C o v e r a g e Scenario 1Scenario 2Scenario 3

Fig. 8: Correlation analysis between mutual information gain and ﬂow speciﬁcation coverage for different message combi-nations for three different usage scenarios.

A. Flow speciﬁcation coverage and trace buffer utilization

Table IV demonstrates the value of the traced messageswith respect to ﬂow speciﬁcation coverage (Deﬁnition 7)and trace buffer utilization. These are the two objectives forwhich our message selection is optimized. Messages selected without packing achieve up to 93.75% of trace bufferutilization with up to 97.22% ﬂow speciﬁcation coverage . With packing , message selection achieves up to 100% oftrace buffer utilization and up to 99.86% ﬂow speciﬁcationcoverage . This shows that we can cover most of the desiredfunctionality while utilizing the trace buffer maximally.

B. Path localization during debug of traced messages

In this experiment, we use buggy executions and tracedmessages to show the extent of path localization per bug.Localization is calculated as the fraction of total paths ofthe interleaved ﬂow. In Table IV, columns 7 and 8 show theextent of path localization. We needed to explore no morethan 6.11% of interleaved ﬂow paths using our selectedmessages. With packing, we needed to explore no more than0.31% of the total interleaved ﬂow paths during debugging.TABLE V: Comparison of signals selected by our method withSigSeT [4] and PRNet [18] for the USB design. P : Partial bit Signal USB Sig PR InfoName Module SeT Net Gain rx data UTMI (cid:55) (cid:88) (cid:88) rx valid line speed (cid:55) (cid:88) (cid:88) rx data valid Packet (cid:55) (cid:55) (cid:88) token valid decoder (cid:55) (cid:55) (cid:88) rx data done (cid:55) (cid:55) (cid:88) tx data Packet (cid:55) (cid:55) (cid:88) tx valid assembler (cid:55) (cid:88) (cid:88) send token Protocol (cid:55) (cid:55) (cid:88) token pid sel engine

P P (cid:88) data pid sel P (cid:55) (cid:88) TABLE VI: Selection of important messages by our method

Message Affecting Bug Message SelectedBug IDs coverage importance Y / N Usagescenariom1

8, 33, 36 0.21 4.76 Y 1, 2 m2

8, 33, 34, 36 0.28 3.57 Y 1, 2 m3

33, 36 0.14 7.14 Y 1, 2 m4

8, 29, 33 0.21 4.76 Y 1, 3 m5

18, 33 0.14 7.14 Y 1, 2 m6 - - N - m7 - - Y 1, 3 m8

33 0.07 14.28 Y 2 m9

1, 33 0.14 7.14 N - m10

24 0.07 14.28 Y 2 m11

1, 24 0.14 7.14 Y 2 m12

24 0.07 14.28 Y 2 m13 m14

1, 17, 33 0.21 4.76 Y 2 m15

1, 17, 18, 33 0.28 3.57 N - m16

1, 17, 18, 33 0.28 3.57 Y 2, 3 (b) Number of selected messages investigated N u m b e r o f p r un e d p o t e n t i a l r oo t c a u s e s (a) Number of traced messages investigated N u m b e r o f p r un e d c a n d i d a t e l e g a l I P p a i r s Case Study 1Case Study 2 Case Study 5Case Study 3 Case Study 4

Fig. 9: Root causing buggy IPEven with packing, subtle bugs like NCU bug of buggy design3 and buggy design 2 needed more paths to explore.

C. Validity of information gain as message selection metric

We select messages per usage scenario. In Figure 8 we an-alyze the correlation between ﬂow speciﬁcation coverage andthe mutual information gain of the selected messages. Flowspeciﬁcation coverage (Deﬁnition 7) increases monotonicallywith the mutual information gain over the interleavedﬂow of the corresponding usage scenario. This establishesthat increase in mutual information gain corresponds tohigher coverage of ﬂow speciﬁcation , indicating that mutualinformation gain is a good metric for message selection.

D. Comparison of our method to existing signal selectionmethods

To demonstrate that existing Register Transfer Level signalselection methods cannot select messages in system level (a)

Case Study 1 (b)

Case Study 2 (c)

Case Study 3 (d)

Case Study 4 (e)

Case Study 5

Fig. 10: Selected messages-cause pruning distribution fordiagnosis. Plausible Cause, Pruned Cause

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015, COMPILED ON 02/10/2021 AT 1:40am 10

TABLE VII: Diagnosed root causes and debugging statistics for our case studies on OpenSPARC T2.

CaseStudy Noof.Flows Legal IPPairs Legal IPpairs in-vestigated Messagesinvestigated Root caused architecture level function

Case study 1 Case study 2 Case study 3 Case study 5

Case studies N o o f s a m p l e s TwoThreeFour FiveSixSeven EightNineTen ElevenTwelve ThirteenFourteen (a) Total number of samples

Case study 1 Case study 2 Case study 3 Case study 5

Case studies A v e r a g e r un t i m e ( i n l o g s c a l e ) PreprocessingPCA LOFLkNN MukNNIForest OCSVM (b) Run time comparison

Case study 1 Case study 2 Case study 3 Case study 5

Case studies A v e r a g e p e a k m e m o r y u s a g e ( i n M B ) PreprocessingPCA LOFLkNN MukNNIForest OCSVM (c) Peak memory usage comparison

Fig. 11: (a) shows total number of message aggregate samples for different length message sequences for different debuggingcase studies. (b) and (c) demonstrate that our diagnosis methodology is computationally efﬁcient in terms of runtime and peakmemory usage across six different outlier detection algorithms for each of the case studies.ﬂows, we compare our approach with an SRR-based method[4] and a PageRank based method [18].

We could not applyexisting SRR based methods on the OpenSPARC T2, sincethese methods are unable to scale. We use a smaller USBdesign for comparison with our method.

In the USB [33]design we consider a usage scenario consisting of two ﬂows.Table V shows that our (mutual information gain based)method selects all of token_pid_sel , data_pid_sel and other important interface signals for system level debug-ging. SigSeT, on the other hand selects signals which are notuseful for system level debugging. Our messages are composedof interface signals, and achieve a ﬂow speciﬁcation coverageof , whereas messages composed of interface signalsselected by SigSeT and PRNet have a low ﬂow speciﬁcationcoverage of and respectively. E. Selection of important messages by our method

For evaluation purposes, we use bug coverage as a metric, todetermine which messages are important. A message is saidto be affected by a bug if its value in an execution of thebuggy design differs from its value in an execution of thebug free design. Intuitively, if multiple bugs are affecting amessage, it is highly likely that message is a part of multipledesign paths. The bug coverage of a message is deﬁned as thetotal number of bugs that affects a message, expressed as afraction of the total number of injected bugs. From debuggingperspective, a message is important if it is affected by very fewbugs implying that the message symptomatizes subtle bugs.Table VI conﬁrms that post-Silicon bugs are subtle and tendto affect no more than 4 messages each. Column 4, 5 and 6 ofTable VI show that our method was able to select importantmessages from the interleaved ﬂow to debug subtle bugs.Table VI shows that message m15 is affected by four bugsand message m9 is affected by two bugs, but due to their size being wider than 32 bits trace buffer, our method does notselect them. F. Effectiveness of selected messages in debugging usagescenarios

Every message is sourced by an IP and reaches a destinationIP. Bugs are injected into speciﬁc IPs (Table III). Duringdebug, sequences of IPs are explored from the point a bugsymptom is observed, to ﬁnd the buggy IP. An IP pair( < source IP, destination IP > ) is legal if a message is passedbetween them. We use the number of legal IP pairs investigatedduring debug as a metric for selected messages. Table VIIshows that we investigated an average of 54.67% of the totallegal IP pairs, implying that our selected messages help usfocus on a fraction of the legal IP pairs.To debug a buggy execution, we start with the tracedmessage in which a bug symptom is observed and backtrack toother traced messages. The choice of which traced message toinvestigate is pseudo-random and guided by the participatingﬂows.Figure 9(a) plots the number of such investigated tracedmessages and the corresponding candidate legal IP pairs thatare eliminated with each traced message. Figure 9(b) showsa similar relationship between the traced messages and thecandidate root causes, i.e. , the architecture level functionsthat might have caused the bug to manifest in the tracedmessages. Both graphs show that with more traced messages,more candidate legal IP pairs as well as candidate root causesare progressively eliminated. This implies that every one ofour traced messages contributes to the debug process.Figure 10 shows that traced messages were able to prune outa large number of potential root causes in all ﬁve case studies.Our traced messages pruned out an average of 78.89% ( max.88.89% ) of candidate root causes. OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015, COMPILED ON 02/10/2021 AT 1:40am 11

Entropy L e v e n s h t e i n d i s t a n c e (a) Case study 1 ( k = 5 ) Entropy L e v e n s h t e i n d i s t a n c e (b) Case study 3 ( k = 5 ) Entropy L e v e n s h t e i n d i s t a n c e (c) Case study 5 ( k = 5 ) Fig. 12: (a), (b), and (c) show that the engineered features demarcate normal and anomalous message aggregates.VIII. E

XPERIMENTAL RESULTS ON DEBUG ANDDIAGNOSIS

In this section we provide insights into our bug diagnosismethodology to debug ﬁve different buggy case studies acrossthree usage scenarios of the OpenSPARC T2 SoC. For theseexperiments, we have used g = 100000 cycles and varied k from two to the number of valid IP pairs (c.f., Table VII) foreach of the case studies. The number of message aggregatesamples for different lengths of message sequences for eachof the outlier detection algorithm per debugging case study isshown in Figure 11a. A. Computational efforts for data preprocessing and outliermessage sequence diagnosis

In this experiment, we show scalability of the automateddiagnosis methodology in terms of runtime and peak memoryusage. Figure 11b and Figure 11c show runtime and peakmemory usage for preprocessing and outlier detection algo-rithms. To calculate the average runtime and average peakmemory usage of each of the outlier detection algorithms, weran each of them times and calculated the average value.Preprocessing trace message data to create message se-quence aggregates incurred a runtime of up to 44.3 seconds ( average 10.8 seconds ) and peak memory usage of up to 508.7MB ( average 457.73 MB ). To run each of the outlier detectionalgorithms on the processed message aggregates incurredonly up to 18.91 seconds ( average 2.77 seconds ) and peakmemory usage of up to 508.2 MB ( average 451.27 MB ). Sincepreprocessing has up to 443 × ( average 3 × ) more runtimethan the running each of the outlier detection algorithms, weshowed runtime in the log scale in the Figure 11b. This experiment shows that our trace-data preprocessingand diagnosis is computationally efﬁcient.

B. Validity of entropy and Levenshtein distance as engineeredfeature for outlier message sequence diagnosis

In this experiment, we analyze the effectiveness of entropy and

Levenshtein distance to identify message aggregates thatcontain anomalous message sequences. In Figure 12 we showjoint probability distribution of entropy and Levenshtein dis-tance and in Figure 13 we show minimum, maximum, andaverage of entropy and Levenshtein distance of anomalous message aggregates across different length message sequencesfor three different debugging case studies.As shown in Figure 12, in the engineered feature space,message aggregates for normal behavior form a dense clusterwhereas anomalous message sequences are sparsely distributedand are placed at a distance from the normal message aggre-gates. Further, Figure 13 shows that message aggregates thatcontain anomalous message sequences have entropy of up to4.3482 ( average 2.08 ) and Levenshtein distance of up to 3.0 ( average 1.5734 ). This experiment validates that entropy and Levenshteindistance are valuable and effective engineered featuresin demarcating the anomalous message aggregates fromnormal message aggregates.

C. Agreements among different outlier detection algorithms indetecting outlier message sequences

In this experiment, we assess the extent of agreementbetween anomalies identiﬁed by various outlier algorithms(c.f., Section III). Since this set of algorithms uses differentmethods for outlier detection, we surmise that the conﬁdencein an anomalous message aggregate is higher, if multiple out-lier detection algorithms identify it as such. For this analysis,we consider the top 10% of anomalous message aggregatesper outlier detection algorithm per case study.Our analysis showed that six outlier detection algorithmsagree for a total of six anomalous message aggregates thatdiagnose of injected bugs, ﬁve outlier detection algo-rithms agree for a total of 17 anomalous message aggregatesthat diagnose of injected bugs, three outlier detectionalgorithms agree for a total of six anomalous message aggre-gates that diagnose of injected bugs, two outlier detectionalgorithms agree for a total of six anomalous message aggre-gates that diagnose of injected bugs.

This experiment shows that our engineered features aregeneric to characterize anomalies such that multiple outlierdetection algorithms agree on a large number of anomaliesthat diagnose multiple bugs. This observation motivated usto use a comprehensive anomaly score to rank messageaggregates.

We explain our comprehensive anomaly scorecalculation in Section VIII-F.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015, COMPILED ON 02/10/2021 AT 1:40am 12 M i n / M a x / A v g v a l u e minmax avgmin maxavg (a) Case study 1 M i n / M a x / A v g v a l u e minmax avgmin maxavg (b) Case study 2 M i n / M a x / A v g v a l u e minmax avgmin maxavg (c) Case study 3 Fig. 13: (a), (b), and (c) show that the minimum, maximum, and average value of engineered features are high for anomalousmessage aggregates irrespective of message sequence lengths. (cid:104)H min , H max , H avg (cid:105) : Minimum, maximum, average entropy. (cid:104)L min , L max , L avg (cid:105) : Minimum, maximum, average Levenshtein distance.TABLE VIII: Diagnosis statistics for different outlier detection algorithms for different case studies using OpenSPARC T2SoC [29], [30]. PCA : Principal Component Analysis [15].

LOF : Local Outlier Factor based algorithm [6].

LkNN : k -NearestNeighbor using largest distance as metric [3], [25]. MukNN : k-Nearest Neighbor using mean distance as a metric [3], [25].

OCSVM : One-class Support Vector Machine [26]. D : Fraction of injected bugs diagnosed by an outlier detection algorithm. t p : Total number of true positive message sequences. f p : Total number of false positive message sequences (no more than 37%anomalous message sequences). P : Precision of an outlier detection algorithm. OS : Overall diagnosis statistics for each of theoutlier detection algorithm per debugging case study. Case IForest PCA LOF LkNN MukNN OCSVMstudy D t p f p P D t p f p P D t p f p P D t p f p P D t p f p P D t p f p P AS D. Comparison of precision of different outlier detection al-gorithms in detecting outlier message sequences

In this experiment, we compare the precision (c.f., Deﬁni-tion 10), recall (c.f., Deﬁnition 11), and accuracy (c.f., Deﬁni-tion 12) of each of the outlier detection algorithms in diagnos-ing anomalous messages sequences per debugging case study.In Table VIII, we show the fraction of injected bugs diagnosed,and the number of true positive and false positive candidateanomalous message sequences identiﬁed for each of the outlierdetection algorithm per debugging case study. In Table IX, weshow the fraction of total number of injected bugs diagnosed,total number of true positive, false positive, true negative,and false negative candidate anomalous message sequencesidentiﬁed across all of the outlier detection algorithms perdebugging case study. For this analysis, we considered onlythe top 10% anomalous message aggregates identiﬁed by eachof the outlier detection algorithm per debugging case study.Our analysis shows that IForest, MukNN, and OCSVMconsistently performed better in anomalous message sequencediagnosis as compared to the other three algorithms PCA,LOF, and LkNN. Each of the outlier detection algorithmdiagnosed up to 100% of injected bugs. IForest diagnosed onan average 73% of injected bugs with a precision of up to0.8 ( average 0.69 ), MukNN diagnosed on an average 67% ofinjected bugs with a precision of up to 0.77 ( average 0.70 ), andOCSVM diagnosed on an average 67% of injected bugs with aprecision of up to 0.8 ( average 0.74 ) per debugging case study.On the other hand, PCA diagnosed on an average average 40% of injected bugs with a precision of up to 0.8 ( average 0.69 ),LOF diagnosed on an average 47% of injected bugs with aprecision of up to 1.0 ( average 0.81 ), and LkNN diagnosedon an average 47% of injected bugs with a precision of up to0.82 ( average 0.76 ) per debugging case study. Further analysisshows (c.f., Table IX) our automated diagnosis technique wasable to detect up to 100% ( average 81.8% ) of injected bugswith a precision of up to 0.769 ( average 0.756 ) per debuggingcase study.In Table IX, we also show the recall and the accuracymetric per debugging case study. Our diagnosis methodologyachieved up to 0.69 ( average 0.46 ) recall and up to 0.56 ( average 0.39 ) accuracy. We note that in Table IX the value ofrecall and accuracy are relatively small. This is due to the factthat we are only considering the top 10% anomalous messageaggregates for this analysis. Consequently, the t p in the nu-merator is calculated from those top 10% anomalous messageaggregates whereas f n and t n are calculated based on theentire set of message aggregates. Consequently, the numeratorsare much smaller than the denominators (c.f., Deﬁnition 11and Deﬁnition 12) which results in a small value of recall andaccuracy. This experiment shows that our automated diagnosismethodology using engineered features is effective in iden-tifying complex and subtle bugs with high precision.

E. Improvement in diagnosis over manual debugging

In this experiment, we analyze the improvement in diagnosisin terms of number of injected bugs diagnosed and diagnosis

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015, COMPILED ON 02/10/2021 AT 1:40am 13

TABLE IX: Overall statistics of automated debugging acrossall outlier detection algorithms across all case studies. D :Fraction of injected bugs detected. P : Precision. R : Recall. A : Accuracy. Case D Sequences P R AStudy t p t n f p f n time over manual debugging. Table X (column 7 and column8) summarizes the diagnosis improvement. We were able todiagnose up to 66.7% more injected bugs ( average 46.67% )with up to 847 × ( average 464.35 × ) less diagnosis time. This experiment shows that our automated bug diagnosisis effective and expedites debugging.

F. Comprehensive ranking of outlier message sequences

In Section VIII-D, our experimental results showed thatIForest, OCSVM, and MukNN are the three most effectiveoutlier detection algorithms among six for diagnosing usefulanomalous message sequences that can help in debugging.Each of the IForest, OCSVM, and MukNN (c.f., Section III)detect anomalous message aggregates based on a differentperspective. IForest selects an anomalous message aggregatebased on shorter path lengths created by random selectionof a feature and recursive partitioning of the feature data .OCSVM selects an anomalous message aggregate by solvingan optimization problem to ﬁnd a maximal margin hyperplane that best separates anomalous message aggregates. MukNN( i.e. , k -NN with mean distance as metric) selects an anomalousmessage aggregate based on a aggregate’s local density and thedistance to its k th nearest neighbor.Consequently, to incorporate these different perspectivesinto our diagnosis methodology, we use a heuristic combina-tion of outlier scores from each of the above three algorithmsfor each of the message aggregate. We found that a linearcombination of outlier scores of a message aggregate is incloser agreement with our empirical ﬁndings than relyingon outlier score of a message aggregate from each of theindividual algorithms. Let x be a message aggregate, Ano ( x ) be the comprehensive outlier score of x , and IF orest ( x ) , OCSV M ( x ) , and M ukN N ( x ) be the outlier score of x usingthe IForest, OCSVM, and MukNN algorithm respectively. Wedeﬁne Ano ( x ) as Ano ( x ) = ( IF orest ( x ) + OCSV M ( x ) + M ukN N ( x )) / . In our experiments, we rank anomalousmessage aggregates based on the comprehensive outlier scoredeﬁned above.IX. Q UALITATIVE CASE STUDY ON EFFECTIVENESS OFOUR MESSAGE SELECTION AND DIAGNOSIS METHODOLOGY

It is illuminating to understand a case study to appreciate theeffectiveness of the selected messages and our bug detectionmethodology in the debugging process.

Symptom:

In this experiment we used traced messages fromTable XI. The simulation failed with an error message

FAIL:Bad Trap . Manual debug with selected messages:

We consider bugsymptom causes of Table XI to debug this case. From the TABLE X: Summary of detection improvements achievedusing automated detection technique over manual debugging. N : Number of symptomatic message sequences identiﬁed. T :Time taken to identify a symptomatic message sequence. D :Improvement in terms of number of additional detected bugsas a fraction of injected bugs. t : Improvement in detectiontime. (cid:11) : Not available. Case Bug Manual Automated Improvementstudy ID N T N T D t(Hrs) (Secs) × (cid:11) (cid:11) (cid:11) (cid:11) (cid:11) (cid:11) (cid:11) (cid:11) (cid:11) (cid:11) ×

18 1 3 2425 (cid:11) (cid:11) (cid:11) (cid:11) (cid:11) (cid:11) (cid:11) (cid:11) × (cid:11) (cid:11)

44 5 1 6 14 57.5 66.7% 375.65 × (cid:11) (cid:11) (cid:11) (cid:11)

35 24 (cid:11) (cid:11) ×

39 1 6 6 observed trace messages, siincu and piowcrd , we identifyNCU got back correct credit ID at the end of the PIO readand PIO write operation respectively. This rules out two causesout of 9. However, we cannot rule out causes related to PIOpayload since a wrong payload may cause computing threadto catch

BAD Trap by requesting operand from wrong mem-ory location. Absence of trace messages mondoacknack and reqtot implies that NCU did not service any Mondointerrupt request and SIU did not request a Mondo payloadtransfer to NCU respectively. Further, there is no messagecorresponding to dmusiidata.cputhreadid in the traceﬁle, implying that DMU was never able to generate a Mondointerrupt request for NCU to process. This rules out all causesexcept cause ( ) to explore further to ﬁnd the root cause. Manual root causing:

From [29], [30], we note that aninterrupt is generated only when DMU has credit and allprevious DMA reads are done. We found no prior DMA readmessages and DMU had all its credit available. Absence of dmusiidata message correct CPUID and ThreadID impliesthat DMU never generated a Mondo interrupt request. Thismakes DMU a plausible location of the root cause of the bug.

Debug with bug diagnosis methodology:

We apply our bugdiagnosis methodology on the same set of trace messages asbefore. The methodology identiﬁed ﬁve anomalous messageaggregates containing a total of 26 unique message sequences.We found 20 true positive anomalous message sequences thatare symptomatic of different bugs that we injected in thedesign. Among these 20 anomalous message sequences, 18message sequences were symptomatic of the bug that weidentiﬁed manually. The remaining two message sequenceswere symptomatic of the other two injected bugs.Clearly, while debugging manually, we were unable todetect the later two bugs because i) they were more subtleand ii) the symptomatic message sequences were extremelyinfrequent. Interestingly, the manual debug took approximatelyeight hours to diagnose one symptomatic message sequence.In comparison, the automated bug diagnosis methodology took

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015, COMPILED ON 02/10/2021 AT 1:40am 14

TABLE XI: Representative potential root causes for one case study. Rest of the root causes are omitted due to lack of space.Remaining case studies are available in [32]

Selected Messages Potential Causes Potential Implication reqtot,grant,mondoacknack,siincu,piowcrd Mondo request forwarded from DMU to SIU’s bypass queue instead of orderedqueue Mondo interrupt not serviced dmusiidata.cputhreadid Invalid Mondo payload forwarded to NCU from DMU via SIU Interrupt assigned to wrong CPU ID and Thread ID only approximately 62 seconds (an improvement of × )to pre-process the trace messages and to diagnose candidateanomalous message sequences using different outlier detectionalgorithms. Additionally, the diagnosis method was able todiagnose candidate anomalous message sequences for twomore bugs, an improvement of over manual debugging(c.f., Table X). This case study shows that our bug diagnosis method-ology automates and expedites tedious and error-pronemanual debugging process of post-silicon failures.

X. D

ISCUSSIONS AND C ONCLUSION

In light of our experimental ﬁndings, we believe that asynergistic application of feature engineering and anomalydetection is a powerful tool for application-level post-silicondebug and diagnosis. Although the two features presentedin this work capture a wide range of bugs, we concur thatthese sets of features are not complete and may fail tocapture certain application-level bugs. Since our proposedbug diagnosis framework is very generic, one may engineeradditional features to plug-in to diagnose a wider set of bugs.In conclusion, we have presented an automated post-siliconbug diagnosis methodology for SoC use-case failures. Oursolution uses the power of machine learning and feature engi-neering to automatically learn the buggy design behavior andthe normal design behavior from the trace data by analyzingintrinsic data feature without requiring prior knowledge of thedesign. Our proposed diagnosis solution is highly effective andcan diagnose many more bugs at a fraction of time with highprecision as compared to manual debugging. We demonstratethe effectiveness of our proposed diagnosis solution using real-world debugging case studies on the OpenSPARC T2 SoC.R

EFERENCES[1] Y. Abarbanel, E. Singerman, and M. Y. Vardi. Validation of socﬁrmware-hardware ﬂows: Challenges and solution directions. In

The51st Annual DAC ’14, San Francisco, CA, USA, June 1-5, 2014 , pages2:1–2:4, 2014.[2] M. Amer, M. Goldstein, and S. Abdennadher. Enhancing one-classsupport vector machines for unsupervised anomaly detection. In

Pro-ceedings of the ACM SIGKDD Workshop on Outlier Detection andDescription , pages 8–15. ACM, 2013.[3] F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensionalspaces. In

Proceedings of the 6th European Conference on Principlesof Data Mining and Knowledge Discovery , PKDD ’02, pages 15–26,London, UK, UK, 2002. Springer-Verlag.[4] K. Basu and P. Mishra. Efﬁcient trace signal selection for post siliconvalidation and debug. In

VLSI Design (VLSI Design), 2011 24thInternational Conference on , pages 352–357. IEEE, 2011.[5] Y. Bengio, A. Courville, and P. Vincent. Representation learning: Areview and new perspectives. arXiv preprint arXiv:1206.5538 , 2014.[6] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: Identifyingdensity-based local outliers. In

Proceedings of the 2000 ACM SIGMODInternational Conference on Management of Data , SIGMOD ’00, pages93–104, New York, NY, USA, 2000. ACM.[7] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey.

ACM computing surveys (CSUR) , 41(3):15, 2009. [8] D. Chatterjee, C. McCarter, and V. Bertacco. Simulation-based signalselection for state restoration in silicon debug. In

Computer-AidedDesign (ICCAD), 2011 IEEE/ACM International Conference on , pages595–601. IEEE, 2011.[9] R. Fraer, D. Keren, Z. Khasidashvili, A. Novakovsky, A. Puder,E. Singerman, E. Talmor, M. Y. Vardi, and J. Yang. From visual tological formalisms for soc validation. In

Twelfth ACM/IEEE MEM-OCODE 2014, Lausanne, Switzerland, October 19-21, 2014 , pages 165–174, 2014.[10] M. Goldstein and S. Uchida. A comparative evaluation of unsuper-vised anomaly detection algorithms for multivariate data.

PloS one ,11(4):e0152173, 2016.[11] R. W. Hamming. Error detecting and error correcting codes.

The BellSystem Technical Journal , 29(2):147–160, April 1950.[12] J. Heaton. An empirical analysis of feature engineering for predictivemodeling.

SoutheastCon 2016 , 2016.[13] Levenshtein distance. https://en.wikipedia.org/wiki/Levenshteindistance.[14] D. Lin, T. Hong, Y. Li, S. Kumar, F. Fallah, N. Hakim, D. Gardner,S. Mitra, et al. Effective post-silicon validation of system-on-chips usingquick error detection.

Computer-Aided Design of Integrated Circuits andSystems, IEEE Transactions on , 33(10):1573–1590, 2014.[15] M. ling Shyu, S. ching Chen, K. Sarinnapakorn, and L. Chang. A novelanomaly detection scheme based on principal component classiﬁer. In in Proceedings of the IEEE Foundations and New Directions of DataMining Workshop, in conjunction with the Third IEEE InternationalConference on Data Mining (ICDM’03 , pages 172–179, 2003.[16] F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation forest. In

Proceedings ofthe 2008 Eighth IEEE International Conference on Data Mining , ICDM’08, pages 413–422, Washington, DC, USA, 2008. IEEE ComputerSociety.[17] F. T. Liu, K. M. Ting, and Z.-H. Zhou. Isolation-based anomalydetection.

ACM Trans. Knowl. Discov. Data , 6(1):3:1–3:39, Mar. 2012.[18] S. Ma, D. Pal, R. Jiang, S. Ray, and S. Vasudevan. Can’t see the forestfor the trees: State restoration’s limitations in post-silicon trace signalselection. In

Proceedings of ICCAD 2015, Austin, TX, USA, November2-6, 2015 , pages 1–8, 2015.[19] T. M. Mitchell.

Machine learning . McGraw-Hill, Inc., New York, NY,USA, 1 edition, 1997.[20] S. Mitra, S. A. Seshia, and N. Nicolici. Post-silicon validation oppor-tunities, challenges and recent advances. In

Proceedings of the 47thDesign Automation Conference , DAC ’10, pages 12–17, New York, NY,USA, 2010. ACM.[21] K. P. Murphy.

Machine learning - A probabilistic perspective . Adaptivecomputation and machine learning series. MIT Press, 2012.[22] D. Pal, A. Sharma, S. Ray, F. M. de Paula, and S. Vasudevan. Applicationlevel hardware tracing for scaling post-silicon debug. In

Proceedingsof the 55th Annual Design Automation Conference, DAC 2018, SanFrancisco, CA, USA, June 24-29, 2018 , pages 92:1–92:6, 2018.[23] P. Patra. On the Cusp of a Validation Wall.

IEEE Design and. Test ofComputers , 24(2):193–196, 2007.[24] K. Rahmani, S. Ray, and P. Mishra. Postsilicon trace signal selectionusing machine learning techniques.

IEEE Trans. VLSI Syst. , 25(2):570–580, 2017.[25] S. Ramaswamy, R. Rastogi, and K. Shim. Efﬁcient algorithms for miningoutliers from large data sets. In

Proceedings of the 2000 ACM SIGMODInternational Conference on Management of Data , SIGMOD ’00, pages427–438, New York, NY, USA, 2000. ACM.[26] B. Sch¨olkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C.Williamson. Estimating the support of a high-dimensional distribution.

Neural Comput. , 13(7):1443–1471, July 2001.[27] C. E. Shannon. A mathematical theory of communication.

Bell SystemTechnical Journal , 27(3):379–423, 1948.[28] E. Singerman, Y. Abarbanel, and S. Baartmans. Transaction based pre-to-post silicon validation. In

Proceedings of the 48th DAC 2011, SanDiego, California, USA, June 5-10, 2011

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015, COMPILED ON 02/10/2021 AT 1:40am 15

FMCAD 2015, Austin, Texas, USA, September 27-30, 2015. , pages 168–175, 2015.[32] Zoom Out and See Better: Scalable Message Tracing for Post-SiliconSoC Debug, 2017. http://hdl.handle.net/2142/98857.[33] USB 2.0, 2008. http://opencores.org/project,usb.[34] S. Yerramilli. Addressing Post-Silicon Validation Challenge: LeverageValidation and Test Synergy. In

Keynote, Intl. Test Conf. , 2006.[35] Y. Zhao, Z. Nasrullah, and Z. Li. Pyod: A python toolbox for scalableoutlier detection. arXiv preprint arXiv:1901.01588arXiv preprint arXiv:1901.01588