A Computational Approach to Packet Classification
aa r X i v : . [ c s . D C ] J u l A Computational Approach to Packet Classification
Alon Rashelbach
Ori Rottenstreich
Mark Silberstein
ABSTRACT
Multi-field packet classification is a crucial component in modernsoftware-defined data center networks. To achieve high through-put and low latency, state-of-the-art algorithms strive to fit the rulelookup data structures into on-die caches; however, they do notscale well with the number of rules.We present a novel approach,
NuevoMatch , which improves thememory scaling of existing methods. A new data structure,
RangeQuery Recursive Model Index (RQ-RMI), is the key component thatenables NuevoMatch to replace most of the accesses to main mem-ory with model inference computations. We describe an efficienttraining algorithm that guarantees the correctness of the RQ-RMI-based classification. The use of RQ-RMI allows the rules to be com-pressed into model weights that fit into the hardware cache. Fur-ther, it takes advantage of the growing support for fast neural net-work processing in modern CPUs, such as wide vector instructions,achieving a rate of tens of nanoseconds per lookup.Our evaluation using 500K multi-field rules from the standardClassBench benchmark shows a geometric mean compression fac-tor of 4.9 × , 8 × , and 82 × , and average performance improvementof 2.4 × , 2.6 × , and 1.6 × in throughput compared to CutSplit, Neu-roCuts, and TupleMerge, all state-of-the-art algorithms . Packet classification is a cornerstone of packet-switched networks.Network functions such as switches use a set of rules that deter-mine which action they should take for each incoming packet. Therules originate in higher-level domains, such as routing, Qualityof Service, or security policies. They match the packets’ metadata,e.g., the destination IP-address and/or the transport protocol. Ifmultiple rules match, the rule with the highest priority is used.Packet classification algorithms have been studied extensively.There are two main classes: those that rely on Ternary ContentAddressable Memory (TCAM) hardware [13, 20, 23, 28, 37], andthose that are implemented in software [3, 8, 21, 22, 34, 36, 41, 44].In this work, we focus on software-only algorithms that can bedeployed in virtual network functions, such as forwarders or ACLfirewalls, running on commodity X86 servers.Software algorithms fall into two major categories: decision-tree based [8, 21, 22, 34, 41, 44] and hash-based [3, 36]. The formeruse decision trees for indexing and matching the rules, whereas thelatter perform lookup via hash-tables by hashing the rule’s prefixes.Other methods for packet classification [7, 38] are less common asthey either require too much memory or are too slow.A key to achieving high classification performance in modernCPUs is to ensure that the classifier fits into the CPU on-die cache.When the classifier is too large, the lookup involves high-latency This work does not raise any ethical issues.
IncomingPacket
Independent Set
RQ-RMI (CPU Cache) iSetRules (DRAM) predictedindex candidaterule
Remainder Set
ExternalClassifier (CPU Cache)
RemainderRules (DRAM) indexes candidaterule
Selector
Action
Figure 1: NuevoMatch overview. The rules are divided intoIndependent Sets indexed by RQ-RMIs and the RemainderSet indexed by any classifier. One RQ-RMI predicts the stor-age index of the matching rule. The Selector chooses thehighest-priority matching rule. memory accesses, which stall the CPU, as the data-dependent ac-cess pattern during the lookup impedes hardware prefetching. Un-fortunately, as the number of rules grows, it becomes difficult tomaintain the classifier in the cache. In particular, in decision-treemethods, rules are often replicated among multiple leaves of the de-cision tree, inflating its memory footprint and affecting scalability.Consequently, recent approaches, notably CutSplit [21] and Neuro-Cuts [22], seek to reduce rule replication to achieve better scaling.However, they still fail to scale to large rule-sets, which in moderndata centers may reach hundreds of thousands of rules [6]. Hash-based techniques also suffer from poor scaling, as adding rules in-creases the number of hash-tables and their size.We propose a novel approach to packet classification,
Nuevo-Match , which compresses the rule-set index dramatically to fit itentirely into the upper levels of the CPU cache (L1/L2) even forlarge 500K rule-sets. We introduce a novel
Range Query RecursiveModel Index (RQ-RMI) model, and train it to learn the rules’ match-ing sets, turning rule matching into neural network inference . Weshow that RQ-RMI achieves out-of-L1-cache execution by reduc-ing the memory footprint on average by 4.9 × , 8 × , and 82 × com-pared to recent CutSplit [21], NeuroCuts [22], and TupleMerge [3]on the standard ClassBench [39] benchmarks, and up to 29 × forreal forwarding rule-sets.To the best of our knowledge, NuevoMatch is the first to per-form packet classification using trained neural network models.NeuroCuts also uses neural nets, but it applies them for optimiz-ing the decision tree parameters during the offline tree construc-tion phase; their rule matching still uses traditional (optimized) de-cision trees. In contrast, NuevoMatch performs classification viaRQ-RMIs, which are more space-efficient than decision trees orhash-tables, improving scalability by an order of magnitude.uevoMatch transforms the packet classification task frommemory- to compute-bound. This design is appealing because itis likely to scale well in the future, with rapid advances in hard-ware acceleration of neural network inference [11, 19, 29]. On theother hand, the performance of both decision trees and hash-tablesis inherently limited because of the poor scaling of DRAM accesslatency and CPU on-die cache sizes (e.g., 1 . × over five years forL1 in Intel’s CPUs).NuevoMatch builds on the recent work on learned indexes [18],which applies a Recursive Model Index (RMI) model to indexingkey-value pairs. The values are stored in an array, and the RMIis trained to learn the mapping function between the keys and theindexes of their values in the array. The model is used to predict theindex given the key. When applied to databases [18], RMI boostsperformance by compressing the indexes to fit in CPU caches.Unfortunately, RMI is not directly applicable for packet classi-fication. First, a key (packet field) may not have an exact match-ing value, but match a rule range , whereas RMI can learn onlyexact key-index pairs. This is a fundamental property of RMI: itguarantees correctness only for the keys used during training, butprovides no such guarantees for non-existing keys ([18], Section3.4). Thus, for range matching it requires enumeration of all pos-sible keys in the range, making it too slow. Second, the match isevaluated over multiple packet fields, requiring lookup in a multi-dimensional space. Unfortunately, multi-dimensional RMI [17] re-quires that the input be flattened into one dimension, which in thepresence of wildcards results in an exponential blowup of the inputdomain, making it too large to learn for compact models. Finally, akey may match multiple rules, with the highest priority one usedas output, whereas RMI retrieves only a single index for each key.NuevoMatch successfully solves these challenges. RQ-RMI . We design a novel model which can match keys toranges, with an efficient training algorithm that does not requireexhaustive key enumeration to learn the ranges. The trainingstrives to minimize the prediction error of the index, while main-taining a small model size. We show that the models can store in-dices of 500K ClassBench rules in 35 KB (§5.2.1). We prove that ouralgorithm guarantees range lookup correctness (§3.3).
Multi-field packet classification . To enable multi-field match-ing with overlapping ranges, the rule-set is split into independentsets with non-overlapping ranges, called iSets , each associated witha single field and indexed with its own RQ-RMI model. The iSetpartitioning (§3.6) strives to cover the rule-set with as few iSets aspossible, discarding those that are too small. The remainder set ofthe rules not covered by large iSets is indexed via existing classifi-cation techniques. In practice, the rules in the remainder constitutea small fraction in representative rule-sets, so the remainder indexfits into a fast cache together with the RQ-RMIs.Figure 1 summarizes the complete classification flow. The queryof the RQ-RMI models produces the hints for the secondary searchthat selects one matching rule per iSet. The validation stage selectsthe candidates with a positive match across all the fields, and aselector chooses the highest priority matching rule.Conceptually, NuevoMatch can be seen as an accelerator for ex-isting packet classification techniques and thus complements them.In particular, the RQ-RMI model is best used for indexing ruleswith high value diversity that can be partitioned into fewer iSets. We show that the iSet construction algorithm is effective for select-ing the rules that can be indexed via RQ-RMI, leaving the rest inthe remainder (§5.3.1). The performance benefits of NuevoMatchbecome evident when it indexes more than 25% of the rules. Sincethe remainder is only a fraction of the original rule-set, it can beindexed efficiently with smaller decision-trees/hash-tables or willfit smaller TCAMs.Our experiments show that NuevoMatch outperforms all thestate-of-the-art algorithms on synthetic and real-life rule-sets. Forexample, it is faster than CutSplit, NeuroCuts, and TupleMerge, by2.7 × , 4.4 × and 2.6 × in latency and 2.4 × , 2.6 × , and 1.6 × in through-put respectively, averaged over 12 rule-sets of 500K ClassBench-generated rules, and by 7 . × in latency and 3 . × in throughputvs. TupleMerge for the real-world Stanford backbone forwardingrule-set.NuevoMatch supports rule updates by removing the updatedrules from the RQ-RMI and adding them to the remainder set in-dexed by another algorithm that supports fast updates, e.g., Tuple-Merge. This approach requires periodic retraining to maintain asmall remainder set; hence it does not yet support more than a fewthousands of updates (§3.9). The algorithmic solutions to directlyupdate RQ-RMI are deferred for future work.In summary, our contributions are as follows. • We present an novel RQ-RMI model and a training techniquefor learning packet classification rules. • We demonstrate the application of RQ-RMI to multi-fieldpacket classification. • NuevoMatch outperforms existing techniques in terms of mem-ory footprint, latency, and throughput on challenging rule-setswith up to 500K rules, compressing them to fit into small cachesof modern processors.
This section describes the packet classification problem and sur-veys existing solutions.
Packet classification is the process of locating a single rule that issatisfied by an input packet among a set of rules. A rule containsa few fields in the packet’s metadata. Wildcards define ranges , i.e.,they match multiple values. Ranges may overlap with each other,i.e., a packet may match several rules, but only the one having thehighest priority is selected. Figure 2 illustrates a classifier with twofields and five overlapping matching rules. An incoming packetmatches two rules ( R , R ), but R is used as it has a higher priority.Packet classification performance becomes difficult to scale asthe number of rules and the number of matching fields grow. There-fore, it has received renewed interest with increased complexityof software-defined data center networks, featuring hundreds ofthousands of rules per virtual network function [5] and tens ofmatching fields (up to 41 in OpenFlow 1.4 [27]). Decision Tree Algorithms.
The rules are viewed as hyper-cubesand packets as points in a multi-dimensional space. The axes of the rule space represent different fields and hold non-negative integers. The source code of NuevoMatch is available in [31].Pv4 Address Port Priority Action R a R a R a R a R a Incoming packet10.10.3.100:19 Action to take a Figure 2: Packet classification with two fields: IP address andport.
A recursive partitioning technique divides the rule space into sub-sets with at most binth rules. Thus, to match a rule, a tree traver-sal finds the smallest subset for a given packet, while a secondarysearch scans over the subset’s rules to select the best match.Unfortunately, a rule replication problem may hinder perfor-mance in larger rule-sets by dramatically increasing the tree’smemory footprint when a rule spans several subspaces. Earlyworks, such as HiCuts [8] and HyperCuts [34] both suffer fromthis issue. More recent EffiCuts [41] and CutSplit [21], suggest thatthe rule set should be split into groups of rules that share similarproperties and generate a separate decision-tree for each. Neuro-Cuts [22], the most recent work in this domain, uses reinforcementlearning for optimizing decision tree parameters to reduce its mem-ory footprint, or the number of memory accesses during traversal,by efficiently exploring a large tree configuration space.
Hash-Based Algorithms.
Tuple Space Search [36] and recent Tu-pleMerge [3] partition the rule-set into subsets according to thenumber of prefix bits in each field. As all rules of a subset havethe same number of prefix bits, they can act as keys in a hash ta-ble. The classification is performed by extracting the prefix bits, inall fields, of an incoming packet, and checking all hash-tables formatching candidates. A secondary search eliminates false-positiveresults and selects the rule with the highest priority.Hash-based techniques are effective in an online classification problem with frequent rule updates, whereas decision trees are not.However, decision trees have been traditionally considered fasterin classification. Nevertheless, the recent TupleMerge hash-basedalgorithm closes the gap and achieves high classification through-put while supporting high performance updates.
The packet classification performance of all the existing techniquesdoes not scale well with the number of rules. This happens becausetheir indexing structures spill out of the fast L1/L2 CPU caches intoL3 or DRAM. Indeed, as we show in our experiments (§5), Tuple-Merge and NeuroCuts exceed the 1MB L2 cache with 100K rulesand CutSplit with 500K rules. However, keeping the entire index-ing structure in fast caches is critical for performance. The inherentlack of access locality in hash and tree data structures, combinedwith the data-dependent nature of the accesses, make hardwareprefetchers ineffective for hiding memory access latency. Thus, theperformance of all lookups drops dramatically. m , ( x ) s m , ( x ) m , ( x ) m , W − ( x ) s More Stages m n − , ( x ) m n − , Wn − − ( x ) s n − v v v v | I |− Values
Stage
Figure 3: RMI model structure and inference [18].
The performance drop is significant even when the data struc-tures fit in the L3 cache. This cache is shared among all the cores,whereas L1 and L2 caches are per-core. Thus, L3 is not only slower(up to 90 cycles in recent X86 CPUs), but also suffers from cachecontention, e.g., when another core runs a cache-demanding work-load and causes cache trashing. We observe the effects of L3 con-tention in §5.2.1.NuevoMatch aims to provide more space efficient representa-tion of the rule index to scale to large rule-sets.
We first explain the RMI model for learned indexes which we useas the basis, explain its limitations, and then show our solution thatovercomes them.
Kraska et al. [18] suggest using machine-learning models for stor-ing key-value pairs instead of conventional data structures such asB-trees or hash tables. The values are stored in a value array , anda
Recursive Model Index (RMI) is used to retrieve the value given akey. Specifically, RMI predicts the index of the corresponding valuein the value array using a model that learned the underlying key-index mapping function.The main insight is that any index structure can be expressedas a continuous monotonically increasing function y = h ( x ) : [ , ] 7→ [ , ] , where x is a key scaled down uniformly into [ , ] ,and y is the index of the respective value in the value array scaleddown uniformly into [ , ] . RMI is trained to learn h ( x ) . The result-ing learned index model b h ( x ) performs lookups in two phases: firstit computes the predicted index b y = b h ( key ) , and then performs a secondary search in the array, in the vicinity ϵ of the predicted in-dex, where ϵ is the maximum index prediction error of the model,namely | b h ( key ) − h ( key )| ≤ ϵ . Model structure.
RMI is a hierarchical model made of several ( n )stages (Figure 3). Each stage i includes W i submodels m i , j , j < W i ,where W i is the stage width . The first stage has a single submodel.Each successive stage has a larger width. The submodels in eachstage are trained on a progressively smaller subset of the inputkeys, refining the index prediction toward the submodels in theleaves. Thus, each key-index pair is learned by one submodel ateach stage, with the leaf submodel producing the index prediction.RMI is a generic structure; a variety of machine learning modelsor data structures can be used as submodels, such as regressionodels or B-trees. The type of the submodels, the number of stagesand the width of each stage are configured prior to training. Training.
Training is performed stage by stage.
First stage.
The submodel in stage m , is trained on the whole data set. Then, the input key-index pairs are split into W dis-joint subsets. The input partitioning is performed as follows.For each input key-index pair { key : idx } we compute the sub-model prediction b j = m , ( key ) , satisfying b j ∈ [ , ) . The output b j is used to obtain j = ⌊ b j · W ⌋ which is the index of the submodelin stage 1, m , j , to be used for learning { key : idx } . We call thesubset of the input to be learned by model m i , j as model inputresponsibility domain R i , j , or responsibility for short. R , is thewhole input. Internal stages.
The submodels in stage i , m i , j , are trained onthe keys in R i , j ( j < W i ) . After training, the responsibilities of thesubmodels in stage i + Last stage.
The submodels of the last stage must predict the ac-tual index of the matching value in the value array. However, a sub-model may have a prediction error. Therefore, RMI uses the modelprediction as a hint . The matching value is found by searching inthe value array in the vicinity of the predicted index, as definedby the maximum error bound ϵ of the model. Note that ϵ shouldbe valid for all input key-index pairs. To compute ϵ , RMI exhaus-tively computes the submodel prediction for each input key in itsresponsibility. Submodels with a high error bound are retrained orconverted to B-trees. Inference.
Given a key, we iteratively evaluate each submodelstage after stage, from m , . We use the prediction in stage i − i , until we reach the last stage. The lastselected submodel predicts the index in the value array. This index b i determines the range for the secondary search in the value arraythat spans [ b i − ϵ , b i + ϵ ]. Direct application of RMI to indexing packet classification rules isnot possible for the following reasons:
No support for range matching . RMI allows only an exactmatch for a given key, whereas packet classification requires re-trieving rules with matching ranges as defined by wildcards. Thisproblem is fundamental: RMI exhaustively enumerates all the keysin all the ranges to calculate the submodel responsibility and themaximum model prediction error (see the underlined parts of thetraining algorithm). In other words, all the values in the range mustbe materialized into key-index pairs for RMI to learn them, sinceRMI does not guarantee correct lookup for keys not used in train-ing [18]. The original paper sketches a few possible solutions, how-ever, they either rely on model monotonicity (while we do not) oruse smarter yet still expensive enumeration techniques.
Slow multi-dimensional indexing.
RMI is ineffective for multi-dimensional indexes because the proposed solution [17] leads to The RMI paper also uses the term range index while applying RMI to range indexdata structures (i.e., B-trees) that can quickly retrieve all stored keys in a requestednumerical range. Our work is fundamentally different: given a key it retrieves theindex of its matching range . xM ( x ) , ⌊ M ( x ) · ⌋
1, 40.75, 30.5, 20.25, 10, 0 t t t t t Figure 4: Transition inputs ( t , ..., t ) for a piece-wise linearfunction with the output domain=4. generating an exponential number of rules in the presence of wild-cards. For example, a single rule with wildcards in destination IP(0.0.0.*), port (10-100), and protocol (TCP/UDP) results in 46,592 dis-tinct key-index pairs. Since the input domain becomes too large, itrequires a large model that exceeds the CPU cache.In the following we outline the solutions to these challengesWe first discuss Range-Query RMI (RQ-RMI), which extends RMIto perform range-value queries in a one-dimensional index whereranges do not overlap (§3.3-§3.5). We then show how to apply RQ-RMI in multi-dimensional index space with overlaps (§3.6-§3.7). We first seek to find a way to perform range matching over a setof non-overlapping ranges in one dimension.There are two basic ideas:
Sampling.
Each submodel m i , j is trained by generating a uniformsample of key-index pairs from input ranges in its responsibility.The samples are generated on-the-fly for each submodel (§3.5.4). Analytical error bound estimation for ranges.
We eliminatethe RMI’s requirement for exhaustive key-value enumeration dur-ing training by making the following observation: if a submodel isa piece-wise linear function, the worst-case error bound ϵ can be com-puted analytically , thereby enabling efficient learning of ranges.The intuition behind this observation is illustrated in Figure 4.It shows the graph of some piece-wise linear function which rep-resents a submodel M whose outputs are quantized into integersin [ , ) , i.e., M predicts the index in an array of size 4. We call theinputs for which this function changes its quantized output transi-tion inputs t i ∈ T . In turn, transition inputs determine the regionof inputs with the same quantized output. Therefore, given an in-put range in the model’s responsibility, to compute the model’smaximum prediction error for any key in that range, it suffices toevaluate the prediction error in the transition inputs that fall inthe range. We describe the training algorithm that relies on theseobservations in Section 3.4. We now provide a more formal descrip-tion, but leave most of the proofs in the Appendix. We choose to use a 3-layer fully-connected neural network (NN)with a single hidden layer and ReLU activation A . Such NNs havebeen suggested in the original RMI paper [18]; however, they didnot leverage their properties for accelerating error bound compu-tations.We denote a submodel as m i , j , and define it as follows. omputetransitioninputs Generatedataset w/ s samples Trainsubmodel ComputeerrorboundsSubmodels w/ error > threshold. s ← s Figure 5: The submodel training process. The additionalphase for training submodels in the leaves is depicted withdashed lines.
Definition 3.1 (RQ-RMI submodel).
Denote the output of a 3-layer fully-connected neural network as: N i , j ( x ) = A (cid:0) x · w + b (cid:1) × w + b where x is a scalar input, w , b are the weight and bias row-vectors for layer 1 (hidden layer), and w , b are the weight column-vector and bias scalar for layer 2. Note that N i , j ( x ) is a scalar. TheReLU function A applies a function a on each element of an inputvector: a ( x ) = ( x x ≥ x < . The submodel output, denoted M i , j ( x ) , is defined as follows: M i , j ( x ) = H (cid:0) N i , j ( x ) (cid:1) where H ( x ) trims the output domain to be in [ , ) . Corollary 3.2. M i , j ( x ) is a piece-wise linear function. We use Corollary 3.2 to compute the transition inputs and the re-sponsibility of the submodels. We provide a simplified description;see Appendix for the precise explanation.
RQ-RMI training is similar to RMI’s. It is per-formed stage by stage. Figure 5 illustrates the training process forone stage. We start by training the single submodel in the firststage using the entire input domain. Next, we calculate its tran-sition inputs (§3.5.2) and use them to find the responsibilities ofthe submodels in the following stage (§3.5.3). We proceed by train-ing submodels in the subsequent stage using designated datasetswe generate based on the submodels’ responsibilities (§3.5.4). Werepeat this process until all submodels in all internal stages aretrained. For the submodels in the leaves (last stage), there is anadditional phase (dashed lines in Figure 5). After training, we cal-culate their error bounds and retrain the submodels that do notsatisfy a predefined error threshold (§3.5.6).
Given a trained submodel m we can analytically find all its linear regions, and respectivelythe inputs delimiting them, which we call trigger inputs д l . Forall inputs in the region [ д l , д l + ] , the model function, denoted as M ( x ) , is linear by construction. On the other hand, the uniform out-put quantization defines a step-wise function Q = ⌊ M ( x ) · W ⌋/ W ,where W is the size of the quantized output domain (Figure 4).Thus, for each input region [ д l , д l + ] , the set of transition inputs t l ∈ T are those where M ( x ) and Q intersect. Given a trained submodel m i , j in an internal stage i ,we say that it maps a key to a submodel m i + , k , k < W i + , if ⌊ M i , j ( key ) · W i + ⌋ = k . As discussed informally earlier, the re-sponsibility R i + , k of m i + , k is defined as all the inputs which aremapped by submodels in stage i to m i + , k . In other words, thetrained submodels at stage i define the responsibility of untrainedsubmodels at stage i + R i + , k using the transition inputs of m i , j . In the fol-lowing, we assume for clarity that R i , j is contiguous, and m i , j isthe only submodel at stage i .We compute R i + , k by observing that it is composed of all theinputs in the regions ( t l , t l + ) that map to submodel m i + , k , where t l ∈ T i , j are transition inputs of m i , j . By construction, the inputsin the region between two adjacent transition points map to thesame output. Then, it suffices to compute the output of m i , j for itstransition points, and choose the respective input ranges that aremapped to m i + , k . Up tothis point, we used only key-index pairs as model inputs. Now wefocus on training on input ranges . A range can be represented as allthe keys that fall into the range, all associated with the same indexof the respective rule. For example, 10.1.1.0-10.1.1.255 includes 256keys. Our goal is to train a model such that given a key in the range,the model predicts the correct index. Enumerating all the keys inthe ranges is inefficient. Instead, we use sampling as follows.We generate the training key-index pairs by uniformly samplingthe submodel’s responsibility. We start with a low sampling fre-quency. A sample is included in the training set if there is an in-put rule range that matches the sampled key. Thus, the number ofsamples per input range is proportional to its relative size in thesubmodel’s responsibility. Note that some input ranges (or individ-ual keys) might not be sampled at all. Nevertheless, they will bematched correctly as we explain further.
We train submodels on the generateddatasets using supervised learning and Adam optimizer [14] witha mean squared error loss function.
Given a trained submodel in thelast stage, we compute the prediction error bound for all inputsin its responsibility by evaluating the submodel on its transition in-puts. The prediction error is computed also for the inputs that werenot necessarily sampled, guaranteeing match correctness. If the er-ror is too large, we double the number of samples, regenerate thekey-index pairs, and retrain the submodel. Training continues un-til the target error bound is attained or after a predefined numberof attempts. If training does not converge, the target error boundmay be increased by the operator. The error bound determines the search distance of the secondary search ; hence a larger bound causeslower system performance. We evaluate this tradeoff later (§5.3.4).
PAddressPort Number R r R r R r R r r R Figure 6: Rules from Figure 2 are split into two iSets: { R , R , R } (by port), and { R , R } (by IP). NuevoMatch supports overlapped ranges and matching over mul-tiple dimensions, i.e., packet fields, by combining two simple ideas:partitioning the rule-set into disjoint independent sets ( iSets ), andperforming multi-field validation of each rule. In the following, weuse the terms dimension and field interchangeably.
Partitioning.
Each iSet contains rules that do not overlap in onespecific dimension . We refer to the coverage of an iSet as the fractionof the rules it holds out of those in the input. One iSet may cover allthe rules if they do not overlap in at least one dimension, whereasthe same dimension with many overlapping ranges may requiremultiple iSets. Figure 6 shows the iSets for the rules from Figure 2.Each iSet is indexed by one RQ-RMI. Thus, to find the match to aquery with multiple fields, we query all RQ-RMIs (in parallel), eachover the field on which it was trained. Then, the highest priorityresult is selected as the output.Each iSet adds to the total memory consumption and compu-tational requirements of NuevoMatch. Therefore, we introduce aheuristic that strives to find the smallest number of iSets that coverthe largest part of the rule-set (§3.6.1).
Multi-field validation.
Since an RQ-RMI builds an index of therules over a single field, it might retrieve a rule which does notmatch against other fields. Hence, each rule returned by an RQ-RMI is validated across all fields. This enables NuevoMatch toavoid indexing all dimensions, yet obtain correct results.
We introduce a greedy heuristic thatrepetitively constructs the largest iSet from the input rules, pro-ducing a group of iSets. To find the largest iSet over one dimension,we use a classical interval scheduling maximization algorithm [15].The algorithm sorts the ranges by their upper bounds, and repeti-tively picks the range with the smallest upper bound that does notoverlap previously selected ranges.We apply the algorithm to find the largest iSet in each field.Then we greedily choose the largest iSet among all the fields andremove its rules from the input set. We continue until exhaustingthe input. This heuristic is sub-optimal but quite efficient. We planto improve it in future work.Having a larger number of fields in a rule-set might help im-prove coverage. For example, if the rules that overlap in one fielddo not overlap in another and vice versa, two iSets cover the wholerule-set, requiring more iSets for each field in isolation.
Real-world rule-sets may require many iSets for full coverage, witha single rule per iSet in the extreme cases. Using separate RQ-RMIsfor such iSets will hinder performance. Therefore, we merge smalliSets into a single remainder set . The rules in the remainder set areindexed using an external classifier. Each query is performed onboth the RQ-RMI and the external classifier.In essence, NuevoMatch serves as an accelerator for the externalclassifier. Indeed, if rule-sets are covered using a few large iSets, theexternal classifier needs to index a small remainder set that oftenfits into faster memory, so it can be very fast.Two primary factors determine the end-to-end performance: (1)the number of iSets required for high coverage (depends on therule-set), and; (2) the number of iSets for achieving high perfor-mance (set by an operator).Our evaluation (§5.3.1) shows that most of the evaluated rule-sets can be covered with high coverage above 90% with only 2-3iSets. This is enough to accelerate the external classifier, as is evi-dent from the performance results. On the other hand, the choiceof the number of iSets depends on the external classifier properties,in particular, its sensitivity to memory footprint. We analyze thistradeoff in §5.3.
Worst-case inputs.
Some rule-sets cannot achieve good coveragewith only a few iSets. For example, a rule-set with a single fieldwhose ranges overlap requires too many iSets to be covered.To obtain a better intuition about the origins of worst-case in-puts, we consider the notion of rule-set diversity for rule-sets withexact matches. Rule-set diversity in a field is the number of uniquevalues in it across the rule-set, divided by the total number of rules.
The rule-set diversity is an upper bound on the fraction of rules in thelargest iSet of that field . In other words, low diversity implies thatusing the field for iSet partitioning would result in poor coverage.We can also identify challenging rule-sets with ranges. We de-fine rule-set centrality as the maximal number of rules that eachpair of them overlap (they all share a point in a multi-dimensionalspace).
The rule-set centrality is a lower bound on the number of iSetsrequired for full coverage .The diversity and centrality metrics can indicate the potentialof NuevoMatch to accelerate the classification of a rule-set. On thepositive side, our iSet partitioning algorithm is effective at segre-gating the rules that cannot be covered well from the rules that can,thereby accelerating the remainder classifier as much as possiblefor a given rule-set. We analyze this property in §5.3.3.
We briefly summarize all the steps of NuevoMatch.
Training (1) Partition the input into iSets and a remainder set(2) Train one RQ-RMI on each iSet(3) Construct an external classifier for the remainder set
Lookup (1) Query all the RQ-RMIs(2) Query the external classifier(3) Collect all the outputs, return the highest-priority rule τ τ τ TimeThroughput
Fast training Long training
Figure 7: Updates impact on Throughput over time. An up-per bound (in green) is for zero training time.
We explain how NuevoMatch can support updates with a limitedperformance degradation.Firstly, an external classifier used for the remainder must sup-port updates. Among the evaluated external classifiers only Tuple-Merge is designed for fast updates.Secondly, we distinguish four types of updates: (i) a change inthe rule action ; (ii) rule deletion (iii) rule matching set change ; (iv) rule addition .The first two types of updates are supported without perfor-mance degradation, and require a lookup followed by an updatein the value array. However, if an update modifies a rule’s match-ing set or adds a new rule, it might require modifications to theRQ-RMI model. We currently do not know an algorithmic way toupdate RQ-RMI without retraining; therefore, an updated rule isalways added to the remainder set.Unfortunately, this design leads to gradual performance degra-dation, as updates are likely to increase the remainder set. Accord-ingly, the model is retrained on the updated rule-set, either peri-odically or when a large performance degradation is detected. Up-dates occurring while retraining are accommodated in the follow-ing batch of updates. Estimating sustained update rate.
Let r and u be the total num-ber of rules and the number of updates that move a rule to the re-mainder, respectively; u can be smaller than the real rate of rule up-dates. We assume that the updates are independent and uniformlydistributed among the r rules. For each rule update, a rule is mod-ified w.p. (with probability) r . Thus a rule is not modified in anyof the updates w.p. ( − r ) u ≈ e − u / r . The expected number of un-modified rules is r · ( − r ) u ≈ r · e − u / r . Throughput behaves as aweighted average between that of NuevoMatch and the remainderimplementation, based on the number of rules in each.Figure 7 illustrates the throughput over time for different re-training rates given a certain update rate. If retraining is invokedevery τ time units, the slower the training process, the worse theperformance degradation.With these update estimates, using the measured speedup as afunction of the fraction of the remainder (§5.3.3), NuevoMatch cansustain up to 4k updates per second for 500K rule-sets, yielding about half the speedup of the update-free case, assuming a minute-long training. These results indicate the need for speeding up train-ing, but we conjecture there might be a more efficient way to per-form updates directly in RQ-RMI without complete re-training ofall submodels. Accelerating updates is left for future work. RQ-RMI structure.
The number of stages and the width of eachstage depend on the number of rules to index. We increase thewidth of the last stage from 16 for 10K rules to as much as 512 for500K. See Table 4 in the Appendix.
Submodel structure.
Each submodel is a fully connected 3-layerneural net with 1 input, 1 output, and 8 neurons in the hidden layerwith ReLU activation. This structure affords an efficient vectorizedimplementation (see below).
Training.
We use TensorFlow [1] to train each submodel on a CPU.Training a submodel requires a few seconds, but the whole RQ-RMImay take up to a few minutes (see §5.3.4). We believe, however, thata much faster training time could be achieved with more optimiza-tions, i.e., replacing TensorFlow (known for its poor performanceon small models). We leave it for future work. iSet partitioning.
We implement the iSet partitioning algorithmusing Python. The partitioning takes at most a few seconds and isnegligible compared to RQ-RMI training time.
Inference and secondary search.
We implement RQ-RMI infer-ence in C++. For each iSet we sort the rules by the value of therespective field to optimize the secondary search. To reduce thenumber of memory accesses, we pack multiple field values fromdifferent rules in the same cache line.
Handling long fields.
Both iSet partitioning algorithms and RQ-RMI models map the inputs into single-precision floating-pointnumbers. This allows the packing of more scalars in vector opera-tions, resulting in faster inference. While enough for 32-bit fields,doing so might cause poor performance for fields of 64-bits and128-bits.We compared two solutions: (1) splitting the fields into 32-bitparts and treating each as a distinct field, and (2) using a single-precision floating-point to express long fields. The two showedsimilar results for iSet partitioning with MAC addresses, whilewith IPv6, splitting into multiple fields worked better. Note thatboth the secondary search and the validation phases are not af-fected because the rules are stored with the original fields.
Vectorization.
We accelerate the inference by using wide CPUvector instructions. Specifically, with 8 neurons in the hidden layerof each submodel, computing the prediction involves a handful ofvector instructions. Validation is also vectorized.Table 1 shows the effectiveness of vectorization. The use ofwider units speeds up inference, highlighting the potential for scal-ing NuevoMatch in future CPUs.
Parallelization.
NuevoMatch lends itself to parallel executionwhere iSets and the remainder classifier run in parallel on differentCPU cores. The system receives the packets and enqueues each for able 1: Submodel acceleration via vectorization. Methodsare annotated with the number of floats per single instruc-tion.
Instruction set (width) Serial(1) SSE(4) AVX(8)Inference Time (ns) 126 62 49 execution into the worker threads. The threads are statically allo-cated to run RQ-RMI or the external classifier with a balanced loadbetween the cores.Note that since RQ-RMI are small and fit in L1, running themon a separate core enables L1-cache-resident executions even ifthe remainder classifier is large. Such an efficient cache utilizationcould not have been achieved with other classifiers running on twocores.
Early termination.
One drawback of the parallel implementationis that the slowest thread determines the execution time. Our ex-periments show that the remainder classifier is the slowest one.It holds only a small fraction of the rules, so it returns an emptyset for most of the queries, which in turn leads to the worst-caselookup time. In TupleMerge, for example, a query which does notfind any matching rules results in a search over all tables, whereasin the average case some tables are skipped.Instead, we query the remainder after obtaining the results fromthe iSets, and terminate the search when we determine that thetarget rule is not in the remainder.To achieve that, we make minor changes to existing classifica-tion techniques. Specifically, in decision-tree algorithms, we storein each node the maximum priority of all the sub-tree rules. When-ever we encounter a maximum priority that is lower than thatfound in the iSets, we terminate the tree-walk. The changes to thehash-based algorithms are similar.We call this optimization early termination . With this optimiza-tion, both the iSets and the remainder are queried on the samecore. While a parallel implementation is possible, it incurs highersynchronization overheads among threads.
In the evaluation, we pursued the following goals.(1) Comparison of NuevoMatch with the state-of-the-art algo-rithms TupleMerge [3], CutSplit [21], and NeuroCuts [22];(2) Systematic analysis of the performance characteristics, in-cluding coverage in challenging data sets, the effect of RQ-RMI error bound, and training time.
We ran the experiments on Intel Xeon Silver 4116 @ 2.1 GHzwith 12 cores, 32KB L1, 1024KB L2, and 16MB L3 caches, runningUbuntu 16.04 (Linux kernel 4.4.0). We disable power managementfor stable measurements.
Evaluated configurations.
CutSplit ( cs ) is set with binth =
8, assuggested in [21].For NeuroCuts ( nc ), we performed a hyperparameter sweep andselected the best classifier per rule-set. As recommended in [22], we focused on both top-mode partitioning and reward scaling. Weran the search on three 12-core Intel machines, allocating six hoursper configuration to converge. In total, we ran nc training for upto 36 hours per rule-set. In addition, we developed a C++ imple-mentation of nc for faster evaluation of the generated classifiers,much faster than the authors’ Python-based prototype.TupleMerge ( tm ) is used with the version that supports updateswith collision-limit =
40, as suggested in [3].NuevoMatch ( nm ) was trained with a maximum error thresh-old of 64. We present the analysis of the sensitivity to the chosenparameters and training times in §5.3.2. Multi-core implementation.
We run a parallel implementationon two cores. NuevoMatch allocates one core for the remaindercomputations and the second for the RQ-RMIs. For cs , nc , and tm ,we ran two instances of the algorithm in parallel on two coresusing two threads (i.e., no duplication of the rules), splitting theinput equally between the cores. We discarded iSets with cover-age below 25% for comparisons against cs and nc , and below 5%for comparisons against tm . We used batches of 128 packets toamortize the synchronization overheads. Thus, these algorithmsachieve almost linear scaling and the highest possible throughputwith perfect load-balancing between the cores. Single-core implementation.
We used a single core to measurethe performance of NuevoMatch with the early termination opti-mization. For nm , we discarded iSets with coverage below 25%. For evaluating each classifier,we generated traces with 700K packets. We processed each trace 6times, using the first five as warmup and measuring the last. Wereport the average of 15 measurements.
Uniform traffic.
We generate traces that access all matching rulesuniformly to evaluate the worst-case memory access pattern.
Skewed traffic.
For each rule-set we generate traces that followZipf distribution with four different skew parameters, according tothe amount of traffic that accounts for the 3% most frequent flows(e.g., 80% of the traffic accounts for the 3% most frequent flows).This is representative of real traffic, as has been shown in previousworks [13, 33].Additionally, we use a real CAIDA trace from the Equinix data-center in Chicago [2]. As CAIDA does not publish the rules usedto process the packets, we modify the packet headers in the traceto match each evaluated rule-set as follows. For each rule, we gen-erate one matching five-tuple. Then, for each packet in CAIDA,we replace the original five-tuple with a random five-tuple gen-erated from the rule-set, while maintaining a consistent mappingbetween the original and the generated one. Note that the rule-setaccess locality of the generated trace is the same or as high as theoriginal trace.
ClassBench rules.
ClassBench [39] is a standard benchmarkbroadly used for evaluating packet classification algorithms [3, 16,21, 22, 28, 41, 44]. It creates rule-sets that correspond to the rule dis-tribution of three different applications: Access Control List (ACL),Firewall (FW), and IP Chain (IPC). We created rule-sets of sizes500K, 100K, 10K, and 1K, each with 12 distinct applications, all with5-field rules: source and destination IP, source and destination port,and protocol. L a t e n c y S p ee d u p NuevoMatch w/ CutSplit NuevoMatch w/ NeuroCuts NuevoMatch w/ TupleMerge1 2 3 4 5 6 7 8 9 10 11 12 GM 1 2 3 4 5 6 7 8 9 10 11 12 GM01234 T h r o u g h p u t S p ee d u p Figure 8: ClassBench: NuevoMatch vs. CutSplit, NeuroCuts, and TupleMerge, using two CPU cores. (See rule-set in the Appen-dix.)Real-world rules.
We used the Stanford Backbone dataset whichcontains a large enterprise network configuration [46]. There arefour IP forwarding rule-sets with roughly 180K single-field ruleseach (i.e., destination IP address).
For fair comparison, NuevoMatch used the same algorithm for boththe remainder classifier and the baseline. For example, we eval-uated the speedup produced by NuevoMatch over cs while alsousing cs to index the remainder set.We present the results for random packet traces, followed byskewed and CAIDA traces. Large rule-sets: ClassBench: multi-core.
Figure 8 shows that, inthe largest rule-sets (500K), the parallel implementation of Nuevo-Match achieves a geometric mean factor of 2.7 × , 4.4 × , and 2.6 × lower latency and 1.3 × , 2.2 × , and 1.2 × higher throughput over cs , nc , and tm , respectively. For the classifiers with 100K rules,the gains are lower but still significant: 2.0 × , 3.6 × , and 2.6 × lowerlatency and 1.0 × , 1.7 × , and 1.2 × higher throughput over cs , nc ,and tm , respectively. The performance varies among rule-sets, i.e.,some classifiers are up to 1.8 × faster than cs for 100 ˙K inputs. Large rule-sets: ClassBench: single core.
Figure 9 shows thethroughput speedup of nm compared to cs , nc , and tm . For 500Krule-sets, NuevoMatch achieves a geometric mean improvementof 2.4 × , 2.6 × , and 1.6 × in throughput compared to cs , nc , and tm , respectively. For the single core execution the latency and thethroughput speedups are the same. Large rule-sets: Stanford backbone: multi-core.
Figure 10shows the speedup of nm over tm for the real-world Stanford back-bone dataset with 4 rule-sets. nm achieves 3 . × higher throughputand 7 . × lower latency over tm on all four rule-sets. Small rule-sets: multi-core.
For rule-sets with 1K and 10K rules,NuevoMatch results in the same or lower throughput, and 2.2 × and 1.9 × on average better latency compared to cs and tm . Thelower speedup is expected, as both cs and tm fit into L1 (§5.2.1), so nm does not benefit from reduced memory footprint, while addingcomputational overheads. See Appendix for the detailed chart. · × × × × T h r o u g h p u t ( pp s ) TupleMerge NuevoMatch w/ TupleMerge1 2 3 4 0200400 × × × × L a t e n c y ( µ s ) Figure 10: End-to-end performance on real Stanford back-bone data sets.
The cs results are averaged over three rule-sets of 1K and sixrule-sets for 10K. In the remaining rule-sets, NuevoMatch did notproduce large-enough iSets to accelerate the remainder. Note, how-ever, that it promptly identifies the rule-sets expected to be slowand falls back to the original classifier. The source of speedups.
The ability to compress the rule-set tofit into faster memory while retaining fast lookup is the key factorunderlying the performance benefits of NuevoMatch. To illustrateit, we take a closer look at the performance. We evaluate tm withand without nm acceleration as a function of its memory footprinton ClassBench-generated 1K,10K,100K and 500K rule-sets for oneapplication (ACL).Figure 11 shows that the performance of tm degrades as thenumber of rules grows, causing the hash tables to spill out of L1 andL2 caches. nm compresses a large part of the rule-set (see coverageannotations), thereby making the remainder index small enoughto fit in the L1 cache, and gaining back the throughput equivalentto tm ’s on small rule-sets. ClassBench: Skewed traffic.
Figure 12 shows the evaluation ofthe early termination implementation on skewed packet traces. Wereport the throughput speedup of nm compared to cs and tm ; theresults for nc are similar to those of cs .We perform 6000 experiments using 25 traces per rule-set: fivetraces per Zipf distribution plus five modified CAIDA traces. Weevaluate over twelve 500K rule-sets and report the geometric mean.Additionally, we evaluate CAIDA traces in two settings. First, the T h r o u g h p u t S p ee d u p NuevoMatch w/ CutSplit NuevoMatch w/ NeuroCuts NuevoMatch w/ TupleMerge
Figure 9: ClassBench: NuevoMatch vs. CutSplit, NeuroCuts, and TupleMerge, using a single CPU core. · L2 Size (1MB)L1 Size (32KB)
Number of Rules T h r o u g h p u t ( pp s ) TupleMerge NuevoMatch w/ TupleMerge
Figure 11: Throughput vs. number of rules for TupleMergeand NuevoMatch. Annotations are coverage (%) and indexmemory size in KB (remainder : total). classifier runs with access to the entire 16MB of the L3 cache (de-noted as CAIDA). Second, the classifier use of L3 is restricted to1.5 ˙MB via Intel’s Cache Allocation Technology, emulating multi-tenant setting (denoted as CAIDA*).NuevoMatch is significantly faster than cs , but its benefits over tm diminish for workloads with higher skews. Yet, the speedupsare more pronounced under smaller L3 allocation.Overall, we observe lower speedups for the skewed traffic thanfor the random trace. This is not surprising, as skewed traces in-duce a higher cache hit rate for all the methods, which in turnreduces the performance gains of nm over both cs and tm , simi-lar to the case of small rule-sets. Nevertheless, it is worth notingthat classification algorithms are usually applied alongside cachingmechanisms that catch the packets’ temporal locality. For instance,Open vSwitch applies caching for most frequently used rules. Itinvokes Tuple Space Search upon cache misses [30]. Therefore, ifNuevoMatch is applied at this stage, we expect it to yield the perfor-mance gains equivalent to those reported for unskewed workloads.Open vSwitch integration is the goal of our ongoing work. Figure 13 compares thememory footprint of the classifiers without and with NuevoMatch(the two right-most bars in each bar cluster). We use the same num-ber of iSets as in the end-to-end experiments. Note that a smallerfootprint alone does not necessarily lead to higher performance ifmore iSets are used. Therefore, the results should be considered inconjunction with the end-to-end performance.The memory footprint includes only the index data structuresbut not the rules themselves. In particular, the memory footprint
Zipf 80%( α =1.05) Zipf 85%( α =1.10) Zipf 90%( α =1.15) Zipf 95%( α =1.25) CAIDA CAIDA* . . . × × × × × × × × × × × × T h r o u g h p u t S p ee d u p NuevoMatch w/ CutSplit NuevoMatch w/ TupleMerge
Figure 12: ClassBench: NuevoMatch vs. CutSplit and Tuple-Merge with skewed traffic. for NuevoMatch includes both the RQ-RMI models and the remain-der classifier. Each bar is the average of all the 12 application rule-sets of the same size.For nm we show both the remainder index size (middle bar) andthe total RQ-RMI size (right-most bar). Note that due to the loga-rithmic scale of the Y axis, the actual ratio bewteen the two is muchhigher than it might seem. For example, the remainder for 10K tm is almost 100 × the size of the RQ-RMI. Note also that since we run nm on two cores, both RQ-RMI and the remainder classifier usetheir own CPU caches.Overall, NuevoMatch enables dramatic compression of thememory footprint, in particular for 500K rule-sets, with 4.9 × , 8 × ,and 82 × on average over cs , nc and tm respectively.The graph explains well the end-to-end performance results. For1K rule-sets, the original classifiers fit into the L1 cache, so nm isnot effective. For 10K sets, even though the remainder index fitsin L1, the ratio between L1 and L2 performance is insufficient tocover the RQ-RMI overheads. For 100K, the situation is similar for cs ; however, for nc , the remainder fits in L1, whereas the original nc spills to L3. For tm , the remainder is already in L2, yielding alower overall speedup compared to nc . Last, for 500K rule-sets, allthe original classifiers spill to L3, whereas the remainder fits wellin L2, yielding clear performance improvements. Performance under L3 cache contention.
The small memoryfootprint of nm plays an important role even when the rule-indexfits in the L3 cache (16MB in our machine). L3 is shared among allthe CPU cores; therefore, cache contention is not rare, in particularin data centers. nm reduces the effects of L3 cache contention onpacket classification performance. In the experiment we use the500 ˙K rule-set (1) and compare the performance of cs and nm (with cs ) while limiting the L3 to 1.5MB. cs loses half of its performance,whereas nm slows down by 30%, increasing the original speedup. K 10K 100K 500K10 Number of rules S i z e ( B y t e s ) NuevoMatch: Remainder CutSplit NeuroCutsNuevoMatch: iSets TupleMerge
Figure 13: Memory size for CutSplit, NeuroCuts, Tuple-Merge vs. NuevoMatch with them indexing the remainder.Each bar is a geometric mean of 12 applications.Table 2: iSet coverage. . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . ,
000 Number of iSets T i m e ( n s ) C o v e r a g e ( % ) Remainder Secondary Search ValidationInference Coverage
Figure 14: Coverage and execution time breakdown ofNuevoMatch vs. varying number of iSets.
Table 2 shows the cumulative coverageachieved with up to 4 iSets averaged over 12 rule-sets (ClassBench)of the same size. The coverage of smaller rule-sets is worse on av-erage, but improves with the size of the rule-set.The last row shows a representative result for the Stanford back-bone rule-set (the other three differ within 1%). Two iSets areenough to achieve 90% coverage and three are needed for 95%. Thisdata set differs from ClassBench in that it contains only one field,providing fewer opportunities for iSet partitioning.
We seek to understand thetradeoff between the iSet coverage of the rule-set and the computa-tional overheads of adding more RQ-RMI. All computations wereperformed on a single core to obtain the latency breakdown. Weuse cs for indexing the remainder. Table 3: Throughput and a single iSet coverage vs. the frac-tion of low-diversity rules in a 500K rule-set. % Low diversity rules % Coverage Speedup (throughput)70% 25% 1.07 ×
50% 50% 1.14 ×
30% 70% 1.60 × Figure 14 shows the geometric mean of the coverage and theruntime breakdown over 12 rule-sets of 500K. The breakdownincludes the runtime of the remainder classifier, validation, sec-ondary search, and RQ-RMI inference. Zero iSets implies that cs was run alone. Adding more iSets shows diminishing returns be-cause of their compute overhead, which is not compensated by theremainder runtime improvements because the coverage is alreadysaturated to almost 100%. Using one or two iSets shows the besttrade-off. nc shows similar results. tm behaved differently (not shown). tm occupies much morememory than cs ; therefore, using more iSets to achieve higher cov-erage allowed us to further speed up the remainder by fitting it intoan upper level cache. Thus, 4 iSets showed the best configuration.We note that the runtime is split nearly equally between modelinference and validation (which are compute-bound parts), andthe secondary search and the remainder computations (which arememory-bound). We expect the compute performance of futureprocessors to scale better than their cache capacity and memoryaccess latency. Therefore, we believe nm will provide better scal-ing than memory-bound state-of-the-art classifiers. We seek to understand howlow diversity rule-sets affect NuevoMatch. To analyze that, we syn-thetically generated a large rule-set as a Cartesian product of asmall number of values per field (no ranges). We blended them intoa 500K ClassBench rule-set, replacing randomly selected rules withthose from the Cartesian product, while keeping the total numberof rules the same.Table 3 shows the coverage and the speedup over tm on theresulting mixed rule-sets for different fractions of low-diversityrules. The partitioning algorithm successfully segregates the low-diversity rules the best, achieving the coverage inversely propor-tional to their fraction in the rule-set. Note that NuevoMatch be-comes effective when it offloads the processing of about 25% of therules. RQ-RMIs aretrained to minimize the prediction error bound to achieve a smallsecondary search distance. Recall that a secondary search involvesa binary search within the error bound, where each rule is vali-dated to match all the fields.The tradeoff between training time and secondary search perfor-mance is not trivial. A larger search distance enables faster train-ing but slows down the secondary search. A smaller search dis-tance results in a faster search but slows down the training. In ex-treme cases, the training does not converge, since a higher preci-sion might require larger submodels. However, increasing the sizeof the submodels leads to a larger memory footprint and longercomputations. T r a i n i n g T i m e ( m i n u t e s ) Figure 15: RQ-RMI training time in minutes vs. maximumsearch range bound.
Figure 15 shows the average end-to-end training time in min-utes of 500 models as a function of the secondary search distanceand the rule-set size. The measurements include all training itera-tions as described in §3.5. As mentioned (§4), our training imple-mentation can be dramatically accelerated, so the results here in-dicate the general trend.Training with the bound of 64 is expensive, but is it really neces-sary? To answer, we evaluate the performance impact of the searchdistance on the secondary search time. We measure 40 ˙ns for re-trieving a rule with a precise prediction (no search). For 64, 128 and256 distances the search time varies between 75 to 80 ˙ns thanks tothe binary search. Last, it turns out that the actual search distancefrom the predicted index is often much smaller than the worst-caseone enforced in training. Our analysis shows that in practice, train-ing with a relatively large bound of 128 leads to 80% of the lookupswith a search distance of 64, and 60% with 32.We conclude that training with larger bounds is likely to havea minor effect on the end-to-end performance, but significantly ac-celerate training. This property is important to support more fre-quent retraining and faster updates (§3.9).
Adding fields to an existingclassifier will not harm its coverage, so it will not affect the RQ-RMI performance. Nonetheless, more fields will increase validationtime.Unfortunately, we did not find public rule-sets that have a largenumber of fields. Thus, we ran a microbenchmark by increasingthe number of fields and measuring the validation stage perfor-mance. As expected, we observed almost linear growth in the val-idation time, from 25ns for one field to 180ns for 40 fields.
Hardware-based classifiers.
Hardware-based solutions for clas-sification such as TCAMs and FPGAs achieve a very high through-put [6, 35]. Consequently, many software algorithms take ad-vantage of them, further improving classification performance[13, 20, 23, 24, 28, 32, 37]. Our work is complementary, but canbe used to improve scaling of these solutions. For example, if theoriginal classifier required large TCAMs, the remainder set wouldfit a much smaller TCAM.
GPUs for classification.
Accelerating classification on GPUs wassuggested by numerous works. PacketShader [10] uses GPU forpacket forwarding and provides integration with Open vSwitch.However, packet forwarding is a single-dimensional problem, so it is easier than multi-field classification [9]. Varvello et al. [42] imple-mented various packet classification algorithms in GPUs, includ-ing linear search, Tuple Space Search, and bloom search. Nonethe-less, these techniques suffer from poor scalability for large classi-fiers with wildcard rules, which NuevoMatch aims to alleviate.
ML techniques for networking.
Recent works suggest using MLtechniques for solving networking problems, such as TCP conges-tion control [4, 12, 45], resource management [25], quality of expe-rience in video streaming [26, 43], routing [40], and decision treeoptimization for packet classification [22]. NuevoMatch is differentin that it uses an ML technique for building space-efficient repre-sentations of the rules that fit in the CPU cache.
We have presented NuevoMatch, the first packet classification tech-nique that uses
Range-Query RMI machine learning model for ac-celerating packet classification. We have shown an efficient way oftraining RQ-RMI models, making them learn the matching rangesof large rule-sets, via sampling and analytical error bound com-putations. We demonstrated the application of RQ-RMI to multi-field packet classification using rule-set partitioning. We evaluatedNuevoMatch on synthetic and real-world rule-sets and confirmedits benefits for large rule-sets over state-of-the-art techniques.NuevoMatch introduces a new point in the design space ofpacket classification algorithms and opens up new ways to scaleit on commodity processors. We believe that its compute-boundnature and the use of neural networks will enable further scalingwith future CPU generations, which will feature powerful computecapabilities targeting faster execution of neural network-relatedcomputations.
We thank the anonymous reviewers of SIGCOMM’20 and our shep-herd Minlan Yu for their helpful comments and feedback. Wewould also like to thank Isaac Keslassy and Leonid Ryzhyk for theirfeedback on the early draft of the paper.This work was partially supported by the Technion Hiroshi Fu-jiwara Cyber Security Research Center and the Israel National Cy-ber Directorate, by the Alon fellowship and by the Taub FamilyFoundation. We gratefully acknowledge support from Israel Sci-ence Foundation (Grant 1027/18) and Israeli Innovation Authority.
REFERENCES [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, JeffreyDean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Man-junath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray,Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, YuanYu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale MachineLearning. In
USENIX OSDI .[2] CAIDA. [n.d.].
The CAIDA UCSD Anonymized Internet Traces 2019
IEEE/ACMTransactions on Networking (TON)
27, 4 (2019), 1417–1431.[4] Mo Dong, Tong Meng, Doron Zarchy, Engin Arslan, Yossi Gilad, Brighten God-frey, and Michael Schapira. 2018. PCC Vivace: Online-Learning Congestion Con-trol. In
USENIX NSDI .[5] Daniel Firestone. 2017. VFP: A Virtual Switch Platform for Host SDN in thePublic Cloud. In
USENIX NSDI .6] Daniel Firestone, Andrew Putnam, Sambrama Mundkur, Derek Chiou, AlirezaDabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian M. Caulfield,Eric S. Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, MattHumphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov,Jitu Padhye,Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Silva, Mad-han Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair, DeepakBansal, Doug Burger, Kushagra Vaid, David A. Maltz, and Albert G. Greenberg.2018. Azure Accelerated Networking: SmartNICs in the Public Cloud. In
USENIXNSDI .[7] Pankaj Gupta and Nick McKeown. 1999. Packet Classification on Multiple Fields.In
ACM SIGCOMM .[8] Pankaj Gupta and Nick McKeown. 2000. Classifying Packets with HierarchicalIntelligent Cuttings.
IEEE Micro
20, 1 (2000), 34–41.[9] Pankaj Gupta and Nick McKeown. 2001. Algorithms for Packet Classification.
IEEE Network
15, 2 (2001), 24–32.[10] Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. 2010. PacketShader:A GPU-accelerated software router. In
ACM SIGCOMM .[11] Intel. 2019.
Intel Nervana Neural Network Processors arXiv preprint arXiv:1810.03259 (2018).[13] Naga Praveen Katta, Omid Alipourfard, Jennifer Rexford, and David Walker.2016. CacheFlow: Dependency-Aware Rule-Caching for Software-Defined Net-works. In
ACM SOSR .[14] Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Opti-mization. arXiv:1412.6980 (2014).[15] Jon M. Kleinberg and Éva Tardos. 2006.
Algorithm Design . Addison-Wesley,116–125.[16] Kirill Kogan, Sergey Nikolenko, Ori Rottenstreich, William Culhane, and PatrickEugster. 2014. SAX-PAC (Scalable and expressive packet classification). In
ACMSIGCOMM .[17] Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Jialin Ding, AniKristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan.2019. SageDB: A Learned Database System.[18] Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018.The Case for Learned Index Structures. In
ACM SIGMOD .[19] Habana Labs. 2019.
Habana AI Processors . Retrieved September 25, 2019 fromhttps://habana.ai/product[20] Karthik Lakshminarayanan, Anand Rangarajan, and Srinivasan Venkatachary.2005. Algorithms for Advanced Packet Classification with Ternary CAMs. In
ACM SIGCOMM .[21] Wenjun Li, Xianfeng Li, Hui Li, and Gaogang Xie. 2018. CutSplit: A Decision-Tree Combining Cutting and Splitting for Scalable Packet Classification. In
IEEEINFOCOM .[22] Eric Liang, Hang Zhu, Xin Jin, and Ion Stoica. 2019. Neural Packet Classification.In
ACM SIGCOMM .[23] Alex X Liu, Chad R Meiners, and Yun Zhou. 2008. All-Match Based CompleteRedundancy Removal for Packet Classifiers in TCAMs. In
IEEE INFOCOM .[24] Yadi Ma and Suman Banerjee. 2012. A Smart Pre-classifier to Reduce PowerConsumption of TCAMs for Multi-dimensional Packet Classification. In
ACMSIGCOMM .[25] Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. 2016.Resource Management with Deep Reinforcement Learning. In
ACM SIGCOMMHotNets Workshop .[26] Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. 2017. Neural AdaptiveVideo Streaming with Pensieve. In
ACM SIGCOMM .[27] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Pe-terson, Jennifer Rexford, Scott Shenker, and Jonathan Turner. 2008. OpenFlow:Enabling Innovation in Campus Networks.
ACM SIGCOMM CCR
38, 2 (2008),69–74.[28] Nina Narodytska, Leonid Ryzhyk, Igor Ganichev, and Soner Sevinc. 2019. BDD-Based Algorithms for Packet Classification. In
Formal Methods in ComputerAided Design FMCAD .[29] Nvidia. 2019.
Nvidia Deep Learning Inference Plat-form
USENIX NSDI .[31] Alon Rashelbach. 2020.
NeuvoMatch source code . Retrieved June 21, 2020 fromhttps://github.com/acsl-technion/nuevomatch[32] Ori Rottenstreich and János Tapolcai. 2015. Lossy Compression of Packet Clas-sifiers. In
ACM/IEEE ANCS .[33] Nadi Sarrar, Steve Uhlig, Anja Feldmann, Rob Sherwood, and Xin Huang. 2012.Leveraging Zipf’s law for traffic offloading.
Computer Communication Review
42, 1 (2012), 16–22.[34] Sumeet Singh, Florin Baboescu, George Varghese, and Jia Wang. 2003. PacketClassification Using Multidimensional Cutting. In
ACM SIGCOMM .[35] Ed Spitznagel, David E Taylor, and Jonathan S Turner. 2003. Packet classificationusing extended TCAMs. In
IEEE ICNP .[36] Venkatachary Srinivasan, Subhash Suri, and George Varghese. 1999. Packet Clas-sification Using Tuple Space Search. In
ACM SIGCOMM .[37] David E Taylor. 2005. Survey and Taxonomy of Packet Classification Techniques.
ACM Computing Surveys (CSUR)
37, 3 (2005), 238–275.[38] David E. Taylor and Jonathan S. Turner. 2005. Scalable packet classification usingdistributed crossproducing of field labels. In
IEEE INFOCOM .[39] David E Taylor and Jonathan S Turner. 2007. Classbench: A Packet ClassificationBenchmark.
IEEE/ACM Transactions on Networking (TON)
15, 3 (2007), 499–511.[40] Asaf Valadarsky, Michael Schapira, Dafna Shahaf, and Aviv Tamar. 2017. Learn-ing to Route with Deep RL. In
NIPS Deep Reinforcement Learning Symposium .[41] Balajee Vamanan, Gwendolyn Voskuilen, and T. N. Vijaykumar. 2010. EffiCuts:Optimizing Packet Classification for Memory and Throughput. In
ACM SIG-COMM .[42] Matteo Varvello, Rafael Laufer, Feixiong Zhang, and T. V. Lakshman. 2016. Mul-tilayer Packet Classification with Graphics Processing Units.
IEEE/ACM Trans-actions on Networking (TON)
24, 5 (2016), 2728–2741.[43] Hyunho Yeo, Youngmok Jung, Jaehong Kim, Jinwoo Shin, and Dongsu Han. 2018.Neural Adaptive Content-aware Internet Video Delivery. In
USENIX OSDI .[44] Sorrachai Yingchareonthawornchai, James Daly, Alex X. Liu, and Eric Torng.2018. A Sorted-Partitioning Approach to Fast and Scalable Dynamic Packet Clas-sification.
IEEE/ACM Transactions on Networking (TON)
26, 4 (2018), 1907–1920.[45] Yasir Zaki, Thomas Pötsch, Jay Chen, Lakshminarayanan Subramanian, andCarmelita Görg. 2015. Adaptive Congestion Control for Unpredictable CellularNetworks. In
ACM SIGCOMM .[46] Hongyi Zeng, Peyman Kazemian, George Varghese, and Nick McKeown. 2012.Automatic Test Packet Generation. In
ACM CoNEXT . Appendices are supporting material that has not been peer-reviewed.
A RQ-RMI CORRECTNESSA.1 Responsibility of a submodel
Denote the input domain of an RQ-RMI model as D ⊂ R and itsnumber of stages as n . Theorem A.1 (Responsibility Theorem).
Let s i be a trainedstage such that i < n − . The responsibilities of submodels in s i + can be calculated by evaluating a finite set of inputs over the stage s i . The intuition behind Theorem A.1 is based on Corollary 3.2,namely that submodels output piecewise linear functions. Provingit requires some additional definitions.
Definition A.2 (Stage Output).
The output of stage s i is definedfor x ∈ D as S i ( x ) = M i , f i ( x ) ( x ) where f i ( x ) is the index of thesubmodel in s i that is responsible for input x , and defined as f i ( x ) = ( i = (cid:4) S i − ( x ) · W i (cid:5) i = { , , ..., n − } Definition A.3 (Submodel Responsibility).
The responsibility of asubmodel m i , j is defined as R i , j = ( D i = (cid:8) x (cid:12)(cid:12) f i ( x ) = j (cid:9) i = { , , ..., n − } Note that the responsibilities of every two submodels in the samestage are disjoint.
Definition A.4 (Left and Right Slopes).
For a range R , if pointsmin x ∈ R x or max x ∈ R x are defined, we refer to them as the bound-aries of the range. For all other points, we refer to as internal pointsof the range. For a piecewise linear function defined over some M i , j ( x ) д д д д д д (a) Trigger Inputs G i , j xM i , j ( x ) , (cid:4) M i , j ( x ) · W i + (cid:5)
1, 40.75, 30.5, 20.25, 10, 0 t t t t t (b) Transition Inputs T i , j Figure 16: Illustration of the trigger inputs ( д , ..., д ) andtransition inputs ( t , ..., t ) for graph M i , j ( x ) of submodel m i , j .Note that W i + , namely the number of submodels in stage i + , affects the transition inputs of m i , j and equals . range R , for every internal point x ∈ R , there exists δ > ( x − δ , x ) , ( x , x + δ ) . Accord-ingly, we can refer to the left slope and the right slope of a point,defined as those of the two linear functions. Definition A.5 (Trigger Inputs).
We say that an input д ∈ D is a trigger input of a submodel m i , j if one of the following holds: (i) д is a boundary point of D (namely, д = min y ∈ D y or д = max y ∈ D y ). (ii) д is an internal point of D and the left and right slopes of M i , j ( д ) differ. Definition A.6 (Transition Inputs).
We say that an input t ∈ D is a transition input of a submodel m i , j if it changes submodel selectionin the following stage. Formally, there exists ϵ > < δ < ϵ : (cid:4) M i , j ( t − δ ) · W i + (cid:5) , (cid:4) M i , j ( t + δ ) · W i + (cid:5) Definition A.7 (The function B i ( x ) ). We define the function B i for i ∈ { , , ..., n − } . B i is a staircase function of values [ , W i + − ] , and defined as B i ( x ) = ⌊ x · W i + ⌋ for x ∈ [ , ) .For a submodel m i , j , we term the set of its trigger inputs as G i , j and the set of its transition inputs as T i , j . See Figure 16 for illustra-tion. From submodel definition and Corollary 3.2, we can tell thata submodel’s ReLU operations determine its trigger inputs. Con-sequently, any set of trigger inputs is finite and can be calculatedusing a few linear equations. Nonetheless, calculating the transi-tion inputs of a submodel is not straightforward. We show a fastand efficient way for doing so in the following lemma: Lemma A.8.
Let m i , j be an RQ-RMI submodel, and a < b ∈ G i , j two adjacent trigger inputs of m i , j . Then the set S = [ a , b ] ∩ T i , j isfinite and can be calculated using the inputs a and b alone. Proof.
We divide the construction of S to two subsets S = S ∪ S . First we handle S . For each x ∈ { a , b } , x ∈ S if and only ifthere exists ϵ > < δ < ϵ : B i (cid:0) M i , j ( x − δ ) (cid:1) , B i (cid:0) M i , j ( x + δ ) (cid:1) Now to S . Without loss of generality, M i , j ( a ) ≤ M i , j ( b ) .From Corollary 3.2 and Definition A.5, M i , j is linear in [ a , b ] . If B i ( M i , j ( a )) = B i ( M i , j ( b )) , then S = ∅ . Otherwise, M i , j ( a ) , M i , j ( b ) . B i ( x ) outputs discrete values between B i ( M i , j ( a )) and B i ( M i , j ( b )) for all x ∈ ( a , b ) . Denote this finite set of discrete val-ues as M . For any y ∈ M there exists a value d ∈ ( a , b ] such that M i , j ( d ) · W i + = y . By the linearity of M i , j in [ a , b ] : d = (cid:16) yW i + − M i , j ( a ) (cid:17) · b − aM i , j ( b ) − M i , j ( a ) + a We construct S as follows: S = (cid:26)(cid:16) yW i + − M i , j ( a ) (cid:17) · b − aM i , j ( b ) − M i , j ( a ) + a (cid:12)(cid:12)(cid:12) ∀ y ∈ M (cid:27) (cid:3) Corollary A.9.
The set of transition inputs T i , j can be calculatedusing G i , j and its size is bounded such that | T i , j | ≤ W i + · | G i , j | . Not all transition inputs of all submodels are reachable, assome exist outside of their corresponding submodel’s responsibil-ity. Therefore, we define the set of reachable transition inputs of astage s i as the transition set of a stage: Definition A.10 (Transition Set).
The transition set U i of a stage s i is an ordered set, defined as: U i = { min ( D )} ∪ { W i − Ø j = T i , j ∩ R i , j } ∪ { max ( D )} The proof of Theorem A.1 directly follows from the next twolemmas:
Lemma A.11.
Let s i , s i + be two adjacent stages. For any two ad-jacent values u < u ∈ U i there exists a submodel m i + , j such that S i + ( x ) is piecewise linear and equal to M i + , j ( x ) for all x ∈ ( u , u ) . Proof.
We show that there exists a submodel m i + , j such thatany x ∈ ( u , u ) satisfies x ∈ R i + , j , which implies f i + ( x ) = j andso S i + ( x ) = M i + , j ( x ) . By Corollary 3.2, S i + is piecewise linearfor all x ∈ ( u , u ) .Let x < y ∈ ( u , u ) . Assume by contradiction there existtwo submodels m i + , j and m i + , j such that x ∈ R i + , j and y ∈ R i + , j . From Definition A.3, f i + ( x ) , f i + ( y ) implies B i ( S i ( x )) , B i ( S i ( y )) . Thus, there exists an input z ∈ ( x , y ] and ϵ > < δ < ϵ : B i (cid:0) S i ( z − δ ) (cid:1) , B i (cid:0) S i ( z + δ ) (cid:1) Since S i consists of the outputs of submodels in s i , there exists asubmodel m i , k such that S i ( z ) = M i , k ( z ) . Therefore, z ∈ T i , k and z ∈ R i , k , which means z ∈ U i , in contradiction to definition of u and u . (cid:3) Lemma A.12.
Let s i be an RQ-RMI stage such that i ∈ { , , ..., n − } . The function f i + defined over the space D can be calculated usingthe inputs U i over S i . Proof.
Let u < u ∈ U i be two adjacent values. By LemmaA.11 there exists a submodel m i + , j such that S i + ( x ) = M i , j ( x ) forall x ∈ ( u , u ) . From Definition A.2, f i + ( x ) = j for all x ∈ ( u , u ) .By calculating B i ( S i ( u )) and B i ( S i ( u )) , f i + ( x ) is known for all x ∈ [ u , u ] . Since min { D } ∈ U i and max { D } ∈ U i , f i + ( x ) isknown for all x ∈ D . (cid:3)
1K Rules 10K Rules L a t e n c y S p ee d u p NuevoMatch w/ CutSplit NuevoMatch w/ TupleMerge . .
1K Rules
10K Rules T h r o u g h p u t S p ee d u p Figure 17: A detailed version of end-to-end performance for small rule-sets. Speedup in throughput and latency of NuevoMatchagainst stand-alone versions of CutSplit and TupleMerge. Classifiers with no valid iSets are not displayed.
A.2 Submodel prediction error
Theorem A.13 (Submodel Prediction Error).
Let s n − be thelast stage of an RQ-RMI model. The maximum prediction error of anysubmodel in s n − can be calculated using a finite set of inputs overthe stage s n − . The intuition behind Theorem A.13 is to address the set of range-value pairs as an additional, virtual, stage in the model.
Definition A.14 (Range-Value Pair).
A range-value pair h r , v i isdefined such that r is an interval in D and v ∈ { , , , ... } is uniqueto that pair.We term W n the number of range-value pairs an RQ-RMI modelshould index. Similar to the definitions for submodels, we extend f i such that f n ( x ) = ⌊ S n − ( x )· W n ⌋ , and say that the responsibility R p of a pair p = h r , v i is the set of inputs { x | f n ( x ) = v } . Consequently,we make the following two observations. First, all inputs in therange r \ R p should have reached p but did not. Second, all inputsin the range R p \ r did reach p but should not. Definition A.15 (Misclassified Pair Set).
Let m be a submodel in s n − with a responsibility R m . Denote P m as the set of all pairs suchthat a pair p = h r , v i ∈ P m satisfies ( r \ R p ) ∪ ( R p \ r ) ∩ R m , ∅ . Inother words, P m holds all pairs that were misclassified by m , andtermed the misclassified pair set of m . Definition A.16 (Maximum Prediction Error).
Let m be a sub-model in s n − with a responsibility R m and a misclassified pairset P m . The maximum prediction error of m is defined as:max (cid:8) | f n ( x ) − v | (cid:12)(cid:12) h r , v i ∈ P m , x ∈ R m (cid:9) Lemma A.17.
The misclassified pair sets of all submodels in s n − can be calculated using U n − over S n − . Proof.
Let q < q be two adjacent values in U n − . FromLemma A.11 there exists a single submodel m n − , j , j ∈ W n − s.t S n − ( x ) = M n − , j ( x ) for all x ∈ ( q , q ) . Hence, using Corollary 3.2, S n − is linear in ( q , q ) . Therefore, the values of S n − in [ q , q ] can be calculated using q and q alone. Consequently, accordingto the definitions of f n and the responsibility of a pair, the set ofpairs P j with responsibilities in [ q , q ] can also be calculated using q and q . Calculating the responsibilities of all pairs is performedby repeating the process for any two adjacent points in U n − .At this point, as we know R p for all p = h r , v i , calculating theset ( r \ R p ) ∪ ( R p \ r ) is trivial. Acquiring the responsibility of anysubmodel in s n − using Theorem A.1 enables us to calculate itsmisclassified pair set immediately. (cid:3) Proof of Theorem A.13
Proof.
Let m be a submodel in s n − with a responsibility R m .For simplicity, we address the case where R m is a continuous range.Extension to the general case is possible by repeating the proof forany continuous range in R m .Denote the submodel’s finite set of trigger inputs as G m . Definethe set Q as follows: Q = min R m ∪ ( G m ∩ R m ) ∪ max R m Let q < q be two adjacent values in Q . From the definition oftrigger inputs, m outputs a linear function in [ q , q ] . Hence, theset of values S = { f n ( x )| x ∈ [ q , q ]} can be calculated using only q and q over S n − . From Lemma A.17, the misclassified pair set P m can be calculated using the finite set U n − . Denote the setˆ P = {h r , v i | h r , v i ∈ P m , r ∩ [ q , q ] , ∅} Calculating max { s − v | s ∈ S , h r , v i ∈ ˆ P } yields the maximumerror of m in [ q , q ] . Repeating the process for any two adjacentpoints in Q yields the maximum error of m for all R m . (cid:3) Rule-set names in Figures 8 and 17, by order: ACL1, ACL2, ACL3,ACL4, ACL5, FW1, FW2, FW3, FW4, FW5, IPC1, IPC2.
Table 4: RQ-RMI configurations for different input rule-setsizes. to 10 to 105