[PDF] A Computational Approach to Packet Classification

Abstract

Multi-field packet classification is a crucial component in modern software-defined data center networks. To achieve high throughput and low latency, state-of-the-art algorithms strive to fit the rule lookup data structures into on-die caches; however, they do not scale well with the number of rules. We present a novel approach, NuevoMatch, which improves the memory scaling of existing methods. A new data structure, Range Query Recursive Model Index (RQ-RMI), is the key component that enables NuevoMatch to replace most of the accesses to main memory with model inference computations. We describe an efficient training algorithm that guarantees the correctness of the RQ-RMI-based classification. The use of RQ-RMI allows the rules to be compressed into model weights that fit into the hardware cache. Further, it takes advantage of the growing support for fast neural network processing in modern CPUs, such as wide vector instructions, achieving a rate of tens of nanoseconds per lookup. Our evaluation using 500K multi-field rules from the standard ClassBench benchmark shows a geometric mean compression factor of 4.9x, 8x, and 82x, and average performance improvement of 2.4x, 2.6x, and 1.6x in throughput compared to CutSplit, NeuroCuts, and TupleMerge, all state-of-the-art algorithms.

Full PDF

aa r X i v : . [ c s . D C ] J u l A Computational Approach to Packet Classification

Alon Rashelbach

[email protected]

Ori Rottenstreich

[email protected]

Mark Silberstein

[email protected]

ABSTRACT

Multi-ﬁeld packet classiﬁcation is a crucial component in modernsoftware-deﬁned data center networks. To achieve high through-put and low latency, state-of-the-art algorithms strive to ﬁt the rulelookup data structures into on-die caches; however, they do notscale well with the number of rules.We present a novel approach,

NuevoMatch , which improves thememory scaling of existing methods. A new data structure,

RangeQuery Recursive Model Index (RQ-RMI), is the key component thatenables NuevoMatch to replace most of the accesses to main mem-ory with model inference computations. We describe an eﬃcienttraining algorithm that guarantees the correctness of the RQ-RMI-based classiﬁcation. The use of RQ-RMI allows the rules to be com-pressed into model weights that ﬁt into the hardware cache. Fur-ther, it takes advantage of the growing support for fast neural net-work processing in modern CPUs, such as wide vector instructions,achieving a rate of tens of nanoseconds per lookup.Our evaluation using 500K multi-ﬁeld rules from the standardClassBench benchmark shows a geometric mean compression fac-tor of 4.9 × , 8 × , and 82 × , and average performance improvementof 2.4 × , 2.6 × , and 1.6 × in throughput compared to CutSplit, Neu-roCuts, and TupleMerge, all state-of-the-art algorithms . Packet classiﬁcation is a cornerstone of packet-switched networks.Network functions such as switches use a set of rules that deter-mine which action they should take for each incoming packet. Therules originate in higher-level domains, such as routing, Qualityof Service, or security policies. They match the packets’ metadata,e.g., the destination IP-address and/or the transport protocol. Ifmultiple rules match, the rule with the highest priority is used.Packet classiﬁcation algorithms have been studied extensively.There are two main classes: those that rely on Ternary ContentAddressable Memory (TCAM) hardware [13, 20, 23, 28, 37], andthose that are implemented in software [3, 8, 21, 22, 34, 36, 41, 44].In this work, we focus on software-only algorithms that can bedeployed in virtual network functions, such as forwarders or ACLﬁrewalls, running on commodity X86 servers.Software algorithms fall into two major categories: decision-tree based [8, 21, 22, 34, 41, 44] and hash-based [3, 36]. The formeruse decision trees for indexing and matching the rules, whereas thelatter perform lookup via hash-tables by hashing the rule’s preﬁxes.Other methods for packet classiﬁcation [7, 38] are less common asthey either require too much memory or are too slow.A key to achieving high classiﬁcation performance in modernCPUs is to ensure that the classiﬁer ﬁts into the CPU on-die cache.When the classiﬁer is too large, the lookup involves high-latency This work does not raise any ethical issues.

IncomingPacket

Independent Set

RQ-RMI (CPU Cache) iSetRules (DRAM) predictedindex candidaterule

Remainder Set

ExternalClassiﬁer (CPU Cache)

RemainderRules (DRAM) indexes candidaterule

Selector

Action

Figure 1: NuevoMatch overview. The rules are divided intoIndependent Sets indexed by RQ-RMIs and the RemainderSet indexed by any classiﬁer. One RQ-RMI predicts the stor-age index of the matching rule. The Selector chooses thehighest-priority matching rule. memory accesses, which stall the CPU, as the data-dependent ac-cess pattern during the lookup impedes hardware prefetching. Un-fortunately, as the number of rules grows, it becomes diﬃcult tomaintain the classiﬁer in the cache. In particular, in decision-treemethods, rules are often replicated among multiple leaves of the de-cision tree, inﬂating its memory footprint and aﬀecting scalability.Consequently, recent approaches, notably CutSplit [21] and Neuro-Cuts [22], seek to reduce rule replication to achieve better scaling.However, they still fail to scale to large rule-sets, which in moderndata centers may reach hundreds of thousands of rules [6]. Hash-based techniques also suﬀer from poor scaling, as adding rules in-creases the number of hash-tables and their size.We propose a novel approach to packet classiﬁcation,

Nuevo-Match , which compresses the rule-set index dramatically to ﬁt itentirely into the upper levels of the CPU cache (L1/L2) even forlarge 500K rule-sets. We introduce a novel

Range Query RecursiveModel Index (RQ-RMI) model, and train it to learn the rules’ match-ing sets, turning rule matching into neural network inference . Weshow that RQ-RMI achieves out-of-L1-cache execution by reduc-ing the memory footprint on average by 4.9 × , 8 × , and 82 × com-pared to recent CutSplit [21], NeuroCuts [22], and TupleMerge [3]on the standard ClassBench [39] benchmarks, and up to 29 × forreal forwarding rule-sets.To the best of our knowledge, NuevoMatch is the ﬁrst to per-form packet classiﬁcation using trained neural network models.NeuroCuts also uses neural nets, but it applies them for optimiz-ing the decision tree parameters during the oﬄine tree construc-tion phase; their rule matching still uses traditional (optimized) de-cision trees. In contrast, NuevoMatch performs classiﬁcation viaRQ-RMIs, which are more space-eﬃcient than decision trees orhash-tables, improving scalability by an order of magnitude.uevoMatch transforms the packet classiﬁcation task frommemory- to compute-bound. This design is appealing because itis likely to scale well in the future, with rapid advances in hard-ware acceleration of neural network inference [11, 19, 29]. On theother hand, the performance of both decision trees and hash-tablesis inherently limited because of the poor scaling of DRAM accesslatency and CPU on-die cache sizes (e.g., 1 . × over ﬁve years forL1 in Intel’s CPUs).NuevoMatch builds on the recent work on learned indexes [18],which applies a Recursive Model Index (RMI) model to indexingkey-value pairs. The values are stored in an array, and the RMIis trained to learn the mapping function between the keys and theindexes of their values in the array. The model is used to predict theindex given the key. When applied to databases [18], RMI boostsperformance by compressing the indexes to ﬁt in CPU caches.Unfortunately, RMI is not directly applicable for packet classi-ﬁcation. First, a key (packet ﬁeld) may not have an exact match-ing value, but match a rule range , whereas RMI can learn onlyexact key-index pairs. This is a fundamental property of RMI: itguarantees correctness only for the keys used during training, butprovides no such guarantees for non-existing keys ([18], Section3.4). Thus, for range matching it requires enumeration of all pos-sible keys in the range, making it too slow. Second, the match isevaluated over multiple packet ﬁelds, requiring lookup in a multi-dimensional space. Unfortunately, multi-dimensional RMI [17] re-quires that the input be ﬂattened into one dimension, which in thepresence of wildcards results in an exponential blowup of the inputdomain, making it too large to learn for compact models. Finally, akey may match multiple rules, with the highest priority one usedas output, whereas RMI retrieves only a single index for each key.NuevoMatch successfully solves these challenges. RQ-RMI . We design a novel model which can match keys toranges, with an eﬃcient training algorithm that does not requireexhaustive key enumeration to learn the ranges. The trainingstrives to minimize the prediction error of the index, while main-taining a small model size. We show that the models can store in-dices of 500K ClassBench rules in 35 KB (§5.2.1). We prove that ouralgorithm guarantees range lookup correctness (§3.3).

Multi-ﬁeld packet classiﬁcation . To enable multi-ﬁeld match-ing with overlapping ranges, the rule-set is split into independentsets with non-overlapping ranges, called iSets , each associated witha single ﬁeld and indexed with its own RQ-RMI model. The iSetpartitioning (§3.6) strives to cover the rule-set with as few iSets aspossible, discarding those that are too small. The remainder set ofthe rules not covered by large iSets is indexed via existing classiﬁ-cation techniques. In practice, the rules in the remainder constitutea small fraction in representative rule-sets, so the remainder indexﬁts into a fast cache together with the RQ-RMIs.Figure 1 summarizes the complete classiﬁcation ﬂow. The queryof the RQ-RMI models produces the hints for the secondary searchthat selects one matching rule per iSet. The validation stage selectsthe candidates with a positive match across all the ﬁelds, and aselector chooses the highest priority matching rule.Conceptually, NuevoMatch can be seen as an accelerator for ex-isting packet classiﬁcation techniques and thus complements them.In particular, the RQ-RMI model is best used for indexing ruleswith high value diversity that can be partitioned into fewer iSets. We show that the iSet construction algorithm is eﬀective for select-ing the rules that can be indexed via RQ-RMI, leaving the rest inthe remainder (§5.3.1). The performance beneﬁts of NuevoMatchbecome evident when it indexes more than 25% of the rules. Sincethe remainder is only a fraction of the original rule-set, it can beindexed eﬃciently with smaller decision-trees/hash-tables or willﬁt smaller TCAMs.Our experiments show that NuevoMatch outperforms all thestate-of-the-art algorithms on synthetic and real-life rule-sets. Forexample, it is faster than CutSplit, NeuroCuts, and TupleMerge, by2.7 × , 4.4 × and 2.6 × in latency and 2.4 × , 2.6 × , and 1.6 × in through-put respectively, averaged over 12 rule-sets of 500K ClassBench-generated rules, and by 7 . × in latency and 3 . × in throughputvs. TupleMerge for the real-world Stanford backbone forwardingrule-set.NuevoMatch supports rule updates by removing the updatedrules from the RQ-RMI and adding them to the remainder set in-dexed by another algorithm that supports fast updates, e.g., Tuple-Merge. This approach requires periodic retraining to maintain asmall remainder set; hence it does not yet support more than a fewthousands of updates (§3.9). The algorithmic solutions to directlyupdate RQ-RMI are deferred for future work.In summary, our contributions are as follows. • We present an novel RQ-RMI model and a training techniquefor learning packet classiﬁcation rules. • We demonstrate the application of RQ-RMI to multi-ﬁeldpacket classiﬁcation. • NuevoMatch outperforms existing techniques in terms of mem-ory footprint, latency, and throughput on challenging rule-setswith up to 500K rules, compressing them to ﬁt into small cachesof modern processors.

This section describes the packet classiﬁcation problem and sur-veys existing solutions.

Packet classiﬁcation is the process of locating a single rule that issatisﬁed by an input packet among a set of rules. A rule containsa few ﬁelds in the packet’s metadata. Wildcards deﬁne ranges , i.e.,they match multiple values. Ranges may overlap with each other,i.e., a packet may match several rules, but only the one having thehighest priority is selected. Figure 2 illustrates a classiﬁer with twoﬁelds and ﬁve overlapping matching rules. An incoming packetmatches two rules ( R , R ), but R is used as it has a higher priority.Packet classiﬁcation performance becomes diﬃcult to scale asthe number of rules and the number of matching ﬁelds grow. There-fore, it has received renewed interest with increased complexityof software-deﬁned data center networks, featuring hundreds ofthousands of rules per virtual network function [5] and tens ofmatching ﬁelds (up to 41 in OpenFlow 1.4 [27]). Decision Tree Algorithms.

The rules are viewed as hyper-cubesand packets as points in a multi-dimensional space. The axes of the rule space represent diﬀerent ﬁelds and hold non-negative integers. The source code of NuevoMatch is available in [31].Pv4 Address Port Priority Action R a R a R a R a R a Incoming packet10.10.3.100:19 Action to take a Figure 2: Packet classiﬁcation with two ﬁelds: IP address andport.

A recursive partitioning technique divides the rule space into sub-sets with at most binth rules. Thus, to match a rule, a tree traver-sal ﬁnds the smallest subset for a given packet, while a secondarysearch scans over the subset’s rules to select the best match.Unfortunately, a rule replication problem may hinder perfor-mance in larger rule-sets by dramatically increasing the tree’smemory footprint when a rule spans several subspaces. Earlyworks, such as HiCuts [8] and HyperCuts [34] both suﬀer fromthis issue. More recent EﬃCuts [41] and CutSplit [21], suggest thatthe rule set should be split into groups of rules that share similarproperties and generate a separate decision-tree for each. Neuro-Cuts [22], the most recent work in this domain, uses reinforcementlearning for optimizing decision tree parameters to reduce its mem-ory footprint, or the number of memory accesses during traversal,by eﬃciently exploring a large tree conﬁguration space.

Hash-Based Algorithms.

Tuple Space Search [36] and recent Tu-pleMerge [3] partition the rule-set into subsets according to thenumber of preﬁx bits in each ﬁeld. As all rules of a subset havethe same number of preﬁx bits, they can act as keys in a hash ta-ble. The classiﬁcation is performed by extracting the preﬁx bits, inall ﬁelds, of an incoming packet, and checking all hash-tables formatching candidates. A secondary search eliminates false-positiveresults and selects the rule with the highest priority.Hash-based techniques are eﬀective in an online classiﬁcation problem with frequent rule updates, whereas decision trees are not.However, decision trees have been traditionally considered fasterin classiﬁcation. Nevertheless, the recent TupleMerge hash-basedalgorithm closes the gap and achieves high classiﬁcation through-put while supporting high performance updates.

The packet classiﬁcation performance of all the existing techniquesdoes not scale well with the number of rules. This happens becausetheir indexing structures spill out of the fast L1/L2 CPU caches intoL3 or DRAM. Indeed, as we show in our experiments (§5), Tuple-Merge and NeuroCuts exceed the 1MB L2 cache with 100K rulesand CutSplit with 500K rules. However, keeping the entire index-ing structure in fast caches is critical for performance. The inherentlack of access locality in hash and tree data structures, combinedwith the data-dependent nature of the accesses, make hardwareprefetchers ineﬀective for hiding memory access latency. Thus, theperformance of all lookups drops dramatically. m , ( x ) s m , ( x ) m , ( x ) m , W − ( x ) s More Stages m n − , ( x ) m n − , Wn − − ( x ) s n − v v v v | I |− Values

Stage

Figure 3: RMI model structure and inference [18].

The performance drop is signiﬁcant even when the data struc-tures ﬁt in the L3 cache. This cache is shared among all the cores,whereas L1 and L2 caches are per-core. Thus, L3 is not only slower(up to 90 cycles in recent X86 CPUs), but also suﬀers from cachecontention, e.g., when another core runs a cache-demanding work-load and causes cache trashing. We observe the eﬀects of L3 con-tention in §5.2.1.NuevoMatch aims to provide more space eﬃcient representa-tion of the rule index to scale to large rule-sets.

We ﬁrst explain the RMI model for learned indexes which we useas the basis, explain its limitations, and then show our solution thatovercomes them.

Kraska et al. [18] suggest using machine-learning models for stor-ing key-value pairs instead of conventional data structures such asB-trees or hash tables. The values are stored in a value array , anda

Recursive Model Index (RMI) is used to retrieve the value given akey. Speciﬁcally, RMI predicts the index of the corresponding valuein the value array using a model that learned the underlying key-index mapping function.The main insight is that any index structure can be expressedas a continuous monotonically increasing function y = h ( x ) : [ , ] 7→ [ , ] , where x is a key scaled down uniformly into [ , ] ,and y is the index of the respective value in the value array scaleddown uniformly into [ , ] . RMI is trained to learn h ( x ) . The result-ing learned index model b h ( x ) performs lookups in two phases: ﬁrstit computes the predicted index b y = b h ( key ) , and then performs a secondary search in the array, in the vicinity ϵ of the predicted in-dex, where ϵ is the maximum index prediction error of the model,namely | b h ( key ) − h ( key )| ≤ ϵ . Model structure.

RMI is a hierarchical model made of several ( n )stages (Figure 3). Each stage i includes W i submodels m i , j , j < W i ,where W i is the stage width . The ﬁrst stage has a single submodel.Each successive stage has a larger width. The submodels in eachstage are trained on a progressively smaller subset of the inputkeys, reﬁning the index prediction toward the submodels in theleaves. Thus, each key-index pair is learned by one submodel ateach stage, with the leaf submodel producing the index prediction.RMI is a generic structure; a variety of machine learning modelsor data structures can be used as submodels, such as regressionodels or B-trees. The type of the submodels, the number of stagesand the width of each stage are conﬁgured prior to training. Training.

Training is performed stage by stage.

First stage.

The submodel in stage m , is trained on the whole data set. Then, the input key-index pairs are split into W dis-joint subsets. The input partitioning is performed as follows.For each input key-index pair { key : idx } we compute the sub-model prediction b j = m , ( key ) , satisfying b j ∈ [ , ) . The output b j is used to obtain j = ⌊ b j · W ⌋ which is the index of the submodelin stage 1, m , j , to be used for learning { key : idx } . We call thesubset of the input to be learned by model m i , j as model inputresponsibility domain R i , j , or responsibility for short. R , is thewhole input. Internal stages.

The submodels in stage i , m i , j , are trained onthe keys in R i , j ( j < W i ) . After training, the responsibilities of thesubmodels in stage i + Last stage.

The submodels of the last stage must predict the ac-tual index of the matching value in the value array. However, a sub-model may have a prediction error. Therefore, RMI uses the modelprediction as a hint . The matching value is found by searching inthe value array in the vicinity of the predicted index, as deﬁnedby the maximum error bound ϵ of the model. Note that ϵ shouldbe valid for all input key-index pairs. To compute ϵ , RMI exhaus-tively computes the submodel prediction for each input key in itsresponsibility. Submodels with a high error bound are retrained orconverted to B-trees. Inference.

Given a key, we iteratively evaluate each submodelstage after stage, from m , . We use the prediction in stage i − i , until we reach the last stage. The lastselected submodel predicts the index in the value array. This index b i determines the range for the secondary search in the value arraythat spans [ b i − ϵ , b i + ϵ ]. Direct application of RMI to indexing packet classiﬁcation rules isnot possible for the following reasons:

No support for range matching . RMI allows only an exactmatch for a given key, whereas packet classiﬁcation requires re-trieving rules with matching ranges as deﬁned by wildcards. Thisproblem is fundamental: RMI exhaustively enumerates all the keysin all the ranges to calculate the submodel responsibility and themaximum model prediction error (see the underlined parts of thetraining algorithm). In other words, all the values in the range mustbe materialized into key-index pairs for RMI to learn them, sinceRMI does not guarantee correct lookup for keys not used in train-ing [18]. The original paper sketches a few possible solutions, how-ever, they either rely on model monotonicity (while we do not) oruse smarter yet still expensive enumeration techniques.

Slow multi-dimensional indexing.

RMI is ineﬀective for multi-dimensional indexes because the proposed solution [17] leads to The RMI paper also uses the term range index while applying RMI to range indexdata structures (i.e., B-trees) that can quickly retrieve all stored keys in a requestednumerical range. Our work is fundamentally diﬀerent: given a key it retrieves theindex of its matching range . xM ( x ) , ⌊ M ( x ) · ⌋

1, 40.75, 30.5, 20.25, 10, 0 t t t t t Figure 4: Transition inputs ( t , ..., t ) for a piece-wise linearfunction with the output domain=4. generating an exponential number of rules in the presence of wild-cards. For example, a single rule with wildcards in destination IP(0.0.0.*), port (10-100), and protocol (TCP/UDP) results in 46,592 dis-tinct key-index pairs. Since the input domain becomes too large, itrequires a large model that exceeds the CPU cache.In the following we outline the solutions to these challengesWe ﬁrst discuss Range-Query RMI (RQ-RMI), which extends RMIto perform range-value queries in a one-dimensional index whereranges do not overlap (§3.3-§3.5). We then show how to apply RQ-RMI in multi-dimensional index space with overlaps (§3.6-§3.7). We ﬁrst seek to ﬁnd a way to perform range matching over a setof non-overlapping ranges in one dimension.There are two basic ideas:

Sampling.

Each submodel m i , j is trained by generating a uniformsample of key-index pairs from input ranges in its responsibility.The samples are generated on-the-ﬂy for each submodel (§3.5.4). Analytical error bound estimation for ranges.

We eliminatethe RMI’s requirement for exhaustive key-value enumeration dur-ing training by making the following observation: if a submodel isa piece-wise linear function, the worst-case error bound ϵ can be com-puted analytically , thereby enabling eﬃcient learning of ranges.The intuition behind this observation is illustrated in Figure 4.It shows the graph of some piece-wise linear function which rep-resents a submodel M whose outputs are quantized into integersin [ , ) , i.e., M predicts the index in an array of size 4. We call theinputs for which this function changes its quantized output transi-tion inputs t i ∈ T . In turn, transition inputs determine the regionof inputs with the same quantized output. Therefore, given an in-put range in the model’s responsibility, to compute the model’smaximum prediction error for any key in that range, it suﬃces toevaluate the prediction error in the transition inputs that fall inthe range. We describe the training algorithm that relies on theseobservations in Section 3.4. We now provide a more formal descrip-tion, but leave most of the proofs in the Appendix. We choose to use a 3-layer fully-connected neural network (NN)with a single hidden layer and ReLU activation A . Such NNs havebeen suggested in the original RMI paper [18]; however, they didnot leverage their properties for accelerating error bound compu-tations.We denote a submodel as m i , j , and deﬁne it as follows. omputetransitioninputs Generatedataset w/ s samples Trainsubmodel ComputeerrorboundsSubmodels w/ error > threshold. s ← s Figure 5: The submodel training process. The additionalphase for training submodels in the leaves is depicted withdashed lines.

Deﬁnition 3.1 (RQ-RMI submodel).

Denote the output of a 3-layer fully-connected neural network as: N i , j ( x ) = A (cid:0) x · w + b (cid:1) × w + b where x is a scalar input, w , b are the weight and bias row-vectors for layer 1 (hidden layer), and w , b are the weight column-vector and bias scalar for layer 2. Note that N i , j ( x ) is a scalar. TheReLU function A applies a function a on each element of an inputvector: a ( x ) = ( x x ≥ x < . The submodel output, denoted M i , j ( x ) , is deﬁned as follows: M i , j ( x ) = H (cid:0) N i , j ( x ) (cid:1) where H ( x ) trims the output domain to be in [ , ) . Corollary 3.2. M i , j ( x ) is a piece-wise linear function. We use Corollary 3.2 to compute the transition inputs and the re-sponsibility of the submodels. We provide a simpliﬁed description;see Appendix for the precise explanation.

RQ-RMI training is similar to RMI’s. It is per-formed stage by stage. Figure 5 illustrates the training process forone stage. We start by training the single submodel in the ﬁrststage using the entire input domain. Next, we calculate its tran-sition inputs (§3.5.2) and use them to ﬁnd the responsibilities ofthe submodels in the following stage (§3.5.3). We proceed by train-ing submodels in the subsequent stage using designated datasetswe generate based on the submodels’ responsibilities (§3.5.4). Werepeat this process until all submodels in all internal stages aretrained. For the submodels in the leaves (last stage), there is anadditional phase (dashed lines in Figure 5). After training, we cal-culate their error bounds and retrain the submodels that do notsatisfy a predeﬁned error threshold (§3.5.6).

Given a trained submodel m we can analytically ﬁnd all its linear regions, and respectivelythe inputs delimiting them, which we call trigger inputs д l . Forall inputs in the region [ д l , д l + ] , the model function, denoted as M ( x ) , is linear by construction. On the other hand, the uniform out-put quantization deﬁnes a step-wise function Q = ⌊ M ( x ) · W ⌋/ W ,where W is the size of the quantized output domain (Figure 4).Thus, for each input region [ д l , д l + ] , the set of transition inputs t l ∈ T are those where M ( x ) and Q intersect. Given a trained submodel m i , j in an internal stage i ,we say that it maps a key to a submodel m i + , k , k < W i + , if ⌊ M i , j ( key ) · W i + ⌋ = k . As discussed informally earlier, the re-sponsibility R i + , k of m i + , k is deﬁned as all the inputs which aremapped by submodels in stage i to m i + , k . In other words, thetrained submodels at stage i deﬁne the responsibility of untrainedsubmodels at stage i + R i + , k using the transition inputs of m i , j . In the fol-lowing, we assume for clarity that R i , j is contiguous, and m i , j isthe only submodel at stage i .We compute R i + , k by observing that it is composed of all theinputs in the regions ( t l , t l + ) that map to submodel m i + , k , where t l ∈ T i , j are transition inputs of m i , j . By construction, the inputsin the region between two adjacent transition points map to thesame output. Then, it suﬃces to compute the output of m i , j for itstransition points, and choose the respective input ranges that aremapped to m i + , k . Up tothis point, we used only key-index pairs as model inputs. Now wefocus on training on input ranges . A range can be represented as allthe keys that fall into the range, all associated with the same indexof the respective rule. For example, 10.1.1.0-10.1.1.255 includes 256keys. Our goal is to train a model such that given a key in the range,the model predicts the correct index. Enumerating all the keys inthe ranges is ineﬃcient. Instead, we use sampling as follows.We generate the training key-index pairs by uniformly samplingthe submodel’s responsibility. We start with a low sampling fre-quency. A sample is included in the training set if there is an in-put rule range that matches the sampled key. Thus, the number ofsamples per input range is proportional to its relative size in thesubmodel’s responsibility. Note that some input ranges (or individ-ual keys) might not be sampled at all. Nevertheless, they will bematched correctly as we explain further.

We train submodels on the generateddatasets using supervised learning and Adam optimizer [14] witha mean squared error loss function.

Given a trained submodel in thelast stage, we compute the prediction error bound for all inputsin its responsibility by evaluating the submodel on its transition in-puts. The prediction error is computed also for the inputs that werenot necessarily sampled, guaranteeing match correctness. If the er-ror is too large, we double the number of samples, regenerate thekey-index pairs, and retrain the submodel. Training continues un-til the target error bound is attained or after a predeﬁned numberof attempts. If training does not converge, the target error boundmay be increased by the operator. The error bound determines the search distance of the secondary search ; hence a larger bound causeslower system performance. We evaluate this tradeoﬀ later (§5.3.4).

PAddressPort Number R r R r R r R r r R Figure 6: Rules from Figure 2 are split into two iSets: { R , R , R } (by port), and { R , R } (by IP). NuevoMatch supports overlapped ranges and matching over mul-tiple dimensions, i.e., packet ﬁelds, by combining two simple ideas:partitioning the rule-set into disjoint independent sets ( iSets ), andperforming multi-ﬁeld validation of each rule. In the following, weuse the terms dimension and ﬁeld interchangeably.

Partitioning.

Each iSet contains rules that do not overlap in onespeciﬁc dimension . We refer to the coverage of an iSet as the fractionof the rules it holds out of those in the input. One iSet may cover allthe rules if they do not overlap in at least one dimension, whereasthe same dimension with many overlapping ranges may requiremultiple iSets. Figure 6 shows the iSets for the rules from Figure 2.Each iSet is indexed by one RQ-RMI. Thus, to ﬁnd the match to aquery with multiple ﬁelds, we query all RQ-RMIs (in parallel), eachover the ﬁeld on which it was trained. Then, the highest priorityresult is selected as the output.Each iSet adds to the total memory consumption and compu-tational requirements of NuevoMatch. Therefore, we introduce aheuristic that strives to ﬁnd the smallest number of iSets that coverthe largest part of the rule-set (§3.6.1).

Multi-ﬁeld validation.

Since an RQ-RMI builds an index of therules over a single ﬁeld, it might retrieve a rule which does notmatch against other ﬁelds. Hence, each rule returned by an RQ-RMI is validated across all ﬁelds. This enables NuevoMatch toavoid indexing all dimensions, yet obtain correct results.

We introduce a greedy heuristic thatrepetitively constructs the largest iSet from the input rules, pro-ducing a group of iSets. To ﬁnd the largest iSet over one dimension,we use a classical interval scheduling maximization algorithm [15].The algorithm sorts the ranges by their upper bounds, and repeti-tively picks the range with the smallest upper bound that does notoverlap previously selected ranges.We apply the algorithm to ﬁnd the largest iSet in each ﬁeld.Then we greedily choose the largest iSet among all the ﬁelds andremove its rules from the input set. We continue until exhaustingthe input. This heuristic is sub-optimal but quite eﬃcient. We planto improve it in future work.Having a larger number of ﬁelds in a rule-set might help im-prove coverage. For example, if the rules that overlap in one ﬁelddo not overlap in another and vice versa, two iSets cover the wholerule-set, requiring more iSets for each ﬁeld in isolation.

Real-world rule-sets may require many iSets for full coverage, witha single rule per iSet in the extreme cases. Using separate RQ-RMIsfor such iSets will hinder performance. Therefore, we merge smalliSets into a single remainder set . The rules in the remainder set areindexed using an external classiﬁer. Each query is performed onboth the RQ-RMI and the external classiﬁer.In essence, NuevoMatch serves as an accelerator for the externalclassiﬁer. Indeed, if rule-sets are covered using a few large iSets, theexternal classiﬁer needs to index a small remainder set that oftenﬁts into faster memory, so it can be very fast.Two primary factors determine the end-to-end performance: (1)the number of iSets required for high coverage (depends on therule-set), and; (2) the number of iSets for achieving high perfor-mance (set by an operator).Our evaluation (§5.3.1) shows that most of the evaluated rule-sets can be covered with high coverage above 90% with only 2-3iSets. This is enough to accelerate the external classiﬁer, as is evi-dent from the performance results. On the other hand, the choiceof the number of iSets depends on the external classiﬁer properties,in particular, its sensitivity to memory footprint. We analyze thistradeoﬀ in §5.3.

Worst-case inputs.

Some rule-sets cannot achieve good coveragewith only a few iSets. For example, a rule-set with a single ﬁeldwhose ranges overlap requires too many iSets to be covered.To obtain a better intuition about the origins of worst-case in-puts, we consider the notion of rule-set diversity for rule-sets withexact matches. Rule-set diversity in a ﬁeld is the number of uniquevalues in it across the rule-set, divided by the total number of rules.

The rule-set diversity is an upper bound on the fraction of rules in thelargest iSet of that ﬁeld . In other words, low diversity implies thatusing the ﬁeld for iSet partitioning would result in poor coverage.We can also identify challenging rule-sets with ranges. We de-ﬁne rule-set centrality as the maximal number of rules that eachpair of them overlap (they all share a point in a multi-dimensionalspace).

The rule-set centrality is a lower bound on the number of iSetsrequired for full coverage .The diversity and centrality metrics can indicate the potentialof NuevoMatch to accelerate the classiﬁcation of a rule-set. On thepositive side, our iSet partitioning algorithm is eﬀective at segre-gating the rules that cannot be covered well from the rules that can,thereby accelerating the remainder classiﬁer as much as possiblefor a given rule-set. We analyze this property in §5.3.3.

We brieﬂy summarize all the steps of NuevoMatch.

Training (1) Partition the input into iSets and a remainder set(2) Train one RQ-RMI on each iSet(3) Construct an external classiﬁer for the remainder set

Lookup (1) Query all the RQ-RMIs(2) Query the external classiﬁer(3) Collect all the outputs, return the highest-priority rule τ τ τ TimeThroughput

Fast training Long training

Figure 7: Updates impact on Throughput over time. An up-per bound (in green) is for zero training time.

We explain how NuevoMatch can support updates with a limitedperformance degradation.Firstly, an external classiﬁer used for the remainder must sup-port updates. Among the evaluated external classiﬁers only Tuple-Merge is designed for fast updates.Secondly, we distinguish four types of updates: (i) a change inthe rule action ; (ii) rule deletion (iii) rule matching set change ; (iv) rule addition .The ﬁrst two types of updates are supported without perfor-mance degradation, and require a lookup followed by an updatein the value array. However, if an update modiﬁes a rule’s match-ing set or adds a new rule, it might require modiﬁcations to theRQ-RMI model. We currently do not know an algorithmic way toupdate RQ-RMI without retraining; therefore, an updated rule isalways added to the remainder set.Unfortunately, this design leads to gradual performance degra-dation, as updates are likely to increase the remainder set. Accord-ingly, the model is retrained on the updated rule-set, either peri-odically or when a large performance degradation is detected. Up-dates occurring while retraining are accommodated in the follow-ing batch of updates. Estimating sustained update rate.

Let r and u be the total num-ber of rules and the number of updates that move a rule to the re-mainder, respectively; u can be smaller than the real rate of rule up-dates. We assume that the updates are independent and uniformlydistributed among the r rules. For each rule update, a rule is mod-iﬁed w.p. (with probability) r . Thus a rule is not modiﬁed in anyof the updates w.p. ( − r ) u ≈ e − u / r . The expected number of un-modiﬁed rules is r · ( − r ) u ≈ r · e − u / r . Throughput behaves as aweighted average between that of NuevoMatch and the remainderimplementation, based on the number of rules in each.Figure 7 illustrates the throughput over time for diﬀerent re-training rates given a certain update rate. If retraining is invokedevery τ time units, the slower the training process, the worse theperformance degradation.With these update estimates, using the measured speedup as afunction of the fraction of the remainder (§5.3.3), NuevoMatch cansustain up to 4k updates per second for 500K rule-sets, yielding about half the speedup of the update-free case, assuming a minute-long training. These results indicate the need for speeding up train-ing, but we conjecture there might be a more eﬃcient way to per-form updates directly in RQ-RMI without complete re-training ofall submodels. Accelerating updates is left for future work. RQ-RMI structure.

The number of stages and the width of eachstage depend on the number of rules to index. We increase thewidth of the last stage from 16 for 10K rules to as much as 512 for500K. See Table 4 in the Appendix.

Submodel structure.

Each submodel is a fully connected 3-layerneural net with 1 input, 1 output, and 8 neurons in the hidden layerwith ReLU activation. This structure aﬀords an eﬃcient vectorizedimplementation (see below).

Training.

We use TensorFlow [1] to train each submodel on a CPU.Training a submodel requires a few seconds, but the whole RQ-RMImay take up to a few minutes (see §5.3.4). We believe, however, thata much faster training time could be achieved with more optimiza-tions, i.e., replacing TensorFlow (known for its poor performanceon small models). We leave it for future work. iSet partitioning.

We implement the iSet partitioning algorithmusing Python. The partitioning takes at most a few seconds and isnegligible compared to RQ-RMI training time.

Inference and secondary search.

We implement RQ-RMI infer-ence in C++. For each iSet we sort the rules by the value of therespective ﬁeld to optimize the secondary search. To reduce thenumber of memory accesses, we pack multiple ﬁeld values fromdiﬀerent rules in the same cache line.

Handling long ﬁelds.

Both iSet partitioning algorithms and RQ-RMI models map the inputs into single-precision ﬂoating-pointnumbers. This allows the packing of more scalars in vector opera-tions, resulting in faster inference. While enough for 32-bit ﬁelds,doing so might cause poor performance for ﬁelds of 64-bits and128-bits.We compared two solutions: (1) splitting the ﬁelds into 32-bitparts and treating each as a distinct ﬁeld, and (2) using a single-precision ﬂoating-point to express long ﬁelds. The two showedsimilar results for iSet partitioning with MAC addresses, whilewith IPv6, splitting into multiple ﬁelds worked better. Note thatboth the secondary search and the validation phases are not af-fected because the rules are stored with the original ﬁelds.

Vectorization.

We accelerate the inference by using wide CPUvector instructions. Speciﬁcally, with 8 neurons in the hidden layerof each submodel, computing the prediction involves a handful ofvector instructions. Validation is also vectorized.Table 1 shows the eﬀectiveness of vectorization. The use ofwider units speeds up inference, highlighting the potential for scal-ing NuevoMatch in future CPUs.

Parallelization.

NuevoMatch lends itself to parallel executionwhere iSets and the remainder classiﬁer run in parallel on diﬀerentCPU cores. The system receives the packets and enqueues each for able 1: Submodel acceleration via vectorization. Methodsare annotated with the number of ﬂoats per single instruc-tion.

Instruction set (width) Serial(1) SSE(4) AVX(8)Inference Time (ns) 126 62 49 execution into the worker threads. The threads are statically allo-cated to run RQ-RMI or the external classiﬁer with a balanced loadbetween the cores.Note that since RQ-RMI are small and ﬁt in L1, running themon a separate core enables L1-cache-resident executions even ifthe remainder classiﬁer is large. Such an eﬃcient cache utilizationcould not have been achieved with other classiﬁers running on twocores.

Early termination.

One drawback of the parallel implementationis that the slowest thread determines the execution time. Our ex-periments show that the remainder classiﬁer is the slowest one.It holds only a small fraction of the rules, so it returns an emptyset for most of the queries, which in turn leads to the worst-caselookup time. In TupleMerge, for example, a query which does notﬁnd any matching rules results in a search over all tables, whereasin the average case some tables are skipped.Instead, we query the remainder after obtaining the results fromthe iSets, and terminate the search when we determine that thetarget rule is not in the remainder.To achieve that, we make minor changes to existing classiﬁca-tion techniques. Speciﬁcally, in decision-tree algorithms, we storein each node the maximum priority of all the sub-tree rules. When-ever we encounter a maximum priority that is lower than thatfound in the iSets, we terminate the tree-walk. The changes to thehash-based algorithms are similar.We call this optimization early termination . With this optimiza-tion, both the iSets and the remainder are queried on the samecore. While a parallel implementation is possible, it incurs highersynchronization overheads among threads.

In the evaluation, we pursued the following goals.(1) Comparison of NuevoMatch with the state-of-the-art algo-rithms TupleMerge [3], CutSplit [21], and NeuroCuts [22];(2) Systematic analysis of the performance characteristics, in-cluding coverage in challenging data sets, the eﬀect of RQ-RMI error bound, and training time.

We ran the experiments on Intel Xeon Silver 4116 @ 2.1 GHzwith 12 cores, 32KB L1, 1024KB L2, and 16MB L3 caches, runningUbuntu 16.04 (Linux kernel 4.4.0). We disable power managementfor stable measurements.

Evaluated conﬁgurations.

CutSplit ( cs ) is set with binth =

8, assuggested in [21].For NeuroCuts ( nc ), we performed a hyperparameter sweep andselected the best classiﬁer per rule-set. As recommended in [22], we focused on both top-mode partitioning and reward scaling. Weran the search on three 12-core Intel machines, allocating six hoursper conﬁguration to converge. In total, we ran nc training for upto 36 hours per rule-set. In addition, we developed a C++ imple-mentation of nc for faster evaluation of the generated classiﬁers,much faster than the authors’ Python-based prototype.TupleMerge ( tm ) is used with the version that supports updateswith collision-limit =

40, as suggested in [3].NuevoMatch ( nm ) was trained with a maximum error thresh-old of 64. We present the analysis of the sensitivity to the chosenparameters and training times in §5.3.2. Multi-core implementation.

We run a parallel implementationon two cores. NuevoMatch allocates one core for the remaindercomputations and the second for the RQ-RMIs. For cs , nc , and tm ,we ran two instances of the algorithm in parallel on two coresusing two threads (i.e., no duplication of the rules), splitting theinput equally between the cores. We discarded iSets with cover-age below 25% for comparisons against cs and nc , and below 5%for comparisons against tm . We used batches of 128 packets toamortize the synchronization overheads. Thus, these algorithmsachieve almost linear scaling and the highest possible throughputwith perfect load-balancing between the cores. Single-core implementation.

We used a single core to measurethe performance of NuevoMatch with the early termination opti-mization. For nm , we discarded iSets with coverage below 25%. For evaluating each classiﬁer,we generated traces with 700K packets. We processed each trace 6times, using the ﬁrst ﬁve as warmup and measuring the last. Wereport the average of 15 measurements.

Uniform traﬃc.

We generate traces that access all matching rulesuniformly to evaluate the worst-case memory access pattern.

Skewed traﬃc.

For each rule-set we generate traces that followZipf distribution with four diﬀerent skew parameters, according tothe amount of traﬃc that accounts for the 3% most frequent ﬂows(e.g., 80% of the traﬃc accounts for the 3% most frequent ﬂows).This is representative of real traﬃc, as has been shown in previousworks [13, 33].Additionally, we use a real CAIDA trace from the Equinix data-center in Chicago [2]. As CAIDA does not publish the rules usedto process the packets, we modify the packet headers in the traceto match each evaluated rule-set as follows. For each rule, we gen-erate one matching ﬁve-tuple. Then, for each packet in CAIDA,we replace the original ﬁve-tuple with a random ﬁve-tuple gen-erated from the rule-set, while maintaining a consistent mappingbetween the original and the generated one. Note that the rule-setaccess locality of the generated trace is the same or as high as theoriginal trace.

ClassBench rules.

ClassBench [39] is a standard benchmarkbroadly used for evaluating packet classiﬁcation algorithms [3, 16,21, 22, 28, 41, 44]. It creates rule-sets that correspond to the rule dis-tribution of three diﬀerent applications: Access Control List (ACL),Firewall (FW), and IP Chain (IPC). We created rule-sets of sizes500K, 100K, 10K, and 1K, each with 12 distinct applications, all with5-ﬁeld rules: source and destination IP, source and destination port,and protocol. L a t e n c y S p ee d u p NuevoMatch w/ CutSplit NuevoMatch w/ NeuroCuts NuevoMatch w/ TupleMerge1 2 3 4 5 6 7 8 9 10 11 12 GM 1 2 3 4 5 6 7 8 9 10 11 12 GM01234 T h r o u g h p u t S p ee d u p Figure 8: ClassBench: NuevoMatch vs. CutSplit, NeuroCuts, and TupleMerge, using two CPU cores. (See rule-set in the Appen-dix.)Real-world rules.

We used the Stanford Backbone dataset whichcontains a large enterprise network conﬁguration [46]. There arefour IP forwarding rule-sets with roughly 180K single-ﬁeld ruleseach (i.e., destination IP address).

For fair comparison, NuevoMatch used the same algorithm for boththe remainder classiﬁer and the baseline. For example, we eval-uated the speedup produced by NuevoMatch over cs while alsousing cs to index the remainder set.We present the results for random packet traces, followed byskewed and CAIDA traces. Large rule-sets: ClassBench: multi-core.

Figure 8 shows that, inthe largest rule-sets (500K), the parallel implementation of Nuevo-Match achieves a geometric mean factor of 2.7 × , 4.4 × , and 2.6 × lower latency and 1.3 × , 2.2 × , and 1.2 × higher throughput over cs , nc , and tm , respectively. For the classiﬁers with 100K rules,the gains are lower but still signiﬁcant: 2.0 × , 3.6 × , and 2.6 × lowerlatency and 1.0 × , 1.7 × , and 1.2 × higher throughput over cs , nc ,and tm , respectively. The performance varies among rule-sets, i.e.,some classiﬁers are up to 1.8 × faster than cs for 100 ˙K inputs. Large rule-sets: ClassBench: single core.

Figure 9 shows thethroughput speedup of nm compared to cs , nc , and tm . For 500Krule-sets, NuevoMatch achieves a geometric mean improvementof 2.4 × , 2.6 × , and 1.6 × in throughput compared to cs , nc , and tm , respectively. For the single core execution the latency and thethroughput speedups are the same. Large rule-sets: Stanford backbone: multi-core.

Figure 10shows the speedup of nm over tm for the real-world Stanford back-bone dataset with 4 rule-sets. nm achieves 3 . × higher throughputand 7 . × lower latency over tm on all four rule-sets. Small rule-sets: multi-core.

For rule-sets with 1K and 10K rules,NuevoMatch results in the same or lower throughput, and 2.2 × and 1.9 × on average better latency compared to cs and tm . Thelower speedup is expected, as both cs and tm ﬁt into L1 (§5.2.1), so nm does not beneﬁt from reduced memory footprint, while addingcomputational overheads. See Appendix for the detailed chart. · × × × × T h r o u g h p u t ( pp s ) TupleMerge NuevoMatch w/ TupleMerge1 2 3 4 0200400 × × × × L a t e n c y ( µ s ) Figure 10: End-to-end performance on real Stanford back-bone data sets.

The cs results are averaged over three rule-sets of 1K and sixrule-sets for 10K. In the remaining rule-sets, NuevoMatch did notproduce large-enough iSets to accelerate the remainder. Note, how-ever, that it promptly identiﬁes the rule-sets expected to be slowand falls back to the original classiﬁer. The source of speedups.

The ability to compress the rule-set toﬁt into faster memory while retaining fast lookup is the key factorunderlying the performance beneﬁts of NuevoMatch. To illustrateit, we take a closer look at the performance. We evaluate tm withand without nm acceleration as a function of its memory footprinton ClassBench-generated 1K,10K,100K and 500K rule-sets for oneapplication (ACL).Figure 11 shows that the performance of tm degrades as thenumber of rules grows, causing the hash tables to spill out of L1 andL2 caches. nm compresses a large part of the rule-set (see coverageannotations), thereby making the remainder index small enoughto ﬁt in the L1 cache, and gaining back the throughput equivalentto tm ’s on small rule-sets. ClassBench: Skewed traﬃc.

Figure 12 shows the evaluation ofthe early termination implementation on skewed packet traces. Wereport the throughput speedup of nm compared to cs and tm ; theresults for nc are similar to those of cs .We perform 6000 experiments using 25 traces per rule-set: ﬁvetraces per Zipf distribution plus ﬁve modiﬁed CAIDA traces. Weevaluate over twelve 500K rule-sets and report the geometric mean.Additionally, we evaluate CAIDA traces in two settings. First, the T h r o u g h p u t S p ee d u p NuevoMatch w/ CutSplit NuevoMatch w/ NeuroCuts NuevoMatch w/ TupleMerge

Figure 9: ClassBench: NuevoMatch vs. CutSplit, NeuroCuts, and TupleMerge, using a single CPU core. · L2 Size (1MB)L1 Size (32KB)

Number of Rules T h r o u g h p u t ( pp s ) TupleMerge NuevoMatch w/ TupleMerge

Figure 11: Throughput vs. number of rules for TupleMergeand NuevoMatch. Annotations are coverage (%) and indexmemory size in KB (remainder : total). classiﬁer runs with access to the entire 16MB of the L3 cache (de-noted as CAIDA). Second, the classiﬁer use of L3 is restricted to1.5 ˙MB via Intel’s Cache Allocation Technology, emulating multi-tenant setting (denoted as CAIDA*).NuevoMatch is signiﬁcantly faster than cs , but its beneﬁts over tm diminish for workloads with higher skews. Yet, the speedupsare more pronounced under smaller L3 allocation.Overall, we observe lower speedups for the skewed traﬃc thanfor the random trace. This is not surprising, as skewed traces in-duce a higher cache hit rate for all the methods, which in turnreduces the performance gains of nm over both cs and tm , simi-lar to the case of small rule-sets. Nevertheless, it is worth notingthat classiﬁcation algorithms are usually applied alongside cachingmechanisms that catch the packets’ temporal locality. For instance,Open vSwitch applies caching for most frequently used rules. Itinvokes Tuple Space Search upon cache misses [30]. Therefore, ifNuevoMatch is applied at this stage, we expect it to yield the perfor-mance gains equivalent to those reported for unskewed workloads.Open vSwitch integration is the goal of our ongoing work. Figure 13 compares thememory footprint of the classiﬁers without and with NuevoMatch(the two right-most bars in each bar cluster). We use the same num-ber of iSets as in the end-to-end experiments. Note that a smallerfootprint alone does not necessarily lead to higher performance ifmore iSets are used. Therefore, the results should be considered inconjunction with the end-to-end performance.The memory footprint includes only the index data structuresbut not the rules themselves. In particular, the memory footprint

Zipf 80%( α =1.05) Zipf 85%( α =1.10) Zipf 90%( α =1.15) Zipf 95%( α =1.25) CAIDA CAIDA* . . . × × × × × × × × × × × × T h r o u g h p u t S p ee d u p NuevoMatch w/ CutSplit NuevoMatch w/ TupleMerge

Figure 12: ClassBench: NuevoMatch vs. CutSplit and Tuple-Merge with skewed traﬃc. for NuevoMatch includes both the RQ-RMI models and the remain-der classiﬁer. Each bar is the average of all the 12 application rule-sets of the same size.For nm we show both the remainder index size (middle bar) andthe total RQ-RMI size (right-most bar). Note that due to the loga-rithmic scale of the Y axis, the actual ratio bewteen the two is muchhigher than it might seem. For example, the remainder for 10K tm is almost 100 × the size of the RQ-RMI. Note also that since we run nm on two cores, both RQ-RMI and the remainder classiﬁer usetheir own CPU caches.Overall, NuevoMatch enables dramatic compression of thememory footprint, in particular for 500K rule-sets, with 4.9 × , 8 × ,and 82 × on average over cs , nc and tm respectively.The graph explains well the end-to-end performance results. For1K rule-sets, the original classiﬁers ﬁt into the L1 cache, so nm isnot eﬀective. For 10K sets, even though the remainder index ﬁtsin L1, the ratio between L1 and L2 performance is insuﬃcient tocover the RQ-RMI overheads. For 100K, the situation is similar for cs ; however, for nc , the remainder ﬁts in L1, whereas the original nc spills to L3. For tm , the remainder is already in L2, yielding alower overall speedup compared to nc . Last, for 500K rule-sets, allthe original classiﬁers spill to L3, whereas the remainder ﬁts wellin L2, yielding clear performance improvements. Performance under L3 cache contention.

The small memoryfootprint of nm plays an important role even when the rule-indexﬁts in the L3 cache (16MB in our machine). L3 is shared among allthe CPU cores; therefore, cache contention is not rare, in particularin data centers. nm reduces the eﬀects of L3 cache contention onpacket classiﬁcation performance. In the experiment we use the500 ˙K rule-set (1) and compare the performance of cs and nm (with cs ) while limiting the L3 to 1.5MB. cs loses half of its performance,whereas nm slows down by 30%, increasing the original speedup. K 10K 100K 500K10 Number of rules S i z e ( B y t e s ) NuevoMatch: Remainder CutSplit NeuroCutsNuevoMatch: iSets TupleMerge

Figure 13: Memory size for CutSplit, NeuroCuts, Tuple-Merge vs. NuevoMatch with them indexing the remainder.Each bar is a geometric mean of 12 applications.Table 2: iSet coverage. . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . ,

000 Number of iSets T i m e ( n s ) C o v e r a g e ( % ) Remainder Secondary Search ValidationInference Coverage

Figure 14: Coverage and execution time breakdown ofNuevoMatch vs. varying number of iSets.

Table 2 shows the cumulative coverageachieved with up to 4 iSets averaged over 12 rule-sets (ClassBench)of the same size. The coverage of smaller rule-sets is worse on av-erage, but improves with the size of the rule-set.The last row shows a representative result for the Stanford back-bone rule-set (the other three diﬀer within 1%). Two iSets areenough to achieve 90% coverage and three are needed for 95%. Thisdata set diﬀers from ClassBench in that it contains only one ﬁeld,providing fewer opportunities for iSet partitioning.

We seek to understand thetradeoﬀ between the iSet coverage of the rule-set and the computa-tional overheads of adding more RQ-RMI. All computations wereperformed on a single core to obtain the latency breakdown. Weuse cs for indexing the remainder. Table 3: Throughput and a single iSet coverage vs. the frac-tion of low-diversity rules in a 500K rule-set. % Low diversity rules % Coverage Speedup (throughput)70% 25% 1.07 ×

50% 50% 1.14 ×

30% 70% 1.60 × Figure 14 shows the geometric mean of the coverage and theruntime breakdown over 12 rule-sets of 500K. The breakdownincludes the runtime of the remainder classiﬁer, validation, sec-ondary search, and RQ-RMI inference. Zero iSets implies that cs was run alone. Adding more iSets shows diminishing returns be-cause of their compute overhead, which is not compensated by theremainder runtime improvements because the coverage is alreadysaturated to almost 100%. Using one or two iSets shows the besttrade-oﬀ. nc shows similar results. tm behaved diﬀerently (not shown). tm occupies much morememory than cs ; therefore, using more iSets to achieve higher cov-erage allowed us to further speed up the remainder by ﬁtting it intoan upper level cache. Thus, 4 iSets showed the best conﬁguration.We note that the runtime is split nearly equally between modelinference and validation (which are compute-bound parts), andthe secondary search and the remainder computations (which arememory-bound). We expect the compute performance of futureprocessors to scale better than their cache capacity and memoryaccess latency. Therefore, we believe nm will provide better scal-ing than memory-bound state-of-the-art classiﬁers. We seek to understand howlow diversity rule-sets aﬀect NuevoMatch. To analyze that, we syn-thetically generated a large rule-set as a Cartesian product of asmall number of values per ﬁeld (no ranges). We blended them intoa 500K ClassBench rule-set, replacing randomly selected rules withthose from the Cartesian product, while keeping the total numberof rules the same.Table 3 shows the coverage and the speedup over tm on theresulting mixed rule-sets for diﬀerent fractions of low-diversityrules. The partitioning algorithm successfully segregates the low-diversity rules the best, achieving the coverage inversely propor-tional to their fraction in the rule-set. Note that NuevoMatch be-comes eﬀective when it oﬄoads the processing of about 25% of therules. RQ-RMIs aretrained to minimize the prediction error bound to achieve a smallsecondary search distance. Recall that a secondary search involvesa binary search within the error bound, where each rule is vali-dated to match all the ﬁelds.The tradeoﬀ between training time and secondary search perfor-mance is not trivial. A larger search distance enables faster train-ing but slows down the secondary search. A smaller search dis-tance results in a faster search but slows down the training. In ex-treme cases, the training does not converge, since a higher preci-sion might require larger submodels. However, increasing the sizeof the submodels leads to a larger memory footprint and longercomputations. T r a i n i n g T i m e ( m i n u t e s ) Figure 15: RQ-RMI training time in minutes vs. maximumsearch range bound.

Figure 15 shows the average end-to-end training time in min-utes of 500 models as a function of the secondary search distanceand the rule-set size. The measurements include all training itera-tions as described in §3.5. As mentioned (§4), our training imple-mentation can be dramatically accelerated, so the results here in-dicate the general trend.Training with the bound of 64 is expensive, but is it really neces-sary? To answer, we evaluate the performance impact of the searchdistance on the secondary search time. We measure 40 ˙ns for re-trieving a rule with a precise prediction (no search). For 64, 128 and256 distances the search time varies between 75 to 80 ˙ns thanks tothe binary search. Last, it turns out that the actual search distancefrom the predicted index is often much smaller than the worst-caseone enforced in training. Our analysis shows that in practice, train-ing with a relatively large bound of 128 leads to 80% of the lookupswith a search distance of 64, and 60% with 32.We conclude that training with larger bounds is likely to havea minor eﬀect on the end-to-end performance, but signiﬁcantly ac-celerate training. This property is important to support more fre-quent retraining and faster updates (§3.9).

Adding ﬁelds to an existingclassiﬁer will not harm its coverage, so it will not aﬀect the RQ-RMI performance. Nonetheless, more ﬁelds will increase validationtime.Unfortunately, we did not ﬁnd public rule-sets that have a largenumber of ﬁelds. Thus, we ran a microbenchmark by increasingthe number of ﬁelds and measuring the validation stage perfor-mance. As expected, we observed almost linear growth in the val-idation time, from 25ns for one ﬁeld to 180ns for 40 ﬁelds.

Hardware-based classiﬁers.

Hardware-based solutions for clas-siﬁcation such as TCAMs and FPGAs achieve a very high through-put [6, 35]. Consequently, many software algorithms take ad-vantage of them, further improving classiﬁcation performance[13, 20, 23, 24, 28, 32, 37]. Our work is complementary, but canbe used to improve scaling of these solutions. For example, if theoriginal classiﬁer required large TCAMs, the remainder set wouldﬁt a much smaller TCAM.

GPUs for classiﬁcation.

Accelerating classiﬁcation on GPUs wassuggested by numerous works. PacketShader [10] uses GPU forpacket forwarding and provides integration with Open vSwitch.However, packet forwarding is a single-dimensional problem, so it is easier than multi-ﬁeld classiﬁcation [9]. Varvello et al. [42] imple-mented various packet classiﬁcation algorithms in GPUs, includ-ing linear search, Tuple Space Search, and bloom search. Nonethe-less, these techniques suﬀer from poor scalability for large classi-ﬁers with wildcard rules, which NuevoMatch aims to alleviate.

ML techniques for networking.

Recent works suggest using MLtechniques for solving networking problems, such as TCP conges-tion control [4, 12, 45], resource management [25], quality of expe-rience in video streaming [26, 43], routing [40], and decision treeoptimization for packet classiﬁcation [22]. NuevoMatch is diﬀerentin that it uses an ML technique for building space-eﬃcient repre-sentations of the rules that ﬁt in the CPU cache.

We have presented NuevoMatch, the ﬁrst packet classiﬁcation tech-nique that uses

Range-Query RMI machine learning model for ac-celerating packet classiﬁcation. We have shown an eﬃcient way oftraining RQ-RMI models, making them learn the matching rangesof large rule-sets, via sampling and analytical error bound com-putations. We demonstrated the application of RQ-RMI to multi-ﬁeld packet classiﬁcation using rule-set partitioning. We evaluatedNuevoMatch on synthetic and real-world rule-sets and conﬁrmedits beneﬁts for large rule-sets over state-of-the-art techniques.NuevoMatch introduces a new point in the design space ofpacket classiﬁcation algorithms and opens up new ways to scaleit on commodity processors. We believe that its compute-boundnature and the use of neural networks will enable further scalingwith future CPU generations, which will feature powerful computecapabilities targeting faster execution of neural network-relatedcomputations.

We thank the anonymous reviewers of SIGCOMM’20 and our shep-herd Minlan Yu for their helpful comments and feedback. Wewould also like to thank Isaac Keslassy and Leonid Ryzhyk for theirfeedback on the early draft of the paper.This work was partially supported by the Technion Hiroshi Fu-jiwara Cyber Security Research Center and the Israel National Cy-ber Directorate, by the Alon fellowship and by the Taub FamilyFoundation. We gratefully acknowledge support from Israel Sci-ence Foundation (Grant 1027/18) and Israeli Innovation Authority.

REFERENCES [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, JeﬀreyDean, Matthieu Devin, Sanjay Ghemawat, Geoﬀrey Irving, Michael Isard, Man-junath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray,Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, YuanYu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale MachineLearning. In

USENIX OSDI .[2] CAIDA. [n.d.].

The CAIDA UCSD Anonymized Internet Traces 2019

IEEE/ACMTransactions on Networking (TON)

27, 4 (2019), 1417–1431.[4] Mo Dong, Tong Meng, Doron Zarchy, Engin Arslan, Yossi Gilad, Brighten God-frey, and Michael Schapira. 2018. PCC Vivace: Online-Learning Congestion Con-trol. In

USENIX NSDI .[5] Daniel Firestone. 2017. VFP: A Virtual Switch Platform for Host SDN in thePublic Cloud. In

USENIX NSDI .6] Daniel Firestone, Andrew Putnam, Sambrama Mundkur, Derek Chiou, AlirezaDabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian M. Caulﬁeld,Eric S. Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, MattHumphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov,Jitu Padhye,Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Silva, Mad-han Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair, DeepakBansal, Doug Burger, Kushagra Vaid, David A. Maltz, and Albert G. Greenberg.2018. Azure Accelerated Networking: SmartNICs in the Public Cloud. In

USENIXNSDI .[7] Pankaj Gupta and Nick McKeown. 1999. Packet Classiﬁcation on Multiple Fields.In

ACM SIGCOMM .[8] Pankaj Gupta and Nick McKeown. 2000. Classifying Packets with HierarchicalIntelligent Cuttings.

IEEE Micro

20, 1 (2000), 34–41.[9] Pankaj Gupta and Nick McKeown. 2001. Algorithms for Packet Classiﬁcation.

IEEE Network

15, 2 (2001), 24–32.[10] Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. 2010. PacketShader:A GPU-accelerated software router. In

ACM SIGCOMM .[11] Intel. 2019.

Intel Nervana Neural Network Processors arXiv preprint arXiv:1810.03259 (2018).[13] Naga Praveen Katta, Omid Alipourfard, Jennifer Rexford, and David Walker.2016. CacheFlow: Dependency-Aware Rule-Caching for Software-Deﬁned Net-works. In

ACM SOSR .[14] Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Opti-mization. arXiv:1412.6980 (2014).[15] Jon M. Kleinberg and Éva Tardos. 2006.

Algorithm Design . Addison-Wesley,116–125.[16] Kirill Kogan, Sergey Nikolenko, Ori Rottenstreich, William Culhane, and PatrickEugster. 2014. SAX-PAC (Scalable and expressive packet classiﬁcation). In

ACMSIGCOMM .[17] Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Jialin Ding, AniKristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan.2019. SageDB: A Learned Database System.[18] Tim Kraska, Alex Beutel, Ed H. Chi, Jeﬀrey Dean, and Neoklis Polyzotis. 2018.The Case for Learned Index Structures. In

ACM SIGMOD .[19] Habana Labs. 2019.

Habana AI Processors . Retrieved September 25, 2019 fromhttps://habana.ai/product[20] Karthik Lakshminarayanan, Anand Rangarajan, and Srinivasan Venkatachary.2005. Algorithms for Advanced Packet Classiﬁcation with Ternary CAMs. In

ACM SIGCOMM .[21] Wenjun Li, Xianfeng Li, Hui Li, and Gaogang Xie. 2018. CutSplit: A Decision-Tree Combining Cutting and Splitting for Scalable Packet Classiﬁcation. In

IEEEINFOCOM .[22] Eric Liang, Hang Zhu, Xin Jin, and Ion Stoica. 2019. Neural Packet Classiﬁcation.In

ACM SIGCOMM .[23] Alex X Liu, Chad R Meiners, and Yun Zhou. 2008. All-Match Based CompleteRedundancy Removal for Packet Classiﬁers in TCAMs. In

IEEE INFOCOM .[24] Yadi Ma and Suman Banerjee. 2012. A Smart Pre-classiﬁer to Reduce PowerConsumption of TCAMs for Multi-dimensional Packet Classiﬁcation. In

ACMSIGCOMM .[25] Hongzi Mao, Mohammad Alizadeh, Ishai Menache, and Srikanth Kandula. 2016.Resource Management with Deep Reinforcement Learning. In

ACM SIGCOMMHotNets Workshop .[26] Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. 2017. Neural AdaptiveVideo Streaming with Pensieve. In

ACM SIGCOMM .[27] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Pe-terson, Jennifer Rexford, Scott Shenker, and Jonathan Turner. 2008. OpenFlow:Enabling Innovation in Campus Networks.

ACM SIGCOMM CCR

38, 2 (2008),69–74.[28] Nina Narodytska, Leonid Ryzhyk, Igor Ganichev, and Soner Sevinc. 2019. BDD-Based Algorithms for Packet Classiﬁcation. In

Formal Methods in ComputerAided Design FMCAD .[29] Nvidia. 2019.

Nvidia Deep Learning Inference Plat-form

USENIX NSDI .[31] Alon Rashelbach. 2020.

NeuvoMatch source code . Retrieved June 21, 2020 fromhttps://github.com/acsl-technion/nuevomatch[32] Ori Rottenstreich and János Tapolcai. 2015. Lossy Compression of Packet Clas-siﬁers. In

ACM/IEEE ANCS .[33] Nadi Sarrar, Steve Uhlig, Anja Feldmann, Rob Sherwood, and Xin Huang. 2012.Leveraging Zipf’s law for traﬃc oﬄoading.

Computer Communication Review

42, 1 (2012), 16–22.[34] Sumeet Singh, Florin Baboescu, George Varghese, and Jia Wang. 2003. PacketClassiﬁcation Using Multidimensional Cutting. In

ACM SIGCOMM .[35] Ed Spitznagel, David E Taylor, and Jonathan S Turner. 2003. Packet classiﬁcationusing extended TCAMs. In

IEEE ICNP .[36] Venkatachary Srinivasan, Subhash Suri, and George Varghese. 1999. Packet Clas-siﬁcation Using Tuple Space Search. In

ACM SIGCOMM .[37] David E Taylor. 2005. Survey and Taxonomy of Packet Classiﬁcation Techniques.

ACM Computing Surveys (CSUR)

37, 3 (2005), 238–275.[38] David E. Taylor and Jonathan S. Turner. 2005. Scalable packet classiﬁcation usingdistributed crossproducing of ﬁeld labels. In

IEEE INFOCOM .[39] David E Taylor and Jonathan S Turner. 2007. Classbench: A Packet ClassiﬁcationBenchmark.

IEEE/ACM Transactions on Networking (TON)

15, 3 (2007), 499–511.[40] Asaf Valadarsky, Michael Schapira, Dafna Shahaf, and Aviv Tamar. 2017. Learn-ing to Route with Deep RL. In

NIPS Deep Reinforcement Learning Symposium .[41] Balajee Vamanan, Gwendolyn Voskuilen, and T. N. Vijaykumar. 2010. EﬃCuts:Optimizing Packet Classiﬁcation for Memory and Throughput. In

ACM SIG-COMM .[42] Matteo Varvello, Rafael Laufer, Feixiong Zhang, and T. V. Lakshman. 2016. Mul-tilayer Packet Classiﬁcation with Graphics Processing Units.

IEEE/ACM Trans-actions on Networking (TON)

24, 5 (2016), 2728–2741.[43] Hyunho Yeo, Youngmok Jung, Jaehong Kim, Jinwoo Shin, and Dongsu Han. 2018.Neural Adaptive Content-aware Internet Video Delivery. In

USENIX OSDI .[44] Sorrachai Yingchareonthawornchai, James Daly, Alex X. Liu, and Eric Torng.2018. A Sorted-Partitioning Approach to Fast and Scalable Dynamic Packet Clas-siﬁcation.

IEEE/ACM Transactions on Networking (TON)

26, 4 (2018), 1907–1920.[45] Yasir Zaki, Thomas Pötsch, Jay Chen, Lakshminarayanan Subramanian, andCarmelita Görg. 2015. Adaptive Congestion Control for Unpredictable CellularNetworks. In

ACM SIGCOMM .[46] Hongyi Zeng, Peyman Kazemian, George Varghese, and Nick McKeown. 2012.Automatic Test Packet Generation. In

ACM CoNEXT . Appendices are supporting material that has not been peer-reviewed.

A RQ-RMI CORRECTNESSA.1 Responsibility of a submodel

Denote the input domain of an RQ-RMI model as D ⊂ R and itsnumber of stages as n . Theorem A.1 (Responsibility Theorem).

Let s i be a trainedstage such that i < n − . The responsibilities of submodels in s i + can be calculated by evaluating a ﬁnite set of inputs over the stage s i . The intuition behind Theorem A.1 is based on Corollary 3.2,namely that submodels output piecewise linear functions. Provingit requires some additional deﬁnitions.

Deﬁnition A.2 (Stage Output).

The output of stage s i is deﬁnedfor x ∈ D as S i ( x ) = M i , f i ( x ) ( x ) where f i ( x ) is the index of thesubmodel in s i that is responsible for input x , and deﬁned as f i ( x ) = ( i = (cid:4) S i − ( x ) · W i (cid:5) i = { , , ..., n − } Deﬁnition A.3 (Submodel Responsibility).

The responsibility of asubmodel m i , j is deﬁned as R i , j = ( D i = (cid:8) x (cid:12)(cid:12) f i ( x ) = j (cid:9) i = { , , ..., n − } Note that the responsibilities of every two submodels in the samestage are disjoint.

Deﬁnition A.4 (Left and Right Slopes).

For a range R , if pointsmin x ∈ R x or max x ∈ R x are deﬁned, we refer to them as the bound-aries of the range. For all other points, we refer to as internal pointsof the range. For a piecewise linear function deﬁned over some M i , j ( x ) д д д д д д (a) Trigger Inputs G i , j xM i , j ( x ) , (cid:4) M i , j ( x ) · W i + (cid:5)

1, 40.75, 30.5, 20.25, 10, 0 t t t t t (b) Transition Inputs T i , j Figure 16: Illustration of the trigger inputs ( д , ..., д ) andtransition inputs ( t , ..., t ) for graph M i , j ( x ) of submodel m i , j .Note that W i + , namely the number of submodels in stage i + , aﬀects the transition inputs of m i , j and equals . range R , for every internal point x ∈ R , there exists δ > ( x − δ , x ) , ( x , x + δ ) . Accord-ingly, we can refer to the left slope and the right slope of a point,deﬁned as those of the two linear functions. Deﬁnition A.5 (Trigger Inputs).

We say that an input д ∈ D is a trigger input of a submodel m i , j if one of the following holds: (i) д is a boundary point of D (namely, д = min y ∈ D y or д = max y ∈ D y ). (ii) д is an internal point of D and the left and right slopes of M i , j ( д ) diﬀer. Deﬁnition A.6 (Transition Inputs).

We say that an input t ∈ D is a transition input of a submodel m i , j if it changes submodel selectionin the following stage. Formally, there exists ϵ > < δ < ϵ : (cid:4) M i , j ( t − δ ) · W i + (cid:5) , (cid:4) M i , j ( t + δ ) · W i + (cid:5) Deﬁnition A.7 (The function B i ( x ) ). We deﬁne the function B i for i ∈ { , , ..., n − } . B i is a staircase function of values [ , W i + − ] , and deﬁned as B i ( x ) = ⌊ x · W i + ⌋ for x ∈ [ , ) .For a submodel m i , j , we term the set of its trigger inputs as G i , j and the set of its transition inputs as T i , j . See Figure 16 for illustra-tion. From submodel deﬁnition and Corollary 3.2, we can tell thata submodel’s ReLU operations determine its trigger inputs. Con-sequently, any set of trigger inputs is ﬁnite and can be calculatedusing a few linear equations. Nonetheless, calculating the transi-tion inputs of a submodel is not straightforward. We show a fastand eﬃcient way for doing so in the following lemma: Lemma A.8.

Let m i , j be an RQ-RMI submodel, and a < b ∈ G i , j two adjacent trigger inputs of m i , j . Then the set S = [ a , b ] ∩ T i , j isﬁnite and can be calculated using the inputs a and b alone. Proof.

We divide the construction of S to two subsets S = S ∪ S . First we handle S . For each x ∈ { a , b } , x ∈ S if and only ifthere exists ϵ > < δ < ϵ : B i (cid:0) M i , j ( x − δ ) (cid:1) , B i (cid:0) M i , j ( x + δ ) (cid:1) Now to S . Without loss of generality, M i , j ( a ) ≤ M i , j ( b ) .From Corollary 3.2 and Deﬁnition A.5, M i , j is linear in [ a , b ] . If B i ( M i , j ( a )) = B i ( M i , j ( b )) , then S = ∅ . Otherwise, M i , j ( a ) , M i , j ( b ) . B i ( x ) outputs discrete values between B i ( M i , j ( a )) and B i ( M i , j ( b )) for all x ∈ ( a , b ) . Denote this ﬁnite set of discrete val-ues as M . For any y ∈ M there exists a value d ∈ ( a , b ] such that M i , j ( d ) · W i + = y . By the linearity of M i , j in [ a , b ] : d = (cid:16) yW i + − M i , j ( a ) (cid:17) · b − aM i , j ( b ) − M i , j ( a ) + a We construct S as follows: S = (cid:26)(cid:16) yW i + − M i , j ( a ) (cid:17) · b − aM i , j ( b ) − M i , j ( a ) + a (cid:12)(cid:12)(cid:12) ∀ y ∈ M (cid:27) (cid:3) Corollary A.9.

The set of transition inputs T i , j can be calculatedusing G i , j and its size is bounded such that | T i , j | ≤ W i + · | G i , j | . Not all transition inputs of all submodels are reachable, assome exist outside of their corresponding submodel’s responsibil-ity. Therefore, we deﬁne the set of reachable transition inputs of astage s i as the transition set of a stage: Deﬁnition A.10 (Transition Set).

The transition set U i of a stage s i is an ordered set, deﬁned as: U i = { min ( D )} ∪ { W i − Ø j = T i , j ∩ R i , j } ∪ { max ( D )} The proof of Theorem A.1 directly follows from the next twolemmas:

Lemma A.11.

Let s i , s i + be two adjacent stages. For any two ad-jacent values u < u ∈ U i there exists a submodel m i + , j such that S i + ( x ) is piecewise linear and equal to M i + , j ( x ) for all x ∈ ( u , u ) . Proof.

We show that there exists a submodel m i + , j such thatany x ∈ ( u , u ) satisﬁes x ∈ R i + , j , which implies f i + ( x ) = j andso S i + ( x ) = M i + , j ( x ) . By Corollary 3.2, S i + is piecewise linearfor all x ∈ ( u , u ) .Let x < y ∈ ( u , u ) . Assume by contradiction there existtwo submodels m i + , j and m i + , j such that x ∈ R i + , j and y ∈ R i + , j . From Deﬁnition A.3, f i + ( x ) , f i + ( y ) implies B i ( S i ( x )) , B i ( S i ( y )) . Thus, there exists an input z ∈ ( x , y ] and ϵ > < δ < ϵ : B i (cid:0) S i ( z − δ ) (cid:1) , B i (cid:0) S i ( z + δ ) (cid:1) Since S i consists of the outputs of submodels in s i , there exists asubmodel m i , k such that S i ( z ) = M i , k ( z ) . Therefore, z ∈ T i , k and z ∈ R i , k , which means z ∈ U i , in contradiction to deﬁnition of u and u . (cid:3) Lemma A.12.

Let s i be an RQ-RMI stage such that i ∈ { , , ..., n − } . The function f i + deﬁned over the space D can be calculated usingthe inputs U i over S i . Proof.

Let u < u ∈ U i be two adjacent values. By LemmaA.11 there exists a submodel m i + , j such that S i + ( x ) = M i , j ( x ) forall x ∈ ( u , u ) . From Deﬁnition A.2, f i + ( x ) = j for all x ∈ ( u , u ) .By calculating B i ( S i ( u )) and B i ( S i ( u )) , f i + ( x ) is known for all x ∈ [ u , u ] . Since min { D } ∈ U i and max { D } ∈ U i , f i + ( x ) isknown for all x ∈ D . (cid:3)

1K Rules 10K Rules L a t e n c y S p ee d u p NuevoMatch w/ CutSplit NuevoMatch w/ TupleMerge . .

1K Rules

10K Rules T h r o u g h p u t S p ee d u p Figure 17: A detailed version of end-to-end performance for small rule-sets. Speedup in throughput and latency of NuevoMatchagainst stand-alone versions of CutSplit and TupleMerge. Classiﬁers with no valid iSets are not displayed.

A.2 Submodel prediction error

Theorem A.13 (Submodel Prediction Error).

Let s n − be thelast stage of an RQ-RMI model. The maximum prediction error of anysubmodel in s n − can be calculated using a ﬁnite set of inputs overthe stage s n − . The intuition behind Theorem A.13 is to address the set of range-value pairs as an additional, virtual, stage in the model.

Deﬁnition A.14 (Range-Value Pair).

A range-value pair h r , v i isdeﬁned such that r is an interval in D and v ∈ { , , , ... } is uniqueto that pair.We term W n the number of range-value pairs an RQ-RMI modelshould index. Similar to the deﬁnitions for submodels, we extend f i such that f n ( x ) = ⌊ S n − ( x )· W n ⌋ , and say that the responsibility R p of a pair p = h r , v i is the set of inputs { x | f n ( x ) = v } . Consequently,we make the following two observations. First, all inputs in therange r \ R p should have reached p but did not. Second, all inputsin the range R p \ r did reach p but should not. Deﬁnition A.15 (Misclassiﬁed Pair Set).

Let m be a submodel in s n − with a responsibility R m . Denote P m as the set of all pairs suchthat a pair p = h r , v i ∈ P m satisﬁes ( r \ R p ) ∪ ( R p \ r ) ∩ R m , ∅ . Inother words, P m holds all pairs that were misclassiﬁed by m , andtermed the misclassiﬁed pair set of m . Deﬁnition A.16 (Maximum Prediction Error).

Let m be a sub-model in s n − with a responsibility R m and a misclassiﬁed pairset P m . The maximum prediction error of m is deﬁned as:max (cid:8) | f n ( x ) − v | (cid:12)(cid:12) h r , v i ∈ P m , x ∈ R m (cid:9) Lemma A.17.

The misclassiﬁed pair sets of all submodels in s n − can be calculated using U n − over S n − . Proof.

Let q < q be two adjacent values in U n − . FromLemma A.11 there exists a single submodel m n − , j , j ∈ W n − s.t S n − ( x ) = M n − , j ( x ) for all x ∈ ( q , q ) . Hence, using Corollary 3.2, S n − is linear in ( q , q ) . Therefore, the values of S n − in [ q , q ] can be calculated using q and q alone. Consequently, accordingto the deﬁnitions of f n and the responsibility of a pair, the set ofpairs P j with responsibilities in [ q , q ] can also be calculated using q and q . Calculating the responsibilities of all pairs is performedby repeating the process for any two adjacent points in U n − .At this point, as we know R p for all p = h r , v i , calculating theset ( r \ R p ) ∪ ( R p \ r ) is trivial. Acquiring the responsibility of anysubmodel in s n − using Theorem A.1 enables us to calculate itsmisclassiﬁed pair set immediately. (cid:3) Proof of Theorem A.13

Proof.

Let m be a submodel in s n − with a responsibility R m .For simplicity, we address the case where R m is a continuous range.Extension to the general case is possible by repeating the proof forany continuous range in R m .Denote the submodel’s ﬁnite set of trigger inputs as G m . Deﬁnethe set Q as follows: Q = min R m ∪ ( G m ∩ R m ) ∪ max R m Let q < q be two adjacent values in Q . From the deﬁnition oftrigger inputs, m outputs a linear function in [ q , q ] . Hence, theset of values S = { f n ( x )| x ∈ [ q , q ]} can be calculated using only q and q over S n − . From Lemma A.17, the misclassiﬁed pair set P m can be calculated using the ﬁnite set U n − . Denote the setˆ P = {h r , v i | h r , v i ∈ P m , r ∩ [ q , q ] , ∅} Calculating max { s − v | s ∈ S , h r , v i ∈ ˆ P } yields the maximumerror of m in [ q , q ] . Repeating the process for any two adjacentpoints in Q yields the maximum error of m for all R m . (cid:3) Rule-set names in Figures 8 and 17, by order: ACL1, ACL2, ACL3,ACL4, ACL5, FW1, FW2, FW3, FW4, FW5, IPC1, IPC2.

Table 4: RQ-RMI conﬁgurations for diﬀerent input rule-setsizes. to 10 to 105