SmartChoices: Hybridizing Programming and Machine Learning
Victor Carbune, Thierry Coppey, Alexander Daryin, Thomas Deselaers, Nikhil Sarda, Jay Yagnik
SSmartChoices: Hybridizing Programming and Machine Learning
Victor Carbune Thierry Coppey Alexander Daryin Thomas Deselaers Nikhil Sarda Jay Yagnik Abstract
We present
SmartChoices , an approach to makingmachine learning (ML) a first class citizen in pro-gramming languages which we see as one way tolower the entrance cost to applying ML to prob-lems in new domains. There is a growing divide inapproaches to building systems: on the one hand,programming leverages human experts to define asystem while on the other hand behavior is learnedfrom data in machine learning. We propose to hy-bridize these two by providing a 3-call API whichwe expose through an object called SmartChoice.We describe the SmartChoices-interface, how itcan be used in programming with minimal codechanges, and demonstrate that it is an easy to usebut still powerful tool by demonstrating improve-ments over not using ML at all on three algo-rithmic problems: binary search, QuickSort, andcaches. In these three examples, we replace thecommonly used heuristics with an ML model en-tirely encapsulated within a SmartChoice and thusrequiring minimal code changes. As opposed toprevious work applying ML to algorithmic prob-lems, our proposed approach does not require todrop existing implementations but seamlessly in-tegrates into the standard software developmentworkflow and gives full control to the softwaredeveloper over how ML methods are applied. Ourimplementation relies on standard ReinforcementLearning (RL) methods. To learn faster, we usethe heuristic function, which they are replacing, asan initial function . We show how this initial func-tion can be used to speed up and stabilize learningwhile providing a safety net that prevents perfor-mance to become substantially worse – allowingfor a safe deployment in critical applications inreal life. Google Research. Correspondence to: Victor Carbune
Reinforcement Learning for Real Life (RL4RealLife) Workshop inthe th International Conference on Machine Learning , LongBeach, California, USA, 2019. Copyright 2019 by the author(s).
1. Introduction
Machine Learning (ML) has had many successes in thepast decade in terms of techniques and systems as well asin the number of areas in which it is successfully applied.However, using ML has some cost that comes from the addi-tional complexity added to software systems (Sculley et al.,2014). There is a fundamental impedance mismatch be-tween the approaches to system building. Software systemshave evolved from the idea that experts have full controlover the behavior of the system and specify the exact stepsto be followed. ML on the other hand has evolved fromlearning behavior by observing data. It allows for learningmore complex but implicit programs leading to a loss of con-trol for programmers since the behavior is now controlledby data. We believe it is very difficult to move from oneto another of these approaches, but that a hybrid betweenthem needs to exist which allows to leverage both the devel-oper’s domain-specific knowledge and the adaptability ofML systems.We propose to hybridize ML with programming. We exposea new object called SmartChoice exposing a 3-call APIwhich is backed by ML-models and determines its value atruntime. A developer will be able to use a SmartChoice justlike any other object, combine it with heuristics, domainspecific knowledge, problem constraints, etc. in ways thatare fully under the developer’s control. This representsan inversion of control compared to how ML systems areusually built. SmartChoices allow to integrate ML tightlyinto systems and algorithms whereas traditional ML systemsare built around the model.Our approach combines methods from reinforcement learn-ing (RL), online learning, with a novel API and aims to makeusing ML in software development easier by avoiding theoverhead of going through the traditional steps of buildingan ML system: (1) collecting and preparing training data,(2) defining a training loss, (3) training an initial model,(4) tweaking and optimizing the model, (5) integrating themodel into their system, and (6) continuously updating andimproving the model to adjust for drift in the distribution ofthe data processed.We show how these properties allow for applying ML indomains that have traditionally not been using it and thatthis is possible with minimal code changes. We demonstratethat ML can help improve the performance of “classical” a r X i v : . [ c s . L G ] J un martChoices: Hybridizing Programming and Machine Learning 2 algorithms that typically rely on a heuristic. The concreteimplementation of SmartChoices in this paper is based onstandard deep RL. We emphasize that this is just one possi-ble implementation.In this paper we show SmartChoices in the context of thePython programming language (PL) using concepts fromobject oriented PLs. The same ideas can be transferred di-rectly to functional or imperative PLs, where a SmartChoicecould be modelled after a function or a variable.We show how SmartChoices can be used in three algo-rithmic problems – binary search, QuickSort, and caches –to improve performance by replacing the commonly usedheuristic with an ML model with minimal code changes,leaving the structure of the original code (including poten-tial domain-specific knowledge) untouched. We chose theseproblems as first applications for ease of reproducibility butbelieve that this demonstrates that our approach could bene-fit a wide range of applications, e.g. systems-applications,content recommendations, or modelling of user behavior.Further, we show how to use the heuristics that are replacedas “ initial functions ” as means to guide the initial learning,help targeted exploration, and as a safety net to prevent verybad performance.The main contributions of this paper are: (i) we proposea way to integrate ML methods directly into the softwaredevelopment workflow using a novel API; (ii) we show howstandard RL and online learning methods can be leveragedthrough our proposed API; (iii) we demonstrate that thiscombination of ideas is simple to use yet powerful enoughto improve performance of standard algorithms over notusing ML at all.
2. Software Development with SmartChoices
A SmartChoice has a simple API that allows the developerto provide enough information about its context, predict itsvalue, and provide feedback about the quality of its predic-tions. SmartChoices invert the control compared to commonML approaches that are model centric. Here, the developerhas full control over how data and feedback are provided tothe model, how inference is called, and how predictions areused.To create a SmartChoice, the developer chooses itsoutput type (float, int, category, ...), shape, and range;defines which data the SmartChoice is able to observe(type, shape, range); and optionally provides an initialfunction. In the following example we instantiate a scalarfloat SmartChoice taking on values between and ,which can observe three scalar floats (each in the rangebetween and ), and which uses a simple initial function: choice = SmartChoice(output_def=(float,shape=[1],range=[0,1]),observation_defs={’low’:(float,[1],[0,10]),’high’:(float,[1],[0,10]),’target’:(float,[1],[0,10])},initial_function=lambda observations:0.5) The SmartChoice can then be used. It determines its valuewhen read using inference in the underlying ML model, e.g. value = choice.
Predict () Specifically, developers should be able to use a SmartChoiceinstead of a heuristic or an arbitrarily chosen constant.SmartChoices can also take the form of a stochastic variable,shielding the developer from the underlying complexity ofinference, sampling, and explore/exploit strategies.The SmartChoice determines its value on the basis ofobservations about the context that the developer passes in: choice.
Observe (’low’, 0.12)choice.
Observe ({’high’:0.56,’target’:0.43})
A developer might provide additional side-information intothe SmartChoice that an engineered heuristic would not beusing but which a powerful model is able to use in order toimprove performance.The developer provides feedback about the quality of previ-ous predictions once it becomes available: choice.
Feedback (reward=10)
In this example we provide numerical feedback. Follow-ing common RL practice a SmartChoice aims to maximizethe sum of reward values received over time (possibly dis-counted). In other setups, we might become aware of thecorrect value in hindsight and provide the “ground truth” an-swer as feedback, turning the learning task into a supervisedlearning problem. Some problems might have multiple met-rics to optimize for (run time, memory, network bandwidth)and the developer might want to give feedback for eachdimension.This API allows for integrating SmartChoices easily andtransparently into existing applications with little overhead.See listing 1 for how to use the SmartChoice created abovein binary search. In addition to the API calls describedabove, model hyperparameters can be specified through ad-ditional configuration, which can be tuned independently.The definition of the SmartChoice only determines its inter-face (i.e. the types and shapes of inputs and outputs).
3. Initial Functions in SmartChoices
We allow for the developer to pass an initial function to theSmartChoice. We anticipate that in many cases the initialfunction will be the heuristic that the SmartChoice is replac-ing. Ideally it is a reasonable guess at what values wouldbe good for the SmartChoice to return. The SmartChoicewill use this initial function to avoid bad performance in theinitial predictions, and observe the behavior of the initialfunction to guide its own learning process, similar to imi-tation learning (Hussein et al., 2017). The existence of theinitial function should strictly improve the performance ofa SmartChoice. In the worst case, the SmartChoice couldchoose to ignore it completely, but ideally it will allow theSmartChoice to explore solutions which are not easily reach-able from a random starting point. Further, the initial func-tion plays the role of a heuristic policy which explores the martChoices: Hybridizing Programming and Machine Learning 3 state and action space generating initial trajectories whichare then used for learning. Even though such exploration isbiased, off-policy RL can train on this data. In contrast toimitation learning where an agent tries to become as goodas the expert, we explicitly aim to outperform the initialfunction as quickly as possible, similar to (Schmitt et al.,2018).For a SmartChoice to make use of the initial heuristic, andto balance between learning a good policy and the safety ofthe initial function, it relies on a policy selection strategy .This strategy switches between exploiting the learned policy,exploring alternative values, and using the initial function.It can be applied at the action or episode level dependingon the requirements. Finally, the initial function providesa safety net: in case the learned policy starts to misbehave,the SmartChoice can always fallback to the initial functionwith little cost.
4. SmartChoices in Algorithms
In this section, we describe how SmartChoices can be usedin three different algorithmic problems and how a developercan leverage the power of machine learning easily with justa few lines of code. We show experimentally how usingSmartChoices helps improving the algorithm performance.The interface described above naturally translates into anRL setting: the inputs to
Observe calls are combined intothe state, the output of the
Predict call is the action, and
Feedback is the reward.To evaluate the impact of SmartChoices we measure cu-mulative regret over training episodes. Regret measureshow much worse (or better when it is negative) a methodperforms compared to another method. Cumulative regretcaptures whether a method is better than another methodover all previous decisions. For practical use cases weare interested in two properties: (1) Regret should neverbe very high to guarantee acceptable performance of theSmartChoice under all circumstances. (2) Cumulative regretshould become permanently negative as early as possible.This corresponds to the desire to have better performancethan the baseline model as soon as possible.Unlike the usual setting which distinguishes a training andevaluation mode, we perform evaluation from the point ofview of the developer without this distinction. The devel-oper just plugs in the SmartChoice and starts running theprogram as usual. Due to the online learning setup in whichSmartChoices are operating, overfitting does not pose aconcern (Dekel & Singer, 2005). The (cumulative) regretnumbers thus do contain potential performance regressionsdue to exploration noise. This effect could be mitigated byperforming only a fraction of the runs with exploration.In our experiments we do not account for the computationalcosts of inference in the model. The goal of our studyis to demonstrate that the proposed approach is generally P r og r a m B i na r y SmartChoice
RL PolicyEpisode LogObserved Context
Client Code choice.Observechoice.Predictchoice.Feedback
Models (definitions & checkpoints)
Replay BufferModel Trainer (agent code, optimizer, …)
Figure 1.
An overview of the architecture for our experiments howclient code communicates with a SmartChoice and how the modelfor the SmartChoice is trained and updated. feasible and that with minimal code changes ML can beused in programming. While for algorithms, like thosewe are experimenting with here, the actual run time doesmatter we believe that advances in specialized hardwarewill enable running machine learning models at insignifi-cant cost (Kraska et al., 2018). Further, even if such costseem high, we see SmartChoices applicable to a wide vari-ety of problems: e.g. relying on expensive approximationheuristics or working with inherently slow hardware, suchas filesystems where the inference time is less relevant. Andlastly, our approach is applicable to a wide variety of prob-lems ranging from systems problems, over user modelling,to content recommendation where the computational over-head for ML is not as problematic.Our implementation currently is a small library exposingthe SmartChoice interface to client applications (fig. 1). ASmartChoice assembles observations, actions, and feedbackinto episode logs that are passed to a replay buffer. Themodels are trained asynchronously. When a new check-point becomes available the SmartChoice loads it for use inconsecutive steps.
To enable SmartChoices we leverage recent progress in RLfor modelling and training. It allows to apply SmartChoicesto the most general use cases. While we are only look-ing at RL methods here, SmartChoices could be used withother learning methods such as multi-armed bandits or su-pervised learning. We are building our models on DDQN(Hasselt et al., 2016) for categorical outputs and on TD3 (Fu-jimoto et al., 2018) for continuous outputs. DDQN is a defacto standard in RL since its success in AlphaGo (Silveret al., 2016). TD3 is a recent modification to DDPG (Lil-licrap et al., 2015) using a second critic network to avoidoverestimating the expected reward. We summarize thehyperparameters used in our experiments in (table 1).While these hyperparameters are now new parameters thatthe developer can tweak, we hypothesize that on the onehand tuning hyperparameters is often simpler than manu-ally defining new problem-specific heuristics, and on theother hand that improvements on automatic model tuningfrom the general machine learning community will be easilyapplicable here too.Our policy selection strategy starts by only evaluating theinitial function and then gradually starts to increase the martChoices: Hybridizing Programming and Machine Learning 4
Table 1.
Parameters for the different experiments described below(FC=fully connected layer, LR=learning rate). See (Hendersonet al., 2018) for details on these parameters.
Binary search QuickSort Caches (discrete) Caches (continuous)Learning algorithm TD3 DDQN DDQN TD3Actor network FC → tanh – – FC → tanh Critic/value network FC ( FC , ReLU ) → FC ( FC , ReLU ) → FC FC Key embedding size – – 8Discount . , . LR actor − – – − Initial function decay yes noBatch size 256 1024Action noise σ . – – . Target noise σ . – – . Temperature – . –Update ratio ( τ ) 0.05 . Common: Optimizer: Adam; LR critic: − ; Replay buffer: Uniform, FIFO, size 20000; Update period: 1. use of the learned policy. It keeps track of the receivedrewards of these policies adjusts the use of the learned policydepending on its performance. We show the usage rate of theinitial function when we use it (fig. 2, bottom) demonstratingthe effectiveness of this strategy. Binary search (Williams, 1976) is a standard algorithm forfinding the location l x of a target value x in a sorted array A = { a , a , . . . , a N − } of size N . Binary search has aworst case runtime complexity of (cid:100) log ( N ) (cid:101) steps when nofurther knowledge about the distribution of data is available.Prior knowledge of the data distribution can help reduce theaverage runtime: e.g. in case of an uniform distribution,the location of x can be approximated using linear inter-polation l x ≈ ( N − x − a ) / ( a N − − a ) . We showhow SmartChoices can be used to speed up binary search bylearning to estimate the position l x for a more general case.The simplest way of using a SmartChoice is to directlyestimate the location l x and incentivize the search to doso in as few steps as possible by penalizing each step bythe same negative reward (listing 1). At each step, theSmartChoice observes the values a L , a R at both ends of thesearch interval and the target x . The SmartChoice output q is used as the relative position of the next read index m ,such that m = qL + (1 − q ) R .In order to give a stronger learning signal to the model,the developer can incorporate problem-specific knowledgeinto the reward function or into how the SmartChoice isused. One way to shape the reward is to account for prob-lem reduction. For binary search, reducing the size of theremaining search space will speed up the search proportion-ally and should be rewarded accordingly. By replacing thestep-counting reward in listing 1 (line 9) with the searchrange reduction ( R t − L t ) / ( R t +1 − L t +1 ) , we directly re-ward reducing the size of the search space. By shaping thereward like this, we are able to attribute the feedback signalto the current prediction and to reduce the problem fromRL to contextual bandit (which we implement by using adiscount factor of ).Alternatively we can change the way the prediction is used to cast the problem in a way that the SmartChoice learns C o s t ( o f l oo k u p s ) , s m oo t h e d Cost of binary search Heuristics mix, shaped rewardHeuristics mix, simple rewardPosition, shaped rewardPosition, simple rewardPosition, simple, no init. func.Interpolation (baseline)Vanilla (baseline)1000 2000 3000 4000 5000-15000-10000-500005000100001500020000 C u m u l a t i v e r e g r e t EpisodeCumulative regret of SmartChoice against vanilla0%20%40%60% 80%0 1000 2000 3000 4000 5000 I n i t i a l f un c t i o n u s a g e Episode
Figure 2.
The cost of different variants of binary search (top left),cumulative regret compared to vanilla binary search (right), andinitial function usage (bottom). faster and is unable to predict very bad values. For manyalgorithms (including binary search) it is possible to pre-dict a combination of (or choice among) several existingheuristics rather than predicting the value directly. We usetwo heuristics: (a) vanilla binary search which splits thesearch range { a L , . . . , a R } into two equally large partsusing the split location l v = ( L + R ) / , and (b) inter-polation search which interpolates the split location as l i = (( a R − v ) L + ( v − a L ) R ) / ( a R − a L ) . We then usethe value q of the SmartChoice to mix between these heuris-tics to get the predicted split position l q = ql v + (1 − q ) l i .Since in practice both of these heuristics work well on manydistributions, any point in between will also work well. Thisreduces the risk for the SmartChoice to pick a value that isreally bad which in turn helps learning. A disadvantage isthat it is impossible to find the optimal strategy if its valueslie outside of the interval between l v and l i .To evaluate our approaches we use a test environment wherein each episode, we search a random element in a sortedarray of elements taken from a randomly chosen distri-bution (uniform, triangular, normal, pareto, power, gammaand chisquare), with values in [ − , ] .Figure 2 shows the results for the different variants of bi-nary search using a SmartChoice and compares them to thevanilla binary search baseline. The results show that the sim-plest case (pink line) where we directly predict the relativeposition with the simple reward and without using an initialfunction performs poorly initially but then becomes nearlyas good as the baseline (cumulative regret becomes nearlyconstant after an initial bad period). The next case (yellowline) has an identical setup but we are using the initial func-tion and we see that the initial regret is substantially smaller.By using the shaped reward (blue line), the SmartChoiceis able to learn the behavior of the baseline quickly. Bothapproaches that are mixing the heuristics (green and redlines) significantly outperform the baselines. QuickSort (Hoare, 1962) sorts an array in-place by partition-ing it into two sets (smaller/larger than the pivot) recursively martChoices: Hybridizing Programming and Machine Learning 5
Listing 1.
Standard binary search (left) and a simple way to use a SmartChoice in binary search (right). def bsearch(x, a, l=0, r=len(a)-1): if l > r: return None q = 0.5 m = int(q*l + (1-q)*r) if a[m] == x: return m if a[m] < x: return bsearch(x, a, m+1, r) return bsearch(x, a, l, m-1) def bsearch(x, a, l=0, r=len(a)-1): if l > r: return None choice. Observe ({’target’:x, ’low’:a[l], ’high’:a[r]}) q = choice. Predict () m = int(q*l + (1-q)*r) if a[m] == x: return m choice. Feedback (-1) if a[m] < x: return bsearch(x, a, m+1, r) return bsearch(x, a, l, m-1) until the array is fully sorted. QuickSort is one of the mostcommonly used sorting algorithms where many heuristicshave been proposed to choose the pivot element. While theaverage time complexity of QuickSort is θ ( N log( N )) , aworst case time complexity of O ( N ) can happen when thepivot elements are badly chosen. The optimal choice fora pivot is the median of the range, which splits it into twoparts of equal size.To improve QuickSort using a SmartChoice we aim at tuningthe pivot selection heuristic. To allow for sorting arbitrarytypes, we use the SmartChoice to determine the number ofrandom samples to pick from the array to sort, and use theirmedian as the partitioning pivot (listing 2). As feedbacksignal for a recursion step, we estimate the impact of thepivot selection on the computational cost ∆ c . ∆ c = c piv + ∆ c rec c expected = c piv + ( a log a + b log b − n log n ) n log n , (1)where n is the size of the array, a and b are the sizes of thepartitions with n = a + b and c piv = c median + c partition is thecost to compute the median of the samples and to partitionthe array. ∆ c rec takes into account how close the currentpartition is to the ideal case (median). The cost is a weightedsum of number of reads, writes, and comparisons. Similarto the shaped reward in binary search, this reward allows usto reduce the RL problem to a contextual bandit problemand we use a discount of .For evaluation we are using a test environment where wesort randomly shuffled arrays. Results of the experimentsare presented in fig. 3 and show that the learned methodoutperforms all baseline heuristics within less than 100episodes. ‘Vanilla’ corresponds to a standard QuickSortimplementation that picks one pivot at random in each step.‘Random3’ and ‘Random9’ sample 3 and 9 random elementsrespectively and use the median of these as pivots. ‘Adap-tive’ uses the median of max(1 , (cid:98) log ( n ) − (cid:99) ) randomlysampled elements as pivot when partitioning a range of size n . It uses more samples at for larger arrays, leading to a bet-ter approximation of the median, and thus to faster problemsize reduction.Fig. 4 shows that the SmartChoice learns a non-trivial pol-icy . The SmartChoice learns to select more samples at largerarray sizes which is similar to the behavior that we hand- C o s t ( r e a d = , w r i t e = , c m p = . ) Episode(a) SmartChoice samplesVanilla (baseline)Random3 (baseline)Random9 (baseline)Adaptive (baseline) 0 200 400 600 800 1000-2.5e+06-2e+06-1.5e+06-1e+06-5000000500000 C u m u l a t i v e r e g r e t Episode(b)
Figure 3.
Results from using a SmartChoice for selecting the num-ber of pivots in QuickSort. (a) shows the overall cost for thedifferent baseline methods and for the variant with a SmartChoiceover training episodes. (b) shows the cumulative regret of theSmartChoice method compared to each of the baselines over train-ing episodes.
0% 20% 40% 60% 80% 100% 10 100 1000 F r a c t i o n o f s a m p l e s Size of the array to sort (log scale)Predicted number of samples to use, such that pivot=median(samples) 15 samples (19%)13 samples (20%)11 samples (22%)9 samples (24%)7 samples (27%)5 samples (31%)3 samples (37%)1 sample (50%)
Figure 4.
Number of pivots chosen by the SmartChoice in Quick-Sort after 5000 episodes. The expected approximation error of themedian is given in the legend, next to the number of samples. coded in the adaptive baseline but in this case no manualheuristic engineering was necessary and a better policy waslearned. Also, note that a SmartChoice-based method isable to adapt to changing environments which is not thecase for engineered heuristics. One surprising result is thatthe SmartChoice prefers 13 over 15 samples at large ar-ray sizes. We hypothesize this happens because relativelyfew examples of large arrays are seen during training (oneper episode, while arrays of smaller sizes are seen multipletimes per episode).
Caches are a commonly used component to speed up com-puting systems. They use a cache replacement policy (CRP)to determine which element to evict when the cache isfull and a new element needs to be stored. Probably themost popular CRP is the least recently used (LRU) heuristicwhich evicts the element with the oldest access timestamp.A number of approaches have been proposed to improvecache performance using machine learning (see sec. 5). Wepropose two different approaches how SmartChoices can beused in a CRP to improve cache performance.
Discrete (listing 3): A SmartChoice directly predicts whichelement to evict or chooses not to evict at all (by predict-ing an invalid index). That is, the SmartChoice learns to martChoices: Hybridizing Programming and Machine Learning 6
Listing 2.
A QuickSort implementation that uses a SmartChoice to choose the number of samples to compute the next pivot. As feedback,we use the cost of the step compared to the optimal partitioning. def qsort(a, l=0, r=len(a)): if r <= l+1: return m = pivot(a, l, r) qsort(a, l, m-1) qsort(a, m+1, r) def delta_cost(c_pivot, n, a, b): def pivot(a, l, r): choice. Observe ({’left’:l, ’right’:r}) q = min(1+2*choice. Predict (), r-l) v = median(sample(a[l:r], q)) m = partition(a, l, r, v) c = cost_of_median_and_partition() d = delta_cost(c, r-l, m-l, r-m) choice. Feedback (1/d) return m Listing 3.
Cache replacement policy directly predicting eviction decisions (
Discrete ). keys = ... def miss(key): choice. Feedback (-1) choice. Observe (’access’, key) choice. Observe (’memory’, keys) return evict(choice. Predict ()) def evict(i): if i >= len(keys): return None choice. Feedback (-1) choice. Observe (’evict’, keys[i]) return keys[i] def hit(key): choice. Feedback (1) choice. Observe (’access’, key)
Listing 4.
Cache replacement policy using a priority queue (
Continuous ). q = min_priority_queue(capacity) def priority(key): choice. Observe (...) score = choice. Predict () score *= capacity * scale return time() + score def hit(key): choice. Feedback (1) q.update(key, priority(key)) def miss(key): choice. Feedback (-1) return q.push(key, priority(key)) Observation Network (when using key embeddings)
Actor
Access (history) E m bed Action
Critic C on c a t MemoryEviction (history) C on c a t F C F C F C Conv1DConv1DConv1D F C F C F C F C V a l ue Figure 5.
The architecture of the neural networks for TD3 with keyembedding network. become a CRP itself. While this is the simplest way to use aSmartChoice, it makes it more difficult to learn a CRP betterthan LRU (in fact, even learning to be on par with LRU isnon-trivial in this setting).
Continuous (listing 4): A SmartChoice is used to enhanceLRU by predicting an offset to the last access timestamp.Here, the SmartChoice learns which items to keep in thecache longer and which items to evict sooner. In this case itbecomes trivial to be as good as LRU by predicting a zerooffset. The SmartChoice value in ( − , is scaled to get areasonable value range for the offsets. It is also possible tochoose not to store the element by predicting a sufficientlynegative score.In both approaches the feedback given to the SmartChoiceis whether an item was found in the cache ( +1 ) or not ( − ).In the discrete approach we also give a reward of − if theeviction actually takes place.In our implementation the observations are the history ofaccesses, memory contents, and evicted elements. TheSmartChoice can observe (1) keys as a categorical inputor (2) features of the keys.Observing keys as categorical input allows to avoid featureengineering and enables directly learning the properties of H i t R a t i o Global Step(a) discrete keyscontinuous keyscontinuous frequencylruoracle 0 5k 10k 15k -1e+06-50000005000001e+061.5e+062e+06 C u m u l a t i v e r e g r e t Episode(b) H i t R a t i o Global Step(a) discrete keyscontinuous keyscontinuous frequencylruoracle 0 5k 10k 15k -1e+06-800000-600000-400000-20000002000004000006000008000001e+06 C u m u l a t i v e r e g r e t Episode(b)
Figure 6.
Cache performance for power law access patterns. Top: α = 0 . , bottom: α = 0 . . (a) Hit Ratio (w/o exploration) and (b)Cumulative Regret (with exploration) particular keys (e.g. which keys are accessed the most) butmakes it difficult to deal with rare and unseen keys. Tohandle keys as input we train an embedding layer sharedbetween the actor and critic networks (fig. 5).As features of the keys we observe historical frequenciescomputed over a window of fixed size. This approach re-quires more effort from the developer to implement suchfeatures, but pays off with better performance and the factthat the model does not rely on particular key values.We experiment with three combinations of these options:(1) discrete caches observing keys, (2) continuous cachesobserving keys, (3) continuous caches observing frequen-cies. For evaluation we use a cache with size and integerkeys from to . We use two synthetic access patternsof length , sampled i.i.d. from a power law distributionwith α = 0 . and α = 0 . . Fig. 6 shows results for the martChoices: Hybridizing Programming and Machine Learning 7 three variants of predicted caches, a standard LRU cache,and an oracle cache to give a theoretical, non-achievable,upper bound on the performance.We look at the hit ratio without exploration to understandthe potential performance of the model once learning hasconverged. However, cumulative regret is still reportedunder exploration noise.Both implementations that work directly on key embed-dings learn to behave similar to the LRU baseline withoutexploration (comparable hit ratio). However, the continuousvariant pays a higher penalty for exploration (higher cumula-tive regret). Note that this means that the continuous variantlearned to predict constant offsets (which is trivial), howeverthe discrete implementation actually learned to become anLRU CRP which is non-trivial. The continuous implementa-tion with frequencies quickly outperforms the LRU baseline,making the cost/benefit worthwhile long-term (negative cu-mulative regret after a few hundred episodes). Nonetheless, Similar to many works that build on RL tech-nology, we are faced with the reproducibility issues de-scribed by (Henderson et al., 2018). Among multiple runsof any experiment, only some runs exhibit the desired be-havior, which we report. In the “failing” runs, we observebaseline performance because the initial function acts as asafety net. Thus, our experiments show that we can outper-form the baseline heuristics without a high risk to fail badly.The design construct specific to SmartChoices and what dis-tinguishes it from standard Reinforcement Learning is thatit is applied in software control where often developers areable to provide safe initial functions or write the algorithmin a way that limits the cost of a poorly performing policy.While we do not claim to have the solution to address re-producibility, the use of the initial function can mitigate itand any solution to better reproducibility and higher stabil-ity developed by the community will be applicable in ourapproach as well.In table 2, we provide details on the reproducibility and per-formance of our experiments over 100 identical experimentsfor each of the problems described earlier. The table showsthe cumulative regret and the break even point for our exper-iments for various quantiles and as the mean. Cumulativeregret indicates how much worse our method is than notusing ML at all – if it’s negative it means that it’s better thannot using it. The break even point is the number of episodesafter which cumulative regret becomes negative and neverpositive anymore. In some experiments the break even pointis not reached. We report the percentage of runs in which itwas reached in the ‘mean’ column.We want to highlight that, while the experiments for someproblems are more reproducible than others, our approachdoes not perform substantially worse than the initial func- tion provided by the developer, e.g. cumulative regret fornone of the problems grows very large, indicating that per-formance remains acceptable. This is very visible for thecache experiments: While only for 26% of the runs thebreak even point was reached, meaning that the cache per-forms strictly better than before, it only performs worse thanbefore in 14% of the runs. For 60% of the runs, the use ofML does neither help nor hurt compared to using the LRUheuristic.
5. Related work
The most relevant work to our proposed interface is (Changet al., 2016) where a programming interface is proposedfor joint prediction and a method that allows for unifyingthe implementation for training and inference. Similarly,Probabilistic programming (Gordon et al., 2014) introducesinterfaces which simplify the developer complexity whenworking with statistical models and conditioning variablevalues on run-time observations. Our proposed interfacesare at a higher level in that the user does not need to knowabout the inner workings of the underlying models. In fact,to implement our proposed APIs, techniques from proba-bilistic programming might be useful. Similarly, (Sampsonet al., 2011) propose a programming interface for approxi-mate computation.Similar in spirit to our approach is (Kraska et al., 2018)which proposes to incorporate neural models into databasesystems by replacing existing index structures with neuralmodels that can be both faster and smaller. In contrast, weaim not to replace existing data structures or algorithmsbut transparently integrate with standard algorithms andsystems. Our approach is general enough to be used to im-prove the heuristics in algorithms (as done here), to optimizedatabase systems (similar to (Kraska et al., 2018)), or to sim-ply replace an arbitrarily chosen constant. Another approachthat is similar to SmartChoices is Spiral (Bychkovsky et al.,2018) but it is far more limited in scope than SmartChoicesin that it aims to predict boolean values only and relies onground truth data for model building.Similarly, a number of papers apply machine learning toalgorithmic problems, e.g. Neural Turing Machines (Graveset al., 2014) aims to build a full neural model for programexecution. (Kaempfer & Wolf, 2018; Kool et al., 2018;Bello et al., 2016) propose end-to-end ML approaches tocombinatorial optimization problems. In contrast to ourapproach these approaches replace the existing methodswith an ML-system rather than augmenting them. These area good demonstration of the inversion of control problemmentioned above: using ML requires to give full control tothe ML system.There are a few approaches that are related to our use ofthe initial function, however most common problems whereRL is applied do not have a good initial function. Generally martChoices: Hybridizing Programming and Machine Learning 8
Table 2.
Reproducibility data for our experiments: We report cumulative regret for different quantiles of experiments at different trainingepisodes as well as the average over all episodes. We also report the respective break even point as a number of episodes, which is thenumber of training episodes at which cumulative regret becomes negative and never positive anymore. For the break even point we reportthe percentage of runs in which the break even point was reached in the column “mean”.
Problem Percentile 1 5 10 25 50 75 90 95 99 meanBinary Search (N=120) Cum. Regret @5K episodes -2.71 -2.66 -2.62 -2.45 -2.03 -1.01 0.44 0.70 0.78 -1.59Cum. Regret @50K episodes -3.99 -3.83 -3.76 -3.64 -3.34 -2.85 3.80 3.86 3.92 -2.20Break-even (episodes) 127 201 271 417 758 2403 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ related is the idea of imitation learning (Hussein et al., 2017)where the agent aims to replicate the behavior of an expert.Typically the amount of training data created by an expert isvery limited. Based on imitation learning is the idea to usepreviously trained agents to kickstart the learning of a newmodel (Schmitt et al., 2018) where the authors concurrentlyuse a teacher and a student model and encourage the studentmodel to learn from the teacher through an auxiliary lossthat is decreased over time as the student becomes better.In some applications it may be possible to obtain additionaltraining data from experts from other sources, e.g. (Hesteret al., 2018; Aytar et al., 2018) leverage YouTube videos ofgameplay to increase training speed of their agents. Theseapproaches work well in cases where it is possible to lever-age external data sources.Caches are an interesting application area where multipleteams have shown in the past that ML can improve cacheperformance (Zhong et al., 2018; Lykouris & Vassilvitskii,2018; Hashemi et al., 2018; Narayanan et al., 2018; Gramacyet al., 2002). In contrast to our approach, all ML modelsare built for task-specific caches, and do not generalize toother tasks. Algorithm selection has been an approach toapply RL for improving sorting algorithms (Lagoudakis &Littman, 2000). Search algorithms have also been improvedusing genetic algorithms to tweak code optimization (Liet al., 2005).
6. Conclusion
We have introduced a new programming concept called aSmartChoice aiming to make it easier for developers to usemachine learning from their existing code in new applicationareas. Contrary to other approaches, SmartChoices can eas-ily be integrated and hand full control to the developer overhow ML models are used and trained. Our approach bridgethe chasm between the traditional approaches of softwaresystems building and machine learning modeling, and thusallow for the developer to focus on refining their algorithmand metrics rather than working on building pipelines to in-corporate machine learning. We achieve this by proposing anew object called SmartChoice which provides a 3-call API.A SmartChoice observes information about its context and receives feedback about the quality of predictions instead ofbeing assigned a value directly.We have studied the feasibility of SmartChoices in threealgorithmic problems. For each we show how easySmartChoices can be incorporated and how performanceimproves in comparison to not using a SmartChoice at all.Specifically, through our experiments we highlight bothadvantages and disadvantages that reinforcement learningbrings when used as a solution for a generic interface asSmartChoices.Note that we do not claim to have the best possible machinelearning model for each of these problems but our contribu-tion lies in building a framework that allows for using MLeasily, spreading its use, and improving the performance inplaces where machine learning would not have been usedotherwise. SmartChoices are applicable to more generalproblems across a large variety of domains from systemoptimization to user modelling. Our current implementationof SmartChoices is built on standard RL methods but otherML methods such as supervised learning are in scope aswell if the problem is appropriate.
Future Work.
In this paper we barely scratch the surfaceof the new opportunities created with SmartChoices. Thecurrent rate of progress in ML will enable better results andwider applicability of SmartChoices to new applications.We hope that SmartChoices will inspire the use of ML inplaces where it has not been considered before.
Acknowledgements.
The authors are part of a larger ef-fort aiming to hybridize machine learning and programming.We would like to thank all other members of the team fortheir contributions to this work: George Baggott, GaborBartok, Jesse Berent, Eugene Brevdo, Andrew Bunner, JeffDean, Arkady Epshteyn, Sanjay Ghemawat, Daniel Golovin,Alex Grubb, Ramki Gummadi, Wei Huang, Eugene Kir-pichov, Effrosyni Kokiopoulou, Ketan Mandke, LucianoSbaiz, Benjamin Solnik, Weikang Zhou.Further we would like to thank the authors and contributorsof the TF-agents (Guadarrama et al., 2018) library: Ser-gio Guadarrama, Julian Ibarz, Anoop Korattikara, OscarRamirez. martChoices: Hybridizing Programming and Machine Learning 9
References
Aytar, Y., Pfaff, T., Budden, D., Paine, T. L., Wang, Z., and de Fre-itas, N. Playing hard exploration games by watching YouTube.In
NIPS , 2018.Bello, I., Pham, H., Le, Q. V., Norouzi, M., and Bengio, S. Neuralcombinatorial optimization with reinforcement learning.
ArXiV ,2016.Bychkovsky, V., Cipar, J., Wen, A., Hu, L., and Mo-hapatra, S. Spiral: Self-tuning services via real-time machine learning. Technical report, Facebook,2018. https://code.fb.com/data-infrastructure/spiral-self-tuning-services-via-real-time-machine-learning/.Chang, K.-W., He, H., Ross, S., Daumé, H., and Langford, J. Acredit assignment compiler for joint prediction. In
NIPS , 2016.Dekel, O. and Singer, Y. Data-driven online to batch conversions.In
NIPS , 2005.Fujimoto, S., van Hoof, H., and Meger, D. Addressing functionapproximation error in actor-critic methods. In
ICML , 2018.Gordon, A. D., Henzinger, T. A., Nori, A. V., and Rajamani, S. K.Probabilistic programming. In
Proc. FOSE , 2014.Gramacy, R. B., Warmuth, M. K., Brandt, S. A., and Ari, I. Adap-tive caching by refetching. In
NIPS , 2002.Graves, A., Wayne, G., and Danihelka, I. Neural Turing machines.
ArXiV , 2014.Guadarrama, S., Korattikara, A., Ramirez, O., Castro, P., Holly, E.,Fishman, S., Wang, K., Gonina, E., Harris, C., Vanhoucke, V.,and Brevdo, E. TF-Agents: A library for reinforcement learn-ing in tensorflow. https://github.com/tensorflow/agents , 2018.Hashemi, M., Swersky, K., Smith, J. A., Ayers, G., Litz, H., Chang,J., Kozyrakis, C. E., and Ranganathan, P. Learning memoryaccess patterns. In
ICML , 2018.Hasselt, H. v., Guez, A., and Silver, D. Deep reinforcement learn-ing with double q-learning. In
AAAI , 2016.Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., andMeger, D. Deep reinforcement learning that matters. In
AAAI ,2018.Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot,B., Sendonaris, A., Dulac-Arnold, G., Osband, I., Agapiou, J.,Leibo, J. Z., and Gruslys, A. Learning from demonstrations forreal world reinforcement learning. In
AAAI , 2018.Hoare, C. A. R. Quicksort.
The Computer Journal , 5(1):10–16,1962.Hussein, A., Gaber, M. M., Elyan, E., and Jayne, C. Imitationlearning: A survey of learning methods.
ACM Comput. Surv. ,2017.Kaempfer, Y. and Wolf, L. Learning the multiple traveling sales-men problem with permutation invariant pooling networks.
ArXiV , 2018.Kool, W., van Hoof, H., and Welling, M. Attention solves yourTSP, approximately.
ArXiV , 2018. Kraska, T., Beutel, A., hsin Chi, E. H., Dean, J., and Polyzotis, N.The case for learned index structures. In
SIGMOD , 2018.Lagoudakis, M. G. and Littman, M. L. Algorithm selection usingreinforcement learning. In
ICML , 2000.Li, X., Garzarán, M. J., and Padua, D. A. Optimizing sortingwith genetic algorithms.
Int. Sym. on Code Generation andOptimization , 2005.Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa,Y., Silver, D., and Wierstra, D. Continuous control with deepreinforcement learning.
ArXiV , 2015.Lykouris, T. and Vassilvitskii, S. Competitive caching with ma-chine learned advice. In
ICML , 2018.Narayanan, A., Verma, S., Ramadan, E., Babaie, P., and Zhang,Z.-L. Deepcache: A deep learning based framework for contentcaching. In
NetAI’18 , 2018.Sampson, A., Dietl, W., Fortuna, E., Gnanapragasam, D., Ceze, L.,and Grossman, D. Enerj: Approximate data types for safe andgeneral low-power computation. In
ACM SIGPLAN Notices ,volume 46, pp. 164–174. ACM, 2011.Schmitt, S., Hudson, J. J., Zídek, A., Osindero, S., Doersch, C.,Czarnecki, W., Leibo, J. Z., Küttler, H., Zisserman, A., Si-monyan, K., and Eslami, S. M. A. Kickstarting deep reinforce-ment learning.
ArXiV , 2018.Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner,D., Chaudhary, V., and Young, M. Machine learning: The highinterest credit card of technical debt. In
SE4ML: Software Engi-neering for Machine Learning (NIPS 2014 Workshop) , 2014.Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van denDriessche, G., Schrittwieser, J., Antonoglou, I., Panneershel-vam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J.,Kalchbrenner, N., Sutskever, I., Lillicrap, T. P., Leach, M.,Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering thegame of Go with deep neural networks and tree search.
Nature ,2016.Williams, Jr., L. F. A modification to the half-interval search (bi-nary search) method. In
Proc. 14th Annual Southeast RegionalConference , 1976.Zhong, C., Gursoy, M. C., and Velipasalar, S. A deep reinforcementlearning-based framework for content caching. In