[PDF] Cache Bypassing for Machine Learning Algorithms

Abstract

Graphics Processing Units (GPUs) were once used solely for graphical computation tasks but with the increase in the use of machine learning applications, the use of GPUs to perform general-purpose computing has increased in the last few years. GPUs employ a massive amount of threads, that in turn achieve a high amount of parallelism, to perform tasks. Though GPUs have a high amount of computation power, they face the problem of cache contention due to the SIMT model that they use. A solution to this problem is called "cache bypassing". This paper presents a predictive model that analyzes the access patterns of various machine learning algorithms and determines whether certain data should be stored in the cache or not. It presents insights on how well each model performs on different datasets and also shows how minimizing the size of each model will affect its performance The performance of most of the models were found to be around 90% with KNN performing the best but not with the smallest size. We further increase the features by splitting the addresses into chunks of 4 bytes. We observe that this increased the performance of the neural network substantially and increased the accuracy to 99.9% with three neurons.

Full PDF

CCache Bypassing for Machine Learning Algorithms

Asim Ikram

Department of Computer SciencesNational University of Computer andEmerging SciencesIslamabad, Pakistan [email protected]

Muhammad Awais Ali

Department of Computer SciencesNational University of Computer andEmerging SciencesIslamabad, Pakistan [email protected]

Mirza Omer Beg

Assistant ProfessorDepartment of Computer SciencesNational University of Computer andEmerging SciencesIslamabad, Pakistan [email protected]

Abstract

Graphics Processing Units (GPUs) were once used solely forgraphical computation tasks but with the increase in theuse of machine learning applications, the use of GPUs toperform general purpose computing has increased in thelast few years. GPUs employ a massive amount of threads,that in turn achieve a high amount parallelism, to performtasks. Though GPUs have a high amount of computationpower, they face the problem of cache contention due tothe SIMT model that they use. A solution to this problemis called "cache bypassing". This paper presents a predictivemodel that analyzes the access patterns of various machinelearning algorithms and determines whether certain datashould be stored in the cache or not. It presents insights onhow well each model performs on different datasets and alsoshows how minimizing the size of each model will affectits performance The performance of most of the modelswere found to be around 90% with KNN performing thebest but not with the smallest size. We further increase thefeatures by splitting the addresses into chunks of 4 bytes.We observe that this increased the performance of neuralnetwork substantially and increased the accuracy to 99.9%with three neurons.

Keywords cache optimization, cache bypassing, machinelearning

Recent advances in artificial intelligence has led to a renewedattention towards a diverse set of difficult combinatorial prob-lems [1, 4, 25, 26, 34]. Graphics Processing Units (GPUs) havebeen used to perform graphics intensive tasks since the ear-lier days, but in the recent years, thanks to the rise of machinelearning, GPUs have been used to perform high performancetasks [12]. Due to the massive computational power thatGPUs offer, they are being used to perform tasks that wereonce meant to be executed by CPUs. Since GPUs providemore computational power than CPUs, they have been usedextensively to train machine learning algorithms[20, 27, 30].Initially, GPUs only had a shared memory called the scratch-pad memory installed on board. While scratchpad memory

Conference’17, July 2017, Washington, DC, USA

Figure 1.

Miss rates for L1 cache bypassing on selectedRodinia benchmark programswas programmable and enabled rapid fetching of data, it hadits own pitfalls. The problem with scratchpad memory wasthat it performed well on processes that had a uniform accesspattern but performed poorly on processes that exhibitedirregular access patterns [10]. To handle the problems thatscratchpad memory exhibited, vendors started to employcache memory on GPUs as well.Caches, on GPUs, perform well on tasks that exhibit non-uniform or irregular access patterns. GPU caches do notoperate in the same way as CPU caches and hence cannotbe optimized in the same way [3]. They also exhibit a lowlevel of temporal and spatial locality. While CPU caches havebeen worked upon in detail, GPU caches still remain an areato be explored [16].GPUs work by dispatching a large number of threadsto streaming multiprocessors (SMs) for processing largedatasets in parallel. Since a large number of threads needto be executed, the small size of the cache becomes a bottle-neck for performance especially when training deep learningmodels [2, 11, 22]. One problem that arises is called "cachecontention". Under a naive approach, the processor wouldevict data from the cache using the defined eviction pol-icy when the cache reached its maximum capacity. Thisapproach would be ineffective since the data already present a r X i v : . [ c s . A R ] F e b onference’17, July 2017, Washington, DC, USA Asim Ikram, Muhammad Awais Ali, and Mirza Omer Beg in the cache might be more used more frequently than newerdata. Since GPUs require a massive amount of threads tobe executed in parallel on a shared cache, harmful evictionwould likely occur very frequently. One solution to this prob-lem is called "cache bypassing" which involves storing someof the data directly to the L2 cache rather than the L1 cache.Figure 1 illustrates how cache bypassing can improve per-formance of randomly selected programs running on a GPU.We execute a few programs from the Rodinia benchmark onthe Kepler GPU architecture and we modify the PTX dataaccess instructions to randomly bypass the cache and seeimprovements on all cases as seen in figure 1.This paper proposes a mechanism that analyzes the ac-cess patterns of various machine learning algorithms anduses a model that predicts whether a certain address shouldbe bypassed or not. Furthermore, since the mechanism isintended to be embedded in the hardware, the size of themodels has been reduced to a considerable degree to reducethe cost of implementation. This paper makes the followingcontributions • We propose a cache bypassing mechanism using vari-ous machine learning algorithms that learn the accesspatterns of other machine learning algorithms andtakes bypassing decisions. • We shrink the size of the model and evaluate its per-formance. We reduce the size of the model since it isintended to be implemented into the hardware. • We further split the addresses into multiple parts andevaluate our models on the modified datasets and findthat the performance of the neural network increasessubstantially.The rest of this paper is organized as follows. Section 2describes the background, section 3 presents an overviewof the related work, section 4 describes our methodology,section 5 lists the experimental setup used to conduct theexperiments, section 6 lists the results of the experiments,and section 7 summarizes our work.

GPUs consists of multiple parts such Streaming Multiproces-sors (SMs). SMs are processors that are designed to handleCUDA requests. Each SM has multiple CUDA cores and ashared L1 data cache [23]. Each SM is responsible for dis-patching warps (blocks of 32 threads) to CUDA cores.

General Purpose GPU (GPGPU) are GPUs that carry outtasks that are meant to be executed on a CPU. Since GPUsallow massive thread level parallelism, they have been usedto perform high intensive computing tasks. Due to the threadhandling capabilities of GPUs, lower end GPUs can handletasks that might require cutting edge CPUs.

GPU caches were introduced to counteract the drawbacks ofscratchpad memory. GPU caches perform well on data thatexhibits irregular access patterns but while caches have theirbenefits, they also suffer from some drawbacks. GPUs em-ploy a mechanism called Single Instruction Multiple Threads(SIMT) that involves multiple threads being dispatched fora single process. The small size of the cache becomes a bot-tleneck when handling such a large number of threads thatrun in parallel. To increase the performance of the cache inthis scenario, cache bypassing can be used. Cache bypassinginvolves storing only those instructions in the cache thathave a high reuse rate, the instructions that do not have ahigh reuse rate a not stored in the cache and are accesseddirectly from the memory. This can result in the decrease ofcache miss rates.

GPUs have been used in the recent years for high perfor-mance computing. To perform this tasks, vendors have startedto employ caches for GPUs [10, 31]. The authors claim thatdue to the nature of GPGPU applications, data data reuse rateis very low and that data can be bypassed to improve cacheperformance. Yijie Huangfu and Wei Zhang [10] proposeda mechanism that filtered data based on the addresses andtheir mechanism improved cache performance by 13.8%. Theauthors in [31] classify the data into three types based onlocality and use static bypassing for data with high and lowlevel of locality and use dynamic bypassing for data withmedium level of locality.The authors in [7, 19, 33] propose bypassing mechanismsthat are compiler based. The authors in [19] state that onlyglobal load instructions are stored in the cache and only thoseinstructions need to be identified and bypassed [7, 19]. Theypropose a heuristic based method for a compiler that filtersout global load instructions and generates an optimized code.The authors in [7] propose a model that identifies the optimalnumber of warps. The authors in [33] propose ’Hyper LoopParallelism’ to improve the performance of CUDA GPUs.The authors propose a mechanism that identifies whethera loop can be presented in a vector form or not and builda compiler to achieve their goals. [16] also propose a com-pile time framework to limiting the number of threads thatcan access a cache. [24] propose a bypass first policy for thelast level cache that only stores those addresses in memorythat are likely to be re-referenced. [35] propose a bypassingscheme that is targeted towards handling un-coalesced loads.The mechanism uses two approaches – One is to bypassdata when the number of accesses exceeds a pre-determinedthreshold. The second approach is to bypass memory ac-cesses when the L1 data cache is stalled.The authors in [15, 29] propose bypassing mechanismsthat are based on reuses. [15] propose a mechanism that uses hort Title Conference’17, July 2017, Washington, DC, USA feedback control loops to predict reuse patterns for each in-struction. They use a reuse table to keep track of reuses anduse data from the table to statistically determine whether tobypass an instruction or not. The result is an almost doublespeedup. [29] propose an adaptive cache bypassing mecha-nism to avoid premature eviction. They use the PC trace topredict bypassing. They predict block that are likely to notbe rerefrenced and choose to bypass them which results ina higher hit rate. [17] propose a method that dynamicallybypasses instructions and only stores those instructions inthe L1 data cache that have a high reuse rate and shorereuse distance. The authors also propose to decouple the L1data cache to increase the energy efficiency of the cache andto enable the storage of more reuse patterns with a loweroverhead. The purpose of this paper is to analyze the address patternsof the machine learning algorithms and uses a predictivemodel to decide whether data should be stored in the cacheor not. Since the model is intended to be embedded in thehardware the size of the models should ideally be kept to aminimum.A large number of machine learning models exists andusing all of them was not feasible, so we chose a subset ofthe algorithms available. The algorithms that we ran our teston include

Decision Tree, K Nearest Neighbors (KNN), LogisticRegression, and Neural Network (MLP) . These machine learn-ing algorithms consist of many parameters to tune and thetuning of these paramaters is an exhaustive approach whichis very computationally expensive. To tune these parametersto obtain optimal values, either manually or automatically,takes several days which was can not be considered feasible.Therefore, we optimized these algorithms based on the corecomponent that builds up these algorithms. For the deci-sion tree, we changed the depth and impurity. We focusedon these parameters because the depth controls how manynodes exist in the tree. For increased depths, the nodes inthe tree increase along with its overall complexity. When weconsider impurity, the computational cost is higher when theimpurity is lower. In the case of KNN, we chose to changethe value of K since this parameter controls the number ofclusters that the model has to construct. For logistic regres-sion, instead of using only one solver, we focused on tryingmultiple solvers to evaluate the performance of each oneand gain insights as to which one performs most optimallyon the given datasets. Neural networks have seen a massiveamount of use in the recent years. With the introductionof deep learning algorithms as well as the increase in pro-cessing power, neural networks have been used in a widevariety of applications such as autonomous driving and IoT[9, 13, 18, 21, 28, 32]. Since, neural networks have been ex-tremely popular, we were interested in the performance of neural networks with cache bypassing. As our model is in-tended to be implemented in the hardware, we could notafford to have too many neurons in the network since thatwould increase the implementation cost. We chose to changethe number of neurons in the network while keeping thenumber of hidden layers to one.The dataset that we used was imbalanced and to balanceit we used the SMOTE [5]. The algorithms we used are givenin algorithms 1, 2, 3, 4, 5.

Algorithm 1:

Decision Tree (Depth) for

𝐷𝑒𝑝𝑡ℎ𝑃𝑎𝑟𝑎𝑚 ← to do Initialize Decision Tree( 𝑑𝑒𝑝𝑡ℎ = 𝐷𝑒𝑝𝑡ℎ𝑃𝑎𝑟𝑎𝑚 );evaluate Decision Tree endAlgorithm 2:

Decision Tree (Impurity) for

𝐼𝑚𝑝𝑃𝑎𝑟𝑎𝑚 ← to . do Initialize Decision Tree( 𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦 = 𝐼𝑚𝑝𝑃𝑎𝑟𝑎𝑚 );evaluate Decision Tree endAlgorithm 3:

KNN for

𝐾𝑃𝑎𝑟𝑎𝑚 ← to do Initialize KNN( 𝐾 = 𝐾𝑃𝑎𝑟𝑎𝑚 );evaluate KNN Model endAlgorithm 4:

Logistic Regression for

𝑆𝑜𝑙𝑣𝑒𝑟𝑃𝑎𝑟𝑎𝑚 in [ 𝑛𝑒𝑤𝑡𝑜𝑛 − 𝑐𝑔, 𝐿𝐵𝐹𝐺𝑆, 𝑙𝑖𝑏𝑙𝑖𝑛𝑒𝑎𝑟, 𝑠𝑎𝑔 ] do Initialize LogisticRegression(

𝑆𝑜𝑙𝑣𝑒𝑟 = 𝑆𝑜𝑙𝑣𝑒𝑟𝑃𝑎𝑟𝑎𝑚 );evaluate Logistic Regression Model end

The dataset that we used consisted of only one featurethat was the address. This caused our algorithm to performnot very well, especially for the neural network. To handlethis case, we split the address into chunks of 4 bytes anduse that data to train the models once more. We present theresults on for both versions of the dataset. The algorithmthat we used to split our data is given in algorithm

Splittingalgorithm here onference’17, July 2017, Washington, DC, USA Asim Ikram, Muhammad Awais Ali, and Mirza Omer Beg Algorithm 5:

Neural Network for

𝑁 𝑒𝑢𝑟𝑜𝑛𝑃𝑎𝑟𝑎𝑚 ← to do InitializeNeuralNetwork( 𝑛𝑒𝑢𝑟𝑜𝑛𝑠 = 𝑁 𝑒𝑢𝑟𝑜𝑛𝑃𝑎𝑟𝑎𝑚 );evaluate Neural Network endAlgorithm 6:

Splitting the Addresses

𝑆𝑝𝑙𝑖𝑡𝐷𝑎𝑡𝑎 ← [] ; for 𝑖 in 𝑟𝑎𝑛𝑔𝑒 ( 𝑠𝑖𝑧𝑒𝑂 𝑓 ( 𝐷𝑎𝑡𝑎 )) do 𝑟𝑒𝑐𝑜𝑟𝑑 ← 𝑑𝑎𝑡𝑎 [ 𝑖 ] ; 𝑡𝑒𝑚𝑝𝐿𝑖𝑠𝑡 ← [] ; while 𝑟𝑒𝑐𝑜𝑟𝑑 ≠ do 𝑟𝑒𝑐𝑜𝑟𝑑, 𝑟𝑒𝑚𝑎𝑖𝑛𝑑𝑒𝑟 ← 𝑟𝑒𝑐𝑜𝑟𝑑 %10 𝑡𝑒𝑚𝑝𝐿𝑖𝑠𝑡 .𝑎𝑝𝑝𝑒𝑛𝑑 ( 𝑟𝑒𝑚𝑎𝑖𝑛𝑑𝑒𝑟 ) end 𝑆𝑝𝑙𝑖𝑡𝐷𝑎𝑡𝑎.𝑎𝑝𝑝𝑒𝑛𝑑 ( 𝑡𝑒𝑚𝑝𝐿𝑖𝑠𝑡 ) end The experiments were conducted on datasets that were gen-erated from tensorflow examples [8]. The codes that we usedare given in table 1. To codes were run on the MNIST dataset[14]. We ran several iterations on each of the examples andgenerated a dataset that contained approximately 1,000,000records.

Table 1.

Codes Used for Generating Datasets

Code Used

Nearest NeighborsLogistic RegressionRandom ForestRecurrent Neural NetworkThe models that we used for training are given in sec-tion 4. The models were trained on a machine with an Inteli7 3630QM processor (2.4Ghz) 8GB RAM, and an NvidiaGeforce GT 630M [6]. The specifications for the GPU aregiven in table 2. The models were implemented using usingPython 3.6.4. As the model(s) would be implemented at thehardware level, we reduced the size of the models to reduceimplementation costs while maintaining an acceptable levelof accuracy.

We tested our models by using data generated from the ten-sorflow examples mentioned earlier. We measured the ac-curacy of each model with respect to the model’s size. The

Table 2.

Geforce GT 630M Specifications

Item Value

CUDA Cores 96Graphics Clock 800MHzMemory Interface Width 128 bitsArchitecture KeplerMemory Bandwidth 32 GB/sMemory Size 2 GBresults for each of the datasets on different models are givenin the subsequent sections.

Figure 2 shows the results obtained from the machine learn-ing models on the logistic regression dataset. For the decisiontree, when the impurity was kept constant and the depth ofthe tree was changed it was observed that the accuracy var-ied when the depth was changed from 1 to 5 but from depth5 onwards the accuracy remained the same. When the depthwas kept constant and the impurity was changed, it wasobserved that the accuracy was high when the impurity waslower than 0.05 but dropped drastically when the impuritywas increased beyond this point. When the impurity was setto greater than or equal to 0.15, the accuracy remained thesame. For KNN, the accuracy decreased more sharply in thecase of uniform weights. All of the versions of the neural net-works had an accuracy of approximately 50% and exhibitedan irregular behavior when the number of neurons in thenetwork was changed. A notable point was that the sigmoidand lbfgs solvers gave the highest accuracy with one neu-ron while the Adam solver showed a converse behavior. Forlogistic regression, all the solvers gave the same error. Thisimplies that the solver most suited for the situation could beused without affecting the performance of the model.

Figure 3 shows the results obtained on the nearest neighborsdataset using the machine learning algorithms. In the case ofthe decision tree, the accuracy increased suddenly when thedepth was increased from 1 to 2 and increased very slightlywhen the depth was increased from 3 to 4. The accuracyremained constant at depths greater then or equal to 4 (im-purity was constant). When the depth was kept constant andthe impurity was changed, the accuracy was very high whenthe impurity was below 0.30 but decreased abruptly whenthe impurity increased beyond this point. This implies thatthe nearest neighbors dataset is relatively resilient to impuresplitting. The accuracy obtained on the same dataset usingboth versions of KNN was very high. It can be seen fromthe graphs that accuracy decreased more consistently with hort Title Conference’17, July 2017, Washington, DC, USA (a) Decision Tree (Depth) (b) Decision Tree (Impurity) (c) KNN (Uniform Weights) (d) KNN (Weighted)(e) Logistic Regression (f) Neural Network (LBFGS) (g) Neural Network (Sigmoid) (h) Neural Network (Adam) Figure 2.

Results on Logistic Regression Dataset (a) Decision Tree (Depth) (b) Decision Tree (Impurity) (c) KNN (Uniform Weights) (d) KNN (Weighted)(e) Logistic Regression (f) Neural Network (LBFGS) (g) Neural Network (Sigmoid) (h) Neural Network (Adam)

Figure 3.

Results on the Nearest Neighbors Datasetincreasing values of K when using uniform weights. Themean absolute error obtained using logistic regression withdifferent solvers on the nearest neighbors dataset was thesame on all the solvers. For the neural network, the sigmoidsolver exhibited a constant accuracy when the number ofneurons in the network was changed. The LBFGS and Adam solvers displayed a completely random pattern no matterhow many neurons existed in the network. In this case, theAdam solver achieved the highest accuracy with one neuronwhile LBFGS exhibited the lowest accuracy with one neuron. onference’17, July 2017, Washington, DC, USA Asim Ikram, Muhammad Awais Ali, and Mirza Omer Beg (a) Decision Tree (Depth) (b) Decision Tree (Impurity) (c) KNN (Uniform Weights) (d) KNN (Weighted) (e) Logistic Regression (f) Neural Network (LBFGS) (g) Neural Network (Sigmoid) (h) Neural Network (Adam) Figure 4.

Results on the Random Forest Dataset

Figure 4 shows the results obtained on the random forestdataset using the machine learning algorithms. I In the caseof the decision tree, when the depth was changed and im-purity was kept constant, the accuracy increased at approxi-mately equal intervals. When the depth was kept constantand the impurity was changed the accuracy dropped abruptlywhen the impurity was increased from 0 and stayed constantat values equal to or above 0.05. This implies that accuracydecreases substantially when the splitting of nodes is notpure. For both of KNN, the accuracy is the highest when K is3 but the accuracy decreases sharply in the case of uniformweights and remains almost constant in the case of weightedKNN. When using logistic regression, the error on newton-cg was considerably lower then the other three solvers. Theerror on LBFGS, liblinear and sag was the same. In the caseof neural networks, the sigmoid solver showed a behaviorseparate from the rest and exhibited a constant accuracyno matter how many neurons existed in the network. TheLBFGS and adam solvers showed an irregular behavior whenthe number of neurons was changed.

Figure 5 shows the results obtained achieved on the RNNdataset using a the models. For the decision tree, the accuracyincreased substantially when the depth was increased from1 to 2 but increased by a very minor amount when the depthwas increased from 2 to 3. The accuracy remained constant atdepth greater than or equal to 3 (impurity was kept constant).When the depth was kept constant and the impurity wasincreased, the accuracy decreased rapidly when the impurity was increased from 0 meaning that a slightly impure splitcan cause a rapid decrease in accuracy on this dataset. Forboth versions of KNN, the highest accuracy was achievedwhen the value of K was 5. For values of K greater than 5, thebehavior of both versions was almost the same. The erroron all the solvers remained the same when using logisticregression. This implies that for the RNN dataset, any ofthe solvers could be applied (most preferably the best onefor the situation) without affecting the error. For the neuralnetwork, all of the solvers showed a random behavior whenthe number of neurons in the network was changed. TheLBFGS solver achieved the highest accuracy when one neu-ron existed in the network. The sigmoid and Adam solversexhibited the opposite behavior under the same conditions.

At first we tested our model on the original dataset thatconsisted of only one feature. Since using one feature wascausing a linear mapping, we decided to split the data intoparts to make a more generic model and tested our models onthe modified dataset as well. The results of the modificationare described in the subsequent sections.

Figure 6 shows the results obtained from the machine learn-ing models on the logistic regression dataset having multiplefeatures. For the decision tree, when the impurity was keptconstant and the depth of the tree was changed it was ob-served that the accuracy varied when the depth was changedfrom 1 to 9 and kept increasing. When the depth was keptconstant and the impurity was changed, it was observed hort Title Conference’17, July 2017, Washington, DC, USA (a) Decision Tree (Depth) (b) Decision Tree (Impurity) (c) KNN (Uniform Weights) (d) KNN (Weighted)(e) Logistic Regression (f) Neural Network (LBFGS) (g) Neural Network (Sigmoid) (h) Neural Network (Adam) Figure 5.

Results on the RNN Dataset (a) Decision Tree (Depth) (b) Decision Tree (Impurity) (c) KNN (Uniform Weights) (d) KNN (Weighted)(e) Logistic (f) Neural Network (LBFGS) (g) Neural Network (Sigmoid) (g) Neural Network (Adam)

Figure 6.

Results on the Logistic Regression Dataset (Case 2)that the accuracy was high when the impurity was lowerthan 0.05 but dropped drastically when the impurity wasincreased beyond this point. When the impurity was set togreater than or equal to 0.15, the accuracy remained the same.For KNN, the accuracy decreased more sharply in the case of uniform weights. All of the versions of the neural networkshad an accuracy of approximately 50% and exhibited an irreg-ular behavior when the number of neurons in the networkwas changed. A notable point was that the sigmoid and lbfgssolvers gave the highest accuracy with two neurons while onference’17, July 2017, Washington, DC, USA Asim Ikram, Muhammad Awais Ali, and Mirza Omer Beg the highest accuracy was achieved with five neurons usingAdam solver. For logistic regression, all the solvers gave thesame error. This implies that the solver most suited for thesituation could be used without affecting the performanceof the model. Figure 7 shows the results obtained on the nearest neighborsdataset with multiple features .In the case of the decisiontree, the accuracy increased suddenly when the depth wasincreased from 1 to 2 and increased very slightly when thedepth was increased from 3 to 4. The accuracy remainedconstant at depths greater then or equal to 4 (impurity wasconstant). When the depth was kept constant and the im-purity was changed, the accuracy was very high when theimpurity was below 0.0.05 but decreased slightly when theimpurity increased beyond this point and then abruptly whenit was increased beyond 0.25. This implies that the nearestneighbors dataset is relatively resilient to impure splitting.The accuracy obtained on the same dataset using both ver-sions of KNN was very high. It can be seen from the graphsthat accuracy decreased more consistently with increasingvalues of K when using uniform weights and it increasedgradually with weighted KNN as the number of K was in-creased. The mean absolute error obtained using logisticregression with different solvers on the nearest neighborsdataset was the same on all the solvers. For the neural net-work, the sigmoid solver exhibited a constant accuracy whenthe number of neurons in the network was changed. TheLBFGS and Adam solvers displayed almost similar pattern.In LBFGS the accuracy was initially 75% with one neuronand then there is a sudden decrease when the number ofneurons were increased to 2 but after that showed the samebehavior as Adam, it increased gradually as the number ofneurons were increased.

Figure 8 shows the results obtained on the random forestdataset with multiple features using the machine learningalgorithms. In the case of the decision tree, when the depthwas changed and impurity was kept constant, the accuracyincreased gradually. When the depth was kept constant andthe impurity was changed the accuracy dropped abruptlywhen the impurity was increased from 0 and stayed constantat values equal to or above 0.15. This implies that accuracydecreases substantially when the splitting of nodes is notpure. For the KNN, the accuracy is the highest when K is 1but the accuracy decreases drastically in both cases whenK was increased. When using logistic regression, the erroron all three solvers remain the same.In the case of neuralnetworks, all solvers exhibit the same behavior as the numberof neurons increased. Neural network accuracy drasticallyincreases and then kept constant as the neurons were furtherincreased.

Figure 9 shows the results achieved on the RNN datasetwith multiple features. For the decision tree, the accuracyincreased substantially when the depth was increased from1 to 2 but then remained constant at depth greater than 2.When the depth was kept constant and the impurity wasincreased, the accuracy decreased rapidly when the impuritywas increased from 0 meaning that a slightly impure splitcan cause a rapid decrease in accuracy on this dataset. Forboth versions of KNN, the highest accuracy was achievedwhen the value of K was 5. For values of K greater than5, the behavior of both versions was almost the same. Theerror on all the solvers remained the same when using lo-gistic regression except for the LIBLINEAR, which shows aslight decrease in error than the others. This implies that forthe RNN dataset, LIBLINEAR solvers could be applied (mostpreferably the best one for the situation) without affectingthe error. For the neural network, LBFGS shows the highestaccuracy when 3 neurons were used and then it kept con-stant as the number of neurons were increased. The sigmoidand Adam solvers exhibited an irregular behavior when thenumber of neurons in the network was changed.

GPUs have been used in the recent years to perform generalpurpose computing tasks. For this purpose, caches have beenadded to GPUs but they perform poorly due to the massivenumber of threads accessing the cache. This paper analyzesdifferent machine learning algorithms and presents insightson whether to bypass or cache addresses using various ma-chine learning algorithms. It also presents how reducingthe size of the machine learning algorithms will affect theirperformance.

References [1] Talha Anwar and Omer Baig. 2020. TAC at SemEval-2020 Task 12:Ensembling Approach for Multilingual Offensive Language Identifi-cation in Social Media. In

Proceedings of the Fourteenth Workshop onSemantic Evaluation . 2177–2182.[2] Muhammad Asad, Muhammad Asim, Talha Javed, Mirza O Beg, HasanMujtaba, and Sohail Abbas. 2020. DeepDetect: detection of distributeddenial of service attacks using deep learning.

Comput. J.

63, 7 (2020),983–994.[3] Mirza Beg and Peter Van Beek. 2010. A graph theoretic approach tocache-conscious placement of data for direct mapped caches. In

Pro-ceedings of the 2010 international symposium on Memory management .113–120.[4] Mirza Omer Beg. 2013. Combinatorial problems in compiler optimiza-tion. (2013).[5] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. PhilipKegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Tech-nique.

J. Artif. Int. Res.

16, 1 (2002), 321–357.[6] NVIDIA Corporation. 2018. Geforce GT 630M Specifications. (March2018). Retrieved March 2, 2005 from [7] Hongwen Dai, Chao Li, Huiyang Zhou, Saurabh Gupta, Christos Kart-saklis, and Mike Mantor. 2016. A Model-Driven Approach to Warp /8 hort Title Conference’17, July 2017, Washington, DC, USA (a) Decision Tree (Depth) (b) Decision Tree (Impurity) (c) KNN (Uniform) (d) KNN (Weighted)(e) Logistic Regression (f) Neural Network (LBFGS) (g) Neural Network (Sigmoid) (h) Neural Network (Adam)

Figure 7.

Results on the Nearest Neighbors Dataset (Case 2) (a) Decision Tree (Depth) (b) Decision Tree (Impurity) (c) KNN (Uniform) (d) KNN (Weighted)(e) Logistic Regression (f) Neural Network (LBFGS) (g) Neural Network (Sigmoid) (h) Neural Network (Adam)

Figure 8.

Results on the Random Forest Dataset (Case 2)

Thread-Block Level GPU Cache Bypassing Experimental Methodol-ogy. In .94:1—-94:6.[8] Aymeric Damien. 2018. TensorFlow-Examples. (March 2018).Retrieved March 2, 2005 from https://github.com/aymericdamien/TensorFlow-Examples [9] Ranik Guidolini, Alberto F. De Souza, Filipe Wall Mutz, and Clau-dine Badue. 2017. Neural-based model predictive control for tackling steering delays of autonomous cars. In

International Joint Conferenceon Neural Networks, IJCNN 2017, . 4324–4331. https://doi.org/10.1109/IJCNN.2017.7966403 [10] Yijie Huangfu and Wei Zhang. 2015. Boosting GPU performance byprofiling-based L1 data cache bypassing. In

IEEE/ACM 15th Interna-tional Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2015 .1119–1122. https://doi.org/10.1109/CCGrid.2015.67 onference’17, July 2017, Washington, DC, USA Asim Ikram, Muhammad Awais Ali, and Mirza Omer Beg (a) Decision Tree (Depth) (b) Decision Tree (Impurity) (c) KNN (Uniform Weights) (d) KNN (Weighted)(e) Logistic Regression (f) Neural Network (LBFGS) (g) Neural Network (Sigmoid) (h) Neural Network (Adam) Figure 9.

Results on the RNN Dataset [11] Abdul Rehman Javed, Mirza Omer Beg, Muhammad Asim, Thar Baker,and Ali Hilal Al-Bayatti. 2020. AlphaLogger: Detecting motion-basedside-channel attack using smartphone keystrokes.

Journal of AmbientIntelligence and Humanized Computing (2020), 1–14.[12] Hussain S Khawaja, Mirza O Beg, and Saira Qamar. 2018. Domain Spe-cific Emotion Lexicon Expansion. In . 1–5.[13] Sreela Kodali, Patrick Hansen, Niamh Mulholland, Paul N. Whatmough,David M. Brooks, and Gu-Yeon Wei. 2017. Applications of Deep NeuralNetworks for Ultra Low Power IoT. In

IEEE International Conference onComputer Design, ICCD . 589–592. https://doi.org/10.1109/ICCD.2017.102 [14] Yann Lecun, Leon Bottou, Y Bengio, and Patrick Haffner. 1998.Gradient-Based Learning Applied to Document Recognition. In

Pro-ceedings of the IEEE , Vol. 86. 2278 – 2324.[15] Shin Ying Lee and Carole Jean Wu. 2016. Ctrl-C: Instruction-AwareControl Loop Based Adaptive Cache Bypassing for GPUs. In

Proceed-ings of the 34th IEEE International Conference on Computer Design,ICCD 2016 . 133–140. https://doi.org/10.1109/ICCD.2016.7753271 [16] Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal.2015. Adaptive and Transparent Cache Bypassing for GPUs. In

In-ternational Conference for High Performance Computing, Networking,Storage and Analysis . 1–12. https://doi.org/10.1145/2807591.2807606 [17] Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, SivaKumar Sastry Hari, and Huiyang Zhou. 2015. Locality-Driven DynamicGPU Cache Bypassing. In

Proceedings of the 29th ACM on InternationalConference on Supercomputing . 67–77. https://doi.org/10.1145/2751205.2751237 [18] He Li, Kaoru Ota, and Mianxiong Dong. 2018. Learning IoT in Edge:Deep Learning for the Internet of Things with Edge Computing.

IEEENetwork

32, 1 (2018), 96–101.[19] Yun Liang, Xiaolong Xie, Guangyu Sun, and Deming Chen. 2015. An Ef-ficient Compiler Framework for Cache Bypassing on GPUs.

IEEE Trans-actions on Computer-Aided Design of Integrated Circuits and Systems

34, 10 (2015), 1677–1690. https://doi.org/10.1109/TCAD.2015.2424962 [20] Adil Majeed, Hasan Mujtaba, and Mirza Omer Beg. 2020. Emotiondetection in Roman Urdu text using machine learning. In

Proceedingsof the 35th IEEE/ACM International Conference on Automated SoftwareEngineering Workshops . 125–130.[21] Mehdi Mohammadi, Ala I. Al-Fuqaha, Sameh Sorour, and MohsenGuizani. 2017. Deep Learning for IoT Big Data and Streaming Analytics:A Survey.

CoRR abs/1712.04301 (2017).[22] Bilal Naeem, Aymen Khan, Mirza Omer Beg, and Hasan Mujtaba.2020. A deep learning framework for clickbait detection on social areanetwork using natural language cues.

Journal of Computational SocialScience (2020), 1–13.[23] NVIDIA Corporation. 2016.

NVIDIA Tesla P100 Whitepaper . TechnicalReport. 45 pages.[24] Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2017. A BypassFirst Policy for Energy-Efficient Last Level Caches. In . 63–70. https://doi.org/10.1109/SAMOS.2016.7818332 [25] Saira Qamar, Hasan Mujtaba, Hammad Majeed, and Mirza Omer Beg.2021. Relationship Identification Between Conversational AgentsUsing Emotion Analysis.

Cognitive Computation (2021), 1–15.[26] Hareem Sahar, Abdul A Bangash, and Mirza O Beg. 2019. Towardsenergy aware object-oriented development of android applications.

Sustainable Computing: Informatics and Systems

21 (2019), 28–46.[27] Muhammad Tariq, Hammad Majeed, Mirza Omer Beg, Farrukh AslamKhan, and Abdelouahid Derhab. 2019. Accurate detection of sittingposture activities in a secure IoT based assisted living environment.

Future Generation Computer Systems

92 (2019), 745–757.[28] Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2017. DeepTest:Automated Testing of Deep-Neural-Network-driven Autonomous Cars.

CoRR abs/1708.08559 (2017).[29] Yingying Tian, Sooraj Puthoor, Joseph L Greathouse, Bradford M Beck-mann, and Daniel A Jiménez. 2015. Adaptive GPU cache bypassing.10 hort Title Conference’17, July 2017, Washington, DC, USA In .25–35. https://doi.org/10.1145/2716282.2716283 [30] Ahmed Uzair, Mirza O Beg, Hasan Mujtaba, and Hammad Majeed.2019. Weec: Web energy efficient computing: A machine learningapproach. Sustainable Computing: Informatics and Systems

22 (2019),230–243.[31] Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015.Coordinated Static and Dynamic Cache Bypassing for GPUs. In

IEEE21st International Symposium on High Performance Computer Architec-ture, HPCA . 76–88. https://doi.org/10.1109/HPCA.2015.7056023 [32] Xiaofeng Xie, Di Wu, Siping Liu, and Renfa Li. 2017. IoT Data AnalyticsUsing Deep Learning.

CoRR abs/1708.03854 (2017).[33] Shixiong Xu and David Gregg. 2015. Exploiting Hyper-Loop Paral-lelism in Vectorization to Improve Memory Performance on CUDAGPGPU. In . 53–60. https://doi.org/10.1109/Trustcom.2015.612 [34] Rabail Zahid, Muhammad Owais Idrees, Hasan Mujtaba, andMirza Omer Beg. 2020. Roman Urdu reviews dataset for aspect basedopinion mining. In . 138–143.[35] Chen Zhao, Fei Wang, Zhen Lin, Huiyang Zhou, and Nanning Zheng.2017. Selectively GPU cache bypassing for un-coalesced loads. In

International Conference on Parallel and Distributed Systems - ICPADS .908–915. https://doi.org/10.1109/ICPADS.2016.0122https://doi.org/10.1109/ICPADS.2016.0122