[PDF] Measurement-driven Analysis of an Edge-Assisted Object Recognition System

Abstract

We develop an edge-assisted object recognition system with the aim of studying the system-level trade-offs between end-to-end latency and object recognition accuracy. We focus on developing techniques that optimize the transmission delay of the system and demonstrate the effect of image encoding rate and neural network size on these two performance metrics. We explore optimal trade-offs between these metrics by measuring the performance of our real time object recognition application. Our measurements reveal hitherto unknown parameter effects and sharp trade-offs, hence paving the road for optimizing this key service. Finally, we formulate two optimization problems using our measurement-based models and following a Pareto analysis we find that careful tuning of the system operation yields at least 33% better performance for real time conditions, over the standard transmission method.

Full PDF

aa r X i v : . [ c s . N I] M a r Measurement-driven Analysis of an Edge-AssistedObject Recognition System

Apostolos Galanopoulos ∗ , V´ıctor Valls † , George Iosiﬁdis ∗ , Douglas J. Leith ∗∗ School of Computer Science and Statistics, Trinity College Dublin, Ireland † Department of Electrical Engineering, and Institute for Network Science, Yale University, USA

Abstract —We develop an edge-assisted object recognition sys-tem with the aim of studying the system-level trade-offs betweenend-to-end latency and object recognition accuracy. We focus ondeveloping techniques that optimize the transmission delay ofthe system and demonstrate the effect of image encoding rateand neural network size on these two performance metrics. Weexplore optimal trade-offs between these metrics by measuringthe performance of our real time object recognition application.Our measurements reveal hitherto unknown parameter effectsand sharp trade-offs, hence paving the road for optimizing thiskey service. Finally, we formulate two optimization problemsusing our measurement-based models and following a Paretoanalysis we ﬁnd that careful tuning of the system operation yieldsat least 33% better performance for real time conditions, overthe standard transmission method.

Index Terms —Edge Computing, Real Time Object Recognition

I. I

NTRODUCTION

Edge-assistance will most likely be a key component offuture latency-critical and computationally-demanding mobileapplications such as video analytics and Tactile Internet ser-vices [1], [2]. Augmented Reality [3] and real time objectrecognition [4] are examples of such services that can beneﬁtfrom the computational power of a nearby edge server, sincemobile devices are too slow to timely perform the requiredcomputations. Nevertheless, the practical performance beneﬁtsof such edge architectures remain unexplored. On the onehand, data transmissions are added to the service delay. Onthe other hand, the quality and execution delay of analytics isaffected by the volume of the transmitted data, as well as thecomplexity of the algorithm running on the edge server.In this paper we investigate this issue experimentally, bybuilding the edge computing system illustrated in Fig. 1. Wedevelop a real-time object recognition system, as a repre-sentative of the plethora of emerging visual-aided services,e.g. video stream analytics, mobile augmented reality, etc. Amobile handset (client) captures camera images and transmitsthem to an edge server for processing; the server uses a deepneural network (NN) to detect and classify objects in theimages; and sends the output to the handset which overlays thisinformation on the screen. We built the above system usingan Android application and a state-of-the-art deep learningnetwork running on GPU hardware for the server. We use ahigh performance 802.11ac wireless link for communicationbetween the handset and the server, which features technology Fig. 1: Schematic of edge-assisted object recognition system.likely to persist in future small cells , hence making our resultsrelevant to a range of systems.Our goal is to understand the system-level trade-offs be-tween end-to-end (E2E) latency and object recognition ac-curacy, and propose speciﬁc solutions that can improve theperformance of the system. We ﬁrstly show that the degreeof image compression and deep learning NN input sizeare key parameters affecting both performance metrics. Inparticular, the use of more aggressive image compressionsaves on communication latency between client and server(since the transmitted image ﬁle is smaller), but at the costof reduced object recognition accuracy. While the impactof image degradation due to noise or blur on recognitionaccuracy has started to receive attention in the deep learningliterature [5], the impact of compression on accuracy remainsrelatively poorly understood. Furthermore, a large NN sizewill improve recognition performance at the cost of higherexecution delay at the server, hence increasing E2E latency. Tothe best of our knowledge, the trade-off between E2E latencyand recognition accuracy for the above parameters, has notpreviously been explored.We focus our effort in designing wireless transmissioninterventions that further improve the communication delay ofthe system. Such interventions have not yet received signiﬁcantattention by the edge computing literature, as most effortshave been devoted to minimizing computation delays [6]–[8].This delay source however, is of critical importance to lowlatency services, and hinders their ability to achieve real timeperformance, e.g. [4], [9]. We show that transmit time canbe reduced by up to 65% by sending the images as shortback-to-back bursts of UDP packets. We also ﬁnd that theclient Network Interface Controller (NIC) powersave can incur We use MU-MIMO/OFDM and channel aggregation at the PHY layer,and employ packet aggregation at the MAC layer to reduce framing/signalingoverheads. ubstantial transmit latency and, hence, smarter sleep modeadaptation can further decrease latency by up to 60% .Finally, we model the different sources of delay in our sys-tem, and the obtained accuracy, as functions of the NN size andencoding rate using our measurements. We illustrate the use ofthe developed model to highlight optimal trade-offs betweenE2E latency and system object detection accuracy. Moreover,we show that the use of smart wireless transmission techniquesemployed, can nearly double the system performance alongthe Pareto-optimal curve of accuracy vs frame rate. Our maincontributions are as follow. • We build the edge architecture of Fig. 1, where the imageencoding rate and input NN size are tunable parameters. • We tailor the system design, with wireless transmissioninterventions (Transport layer, MAC aggregation, devicewake-up), reducing the communication delay to just 2-6 ms. • Using the system, we explore the impact of image encodingquality and NN size on the delay and recognition accuracy.Extensive experiments reveal sharp trade-offs between thesetwo performance criteria. • We collect a wealth of measurements and use them to buildstatistical models for the performance metrics of interest.These can be used in order to tailor the system operationbased on the needs of the client, e.g. maximize accuracy fora minimum perceived frame rate.

Paper Organization . In Sec. II we describe the systemarchitecture and the evaluation scenario. In Sec. III we measurethe impact of the image encoding and NN size on the E2Elatency, and present our design choices for reducing thetransmission delay. In Sec. IV we analyze the inherent latency-accuracy trade-off, while in Sec. V we use our measurementsto obtain analytical models for delay and accuracy. Finally,Sec. VI presents a discussion of the related work, whileSec. VII concludes the paper.II. P

RELIMINARIES

A. Hardware & Software Setup

We developed an Android application that captures imagesthrough the handset’s camera, carries out JPEG encoding andthen transmits the compressed images to an edge server forprocessing. The server software (written in C/C++) decom-presses and pre-processes the images, and submits them tothe deep learning neural network (NN) which is implementedusing a GPU-optimized framework. The results, i.e., thebounding boxes and labels, are then sent back to the clienthandset and overlaid on the displayed image.Object recognition is performed by YOLO [10], a state-of-the-art deep learning detector implemented on darknet, anopen source framework that supports GPU computations viacuda. It takes an n × n array of image pixels as input, with eachpixel being a ﬂoat value, and down-samples by to give an n/ grid. Then, each grid cell proposes bounding boxes andlabels for any contained objects. These results are ﬁltered togenerate the output consisting of a set of bounding boxes ofrecognized objects with their labels and respective conﬁdence. We use different mobile devices to measure the effect of theend user’s hardware on the system’s performance: (i) a GooglePixel 2 (default device), (ii) a Samsung Galaxy S8, and (iii) a Huawei P10 Lite. All devices are equipped with 802.11acchipsets, and we will be using the Google phone unless statedotherwise. The edge server is connected via Ethernet to a WiFirouter that serves as an access point (802.11ac, 5GHz) for thehandsets , see Fig. 1. B. The Need for Edge Server Ofﬂoad

We investigated ﬁrst the viability of running YOLO on thehandset by cross-compiling darknet, but found that the runningtimes were excessive (on the order of minutes). Use of a cut-down version of YOLO, referred to as TinyYOLO [10], wasalso investigated. The running time was around 1s per image,substantially faster than with the full YOLO network but stillvery slow compared to the server. Note also that the speedupof TinyYOLO is obtained at the cost of signiﬁcantly reducedobject recognition accuracy, and supports only a small subsetof object types. Our tests convey the same message as previousstudies [11], [12], namely conﬁrm the necessity for ofﬂoadingthe object recognition task to a powerful server, if low latencyoperation is to be obtained . C. Evaluation Scenario

To evaluate the system performance we used the extensiveCOCO dataset [13] which covers a wide range of imagesand objects, and includes ground truth for each image (objectlocations and labels within each image). For quantifyingperformance, we used the Average Precision (AP) and AverageRecall (AR) metrics for a range of Intersection-over-Union(IoU) values. Detection is considered successful when the ratioof the overlapping area between the detected object and theground truth, over their respective union area, is higher thanan IoU value of 0.5. COCO further breaks precision and recallmetrics down by whether objects are large, medium or small.YOLO is known to perform poorly on small objects and sowe focus on large and medium objects.To use the COCO images we connected the phone to aserver via a USB cable and a Python script on the serversends commands to the phone using the Android Debug Bridge(adb). The server initiates the client application through adband conﬁgures the system parameters for the experiment (e.g.,the JPEG compression level). Then it iterates over 5000 imagesfrom the COCO validation set, sending them one-by-one to thephone through cable. The phone transmits each image to theserver through the wireless interface, as if they were imagescaptured by its camera, receives the server response over WiFiand passes this back over the USB cable for logging. The edge server is a 3.7 GHz Core i7 PC equipped with 32GB of RAMand a GeForce RTX 2080Ti GPU; and the router is the ASUS RT-AC86U. AP is the ratio T p / ( T p + F p ) while AR is the ratio T p / ( T p + F n ) , with T p being the true positive detections, F p the false positive and F n the falsenegative detections. The results are averaged over all objects classes. II. S

YSTEM E ND - TO -E ND L ATENCY

Our ﬁrst goal is to measure each of the different delaycomponents involved in the procedure, and investigate howthey are affected by the encoding rate q and NN size n , butalso by the network set up (from the transport, to data link andphysical layer). Based on our ﬁndings we propose and evaluatenetwork design choices that speedup the task completion. A. Encoding Delay ( T enc ) The handset application converts its camera images to JPEGformat before transmission to the server. We use JPEG as itis widely adopted and supported by the Android API. Whileimage encoding is a typical step in such systems, its impacton the performance of edge-assisted object recognition hasnot received attention, with only few exceptions [2]. JPEGis a lossy format and its compression is decided by the encoding rate q . Note that we rely on the terminology of thecompression library we employed in our system and deﬁne q ∈ [10 , as the percentage ratio of compressed image sizeover its actual size, where q = 100 for an uncompressed image.At higher encoding rates, the number of discrete cosinetransform coefﬁcients that represent the JPEG image is larger,leading to an expected increase in the encoding delay. Indeed,Fig. 2a (upper plot) shows the encoding delay T enc vs. theencoding rate q . It can be seen that T enc grows from 5ms to11ms as q increases from 25% to 100%. This has also impacton the size of the compressed image, see Fig. 2a (lower plot). B. Decoding and Pre-processing Delay ( T dec ) Upon receiving an image, the server (i) decompresses itto obtain an RGB image; (ii) re-samples/pads the imageto match the input size n of the deep learning network; (iii) rotates the image to compensate for the handset cameraorientation; and (iv) converts the pixel values from 0-255integers to 0-1.0 ﬂoats. Our proﬁling indicates that most of thisprocessing is limited by memory resources rather than CPU .Hence, in our implementation we execute steps (i) and (ii) jointly so as to minimize memory movements and maximizescope for in-processor caching. And similarly we designed ourimplementation to execute simultaneously steps (iii) and (iv) .Contrary to encoding delay, this part of the processing dependsboth on the encoding rate and the NN size. Fig. 2b plotsmeasurements of the processing time vs. q and n . Observethat when q ≤ the latency is largely insensitive to q , i.e.,it is dominated by the preprocessing steps other than imagedecompression. Similarly, the NN size n affects signiﬁcantly T dec only when it is very large (notice the sudden increasewhen n ≥ ). As we will see later, these ﬁndings createopportunities for optimizing the overall system operation. C. Transmission Delay ( T tx ) Next, we investigate the network impact on the task delay,and propose speciﬁc solutions that can effectively halve thistime. First, note that the size of the transmitted images vary For jpeg compression (through quantization) we used the Android library:https://developer.android.com/reference/android/graphics/YuvImage.

20 40 60 80 100

Encoding rate(%) E n c od i ng d e l ay ( m s )

20 40 60 80 100

Encoding rate(%) I m a g e s i z e ( KB ) (a)

20 40 60 80 100

Encoding rate(%) D ec od i ng d e l ay ( m s ) N=128N=256N=320N=512N=608 (b)

Fig. 2: Time used for: (a) JPEG encoding, (b) decoding andpreprocessing, vs encoding rate q . Results are averaged for theentire COCO library (5000 images).

20 40 60 80 100

Encoding rate(%) T r a n s m i ss i on d e l ay ( m s ) powersave enabledpowersave disabled (a) (b) Fig. 3: (a) Wireless transmission delay using TCP vs JPEGencoding rate, (b) example time history of the NIC state onthe mobile handset when power saving is enabled.between 20–250KB, corresponding to roughly 13–166 packets(each 1500B long). In contrast, the server response containsobject bounding boxes and typically ﬁts into a single packet.Hence, the network transmission delay is dominated by thetime taken to transmit the image and we expect that this willincrease with the encoding rate q .The solid line in Fig. 3a plots the transmission delay vs. q . This delay includes the time needed to send the image tothe server and the time for transmitting back the response.The measurements are when TCP is used with default Androidand Linux settings, i.e., Cubic congestion control and dynamicsocket buffer sizing. As expected, the delay tends to increasewith the JPEG quality (for larger q ). However, when q < the delay is relatively insensitive to the encoding rate. Furtherinvestigation reveals that this insensitivity is mainly causedby two factors . Firstly, the handset’s power managementaggressively puts the NIC into sleep mode, and this induces adelay to wake the NIC when transmission or reception restarts.Secondly, the dynamics of TCP congestion control mean thatit takes multiple round-trip times to transmit all image packets.Next, we propose solutions for these two issues.

1) Handset NIC Wake-from-Sleep Latency:

When enteringsleep mode, the handset’s 802.11 NIC sends a special ﬂaggingframe to the AP which buffers any packets awaiting transmis-sion until the handset signals it has woken up. Fig. 3b plotsan example time history of the handset’s NIC state derivedy extracting these state transitions from tcpdump data . Alsoindicated on Fig. 3b are “active” periods where the NIC isawake and exchanges data with the server. Note that the NICregularly enters a sleep state, waking up when the handsetstarts to send an image. As indicated by our measurementsabove, the handset can roughly predict when the next imagetransmission will occur. Namely, a new captured image istransmitted approximately after 5-10ms (time for its encoding),and this could be used to preemptively wake up the NIC.Solution: In order to investigate the potential latency gainsof smart wake-up strategies, we adopted the cruder approach ofusing iperf to generate 1Mb/s of background UDP trafﬁc fromthe server to the client, to keep the handset’s wireless interfaceawake. The dashed line in Fig. 3a shows that the overalltransmit delay is now decreased for all values of q , consistentwith the handset NIC no longer having to be woken up fortransmitting the image. The delay reduction is approximately5ms for all encoding rates which corresponds to a reductionof 50% in the wireless transmission delay.2) Latency Caused By TCP Dynamics:

The upper plot inFig. 4a shows the time history when transferring an imageusing TCP. The connection is kept open and used for sendingmultiple images so that the overhead of the TCP handshake(SYN-SYNACK-ACK) is only incurred once (takes 4ms; notshown). The compressed image in this example is 31335B insize, and when the HTTP request header is added, it occupies22 TCP packets . Its transmission lasts 2.5ms and uses 4 MACframes for data and 3 for TCP ACKs. On average, 5.5 TCPdata packets are therefore sent in each MAC frame. Observethat the client needs to receive TCP ACKs before it can sendthe full image since the TCP congestion window (cwnd) limitsthe packets in ﬂight to around 10 when starting a new transfer.Also, observe that there is contention between uplink anddownlink due to the ACKs transmitted by the server.Solution: We explore the gains from removing up-link/downlink contention and the impact of TCP cwnd, bymodifying the Android client and server to use UDP. At theclient side, an image is segmented and placed into a sequenceof UDP packets which are then sent to the socket back-to-back to facilitate aggregation by the NIC. The lower plot inFig. 4a shows UDP measurements for transmission of thesame image. Despite that UDP packets are ﬁt within a singleMAC frame (our system can aggregate up to 128 packets in1 frame), we see that the transfer used actually 3 frames.Presumably this is due to the scheduling delays between thekernel and NIC, and the relative timing of channel accessopportunities and packet arrivals. Nevertheless, we ﬁnd thatthe data transfer time is now 0.8ms, i.e., 3 times faster thanwith TCP. Finally, Fig. 4b plots measurements of the overallwireless transmission time (sending the image and receivingits response) for the full COCO data set when using TCP and In our experiment a delay is inserted between input of each image to theandroid app to make the power-save behavior easier to see. The payload of a 1500B TCP packet is 1448B including header overheads. Including the time needed to segment the image into UDP packets, so thevalues are comparable with the TCP data.

Time(ms) B y t es se n t ACKs

Time(ms) B y t es se n t (a)

20 40 60 80 100

Encoding rate(%) T r a n s m i ss i on d e l ay ( m s ) TCPUDP (b)

Fig. 4: (a) Time histories showing transfer of a compressedimage from client to server using TCP (upper plot) and UDP(lower plot), markers indicate packet boundaries. (b) Wirelesstransmission delay for TCP and UDP vs JPEG encoding rate q with mobile NIC power-save disabled.

200 300 400 500 600

NN Input Size R ec ogn i t i on d e l ay ( m s ) Fig. 5: Server recognition delay ( T dl ), for different NN size.UDP; and with mobile NIC power-save disabled. We ﬁnd thatusing UDP packet bursting roughly halves the transmit timefor all JPEG encoding rates.Concluding, in this subsection we showed that tailoredtransmission strategies, such as smart NIC power-saving andusing UDP with packet bursting, reduce the transmit time toaround 5ms . This improvement is hugely important given thetargeted E2E latency budgets. D. Recognition Delay ( T dl ) and Impact of Handheld YOLO outputs the coordinates of the image’s detectedobjects along with their labels. The recognition delay T dl depends on the NN size, and our measurements in Fig. 5 showthat it increases, roughly, quadratically with n . Other workshave reported similar ﬁndings, e.g., see [7], [11], but the delaysare quite higher than our results, presumably due to the usageof older GPU hardware. Furthermore, DeepMon [6] proposesNN optimizations on the mobile devices that reduce the delayat about 1sec for YOLO, but it is still worse than our system’sperformance. These values may vary from system to system,but we expect qualitatively the trend to persist.Similarly, we suspect that the handset hardware affects onlyslightly (i.e., quantitatively) the results. To verify this, werepeat our experiments with 2 additional mobile devices. Thedelays that are directly related to the handset device, and may To achieve real time frame update rates, such as 30fps, the available totallatency budget is only 33ms. ig. 6: Edge device delay comparison.vary due to the different hardware speciﬁcations, are the en-coding and transmission delay. Fig. 6 plots the total encodingand transmission delay measured for the 3 devices (Pixel 2,P10 Lite, Galaxy S8) for each encoding rate q (averaging alldataset images). We ﬁnd that compared to the Pixel 2, theother 2 devices are slightly faster in image encoding, but alsoslower in transmitting. Such differences might likely arise dueto the different chipsets/ﬁrmware implementations. Observehowever, that the roughly quadratic increase of both delaycomponents persists across all devices as q increases. Hence,qualitatively the results hold for different hardware.IV. P ERFORMANCE T RADE - OFFS

Using our measurements above we discuss here the inter-action and trade-offs between the two performance metrics,i.e., the accuracy and E2E delay, under a range of differentscenarios. We discover that in several cases there are sharptrade-off curves which create opportunities for improving thesystem operation, by carefully tuning parameters q and n .Figures 7a-7b plot the object recognition average precisionand recall vs the encoding rate q and the NN size n . We seethat both metrics generally increase with q and n , althoughthere is a sharp improvement going from n = 128 to n = 256 .Moreover, as n drops the precision and recall performancedeteriorate and cannot be improved even if we use high q (e.g., see last row in each matrix). This ﬁnding differs fromprevious studies, e.g., [5], perhaps due to the COCO datasetwhich contains images with a large range of object sizes.We further study the impact of the object sizes on perfor-mance, while we consider different detection thresholds (IoUvalues) [13]. In Fig. 7c we plot the precision and recall vs n and q for large and medium objects, averaged for a range ofIoU values. We see that for large objects the accuracy increasesrapidly with n but plateaus when n > . For mediumobjects on the other hand, the beneﬁts of larger input size(and so higher image resolution) are greater and accuracy onlyplateaus when n > . Fig. 7d shows that the dependence on q , albeit not that strong, follows indeed a continuous increase.We note that the precision and recall values in these plots arerelatively low because we use very high IoU thresholds (up to We have used the Python library CoCoApi for calculating these metrics,https://github.com/cocodataset/cocoapi/tree/master/PythonAPI/pycocotools. n = 608 we already have satisfactory precision but also large delays.Finally, we study the frame rate, i.e., the reciprocal of E2Elatency, for different NN sizes and image encoding rates.Fig. 7e presents the average frame rate for each scenario.Notice that for small NNs ( n < ) the encoding affects sig-niﬁcantly the frame rate, but this effect is weaker for n > .For example, when n = 608 the rate falls below 30fps evenfor very small values of q . In other words, we ﬁnd that in thelow NN size regime, the accuracy gains from choosing a highencoding rate are not signiﬁcant, while the frame rate gainsof a low encoding rate are substantial. Hence, a low encodingrate is probably more suitable for a small NN. The oppositeis true in the high NN size regime, where we can achievesubstantial accuracy gains without compromising signiﬁcantlythe frame rate. These ﬁndings underline the importance ofselecting jointly the values of parameters n and q . Next sectionprovides a systematic methodology towards that end.V. D ATA M ODELS AND P ARETO A NALYSIS

A. Fitting the Measurements

Our measurements indicate that the latency components andaccuracy can be approximated using quadratic functions of thedecision variables n and q . Note that only the decoding delay T dec and precision f (we omit recall for brevity) depend onboth n and q . On the other hand, the encoding and transmissiondelays, T enc and T tx , depend only on q , and the deep learningdelay T dl on n . We therefore deﬁne: T enc ( q ) = α + α q + α q , (1) T dec ( n, q ) = β + β n + β q + β nq + β n + β q , (2) T tx ( q ) = γ + γ q + γ q , (3) T dl ( n ) = δ + δ n + δ n , (4) f ( n, q ) = ǫ + ǫ n + ǫ q + ǫ nq + ǫ n + ǫ q . (5)The model parameters are obtained by ﬁtting our measure-ments to (1)-(5). Clearly, the exact values of these parameterscan change if, for instance, we use a different access point orserver. However, as our tests with the different handset deviceshave revealed, the changes are minimal and only quantitative. B. Pareto Analysis

We leverage the above models to explore the interaction ofthe decision variables: n ∈ N , (cid:8) [128 , | mod ( n,

32) = 0 (cid:9) , q ∈ Q , [10 , , i.e., study how they jointly affect the precision and the framerate (E2E latency), while we also devise the Pareto frontsfor these two performance criteria by following a detailedparameter-sensitivity analysis. We formulate two optimizationproblems; P , where we maximize the precision subject toachieving a minimum frame rate; and P where we maximizethe frame rate while not dropping the precision below athreshold value. Formally the 2 problems can be written: The handsets affect only the values of parameters { α i } i and { γ i } i . Encoding rate (%) NN s i z e Average Precision (a)

25 50 75 100

Encoding rate (%) NN s i z e Average Recall (b)

200 300 400 500 600NN input size00.10.20.30.40.5

Accuracy

Precision (Large)Recall (Large)Precision (Medium)Recall (Medium) (c)

40 60 80 100Encoding Rate (%)0.10.20.30.40.5

Accuracy

Precision (Large)Recall (Large)Precision (Medium)Recall (Medium) (d)

25 50 75 100

Encoding rate (%) NN s i z e Frame Rate (e)

Fig. 7: Performance Trade-offs (disabled power-save; UDP). (a-b) Precision and recall (IoU=0.5) vs n and q . (c-d) Precisionand recall for medium and large objects vs n for uncompressed images (in (c)), and vs q with ﬁxed n = 512 (in (d)). (e)Frame rate vs NN size n and encoding rate q . All results are averaged over all images and all IoUs in [0 . , . . P : maximize n ∈N ,q ∈Q f ( n, q ) (6)s.t. T total ( n, q ) ≤ T max (7) P : minimize n ∈N ,q ∈Q T total ( n, q ) (8)s.t. f ( n, q ) ≥ f min . (9)where we have deﬁned: T total ( n, q ) = T enc ( q ) + T dec ( n, q ) + T tx ( q ) + T dl ( n ) , and T max is the highest tolerable delay in order to achieve aframe rate of /T max fps. Respectively, f min is the targetprecision requested by the user. In essence, constraint (7)ensures that the total delay does not exceed T max , and hencethe frame rate /T total will be greater or equal to the threshold /T max . Similarly in P we maximize the frame rate byminimizing T total . Using both problem formulations we willbe able to highlight the trade-offs between delay and precision.Fig. 8a plots the values of n and q that maximize theprecision while keeping the frame rate at or above the valueindicated on the x-axis (recall that n is a multiple of 32). Theachieved precision for each frame rate is displayed with a solidline in Fig. 8b. Observe how the increasing frame rate dictatesthe drop of NN size and encoding rate, which in turn result indecreasing precision performance. Moreover, we observe thatthe NN size continuously drops or stays level with the framerate, while the encoding rate can increase in some cases. Thatoccurs when the NN size has been reduced and hence theincrease of the encoding rate can sustain a higher precision.Notice that for the largest range of frame rates, the NN sizecan be kept quite high (around and above 400), even whenexceeding 30 fps. This yields a satisfactory precision of . at 40 fps . However, after the 40 fps threshold, the NN sizehas to be very small to facilitate fast object recognition andthe precision performance drops dramatically.To highlight the impact of our optimized networking con-ﬁguration, we compare the performance with the respectiveresults of a non-optimized ( vanilla ) system, dashed line in Recall that we obtain low precision values because on purpose we usedvery high IoU values; for more typical thresholds the precision is much higher.

16 20 24 28 32 36 40 44

Frame rate (fps) NN s i z e E n c od i ng r a t e ( % ) NN sizeEncoding rate (a)

16 20 24 28 32 36 40 44

Frame rate (fps) P r ec i s i on OptimizedNon-optimized

Infeasible

33 % (b)

Fig. 8: (a) Optimal NN size and encoding rate for the desiredframe rate. (b) Corresponding maximal precision values.

Target precision NN s i z e E n c od i ng r a t e ( % ) NN sizeEncoding rate (a)

Target precision F r a m e r a t e (f p s ) OptimizedNon-optimized

72 %93 % (b)

Fig. 9: (a) Optimal NN size and encoding rate for the targetaccuracy. (b) Corresponding maximal frame rate values.Fig. 8b. Namely, these results were obtained by ﬁtting thenon-optimized (TCP, and enabled powersave) wireless trans-mission delay measurements to (3) and solving P . Clearly, theincreased transmission delays hamper the ability of the systemto achieve high precision for acceptable frame rates (precisiondrops by 33% at 30 fps). Moreover, P becomes infeasible fora target frame rate above 34 fps, indicating the greater rangein which the system can operate after conﬁguring the network.The respective results for P are displayed in Fig. 9a, 9b. Theoptimal frame rate can be kept very close to 30 fps, evenfor very high target precision. Also, we observe a huge gapbetween the optimized and non-optimized solution in this case,with the former achieving up to 93% higher frame rate thanthe latter when target precision is very low.VI. R ELATED W ORK

Deep Learning With Compressed Images.

The impactof image compression on recognition accuracy has started toreceive attention in the deep learning literature, see seminalaper [5] and follow-up works, but this aspect of performanceremains relatively poorly understood. Most attention has fo-cused on developing new compression approaches tailored todeep learning e.g. see [3], [14]. The authors in [2] explore theeffect of image compression rate to the object detection accu-racy. To the best of our knowledge however, the system-leveltrade-offs between E2E latency and deep learning accuracyintroduced by the use of image compression have not beenpreviously explored.

Edge-Assistance.

JAGUAR [15] and Glimpse [4] are edge-assisted, real-time object recognition systems. They both useobject tracking to reduce the number of recognitions, but donot use state of the art deep learning techniques for objectrecognition. [11] proposes a solution for deciding the execu-tion location of augmented reality tasks, either on the mobile,or an edge server. The idea of distributing the neural networklayers among different tiers of the network architecture isdemonstrated in [16], [17]. The devices, based on their com-putation resources execute smaller or larger parts of the NNtowards increasing the accuracy of inferences with tolerableexecution and network delays. In [8] the authors proposea framework for distributing deep learning sub-processes toedge, cloudlet and cloud nodes towards increasing the jobexecution rate of the system. [3] presents an augmented realityobject detection system that leverages an edge server, as wellas object tracking and image encoding to improve latency.The above works indicate the necessity of edge architectures,towards improving the E2E latency of delay sensitive services.

Accuracy/Latency Trade-off.

JALAD [18] proposes thedecoupling of a Deep Neural Network (DNN) between edgeand cloud towards minimizing latency with execution accu-racy guarantees. Overlay [9] presents an augmented realitysystem for mobile devices, assisted by a GPU-enabled serverthat is designed towards minimizing the tracking error. Mo-biQoR [19] studies the trade-off between delay and Qualityof Result for edge applications that involve machine learningand analytics like face recognition. The authors show thatsacriﬁcing computation result quality can decrease delay aswell as energy consumption. LAVEA [12] proposes a systemfor computation ofﬂoading of data analytics to nearby edgenodes. The formulated optimization problem aims in makingofﬂoading and bandwidth allocation decisions towards mini-mizing latency. DeepDecision [7] is a video analytics systemthat balances accuracy and latency, by properly adjusting thecamera sample rate, video encoding rate, and deep learningmodel. However, both transmission and processing delaysare much higher than the ones obtained by our system.DeepMon [6] distributes the execution of a large DNN acrossmultiple mobile GPUs to reduce latency. It focuses on DNNoptimizations, instead of the network-centric analysis pre-sented in our work. All the above works, highlight the inherenttrade-off between latency and accuracy in edge architectures.Our work however goes beyond that, by proposing importantdelay reducing modiﬁcations that easily enable real timeperformance for object recognition. VII. C

ONCLUSIONS

We develop an edge-assisted object recognition system andshow that careful network transmit and powersave strategiescan signiﬁcantly reduce the wireless transmission delay. Weﬁnd that the level of image compression, as well as thedimension of the deep learning network used, are key de-sign parameters, affecting both end-to-end latency and objectrecognition accuracy. We demonstrate how our measurementscan be used to choose these design parameters to optimallytrade-off between execution delay and accuracy.A

CKNOWLEDGMENTS

The authors would like to thank Domenico Guistianino forhelpful input and discussions during development of the sys-tem. This publication has emanated from research supportedin part by SFI research grants 17/CDA/4760, 16/IA/4610 andis co-funded under the European Regional Development Fundunder Grant Number 13/RC/2077.R

EFERENCES[1] G. Ananthanarayanan et al. , “Real-time video analytics: The killer appfor edge computing,”

Computer , vol. 50, no. 10, pp. 58–67, 2017.[2] J. Ren, Y. Guo, D. Zhang, Q. Liu, and Y. Zhang, “Distributed andefﬁcient object detection in edge computing: Challenges and solutions,”

IEEE Network , vol. 32, no. 6, pp. 137–143, 2018.[3] L. Liu, H. Li, and M. Gruteser, “Edge assisted real-time object detectionfor mobile augmented reality,” in

Proc. of ACM MobiCom , 2019.[4] T. Y.-H. Chen et al. , “Glimpse: Continuous, real-time object recognitionon mobile devices,” in

Proc. of ACM SenSys , 2015.[5] S. Dodge and L. Karam, “Understanding how image quality affects deepneural networks,” in

QoMEX , 2016.[6] L. N. Huynh, Y. Lee, and R. K. Balan, “Deepmon: Mobile gpu-baseddeep learning framework for continuous vision applications,” in

Proc.of ACM MobiSys , 2017.[7] X. Ran et al. , “Deepdecision: A mobile deep learning framework foredge video analytics,” in

Proc. of IEEE INFOCOM , 2018.[8] M. Ali et al. , “Edge enhanced deep learning system for large-scale videostream analytics,” in

Proc. of IEEE ICFEC , 2018.[9] P. Jain, J. Manweiler, and R. Roy Choudhury, “Overlay: Practical mobileaugmented reality,” in

Proc. of ACM MobiSys , 2015.[10] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv , 2018.[11] X. Ran, H. Chen, Z. Liu, and J. Chen, “Delivering deep learning tomobile devices via ofﬂoading,” in

Workshop on Virtual Reality andAugmented Reality Network , 2017.[12] S. Yi et al. , “Lavea: Latency-aware video analytics on edge computingplatform,” in

Proc ACM/IEEE Symposium on Edge Computing , 2017.[13] T. Lin et al. , “Microsoft COCO: common objects in context,” arXiv ,vol. abs/1405.0312, 2014.[14] Z. Liu et al. , “Deepn-jpeg: A deep neural network favorable jpeg-basedimage compression framework,”

CoRR , vol. abs/1803.05788, 2018.[15] W. Zhang, B. Han, and P. Hui, “Jaguar: Low latency mobile augmentedreality with ﬂexible tracking,” in

Proc. of ACM Conference on Multi-media , 2018.[16] C. Lo, Y. Su, C. Lee, and S. Chang, “A dynamic deep neural networkdesign for efﬁcient workload allocation in edge computing,” in

Proc. ofIEEE ICCD , 2017.[17] S. Teerapittayanon, B. McDanel, and H. T. Kung, “Distributed deepneural networks over the cloud, the edge and end devices,” in

Proc. ofIEEE ICDCS , 2017.[18] H. Li et al. , “JALAD: joint accuracy- and latency-aware deep structuredecoupling for edge-cloud execution,”

CoRR , vol. abs/1812.10027, 2018.[19] Y. Li, Y. Chen, T. Lan, and G. Venkataramani, “Mobiqor: Pushing theenvelope of mobile edge computing via quality-of-result optimization,”in