[PDF] AQuA: Analytical Quality Assessment for Optimizing Video Analytics Systems

Abstract

Millions of cameras at edge are being deployed to power a variety of different deep learning applications. However, the frames captured by these cameras are not always pristine - they can be distorted due to lighting issues, sensor noise, compression etc. Such distortions not only deteriorate visual quality, they impact the accuracy of deep learning applications that process such video streams. In this work, we introduce AQuA, to protect application accuracy against such distorted frames by scoring the level of distortion in the frames. It takes into account the analytical quality of frames, not the visual quality, by learning a novel metric, classifier opinion score, and uses a lightweight, CNN-based, object-independent feature extractor. AQuA accurately scores distortion levels of frames and generalizes to multiple different deep learning applications. When used for filtering poor quality frames at edge, it reduces high-confidence errors for analytics applications by 17%. Through filtering, and due to its low overhead (14ms), AQuA can also reduce computation time and average bandwidth usage by 25%.

Full PDF

AAQuA: Analytical Quality Assessment for OptimizingVideo Analytics Systems

Sibendu Paul ∗ Purdue UniversityWest Lafayette, [email protected]

Utsav Drolia † NEC Laboratories AmericaSan Jose, [email protected]

Y. Charlie Hu

Purdue UniversityWest Lafayette, [email protected]

Srimat T. Chakradhar

NEC Laboratories AmericaSan Jose, [email protected]

ABSTRACT

Millions of cameras at edge are being deployed to powera variety of different deep learning applications. However,the frames captured by these cameras are not always pris-tine - they can be distorted due to lighting issues, sensornoise, compression etc. Such distortions not only deterioratevisual quality, they impact the accuracy of deep learningapplications that process such video streams. In this work,we introduce AQuA, to protect application accuracy againstsuch distorted frames by scoring the level of distortion in theframes. It takes into account the analytical quality of frames,not the visual quality, by learning a novel metric, classifieropinion score , and uses a lightweight, CNN-based, object-independent feature extractor. AQuA accurately scores dis-tortion levels of frames and generalizes to multiple differentdeep learning applications. When used for filtering poorquality frames at edge, it reduces high-confidence errors foranalytics applications by 17%. Through filtering, and due toits low overhead (14ms), AQuA can also reduce computationtime and average bandwidth usage by 25%.

Video camera deployments are increasing rapidly, poweringapplications like city-scale traffic analytics, security and re-tail analytics. A recent report estimated the market size ofvideo analytics to be $4.10 billion in 2020, and $20.80 billionby 2027 [42]. A CNBC study reported that by 2021, about onebillion surveillance cameras will be ensuring our safety andsecurity [8]. Figure 1 illustrates how such cameras can con-tinuously capture high-resolution video of the real world andtransmit it to application services running on nearby edgecomputing nodes or on the cloud. The exponential growth ∗ Work mostly done as an intern at NEC Laboratories America. † Work done when Utsav Drolia was a Researcher at NEC LaboratoriesAmerica. of camera deployments and video analytics applications canbe attributed to two main reasons - deep learning, which isenabling accurate computer vision applications [35], and 5G, which is making low-latency and high-bandwidth commu-nication possible [9, 53].Several factors impact the quality of data acquired by thecameras. In-camera distortions are introduced by camerahardware or on-board software during the video captureprocess. Such distortions include texture distortions, arti-facts due to exposure and lens limitations, focus and coloraberrations. Factors such as lighting (low-light, glare andhaze), noise sensitivity, acquisition speed, camera setup, andcamera shake can also adversely affect a video’s perceivedvisual quality. Some distortions, like exposure and color-related distortions, for instance, occur more frequently thanothers. Different forms of distortion can also be introducedduring post-capture. For example, video compression (likeH.264, MPEG or VP9 encoding) is lossy, and transmissionover IP networks or wireless networks is error-prone, andthey can introduce distortions that adversely affect the per-ceived video quality. Note that all these distortions occurnaturally in video acquisition and transmission, and they arenot introduced by an adversary. Irrespective of the source ofdistortions, both content and network providers are deeply

Cameras Cloud

High-deﬁnition video streams

Edge Node

Figure 1: Large-scale, real-time video analytics deploy-ment a r X i v : . [ ee ss . I V ] J a n chnauzer (0.853) (a) Original Irish Terrier (0.416) (b) Blurred

Rottweiler (0.181) (c) Over-exposed

Figure 2: Impact of distortions on accuracy invested in finding better ways to monitor and control videoquality. Designing reliable predictive models and algorithmsfor detecting, and ameliorating distortions is of great interest[18, 63].Figure 2 shows a few examples of distortions (for moreexamples of distortions, please see Figure 2 in [18]). Thesedistortions do lower the perceived image quality (as observedby a human being), but more importantly, they adverselyaffect the accuracy of video analytics applications. Figure2 shows examples of adverse effects that distorted imagescan have on a state-of-the-art image classifier, ImageNet-trained ResNet101 . Although the original version of theimage is correctly classified, slight distortions of the imageresult in mis-classifications. For example, Figure 2b has mi-nuscule motion blur, but the classifier is confidently wrong(there are a 1000 classes, and the classifier has a 41.6% con-fidence in its mis-prediction). These high-confidence errors adversely impact the accuracy of the application and theycannot be filtered out from further consideration by usingsimple thresholds on prediction confidence. In Figure 2c,although the classifier again predicts incorrectly, it is notconfident in its prediction. This can be filtered out by usingsuitable thresholds on prediction confidence. However, thedistorted image had to be transmitted from the camera tothe application service, processed by the computationally-expensive deep learning classifier, only to be filtered out dueto low confidence. In a typical edge-assisted video analyticssystem, there are multiple analytics applications that aresimultaneously analyzing a video stream (as shown in Figure3), and negative consequences of wasted computations orhigh-confidence errors due to low quality input frames dosnowball.In this paper, our goal is to detect and score poor qualityframes from video streams as early as possible, ideally im-mediately after capture at edge (either edge node or edgecamera in Figure 1), and enable other actions so that overallaccuracy of the applications increase, while compute andnetwork resource usage decreases. In the present work, wefilter these poor quality frames. In our on-going work, we We chose ResNet because recent computer vision models rely on DCNNsand use such image-classifiers as the base network. are exploring other actions like alerting operators aboutthe quality, establishing a feedback loop with the camera todynamically adapt its settings to improve quality, markingframes as prospects for future fine-tuning, etc.At first glance, the obvious approach to detect these lowquality frames is to use state-of-the-art image quality as-sessment (IQA) tools [28, 31, 45, 46, 60, 76] that score theperceptual quality of an image. However, as we show later(Section 3.2), we observed that these IQA models’ [43, 69]image quality assessments do not align with a classifier’sassessment of the “quality", as evidenced by the classifier’sconfidence in the correct class. Motivated by this finding, andinspired by the human opinion score used by IQA models,we introduce the notion of classifier opinion score , which cap-tures the classifier’s assessment of the image quality. Armedwith the classifier’s opinion score, we present AQuA, an an-alytical quality assessor for images. Our approach judgesif a frame is good enough for further analytics or not, andassigns a quality score accordingly. We also construct a fil-tering system based on AQuA that can identify, flag and/ordiscard distorted frames immediately after capture, or aftervideo compression and transmission.In this paper, we make the following contributions:(1) We show empirically that image quality assessmentsfrom state-of-the-art IQA models do not align with aclassifier’s assessment of image quality, thereby lead-ing to mis-classification and high-confidence errors.(2) We propose two new metrics,

Mean Classifier Opin-ion Score (MCOS), and its semi-supervised versionMCOS 𝑆𝑆 , which are used for training a new, deep-learning based analytical quality assessor. To our knowl-edge, this is the first time that a classifier’s notion ofimage quality has been defined, quantified, and usedto train an effective image quality assessment model.(3) We design and train a new, lightweight feature extrac-tor, which leverages early layers of pre-trained imageclassifiers to quickly learn low-level image features.Our model is faster than state-of-the-art imageclassifiers, which makes our model to be a good fitfor resource-constrained mobile, embedded or edgeprocessing environments.(4) We implement AQuA, a deep-learning model that lever-ages classifier opinion scores to estimate a frame’sanalytical quality. To our knowledge, this is the first system to explicitly consider a classifier’s assessmentof image quality, and thus improve any real-time videoanalytics pipeline.(5) We conduct multiple evaluations of AQuA to show itsaccuracy, efficacy as a filter, and quantify its impacton application accuracy and resource usage. igure 3: Multiple Analytics performed on Same Video Stream (Mall) AQuA enables filtering frames with high precision andrecall compared to existing IQA models. It can be used formultiple computer vision applications such as object detec-tion, instance segmentation, and pose estimation. When eval-uated on a real-world application (face recognition), it canreduce false positives by 17% (3x more than BRISQUE). By fil-tering low quality frames, coupled with its low-overhead ofonly 14ms, AQuA reduces computation and communicationresource requirements for both edge-only and edge-cloudsystems.

Visual distortions manifest as noise, artifacts or loss of detailin a frame. These distortions can occur due to multiple factorsand are grouped under two broad categories depending onwhen the distortion occurs, (1) Image Acquisition and (2)Image Transmission. Under image acquisition, distortionscan happen due to incorrect settings of the camera, such asfocus (focal blur), exposure settings (over or under exposure)or shutter speed (motion blur). Cameras using a low qualitysensor may add Gaussian noise [15] to the frame, or causesparse but intense disturbances at low-light.For efficient image transmission, raw frames need to un-dergo compression, such as H.264, MJPEG and HEVC. Thesecompression algorithms are typically lossy and can induceartifacts like blocking and blurring.The effects of such distortions can be seen in Figure 4.

Image Quality Assessment (IQA) techniques are used to scorethe visual quality of images. They typically take the formof a machine learning model, which is trained to estimate ahuman observer’s opinion of a given input image. The train-ing data for these models includes original and distortedimages (X), and the human opinion score for each image (Y).Human observers rate the difference between the originalimage and its distorted version as a score. A large opinionscore implies higher level of distortion present in the imageunder consideration. TID2013 [50], AVA [47], LIVE [64] are afew commonly used training datasets. These models employa two-stage framework: feature extraction followed by re-gression. Figure 5 shows how a typical IQA model is trainedand used.Early IQA algorithms used natural scene statistics (NSS)based feature extraction, which encodes the resultant dis-tribution from image filters. These included BIQI [45], DI-IVINE [46], BLINDS-II [60], GMLOG [76] and BRISQUE [43].CORNIA [78] was the first to propose that image featurescan be learnt directly from raw pixels using CNNs. CORNIA’ssuccess motivated other CNN-based algorithms such as [31],ILG-net [28] and Neural Image Assessment (NIMA) [69].

Image classification is regarded as a basic computer visiontask, where the input image is classified according to whatobject(s) it contains. With the advent of convolutional neu-ral networks [36] combined with deep learning [35], mod-els have now started to surpass human-level accuracy [65] a) Original image (b) Over exposed (c) Under exposed (d) Low Contrast (e) High Contrast(f) Motion blur (g) Compression Arti-fact (h) Low-light noise (i) Defocus Blur (j) Gaussian Noise Figure 4: Types of distortions and their visual impact. D i s t o r t e d O r i g i n a l Mean Opinion Score

TrainingInferencePredicted QualityFeatures

FeatureExtractor Regressor

Figure 5: Training a typical IQA model for image classification on large datasets [11]. These high-accuracy classifiers are also used as backbones for othercomputer vision tasks such as object detection [55, 56], pose-estimation [44], instance segmentation [52]. Thus, we useimage classification as a running example for a basic unitthat any video analytics application might have.

We first show that distortions not only adversely affect theperceptual quality of frames (as observed in Figure 4), butthey also cause classifiers to make errors. Then, we show thattraditional image quality assessment, based on perceptualquality, is not up to the task of identifying distorted imageson which a classifier would falter.

Section 1 showed how an image classifier falters, with botha high confidence error and a low confidence error, in thepresence of minor distortions. To further understand the

Brightness Index A cc u r a c y Top-1Top-5 (a) Brightness

Motion Blur Index A cc u r a c y Top-1Top-5 (b) Motion Blur

20 25 30 35 40 50

Artifact Index A cc u r a c y Top-1Top-5 (c) Compression Artifact

Focal Index A cc u r a c y Top-1Top-5 (d) Defocus Blur

Figure 6: Effect of distortions on accuracy impact of distortions, we conducted experiments with severalimage classifiers on a distorted-image dataset, which wecreated by distorting images from ImageNet. Multiple typesof distortions were used, and we also varied the degree ofeach type of distortion. We use the top-1 and top-5 accuracyto understand the impact of distortions on classifiers. Figure6 shows how accuracy is affected by different types anddegrees of distortions. ll B r i g h t n e ss C o n t r a s t M o t i o n - b l u r C o m p r e ss i o n - A r t i f a c t L o w - li g h t N o i s e G a u ss i a n N o i s e F o c a l - B l u r A b s o l u t e Sp e a r m a n C o rr e l a t i o n BRISQUE NIMA

Figure 7: Correlation between state-of-the-art IQAmodels. “All" signifies the cumulative result of consid-ering all the different types of distortions.

We observe that for any distortion type, as we increase thedegree of distortion, the accuracy drops. For example, con-sider Figure 6(a). As we vary the Brightness index to be eithergreater than 1 (over exposed) or less than 1 (under-exposed),the classifier’s accuracy drops from 90%, but a precipitousdrop is observed for Brightness index beyond 1.25. In con-trast, slight increase in the degree of motion blur or defocusblur leads to an almost linear drop in the classifier’s accuracy(Figure 6 (b) and (d)). So, the classifier is particularly sensitiveto even slight motion or defocus blur, while it can tolerate amodest increase or decrease in brightness related distortion.Our results show that type of distortion, and the degree ofdistortion, matters, and classifiers can tolerate some type ofdistortions better than others. This implies that any analyti-cal quality assessor must capture the differential impact ofsuch distortions. Our experiments with four other popularimage classifiers, and with more distortion types, also showsimilar trends.

In this section, we examine the correlation between per-ceptual and analytical quality. We use state-of-the-art IQAtechniques like NSS-based BRISQUE [43] and CNN-basedNIMA [69], which are expressly designed to detect visualquality degradation, to quantify visual quality. We obtainanalytical quality scores from a classifier, for a common setof images with various types and degrees of distortion. Here,we define analytical quality of an image as the confidenceof correct class (CCC) as observed at the softmax layer of aclassifier.Figure 7 shows sets of correlation results between percep-tual quality scores from each IQA technique, and the analyt-ical quality scores from the classifier, for different types of distortions. For example, consider the two bars for motionblur. For BRISQUE, our experiments show that the absoluteSpearman correlation is 0.1, which suggests a weak corre-lation. Similarly, for NIMA, we see a very weak correlation.We observed a higher correlation between quality scores ofNIMA and the classifier (0.5) for compression distortions. Ex-periments with four other classifiers showed a similar trend.So, we empirically conclude that there is a weak correlationbetween visual quality and analytical quality, and the extentof correlation depends on the type and degree of distortion.This weak correlation implies that IQA methods are poorestimators of analytical quality of images.

Perceptual IQA methods have 2 essential parts - a featureextractor, which captures important aspects of the imagefrom a perceptual quality point of view, and a regressor,which assigns a quality score. Inspired by the perceptualIQA design, we hypothesize that a good analytical quality assessor should have the following desirable properties - • The feature extractor of the assessor should extractfeatures that are representative of the image featuresthat a classifier typically considers. Please note thatit does not have to classify images, like a classifierdoes. So, its feature extraction process does not haveto learn higher-level features that are necessary forimage classification. • Any analytical quality assessor must show strong cor-relation with a classifier’s notion of image quality.Therefore, it should consider a classifier’s opinion,rather than a perceptual opinion of a human observer.Accordingly, the regressor in the assessor should pro-duce quality scores that correlate well with a classi-fier’s notion of image quality. • The analytical quality assessor must be efficient, withinference speeds that are much higher (10x or better)than a classifier and a model size that is significantlysmaller. This will ensure that the assessor can be usedin resource-constrained mobile, embedded and edgeprocessing environments.We now describe the design of our proposed analytical qual-ity assessor to satisfy the above properties.

As mentioned in Section 2.2, training of a perceptual IQAmodel requires images (original and distorted), and the hu-man opinion scores for each image. The perceptual IQA thenlearns the mapping between each image and its human opin-ion. r i g i n a l D i s t o r t e d Semi-Supervised

COS D ✔ Supervised

COS ✔ Figure 8: Semi-supervised and supervised ClassifierOpinion Score

Our insight is that by replacing the human opinion scorewith the classifier’s opinion score, we can dramatically im-prove analytical quality assessment. To this end, we discusstwo ways of computing a classifier opinion for an image.By using similar ideas, one can devise more elaborate qual-ity scores from the results of a classifier, and the proposedapproach is still applicable to train an effective analyticalquality estimator.

Just as human opinionis based on scoring the visual differences between originaland distorted images, classifier opinion should also dependon the differences between the original and distorted images.We consider the sum of correct class confidence (CCC) anda normalized correct class rank (NCCR) as an indicator of theanalytical quality of an image. The sum takes into accountboth the correctness, and the confidence in the classification.NCCR maps the correct class rank (CCR) to a real numberbetween 0 and 1, where last rank tends to 0. If the number ofclasses is 𝑁 , then 𝑁𝐶𝐶𝑅 = ( 𝑁 − 𝐶𝐶𝑅 )/ 𝑁 . Computation ofthis sum requires knowledge of the true class of the image,which makes this approach a supervised method. We definethe classifier opinion score (COS) as the difference betweenthe sums for the original image and its distorted version (asshown in Figure 8). Other linear combinations of CCC andNCCR can also be used as the classifier opinion score.To attain a more robust opinion, we use several differentclassifiers, and compute the mean of the COS scores for theimage across the different classifiers. We refer to the meanas MCOS, which is computed using Equation 1. 𝑀𝐶𝑂𝑆 = ∥ 𝑀 ∥ (cid:205) 𝑖 ∈ 𝑀 (( 𝐶𝐶𝐶 + 𝑁𝐶𝐶𝑅 ) 𝑖𝑜𝑟𝑔 −( 𝐶𝐶𝐶 + 𝑁𝐶𝐶𝑅 ) 𝑖𝑑𝑖𝑠𝑡 ) (1) As mentioned above,computing MCOS requires labeled images. Intuitively, thisseems like an excessive requirement, since the goal is not to learn anything specific to a class, but to only capture the clas-sifier’s perceived image quality. Therefore, we also proposea semi-supervised approach, which does not require labeleddata. We use the entire softmax output of the classifier for agiven input image as an indicator of the image quality. Basedon the softmax output, we define a new, semi-supervised clas-sifier opinion score (COS 𝑆𝑆 ), which is the distance betweenthe softmax outputs for the original image and its distortedversion. The intuition behind this is that softmax output fororiginal images will tend to be a unimodal distribution acrossthe classes, with a strong peak at the correct class. On theother hand, as the image gets distorted, this distribution willtend towards uniformity as the classifier will not be able todiscern classes strongly.The distance between softmax outputs can be calculatedusing a number of different methods - KL divergence [33],Mean Absolute Difference (MAD), L1/L2 norms, Bhattacharyyadistance [4], JS divergence [30]. Figure 8 shows how semi-supervised COS will be calculated.Like the supervised case, we use several different clas-sifiers to get a better, more robust classifier opinion. Wecompute the mean of the COS 𝑆𝑆 values for the image acrossdifferent classifiers (see Equation 2). We refer to the meanscore as MCOS 𝑆𝑆 , which is the classifier opinion score forthe semi-supervised case. 𝑀𝐶𝑂𝑆 𝑆𝑆 = ∥ 𝑀 ∥ (cid:205) 𝑖 ∈ 𝑀 𝐷 ( 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥 𝑖𝑜𝑟𝑔 ,𝑆𝑜𝑓 𝑡𝑚𝑎𝑥 𝑖𝑑𝑖𝑠𝑡 ) (2) Perceptual IQAs use feature extractors like NSS (naturalscene statistics) to extract features that a human observerwould use to assess perceptual quality. Similarly, an ana-lytical quality assessor should extract image features thatclassifiers would use. We observe that almost all off-the shelf,high-accuracy classifiers share the following:(1) They use convolution and pooling layers to extract local features. Convolution layers use multiple, smallfilters within a patch of the image [20, 29].(2) The first few layers extract low-level features like edges,shapes, or stretched patterns [20, 29].We could use the entire convolution and pooling layers ofa classifier for analytical quality, but such a model will havehigh inference overhead, and it might capture features thatare unnecessary for analytical quality. Instead, we proposeto use a truncated network, with only layers that capturelow-level, object-independent features. This lowers the over-head of feature extraction in our analytical quality estimator,allowing for real-time performance. By using a pre-trainedfeature extractor, we also dramatically lower the trainingtimes. i s t o r t e d O r i g i n a l Mean COS

TrainingInferencePredicted QualityFeatures

CNN-based FE

Fully Connected RegressorClassiﬁer Bank

Figure 9: Training and running AQuA.

Figure 9 shows how our analytical quality assessor is built.All the desirable properties described earlier are capturedin the different components of the design. The feature ex-tractor is a shallow, CNN-based pre-trained network. It islightweight and it extracts exactly the same features that an-alytical applications use to make decisions. The regressor isa fully-connected layer that is trained on classifier opinionsof images, by using the MCOS or MCOS 𝑆𝑆 scores. Following the recent advancement in edge architectures forAI [38], edge deployments generally contain GPU or neu-ral accelerators. Here, all performance experiments are con-ducted on a NVIDIA GeForce RTX 2070 GPU.

Types of distortion Range of Distortion

Brightness [0.1,5]

Contrast [0.1,5]

Motion-Blur [5,30]

Compression-Artifact [20,50]

Focal Blur [1,20]

Gaussian Noise [0.05,0.5]

Low-light Noise [1,100]

Table 1: Type and range of distortions

We train AQuA by using imagesfrom the validation set of ImageNet ILSVRC-2017 dataset [11]and their distorted versions. This dataset has 21K original im-ages over 120 different classes. All of these images undergo 7different types of distortions. For each distortion, 6 degrees ofdistortion are applied. These degrees are uniformly sampledfrom within a fixed range. The different distortion types andtheir ranges are shown in Table 1. 80% of these images from

CNN-based FE FC Regressor

Figure 10: AQuA model architecture each class are used as input for training, reserving 20% fortesting. Examples of the images are shown in the Appendix.To obtain MCOS and MCOS 𝑆𝑆 , these training images needto be passed through a bank of classifiers. We use 5 classifierswith least top-1 error on ImageNet validation images [51] -DenseNet-121, ResNeXt-101, Wide ResNet-101, Inception-v3and VGG-19. Different distortions manipu-late local statistics at different granularities [68]. For example,exposure of light affects coarse textures while motion blur ordefocus blur affects finer textures. In the convolutional lay-ers, larger kernel sizes focus on global textures while stackedconvolutional layers extract fine-grained local features. Tocapture all these granularities, we use the Inception mod-ule from

Inception-v3 [67], which has convolutional layerswith diverse kernel sizes (i.e. 1x1, 3x3 and 5x5) in parallel.We build the feature extractor for AQuA using the layersupto the first Inception module in

Inception-v3 followed by apooling layer. This is followed by fully-connected layers forregression, as shown in Figure 10.

We use transfer learning to train

𝐴𝑄𝑢𝐴 .The Inception based feature extractor is initialized usingweights from an ImageNet trained Inception-v3 model, andare frozen. While training, only the weights in the fully con-nected regression layer are updated. The model is trainedend-to-end with an initial learning rate of 10 − and usingthe Adam Optimizer [32] for 200 epochs. Out of 5 differentdistance measure tested, MAD shows highest monotoniccorrelation between MCOS 𝑆𝑆 and distortion levels. Hence,MAD is used to compute the final MCOS 𝑆𝑆 score. We evaluate the impact of using supervised and semi-supervisedclassifier opinion scores, and the proposed lightweight fea-ture extraction model. The metric for success here is how ell the predicted quality score correlates with the classi-fier’s confidence on a large dataset. We also show the resultsof using AQuA as a filter (AQuA-Filter) through a ReceiverOperating Characteristic (ROC) curve. This shows the dis-criminating capability of a filter for varying thresholds onquality. Higher area under a ROC curve (AUC) indicates bet-ter filtering capability. The definitions of some of the termsare as follows: True Positive (TP) : A frame that is correctly classified bythe classifier, and the frame is also passed (i.e., considered tobe of good quality) by AQuA-Filter.

False Positive (FP) : A frame that is incorrectly classified bythe classifier, but the frame is passed by AQuA-Filter.

False Negative (FN) : A frame that is correctly classified bythe classifier but the frame is filtered (i.e., considered to beof poor quality) by AQuA-Filter.

True Negative (TN) : A frame that is incorrectly classifiedby the classifier and the frame is also filtered by AQuA-Filter.For the experiments in this section, the reserved 20% im-ages from Section 5.1.1 are used as the testing dataset. Pleasenote that this dataset contains images that have not been seenby AQuA during training, and the images have levels of dis-tortions that were not seen in the training dataset. Also, theclassifier chosen for assessing confidence-quality correlationwas not used in generating any COS scores that were usedfor training. We use ResNet-101 [25] and GoogLeNet [66]as the classifiers against which correlation is measured. Weobserved similar results with GoogLeNet.

To evaluate theimpact of the novel training metric, COS, existing visual IQAmodels were retrained by using COS as the target regressionscore and the training dataset mentioned in Section 5.1.1.These models were then tested on the reserved 20% images.The quality scores produced by each model are comparedwith the test classifier’s confidence. The correlation resultsare shown in Figure 11a - the COS versions have the suffix“COS" appended to their name. These graphs clearly show that using COS as the trainingmetric drastically improves the correlation of the qualityscore with the classifier’s confidence. This implies that whenthe COS-trained IQA methods estimate quality of a frame tobe low, then it is very likely that the classifier will make aclassification error.The impact of COS is also evident in the ROC curve (Figure11b). BRISQUE and NIMA, compared to their COS variants,have lower AUC, and are thus worse filters of analyticalquality. We use “VGG-COS" instead of “NIMA-COS", since NIMA is essentially aVGG-19 model trained for a quality regression task.

To understand the impact of semi-supervised training onanalytical quality estimates, AQuA and AQuA 𝑆𝑆 can be com-pared in both, Figure 11a and Figure 11b. In both of thesegraphs we see that AQuA 𝑆𝑆 performs similarly or better thanAQuA. Moreover, since both AQuA and AQuA 𝑆𝑆 use thesame model architecture, the inference speeds are the sameas well (Table 2). This shows that semi-supervised training isan effective approach to train an analytical quality assessor.Using the entire softmax output captures more informationthan just using the correct class, and it also helps the modelgeneralize better. This will be further established when eval-uation on other datasets and applications are presented. The impact of the choiceof feature-extractor type and complexity can be seen in Fig-ure 11a, which compares the different COS methods. Go-ing from NSS-based feature-extractor in BRISQUE-COS to aCNN-based one in NIMA/VGG-COS provides a significantincrease in correlation, almost 2x for some distortions. Whenwe compare the deep feature-extractor in VGG-COS with theshallower extractor in AQuA, we see that deep extractorsshow slightly higher correlation, which is not surprising.The same improvement is also evident in the ROC curves inFigure 11b.The choice of a feature-extractor directly impacts the com-putation time of the model. Table 2 presents the time forprocessing a frame. Compared to AQuA, a deeper feature ex-tractor (VGG) is more than 10x slower and consumes higherGPU memory during computation. We believe that this dis-proportionate improvement in latency easily outweighs theminor accuracy advantage of VGG-COS.Quality Assessor Latency (ms)

BRISQUE

NIMA

AQuA

AQuA 𝑆𝑆 VGG-COS

Table 2: Comparing assessor latencies

Edge-based video analytics applications require models forfiner-grained tasks, such as object detection for detectingpedestrians or cars, face detection for person recognition,body-keypoint detection for estimating pose and recognizingactions. If necessary, AQuA can easily be custom-tailored toeach of these different recognizers and detectors. However, inreal-world field trials, we observed that many different clas-sifiers and detectors usually process the same video streamas shown in Figure 3. Thus a good analytical quality assessor ll B r i g h t n e ss C o n t r a s t M o t i o n - b l u r C o m p r e ss i o n - A r t i f a c t L o w - li g h t N o i s e G a u ss i a n N o i s e F o c a l - B l u r A b s o l u t e Sp e a r m a n C o rr e l a t i o n BRISQUEBRISQUE-COS NIMA(VGG)VGG-COS AQUAAQUA_SS (a) CCC-Quality Correlation

False Positive Rate (FPR) T r u e P o s i t i v e R a t e ( T P R ) BRISQUEBRISQUE-COSNIMA(VGG)VGG-COSAQUAAQUA_SS (b) ROC for quality-based filtering

Figure 11: Accuracy of different quality assessors

Predicted Quality m A P AQUAAQUA_SS (a) Object Detection

Predicted Quality m A P AQUAAQUA_SS (b) Instance Segmentation

Predicted Quality m A P AQUAAQUA_SS (c) Keypoint Detection

Figure 12: AQuA Generalization should be able to indicate if any of these models will falter onan input frame. To establish this, we conducted experimentswith models for three different tasks: object detection [56],instance segmentation [24], and keypoint estimation [14].The goal of the experiment is to show whether AQuA,trained once as described in Section 5.1.1 on image classifieropinions, can assign quality scores to images that align withother models’ accuracy metrics. The metrics to compare arethe quality score produced by AQuA and the mean AveragePrecision (mAP) of these models. The dataset for this ex-periment is generated by applying distortions, as describedin Section 5.1.1, to the COCO dataset [39]. mAP is defined as mean area under the precision-recall curve for eachclass. mAP is computed applying pycocotools [10] The results for this experiment can be seen in Figure 12.The graphs show a strong correlation between the mAPand the predicted quality. That is, if AQuA estimates thatan image is of poor analytical quality, all of the applicationmodels under consideration will most likely falter on it.Moreover, the generalization of AQuA is boosted by semi-supervision because it accounts for the frame and the ap-pearance of the object inside the frame, rather than justconsidering the specific object.

As discussed earlier, distorted images can cause models tomake high-confidence errors. We conduct the following ex-periment to evaluate if an analytical quality-based filter canlower such false-positives (i.e. high-confidence errors). roperties FaceScrub CelebA

530 9211

Table 3: Face-Recognition Datasets for Evaluation

False Positive Rate (FPR) T r u e P o s i t i v e R a t e ( T P R ) BRISQUENIMAAQUAAQUA_SS (a) FaceScrub

False Positive Rate (FPR) T r u e P o s i t i v e R a t e ( T P R ) BRISQUENIMAAQUAAQUA_SS (b) CelebA

Figure 13: ROC performance on Face-RecognitionDatasets

We consider the face-recognition application [62]. This ap-plication has a database of persons, and each person has oneor more images of their face. Given an image, the applicationeither recognizes the faces in the image as known persons(who are already in the database), or it classifies them asunknown. AQuA is placed upstream from the application. IfAQuA thinks the image is of high quality, then the image isforwarded to the application.We use two different face-recognition datasets, CelebA [41]and FaceScrub [48] for the evaluation. The key properties,i.e., the number of unique faces and total test frames underconsideration of these two datasets, are listed in Table 3.One image per person is used as a reference in the persondatabase. For the queries, two images per person are selectedfrom the dataset and are randomly distorted, just as before(Section 5.1.1).Figure 13 shows the results of the experiment on bothdatasets as ROC curves. The ROC curves show that AUCunder either AQuA or AQuA 𝑆𝑆 is higher than existing IQAmethods. This suggests that both AQuA variants can betterfilter out poor quality frames as compared to other IQAmethods.Table 4 shows that false-positives (high-confidence errors)reduce by over 17% when AQuA is used, with minimal impacton true-positives. Thus we conclude that it can effectivelylower high-confidence errors in such tasks. As mentioned earlier, AQuA can be used as a quality filter,AQuA-filter. Such a filter can improve accuracy, and reduceresource usage by filtering out poor quality frames. In this Quality filter TP Decrease (%) FP Decrease (%)

BRISQUE

NIMA

AQuA

AQuA 𝑆𝑆 Table 4: Reduction of FP & TP for face recognition (a) w/o AQuA filter(b) with AQuA filter

Figure 14: Face-recognition pipeline section, we evaluate AQuA-filter on naturally distorted con-tinuous videos, instead of synthetically distorted images. Dueto the lack of suitable, publicly available video datasets, weused proprietary videos, which have distortions due to envi-ronmental changes . We focus on two videos with multiplechronic distortions, (a) Daytime, and (b) Nighttime. • Daytime : This video had the sun shining directly intothe camera, blowing out regions of the frame. • Nighttime : This video was captured after sunset, andit suffered from low-light noise and under-exposure.The application running on these video streams is face-recognition, and AQuA is placed upstream from it, as shownin Figure 14b. The state-of-the-art face-recognition pipeline(as shown in Figure 14a) takes captured frames as input fromthe edge-camera and then pushes them to the face-detectorfor detecting various face bounding boxes. Each of theseface bounding boxes then passes to a face-recognition en-gine for feature extraction and feature matching with thereference face-database. This face-recognition pipeline iswidely used [21, 34, 54, 61] and also widely adopted by en-terprises and governments at airports [72, 74], roads [58, 70]shopping-mall [17, 57] for surveillance and enhancing cus-tomer experience.Note, AQuA has not been trained for this applicationspecifically. The computation time of the application is: 55msfor face-detection per frame, and 200ms for face feature ex-traction and matching, per face. These are confidential customer videos, and hence we cannot share thedataset as of now. Our work is currently deployed (field trials) at several major arenas, casinosand airports. .5 0.6 0.7 0.8 0.9 1.0Normalized Bandwidth Use0.000.010.020.030.040.05 m e a n A v e r a g e P r e c i s i o n w/o AQUATH1-S1TH1-S2TH1-S4TH2-S1TH2-S2TH2-S4TH3-S1TH3-S2TH3-S4 (a) Day-time Video m e a n A v e r a g e P r e c i s i o n w/o AQUATH1-S1TH1-S2TH1-S4TH2-S1TH2-S2TH2-S4TH3-S1TH3-S2TH3-S4 (b) Night-time Video m e a n A v e r a g e P r e c i s i o n w/o AQUATH1-S1TH1-S2TH1-S4TH2-S1TH2-S2TH2-S4TH3-S1TH3-S2TH3-S4 (c) Day-time Video m e a n A v e r a g e P r e c i s i o n w/o AQUATH1-S1TH1-S2TH1-S4TH2-S1TH2-S2TH2-S4TH3-S1TH3-S2TH3-S4 (d) Night-time Video Figure 15: AQuA Performance on Videos. TH1, TH2,TH3 signifies three different quality thresholds usedfor AQuA while S1, S2, S4 indicates frame samplingfor AQuA prediction

To observe the accuracy-resource consumption trade-off,AQuA-filter was run with 9 different configurations - threequality thresholds (TH1, TH2, TH3), and three sampling rates(S1, S2, S4). The sampling rate S 𝑥 denotes that AQuA-filterwas invoked every 𝑥 frames. The decision on this frame,whether to filter or not, was then applied to the next 𝑥 − Edge-only:

For an edge-only scenario [2], where all theanalytics are performed on the edge device, AQuA-filter willreduce the computation overhead by filtering poor qualityframes. As more analytics are performed on the same videostream (i.e., multiple face analytics performed on the edgedevice of Eagleeye [79]) AQuA-filter’s relative resource re-duction will be even higher. A cloud-only system would alsobenefit in the same way.

Edge-Cloud Collaboration:

In edge-assisted real-timeAR systems [12, 40, 79] edge cameras capture the framesand might partially process before streaming to any remoteserver. Pushing the AQuA-filter onto the camera can droppoor quality frames and reduce the streaming bandwidth re-quirement. This can be applied along with video compressionalgorithms [22, 73] or other filtering approaches [5, 7, 37].

Scalable Video Analytics:

Along with its filtering ca-pability based on analytical perception, in a multi-cameranetwork systems [26, 27, 75], AQuA can also enhance thevideo analytics system’s capability to serve multiple videostreams at the same time. For multi-camera video feeds, dis-carding low-quality distorted frames aids to process multipleparallel streams at the same time. Hence, AQuA can alsoimprove scalability.

Low resolution (LR) is one of the earliest examples of poorimage quality that has been studied in computer vision re-search.

One of the first branches of com-puter vision applications to look into low quality images wasface detection and recognition. This is because a number ofdifferent security and analytics applications rely on faces,but cameras used for such applications tend to be low res-olution, cheap cameras. [1] surveys all recent works in

LRface recognition and proposes that there still are significantchallenges that need to be overcome.

One of the ways to tackle LR imagesis to construct a high resolution version of them throughsuper resolution. This is a classic computer vision problem, hich has gotten renewed attention due to the success ofdeep learning and CNN based models. [77] provides a goodoverview of different approaches for generic super resolution.There has been additional work to direct super resolution forspecific applications, such as, for person identification in acrowded scene, Eagleeye [79], for object detection [23], Mostrecent edge-assisted face-recognition system, also employssuper-resolution to identify missing person accurately fromcaptured LR faces in a crowded urban space.Although low resolution can have impact on quality ofimages, it is orthogonal to the kind of distortions consid-ered in this work. Moreover, the methods to overcome it arecomplementary to this work. Image classification models has recently surpassed human-level accuracy on large datasets such ImageNet [11]. Thishas been made possible through deep learning and CNNs [25,35, 65–67]. However, it has been shown that these modelsare brittle and can lead to erroneous predictions even whenthe input is distorted in minimal ways.

Adversarial distortions are small,calculated and deliberate perturbations on the input images,which are visually imperceptible, that cause classifiers tofail [3, 6, 19].Although this work addresses distortions too, it looks ataddressing “natural" distortions due to image acquisition ortransmission.

There have been multipleefforts that show that image classification suffers on im-ages that undergo common distortions, like blur, noise, over-exposure [15, 49, 59, 68]. The main reason, proposed in thesepapers, is that most classifiers are trained on high-qualityimages, typically scraped from the Internet, and hence fail onlow-quality images. [71] show that fine-tuning an existingclassifier with blurred images can improve the classifier’sperformance, but can impact it’s performance on pristineimages. [83] showed that classifiers can also be fine-tunedon noisy images using the same approach, but had the samedrawback of reduced overall accuracy.[13] attaches another network to the input of a classifierthat rectifies blurry and noisy images, and thus presents acleaner image to the classifier. This network, however, re-quires camera parameters to be trained, making it difficultto generalize. [16] introduces MixQualNets, which takes anensemble learning approach. Each model within the ensem-ble is an image classifier, but trained with different kinds ofdistortions. Specifically, their proposed ensemble consists of3 image classifiers: clean images, noisy images and blurryimages. The overall accuracy on all kinds of images is better than each individual classifier, though it comes at a highcomputation cost.[82] proposes a new training method, stability training ,which improves the resilience of the network to common dis-tortions. They validate the approach on highly compressedJPEG images and show that their method outperforms thebase image classifier.Although multiple approaches have been explored to tacklethe issue of degraded images, most of these cannot be appliedin large scale video-analytics deployments. Most the earlierwork requires retraining of classifiers, and since a single ap-plication can have a number of different models, this mightnot be practical. Most approaches also increase the size ofthe network, thus leading to higher compute times, which isdetrimental to video-analytics applications.AQuA takes a different approach, in which it filters de-graded frames, safeguarding the accuracy of all models inthe application pipeline. Moreover, by filtering such framesat the head of pipeline, it reduces resource usage.

Recently, tuning the video analytics pipeline for better ac-curacy along with efficient resource usage has gained alot of attention. Most of these works, e.g., Chameleon [27],AWStream [80], and VideoStorm [81], focus on parametertuning of resolution, frame-rate and analytical model un-der consideration to achieve better resource-accuracy trade-off. Such parameter tuning are applied after frame acquisi-tion/registration by edge camera. However, during frameacquisition through camera, if inferior-quality frames arecontinuously captured due to camera misconfigurations (i.e.,camera focus, exposure settings), high resource consumptionwill happen without any desired anlaytics being performed.

𝐴𝑄𝑢𝐴 not only reduces redundant resource consumptionon edge devices through segregating inferior-quality frames,it can also be trained to predict the misconfigurations thatcaused quality-deterioration. This will further reduce thechances of inferior-quality frame acquisition.

Live, real-time video analytics applications at edge are in-creasing, propelled by deep learning and 5G connectivity.However, in-camera and transmission distortions cause appli-cations to falter and produce erroneous analytical outcomes.In this work, we introduced AQuA to judge frames and as-sign an analytical quality score. We also proposed that AQuAcan be used as a filter to drop low quality frames with distor-tion, and eliminates frames that can lead to high-confidenceerrors. Our approach is inspired from IQA methods, but in-stead human opinions, we define a new metric, classifier pinion score, that helps train AQuA. AQuA uses a trun-cated Inception-v3 for feature extraction, to extract low-level,object-independent features. We evaluate AQuA and showthat it can outperform SOTA IQA methods in terms of cor-relating with application confidence, filtering and reducingfalse-positives. We also show that AQuA generalizes to mul-tiple different video analytics applications and can reduceresource (i.e. communication and compute) consumption aswell without degrading inference accuracy while processingvideo data at scale. EFERENCES [1] 2018. Face Recognition in Low Quality Images: A Survey.

CoRR abs/1805.11519 (2018). arXiv:1805.11519 http://arxiv.org/abs/1805.11519[2] Kittipat Apicharttrisorn, Xukan Ran, Jiasi Chen, Srikanth V Krishna-murthy, and Amit K Roy-Chowdhury. 2019. Frugal following: Powerthrifty object detection and tracking for mobile augmented reality.In

Proceedings of the 17th Conference on Embedded Networked SensorSystems . 96–109.[3] Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. 2018.Synthesizing robust adversarial examples. In

International conferenceon machine learning . PMLR, 284–293.[4] Bhattacharyya-distance. . https://en.wikipedia.org/wiki/Bhattacharyya_distance.[5] Christopher Canel, Thomas Kim, Giulio Zhou, Conglong Li, HyeontaekLim, David G Andersen, Michael Kaminsky, and Subramanya R Dulloor.2019. Scaling video analytics on constrained edge nodes. arXiv preprintarXiv:1905.13536 (2019).[6] Nicholas Carlini and David Wagner. 2017. Adversarial examples arenot easily detected: Bypassing ten detection methods. In

Proceedingsof the 10th ACM Workshop on Artificial Intelligence and Security . 3–14.[7] Tiffany Yu-Han Chen, Lenin Ravindranath, Shuo Deng, Paramvir Bahl,and Hari Balakrishnan. 2015. Glimpse: Continuous, real-time objectrecognition on mobile devices. In

Proceedings of the 13th ACM Confer-ence on Embedded Networked Sensor Systems . 155–168.[8] CNBC-Study. 2019. One billion surveillance cameras willbe watching around the world in 2021, a new study says.cnbc_study_reports_1blllion_surveillancecamera_by2021.[9] CNET. 2019. How 5G aims to end network latency.CNET_5G_network_latency_time.[10] cocoapi github. . pycocotools. https://github.com/cocodataset/cocoapi/tree/master/PythonAPI/pycocotools.[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei.2009. Imagenet: A large-scale hierarchical image database. In . IEEE, 248–255.[12] Yang Deng, Tao Han, and Nirwan Ansari. 2020. FedVision: FederatedVideo Analytics With Edge Computing.

IEEE Open Journal of theComputer Society

CoRR abs/1701.06487(2017). arXiv:1701.06487 http://arxiv.org/abs/1701.06487[14] Xintao Ding, Qingde Li, Yongqiang Cheng, Jinbao Wang, Weixin Bian,and Biao Jie. 2020. Local keypoint-based Faster R-CNN.

APPLIEDINTELLIGENCE (2020).[15] Samuel Dodge and Lina Karam. 2016. Understanding how image qual-ity affects deep neural networks. In . IEEE, 1–6.[16] Samuel F Dodge and Lina J Karam. 2018. Quality robust mixtures ofdeep neural networks.

IEEE Transactions on Image Processing

IEEE Transactions on Circuits andSystems for Video Technology

28, 9 (2018), 2061– 2077.[19] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014.Explaining and harnessing adversarial examples. arXiv preprintarXiv:1412.6572 (2014). [20] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, AmirShahroudy, Bing Shuai, Ting Liu, Xingxing Wang, Gang Wang, JianfeiCai, et al. 2018. Recent advances in convolutional neural networks.

Pattern Recognition

77 (2018), 354–377.[21] Guodong Guo and Na Zhang. 2019. A survey on deep learning basedface recognition.

Computer Vision and Image Understanding

189 (2019),102805.[22] H264. . H.264 Video Encoding. https://en.wikipedia.org/wiki/Advanced_Video_Coding.[23] Muhammad Haris, Greg Shakhnarovich, and Norimichi Ukita. 2018.Task-Driven Super Resolution: Object Detection in Low-resolutionImages.

CoRR abs/1803.11316 (2018). arXiv:1803.11316 http://arxiv.org/abs/1803.11316[24] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017.Mask r-cnn. In

Proceedings of the IEEE international conference oncomputer vision . 2961–2969.[25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deepresidual learning for image recognition. In

Proceedings of the IEEEconference on computer vision and pattern recognition . 770–778.[26] Samvit Jain, Xun Zhang, Yuhao Zhou, Ganesh Ananthanarayanan,Junchen Jiang, Yuanchao Shu, Paramvir Bahl, and Joseph Gonzalez.2020. Spatula: Efficient cross-camera video analytics on large cameranetworks. (2020).[27] Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, SiddharthaSen, and Ion Stoica. 2018. Chameleon: scalable adaptation of videoanalytics. In

Proceedings of the 2018 Conference of the ACM SpecialInterest Group on Data Communication . 253–266.[28] Xin Jin, Le Wu, Xiaodong Li, Xiaokun Zhang, Jingying Chi, Siwei Peng,Shiming Ge, Geng Zhao, and Shuying Li. 2018. ILGNet: inceptionmodules with connected local and global features for efficient imageaesthetic quality classification using domain adaptation.

IET ComputerVision

Proceed-ings of the IEEE conference on computer vision and pattern recognition .1733–1740.[32] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 (2014).[33] KL. . Kullback-Leibler Divegence. https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence.[34] Yassin Kortli, Maher Jridi, Ayman Al Falou, and Mohamed Atri. 2020.Face recognition systems: A Survey.

Sensors

20, 2 (2020), 342.[35] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenetclassification with deep convolutional neural networks. In

Advancesin neural information processing systems . 1097–1105.[36] Yann LeCun et al. 2015. LeNet-5, convolutional neural networks.

URL:http://yann.lecun.com/exdb/lenet

20, 5 (2015), 14.[37] Yuanqi Li, Arthi Padmanabhan, Pengzhan Zhao, Yufei Wang, Guo-qing Harry Xu, and Ravi Netravali. 2020. Reducto: On-Camera Filter-ing for Resource-Efficient Real-Time Video Analytics. In

Proceedings ofthe Annual conference of the ACM Special Interest Group on Data Com-munication on the applications, technologies, architectures, and protocolsfor computer communication . 359–376.[38] Qianlin Liang, Prashant Shenoy, and David Irwin. 2020. AI on theEdge: Rethinking AI-based IoT Applications Using Specialized EdgeArchitectures. arXiv preprint arXiv:2003.12488 (2020).

39] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Per-ona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Mi-crosoft coco: Common objects in context. In

European conference oncomputer vision . Springer, 740–755.[40] Luyang Liu, Hongyu Li, and Marco Gruteser. 2019. Edge assisted real-time object detection for mobile augmented reality. In

The 25th AnnualInternational Conference on Mobile Computing and Networking . 1–16.[41] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2015. DeepLearning Face Attributes in the Wild. In

Proceedings of InternationalConference on Computer Vision (ICCV)

IEEETransactions on image processing

21, 12 (2012), 4695–4708.[44] Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. 2019. Posefix:Model-agnostic general human pose refinement network. In

Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recognition .7773–7781.[45] Anush Krishna Moorthy and Alan Conrad Bovik. 2010. A two-stepframework for constructing blind image quality indices.

IEEE Signalprocessing letters

17, 5 (2010), 513–516.[46] Anush Krishna Moorthy and Alan Conrad Bovik. 2011. Blind imagequality assessment: From natural scene statistics to perceptual quality.

IEEE transactions on Image Processing

20, 12 (2011), 3350–3364.[47] Naila Murray, Luca Marchesotti, and Florent Perronnin. 2012. AVA: Alarge-scale database for aesthetic visual analysis. In . IEEE, 2408–2415.[48] Hong-Wei Ng and Stefan Winkler. 2014. A data-driven approach tocleaning large face datasets. In . IEEE, 343–347.[49] Yanting Pei, Yaping Huang, Qi Zou, Hao Zang, Xingyuan Zhang, andSong Wang. 2018. Effects of image degradations to CNN-based imageclassification. arXiv preprint arXiv:1810.05552 (2018).[50] Nikolay Ponomarenko, Oleg Ieremeiev, Vladimir Lukin, Karen Egiazar-ian, Lina Jin, Jaakko Astola, Benoit Vozel, Kacem Chehdi, Marco Carli,Federica Battisti, et al. 2013. Color image database TID2013: Peculiari-ties and preliminary results. In european workshop on visual informationprocessing (EUVIP) . IEEE, 106–111.[51] pytorch. . Pretrained Models. https://pytorch.org/docs/stable/torchvision/models.html.[52] Siyuan Qiao, Liang-Chieh Chen, and Alan Yuille. 2020. DetectoRS:Detecting Objects with Recursive Feature Pyramid and SwitchableAtrous Convolution. arXiv preprint arXiv:2006.02334 (2020).[53] Qualcomm. 2019. How 5G low latency improves your mobile experi-ences. Qualcomm_5G_low-latency_improves_mobile_experience.[54] Rajeev Ranjan, Ankan Bansal, Jingxiao Zheng, Hongyu Xu, JoshuaGleason, Boyu Lu, Anirudh Nanduri, Jun-Cheng Chen, Carlos DCastillo, and Rama Chellappa. 2019. A fast and accurate system forface detection, identification, and verification.

IEEE Transactions onBiometrics, Behavior, and Identity Science

1, 2 (2019), 82–96.[55] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016.You only look once: Unified, real-time object detection. In

Proceedingsof the IEEE conference on computer vision and pattern recognition . 779–788.[56] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks.In

Advances in neural information processing systems arXiv preprint arXiv:1807.10108 (2018).[60] Michele A Saad, Alan C Bovik, and Christophe Charrier. 2012. Blindimage quality assessment: A natural scene statistics approach in theDCT domain.

IEEE transactions on Image Processing

21, 8 (2012), 3339–3352.[61] Muhammad Sajjad, Mansoor Nasir, Khan Muhammad, Siraj Khan, Za-hoor Jan, Arun Kumar Sangaiah, Mohamed Elhoseny, and Sung WookBaik. 2020. Raspberry Pi assisted face recognition framework for en-hanced law-enforcement services in smart cities.

Future GenerationComputer Systems

108 (2020), 995–1007.[62] Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015.Facenet: A unified embedding for face recognition and clustering.In

Proceedings of the IEEE conference on computer vision and patternrecognition . 815–823.[63] K. Seshadrinathan, R. Soundarajan, A. C. Bovik, and L. K. Cormack.2010. Study of subjective and objective quality assessment of video.

IEEE Transactions on Image Processing

19, 6 (2010), 1427–1441.[64] Hamid R Sheikh. 2005. LIVE image quality assessment database. http://live. ece. utexas. edu/research/quality (2005).[65] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo-lutional networks for large-scale image recognition. arXiv preprintarXiv:1409.1556 (2014).[66] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and AndrewRabinovich. 2015. Going deeper with convolutions. In

Proceedings ofthe IEEE conference on computer vision and pattern recognition . 1–9.[67] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, andZbigniew Wojna. 2016. Rethinking the inception architecture forcomputer vision. In

Proceedings of the IEEE conference on computervision and pattern recognition . 2818–2826.[68] Timothy Tadros, Nicholas C Cullen, Michelle R Greene, and Emily ACooper. 2019. Assessing Neural Network Scene Classification fromDegraded Images.

ACM Transactions on Applied Perception (TAP)

16, 4(2019), 1–20.[69] Hossein Talebi and Peyman Milanfar. 2018. NIMA: Neural imageassessment.

IEEE Transactions on Image Processing

CoRR

Proceedings ofthe 8th ACM on Multimedia Systems Conference . 38–49.[76] Wufeng Xue, Xuanqin Mou, Lei Zhang, Alan C Bovik, and XiangchuFeng. 2014. Blind image quality assessment using joint statistics of radient magnitude and Laplacian features. IEEE Transactions on ImageProcessing

23, 11 (2014), 4850–4862.[77] Chih-Yuan Yang, Chao Ma, and Ming-Hsuan Yang. 2014. Single-imagesuper-resolution: A benchmark. In

European Conference on ComputerVision . Springer, 372–386.[78] Peng Ye, Jayant Kumar, Le Kang, and David Doermann. 2012. Unsu-pervised feature learning framework for no-reference image qualityassessment. In . IEEE, 1098–1105.[79] Juheon Yi, Sunghyun Choi, and Youngki Lee. 2020. EagleEye: wear-able camera-based person identification in crowded urban spaces. In

Proceedings of the 26th Annual International Conference on Mobile Com-puting and Networking . 1–14.[80] Ben Zhang, Xin Jin, Sylvia Ratnasamy, John Wawrzynek, and Edward ALee. 2018. Awstream: Adaptive wide-area streaming analytics. In

Proceedings of the 2018 Conference of the ACM Special Interest Group onData Communication . 236–252.[81] Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Phili-pose, Paramvir Bahl, and Michael J Freedman. 2017. Live video analyt-ics at scale with approximation and delay-tolerance. In { USENIX } Symposium on Networked Systems Design and Implementation ( { NSDI } . 377–392.[82] Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow. 2016.Improving the Robustness of Deep Neural Networks via Stability Train-ing. arXiv:1604.04326 [cs.CV][83] Yiren Zhou, Sibo Song, and Ngai-Man Cheung. 2017. On classificationof distorted images with deep convolutional neural networks. In . IEEE, 1213–1217.. IEEE, 1213–1217.