[PDF] Modeling Score Distributions and Continuous Covariates: A Bayesian Approach

Abstract

Computer Vision practitioners must thoroughly understand their model's performance, but conditional evaluation is complex and error-prone. In biometric verification, model performance over continuous covariates---real-number attributes of images that affect performance---is particularly challenging to study. We develop a generative model of the match and non-match score distributions over continuous covariates and perform inference with modern Bayesian methods. We use mixture models to capture arbitrary distributions and local basis functions to capture non-linear, multivariate trends. Three experiments demonstrate the accuracy and effectiveness of our approach. First, we study the relationship between age and face verification performance and find previous methods may overstate performance and confidence. Second, we study preprocessing for CNNs and find a highly non-linear, multivariate surface of model performance. Our method is accurate and data efficient when evaluated against previous synthetic methods. Third, we demonstrate the novel application of our method to pedestrian tracking and calculate variable thresholds and expected performance while controlling for multiple covariates.

Full PDF

MModeling Score Distributions and Continuous Covariates:A Bayesian Approach

Mel McCurrie , Hamish Nicholson , , Walter J. Scheirer , , and Samuel Anthony Perceptive Automata, Inc. University of Notre Dame Harvard University

Abstract

Computer Vision practitioners must thoroughly under-stand their model’s performance, but conditional evalua-tion is complex and error-prone. In biometric veriﬁca-tion, model performance over continuous covariates—real-number attributes of images that affect performance—isparticularly challenging to study. We develop a generativemodel of the match and non-match score distributions overcontinuous covariates and perform inference with modernBayesian methods. We use mixture models to capture ar-bitrary distributions and local basis functions to capturenon-linear, multivariate trends. Three experiments demon-strate the accuracy and effectiveness of our approach. First,we study the relationship between age and face veriﬁcationperformance and ﬁnd previous methods may overstate per-formance and conﬁdence. Second, we study preprocessingfor CNNs and ﬁnd a highly non-linear, multivariate surfaceof model performance. Our method is accurate and data ef-ﬁcient when evaluated against previous synthetic methods.Third, we demonstrate the novel application of our methodto pedestrian tracking and calculate variable thresholdsand expected performance while controlling for multiple co-variates.

1. Introduction

Computer Vision practitioners must thoroughly under-stand their model’s performance, but conditional evalu-ation is complex and error-prone. In biometric veriﬁ-cation, model performance over continuous covariates—known, measurable, real-number attributes of images thataffect performance—is particularly challenging to study.It is impossible to use traditional regression methods be-cause metrics like ROC Curves are calculated over an en-tire dataset, not at individual data points. Current methodsmake strong and simplifying assumptions about the data,commonly treating continuous covariates as discrete or as-

Figure 1. Our objective is to capture the relationships betweenmultiple continuous covariates and biometric performance. It isimpossible to use traditional regression techniques to estimate ver-iﬁcation metrics like ROC-curves that are calculated over the en-tire dataset. Instead, we develop a generative model that allows usto estimate the latent match and non-match distributions and thusmetrics at any covariate values. (A) Here we perform pair-wisecomparisons between images on the Labeled Faces in the Wildveriﬁcation dataset where each image is preprocessed with a ran-dom scale value, the covariate, before being center cropped andfed into VGGFace2. (B) On the diagonal slice of the multivariatedata we can see the match and non-match densities have non-lineartrends—a single threshold and an aggregate metric would be inap-propriate even for this univariate slice. (C,D) Slices of the dataat speciﬁc covariate values with few points create noisy empiri-cal densities, but our method estimates the smooth, latent distribu-tions. a r X i v : . [ c s . L G ] S e p uming a ﬁxed threshold. Other studies avoid these statis-tical complexities and limit their analysis to covariates thatcan be simulated. These studies generate billions or tril-lions of data points to perform computationally expensivegrid searches over covariates like blur, noise, and occlusion,and still only capture performance at a ﬁnite set of points.Especially in the presence of limited data or multiple covari-ates, current methods for continuous covariates fall short ofcomputer vision practitioners’ needs.Our objective is to capture continuous relationships be-tween multiple continuous covariates and biometric perfor-mance without making strong assumptions or synthesizingan unreasonable amount of data. Instead of directly model-ing the ﬁnal performance metric, we model the underlyingmatch and non-match distributions of feature distances overcontinuous covariates with ﬂexible Bayesian methods. Tocapture arbitrary feature distribution shapes we replace thetraditional empirical approach with a mixture model. Tocapture non-linear trends we model mixture component pa-rameters with local basis functions. To capture uncertaintywhere data is limited and scale where data is abundant weperform inference with Monte Carlo Sampling and Stochas-tic Variational Inference on modern hardware. In short weintroduce a method for conditional analysis that does thefollowing: • Accurately models both the match and non-matchscore distributions. • Captures continuous, non-linear relationships betweenmultiple covariates and model performance. • Controls for both Query and Gallery covariates. • Generates match and non-match feature distances atarbitrary covariate speciﬁcations. • Expresses uncertainty where data is limited. • Reduces data and compute needs.We demonstrate and evaluate our method with three ex-periments. First, we re-examine a study [2] on the rela-tionship between subject age and veriﬁcation performance.The study we examine constrains query and gallery pairsto have the same discrete covariate values—we extend thismethodology to the continuous case. Ultimately we ﬁndthat previous studies probably overstate model performanceand conﬁdence.Second, we study the effect of preprocessing on the ver-iﬁcation performance of a Convolutional Neural Network.We ﬁnd extremely low model performance outside of theoptimal range of image scale levels, and show how previ-ous works that do not model both query and gallery co-variates can understate model robustness. We regeneratethe entire Labeled Faces in the Wild [10] (LFW) dataset at100 combinations of query and gallery covariate levels toapproximate ground truth model performance at each point.Our method accurately captures the highly non-linear trends in performance when compared to the approximate groundtruth. Additionally we apply our method to the “preferredview”, a method to isolate synthetic covariate effects intro-duced by RichardWebster et al. [24].Third, we tackle a previously unstudied problem, andcalculate the performance of a pedestrian re-identiﬁcationmodel used to track pedestrians over temporal occlusion.We model the match and non-match distribution as func-tions of elapsed time and control for changes in detectionsize. We demonstrate how our method can be used duringinference to dynamically vary the threshold in a pedestriantracking scenario to maintain a constant False Positive Rate.

2. Related Work

Our work builds upon existing studies that evaluate ver-iﬁcation systems under varying conditions. We organizeprevious work into

Natural Covariate Studies and

SyntheticCovariate Studies . Natural covariate studies annotate andanalyze attributes like age, gender, and race that are difﬁcultto synthetically alter independent of identity. These studiesface a statistical challenge due to limited data. Syntheticcovariate studies programmatically alter face attributes likeexpression and image attributes like blur and noise. Thesestudies are the gold standard, but are expensive to scale andcan only be used for some covariates.

Each image in the dataset of a natural covariate studyonly has one value per covariate of interest. Synthetic tech-niques cannot be used to generate alternate versions of thedataset. Data is usually distributed unevenly over covariatevalues resulting in limited data in many regions of covari-ate space. In these sparsely populated regions conditionalperformance evaluation methods face a difﬁcult challengemeasuring performance with certainty.The 2002 Face Recognition Vendor Test [20] studiedchanges in veriﬁcation and identiﬁcation performance overchanges in covariates including gender, age, and elapsedtime. They found different cohorts performed differently,but only considered covariates in the query set, comparedidentiﬁcation rates between different dataset conﬁgurations,and binned age and elapsed time. Mitra et al. [16] used aGLM with random effects to predict match rates from co-variates like illumination with a normal, linear assumption.Scheirer et al. [27] modeled a surface of match score perfor-mance with a Support Vector Machine. O’Toole et al. [18]studied the effects of race and gender on veriﬁcation per-formance. They concentrated on removing race and gen-der as potential discriminative features by constraining non-match pairs to be of the same race or gender, a techniquecalled “yoking”. Best-Rowden and Jain [3] modeled the ef-fects of elapsed time, race, gender, and other covariates onthe match score. They ﬁt a multi-level regression modelith normal, linear assumptions and estimated conﬁdenceintervals by bootstrapping. Lu et al. [14] binned age intoseven groups and studied group effects on veriﬁcation per-formance. Cook et al. [8] binned age into 2 groups and useda linear regression with bootstrapped conﬁdence intervals tomeasure effects on performance. Most recently Albeiro andBowyer [2] binned age into three groups, “young”, “mid-dle”, and “old”. They estimated ROC curves for each group,controlling for race and gender, and calculated bootstrappedconﬁdence intervals.Previous works that do not model continuous relation-ships may miss interesting trends in both performance andcertainty. Previous works that do model continuous trendsconcentrate on modeling match scores, usually with the in-tent of capturing effects. For the sake of analysis mostworks make unrealistic assumptions of linear trends or nor-mal feature distributions. In this work we develop a gener-ative model of the continuous match and non-match densi-ties over continuous query and gallery covariates. This al-lows us to numerically calculate optimal thresholds and ex-pected metrics, as well as empirically estimate performanceby simulating speciﬁc populations from posterior samples.We do not assume trends are linear, do not assume a speciﬁcdistribution, and do not assume ﬁxed thresholds. It is worthnoting that other ﬁelds have developed similar methods forcalculating covariate-speciﬁc ROC curves, for a thoroughreview see the work of Rodr´ıguez- ´Alvarez et al. [25].

Synthetic covariate studies reproduce entire datasets andenvironments under any speciﬁed conditions. Computergraphics software can be used to synthesize faces with vary-ing expression [24] and pose [12] and pedestrians with vary-ing clothing [21]. Recent developments in generative mod-els enable researchers to manipulate skin tone, hair color,and gender of existing face images [6] or generate new ones.Scheirer et al. [26] simulated varying amounts of occlu-sion to compare different face detectors. They graph accu-racy against the area of face that is visible, calling the grapha psychometric curve. RichardWebster et al. [23] used thepsychometric curve to display object recognition perfor-mance against perturbations like rotation, blur, and contrast.Grm et al. [9] plotted the mean and standard deviation offace veriﬁcation accuracy against parameters of perturba-tions like occlusion, contrast, and compression that wereapplied to the query set. Kortylewski et al. [12] syntheti-cally rendered face images with different lighting and poseand measured identiﬁcation performance. They examinedjoint covariate effects and controlled covariate distributionsin their train and test data. Nicholson [17] manipulated im-ages in the query set of a pedestrian re-identiﬁcation datasetwith blur, noise, compression, and other perturbations, andmeasured Rank-1 retrieval performance. RichardWebster et al. [24] manipulated expression, contrast, blur and othercovariates in the query set and measured Rank-1 retrievalperformance. They pruned a dataset such that each model’sperformance is optimal before applying a perturbation, in-troducing the “preferred view” of the dataset.Although progress with generative models and computergraphics is promising, biometric researchers face the uniquechallenge of manipulating attributes while maintaining asubject’s underlying identity. For assessment of attributesthat are integral to an identity or at least difﬁcult to varyindependently, natural covariate studies remain necessary.Additionally, existing synthetic studies generally limit theiranalysis to perturbations on the query set, only examininga slice of the true joint metric surface. Even within thisslice, perturbations are only applied at ﬁnite intervals. Fi-nally, because studies tend to regenerate the entire dataset atevery interval, scaling to multiple covariates and increasingthe density of the ﬁnite intervals would be expensive. Ourapproach can be used with synthetic techniques to calculatecontinuous results over joint query and gallery covariateswith signiﬁcantly less data and reduced computational bur-den.

3. Method

In this section we describe the general data generationprocess that underlies our experiments and develop a gen-erative model that captures the densities of the match andnon-match distributions over continuous covariates.

A set of N images is collected, and each image is an-notated with an identity value and some attributes. Not allidentity values can be unique. During evaluation a pair-wisedistance or similarity is calculated between the dataset anditself, resulting in N data points. Rows are called “query”or “probe” rows and columns called “gallery” columns.The complete dataset has N data points { X i , y i } for i ∈{ , ..., N } where y i is the distance or similarity score be-tween two images or their extracted features and X i is vec-tor of attributes that includes the original query attributes,the original gallery attributes, and user deﬁned interactionsbetween those attributes. One interaction always calculatedis a boolean equivalence between query identity and galleryidentity that results in the “match” attribute of (match) or (non-match). Usually we are interested in the difference be-tween feature distances whose associated “match” attributeis and whose associated “match” attribute is . We studyhow this difference varies as a subset of attributes, called“covariates”, varies. We use “attribute” to describe any la-tent or known value associated with an image, “perturba-tion” to describe an attribute created with synthetic manip-ulation, “features” to describe attributes used to calculatefeature distances, and “covariates” to describe the known,easured attributes we study in relation to model perfor-mance.The resulting dataset of attributes and feature distancesis partitioned and manipulated based on attributes X . Manyworks deﬁne ﬁxed sets of query and gallery ids, leavingat most N / data points. Many works, especially thosethat calculate metrics derived from the CMC curve, reducethe number of non-match data points by removing all queryimages with no match in the gallery set. Works in multi-camera pedestrian re-identiﬁcation reduce the number ofmatch points by removing all data points with equivalentquery and gallery camera identity attributes. Most com-monly researchers only consider the upper diagonal of the N by N matrix, as symmetric distance functions createa symmetric matrix, and deterministic distance functionscause the diagonal to be in the case of feature distances,or a maximum value in the case of similarity scores.In our framework, evaluating models amounts to calcu-lating metrics over the distances y conditioned on the at-tributes X . We treat partitioning and manipulation of adataset as conditioning on attributes, and thus consider all N data points in our analysis. Keeping all match and non-match points increases the data used to estimate the matchand non-match distributions. Conditional analysis with fea-ture distances from the matrix diagonal in synthetic ex-periments produces interesting results. Keeping redundantpoints from the symmetric matrix provides no additional in-formation but is convenient for modeling over a continuousspace and simpliﬁes conditional statements. We introduce a method to estimate the full density of thematch and non-match distributions over continuous covari-ates, allowing us to efﬁciently estimate model performancewith uncertainty at any given range of continuous covariatevalues with limited data.The metrics we estimate are derived from the ROCcurve, deﬁned asTPR ( fpr ) = F M ( F − M ( fpr )) (1)where fpr ∈ [0 , is the false positive rate and F M and F ¯ M are the cumulative distribution functions of the matchand non-match feature distances, respectively. In additionto ROC curves we summarize performance with the AreaUnder the ROC Curve (AUC) and True Positive Rate at aﬁxed false positive rate, usually − .It follows from Equation 1 that to estimate an ROC curvewe can estimate F M and F − M independently. Most com-monly researchers make few assumptions and use the em-pirical CDF to estimate F M , use the empirical quantilefunction to estimate F − M , and bootstrap to estimate con-ﬁdence intervals. This non-parametric, empirical approach Figure 2. Estimating match and non-match distributions withmixtures of normals allows us to estimate continuous densitiesas a function of covariates and capture uncertainty with poste-rior draws. Here we show this method is accurate on real dataeven at low False Positive Rates. In this speciﬁc example, weuse VGGFace2 to extract features from the LFW dataset centercropped at a scale of . , and calculate euclidean distances be-tween query/gallery pairs. In the left graph we display a teal his-togram of match feature distances and a grey histogram of non-match feature distances. Also in the left graph we show posteriordraws from the mixture with solid black lines and show individ-ual mixture components scaled by weight with dotted black lines.In the right graph we show a traditional log ROC curve estimatedusing empirical distributions with a black line, and our method’slog ROC curves calculated with posterior draws from the mixtureswith grey lines. A traditional ROC curve estimation cannot beextended as a surface over continuous covariates, but our methodcan. is convenient but limiting in the conditional case where wewant to estimateTPR ( fpr | x ) = F M ( F − M ( fpr | x ) | x ) (2)where x is our vector of covariates. In continuous spacewe will have limited or no data at any speciﬁc value of x , resulting in uncertain or undeﬁned metrics. We use amore ﬂexible approach and estimate F M and F ¯ M with mix-tures of normals and numerically invert F ¯ M to calculate F − M ( f pr ) . Instead of bootstrapping, uncertainty is cap-tured with posterior draws from the mixtures. Figure 2demonstrates this method on real data.Estimating F M and F ¯ M with mixtures of normals al-lows us to use continuous trends in the data to estimateTPR ( f pr | x ) at any value of x . This amounts to modelingthe weight and location of each component as a function of x , such that a mixture is deﬁned as H (cid:88) h =1 π h ( x ) N ( y i | µ h ( x ) , σ h ) (3)where H is the number of components, π h models theeights as a function of x , and µ h models the locations of x . In order to capture non-linear trends and express uncer-tainty in regions with limited data we model the locationsand weights of our mixtures with local radial basis func-tions at evenly spaced locations. In some of our experi-ments we ﬁnd data is entirely concentrated in small regionsof covariate space so we prune basis functions in regionswith no data to improve inference. Additionally, we ﬁndboth hyperparameter tuning and a Dirichlet Process priorcan choose appropriate numbers of components, and optfor hyperparamter tuning in our experiments because it ismore consistently stable. For principled estimates of uncer-tainty we perform MCMC inference using Pyro [4] and Py-torch [19], and scale to large datasets with Stochastic Vari-ational Inference.

4. Experimental Results

In this section we describe the setup and results of threeexperiments. We study age in face veriﬁcation, scale in faceveriﬁcation, and temporal occlusion and detection size inpedestrian re-identiﬁcation.

Researchers commonly study age when analyzing co-variate effects on face veriﬁcation performance [15]. In thisexperiment we use the MORPH dataset [1] cleaned by theprocedure in [2]. This dataset has over 50,000 images thatwere annotated for age at capture time with additional meta-data on gender and race. We compare our method to therecent work of Albeiro et al. [2].Albeiro et al. control for age while studying the rela-tionship between age and model performance. Controllingfor a covariate amounts to only comparing images that havethe same associated covariate value, effectively removingthat covariate as a possible discriminative feature. Thus,in experiments with continuous covariates we want to es-timate model performance and uncertainty on the diagonalof the query and gallery covariate dimensions where theyare exactly equal. Our method uses trends in the featuredistances near the diagonal to capture the latent match andnon-match distributions on the diagonal, providing contin-uous estimates of both performance and uncertainty. Previ-ous studies compensate for the lack of data on this diagonalby binning age into discrete groups [2, 11, 20, 8] and calcu-lating intra-bin metrics. Albeiro et al. use “young” (16-29),“middle” (30-49), and “old” (50-70). Binning can misrep-resent performance and conﬁdence in three ways. First, itdiscards the distribution of data within a bin. For example,in the MORPH dataset the quantity of data decreases withinthe “old” bin—our intuition says conﬁdence in our metricshould decrease too. Second it can average over continuoustrends in the data. Even if the old group performs worse, ourintuition says performance is not as different between ages

Figure 3. Veriﬁcation performance probably decreases with age,but we report very high uncertainty where data is limited andwould need more data for conclusive results. Here, we plot theTrue Positive Rate at a False Positive Rate of − against age.Shown in blue, previous work discretizes age into three bins andestimates the 95% conﬁdence interval with bootstrapping. We useBayesian methods to capture continuous trends of performanceand uncertainty without binning, and show the 95% and 99% cred-ible interval in dark and light grey, respectively. Binning estimateshigher performance and higher certainty than our method becauseit averages over the true continuous trends, effectively weightingresults by intra-bin data density. (Note that previous work [2] doesnot calculate the bottom two graphs, “African American Female”and “Caucasian Female”, because of limited data. We generatedthese for illustrative purposes.)

49 and 50 as it is between 30 and 70. Third, it requires a re-searcher to choose the number and size of bins. Fewer binsresults in more conﬁdence but a worse representation of thecontinuous data, and more bins better captures continuoustrends but results in less data per bin and less reasonableconﬁdence estimates from bootstrapping.In our experiments we use VGGFace2 [5] to extract fea-tures from images in the MORPH dataset and calculate eu-clidean feature distances. For comparison with our methodwe bin images into the same age ranges as [2], calculateintra-bin true positive rates at a false positive rate of − ,and estimate conﬁdence intervals from one hundred boot-strapped calculations. The results are displayed in blue inFigure 3. For our method we independently normalize thematch and non-match feature distances and the query andgallery age and evenly space basis functions for componentlocations and weights over the normalized space. The entiredataset lies close to the diagonal of query and gallery age soe a priori prune basis functions more than one standarddeviation away from any data point. We perform Stochas-tic Variational Inference in Pyro [4] and use 100 samplesfrom our posterior to capture uncertainty. The results aredisplayed in grey in Figure 3.Our method shows that performance probably tends todecrease with age. However, our method also expressesvery high uncertainty as age increases, reﬂecting the de-crease of data within the old bin. We would need more datain the old group to verify our results. There are two differ-ences between our results and the binned result in Figure 3.First, our method estimates a lower true positive rate thanthe “old” bin. This can be explained by the distribution ofdata within the old bin. Most of the data tends to comefrom younger people within the old bin, and thus “old” binperformance is similar to an average over the true contin-uous relationship, weighted by data quantity. In fact, weﬁnd more, smaller bins capture this relationship. Second,the low certainty from our method contrasts starkly withthe high certainty from binned bootstrapping. This can beexplained by the large bin size. Within the “old” bin, fromage 50 to age 70, the amount of data decreases. Howeverthe “old” bin captures the total of that data which is enoughto cause conﬁdent bootstrap results. Computer Vision practitioners introduce covariates whenthey perform image preprocessing for Convolutional NeuralNetworks. Here we study how the scale parameter in centercropping effects veriﬁcation performance of the VGGFace2network. We use the Labeled Faces in the Wild [10] (LFW)face veriﬁcation dataset with over 13,000 images. We com-pare our method to the recent work of RichardWebster etal. [24].RichardWebster et al. artiﬁcially perturb images to studymodel performance and robustness, introducing the param-eter of a perturbation function as a covariate. At 100 per-turbation parameter values, they perturb a 1,000 image sub-set of the LFW dataset and calculate Rank-1 performance,where the original images are the gallery set and the per-turbed images are the query set. They also propose amethod to isolate the effects of the covariate of interest.They select a partition of the 1,000 images that maximizesmodel performance and dataset size using a graph cut al-gorithm, creating a “Preferred View” of the dataset. Wemake several modiﬁcations to the original study: we useROC curve based metrics instead of Rank-1, we use all13,000 images of the LFW dataset, and we study imagescale, which is not one of the perturbations originally stud-ied. Most importantly, we simplify the Preferred View se-lection algorithm and provide a visual explanation of thePreferred View’s importance in Figure 4. We simply satisfythe two conditions of the preferred view by selecting feature

Figure 4. When the gallery images are scaled by . before centercropping, the match and non-match distribution are well-separatedfor a large range of query scales, indicating robust performance.Works that do not model both query and gallery scales will see adifferent slice where the gallery scale is . , and report low robust-ness. Here, we scatter the non-match, match, and preferred viewmatch feature distances over changes in query scale and show theestimated 80% densities from our method. The top images makeup a legend that exempliﬁes non-match as comparisons betweendifferent identities, match as comparisons between same identitiesbut different original images, and preferred view match as com-parisons between same identities with same original images. Thecloser the preferred view match distribution is to the match distri-bution, the more the match distribution is explained by the covari-ate of interest, scale. distances from the diagonal of the symmetric N by N dis-tance matrix. Maximal performance is satisﬁed because alldiagonal feature distances are . Dataset size is maximal be-cause we do not reduce the non-match data points, and the-oretically we can use inﬁnite match data points because wecan compare any image to itself. We call the match distribu-tion selected from the diagonal the “preferred view matchdistribution”. The practical beneﬁt of our preferred viewformulation is that when we perturb the query and galleryset, feature distances in the preferred view match distribu-tion are no longer , and their positive value is totally ex-plained by the perturbation.For comparison with our method we pick an evenlyspaced ten by ten grid over query and gallery scale dimen-sions in [0 . , . , where . is the original scale of the LFW igure 5. VGGFace2 performance peaks when LFW query andgallery images are scaled by about . before center cropping.Performance is poor if either query or gallery images are scaledless than . or greater than . , and is most robust to changes inscale when query and gallery image scales are the same. Colorsindicate True Positive Rate at a False Positive Rate of − , where0 is black and 1 is white. An ideal synthetic study, pictured in theleft image, captures model performance at a ﬁnite number of queryand gallery scales. Using over 100 times less data than the syn-thetic study, our method captures continuous non-linear trends andoutputs the dense surface of performance seen in the right image.Increasing the 10x10 pixel heatmap on the left to be a 250x250heatmap like the right would require calculating over 10 trillionmore feature distances. dataset and . is considerably zoomed in. At each point onthe grid described by a query scale value x q and galleryscale value x g we generate our query set by scaling the en-tire dataset by x q and generate our gallery set by scaling theentire dataset by x g . At each point { x g , x q } we calculatethe True Positive Rate at a False Positive Rate of − fromthe feature distances. In total we calculate over 17 billionfeature distances. For our method we generate 100 timesless data, perturbing each image only once by a random uni-form value in [0 . , . to simulate how natural datasets areformed. Based on preliminary data analysis we choose aless smooth prior than previous experiments and reduce thedistance between basis functions. Inference is performedthe same as our previous experiment.Our method shows that VGGFace2 has a very clear op-timal range of scale parameters and performance sharplydrops outside of that range. There are two major differ-ences between our method and previous methods. Firstwe observe that previous methods only perturb the queryimages and would misrepresent the robustness and perfor-mance of the model. The model performs best where queryand gallery scales are equal and zoomed in at a scale ofapproximately . . Fixing the gallery set at a scale of . and perturbing the query set would show only a poor sliceof performance despite robustness on the diagonal. In Fig-ure 4 we visualize how the match, preferred view match,andnon-match distributions change over the query scale at twodifferent ﬁxed gallery scales, demonstrating it is necessaryto model both the query and gallery covariates. In Figure 5 we model both the query and gallery covariates and visu-alize the resulting metric surface. Our method produces adense continuous surface of performance that captures ro-bustness on the diagonal and an optimal scale at { . , . } .In contrast, the synthetic method uses 100 times more data,produces metric results at a ﬁnite set of 100 values, and re-quires exponentially more data and compute to increase thedensity. Finally, we can consider performance calculatedwith the 100 simulated datasets to be a gold standard ap-proximation of ground truth model performance and com-pare our method’s estimate at those covariate values to un-derstand its accuracy. From 100 posterior draws our methodachieves a high R between . and . , the 90% credibleinterval. Recent improvements in the rapidly growing ﬁeld ofpedestrian re-identiﬁcation [13] have improved pedestriantracking performance [7, 28]. Unfortunately evaluation ofpedestrian re-identiﬁcation models has received little at-tention, and performance speciﬁc to tracking models re-mains unmeasured. In this experiment we demonstratehow our method can be used to estimate a pedestrian re-identiﬁcation model’s performance in a tracking setting.We use the Joint Attention in Autonomous Driving (JAAD)dataset [22], a dataset of dashcam videos from a movingvehicle annotated with pedestrian detection, tracking, andvarious other attributes.A good pedestrian tracking algorithm tracks pedestri-ans through temporal occlusion—when objects like otherpedestrians and cars obstruct the camera view. It is commonto use Convolutional Neural Networks trained for pedes-trian re-identiﬁcation to measure the similarity between twodetected pedestrians, and use a hard threshold on featuredistances to determine if two pedestrians are the same.However, a hard threshold is unlikely to be optimal giventhe wide range of conditions in 2D pedestrian tracking.Our intuition tells us that pedestrian appearance probablychanges more with elapsed time, and far away pedestrianswill have less discriminative features than those closer tothe car. In the 2D tracking setting these effects can be cap-tured by elapsed time between detections and bounding boxsizes of the two detections. We examine these three covari-ates, model the threshold as a function of covariates and aﬁxed False Positive Rate, and estimate our expected TruePositive Rate in speciﬁc conditions.This task is fundamentally a veriﬁcation task so wemaintain the experimental setup of our previous experi-ments. Pedestrian images are extracted by cropping detec-tions from the unoccluded subset of the JAAD test. Wewe extract features with a high performing pedestrian re-identiﬁcation model, OSNet [29] and calculate euclidean igure 6. We show that the threshold needed to maintain a FalsePositive Rate of − is drastically different for different detec-tion sizes. When a small detection is compared to a medium tolarge detection a very high threshold is required. As the query andgallery detection height each get closer to 33% of the frame heightwe can use a much lower threshold. Our conﬁdence for large de-tections is much lower as there are fewer examples in the dataset. feature distances. We model the match distribution as afunction of query detection height, gallery detection height,and the time between image capture. We model the non-match distribution as a function of query detection heightand gallery detection height. We represent detection heightas a percentage of the frame height, independently normal-ize the covariate dimensions, use a smooth prior, and pruneour basis functions in regions more than one standard devi-ation away from any data point. Inference is performed thesame as previous experiments.We calculate thresholds at a False Positive Rate of − . Speciﬁcally, we numerically invert F ¯ M to calculate F − M (10 − | x ) where x is speciﬁc pair of query box heightand gallery box height values. We display the results andour conﬁdence in those results as a heatmap in Figure 5.The most obvious trend we ﬁnd is that smaller detectionsneed a much higher threshold when being compared withmedium to larger detections. Our method also captures highuncertainty for large boxes where data is limited. We wewould err on the side of caution and choose a high thresholdat the edge of our 95% credible interval when encounteringlarge boxes in practice.Modeling the match distribution to estimate the TruePositive Rate in a tracking dataset requires an extra covari-ate. The majority of feature distances in the match distri-bution will come from images a very short time apart re-sulting in overstated performance and unreasonably highconﬁdence similar to the binning method in our age exper-iments. Estimating our match distribution as a function ofelapsed time between images allows us to capture decreas-ing performance and decreasing conﬁdence over increasingtemporal occlusions. In Figure 7 we visualize these highdimensional results by graphing the expected True Posi-tive Rate against elapsed time for several speciﬁc pairs ofgallery detection height and query detection height. In gen-eral we notice overall performance decreases rapidly withinthe ﬁrst few seconds of elapsed time and uncertainty is very Figure 7. Even after adjusting thresholds, different query andgallery detection sizes will perform differently over temporal oc-clusions in a pedestrian tracking scenario. Here we graph ourmethod’s 90% and 98% credible interval for the estimated TruePositive Rate at a False Positive Rate of − against the num-ber of seconds between pedestrian detections. Each graph is for aspeciﬁc combination of query and gallery detection heights, cal-culated as the percent of the frame height. For example in the topleft graph we show that a detection with a height that is 33% ofthe frame height will correctly be matched with detections with aheight that is 24% of the frame about 80% of the time after 12 sec-onds of occlusion. In general we observe that medium size boxesperform best, and performance is negatively correlated with sec-onds. However there are some interesting non-linear trends likethe late peaks in the top-left and bottom-right graphs. high at a larger elapsed time, reﬂecting the small amountof long tracks in the dataset. Contrary to our intuitions wesee a local peak in performance after ten seconds of occlu-sion at some query/gallery detection height combinations.This is likely caused by by tracks where the car and pedes-trian are stationary so pedestrians’ features are fairly con-stant throughout the tracks.

5. Discussion

Throughout our experiments we found our method wasbest used together with previous works. Binning methodsare popular because they are simple, intuitive, and easy todebug. We found it was useful to use different numbers andsizes of bins to verify our own results and explore the data.Synthetic methods are expensive, but accurate. We useda combination of sparse synthetic trials and our Bayesianmethod to decide on the ﬁnal range of image scale values( [0 . , . ) studied in our second experiment. Measuringconditional model performance is a complex task. Claimsabout model performance require advanced methods anddomain knowledge. We present a tool for the computer vi-sion practitioner that makes it easier. eferences [1] Morph dataset. .[2] V. Albiero, K. Bowyer, K. Vangara, and M. King. Does facerecognition accuracy get better with age? deep face matcherssay no. In The IEEE Winter Conference on Applications ofComputer Vision , pages 261–269, 2020.[3] L. Best-Rowden and A. K. Jain. Longitudinal study of auto-matic face recognition.

IEEE transactions on pattern analy-sis and machine intelligence , 40(1):148–162, 2017.[4] E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer,N. Pradhan, T. Karaletsos, R. Singh, P. Szerlip, P. Horsfall,and N. D. Goodman. Pyro: Deep universal probabilisticprogramming.

The Journal of Machine Learning Research ,20(1):973–978, 2019.[5] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman.Vggface2: A dataset for recognising faces across pose andage. In , pages 67–74.IEEE, 2018.[6] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo.Stargan: Uniﬁed generative adversarial networks for multi-domain image-to-image translation. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 8789–8797, 2018.[7] G. Ciaparrone, F. L. S´anchez, S. Tabik, L. Troiano, R. Tagli-aferri, and F. Herrera. Deep learning in video multi-objecttracking: A survey.

Neurocomputing , 381:61–88, 2020.[8] C. M. Cook, J. J. Howard, Y. B. Sirotin, J. L. Tipton, andA. R. Vemury. Demographic effects in facial recognitionand their dependence on image acquisition: An evaluationof eleven commercial systems.

IEEE Transactions on Bio-metrics, Behavior, and Identity Science , 1(1):32–41, 2019.[9] K. Grm, V. ˇStruc, A. Artiges, M. Caron, and H. K. Ekenel.Strengths and weaknesses of deep learning models for facerecognition against image degradations.

Iet Biometrics ,7(1):81–89, 2017.[10] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. La-beled faces in the wild: A database forstudying face recog-nition in unconstrained environments. 2008.[11] B. F. Klare, M. J. Burge, J. C. Klontz, R. W. V. Bruegge,and A. K. Jain. Face recognition performance: Role of de-mographic information.

IEEE Transactions on InformationForensics and Security , 7(6):1789–1801, 2012.[12] A. Kortylewski, B. Egger, A. Schneider, T. Gerig, A. Morel-Forster, and T. Vetter. Empirically analyzing the effect ofdataset biases on deep face recognition systems. In

Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition Workshops , pages 2093–2102, 2018.[13] Q. Leng, M. Ye, and Q. Tian. A survey of open-world personre-identiﬁcation.

IEEE Transactions on Circuits and Systemsfor Video Technology , 2019.[14] B. Lu, J.-C. Chen, C. D. Castillo, and R. Chellappa. An ex-perimental evaluation of covariates effects on unconstrainedface veriﬁcation.

IEEE Transactions on Biometrics, Behav-ior, and Identity Science , 1(1):42–55, 2019. [15] Y. M. Lui, D. Bolme, B. A. Draper, J. R. Beveridge,G. Givens, and P. J. Phillips. A meta-analysis of face recogni-tion covariates. In , pages 1–8. IEEE, 2009.[16] S. Mitra, M. Savvides, and A. Brockwell. Statistical perfor-mance evaluation of biometric authentication systems usingrandom effects models.

IEEE Transactions on Pattern Anal-ysis and Machine Intelligence , 29(4):517–530, 2007.[17] H. Nicholson. Psychophysical evaluation of deep re-identiﬁcation models. arXiv preprint arXiv:2005.02136 ,2020.[18] A. J. O’Toole, P. J. Phillips, X. An, and J. Dunlop. Demo-graphic effects on estimates of automatic face recognitionperformance.

Image and Vision Computing , 30(3):169–176,2012.[19] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,et al. Pytorch: An imperative style, high-performance deeplearning library. In

Advances in Neural Information Process-ing Systems , pages 8024–8035, 2019.[20] P. J. Phillips, P. J. Grother, R. J. Micheals, D. M. Blackburn,E. Tabassi, and M. Bone. Face recognition vendor test 2002:Evaluation report. Technical report, 2003.[21] A. Pumarola, J. Sanchez-Riera, G. Choi, A. Sanfeliu, andF. Moreno-Noguer. 3dpeople: Modeling the geometry ofdressed humans. In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 2242–2251, 2019.[22] A. Rasouli, I. Kotseruba, and J. K. Tsotsos. Are they go-ing to cross? a benchmark dataset and baseline for pedes-trian crosswalk behavior. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages206–213, 2017.[23] B. RichardWebster, S. E. Anthony, and W. J. Scheirer. Psy-phy: A psychophysics driven evaluation framework for vi-sual recognition.

IEEE transactions on pattern analysis andmachine intelligence , 41(9):2280–2286, 2018.[24] B. RichardWebster, S. Yon Kwon, C. Clarizio, S. E. Anthony,and W. J. Scheirer. Visual psychophysics for making facerecognition algorithms more explainable. In

Proceedingsof the European Conference on Computer Vision (ECCV) ,pages 252–270, 2018.[25] M. X. Rodr´ıguez- ´Alvarez and V. Inacio. Rocnreg: Anr package for receiver operating characteristic curve infer-ence with and without covariate information. arXiv preprintarXiv:2003.13111 , 2020.[26] W. J. Scheirer, S. E. Anthony, K. Nakayama, and D. D. Cox.Perceptual annotation: Measuring human vision to improvecomputer vision.

IEEE transactions on pattern analysis andmachine intelligence , 36(8):1679–1686, 2014.[27] W. J. Scheirer, A. Bendale, and T. E. Boult. Predicting bio-metric facial recognition failure with similarity surfaces andsupport vector machines. In , pages 1–8. IEEE, 2008.[28] N. Wojke, A. Bewley, and D. Paulus. Simple online andrealtime tracking with a deep association metric. In ,pages 3645–3649. IEEE, 2017.[29] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang. Omni-scalefeature learning for person re-identiﬁcation. In