Adaptive Algorithm and Platform Selection for Visual Detection and Tracking
11 Adaptive Algorithm and Platform Selection forVisual Detection and Tracking
Shu Zhang, Qi Zhu, and Amit K. Roy-Chowdhury
Abstract —Computer vision algorithms are known to be ex-tremely sensitive to the environmental conditions in which thedata is captured, e.g., lighting conditions and target density.Tuning of parameters or choosing a completely new algorithmis often needed to achieve a certain performance level, especiallywhen there is a limitation of the computation source. In thispaper, we focus on this problem and propose a framework toadaptively select the “best” algorithm-parameter combination(referred to as the best algorithm for simplicity) and thecomputation platform under performance and cost constraintsat design time, and adapt the algorithms at runtime based onreal-time inputs. This necessitates developing a mechanism toswitch between different algorithms as the nature of the inputvideo changes. Our proposed algorithm calculates a similarityfunction between a test video scenario and each unique trainingscenario, where the similarity calculation is based on learning amanifold of image features that is shared by both the trainingand test datasets. Similarity between training and test datasetindicates the same algorithm can be applied to both of themand achieve similar performance. We design a cost function withthis similarity measure to find the most similar training scenarioto the test data. The “best” algorithm under a given platformis obtained by selecting the algorithm with a specific parametercombination that performs the best on the corresponding trainingdata. The proposed framework can be used first offline to choosethe platform based on performance and cost constraints, andthen online whereby the “best” algorithm is selected for each newincoming video segment for a given platform. In the experiments,we apply our algorithm to the problems of pedestrian detectionand tracking. We show how to adaptively select platforms andalgorithm-parameter combinations. Our results provide optimalperformance on 3 publicly available datasets.
Index Terms —algorithm-parameter selection, platform selec-tion
I. I
NTRODUCTION N UMEROUS algorithms have been developed for differentcomputer vision applications like object detection, objectrecognition, tracking, etc. Also, many public datasets havebeen released to help researchers fairly evaluate their algo-rithms. For instance, the datasets of CAVIAR [1], ETHMS [2],and TUD-Brussels [3] have been commonly used in the area oftracking. In most cases, each algorithm is able to achieve verygood performance on some datasets, while failing to beat otheralgorithms on some other datasets. Besides, it is interesting tosee that some algorithms perform well on parts of a dataset,but cannot achieve good results on some other parts. Thisis because every algorithm is sensitive to the environmentalconditions in each dataset or parts thereof. Moreover, although
Shu Zhang, Qi Zhu, and Amit K. Roy-Chowdhury are with the Departmentof Electrical and Computer Engineering, University of California, Riverside,CA, USA, 92521. some state-of-the-art algorithms can achieve better resultsthan other algorithms in simulation, the high computationcomplexity might significantly reduce their performance inthe real-world scenarios or requires computation platformsthat are out of monetary or energy budget. In which case,choosing other algorithms with a different platform may bemore beneficial. All these observations raise an importantquestion: can we automatically select the best computationplatform and the best algorithm-parameter combination for anapplication domain?The goal of this paper is the following. Given a set ofexisting computer vision algorithms and its parameters, i.e. ,algorithm-parameter combinations, for a certain problem, canwe select the proper platform under certain computation andperformance constraints? Also, can we automatically select the“best” algorithm-parameter combination on a given platformfor a particular dataset (at design time this would be a bench-mark; at runtime this would be real-time input video)? Theanswer, in most cases, will not lie in one specific algorithm-parameter combination but on an adaptive mechanism forselecting among the set of algorithm-parameter combinations ,since the conditions in the video will likely change over time.Conditions that could trigger the switch include the lightingin the video, the number of targets in the scene, the resolutionof the targets, and so on - factors which are known to affectthe performance of vision algorithms. In the experiments, wespecifically focus on the problems of pedestrian detection andtracking, since pedestrian detection is a fundamental low-leveltask that is crucial to higher-level tasks, e.g. tracking, and thesetwo tasks are known to be sensitive to environmental factors.The proposed methodology could be generally applied to othercomputer vision applications, with specific features selectedfor those applications.An illustration of such an algorithm selection process isshown in Fig. 1, where the results of four pedestrian detectorsare affected by the frequently-changing scales of objects,number of objects, and illumination condition. There are twodifferent sets of algorithm-parameter combination selectionresults which are obtained under two platforms. In both leftand right parts of Fig. 1, each row shows representative imageframes from a video recorded at different times of a day,and each column denotes the person detection results by fourdifferent algorithms. It is noted that each pedestrian detectorachieves desired results on some image frames while doesnot perform well on the others. For instance, in the firstframe, the detectors with the best performances are detector1, 2 and 4, while in the second frame, the detector with thebest performances changes to the first detector. In the third a r X i v : . [ c s . C V ] M a y Fig. 1: Illustrations of pedestrian detection results by four algorithms { A , A , A , A } with the parameter frames per second(FPS) in a video sequence within a day. The two parts represent two computation platforms, each of which leads to differentFPSs for these four algorithms. In each part, a row denotes results of an algorithm-parameter , and each column represents animage frame. An example of the adaptive algorithm-parameter selection under a platform is shown with green arrows in theleft part, which is A → A → A → A → A . In the right part, with a different platform, the algorithm-parameter selectionresults change to A → A → A → A → A .frame, only detector 3 successfully detects all the pedestrians.In the fourth and fifth frames, the detectors with the bestperformances do not lie on the same detector. The imagefeatures of the first two image frames are similar to eachother. However, in the third frame, the illumination conditionchanges and the number of pedestrians increases. It is shownthat detector 3 achieves the best performance in the thirdframe. Similar observations can be noticed in the rest of theimages. An example of an ideal detector, which is obtainedby switching between the four original detectors, is shownwith green arrows in the left part. It shows the importanceof developing an adaptive switching mechanism between thealgorithm-parameter combinations that minimizes the detec-tion error for each scenario. The selection of platform andalgorithm-parameter combination for the left part is based onthe performance constraint. In the right part, with a differentperformance constraint and other constraints ( e.g. energy con-sumption), the platform and algorithm-parameter combinationis different from that of the left part. Although algorithm 3and 4 perform well under the first platform, their computationtime under the second platform is too high and thus can not beused in this application. Such sacrifice on the performance isvery necessary for many real applications when there is a needto find a balance between the performance and computationaldemand of algorithms. A. Overview and Contributions
Motivated by observations from Fig. 1, we propose aswitching algorithm which adaptively selects the best availablealgorithm-parameter combination along with a platform foreach scenario based on the characteristics of the video undercertain constraints. Our input consists of a set of existing algo-rithms that are well-known in the community for the specificvision task, in this case, pedestrian detection and tracking,applications we focus on in this paper. These algorithms’parameters are also known. In addition, we have datasetson which these algorithms have been tested. Each available algorithm has image frames as inputs and performance resultsas outputs.There are two operating phases in our proposed framework:the design time and the run time . At design time (offline),we learn the mapping between each unique scenario in thetraining data and algorithm-parameter combinations for eachplatform, and select the platform based on the training dataset,performance and cost constraints. The parameters that weuse include frames per second (FPS) and image resolutions.The algorithm-parameter combination that obtains the bestperformance under the platform is then labeled as the bestalgorithm-parameter combination for this training scenario.At runtime, we adapt the algorithm-parameter combinationbased on the real-time input, performance and cost constraints.Specifically, we segment every video sequence in the testdataset into time windows. The goal of the proposed algorithmselection process is to choose the “best” algorithm-parametercombination for each video segment. This is done by twosteps. The first step is to compute a similarity function betweenthe test video segment and all the training scenarios over alearned manifold of image features shared by the trainingand test dataset. This method has been referred to as domainadaptation in the literature [4], [5]. The output is the trainingscenario that the test video is closest to in this space. Inthe second step, the “best” algorithm-parameter combinationfor a video segment is obtained by selecting the algorithm-parameter combination with best performances in the selectedtraining scenario. Note that in principle, we can also adjustthe platform (to some degree, depending on the platformcapability), but this is out of the scope of this paper.We demonstrate the efficacy of the proposed approachon multiple well-known datasets. We apply 10 algorithm-parameter combinations on 3 public datasets [2], [3], [6]. Weshow how to choose the “best” algorithm-parameter combina-tion for each time window of image frames through switchesfrom one algorithm-parameter combination to another in adataset under a performance constraint. It is proved that the proposed approach is able to obtain the optimal or close-to-the-optimal performances among all the algorithm-parametercombinations’ performances given certain performance con-straints.
B. Related Works
Algorithm selection has been studied in recent years in afew works. In [7], image segmentation algorithms are selectedon different images. Features are learned by support vectormachine (SVM) and the performance of each algorithm ismapped to a four-bin ranking vector based on the correlationbetween features. The results are shown to be effective on1000 synthetic images. In [8], the goal is to segment pixelsin an image into different regions that are suitable to differentalgorithms. Different features are classified by a random forestclassifier, and different optical flow algorithms are automati-cally selected.Our work is different from these two approaches. Weconsider the problem of automatically switching the algorithmbased on the scene similarity between a test time window andall the unique scenarios in the training dataset. Our proposedalgorithm does not learn which specific feature to be used fora dataset, and does not need manual analysis of the feature-performance correspondence. This is more general than [8],where the effects of the features on the training datasetare manually analyzed to obtain the correlations betweenfeatures and algorithms, e.g. which feature has an impacton a specific data. The methodology of domain adaptationthat we use finds the underlying correspondences betweenfeatures while the approaches[7], [8] do not investigate thisissue. In [9], budget constraints are taken into considerationas the leverage rule between different algorithms in the con-text of handwriting recognition and scene categorization. Thealgorithms are selected based on a binary tree. Our workdoes not only consider the budget constraints. Instead, ourproposed framework considers both budget constraints andthe performances of each algorithm. We also investigate thepossibility to change computation parameters to improve thealgorithm performances.To the best of our knowledge, this is the first work thatadaptively selects algorithms with applications on pedestriandetection and tracking. We briefly introduce some relatedworks on these two applications. The most widely usedpedestrian detector in the past decade is the Histograms ofOriented Gradients (HOG) detector [10], where the HOGfeature is developed and a linear SVM is adopted to classifyHOG features. The part-based model (PBM) that is developedin [11], [12] applies HOG features on a part-based multi-component model and achieves very good performance onsome datasets. The work in [13] considers a deformable partmodel with k parts as k-poselets, and uses a separate HOGtemplate to model the appearance. A summary of the worksof person detection can be found in [14]. The work in [15]looks into the problem of how a previously trained classifiercan be adapted to a target dataset and proposes deep networkswhich jointly solve the problem of target detection as wellas reconstruction of the target scene. Our method differs from this in two important ways. First, we build upon existing well-known algorithms whose performance is very well understoodand choose the best algorithm for each video segment. Second,the method proposed here can be applied online as new testdata is available.In the area of multi-target tracking, most state-of-the-artworks focus on solving the problem of data association,given that the detections are available. The works [16], [17]adopted the bipartite graph matching method to find outinitial detection association results, and used the statisticsor other properties of the associated tracks to obtain thefinal tracking results. The works [18], [19], [20] developedcomplex detection association models based on the groupingbehaviors between targets. In the works [21], [22], [23], theproblem of multi-target tracking was modeled as a network-flow problem. Although the state-of-the-art results have beenobtained recently, the computations of these algorithm areusually too high to be adopted in the applications that requirecertain budget constraints or processing speed. For example,the works [19], [24] have complex graph structures whichmake the learning and inference of the graphs time consuming.To meet the requirement of the computation time, we adopta simple yet effective approach that has been widely used asthe baseline algrotihm in the works [16], [17], [18], [19], [20].The details are provided in the experimental section.II. M
ETHODOLOGY
A. Problem Description
We assume the availability of a number of algorithms forthe problem. Representative parameters of these algorithms arealso known. Our goal is to answer the following questions: forevery part of an unknown dataset, is it possible to automat-ically select an algorithm-parameter combination along witha platform among all available algorithm-parameter combina-tions that achieves the best result under certain performanceconstraints? And for the entire unknown dataset, what is thebest strategy to switch between algorithms?In our problem, the input is the set of K available algorithms A = { A , · · · , A K } with different parameters and the dataseton which they are evaluated. We call this the training dataset T = { T , · · · , T M } , where T i represents the i -th uniquescenario in T . The segmentation of T can be done by anydata classification methodology. We combine algorithms withdifferent parameters to obtain a set of algorithm-parametercombinations, denoted by B = { B , · · · , B H } .Under each platform in the design time, we apply everyalgorithm-parameter combination B h in B on each T i in T .Given a constraint such as a performance constraint, we selecta platform with the corresponding computation capability. Wethen select the algorithm-parameter combination that performsthe best as the training label Y i under the constraint.The unknown dataset is called the test dataset R . In R ,we assume that there are totally N time windows. Every timewindow of images is denoted as R j , j = 1 , · · · , N . In theruntime, the selection of algorithms for R j is representedby L j . Given the pairs ( T i , Y i ) under the same performanceconstraint, the problem is how to find the unknown label L j Fig. 2: Overall Methodology. The algorithms A = { A , · · · , A K } are combined with a couple of parameters to generate thealgorithm-parameter combinations under certain performance constraints. We learn the mapping between the training data andeach algorithm-parameter combination, and obtain the training label Y = { Y , · · · , Y K } . The feature similarity scores betweenthe training and test datasets are calculated by T and R . A cost function with two steps is defined and solved in Sec. II-D. A the set of available algorithms A , · · · , A K B the set of available algorithm-parameter combinations B , · · · , B H T training dataset T , · · · , T M R test dataset R , · · · , R N Y i the label of the training data T i ∈ T L j the label of the test data R j ∈ R t i the feature of T i , the dimension of which is ar j the feature of R j , the dimension of which is ab the dimension of the subspace of t i and r j x i the basis of subspace of t i , the dimension of which is a × bz j the basis of subspace of r j , the dimension of which is a × b ˜ x i orthogonal to x i , the dimension of which is a × ( a − b )˜ z j orthogonal to z j , the dimension of which is a × ( a − b ) W ij geodesic kernel θ ( y ) geodesic flow parametered by y in Eq. 3 Λ i diagonal matrices in Eq. 4 TABLE I: Notation Table.for each R j that is in R . All the notations are highlighted inTable I. B. Solution Overview
The overview of our solution is shown in Fig. 2. In an un-known test dataset R , every video sequence/image set is seg-mented into a sequence of non-overlapping time windows R j .The output of the algorithm, the label set L = { L , · · · , L N } ,is obtained by a two-step cost function. This cost functionis able to automatically select the best algorithm-parametercombination on a specific time window R j under certainplatform which is determined by the performance constraint. Inthe first step, we measure a similarity score S ( T i , R j ) between R j and every training scenario T i in the training dataset T , andfind the scenario T i ∗ that is most similar to R j , i.e. , with largest S ( T i ∗ , R j ) . In the second step, we find the “best” algorithm-parameter selection for T i ∗ , and choose that selection for R j .The structure of our solution is as follows. In Sec. II-C,we introduce how to calculate the similarity score S ( T i , R j ) between the time window R j in the test dataset and a particularscenario T i in the training dataset. In Sec. II-D, we introducethe overall two-step cost function that is able to find the bestalgorithm-parameter selection for every time window in R . C. Similarity Scores between Training Scenarios and TestTime Windows
Following the similarity definition in [16], the similaritybetween R j and T i , denoted by S ( T i , R j ) , is calculated as anexponential function of feature distance: S ( T i , R j ) = e − d ( T i ,R j ) (1)where d ( T i , R j ) represents the feature distance between T i and R j , whose computation is shown in below.
1) Feature Distance Computation:
In this section, we pro-vide a solution to d ( T i , R j ) in Eq. 1. Different from [8],where feature distances are directly computed, we considerthe mismatch between the training data and the test data.This mismatch can come from many sources, e.g. , pose,illumination, image quality, etc. In other words, even thoughthe training and test data have the features lying in differentspaces, a domain shift might indicate similar distributions ofthe two sets of features. An example is shown in Fig. 3,where each column of images does not have the same featuredistributions in terms of illumination, size of pedestrians andetc. However, the pedestrian detection experiments show thatthe same algorithm should be applied to each column ofimages to achieve the best performance. It represents that Fig. 3: Examples of mismatches between feature distributions.Every column of images do not share the same feature space.The application of domain adaption indicates that the samepedestrian detector should be applied to both rows of eachcolumn.directly calculating the feature distance between two data maymislead to wrong classification results. It is highly likely thatthere is an underlying space that is shared by features of bothtraining and test data. If the features of T i and R j share similardistributions on such a space, there is a high chance that thesame algorithm can be applied to both T i and R j . Finding sucha space is often known as the problem of domain adaptation.Our solution is motivated by the approaches in [4], [5], wherethe mapping between the training data and the test data ismodeled as geodesic flow.The key idea of domain adaptation is to project both thetraining data and each video segment of the test data intosubspaces to learn domain-invariant features. A challenge ishow to determine and select the subspace that is shared byboth training data and test data. We use a geodesic flow curveto link the training and test data on a Grassmann manifold(similar to[4], [5]), which is a special type of Riemannianmanifold [25]. As long as the projections of the two datasetson the Grassman manifold are similar, the features of the dataalso share similar distributions. Now the problem is convertedto the computation of geodesic flow, as explained below.Denote the features of T i and R j as t i ∈ R a and r j ∈ R a individually. We denote b as the dimension of the subspace of t i and r j . Performing principal component analysis (PCA) on t i and r j , we can obtain x i ∈ R a × b and z j ∈ R a × b which arethe basis vectors for the subspaces of t i and r j . The orthogonalmatrix of x i is defined as ˜ x i ∈ R a × ( a − b ) , and that of z j isdefined as ˜ z j ∈ R a × ( a − b ) .The work in [4] provides a closed loop solution to thegeodesic flow between t i and r j , which is t Ti W ij r j = (cid:90) ( θ ( y ) t i ) T ( θ ( y ) r j ) dy. (2)where θ ( y ) is a constructed geodesic flow function parameter-ized by a continuous variable y ∈ [0 , and W ij is the kernelfunction that is defined below.The term θ ( y ) in Eq. 2 represents how to smoothly projecta feature t i into r j , where θ ( y ) t i projects a feature into the y -th subspace on the Grassmann manifold. θ ( y ) is defined as θ ( y ) = x i , if y = 0 ,z j , if y = 1 ,x i U Σ ( y ) − ˜ x i V Σ ( y ) , otherwise (3)where U , V , Σ and Σ are obtained by singular valuedecomposition (SVD) of x Ti z j and ˜ x Ti z j . The index y denotesthe y -th subspace. In other words, y is a continuous variableand θ ( y ) parameterizes infinite number of y to constructgeodesic flow.Looking back to Eq. 2, we find that it calculates the geodesicflow over all the y , which means that the original features withprojections are expanded to all subspaces. In this case, W ij isof importance since it induces inner projects between featureswith infinite dimensions. In theory, W ij is the kernel between t i and r j and can be calculated by W ij = (cid:2) x i U ˜ x i V (cid:3) (cid:20) Λ Λ Λ Λ (cid:21) (cid:20) U T x Ti V T ˜ x Ti (cid:21) , (4)where the matrices Λ to Λ are diagonal matrices. Theelements of Λ to Λ come from Σ and Σ in Eq. 3. Thedetails of the derivation can be found in [4].In summary, calculating the feature distance kernel assumesthat the subspaces of t i and r j lie on a Grassmann manifold.Eq. 2 constructs geodesic flow between t i and r j , wherethe correlations between t i and r j are parameterized by thecontinues variable y . The projection of the feature t i on theGrassmann manifold is θ ( y ) t i , and that of the feature r j is θ ( y ) r j . The kernel inner product of t i and r j essentiallyrepresents how close their subspace projections are.The feature distance d in Eq. 1 can be calculated usingkernel distance [26] given the calculated kernel W ij . Thekernel distance is able to calculate the distance betweentwo sets of points which lie on geometric surfaces, i.e., themanifold that t i and r j lies on. The kernel distance between T i and R j is defined as in [26] d ( T i , R j ) = t Ti W ij t i + r Tj W ij r j − t Ti W ij r j . (5)In Eq. 5, the first two terms are self-similarities betweenthe feature t i of the training data T i and the feature r j of thetest data R j individually. The third term, that is defined in Eq.2, is the inner product of the two features that measures howclose they are correlated to each other. D. Adaptive Algorithm-Parameter Selection Cost Function
The ultimate goal of our proposed approach is to auto-matically select an algorithm-parameter for a time windowof images R j in the test data under performance and costconstraints. We formulate the selection process in the testdataset R as a two-step optimization function.
1) Step 1 of the Cost Function:
We obtain the trainingscenario that is closest to the test time window R j . Thistraining scenario T i ∗ is obtained by finding the maximumsimilarity between all the training scenarios and R j T i ∗ = max i S ( { T , · · · , T i , · · · , T M } , R j ) , (6)where S denotes the similarity function that is defined in Eq.1.
2) Step 2 of the Cost Function:
In the design time, givenconstraints such as system performance and cost, we select aplatform configuration C that could meet the requirements. Inthe run time, we will find the algorithm-parameter combinationthat performance the best for T i ∗ (obtained in the first step ofthe cost function) given the selected platform configuration C ,and choose this algorithm-parameter combination for R j .The algorithm-parameter selection, which is essentially theoutput label L j , is obtained by selecting the best performance P of the training scenario T i ∗ under the selected platformconfiguration C L j = max h P ( B h | T i ∗ , C ) , (7)where the superscript h denotes the h -th algorithm-parametercombination.Note that in the design time, we exhaustively apply everyalgorithm-parameter combination B h on each training scenario T i . The algorithm-parameter combination obtaining the min-imum error (best performance), which was calculated in thedesign time, is selected as the solution to Eq. 7 in the run time.The details of computation parameter selections are shown inSec. III-A. III. E XPERIMENTS
A. Experimental Setup
In the experiments, we show results of our method ontwo applications: pedestrian detection and pedestrian tracking.There are 5 state-of-the-art available detection algorithms:HOG [10], PartBased [11], Cascades [27], ACF [28], andLDCF [29]. The public datasets that are used as the testdatasets are: INRIA [10], ETHMS [2], and TUD Stadtmitte[30].In the application of pedestrian tracking, we adopt thebaseline algorithm shared by [16], [17], [18], [19], [20], dueto its computation efficiency. This algorithm is based onthe detection association methodology. Thus the effects ofdifferent detection results on tracking can be demonstrated byusing the same tracking module. Any other detection associ-ation algorithm can be adopted, as long as the computationrequirements are satisfied. Among the detection datasets, weuse all the datasets expect for the INRIA dataset and the TUD-Brussels dataset for tracking. The reason is that the INRIAdataset is not composed of consecutive image frames andthus is not suitable for the problem of tracking, and that theTUD-Brussels dataset was recorded by a fast-moving platform,which makes every pedestrian only exist 1-2 frames.We extract four different features that are used for distancecalculation in the experiments: HOG features [10], SIFTfeatures [31], gradient features [32], and texture features[33]. We resize every image frame to × and usethe methodology of Principle Component Analysis (PCA) toreduce the dimension of the feature combinations to 1288,where the HOG features have the size of 800 due to itsdominance in pedestrian detection.In the training dataset T , all the scenarios are clusteredinto 15 unique scenarios. The number of unique scenarios is determined based on the observation of the characteristics ofthe training data, e.g. the lighting condition, the density of thescenarios and etc. In the test dataset R , all the videos/imageframes are segmented into different time windows. The lengthof a time window is set to be 30 frames except for the INRIAdataset. The reason is that the INRIA dataset is composedof non-consecutive image frames, and we set the length ofa time window to be 10. The computation complexity ofour algorithm is low, since the cost function only needs twomax operations. Typically the computation time for each timewindow is less than 0.5s on an Intel i7 platform. B. Results of Algorithm-Parameter Selection on Various Plat-forms
We investigate the effects of algorithm-parameter combina-tion switches in every dataset. In the design time, we estimatethe classification threshold of each detection algorithm thatleads to FPPI = 1 . In the test dataset, we keep the detectionswith the scores greater than this threshold for each algorithm.We evaluate the detections for each time window, where thenumber of missed detections is used as the evaluation metrics.The reason of using missed detections as evaluation metricsis that the overall FPPI of every algorithm is fixed to bearound 1, and the number of missed detections is assumedto be dominant in determining the performance for each timewindow of the test dataset. In the problem of tracking, weadopt four evaluation metrics: mostly tracked (MT), mostlylost (ML), number of ID switches (IDS) and false positive(FP).We show the validation and importance of the algorithmselection process in Fig. 4, where the numbers of missed de-tections of the INRIA dataset are shown under two platforms.4 algorithm-parameter combinations, whose FPS are shown inthe legends, are used under each platform. In each subfigure,x-axis represents time windows and y-axis represents thenumber of missed detections. The selected algorithms areshown by pink curves. Overall, our approach selects the bestalgorithm-parameter combinations in most time windows forboth datasets under both platforms. It is also noted that theselected algorithms are not the same under different plat-forms. For instance, ACF-240x320 performs well in most timewindows with FPS=10 under the first platform, while ACF-480x640 obtains the best results in most time windows withFPS=30 under the second platform. Our algorithm selectionprocess successfully captures such algorithm changes within aplatform and between platforms. The results of (b) are betterthan that of (a) because the designed platform is more powerfulthan that of (a), and thus makes the FPS of each algorithmhigher than that of (a). A detailed description of Fig. 4 isshown in the figure caption.In Fig. 5, we show both detection and tracking results onthe TUD-Stadtmitte dataset, where the top two subfiguresshow the detection results and the bottom two show thetracking results. The detection algorithm-parameter selectionsare different under the two platforms, leading to differenttracking results. In (a), the detection algorithm-parametersswitch between ACF-RES:240x320 and HOG-RES:480x640
Fig. 4: Algorithm-parameter selection results with the application of pedestrian detection on INRIA dataset with four algorithm-parameter combinations under two platforms. In each subfigure, x-axis represents the time window and y-axis represents thenumber of missed detections. Each subfigure shows the results under a platform. It is shown that our algorithm-parameterselection process can select the low-error results in most time windows under both platforms. For instance, in (a), the selectedalgorithm-parameter only fails to select the best performance at the time window 13, 14, 17, 24, 25, and 27 among totally 29time windows. The selected algorithm-parameter switches between HOG-RES:480x640-FPS:8 and ACF-RES:240x320-FPS:10,each of which obtains the best results on some time windows. ACF-RES:240x320-FPS:10 obtains the best results in most timewindows under the platform 1. Given a stricter performance constraint, the platform 2 is selected in (b). The algorithm ACF-RES:480x640 performs the best in most time windows. It is because the high performance requirement leads to a powerfulplatform selection, which is easily to process high FPS and resolutions. The selected algorithm follows the correct trend, andlies in the ACF-RES:480x640 in all time windows. It is shown that the select algorithm does not obtain the best performanceonly for four time windows.with the corresponding FPSs under the platform 1. In (b),the selection always lies in ACF-RES:480x640 with FPS=30under the platform 2. It is reasonable because the performancerequirement is strict in (b). Such a performance requirementleads to a powerful platform selection that allows a highresolution and FPS of an algorithm. In (c) and (d),we show thetracking results with the four evaluation metrics: MT, ML, FTand IDS. A good tracker should have high MT and low ML,FT, and IDS. In (c), the selected algorithm-parameter obtainsbetter results than any single algorithm-parameter combinationunder the first platform. It is demonstrated that the trackingperformances of the selected algorithm follow the trend of thedetection performances. The results prove the effectiveness ofadaptively selecting algorithm-parameters. In (d), the resultsof the selected algorithm follows the trend of (a), where ACF-RES:480x640 obtains the best result among all the algorithm-parameter combinations. Similarly, we also show results onETHMS dataset under the same platforms in Fig. 6. Detaileddescriptions of Fig. 5 and 6 are illustrated in their captions.
C. Results of Platform Selection
Different performance constraints may lead to different se-lections of platforms and algorithm-parameter combinations..In our experiments, we consider two parameters of eachalgorithm: FPS and image resolutions. If the performanceconstraints (e.g., tracking accuracy) are moderate, a platformwith low computation and low cost may be chosen, andthe most suitable algorithm and its parameters can thenbe chosen accordingly. If the performance requirements arehigh, a platform with high computation capability may be needed, and correspondingly a different set of algorithms andparameters may be chosen. This is the essence of selectingplatform and algorithm (including parameters) based on designrequirements (including performance requirement and otherconstraints such as energy consumption or cost).In the experiments, we consider different performance re-quirements that lead to two different platform selections. Then,for each platform, we consider the set of algorithm-parametercombinations that are computationally feasible on the platformand select the one that provides the best performance forpedestrian detection. We have tested two PCs as candidateplatforms. The algorithm-parameter combinations under eachplatform are shown in Table II, where the parameters areobtained by running codes on the two platforms. In Fig. 4,we can see that for platform 1, ACF-RES:240x320-FPS:10provides the best performance while for platform 2, ACF-RES:480x640-FPS:30 provides the best performance. Notethat we have also tested the algorithms PartBased, Cascades,and LDCF. However, because of the low FPSs of thesealgorithms yielding bad performances, we only show results ofHOG and ACF, which obtain reasonable results with adaptiveselecting of the platform and algorithm-parameter combinationunder performance constraints.IV. C
ONCLUSION
In this paper, we present a novel approach to adaptivelyselect the best platform and algorithm-parameter combination.Our approach consists of an offline design step and an onlinerunning step. In the design step, we select platform config-uration based on the performance and cost constraints. Then
Fig. 5: Algorithm-parameter selection results on TUD-Stadtmitte dataset with four algorithm-parameter combinations under twoplatforms. In the top subfigures, x-axis represents the time window and y-axis represents the number of missed detections. Theleft subfigures show the results under the first platform and the right subfigures show the results under the second platform. Thetop subfigures demonstrate the detection results under different platforms given performance constraints. In (a), the selectedalgorithm-parameter only fails to select the best algorithm-parameter at the second time window. Given a new performanceconstraint, the platform is chosen as (b), where the selected algorithm-parameter does not lie in ACF-RES:480x640 as the firstplatform. (c) and (d) show the tracking results. In (c), the selected algorithm-parameter obtains better MT, ML and FT thanany single algorithm-parameter. Though its IDS is a little higher than other two algorithms, its overall performance is the best.In (d), the tracking performance of the selected algorithm is the same as that of ACF-RES:480x640, which obtains the bestperformance for all time windows.TABLE II: Algorithm-parameter combinations under two platforms.
Platform 1 Platform 2FPS / Resolution FPS / Resolution FPS / Resolution FPS / ResolutionHOG
15 / 240x320 8 / 480x640 30 / 240x320 15 / 480x640
ACF
10 / 240x320 5 / 480x640 20 / 240x320 10 / 480x640 we exhaustively apply every algorithm-parameter combinationto each platform and obtain the corresponding performances.In the run time, we calculate the feature similarity on themanifold that is shared between training and test data. Themore similar they are, the higher the possibility that theyshare the same algorithm-parameter combination. We obtainthe “best” algorithm-parameter combination for the test databy selecting the most similar training data’s selection. Weshow the efficacy of the proposed method on the applicationof pedestrian detection and tracking. Our promising experi-mental results have demonstrated that adaptively selecting thealgorithm-parameter combinations for each scenario is able toobtain the best or is close to the best performance under acertain constraint. R
EFERENCES[1] “CAVIAR dataset,” http://homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/.1[2] A. Ess, B. Leibe, K. Schindler, and L. V. Gool, “A mobile vision systemfor robust multi-person tracking,” in
CVPR , 2008. 1, 2, 6[3] C. Wojek, S. Walk, and B. Schiele, “Multi-cue onboard pedestriandetection,” in
CVPR , 2009. 1, 2[4] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel forunsupervised domain adaptation,” in
CVPR , 2012. 2, 5[5] R. Gopalan, R. Li, and R. Chellappa, “Domain adaptation for objectrecognition: an unsupervised approach,” in
ICCV
Pattern Recognition Letters ,vol. 26, no. 8, pp. 1059–1068, 2005. 3[8] O. M. Aodha, G. J. Brostow, and M. Pollefeys, “Segmenting video intoclasses of algorithm-suitability,” in
CVPR , 2010. 3, 4
Fig. 6: Algorithm-parameter selection results on ETHMS dataset with four algorithm-parameter combinations under twoplatforms. Note that we only show parts of the time windows of ETHMS dataset to clearly denote how the results switch.In the top subfigures, x-axis represents the time window and y-axis represents the number of missed detections. The leftsubfigures show the results under the first platform and the right subfigures show the results under the second platform givena new performance constraint. The top subfigures demonstrate the detection results under different platforms. The selectedalgorithm-parameter can capture the lowest detection errors in most time windows under the two platforms given differentperformance constraints. Different from Fig. 4 and Fig. 5, where the selected algorithm-parameter mainly switches betweenHOG-RES:480x640 and ACF-RES:240x320 with different FPSs under the first platform, the selected algorithm-parameter ofETHMS dataset also selects ACF-RES:480x640 with different FPSs under the first platform. In the second platform (b), theselected algorithm always lies in ACF-RES:480x640, which is the same as the results of Fig. 4 and Fig. 5. In the trackingresults of both (c) and (d), the selected algorithm-parameter can obtain the best performance, which also supports the detectionalgorithm selection. [9] J. Wang, T. Bolukbasi, K. Trapeznikov, and V. Saligrama, “Modelselection by linear programming,” in
ECCV , 2014. 3[10] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in
CVPR , 2005. 3, 6[11] P. F. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Ob-ject detection with discriminatively trained part-based models,”
IEEETrans. on Pattern Analysis and Machine Intelligence , vol. 32, no. 9, pp.1627–1645, 2010. 3, 6[12] P. F. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminativelytrained, multiscale, deformable part model,” in
CVPR , 2008. 3[13] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik, “Using k-poseletsfor detecting people and localizing their keypoints,” in
CVPR , 2014. 3[14] M. Enzweiler and D. M. Gavrila, “Monocular pedestrian detection:survey and experiments,”
IEEE Trans. on Pattern Analysis and MachineIntelligence , vol. 31, no. 12, pp. 2179–2195, 2009. 3[15] X. Zeng, W. Ouyang, M. Wang, and X. Wang, “Deep learning of scene-specific classifier for pedestrian detection,” in
ECCV , 2014. 3[16] B. Song, T. Jeng, E. Staudt, and A. K. Roy-Chowdhury, “A stochsticgraph evolution framework for robust multi-target tracking,” in
ECCV ,2010. 3, 4, 6[17] B. Yang and R. Nevatia, “An online learned CRF model for multi-targettracking,” in
CVPR , 2012. 3, 6 [18] Z. Qin and C. Shelton, “Improving multi-target tracking via socialgrouping,” in
CVPR , 2012. 3, 6[19] S. Zhang, A. Das, C. Ding, and A. K. Roy-Chowdhury, “Online socialbehavior modeling for multi-target tracking,” in
CVPR Workshop , 2013.3, 6[20] X. Chen, Z. Qin, L. An, and B. Bhanu, “An online learned elementarygrouping model for multi-target tracking,” in
CVPR , 2014. 3, 6[21] H. Pirsiavash, D. Ramanan, and C. Fowlkes, “Globally-optimal greedyalgorithms for tracking a variable number of objects,” in
CVPR , 2011.3[22] A. Butt and R. Collins, “Multi-target tracking by Lagrangian relaxationto min-cost network flow,” in
ICCV , 2013. 3[23] A. Dehghan, Y. Tian, P. H. S. Torr, and M. Shah, “Target identity-awarenetwork flow for online multiple target tracking,” in
CVPR , 2015. 3[24] A. Dehghan, S. Modiri, and M. Shah, “GMMCP-Tracker: globallyoptimal generalized maximum multi clique problem for multiple objecttracking,” in
CVPR , 2015. 3[25] M. Harandi, C. Sanderson, C. Shen, and B. Lovell, “Dictionary learningand sparse coding on grassmann manifolds: an extrinsic solution,” in
ICCV , 2013. 5[26] J. M. Phillips and S. Venkatasubramanian, “A gentle introduction to thekernel distance,” in
Technical Report arXiv:1103.1625. Arxiv,org , 2011.5 [27] H. Cevikalp and B. Triggs, “Efficient object detection using cascades ofnearest convex model classifiers,” in CVPR , 2012. 6[28] P. Doll´ar, S. B. R. Appel, and P. Perona, “Fast feature pyramids for objectdetection,”
IEEE Trans. on Pattern Analysis and Machine Intelligence ,vol. 36, no. 8, pp. 1532–1545, 2014. 6[29] W. Nam, B. Han, and J. H. Han, “Local decorrelation for improvedpedestrian detection,” in
NIPS , 2014. 6[30] M. Andriluka, S. Roth, and B. Schiele, “Monocular 3d pose estimationand tracking by detection,” in
CVPR , 2010. 6[31] D. Lowe, “Object recognition from local scale-invariant features,” in
ICCV , 1999. 6[32] A. Oliva and A. Torralba, “Modeling the shape of the scene: a holisticrepresentation of the spatial envelope,”