Online Multi-Object Tracking with Historical Appearance Matching and Scene Adaptive Detection Filtering
Young-chul Yoon, Abhijeet Boragule, Young-min Song, Kwangjin Yoon, Moongu Jeon
aa r X i v : . [ c s . C V ] S e p Online Multi-Object Tracking with Historical Appearance Matching andScene Adaptive Detection Filtering
Young-chul Yoon Abhijeet Boragule Young-min Song Kwangjin Yoon Moongu JeonGwangju Institute of Science and Technology123 Cheomdangwagi-ro, Buk-gu, Gwangju, 61005, South Korea { zerometal9268, abhijeet, sym, yoon28, mgjeon } @gist.ac.kr Abstract
In this paper, we propose the methods to handle temporalerrors during multi-object tracking. Temporal error occurswhen objects are occluded or noisy detections appear nearthe object. In those situations, tracking may fail and vari-ous errors like drift or ID-switching occur. It is hard to over-come temporal errors only by using motion and shape infor-mation. So, we propose the historical appearance matchingmethod and joint-input siamese network which was trainedby 2-step process. It can prevent tracking failures althoughobjects are temporally occluded or last matching informa-tion is unreliable. We also provide useful technique to re-move noisy detections effectively according to scene condi-tion. Tracking performance, especially identity consistency,is highly improved by attaching our methods.
1. Introduction
Current paradigm of multi-object tracking is tracking bydetection approach. Most trackers assume that detectionsare already given and focus on labeling each detection withspecific ID. This labeling process is basically done by dataassociation. For online tracker, the data association prob-lem could be simplified to the bipartite matching problemand the hungarian algorithm has frequently been adopted tosolve it. Before solving the data association problem, a costmatrix has to be defined. Each element of the cost matrix isthe measure of affinity(similarity) between specific objectand detection(observation). Because the data associationsimply finds 1-to-1 matches on cost matrix, it is importantto derive accurate affinity scores for better performance.Motion is a basic factor of affinity. Motion is the onlyinformation that we can guess in a simple tracking envi-ronment(e.g. tracking dots, which are signals from specificobjects like ship or airplane, on 2D field). The kalman fil-ter has frequently been adopted for motion modeling. Itcan model temporal errors by adaptively predicting and up- (cid:20)(cid:17)(cid:3)(cid:44)(cid:81)(cid:87)(cid:85)(cid:82)(cid:71)(cid:88)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:11)(cid:87)(cid:72)(cid:80)(cid:83)(cid:82)(cid:85)(cid:68)(cid:79)(cid:3)(cid:72)(cid:85)(cid:85)(cid:82)(cid:85)(cid:86)(cid:3)(cid:71)(cid:88)(cid:85)(cid:76)(cid:81)(cid:74)(cid:3)(cid:87)(cid:85)(cid:68)(cid:70)(cid:78)(cid:76)(cid:81)(cid:74)(cid:12)(cid:11)(cid:36)(cid:57)(cid:54)(cid:54)(cid:3)(cid:21)(cid:19)(cid:20)(cid:27)(cid:12) (cid:30)(cid:80)(cid:81)(cid:75)(cid:85)(cid:91)(cid:3)(cid:70)(cid:71)(cid:86)(cid:71)(cid:69)(cid:86)(cid:75)(cid:81)(cid:80)(cid:85)(cid:32) (cid:30)(cid:86)(cid:71)(cid:79)(cid:82)(cid:81)(cid:84)(cid:67)(cid:78)(cid:3)(cid:81)(cid:69)(cid:69)(cid:78)(cid:87)(cid:85)(cid:75)(cid:81)(cid:80)(cid:32)(cid:30)(cid:35)(cid:37)(cid:40)(cid:3)(cid:81)(cid:80)(cid:3)(cid:69)(cid:81)(cid:79)(cid:82)(cid:78)(cid:71)(cid:90)(cid:3)(cid:85)(cid:69)(cid:71)(cid:80)(cid:71)(cid:32)
Figure 1: Example of temporal errors. ACF detector creates noisydetections which include several objects simultaneously or onlya small part of object. Also, a lot of temporal occlusions occurbecause of complex scene condition dating the positions of objects according to tracking con-dition. But it is insufficient to track objects in more com-plex situation. Scenes taken directly from RGB cameracontain a lot of difficulties. As described in Figure 1(tem-poral occlusion), objects are occluded by other objects andobstacles which exist on the scene. To overcome this, wecan exploit appearance information. There have been manyworks [2, 1, 7, 10, 8] which tried to derive accurate ap-pearance affinity. Several works [2, 7] tried to design ap-pearance model without using deep neural network(DNN).Those trackers achieved better performance but couldn’tsignificantly improve the performance. Along rapid de-velopment of deep learning, several works [1, 10, 8] triedto apply DNN to calculating appearance affinity. Most ofthose works [1, 10] used the siamese network to calculatethe affinity score. Although siamese network has a strongdiscriminating power, it can only see cropped patches whichcontain limited information. If imperfect detectors [3, 4] areused to extract detections, detections themselves contain in-accurate information. Those detections are ambiguous asdescribed in Figure 1(noisy detections) and may lead to in-accurate appearance affinity.We propose several methods to tackle aforementionedproblems(noisy detections, temporal occlusion). First, it ishard to match a object to an observation when the recentobject appearance is ambiguous. To break this, we savereliable historical appearances. From the support of reliablehistorical appearances, we can get an accurate affinity (cid:17)(cid:3)(cid:73)(cid:85)(cid:68)(cid:80)(cid:72)(cid:90)(cid:82)(cid:85)(cid:78) (cid:79)(cid:81)(cid:86)(cid:75)(cid:81)(cid:80)(cid:3)(cid:8)(cid:3)(cid:85)(cid:74)(cid:67)(cid:82)(cid:71)(cid:3)(cid:67)(cid:72)(cid:72)(cid:75)(cid:80)(cid:75)(cid:86)(cid:91)(cid:3)(cid:69)(cid:67)(cid:78)(cid:69)(cid:87)(cid:78)(cid:67)(cid:86)(cid:75)(cid:81)(cid:80) (cid:67)(cid:82)(cid:82)(cid:71)(cid:67)(cid:84)(cid:67)(cid:80)(cid:69)(cid:71)(cid:3)(cid:67)(cid:72)(cid:72)(cid:75)(cid:80)(cid:75)(cid:86)(cid:91)(cid:3)(cid:69)(cid:67)(cid:78)(cid:69)(cid:87)(cid:78)(cid:67)(cid:86)(cid:75)(cid:81)(cid:80) (cid:69)(cid:81)(cid:80)(cid:72)(cid:75)(cid:70)(cid:71)(cid:80)(cid:69)(cid:71)(cid:14)(cid:3)(cid:85)(cid:86)(cid:67)(cid:86)(cid:71)(cid:3)(cid:87)(cid:82)(cid:70)(cid:67)(cid:86)(cid:71)(cid:70)(cid:67)(cid:86)(cid:67)(cid:3)(cid:67)(cid:85)(cid:85)(cid:81)(cid:69)(cid:75)(cid:67)(cid:86)(cid:75)(cid:81)(cid:80) (cid:72)(cid:84)(cid:67)(cid:79)(cid:71)(cid:13)(cid:19)(cid:74)(cid:75)(cid:85)(cid:86)(cid:81)(cid:84)(cid:75)(cid:69)(cid:67)(cid:78)(cid:3)(cid:67)(cid:82)(cid:82)(cid:71)(cid:67)(cid:84)(cid:67)(cid:80)(cid:69)(cid:71)(cid:3)(cid:87)(cid:82)(cid:70)(cid:67)(cid:86)(cid:71) (cid:11)(cid:36)(cid:57)(cid:54)(cid:54)(cid:3)(cid:21)(cid:19)(cid:20)(cid:27)(cid:12) (cid:43)(cid:80)(cid:75)(cid:86)(cid:75)(cid:67)(cid:78)(cid:75)(cid:92)(cid:67)(cid:86)(cid:75)(cid:81)(cid:80)(cid:8)(cid:86)(cid:71)(cid:84)(cid:79)(cid:75)(cid:80)(cid:67)(cid:86)(cid:75)(cid:81)(cid:80)(cid:85)(cid:69)(cid:71)(cid:80)(cid:71)(cid:3)(cid:67)(cid:70)(cid:67)(cid:82)(cid:86)(cid:75)(cid:88)(cid:71)(cid:3)(cid:70)(cid:71)(cid:86)(cid:71)(cid:69)(cid:86)(cid:75)(cid:81)(cid:80)(cid:3)(cid:72)(cid:75)(cid:78)(cid:86)(cid:71)(cid:84)(cid:75)(cid:80)(cid:73)(cid:39)(cid:83)(cid:16)(cid:3)(cid:10)(cid:27)(cid:11)(cid:14)(cid:3)(cid:10)(cid:19)(cid:18)(cid:11)(cid:70)(cid:75)(cid:85)(cid:69)(cid:67)(cid:84)(cid:70)(cid:10) (cid:2729) = (cid:2777) (cid:11) (cid:91)(cid:71)(cid:85)(cid:80)(cid:81) (cid:2729) (cid:2175)(cid:2169) (cid:32) (cid:2254) (cid:2183)(cid:2201)(cid:2185) (cid:74)(cid:87)(cid:80)(cid:73)(cid:67)(cid:84)(cid:75)(cid:67)(cid:80) (cid:67)(cid:78)(cid:73)(cid:81)(cid:84)(cid:75)(cid:86)(cid:74)(cid:79)(cid:39)(cid:83)(cid:16)(cid:3)(cid:10)(cid:22)(cid:11)(cid:3)(cid:282) (cid:39)(cid:83)(cid:16)(cid:3)(cid:10)(cid:26)(cid:11) (cid:273) (cid:87)(cid:82)(cid:70)(cid:67)(cid:86)(cid:71)(cid:3)(cid:74)(cid:75)(cid:85)(cid:86)(cid:81)(cid:84)(cid:75)(cid:69)(cid:67)(cid:78)(cid:3)(cid:67)(cid:82)(cid:82)(cid:71)(cid:67)(cid:84)(cid:67)(cid:80)(cid:69)(cid:71)(cid:84)(cid:71)(cid:69)(cid:71)(cid:80)(cid:86)(cid:3)(cid:67)(cid:82)(cid:82)(cid:71)(cid:67)(cid:84)(cid:67)(cid:80)(cid:69)(cid:71) (cid:2185) (cid:2191)(cid:2200) (cid:32) (cid:2254) (cid:2185)(cid:2197)(cid:2196)(cid:2188) (cid:2185) (cid:2191)(cid:2200) (cid:1370) (cid:2729) (cid:91)(cid:71)(cid:85) (cid:2185) (cid:2191)(cid:2782) (cid:2185) (cid:2191)(cid:2781) (cid:2185) (cid:2191)(cid:2780) (cid:2185) (cid:2191)(cid:2779) (cid:2185) (cid:2191)(cid:2778) (cid:70)(cid:75)(cid:85)(cid:69)(cid:67)(cid:84)(cid:70)(cid:80)(cid:81) (cid:2180) (cid:2191)(cid:2200) (cid:2180) (cid:2191)(cid:2778) (cid:2180) (cid:2191)(cid:2779) (cid:2180) (cid:2191)(cid:2780) (cid:2180) (cid:2191)(cid:2781) (cid:2180) (cid:2191)(cid:2782) Figure 2: Our tracking framework. score even in ambiguous situation. And it is necessaryto reduce noisy detections as much as possible for betterperformance. Many trackers have used a constant detectionthreshold(e.g. 30) for all sequences. Instead of using aconstant threshold, we propose a method to decide thedetection threshold according to a scene condition. In ourbest knowledge, this is the first work that considers to filterout detections according to a scene condition. In summary,our main contributions are: • The proposed historical appearance matching methodsolves the matching ambiguity problem; • The proposed 2-step training method with the siamesenetwork produces the better tracking performance thantraining by single dataset; • We propose a simple method to adaptively decide de-tection confidence threshold. It decides the thresholdaccording to scene conditions and performs better thanthe constant threshold;In the experiments section, it is proved that each of ourmethod improves the tracking performance.
2. Proposed methods
In this section, we describe our tracking framework andproposed methods. Our framework is shown in Figure 2,and is based on the simple framework of the online multi-object tracking. It associates existing objects with observa-tions first. Then, update the states of objects using associ-ated observation and process birth&death of objects. Ourmain contribution is to designing appearance cues and pre-processing given detections. It would be explained in fol-lowing sub-sections.
Our affinity model consists of three cues: appearance,shape and motion. The affinity matrix is calculated by mul-tiplying scores from each cue: Λ( i, j ) = Λ A ( X i , Z j )Λ S ( X i , Z j )Λ M ( X i , Z j ) (1)Each of A, S, M indicates appearance, shape and motion.Score from each cue is calculated as below : Λ S ( X, Z ) = exp (cid:16) − ξ n (cid:12)(cid:12) h ˆ X − h Z (cid:12)(cid:12) h ˆ X + h Z + (cid:12)(cid:12) w ˆ X − w Z (cid:12)(cid:12) w ˆ X + w Z o(cid:17) , Λ M ( X, Z ) = exp( − η ( p Z − p ˆ X ) T Σ − ( p Z − p ˆ X )) (2)The appearance affinity score is calculated by our proposedmethod. It will be explained in section 2.2 and 2.3. Dif-ferent from other tracking methods, we predict the state ofeach object X not only for motion but also for shape. Al-though we model motion and appearance affinities robustto error, tracking may fail because of noisy detections withthe different size. We thought the kalman filter could be ap-plied to the shape state ( w, h ) in a similar way to predictthe motion state p . ˆ X indicates the predicted state of theobject X . We calculate the relative difference of height andwidth between the object and the observation. Motion affin-ity score is calculated by the mahalanobis distance betweenthe position of the predicted state and the observation withthe predefined covariance matrix Σ which generally workswell in any scene condition. There are various structures of the siamese network thatwe can consider to use in multi-object tracking. From ex-periments of the prior works [10], the joint-input siamesenetwork outperforms other types of the siamese network.Also, it is important to set the output range between 0-1 tobalance with other affinities(motion and shape). The soft-max layer of the joint-input siamese network naturally set ayer filter size input output conv & bn & relu 9x9x12 128x64x6 120x56x12max pool 2x2 120x56x12 60x28x12conv & bn & relu 5x5x16 60x28x12 56x24x16max pool 2x2 56x24x16 28x12x16conv & bn & relu 5x5x24 28x12x16 24x8x24max pool 2x2 24x8x24 12x4x24flatten - 12x4x24 1x1152dense - 1x1152 1x150dense - 1x150 1x2softmax - 1x2 1x2
Figure 3: Our joint-input siamese network structure. bn indicatesbatch normalization layer. Each of two final output means proba-bility of which two inputs are identical or different. the output range between 0-1. Our network structure is de-scribed in Figure 3. Different from the prior works, weused batch normalization [6] for better accuracy. It preventsoverfitting and improves convergence so is useful to trainthe network with the small size of training data. Thanksto the convolutional neural network which can extract richappearance features, ours, without historical matching, out-performs the color histogram based tracking(Figure 7(a)).We trained our network in two steps: pretrain and domainadaptation. The detail of the network training is explainedin Figure 6 and Section 3.1.
Because of occlusion and inaccurate detection, the objectstate may be unreliable. As we mentioned in Section 2.1,the shape and motion cues can handle temporal errors usingthe kalman filter. However, different from those cues, thesize of appearance feature is huge and is hard to be modeledconsidering temporal errors. Before explaining our method,we revisit the method of the adaptive color histogram up-date. It is possible to update object color histogram adap-tively according to the current matching affinity score. Itcan be represented as:
Hist Xt = αHist ˆ Xt + (1 − α ) Hist Xt − (3) Hist Xt means the saved color histogram of object X in time t and Hist ˆ Xt is the matched observation color histogram ofobject X in time t . The update ratio is controlled easily by α . α is large if the current matching affinity score is highand vice versa. However, the color histogram is still sen-sitive to change of light, background and object pose. Thejoint-input siamese network produces much reliable affin-ity score. But features can’t be updated adaptively like thecolor histogram because input images are concatenated andjointly inferred through network. So, we propose the histor-ical appearance matching(HAM) as following equations: Λ A ( X i , Z j ) = ( siam ( X ri , Z j ) , baselineham ( X i , Z j ) , proposed (4) (cid:21)(cid:17)(cid:3)(cid:37)(cid:85)(cid:72)(cid:68)(cid:78)(cid:3)(cid:70)(cid:82)(cid:81)(cid:73)(cid:88)(cid:86)(cid:76)(cid:82)(cid:81)(cid:3)(cid:88)(cid:86)(cid:76)(cid:81)(cid:74)(cid:3)(cid:43)(cid:36)(cid:48)(cid:11)(cid:36)(cid:57)(cid:54)(cid:54)(cid:3)(cid:21)(cid:19)(cid:20)(cid:27)(cid:12) (cid:273) (cid:85)(cid:87)(cid:82)(cid:82)(cid:81)(cid:84)(cid:86) (cid:81)(cid:68)(cid:85)(cid:71)(cid:84)(cid:88)(cid:67)(cid:86)(cid:75)(cid:81)(cid:80) (cid:10)(cid:19)(cid:11) (cid:10)(cid:20)(cid:11) (cid:86)(cid:67)(cid:84)(cid:73)(cid:71)(cid:86) (cid:10)(cid:20)(cid:11) (cid:74)(cid:75)(cid:85)(cid:86)(cid:81)(cid:84)(cid:75)(cid:69)(cid:67)(cid:78)(cid:3)(cid:67)(cid:82)(cid:82)(cid:71)(cid:67)(cid:84)(cid:67)(cid:80)(cid:69)(cid:71) (cid:33) (cid:2180) (cid:2191)(cid:2200) (cid:2180) (cid:2191)(cid:2778) (cid:2180) (cid:2191)(cid:2779) (cid:2180) (cid:2191)(cid:2780) (cid:2180) (cid:2191)(cid:2781) (cid:2180) (cid:2191)(cid:2782) Figure 4: Example of breaking ambiguity using historical appear-ance matching. (Black arrow): It is hard to choose the observationcorrespond to object among (1) and (2) because of ambiguous re-cent object appearance. (Red arrow): From support of historicalappearance, object can be matched to correct observation (2).
The baseline method was additionally defined for compar-ison. X i and Z j are the i th object and the j th observationrespectively. Baseline( siam ) takes score from the siamesenetwork with two inputs: X ri ( r indicates recently matchedappearance) and Z j . And ham is the proposed method thatwe designed to break confusion and ambiguity as describedin Figure 4. ham is calculated as: ham ( X i , Z j ) = c ri ∗ siam ( X ri , Z j )+(1 − c ri ) N hi X n =1 ( w ni ∗ siam ( X ni , Z j )) (5)where c ri is the recent matching confidence(affinity) of theobject. Relative weights of two terms in the equation arecontrolled by c ri . If the recent matching is unreliable( c ri ↓ ),the second term(reliable historical appearances) takes big-ger portion in the appearance affinity and vice versa. N hi isa number of the saved historical appearances of the object X i . w ni is the relative weight of X ni ( n -th historical appear-ance of object X i ). It is defined as: w ni = c ni P N hi k =1 c ki (6)Each weight of historical appearance is calculated by di-viding its matching confidence( c ni ) by sum of all matchingconfidences. c ki indicates the affinity score( Λ ) of X i at thetime that the k -th historical appearance is matched with X i .So, the sum of all w ni becomes 1. And naturally, this assuresthat ham is in range 0-1. As described in Figure 2, histor-ical appearance of each object is updated when the match-ing appearance is bigger than τ conf (0 . . We maintain themaximum number of historical appearances to be not morethan 10 and the oldest one to be within 15 from the currentframe. Processing the siamese network could be a time bottle-neck. To reduce processing time, we applied a simple butfficient gating technique from our previous work [17]. Be-fore calculating the appearance affinity, we create shape andmotion affinity matrices using Eq. 2: Λ SM ( i, j ) = Λ S ( X i , Z j )Λ M ( X i , Z j ) , ∀ i ∈ { , · · · , N X } , ∀ j ∈ { , · · · , N Z } (7)where N is the total number of object( X ) orobservation( Z ). Then, we calculate the final affinitymatrix as: Λ( i, j ) = ( Λ SM ( i, j )Λ A ( X i , Z j ) , if Λ SM ( i, j ) > τ asc , else (8)The final affinity matrix is simply calculated by multiplyingappearance affinities when Λ SM is larger than τ asc which isthe pre-defined association threshold. Although the objectand observation are associated by the hungarian algorithm,those with affinity, smaller than τ asc , are ignored. So, weonly calculate the appearance affinity for pairs which have Λ SM larger than τ asc . This can save processing time a lot. In public benchmarks[11, 13], detections, extracted byACF [3] or DPM [4], are given as default. Both ACF andDPM are not the state-of-the-art detectors and make a lotof false-positive, false-negative errors. So, it is necessaryto filter out noisy detections for better performance. It iscommon to filter out noisy detections using given detec-tion confidences which are also produced by detector. A lotof previous works simply filter out detections which havelower confidences then the pre-defined constant threshold τ const . However, the distribution of detection confidenceis variable depending on tracking environment. We showan example in Figure 5. The average detection confidenceis high in PETS09-S2L1 which is taken by a static cam-era and in which sizes of objects are constant. In contrast,the average confidence is low in ETH-SUNNYDAY whichis taken from a dynamic camera in highly illuminated en-vironments and in which sizes of objects are variant. If thedetection threshold( τ const ) is fixed to work well in PETS09-S2L1, a lot of true-positive detections are filtered out inETH-SUNNYDAY dataset(see Figure 5(b)). Even it variesin the same scene as time flows. So, we propose a sim-ple method to adaptively decide the threshold depending onscene as: τ t = (1 − ρ t ) τ sa + ρ t τ const (9)where τ t is the detection threshold at the frame t . Detec-tions with the confidence lower than τ t are eliminated be-fore tracking in the frame- t . The first term of the right-handside of the equation is a scene adpative threshold( τ sa ) whichconsiders inter-scene, intra-scene difference(described in (cid:23)(cid:17)(cid:3)(cid:36)(cid:71)(cid:68)(cid:83)(cid:87)(cid:76)(cid:89)(cid:72)(cid:3)(cid:71)(cid:72)(cid:87)(cid:72)(cid:70)(cid:87)(cid:76)(cid:82)(cid:81)(cid:3)(cid:70)(cid:82)(cid:81)(cid:73)(cid:76)(cid:71)(cid:72)(cid:81)(cid:70)(cid:72)(cid:3)(cid:87)(cid:75)(cid:85)(cid:72)(cid:86)(cid:75)(cid:82)(cid:79)(cid:71) (cid:75)(cid:80)(cid:86)(cid:71)(cid:84)(cid:3)(cid:85)(cid:69)(cid:71)(cid:80)(cid:71)(cid:3)(cid:70)(cid:75)(cid:72)(cid:72)(cid:71)(cid:84)(cid:71)(cid:80)(cid:69)(cid:71)(cid:75)(cid:80)(cid:86)(cid:84)(cid:67)(cid:3)(cid:85)(cid:69)(cid:71)(cid:80)(cid:71)(cid:3)(cid:70)(cid:75)(cid:72)(cid:72)(cid:71)(cid:84)(cid:71)(cid:80)(cid:69)(cid:71) (cid:11)(cid:36)(cid:57)(cid:54)(cid:54)(cid:3)(cid:21)(cid:19)(cid:20)(cid:27)(cid:12) (a) various scene condition ETH-Sunnyday PETS09-S2L1 . . (b) average detection confi-dence comparisonFigure 5: (a)Upper row shows different scene condition be-tween ETH-Sunnyday and PETS09-S2L1. Lower row showsthe varying scene condition between different frames of ETH-Pedcross2. (b)Comparison of average detection confidence be-tween two scenes(ETH-Sunnyday, PETS09-S2L1) Figure 5(a)). τ sa is defined as: τ sa = arg min τ k ( βP ( D t ≤ τ )+ (1 − β ) P ( D allt ≤ τ )) − p d k (10)Two cumulative distribution functions( P ( D ≤ τ ) ) of thegaussian variable( D ) are combined through β . The gaus-sian variable( D = N ( µ, σ ) ) is derived from the average( µ )and standard deviation( σ ) of detection confidences. D t is calculated by detection confidences of recent 10 frames.Because of the reason that D t is usually calculated by thesmall number of samples, D allt , calculated by all detectionconfidences collected until the current frame, is needed forsmoothness. β controls the degree of smoothness and p d is an important constant which decide τ sa . We found that0.4 generally works best(Figure 8(b)). The second part ofEq. 9 is the pre-defined threshold( τ const ). Because ourtracker operates in fully online way, this pre-defined thresh-old is needed for first a few frames when the number ofdetection samples is small to calculate distribution( τ sa ). Itsproportion( ρ t ) gets smaller as frames( t ) gets bigger. Weheuristically selected ρ as 0.95.
3. Experiments
In this section, we explain detail of implementation andshow the improvement in tracking performance by attach-ing our methods one-by-one. Also, we compare the perfor-mance of our tracker with other public trackers.
A Network may be confused if it learns directlyfrom tracking sequences. As described in Figure6(2DMOT2015), there exist a lot of occluded or noisy ob-jects which are marked as ground-truth objects. If a net-work trains from those samples, it may decrease the perfor-mance. For this reason, it would be better to train the net- (cid:87)(cid:74)(cid:77)(cid:18)(cid:20)(cid:3)(cid:70)(cid:67)(cid:86)(cid:67)(cid:85)(cid:71)(cid:86) (cid:30)(cid:82)(cid:84)(cid:71)(cid:86)(cid:84)(cid:67)(cid:75)(cid:80)(cid:32) (cid:30)(cid:70)(cid:81)(cid:79)(cid:67)(cid:75)(cid:80)(cid:3)(cid:67)(cid:70)(cid:67)(cid:82)(cid:86)(cid:67)(cid:86)(cid:75)(cid:81)(cid:80)(cid:32) (cid:20)(cid:38)(cid:47)(cid:49)(cid:54)(cid:20)(cid:18)(cid:19)(cid:23) (cid:24)(cid:17)(cid:3)(cid:82)(cid:73)(cid:73)(cid:79)(cid:76)(cid:81)(cid:72)(cid:3)(cid:87)(cid:85)(cid:68)(cid:81)(cid:86)(cid:73)(cid:72)(cid:85)(cid:3)(cid:79)(cid:72)(cid:68)(cid:85)(cid:81)(cid:76)(cid:81)(cid:74)(cid:11)(cid:36)(cid:57)(cid:54)(cid:54)(cid:3)(cid:21)(cid:19)(cid:20)(cid:27)(cid:12)
Figure 6: Our 2-step training process. CUHK02 dataset is clearso is good to learn general concept of similarity. In contrast,2DMOT2015 contains occluded, noisy samples. Its good to learnreal-world tracking situation. color hist w/o ham ham . . (a) 2DMOT2015 training color hist cuhk 2015train 2-step . . . (b) MOT16 trainingFigure 7: MOTA improvement by applying our methods. Wetested the contribution of our methods on two different train-ing dataset. In each graph, left-most bar shows the result usingcolor-histogram based appearance model as mentioned in Eq. 3.(a)2DMOT2015 training set. The middle bar shows that our joint-input siamese network outperforms color histogram based appear-ance model without HAM. Right-most bar shows that performanceis improved by using historical appearance matching. (b)MOT16training set. Second and third bars from left show the result fromnetwork trained by single dataset, CUHK or 2015train. The right-most bar verify the outperforming accuracy from 2-step training work about general concept of appearance comparison be-fore training it with examples from real tracking sequences.We separated training process into two steps: pre-trainfrom CUHK02 dataset and domain adaptation from track-ing sequences. CUHK02 [12] was developed for the personre-identification task and contains 1816 identities each ofwhich has 4 different samples. All images are clear andnot occluded. So, this is proper dataset to learn generalconcept of appearance comparison. First, we trained ournetwork with learning rate − for 300 epochs(trainingconverges). In each epoch, 3000 pairs, positive:negativeratio 1:1, are trained with mini batch size 100. After pre-train, we decreased learning rate to − and re-train itfrom 2DMOT2015 training sequences in a similar way aspre-training step. Cross-entropy loss and stochastic gradi-ent descent(SGD) optimizer are used for back-propagation. Performance improvement : To prove the contributionof our methods in tracking performance, we provide severalexperimental results(Figure 7 and 8). In Figure 7, we com-pared results by sequentially attaching our method related τconst (20, , 40) β ( pd : 0.3, , 0.5) . . . . . . (a) baseline methods τconst β β . (proposed) . . . (b) comparisonFigure 8: (a)MOTA scores in 2DMOT2015 training-set from twodifferent kinds of detection threshold. τ const : pre-defined thresh-old(20, 30, 40 from left to right). β : special case of Eq. 10 when β = 0 . This means that it doesn’t consider intra scene difference.(b)MOTA score comparison between proposed method( β = 0 . )and other baseline methods( τ const , β ). We chose the best scoreof each method in (a). to appearance affinity. We tested on two kinds of trainingdataset(2DMOT2015, MOT16). In 2DMOT2015 dataset,we tested the validity of our historical appearance match-ing method. For the fair experiment, we used the networkpre-trained from CUHK02 dataset without additional train-ing from 2015 training-set. As you can see in Figure 7(a),It is clear that the historical appearance matching improvethe overall tracking performance. In MOT16 training-set,we tested validity of the 2-step training method. As youcan see in Figure 7(b), the 2-step training method showsthe highest MOTA score outperforming networks which aretrained from each single dataset. In Figure 8, you can findexperimental results which prove necessity of scene adap-tive detection filtering. We compared our MOTA score withscores from other filtering methods( τ const , τ t ( β ) ). Comparison with other trackers : We compared ourmethods with several trackers on 2DMOT2015 and MOT17benchmark. Because ID consistency metrics(IDSw, IDF1)are critically affected by Fasle-Positive(FP) and False-Negative(FN), we carefully selected other trackers, havingsimilar number of FP and FN with ours, to be compared.Overall comparison result is on Table 1 and Table 2. In Ta-ble 1, we provide results in 2DMOT2015 benchmark. Wemeasured our tracker in two different ways, SADF and Vis-Best. In SADF, we applied our proposed scene adaptive de-tection filtering method( p d = 0 . ). Although SADF doesn’tshow the state-of-the-art MOTA, it shows better perfor-mance than methods which don’t use appearance reasoning[14, 15]. This proves the effectiveness of deep appearancefeature. Also, SADF shows better ID-consistency(IDF1,IDSw) than fine-tuned baseline method [10]. IDF1 was pro-posed to compensate the limitation of IDSw metric. Highperformance in both IDF1 and IDSw metrics proves thatour tracker can manage object ID consistently. In VisBest,we heuristically chose a different detection filtering thresh-old, visually seems best, for each sequence. VisBest re-moved FN a lot and produced near state-of-the-art MOTA,highest IDF1 and second-best IDSw except SADF. It is re- ethod MOTA ↑ IDF1 ↑ IDSw ↓ FP ↓ FN ↓ offlineSiameseCNN [10] 29.0 34.3 639 PHD GSDL [5]
SCEA [16] 29.1 37.2 604 6060 36912oICF [7] 27.1
460 7485 35910Ours(SADF) 25.2 37.8
Table 1: Comparison in 2DMOT2015 benchmark. Best(red) andrunner-up(blue) scores in table are marked in bold.
Method MOTA ↑ IDF1 ↑ IDSw ↓ FP ↓ FN ↓ offline MHT DAM [8]
MHT bLSTM [9] 47.5
EAMTT [14] 42.6 41.8 4488 30711 288474Ours(SADF)
Table 2: Comparison in MOT17 benchmark. We applied τ sa ( p d =0 . ) for DPM and didn’t apply threshold for FRCNN and SDP. markable that ours showed far better ID-consistency thanthat of [10, 1] which used a similar siamese network. Weguess it attributes to our historical appearance matching andtwo-step training method. In Table 2, we compared with afew state-of-the-art trackers in MOT17 benchmark. Oursshows competitive performance in all metrics. Even ourtracker outperforms state-of-the-art LSTM based tracker [9]in MOTA and IDSw metrics.
4. Conclusion
We proposed several methods to overcome temporal er-rors which occur because of occulsion and noisy detections.First, we designed the joint-input siamese network for ap-pearance matching and trained it using the 2-step trainingmethod. And we applied historical appearance matchingmethod to break ambiguity. Finally, we tried to find anadaptive detection threshold which generally works wellin all sequences. As confirmed in experiment, our trackershowed improved performance, especially in ID consis-tency metrics. But there is a limitation of our work. Our net-work only takes cropped patches as input and lacks contex-tual information. In our future work, we will try to exploitcontextual information instead of directly cropping patchesfrom image.
5. Acknowledgement
This work was financially supported by the ICT R&Dprogram of MSIP/IITP [2014-0-00077, Development of global multi-target tracking and event prediction techniquesbased on real-time large-scale video analysis] and LotteData Communication Company.
References [1] S.-H. Bae and K.-J. Yoon. Confidence-based data associa-tion and discriminative deep appearance learning for robustonline multi-object tracking.
IEEE transactions on patternanalysis and machine intelligence , 40(3), 2018.[2] A. Boragule and M. Jeon. Joint cost minimization for multi-object tracking. In
AVSS , 2017.[3] P. Dollar, R. Appel, S. Belongie, and P. Perona. Fast featurepyramids for object detection.
IEEE transactions on patternanalysis and machine intelligence , 36(8):1532–1545, Aug.2014.[4] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, , andD. Ramanan. Object detection with discriminatively trainedpart based models.
IEEE transactions on pattern analysisand machine intelligence , 32(9), 2010.[5] Z. Fu, P. Feng, F. Angelini, J. A. Chambers, and S. M. Naqvi.Particle phd filter based multiple human tracking using on-line group-structured dictionary learning.
IEEE Access , 6,2018.[6] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. In
ICML , 2015.[7] H. Kieritz, S. Becker, W. Hbner, and M. Arens. Online Multi-Person Tracking using Integral Channel Features. In
AVSS ,2016.[8] C. Kim, F. Li, A. Ciptadi, and J. Rehg. Multiple HypothesisTracking Revisited. In
ICCV , 2015.[9] C. Kim, F. Li, and J. Rehg. Multi-object Tracking with Neu-ral Gating Using Bilinear LSTM. In
ECCV , 2018.[10] L. Leal-Taixe, C. Canton-Ferrer, and K. Schindler. Learningby tracking: Siamese cnn for robust target association. In
DeepVision workshop in conjunction with CVPR , 2016.[11] L. Leal-Taixe, A. Milan, I. Reid, S. Roth, and K. Schindler.Motchallenge 2015: Towards a benchmark for multi-targettracking. In arXiv:1504.01942 .[12] W. Li and X. Wang. Locally aligned feature transformsacross views. In
CVPR , 2013.[13] A. Milan, L. Leal-Taixe, I. Reid, S. Roth, and K. Schindler.Mot16: A benchmark for multi-object tracking. In arXiv:1603.00831 .[14] J. Sanchez-Matilla, F. Poiesi, and A. Cavallaro. Multi-target tracking with strong and weak detections. In
BMTT-Workshop in conjunction with ECCV , 2016.[15] S. Wang and C. C. Fowlkes. Learning optimal parametersfor multi-target tracking with contextual interactions.
Inter-national Journal of Computer Vision , 122(3), 2017.[16] J. Yoon, C. Lee, M. Yang, and K. Yoon. Online multi-object tracking via structural constraint event aggregation.In
CVPR , 2016.[17] Y. Yoon, Y. Song, K. Yoon, and M. Jeon. Online Multi-Object Tracking using Selective Deep Appearance Match-ing. In