[PDF] Review of Person Re-identification Techniques

Abstract

Person re-identification across different surveillance cameras with disjoint fields of view has become one of the most interesting and challenging subjects in the area of intelligent video surveillance. Although several methods have been developed and proposed, certain limitations and unresolved issues remain. In all of the existing re-identification approaches, feature vectors are extracted from segmented still images or video frames. Different similarity or dissimilarity measures have been applied to these vectors. Some methods have used simple constant metrics, whereas others have utilised models to obtain optimised metrics. Some have created models based on local colour or texture information, and others have built models based on the gait of people. In general, the main objective of all these approaches is to achieve a higher-accuracy rate and lowercomputational costs. This study summarises several developments in recent literature and discusses the various available methods used in person re-identification. Specifically, their advantages and disadvantages are mentioned and compared.

Full PDF

PPublished in IET Computer VisionReceived on 21st February 2013Revised on 14th November 2013Accepted on 18th December 2013doi: 10.1049/iet-cvi.2013.0180

ISSN 1751-9632

Review of person re-identification techniques

Mohammad Ali Sagha ﬁ , Aini Hussain , Halimah Badioze Zaman ,Mohamad Hanif Md. Saad Faculty of Engineering and Built Environment, Universiti Kebangsaan Malaysia (UKM), Bangi, Malaysia Institute of Visual Informatics, Universiti Kebangsaan Malaysia (UKM), Bangi, MalaysiaE-mail: [email protected]

Abstract:

Person re-identi ﬁ cation across different surveillance cameras with disjoint ﬁ elds of view has become one of the mostinteresting and challenging subjects in the area of intelligent video surveillance. Although several methods have been developedand proposed, certain limitations and unresolved issues remain. In all of the existing re-identi ﬁ cation approaches, feature vectorsare extracted from segmented still images or video frames. Different similarity or dissimilarity measures have been applied tothese vectors. Some methods have used simple constant metrics, whereas others have utilised models to obtain optimisedmetrics. Some have created models based on local colour or texture information, and others have built models based on thegait of people. In general, the main objective of all these approaches is to achieve a higher-accuracy rate and lower-computational costs. This study summarises several developments in recent literature and discusses the various availablemethods used in person re-identi ﬁ cation. Speci ﬁ cally, their advantages and disadvantages are mentioned and compared. One of the most important aspects of intelligent surveillancesystems, which has been considered in the literature, is personre-identi ﬁ cation, especially in cases in which more than onecamera is used [1 – ﬁ cation is a pipelinedprocess consisting of a series of image-processingtechniques that ﬁ nally indicate the same person who hasappeared in different cameras. In identi ﬁ cation, the entireprocess is performed under the same illumination,viewpoint and background conditions, but these conditionsare uncontrolled in re-identi ﬁ cation. Furthermore, inidenti ﬁ cation, a large number of samples are ready;meanwhile, in re-identi ﬁ cation, one cannot expect to haveseen the unknown person earlier. In fact, the combinationof these uncontrolled conditions makes re-identi ﬁ cationmore dif ﬁ cult and at the same time more useful thanidenti ﬁ cation in most cases. However, identi ﬁ cation still hasspecial applications in vision industries. The potential tomake surveillance systems more operator-independent thanbefore, and the need to address related issues that have notyet been solved, make re-identi ﬁ cation an interestingsubject of research. ﬁ cation? According to Paul McFedries [4], re-identi ﬁ cation is theprocess of matching anonymous census data with theindividuals who provided the data. Thus, the term ‘ peoplere-identi ﬁ cation ’ can be de ﬁ ned as the process of matchingindividuals with a dataset, in which the samples containdifferent light, pose and background conditions from thequery sample. The matching process could be considered as ﬁ nding a person of interest among pre-recorded images,sequences of photos [5 –

7] or video frames [1, 2, 8 –

10] thattrack the individual in a network of cameras in real-time.Obviously, the latter is more challenging and has openunsolved issue and is being actively pursued by researchersworldwide. ﬁ cation signi ﬁ cant? Surveillance in public places is widely used to monitorvarious locations and the behaviour of people in thoseareas. Since events such as terrorist attacks in differentpublic places have occurred more frequently in recent years,a growing need for video network systems to guarantee thesafety of people has emerged. In addition, in publictransport (airports, train stations or even inside trains andairplanes), intelligent surveillance has proven to be a usefultool for detecting and preventing potentially violentsituations. Re-identi ﬁ cation can also play a part in processesthat is needed for activity analysis and event recognition orscene analysis. In an intelligent video surveillance system, asequence of real-time video frames is grabbed from theirsource, normally closed circuit television (CCTV) andprocessed to extract the relevant information. Developingtechniques that can process these frames to extract thedesired data in an automatic and operator-independentmanner is crucial for state-of-the-art applications ofsurveillance systems.Today, the growth in the computational capabilities ofintelligent systems, along with vision techniques, hasprovided new opportunities for the development of newapproaches in video surveillance systems [1]. This includesautomatic processing of video frames for surveillance IET Comput. Vis. , pp. 1 – urposes, such as segmentation, object detection, objectrecognition, tracking and classifying. One of the mostimportant aspects in this area is person re-identi ﬁ cation. Aslong as a person stays within a single camera ’ s view, hisposition, as well as the lighting condition and background,is known to the system. However, problems arise inapplications in which a network of cameras must be usedwhen the person moves out of one camera ’ s view andenters another. Although tracking a person within a singlecamera stream creates issues related to occlusion and istypically based on continuous user observations,multi-camera tracking raises the concern of uncovered areaswhere the user is not observed by any camera. Thereforehow does the system know that the person seen in thatcamera was the same person seen earlier in another camera?This issue is known as a re-identi ﬁ cation problem. It centreson the task of identifying people separated in time andlocation. The lack of spatial continuity for the informationreceived from different camera observations makes personre-identi ﬁ cation a complex problem.The person re-identi ﬁ cation problem has three aspects.First, there is a need to determine which parts should besegmented and compared (i.e. ﬁ nd the correspondences).Second, there is a need to generate invariant signatures forcomparing the corresponding parts [11]. Third, anappropriate metric must be applied to compare thesignatures. In most studies, the method is designed underthe assumption that the appearance of the person remainsunchanged [1, 12, 13] which seems sensible. Based on thisassumption, local descriptors such as colour and texture canbe considered to exploit the robust signatures of images. Acollection of solutions have been used towards thisobjective, including gait [10, 14, 15], colour [11, 16 – ﬁ cation mustdeal with several challenges such as variations in theillumination conditions, poses (Fig. 1) and occlusionsacross time and cameras. In addition, different people maydress similarly. In gait-based methods [10], although thereis no colour-constancy problem, the gait of a personappears to be different from different viewing angles andposes. Thus, the recognition rate diminishes when theindividuals to be identi ﬁ ed are viewed from differentangles. The partial occlusions created by other people orobjects also affect the gait-based methods.There are two styles for writing surveys in thestate-of-the-art re-identi ﬁ cation literature, one is themethod-based survey [21] and the other is phase-based [22 – ﬁ cation and discuss the most reliable methods that have been employed in these areas. As such,the review afforded in this current work is more extensivethan the one reported earlier in [23].The paper is organised as follows. In Section 2, the variousissues regarding re-identi ﬁ cation are explained. Section 3describes the methods that have been used inre-identi ﬁ cation. The most popular databases and evaluationmetrics used for different methods are also reported in thissection. In Section 4, we discuss the methods stated in theliterature, and enumerate the pros and cons of thesemethods. The conclusion summarises the contents of thispaper. Problems related to re-identi ﬁ cation make it more dif ﬁ cultthan the identi ﬁ cation task. Although some research hasbeen done in this area, several problems have yet to besolved. These issues can generally be classi ﬁ ed into twocategories: (i) inter-camera and (ii) intra-camera. Theproblems may differ in different scenarios. For example, theconsiderations for re-identi ﬁ cation in public places such astrain stations or crowded shopping malls are different fromthose in a house. However, all applications have commonproblems.To track the same person in different cameras, the systemmust be robust against illumination changes and outdoorclutter. Different camera viewpoints can also causeproblems for methods that are based on the gait or shape ofthe moving person. Occlusion in public places is anotherissue that must be addressed. In methods based on theappearance of people, the clothing of the object shown fromone camera to another should not be changed; otherwise,the system would fail. People will enter the camera ’ s ﬁ eldof view with different poses; thus, for approaches that try toextract a model based on the movement of the person,changing the pose will create dif ﬁ culties. To prevent failure,the designated methods must have the ﬂ exibility andcapability to deal with these problems. Several issues (i.e. inter-camera issues) can cause problems intracking people in a network of cameras with disjoint ﬁ elds ofview. Given that clothing is mostly characterised by its colourand texture, having a constant appearance of colour among allthe cameras is important. The different illuminationconditions that exist at different camera sites are problem toconsider. Owing to bandwidth limits, images grabbed in anetwork of cameras will be compressed, causing unwantednoise to be added to these images [6]. Even if the camerasare from the same manufacturer, they have different featuresand so have differences in illumination. Another mainproblem is the different poses of humans of interest indifferent camera angles. This problem decreases the Fig. 1

Differences in poses and lighting conditions in four different cameras [26]

IET Comput. Vis. , pp. 1 ––

IET Comput. Vis. , pp. 1 –– etection rate, especially in the gait-based methods. Severalresearchers have proposed the use of more robust methodsto address this problem [12, 16, 25].Tracking individuals between disjoint points of view isanother problem in re-identi ﬁ cation. Most methods rely onextracting the colour and texture features of an individual ’ sclothing; however, offering methods that are invariant evenwith a rapid change of clothing would be better. Some of the problems to be addressed are related to varyinglight conditions at different times of the day (Fig. 2). Inaddition, most surveillance cameras are low-resolutioncameras; hence, detection techniques that use methods thatare dependent on the quality of the frames (e.g. facerecognition methods [27]) can rarely be used. They havemostly been implemented and evaluated on local datasetsrather than on standard famous datasets [25].Occlusion (Fig. 3) in camera frames is another problem thatcreates dif ﬁ culty in image segmentation (one of the steps inre-identi ﬁ cation). As mentioned, the re-identi ﬁ cation task isa pipelined process consisting of different processes, suchas image segmentation, feature extraction and classi ﬁ cation.Each of these processes represents a vast area of research inimage processing and computer vision. Thus, there arespeci ﬁ c considerations and issues related to them. We donot mention those concerns in this paper; we only considerthe concatenation of these processes, which leads to thetask of re-identi ﬁ cation.Table 1 presents the issues that must be overcome inre-identi ﬁ cation. Some of these problems have been solvedaccording to previous work, whereas others remain unsolved.All of the methods that have been proposed forre-identi ﬁ cation attempt to extract signatures (invariantfeatures) from video frames and classify them in an appropriate manner to overcome the aforementionedproblems. Thus far, there is no comprehensive frameworkreported in the literature that can cover all the issues relatedto re-identi ﬁ cation, and each method can only partly coverthe issues. In the following sections, different methods thathave been investigated by several researchers are discussed. In this section, the most signi ﬁ cant studies that have beendone in the area of person re-identi ﬁ cation in recent yearsare categorised and explained. Then, in Section 4, asummary of the methods, comparing their advantages anddisadvantages, is presented. The backgrounds of mostre-identi ﬁ cation techniques in their present structures referto multi-camera tracking approaches [28, 29], content-basedimage retrieval techniques [30, 31] and algorithms that havebeen used to extract the colour and texture information fromstill images to classify and label them among a largevolume of raw data.Generally, re-identi ﬁ cation methods can be divided intotwo main groups. The ﬁ rst group includes methods that tryto extract signatures from colour, texture and otherappearance properties of frames. These areappearance-based methods. In contrast, others try to extractfeatures from the gait and motion of persons of interest.These are gait-based methods, which are not popular yetbecause of the restrictions caused by different viewpoints orfar-view camera frames in which the subject ’ s gait is notclearly shown.Whether the approach is appearance-based or gait-based,the re-identi ﬁ cation consists of three main steps which aredepicted in Fig. 4. The ﬁ rst step is to extract the blob of theperson of interest from other parts of the image. The secondstep is to extract the signatures and the last step is tocompare the extracted signatures and evaluate thesimilarities between the query and the gallery set. Before we go through the feature extraction and classi ﬁ cationstages, we brie ﬂ y review the types of methods that are used inre-identi ﬁ cation approaches as pre-processing, including Fig. 2

Lighting conditions at different times of day [26]

Table 1

Re-identification issuesTypes Issuesinter-camera illumination changes for different scenes; disjointfields of view; different entrance poses in differentcameras; similarity of clothing; rapid changes inperson ’ s clothing; and blurred imagesintra-camera background illumination changes; low resolutionof CCTVs; and occlusion in frames Fig. 3

Sample of occlusion in scene [26]

IET Comput. Vis. , pp. 1 – uman detection, background elimination and, in some work,shadow elimination. These pre-processing steps are notnecessarily included in all re-identi ﬁ cation approaches butapplying them can increase the accuracy. The ﬁ rst step forre-identi ﬁ cation after data acquisition is backgroundelimination, which is needed to detect the person or regionof interest. Although there may besome approaches in which the background elimination isnot suitable [32], in most of them it is necessary to removethe background to obtain better accuracy in the next stages.In re-identi ﬁ cation, the same background may not exist fordifferent frames because the data are grabbed from differentcameras with different backgrounds. Thus, methods that usebackground subtraction with respect to a referencebackground frame are useless. Manual silhouettesegmentation is the most naïve approach for backgroundelimination method [33]. Gaussian mixture models (GMMs)[34] are widely used for background/foregroundclassi ﬁ cations [14, 35 – et al. [41] used the probability density function ofcolour features of a target region to ﬁ nd the log-likelihoodratio of the foreground class. Gheissari et al. [11] used themaximum frequency image for background/foregroundsegmentation. Park et al. [6] also proposed a Gaussianbackground model using the two levels of pixels andimage. In this approach, the mean and variance of thebackground model are updated recursively using temporaland spatial information. The methods used by Gheissari et al. and Park et al. are only useful in situations where wehave sequences of frames, whereas in scenarios in whichonly still images of pedestrians are available, these methodscannot be utilised. Some approaches have preferredto use human detection and extract features from thebounding boxes that surround the human body. Thehistogram of oriented gradients (HOGs) proposed by Dalaland Triggs [42] is one of the most useful methods that havebeen applied for human detection [38, 41, 43 –

46] and evenhuman body part detection [43, 47]. This method is suitableand can be utilised for cases in which sequences of videoframes are not available, but needs training samples likestructure element (STEL). It is reliable under differentillumination conditions (mostly extracted from grey-scaleimages) and different scales (using a multi scale approach). However, varying poses of humans may decrease thedetection rate. A local binary pattern (LBP)-based detectorwas proposed by Corvee et al . [48] called simpli ﬁ ed localbinary pattern (SLBP), in which a set of 256 vectorelements of an LBP texture descriptor was decreased to 16.In their work [48], they divided each cell into four parts,and then computed the mean intensities and meandifferences of the parts to form the SLBP and used theAdaBoost classi ﬁ er to train the features. To overcome thedifferent scales, different sizes of cells were examined.Goldmann [35] and Monari [49] used a pixel wise methodproposed by Horpraser et al. [50] to detect persons in videostreams. This algorithm closely mimics the human visionsystem in which the sensitivity to illumination is more thanthe sensitivity to the colour. In this method, the differencebetween pixel value of current image and background valuein (red – blue – green) RGB colour space is decomposed intochromaticity and brightness components. The pixel isclassi ﬁ ed as foreground if only the chromaticity componentexceeds a pre-de ﬁ ned threshold. For cases in which onlythe brightness component differs, the pixel is considered asshadow. In [51], Albiol et al. formed a height map basedon the calculation of the pixels height from the ground.Next, to detect the moving persons or the blob, a thresholdwas applied on the height map image and followed analysisof connected component. In cases involving overlaps of twopersons, watershed algorithm was used to split the blob.The criterion to split a blob is based on the existence ofmore than one local maximum (head position) on that blob.One of the major concerns in human detection step forre-identi ﬁ cation is the real-time implementation issue.Eisenbach and Kolarow [52] have used a real-time algorithmproposed by Wu et al. [53] which was capable of detectinghuman with 20 frames per second speed. This method usescensus transform histograms visual descriptor whichoutperforms the HOG and LBP methods. This descriptorencodes the signs of comparisons of neighbouring pixels andcomposes a histogram of these codes. In contrast with HOG,the focus of this descriptor is on the contour information ofimages and only the sign information of neighbouring pixelsis preserved while ignoring their magnitude information. Theuse of this human detection method in companion with anef ﬁ cient, real-time colour-based human tracking method [54]empowered Eisnebach et al. work to track the personsthrough video frames in real-time.In another work, Aziz et al. [55] have proposed one of themost applicable methods to detect human in crowd which wasfeasible in real-time application with only a short time delay.In this method, they performed background subtraction, andthen used a particle ﬁ lter to detect and track the heads andskeleton graphs in the video frames. This method wasdesigned to work in a crowded scene which involved morethan one person. In the case of occlusions where the bodyof two persons overlapped, the nearest head to the camerais kept and the other head is ignored.The moving foreground of the silhouettes can also betracked based on their spatial and colour probability Fig. 4

Steps in re-identi ﬁ cation IET Comput. Vis. , pp. 1 ––

Steps in re-identi ﬁ cation IET Comput. Vis. , pp. 1 –– istributions. Javed et al. as reported in [56] considered aGaussian distribution of spatial probability density functionfor moving persons in consequent frames and usednormalised colour histograms of foreground pixels as theobjects ’ colour distribution. A foreground pixel which hadthe maximum colour and spatial product value voted for anobject label. In the next level, a foreground region in thecurrent frame is assigned to an object when most of itspixels (above a threshold) had voted for that object. In thecase of partial occlusion, the position of partially occludedobject was indicated based on the mean and variance of thepixels that voted for that object label. Sometimes, the shadowsare not eliminated in the background subtraction step. Thus,some methods are being used to remove the remainingshadows. Roy et al. [14] used the method in [57], in whichafter the background subtraction the angle between thebackground pixel value and the foreground pixel iscompared with a threshold to decide whether or not thepixel belongs to a shadow. Park et al. [6] also used thesame proposed Gaussian model that they had used forbackground subtraction by applying it on the foregroundpixels in Hue Saturation Value (HSV) space. Thesubtraction was ﬁ rst performed on V and then on the H andS values. In [10], shadows were detected if the differencebetween the pixel value and the expected value of theGaussian background model was within two thresholds. ﬁ cation The models that are created to describe the appearance or gaitcan be extracted using a holistic description of an individual[58] or by a part-based or region-based description of thatindividual [16, 21, 59]. In both appearance-based andmotion-based approaches, using the intermediate step ofextracting the spatial information helps to extract morerobust features and ﬁ nally obtain a better re-identi ﬁ cationrate. The partitioning can be done based on ﬁ xedproportions of the bounding box around the person ofinterest [59], but this cannot properly separate the regionsand portions. One ofthe most applicable partitioning algorithms, which has beenutilised in many approaches [27, 52, 60] was proposed byFarenzena et al. [16]. In this algorithm, three main bodyregions are divided by two horizontal asymmetry axescorresponding to the head, torso and legs. This division isbased on the maximum difference between the number ofpixels in two moving rectangles, which sweeps the image todivide the head and torso and also the maximum colourdifference in these bounding boxes to divide the torso andlegs. In the last two regions, a vertical axis of theappearance symmetry is estimated. The use of symmetricalaxes (by giving the pixels weights based on their distancefrom the axes) has a signi ﬁ cant improvement to make themethod pose independent. Fig. 5 shows body separation bythis method. In [59], two rectangles like that in [16] wereused to scan the image and ﬁ nd the best separating linebetween the torso and the legs. The decision here to ﬁ ndthe maximum colour dissimilarity of the torso and legs wasmade based on the Bhattacharyya coef ﬁ cients of thehistograms of two rectangles. Gheissari et al. [11] proposed an algorithm that segments the silhouette based on its salient edges. It is robust against cloth wrinklesand temporary edges caused by changing illuminationconditions in different frames. The signi ﬁ cance of this methodis that it groups pixels based on their fabric. In this method,an over segmentation is ﬁ rst performed using watershedalgorithm, which results in a set of contiguous regions, asshown in Fig. 6. In the next step, a graph G = { V , E } withspatial and temporal edges is de ﬁ ned. If two regions in oneframe share a common boundary their corresponding edgeis e t , ti , i ′ and if two regions in consequent frames are indicatedas corresponding ones their edge is e t , t + i , i ′ . Thecorrespondence between two regions in two consequentframes is detected by the frequency image of frames.Finally, a graph-based partitioning algorithm is used togroup the connected regions (temporally and spatially),where their inter-class variations are less than theirintra-class variations. As previouslymentioned, in addition to using HOG for human detection itcan be used to detect body parts. Bak et al. [43] andBedagkar-Gala and Shah [47] used HOG as part detector asshown in Fig. 7. The idea is the same as using it for humandetection and the system must be trained by negative andpositive samples for each body part. This type ofsegmentation allows the algorithm to comparecorresponding parts with each other in the classi ﬁ cationstage which will decrease the computation cost. Fig. 5

Body segmentation using symmetry and asymmetry axes[16]

Fig. 6

Spatiotemporal segmentation [11]

IET Comput. Vis. , pp. 1 – .2.4 Group representation of people forre-identi ﬁ cation: In contrast to most of the approachesused for re-identi ﬁ cation which emphasise detecting andre-identifying an individual person of interest, Zheng et al. [20] proposed two spatial descriptors in which associatedgroups of people are considered as visual contexts. Thesedescriptors are invariant from the rotations and swaps thatoccur in associating groups and are de ﬁ nitely invariantfrom scene occlusion. In fact, rather than re-identifyingpre-viewed persons individually, these spatial descriptorstry to re-identify pre-viewed groups of people. TheSIFT-RGB features of the image are extracted andclassi ﬁ ed into n visual words of w , … , w n . Then, the pixelvalues are replaced with their corresponding visual words,and the image is divided into l non-overlapped regions( p , … , p l ) that expand from the centre of the image whichis depicted in Fig. 8. For each region p i , a histogram h i isbuilt, where h i ( a ) means the frequency of occurrence ofvisual word w a in that ring. An intra-ratio occurrence index h i ( a , b ) is also de ﬁ ned which indicates the ratio of thefrequency of occurrence of w a to w a + w b . This is de ﬁ ned asfollows h i ( a , b ) = h i ( a ) h i ( a ) + h i ( b ) + (1) To obtain inter-ratio occurrence indices, g i and s i are ﬁ rstde ﬁ ned as follows g i = (cid:1) i − j = h j , s i = (cid:1) lj = i + h j (2)Finally, G i ( a , b ) and S i ( a , b ) are de ﬁ ned as inter-ratiooccurrence indices G i ( a , b ) = g i ( a ) g i ( a ) + g i ( b ) + , S i ( a , b ) = s i ( a ) s i ( a ) + s i ( b ) + (3)Therefore, for each region p i , the centre rectangular ringratio-occurrence (CRRRO) descriptor will be de ﬁ ned as T i r = { H i , G i , S i } and the whole image will be describedby T i r (cid:3) (cid:4) li = . Fig. 9 shows how CRRRO is helpful forextracting the inter-person spatial information of groups ofpeople.Another spatial descriptor that is de ﬁ ned in this approach isthe block based ratio-occurrence (BRO) descriptor. Thisdescriptor is de ﬁ ned to extract the likely local patchinformation from individuals, as it can be seen in Fig. 8(right). The image is divided into grid blocks, and BRO isde ﬁ ned inside each block. Each block B i is divided intosub-blocks SB g i ( γ = 1). In fact, this descriptor copes withthe non-rotational position changes of a person whichCRRRO cannot do. A complementary sub-region SB γ i +1 isalso considered to cover the other visually similar blocks inthe same group of people. Similar to CRRRO, the index H ij is de ﬁ ned in this descriptor between visual words in eachregion SB i , but an extra index O ij is de ﬁ ned here to explorethe inter-ratio occurrence of SB i and other block regions O i ( a , b ) = t i ( a ) t i ( a ) + z i ( b ) + , O i ( a , b ) = z i ( a ) t i ( a ) + z i ( b ) + (4) Fig. 7

HOG body part detector [43, 47]

Fig. 8

CRRRO (left) and BRO (right) descriptors [20]

Fig. 9

Inter-spatial information extraction in CRRRO [20]

IET Comput. Vis. , pp. 1 ––

IET Comput. Vis. , pp. 1 –– here t i and z i are visual word histograms of B i andcomplementary image region SB γ i +1 . Thus, the BRO will berepresented by T i b = H ij (cid:5) (cid:6) g + j = < O ij (cid:5) (cid:6) j = , i = 1, … , m where m is the number of blocks. According to the literature, appearance-based methods aremore suitable in re-identi ﬁ cation because of the short timerequired by the entire process. Although varying pose andillumination issues have a direct effect on the appearancefeatures (like colour or texture) of different images, theavailability and discrimination potential ofappearance-based features are the likely cause of usingthem in most of the works on re-identi ﬁ cation. In thissection, models based on the appearance features of imageswill be further discussed. Colour histograms are the mostpopular tools used to describe the occurrence frequency ofcolours in an image. In re-identi ﬁ cation, the holisticrepresentation of a scene is not applicable and effective.Thus, histograms are preferably extracted from thesegmented parts and regions [11, 12, 16, 27, 61]. Thecolour histograms are de ﬁ ned in different colour spaces.RGB colour histograms [12, 61], HSV colour histograms[16, 27, 32, 62] or LAB colour space histograms [62, 63]are examples of using different colour spaces to constructhistograms. Among these colour spaces, HSV channels arethe more robust against illumination changes. Theluminance and chromatic channels are also separated fromeach other in the LAB colour space and can therefore behelpful to overcome the effects of the varying illuminationof different frames. The main disadvantage of histograms isthe lack of geometric and spatial information, which isnecessary in re-identi ﬁ cation applications. To add spatialinformation to histograms, the silhouette can be dividedinto horizontal stripes, where a colour-position histogram isde ﬁ ned for each stripe [57, 58]. Spatiograms can alsoprovide complementary spatial information by addinghigher-order spatial moments to histograms [2].D ’ Angelo and Dugelay [63] proposed a probabilisticrepresentation of histograms called probabilistic colourhistogram (PCH), in which the colours were quantised into11 culture colours using a fuzzy k -nearest neighbouralgorithm. In this approach, a data element can belong tomore than one cluster. A membership vector u ( n ) = { u ( n ), u ( n ), … , u C ( n )} is de ﬁ ned for each pixel which indicatesthe degree of the association of pixel n in all C = 11clusters. Then, for each segmented part PCH is de ﬁ ned as avector of H ( X ) H ( X ) = H ( X ), H ( X ), . . . , H ( X ) (cid:3) (cid:4) , H c ( X ) = N (cid:1) Nn = u c ( X n ) (5)where N is the number of total pixels in that segment. Thefuzzy clustering used in this method causes the pixel tobelong to more than one cluster. The quantisation ofcolours into 11 culture colours and the fuzzy nature ofhistogram make this method more reliable againstillumination changes than normal histograms. However, thehistogram does not contain spatial information. In addition, the method has not been compared with any other methodto provide information on any improvement in resultscompared with normal histograms. The fuzzy space colourhistogram (FSCH) that was recently proposed by Xiang et al. [37] contained both colour and space information. Inthis approach, the usual three-dimensional (3D) histogram(in RGB space) was replaced by a 5D fuzzy histogram ˜ P ( R , G , B , x , y ), which included the pixel geometry. Amembership function w i ( x ) was also de ﬁ ned so that eachpixel belonged to two neighbouring bins at each dimension.In the implementation stage for re-identi ﬁ cation, the authorsreduced the dimensionality of the histogram to 4D byremoving the x dimension to make the histogram robustagainst pose variations. The idea ofcolour context people descriptor (CCPD) proposed by Khan et al. [59] was inspired by the shape context structure inBelongie et al. [64]. In this approach, the shape contextstructure is placed in the centre of the segmented object(which is torso or legs). Then, based on the pixels ’ radialand angular bins, a colour histogram is generated. The 3Dstructure of CCPD is shown in Fig. 10.Depending on the pose, the legs may contribute more orless colour information to the histogram. Therefore thebottom histogram for the same person will be different frompose to pose. Thus, it is important to ignore the backgroundpixels and only consider the legs ’ pixels to make thedescriptor more discriminative. A back projection algorithm[65] is used to identify the pixels that represent the legs.Colour histograms of the both top and bottom rectangularareas are created, and the Bhattacharyya coef ﬁ cient [66] iscomputed to match the histograms. Applying the CCPD onthe whole torso region enters some unwanted pixels fromthe background in histogram computation. To improve thisproblem, it is better to apply a background/foregroundsegmentation algorithm on detected person bounding boxand then apply CCPD or to apply it on smaller extractedpatches from the torso region. MPEG7 colour descriptors have generally beenused in image retrieval applications [67, 68]; however, anumber of researchers have used these descriptorsspeci ﬁ cally for re-identi ﬁ cation purposes. The visualdescriptors in the MPEG7 standard use both colour andspatial information, and most of them are invariant againstscale, rotation and translation, which is why MPEG7 isconsidered to be potentially capable descriptor forre-identi ﬁ cation. Annesley et al. [33] used MPEG7 colourdescriptors (dominant colour, colour layout, scalable colourand colour structure) to re-identify a person in a datasetgrabbed by different cameras at different times. Theycollected a set of image sequences of pedestrians enteringand leaving a room, viewed by two cameras, as the test set Fig. 10

CCPD [59]

IET Comput. Vis. , pp. 1 – data are from medium and far-view image sequences). Adataset was generated from the top and bottom clothingcomponents of each individual. They evaluated how thiskind of segmentation affects the retrieval accuracy rate ofthe system. In addition, they investigated the effect ofcombining colour descriptors to improve the retrievalaccuracy rate. However, in multiple camera datasets, theMPEG7 colour descriptors do not outperform the simple(R, G, B) mean description of foreground data because ofthe lack of colour constancy.In another work, Bak et al. [41] used the dominant colourdescriptor (DCD) to extract robust signatures from segmentedimages. In this approach, a human detection algorithm wasused to ﬁ nd people in video sequences, and then anindividual was tracked through several frames to generate ahuman signature. The DCD signature was created byextracting the dominant colours of upper and lower-bodyparts. These two sets were combined using the AdaBoostscheme to capture different appearances corresponding toone individual. The method used cross-correlation modelfunctions to be robust against differences in illuminationand pose, and to handle inter-camera colour calibration. ﬁ cation: Gheissari et al. [11] proposed amethod based on generating a large amount of interestpoints (colour and structure information) around regionswith high-information contents (Fig. 11). The Hessianaf ﬁ ne invariant operator [69] was used to nominate interestregions. The HSV histogram and ‘ edgel ’ histogram are twocolour and structural features exploited from these regionsto be compared. In [60], Martinel et al. used scale-invariantfeature transform (SIFT) [70] interest points as the centresof circular regions, and a Gaussian function was used toconstruct a weighted colour histogram from the interestregions. The SIFT interest points are 3D histograms of thelocation and gradient orientation of the pixels. The gradientlocation and orientation histogram (GLOH) [71] is anotherdescriptor, which is actually an extension of the SIFTdescriptor to improve its distinctiveness [72]. In anotherattempt, Hamdoun et al. [3, 58] proposed a method basedon harvesting SIFT-like interest point descriptors fromdifferent frames of a video sequence. In contrast to themethod mentioned in [11] where matches are doneimage-to-image, this method exploits a sequence of images.It generates a more dynamic and multi-view descriptor thanthe use of a single image. In the learning phase of thisalgorithm, the given object, person, or car is tracked in onecamera to extract interest points and descriptors to build themodel. The interest point detection and descriptorcomputation is conducted using the ‘ key points ’ functionsavailable in the Camellia image processing library, which was inspired by speeded up robust feature (SURF) [73], butis even faster because the detector mostly relies on integralimages to approximate the determinant of the Hessianmatrix. De Oliveira and De Souza Pio [74] also used theSURF descriptor to locate interest regions, including theHSV histogram of the points and saved it as a compactsignature. The correlation of compact signatures was thencomputed to ﬁ nd the best matches. The main advantage ofusing interest points for detection and description is theirinvariance to illumination and partial invariance to posechanges. However, the redundancy of interest points is notdesirable and must be limited. The other issue which mustbe taken into account is that the interest point detectors aresensitive to edges so their performance on the silhouetteedges may be decreased. ﬁ cation: Theinsensitivity to noise and invariance to the identical shiftingof colours make the covariance descriptor suitable forre-identi ﬁ cation [43, 52, 75 – R as asegmented part of an image I , the covariance descriptor ofregion R will be de ﬁ ned as a d × d dimensional covariancematrix as follows [18] C R = n − (cid:1) nk = f k − m (cid:7) (cid:8) f k − m (cid:7) (cid:8) T (6)where { f k } k =1, … , n are the d -dimensional feature points ofregion R with n number of points of region R and the mean μ for the region points. The feature vector in the covariancedescriptor can contain the colour, gradient or spatialderivatives of points. Bak et al. [43] proposed the spatialcovariance regions (SCRs) descriptor, in which the physicallocations of the points, along with their RGB colours, thegradient ’ s magnitudes and their orientations were used toconstruct the feature vector. This model handles differencesin the illumination, pose and camera parameters. In thisperson re-identi ﬁ cation approach, the human detector andrespective body parts detector based on the HOG areapplied to establish the correspondence between body parts.Then, the covariance descriptor is offered to identify thesimilarity between corresponding body parts. Finally, anadvantage of the concept of spatial pyramid matching [79]is used to design a new dissimilarity measure betweenhuman signatures. Hirzer et al. [78] used the covariancedescriptor of the horizontal stripes of an image patch. Thefeature vector in their descriptor contained y position, LABcolour channels and vertical/horizontal derivatives of theluminosity channel. The irrelevancy to the x axes made thedescriptor robust against pose changes, but the naïvehorizontal segmentation of the images made it lessdiscriminative than the part-based segmentation in [43] or Fig. 11

Interest point detectors (SIFT detector) from [11] and [60]

IET Comput. Vis. , pp. 1 ––

IET Comput. Vis. , pp. 1 –– he dense grid segmentation in [75]. The feature vector canconsist of Gabor and LBP texture features [19]. A Gabormask with orientation ‘ ’ was used to make the descriptorinvariant against pose changes, and the LBP can provideinvariance to grey level changes. In other studies of theseauthors [48, 75] they proposed the mean Riemanniancovariance (MRC) descriptor, which was the temporal meanof the covariance matrices of overlapped regions. Thefeature vector of the covariance descriptor was almost thesame as that in their previous work [43]. The covariancedescriptors are robust through rotation and illuminationchanges, and dense representation (overlapped regions)makes this descriptor robust to partial occlusion. However,it must be noted that the covariance descriptors are notde ﬁ ned in Euclidean space and do not have additivestructure. Thus, every operation, like the mean or variance,must be specially treated, which leads to greatercomputation cost. One solution is to compare only themeans of the covariance matrices of corresponding partsinstead of the whole descriptor [76]. ﬁ cation: Texture features play complementary role in constructingappearance-based models. In re-identi ﬁ cation, they must becombined with colour features to improve the performance.The only texture-based descriptors are not very effective inre-identi ﬁ cation [19]. The recurrent high-structured patches(RHSP) descriptor proposed by Farenzena et al. [16] is oneof the most applicable texture descriptors that are utilised insome other works [38, 52]. The descriptor is based onselecting patches from the foreground and ignoring the onewith low-structural information by thresholding theirentropy. Some transformation was done on pruned patchesto evaluate their invariance through geometric variations,which were then mixed to construct the RHSP. This texturedescriptor is invariant through pose and rotation changes,but needs images with at least a medium resolution to beapplicable. Fig. 12 shows the extracted RHSP texturefeatures from a torso.Gabor [80] and Schmid [81] ﬁ lters are mostly applied onluminance channels. These ﬁ lters are rotation invariant.Thus, they create features that are pose and view pointinvariant for re-identi ﬁ cation.Co-occurrence matrices also have been used for texturedescription [35]. This descriptor is constructed fromsquared matrices, which provide information about theneighbouring pixels ’ relativity. The joint probabilities ofneighbour pixels P ( i , j ) inside the square matrix withdimension ( N × N ) can be described as P ( i , j ) = (cid:1) Nx = (cid:1) Ny =

1, if I ( x , y ) = i , I x + D x , y + D y (cid:7) (cid:8) = j

0, otherwise (cid:9) (7)where ( Δ x , Δ y ) is the distance between the pixel of interest andits neighbour and I ( x , y ) is the pixel value at point ( x , y ). Mostof the texture descriptors are de ﬁ ned on grey level channels andthis makes them robust against illumination changes. ﬁ cation A person ’ s gait is a biometric that seems useful forre-identi ﬁ cation because it is dif ﬁ cult for people todeliberately alter the way they walk without looking unnatural. The ﬁ rst method to demonstrate that humanscould be recognised by their gait was introduced in the1970s. The method employed a moving light display [82].Lights were fastened to the major joints of the human body,and only the lights were visible as the person moved incomplete darkness. Analysing the frames showed that theperson could be recognised via this pattern. A person ’ s gaitchanges with walking speed, type of clothing and evenmood. However, these factors can be considered as constantwithin a short period of the re-identi ﬁ cation process. Theapproaches that have used gait analysis are mostly used forrecognition and identi ﬁ cation purposes, and rarely used forre-identi ﬁ cation. In identi ﬁ cation, the processing time is notimportant because the procedure is of ﬂ ine, whereas inre-identi ﬁ cation the processing time is an important factor.Therefore gait recognition techniques that require extensivecomputations are not suitable for use in re-identi ﬁ cation.The drawback of gait-based methods is that the subjectmust be observed for at least one or two steps before ananalysis can be done. Mostly, the gait is extracted from theside view of a person. Thus, the gait-based algorithmsrequire a pure side view of individuals to extract the gaitfeatures, which limit these methods. However, to extractgait features, images with high resolution are not needed,which is a bene ﬁ t of gait-based approaches.Two types of methods are used in gait recognition:model-based and motion-based. In the model-basedapproach, a model is ﬁ tted to the video data and theparameters of the model are used for identi ﬁ cation. Thisapproach is computationally expensive and time consumingbecause of the large number of parameters. It similarlyinvolves problems such as determining the positions of thejoints in the arms and legs. In contrast, motion-basedapproaches directly use the binary information of sequencesusing their gait and motion information. The quality of theframes and their resolution is not important in theseapproaches. Thus, a motion-based approach is consideredmore often in gait-based methods. Motion-based methodsuse actual images of a subject during a walking sequence,and extracts features from them. Some features consideredare the silhouette and contour of the person. The viewpointaffects the functioning of motion-based methods. Moreover,the high dimensionality of the feature vectors may causeproblems (referred to as ‘ the curse of dimensionality ’ ). Onthe positive side, motion-based methods are cheaper andeasier to calculate, and are less technical to implement [10].The method in [10] attempts to transform the extractedfeatures of the silhouettes in frames to provide a gaitrepresentation based on the sequences of the silhouettes ofthe subject, which are then used in re-identi ﬁ cation by Fig. 12

RHSP textural structures [16]

IET Comput. Vis. , pp. 1 – omparing the representations from different silhouettes usinga simple classi ﬁ cation method. The methods used in thisapproach are: (i) the active energy image (AEI) [83] andgait energy image (GEI) representations, (ii) the 3D Fouriertransform [84] of the gait silhouette volume (to removehigh-frequency noise from silhouettes), (iii) the framedifference energy image, (iv) the self similarity plot [85]and (v) a method that uses the distance curves of asequence of contours. If B t x , y (cid:7) (cid:8)(cid:3) (cid:4) Nt = are sequences ofsilhouettes the GEI is de ﬁ ned as G x , y (cid:7) (cid:8) = N (cid:1) Nt = B t ( x , y ) (8)In this grey-level image, the frequently appearing regions ofsilhouettes will become brighter. The de ﬁ nition for AEI ( A t )is as follows D t ( x , y ) = B t , if t = B t − ( x , y ) − B t ( x , y ), if t . (cid:10) (9) A t ( x , y ) = N (cid:1) Nt = D t ( x , y ) (10)Fig. 13 shows GEI and AEI images.In these approaches, the re-identi ﬁ cation procedure isregarded as a pipelined process that starts withdistinguishing the interesting parts of the video (i.e. thepeople) from the uninteresting stationary background. Thisresult is achieved using a mixture of Gaussians algorithm.Using the segmented video, the positions of differentpeople are tracked as they move. The position data are thenused together with the segmented video frames to create arepresentation of the different persons ’ gaits. Subsequently,re-identi ﬁ cation is performed by comparing the differentgaits using a simple classi ﬁ cation procedure (nearestneighbour). In this approach, the mixture of Gaussiandistributions [34] is used in modelling each backgroundpixel, and an effective estimation of the parameters basedon an expectation maximisation approach [86] is done forthe segmentation of interesting objects (people). In contrast to the interest point operator approach in [11],which generates a large number of potentialcorrespondences, model-based algorithms (which representthe second approach used in that particular study) establisha map from one individual to another. Speci ﬁ cally, adecomposable triangulated graph [87] is used to model thearticulated shape of a person as is depicted in Fig. 14. Thismethod can be categorised in the model ﬁ tting category ofthe gait-based methods. A dynamic-programming algorithmis used to ﬁ t the model to the person ’ s image [87]. Model ﬁ tting localises different body parts such as the arms, torso,legs and head, thus facilitating the comparison of theappearance and structure between corresponding body parts.The main shortage of this modelling is that it works onfront view of the gaits which in real scenarios many framesare not from the front view.Kawai et al. [32] proposed a spatiotemporal HOGs(STHOGs) as a gait descriptor to extract both shape andmotion features of the silhouettes. In this descriptor, thespatial and temporal gradients of two subsequent frames arecalculated for an ( x , y , t ) space, that is: G = [ G x , G y , G t ].Then, the orientation of the spatial ( f ) and temporal ( θ )gradients is calculated w = tan − G y G x (cid:11) (cid:12) , u = tan − G t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) G x + G y (cid:14)⎛⎜⎝ ⎞⎟⎠ (11)The temporal and spatial orientations are quantised separatelyinto 9 bins each and then combined into a single 18-binshistogram to construct STHOG. Since this descriptor is sosensitive to the contour of the foreground, removing thewhole background is not suitable. Therefore a backgroundattenuator [88] is used instead. Finally, this gait feature iscombined with colour features (holistic HSV histogram ofsilhouettes) to form a mixture of gait and colour features.The problem with this descriptor is that it is too sensitive todifferences in the point of view.Pose energy image (PEI) is another gait feature [14] that isused for re-identi ﬁ cation. First, the gait cycle is divided into K different poses. Then, the averages of all the silhouettesbelonging to one person in a particular pose of k i arecalculated to obtain K , PEIs. An unsupervised K -meansclustering is used to classify each gait frame into one of the K poses. When all the frames in a sequence are allocated to K poses, the fraction time of the occurrence of each k i in a Fig. 13

Depiction of GEI ( ﬁ rst row) and AEI (second row) images [10] Fig. 14

Decomposable triangulated graph used as person model [11]

IET Comput. Vis. , pp. 1 ––

IET Comput. Vis. , pp. 1 –– ait cycle for N frames is estimated as follows T i = N (cid:1) Nt =

1, if frame f t belongs to k i

0, otherwise (cid:10) (12)If the binary silhouette of frame t in the sequence is I t ( x , y ),the i th PEI i is de ﬁ ned as followsPEI i ( x , y ) = N × T i (cid:1) Nt = I t ( x , y ),if I t ( x , y ) belongs to k i (13)This gait descriptor is more robust than other gait descriptorslike GEI and AEI because it contains ﬁ ner temporalinformation of the gait and shows exactly how the shape ofa person changes. However, as other gait features withouttemporal information (sequence availability) this descriptorcannot be formed. ﬁ ers and matching metrics inre-identi ﬁ cation approaches The third stage of re-identi ﬁ cation is to compare the extractedfeatures to ﬁ nd the most similar correspondents. Euclideandistance and Bhattacharyya coef ﬁ cients are the mostpopular distance metrics in re-identi ﬁ cation [63]. Thegeneral form of the Euclidean and Bhattacharyya distancemeasures is as followsEuclidean dist( a , b ) = a − b ,Bhattacharyya coeff ( p , q ) = (cid:1) (cid:13)(cid:13)(cid:13)(cid:13) pq √ (14)where operand ·(cid:4) (cid:4) , is the norm of two vectors. For theEuclidean distance, a and b represent the feature vectorsof any dimension and for the Bhattacharyya coef ﬁ cients p and q represent two different probability distributions offeatures. As colour and texture histograms are probabilitydistributions, this metric is widely used to comparesuch features in corresponding samples. In some cases,the formulation of Bhattacharyya dist p , q (cid:7) (cid:8) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − Bhattacharyya coeff ( p , q ) √ is used to better representproperties of a metric structure. It must be noted thatwhen the features do not belong to the Euclidean space,the Euclidean distance cannot be used. For example aspreviously stated, the covariance descriptors do not lie inthe vector space. Thus, special covariance distance metricis used for these descriptors r C i , C j (cid:21) (cid:22) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:1) dk = ln l k ( C i , C j ) (cid:23)(cid:24)(cid:24)(cid:25) (15)where C i is a covariance descriptor of dimension d withgeneralised eigenvalues of λ k ( C i , C j ). The sum of thequadratic distances [74], the sum of absolute differences [3,58], the correlation coef ﬁ cients [51] and the Mahalanobisdistance [16, 52] are other metrics that are also being usedto compare the feature vectors to ﬁ nally ﬁ nd the mostsimilar individuals in different scenes. The advantage ofMahalanobis distance compared with the other previouslymentioned distant metrics is that it counts the correlationbetween the feature vectors. Although the aforementioned metrics are different and have their own special de ﬁ nitions,they have one major concept in common, which is theirrestrictive nature and non- ﬂ exibility. In other words, theytreat any feature that is fed to them equally and do not havethe ability to discard useless features. In re-identi ﬁ cation,this property may present a large limitation for thesemetrics. Under severe changes in the illumination, pose andview point conditions, some features may be moredistinctive than others. Thus, some features must be givenmore weight and some must be discarded. However, thestandard metrics cannot discriminate between the features.Based on this shortcoming, some researchers have recentlyapplied learning distance metrics [62, 89], optimiseddistance metrics [90] or probabilistic distances [45, 61] tore-identi ﬁ cation to overcome this problem. In theseapproaches, an attempt is made to solve the re-identi ﬁ cationfrom a distance learning point of view.Zhao et al. [91] proposed a method in which only thesalient patches of the foreground were selected andcompared together. The salient patches were selected basedon learning methods. This kind of selection enables thesystem to only compare the most relative parts together.They examined k -nearest neighbour and one-class supportvector machine (SVM) to select the salient patches. Basedon their experiments, there was no major differencebetween results of k-nearest neighbour (KNN) andone-class SVM. The one-class SVM was only trained bypositive samples and its goal was to detect outliers. Theone-class SVM, on the other hand, was formulated as anoptimisation problem which de ﬁ ned a hyper sphere in thefeature space. The goal was to minimise the objectivefunction while including most of the training samples insidethe hyper sphere. The objective function for such problemwas as follows min R [ R , j [ R l , c [ F R + vl (cid:1) i j i (16)s . t . (cid:4) F X i (cid:7) (cid:8) − c (cid:4) ≤ R + j i , ∀ i [ . . . , l { } : j i . ξ represents the misclassi ﬁ cation error, R and c represent radius and centre of hyper sphere and Φ ( X i ) ismulti-dimensional feature vector of training data X i with l training samples. In this equation, v is a trade-off parameterwhich takes a value between 0 to 1. The SVM was used insome re-identi ﬁ cation research as classi ﬁ er [35, 90, 92] indifferent styles. The common point in all approaches is thatan objective function must be optimised and thehyper-plane must be selected in such way that the vectorsfrom two classes can be separated with maximum margins.In [38], to reduce the complexity cost and speed up thetraining phase, an active learning method [93] wasexploited. In this method, only the samples which werenear to the decision plane were labelled and used forlearning. This helped to reduce the number of sampleswhich have negative effect on the classi ﬁ er. The trainingphase started with one positive and one negative sample.After the ﬁ rst round, only the closest samples to thehyper-plane were selected. The speci ﬁ c selection of samplesin this manner decreased the training set to 1/4 of the totalnumber of samples.Gray and Tao [12] and Bak et al. [41] used the AdaBoostscheme to construct an ensemble of likelihood ratios to form asimilarity function. In [78], a similar boosting algorithm wasused to select more discriminative features among all features. IET Comput. Vis. , pp. 1 – n this approach, a subset of T more informative featureswhich corresponded to the weak classi ﬁ ers in the boostingalgorithm was selected among the whole set { f , … , f M }.Fig. 15 shows the procedure of the adaptive boostingmethod. The main disadvantage of the boosting method isthat they can easily be over- ﬁ tted and are sensitive to noiseand outliers.Ensemble of decision trees known as random forests [94] isalso another tool which has been used for classi ﬁ cation inre-identi ﬁ cation. The ensemble of decision trees is lesssensitive and has reduced variance compared with anindividual decision tree. Du et al. [95] proposed a model,known as random ensemble of colour features (RECF) inwhich they used random forests to learn a similarityfunction f (.) based on different colour spaces channels asfeatures. Given ( x i , y i ) as training samples where x i denotesa pair of person images and y i denotes the label for thegiven x i . The similarity function is a combination of T posteriors probabilities p t ( x i ) of T trees and is formulated asfollows f x i (cid:7) (cid:8) = T (cid:1) Tt = p t ( x i ) (17)where p t ( x i ) is the posteriors probability of a leaf node l of tree t and is estimated based on the fraction of positive subsampleswhich reach to the leaf node ( N l − pos ) to the total number ofsubsamples reach to that leaf node ( N l − pos + N l − neg ).Zheng et al. [61, 89] introduced the novel probabilisticrelative distance comparison (PRDC) and relative distancecomparison (RDC) models, in which they tried to formulatethe re-identi ﬁ cation as a distance learning problem. Incontrast to conventional approaches that have attempted tominimise the intra-class variation (i.e. images of oneperson) and maximise the inter-class variation (i.e. imagesof two different persons), the objective function used byPRDC aims to maximise the probability of a pair that is atrue match (i.e. two true images of person A) having asmaller distance than that of a pair that is a related wrongmatch (i.e. two images of persons A and B, respectively). If f is considered for PRDC, it must be learned so that thedistance between relevant pairs is less than that betweenirrelevant ones f x pi (cid:7) (cid:8) , f x ni (cid:7) (cid:8) (18)where x pi is the distance between two relevant pairs, and x ni isthe distance between two irrelevant pairs. The probability ofthe above event is computed, and then the function is learned based on the maximum likelihood principle f = arg min r ( f , O ) (19) P f x pi (cid:7) (cid:8) , f x ni (cid:7) (cid:8)(cid:7) (cid:8) = + exp f x pi (cid:7) (cid:8) − f x ni (cid:7) (cid:8)(cid:3) (cid:4)(cid:7) (cid:8) − (20) r f , O (cid:7) (cid:8) = − log (cid:26) O i P f x pi (cid:7) (cid:8) , f x ni (cid:7) (cid:8)(cid:7) (cid:8)⎛⎝ ⎞⎠ (21) O = O i = ( x pi , x ni ) (cid:3) (cid:4) (22)This function can also be learned by SVM as was done byProsser et al. [90]. The difference between RDC andRankSVM is that RDC uses a logistic function that makes asoft margin measure for vectors x ni , whereas RankSVMdoes not have such a margin. Thus, RDC would be morerobust against inter-class and intra-class variations. Largemargin nearest neighbour (LMNN) [61, 96] classi ﬁ er isanother kind of learning distance metric that is used forre-identi ﬁ cation. In this approach, a linear transformation L is learned in order to minimise the distance between a datapoint and its k- nearest neighbours with the same label,while simultaneously maximising the distance of this datapoint from differently labelled data points. Khedher et al. [45] developed two GMM to model the distancedistributions of relevant pairs (GMM ) and irrelevant pairs(GMM ). To decide whether the test distance ( d ) is relevantor not, its likelihood ratio must be >1LR = P d | GMM (cid:7) (cid:8) P d | GMM (cid:7) (cid:8) (23) P d | GMM i (cid:7) (cid:8) = (cid:1) Gg = C gi (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ps gi (cid:14) exp − d − m gi s gi (cid:27) (cid:28) ⎛⎝ ⎞⎠ (24)where μ gi , s gi and C gi are the mean, variance and weight ofcomponent g from GMM i . ﬁ cation There are some standard databases that have been used indifferent works for the evaluation and comparison ofre-identi ﬁ cation approaches. Farenzena et al. [16] usedpublic databases such as viewpoint invariant pedestrianrecognition (VIPeR) [12], imagery library for intelligentdetection system (i-LIDS) [97] (which was also used byZheng et al. [61]), and ETH Zurich (ETHZ) [98]. In [27], Fig. 15

AdaBoost algorithm

IET Comput. Vis. , pp. 1 ––

IET Comput. Vis. , pp. 1 –– atta et al. similarly used VIPeR to compare the results with[16]. Bak et al. [41] used the TRECVID database (organisedby National Institute of Standards and Technology (NIST)) totrain the human detection algorithm with 10 000 positive(human) samples and 20 000 negative (background scene)samples. They similarly used CAVIAR [99] as didHamdoun et al. in [3, 58]. In another work, Bak et al. [43]used i-LIDS as their database. Khan et al. [59] also usedthe NICTA and CAVIAR datasets to evaluate their results.In some studies, datasets were created based on the scenariosin which the re-identi ﬁ cation should be done. For instance,Cong et al. [2] created real datasets using two cameras in thedesired places. Gheissari et al. [11] also used their owndatasets. Skog [10] and Annesley et al. [33] both made twodifferent datasets based on the scenes that they needed.The frequent use of the above-mentioned datasets invarious works has changed them into standard ones forevaluation of experiments in re-identi ﬁ cation. They alsopresent challenges that must be overcome for are-identi ﬁ cation. However, they are not appropriate for gait-and motion-based approaches because these approachesrequire databases that contain successive image sequenceswith information of their points of views.The CAVIAR, i-LIDS and TRECVID databases are in avideo form, whereas VIPeR contains still images and ETHZis a set of consecutive frames. Almost all of them containvideos and images of pedestrians from different points ofview and with different illumination conditions, but amongthem only i-LIDS and ETHZ contain occluded frames.TRECVID and VIPeR also have images that showpedestrians carrying objects (partial occlusion). TheTRECVID and i-LIDS datasets contain images that wereobtained under actual surveillance conditions. Table 2shows the most discriminatory and challenging features ofeach of these datasets compared with the others. One major issue mentioned in the reviewed papers is the wayin which their results are evaluated and compared. Mostmethods have used the cumulative matching characteristic(CMC) curve as stated in [77] to evaluate and comparetheir results [2, 16, 27, 41, 43, 61]. In CMC curves, thecumulative number of re-identi ﬁ ed queries is shown basedon the order in which they have re-identi ﬁ ed. If the numberof true re-identi ﬁ ed queries in rank i is tq ( i ). The amount ofCMC for rank i is de ﬁ ned asCMC( i ) = (cid:1) ir = tq ( r ) (25) Other curves also have been used such as receiver operatingcharacteristics [59, 92], precision – recall (PR) [3, 58], andother metrics [10, 11] for evaluation of re-identi ﬁ cationsystems. In PR curves, the indices are calculated as belowprecision = TPTP + FP , recall = TPtarget number (26)where TP and FP stand for true positives and false positives.There are other numbers of works which prefer PR to CMCcurves [6, 55] the only privilege of PR curve to CMC curveis that in PR curve the ratio of false positive re-identi ﬁ edsamples can be seen apparently, whereas in CMC it ishidden. However, the lack of the rank in PR curves is ashortage to evaluate the performance of a re-identi ﬁ cationsystem.Xiang et al. [38] evaluated the performance of theirre-identi ﬁ cation system by measuring the true positive ratioand true negative ratio against different labelled samples intwo different curves, but these true ratios are only the ﬁ rstranked results and these evaluation curves are unable toshow the next ranks of re-identi ﬁ ed samples.The privilege of CMC compare with other curves is inCMC not only the ﬁ rst true re-identi ﬁ ed query rank isindicated, but also the true re-identi ﬁ ed queries in otherranks are also indicated. By this means, the performance ofthe systems for re-identi ﬁ cation can be depicted better. It isimportant to mention that two important factors in CMCcurves are the ﬁ rst rank re-identi ﬁ cation rate and the steepof the curve. The steeper the curve the better theperformance is. In the literature, the re-identi ﬁ cation methods can be groupedinto two sets. The ﬁ rst group is composed of single-shotmethods that analyse the single image of a person [12, 13].They are applied in the lack of tracking frames. The secondgroup includes multiple-shot approaches; they employmultiple frames of a person (usually obtained via tracking)to make the signatures [3, 11, 100]. Both approachessimplify the problem by adding temporal reasoning to thespatial layout of the monitored environment to prune thecandidate set to be matched [100].In most security-related research, the colour and/or textureof the clothes are/is regarded as the most signi ﬁ cant cues forperson retrieval [2, 11, 16, 52] which shows that clothinginformation can be used for local descriptors. Using other Table 2

Most challenging and discriminatory features of each of datasetsDataset CAVIAR ETHZ VIPeR i-LIDS TRECVIDchallengingfeatures real situation, lowresolution illumination variation,occlusion pose variation,background variation real situation,occlusion real situation, Partialocclusion

Table 3

Re-identification rate for different normalisation methods [101]Normalisation method RGB colour space Grey world Histogram equalisation Affine normalisationfirst rank of CMC re-identification rate 70 95 97.5 95

IET Comput. Vis. , pp. 1 – eatures such as facial and gait features [10, 11, 25] or eventhe height of individuals [6, 51] have a great advantagebecause these features are likely to remain constant over alonger period of time. The person ’ s face and gait arepopular descriptors in recognition tasks, which have beenused even for re-identi ﬁ cation [7, 10], but facial-baseddescriptors tend to produce less accurate results whenemployed in combination with low-resolution cameras.Nevertheless, those cameras are widely used in actualsurveillance systems mainly because of economical reasons.As previously mentioned, the three most important existingissues in re-identi ﬁ cation are illumination changes, view pointand pose variations and scene occlusion. Althoughinsuf ﬁ cient attention has been given to the scene occlusionissue in many of re-identi ﬁ cation studies, there are only asmall numbers that have provided solutions for it, almost allof them have given attention to illumination and posevariations. The various colour, texture, shape and gaitmodels and descriptors that have been proposed in theseworks have a common goal, which is to improve the truere-identi ﬁ cation rate by overcoming these issues as well asthey can. Here, the various solutions given in the literaturefor these issues will be summarised. Since some colour spaces like HSV and LAB are more robustagainst illumination changes, some approaches have preferredto use them in histogram descriptors or textural featuresinstead of the RGB colour space [11, 16, 27, 63]. Reducingthe number of quantisation levels of colour channels alsodecreases the sensitivity to noise and intensity [63]. Inmethods that use colour histograms, a common way tomake them illumination invariant is histogram equalisation[2, 51, 74]. The assumption in the histogram equalisationprocess is that although illumination changes make thesensors respond differently, their rank ordering will remainunchanged. The rank measure for level i is de ﬁ ned as M ( i ) = (cid:29) it = H ( t ) (cid:29) N b t = H ( t ) (27)where N b is the total number of quantisation levels and H (.) isthe histogram. The Greyworld and Af ﬁ ne normalisations havealso been used. These normalisation methods are beingapplied on colour channels ( I k ) instead of histogramsnorm Grey I k (cid:7) (cid:8) = I k mean I k (cid:7) (cid:8) ,norm Affine I k (cid:7) (cid:8) = I k − mean I k (cid:7) (cid:8) std I k (cid:7) (cid:8) (28)where std is the standard deviation of the pixels in colourchannel I k . Histogram normalisation is another method usedin some works [37, 58]. Cong et al. [101] performed thesame algorithm on the data that were grabbed from twolocations, but applied different normalisation methods on it.The re-identi ﬁ cation rates are as shown in Table 3.The light absorbed by the camera ’ s sensor is re ﬂ ected fromdifferent sources in the scene. Thus, to obtain more realisticillumination compensation, the in ﬂ uence levels of theselocal light sources on the object ’ s colours must beindicated. Monari [49] divided scene lights into the threecategories of ambient light, overhead light and backlight. They proposed a model for an object ’ s pixel intensity I object ( x ) and its shadow pixel intensity I shadow ( x ) which wasmade up of ambient and overhead lights. I shadow ( x ) ; K c I amb ( x ) + I overhead ( x ) (cid:7) (cid:8) (29)By dividing the shadow intensity by a known ground ﬂ oorre ﬂ ection coef ﬁ cient ( K c ), the pure light sources wereobtained ( I amb ( x ) + I overhead ( x )) and then the object ’ s pixelintensity I ( x ) was normalised by this value I norm ( x ) = I ( x ) K c I shadow ( x ) (30)To guarantee the correct illumination compensation in thisapproach, it is necessary to properly detect the shadows ofthe individuals and the environmental information of thescene (ground ﬂ oor re ﬂ ection coef ﬁ cient), which are thedrawbacks of this method. Aziz et al. [55] classi ﬁ ed peopleappearances into frontal and back appearances to obtainrobustness against illumination changes. They alsonormalised SIFT descriptor that they used for featureextraction. One of the most effective methods to obtain pose invariancewas devised by Farenzena et al. [16]. In their method, asilhouette is divided into the three main parts of the head,torso and legs. Then, for each part, a symmetry axis dividesthe silhouette into two parts. Features are then selected fromthe two sides of this axis and weighted based on theirdistance from this axis. This symmetric selection of patchesto extract features makes it pose invariant. In contrast, thearbitrary and random selection of patches from the torso orlegs to extract features from them would make the methodvulnerable to pose variations [27]. The rotation-invariantnature of Gabor and Schmid descriptors is the reason fortheir same response when applied to different poses of thesilhouettes and results in the pose invariant features [12].The distance metrics can also affect the pose invariance ofthe methods like Mahalanobis distance that was used byMartinel and Micheloni [60] to measure two extracted SIFTdescriptors.Most of the pose variations in different frames occurredaround the vertical axes of the scenes. To de ﬁ ne the modelsand descriptors to be x axes independent partially improvesrobustness against pose variations [12, 51, 61]. The descriptors proposed by Zheng et al. [20] are mostrelative ones which have been designed to deal with sceneocclusions. The descriptors are designed to perform ongroups of individuals instead of the single ones, butaccording to its formulation which is based on inter peopledistances the method is unable to re-identify the individualsseparately.Wang et al. [102] trained LMNN-R classi ﬁ er with real andsynthesised occluded dataset to overcome the partialocclusion of the test dataset. An interesting body print wasproposed in Albiol et al. [51] that was totally robust againstocclusions. This body print is extracted from all RGBvalues of the pixels at height ‘ h ’ (from the ground) of the IET Comput. Vis. , pp. 1 –– oreground. The signature uses temporal information ofconsequent frames of one person. During these consequentframes, the signatures of the occluded frames are beingneglected. Thus, the occluded frames do not affect the ﬁ nalsignature that is extracted from the whole frames. Theheight information in this method was grabbed byMicrosoft kinect sensor and cannot be used in outdoorenvironment that is considered as a large limitation forre-identi ﬁ cation application. The gait-based approaches areso sensitive against occlusions because even small partialocclusion in consequent frames does not let to extract thegait feature properly. Roy et al. [14] used the advantage ofthe phase of the motion of the silhouettes in a hierarchicalway. This feature was used in the frames in which the gaitwas affected by occlusion. ﬁ cation The ultimate goal for a generic re-identi ﬁ cation system is tobe capable of working in real-time situation. To do so, thewhole re-identi ﬁ cation procedure must be fast enough toallow for real-time implementation. Although someattempts for real-time re-identi ﬁ cation were noted [35, 37,56]; there still exist many issues that must be solved beforea complete real-time re-identi ﬁ cation system can besuccessfully implemented.The FSCH descriptor that was exploited by Xiang et al. [37] took 1.14 ms to be extracted from a 70 × 25 imagepatch. This algorithm must perform 42 additionalcomputations compared with the simple histogram method.This was mainly because of membership degree and 5Dspace computations involved. They also performedsimilarity matching between the extracted descriptors bycorrelation metric which is quite fast. Goldmann [35] usedsimple features like RGB value, colour structure descriptor,co-occurrence matrix and intensity-based histogram to beable to implement the real-time system. However, sincetheir method was based on supervised learning method inclassi ﬁ cation stage it needed a training phase which must beexecuted prior to the testing phase.Eisenbach and Kolarow [52] adopted a person detectionmethod based on contour cues [53] and a real-time trackingmethod [54] to track people to speed up their algorithm.Besides that, they extracted appearance features from upperand lower bodies of any person who passed the camerawhile recording the video to reduce the total computationtime. An online feature selection scheme was used in whichthe joint mutual information [103] estimated thedependencies between each of the features and the classlabel of the detected person. Thus, the best features couldbe selected for speci ﬁ c class and redundant features will beremoved. Consequently, the re-identi ﬁ cation could beperformed faster. In [77], Wang et al. used the integralcomputations to build their occurrence matrix as theirdescriptor to speed up their algorithm performance. Thecomputation complexity is independent of the size of therectangular domain D on which the integral computation isapplied.Satta et al. [40] reduced the matching time between probeand the gallery images by transposing them into dissimilarityspace. The components from the same parts of individuals inthe gallery set were put together and clustered to someprototypes. Thus, each part (torso or leg) in the gallery sethad its own prototype bank. Then, the difference betweeneach part of every individual in the gallery and centroid ofthe prototypes was computed and set as dissimilarity vector for each individual. The same action was done for query. Inthe last step, the nearest individual to the query would beselected based on lowest distance between dissimilarityvectors. The matching time in their experience was < 0.01ms. In this method, instead of any measure to comparecomplex descriptors only the vectors were compared basedon simple distance metric Hausdorff distance [104] whichdrastically reduced the computation cost and memory usage.To implement a re-identi ﬁ cation system in real-time, thebackground subtraction, human detection, feature extractionand matching steps must all be based on real-timealgorithms. However, the methods mentioned in here haveonly partially implemented their algorithms in real-time. Assuch, a completely real-time re-identidifcation system isvery much needed, and thus real-time re-identi ﬁ cationresearch will be an open issue among the researchers.Table 4 lists the most popular features and descriptors usedin state-of-the-art re-identi ﬁ cation works which have beenmentioned in the previous sections. The features anddescriptors are namely colour, texture, shape and gait. Inthis table, these features and descriptors are compared basedon their effectiveness in solving re-identi ﬁ cation issues anda qualitative comparison between these features is alsoprovided. The robustness of these features and descriptorsagainst varying illumination, pose and occlusion are shown.In addition, the acronyms ‘ CI ’ and ‘ PI ’ in Table 4 are shortfor ‘ completely independent ’ and ‘ partially independent ’ ,respectively. For cases in which the invariance is dependentof some other factors, the acronym ‘ DI ’ which stands for ‘ dependent invariance ’ is used.Actually, the major signi ﬁ cant function of pure colourfeatures and descriptors is to make signatures that areinvariant against varying illumination. As stated in Section4.1, several methods have been used to make thesesignatures independent against illumination changes, but tomake the signatures pose invariant they must be combinedwith complementary spatial and textural features. Themethods which have used region-based colourrepresentations de ﬁ nitely outperform the holistic colourrepresentations. Table 5 represents an example of the ﬁ rstrank of CMC re-identi ﬁ cation rates of spatial covariancedescriptor (SCR) proposed by Bak et al. [43] and a normalcolour histogram with and without using group contextmodel (CRRRO and BRO) proposed by Zheng et al. [20].The i-LIDS dataset was used for these experiments.As can be seen from the table, the colour representationwithout spatial information has the lowest re-identi ﬁ cationrate. However, when this colour representation is combinedwith the spatial group context models which are CRRROand BRO descriptors the results are signi ﬁ cantly improved.As indicated, the SCR descriptor outperforms both becausethe covariance matrices which are extracted from theoverlapped regions are robust against pose and partialocclusions. The ﬁ rst derivative of grey level channel that isutilised in constructing covariance feature vectors has alsomade it illumination invariant and improved there-identi ﬁ cation rate.In addition, interest point descriptors are also attractivetools to use in re-identi ﬁ cation. Nevertheless, the reportedresults about their performance are somewhat varied. In [45,92], it was reported that SURF outperformed the SIFT, butin [72] SIFT and GLOH outperform SURF and CCPD.However, the datasets in which these descriptors wereapplied and the methods used in segmentation andclassi ﬁ cation processes were known to have signi ﬁ canteffect in the ﬁ nal re-identi ﬁ cation rate. In applying interest IET Comput. Vis. , pp. 1 – oint descriptors, the important queue is to decrease thenumber of corresponding interest regions to control thecomputational costs. The advanced methods that their basesare related to interest point descriptors like visual words orattribute-based methods are new approaches inre-identi ﬁ cation which can also direct the future aspects ofre-identi ﬁ cation [105, 106].On using gait features in re-identi ﬁ cation, the advantage isthat the low resolution of CCTVs does not affect them, but the main disadvantages of gait features are (i) their susceptibilityto occlusion, that is, the gait cannot be captured properly if thesilhouette is occluded by other objects or the silhouettes ofother people and (ii) their need for a side camera-viewbecause most of the gait features can be de ﬁ ned based onthe side camera-view of the gait. These shortcomingsalongside with high-computational demand for gait featuresmake them less favourable in re-identi ﬁ cation approaches.However, the gait-based features can be very effective Table 4

Summary of the various main features and descriptors used in re-identificationResearch which usedfeature/descriptors Feature/descriptornames Types offeature/descriptor Illuminationinvariances Poseinvariances Occlusioninvariances RemarksGheissari et al. [11],Satta et al. [27], Wangand Lewandowski[102] colourhistograms colour DI × × using HSV and LAB colour spacesmake histograms partially robustagainst illumination changes, butthey are totally vulnerable againstpose differences and occlusionD ’ Angelo andDugelay [63] PCH colour PI × × ─ Xiang et al. [37] FSCH colour PI DI × the x dimension of the descriptormust be ignored for poseinvarianceAnnesley et al. [33],Bak et al. [41] MPEG7 colour DI PI × DCD descriptor is not poseinvariant, but colour layoutdescriptor (CLD) descriptor is poseinvariantKhan et al. [59] CCPD colour PI PI PI the polar representation of thedescriptor makes it pose invariantthe type colour space is importantfor illumination invarianceFarenzena et al. [16],Satta et al. [27] RHSP texture × CI × ─ Zhang et al. [19], Grayand Tao [12] Gabor andSchmid texture CI CI × ─ Zhang et al. [19],Corvee et al. [48],Hirzer et al. [62] LBP texture DI × × LBP is invariant to grey levelchannelGoldmann [35] co-occurrencematrices texture DI CI × this descriptor must be applied ongrey level colour channels in orderto be robust against varyingilluminationsGheissari et al. [11] frequencyimage gait PI PI × ─ Skog [10] AEI gait PI CI × ─ Skog [10] GEI gait PI CI × ─ Kawai et al. [32] STHOG gait CI PI × contains shape and motionfeaturesRoy et al. [14] PEI gait/shape CI DI × the side view of silhouette must beavailable for PEI extractionBauml andStiefelhagen [72] SIFT colour,texture PI PI PI ─ Khedher et al. [45],Hamdoun et al. [58] SURF colour,texture PI PI PI ─ Bauml andStiefelhagen [72] GLOH colour,texture PI PI PI ─ Zhang et al. [19],Bak et al. [43, 75] covariancematrices colour,texture PI CI PI the covariance matrices must beextracted from overlapped regionsin order to be robust againstocclusionZheng et al. [20],Xiang et al. [38] HOG colour,texture DI CI PI the training set plays salient role tomake the descriptor robust againstpose and occlusionZheng et al. [20] CRRRO andBRO colour,texture DI CI CI the selection of colour feature toconstruct the descriptor isimportant for its illuminationinvariance

IET Comput. Vis. , pp. 1 ––

IET Comput. Vis. , pp. 1 –– upplements in companion with appearance-based models.The hierarchical models which are designed to fuse theappearance and gait/motion features have proven to beuseful and can be applied in forthcoming research onre-identi ﬁ cation [14, 21]. To have a quantitative comparisonof the above-mentioned statement, the results presented byKawai et al. [32] are shown again in Fig. 16 in whichcomparisons between the ﬁ rst rank re-identi ﬁ cation rate of using their proposed STHOG gait descriptor method, aHSV colour histogram as pure colour feature method andthe fusion of the two features are made. The descriptorswere applied on their own dataset with different viewingangles. The query viewing angle is near to ID5 whichindicates the side view. As shown in Fig. 16, in the viewingangles near to the query image the gait feature outperformsthe colour feature (ID4, ID5 and ID6), but when the posestarts to differ, the gait feature is unable to outperform thecolour feature (ID1, ID2 and ID3). However, in all viewingangles the fusion of gait and colour features produced betterresults than using them alone which is somewhat expected.The point to highlight is that in this graph, one can see theperformance of the descriptors for different viewing angleswhich has never been reported previously.The application of learning distance metrics inre-identi ﬁ cation is intensely growing and it is anticipatedthat it will become a prevalent area of research in the nearfuture. In fact, the strategy of these metrics which is tomaximise the inter-cluster distance while minimising theintra-cluster distance has performed well especially onappearance-based models which suffer from the noise andvariations even at the same clusters.Table 6 is a comparative evaluation of the ﬁ rst ranks ofCMC curves of different learning metrics andBhattacharyya distance discussed on previous sections forVIPeR, i-LIDS and ETHZ datasets. These presented resultswere obtained from Zheng et al. [61, 89]. The same colourand texture features were extracted from the persons ofinterest and then the feature vectors were fed to thesedifferent classi ﬁ ers. For training, the number of pedestriansas samples were 316, 40 and 30 in VIPer, ETHZ andi-LIDS, respectively.As expected, the all other learning metrics outperformedthe Bhattacharyya distance metric which is shown inTable 6. The RDC was the best learning metric and hasproduced the best results of 15.66, 44.05 and 72.65% forthe VIPeR, i-LIDS and ETHZ datasets, respectively. It isworth mentioning that the RECF [95] has 16.96% accuracyrate on VIPeR which is above RDC and shows itsef ﬁ ciency to alleviate over- ﬁ tting problem when there arelimited number of training data. However, this classi ﬁ erwas not examined on the other existing datasets. Thechallenging VIPeR dataset, as expected has the lowestre-identi ﬁ cation rates when compared with the results usingi-LIDS and ETHZ datasets. Obviously, the learning metricscan perform better on video databases of ETHZ and i-LIDSwhen compared with VIPeR. In terms of performance, thereason behind RDC high accuracy is that the RDC or alsoknown as PRDC learning metric adopts the second-order Fig. 16

Re-identi ﬁ cation rate for STHOG gait descriptor, HSVcolour histogram and combination of them [32] Table 6

First rank re-identification rate of learning metrics andBhattacharyya coefficients for different datasets [61, 89]Learning metrics/classifier First rank re-identification ratesVIPeR i-LIDS ETHZRDC 15.66 44.05 72.65AdaBoost 8.16 33.58 69.21LMNN 6.23 33.68 64.88Bhattacharyya distance 4.65 31.77 60.97

Table 7

Highlighted methods which are capable of being pursued in future research on re-identificationRe-identificationstages Highlighted methods Examples Key pointsfeature extractionstage 1. attribute extraction Layne et al. [105] ability to more meaningful representation of objects2. interest pointdetectors Bauml and Stiefelhagen [72],Martinel and Micheloni [60] high level of robustness against illumination andpose variations3. fuzzy analysis Xiang et al. [37], D ’ Angelo andDugelay [63] ability to handle severe illumination changes4. spatiotemporalmethods Kawai et al. [32], Bedagkar-Galaand Shah [47] simultaneously uses temporal, spatial, colour andsometimes gait of the framesclassification stage 1. distance learningmetrics Zheng et al. [89], Mignon andJurie [96] makes classifier to be able to be discriminative andto be optimised against different features2. feature selectionmethods Hirzer et al. [78] reduce computation cost in classification, mostrelevant features will be selected

Table 5

Comparative results of using holistic pure colourfeatures against using colour features in combination of spatialfeatures [43]Method Colourhistogramwithout groupcontext (CRRROand BRO) Colourhistogram withgroup context(CRRRO andBRO) SCRfirst rankre-identification rateof CMC curve, % 10 17 33

IET Comput. Vis. , pp. 1 – oments of features and consequently the joint effects of thefeatures are considered, whereas the other classi ﬁ ers assumeeach feature independently. Therefore there is no interactionbetween the features. Performance in terms of the datasetsused, ETHZ and i-LIDS were better because of theconsequential nature of the frames in them, whereas theVIPeR dataset only consists of single shots.Generally, in re-identi ﬁ cation the models must not be onlydescriptive, but they have to be discriminativesimultaneously. Features extracted from the silhouettes mayconstruct completely descriptive descriptors, but most ofthese features are not discriminative enough. The learningdistance methods try to move the load of discriminationfrom the feature to the classi ﬁ er. This property is rarelyfound in methods with usual distance classi ﬁ ers [3, 58, 63].Finally, we also recommend and highlight several potentialapproaches to extract features and perform classi ﬁ cation forre-identi ﬁ cation which are summarised in Table 7. In this paper, we provide a review of the existingstate-of-the-art research on re-identi ﬁ cation which involvesboth the appearance and gait/motion descriptors and spellout the abilities, limitations and advantages of the variousavailable methods for re-identi ﬁ cation. We begin the paperby discussing the issues pertaining to peoplere-identi ﬁ cation which involves inter- and intra-cameraissues. Among them, the most serious issues includeillumination, pose changes and scene occlusions. Next, weafford methods that have been used for personre-identi ﬁ cation in which different types of descriptors thatare based on colour, texture and gait of the silhouettes aredescribed. The different classi ﬁ ers, the existing standarddatasets and evaluation methods for re-identi ﬁ cation arealso explained.We also highlight the challenges that need to be resolvedwhich mainly concerns the robustness against the followingaspects of illumination, view point and pose variations andscene occlusion. Although several solutions of theabove-mentioned issues have been proposed and developed,the existing methods are still unable to overcome thoseissues completely. Truthfully, the solutions thus far, are notapplicable in all practical scenarios in which illuminationvariations, pose changes and occlusions may occursimultaneously with respect to time and if these were tooccur simultaneously, it will incur a high-computationalcost. As such, new and improved descriptors and modelsare needed so that the issues can be solved more ef ﬁ ciently.In this paper, we have provided both qualitative andquantitative comparisons of several re-identi ﬁ cationmethods to depict the advantages and shortcomings of themodels being used. Finally, to provide an insight for thefuture research direction, we highlight the methods that arecapable of being pursued in the forthcoming research.To conclude, we believe that research in re-identi ﬁ cationwill proliferate as the demand for ef ﬁ cient intelligent videosurveillance system increases so as to ensure secured andsafe environment of the society and mankind. The authors would like to express their gratitude to theMalaysian government and the Universiti KebangsaanMalaysia for providing ﬁ nancial assistance via grant LRGS/ TD/2011/UKM/ICT/04/02 and DPP-2013-003 given for thisproject. ‘ People re-identi ﬁ cation across a camera network ’ .Master ’ s thesis in computer science at the school of computerscience and engineering royal institute of technology, Stockholm,Sweden, 20102 Cong, T., Achard, C., Khoudour, L., Douadi, L.: ‘ Video sequencesassociation for people re-identi ﬁ cation across multiplenon-overlapping cameras ’ . Proc. Int. Conf. Image Analysis andProcessing (ICIAP), Vietri Sul Mare, Italy, September 2009,pp. 179 – ‘ Personre-identi ﬁ cation in multi-camera system by signature based oninterest point descriptors collected on short video sequences ’ . Proc.IEEE Conf. Distributed Smart Cameras, CA, USA, September 2008,pp. 1 – ‘ word spy ’ ﬁ cation.asp5 Guermazi, R., Hammami, M., Hama, A.B.: ‘ Violent web imagesclassi ﬁ cation based on MPEG7 color descriptors ’ . Proc. IEEE Int.Conf. Systems Man and Cybernetics, San Antonio, USA, October2009, pp. 3106 – ‘ ViSE: visualsearch engine using multiple networked cameras ’ . Proc. 18th Int. Conf.Pattern Recognition, Hong Kong, China, August 2006, pp. 1204 – ‘ A data association algorithmfor people re-identi ﬁ cation in photo sequences ’ . Proc. IEEE Int. Symp.Multimedia, Taichung, Taiwan, December 2010, pp. 318 – ‘ Tracking multiple people with a multi-camerasystem ’ . Proc. IEEE Workshop on Multi-Object Tracking,Vancouver, Canada, July 2001, pp. 19 –

269 Lantagne, M., Parizeau, M., Bergevin, R.: ‘ VIP: vision tool forcomparing images of people ’ . Proc. 16th IEEE Conf. VisionInterface, Halifax, Canada, June 2003, pp. 35 – ‘ Gait-based re-identi ﬁ cation of people in urban surveillancevideo ’ . Master ’ s thesis, Uppsala University, Department of InformationTechnology, 201011 Gheissari, N., Sebastian, T.B., Tu, P.H., Rittscher, J., Hartley, R.: ‘ Person re-identi ﬁ cation using spatiotemporal appearance ’ . Proc.Conf. Computer Vision and Pattern Recognition CVPR, New York,USA, June 2006, pp. 1528 – ‘ Viewpoint invariant pedestrian recognition with anensemble of localized features ’ . Proc. 10th European Conf. ComputerVision (ECCV), Marseille, France, October 2008, pp. 262 – ‘ Learning discriminative appearance-basedmodels using partial least squares ’ . Proc. XXII Brazilian Symp.Graphics and Image Processing (SIBGRAPI), Rio de Janeiro, Brazil,October 2009, pp. 322 – ‘ A hierarchical method combininggait and phase of motion with spatiotemporal model for personre-identi ﬁ cation ’ , Pattern Recognit. Lett., Elsevier , 2012, , (14),pp. 1891 – ‘ Silhouette analysis-based gaitrecognition for human identi ﬁ cation ’ , IEEE Trans. Pattern Anal.Mach. Intell. , 2003, , (12), pp. 1505 – ‘ Person re-identi ﬁ cation by symmetry-driven accumulation of localfeatures ’ . Proc. Int. Conf. Computer Vision and Pattern Recognition(CVPR), San Francisco, USA, June 2010, pp. 2360 – ‘ Maximally stable color regions for recognition andmatching ’ . Proc. Conf. Computer Vision and Pattern Recognition(CVPR), MN, USA, June 2007, pp. 1 –

818 Tuzel, O., Porikli, F., Meer, P.: ‘ Region covariance: a fast descriptor fordetection and classi ﬁ cation ’ . Proc. Ninth European Conf. ComputerVision, Graz, Austria, May 2006, pp. 589 – ‘ Gabor-LBP based region covariance descriptor forperson re-identi ﬁ cation ’ . Proc. Sixth Int. Conf. Image and Graphics(ICIG), Anhui, China, August 2011, pp. 368 – ‘ Associating groups of people ’ . Proc.British Machine Vision Conf. (BMVC), London, UK, September 2009,pp. 2360 – ‘ Appearance-basedperson re-identi ﬁ cation in camera networks: problem overview andcurrent approaches ’ , J. Ambient Intell. Humanized Comput. , 2011, ,(2), pp. 127 – IET Comput. Vis. , pp. 1 –– ‘ Person re-identi ﬁ cation incrowd ’ , Pattern Recognit. Lett. , 2012, , (14), pp. 1828 – ﬁ , M.A., Hussain, A., Saad, H., Tahir, N., Zaman, H.B., Hannan,M.: ‘ Appearance-based methods in re-identi ﬁ cation: a brief review ’ .Proc. IEEE Eighth Int. Colloquium on Signal Processing and itsApplications, Melaka, Malaysia, March 2012, pp. 404 – ’ Orazio, T., Cicirelli, G.: ‘ People re-identi ﬁ cation and tracking frommultiple cameras: a review ’ . Proc. 19th IEEE Int. Conf. ImageProcessing (ICIP), FL, USA, October 2012, pp. 1601 – ‘ Multi-pose face recognition for person retrieval in camera networks ’ .Proc. Seventh IEEE Int. Conf. Advanced Video and Signal BasedSurveillance, Boston, USA, September 2010, pp. 441 – ‘ Adaptive colortransformation for person re-identi ﬁ cation in camera networks ’ . Proc.IEEE Int. Conf. Distributed Smart Cameras (ICDSC), Atlanta, USA,September 2010, pp. 199 – ‘ A multiplecomponent matching framework for person re-identi ﬁ cation ’ . Proc.16th Int. Conf. Image Analysis and Processing (ICIAP), Ravenna,Italy, September 2011, pp. 140 – ‘ A particle ﬁ lter approach for multi-cameratracking systems in a large view space ’ , Int. J. Innov. Comput. Inf.Control , 2010, , (6), pp. 2827 – ‘ Multiple-persontracking devoted to distributed multi smart camera networks ’ . Proc.IEEE Int. Conf. Intelligent Robots and Systems, Taipei, Taiwan,October 2010, pp. 2469 – ‘ An ef ﬁ cient color representation for imageretrieval ’ , IEEE Trans. Image Process. , 2001, , (1), pp. 140 – ‘ A novel content based image retrievalmethod based on splitting the image into homogeneous regions ’ , Int. J. Innov. Comput. Inf. Control , 2010, , (9), pp. 4029 – ‘ Personre-identi ﬁ cation using view-dependent score-level fusion of gait andcolor features ’ . Proc. 21th Int. Conf. Pattern Recognition (ICPR),Tsukuba Science City, Japan, 2012, pp. 2694 – ‘ Evaluation of MPEG7 colordescriptors for visual surveillance retrieval ’ . Proc. Second JointIEEE Int. Workshop on VS-PETS, Beijing, China, October 2005,pp. 105 – ‘ Adaptive background mixture modelsfor real-time tracking ’ . Proc. IEEE Computer Society Conf. ComputerVision and Pattern Recognition, 1999, vol. 2, no. C, pp. 246 – ‘ Appearance-based person recognition for surveillanceapplications ’ (Technical University of Berlin, Communication SystemsGroup, 2006)36 Huang, C., Wu, Y., Shih, M.: ‘ Unsupervised pedestrianre-identi ﬁ cation ’ . Advances in Image and Video Technology,Springer Berlin Heidelberg, 2009, pp. 771 – ‘ Person re-identi ﬁ cation by fuzzy spacecolor histogram ’ . Multimedia Tools and Applications, 2012, pp. 1 – ‘ Active learning for person re-identi ﬁ cation ’ . Proc. Int. Conf.Machine Learning and Cybernetics (ICMLC), Xian, China, July 2012,pp. 336 – ‘ Stelcomponent analysis: modeling spatial correlations in image classstructure ’ . Proc. Conf. Computer Vision and Pattern Recognition(CVPR), Miami, USA, June 2009, pp. 2044 – ‘ Exploiting dissimilarity representationsfor re-identi ﬁ cation ’ , Lect. Notes Comput. Sci. , 2011, ,pp. 275 – ‘ Personre-identi ﬁ cation using haar-based and DCD-based signature ’ . Proc.Seventh IEEE Int. Conf. Advanced Video and Signal BasedSurveillance, Boston, USA, September 2010, pp. 1 –

842 Dalal, N., Triggs, B.: ‘ Histograms of oriented gradients for humandetection ’ . Proc. IEEE Computer Society Conf. Computer Vision andPattern Recognition (CPVR), CA., USA, June 2005, pp. 886 – ‘ Personre-identi ﬁ cation using spatial covariance regions of human bodyparts ’ . Proc. Seventh IEEE Int. Conf. Advanced Video and SignalBased Surveillance, Boston, USA, September 2010, pp. 435 – ‘ Recognizing people innon-intersecting camera views ’ . Proc. Third Int. Conf. Imaging forCrime Detection and Prevention (ICDP), London, UK, December2009, pp. 44 – P4445 Khedher, M.I., El-Yacoubi, M.A., Dorizzi, B.: ‘ Probabilistic matchingpair selection for SURF-based person re-identi ﬁ cation ’ . Proc. Int. Conf. Biometrics Special Interest Group (BIOSIG), Darmstadt, Germany,September 2012, pp. 1 –

646 Simonnet, D., Lewandowski, M.: ‘ Re-identi ﬁ cation of pedestrians incrowds using dynamic time warping ’ . Computer Vision ECCVWorkshops and Demonstrations, Springer Berlin, Heidelberg, 2012,pp. 423 – ‘ Part-based spatio-temporal model formulti-person re-identi ﬁ cation ’ , Pattern Recognit. Lett., Elsevier , 2012, , (14), pp. 1908 – ‘ People detection andre-identi ﬁ cation for multi surveillance cameras ’ . Proc. VISAPP Int.Conf. Computer Vision Theory and Applications, Rome, Italy,February 2012, pp. 82 – ‘ Color constancy using shadow-based illumination mapsfor appearance-based person re-identi ﬁ cation ’ . Proc. IEEE Ninth Int.Conf. Advanced Video and Signal-based Surveillance, Beijing,China, September 2012, pp. 197 – ‘ A statistical approach forreal-time robust background subtraction and shadow detection ’ . Proc.IEEE Int. Conf. Computer Vision (ICCV), Kerkyra, Greece,September 1999, pp. 1 – ‘ Who is who at different cameras:people re-identi ﬁ cation using depth cameras ’ , IET Comput. Vision ,2012, , (5), pp. 378 – ‘ View invariant appearance-based personre-identi ﬁ cation using fast online feature selection and score levelfusion ’ . Proc. IEEE Ninth Int. Conf. Advanced Video andSignal-based Surveillance (AVSS), Beijing, China, September 2012,pp. 184 – ‘ Real-time human detection usingcontour cues ’ . Proc. IEEE Int. Conf. Robotics and Automation(ICRA), Shanghai, China, May 2011, pp. 860 – ‘ Color-based tracking of heads and othermobile objects at video frame rates ’ . Proc. IEEE Computer SocietyConf. Computer Vision and Pattern Recognition, San Juan, USA,June 1997, pp. 21 – ‘ People re-identi ﬁ cation acrossmultiple non-overlapping cameras system by appearanceclassi ﬁ cation and silhouette part segmentation ’ . Proc. Eighth IEEEInt. Conf. Advanced Video and Signal Based Surveillance (AVSS),Klagenfurt, Austria, September 2011, pp. 303 – ‘ KNIGHT ™ : a real timesurveillance system for multiple and non-overlapping cameras ’ . Proc.Int. Conf. Multimedia and Expo (ICME ’ – ‘ Gait recognition for human identi ﬁ cation based onICA and fuzzy SVM through multiple views fusion ’ , Pattern Recognit.Lett. , 2007, , (16), pp. 2401 – ‘ Interestpoints harvesting in video sequences for ef ﬁ cient personidenti ﬁ cation ’ . Proc. Eighth Int. Workshop on Visual Surveillance,Marseille, France, October 200859 Khan, A., Zhang, J., Wang, Y.: ‘ Appearance-based re-identi ﬁ cation ofpeople in video ’ . Proc. Int. Conf. Digital Image Computing:Techniques and Applications (DICTA ’ – ‘ Re-identify people in wide area cameranetwork ’ . Proc. IEEE Conf. Computer Vision and PatternRecognition Workshops, RI, USA, June 2012, pp. 31 – ‘ Person re-identi ﬁ cation byprobabilistic relative distance comparison ’ . Proc. IEEE Conf.Computer Vision and Pattern Recognition (CVPR), CO, USA, June2011, pp. 649 – ‘ Person re-identi ﬁ cation byef ﬁ cient impostor-based metric learning ’ . Proc. Ninth Int. Conf.Advanced Video and Signal-based Surveillance, Beijing, China,September 2012, pp. 203 – ’ angelo, A., Dugelay, J.L.: ‘ People re-identi ﬁ cation in cameranetworks based on probabilistic color histograms ’ . Proc. ElectronicImaging Conf. 3D Image Processing (3DIP) and Applications, CA,USA, January 2011, pp. 23 – ‘ Shape matching and objectrecognition using shape contexts ’ , IEEE Trans. Pattern Anal. Mach.Intell. , 2002, , (4), pp. 509 – ‘ Indexing via color histograms ’ . ActivePerception and Robot Vision, 1992, pp. 261 – ‘ On a measure of divergence between two statisticalpopulations de ﬁ ned by their probability distribution ’ , Bull. CalcuttaMath. Soc. , 1943, , pp. 99 – IET Comput. Vis. , pp. 1 – ‘ Image retrieval based onMPEG-7 dominant color descriptor ’ . Proc. IEEE Ninth Int. Conf.Young Computer Scientists, Hunan, China, November 2008,pp. 753 – ‘ Dominant color structuredescriptor for image retrieval ’ . Proc. Int. Conf. Image Processing,San Antonio, USA, October 2007, vol. 4, pp. 365 – et al .: ‘ A comparison ofaf ﬁ ne region detectors ’ , Int. J. Comput. Vis. , 2005, , (1 – – ‘ Distinctive image features from scale-invariantkey-points ’ , Int. J. Comput. Vis. , 2004, , (2), pp. 91 – ‘ Performance evaluation of localdescriptors ’ , IEEE Trans. Pattern Anal. Mach. Intell. , 2005, , (10),pp. 1615 – ‘ Evaluation of local features for personre-identi ﬁ cation in image sequences ’ . Proc. Eighth IEEE Int. Conf.Advanced Video and Signal based Surveillance AVSS, Klagenfurt,Austria, September 2011, pp. 291 – ‘ SURF: speeded up robustfeatures ’ . Proc. Ninth European Conf. Computer Vision(ECCV ’ LNCS – ‘ People re-identi ﬁ cation in acamera network ’ . Proc. Eighth Int. Conf. Dependable, Autonomicand Secure Computing (DASC), Chengdu, China, December 2009,pp. 461 – ‘ Multiple-shot humanre-identi ﬁ cation by mean Riemannian covariance grid ’ . Proc. IEEE Int.Conf. Advanced Video and Signal-based Surveillance, Klagenfurt,Austria, September 2011, pp. 179 – ‘ Appearance-based re-identi ﬁ cation of humans inlow-resolution videos using means of covariance descriptors ’ . Proc.Ninth Int. Conf. Advanced Video and Signal-based Surveillance,Beijing, China, September 2012, pp. 191 – ‘ Shape andappearance context modeling ’ . Proc. IEEE 11th Int. Conf. ComputerVision, Rio de Janeiro, Brazil, October 2007, pp. 1 –

878 Hirzer, M., Beznai, C., Roth, P.M., Bischof, H.: ‘ Personre-identi ﬁ cation by descriptive and discriminative classi ﬁ cation ’ .Proc. 17th Scandinavian Conf. Image Analysis, Ystad Sltsjobad,Sweden, May 2011, pp. 91 – ‘ The pyramid match kernel: discriminativeclassi ﬁ cation with sets of image features ’ . Proc. 10th Int. Conf.Computer Vision (ICCV), Beijing, China, October 2005,pp. 1458 – ‘ Gabor ﬁ lters as texture discriminator ’ , Biol.Cybern. , 1989, , (2), pp. 103 – ‘ Constructing models for content-based image retrieval ’ .Proc. IEEE Conf. Computer Society on Computer Vision and PatternRecognition, HI, USA, December 2001, pp. 11 – ‘ Visual perception of biological motion and a model forits analysis ’ , Percept. Psychophys. , 1973, , (2), pp. 201 – ‘ Active energy image plus 2DLPP forgait recognition ’ , Signal Process. , 2010, , (7), pp. 2295 – ‘ Gait volumespatio-temporal analysis of walking ’ . Proc. Fifth Workshop on OmniDirectional Vision, Camera Networks and Non-Classical Cameras,Prague, Czech Republic, May 2004, pp. 79 – ‘ Gait recognition usingimage self-similarity ’ , EURASIP J. Appl. Signal Process. , 2004, ,(1900), pp. 572 – ‘ A gentle tutorial of the EM algorithm and its applicationto parameter estimation for Gaussian mixture and hidden Markovmodels ’ , (International Computer Science Institute, 1998), pp. 1 –

13 87 Felzenszwalb, P.F.: ‘ Representation and detection of deformableshapes ’ , IEEE Trans. Pattern Anal. Mach. Intell. , 2005, , (2),pp. 208 – ‘ Back- ground cut ’ . Proc.Ninth European Conf. Computer Vision (ECCV), Graz, Austria, May2006, pp. 628 – ‘ Re-identi ﬁ cation by relative distancecomparison ’ , IEEE Trans. Pattern Anal. Mach. Intell. , 2013, , (3),pp. 653 – ‘ Person re-identi ﬁ cationby support vector ranking ’ . Proc. British Machine Vision Conf.(BMVC), Aberystwyth, UK, September 2010, vol. 2, no. 5,pp. 21.1 – ‘ Unsupervised salience learning forperson re-identi ﬁ cation ’ . Proc. IEEE Conf. Computer Vision andPattern Recognition (CVPR ’ – ‘ People re-identi ﬁ cation by spectral classi ﬁ cation of silhouettes ’ , J. Signal Process. , 2010, , (8), pp. 2362 – ‘ Active learning methods for interactiveimage retrieval ’ , IEEE Trans. Image Process. , 2008, , (7),pp. 1200 – ‘ Random forests ’ , Mach. Learn. , 2001, , (1), pp. 5 – ‘ Evaluation of color spaces for personre-identi ﬁ cation ’ . Proc. Int. Conf. Pattern Recognition (ICPR),Tsukuba Science City, Japan, November 2012, pp. 1371 – ‘ PCCA: a new approach for distance learningfrom sparse pairwise constraints ’ . Proc. IEEE Conf. Computer Visionand Pattern Recognition (CVPR), Island, June 2012, pp. 2666 – ﬁ ce, U.H.: ‘ i-LIDS multiple camera tracking scenario de ﬁ nition ’ ﬁ ce.gov.uk/science-research/hosdb/i-lids/, 200898 Ess, A., Leibe, B., Gool, L.V.: ‘ Depth and appearance for mobile sceneanalysis ’ . Proc. IEEE 11th Int. Conf. Computer Vision, Rio De Janeiro,Brazil, October 2007, pp. 1 – ﬁ que, K., Rasheed, Z., Shah, M.: ‘ Modeling inter-cameraspace-time and appearance relationships for tracking acrossnon-overlapping views ’ , Comput. Vis. Image Underst. , 2008, ,pp. 146 – ‘ Peoplere-identi ﬁ cation by means of a camera network using a graph-basedapproach ’ . Proc. Conf. Machine Vision Applications, Minato, Japan,May 2009, pp. 152 – ‘ Re-identi ﬁ cation of pedestrians withvariable occlusion and scale ’ . Proc. IEEE Int. Conf. ComputerVision Workshops (ICCV Workshops), Barcelona, Spain, November2011, pp. 1876 – ‘ Feature selection based on joint mutualinformation ’ . Proc. Int. ICSC Symp. Advances in Intelligent DataAnalysis, NY, USA, June 1999, pp. 22 – ‘ Solving the multiple-instance problem: a lazylearning approach ’ . Proc. Int. Conf. Machine Learning, CA, USA,February 2000, pp. 1119 – ‘ Person re-identi ﬁ cation byattributes ’ . Proc British Machine Vision Conf. (BMVC), 2012,vol. 2, p. 3106 Vaquero, D., Feris, R., Tran, D., Brown, L., Hampapur, A., Turk, M.: ‘ Attribute-based people search in surveillance environments ’ . Proc.Workshop on Applications of Computer Vision (WACV), UT, USA,December 2009, pp. 1 – IET Comput. Vis. , pp. 1 ––