LOAD: Local Orientation Adaptive Descriptor for Texture and Material Classification
Xianbiao Qi, Guoying Zhao, Linlin Shen, Qingquan Li, Matti Pietikainen
LLOAD: Local Orientation Adaptive Descriptor for Texture and MaterialClassification
Xianbiao Qi a,b , Guoying Zhao a , Linlin Shen b , Qingquan Li b , Matti Pietik¨ainen a a Center for Machine Vision Research, University of Oulu, PO Box 4500, FIN-90014, Finland. E-mails: [email protected],[email protected].fi, [email protected].fi b Shenzhen University, Shenzhen 518000, China. E-mails: [email protected], [email protected]
Abstract
In this paper, we propose a novel local feature, called Local Orientation Adaptive Descriptor (LOAD), to captureregional texture in an image. In LOAD, we proposed to define point description on an Adaptive Coordinate Sys-tem (ACS), adopt a binary sequence descriptor to capture relationships between one point and its neighbors and usemulti-scale strategy to enhance the discriminative power of the descriptor. The proposed LOAD enjoys not only dis-criminative power to capture the texture information, but also has strong robustness to illumination variation and imagerotation. Extensive experiments on benchmark data sets of texture classification and real-world material recognitionshow that the proposed LOAD yields the state-of-the-art performance. It is worth to mention that we achieve a 65.4%classification accuracy– which is, to the best of our knowledge, the highest record by far –on Flickr Material Databaseby using a single feature. Moreover, by combining LOAD with the feature extracted by Convolutional Neural Net-works (CNN), we obtain significantly better performance than both the LOAD and CNN. This result confirms that theLOAD is complementary to the learning-based features.
Keywords:
Local Orientation Adaptive Descriptor, Texture Classification, Material Recognition, Improved FisherVector, Convolutional Neural Network
1. Introduction
Visual image classification [31, 32, 18, 29, 9, 14] is achallenging problem in computer vision, especially un-der multiple sources of image transformations, e.g. ro-tation, illumination, affine and scale variations, etc. TheBag-of-Words (BoW) [5] model, as a powerful interme-diate image representation of images, is the most popu-lar approach in visual categorization in the past ten years.In BoW model, the low-level feature extraction and mid-level feature encoding are two most important problems.In the past few years, some advanced middle-level fea-ture encoding approaches has been proposed, such asLocality-constrained Linear Coding (LLC) [35], Vectorof Locally Aggregated Descriptors (VLAD) [13] and Im-proved Fisher Vector (IFV) [25]. These encoding meth- ods have greatly put forward the development of BoWapproach. However, on the other side, the developmentof low-level feature extraction is slow.Earlier works on texture description mainly focused oncapturing global texture information (e.g. GIST [24], Ga-bor), or fine texture micro-structure (e.g. MR8 filter bank[32], Local Binary Pattern (LBP) [23]). The global texturedescriptors can well capture global texture information,but miss most of texture details. For instance, the GISTis good at capturing the spatial layout of scene, but per-forms poor on simple texture classification task in whichthe micro-structures are important. These fine texture de-scriptors defined on very small patches (e.g. × or × ) can well capture small texture structures, but ig-nore global texture information. For example, the LBPand MR8 perform well on some simple texture data sets, Preprint submitted to Elsevier August 28, 2018 a r X i v : . [ c s . C V ] A p r ut work poor on complex material data sets in which re-gional texture information is important. There were someworks that tried to bridge the gap between these two typesof features. However, as we will discuss later, these fea-tures may suffer from some limitations, such as sensitive-ness to image transformations or limited discriminativepower.This paper aims to provide a powerful regional texturedescriptor. To this end, we propose a novel Local Ori-entation Adaptive Descriptor (LOAD). The proposed de-scriptor has two important advantages. (i) strong regionaltexture discrimination: the strong texture discriminationcomes from two aspects. Firstly, on single point, we adopta binary sequence description that owns stronger discrim-inative power than the Gradient Orientation in (e.g. SIFT[21], MORGH [7]) and Local Intensity Order (e.g. LIOP[36]). Secondly, to enhance the discriminative power ofthe descriptor, we propose to use a multi-scale descriptionto capture multi-scale texture information. (ii) robustnessto image rotation and illumination variation: Due to thatthe LOAD is defined on an Adaptive Coordinate System,the LOAD is robust to image rotation. Meanwhile, the bi-nary sequence description used in the LOAD affiliates thefeature with great robustness to illumination variation.Our first contribution in this paper is to propose anovel and discriminative texture descriptor, LOAD, anddemonstrate its effectiveness on two applications includ-ing texture and real-world material classification. On thetraditional texture data sets [22, 15], the LOAD almostsaturates the classification performance. On the real-world Flickr Material Database (FMD) [18], the LOADachieves 65.4% that is the best result for single feature asfar as we know.Our second contribution is that we build a new real-world material data set from a newly introduced ETHZSynthesizability data set. We name the newly intro-duced data set as OULU-ETHZ. We evaluate and com-pare the LOAD with the LBP, PRICoLBP and CNN onthe OULU-ETHZ. Experiments show that our LOADachieves promising performance on the new data set.Our third contribution is that we experimentallydemonstrate that the proposed LOAD shows strong com-plementary property with the learning based feature, suchas Convolutional Neural Networks (CNN) [16, 14]. Onthe Flickr Material Database [18], our LOAD combinedwith the CNN achieves 72.5% that significantly outper- forms the CNN (61.2%) and LOAD (65.4%). On theOULU-ETHZ data set, the combination of the LOAD andCNN improves the CNN by around 6.0%.We believe the strong complementary information isdue to that the IFV representation with LOAD and CNNbelong to two different approaches: non-structured andstructured methods. The former is robust to image rota-tion and translation, but not well captures the structuredinformation. In contrast, the latter is good at capturingthe structured information because its hierarchical max-pooling strategy can preserve the structured information,but is not robust to heavy image rotation and translation.
2. Related Works
Since the proposed descriptor is partially inspired byLocal Binary Pattern (LBP) [23], we will give a brief in-troduction to the LBP.
LBP is an effective gray-scale texture operator. EachLBP pattern corresponds to a kind of local structure innatural image, such as flat region, edge, contour and so on.For a pixel ( x c , y c ) in an image I , its LBP image can becomputed by thresholding the pixel values of its neighborswith the pixel value of the central point ( x c , y c ) : LBP
P,R ( x c , y c ) = P − (cid:88) p =0 sign( g p − g c )2 p , sign ( t ) = (cid:40) , t ≥ , t < , (1)where P is the number of neighbors and R is the ra-dius. g c = I ( x c , y c ) is the gray value of the central pixel ( x c , y c ) , and g p = I ( x p , y p ) is the value of its p -th neigh-bor ( x p , y p ) .Ojala et al. [23] also pointed out that these patterns withat most two bitwise transitions described the fundamentalproperties of the image, and they called these patterns as“uniform patterns”. The number of spatial transitions canbe calculated as follows: Φ( LBP
P,R ( x c , y c )) = P (cid:88) p =1 | sign ( g p − g c ) − sign ( g p − − g c ) | , (2)2igure 1: Sample patches under different image rotations.where g P equals to g . The uniform patterns are definedas Φ( LBP ( P, R )) ≤ . For instance, “11000011” and“00001110” are two uniform patterns, while “00100100”and “01001110” are non-uniform patterns.The LBP with P = 8 has = 256 patterns, in whichthere are 58 uniform patterns and 198 non-uniform pat-terns. According to the statistics in [23], although thenumber of uniform patterns is significantly fewer thanthe non-uniform patterns, the ratio of uniform patternsaccounts for 80%-90% of all patterns. Thus, instead ofthe original 256 LBP, the uniform LBP is widely used inmany applications such as face recognition.
3. Local Orientation Adaptive Descriptor
Our goal is to design a discriminative texture descriptorthat owns the following two properties: • Regional texture discrimination:
Most descrip-tors, such as SIFT, HOG × , are designed forimage matching or human detection, not especiallyfor texture description, thus their texture discrimi-nation may be limited. Although there exist effec-tive texture descriptors in literature, such as GIST,LBP, Completed LBP (CLBP), most of them areconstructed for a global or fine texture description,thus they ignore regional texture information. In thiswork, we focus on designing a discriminatively re-gional texture descriptor. AB A θ O Figure 2: Illustration of Local Orientation Adaptive De-scriptor. The point O is the central point of the patch.The pattern for point A is “00001111”, and the pattern forpoint B is “00000110”. • Robust to image transformations:
Natural imagescontain rich image transformations, in which rota-tion and illumination variations are two most com-mon cases. Thus, when designing a feature, thesetwo aspects should be carefully considered.In what follows we will describe the LOAD descrip-tor in detail. In Section 3.1, we describe the descriptionstrategy for each point under an adaptive coordinate sys-tem. Then in Section 3.2, we introduce a multi-scale de-scription strategy that is used to enhance the discrimina-tive power of the descriptor. And then, we describe thehistogram construction and normalization approaches inSection 3.3. Finally, in Section 3.4, we discuss the rela-tionship between the LOAD with some existing features.
Given similar patches under different image rotationsas shown in Figure 1, our objective is to extract a kind ofdescriptor that is discriminative and transformation invari-ant. To achieve rotation invariance, the traditional meth-ods (e.g. SIFT) firstly estimate a reference orientation(also called main orientation), and then align the patch tothe reference orientation. However, estimation of the ref-erence orientation will significantly increase the compu-tational cost of the descriptors. Meanwhile, as indicated3y [7], the descriptor is sensitive to the error brought bythe orientation estimation.As the circular patch is symmetric with respect to anyline across the central point, we choose to sample a circu-lar region around each point. Given a sampled point O ,we can obtain a circular patch around the point O . Byrotating the patch around the central point O , we can ob-tain a patch with arbitrary angle as shown in Figure 1. Forany point A in the patch, an Adaptive Coordinate System(ACS) can be formed by the point A and the referencepoint O as shown in Figure 2. Under the ACS, the neigh-boring relationship between point A and its neighbors isinvariant to image rotation. It means that, as shown inFigure 2, the positions of point A ’s neighbors are alwaysfixed compared to point A . Thus, the pixel values of the A ’s neighbors are also invariant to image rotation.Under the ACS, any point in the patch can be encodedin a rotation invariant way. In this paper, we propose anovel Local Orientation Adaptive Descriptor (LOAD) thatis built on ACS. As illustrated in Figure 2, the LOAD pat-tern for the point A can be encoded as follows:LOAD P,R ( x A , y A , θ A ) = P − (cid:88) p =0 sign ( V ( A p ) − V ( A ))2 p , (3) (cid:40) x A p = x A + Rcos (2 πp/P − θ A ) ,y A p = y A − Rsin (2 πp/P − θ A ) , where P is the number of neighbors, R is the radius, ( x A , y A ) and ( x A p , y A p ) are the positions of the cen-tral point A and its p -th neighbor under the ACS, V ( A ) and V ( A p ) denote the pixel values of points ( x A , y A ) and ( x A p , y A p ) individually, sign ( · ) is a sign function, θ A = arctan y A − y O x A − x O .In the same way, under the ACS, the adaptive gradientmagnitude for the point A can be denoted as follows:M ( A ) = (cid:112) ( V ( A ) − V ( A )) + ( V ( A ) − V ( A )) , (4)where the M ( A ) is computed when R = 1 .The encoding approach as Eq. 3 has two advancedproperties: (i) Rotation invariance: Under the ACS,the neighboring relationships between one point and itsneighbors are fixed. As shown in Figure 2, the same startpoint A will always be selected for the point A . Thus,the LOAD encoding is rotation invariant. (ii) Robustness O Figure 3: Multi-scale Local Orientation Adaptive De-scriptor. The pattern for the inner scale is “10001111”,and the pattern for the outer scale is “10000011”.to illumination variation: Using the binary sequence de-scription approach, our LOAD is also robust to illumina-tion variation because illumination variation usually doesnot change the binary comparison relationship betweentwo adjacent pixels.According to Eq. 3, when P is set to 8, the LOAD willhave 256 patterns that may be high for a local descriptor.Motivated by the “Uniform” encoding in LBP [23], wealso adopt the “Uniform” strategy the LOAD. Thus, thedimension of the LOAD is 59. Multiresolution analysis–also called multi-scaleanalysis–is an effective way to depict texture informationin different scales. Multi-scale strategy is widely usedin the LBP [23] and its variants [10, 11, 38]. As pointedout by previous works, the multi-scale descriptionperforms significantly better than the single-scale one.The multi-scale version of the LOAD can be defined asfollows:LOAD
P,R ( x A , y A , θ A , s ) = P − (cid:88) p =0 sign ( V ( A p ) − V ( A ))2 p , (5) (cid:40) x A p = x A + s × Rcos (2 πp/P − θ A ) ,y A p = y A − s × Rsin (2 πp/P − θ A ) , s is a scale factor.Compared to the Eq. 3, we introduce a scale factor tothe Eq. 5. With choice of different scale factors, we canobtain LOAD patterns in different scales. Figure 3 showsthe LOAD with two scales. In practice, we can choose 2, 3or 4 scales. As shown in Figure 3, the binary sequence forthe inner scale is “10001111”, and the binary sequencefor outer scale is “10000011”. If the patterns betweeninner and outer scales are similar, it may indicate that thestructures around this point is consistent, and vice versa. Algorithm 1
Calculation of LOAD feature
Input:
One reference point O and a circular patch P around O ; Output:
LOAD histogram feature H Initiate a 2-D histogram H with zeros, the size of H isset as × S ; for all O i ∈ P do Compute the gradient orientation M ( O i ) of the point O i as Eq. 4, for each s ∈ [1 , S ] do Calculate the uniform LOAD pattern with g as shown in Figure 2 as start point, denote it as U s ( O i ) , Accumulate the histogram H, H ( U s ( O i ) , s ) = H ( U s ( O i ) , s ) + M ( O i ) , end for end for Resize the histogram H into 1-D vector and normalizeit with square root norm,
Return H.
Given a circular patch with the point O as the centralpoint, suppose that the patch has K points. Assume thatwe use S scales, the dimension for each scale is 59, thus,the final feature dimension is × S . We initiate a 2-Dhistogram H with all zeros. Then, for each point O i , i ∈ [1 , K ] , we can accumulate the histogram H as follows:H ( U s ( O i ) , s ) = H ( U s ( O i ) , s ) + M ( O i ) , (6)where s ∈ [1 , S ] , M ( O i ) is the gradient magnitude ofpoint O i under the ACS as computed according to the Eq. 4, U s ( O i ) is the “Uniform” pattern of the LOAD featureof the point O i at the scale s .After accumulating all K points in the patch into thehistogram H, we resize the histogram into 1-D vector.Feature normalization is an important step for both fea-ture description (e.g. RootSIFT [1]) and image repre-sentation [34, 25]. In this paper, we follow the opera-tor in RootSIFT, and conduct square root operation to ourLOAD. Previous works [1, 34] have shown that the squareroot normalization performs better than L normalization.For clarity, we summarize the algorithm for calculatingthe LOAD feature in Algorithm 1, in which S is the num-ber of scales, U s ( O i ) is the uniform pattern representationof the LOAD feature of the point O i at scale s . Our LOAD feature is related to some existing featuresin the literature. The first category of related featuresare the LBP based methods, e.g. LBP [23], CLBP [10].Another set of related features are Local Intensity Orderbased methods including MORGH [7], LIOP [36]. How-ever, different from the LBP based methods, our LOADhas the following two properties: • Regional texture discrimination: Our LOAD is apatch-based feature. However, the LBP based meth-ods, e.g. LBP, CLBP, were designed to depict micro-structures. Image representation based on LBP isto compute the histogram of patterns, but the imagerepresentation with the LOAD uses the BoW model. • Trade-off between rotation invariance and discrimi-native power: Our LOAD descriptor for each point isbuilt on the ACS. Thus, the LOAD not only achievesgood robustness to image rotation but also has strongdiscriminative power. On the other hand, the LBPbased methods achieve rotation invariance at the costof discriminative power.Different from LIOP and MORGH, our LOAD has thefollowing two properties: • Richer patterns: The LOAD adopts a binary patterndescription. Using the binary pattern descriptor, ourLOAD has richer patterns than LIOP (16 patterns)and MORGH (8 patterns) on a single point.5
Robust to the sensitiveness of region division: theLOAD does not employ the region division. In-tensity order based region division [7, 36] may besensitive to non-monotonous illumination variation.Meanwhile, the region division will greatly increasethe feature dimension.
4. Encoding
The Improved Fisher Vector (IFV) [25] encoding hasbeen proposed to address the problem of informationloss in the process of feature encoding in the traditionalBoW model. Within the context of IFV, images arerepresented by encoding densely sampled local descrip-tors. Principal Component Analysis (PCA) is firstly usedto remove the correlation between two arbitrary dimen-sions. In PCA, we keep D components. Then, a Gaus-sian mixture model (GMM) is estimated to build thevisual words for the after-PCA local descriptors. TheIFV measures the normalised deviations of local descrip-tors w.r.t. the GMM parameters. More specifically, let I = { x t , t = 1 · · · T } that are the set of D -dimensionalafter-PCA local descriptors extracted from an image. De-note the set of parameters of a K -component GMM by λ = { π k , µ k , Σ k , k = 1 , · · · , K } , where π k , µ k , and Σ k are the prior, mean vector, and covariance matrix for the k -th components respectively. Given x t with a soft assign-ment λ tk to each of the K components, the IFV encodingof I is defined as follows: Φ ( I ) = 1 T T (cid:88) t =1 φ ( x t ) , (7)with φ ( x t ) = [ φ ( x t ) , · · · , φ K ( x t )] , (8)where φ k ( x t ) = λ tk √ π k x t − µ k σ k , λ tk √ π k (cid:20) ( x t − µ k ) σ k − (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) D T ,k = 1 , · · · , K. (9)The IFV encoding is a vector representation of 2 D × K dimensions. In the IFV, the power (signed square root) normalization usually shows better performance than theL Normalization.Compared with the BoW with K-means, the IFV frame-work provides a more general way to represent an im-age by a generative process of local descriptors and canbe efficiently computed from much smaller vocabular-ies. Chatfield et al.[2] evaluated the state-of-the-art en-coding methods such as the IFV, the Super Vector and theLocality-constrained linear (LLC), and showed that theIFV performs best in all compared encodings.
5. Experiments
LOAD.
In LOAD, we use four scales ((8, 1), (8, 2),(8, 3) and (8, 4)). The dimension for each scale is 59,thus, the final dimension is 236. Experiments show thatthe performance of four scales usually slightly improvesthe performance of two scales (e.g. (8, 1) and (8, 3)).
IFV.
We firstly sample 100,000 LOAD features fromthe training samples, then the 100,000 LOAD featuresare used to learn the PCA components, and 100 prin-cipal components are preserved as the basis for dimen-sion reduction. As pointed out by [27], the PCA, whichis used to remove correlation between two arbitrary di-mensions, is a key step in the IFV framework. With theabove-mentioned 100,000 after-PCA LOAD features, welearn a GMM with 256 components. For the PCA, we usethe Matlab built-in SVD (Singular Value Decomposition).For the GMM, we use Vlfeat [33] to learn the parameters θ = { π k , µ k , Σ k , k = 1 , · · · , K } . In the IFV, the Σ k isforced to be diagonal. The final dimension of the IFVrepresentation for each image is × ×
256 = 51 , . Classifier.
We trained a 1-vs-all linear SVM classifier(with C=10) using Liblinear [8] toolbox.It should be pointed out that the computational cost forour LOAD descriptor is low. On a desktop computer withdual-core 3.4G CPU, the C++ (Matlab mex) implementa-tion takes about 2s to extract 8000 features.
Rotation Invariance.
To evaluate the rotation in-variance of the LOAD feature, we use three data sets:Outex TC 00010 (TC10), Outex TC 00012 (TC12) and6IUC. The experimental setups for each data set are pre-sented in the following application section. We comparethe LOAD with RootSIFT. We guarantee that the LOADand RootSIFT uses the same number of features and thesame framework of IFV presentation. The experimentalresults for both features are shown in Tab. 1.Table 1: Evaluation of Rotation Invariance of the LOADon TC10, TC12 amd UIUC data sets.
UIUC TC10 TC12RootSIFT 97.1 48.78 53.98 54.56LOAD 99.6 99.95 99.65 99.33
According to Table 1, we have two observations: (1).On the data sets with strong rotation such as TC10 andTC12, the LOAD shows great robustness to image rota-tion and significantly outperforms the RootSIFT. (2) Onthe UIUC data set that has small image rotations, ourLOAD still shows better performance than the RootSIFT.
Discriminative Power.
To access the discriminativepower of the LOAD, we directly compare it with theRootSIFT. We compare them in two sampling strategies:single-scale and multi-scale sampling. For single-scale,we directly sample points on the original images. Formulti-scale sampling, we densely extracted features fromsix scales with rescaling factors − i/ , i = − , , , ..., .We evaluate the LOAD and RootSIFT on Flickr Mate-rial Database (FMD) and UIUC data sets. The results areshown in Table 2.Table 2: Comparison of the LOAD and RootSIFT onFMD and UIUC data sets. Sampling Strategy Features FMD UIUCSingle-scale RootSIFT . . Multi-scale RootSIFT . . From Table 2, on both single-scale and multi-scalesampling strategies, our LOAD outperforms the Root-SIFT. For instance, with single-scale sampling, ourLOAD improves the RootSIFT by 5.6% on FMD data set.Meanwhile, we can also find that the multi-scale sam-pling strategy consistently outperforms the single-scale (a). TC10(b). TC12(c). UIUC
Figure 4: Sample images from TC10, TC12 and UIUCtexture data sets. Note that TC10 and TC12 have strongrotation variation, and UIUC has strong rotation, scaleand affine transformation.sampling strategy.
Outex [22] database has two test suites-
Outex TC 00010 (TC10) and
Outex TC 00012(TC12) . The two test suites contain the same 24 classesof textures, which were collected under three differentilluminations (horizon, inca, and t184) and nine differentrotation angles (0, 5, 10, 15, 30, 45, 60, 75, and 90 ).There are 20 non-overlapping × texture samplesfor each class. For TC10, samples of illuminations “inca”with angle 0 in each class were used for training and theother eight rotation angles with the same illuminationswere used for testing. Hence, there are 480 ( × )training samples and 3,840 ( × × ) validationsamples. For TC12, the classifier was trained withthe same training samples as TC10, and it was testedwith all samples captured under illuminations “t184”or “horizon”. Hence, there are 480 ( × ) trainingsamples and 4,320 ( × × ) validation samples foreach illumination. It should be noted that the trainingimages come from only one angle, but the testing imagescome from different angles. UIUC [15] texture data set contains 1,000 images: 25different texture categories with 40 samples in each cate-7ory. The image size in the data set is × . Thisdata set has strong rotation and scale variations. In theexperiments, 20 samples from each category are used fortraining, and the rest 20 samples are used for testing.Sample images for above three data sets are shown inFig. 4. For all three data sets, we densely extracted fea-tures from six scales with rescaling factors − i/ , i = − , , , ..., . We use the IFV representation and linearSVM. The results of TC10, TC12 and UIUC data sets areshown in Table 3.Methods TC10 TC12Dense SIFT (SVM) 48.78 53.98 54.56CLBP SM/C (NN) [10] 99.14 95.18 95.55BRINT (NN) [20] 99.35 97.69 98.56BRINT (SVM) [20] 99.30 98.13 98.33LOAD (SVM) (a) Experimental results on data sets TC10 and TC12. Methods Acc. Methods Acc.Lazebnik et al.[15] 96.0 WMFS [37] 98.6BIF [4] 98.8 SRP [19] 98.56Sifre et al.[30]
RootSIFT 97.0Cimpoi et al.[3] 99.0 LOAD (b) Experimental results on UIUC data set.
Table 3: Comparison with state-of-the-art methods onTC10, TC12 and UIUC texture data sets.Table 3(a) shows that the rotation invariant methodsincluding CLBP, BRINT and LOAD significantly out-perform the rotation sensitive method (Dense SIFT withIFV). Meanwhile, among all rotation invariant methods,our LOAD works best. According to Table 3(b), on UIUCdata set, our LOAD also outperforms the state-of-the-artmethods including SRP [19] and two newly publishedworks [30, 3].
Flickr Material Dataset (FMD) [18] is a challengingreal-world material data set. It contains 10 categories, in-cluding fabric, foliage, glass, leather, metal, paper, plas-tic, stone, water, and wood. As pointed out in [29], FMDwas designed with specific goal of capturing the appear-ance variations of real-world materials, and by including
Fabric Foliage Glass Leather MetalPaper Plas c Stone Water Wood
Figure 5: Sample images of 10 categories from the FMDdata set.a diverse selection of samples in each category. Eachcategory in FMD has 100 images, where 50 images areused for training and the rest 50 images are used for test-ing. Samples images are shown in Figure 5. We use themulti-scale sampling and densely extracted features fromsix scales with rescaling factors − i/ , i = − , , , ..., .The step size for our sampling is 4. For instance, about43,000 points are sampled from each image.In the experiments, we compare our feature with manystate-of-the-art methods including Kernel Descriptor [12],Pairwise Rotation Invariance Co-occurrence LBP (PRI-CoLBP) [26], DTD (a texture attribute descriptor) [3] andCNN [16] and etc.This paper investigates two key issues: (1) howmuch does the proposed feature depend on the dictio-nary (Learned by GMM) in IFV? (2) how much com-plementary information can the learning-based methods(e.g. CNN) provide for the LOAD feature with the IFVrepresentation? For the first question, we compare theLOAD with the IFV representation using the vocabularieslearned from the FMD or from an external data set. Werandomly select 500 images from [6] as the external dataset. For the second question, we evaluate the combinationof our LOAD with the CNN feature. All relevant resultsare shown in Table 4, and three classification confusionmatrices for the CNN, the LOAD and the combination of We use OverFeat[28] toolbox in this paper. NN(61.2) LOAD(65.4) LOAD + CNN(72.5)
Figure 6: Classification confusion matrices for CNN, LOAD and the combination of CNN and LOAD on FMD dataset.the CNN and the LOAD are shown in Figure 6.Table 4: Comparison of state-of-the-art methods on FMDdata set. LOAD* means using vocabulary learning froman external data set. Note that the recognition accuracyfor humans on the FMD is 84.9% reported in [29].Methods AccuracyLiu et al.CVPR’10 [18] 44.6Hu et al. BMVC’11 [12] 49Qi et al.ECCV’12 [26] . ± . Li et al.ECCV’12 [17] 48.1Sharan et al.IJCV’13 [29] 57.1DTD CVPR’14 [3] . ± . Features Combined [3] . ± . CNN [28] . ± . LOAD* . ± . LOAD . ± . LOAD* + CNN . ± . From Table 4, we can observe that: • The LOAD achieves better performance than previ-ous works including the methods with single feature,such as Kernel Descriptor, DTD, PRICoLBP. Mean-while, it also outperforms some methods with mul-tiple features, such as Liu et al.[18] and Sharan etal.[29]. Their results are based on combination ofseven features. • The LOAD combined with the CNN significantlyimproves both of them. The combination of theCNN and LOAD decreases the error rate of LOADby about 20%, and decreases the error rate of CNNby about 30%. • The LOAD is not sensitive to the source of the vo-cabulary. The LOAD with vocabulary learning fromFMD only slightly improves the LOAD* with vocab-ulary learning from an external data set.We can find that, from Figure 6, the performances forthe CNN and the LOAD on the corresponding categoriesvary a lot, such as the categories “foliage”, “metal” and“stone”. Meanwhile, we observe that on several cate-gories, such as “fabric” and “glass”, the LOAD combinedwith the CNN improves the one with lower classificationaccuracy by more than 10%.
Discussion.
We believe the reason behind the signif-icant increase of classification performance is that theCNN and IFV representations belong to two different ap-proaches: structured and non-structured. The CNN isthe structured method that is discriminative in capturingspatial layout information. With the hierarchical max-pooling strategy, the structured information is well pre-served and captured. However, on the other hand, theCNN may be not robust to heavy image rotation and trans-lation. In contrast, the IFV representation with the LOADfeature is robust to image rotation and translation, but not9owerful in describing spatial structure information. Webelieve this is the reason why these two methods havestrong complementary information.
A New Material Dataset (OULU-ETHZ) is intro-duced in this paper. The new data set is compiled froma new introduced ETHZ Synthesizability data set thatcontains rich material images. The ETHZ data set is de-signed to evaluate the Synthesizability of images, but notdesigned for material recognition. In this paper, we select13 material categories from this data set, and construct anew data set for material recognition.All 13 categories include Cloud, Fabric, Flour, Fur,Glass, Grass, Leather, Metal, Paper, Plastic, Sand, Wa-ter and Wood. The number of the images in each categoryranges from 44 to 420. Deriving from the ETHZ data set,the image sizes for all samples are × pixels. Somesample images are shown in Figure 7.The OULU-ETHZ and FMD data sets share some sim-ilar properties and also have some differences. These sim-ilar and different properties are: • The images in both FMD and OULU-ETHZ are bothcollected from real-world material images. Rich ap-pearance variation happens in both data sets. Forinstance, the “Air” category in Figure 7 has shownhuge illumination variation. • Compared to the FMD data set, most of the imagesin the OULU-ETHZ are close-up images, thus, bet-ter alignment is shown in the OULU-ETHZ. It meansthat the images in the OULU-ETHZ has strongerscale and rotation prior than the FMD.To evaluate different algorithms, we use 20 samplesfor training and the rest for testing. We pre-create fivetraining-testing configurations, averaged accuracy is re-ported. We compare the proposed LOAD with two base-line methods (LBP, PRICoLBP) and also with CNN ap-proaches . The results are shown in Table 5.From Table 5, we can observe that: The ETHZ Synthesizability data set contains 21302 texture of × pixels, downloaded with 60 keywords. Following [26], we use χ kernel for LBP and PRICoLBP. In theexperiments, LBP uses three scales and PRICoLBP uses 6 templates.The dimensions for LBP and PRICoLBP are 54 and 3540 individually.We use linear SVM for our LOAD and CNN. Table 5: Experimental results on the OULU-ETHZ set.Methods AccuracyLBP (Gray) [23] . ± . PRICoLBP (Gray) [26] . ± . PRICoLBP (Color) [26] . ± . SIFT(IFV) . ± . CNN . ± . LOAD(IFV) . ± . LOAD + CNN . ± . • The CNN achieves the best result among all com-pared approaches, our LOAD ranks second. TheLOAD outperforms the LBP, PRICoLBP and SIFT. • The LOAD shows strong complementary propertywith the CNN. The combination of them improvesthe CNN by about 6%.
Discussion.
It is interesting to investigate the reasonswhy the LOAD performs better than the CNN on theFMD, but worse than the CNN on OULU-ETHZ. We be-lieve the following two points may be two main reasons: • The OULU-ETHZ shows better consistency in ap-pearance (e.g. color). The CNN is built on colorimage, and the LOAD is extracted from gray im-age. We believe that the OULU-ETHZ may havestronger color prior than the FMD. This argumentcan be validated by the fact that color PRICoLBPshows better performance than gray PRICoLBP onthe OULU-ETHZ, but only achieves similar perfor-mance as gray PRICoLBP on the FMD. The consis-tency of appearance on the OULU-ETHZ is impor-tant for the CNN. • Most of the images in the OULU-ETHZ are close-up images. The close-up images have strong align-ment on scale. Meanwhile, due to the skews whencollecting the ETHZ data set, the images also havegood alignment on rotation. The scale and rotationare two difficult issues to handle in the CNN.
6. Conclusion
This paper proposed a novel Local Orientation Adap-tive Descriptor (LOAD) to capture regional texture infor-10 loud Fabric Flour Fur Glass Grass Leather Metal Paper Plas c Sand Water Wood
Figure 7: OULU-ETHZ real-world material data set. The OULU-ETHZ has rich image transformations.mation for image classification. It enjoys not only dis-criminative power to capture the texture information, butalso has strong robustness to illumination variation andimage rotation. Superior performance on texture and real-world material classification tasks fully demonstrate itseffectiveness. Meanwhile, it also shows strong comple-mentary property with the learning-based method (e.g.Convolutional Neural Networks). The LOAD combinedwith the CNN significantly outperforms both of them. Webelieve the strong complementary information is due tothat the IFV representation with LOAD feature and CNNbelong to two different approaches: non-structured andstructured approaches. The former is robust to image ro-tation and translation, but not well captures the structuredinformation. In contrast, the latter is good at capturing thestructured information because of its hierarchical max-pooling strategy, but is not robust to heavy image rotationand translation. Therefore, they exhibit strong comple-mentary property.