An Improved Relevance Feedback in CBIR
AA N I MPROVED R ELEVANCE F EEDBACK IN
CBIR
A P
REPRINT
Subhadip Maji ∗ M.Tech QROR-IIIndian Statistical Institute, KolkataKolkata, 700108 [email protected]
Smarajit Bose
Interdisciplinary Statistical Research UnitIndian Statistical Institute, KolkataKolkata, 700108 [email protected]
September 1, 2020 A BSTRACT
Relevance Feedback in Content Based Image Retrieval is a method where the feedback of theperformance is being used to improve itself. Prior works use feature re-weighting and classificationtechniques as the Relevance Feedback methods. This paper shows a novel addition to the priormethods to further improve the retrieval accuracy. In addition with all of these, the paper also showsa novel idea to even improve the 0-th iteration retrieval accuracy from the information of RelevanceFeedback. K eywords Relevance Feedback · Content Based Image Retrieval · Deep Learning · Classification · Clustering
Relevance feedback is a method where we take feedback from users after image retrieval and use this feedback toimprove the retrieval accuracy after each iteration gradually. There are several relevance feedback methods to accountthis improvement. In this paper, we use this relevance feedback in a learning way to the system so that as users use thisContent Based Image Retrieval System (CBIR) more and more, the average retrieval accuracy increase and the averageiteration number to reach the maximum retrieval accuracy decreases.There are many Relevance Feedback methods[7, 8, 9, 11, 12, 13] to further improve the precision after first iteration.But, throughout this paper we have used Feature Re-weighting[2] as the elementary Relevance Feedback method tolearn the system to further improve the average retrieval accuracy and average relevance feedback iteration number.
Human perception of image similarity is semantic, task-dependent and subjective. Although content-based methodsgive promising directions for image retrieval, the retrieval results based on the similarities of pure visual features arenot necessarily semantically and perceptually meaningful. In addition, each type of visual feature tends to capture onlyone aspect of image property and it is usually hard for a user to specify clearly how different aspects are combined.To tackle these problems, relevance feedback, a technique in traditional text-based information retrieval systems, isintroduced.Relevance feedback is a supervised active learning technique used to improve the effectiveness of information systems.The main idea is to use positive and negative examples given from the user to improve performance of system. For agiven query image, the system first retrieves a list of ranked images with respect to a predefined similarity measure.Then, the user selects the retrieved images as relevant (positive examples) or not relevant (negative examples) to thegiven query image. The system will then refine the retrieval results based on the user feedback and retrieve a new set ofimages to the user. The key issue in relevance feedback is how to incorporate positive and negative examples given ∗ GitHub Repo: https://github.com/pidahbus a r X i v : . [ c s . I R ] A ug PREPRINT - S
EPTEMBER
1, 2020from the user to refine the query and/or to adjust the similarity measure to improve the system retrieval performance.The mechanism is generally applied iteratively.
In Feature Reweighting[2], the information obtained from the 0-th iteration (after the user has classified the images asrelevant and non-relevant) is used to assign meaningful weights wj to each of the d=100 features (discussed above). Thequery image is again compared with each image in the database (minus the 20 images already retrieved), but by using aweighted L1 norm distance measure: D = d (cid:88) j =1 w j | f I j − f Q j | (1)The images having the least D2 values are returned (1st iteration). The number of images returned in the 1st iteration isScope - No. of relevant images returned in the 0-th iteration. The user once again classifies these images, the additionalinformation being utilized to modify the weights; and the same process continues for each iteration. For a particulariteration, the no. of images returned is equal to Scope – Total no. of relevant images returned in the preceding iterations.The process continues till six iterations or when the total number of relevant images returned becomes equal to theScope. An obvious criterion for the choice of weights [8] is that they should be higher for those features which differ significantlybetween the relevant and non-relevant classes, and thus discriminate well between relevant and non-relevant images;and lower for those features which behave similarly in both the classes. Let σ ( t ) j and σ ( t ) rel,j be the standard deviations of f j over the sets N t ∪ R t and R t respectively, where R t and N t are respectively the relevant and non-relevant sets at thet-th RF iteration. An intuitive choice of weight for the feature f j at the (t+1)- th iteration is: w ( j +1) j = σ ( t ) j σ ( t ) rel,j (2)If σ ( rel, j ) becoming zero, the denominator is assigned a small positive value ∆ to avoid computational problems.Wu and Zhang proposed an efficient way of using both the relevant and non-relevant samples by forming a discriminantratio which would determine the ability of a feature to separate relevant images from the non-relevant ones. If F ( t ) rel,j = { f I,j I ∈ R t } is the collection of the j-th feature of all images in R t , then the dominant range over relevantimages at the t-th iteration for the j-th feature component is defined as, D ( t ) j = [ min ( F ( t ) rel,j ) , max ( F ( t ) rel,j )] (3)The discriminant ratio proposed is: δ ( t ) j = 1 − N o.of non − relevantimageshavingthej − thf eatureinD ( t ) j | N t | (4)The value of δ j lies between 0 and 1. It is 0 when the j-th feature of all non-relevant images is within the dominant rangeand thus, no weight should be given for that feature component. On the other hand, when there are no non-relevantimages having their jth feature components lying within the dominant range, maximum weight should be given to thatfeature component ( δ j = 1 ). Based on this, another choice of weights is: w ( t +1) j = δ ( t ) j ∗ σ ( t ) j σ ( t ) rel,j (5)2 PREPRINT - S
EPTEMBER
1, 2020
Maji et al.[6] in their paper used
Precision as the evaluation metric.
Precision = Number of relevant images retrievedNumber of retrieved images (6)Generally the number of images retrieved by any CBIR method (called the Scope of the method) is a pre-specifiedpositive integer. Precision values are calculated for each image in the database, and these are averaged over all images(in the database). These averages are conventionally plotted for different values of the scope to provide an illustration ofthe overall retrieval performance of the method.However, under relevance feedback, the scenario is slightly different. Here, after the user identifies the relevant andnon-relevant images at each iteration, usually a different set (not necessarily disjoint with the earlier set) of images isretrieved in the following iteration due to change in the search criterion. There are several issues involved here. Forexample, it is not desirable to return the same image (relevant or non-relevant) to the user a second time after retrieval atan earlier iteration. Therefore one should aim to retrieve a new set of images at each iteration, which does not containany of the images retrieved earlier. Further, if we consider the initial scope (S) as the number of relevant images theuser is looking for; it makes sense to retrieve only S-R number of images at every step, where R is the current numberof relevant images. Under such considerations, the total number of images to be retrieved changes after every iterationand it is expected to be different for different images. Hence one other evaluation measures are proposed[1] whosebehaviour remains consistent, irrespective of whether RF has been used or not:
Retrieval Accuracy = Number of relevant images retrievedScope (7)
This database contains 9144 images from 102 categories. The number of images in each category varies from 34 to800[5].
This paper is a continuation of the paper by Maji et al.[6] where they used the pre-trained deep learning features to getthe state-of-the-art result. With their method, they achieved an average precision on this dataset to be around 82%. Onthe top of their result, we introduced relevance feedback to further improve the retrieval result.The features extracted from the CBIR model referred above is high dimensional (n = 1536). So, to tackle the curse ofhigh dimensionality we proceed calculating relevance feedback accuracies taking roughly first 100 principal components.The idea of doing PCA on extracted features is taken from here[6].Once the top 20 images (having the least distance from the query image) have been retrieved by the classical CBIRapproach[6] (0th iteration), these 20 images are presented to the user, and he/she is asked to manually label each ofthem as “relevant” or “non-relevant” to his/her query.This feedback is used in subsequent iterations, so that the ranking criterion is updated and a new set of images isretrieved. In this way, the subjective human perception of image similarity is incorporated to the system and the retrievalresults are expected to improve, according to the user’s viewpoint, with the RF iterations. The process is generallyterminated when there is no further improvement or when the required number of relevant images (here, 20) is retrieved.In this project, we proceed till 6 iterations. (We consider that the user will get tired after 6 iterations and will not like tocontinue classifying the retrieved images manually even after 6 iterations).The improvement till 6 iterations are shownin the Figure 1.
Now we try to further improve in the retrieval accuracy as user continues using the system. This improvement can bedone in two ways: Increasing the average retrieval accuracy and decreasing the average iteration number. We did thisexperiment on the DBCaltech dataset with the total 9144 images.3
PREPRINT - S
EPTEMBER
1, 2020Figure 1: Relevance Accuracies up to iteration 6 for Normal Feature Re-weighting RF Method
To implement we divided the DBCaltech dataset in three parts: one image randomly from each group (a total of 102images) with less precision in RF0 and named it as test images. Retrieval accuracy of these images before and after theexperiment will show how much improvement we have got. Another 1000 images randomly selected from the rest 9042images. These are the images user uses to retrieve similar images over the time. We will show as user uses these imagesgradually the RF continues to improve. We named these images as validation images. Rests images (a total of 8042) arethe dataset from which the similar images will be retrieved. We named it retrieval dataset.
Without touching validation images, we applied the above described pre-trained neural network method to calculate theprecision without RF. The value came to be 56.12%. Using the Normal Feature Re-weighting method demonstratedabove we did the Relevance Feedback retrieval accuracy calculation too. As mentioned above, we fixed the iterationnumber to be at most 6 for each of the test images. Here along with the retrieval accuracies of the test images werecorded the iteration number at which the accuracy of the image reaches 1.So,
RF iteration number for image i = min{iteration number at which the accuracy reaches 1,6}
By doing so we get the mean retrieval accuracy to be 82.69% and average RF iteration number to be 3.71.
Now say for a certain period of time user is using the CBIR system we have developed and the images they are using arethe validation images which comes outside from the retrieval dataset. Our aim is to classify the images of the retrievaldataset based on the Relevance Feedback provided by the user after each retrieval.Now at very first there is no group. After at most 6 iterations for the first validation image, we got n ( n ≤ relevantimages. Now with n , we create our first group. Now while retrieving images for the second validation image, afterfirst retrieval user selects the relevant images from the sample, if any one of the relevant image matches with the imagesof the created group we take a random sample equals to the number of irrelevant images and the accuracy becomes 1after one iteration. If the number of images of the group is less than the number of irrelevant images of that retrievalthen we include all the images from that group and rest images we find by Normal Feature Re-weighting. After at most6 iterations, say we get n relevant images. Now we add these n images with the previous group. If any of the relevantimages after first retrieval of the second validation images does not match with any of the images of the group we try togather more relevant images by only Norma Feature Re-weighting method and after at most 6 iterations, we create new4 PREPRINT - S
EPTEMBER
1, 2020group with the relevant n images. In this process, after at most 6 iterations with the relevant images either we add theimages with an existing group or we create a new group.It may happen that the different relevant images after the first retrieval, came from different groups. In that case wemerge all the groups. To show the improvement over the use of validation images we selected 5 equal intervals in the process of groupformation by the validation images and we checked mean accuracy and RF iteration number at each interval: one after200 images, second after 400 images and so on. The graphical result is shown in Figure 2.Figure 2: Improvement in Accuracy and RF iteration number over the use of validation images
Here RF0 means the CBIR without any relevance feedback. After doing the above grouping process at the end of 1000validation images, we get 117 groups, where the original number of groups are 102. It is not hard to believe that if usersuses more and more beyond 1000 validation images the number of groups will eventually shrink to 102 from 117. Nowthe idea is to train a neural network model on these dataset with the specified class and make more appropriate featureset for our dataset only.
At first we removed groups with images less than 10. This reduces the group number from 117 to 104. We created twofolders: Train and Validation. Randomly pick up 20% images from each group to the validation folder and rests put upin the train folder. 5
PREPRINT - S
EPTEMBER
1, 2020
We have fine-tuned the InceptionResNet model pre-trained on the ImageNet dataset. Firstly, we extracted the convolu-tional base, then added one two hidden units of 1536 and 256 nodes respectively. Finally added a softmax layer with104 classes. Compiled the model with RMSprop optimizer with learning rate of 1e-05. We did batch gradient descentwith the batch size of 20. At epoch 21 the model gave the highest validation accuracy of around 95%. Then trained thewhole dataset for epoch 21 keeping other things intact. As discussed in the chapter 2, we removed the softmax layer andlast layer of 256 dimensions became our new feature representation of the images. Finally we encoded all the images inthe retrieved dataset and encoded to 256 dimensional feature vector feeding this newly trained Neural Network model.
By using the method described by Maji et al.[6], for the 102 test images the RF0 precision is 56.12%. Now after thisfine-tuning this result has improved to 65.78%. The improvement is shown in the Figure 3. The fine-tuned result couldbe improved further with better network architecture and hyper-parameter tuning. However, the aim of this section is toshow the idea of improving the RF0 precision using the relevance feedback from the users.Figure 3: Improvement in RF0 precision over the use of validation images
Like the earlier section this section also concentrates on improving RF0 precision after collecting and storing infor-mation during relevance feedback to update model parameters, but unlike grouping images and solving a multi-classclassification problem, this section formulates the problem from a simple binary classification approach which ends tobe more superior in retrieving relevant images than the earlier methods.During relevance feedback users points out given a query image, which images are relevant and which are not. Forexample, for a scope of n , given one query image, user tells that r images are relevant and r images are not relevant,where r ≥ , r ≥ and r + r = n . With this information we can create (cid:0) r (cid:1) combinations where images ineach combination are similar and (cid:0) r (cid:1) × (cid:0) r (cid:1) combinations where images in each combinations are not similar. Eachcombination consists two images. If we tag similar combinations to be 1 and dissimilar combinations to be 0, then wecan formulate this problem as binary classification problem given two images as inputs and finally predicting whetherthose images are similar or not. We used a Siamese Neural Network[4], to train a model on this dataset. The sharedCNN layers in a Siamese Network are taken from the pre-trained layers of MobileNetV2[10] (excluding the last softmaxdense layer). These shared CNN layers take two images as input and output two dense vectors (1280 dimension each)for each of the input images. L1 distance is calculated between these two layers. Finally a dense layer of unit 1 with sigmoid activation is used to predict the final output. The model architecture is shown in Figure 4.We trained the model with Adam[3] optimizer using a low learning rate (0.00005), as the model parameters werepre-trained. We spilt the dataset into train and validation set. It took typically 4-5 epochs to train the model. Aftermodel training is complete, we extracted the MobileNetV2 Encoding from the model architecture, whose input is animage array (224 X 224 X 3) and output 1280 dimensional vector. We use this tuned encoding layer as our new CBIRencoder which creates feature representation given an image. It turned out that fine-tuning the model encoder with theinformation we got from relevance feedback improves the quality of the CBIR encoder for retrieval.6
PREPRINT - S
EPTEMBER
1, 2020Figure 4: Siamese Neural Network architecture DiagramTo evaluate the efficacy of the approach we took randomly x % of the images from the database as query images andretrieved images using the CBIR encoder model described by Maji et al.[6]. Once the retrieval results are prepared userfeedback is used to create the binary class dataset above-mentioned. Here x ∈ { , , , , , , , } . We fitseparate models and got the MobileNetV2 Encoder for each of the x values. This newly tuned encoder models areused to calculate precision onward. Table 1 and Figure 5 show the change of Average precision values with the changeof sample ( x ) size from 0 to 100 using new encoder models. In sample precision means the average precision whilethe x % sampled images are used as query images and Out of Sample Precision means the average precision while the (100 − x )% images are used as query images. Overall Precision is the average precision of all the images in Caltechdataset. Sample ( x ) of 0%, means that in this case no image is used in the relevance feedback and 82.02% precisionis the precision value using the method described by Maji et al.[6]. It is clearly seen that as we increase the samplesize, except for 5%, all precision values increased significantly using the new encoder models we got from the binaryclassification approach.Finally from the practical point of deploying a CBIR system, we suggest a better road map to follow. First deploy thesystem using the processes informed by Maji et al.[6]. Store the relevance feedback information of the user. Once it isfound that around 10-15% time the CBIR system has been used, use the stored feedback information to build a SiameseNetwork model. Once the model is build extract the encoder model and replace it with the one from Maji et al.[6]. Asshown above, it will improve the retrieval precision. One is suggested to update this model encoder after certain timeinterval as more user feedback information is collected.Two points to be noted here: • Above-mentioned 10-15% means, if there are 1000 images in the database, then after 100-150 iterations ofretrieval it is recommended to build the new encoder model. • In the whole paper we used the database images as query images, but from the practical usability perspective,the query image may come outside the CBIR database also.
10 Conclusion
This paper shows a novel method to gradually improve the retrieval accuracy of CBIR collecting the user feedback bygrouping the images and also uses this feedback information to further improve the 0-th iteration retrieval accuracy.7
PREPRINT - S
EPTEMBER
1, 2020
Sample (%) Overall Precision(%) In Sample Precision(%) Out of SamplePrecision (%)
PREPRINT - S
EPTEMBER
1, 2020
References [1] Smarajit Bose et al. “A Hybrid Approach for Improved Content-based Image Retrieval using Segmentation”. In:(Feb. 2015).[2] Gita Das, Sid Ray, and Campbell Wilson. “Feature Re-Weighting in Content-Based Image Retrieval”. In:
Proceedings of the 5th International Conference on Image and Video Retrieval . CIVR’06. Tempe, AZ: Springer-Verlag, 2006, pp. 193–200.
ISBN : 3540360182.
DOI : . URL : https://doi.org/10.1007/11788034_20 .[3] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In: arXiv e-prints ,arXiv:1412.6980 (Dec. 2014), arXiv:1412.6980. arXiv: .[4] Gregory R. Koch. “Siamese Neural Networks for One-Shot Image Recognition”. In: 2015.[5] Li Fei-Fei, R. Fergus, and P. Perona. “Learning Generative Visual Models from Few Training Examples: AnIncremental Bayesian Approach Tested on 101 Object Categories”. In: . June 2004, pp. 178–178. DOI : .[6] Subhadip Maji and Smarajit Bose. “CBIR using features derived by Deep Learning”. In: arXiv e-prints ,arXiv:2002.07877 (Feb. 2020), arXiv:2002.07877. arXiv: .[7] Henning Müller et al. “A Review of Content-Based Image Retrieval Systems in Medical Applications - ClinicalBenefits and Future Directions”. In: International journal of medical informatics
73 (Mar. 2004), pp. 1–23.
DOI : .[8] Michael Ortega-binderberger and Sharad Mehrotra. “Relevance Feedback in Multimedia Databases”. In: (May2003).[9] Yong Rui, Thomas S. Huang, and Shih-Fu Chang. “Image Retrieval: Current Techniques, Promising Directions,and Open Issues”. In: J. Visual Communication and Image Representation
10 (1999), pp. 39–62.[10] Mark Sandler et al. “MobileNetV2: Inverted Residuals and Linear Bottlenecks”. In: arXiv e-prints ,arXiv:1801.04381 (Jan. 2018), arXiv:1801.04381. arXiv: .[11] A. Yoshitaka and T. Ichikawa. “A survey on content-based retrieval for multimedia databases”. In:
IEEETransactions on Knowledge and Data Engineering
ISSN : 2326-3865.
DOI : .[12] Hongjiang Zhang and Zhong Su. “Relevance Feedback in CBIR”. In: Visual and Multimedia InformationManagement: IFIP TC2/WG2.6 Sixth Working Conference on Visual Database Systems May 29–31, 2002Brisbane, Australia . Ed. by Xiaofang Zhou and Pearl Pu. Boston, MA: Springer US, 2002, pp. 21–35.
ISBN :978-0-387-35592-4.
DOI : . URL : https://doi.org/10.1007/978-0-387-35592-4_3 .[13] Xiang Zhou and Thomas Huang. “Relevance feedback in image retrieval: A comprehensive review”. In: Multi-media Syst.
DOI :10.1007/s00530-002-0070-3