Understanding Art through Multi-Modal Retrieval in Paintings
UUnderstanding Art through Multi-Modal Retrieval in Paintings
Noa Garcia Benjamin Renoust Yuta NakashimaInstitute for Datability ScienceOsaka University, Japan { noagarcia, renoust, n-yuta } @ids.osaka-u.ac.jp Abstract
In computer vision, visual arts are often studied from apurely aesthetics perspective, mostly by analysing the visualappearance of an artistic reproduction to infer its style, itsauthor, or its representative features. In this work, how-ever, we explore art from both a visual and a language per-spective. Our aim is to bridge the gap between the visualappearance of an artwork and its underlying meaning, byjointly analysing its aesthetics and its semantics. We intro-duce the use of multi-modal techniques in the field of au-tomatic art analysis by 1) collecting a multi-modal datasetwith fine-art paintings and comments, and 2) exploring ro-bust visual and textual representations in artistic images.
1. Introduction
The large-scale digitisation of artworks from collectionsall over the world has opened the opportunity to study artfrom a computer vision perspective, by building tools tohelp in the conservation and dissemination of cultural her-itage. Some of the most promising work on this directioninvolves the automatic analysis of paintings, in which com-puter vision techniques are applied to study the content [5],the style [4, 18], or to classify the attributes [16, 15] of aspecific piece of art. In this way, art has been mostly studiedfrom a visual perspective [2, 10, 17, 19, 14], and less atten-tion has been paid to automatically analyse the underlyingmeaning of each painting. In this work, we aim to bridgethe gap between the visual analysis and the high-level un-derstanding of art, by proposing robust language and visionrepresentations for multi-modal retrieval in paintings.We first introduce a multi-modal dataset for visual arts,in which each image of a painting is associated withan artistic comment (Figure 1). Differently from multi-modal datasets in natural images, such as VQA [1], VisualGenome [12], and MS-COCO [13], the interpretation of artis strongly related to the artistic context of each artwork.This peculiarity is observed in the proposed dataset both interms of images, through the use of style and composition,
This landscape depicts ships moored off a rocky coastline with fishermen unloading their catch.This view of Florence is one of a number of views by Lear based upon on the spot sketches he produced in 1861.
View of Florence from Villa San Firenze, near San Miniato Ships Moored Off a Rocky Coastline
This painting depicts a still-life of grapes, cherries, peaches and other fruit in a basket, with a rose and a dragonfly on a stone ledge.
This painting was inspired by the painter's travels in Italy. The costume of the two girls and the landscape suggests the Amalfi coast, or Capri as the setting of the scene.
Water Carriers Still-Life
Figure 1. Examples of paintings and comments in SemArt dataset. and language, through the use of references.To leverage these differences and study art from a se-mantics perspective, we propose to enhance robust visualand language representations with artistic attributes. Theenhanced representations are projected into a multi-modalartistic space in which image and text coexist. By fine-tuning the multi-modal representations in the art domain,paintings and comments that are semantically similar arerepresented closer than dissimilar samples.The quality of the proposed multi-modal artistic space isevaluated as a retrieval task, in which given a painting im-age, the most representative comment from the collectionmust be found, and vice-versa. Multi-modal retrieval allowsus to discriminate whether the language and visual repre-sentations capture the sufficient artistic insights to matchcorresponding paintings and comments together. In theevaluation, our method achieves results only 0.059 belowhuman accuracy.1 a r X i v : . [ c s . C V ] A p r orace Vernet, French, Genre, 1801-1850 Cosine Margin Loss
A Saddled Race Horse Tied to a Fence painting
ResNetContextNet VisualProjection
Horace Vernet enjoyed royal patronage, one of his earliest commissions was a group of ten paintings depicting Napoleon's horses. These works reveal his indebtedness to the English tradition of horse painting. The present painting was commissioned in Paris in 1828 by Jean-Georges Schickler, a member of a German based banking family, who had a passion for horse racing
LanguageProjection attributestitlecomment
TF-IDF
Figure 2. Proposed visual and language representations for multi-modal retrieval in art.
2. SemArt Dataset
Existing datasets in art analysis, such as PRINTART [3],Painting-91 [11], Rijksmuseum [16] or Art500k [15], aremostly annotated with attribute labels, such as author, style,or timeframe. Although this information is crucial in theanalysis of visual arts, it does not provide enough insightsfor understanding the high-level semantics of fine-art paint-ings. To jointly study language and vision in art, we intro-duce SemArt , a dataset for semantic art understanding.SemArt contains 21,384 reproductions of Europeanpaintings collected from the Web Gallery of Art , randomlysplit into training, validation, and test sets with 19,244,1,069 and 1,069 samples, respectively. Each image is anno-tated with its main attributes – author, title, date, technique,type, school and timeframe – and with a natural languagecomment. Interestingly, comments involve not only a de-scription of the elements in the scene but also references toits technique, author or context. Some examples are shownin Figure 1, and a complete analysis of the dataset can befound in [7].
3. Multi-Modal Representations
To jointly represent aesthetics and semantics in art, wepropose to project robust visual and language representa-tions enhanced with artistic attributes into a multi-modalartistic space, as depicted in Fig. 2. In total, we combinefour different representations, which are described below.
Language representation
The language representationcaptures the insights of the high-level semantics of artworksby encoding both titles and artistic comments. Titles areencoded as a term frequency - inverse document frequency(tf-idf) vector, v tit ∈ R N t , with N t = 9 , being the sizeof the title vocabulary built with the alphabetic words in thetitles in the training set. Comments are encoded as another Available at http://noagarciad.com/SemArt/ Periods of 50 years evenly distributed between 801 and 1900. tf-idf vector, v com ∈ R N c , with N c = 9 , being the com-ments vocabulary built with the alphabetic words occurringat least ten times in the training set. The language represen-tation is obtained by v lang = v tit ⊕ v com , where ⊕ is vectorconcatenation. Language attributes
Attributes capture the essential in-formation of a painting, such as its painter or its date ofcreation. We encode the type, school, timeframe, or authorlabels in the dataset as a one-hot vector, v att ∈ R c , with c being the number of labels in each attribute. Visual representation
The visual representation capturesthe visual appearance of paintings. Painting images arescaled down to 256 pixels per side, randomly cropped into224 ×
224 patches and fed into a ResNet50 [9], initialisedwith its standard pre-trained weights. Appearance is thenrepresented by the output of the model as v vis ∈ R . Image attributes
From the painting image, we use a con-textual network (ContextNet) [6] to predict the artistic at-tributes. ContextNet is composed by two core modules,as depicted in Fig. 3: a ResNet [9], which obtains thevisual information of the image, and a knowledge graph,which captures the contextual relationships of the painting.The visual encoding from the ResNet is further input intoan attribute classifier for predicting the artistic attributes,and into an encoder module for projecting the visual en-coding into the knowledge graph space. The knowledgegraph is built by connecting the training paintings in Se-mArt with their attributes, and its nodes are encoded into a128-dimensional graph representations using node2vec [8].At training time, we compute the cross-entropy lossfunction, (cid:96) c , between the predicted attribute and the real at-tribute of the painting, and the smooth L1 loss function, (cid:96) e , Without the last fully connected layer. A n -dimensional fully-connected layer with ReLU and softmax,where n is the number of classes for the predicted attribute. A 128-dimensional fully-connected layer. ode2vec F C R eL U S o ft m a x Encoder
ResNet
CrossEntropy LossL1 Loss
ContextNetKnowledgeGraph
Figure 3. ContextNet predicts the painting attributes, such as type,school, timeframe, or author, by fine-tuning a ResNet model basedon the information captured by an artistic Knowlegde Graph. between the encoder output and the graph embedding fromthe knowledge graph. The ContextNet weights are learnt byjointly optimising both losses as: L = λ c N (cid:88) j =1 (cid:96) c,j + λ e N (cid:88) j =1 (cid:96) e,j (1)where λ c and λ e are parameters that weight the contributionof the classification and the encoder modules, respectively,and N is the number of training samples. To predict thepainting attribute, the graph computation part is removed,and the attribute is predicted as the maximum value of theoutput of the ContextNet classifier, represented as v ctx ∈ R c , with c being the number of labels in the attribute.
4. Multi-Modal Projections
To learn the relationship between the visual attributesfrom the paintings and the semantics from the comments,we project the multi-modal representations from paintingsand comments into a multi-modal artistic space. We definethe vectors p ∈ R c and q ∈ R N t + N c + c as the jointrepresentation of visual and image attributes, and languageand language attributes, respectively: p = v vis ⊕ v ctx q = v lang ⊕ v att The two joint representation vectors are projected intoa multi-modal artistic space using the non-linear functions f ( · ) and g ( · ) , respectively, which are implemented with a128-dimensional fully connected layer followed by a tanhactivation function and a (cid:96) -normalisation layer. The wholemodel, except for the ContextNet which is previously fine-tuned and frozen, is trained end-to-end using both matchingand non-matching pairs of samples from the training set.The loss is computed as a cosine margin loss function: Loss ( p i , q j ) = (cid:40) − sim ( f ( p i ) , g ( q j )) , if i = j max(0 , sim ( f ( p i ) , f q ( q j )) − ∆) , if k (cid:54) = j where the sub-indeces i and j are the representations forthe i -th and j -th training sample, sim ( · , · ) is the cosine sim-ilarity between two vectors, and ∆ = 0 . is the margin. Weuse Adam optimiser with learning rate 0.0001.
5. Evaluation
To evaluate the quality of language and vision represen-tations in art, we design the Text2Art challenge based onmulti-modal retrieval, in which the aim is to find the mostrepresentative painting given an artistic comment, and viceversa, by ranking test samples according to their cosine sim-ilarity. In this way, the challenge evaluates whether themodels capture enough of the insights and clues providedby the artistic comments to be able to match it to the correctpainting. Results are reported with standard retrieval met-rics: median rank (MR), and recall rate at K (R@K), withK being 1, 5 and 10.Table 1 reports an ablation study when different combi-nations of the proposed representations are used. Vis&Languses the visual and language representations only. Att usesthe vision, language, and language attribute (specified inbrackets) as well as the output of a ResNet152 attribute clas-sification network as a simplier image attribute representa-tion. Note that the image attribute representation predictedin this way has not been informed with the graph represen-tation from the knowledge graph. Finally, Att&ContextNetconsiders the four multi-modal representations from Sec-tion 3, including the context-aware classifier.The best results are obtained when the four proposed rep-resentations are used, with attributes from language and im-age are given by the author. Att&ContextNet (Author) im-proves results by a 37.24% in average with respect to visionand language only, suggesting the importance of consider-ing context when studying art. When compared against Att,the use of ResNet152 instead of the context-aware classi-fier performs better with type and school attributes, whereasAtt&ContextNet is the best in timeframe and author.In Table 2, we evaluate the proposed multi-modal artrepresentations against human performance, where humanevaluators were asked to choose between 10 paintings ac-cording to an artistic comment, title, author, type, school,and timeframe. We performed two evaluations: in the easysetup, the 10 paintings were chosen randomly, whereas inthe difficult setup, the 10 paintings shared the same type(i.e. landascape, portrait, etc.). The multi-modal represen-tations using the ContextNet reached values close to humanaccuracy, outperforming Vis&Lang by a 10.67% in the easytask and a 9.67% in the difficult task.
6. Conclusion
We addressed art understanding by introducing a newdataset of paintings with associated comments and explor-3 ext → Image Image → TextEncoding R@1 R@5 R@10 MR R@1 R@5 R@10 MR
Vis&Lang 0.164 0.384 0.505 10 0.162 0.366 0.479 12Att
Type
School Tf Author
Type
School Tf Author
Table 1. Results on the Text2Art Challenge when using vision and language only (Vis&Lang), when adding attributes (Attributes) andwhen adding the ContextNet classifier (Att&ContextNet).
Model Easy Difficult
Vis&Lang 0.750 0.620Att&Context 0.830 0.680Human 0.889 0.714
Table 2. Multi-modal representations against humans. ing multi-modal representations in art. Results showed thatrobust vision and language representations were able to cap-ture the semantic content of paintings relatively well. How-ever, performance was considerably improved when con-textual information in the form of a knowledge graph wasused to inform the model, which suggested the existence ofa strong relationship between art and context. As a futurework, we would like to pursue effort in the use of knowledgegraph to connect vision and language. We could enhanceContextNet with more robust graph embedding techniques,such as StarSpace [20], as well as enhance the language rep-resentation with the knowledge graph attributes.
Acknowledgement : This work was partly supported byJSPS KAKENHI Grant No. 18H03264.
References [1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L.Zitnick, and D. Parikh. VQA: Visual Question Answering.In
ICCV , 2015.[2] Y. Bar, N. Levy, and L. Wolf. Classification of artistic stylesusing binarized features derived from a deep neural network.In
ECCV Workshops , 2014.[3] G. Carneiro, N. P. da Silva, A. Del Bue, and J. P. Costeira.Artistic image classification: An analysis on the printartdatabase. In
ECCV , 2012.[4] J. Collomosse, T. Bui, M. J. Wilber, C. Fang, and H. Jin.Sketching with style: Visual search with sketches and aes-thetic context. In
ICCV , 2017.[5] E. J. Crowley and A. Zisserman. The art of detection. In
ECCV . Springer, 2016.[6] N. Garcia, B. Renoust, and Y. Nakashima. Context-awareembeddings for automatic art analysis. In
ICMR , 2019. [7] N. Garcia and G. Vogiatzis. How to read paintings: Seman-tic art understanding with multi-modal retrieval. In
EECVWorkshops , 2018.[8] A. Grover and J. Leskovec. node2vec: Scalable featurelearning for networks. In
KDD , pages 855–864. ACM, 2016.[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In
CVPR , 2016.[10] S. Karayev, M. Trentacoste, H. Han, A. Agarwala, T. Darrell,A. Hertzmann, and H. Winnemoeller. Recognizing imagestyle. In
BMVC , 2014.[11] F. S. Khan, S. Beigpour, J. Van de Weijer, and M. Felsberg.Painting-91: a large scale database for computational paint-ing categorization.
Machine vision and applications , 2014.[12] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Vi-sual genome: Connecting language and vision using crowd-sourced dense image annotations.
IJCV , 123(1):32–73,2017.[13] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Doll´ar, and C. L. Zitnick. Microsoft coco: Com-mon objects in context. In D. Fleet, T. Pajdla, B. Schiele, andT. Tuytelaars, editors,
ECCV , 2014.[14] D. Ma, F. Gao, Y. Bai, Y. Lou, S. Wang, T. Huang, and L.-Y.Duan. From part to whole: Who is behind the painting? In
ACMMM , 2017.[15] H. Mao, M. Cheung, and J. She. Deepart: Learning jointrepresentations of visual arts. In
ACMMM , 2017.[16] T. Mensink and J. Van Gemert. The rijksmuseum challenge:Museum-centered visual recognition. In
ICMR , 2014.[17] B. Saleh and A. Elgammal. Large-scale classification of fine-art paintings: Learning the right metric on the right feature.
International Journal for Digital Art History , (2), 2016.[18] A. Sanakoyeu, D. Kotovenko, S. Lang, and B. Ommer. Astyle-aware content loss for real-time hd style transfer. In
ECCV , volume 2, 2018.[19] W. R. Tan, C. S. Chan, H. E. Aguirre, and K. Tanaka. Cecin’est pas une pipe: A deep convolutional network for fine-artpaintings classification.
ICIP , 2016.[20] L. Y. Wu, A. Fisch, S. Chopra, K. Adams, A. Bordes, andJ. Weston. Starspace: Embed all the things! In
AAAI , 2018., 2018.