[PDF] Iconographic Image Captioning for Artworks

Abstract

Image captioning implies automatically generating textual descriptions of images based only on the visual input. Although this has been an extensively addressed research topic in recent years, not many contributions have been made in the domain of art historical data. In this particular context, the task of image captioning is confronted with various challenges such as the lack of large-scale datasets of image-text pairs, the complexity of meaning associated with describing artworks and the need for expert-level annotations. This work aims to address some of those challenges by utilizing a novel large-scale dataset of artwork images annotated with concepts from the Iconclass classification system designed for art and iconography. The annotations are processed into clean textual description to create a dataset suitable for training a deep neural network model on the image captioning task. Motivated by the state-of-the-art results achieved in generating captions for natural images, a transformer-based vision-language pre-trained model is fine-tuned using the artwork image dataset. Quantitative evaluation of the results is performed using standard image captioning metrics. The quality of the generated captions and the model's capacity to generalize to new data is explored by employing the model on a new collection of paintings and performing an analysis of the relation between commonly generated captions and the artistic genre. The overall results suggest that the model can generate meaningful captions that exhibit a stronger relevance to the art historical context, particularly in comparison to captions obtained from models trained only on natural image datasets.

Full PDF

IIconographic Image Captioning for Artworks

Eva Cetinic − − − Rudjer Boskovic Institute, Bijenicka cesta 54, 10000 Zagreb, Croatia [email protected]

Abstract.

Image captioning implies automatically generating textualdescriptions of images based only on the visual input. Although this hasbeen an extensively addressed research topic in recent years, not manycontributions have been made in the domain of art historical data. Inthis particular context, the task of image captioning is confronted withvarious challenges such as the lack of large-scale datasets of image-textpairs, the complexity of meaning associated with describing artworksand the need for expert-level annotations. This work aims to addresssome of those challenges by utilizing a novel large-scale dataset of art-work images annotated with concepts from the Iconclass classiﬁcationsystem designed for art and iconography. The annotations are processedinto clean textual description to create a dataset suitable for traininga deep neural network model on the image captioning task. Motivatedby the state-of-the-art results achieved in generating captions for nat-ural images, a transformer-based vision-language pre-trained model isﬁne-tuned using the artwork image dataset. Quantitative evaluation ofthe results is performed using standard image captioning metrics. Thequality of the generated captions and the model’s capacity to generalizeto new data is explored by employing the model on a new collection ofpaintings and performing an analysis of the relation between commonlygenerated captions and the artistic genre. The overall results suggest thatthe model can generate meaningful captions that exhibit a stronger rele-vance to the art historical context, particularly in comparison to captionsobtained from models trained only on natural image datasets.

Keywords: image captioning · vision-language models · ﬁne-tuning · visual art. Automatically generating meaningful and accurate image descriptions is a chal-lenging task that has been extensively addressed in the recent years. This taskimplies recognizing objects and their relationship in an image and generating syn-tactically and semantically correct textual descriptions. In resolving this task,signiﬁcant progress has been made using deep learning based - techniques. Aprerequisite for this kind of approach are large datasets of semantically relatedimage and sentence pairs. In the domain of natural images, several well-knownlarge-scale datasets are commonly used for caption generation, such as the MS a r X i v : . [ c s . C V ] F e b E. Cetinic

COCO [22], Flickr30 [41] and Visual Genome [20] dataset. Although the avail-ability of such datasets enabled remarkable results in generating high qualitycaptions for photographs of various objects and scenes, the task of generatingimage captions still remains diﬃcult for domain-speciﬁc image collections. Inparticular, in the context of the cultural heritage domain, generating image cap-tions is an open problem with various challenges. One of the major obstacles isthe lack of a truly large-scale dataset of artwork images paired with adequatedescriptions. It is also relevant to address what kind of description would beregarded as ”adequate” for a particular purpose. Considering for instance ErwinPanofsky’s three levels of analysis [25], we can distinguish the ”pre-iconographic”description, ”iconographic” description and the ”iconologic” interpretation aspossibilities of aligning semantically meaningful, yet very diﬀerent textual de-scriptions with the same image. While captions of natural images usually func-tion on the level of ”pre-iconographic” descriptions, which implies simply listingthe elements that are depicted in an image, for artwork images this type of de-scription represent only the most basic level of visual understanding and is oftennot considered to be of great interest.In the context of artwork images, it would be more interesting to generate”iconographic” captions that capture the subject and symbolic relations betweenobjects. Creating a dataset for such a complex task requires expert knowledge inthe process of collecting sentence-based descriptions of images. There have beensome attempts to create such datasets, but those existing datasets consist only ofa few thousand images and are therefore not suitable to train deep neural mod-els in the current state-of-the-art setting for image captioning. However, thereare several existing large-scale artwork collections that associate images withkeywords and speciﬁc concepts. The idea of this work is to use a concatenationof concept descriptions associated with an image as textual inputs for trainingan image captioning model. Recently an interesting large-scale artwork datasethas been published under the name ”Iconclass AI Test Set” [27]. This datasetrepresents a collection of various artwork images assigned with alphanumericclassiﬁcation codes that correspond to notations from the Iconclass system [9].Iconclass is a classiﬁcation system designed for art and iconography and is widelyaccepted by museums and art institutions as a tool for the description and re-trieval of subjects represented in images. Although the ”Iconclass AI Test Set”is not structured primarily as an image captioning dataset, each code is pairedwith its ”textual correlate” - a description of the iconographic subject of theparticular Iconclass notation. Therefore the main intention of this work is toextract and preprocess the given annotations into clean textual description andcreate the ”Iconclass Caption” dataset. This dataset is then used to ﬁne-tunea pre-trained uniﬁed vision-language model on the down-stream task of imagecaptioning [42]. Transformer-based vision-language pre-trained models currentlyrepresent the leading approach in solving a variety of tasks in the intersectionof computer vision and natural language processing. This paper represents aﬁrst attempt to employ the aforementioned approach on a collection of artwork conographic Image Captioning for Artworks 3 images with the goal to generate image captions relevant in the context of arthistory.

The availability of large collections of digitized artwork images led to an increaseof interest in the employment of deep learning-based techniques for a varietyof diﬀerent tasks. Research in this area most commonly focuses on addressingproblems related to computer vision in the context of art historical data, such asimage classiﬁcation [4, 29], visual link retrieval [3, 31], analysis of visual patternsand conceptual features [6,11,14,33], object and face detection [10,36], pose andcharacter matching [19, 24] and computational aesthetics [5, 18, 30].Recently however there has been a surge of interest in topics that deal withnot only visual, but both visual and textual modalities of artwork collections.The pioneering works in this research area mostly addressed the task of multi-modal retrieval. In particular, [15] introduced the SemArt dataset, a collectionof ﬁne-art images associated with textual comments, with the aim to map theimages and their descriptions in a joint semantic space. They compare diﬀer-ent combinations of visual and textual encodings, as well as diﬀerent methodsof multi-modal transformation. In projecting the visual and textual encodingsin a common multimodal space, they achieve the best results by applying aneural network trained with cosine margine loss on ResNet50 features as vi-sual encodings and bag-of-word as textual encodings. The task of creating ashared embedding space was also addressed in [1] where the authors introduce anew visual semantic dataset named BibleVSA, a collection of miniature illustra-tions and commentary text pairs, and explore supervised and semi-supervisedapproaches to learning cross-references between textual and visual informationin documents. In [35] the authors present the Artpedia dataset consisting of2930 images annotated with visual and contextual sentences. They introduce across-modal retrieval model that projects images and sentences in a commonembedding space and discriminates between contextual and visual sentences ofthe same image. A similar extension of this approach to other artistic datasetswas presented in [8].Besides multi-modal retrieval, another emerging topic of interest is visualquestion answering (VAQ). In [2] the authors annotated a subset of the Art-Pedia dataset with visual and contextual question-answer pairs and introduceda question classiﬁer that discriminates between visual and contextual questionsand a model that is able to answer both types of questions. In [16] the authorsintroduce a novel dataset AQUA, which consists of automatically generated vi-sual and knowledge-based QA pairs, and also present a two-branch model wherethe visual and knowledge questions are handled independently.A limited number of studies contributed to the task of generating descrip-tions of artwork images using deep neural networks and all of them rely on em-ploying the encoder-decoder architecture-based image captioning approach. Forexample, [34] proposes an encoder-decoder framework for generating captions of

E. Cetinic artwork images where the encoder (ResNet18 model) extracts the input imagefeature representation and the artwork type representation, while the decoder isa long short-term memory (LSTM) network. They introduce two image caption-ing datasets referring to ancient Egyptian art and ancient Chinese art, whichcontain 17,940 and 7,607 images respectively. Another very recent work [17] pre-sented a novel captioning dataset for art historical images consisting of 4000images across 9 iconographies, along with a description for each image consist-ing of one or more paragraphs. They used this dataset to ﬁne-tune diﬀerentvariations of image captioning models based on the well-known encoder-decoderapproach introduced in [39].Inﬂuenced by the success of utilizing large scale pre-trained language modelslike BERT [13] for diﬀerent tasks related to natural language processing, therehas recently been a surge of interest in developing Transformer-based vision-language pre-trained models. Vision-language models are designed to learn jointrepresentations that combine information of both modalities and the alignmentsacross those modalities. It has been shown that models pre-trained on interme-diate tasks with unsupervised learning objectives using large datasets of image-text pairs, achieve remarkable results when adapted to diﬀerent down-streamtasks such as image captioning, cross-modal retrieval or visual question answer-ing [7, 23, 37, 42]. However, to the best of our knowledge, this approach has untilnow not been explored for tasks in the domain of art historical data.

In our experiment we use a subset of 86 530 valid images from the ”IconclassAI Test Set” [27].This is a very diverse collection of images sampled from theArkyves database . It includes images of various types of artworks such as paint-ings, posters, drawings, prints, manuscripts pages, etc. Each image is associatedwith one or more codes linked to labels from the Iconclass classiﬁcation system.The authors of the ”Iconclass AI Test Set” provide a json ﬁle with the list ofimages and corresponding codes, as well as an Iconclass Python package to per-form analysis and extract information from the assigned classiﬁcation codes. Toextract textual descriptions of images for the purpose of this work, the Englishtextual descriptions of each code associated with an image are concatenated.Further preprocessing of the descriptions includes removing text in brackets andsome recurrent uppercased dataset-speciﬁc codes. In this dataset, the text inbrackets most commonly includes very speciﬁc named entities, which are consid-ered a noisy input in the image captioning task. Therefore, when preprocessingthe textual items, all the text in brackets is removed, even at the cost of some-times removing useful information. Figure 1 shows several example images fromthe Iconclass dataset and their corresponding descriptions before and after pre-processing. Depending on the number of codes associated with each image, the ﬁnal textual descriptions can signiﬁcantly vary in length. Also, because of thespeciﬁc properties of this dataset, the image descriptions are not structured assentences but as a list of comma-separated words and phrases. Original description:

Madonna: i.e. Mary with the Christ-child,ﬂowers: rose, historical persons (portraits and scenes from the life)(+ half-length portrait)

Clean description:

Madonna: i.e. Mary with the Christ-child,ﬂowers: rose, historical persons .

Original description: adult woman, manuscript of musical score, writer,poet, author (+ portrait, self-portrait of artist), pen, ink-well, paper(writing material), codex, inscription, historical events and situations(1567), historical person (MONTENAY, Georgette de) - BB - woman -historical person (MONTENAY, Georgette de) portrayed alone, proverbs,sayings, etc. (O PLUME EN LA MAIN NON VAINE)

Clean description: adult woman, manuscript of musical score, writer,poet, author , pen, ink-well, paper , codex, inscription, historical eventsand situations , historical person, woman - historical person portrayedalone, proverbs, sayings.

Original description: plants and herbs (HELLEBORINE), plantsand herbs (LUPINE),

Clean description: plants and herbs .

Fig. 1.

Example images from the Iconclass dataset and their corresponding descriptionsbefore and after preprocessing.

Because of this type of structure, and because of having only one referencecaption for each image, the Iconclass Caption dataset is not a standard imagecaptioning dataset. However, having in mind the diﬃculties of obtaining ade-quate textual descriptions for images of artworks, this dataset can be considereda valuable source of image-text pairs in the current context. Particularly becauseof the large number of annotated images that enables training deep neural mod-els. In the experimental setting, a subset of 76k items is used for training themodel, 5k for validation and 5k for testing.

E. Cetinic

In this work the uniﬁed vision-language pre-training model (VLP) introducedin [42] is employed. This model is denoted as “uniﬁed” because the same pre-trained model can be ﬁne-tuned for diﬀerent types of tasks. Those task includeboth vision-language generation (e.g. image captioning) and vision-language un-derstanding (e.g. visual question answering). The model is based on an encoder-decoder architecture comprised of 12 Transformer blocks. The model input con-sist of the image embedding, text embedding and three special tokens that in-dicate the start of the image input, the boundary between visual and textualinput and the end of the textual input. The image input consist of 100 objectclassiﬁcation aware region features extracted using the Faster RCNN model [28]pre-trained on the Visual Genome dataset [20]. For a more detailed descriptionof the overall VLP framework and pre-training objectives, the reader is referedto [42]. The experiments introduced in this work employ as the base modelthe VLP model pre-trained on the Conceptual Captions dataset [32] using thesequence-to-sequence objective. This base model is ﬁne-tuned on the IconclassCaption Dataset using recommended ﬁne-tuning conﬁgurations, namely trainingwith a constant learning rate of 3e-5 for 30 epochs. Because the descriptions inthe Iconclass Caption Dataset are on average longer than captions in other cap-tion datasets, when ﬁne-tuning the VLP model, the maximum number of tokensin the input and target sequence is modiﬁed from the default value (20) to anew higher value (100).

To quantitatively evaluate the generated captions, standard language evaluationmetrics for image captioning on the Iconclass Caption test set are used. Thoseinclude the standard 4 BLEU metrics [26], METEOR [12] ROUGE [21] andCIDEr [38]. The BLUE, ROUGE and METEOR are metrics that originate frommachine translation tasks, while CIDEr was speciﬁcally developed for image cap-tion evaluation. The BLUE metrics represent n-gram precision scores multipliedby a brevity penalty factor to assess the length correspondence of candidate andreference sentences. ROUGE is a metric that measures the recall of n-grams andtherefore rewards long sentences. Speciﬁcally ROUGE-L measures the longestmatching sequence of words between a pair of sentences. METEOR representsthe harmonic mean of precision and recall of unigram matches between sentencesand additionally includes synonyms and paraphrase matching. CIDEr measuresthe cosine similarity between TF-IDF weighted n-grams of the candidate andthe reference sentences. The TF-IDF weighting of n-grams reduces the score offrequent n-grams and appoints higher scores to distinctive words. The resultsobtained using those metrics are presented in Table 1.Although the current results cannot be compared with any other work be-cause the experiments are performed on a new and syntactically and semantically conographic Image Captioning for Artworks 7

Table 1.

Table captions should be placed above the tables.Evaluation metric Iconclass Caption test setBLEU 1 14 . . . . . . . diﬀerent dataset, the quantitative evaluation results are included to serve as abenchmark for future work. In comparison with current state-of-the-art captionevaluation results on natural image datasets (e.g. BLEU4 ≈

37 for COCO and ≈

30 for Flickr30 datasets) [40, 42], the BLUE scores are lower for the Iconclassdataset. A similar behaviour is also reported in another study addressing icono-graphic image captioning [17]. On the other hand, the CIDEr score is quite highin comparison to the one reported for natural image datasets (e.g. CIDEr ≈ ≈

68 for Flickr30 dataset) [40, 42].However, it remains questionable how adequate these metrics are in assessingthe overall quality of the captions in this particular context. All of the reportedmetrics mostly measure the word overlap between generated and reference cap-tions. They are not designed to capture the semantic meaning of a sentence andtherefore often lead to poor correlation with human judgement. Also, they arenot appropriate for measuring very short descriptions which are quite commonin the IconClass Caption dataset. Moreover, they do not address the relationbetween the generated caption and the image content, but express only thesimilarity between the original and generated textual descriptions. The gener-ated caption could be semantically aligned with the image content but representa diﬀerent version of the original caption and therefore have very low metricscores. In Figure 2, several such examples from the Iconclass Caption test setare presented.Those examples indicate that the existing evaluation metrics are not verysuitable in assessing the relevance of generated captions for this particular dataset.Therefore a qualitative analysis of the results is also required in order to betterunderstand potential contributions and drawbacks of the proposed approach.

For the purpose of qualitative analysis, examples of images and generated cap-tions on two datasets are analyzed. One is the test set of the Iconclass Captiondataset that serves for direct comparison between the generated captions andground-truth descriptions. The other dataset is a subset of the WikiArt paint-ing collection, which does not include textual descriptions of images but has abroad set of labels associated with each image. This enables the study of the re-lation between generated captions and other concepts, e.g genre categorization

E. Cetinic

Ground-truth: sea.

Caption: sailing - ship, sailing - boat.

BLEU 1: 2.49e-16BLEU 2: 2.88e-16BLEU 3: 3.46e-16BLEU 4: 4.51e-16METEOR: 0.16ROUGE: 0.0CIDEr: 0.0

Ground-truth: apostle, unspeciﬁed,key.

Caption: head turned to the right,historical persons.

BLEU 1: 1.43e-16BLEU 2: 1.54e-16BLEU 3: 1.68e-16BLEU 4: 1.85e-16METEOR: 0.0ROUGE: 0.0CIDEr: 0.0

Ground-truth: arms, ﬁngers.

Caption: hand.

BLEU 1: 3.67e-16BLEU 2: 1.16e-11BLEU 3: 3.67e-10BLEU 4: 2.06e-09METEOR: 0.0ROUGE: 0.0CIDEr: 0.0

Ground-truth: palace, king, NewTestament, adoration of the kings: theWise Men present their gifts to theChrist-child.

Caption:

New Testament.

BLEU 1: 0.00055BLEU 2: 0.00055BLEU 3: 5.53e-16BLEU 4: 5.53e-07METEOR: 0.0552ROUGE: 0.184CIDEr: 0.051

Fig. 2.

Examples of images from the Iconclass Caption test set, their correspondingground-truth and generated captions and the values of evaluation metrics for thoseexamples. of paintings, as well as gives an insight into how well the model generalizes to adiﬀerent artwork dataset.

Iconclass Caption test set

To gain a better insight into the generated imagecaptions, in Figure 3 several examples are shown. The presented image-text pairsare chosen to demonstrate both successful examples (the left column) and failedexamples (the right column) of generated captions.Analysis of the failed examples indicates an existing “logic” in those erro-neous captions, as well as demonstrates underlying biases within the dataset. Forinstance, in the Iconclass Caption training test there are more than thousandexamples that include the phrase “New Testament” in the description. There-fore images that include structurally similar scenes, particularly from classicalhistory and mythology, are sometimes wrongly attributed as depicting a scene conographic Image Captioning for Artworks 9

Ground-truth: historiatedinitial , printed historiatedinitial , printed matter ,Roman script ; scripts basedon the Roman alphabet

Caption: historiated initial ,printed historiated initial ,printed matter , Romanscript ; scripts based on theRoman alphabet

Ground-truth: device ,printed matter.

Caption: historiated initial ,printed historiated initial ,printed matter , Romanscript ; scripts based on theRoman alphabet.

Ground-truth: ’NewTestament, Christ in thehouse of Martha and Mary ,Martha complains to Christabout Mary.

Caption:

New Testament.

Ground-truth: hoofedanimals: deer, ClassicalMythology and AncientHistory, the sacriﬁce ofIphigenia.

Caption:

New Testament ,the assumption of Mary : sheis borne into heaven byangels.

Ground-truth: ’crustaceans:lobster, molluscs: oyster,glass, rummer, table-cloth.

Caption: crustaceans :lobster , molluscs : oyster ,fruits : lemon , fruits : orange, fruits : peach , fruits :walnut , table - cloth , fruit ,fruit , grape .

Ground-truth: glass,rummer, meat; still life ofmeat, tools, aids, implementscrafts and industries: knife.

Caption: crustaceans :lobster , molluscs : oyster ,ﬁsh ; still life of ﬁshes.

Ground-truth: ’founder ofthe Order of Friars Minor ,Francisof Assisi; possibleattributes: book, cruciﬁx, lily,skull, stigmata.

Caption: male saints.

Ground-truth: saints, St.Jerome as Doctor of theLatin Church in his studywith book, pen and ink; lionand cardinal’s hat besidehim, study; ’studiolo’; library.

Caption: saints , thepenitent harlot MaryMagdalene ; possibleattributes : book , crown ,crown of thorns , cruciﬁx , jarof ointment , mirror , musicalinstrument , palm - branch ,rosary , scourge , book.

Fig. 3.

Examples of images from the Iconclass Caption test set, their correspondingground-truth and generated captions. Examples shown in the left column representsuccessfully generated captions, while examples shown in the right column demonstratewrongly generated captions.0 E. Cetinic from the New Testament. This signiﬁes the importance of balanced examples inthe training dataset and indicates directions for possible future improvements.The Iconclass dataset is a collection of very diverse images and apart from theIconclass classiﬁcation codes, there are currently no other metadata available forthe images. Therefore it is diﬃcult to perform an in-depth exploratory analysisof the dataset and the generated results in regard to attributes relevant in thecontext of art history such as the date of creation, style, genre, etc. For thisreason, the ﬁne-tuned image captioning model is employed on a novel artworkdataset - a subset of the WikiArt collection of paintings.

WikiArt dataset

In order to explore how the model generalizes to a new art-work dataset, a subset of 52562 images of paintings from the WikiArt collectionis used. Because images in the WikiArt dataset are annotated with a broad setof labels (e.g. style, genre, artist, technique, date of creation, etc. ), the studyof the relation between the generated captions on those labels is performed asone method of qualitative assessment. Figure 4 shows the distribution of mostcommonly generated descriptions in relation to four diﬀerent genres. From thisbasic analysis it is obvious that the generated captions are meaningful in relationto the content and the genre classiﬁcation of images.To understand the contribution of the proposed model in the context oficonographic image captioning, it is interesting to compare the Iconclass captionswith captions obtained from models trained on natural images. For this purpose,two models of the same architecture but ﬁne-tuned on the Flickr 30 i MS COCOdatasets are used. Figure 5 shows several examples from the WikiArt datasetwith corresponding Iconclass, Flickr and COCO captions. It is evident that theother two models generate results that are meaningful in relation to the imagecontent but do not necessarily contribute to producing more ﬁne-grained andcontext-aware descriptions. This paper introduces a novel model for generating iconographic image captions.This is done by utilizing a large-scale dataset of artwork images annotated withconcepts from the Iconclass classiﬁcation system designed for art and iconogra-phy. To the best of our knowledge, this dataset has not yet been widely used inthe computer vision community. Within the scope of this work, the available an-notations are processed into clean textual descriptions and the existing datasetis transformed into a collection of suitable image-text pairs. The dataset is usedto ﬁne-tune a transformer-based visual-language model. For this purpose, ob-ject classiﬁcation aware region features are extracted from the images using theFaster RCNN model. The base model in our ﬁne-tuning experiment is an exist-ing model, called the VLP model, that is pre-trained on a natural image dataset Fig. 4.

Distribution of most commonly generated descriptions in relation to four dif-ferent genres in the WikiArt dataset. on an intermediate tasks with unsupervised learning objectives. Fine-tuning pre-trained vision-language models represents the current state-of-the-art approachfor many diﬀerent multimodal tasks.The captions generated by the ﬁne-tuned models are evaluated using stan-dard image captioning metrics. Unlike in other image captioning datasets whichusually contain several short sentences, the ground-truth descriptions of theIconclass dataset signiﬁcantly vary in length. Because of the speciﬁc propertiesof the Iconclass dataset, standard image captioning evaluation metrics are notvery informative regarding the relevance and appropriateness of the generatedcaptions in relation to the image content. Therefore, the quality of the generatedcaptions and the model’s capacity to generalize to new data are further exploredby employing the model on another artwork dataset. The overall quantitativeand qualitative evaluation of the results suggests that the model can generate

Jan van Hemessen, Christ Driving Merchants from the Temple,1556

Iconclass caption:

New Testament .

Flickr caption:

A painting of a group of people .

Coco caption:

A painting of a group of people dancing .

Giovanni Bellini, Madonna Enthroned Cherishing the SleepingChild, 1475

Iconclass caption:

Madonna : i . e . Mary with the Christ -child , sitting ﬁgure , historical persons .

Flickr caption:

A woman holding a baby .

Coco caption:

A painting of a woman holding a child .

Jan Gossaert, Adam and Eve in Paradise, 1527

Iconclass caption:

Adam and Eve holding the fruit .

Flickr caption:

Four naked men are standing in the mud .

Coco caption:

A couple of men standing next to each other .

Fig. 5.

Examples from the WikiArt dataset with captions generated by models ﬁne-tuned on the Iconclass, Flickr and COCO datasets. meaningful captions that capture not only the depicted objects but also the arthistorical context and relation between subjects. However, there is still room forsigniﬁcant improvement. In particular, the unbalanced distribution of themesand topics within the training set result in often wrongly identiﬁed subjects inthe generated image descriptions. Furthermore, the generated textual descrip-tions are often very short and could serve more as labels rather than captions.Nevertheless, the current results show signiﬁcant improvement in comparison tocaptions generated from artwork images using models trained on natural imagecaption datasets. Further improvement can potentially be achieved with ﬁne-tuning the current model on a smaller dataset with more elaborate ground-truthiconographic captions.

References

1. Baraldi, L., Cornia, M., Grana, C., Cucchiara, R.: Aligning text and documentillustrations: towards visually explainable digital humanities. In: 2018 24th Inter-national Conference on Pattern Recognition (ICPR). pp. 1097–1102. IEEE (2018)2. Bongini, P., Becattini, F., Bagdanov, A.D., Del Bimbo, A.: Visual question an-swering for cultural heritage. arXiv preprint arXiv:2003.09853 (2020)3. Castellano, G., Vessio, G.: Towards a tool for visual link retrieval and knowledgediscovery in painting datasets. In: Italian Research Conference on Digital Libraries.pp. 105–110. Springer (2020)conographic Image Captioning for Artworks 134. Cetinic, E., Lipic, T., Grgic, S.: Fine-tuning convolutional neural networks for ﬁneart classiﬁcation. Expert Systems with Applications , 107–118 (2018)5. Cetinic, E., Lipic, T., Grgic, S.: A deep learning perspective on beauty, sentiment,and remembrance of art. IEEE Access , 73694–73710 (2019)6. Cetinic, E., Lipic, T., Grgic, S.: Learning the principles of art history with convo-lutional neural networks. Pattern Recognition Letters , 56–62 (2020)7. Chen, Y.C., Li, L., Yu, L., Kholy, A.E., Ahmed, F., Gan, Z., Cheng, Y.,Liu, J.: Uniter: Learning universal image-text representations. arXiv preprintarXiv:1909.11740 (2019)8. Cornia, M., Stefanini, M., Baraldi, L., Corsini, M., Cucchiara, R.: Explaining dig-ital humanities by aligning images and textual descriptions. Pattern RecognitionLetters , 166–172 (2020)9. Couprie, L.D.: Iconclass: an iconographic classiﬁcation system. Art Libraries Jour-nal (2), 32–49 (1983)10. Crowley, E.J., Zisserman, A.: In search of art. In: European Conference on Com-puter Vision. pp. 54–70. Springer (2014)11. Deng, Y., Tang, F., Dong, W., Ma, C., Huang, F., Deussen, O., Xu, C.: Exploringthe representativity of art paintings. IEEE Transactions on Multimedia (2020)12. Denkowski, M., Lavie, A.: Meteor universal: Language speciﬁc translation evalua-tion for any target language. In: Proceedings of the ninth workshop on statisticalmachine translation. pp. 376–380 (2014)13. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-tional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018)14. Elgammal, A., Liu, B., Kim, D., Elhoseiny, M., Mazzone, M.: The shape of arthistory in the eyes of the machine. In: 32nd AAAI Conference on Artiﬁcial Intel-ligence, AAAI 2018. pp. 2183–2191. AAAI press (2018)15. Garcia, N., Vogiatzis, G.: How to read paintings: semantic art understanding withmulti-modal retrieval. In: Proceedings of the European Conference on ComputerVision (ECCV). pp. 0–0 (2018)16. Garcia, N., Ye, C., Liu, Z., Hu, Q., Otani, M., Chu, C., Nakashima, Y., Mitamura,T.: A dataset and baselines for visual question answering on art. arXiv preprintarXiv:2008.12520 (2020)17. Gupta, J., Madhu, P., Kosti, R., Bell, P., Maier, A., Christlein, V.: Towards im-age caption generation for art historical data. AI methods for digital heritage,Workshop at KI2020 43rd German Conference on Artiﬁcial Intelligence (2020)18. Hayn-Leichsenring, G.U., Lehmann, T., Redies, C.: Subjective ratings of beautyand aesthetics: correlations with statistical image properties in western oil paint-ings. i-Perception (3), 2041669517715474 (2017)19. Jenicek, T., Chum, O.: Linking art through human poses. In: 2019 InternationalConference on Document Analysis and Recognition (ICDAR). pp. 1338–1345.IEEE (2019)20. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S.,Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting languageand vision using crowdsourced dense image annotations. International journal ofcomputer vision (1), 32–73 (2017)21. Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text sum-marization branches out. pp. 74–81 (2004)22. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P.,Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conferenceon computer vision. pp. 740–755. Springer (2014)4 E. Cetinic23. Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolin-guistic representations for vision-and-language tasks. In: Advances in Neural In-formation Processing Systems. pp. 13–23 (2019)24. Madhu, P., Kosti, R., M¨uhrenberg, L., Bell, P., Maier, A., Christlein, V.: Rec-ognizing characters in art history using deep learning. In: Proceedings of the 1stWorkshop on Structuring and Understanding of Multimedia heritAge Contents.pp. 15–22 (2019)25. Panofsky, E.: Studies in iconology. humanistic themes in the art of the renaissance,new york. New York: Harper and Row (1972)26. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automaticevaluation of machine translation. In: Proceedings of the 40th annual meeting ofthe Association for Computational Linguistics. pp. 311–318 (2002)27. Posthumus, E.: Brill iconclass ai test set (2020)28. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detec-tion with region proposal networks. In: Advances in neural information processingsystems. pp. 91–99 (2015)29. Sandoval, C., Pirogova, E., Lech, M.: Two-stage deep learning approach to theclassiﬁcation of ﬁne-art paintings. IEEE Access , 41770–41781 (2019)30. Sargentis, G., Dimitriadis, P., Koutsoyiannis, D., et al.: Aesthetical issues ofleonardo da vinci’s and pablo picasso’s paintings with stochastic evaluation. Her-itage (2), 283–305 (2020)31. Seguin, B., Striolo, C., Kaplan, F., et al.: Visual link retrieval in a database ofpaintings. In: European Conference on Computer Vision. pp. 753–767. Springer(2016)32. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned,hypernymed, image alt-text dataset for automatic image captioning. In: Proceed-ings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers). pp. 2556–2565 (2018)33. Shen, X., Efros, A.A., Aubry, M.: Discovering visual patterns in art collectionswith spatially-consistent feature learning. In: Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pp. 9278–9287 (2019)34. Sheng, S., Moens, M.F.: Generating captions for images of ancient artworks. In:Proceedings of the 27th ACM International Conference on Multimedia. pp. 2478–2486 (2019)35. Stefanini, M., Cornia, M., Baraldi, L., Corsini, M., Cucchiara, R.: Artpedia: Anew visual-semantic dataset with visual and contextual sentences in the artisticdomain. In: International Conference on Image Analysis and Processing. pp. 729–740. Springer (2019)36. Strezoski, G., Worring, M.: Omniart: a large-scale artistic benchmark. ACM Trans-actions on Multimedia Computing, Communications, and Applications (TOMM) (4), 1–21 (2018)37. Tan, H., Bansal, M.: Lxmert: Learning cross-modality encoder representations fromtransformers. arXiv preprint arXiv:1908.07490 (2019)38. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based imagedescription evaluation. In: Proceedings of the IEEE conference on computer visionand pattern recognition. pp. 4566–4575 (2015)39. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural imagecaption generator. In: Proceedings of the IEEE conference on computer vision andpattern recognition. pp. 3156–3164 (2015)conographic Image Captioning for Artworks 1540. Xia, Q., Huang, H., Duan, N., Zhang, D., Ji, L., Sui, Z., Cui, E., Bharti, T., Zhou,M.: Xgpt: Cross-modal generative pre-training for image captioning. arXiv preprintarXiv:2003.01473 (2020)41. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visualdenotations: New similarity metrics for semantic inference over event descriptions.Transactions of the Association for Computational Linguistics2