[PDF] Cosine Similarity of Multimodal Content Vectors for TV Programmes

Abstract

Multimodal information originates from a variety of sources: audiovisual files, textual descriptions, and metadata. We show how one can represent the content encoded by each individual source using vectors, how to combine the vectors via middle and late fusion techniques, and how to compute the semantic similarities between the contents. Our vectorial representations are built from spectral features and Bags of Audio Words, for audio, LSI topics and Doc2vec embeddings for subtitles, and the categorical features, for metadata. We implement our model on a dataset of BBC TV programmes and evaluate the fused representations to provide recommendations. The late fused similarity matrices significantly improve the precision and diversity of recommendations.

Full PDF

CCosine Similarity of Multimodal Content Vectors for TV Programmes

Saba Nazir Taner Cagali Chris Newell Mehrnoosh Sadrzadeh Abstract

Multimodal information originates from a varietyof sources: audiovisual ﬁles, textual descriptions,and metadata. We show how one can represent thecontent encoded by each individual source usingvectors, how to combine the vectors via middleand late fusion techniques, and how to computethe semantic similarities between the contents.Our vectorial representations are built from spec-tral features and Bags of Audio Words, for audio,LSI topics and Doc2vec embeddings for subtitles,and the categorical features, for metadata. Weimplement our model on a dataset of BBC TVprogrammes and evaluate the fused representa-tions to provide recommendations. The late fusedsimilarity matrices signiﬁcantly improve the pre-cision and diversity of recommendations.

1. Introduction

Ideas put forwards by Firth and Harris in the 1930’s led tothe development of vector representations for words. Theoriginal word vectors represented context of textual use andthe cosine distances between them, semantic similarities(Turney & Pantel, 2010). Subsequent research extendedthe vector representation methods from words to sentencesand documents; nowadays, these vectors are learnt usingneural networks, for a survey see (Jurafsky & Martin, 2019).Recently, the textual vector representations have been en-riched by other modes of information, such audio-visual andcognitive (E. Bruni & Baroni, 2014; Kiela & Clark, 2017).The enriched representations are also used in multimodalcontent-based and hybrid recommendation systems for prob-lems such as cold-start (Oramas et al., 2017), (Barkan et al.,2019), e-commerce assortment (Iqbal et al., 2018b) andgenre classiﬁcation (Ekenel & Semela, 2013). A large body Computer Science, University College London, UK, Electronic Engineering and Computer Science, Queen MaryUniversity of London, UK, Research and Development,British Broadcasting Company, UK. Correspondence to: < [email protected] > , < [email protected] > . Proceedings of the th International Conference on MachineLearning , Vienna, Austria, PMLR 108, 2020. Copyright 2020 bythe author(s).

Figure 1.

Multimodal Content Recommendations Framework of work exists here, with none as extensive or conclusive:e.g. (Yang et al., 2007), only considers tags and titles fortextual data, (Bougiatiotis & Giannakopoulos, 2018) doesuse full subtitles but does not improve on the metadata-onlyrecommendations.This paper works with multimodal vector representationsfor audio and text, and investigates their application to TVrecommendations, based on a dataset of 145 BBC TV pro-grammes. On the methodological side, this is the ﬁrst timethat neural (Doc2Vec) and topical (LSI) document vectorsare combined with audio (BoAW) vectors and vector repre-sentations of metadata. On the experimental side, the latefused representations signiﬁcantly improve the precisionand diversity of the recommendations.

2. Multimodal Content Vectors

Our dataset contains 145 BBC TV programmes with theirsubtitle and audiovisual ﬁles and metadata information.

Subtitle Vectors.

Latent Semantic Indexing (LSI) (Pa-padimitriou et al., 2000), a topic modelling technique, wasapplied to the subtitle ﬁles. LSI is a two-step procedure.Firstly, a document- term matrix is generated via a low-rankapproximation obtained from the term vector space projec-tions of the Bag of Words vectors. Secondly, Singular ValueDecomposition (SVD) is applied to the document-term ma- a r X i v : . [ c s . MM ] S e p osine Similarity of Multimodal Content Vectors for TV Programmes trix, where the newly created eigenvectors represent theconcepts within the latent space. We worked with 50 dimen-sional spaces. LSI improves on the term-document matrices,but does not take word order into account. To deal with this,we worked with neural semantics embeddings Doc2vec (Le& Mikolov, 2014). Doc2vec is an extension of the neuralsemantic word embeddings Word2vec (T. Mikolov & Dean,2013). We worked with Paragraph Vector Distributed Mem-ory (PV- DM), which concatenates the unique document IDwith the context words with respect to the speciﬁed contextwindow over the text and preserves the order of words. Audio Vectors.

We extracted acoustic features includingMFCCs, Spectral Centroid, Zero Crossing Rate, SpectralFlatness and Root Mean Square using LibROSA (McFeeet al., 2015), keeping audio sampling rate of 22050 Hz andhop length of 512 samples, with variable lengths of au-dio tracks averaging on about 30 mins each for a detailedanalysis. The extracted multiple acoustic features are con-catenated, normalised and then used as audio vectors foreach audio.We then followed (Kiela & Clark, 2017) andused a Bag of Audio Words (BoAW) model to learn abstractaudio vector representations. BoAW is used in audio infor-mation retrieval recognition (Liu et al., 2010; Pancoast &Akbacak, 2012; Rawat et al., 2013) and acoustic event detec-tion (Plinge et al., 2014; Grzeszick et al., 2015; Lim et al.,2015). They are learnt via mini-batch K-means clusteringwith K = 50 . Metadata Vectors.

Metadata representations are basedon the editorially-assigned attributes of the programmes.We worked with hierarchical genre information, e.g.”factual/scienceandnature/natureandenvironment”, where amatch can occur at any level. A categorical feature vector iscreated for each attribute by traversing the trees.

Fusion.

Individual content vectors are ranked using cosinesimilarities and are fused with middle and late fusion tech-niques (Kiela & Clark, 2017; Atrey et al., 2010; Zhu et al.,2006). In middle fusion, we concatenated the different vec-tors representations. In late fusion, we ﬁrst computed thecosine similarities of pairwise vectors, resulting in 3 sym-metric × similarity matrices; then combined thesewith each other by weighted averaging. Figure 1 shows ourlate fusion framework.

3. Evaluation and Results

Performance of the singular and fused vectors is evaluatedby a personalised Python-based recommender system eval-uation framework, developed using MyMediaLite library(MyMediaLite). We calculated the Mean Average Precision(MAP) and Intra-list diversity (ILD) of the recommenda-tions at ranks 10 and 20 obtained from cosine similarities,and compared these with a metadata-only recommenda-

Table 1.

Singular and fused model evaluations. The acronyms LSI,D2V, Aud, MD, and Fus are used for Latent Semantic Indexing,Doc2vec, Audios, metadata, respectively. User is the user-basedbehavioural similarity that we are trying to estimate.M

ODEL

MAP@10 ILD@10 MAP@20 ILD@20LSI 11.30 69.89 13.40 76.79D2V 11.76 77.20 13.88 80.37A UD US SER tion system (MD) and the behavioural similarity measureobtained from users viewings (USER). Our aim was to in-crease ILD while maintaining or increasing MAP. Evenindividually, LSI and Doc2Vec outperformed metadata inboth MAP and ILD; audios (AUD) showed the highest ILDand lowest MAP. The best performance was by the multi-modal model late fusion (FUS) with the weights 0.7 LSI,1.5 D2V, 0.2 AUD, 0.65 MD. It increased MAP and ILD, atboth ranks 10 and 20, see Table 1. It also came very closeto the gold standard user behaviour.

Analysis.

We analysed individual features of a subset of pro-grammes to ﬁnd out the roles various features played in mea-suring the similarities. We worked in three groups:

Group(I). closely correlated programmes of the same genre, e.g.Eastenders, Doctors, Waterloo Road.

Group (II). uncor-related programmes, of different genres, e.g. Eastenders,Football League Show, University Challenge.

Group (III). correlated programmes, of different genres, e.g. Eastenders(soap drama), Jamie Private School Girl (comedy), Noto-rious Betty Page (autobiographical drama). In

Group (I) ,the genre similarities were nicely reﬂected by the audioand textual features. In

Group (II) , uncorrelatedness ofthe programmes were manifested in the textual and audiofeatures. In

Group (III) , a genre-only analysis failed but amulti modal one succeeded. The similarities between thedifferent-genre programmes clearly showed themselves inthe textual and audio features. The most efﬁcient differenti-ating vectors were BoAW model and LSI.

4. Future Work

Our framework can easily be extended to other modalities.We worked with standard Python packages for sentimentand writing style, with format and service metadata, andwith 200 scene images extracted from the video ﬁles ofprogrammes, but did not obtain improvements. Workingwith more sophisticated attributes, and larger number ofimages and jointly learning multimodal representations, asin (Iqbal et al., 2018a), via neural nets is work in progress. osine Similarity of Multimodal Content Vectors for TV Programmes

References

Atrey, P. K., Hossain, M. A., El Saddik, A., and Kankanhalli,M. S. Multimodal fusion for multimedia analysis: asurvey.

Multimedia systems , 16(6):345–379, 2010.Barkan, O., Koenigstein, N., Yogev, E., and Katz, O.Cb2cf: a neural multiview content-to-collaborative ﬁl-tering model for completely cold item recommendations.In

Proceedings of the 13th ACM Conference on Recom-mender Systems , pp. 228–236, 2019.Bougiatiotis, K. and Giannakopoulos, T. Enhanced moviecontent similarity based on textual, auditory and visualinformation.

Expert Systems with Applications , 96:86 –102, 2018.E. Bruni, N. T. and Baroni, M. Multimodal distributionalsemantics.

Journal of Artiﬁcal Intelligence Research , 49:1–47, 2014.Ekenel, H. K. and Semela, T. Multimodal genre classiﬁca-tion of tv programs and youtube videos.

Multimedia toolsand applications , 63(2):547–567, 2013.Grzeszick, R., Plinge, A., and Fink, G. A. Temporal acousticwords for online acoustic event detection. In

German Con-ference on Pattern Recognition , pp. 142–153. Springer,2015.Iqbal, M., Kovac, A., and Aryafar, K. A multimodal rec-ommender system for large-scale assortment generationin e-commerce. In

The SIGIR 2018 Workshop On eCom-merce co-located with the 41st International ACM SIGIRConference on Research and Development in InformationRetrieval (SIGIR 2018), Ann Arbor, Michigan, USA, July12, 2018 , 2018a.Iqbal, M., Kovac, A., and Aryafar, K. A multimodal recom-mender system for large-scale assortment generation ine-commerce. arXiv preprint arXiv:1806.11226 , 2018b.Jurafsky, D. and Martin, J. H.

Speech and Language Pro-cessing . https://web.stanford.edu/ jurafsky/slp3/, 2019.Kiela, D. and Clark, S. Learning neural audio embeddingsfor grounding semantics in auditory perception.

Journalof Artiﬁcial Intelligence Research , 60:1003–1030, 2017.Le, Q. and Mikolov, T. Distributed representations of sen-tences and documents. In

International conference onmachine learning , pp. 1188–1196, 2014.Lim, H., Kim, M. J., and Kim, H. Robust sound eventclassiﬁcation using lbp-hog based bag-of-audio-wordsfeature representation. In

Sixteenth Annual Conferenceof the International Speech Communication Association ,2015. Liu, Y., Zhao, W.-L., Ngo, C.-W., Xu, C.-S., and Lu, H.-Q.Coherent bag-of audio words model for efﬁcient large-scale video copy detection. In

Proceedings of the ACMinternational conference on image and video retrieval ,pp. 89–96, 2010.McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M.,Battenberg, E., and Nieto, O. librosa: Audio and musicsignal analysis in python. In

Proceedings of the 14thpython in science conference , volume 8, 2015.MyMediaLite. Mymedialite recommender system library. . (Accessed on02/20/2020).Oramas, S., Nieto, O., Sordo, M., and Serra, X. A deep mul-timodal approach for cold-start music recommendation.In

Proceedings of the 2nd Workshop on Deep Learningfor Recommender Systems , pp. 32–37, 2017.Pancoast, S. and Akbacak, M. Bag-of-audio-words ap-proach for multimedia event classiﬁcation. In

ThirteenthAnnual Conference of the International Speech Commu-nication Association , 2012.Papadimitriou, C. H., Raghavan, P., Tamaki, H., and Vem-pala, S. Latent semantic indexing: A probabilistic anal-ysis.

Journal of Computer and System Sciences , 61(2):217–235, 2000.Plinge, A., Grzeszick, R., and Fink, G. A. A bag-of-featuresapproach to acoustic event detection. In , pp. 3704–3708. IEEE, 2014.Rawat, S., Schulam, P. F., Burger, S., Ding, D., Wang, Y.,and Metze, F. Robust audio-codebooks for large-scaleevent detection in consumer videos. In

INTERSPEECH ,pp. 2929–2933, 2013.T. Mikolov, K. Chen, G. C. and Dean, J. Efﬁcient estimationof word representations in vector space. In

Proceedingsof ICLR , Scottsdale, AZ, 2013.Turney, P. and Pantel, P. From frequency to meaning: vectorspace models of semantics.

Journal of Artiﬁcal Intelli-gence Research , 37:141188, 2010.Yang, B., Mei, T., Hua, X.-S., Yang, L., Yang, S.-Q., and Li,M. Online video recommendation based on multimodalfusion and relevance feedback. In

Proceedings of the6th ACM International Conference on Image and VideoRetrieval , CIVR 07, pp. 7380, New York, NY, USA, 2007.Association for Computing Machinery.Zhu, Q., Yeh, M.-C., and Cheng, K.-T. Multimodal fusionusing learned text concepts for image categorization. In