Automated Composition of Picture-Synched Music Soundtracks for Movies
AAutomated Composition of Picture-SynchedMusic Soundtracks for Movies
Vansh Dassani
The University of BristolBristol BS8 1UB, [email protected]
Jon Bird
The University of BristolBristol BS8 1UB, [email protected]
Dave Cliff
The University of BristolBristol BS8 1UB, [email protected]
ABSTRACT
We describe the implementation of and early results from a systemthat automatically composes picture-synched musical soundtracksfor videos and movies. We use the phrase picture-synched to meanthat the structure of the automatically composed music is deter-mined by visual events in the input movie, i.e. the final music issynchronised to visual events and features such as cut transitionsor within-shot key-frame events. Our system combines automatedvideo analysis and computer-generated music-composition tech-niques to create unique soundtracks in response to the video input,and can be thought of as an initial step in creating a computerisedreplacement for a human composer writing music to fit the picture-locked edit of a movie. Working only from the video informationin the movie, key features are extracted from the input video, us-ing video analysis techniques, which are then fed into a machine-learning-based music generation tool, to compose a piece of musicfrom scratch. The resulting soundtrack is tied to video features,such as scene transition markers and scene-level energy values,and is unique to the input video. Although the system we describehere is only a preliminary proof-of-concept, user evaluations of theoutput of the system have been positive.
CCS CONCEPTS • Applied computing → Sound and music computing; • Computingmethodologies → Image processing;
Neural networks;
KEYWORDS video soundtracks, automated music composition, machine learning
ACM Reference format:
Vansh Dassani, Jon Bird, and Dave Cliff. 2019. Automated Composition ofPicture-SynchedMusic Soundtracks for Movies. In
Proceedings of the 16th ACM SIGGRAPHEuropean Conference on Visual Media Production, London, UK, Dec. 17–18(CVMP 2019),
10 pages.DOI: 10.1145/nnnnnnn.nnnnnnn
Imagine that you have edited a 15-minute movie to the picture-lockstage, and now you need to source a continuous music soundtrackto be added in the final stages of postproduction: what are youroptions? You could attempt to find appropriate music on a stock-audio site such as MusicBed (Mus [n. d.]), but such sites tend tooffer a large number of tracks that are five-minutes or less and veryfew that are over 10 minutes. Even if you do find a pre-recorded
CVMP 2019, London, UK auteur , you could compose yourown soundtrack, fitting the music to the video, but that will likelytake some time; and maybe you just don’t have the music skills.If instead you have money, you could pay a composer to write aspecifically commissioned soundtrack, synchronised to the cut ofthe video. But what if none of these options are practicable? In thispaper we sketch a solution for your woes: we have created a proof-of-concept automated system that analyses the visual content of avideo and uses the results to compose a musical soundtrack thatfits well to the sequence of events in the video, which we refer to asa picture-synched soundtrack. We have conducted a series of userstudies, and the results demonstrate that our approach has meritand is worthy of further exploration. Because it’s easier to refer tosuch a system via a proper noun, we have named our video-drivenautomated music composition system
Barrington . Another way of framing the task performed by Barrington isas follows: in the production of promo videos for music singles,especially in genres such as disco, house, techno, and other formsof electronic dance music with a strongly repetitive beat structure,it is quite common to see videos edited in such a way that they’re cut to the beat , i.e. where abrupt (hard-cut or jump-cut) transitionsbetween scenes in the video are synchronised with the beat-patternof the music. In those kind of promo videos, the pattern of switch-ing between scenes within the movie is determined by the structureof the music: the music comes first, and the video is then cut to fitto the music. Barrington is a provisional system for automaticallygenerating answers to the question posed by inverting this process:starting with a video, what music can be created that fits the styleand structure of that video? Obviously, for any one video, thereis an essentially infinite number of music tracks that could poten-tially be used as the soundtrack for that video, and deciding whichones are good, and which is best, is a subjective process requiringartistic judgement. Barrington incorporates a number of heuristicswhich are intended to improve the quality of its output. In its cur-rent proof-of-concept form, Barrington is certainly not offered assomething that is likely to put professional music composers out ofwork anytime soon. More realistically, we think it likely that a laterevolution of Barrington could conceivably be used to generate be-spoke/unique soundtracks as a service for consumers, and an evenmore mature version could plausibly be used in creating targetedadvertising. For example, the same video-advert could be shown to In honour of Barrington Pheloung (1954-2019), composer of clever themes, incidentalmusic, and soundtracks for film and TV. a r X i v : . [ c s . MM ] O c t VMP 2019, Dec. 17–18, London, UK Vansh Dassani, Jon Bird, and Dave Cliff
Customer A with a rock-music picture-synched soundtrack, andto Customer B with a classical-music picture-synched soundtrack,with both soundtracks produced by Barrington on the basis of priorknowledge of the customers’ respective tastes in music.In Section 2 we review relevant background material. The techni-cal notes on the design and implementation of Barrington are givenin Section 3. Then in Section 4 we give details of our user studiesand discuss the results. We close the paper with a discussion offurther work in Section 5.
Computers have long been used as tools by musicians to experimentand produce music with. From simple applications that allow youto organise short audio clips into playable sequences, to complexaudio synthesis models that can recreate the sounds of musicalinstruments, or even create new ones, computers have transformedour approach to making music. Today, the majority of musiciansuse digital audio workstations (DAWs), such as Logic Pro (Log [n.d.]) and Reaper (Rea [n. d.]), which provide a single platformthrough which to record, edit and process audio.Pythagoras is attributed with discovering that numerical ratioscould be used to describe basic intervals in music, such as octavesand fifths, that form harmonic series (Weiss and Taruskin 2007,Chapter 2). This observation paved the way for formalising musi-cal concepts using mathematics. In the 18 th Century, dice-basedmusical games, such as Mozart’s Dice, were an example of algorith-mic composition, in which a table of rules and a dice were used torandomly select small, predetermined sections of music to form anew piece (Hedges 1978). Today, those same games can run on aphone or computer, with no dice required.In the 19 th Century, Ada Lovelace hypothesized that Babbage’sAnalytical Engine might be capable of algorithmic composition(Menabrea and Lovelace 1842, Note A): ”Again, it might act upon other things besides num-ber, were objects found whose mutual fundamentalrelations could be expressed by those of the abstractscience of operations, and which should be also sus-ceptible of adaptations to the action of the operatingnotation and mechanism of the engine. Suppos-ing, for instance, that the fundamental relations ofpitched sounds in the science of harmony and of mu-sical composition were susceptible of such expressionand adaptations, the engine might compose elabo-rate and scientific pieces of music of any degree ofcomplexity or extent.”
The idea of algorithmic music composition has been exploredsince the 1950’s, but for the most part the composition algorithmswere developed manually. Rule-based systems, such as Ebcioglu’sCHORAL (Ebcio˘glu 1990), allowed for automated compositionbased around a set of pre-programmed rules, often referred to asa grammar. Musical grammars describe the form in which notescan be grouped or ordered to create sequences. The use of a formalgrammar allows a computer to execute the algorithm, without anyhuman intervention or direction. Arguably such algorithmic music composition techniques areexamples of artificial intelligence (AI). However, more recent break-throughs in AI, specifically deep learning methods, have vastlyimproved the quality and performance of computer-generated mu-sic systems. Researchers can use vast amounts of data, in this casein the form of existing music, to essentially ‘teach’ a computer tosolve a problem like a human would. Here, instead of a programmerdefining the musical grammar, the machine observes and processesthe input data to define the grammar on its own: i.e., it is able to‘learn’. Van Der Merwe and Schulze used Markov chains to modelchord duration, chord progression and rhythmic progression inmusic (Van Der Merwe and Schulze 2011). Markov chains modelsequences of events as discrete states and the transition probabili-ties between states, and do not rely on knowledge of other statesto make a transition. More sophisticated Markov models havebeen useful for tasks such as accompaniment or improvisation,as shown by Morris et al. (Morris et al. 2008) in their work onhuman-computer interaction for musical compositions, and Sas-try’s 2011 thesis (Sastry 2011) exploring machine improvisation andcomposition of tabla sequences. New deep learning models, suchas BachBot (Liang 2016) are capable of producing complex musicthat can be virtually indistinguishable from their human-producedcounterparts.One of the first practical applications of this technology hasbeen creating royalty-free music for video content creators. Videocontent is rapidly becoming the preferred medium through whichto advertise and communicate over the internet. Cisco estimatesthat by 2022, online video content will make up 82% of consumerinternet traffic (Cisco 2018). As demand for video content grows,the content creation industry is looking for novel methods to im-prove and streamline their processes. In recent years, tools suchas Animoto (Ani [n. d.]) and Biteable (Bit [n. d.]) have allowednovices to produce professional quality videos in just a few clicks,by offloading the more time consuming and skill-based aspectsof video editing to machines. These tools often provide a selec-tion of royalty-free music for users to easily add a soundtrack totheir productions. However, these compositions have a fixed styleand structure, which leaves limited, if any, scope for altering thesoundtracks to better support the visuals.Advertisers have also turned to video content to better conveymeaning and engage with customers. Drawing on relationshipsbetween auditory and visual emotional responses, it could becomeeasier to create emotionally engaging content, thereby improvingthe effectiveness of advertising campaigns. In the future, it is possi-ble to envision social media advertising where the genre or styleof the soundtrack is tailored to individual viewers. Video hostingplatforms, such as YouTube (You [n. d.]) and Vimeo (Vim [n. d.]),could evolve to support high levels of real-time customisation, al-lowing different viewers to have their own unique experience ofthe same video content.Computer vision is another field that has benefited from ad-vances in machine learning, utilising vast datasets to infer featuresrelevant to the task at hand. Research in this area has enabledmany interesting applications, from identifying and tracking mov-ing objects in videos (e.g., (Nam and Han 2016)), to detecting humanemotions through facial expressions (e.g. (Bartlett et al. 2003)). Re-cently, Chu and Roy (Chu and Roy 2017) described a novel method utomated Composition of Music for Movies CVMP 2019, Dec. 17–18, London, UK of learning to identify emotional arcs in movies using audio-visualsentiment analysis, which opens up new opportunities for derivingmeaningful information from video input.Computationally-generated video from audio input is a well-established domain of study (see e.g. Beane’s 2007 thesis (Beane2007)). But if computers can generate visual imagery to supportmusic, why not the other way round? As far as we can determine,very little research has been published in this area. There has beensome work on extracting image features such as contour, colour andtexture to generate music, as in Wu and Li’s study of image-basedmusic composition (Wu and Li 2008), and this field is often referredto as image sonification . As the name suggests, these methods workon still images and therefore could, in principle, be applied to videocontent.The lack of published work on this topic was one motivationfor our work; another was that the digital video industry is valuedat over $135 billion in the USA alone(Mag [n. d.]), and representsa huge potential market for a solution. The contribution of thispaper is to describe the design and implementation of Barrington,a proof of concept system that our initial user evaluation indicatescould, with further refinement, prove to be extremely time andcost effective for content creators working with prohibitive musiclicensing costs, or those without the technical expertise to createor edit a soundtrack for their video.
In traditional film-making, directors and editors work closely withcomposers to develop a soundtrack that fits with their narrativeintent. Similarly, our intention was to create a tool that was nottotally autonomous but rather could be guided by visual contentcreators.Barrington is primarily written in the Python 3 programminglanguage, due to the availability of a wide range of open-sourcelibraries; as is made clear later in this paper, some aspects of Barring-ton are controlled by shell-scripts consisting of operating-systemcommands.
Scene transitions are the most easily identifiable markers of pro-gression in a video, and served as the starting point for the analysis.Scene transition detection, often also referred to as shot boundarydetection, is a difficult task for computers, due to the variety oftransition methods video editors have at their disposal. Abrupttransitions follow a predictable pattern, as the change in contentfrom one frame to another is sudden and usually extensive. Gradualtransitions often utilise visual effects to blend together two scenesinto a smooth manner, such as cross-dissolves or fades, over anylength of time, and with very little change in content from frame toframe. This variability makes it difficult to create a single method ofdetecting every type of transition. Transition detection algorithmstypically look at pairs of frames, computing a score for each us-ing some defined similarity measure, then evaluating these scoresagainst some threshold to determine if a transition has occurred. Following (Zhang et al. 1993), Barrington implements two basicscene detectors, one to detect abrupt transitions ( cut detection ) andthe other to detect fades to/from black ( fade detection ).Cuts are detected by computing the average change in hue, sat-uration and intensity values from one frame to the next. Thisaggregate value will be low between frames with little change incontent, and high between frames with large changes. A thresholdis specified such that frames above the threshold are marked asabrupt transitions.Fades are detected using RGB intensity. For each frame, the R,G, and B pixel values are summed to produce an intensity score.A threshold is specified such that the RGB intensity values in themajority of frames within a scene lie above said threshold; if theRGB intensity for a frame falls below the threshold, it is markedas the start of the fade, and the next frame to return above thethreshold is marked as the end. This remarkably simple techniqueworks effectively, as any fade to/from black will inevitably containat least one black frame, with an RGB intensity value of 0.PySceneDetect (PyS [n. d.]) was the open-source library chosento implement the scene detection algorithms described above. Itprovides methods for performing both fade detection and cut de-tection, each outputting a list of scenes identified with their startand end times. Barrington uses PySceneDetect’s default thresh-old values of 12 and 30 for the fade and cut detection respectively,outputting a final list of scenes detected through both detectionmethods. The library is also able to export clips of individual scenes,which are used for scene-level analysis discussed later in Section3.5.
The list of scenes output from scene detection provide a startingpoint on which to base the temporal structure of the soundtrack.Taking a simplistic view, we can split the overall soundtrack intosections commonly found in popular music, such as the intro, verse,chorus and coda (outro), and map each scene to a section of music:the first scene being the intro and the last the coda. Mapping therest of the scenes is more complicated, as different videos will havedifferent numbers of scenes.In order to gain an understanding of how best to approach this,we first created a simple rule-based system that loops some pre-existing samples of music, with the middle scene (or middle pairof scenes for cases where the total number of scenes is even) ar-bitrarily chosen as the point at which the soundtrack peaks, andevery other scene transition either introduces or removes a par-ticular sample. This is analogous to the way music is composedprofessionally, as repetition is often a key part of musical structure.These loop-based sequenced soundtracks were an important earlystep in developing Barrington because they allowed us to exploreand demonstrate video-driven sequence-assembly, but they offeronly a coarse-grained approach to music composition.To give Barrington more creative scope, we switched to usingfully-fledged music generation systems. We initially used MagentaPolyphonyRNN, a recurrent neural network, based on the BachBotarchitecture described in (Liang 2016). It is able to compose andcomplete polyphonic musical sequences using a long short-termmemory (LSTM) generative model, without having to explicitly
VMP 2019, Dec. 17–18, London, UK Vansh Dassani, Jon Bird, and Dave Cliff encode musical theory concepts. BachBot’s success is based onmodelling melodies and harmonisation independently. It is able tolearn a representation of harmonisation based on samples from acorpus, and apply it to any melody. In user tests it was capable ofproducing music virtually indistinguishable from Bach’s composi-tions. Unfortunately, the computational costs of PolyphonyRNNmeant it was not suitable for our system: run-times lasting a weekor more could be consumed in training the system, and the resultswere often disappointing. We therefore switched to IBM’s WatsonBeat.
IBM’s Watson Beat (IBM [n. d.]) is designed to be a compositionalaid for musicians, helping them generate new ideas they can thenadapt to compose a piece of music, although it is capable of pro-ducing a complete piece on its own. Watson Beat provides moregranular control over the composition compared to PolyphonyRNN;it composes parts for multiple instruments at once, and allows usersto define the composition’s mood, instrumentation, tempo, timesignature and energy. This made it a much more versatile tool fordeveloping Barrington.A brief overview of the process of developing Watson Beat isavailable on IBM’s website (IBM [n. d.]), in which the develop-ers describe using supervised learning on a vast amount of datarelating to each composition’s pitch, rhythm, chord progressionand instrumentation, as well as labeled information on emotionsand musical genres. A key feature of Watson Beat is its ability toarrange the musical structure in a manner that provides the usercontrol over individual sections of music, though it is unclear ifthis was manually encoded or learned from training data.Watson Beat requires two inputs to compose a piece of music:a MIDI file with a melody to base the composition on; and a .ini file that specifies the structure of each section of the composition.The .ini file for each composition must start with the followingparameters: • Composition duration : in seconds. • Mood : which is selected from one of the 8pre-configured options. • Complexity : three options (simple, semi-complex, or complex) which determines thecomplexity of the chord progressions in thecomposition.This is followed by a list of parameters for each section of thecomposition, which are explained below: • Section ID : a series of consecutive integersstarting from 0 identifying the current sec-tion. • Time signature . • Tempo : in BPM. • Energy : three options (low, medium or high)which represents the number of active instru-ments for the current section. The number ofinstruments (also referred to as layers) eachenergy option uses can vary, and depends onthe selected mood. For example, low could signify 1-3 instruments for one mood option,but 2-5 for another. • Section duration : in seconds (allows for arange to be specified, such as 10 to 20 seconds,for situations where a precise duration is notrequired). • Direction : set to either up or down, and de-termines whether instrument layers will beadded or removed during the section. • Slope : one of three options (stay, gradual orsteep) which control the rate of change of theDirection, that is, the rate at which layers areadded or removed.Watson Beat only requires a duration range to generate eachsection of music, and calculates a best-fit phrase length, tempoand time signature, from a pre-specified range of options for thegiven mood, for durations within this range. This calculation isnon-deterministic as it utilises a randomly generated number topick from the computed values. Given that section durations for theautogenerated soundtracks must match exactly with the video, thebest-fit calculation is not ideal as it can often lead to an incorrectsection duration.To counter this, Barrington incorporates a Python function tocompute the correct tempo and time signature required to producethe specified section duration. The function loops through everypossible combination of parameters and selects those where thenumber of phrases is a whole number and the calculated durationequals the desired duration. These lists of valid combinations aregenerated for each scene, and then trimmed to only contain optionsthat ensure a consistent tempo throughout the soundtrack. For eachsection, one of these possible combinations is randomly selected foruse, ensuring variation in the resulting soundtracks. To generatethe soundtrack, Barrington invokes a Bash shell script that runsWatson Beat, specifying the .ini file, an existing MIDI file with aninput melody, and the output folder for the generated MIDI files.As described thus far, Barrington is able to automatically producesynchronised soundtracks with discernibly different sections, onlyrequiring the energy , direction and slope parameters to be manuallyinput. These parameters, along with the composition’s mood, play arole in the subjective interpretation of how the soundtrack supportsthe visual content, therefore it is important for users to have controlover them. Nevertheless, for the system to operate at higher levelsof autonomy, it must also be capable of determining these valuesitself, directly from the input video. There are no definitive criteria to determine the energy of a scene ofvideo or a section of music; the parameter is intrinsically dependanton the music generation tool being used. Watson Beat uses energy,direction and slope to define the number of instrument layers activein a given section of music, and how this number should varythrough the composition, so energy values such as low, medium andhigh are only relevant in the context of a Watson Beat composition.Traditionally, tempo is used to convey energy - a high tempo isconsidered more energetic than a low tempo - so it stands to reason utomated Composition of Music for Movies CVMP 2019, Dec. 17–18, London, UK that any representation of energy derived through video analysisshould control tempo, as well as the instrumentation parameters.A naive approach to this problem could be to use activity or‘busyness’ within a scene to determine energy. For example, ascene with lots of objects can be considered more energetic thanone with very few. The downside of naive approximations is thatonce again there is no consistent way of comparing the resultsacross different videos. Without a standardised metric, the onlymethod of determining if a scene has low or high energy is tocompare it with others within the same video.To test the effectiveness of a naive object-counting approach,we wrote a Python function that uses ImageAI (Ima [n. d.]), anopen-source computer vision library, and version 3 of the YOLOobject detection model (Redmon et al. 2015). YOLO utilises a singleconvolutional neural network (CNN) for training and detection.which are often used for image classification tasks. In mathematics,a convolution expresses how the shape of some function is affectedas it is shifted across a static , or non-moving, function, i.e. ‘blending’the two together. CNNs treat the input image as this static functionand pass lots of different functions over it, each of these beinga filter that looks for a specific feature in the image. Each filteris passed over small sections of the image at a time, to build upa feature map that tracks the locations at which various featuresoccur across the image.YOLO is able to learn a general representation for different objectclasses, while optimising detection performance during training,resulting in faster, more precise detection. YOLO subdivides animage into a grid, using features from across the whole image topredict the bounding box for every object class, and a confidencescore that encodes the probability of that box containing an object.The clips of individual scenes are processed one by one to produce alisting of the number of objects detected in each scene. Barringtonthen calculates the mean and standard deviation of the results tolabel the scenes low, medium or high energy: scenes with a numberof objects fewer than one standard deviation below the mean areclassified as low energy; scenes within the mean plus or minus onestandard deviation as medium-energy; and scenes one standarddeviation or greater above the mean as high-energy. A similarapproach is used to match an associated tempo; the tempo rangefor a particular mood is split into three and matched the same wayas the energy value, then a tempo within this range is selected fromthe list of valid tempo options. This approach does not work for allscenes, as it relies on a tempo within the sub-range to be valid forthe particular scene duration, which may not be the case, but it is auseful first approximation.The Watson Beat direction and slope parameters are selected bycomparing each scene’s energy value to the next, using a heuristicdecision tree, details of which are given in
Anonymous Citation .For the final scene in the video, the function assumes the nextscene has the same energy value as the current. Note that if ‘stay’ isselected for the Slope parameter, it doesn’t matter what direction isselected as the number of instrument layers stays constant throughthe section.
The final step involves post-processing the MIDI files and export-ing a video with the generated soundtrack. Watson Beat outputsa separate MIDI file for each instrument within each section ofmusic, allowing composers to rearrange the piece and select theinstruments to use when converting to an audio file. While thisoption is still available for more technical users, the Barringtondefault post-processing step converts these MIDI files into a singleaudio file containing the final soundtrack, as well as producing acopy of the input video with the new soundtrack attached.The MIDI specification allows for instruments to be declared ina message at the start of a file, however the instrument labels usedby Watson Beat do not map directly to instruments listed in theGeneral MIDI 1 (GM1) sound set. Barrington takes all the gener-ated MIDI files as an input, assigning the correct tempo to eachsection using the reference structure, and the correct instrumentusing a simple mapping based on recommendations found in Wat-son Beat’s documentation (https://github.com/cognitive-catalyst/watson-beat/blob/master/customize ini.md). It then concatenatesthem into a single file and uses FluidSynth (Flu [n. d.]), an open-source software synthesizer, to play and record the soundtrack toaudio. A Bash script exports the final video, using ffmpeg (ffm [n.d.]), a cross-platform video conversion tool, to attach the sound-track to the input video. Samples of output from Barrington areavailable online at Anonymous Website . In this section we describe the user studies we carried out to eval-uate the current version of Barrington. We begin by describingthe methodology used, and the goal of the studies, after which wepresent and discuss the results. We conclude with reflections onthe project thus far and its outcomes.
Ultimately, a system like Barrington is only useful if the end userenjoys its outputs, so this is a primary measure by which to judgesuccess. Our goal was to determine how the Barrington-generatedsoundtracks affected a user’s experience of a video. There are few,if any, objective methods for measuring this, therefore a user studywas carried out to measure subjective responses to a selection ofBarrington’s outputs, adapting the procedure used by Ghinea andAdemoye (Ghinea and Ademoye 2012). They used a questionnaire tomeasure a user’s quality of experience (QoE) of visual media whenthey were presented alongside stimuli in other sensory modalilties,such as olfaction. The QoE study they performed was particularlywell suited for evaluating Barrington, as it also investigated the roleof synchronisation of the different sensory stimuli on users’ QoE,that is, how temporal offsets between the sensory stimuli affectedusers’ reported QoE. A key feature of Barrington is the synchroni-sation of changes in auditory and visual content, so it is importantto assess whether users were able to detect this synchronisationand, if so, how it affected their QoE.Another element of Barrington that we wanted to evaluate isthe relationship between the energy parameter of a section of the VMP 2019, Dec. 17–18, London, UK Vansh Dassani, Jon Bird, and Dave Cliff soundtrack, and the corresponding scene’s visual content. In par-ticular, we wanted to test whether the object detection methoddiscussed in Section 3.5 is a suitable proxy for the mood or intensityof a scene.Finally, it is important to assess how Barrington’s output com-pares with existing solutions in the form of royalty-free music thatis not picture-synched. We wanted to test whether the soundtracksgenerated by Barrington produced greater QoE than royalty-freemusic whose structure is not synched with the video content.How these key feautures affected users’ QoE was tested using anapproach inspired by a Dual-Stimulus Impairment Scale (DSIS), astandard video testing methodology recommended by the Interna-tional Telecommunications Union (ITU) (Series 2012). In a DSIS testthe viewer sees a reference video and a video with an impairment or alteration, and is asked to indicate on a Likert scale how irritat-ing they find this impairment. Instead of only measuring a user’sexperience of the impaired video, we asked participants to rateboth the reference and the impaired videos, in order to determinewhether their QoE was affected. Participants were not informedwhich video was which, to prevent any bias when responding tothe questions.
The user evaluation was carried out to testtwo hypotheses.First, we wanted to investigate the effect of synching a sound-track with video scene changes, hypothesising that: H1 – users will report higher QoE for videos combined with syn-chronised soundtracks generated by Barrington than for videoscombined with unsynchronised soundtracks.Second, we sought to investigate whether the method of selectingthe soundtrack’s energy parameters was appropriate, hypothesis-ing that: H2 – users will perceive a correlation between the auditory andvisual content of the synchronised soundtracks generated by Bar-rington. The 26 participants who volunteered to takepart in this study consisted of 17 males and 9 females between theages of 20 and 24. All the participants were either undergraduate orgraduate students at
Anonymous University , the majority of whomstudy in the Computer Science Department. Each participant wasprovided with information and task instruction sheets explainingthe goal of the project (which can be viewed in
Anonymous Citation )before taking part, and was left alone in a private meeting room toproceed through the experiment at their own pace.
Each participant completed four tasks, the firstthree focusing on, respectively: picture-synchronisation, the rela-tionship between auditory and visual energy, and a comparisonwith royalty-free soundtracks. The final task consisted of a shortsix item questionnaire relating to videos and soundtracks.For the first three tasks, each participant was presented with twovideos, the reference and the impaired, which had identical visualcontent but different soundtracks. They were asked to watch eachvideo in turn and complete the seven-item questionnaire shownbelow. Each item on the questionnaire required participants toselect one option from a five-point Likert scale. For statements 1-6,participants were asked to indicate their level of agreement (LoA) by choosing one of the following options:
Strongly agree , Agree , Neither agree nor disagree , Disagree , or
Strongly disagree . Statement7 required participants to rate the synchronisation of the soundtrackand the scene transitions by selecting one of the following options:
Too early , Early , At an appropriate time , Late , or
Too late .(1)
The soundtrack added to the video experience (2)
The soundtrack was relevant to the video (3)
The soundtrack was enjoyable (4)
There were noticeable points in the soundtrackwhere the melody/rhythm changed (5)
The mood/intensity of the soundtrack reflectedthe mood/intensity of the video content (6)
The synchronisation between the audio andvideo improved the overall experience (7)
In relation to the scene transitions in the video,the melody/rhythm of the music changed
In the final task, participants were presented with the followingsix statements and asked to indicate their LoA on a five-point Likertscale:
Strongly agree , Agree , Neither agree nor disagree , Disagree , or
Strongly disagree .(1)
Soundtracks are an important aspect of videocontent (2)
Soundtracks improve the experience of watch-ing video content (3)
The mood of a soundtrack should match themood of the video content (4)
A bad soundtrack negatively affects the view-ing experience (5)
There was an obvious relationship between thesoundtracks and the video content in the previ-ous examples (6)
The synchronisation between the soundtrackand video improved the overall experience
We created three 60-second duration, soundtrack-free videos for this study - one for each of the first three tasks. Thevideos consisted predominantly of open source wildlife and natureclips, chosen because they best complemented the
Inspire mood thathad been selected in Watston Beat and which generates calming,cinematic music. The only exceptions to this were a clip of a busypedestrian crossing and a clip of a herd of wildebeest, which wereused because they were a sharp contrast to the relatively motion-free nature scenes, and specifically chosen to output a high energyparameter when object detection was used. Each soundtrack-freevideo was processed by Barrington to create the soundtracks forthe reference videos, while the process for creating each of theimpaired videos is described below.
Task 1 investigated the effect of synchronisation of the videoscene changes and the soundtrack on users’ QoE and tested H1.The video used contained three scene transitions, which providedample opportunity for the participants to determine if the audioand video content were synchronised. The reference video thesoundtrack was synchronised with the video scene changes. Inthe impaired video, we altered the scene-transition timings so thesoundtrack transitions were three seconds earlier than their cor-responding scene transitions, then used the system to generate anew soundtrack based on the alteration. utomated Composition of Music for Movies CVMP 2019, Dec. 17–18, London, UK
Task 2 investigated the effect of the value of the energy param-eter and investigated H2 The video contains just two scenes, onewith low energy and the other high. The high energy scene con-tains lots of moving objects, which intuitively corresponds to anincreased scene intensity. To create the impaired video the energyparameters were swapped from high to low and vice versa . (N.B.abrupt scene transitions were ignored in this video, due to the factthat one of the clips used consisted of multiple hard cuts withinseconds of each other, which led to very short sections of musicthat clashed quite heavily and sounded bad.)
Task 3 compares a video with a Barrington generated soundtrackto a video with a royalty-free soundtrack and also investigated H1.For the impaired video, a song was selected from Bensound (Ben[n. d.]), an online royalty-free music provider, and trimmed to theduration of the video.
Two aspects of synchronisationwere tested: a) the detection of auditory and visual transitions; andb) the effect of their synchronisation on the participants’ enjoy-ment of the video. For the reference video in Task 1, 26/26 (100%)participants responded that the music changed, or progressed, atan appropriate time in relation to scene transitions, while only3/26 (12%) believed the same for the impaired video, where thesoundtrack transitioned three seconds prior to the visual scencetransitions.This initially suggests that participants were able to recognisethe points at which musical transitions occurred, and determinewhether they were in-sync with visual scene transitions. However,it is clear from the results in Figure 1 that while participants wereable to detect that the audio and visuals were not synchronised,they were poor at judging whether the music transition occurredbefore or after the scene transition. 46% of participants respondedwith late or too late to question 7 even though the soundtrack wasconfigured to transition three seconds earlier than the video. Figure 1: Participant responses to the statement
In relationto the scene transitions in the video, the melody/rhythm of themusic changed , in Task 1 of the user study.
When looking at the response to this question across all videossynchronised by Barrington, Figure 2 shows that the majority ofparticipants thought the synchronisation was appropriately timed.It’s interesting to note the remaining participants all selected late ,suggesting that participants may differ from one another in what they consider as being appropriately synchronised. Additionally,the videos for Task 2 elicited different timing responses, which isunexpected because the timing parameters for both were identical.This is possibly a result of the musical transitions being too grad-ual, leaving participants unclear as to when a transition occurred.Or it is perhaps an indication that the end of a scene is not the bestmarker for a musical transition, although it is not immediately clearwhat an alternative may be.
Figure 2: Participant responses to the statement
In relationto the scene transitions in the video, the melody/rhythm of themusic changed , across all videos produced via Barrington.
Figure 3 presents participant responses to statement 6, whichascertains whether the soundtrack has enhanced their enjoyment ofthe video. The impaired video, 1B, elicited a predominantly neutralresponse, which could suggest either that the delay between musicaland visual transitions is not significant enough to adversely affectenjoyment, or that a lack of synchronisation does not adverselyaffect enjoyment. This makes intuitive sense, as one would assumea large proportion of videos on media platforms contain unsyn-chronised soundtracks. In contrast, the reference video 1A receivedpredominantly positive responses, suggesting that synchronisationcan enhance the quality of experience.
Figure 3: Participant responses to the statement
The syn-chronisation between the audio and video improved the over-all experience , in Task 1 of the user study.
Looking at participant responses to statement 4, plotted in Figure4, can give us further insight into why the above results neither
VMP 2019, Dec. 17–18, London, UK Vansh Dassani, Jon Bird, and Dave Cliff
Figure 4: Participant responses to the statement
There werenoticeable points in the soundtrack where the melody/rhythmchanged , in Task 1 of the user study. supports nor disproves the hypothesis that synchronisation canenhance the experience.Only 50% of participants were able to detect musical transitionsin the impaired video 1B, compared with 73% for the referencevideo 1A. This discrepancy in responses is unexpected, as intuitionsuggests participants would respond the same way for 1A and 1B,given that both soundtracks contain three transitions.It is possible this is due to the different sections within thosespecific soundtracks sounding very similar to one another, howeverit is not possible to determine this from the results of this study. Am-biguity regarding participants’ interpretations of the word changed could also have played a role, however none of the participantsbrought this up in the informal post-study feedback conversations.It seems that participants found it difficult to detect melody andrhythm, despite overall responding with a more positive perceivedexperience for picture-synched videos.
This test was to de-termine whether the number of objects in a scene is a good proxyfor how the soundtrack’s energy levels should reflect the visualintensity of a scene. Figure 5 shows the results for the referenceand impaired videos in Task 2 of the study. Only a small proportion(27%) of participants provided the expected response disagreeingwith statement 5 for the impaired video, whereas 81% agreed for1A and 54% agreed for 1B. The results imply that the energy param-eter has a minimal effect on the perceived mood/intensity of visualcontent, although it is promising to note that some participantsdid perceive a difference when the change was made, suggesting amore refined method could produce better results.This raises the question of whether a soundtrack should reflectthe mood/intensity of visual content: Section 4.2.4 provides someinsight on this.
In Task 3,participants responded more positively towards the royalty-freesoundtrack on statements relating to the mood and relevance of thesoundtrack with respect to the visual content. Figure 6 shows that100% of participants agreed that the mood/intensity of the royalty-free soundtrack reflected the mood of the visual content, while only46% thought the same for the Barrington-generated soundtrack.
Figure 5: Participant responses to the statement
The mood/intensity of the soundtrack reflected themood/intensity of the video content , in Task 2 of theuser study.Figure 6: Participant responses to the statement
The mood/intensity of the soundtrack reflected themood/intensity of the video content , in Task 3 of theuser study.
This result reinforces the prior suspicion that the object-detectionproxy for the soundtrack’s energy level captures a limited repre-sentation of the mood of a scene in a video. However, it could alsobe a reflection of the audio quality of the music produced by thesystem. The royalty-free soundtrack was professionally composedand instrumented, whereas the generated soundtracks use a basicsoftware synthesizer to apply instrument sounds as a post-process,a limitation imposed by Watson Beat. When it comes to synchro-nisation, participants were in agreement that the timing of thegenerated soundtrack was more accurate than the royalty-free one(see Figure 7), although this did not have the expected result on userenjoyment, which once again could have been due to the qualityof the soundtrack. Figure 8 shows a negligible difference (4%, or1 participant) in the enjoyment of the generated vs royalty-freesoundtrack. This study was not intended to compare the qualityof generated soundtracks against professionally produced royalty-free samples, but it is encouraging to note that the Watson Beatsoundtracks were no less enjoyable than their counterparts.
Fig-ure 9 presents a summary of the responses to the final task ofthe study that involved indicating a level of agreement with six utomated Composition of Music for Movies CVMP 2019, Dec. 17–18, London, UK
Figure 7: Participant responses to the statement
In relationto the scene transitions in the video, the melody/rhythm of themusic changed , in Task 3 of the user study.Figure 8: Participant responses to the statement
The sound-track was enjoyable , in Task 3 of the user study. statements. Generally speaking, participants agreed with all thestatements, although some disagreed with the notion that sound-tracks should match the mood of the visual content (statement3) and likewise some did not detect a relationship between theauditory and visual content (statement 5) in the sample videos.The responses to statement 5 were understandable given someparticipants struggled to notice the temporal synchronisation, whilethe results from Task 2 suggested the number of objects in the scenewas not a strong indicator of mood/intensity.The key takeaway from this task is that 100% of participantsagree that soundtracks play an important role in experiencing videocontent. 62% of participants strongly agreed that synchronisationimproves the experience. While these results suggest that Barring-ton is not yet performing as strongly as hoped for, there is a clearindication that the problems being tackled are relevant to potentialusers.A complete breakdown of responses to individual questions inthis study can be found in
Anonymous Citation . This user study, while by no means conclusively proving that syn-chronisation of audio and video provides a better experience thanunsynchronised content, was an important step in validating someof the work that has gone into this project. We used a fixed set of
Figure 9: Participant responses to all statements in Task 4 ofthe user study. questions to disguise which video was the reference and which theimpaired. Even when using subjective measures of quality, carefulconsideration is required to avoid ambiguous questions that couldresult in different interpretations by different participants.Additionally, some interactive tasks could have been used insteadof passive questionnaires. One example would be to ask participantsto manually mark the points in a video at which they detect a scenetransition, in order to objectively evaluate the performance of scenedetection algorithms against a human. These interactive tasks couldalso form the basis of a dataset that could be used to train a machineto tackle some of the challenges with this topic.This work has shown that temporal synchronisation of audioand video can lead to an enhanced user experience, and that Bar-rington’s synchronisation method was accurate, as perceived bythe participants. Our user evaluation provides some support forhypothesis H1 . Less successful was the object-counting proxy usedto determine the energy of a scene; this method was not able toaccurately convey the intended relationship between auditory andvisual content to a viewer and we can reject hypothesis H2 .The limiting factor of this system is the music generation systemused. Watson Beat is the most feature-packed of the open-sourceoptions currently available, but a lack of published literature ordocumentation makes it a difficult tool to work with and adaptfor more general use. As it stands, in terms of musical structure,there is little one can do beyond defining the number and durationof sections within a piece of music. More fine grained temporalcontrol over musical components, such as chord progressions andmelodies, would allow for the development of new methods tocreate a temporal mapping. There is the potential for scene tran-sitions, or even content within a scene, to define how differentchords within a piece of music flow. For example, an action shot ofa child raising its arms could be used to trigger a musical buildupor the climax of a particular phrase of music, which would moreeffectively demonstrate audio-visual synchronisation.Emotionality is an extremely important aspect of music; signif-icant work would be required to enable this system to associatemoods and genres of music with visual content. Creating a datasetof images labelled with the same mood and emotion tags as thecorpus of music used to train the music generation model would bean excellent first step in being able to associate visual and auditorymoods using machine learning. For example, an image classifier VMP 2019, Dec. 17–18, London, UK Vansh Dassani, Jon Bird, and Dave Cliff trained on such a dataset could potentially identify the overarchingmood of a video as happy or sad, using the kind of emotion detec-tion described in (Chu and Roy 2017). The music generation modelwould then generate music in a style that has been labelled withthe same mood, as opposed to the user-driven approach currentlytaken in this system.Barrington is able to automatically generate soundtracks forvideos, with the only requirements being a mood and melody onwhich to base the composition. Scene transition detection andsome scene-level object detection form a base structure for thecomposition to follow. Once generated, a simple post-processingstep converts the MIDI representation of the composition into audio,using modifiable instrumentation, after which the soundtrack isready for use. Processing components can be added or removedas required, and are easy to integrate; for example, we were ableto add the object detection functionality in under 40 lines of code.A 30-second soundtrack takes under a minute to generate froman input video, making this a fast, versatile prototype system fortesting ideas and a firm foundation for further work.
This paper reports on work in progress, and there remain openproblems to be tackled in future.Adapting and building upon Chu and Roy’s work on emotionrecognition in movies (Chu and Roy 2017) was something weplanned to do during this project, but ultimately was unable todue to the limitations of Watson Beat. Further development ofWatson Beat’s model of structure is the logical next step, as itwould open up opportunities to experiment with new video analy-sis techniques that can help build a better representation of a video’stemporal structure. As musical structure is the limiting factor ofa system of this kind, without any advances in capabilities, anyfurther development using current tools would only offer minorimprovements.The results of Task 4 of the user trials (4.2.4) confirmed our initialview that a system such as Barrington would be useful to the videocontent creation community. While the current version is not goingto wow viewers with professional quality music, it demonstratesthat the ultimate goal of automatically generating emotionally andstructurally representative soundtracks from just a video input isvalid and achievable.
REFERENCES , Vol. 5. 53–53. https://doi.org/10.1109/CVPRW.2003.10057Allison Brooke Beane. 2007.
Generating audio-responsive video images in real-time fora live symphony performance . Ph.D. Dissertation. Texas A&M University.Eric Chu and Deb Roy. 2017. Audio-Visual Sentiment Analysis for Learning EmotionalArcs in Movies.
CoRR abs/1712.02896 (2017). arXiv:1712.02896 http://arxiv.org/abs/1712.02896VNI Cisco. 2018. Cisco Visual Networking Index: Forecast and Trends, 2017–2022.
White Paper (2018).Kemal Ebcio˘glu. 1990. An expert system for harmonizing chorales in the style of JSBach.
The Journal of Logic Programming
8, 1-2 (1990), 145–185.Gheorghita Ghinea and Oluwakemi Ademoye. 2012. The sweet smell of success:Enhancing multimedia applications with olfaction.
ACM Transactions on MultimediaComputing, Communications, and Applications (TOMM)
8, 1 (2012), 2.Stephen A. Hedges. 1978. Dice Music in the Eighteenth Century.
Music & Letters
University of Cambridge
The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) .Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2015. YouOnly Look Once: Unified, Real-Time Object Detection.
CoRR abs/1506.02640 (2015).arXiv:1506.02640 http://arxiv.org/abs/1506.02640Avinash Sastry. 2011.
N-gram modeling of tabla sequences using variable-length hiddenMarkov models for improvisation and composition . Ph.D. Dissertation. GeorgiaInstitute of Technology.BT Series. 2012. Methodology for the subjective assessment of the quality of televisionpictures.
Recommendation ITU-R BT (2012), 500–13.A. Van Der Merwe and W. Schulze. 2011. Music Generation with Markov Models.
IEEEMultiMedia
18, 3 (March 2011), 78–85. https://doi.org/10.1109/MMUL.2010.44Piero Weiss and Richard Taruskin. 2007.
Music in the western world . Cengage Learning.Xiaoying Wu and Ze-Nian Li. 2008. A study of image-based music composition. In . IEEE, 1345–1348.HongJiang Zhang, Atreyi Kankanhalli, and Stephen W Smoliar. 1993. Automaticpartitioning of full-motion video.