Automatic Realistic Music Video Generation from Segments of Youtube Videos
AAutomatic Realistic Music Video Generation from Segments ofYoutube Videos
Sarah Gross, Xingxing Wei, Jun Zhu
Department of Computer Science and Technology, Tsinghua University
ABSTRACT
A Music Video (MV) is a video aiming at visually illustrating or ex-tending the meaning of its background music. This paper proposesa novel method to automatically generate, from an input musictrack, a music video made of segments of Youtube music videoswhich would fit this music. The system analyzes the input musicto find its genre (pop, rock, ...) and finds segmented MVs with thesame genre in the database. Then, a K-Means clustering is done togroup video segments by color histogram, meaning segments ofMVs having the same global distribution of colors. A few clustersare randomly selected, then are assembled around music bound-aries, which are moments where a significant change in the musicoccurs (for instance, transitioning from verse to chorus). This way,when the music changes, the video color mood changes as well.This work aims at generating high-quality realistic MVs, whichcould be mistaken for man-made MVs. By asking users to identify,in a batch of music videos containing professional MVs, amateur-made MVs and generated MVs by our algorithm, we show that ouralgorithm gives satisfying results, as 45% of generated videos aremistaken for professional MVs and 21.6% are mistaken for amateur-made MVs. More information can be found in the project website:http://ml.cs.tsinghua.edu.cn/~sarah/
CCS CONCEPTS • Applied computing → Arts and humanities . KEYWORDS music video generation, music segmentation, video analysis, genrerecognition, shot detection
ACM Reference Format:
Sarah Gross, Xingxing Wei, Jun Zhu. 2019. Automatic Realistic MusicVideo Generation from Segments of Youtube Videos. In
Proceedings ofACM Conference (Conference’17).
ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
With the progress of Artificial Intelligence, researchers tried to ques-tion the understanding of machine produced art. Many works havebeen done on text generation such as creating poems [26], on imagegeneration by generating images in the style of Van Gogh[10], or
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
Conference’17, July 2017, Washington, DC, USA © 2019 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn even on music generation with the creation of polyphonic choralesin the style of Bach [12].However, videos are the forgotten ugly duckling of media gener-ation research. Best results in video generation [21] [19] [15] so farcan only achieve very limited samples, such as GIFs of maximum64 ×
64 pixels, still lacking coherence in the structure hence realism,and only in the very specific categories they were trained on ( e.g. bouncing digit, kite surfing, baby ...). Therefore generating videofrom scratch is currently a very difficult task unable to produceresults comparable to real videos.Yet, videos are nowadays the most popular media on the net.According to Cisco Visual Networking Index Forecast [1], IP videotraffic represented 75% of overall IP traffic in 2017 and will grow upto 82% in 2022. Around 100 million hours of video are watched ev-eryday on Facebook [30], and Youtube is the world’s largest searchengine after Google [25]. Influencing power of videos is tremen-dous, as a video Tweet is 10x more likely to be retweeted than aphoto Tweet [23]. For that reason, companies invested dramaticallyin this media, increasing branded video content by 99% on YouTubeand 258% on Facebook between 2016 and 2017 [4].Within this golden media, music videos strive. "Music" is thefirst-searched term on Youtube in 2017 [25]. Within the top 30 most-viewed videos on Youtube, each accumulating more than 2 billionviews, only two are not MVs [22]. Music videos represent greatartistic value, as it combines two medias on which artistic work isalready performed, music and video, to create an enhanced media.Being able to generate a music video is therefore a challenge withthe potential to raise significant interest from the public towardsAI research.A few authors previously worked on MV generation, but noneof them produced MVs from a music comparable to real musicvideos produced for an artist. As these kinds of videos receive highinterest from the Internet community, with fans creating their ownalternatives of music videos when the artist do not provide one,our work would provide not only an innovative method for MVgeneration, but also an alternative source of music videos for musicsdeprived of official video clips.From a database of music videos downloaded from Youtube-8M dataset, we extract video shots and store the color histogramfor each shot. Given this configuration, our generation algorithmcan create in only a hundred seconds a suitable music video foran audio file given in input. First, we identify the genre of theinput music in order to use only video segments in our databaseoriginating from MVs of the same genre. Second, these video shotsare grouped together based on their color histograms using K-Means algorithm to create K clusters of similar color distributions.Key changes in the song (boundaries between sections such as verse,chorus, bridge, etc.) are detected. Finally a few clusters are selectedand consecutively appended between each boundary, triggering a r X i v : . [ c s . MM ] M a y onference’17, July 2017, Washington, DC, USA Sarah Gross, Xingxing Wei, Jun Zhu a cluster change, and thus, a major color change when a musicboundary is encountered.Our contributions are summarized as follows: ( i ) To the best ofour knowledge, we are the first to generate a realistic music videoaccording to an input music, and propose a systematic methodusing a database of Youtube music videos. ( ii ) Contrary to previ-ous works in music video generations, our selection of features(color histogram, key changes in music, genre) are all justified us-ing sociological studies of music videos. ( iii ) Our method is testedthrough a user study where in more than half of cases users mis-take our generated videos for human-made videos, and confirmwith feedback our choice of features. ( iv ) Our algorithm provideshelper functions such as music genre recognition and video innerresolution harmonization which can be used in future works.The remainder of this paper is organized as follows. Section2 briefly reviews the related work. Section 3 details the databaseconfiguration. Section 4 introduces the music video generationmethod. Experimental results are presented in Section 5, followedby conclusion in Section 6. Earlier works in MV generation relied on assembling images fetchedfrom the internet based on lyrics content [24] [5], creating a videooutput similar to a slideshow. Another category of researchersdecided not to base their MVs on still images, improving the outputquality, however the base video was a unique video shot in amateurconditions. In order to generate artistic-looking summary videosof private events, Yool et. al. [27], Hua et. al. [13] and Foote et.al. [9] take in input a user home video in addition to the music,analyze the video, segment it and assemble it around the musicaccordingly, with features specific to each work. Their approachdiffers significantly from our work in two respects. On the one hand,they have additional constraints : since they deal with amateur,homemade video they need extra video pre-processing and analysisto discard low-quality shots, and must put in extra work to matchthe video with the music since the home video likely containsactions that are not especially aesthetic or contain any rhythm. Onthe other hand, our database is made of tens of thousands segmentsissued from about a thousand MVs, which therefore have differentstyles and looks, and feature different people. This gives us anadditional constraint to assemble these segments while keepingsome consistency throughout the video.Finally, most recent works do the opposite process by takingone input video and finding the best matching music to generate amusic video. Lin. et. al. [16] [17] trained a Deep Neural Network onthe 65 MVs of DEAP Database to find emotional correspondencebetween video content and music content. By applying this trainedalgorithm on a video, the algorithm picks a music in a database ofaudio files which would "emotionally" fit it best. Though this ap-proach is highly innovative and technically interesting, it producesunconvincing results for two reasons: first, the video transitionshave no relationship at all with the music; second, emotional corre-spondence between the video atmosphere and music is actually notso important in professional MVs. For artistic reasons, music videocreators do not focus much on the emotional feeling of the musicand include many elements which have absolutely no correlation, for instance "performance" shots where we see close-ups of theband playing musical instrument. In Electronic genre musics, sincethe melody relies on strong beats and dancing atmosphere, onecould think of always pairing them up with lively, joyful videos;however very few electronic MVs follow this rule, like Marshmello’ssingle "Alone" which, in spite of having a lively melody, depicts thesad everyday life of a bullied high-schooler.The current state of video generation [21] [19] [15] shows thatgenerating an MV from scratch would be impossible. Instead, unlikeprevious works, we will use extracts of real MVs to generate a musicvideo for an input music.
Contrary to previous works, our study relies on a very large dataset:
Youtube-8M , released in 2017 to help the improvement of video-based research. Among these 8 million videos, "Music Video" rep-resents the fourth most frequent entity [2]. By querying Youtubesearch API with the keywords "official music video", K. Choi pro-vides the subset YouTube-music-video-5M [6], a list of 5,119,955video ids. Approximately 1000 videos from YouTube-music-video-5M dataset are downloaded to test our algorithm. We extract severalinformation from these music videos in order to prepare in advancethe database, hence minimizing the number of operations requiredin the future by the MV generation algorithm. We perform theprocess illustrated by figure 1 on each video of the database :(1) Separate the video into shots, also called scenes . A scene ismade of one single motion of the camera, without any cut.(2) Calculate the color histogram for each scene.(3) Store the color histogram array and the duration for eachscene into a json file.
Figure 1: Database constitution process.
The database is split into folders of sub-videos so that the gen-eration algorithm can select a subset of scenes originating fromdifferent MVs, to finally concatenate them into a music video. This utomatic Realistic Music Video Generation from Segments of Youtube Videos Conference’17, July 2017, Washington, DC, USA project github provides all the guidance and functions for auto-matically converting a database of downloaded music videos to theformat described here.
Existing shot boundary detection methods can generally be catego-rized [28] into rule-based ones and machine learning based ones.Although Li et. al. [14] judge machine learning methods more pre-cise, namely because of the difficulty to find a proper thresholdin rule-based methods, we decide to use video segmentation algo-rithm provided by Python module
PySceneDetect , relying on verysimple threshold-based algorithms, due to its goods performances.
PySceneDetect provides two different methods in order to de-tect scene transition: • Content-Aware Detection Mode this mode looks at thedifference between the pixels in each pair of adjacent frames,triggering a scene break when this difference exceeds a giventhreshold t . Zhang et. al. [29] call this method Pair-wisecomparison . For a pixel of two-dimensional coordinates ( a , b ) and a frame i currently being compared with its successor,we define the binary function DP i ( a , b ) depending on P i ( a , b ) the intensity value of the pixel: DP i ( a , b ) = (cid:40) | P i ( a , b ) − P i + ( a , b )| > t M × N , a segment boundary is detected if thepercentage of different pixels according to the metric DP isgreater than a given threshold T : (cid:205) M , Na , b DP i ( a , b ) M ∗ N ∗ > T • Threshold-Based Detection Mode this mode looks at theaverage intensity of the current frame, computed by averag-ing the R, G, and B values over the whole frame, and triggersa scene break when the delta in intensity between two suc-cessive frames falls below a given threshold. Zhang et. al.[29] call this method
Histogram comparison . If we denote H i ( j ) the color histogram of frame i for the channel j (eitherR, G, B), a boundary is detected by the formula: SD i = (cid:213) j = | H i ( j ) − H i + ( j )| > T After a few tests, we found that the content-aware mode withthe default threshold (30) separated videos into correct shots.
The point of this research is to create a music video from extractsof MVs of various origins. How does our algorithm decide whichsegments to select among all the different MVs ?One first lead could be focusing on the lyrics meaning to selectthe MV segments, like Cai et. al. [5] or Wu et. al. [24] who createmusic videos from images in relation with the lyrics. However ourobservations and many sociological studies of MVs like Sedeño et.al.’s [3] or the first chapter of
Experiencing Music Video: Aesthetics and Cultural Context [20] point out that song lyrics, nor a clearnarrative, are not decisive elements in the conceptualisation andproduction of MVs.Sedeño et. al.’s research [3] identified a recurring phenomenonin MVs: western professional music videos are often organizedinto two "styles" which alternate throughout the video, usuallyone depicting the side story written in the lyrics (called conceptvideos ) while the other one is centered on the music performance,showing for instance the music group playing or the singer singing(called performance videos ), as illustrates figure 2. We observed thatpop songs can even have more than two "styles" to appear moredynamic.
Figure 2: Alternation of styles for 3 popular MVs. From topto bottom: Secrets - One Republic, Grenade - Bruno Mars,Misery Business - Paramore.
The main feature in common for same-style scenes is the color .Usually a stark change is made in the color distribution of the framein order to clearly indicate to the viewer that the video is showingcontent happening in a different time and spatial frame than asecond ago. Thus, in order to create the same style structure inthe generated MVs, we decide to group together video shots withsimilar color distribution, then alternate between the clusters hencegenerated.The color distribution of an image is reflected by its color his-togram . The color histogram for a given channel represents thestatistical distribution of pixels in the image for this channel. Forinstance if we have N pixels of values ( r , д i , b i ) , ≤ r , д i , b i ≤ Hist red [ r ] = N (1)Equation (1) implies : (cid:213) c = Hist channel [ c ] = P , channel ∈ ( R , G , B ) where P is the total number of pixels in the image. As the colorhistogram is defined only for an image and a channel, we computea 768-size array for representing the video color histogram :(1) Extract a video frame every 5 frames ( 100ms, reaction timeof human eyes).(2) Concatenate these extracted frames to get an image repre-senting the whole video.(3) Compute the normalized color histogram for this image oneach channel (B, G, R).(4) Concatenate the 3 arrays size 256 of each channel to get768-size array. onference’17, July 2017, Washington, DC, USA Sarah Gross, Xingxing Wei, Jun Zhu (5) Store this array in a json file, as well as the video segmentduration.Step 3 is executed with the help of cv2 Python library. K-Meansalgorithm is executed with scikit-learn applied on the arraystored in the json files.The computation above is executed only for scenes containingbetween 12 and 125 frames. Statistical analysis of our databaseshows that scene length varies between 0 and 100 frames, valuesoutside these boundaries being considered as extremes (see figure 3),hence the [ , ] interval. Finally we consider scenes shorter than12 frames (almost 500 ms) as unfit, as it is the time it takes forhuman eyes to process information. Figure 3: Box plot of database videos scenes characteristics.
Unfortunately, many amateur MVs are incorrectly labelled as "of-ficial music video" on Youtube. As we can not afford to manuallyverify every video provided by
Youtube-5M dataset, our databasecontains many MVs which segments are unsuitable for generatinga music video.We first roughly remove unsuitable videos based on the videoand scene length. Based on the box plots of video and scene proper-ties, we eliminate videos containing scenes longer than 60 seconds,as this represents more than 1/4 of the average length (approxi-mately 240s) and significantly exceeds the 1.5 interquartile rangecorresponding to the upper quartile of scene lengths (4s). Inspec-tion of such videos in database shows that they are indeed usuallyunsuitable, for instance makeup video tutorials or lyrics videos.
Figure 4: Box plot of database videos characteristics.
After this step, the most common problems remaining are eitherMVs with hard-written lyrics, or videos made from video shots ofgames or movies, on which is applied a music track reflecting thedesired atmosphere. In both cases, segments originating from suchkind of videos clearly give off an odd impression of mismatch onthe overall music video and decrease the quality of the output MV.Another major issue are the MVs actual resolution. Even thoughmost videos on Youtube have a 16 × × × VLC , or have the black bars surrounding the video constantlychanging. Although most man-made amateur MVs present this de-fault, we tackle it in order to generate better realistic-looking musicvideo. For this purpose, we design an algorithm to harmonize boththe outer size (resolution) and the inner size (within the black bars)of the videos. First, we resize all videos to have a final resolution of640 × x Our algorithm takes as input an audio file, and outputs a videousing the input music as background. This algorithm is made ofseveral steps to bring the output video as close as possible to aman-made MV:(1) Find boundaries in the input music at important music changes(2) Find input music genre(3) Fetch in database videos with the same genre as found atstep (2)(4) Apply K-Means on the color histogram feature to clustertogether scenes from videos found at step (3)(5) Randomly select C clusters from clusters created at (4)(6) Assemble them around the boundaries found at step (1).Based on the lengths of videos on our database (figure 4), ouralgorithm rejects after step (1) inputs of length (cid:60) [ , ] . Step(2) is optional in our algorithm. If skipped, then step (4) is appliedon the videos of the whole database, however the resulting outputpresents significantly less consistency. We detail each step of thisalgorithm in the following subsections. First and foremost, which rhythmic indicators are relevant for amusic video ?As Goodwin explains in his third chapter of
Dancing in the dis-traction factory [11], considered as the Bible of MV analysis, rhythmin music video clips is not generally represented through the tech-nique of cutting "on the beat", meaning that one can often observea video shot transition without hearing a beat, or hear a beat while utomatic Realistic Music Video Generation from Segments of Youtube Videos Conference’17, July 2017, Washington, DC, USA
Figure 5: Flowchart of generation algorithm. not observing any change in the video. Yet, he noticed that manyvideos mirrored the shifts of the harmonic developments in a song.When there is a key shift in a melody (transition between sec-tions such as chorus, verses, solo...), one can almost be sure to alsoobserve a significant change in the video.That is why we decide not to detect small rhythm changes, butonly major changes between sections. For that purpose, after test-ing the boundary algorithms provided by Python library
MSAF , wesettle on
Ordinal Linear Discriminant Analysis (OLDA) fromMcFee et. al. [18] due to its outstanding performances. This super-vised learning algorithm is especially designed to split music intofunctional or locally homogeneous segments ( e.g. , verse or chorus).Before applying OLDA, structural repetition features must becomputed. First, beat-related features (Mel-frequency cepstrum co-efficients or chroma) are extracted from the audio sample, then asimilarity matrix is computed by linking each beat to its nearestneighbors in feature space. The repetitions appear in the diagonalsof the resulting matrix. To convert diagonals into horizontals, theself-similarity matrix is skewed by shifting the i th column downby i rows. However nearest-neighbour method can present a fewerrors like spurious links or skipped connections. To solve thisissue, McFee et. al. apply a horizontal median filter in the skewedself-similarity matrix, resulting in a matrix R ∈ R t × t . Finally, asonly the principal components of the matrix contain the most im-portant factors, R matrix is reduced to a matrix L ∈ R d × t , d << t using matrix multiplications, representing latent structural repeti-tion . From L matrix and the audio sample, several useful featuresare extracted : • mean MFCCs • median chroma vectors • latent MFCC repetitions • latent chroma repetitions • beats characteristics : indices and time-stamps (in secondsand normalized)Repetitive features are used for genres with verse-chords structurelike rock or pop, while non-repetitive features are used for otherkinds of music like jazz.OLDA algorithm is an improved version of linear discriminantanalysis algorithm (FDA) developed by R. Fisher [8]. This supervisedlearning algorithm takes as input a collection of labeled data x i ∈ R D and class labels y i ∈ { , , .., C } . Then it tries at the same timeto maximize the distance between class centroids and minimizeindividually the variance of each class. For that purpose, they definea linear transformation W ∈ R D × D based on the scatter matrices : A w : = (cid:213) c (cid:213) i : y i = C ( x i − µ c )( x i − µ c ) ⊤ A o : = (cid:213) c < C n c ( µ c − µ c + )( µ c − µ c + ) ⊤ + n c + ( µ c + − µ c + )( µ c + − µ c + ) ⊤ where µ c is the mean of class c and n c is the number of examples inclass c . A o measures the deviation of successive segments ( c , c + ) from their mutual centroid µ c + defined as : µ c + : = n c µ c + n c + µ c + n c + n c + McFee et. al. use the same within-class scatter matrix A w asFisher, but they improve over Fisher’s between-class scatter matrixby defining A o and µ c + in order to force all segments of one songto be considered of the same class during the training. They alsoadd a λ > A w is singular. OLDA optimization aims at solving thefollowing equation: w : = argmax W tr (cid:16) ( W ⊤ ( A w + λI ) W ) − W ⊤ A o W (cid:17) (2) According to Goodwin [11] and Vernallis [20], the content of MVsare intimately related to their music genre. A music genre is acategory that identifies a music as belonging to a set of conventions.Based on their works and an extensive analysis of our database ( ≈ Pop/Indie
Close-ups on singer, dance routines, less conceptualscenes due to focus on artist
Rock/Metal/Alternative
Musical performance, concert per-formance, more conceptual scenes
Hip-Hop/Rap/RnB
Sport clothing, black performers, singerclose-ups, less conceptual scenes due to focus on artist, sex-ualized woman, street environment
Electronic/House
Dance routines, very conceptual except forDJ concert performances, representations of people partying,nature environmentThere is no official list of all the existing music genres. These cate-gories are established from popular tags sharing common charac-teristics and frequently paired together from the website
Last.fm .Some significantly different genres, such as classical music or jazz,are not represented in our database. onference’17, July 2017, Washington, DC, USA Sarah Gross, Xingxing Wei, Jun Zhu
Using this data, we match the input music only with segmentsof videos from our database which music genre corresponds to theinput music genre. This way, we ensure consistency between thestyle of the music and the video segments.In order to identify the input music genre, several Deep neuralnetworks trained on musics genre identification were tested [7].Unfortunately, neither using the raw weights provided on github ,or fine-tuning the available algorithms gave correct results.Therefore, fingerprinting technology is used by our algorithmto recognize the music genre. First, the music name and artist areidentified via
ACR Cloud fingerprinting API. If the genre is notavailable, an additional request is sent to
Last.fm
API with themusic and artist name to finally get the input music genre. Naturally,this method is far from ideal as it works only for music popularenough to have its fingerprint recorded in
ACR Cloud database. Toremedy this problem, we also allow the user to manually input hismusic genre in case it is not recognized.The above method applied to the audio channel of the videos isused to get the genre of the music videos in our database.
From the previous step, we obtain a list of music videos having thesame genre as the input music. Theses music videos were previouslysplit into scenes in the database. In order to group together differentscenes from these MVs, we perform a
K-Means algorithm usinglibrary scikit-learn on the color histograms previously storedin json files. This unsupervised machine learning method assignsa set of points to K clusters following this process:(1) Randomly position initial centroids ( c i ) Ki = (2) Form K clusters by assigning each point x to to its closestcentroid: argmin c i , i = .. K dist ( c i , x ) (3) Recompute the centroid of each cluster of points C i : c i = | C i | (cid:213) x i ∈ C i x i (4) Repeat steps 2 and 3 until no more assignment change isobserved.To select the K value, we iterated over several value of K from10 to 100 and checked the quality of the clustering. We eventuallychose K =
90 to have a good partition of colors. From this step,we obtain a list of K clusters made of scenes originating from MVswith the same genre as the input music. Given the boundaries array of the music segmentation and a set ofcolor clusters, we can finally assemble the final music video:(1) Randomly select C clusters large enough to cover wholesong. C (cid:213) k = (cid:213) i l ki > l input , l ki = length of scene i in cluster k (2) Shuffle scenes grouped by MV in cluster.(3) As long as boundary not met, append videos of same cluster.If meet end of cluster, start appending next cluster. (4) When meeting a boundary, switch to next cluster.(5) Repeat steps 3 and 4 until (end of the song - END_OFFSET ) sec.(6) Try find one last scene to cover the whole end. If find, appendit and add a fading effect.We tried several values of C before settling to value C = The original goal of this research was: can an AI algorithm createmusic videos comparable to human-made MVs?Human-made MVs can be separated into two groups: professionalMVs and amateur MVs . Professional MVs are created by a productionteam with professional cameramen, editors, etc. On the other hand,amateur MVs are music videos created by people as a hobby, usuallybecause they are fans of the music or the artist. Either they shootthemselves a whole new music video, or they assemble extracts ofprevious music videos from that artist via a video editing softwareto create a realistic-looking music video for that song. We willconsider the latter kind of amateur MVs for this study.We asked volunteers to judge a total of 30 music videos and clas-sify them into one of the three categories: generated, professionalof amateur MV. They should as well explain in a text input fieldwhy they made this decision. In these 30 MVS : •
15 were generated music videos, selected for their quality • • In total, 126 different workers classified our 30 ×
10 tasks. Wecontrolled their classification labels based on the number of videoseach worker classified, the correctness of their guesses or the detailof their justifications for their choices, to eliminate results fromworkers who seemed to give random answers. In order not toinfluence the worker’s decisions, we explained each category withthe following terms : generated
MV created by an AI algorithm utomatic Realistic Music Video Generation from Segments of Youtube Videos Conference’17, July 2017, Washington, DC, USA professional official MV created for the artist by a profes-sional production team amateur fan-made MVWe grouped together classifications answers for each categoryof music video to evaluate the performance of our algorithm onfigure 6.
Figure 6: Percentage of answers for each category of musicvideo. Light grey: generated MV answers, middle grey: ama-teur MV answers, dark grey: professional MV answers.
Results show that our generated MVs are most often per-ceived like human-made videos . 45% of generated videos weremistaken for professional music videos, and 21.6% were mistakenfor amateur-made music videos. Users seem to have little knowl-edge of amateur MVs, as they mostly classified them as professionalvideos. Nonetheless these results show logic through the followingpoints: • The percentage of classification as "generated" increases withthe non-professionalism of the video : they represent 19%of pro MVs, 23.2% of amateur MVs, and 32.8% of generatedMVs. • The percentage of classification as "professional" decreaseswith the non-professionalism of the video : they represent65.1% of pro MVs, 62.5% of amateur MVs, and 45.5% of gen-erated MVs. • The percentage of classification as "amateur" is the greatestfor generated videos, which mean they were confused forhuman-made videos but the users still noticed some oddcharacteristics preventing them to give the "professional"label.By grouping together amateur MVs and professional MVs inone category named human-made MVs and considering "gener-ated" as positives and "human-made" a negatives, we can evaluateclassification metrics for this experiment in tab 1.
Table 1: Classification metricss
Metric Value Interpretationaccuracy 0 .
55 how correctly the users classifiedprecision 0 .
64 how accurate are the "generated" predictionssensitivity 0 .
33 how well they recognized generated videosspecificity 0 .
79 how well they recognized man-made videosFrom these metrics, we can interpret that users do not recognizegenerated videos when they see one in 2/3 of cases, but when theygive the label they are mostly correct (64%). They are reluctant togiving the "generated" label even when they see a generated video,which is why we have a high specificity but mediocre accuracy.
Figure 7: Reasons given by the workers to justify their clas-sification.
In order to understand better the choices of the workers, weanalyze the reasons they gave while they classified. The number ofresulting labels for a reason are shown in figure 7.The main reasons quoted by workers are :
RHYTHM : how the music and video were in synchronization.
RELATED : how the video shots presented coherence alto-gether.
FILMING : quality of the filming and video effects.
RECOGNIZE : the user recognized either artists and pieces inthe video, or the author of the music, hence helping him todecide.
MUSIC : quality of the music in the video.
MEANING : whether the video content were in relation withthe lyrics.
LIP-SYNC : whether lip-syncing was correctly performed.A few elements better explain why such of big proportion of videoswere classified as either "professional" or "amateur". First, since wedecided to base our algorithm not on videos generated from scratchbut from extracts of real videos, the video outcome always use seg-ments of videos with professional filming, sometimes presentingvideo effects. This is why no videos in the whole experience wereclassified as "generated" when users judged based on the video onference’17, July 2017, Washington, DC, USA Sarah Gross, Xingxing Wei, Jun Zhu quality. This reason also explain confusions between amateur andprofessional videos. The amateur mashup of high-resolution videosfrom Chris Brown clips was unanimously categorized as "profes-sional", although a knowledgeable viewer could easily detect theamateur provenance of this video through the black bars and thewatermarks. On the other hand, the official MVs
Believe from Cheror
Get Thru This from Art of Dying were frequently mistakenlyclassified as "amateur" or "generated" due to poor resolution quality.Bold artistic choices in the category "relatedness" can also induceclassification errors: the official MV
Money from Pink Floyd receivedthe most "generated" labels in the whole batch, due to recurrenceof shots representing coins and the apparent randomness of othershots. However, knowing the name of the music and paying at-tention to the lyrics, one can understand that the "random" andcoin shots actually represent a critic of American consumerism. Inopposite, amateur and generated music videos used more conven-tional shots, such as concert extracts, people partying, etc. Yet, the"randomness" of video shots, meaning the lack of coherence, appearto be a correct lead to identify generated videos, as this was themain reason given for our generated MVs. Second, the quality ofthe music is one factor helping the workers to decide whether thevideo is generated. Since our algorithm takes in input a real musicinstead of generating it, naturally the workers would tend to give a"human-made" label if they judge based on this criterion. Finally, aspredicted in Section 3, the meaning of lyrics represent only a smallproportion of the reasons given by the workers to evaluate a musicvideo.To further interpret these results, we show which proportionof these labels actually come from generated videos, in order toknow the performance of our algorithm in each of the reason cate-gories shown above. Since we aim at generating realistic-lookingvideos, receiving the label "generated" with a given reason categorymeans the generated video performed badly in this category, whilea "human-made" label means the generated video performed wellenough in this category.
Figure 8: Justifications given by the workers when they wereactually classifying a generated MV.
In 55.6% of the cases, rhythm was judged well-performed enoughto induce workers in classifying the generated video as "human-made". Thus, main leads to improve are the relatedness of the videoshots and the lip-syncing.
In this paper, we proposed a novel technique for generating a musicvideo from extracts of Youtube music videos. This technique wasbased of sociological observations of MV structures. The results ofthe user study show that generated MVs could easily be mistakenfor amateur and even professional music videos. Users feedbackshow that the quality of video shooting, how the shots make acoherent ensemble and synchronization with the rhythm are themost important elements to easily recognize a realistic-lookingmusic video.As our algorithm only takes about 1m30s to run, due to the stor-age in advance of features and K-Means results, we made it publiclytestable on a website, where people can generate their own videoclips for their music tracks. After inputting their audio file, usersare informed of the progression of the generation algorithm using amodal, and can download the video after a little wait. This websitereceived high interest as over 300 videos have been generated untilnow.There are still many opportunities in improving this algorithm.In 100 generated MVs, half of them needed to be discarded becausethey contained a segment of an unsuitable MV (lyrics video, pieceof cartoon, etc.) which immediately could inform the viewer of thenature of the MV. In these 100, 10 had good quality enough to beused in the experimentation.Thus, a method for automatically cleaning the database wouldsignificantly improve our algorithm’s performance, as in this case20% of the generated videos could be considered of "good quality"instead of 10%. For detection of lyrics music videos, one couldimplement a machine learning algorithm for recognizing text in thevideo frames. For detection of cartoons, one could perform furtheranalysis in the distribution of colors in the frames to detect thepercentage of flat color sections.A further step would be to improve the video-music synchroniza-tion and the illusion of lip-syncing by matching music segmentswith no singing only with video segments where no mouth is mov-ing, and only using mouth-opening video segments of people withthe same gender as the input music singer.Finally, further research on genre recognition would provide anew method more resilient to input music variety.Participation to this project can be done via its public github repository.
REFERENCES [1] 2019.
Cisco Visual Networking Index: Forecast and Trends, 2017-2022 White Paper .Cisco White Paper 1551296909190103. Cisco Systems, Inc.[2] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici,Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A Large-Scale Video Classification Benchmark.
CoRR abs/1609.08675 (2016).arXiv:1609.08675 http://arxiv.org/abs/1609.08675[3] Sedeño Valdellós Ana, Rodríguez López Jennifer, and Roger Acuña Santiago.2016. The post-television music video. A methodological proposal and aestheticanalysis.
Revista Latina de ComunicaciÃşn Social
71, 1 (2016), 332–348. https://doi.org/10.4185/RLCS-2016-1098en utomatic Realistic Music Video Generation from Segments of Youtube Videos Conference’17, July 2017, Washington, DC, USA [4] Bree Brouwer. 2017. Facebook Video: Views of Sponsored Video Content Jump258% Since 2016. Retrieved Aug 17, 2017 from https://tubularinsights.com/sponsored-content-q2-2017-report/[5] R. Cai, L. Zhang, F. Jing, W. Lai, and W. Y. Ma. 2007. Automated Music VideoGeneration using WEB Image Resource. In , Vol. 2. II–737–II–740.https://doi.org/10.1109/ICASSP.2007.366341[6] Keunwoo Choi. 2017. YouTube-music-video-5M. Retrieved Aug 3, 2017 fromhttps://github.com/keunwoochoi/YouTube-music-video-5M[7] Keunwoo Choi, György Fazekas, Mark B. Sandler, and Kyunghyun Cho.2017. Transfer learning for music classification and regression tasks.
CoRR abs/1703.09179 (2017). arXiv:1703.09179 http://arxiv.org/abs/1703.09179[8] Sir Fisher, Ronald Aylmer. 1936. The Use of Multiple Measurements in TaxonomicProblems.
Annals of Eugenics
7, 1 (September 1936), 179–188. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x Reproduced with permission of CambridgeUniversity Press.[9] Jonathan Foote, Matthew Cooper, and Andreas Girgensohn. 2002. CreatingMusic Videos Using Automatic Media Analysis. In
Proceedings of the Tenth ACMInternational Conference on Multimedia (MULTIMEDIA ’02) . ACM, New York, NY,USA, 553–560. https://doi.org/10.1145/641007.641119[10] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. 2015. A NeuralAlgorithm of Artistic Style.
CoRR abs/1508.06576 (2015). arXiv:1508.06576http://arxiv.org/abs/1508.06576[11] Andrew Goodwin. 1992.
Dancing In The Distraction Factory: Music Television andPopular Culture . Minneapolis: University of Minnesota Press. https://muse.jhu.edu/[12] Gaëtan Hadjeres and François Pachet. 2016. DeepBach: a Steerable Model forBach chorales generation.
CoRR abs/1612.01010 (2016). arXiv:1612.01010 http://arxiv.org/abs/1612.01010[13] Xian-Sheng HUA, Lie LU, and Hong-Jiang ZHANG. 2004. Automatic MusicVideo Generation Based on Temporal Pattern Analysis. In
Proceedings of the 12thAnnual ACM International Conference on Multimedia (MULTIMEDIA ’04) . ACM,New York, NY, USA, 472–475. https://doi.org/10.1145/1027527.1027641[14] Li Li, Xianglin Zeng, Xi Li, Weiming Hu, and Pengfei Zhu. 2009. Video Shot Seg-mentation Using Graph-based Dominant-set Clustering. In
Proceedings of the FirstInternational Conference on Internet Multimedia Computing and Service (ICIMCS’09) . ACM, New York, NY, USA, 166–169. https://doi.org/10.1145/1734605.1734645[15] Yitong Li, Martin Renqiang Min, Dinghan Shen, David E. Carlson, and LawrenceCarin. 2017. Video Generation From Text.
CoRR abs/1710.00421 (2017).arXiv:1710.00421 http://arxiv.org/abs/1710.00421[16] Jen-Chun Lin, Wen-Li Wei, and Hsin-Min Wang. 2016. Automatic Music VideoGeneration Based on Emotion-Oriented Pseudo Song Prediction and Matching.In
Proceedings of the 2016 ACM on Multimedia Conference (MM ’16) . ACM, NewYork, NY, USA, 372–376. https://doi.org/10.1145/2964284.2967245[17] Jen-Chun Lin, Wen-Li Wei, James Yang, Hsin-Min Wang, and Hong-Yuan MarkLiao. 2017. Automatic Music Video Generation Based on Simultaneous Sound-track Recommendation and Video Editing. In
Proceedings of the 25th ACM Inter-national Conference on Multimedia (MM ’17) . ACM, New York, NY, USA, 519–527.https://doi.org/10.1145/3123266.3123399[18] Brian Mcfee and Daniel P. W. Ellis. 2014. Learning to Segment Songs with OrdinalLinear Discriminant Analysis. In
In Proc. of the 39th IEEE International Conferenceon Acoustics Speech and Signal Processing
Experiencing Music Video: Aesthetics and Cultural Con-text
CoRR abs/1609.02612 (2016). arXiv:1609.02612 http://arxiv.org/abs/1609.02612[22] Wikipedia. 2019. List of most-viewed Youtube videos. Retrieved 7 March 2019from https://en.wikipedia.org/wiki/List_of_most-viewed_YouTube_videos[23] Marissa Window. 2018. 5 data-driven tips for scroll stopping video.Retrieved July 24, 2018 from https://business.twitter.com/en/blog/5-data-driven-tips-for-scroll-stopping-video.html[24] Xixuan Wu, Bing Xu, Yu Qiao, and Xiaoou Tang. 2012. Automatic Music VideoGeneration: Cross Matching of Music and Image. In
Proceedings of the 20th ACMInternational Conference on Multimedia (MM ’12)
Chinese Computational Linguistics andNatural Language Processing Based on Naturally Annotated Big Data , MaosongSun, Xiaojie Wang, Baobao Chang, and Deyi Xiong (Eds.). Springer InternationalPublishing, Cham, 211–223. [27] Jong-Chul Yoon, In-Kwon Lee, and Siwoo Byun. 2008. Automated music videogeneration using multi-level feature-based segmentation. In
Multimedia Toolsand Applications , Vol. 41. 197. https://doi.org/10.1007/s11042-008-0225-0[28] J. Yuan, H. Wang, L. Xiao, W. Zheng, J. Li, F. Lin, and B. Zhang. 2007. A FormalStudy of Shot Boundary Detection.
IEEE Transactions on Circuits and Systems forVideo Technology
17, 2 (Feb 2007), 168–186. https://doi.org/10.1109/TCSVT.2006.888023[29] HongJiang Zhang, Atreyi Kankanhalli, and Stephen W. Smoliar. 1993. Automaticpartitioning of full-motion video.