An audio-only method for advertisement detection in broadcast television content
AAn audio-only method for advertisement detection in broadcast television content
António Ramires [email protected]
Diogo Cocharro [email protected]
Matthew E. P. Davies [email protected]
INESC TECSound and Music Computing GroupRua Dr. Roberto Frias, s/n4200 - 465 Porto, Portugal
Abstract
We address the task of advertisement detection in broadcast televisioncontent. While typically approached from a video-only or audio-visualperspective, we present an audio-only method. Our approach centres onthe detection of short silences which exist at the boundaries between pro-gramming and advertising, as well as between the advertisements them-selves. To identify advertising regions we first locate all points within thebroadcast content with very low signal energy. Next, we use a multiplelinear regression model to reject non-boundary silences based on featuresextracted from the local context immediately surrounding the silence. Fi-nally, we determine the advertising regions based on the long-term group-ing of detected boundary silences. When evaluated over a 26 hour anno-tated database covering national and commercial Portuguese televisionchannels we obtain a Matthews correlation coefficient in excess of 0.87and outperform a freely available audio-visual approach.
The classification of audiovisual content into categories and the identi-fication of advertising has become increasingly important for end-users,broadcasters and entities that have contracted advertising space. This hasspecial importance in the case of television content both for the need toarchive content with the advertising removed, and in streaming contextsto allow for the region-specific substitution of advertising.Currently, the delimitation of advertising segments i.e., the identifi-cation of the beginning and end moments of a contiguous set of adver-tising content is typically performed by a human operator. As a result,the process is labour-intensive, expensive, and potentially error-prone [2].One means to improve the workflow of the human operator is to providean automatic analysis of the broadcasting content which can classify thetime-line into regions of advertising and regular programming.Existing algorithms for the detection of advertising fall into two maincategories. Those which use explicit prior knowledge of a known set ofadvertisements and identify them using fingerprinting methods [1], andthose which rely on heuristics as advertising indicators. For both types ofapproach, information can be leveraged from the video signal alone (lo-gos, black frames, scene changes etc.), or in combination with the audiostream within audio-visual approaches [3].In this work, our focus is on audio-only approach for television ad-vertising detection which makes no use of video information, meta-dataconcerning the content type, or any prior knowledge of which advertise-ments can appear. To this end, we seek to discover if there is sufficientinformation in the audio signal alone to locate where advertisements oc-cur. From this perspective, relevant acoustic cues include the presence ofsilence (often co-occurring with black frames at content boundaries), thepresence of jingles, fast paced narration, background music, and identifiedrepeated content – which operates on the assumption that advertisementsare repeated more frequently than regular programming.In our audio-based approach, we focus on a single acoustic property,that of silence – which we assume to indicate very low signal energy ratherthan digital zeros in the audio bit-stream. We believe that silences havebeen under-used in the existing literature having been treated as just onefeature among many which contribute towards the final decision. In ourapproach, we seek to maximize the information that can be obtained fromdetecting silences. Furthermore, we propose that by effective character-isation of different types of silences and the large scale grouping of anidentified set of “boundary silences” we can obtain a very reliable de-scriptor of advertising boundaries in television.
Our approach centres on the existence and detection of short pauses ofsilence (i.e., very low audio signal energy) in between separate piecesof content. We now provide an overview of each stage of the algo-rithm. Throughout, we assume the audio signal (a stereo signal sampledat 48kHz with 24-bit precision) has already been separated from the videocontent, and mixed down to mono. We notate the audio input as, x .To maintain parity with video frame rate of the television content(and allow easy integration with future video-based analysis) we parti-tion x into non-overlapping audio frames of 1920 samples (equivalent to25 video frames per second). In each audio frame, x i , we calculate thesignal energy, e i = (cid:16)(cid:112) mean ( x i ) (cid:17) . By taking the measurementin dB, we force all low energy parts of the signal to take large negativevalues. Next, to find all the low energy points in the input signal, we com-pare e i at each frame, i to a silence threshold, η =-60 dB, and retain thoseframes i s for which e i ≤ η . An example is shown in the top plot of Fig. 1. d B -100-500 Input Signal Energy S il en c e Th r e s ho l d Regression Output De c i s i on Th r e s ho l d time (minutes) De t e c t ed Ad v e r ti s i ng G r ound t r u t h Final Decision and Annotated Ground Truth
Figure 1: (top) Energy of input signal, with regions under the silencethreshold shown in black. (middle) Output regression model on detectedsilences. Points above the decision threshold are shown in black. (bottom)The output classified as advertising and the corresponding ground truth.Since short regions of silence can occur naturally within program-ming, e.g. as pauses between speech (either during narration or interviewswith no background music or noise), we must filter out those silenceswhich do not correspond to content boundaries. In our model we assumea boundary silence to be: short in duration, have a low minimum value,and be surrounded by regions of much higher energy. From a broadcastperspective we understand this is perceptually loud advertising contenteither side of a brief, imperceptible drop in energy, as shown in Fig. 2.To distinguish between different types of silence we collect a small setof statistics: the max, mean, min, inter-quartile range, standard deviation,skewness, and kurtosis from a small temporal window of ± ± a r X i v : . [ c s . S D ] N ov rames0 100 200 300-90-70-50-30-10 Non-boundary silence frames0 100 200 300-90-70-50-30-10
Boundary silence S il ence T h r esho l d Figure 2: Examples of non-boundary silence (left) and boundary silence(right). The detected silence is at the mid-point of each plot.audio frames) of the energy signal e surrounding each detected silence i s . We then perform a basic multiple linear regression on the extractedfeatures where positive examples (i.e. annotated boundary silences) arelabelled as 1, and non-boundary silences are labelled as 0. The output ofthe regression is shown in the middle plot of Fig. 1. Here, all detectedsilences greater than the decision threshold, β =0.25, are retained and setto a value of 1, with all others discarded.In the final stage of our algorithm, we pass a sliding rectangular win-dow of 150 s duration across the thresholded regression output. We de-termine regions of advertising as those which adhere to the following twoconditions: i) there is more than one detected boundary silence withinthe long-term window (i.e. at least one starting and one ending silence);ii) the total duration of any period of advertising must be at least 60 s.In this way isolated silences or those which are far from one another areexcluded. The start of the detected advertising region is marked at theframe where the first detected boundary silence exits the long-term win-dow. Likewise the end of the region occurs at the frame when the finalboundary silence of any group exits the long-term window. An exampleof final output of the system is shown in the bottom plot of Fig. 1. We evaluate our algorithm over an annotated dataset we have compiledcovering national (two instances of RTP 1 and one of RTP 2) and com-mercial channels (SIC and TVI) of Portuguese television. The datasetcontains over 26 hours of content (segmented in 28 programmes), whichhas been annotated at two levels. First, to mark the high level bound-aries between regular programming and advertising blocks, and second ata finer temporal level to marks the boundaries between all commercials.We use this second level for training the linear regression model.In order to measure the performance of silence-based method, we firstcount the number of true positives, T P , true negatives, T N , false positives, F P , and false negatives, F N , where a T P corresponds to a region which isboth annotated and detected as advertising.As we can expect with broadcast television content, a far greater pro-portion of the content corresponds to scheduled programming rather thanadvertising (in our case, approximately 12% is advertising). While manyapproaches in the literature report the F-measure as a performance indica-tor for advertising, this excludes any information about the number of T N .To incorporate this information we instead report Matthews correlationcoefficient, M , which is calculated as follows: M = T P × T N − F P × F N (cid:112) ( T P + F P )( T P + F N )( T N + F P )( T N + F N ) (1)In addition to reporting the performance of our proposed approach, wealso ran an open source audio-visual approach called ComSkip underthe default parameter settings. A comparison of performance between thetwo approaches is shown in Table 1.As can be seen, our proposed approach outperforms ComSkip acrossall channels, with a correlation coefficient in excess of 0.87. Indeed, ourapproach performs especially well on the commercial channels (SIC andTVI), which contain large blocks of advertising content (running into sev-eral minutes at a time) with explicit use of silences between individualadvertisements.The lowest performance was obtained on RTP 2. This channel con-tained a far lower proportion of commercial advertising, with the breaksbetween programming more frequently containing trailers for upcom-ing in-channel content (and without such prominent silence boundaries). , v. 0.82, accessed 06-15-2017. Table 1: Summary of dataset and comparison of algorithm performance.Input Total Advertising ComSkip Proposed Alg.Channel Duration Duration Accuracy AccuracyRTP 1 a b Since this content falls between the main programming, it can be under-stood as advertising, and thus something which our current approach can-not readily detect. However, given the critical requirement in advertisingremoval applications not to misclassify programming as advertising, ourproposed approach has explicitly been parameterised to minimise falsepositives. To this end, it provides “conservative” estimates of advertis-ing boundaries. Indeed, over the 26 hours, our approach has just 6 falsepositive frames, with ComSkip having only 761 false positive frames ( ∼
30 s).A potential criticism of the comparative results is that they may besomewhat optimistic since our approach has partial access to the datasetfor training, where as ComSkip does not. However, our multiple linearregression model was trained using leave one out cross fold validation atthe programme level, and therefore we maintain some separation betweentraining and testing material. Informal tests on currently un-annotatedvalidation data also indicates highly promising performance and larger-scale evaluation will be among the main areas of future work.
We have a presented a new audio-only approach for the detection of adver-tising in television broadcast content. Our approaches relies on the short,medium, and long-term modelling of silences within the audio stream as ameans for distinguishing regular programming from advertising. A novelfeature of our approach is the ability to reject silences (e.g. pauses inspeech) which do not exhibit the statistical properties of content bound-aries. Currently our approach has been optimised for Portuguese televi-sion content, therefore main focus of our future work will be to investigatethe accuracy of our approach on international television content. Further-more, we intend to enhance our audio-only model via the inclusion ofother important cues includes in jingle detection, music/speech separationand audio production effects related to bandwidth and dynamic range.
This article is a result of the project MOG CLOUD SETUP - N o References [1] P. Cardinal, V. Gupta, and G. Boulianne. Content-based advertise-ment detection. In
INTERSPEECH , pages 2214–2217, 2010.[2] D. Conejero and X. Anguera. TV advertisements detection and clus-tering based on acoustic information. In
Intl. Conf. on Computa-tional Intelligence for Modelling Control Automation , pages 452–457, 2008.[3] M. Covell, S. Baluja, and M. Fink. Advertisement detection and re-placement using acoustic and visual repetition. In