Prediction and Optimal Scheduling of Advertisements in Linear Television
Mark J Panaggio, Pak-Wing Fok, Ghan S Bhatt, Simon Burhoe, Michael Capps, Christina J Edholm, Fadoua El Moustaid, Tegan Emerson, Star-Lena Estock, Nathan Gold, Ryan Halabi, Madelyn Houser, Peter R Kramer, Hsuan-Wei Lee, Qingxia Li, Weiqiang Li, Dan Lu, Yuzhou Qian, Louis F Rossi, Deborah Shutt, Vicky Chuqiao Yang, Yingxiang Zhou
PPanaggio et al.
RESEARCH
Prediction and Optimal Scheduling ofAdvertisements in Linear Television
Mark J Panaggio , Pak-Wing Fok , Ghan S Bhatt , Simon Burhoe , Michael Capps , Christina JEdholm , Fadoua El Moustaid , Tegan Emerson , Star-Lena Estock , Nathan Gold , Ryan Halabi ,Madelyn Houser , Peter R Kramer , Hsuan-Wei Lee , Qingxia Li , Weiqiang Li , Dan Lu , YuzhouQian , Louis F Rossi , Deborah Shutt , Vicky Chuqiao Yang and Yingxiang Zhou * Correspondence:[email protected] Mathematics Department,Rose-Hulman Institute ofTechnology , Terre Haute, IN47803, USAFull list of author information isavailable at the end of the article
Abstract
Advertising is a crucial component of marketing and an important way forcompanies to raise awareness of goods and services in the marketplace.Advertising campaigns are designed to convey a marketing image or message toan audience of potential consumers and television commercials can be aneffective way of transmitting these messages to a large audience. In order to meetthe requirements for a typical advertising order, television content providers mustprovide advertisers with a predetermined number of “impressions” in the targetdemographic. However, because the number of impressions for a given program isnot known a priori and because there are a limited number of time slots availablefor commercials, scheduling advertisements efficiently can be a challengingcomputational problem. In this case study, we compare a variety of methods forestimating future viewership patterns in a target demographic from past data.We also present a method for using those predictions to generate an optimaladvertising schedule that satisfies campaign requirements while maximizingadvertising revenue.
Keywords: advertising; programmatic TV; optimization; linear programming;machine learning; Bayesian estimation; data science
AMS Subject Classification:
The use of advertisements as a means for marketing consumer goods has been com-mon practice for centuries. Mass distribution of advertisements through newspaperswas first made possible by the printing press, but it was the advent of the radio andthe television in the twentieth century that ultimately revolutionized advertising byallowing companies to transmit marketing messages into millions of homes aroundthe world simultaneously [1]. Today, advertising is an over $500 trillion dollar globalindustry, and although advertising through digital media is growing rapidly, tele-vision remains the primary advertising medium with total television advertisingexpenditures making up approximately 40% of the worldwide total [2].Advertising on “linear” (traditional live, not on-demand) television typically con-sists of an arrangement between content providers/programmers (TV networks suchas ABC, NBC and Fox or cable operators such as Comcast or Cox) and advertis-ing agencies in which the networks/operators are paid to run commercials in orderto reach a desired audience. This audience is typically specified in demographic or a r X i v : . [ m a t h . O C ] A ug anaggio et al. Page 2 of 24 psychographic terms, such as “women 18 - 54”, or “people concerned with healthand fitness.” A campaign’s marketing target can be quantified by three measures: impressions , which is the total number of times the message or ad is seen by amember of the target audience, reach , which is the number of unique members ofthe target groups exposed to the ad, and frequency , which is the average numberof times the ad is viewed by each member of the target group that is reached bythe ad. When a content provider agrees to fill an advertising order, they committo running the commercial as many times as necessary until the desired numberof impressions (and possibly reach and frequency) have been obtained. Since theavailable commercial time in a given time frame is limited and each order that acontent provider is able to fill provides additional revenue, it is of interest to meeteach impression target in such a way that leaves broadcast time available for ad-ditional orders. Therefore, before an order can be accepted, content programmersmust assess whether the number of impressions can be achieved in an acceptabletime-frame and whether the budget for the order is large enough to warrant usingbroadcast time to provide those impressions.In order to make this determination, it is necessary to be able to generate a sched-ule that satisfies the order constraints in an efficient way. However, the generationof optimal advertising schedules poses a number of challenges. First of all, one mustknow the viewership demographics of each television program. Unfortunately, thisis not known in advance and can only be estimated from past viewership data.Secondly, depending on the content provider and time-frame, there can be a largenumber of possible orders and available commercial slots along with a large numberconstraints on what schedules are acceptable. This makes finding an optimal sched-ule a computationally intensive problem that cannot usually be solved by hand.For this reason, naive approaches to scheduling lead to wasted resources and disen-chanted audiences when ads fail to reach the interested consumers efficiently andmust be aired repeatedly in order to meet impression targets.Here, we address two aspects of this interesting mathematical problem. First,using data modeled statistically mimic real TV viewership behavior as reported,for example by The Nielsen Company [3], we explore a number of methods forpredicting the number of impressions of future programming (Section 2) . Thesemethods make use of spectral analysis, machine learning, Kalman filtering and dis-tance scores. Second, we demonstrate that when combined with advertising orders,these predictions can be used to formulate a nonlinear optimization problem thatcan be solved using standard integer programming techniques (Section 3). Finally,we outline a method for extending earlier results to account for the reach and fre-quency of an advertisement (Section 4). This work was sponsored by clypd Inc. andmade possible by the 2015 Mathematical Problems in Industry Workshop.
In order to generate predictions for the viewership and demographics of futureprograms, we used simulated past program viewership data provided by clypd Inc.Since precise data on the viewership of commercials was not available, we assumethat program viewership data is representative of the viewership of advertisementsas well. Figure 1 shows a time series for the number of impressions on one particular anaggio et al.
Page 3 of 24 channel over a period of about 273 days. Qualitatively, the signal is noisy with largespikes in the viewership. time (days) N u m be r o f i m p r e ss i on s Figure 1
Impressions per hour. Number of impressions for each hour over the course of 273 daysstarting from October 23, 2014 on one particular channel.
Although the data appears noisy, there are clear periodic trends in the data. Theseare mostly likely driven by the periodic nature of the channel programming. Inorder to identify these trends, we assume that the number of impressions, S ( t ) canbe decomposed into a deterministic, periodic part P ( t ) and a stochastic part η ( t ), S ( t ) = P ( t ) + η ( t ) . (1)We attempt to filter the signal to remove η ( t ), leaving behind P ( t ) by first fillingin missing data using linear interpolation and then performing a Fourier transformon the data and taking only the dominant modes in the power spectrum.For this filtering scheme, we used Matlab’s fft and ifft algorithms to computethe power spectrum and then removed all frequencies with amplitude less than agiven threshold A thresh . We write the full signal as S ( t ) = A + N (cid:88) j =1 ( A j e ik j t/T + A ∗ j e − ik j t/T ) , (2)where t is the time (in hours), N is the number of frequencies in the transform, T is the total duration of the signal, A j and A ∗ j are complex amplitudes (conju-gates of each other), and k j are dimensionless frequencies. For our data set, wehave N = 3275, and the total duration of the signal is T = 6551 hours ≈
273 days.This corresponds to 6551 data points S (0) , . . . , S (6550) and 6551 Fourier coeffi-cients (distinguishing between A j and A ∗ j ) and therefore, representation (2) exactlyreproduces S ( t ). anaggio et al. Page 4 of 24
Time (Days)0 50 100 150 200 250 300 I m p r e ss i on s , S ( t ) (a) Original Data Frequency, k/T (1/Day)0 5 10 15 A m p lit ud e , | A k | × (b) Spectrum Time (Days)0 50 100 150 200 250 300 F ilt e r e d I m p r e ss i on s , P ( t ) (c) Reconstructed Data ( A thresh = 15 , Frequency, k/T (1/Day)0 1 2 3 A m p lit ud e , | A k | × (d) Spectrum Frequency, k/T (1/Day)0 1 2 3 A m p lit ud e , | A k | × (f) Spectrum Time (Days)0 50 100 150 200 250 300 F ilt e r e d I m p r e ss i on s , P ( t ) (e) Reconstructed Data ( A thresh = 20 , Figure 2
Total viewership. (a,b): unfiltered data and power spectrum (frequency has units ofday − ). (c,d): filtered signal and power spectrum with A thresh = 15 , . (e,f): filtered signal andpower spectrum with A thresh = 20 , . Time (Days) I m p r e ss i on s , S ( t ) (a) Original Data Frequency, k/T (1/Day) A m p lit ud e , | A k | (b) Spectrum Time (Days) F ilt e r e d I m p r e ss i on s , P ( t ) (c) Reconstructed Data ( A thresh = 2 , Frequency, k/T (1/Day) A m p lit ud e , | A k | (d) Spectrum Frequency, k/T (1/Day) A m p lit ud e , | A k | (f) Spectrum Time (Days) F ilt e r e d I m p r e ss i on s , P ( t ) (e) Reconstructed Data ( A thresh = 3 , Figure 3
Viewership of males 65 and older (a,b): unfiltered data and power spectrum (frequencyhas units of day − ). (c,d): filtered signal and power spectrum with A thresh = 2 , . (e,f): filteredsignal and power spectrum with A thresh = 3 , .anaggio et al. Page 5 of 24
We decompose the signal S ( t ) as follows S ( t ) = A + (cid:88) j : | A j | >A thresh ( A j e ik j t/T + A ∗ j e − ik j t/T ) (cid:124) (cid:123)(cid:122) (cid:125) filtered signal , P ( t ) + (cid:88) j : | A j |≤ A thresh ( A j e ik j t/T + A ∗ j e − ik j t/T ) (cid:124) (cid:123)(cid:122) (cid:125) noise , η ( t ) (3)where the signal is defined as P ( t ) = A + (cid:88) j : | A j | >A thresh ( A j e ik j t/T + A ∗ j e − ik j t/T ) . (4)while the noise is defined as η ( t ) = (cid:88) j : | A j |≤ A thresh ( A j e ik j t/T + A ∗ j e − ik j t/T ) . (5)Therefore, noise can be eliminated by removing all frequency modes j such that | A j | ≤ A thresh . In other words, which Fourier modes are considered part of thesignal and which are part of the noise depends solely on the cut-off A thresh . The resulting signal is displayed in Figures 2 (for the general viewership) and 3(for males 65 and older). Power spectra are shown for rescaled frequencies k j /T which have units of inverse time. In Figure 2, we see that there are “spikes” at k/T = i/ i = 1 , . . . ,
7. There are also spikes at k/T = 0 and k/T = 2per day. Broadly speaking, the power spectra reveal that the predictable portion ofthe signal contains nine dominant frequencies/periods for viewer behavior1. The zero mode in which the television is always on or off, regardless of thetime ( k/T = 0 per day)2. Viewing patterns that repeat weekly ( k/T = 1 / k/T = 2 / k/T = 3 / k/T = 4 / k/T = 5 / k/T = 6 / k/T = 1 per day)9. Viewing patterns that repeat twice daily ( k/T = 2 per day)These behaviors do not necessarily refer to the same members of the viewership. Thefrequencies above also appeared in different demographic groups. However, malesand females 65 and older did not exhibit these frequencies (see Fig. 3) suggestingthat older members of the population have qualitatively different viewing habits.We should note, however, that upon taking different 7-week subsets of the data,some of the spikes at k/T = 1 / , / , / , . . . , / k/T = 0, k/T = 1 and k/T = 2 per day modes are the most robust.In summary, viewership patterns are periodic with both daily and weekly fre-quency components. The weekly pattern is shown as a solid black curve in Figure anaggio et al. Page 6 of 24
Time(hr)0 20 40 60 80 100 120 140 160 180 a m oun t o f i m p r e ss i on s i n ea c h hou r × Figure 4
Predicted week-long viewership trend. The signal (for channel 7287) exhibits clearperiodicity with dominant frequencies k/T = , , . . . , per day and k/T = 2 per day.
4. This figure indicates the median number of impressions over a typical week alongwith confidence intervals. The distribution of impressions at a given time in theweek is found from the first 38 weeks of data. 90% confidence intervals (dashedlight blue) are calculated by taking the 5th and 95th percentiles of the distribu-tion. We see that during the evening of each day, there is a rise in the viewership.Saturday seems to be the hardest to predict since it has the highest variability inimpressions. The red curve shows the viewership over the 39th week, which mostlyfalls within the confidence intervals. −200 −150 −100 −50 0 50 100 150 200 250 30000.0020.0040.0060.008 η P r ob a b ilit y D e n s it y Figure 5
Distribution of noise data points. A histogram of the data points in the time-series forthe noise η ( t ) (obtained using the cut-off A thresh = 20 , ) is displayed along with fitted normaland t Location-scale distributions. The mean and standard deviation of the best-fit normal pdf are µ = − . and σ = 32 . The parameters of the best-fit t Location-scale pdf are ( µ, σ, ν ) = ( − . , , . We now examine the noise η ( t ), after removing the periodic signal using A thresh =20 , η is a time-homogeneous sequence of random variables with anaggio et al. Page 7 of 24 a different realization for every t :Prob[ x ≤ η ( t ) ≤ x + dx ] = f ( x ) dx. (6)Attempts at fitting this data to various well-known probability distributions revealstwo distributions that yield a good fit: normal and the t location-scale distributions(see Fig 5) with t -location-scale distribution yielding a better overall fit. For anormal distribution, the we obtained the best-fit N ( µ, σ ) with µ = − . σ = 32. For the t location-scale distribution with probability density function f ( x ) = Γ[( ν + 1) / σ √ νπ Γ( ν/ (cid:20) [( x − µ ) /σ ] + νν (cid:21) − ν +12 , (7)we found that the best-fit parameters were µ = − . σ = 20 and ν = 3. Figure 6
Definition of spike time. As highlighted in the schematic, spike time is defined as thebeginning of a series of impressions that exceed a certain threshold. In our analysis, the thresholdis set to be the 95th percentile of the data.
One notable feature of S ( t ) is that large spikes seem to occur randomly in time.We define the spiking event time as the time when the impressions signal crossesa fixed threshold from below (see Fig. 6). Here we choose the threshold to bethe 95th percentile of the available impressions data points. In Figure 7, we showthat the distribution of waiting times τ between consecutive spikes appears to beapproximately exponentially distributed:Prob[ t ≤ τ ≤ t + dt ] = λ exp( − λt ) dt. (8)This suggests that the spiking has no memory (spiking is approximately Markovian)and occurs at a Poisson rate of λ ≈ .
015 per hour, so that the mean time betweenspikes is about 69 hours. The analysis of spiking time was performed on unfiltereddata S ( t ). One might also consider spikes in the noise left over from the filtering, η ( t ). However, this yields similar results because the periodic signal generally hassmall amplitude. anaggio et al. Page 8 of 24 f r equen cy −8 −6 −4 −2 number of hours between spike events e s t i m a t ed pd f exponential distribution λ e − λ t , λ = 0.0145ksdensity fit Figure 7
Distribution of spike times. The time between two spikes in viewership approximatelyfollows an exponential distribution. ksdensity is a Matlab routine that attempts to find the pdfassociated with given data.
Although spectral analysis of viewership data provides insight into the mechanismsthat contribute to the observed trends, this approach assumes periodic behavior andignores programming information. These findings can be helpful for filling in missingdata and for estimating viewership of new programming, but an approach that takesinto account the programming could potentially explain both the periodic behaviorand the noise. We therefore implement a machine learning algorithm to predictthe number of impressions in a time slot by learning from past data. The machinelearning task is defined with the following attributes: program ID (nominal), day ofthe week (nominal) and time of the day (numeric). The output class is the numberof impressions.We explored two different approaches. First, we treated the output class as nu-meric. A number of machine learning methods are suitable for this task, and wetested k-nearest neighbor, neural networks, linear regression, regression tree, andk-star. The best performing algorithm was 1-nearest neighbor. The root relativesquare error for this method was 61% (100% would correspond to the error innaively guessing the mean of all impressions).Second, we divided the output class into 5 bins, and tested 4 algorithms: decisiontrees, random forest, naive Bayes, and random tree. Decision tree and randomforest performed equally well - the error was 44% (compared with 80% from naivelyguessing the correct bin).A learning curve is shown in Figure 8. The fact that the curve did not plateaushows potential for more accurate prediction given more data. The fact that testingerror decreases with training error demonstrates that this approach does not over-fitthe data.
We also investigated the use of Bayesian estimation and Kalman filtering to predictthe number of impressions for a program. In order to do this, we neglected any sam-pling issues that may be present in the data and assumed perfect data. We treatedthe number of interested viewers as fixed with a particular probability of watchingthe program or not. This probability can be modeled by a binomial distribution anaggio et al.
Page 9 of 24 % i n c o rr e c t c l a ss i f i c a t i on Learning curve for decision tree training errortesting error
Figure 8
Learning curve for decision tree machine learning algorithm. The lack of a plateausuggests that the algorithm could be improved with more data. Given the small range ofvariability, alternative decision trees should be explored. because of the two possible values; however, due to the large sample size of viewers,we can apply the Central Limit theorem and approximate the distribution with anormal distribution [4].For a Bayesian estimation, a prior probability distribution and likelihood func-tion must be assumed in order to formulate an initial prediction. After this initialprediction is made, the new data is observed and is used to update the probabilitydistribution and gives a posterior probability distribution. For our purposes, thisposterior distribution was then used as our prior for predicting the next week [5].The number of impressions, S , for each week w was modeled by a Gaussian withmean µ ( w ) and standard deviation σ . While we allow µ to vary from week to week,we keep σ fixed for simplicity. Under the Bayesian framework, we view µ ( w ) as anunknown parameter which we will represent by a subjective probability distribution.We can think of µ ( w ) as a measure of the popularity of the show at week w , whilethe actual viewership will have an unpredictable fluctuation from week to week dueto external factors. The standard deviation σ measures this inherent variability inweekly viewership.To find a reasonable value for the fixed σ , the available data was divided into aparticular number of bins. For each bin of data, the standard deviation was found,and then all of the standard deviations were averaged together to get an average standard deviation. This was repeated multiple times with varying numbers of binsin order to choose the optimal (smallest) standard deviation. The smallest valuefound was then used as σ for S .To understand the distribution of µ , a recursive Bayesian estimation was used. Tostart, a weak prior probability distribution was chosen for µ based on a Gaussianfit to a histogram of 39 weeks of historical viewership data for a given time sloton a given channel on a given day. These histograms and their Gaussian fits areillustrated in Figure 9 for 7 different weekly time slots, namely 8 p.m. on the givenday of the week and a given channel. In particular, the prior distribution for µ willdepend on the time of the week (and channel) under consideration. Bayes’ rule isapplied in the usual way to update the prior distribution for an upcoming week withthe likelihood of the data, once collected, to obtain a posterior distribution for µ onthat week. This posterior distribution is then taken as the prior distribution for µ anaggio et al. Page 10 of 24 M onda y −5 T ue s da y −5 W edne s da y −5 T hu r s da y −5 x 10 F r i da y −5 S a t u r da y −5 S unda y −5 Impressions at 8pm for 39 weeks
Figure 9
Impressions by day. (Left) Histogram of the number of impressions during the 8 p.m.slot for each day of the week over 39 weeks. (Right) Approximating normal distribution ofavailable data. on the following week. The likelihood model, as described above, is Gaussian, so therecursive Bayesian estimation procedure reduces to a simple version of the Kalmanfilter, with the predicted (prior) distribution of the parameter µ at the next weektaken to be the same as the posterior distribution on the current week. The evolvingvalue of µ is used as a point estimate for the expected number of impressions for agiven channel on a given day of the week at a given time.In order to assess the performance of this model, we chose the first 20 weeks ofdata as the “training” data and then tested the model against the last 19 weeks.We tested 7 such data sets, namely the viewereship of a given channel at 8 p.m. onone of the 7 days of the week. For every day of the week the relative errors betweenthe predicted impressions and observed number of impressions for the last 19 weekswere computed. The root mean square (RMS) of the relative errors was calculatedas the measure of error for our model. The RMS error for each day of the weekwas below 30%, where days such as Tuesday, Friday and Saturday were under 10%.This suggests that our model is able to make reasonably accurate predictions for anaggio et al. Page 11 of 24 the number of future impressions given the data from the first 20 weeks. Theseresults are displayed in Table 1. The table also includes an example of our model’sprediction for a week (39 th week) for a particular network at 8:00pm. Day Mon Tues Wed Thurs Fri Sat SunDate 2/2 2/3 2/4 2/5 2/6 2/7 2/8Model
Actual
Rel. Error
RMS for Day
Table 1
Impressions estimate for February 2-8, 2015 for a given network at 8:00pm. The modelestimates are given along with the observed number of impressions. The relative errors for thedisplayed week is calculated and compared to the root mean square determined from all of the 39weeks.
One shortcoming the previous two approaches is that they require accurate dataabout the viewers of a particular program to train the algorithm. However, for newprogramming it is necessary to generate predictions for the number of impressionswith limited data. To overcome this challenge, it is helpful to consider the viewer-ship trends of similar programs. We therefore created a difference score to identifyappropriate similar programs for comparison. Given two programs, p and p , attimes t and t respectively, we define a function ∆( p , t , p , t ) that returns ascore measuring how different those programs are at those times in terms of theirdemographic ratios. The idea is that the viewership breakdown of future televisionshows can be predicted by studying the breakdown of similar shows that have airedin the past.First we let D p,t be the demographics information for program p at time t (where t includes information about day of the week and time slot during the day, ex:Monday from 3pm to 4pm). D p,t = ( d , d , ..., d N )where d i is the number of impressions from the i -th demographic (ex: females ages 21to 24) and N is the number of demographic fields in the data (in our case N = 30,with 15 age ranges for both males and females). Now let I p,t = (cid:80) Ni =1 d i , be thetotal number of impressions for program p in the time slot t . Let ˆ D p,t = D p,t || D p,t || = D p,t /I p,t which is D p,t normalized in the (cid:96) norm. ˆ D p,t ∈ R N is a breakdown of theviewership by demographic for program p at time t .The distance score is defined as∆( p , t , p , t ) = || ˆ D p ,t − ˆ D p ,t || , (9)which is just the Euclidean distance between ˆ D p ,t and ˆ D p ,t in R N . Note thatthis means that programs with lower scores are more similar because their viewerdemographics will be closer to each other in terms of the Euclidean distance. anaggio et al. Page 12 of 24
Program ID Day of showing Hour of showing ∆ Random Random Random 0.5128Random Same Same 0.4845Same Random Same 0.3146Same Same Random 0.2842
Table 2
Average distance score for 100,000 randomly selected program pairings. The distance ∆ isdefined in equation (9). Since we do not know where in the space our data sits, we randomly selected pairsof programs to see how far apart they are on average, keeping certain programattributes (such as program ID, day of showing, hour of showing) identical. Ourresults are shown in Table 2. This reveals that the programming content is a betterpredictor of the demographic data than the programming date and time.This difference measure relies on the L norm, however alternative metrics mightalso yield reasonable scores. For example, we could compute the difference between D p,t and D p (cid:48) ,t (cid:48) using a dot product (cid:104) ˆ D p,t , ˆ D p (cid:48) ,t (cid:48) (cid:105) resulting in a score between 0 and 1 (after dividing by a normalization factor).These scores could be used to estimate viewership of new programs by averagingthe impression data from a group of similar programs. These predictions could thenbe enhanced by correcting for the spectral properties of the viewing habits of eachdemographic in the relevant timeslots. Then as additional data is collected, thesepredictions could be adjusted using the Kalman filtering approach. In this section, we implement a method for creating an optimal schedule of adver-tisements, given a set of orders from an advertising agency and predicted viewershipnumbers such as those that could be generated using the methods outlined in Sec-tion 2.In order to formalize the optimization method we define the following notationand assumptions: • C is the total number of channels, and the subscript index c with 1 ≤ c ≤ C is used to denote one particular channel. • N c is the number of commercial slots on channel c , and the subscript index i with 1 ≤ i ≤ N c is used to denote the corresponding slot. This index takesinto account both day and time. We assume the number of slots are specifiedby programmers in advance. • P c,i indicates the price for commercial slot i on channel c . We assume theseprices are set by the programmer in advance. • S ( d ) c,i contains the number of impressions for slot i , demographic group d , onchannel c . We assume these values are provided in advance and that this datais reliable. • A gives the number of advertising orders, and the superscript index ( a ) with1 ≤ a ≤ A denotes one particular order. We assume that all orders for a givenweek are received in advance, that the schedule can be determined one week anaggio et al. Page 13 of 24 at a time, and that all advertisers have equality priority and therefore ordersaccepted or rejected only on the basis of whether the order is likely to besatisfiable. • V ( a ) is a binary vector indicating the target demographics in the order foradvertiser a . • S ( a ) c,i contains the number of impressions for slot i , in the demographics spec-ified by advertiser a , on channel c . In other words, S ( a ) c,i = (cid:80) d ∈ V ( a ) S ( d ) c,i . Weassume that all target demographics are of equal value to the advertiser andtherefore the desired number of impressions can be satisfied by any subset ofthe target audience. • B ( a ) represents the budget of advertising order a , and R ( a ) represents desiredimpressions for order a . We assume that these requirements are strict and canbe implemented as hard inequality constraints for the solution. • X ( a ) c,i is a ‘binary matrix’ indicating whether advertiser a is assigned to slot i on channel c . This is the schedule we are trying to find. We now use this notation to express the scheduling problem as a constrained opti-mization problem.
The most basic constraint on a proposed schedule is that two advertisements cannotair simultaneously on the same channel.
1. No overlap:
Only one advertiser can use a given slot on a given channel. Mathematically, thiscan be stated as (cid:88) a X ( a ) c,i ≤ . This constraint can be modified to allow for variable length commercials by weight-ing each entry in X ( a ) c,i by the commercial length for advertiser a and then changingthe right hand side to include the number of ‘time slots’ in each commercial break.In addition to this, each order ( a ) that is accepted imposes two additional in-equality constraints on the schedule.
2. Budget:
The total cost to each advertiser must not exceed their budget B ( a ) . This impliesthat (cid:88) c,i X ( a ) c,i P c,i ≤ B ( a ) . Note that it may not be possible to satisfy every order, so if the total cost to anadvertiser is greater than 0, then we must also meet the target number of impres-sions.
3. Impression Target: anaggio et al.
Page 14 of 24
The total number of impressions (as given by S ( a ) c,i ) must exceed the campaign goal R ( a ) . In other words, (cid:88) c,i X ( a ) c,i S ( a ) c,i ≥ R ( a ) . Since this linear inequality only yields a feasible region if it is possible to satisfyevery order (which in practice is unlikely), we impose these constraints by solvinga sequence of optimization problems where R ( a ) are replaced with 0 for the orderswe are not able to fill. In Section 3.3, we propose a value function that can be usedto determine which orders should be eliminated.The above constraints are necessary to obtain a usable schedule that satisfies theadvertising campaign goals. However, programmers may impose additional require-ments on allowable schedules to prevent consecutive airings of the same commercial( X ( a ) c,i + X ( a ) c,i +1 ≤ i , c , a ), commercials with adult content from airingduring children’s programming ( X ( a ) c,i = 0 for particular i , c , a ), etc. These andany other requirements can be implemented by imposing additional inequality andequality constraints on X ( a ) c,i . However, for simplicity, we omit these constraints inwhat follows. Presumably, the advertising schedule is set by the programmer or an intermedi-ary who is interested in maximizing advertising revenue. Therefore, the objectivefunction of interest is simply the total revenue which is given by (cid:88) a (cid:88) c,i X ( a ) c,i P c,i . (10) This optimization problem involves finding a vector inputs X (a vectorized versionof X ( a ) c,i ) that satisfies a set of linear constraints and that maximizes the value ofa linear objective function. If the inputs were real numbers, then this could besolved with a linear program. However, X must contain binary inputs so thereforewe solved this using a binary integer program.To implement this program, we write the inequality constraints as a matrix in-equality A X ≤ B and the objective function as dot product f ( X ) = P · X , where P is a vectorized version of P c,i . This allows us to make use of MATLAB’sbuilt-in mixed integer linear programming algorithm from the optimization toolbox.This algorithm consists of the following three steps [6, 7, 8]:1. Solve the linear programming problem without the integer valued constraints. anaggio et al. Page 15 of 24
2. Use a heuristic algorithm to find a nearby feasible integer solution.3. Perform branching to try to improve on the heurtistic feasible solution.Depending on the data and parameters used, this algorithm occasionally finds asolution which sells all of the time slots or it fills all of the order leaving sometime slots unfilled. These outcomes represent the global maximum for the revenue.Other times, it cannot find a feasible solution at all. This suggests that a highquality heuristic is necessary for finding feasible solutions to initialize the integerprogram [9]. We explore one such heuristic in Section 3.2. One explanation for thisinability to find a feasible solution is the fact that it is not always possible to satisfyall of the orders. To overcome this, we iteratively remove orders and instead choosea subset of the orders that includes only the ones that are the most valuable (seeSection 3.3).
In order to generate a feasible solution both to initialize the integer program andto compare to the results from the integer program, we also implemented a greedyalgorithm. In this algorithm, we generate a matrix V , where the rows are indexedby the slot { i, c } and the columns are index by the advertiser a . We assign a valueto each entry of V for each advertiser. Then we choose the slot with the highestvalue and assign that to the advertiser who gets the most value from that slot. Thevalue of slot { i, c } is for advertiser a is equal to V ( a ) c,i = (cid:32) S ( a ) c,i R ( a ) (cid:33) / (cid:18) P c,i B ( a ) (cid:19) (11)which represents the fraction of the desired impressions that can be provided bythe slot divided by the fraction of the budget that must be used for the slot.This process is repeated until there are no longer available slots, the orders haveall been met or no advertisers can afford a slot. At this point, incomplete ordersare removed and the process is repeated with the remaining orders until all ordershave been satisfied or removed. In order to determine which orders should be rejected, a systematic way of priori-tizing orders is needed. Below, we propose a heuristic ranking scheme.
Step 1: Eliminate unreasonable orders
If the number of impressions desired is more than the size of the viewership forthat demographic (in the time allotted), then the order cannot be satisfied. Theseorders should be rejected immediately. In other words, an order must be rejected if (cid:80) c,i S ( a ) c,i < R ( a ) . Step 2: Monte Carlo Method
A Monte Carlo method can be used to estimate the number of feasible solutions foreach advertiser. We then assign a value proportional to that number.1. First generate a random binary vector X ( a ) c,i for fixed a . anaggio et al. Page 16 of 24
2. Check to see if it is feasible.3. Repeat N times.Let F be an N by 1 vector indicating which candidate solutions are feasible. Then,the fraction of solutions that are feasible is given by W ( a ) MC = F · N (12)which we call the value function, and the orders that are more likely to be satisfiableshould be prioritized. This provides a simple method for identifying feasible orders. However, in practicethe decision making process might be more complex. For example, if rather thanhaving fixed time-slot prices advertisers were allowed to bid for a given slot, thenthe advertisers with larger budgets relative to their demands should be prioritizedsince their orders are likely to be both more satisfiable and more lucrative. Also, ifa variable scheduling horizon for orders is allowed (rather than focusing on a sin-gle week at a time), then the urgency of the order should also be taken into account.
Modification 1: Weight by excess budget.
This value function can be modified to account for bidding on time slots by weight-ing the feasible solutions by the price the advertiser would be willing to pay inan auction. An advertiser should be willing to increase the bid by a factor of B ( a ) (cid:80) c,i X ( a ) c,i P c,i to remain under budget and ensure that their advertising campaignorder is accepted. So, given a set of random X ( a ) c,i from the Monte Carlo methodabove, rather than just computing the fraction that are feasible, the feasible sched-ules can be weighted to account for the amount of leftover money in the budget.For a feasible solution j , the percentage of the budget that is unused is just E ( a ) j = B ( a ) − (cid:80) c,i X ( a ) c,i P c,i B ( a ) . Let F be an N by 1 vector indicating which candidate solutions are feasible. Let E ( a ) be the percentages from above. Then the total value of an order would be givenby the average of this excess W ( a ) BID = F · E ( a ) N and orders with larger average value should be prioritized. In order to balance bothfeasibility and value, a balance of the Monte Carlo and bidding based values suchas W ( a ) COMB = (1 − r ) W ( a ) MC + rW ( a ) BID was employed, where r represents the relative weight of the excess budget to thefeasibility. anaggio et al. Page 17 of 24
Modification 2: Weight by urgency.
In order to take into account variable time frames for orders, this value could fac-tor in the fact that certain orders are more urgent than others. Orders that havealready been accepted will usually need to be prioritized over new orders. For neworders, some orders with short time tables will be infeasible, others with short timetables would need to be completed immediately, and orders with long time tablesmay be saved for later, but not postponed so long that they become infeasible. Thismodification was not implemented, but it warrants further exploration.
In order to test this optimization scheme, we used a subset of the viewership dataover a single work week (Monday-Friday) between the hours 5:00am-12:00am forthree channels. We also assumed that impressions by demographic are constant overthe span of an hour (if there are two half hour shows in an hour we average theimpressions per demographic from both shows) and do not take into account anyuncertainty in the estimates of impressions per time slot. The results are displayedbelow.
Figure 10
Optimal schedule schedule for a programmer with one slot per hour on three differentchannels for an entire work week. The resulting schedule satifies the requirements for the top 49orders over this span. The vertical axis corresponds to different advertiser’s orders while thehorizontal axis corresponds to the time slot index. Black corresponds to a slot being filled.
For this data set, an optimal schedule can be found that satisfies the top 49 ordersand fills all available ad time when orders are sorted using the Monte Carlo valuefunction. The optimal schedule is shown in Figure 10. With more orders, however,the algorithm fails because it is unable to find a feasible solution. Thus iterativereductions in the total number of orders are necessary before a satisfiable subsetof orders can be found. In contrast, the greedy algorithm always yields a feasiblesolution (by automatically rejecting unsatisfied orders), but that solution may notbe close to optimal. With 149 orders (the total number of sample orders in ourdata-set), a solution satisfying 112 orders that generates 91.8% of the maximum anaggio et al.
Page 18 of 24
Figure 11
Run-time for computing an optimal schedule with a variable advertising time frame.The integer program was initialized with a variable number of weeks and fixed set of 40 orders(rescaled to match the number of weeks). These data points can be fit by a quadratic y = 0 . x with R = 0 . . This suggests that the computation time is proportional to thesquare of the time frame. allowable revenue is found in 0.57 seconds whereas the integer program finds theoptimal solution satisfying as many as 49 orders in 6.5 seconds.In the our data set, adding more time slots by expanding the time horizon to sev-eral weeks or using more channels with varied viewership allows us to accommodatemore orders, but it can also increase the computation time (see Figure 11). Thusthis method may be more suitable for smaller problems with fixed time horizons.Scaling the algorithm to larger data sets and to include more complex constraints(for example, to take into account the reach and frequency of an advertisement)would require a hybrid approach involving multiple algorithmic frameworks. Whenlong time horizons, large numbers of channels or large numbers of orders must beconsidered, one promising approach would be to segment orders and schedules intosmaller intervals and then apply this integer programming method to each interval.The most efficient method for this segmentation would likely be dependent on thedata considered but there may be structural properties of this type of problem thatcan be exploited. Therefore, scaling this method effectively would require a deeperlook at segmentation strategies that would allow for the large scale problem to bedivided into pieces that could be solved in parallel.These results suggest that binary integer programming provides a flexible frame-work for implementing the essential constraints and searching for optimal solutions.However, the development of new heuristics and improvement of current heuristicscould be beneficial for scaling of the problem to more realistic sizes and warrantsfurther exploration. In addition to the total number of impressions, advertisers may also be interestedin specifying the desired reach and frequency of their advertisements. The reach ofan advertisement is the number of unique individuals who have received at least anaggio et al.
Page 19 of 24 one impression of the advertisement, and the frequency of an advertisement is themean number of times the advertisement is seen by these individuals. From detailedviewership data, we can express the reach of an advertisement by the following exactformula: R = N ( a ) (cid:88) j =1 S (cid:93)i j (13)where S (cid:93)i j is number of new impressions made at time slot i j of the j th airing ofadvertisement, and N ( a ) is the total number of times the advertisement is shown. Toreduce the notational complexity, we are dropping here the indices referring to thechannel and demographic group, essentially assuming we are focusing on a fixedchannel and a specified demographic group. This can be generalized in principleto allow multiple channels and multiple demographic groups, with accompanyingcomplication in notation.The formula for reach is to be contrasted with the formula for total number ofimpressions made by the advertising campaign, I = N ( a ) (cid:88) j =1 S i j where S i j is the number of impressions (including both new and repeat viewers)made at the time slot i j of the j th airing of the advertisement. S (cid:93)i j by contrast onlycounts those viewers on whom an impression was made at time slot i j , but not on aprevious airing of that advertisement. Consequently, to count reach, we must knowmore about the viewership than simply the statistics for the number of impressionslikely to be made in each time slot. Put another way, the number of impressions S i is only a function of the data at time slot i (and thus may be thought of as aone-dimensional marginal distribution of the viewership data), whereas the numberof new impressions S (cid:93)i depends not only on the statistics of time slot i by alsoon statistics of previous time slots (and thus inherently involves joint distributionsof viewership data at different time slots). To make matters more complicated forthe purpose of schedule optimization, the number of impressions S i in a time slotdepends only on the time slot in which the ad is scheduled, whereas the numberof new impressions S (cid:93)i depends not only on the time slot i but also the previouslyscheduled time slots of the advertisement. The average frequency fortunately iseasily determined from the reach by the simple formula: F = I/R.
As noted above, the reach of a previously aired advertisement can easily be cal-culated from historical data. However, predicting the reach of a proposed futureadvertising campaign is a much more delicate matter. Even if the viewership offuture programs could be assumed to be identical to previous weeks, the reachcould be calculated for any proposed schedule, but this would be an expensive and anaggio et al.
Page 20 of 24 unwieldy calculation. Either it would be necessary to do an online processing ofhistorical data, or it would be necessary to refer to an intractably large data struc-ture which has precomputed reach scores for every feasible advertising schedule.Including reach into the optimization scheme for the advertising schedule would, aswe show, introduce an inherent nonlinearity, and nonlinear optimization typicallyrequires additional iterations beyond that required for linear optimization. Thatmeans many schedules will be proposed in the optimization, and thus the expensivereach computation would be invoked many times. We therefore consider a simplifiedway to estimate future reach. Such simplification is justified in particular becausefuture viewership cannot be perfectly predicted, so involving a expensive and pre-cise computation for reach in the schedule optimization algorithm would seem tobe a misplaced effort.We propose encoding the information needed for future reach calculations in atwo-slot function P i,i (cid:48) which, for i < i (cid:48) , represents an estimate of the fraction ofviewers at time slot i (cid:48) who also viewed time slot i . Under our standing simpilfy-ing assumptions, P i,i (cid:48) could be estimated from historical data, possibly using theKalman filtering idea in Section 2.3. If we allow future time slots to be associatedwith programs different from those in the past, we could try to develop an inferencescheme for combining historical data on viewership of time slots with viewershipof programs. But this might still be attempting too fine a resolution. Since thenumber of potential time slots in which a proposed advertisement could air withina typical campaign window is large, such a detailed data-driven approach wouldrequire the storage of P as an immense matrix, which would be at least 10 × for a weeklong campaign even in our very simplified setting, and much larger inpractice. A more tractable approach might be to simply treat P i,i (cid:48) as a function ofonly i − i (cid:48) , meaning essentially the time difference between slots (and possibly alsoa measure of difference between channels for a multichannel campaign). We mightimagine that P i,i (cid:48) begins as a decaying function of | i − i (cid:48) | , but has peaks at multiplesof a day and a week for patterned viewer behavior. Perhaps historical data couldbe fit to a sum of a small number of periodic functions with frequencies identifiedby the spectral analysis in Section 2.1, with decaying amplitudes.We now assume we have in hand some scheme for estimating the two-slot function P i,i (cid:48) , and now wish to estimate the reach of a proposed scheduling of the advertisingcampaign in time slots { i , i , . . . , i N ( a ) } . According to formula (13), we need toestimate the number of new impressions made with each airing of the advertisement,and we propose to approximate this in terms of the estimated number of impressionsand the two-slot function as follows: S (cid:93)i j ≈ S i j (cid:89) j (cid:48) 95, then the prob-ability model underlying the formula (14) would have a substantial positive corre-lation between the viewers of the first and second airing of the advertisement. Thepoint of the conditional independence assumption is that we assume all such cor-relations between the viewership of the various airings of the advertisement can bewell represented by an explicit model of the correlation between the viewer of eachairing j (cid:48) < j and the airing j under consideration, with the correlations betweenthe previous airings being implied (not neglected) by the conditional independenceassumption.The conditional independence assumption can lead to either overestimates orunderestimates of the reach. For example, if the airings occur during successiveepisodes of a program with a substantial committed core base who watches everyepisode, the number of new impressions would be underestimated by formula (14).On the other hand, if the airings of the advertisement involve some episodes re-peated at different times during a week, the viewership of those airings would bemore negatively correlated than the conditional independence assumption, and thenumber of new impressions could be overestimated by formula (14). This can beverified under a simple model in which no viewer makes repeated viewings of thesame episode at different times. Some kind of conditional independence assumptionseems to be necessary to reduce the reach calculation to a complexity comparable tothe 2-slot statistic. Another natural way to invoke conditional independence is viaa Markov chain model, which would only attempt to explicitly model the repetitionin viewership between successive airings of the advertisement. Such a Markovianapproach appears less suitable than the conditional independence we suggest in theprevious paragraph for a couple of reasons. First of all, it is unclear how to deducethe number of new impressions made on the third airing of an advertisement byknowing how many viewers of the first advertisement saw the second advertisement,and how many viewers of the second advertisement saw the third advertisement.How does one infer from this the number of viewers of the third advertisementwho saw neither the first nor the second airing? Moreover, the Markovian approachseems completely incapable of representing the likely strong repeat viewership ofa regularly airing program from one week to the next, if advertisements are alsoaired in between those weekly episodes. So if the first and third airing of the ad-vertisement took place one week apart in successive episodes, and a second airingtook place in between, one would expect a large number of repeat viewers betweenthe first and third airing, but not between the second airing and either the first orthird airing. anaggio et al. Page 22 of 24 The approximate reach estimate developed in Subsection 4.1 can be expressed as apolynomial function of the schedule vector X ( a ) c,i : R ( X ( a ) c,i ) = N c (cid:88) i =1 X ( a ) c,i S i (cid:89) i (cid:48)
In this report, we analyzed the problem of optimally scheduling advertisements us-ing several different methods. First, we analyzed historical data to obtain trends inthe viewership. We found that the viewership was strongly periodic and that devi-ations from the periodic signal (noise) were approximately bell-shaped. We supple-mented these analyses with predictions from several machine learning algorithmsfor viewership, from a Bayesian procedure for predicting new program impressionsfrom the program’s target demographic, and a measure for comparing programsin order to fill in missing or unknown data. Second, we developed an algorithm,based on binary integer programming, to schedule advertisements. Given orders inthe form of a budget, number of impressions desired and demographic targets, thealgorithm produces a binary matrix that tells the media company how to sched-ule advertisements in such a way as to maximize revenue. The algorithm can beinitialized with a schedule generated by a greedy heuristic. Finally, we developeda theoretical framework to quickly estimate the reach (number of new impressionsmade) of an advertisement. This framework approximates the number of new view-ers through historical impressions data and a two-slot function, which gives thefraction of viewers who watched the same advertisement in two time slots.In summary, mathematical analysis can be an extremely useful tool for under-standing how to best schedule advertisements. Techniques from probability, statis-tics, data science, signal analysis and linear/non-linear programming can all be usedto improve and optimize advertising campaigns, give insight into viewership trendsand predict the reach of future television programs. anaggio et al. Page 24 of 24 Competing interests The authors declare that they have no competing interests. Author’s contributions GS Bhatt, S Burhoe, M Capps, CJ Edholm, S-L Estock, P-W Fok, N Gold, M Houser, P Kramer, H-W Lee, L Rossi,D Shutt, & VC Yang contributed primarily to the development of methods for predicting viewership. F El Moustaid,T Emerson, R Halabi, Q Li, W Li, D Lu, Y Qian, MJ Panaggio, & Y Zhou contributed primarily to the formulationand solution of the scheduling optimization problem. MJ Panaggio and P-W Fok compiled the content of themanuscript and all contributed to the revision of the manuscript. Acknowledgements This problem and the data used in this project were provided by Marco Montes de Oca and clypd, Inc. This work waspartially supported by NSF Grant DMS-1261594 through the 2015 Mathematical Problems in Industry Workshop. Author details Mathematics Department, Rose-Hulman Institute of Technology , Terre Haute, IN 47803, USA. Department ofMathematical Sciences, University of Delaware, Newark, DE 19716, USA. Department of Mathematical Sciences,Tennessee State University, Nashville, TN 37209, USA. Department of Mathematics and Statistics, University ofMassachusetts Amherst, Amherst, MA 01003, USA. Department of Mathematics, Colorado State University , FortCollins, CO 80523, USA. Department of Mathematics, University of Nebraska-Lincoln, Lincoln, NE 68588, USA. Department of Mathematics, Temple University, Philadelphia, PA 19122, USA. Department of Mathematics andStatistics, York University, Toronto, ON M3J 1P3, Canada. Department of Mathematics, University of California,Davis, Davis, CA 95616, USA. Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY12180, USA. Department of Mathematics, University of North Carolina at Chapel Hill , Chapel Hill, NC 27599,USA. Department of Mathematics and Computer Science, Fisk University, Nashville, TN 37208, USA. Applied Mathematics and Statistics Department, Colorado School of Mines, Golden, CO 80401, USA. Department of Engineering Sciences and Applied Mathematics, Northwestern University, Evanston, IL 60208, USA. References 1. O’Barr, W.M.: A brief history of advertising in America. Advertising & Society Review (1) (2010)2. Global media report 2014. Technical report, McKinsey & Company (September 2014). 3. The Nielsen Company - Solutions: Television. Accessed 2015-10-154. Mood, A., Graybill, F., Boes, D.: Introduction to the theory of statistics. (1974)5. Humpherys, J., Redd, P., West, J.: A fresh look at the Kalman filter. SIAM Review (4), 801–823 (2012).doi:10.1137/1007996666. Wolsey, L.A.: Integer Programming vol. 42. John Wiley & Sons, New York, NY (1998)7. Wolsey, L.A., Nemhauser, G.L.: Integer and Combinatorial Optimization. John Wiley & Sons, Hoboken, NJ(2014)8. Danna, E., Rothberg, E., Le Pape, C.: Exploring relaxation induced neighborhoods to improve MIP solutions.Mathematical Programming (1), 71–90 (2005)9. Savelsbergh, M.W.: Preprocessing and probing techniques for mixed integer programming problems. ORSAJournal on Computing6