Clarification of Video Retrieval Query Results by the Automated Insertion of Supporting Shots
CClarification of Video Retrieval Query Results bythe Automated Insertion of Supporting Shots
Sean ButlerDepartment of Computer Science and Creative Technologies, UWEBristol, BS16 1QY, UK
Abstract
Computational Video Editing Systems’ output video generally followsa particular form, e.g. conversation or music videos, in this way they aredomain specific. We describe a recent development in our video anno-tation and segmentation system to support general computational videoediting in which we derive a single generic editing strategy from gen-eral cinema narrative principles instead of using a hierarchical film gram-mar. We demonstrate how this single principle coupled with a databaseof scripts derived from annotated videos leverages the existing video edit-ing knowledge encoded within the editing of those sequences in a flexibleand generic manner. We discuss the cinema theory foundations for thisgeneric editing strategy, review the algorithms used to effect it, and goon by means of examples to show its appropriateness in an automatedsystem.
Researchers have applied Computational Video Editing (CVE) across a varietyof domains, including news programmes [1], sports [2], show trailers [3], or in-terviews [4]. Systems generally make use of a distinctive, limited set of editingprinciples to assemble their outputs for that domain. While textbooks, theoristsand practitioners [5] [6] discuss complex film grammars applied in the generalcase, automatically targeting this general case is properly hard due to an ex-plosion in permutations and we are unfamiliar with computational intelligenceapplicable in the general case.Grammatical film editing techniques such as the master shot, which canresult in a movie where the structure is visible to the audience, have fallen out1 a r X i v : . [ c s . MM ] F e b f fashion. In narrative driven movies this is generally unacceptable for theaudience to see the structure of the editing, except when it isn’t, see [7] [8]which both draw attention to their formal structure as part of the storytelling.Our aim has been to find generic automatic editing strategies which are bothapplicable across multiple domains and which are not apparent to the audienceeach time they are used with the objective of developing systems which use anaturalistic editing style.Advances in computational intelligence and natural language processing [9]mean systems can now translate natural language into machine readable formatssuch as the formal and logical Conceptual Graphs [10] and systems can regardimages and create accurate descriptions of their content [11].We have created a video retrieval and editing system which uses humanreadable queries to assemble video sequences from a database of footage. Thedatabase of video is processed and annotations are combined and segmentedso that they describe the visual content of the frame over intervals. A singleglobal editing strategy derived from narrative and film theory, based on causeand effect using the preconditions and postconditions of a situation is used torecombine video in response to external queries.The primary benefits of this approach are: • We can extract and apply the existing editing knowledge built into reused/re-edited video sequences. • The video produced follows readable editing patterns without visible struc-tureAlso, the system is easily interfacable with other Computational Intelligencetechnologies, and the sequence processing being derived from set and graphtheory contribute to a formal understanding of film theory and a definition ofthe unit of film grammar.
In the 1920s Kuleshov showed that the effect of perceiving a continuous realitywhen shown a rapid sequence of frames is so effective and so closely mimics theconscious experience that it also works across sequences of images that don’thave similar contours. This has since become known as “The Kuleshov Effect”[12]. In Kuleshov’s experiment, audiences perceive hunger, grief and attractionin the same shot of an actor’s face that has been juxtaposed with differentsuccessive shots. As can be seen in Figure 1, [13] the effect is revealed to be an2igure 1: Stills from a demonstration of The Kuleshov Effect, where the per-ceived emotion changes depending on the contextillusion but when presented with the sequences in isolation audiences perceivethe emotion.The Kuleshov Effect may be explained by saying the audience uses theirimagination and experience to create a fiction which accounts for the imagesbefore them. The audience reads the film assuming the shots are related anddiscovers or invents [14] a reason or for that relation. Cognitive scientists [15]have studied the phenomena. Explanations of the effect have ranged widely over,for example, theories of Point of View, Consciousness and Eye Movements. Filmstudents are commonly advised to consider the camera as an additional personin the scene observing the other protagonists and edit according to an observer’sbehaviour.
Due to the Kuleshov Effect we already know that should the context of a shotchange, then its meaning will also likely change. All cases of retrieval, exceptthe first solitary one of a shot retrieved into isolation also change the context.Therefore retrieval also changes the meaning. After retrieval any editing oper-ations which change the context will also alter the meaning.The permutations of meaning change are many and would present a signif-icant generation and indexing challenge. It follows that a system designed toperform automatic re-editing of video footage will run into significant problemsshould it maintain data about the meaning of shots or sequences. We musttherefore focus on managing the content of the shots rather than the meaning3f those shots.
Since [16] many authors [17] have developed algorithms to determine the struc-ture of video. These approaches generally use a mathematical method to sum-marise the difference between neighboring frames or those in short succession.In doing so they detect the linear structure of a shot sequence. We envisagestrategies such as those we are developing to be integrated with approaches tocreate hybrid systems.
Operating on video containing complex engineering activities Parkes [18] showedthe effectiveness of Conceptual Structures (CGs) [10] as an notation to describethe content of an instructional video. CGs support description and also process-ing for transitions suitable for the reediting of instructional videos for differentlearners.“Conceptual graphs (CGs) are a system of logic based on the exis-tential graphs of Charles Sanders Peirce and the semantic networksof artificial intelligence. They express meaning in a form that is log-ically precise, humanly readable, and computationally tractable.”[19]The Stratification System [20] (an early attempt to maintain and visualiselayered descriptions of video content) stores keywords over time and is focussedon the interface so a user can manipulate the descriptions easily for video anno-tation. Butler & Parkes [21] presented a video annotation and editing systemwhich also used keywords over time and provided a visual interface, in this casethe user could join shots together to create a space-time diagram and see thestructure of the edits.Sack and Davies [3] describe an annotation and repurposing of Star Trekepisodes by means of story plans and content annotations to create simplified orshort sequences following a particular dramatic structure. Davies [22] considersa semantic approach that handles the issue of context and meaning by arguingfor a notation that can encompass the meaning in isolation and the meaning incontext.Butler and Parkes [23] described a system which also uses CGs to describevideo content coupled with layers of intervals [24]. The main development was4he use of the generalisation of Conceptual Graphs by means of an ontologyalong with a range of idiomatic editing rules such as the 180 degree rule, parallelactions and subjective shots.Nack [25] described theme oriented video editing using planning form andtheme strategies to achieve comedic effect. Focussing on argued opinion docu-mentaries, Bocconi and others [26] go on to show how higher level annotationsand recombination schemas can be used to generate specific forms of video pro-duction.Various algorithms were investigated [27] for the creation of multi-level videosummaries in the home movies domain, using a hybrid of technologies includingmathematical shot boundary detection, external annotation and defined struc-tures of summaries.More recent computational approaches use a variety of technologies includ-ing machine learning and computer vision. For example [4] with a focus oncross-cutting between complete captures of scenes from multiple angles for multiparty conversation. Merabiti [28] created a hybrid system making use of HiddenMarkov Models to learn editing from source videos which have been annotatedby a person.In summary, many researchers working at the current state of the art incomputational video editing make use of a hybrid approach. That is of machinevision and machine learning along with symbolic annotations. In some cases thesytems generate annotations and in others they use external annotations withthe system operating on the annotations. As you can see systems have evolvedfrom the purely symbolic logic programming approaches in the 90s through theapplication of computer vision technologies to the modern hybrid approachs ofmachine learning and symbolic annotation combined.
A short review of some major narrative principles follows. By focussing onediting and narrative principles which are applicable in any/all contexts and areunrelated to specific events or portrayals we have formulated a generic editingprinciple.Deus Ex Machina or God from the Machine refers to the unsatisfying resolu-tion of a story which otherwise would have been impossible for the protagoniststo resolve that way. Historically the gods might have intervened in the storyand entered from above the stage by means of a mechanism [29]. In modernstories it refers to sudden or late powerful interference, impossible deductions,coincidence, dreams or other similar unsatisfying resolutions. For automaticvideo this means we cannot show an event which relies on an event, object or5igure 2: Stills from the major shots of the “Shaun of the Dead” SurvivalPlanningFigure 3: Stills from each shot of the “Shaun of the Dead” car subplanother entity within the narrative that has not appeared prior to this point.One of the expressions of “the rule of three” in Hollywood movies is thatyou show them three times: First show the audience that you are going to do it,then you show it happening, then you show them that you did it. Historicallythis may have been taken quite literally. More recently we don’t show all threephases of an event. In practice the speed of the shot and the complexity of theevent have an impact on whether we show 1 or 2 or 3 of the parts.Consider the sequence in “Shaun of the Dead” where the protagonist andhis buddy are discussing how they are going to survive the zombie apocalypse.Several refinements of their plan are discussed and the sequences are shortenedeach time. The first iteration of the plan is shown very quickly and understand-able because each event is shown with multiple shots. Those shots being thepreconditions and/or postconditions of an event in the sequence. Examine thecar sequence within the first iteration of the plan and we can see how taking thecar is shown in great detail with several successive shots of the preconditions.
J.M.Carroll [30] discusses film psychology and grammar in great detail. In hispsychological and grammatical treatment of film Carroll separates the visual-isation of actions in film into preparatory actions and focal actions also focalactions and event terminations. He then builds these as separate non-terminalsinto his partial film grammar. (cid:104) sequence (cid:105) | = (cid:104) action (cid:105) | (cid:104) action (cid:105) (cid:104) action (cid:105) action (cid:105) | = (cid:104) preparatory action (cid:105) (cid:104) focus-action (cid:105)(cid:104) action (cid:105) | = (cid:104) focus action (cid:105) (cid:104) event-termination (cid:105) In his transformational grammar, rules may be used to change a sequenceof actions into a sequence of shots depending on the interrelationships betweenthe shots and the desires of the editor/director. Throughout the work Carrollpresents examples of additional rules for specific scenarios, such as master scene,parallel actions and subjective shot. Being a partial system, and due to therecursive rules, what is not clear is how the rules can be motivated to achieve aspecific editing goal.
Each of the above strategies is a story telling technique, and can therfore beused as a general principle to be applied repeatedly across an entire video. • We should omit events that are unnecessary for the audience to understandthe story. • Audiences fill in the gaps in missing, truncated or summarised events. • If we cant show the event itself, then we can show the preconditions and/orthe postconditions and the audience will conclude the event has occured. • If we wish to show an event more clearly, then we can show the precondi-tions and/or the postconditions bracketing the event itself.Assuming that video editing is an interleaving of events to create a linearsequence, to show an event in film we have a variety of choices:
Preconditions the Situation then the Postconditions
This is the clear-est form and is the least challenging for an audience to understand. It takeslonger, but this approach would be appropriate when the event is complicated,fast, or not obvious to an audience.
Preconditions then the Situation:
This works for most circumstances andis appropriate where the results of the event are well understood because theaudience is familiar with the domain.
Situation then the Postconditions:
This works for most circumstances andis appropriate where the result of the event is important for the later content.7 reconditions then the Postconditions:
Show a plane taking off, thensomeone walking into an office. We show someone putting a posting a letterthen show another person reading a letter. We don’t need to show it beingwritten, being received, or travelling between the places.Each of these approaches has similarities and differences in their interpreta-tion by an audience, which is a topic for future work, but all serve to commu-nicate more clearly.
The effect of the principles dissussed can be more succinctly described by the fol-lowing production rules. Each terminal and non-terminal symbol can ultimatelybe replaced by a situation from the database. (cid:104) action (cid:105) | = (cid:104) precondition (cid:105) (cid:104) action (cid:105) (cid:104) postcondition (cid:105)(cid:104) action (cid:105) | = (cid:104) precondition (cid:105) (cid:104) action (cid:105)(cid:104) action (cid:105) | = (cid:104) action (cid:105) (cid:104) postcondition (cid:105)(cid:104) action (cid:105) | = (cid:104) precondition (cid:105) (cid:104) postcondition (cid:105) These are not shots. The footage showing the action desired becomes shotsonly after it has been extracted from a longer sequence (of situations) andcombined into the resulting edited sequence.
In the system overview (Figure 4) you can see each of the major data andprocesses which occur in the system. ‘Processor’ is where the sequences aregenerated from the interval tree scripts in the database when presented with aquery. The final assembly is carried out in a separate module from a list of cuts,similar in principle to a non linear editor.The system maintains a set of CGs. Each CG represents a description ofthe visible content of some contiguous frames of video and is accompanied byfilename and time data which represents the interval for which it is said to hold.Multiple overlapping intervals with their associated CGs are inserted in aninterval tree which affords time efficient query operations.During an insert operation, if no overlap exists, then the new interval and CGannotation is inserted. If an overlap is found, then new intervals are generated8igure 4: Overview of System Dataflowsfrom the intersection of those overlapping such that the overlap is removed.Corresponding to each new interval, a union of each CG associated is createdto form a new concept describing the content of the image during that interval.A script consists of an ordered sequence of CGs and is used to indicate thatsome events are preconditions and some are postconditions of others. The exacttimings are not important rather their order is, in this way the scripts resemblea traditional AI plan or script. [31]As its impossible for a system to know all the possibilities for meaning changeas context changes, we are not implementing rules to choose intervals basedon their conceptual description and its relation to the event. Though othersystems have been implemented which operate in this way [25]. Where possiblethe system chooses from the intervals in the database which match those in thescripts. Where more than one possibility exists we choose one with the closestrelationship to the event.Where the retrieval action results in a choice of more than one preconditionor more than one post condition for a given event built the system to choosethe interval which is closest in sequence to the event. Similar scripts can begenerated for any linear sequence of situations. Watering the Plants, MakingCoffee, Visiting Friends, Assembling an Assault Rifle, Operating a Micrometer,9igure 5: Still from the shot resulting from “Visit” QueryFigure 6: Stills from each shot resulting from “Visit” Query, where we see theinsertion of a preconditions shotetc
Looking at a specific examples we can show how the preconditions post con-ditions event rules can be used to generate sequences which clearer and moreunderstandable.Each concept in the query is tested against the conceptual database whichdescribes the video content available and if the database contains the conceptsrequested, they are retrieved. In the examples here we have forced the system toretrieve results with and without applying the editing principles for comparison.Consider the following example querie in which we ask the system to retrievea short video of a person visiting. [visit]<-(AGNT)<-[person]
You can see the results in Figure 5. The system finds the visit script andretrieves a shot. In Figure 6 it looks for a precondition and inserts that into thegenerated edit list. [visit]<-(AGNT)<-[person][drink]<-(AGNT)<-[person]
You can see the results in Figure 5. The system finds the visit script andthe cup of coffee script.In Figure 6 the system looks for a precondition and inserts that into thegenerated edit list.You can see the results of the following query in Figure 7.10igure 7: Stills from each shot of the results from “Allotment and Drink” Query.Figure 8: Stills from each shot of the results from “Allotment and Drink” Query,where we see the insertion of a preconditions shot for the allotment watering,and a preconditions shot for the drink. [irrigate]-<-(AGNT)<-[person]<-(OBJ)<-[plants][drink]<-(AGNT)<-[person]
Again with the pre/post conditions editing rule applied in Figure 8.You can see the results of the following query in Figure 9. [irrigate]-<-(AGNT)<-[person]<-(OBJ)<-[plants][canning]-<-(AGNT)<-[person]<-(OBJ)<-[fruit]
Again with the pre/post conditions editing rule applied in Figure 10.Figure 9: Stills from each shot of the results from the “Allotment and Canning”Query. 11igure 10: Stills from each shot of the results from “Allotment and Canning”Query, with the preconditions and postconditions rules applied.
The system assembles video sequences which match the concepts requested. Thesequences assembled by the application of this single strategy are clearly moreeasily comprehensible than those without.Scripts of events from normal/common causal chains often include eventswhich though are preconditions for this particular instance of an event, aren’ta precondition for the generic form. Additional annotations would fix this inthe short term we aim to develop a system which doesn’t need additional databeyond the image contents. The ordering of events is determined by the scriptswhich are derived from the content of human made video sequences. It is in thisway we capture some of the ‘grammar’ of film without having to construct spe-cific rules. Applying the editing strategies outlined earlier, the system choosesa context of preconditions or postconditions for the events supplied. The com-bination of these additional shots with the desired event renders the overallsequence understandable.Rather than the application of a hierarchical rule based system automatedvideo editing and retrieval should focus generic methods for the assembly ofcontext of shots.The location of potential cuts in the database for the system to use is de-rived from the processing of overlapping intervals each associated with a conceptwhich can be said to be true and holds overall the images in the interval. Eachtime the system chooses potential cut it does so to satisfy a goal on the se-quence level, the places it can choose from are not limited to existing cuts inthe sequence. The videos created are clear and understandable (though therhythm/pacing is inconsistent). The unit of film grammar should not be con-sidered as a shot bounded by cuts, but instead the ‘interval’. Where an intervalis sequence of frames with continuous contours and constant visible content.There is also an argument for the unit being the sign-interval within the frame,but further work is needed, and that if this is the case, then the interval becomesa non-terminal within a more detailed system.12
Future Work
Our immediate research goals lie in two areas, machine learning and furtheranalysis of overlapping intervals:To apply the editing rules to a much larger body of video and scripts, we planto extend our system into a larger toolchain. To interface the editing systemwith a source of automatically generated annotations of video image contentalong with a feedback user interface so an operator can indicate the quality ofthe results generated.To automatically derive the scripts from existing sequences of video andto automatically derive a hierarchy from the multiple overlapping intervals ofconcepts describing the objective content of the video. This hierarchy wouldbe used to more flexible support the application of the singular editing rule indifferent scenarios.This study has focussed on investigating computational approaches to genericvideo editing using concept boundaries as potential cut opportunities. Cuttingat concept boundaries is however, only one approach, another is to cut in themiddle of an action, to this end we plan to investigate “The Invisible Cut” tech-nique where shot boundaries occur mid action as an event sequences from oneinternal phase to another.
References [1] Chien Yong Low, Qi Tian, and Hongjiang Zhang. An automatic newsvideo parsing, indexing and browsing system. In
Proceedings of the FourthACM International Conference on Multimedia , MULTIMEDIA ’96, pages425–426, New York, NY, USA, 1996. ACM.[2] Manolis Delakis, Guillaume Gravier, and Patrick Gros. Audiovisual inte-gration with segment models for tennis video parsing.
Computer Visionand Image Understanding , 111(2):142 – 154, 2008.[3] W. Sack and M. Davis. Idic: assembling video sequences from story plansand content annotations. In , pages 30–36, May 1994.[4] Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. Com-putational video editing for dialogue-driven scenes.
ACM Trans. Graph. ,36(4):130:1–130:14, July 2017.[5] D. Arijon.
Grammar of the Film Language . Silman-James Press, 1991.136] S.D. Katz.
Film Directing Shot by Shot: Visualizing from Concept toScreen . Michael Wiese Productions Series. Michael Wiese Productions inconjunction with Focal Press, 1991.[7] W. Anderson. Moonrise kingdom.
Focus Features , 2012.[8] P. Greenaway. The falls.
British Film Institute , 1980.[9] E. Cambria and B. White. Jumping nlp curves: A review of natural lan-guage processing research [review article].
IEEE Computational IntelligenceMagazine , 9(2):48–57, May 2014.[10] John F. Sowa. Conceptual structures: Information processing in mind andmachine.
Artif. Intell. , 33:259–266, 1984.[11] Timor Kadir and Michael Brady. Saliency, scale and image description.
International Journal of Computer Vision , 45(2):83–105, Nov 2001.[12] Dean Mobbs, Nikolaus Weiskopf, Hakwan C. Lau, Eric Featherstone, Ray-mond J. Dolan, and Chris D Frith. The kuleshov effect: the influence ofcontextual framing on emotional attributions.
Social cognitive and affectiveneuroscience , 1 2:95–106, 2006.[13] esteticaCC. The kuleshov effect, 2009.[14] B. Kawin.
How Movies Work . University of California Press, 1992.[15] Joseph Magliano and Jeffrey M. Zacks. The impact of continuity editingin narrative film on event segmentation.
Cognitive science , 35 8:1489–517,2011.[16] HongJiang Zhang, Atreyi Kankanhalli, and Stephen W. Smoliar. Auto-matic partitioning of full-motion video.
Multimedia Systems , 1:10–28, 1993.[17] Rainer Lienhart. Comparison of automatic shot boundary detection algo-rithms. In
Storage and Retrieval for Image and Video Databases , 1999.[18] Alan P. Parkes. The prototype cloris system: Describing, retrieving anddiscussing videodisc stills and sequences.
Information Processing and Man-agement , 25(2):171 – 186, 1989.[19] John F. Sowa. Conceptual graphs. , 2005. Accessed: 2018-05-29.[20] Thomas G Aguierre Smith and Glorianna Davenport. The stratificationsystem a design environment for random access video. In P Venkat Rangan,editor,
Network and Operating System Support for Digital Audio and Video ,pages 250–261, Berlin, Heidelberg, 1993. Springer Berlin Heidelberg.1421] S. Butler and A. P. Parkes. Filmic space-time diagrams for video structurerepresentation.
Sig. Proc.: Image Comm. , 8:269–280, 1996.[22] Marc Davis. Knowledge representation for video. In
AAAI-94 , 1994.[23] S. Butler and A. P. Parkes. Film sequence generation strategies for auto-matic intelligent video editing.
Applied Artificial Intelligence , 11(4):367–388, 1997.[24] James F. Allen. Maintaining knowledge about temporal intervals.
Com-munications of the ACM , 26:832–843, 1983.[25] Frank Nack and Alan Parkes. Toward the automated editing of themeoriented video sequences.
Applied Artificial Intelligence , 11(4):331–366,1997.[26] S. Bocconi, F. Nack, and L. Hardman. Using rhetorical annotations forgenerating video documentaries. In , pages 4 pp.–, July 2005.[27] Frank Shipman, Andreas Girgensohn, and Lynn Wilcox. Generation ofinteractive multi-level video summaries. In
Proceedings of the EleventhACM International Conference on Multimedia , MULTIMEDIA ’03, pages392–401, New York, NY, USA, 2003. ACM.[28] B. Merabti, M. Christie, and K. Bouatouch. A virtual director using hiddenmarkov models.
Computer Graphics Forum , 35(8):51 – 67, 2016.[29] M. Heath.
Aristotle, Poetics (trans) . Penguin, 1996.[30] John M. Carroll.
Towards a Structural Psychology of Cinema . 1980.[31] R.C. Schank, R.P. Abelson, and Yale University.