[PDF] LookOut! Interactive Camera Gimbal Controller for Filming Long Takes

Abstract

The job of a camera operator is more challenging, and potentially dangerous, when filming long moving camera shots. Broadly, the operator must keep the actors in-frame while safely navigating around obstacles, and while fulfilling an artistic vision. We propose a unified hardware and software system that distributes some of the camera operator's burden, freeing them up to focus on safety and aesthetics during a take. Our real-time system provides a solo operator with end-to-end control, so they can balance on-set responsiveness to action vs planned storyboards and framing, while looking where they're going. By default, we film without a field monitor. Our LookOut system is built around a lightweight commodity camera gimbal mechanism, with heavy modifications to the controller, which would normally just provide active stabilization. Our control algorithm reacts to speech commands, video, and a pre-made script. Specifically, our automatic monitoring of the live video feed saves the operator from distractions. In pre-production, an artist uses our GUI to design a sequence of high-level camera "behaviors." Those can be specific, based on a storyboard, or looser objectives, such as "frame both actors." Then during filming, a machine-readable script, exported from the GUI, ties together with the sensor readings to drive the gimbal. To validate our algorithm, we compared tracking strategies, interfaces, and hardware protocols, and collected impressions from a) film-makers who used all aspects of our system, and b) film-makers who watched footage filmed using LookOut.

Full PDF

LLookOut! Interactive Camera Gimbal Controller for Filming Long Takes

MOHAMED SAYED,

University College London

ROBERT CINCA,

University College London

ENRICO COSTANZA,

University College London

GABRIEL J. BROSTOW,

University College London, Niantic http://visual.cs.ucl.ac.uk/pubs/lookOut/

Fig. 1. a) Script Creation: The LookOut GUI enables a user to pre-script where the camera should point. It allows the user to define multiple alternate scripts,as seen here, and to switch between them on-the-fly. A script consists of one or more behaviors chained together. A behavior can be as simple as a pan, or ascomplex as positioning multiple subjects in different parts of the frame. Each camera behavior is triggered by a cue, such as the appearance of a specific actor,someone issuing a voice command, or the actor reaching a specific zone within the frame. b) Guided Setup: For field-use, LookOut resembles a very simplifieddialog system, guiding the user through system checks and scene-specific initialization. The user-worn LookOut rig consists of a light backpack computer, ahand-held motorized gimbal, dual cameras (normal and wide-view), earphones, a lapel microphone, and a joystick for initial setup. When first turned on,LookOut prompts the user, through text-to-speech, to set audio levels, choose a script, and assign actor identities as needed. c-f) Interactive Filming: Fourframes from a LookOut-captured video, but with false-coloring to visualize which actor(s) the scripted behaviors were attending to at the time. At the user’sinstruction, LookOut frames (c) both dancers, then (d) orients the gimbal to center on the male, then the female (e), and back to the male (f). The user receivesaudio feedback when switching between camera behaviors. Without a field monitor, the user can watch where they’re going, while trusting our controller tohandle their dynamic requests.

The job of a camera operator is more challenging, and potentially dangerous,when filming long moving camera shots. Broadly, the operator must keep theactors in-frame while safely navigating around obstacles, and while fulfillingan artistic vision. We propose a unified hardware and software system thatdistributes some of the camera operator’s burden, freeing them up to focuson safety and aesthetics during a take. Our real-time system provides a solooperator with end-to-end control, so they can balance on-set responsivenessto action vs. planned storyboards and framing, while looking where they’regoing. By default, we film without a field monitor.Our LookOut system is built around a lightweight commodity cameragimbal mechanism, with heavy modifications to the controller, which wouldnormally just provide active stabilization. Our control algorithm reacts tospeech commands, video, and a pre-made script. Specifically, our automaticmonitoring of the live video feed saves the operator from distractions. In pre-production, an artist uses our GUI to design a sequence of high-level camera“behaviors.” Those can be specific, based on a storyboard, or looser objectives,such as “frame both actors.” Then during filming, a machine-readable script,exported from the GUI, ties together with the sensor readings to drivethe gimbal. To validate our algorithm, we compared tracking strategies, interfaces, and hardware protocols, and collected impressions from a) film-makers who used all aspects of our system, and b) film-makers who watchedfootage filmed using LookOut.CCS Concepts: •

Computer systems organization → Embedded sys-tems ; Robotics.Additional Key Words and Phrases: cinematography, videography, videoediting, camera gimbal

Filming for journalism and movies is a creative and often collab-orative process, where the budget dictates if the roles of director,director of photography (DP), and camera operators are fulfilled by ateam, or rest on just one person’s shoulders. Ultimately, the personholding the camera has the responsibility of delivering both thecontent and style that was agreed in advance, while safely adaptingto dynamic changes on set. a r X i v : . [ c s . G R ] D ec • Sayed et al. After budget, time is the next biggest constraint. We considertwo types of filming scenarios: one type where a journalist ordocumentary-maker must catch a one-off unrepeatable event, andthe other type where actors and crew follow a storyboard withblocking, repeating the performance until the director is happy.Our system, called “LookOut,” is designed help with both types offilming, if the aim is to capture a long take with a moving camera.Long takes stand out as novel and complex to choreograph inbig-budget films , though they are common for journalism, doc-umentaries, and run & gun videos - so the majority of workingvideo/cinematographers. Moving the camera helps keep long takesinteresting for the viewer [4, 26, 36]. However, moving cameras andmoving people stretch the attention of camera operators, who aretrying to simultaneously walk about and adequately frame theirstars. Usain Bolt was famously run over by a cameraman who suf-fered from task overload while steering a Segway at the WorldAthletics Championships in 2015.Speaking informally with independent film-makers, we foundthere was some interest in drone cinematography systems like[16, 39, 53], but a strong desire for three things: 1) to have interac-tive control while filming, 2) for a system that tracks indoors andoutdoors without special costumes, and 3) ideally, to work withlightweight hand-held hardware, because drones are prohibited inmany populated areas, and most countries require a pilot’s license.This seeded our research process, which, with feedback and valida-tion from filmmakers, has led to our proposed LookOut system (seeFig 1).The overall LookOut system serves as an interactive digital assis-tant for filming long takes with a camera gimbal. LookOut consistsof software and 3D printed hardware that augments an existinglightweight motorized camera gimbal ($130), with a video feed andrudimentary two-way speech-interface connected to a backpackcomputer. Without innovations, some of the individual componentsexisted in principle, but would not integrate into a usable or respon-sive video-making algorithm. Therefore, our two main technicalcontributions are: • A visual tracking system that detects and tracks actors inrealtime. It re-frames them dynamically, based on requested“behaviors.” • A combined controller that dynamically balances script-inducedconstraints like smoothness and intentional framing, whilestill being responsive to tracker outputs that have inherentnoise and drop-outs.The camera operator often wears many other hats, but from theirperspective, during the critical moments of filming, the LookOutsystem responds to voice commands and follows alternative or se-quential pre-specified behaviors. It rotates and stabilizes the camerawithin its joint limits, to follow the actors and to compensate forthe operator’s trajectory through the scene. For our experiments,operators didn’t see a monitor while filming, so were free to lookaround and keep one hand spare as they walked, climbed, or cycledthrough different environments. See the films [37] and

Birdman [22] , both filmed to look like one take, versusMichael Bay’s average shot length of 3 seconds [40].

The graphics community has a long history of exploring cameraplacement [7] and control systems, striving to be automatic and cin-ematic. For “offline” scenery special effects, motion control camerasystems have been used since the work of computer graphics pio-neer John Whitney in the 1950’s [54]. While programmable cameratrajectories can help with stop motion animation, and with layeredcompositing of scenery and special effects, they require hiring ofspecialized crew, are usually constrained to a short track, and thesystems ignore actors and other dynamic events. We therefore focusthis review on the context of our system, so following and framingof actors in video. This includes stabilizing gimbals, visual activetracking, and the efforts in drone cinematography.

Camera gimbals play a crucial role in isolating rotational move-ment of a camera. They are essential for smooth video capture,especially when the whole assembly is held by a walking cameraoperator. The Steadicam [5] was invented by Garrett Brown in 1975and allows for a camera operator to physically move the camera andsimultaneously capture smooth footage. It has been famously usedin many Hollywood film productions, including

Rocky (1976) [2],

Goodfellas (1990) [44], and

Indiana Jones and the Temple of Doom (1984) [47]. Steadicams provide an extra layer of isolation from thecamera operator compared to gimbals, in that they also dampencamera translation. Some are motorized to provide active stabiliza-tion and manual motorized control over the direction of the camera.Although the camera operator no longer has to worry about keepingthe camera steady, the operator must still point the camera whilemoving, either electronically through a joystick or manually byrotating the camera assembly. BaseCam Elecronics [13] developdifferent hardware and software components for the construction ofstabilizing gimbals. Their firmware offers control and flexibility overevery stabilization parameter. We build on top of their BaseCamHandy gimbal, which offers 3-axis control over camera orientation.Communication to the gimbal is achieved through a serial API thatallows for online control and settings changes on the fly.

Many early active tracking systems focus on surveillance applica-tions. Daniilidis et al. [9]’s pan-tilt camera control system orientsa camera to focus on motion in a static scene. Dinh et al. [10] andFunahasahi et al. [15] propose multi-camera or multi-focal lengthcamera systems for identifying pedestrians through facial recogni-tion. These systems are among the many that actively controlledpan, tilt, and zoom.Closest to our own hardware is the DJI Osmo Mobile [11]. It is acommercial real-time handheld active tracker. It uses a motorizedgimbal and inertial measurement units (IMUs) to control a smart-phone camera’s orientation. The gimbal enables the user to createstabilized camera footage and select a single object to actively track.A smartphone is used as the camera and processing unit. The track-ing algorithm is not made public. Unlike our system, users have nocontrol over framing and complex scripting, and no ability to trackmultiple targets. ookOut! Interactive Camera Gimbal Controller for Filming Long Takes • 3

Most trackers in the literature are designed with different require-ments in mind. Generally, the ability of a tracker can be measuredbased on some high level performance criteria. Among them arespeed, accuracy including robustness to ID switching or drift, num-ber of trackable objects (usually one vs. many), robustness to ap-pearance changes, and the ability to be run online.We focus almost entirely on trackers that can approach real-timespeeds. The MOT challenge [38] provides performance metrics ontrackers for multiple people in crowded scenes. The average shotlength in MOT is ~31 seconds with most targets exhibiting shorterlife spans. While MOT includes metrics that measure ID swaps, themetrics still reward trackers for continued tracking of a person witha new ID after an ID swap. A target swap during filming would verylikely ruin a take and cause delays. Our application requires robusttracking of a handful of targets for long durations (>20 minutes).Robustness to ID switches and target re-acquiring after occlusion,especially in busy and cluttered environments, are crucial to ouruse case.The VOT challenges [29–31] cater to single target tracking of anyclass. The VOT Short-Term Challenge allows tracks to be reset, witha penalty and a timeout of five frames, to make use of the entiredataset. Trackers in the main VOT challenge are not required to dealwith longer term occlusion and confidence reporting. In our usecase, actors often appear and disappear as filming progresses. Whilethe VOT Long-Term Challenge evaluates trackers with metrics thatput a greater emphasis on longer term tracking (the average videois 2m04s long and contains 10 occlusions lasting 52 frames [31]), itdoes not run trackers in a multiple object regime.A family of single object trackers are built on top of Siamesenetwork architectures [3, 48, 50, 58]. Most notably, SiamMask [50]achieves state-of-the-art performance on the VOT2018 challenge forobject tracking while operating at around 50Hz for bounding boxprediction. DaSiamRPN [58] includes a "distractor aware" moduleand a training sampling strategy in an effort to prevent distractorsfrom causing track loss errors. DaSiamRPN performs well at 110Hzand achieves first place on the VOT2018 real-time challenge, andsecond place on the long-term tracking challenge. We experimentwith both trackers and show how they are both prone to impostersof the same object type in long takes and cluttered environments.Among the lightweight high scoring MOT multi-person track-ers, DeepSORT [51] and MOTDT [35] stand out. Both incorporatea tracking-by-detection paradigm and use a combination of IOUand appearance costs via ReID networks for assignment. Assumingdetections are precomputed in advance, they could theoretically op-erate at 120Hz and 60Hz respectively. In Sec. 7 we compare againstthese trackers and show that while they are capable of tracking indense scenes with short lived tracks, as in MOT, they are not robustto ID switches when tracking people in frame for longer videos,making them inadequate for our use case.While these trackers offer good performance across a wide rangeof metrics and for different classes of objects, no one tracker satisfiesall the requirements of our use case, especially for people tracking.

Though drones are contentious with safety restrictions in manycountries, we share many objectives with drone-based cinematogra-phy. The Skydio R2 [46] is an autonomous drone made for hands-freeaerial filming and houses an NVIDIA TX2, six 200 ◦ cameras for navi-gation, and a 12.3MP camera (20mm). It can fly autonomously, avoidobstacles, and keep a target in the center of the frame. The R2 canalso perform different localization moves relative to the target toachieve motion shots such as follow, lead, and orbit.DJI provides a line of drones with Active Track capability [12].These drones can follow a target while also achieving certain mo-tion objectives. Like DJI’s gimbal models, the drone-based trackingalgorithms are not public, the user has limited control over framing,and multiple actors can not be tracked in series or in parallel.Galvane et al. [16] formulate a system for drone path planningand actor based framing, and utilize the Prose Storyboard Language(PSL) [43] for a high level description of subject framing. In Nägeli etal. [39], multiple drones can be scripted to fly through specific pathsin a scene and be guided by actor location. Huanget al. [20] controlthe location of a drone around a subject in a clutter free environment,using actor pose as input. Similarly, Joubert et al. [24] control twosubject framing in drone filming. While these methods either uselimited UIs and/or non-visual means of actor tracking (GPS andinfrared markers), they showed promise for the concept of scriptedand actor-driven camera control.Xie et al. [53] construct a feasible drone path given user-definedway-points for videography of landmarks. Huang et al. [21] learna motion control model for drone cinematography by training onvideos filmed with expert pilots.Ashtari et al. [1] propose a system for mimicking the kinematicsof FPV shots filmed by humans in a fixed drone flight path fordrone cinematography. While the method provides dynamic controlover drone kinematics for "shake" style, it relies on a fixed dronetrajectory and doesn’t allow for actor driven framing and multipleactors.While we share the excitement around drone-based filming, dronesare not always the correct or perhaps even legal tool for the task.Most actor driven shots take place in close quarters, with the cam-era closely following actors in the middle of the action. Drones areoften only operated outdoors in accordance with size and safetyrestrictions. Further, while dubbed audio may be used in scenes, thenoise they produce will ruin on-set audio. At a very high level, the proposed LookOut system lets a user specifywhat they want to track, and then aims the camera gimbal at thattarget during filming. Achieving that aim required many iterationsof hardware and software, user interfaces, and especially (1) inno-vations in long-term visual tracking and (2) a novel control system.Here, we outline the components of the system, and how they helpthe operator to design and safely film the long takes they want.A solo camera operator, without specialized programming skills,uses our GUI for offline pre-production, and our rig for live filming.We consider post-production only as part of Future Work. Interest-ingly, Leake et al. [33], Wang et al. [49], and Zhang et al. [56] built • Sayed et al.

Fig. 2.

A novice camera operator filming using the LookOut system: (a) is an existing active camera gimbal, designed to stabilize mobile-phonefilming. The mini-joystick is inactive by default, The orange 3D-printedhandle channels the cables and protects the USB connectors from beingbumped. (b) is the backpack computer, connected to the gimbal by one USBcable and connected to (d) with another. Not shown, the backpack also hasheadphones and a lapel mic, for two-way speech communication with theoperator. (c) is the primary camera, recording high quality footage to localmemory. (d) is the guide-camera, which has a wider field of view than (c),and whose video is fed to the backpack computer for real-time analysis. interfaces that use learning to assist precisely with film-editing ofexisting clips. Instead, through our GUI, the user defines their inten-tions up-front - somewhat like telling an assistant what to expect.Those intentions are saved into scripts, that are later parsed by theLookOut control system during filming. On set, the camera operatorwears a backpack-computer (see Fig 2) as the control-center andsensor-hub. The user also holds the camera gimbal in one hand, andhas dialog with the LookOut controller, by wearing a microphoneand headphone.We give a brief overview of these components here, before pro-viding their specifics in Sections 4 through 6.

GUI:

Before filming takes place, the camera operator uses Look-Out’s GUI to “tell” the camera gimbal how to behave and what toexpect. The behaviors are chained together into a relative timeline.Instead of absolute times, user-specified cues will conclude andthen trigger each subsequent behavior in turn. Through the scriptfile saved by the GUI, non-programmer users instruct the Look-Out control system with what to look for in the audio and videosensor inputs, and how to react. Please see the supplemental mate-rials where we show the Blockly-based LookOut GUI for designinglong takes. There, we explain how non-programmer users buildscript files by assembling chains of behaviors. A resulting scriptfile encapsulates how one or more actors (and even non-actors)should be framed while filming. The script file switches betweenbehaviors when triggered by user-controlled cues, that LookOutchecks for continuously: Speech cues, Elapsed Time, Actor Appear-ance/Disappearance, Actor in Landing Zone, and Relative Actor Size.We are proud of the GUI for being easy to learn and for matchingmany of the wishes voiced by consulted film-makers.

System Startup and Setup:

When the LookOut hardware is firstswitched on, the user selects which scripts to load into the system.LookOut then parses these scripts and asks the user, through guidedaudio feedback, to enroll actors for tracking. The user adds an actorby pointing the camera roughly in the actor’s general direction and pressing a button on a small joystick. LookOut guides the user foreach additional actor. The system then prompts the user to uttereach script-relevant speech trigger. This ensures all speech triggersare registered by LookOut using the user’s current hardware audioconfiguration. LookOut informs the user that setup is complete andremains in

Manual Mode until the user requests

Automatic Mode .Every mode switch and behavior trigger is met with audio feedback.

Controller:

The controller reconciles the input script(s) withincoming sensor data, to dynamically drive the gimbal motors. Whena script sets out the camera behaviors, the controller listens for therelevant audio-cues, and analyzes the video feed to monitor spatialrelationships between enrolled actors. It then dynamically drivesthe gimbal to achieve the desired framing and smoothness. Finally,it gives audio feedback to the user, so they know that the LookOutsystem is correctly following the script and the current actions.

Visual Tracking:

Dynamic framing of one or more actors re-quires our system to follow along, monitoring where people areon-screen, even when they are briefly occluded or on the edge of thefield of view (FoV). For these aims, we needed a visual tracker thatcan detect people and distinguish between them for long periods oftime, despite imposter-objects, e.g. people or things that could re-semble the main actor(s). Our tracker balances the need for accuracyagainst the need to feed low-latency tracks to the controller.

In this section, we describe the hardware and software on whichLookOut is built. Please see Figs 2, and 3 for close-ups.

Backpack:

Our system requires low latency feedback control inthe wild. We use a VR backpack computer with a Quadcore Intel i77820HK [email protected] and a mobile Nvidia GTX 1070 GPU.. Thebackpack can operate for 1.5-2 hours, allowing for very long shotsand multiple takes, and is light at 3.6kgs.

Stabilizing Gimbal:

We use the Basecam Handy gimbal to carrythe camera assembly. The gimbal is programmable through a serialAPI and allows high speed low latency control and telemetry datatransfer up to 80Hz. The gimbal has an Inertial Measurement Unit(IMU) on the camera frame assembly, and an encoder for each axisfor tight closed loop feedback control. We have exclusive controlover velocities on yaw ( 𝜓 ), pitch ( 𝜃 ), and roll ( 𝜙 ) on the camera frameassembly, regardless of the orientation of the handle. We disableany internal low pass filters on velocity to ensure controllability.We tune the gimbal’s internal PID [27] loop for the tightest possibleaxis velocity control, while ensuring loop stability, given our cameraarray. Camera:

We use two cameras in our system. One serves as aguide camera for visual tracking over a 90 ◦ field of view. It operatesat 60Hz and at a resolution of 1280 × star camera for capturing high quality footage. This configuration was preferredby filmmakers in our initial scoping. It allows for cinematic freedomover camera parameters, without hurting the performance of thevisual tracking. We design and 3D print a carrier assembly for thecameras, shown in Fig 3. It maximises the balance on all gimbalaxes, while minimizing the distance between the optical centers ofboth cameras within the gimbal’s confined space. ookOut! Interactive Camera Gimbal Controller for Filming Long Takes • 5 Fig. 3. a) Camera frame axes with pitch ( 𝜃 ), roll, ( 𝜙 ), and yaw ( 𝜓 ). TheLookOut controller drives the orientation of the camera assembly. On top isthe guide camera, while the bottom camera is the "star" camera. b) Gimbalhandle enclosure to allow for wire pass through and a comfortable grip. c)Camera assembly engineered for balance, and alignment of camera opticalaxes. Remote Screen:

We use a remote HDMI transmitter and screenwhen turning the system on. Once the system is setup, the screen isput away.

Audio:

The user wears a lapel mic and earphones to speak com-mands to the system during filming, and to receive feedback through-out actor-enrollment and filming. We use an online wake worddetection framework, Porcupine [41], for recognizing speech com-mands.

To achieve LookOut’s aim of framing actors, the system needs toknow their locations in screen space. The tracking component mustwork reliably for filming impromptu run & gun situations. Attachingreal tags to actors such as in [16, 39] is often impractical. To this end,the tracker must be completely visual in nature. The requirementsof the tracker are that it must:(1) be capable of locating multiple targets of interest simultane-ously, with a focus on actors,(2) reacquire actors when they appear back in frame, while beingrobust to ID switches, and(3) maintain a high online refresh rate (>30Hz) and low latencyto ensure fast actor movements are captured and acted on bythe control feedback loop discussed in Sec 6.We cover the current state-of-the-art in Sec 2.3. Broadly, the track-ers that are fast enough (>20Hz) fall into two categories, single objecttrackers aimed at the VOT [6] and OTB [52] challenges, and multi-target trackers from the MOT [38] challenge. We compare againstthe best trackers from these challenges in Sec 7.1. Notably, while single object trackers like DaSiamRPN [58] and SiamMask [50] per-form well when keeping track of an object in frame, they are proneto tracking imposters when an object is occluded and reappears inframe, not satisfying (2). To satisfy (1), a different instance of eachtracker would need to run separately for each actor; this compro-mises (3) since the runtime now scales linearly with the number ofactors.For trackers competing in the MOT [38], almost all trackers usetracking-by-detection, where detection bounding boxes for eachscene are provided in advance and taken for granted. Multiple de-tectors in the literature are aimed at real-time performance at ac-ceptable accuracy. A good choice of detector allows for real-timeperformance. Combining this with a budgeted algorithm for assign-ing detections to tracks at each timestep, trackers of this kind wouldsuffer a relatively small penalty for each additional target. However,the MOT dataset has scenes whose mean length is only ∼

31 sec-onds, where targets only occasionally change view throughout theirshort life, and rarely reappear after long term occlusion. If a targetis reacquired with a different ID, the tracker is only given a smallpenalty for ID switching/reassignment and is still given points forcontinuing to track with the wrong ID, breaking requirement (2).To this end, we take inspiration from the high scoring trackerswith relatively high throughput from the MOT challenge, Deep-SORT [51] and MOTDT [35]. We add three contributions: • a reworked cost structure for data association and detec-tion/track assignment, with a concentration on tracking ahandful of targets robustly, • a recovery phase and mechanism, and • a set of lightweight long term appearance-encoding history-management strategies, to aid in track recovery after longocclusion.Appearance encodings are what the tracker relies on for differen-tiating actors and other people during filming. They are made bystoring encodings of matched detections during tracking. A reliableper-actor feature gallery is important for tracking and recovery.All three components, explained below and in pseudocode in theSupplementary Material, focus on maintaining correct IDs for eachactor, especially after occlusion. Cost Formulation and Data Association:

At the heart of atracking-by-detection tracker is an assignment problem. It involvesminimizing an overall assignment cost for matching a set of targets 𝑇 = 𝑡 , ..., 𝑡 𝑖 , including appearance and bounding box information,to a set of detections in the current frame 𝐷 = 𝑑 , ..., 𝑑 𝑗 . Taking in-spiration from DeepSORT [51], we combine 𝑐 IOU 𝑖 𝑗 , the IOU boundingbox cost [23, 38], with 𝑐 feature 𝑖 𝑗 , the cosine distance on appearancefeatures, born from the Siamese network in [57]. We do not usea Kalman filter state based cost, as detections from our choice oflightweight detector, tiny-YOLOv3, are very noisy spatially overtime.IOU costs are useful when a target is in isolation, but uselesswhen overlaps occur or when coming out of a long occlusion. Com-plementing IOU, appearance costs are crucial for locating a targetafter long absences or during partial occlusion scenarios. However,a target’s appearance changes over time as they move in and outof different lighting and exhibit out of plane rotation. A collection • Sayed et al. of appearance encodings, an encodings gallery, must be accumu-lated early in the track’s life before we can rely on the appearancecost. To this end, we formulate a dynamic cost structure specific toeach target, that emphasizes robustness by relying on IOU whenno more than one detection competes for the same target, and theappearance cost when a target is crowded. The cost for associatingeach target and detection, 𝑐 ( 𝑡 𝑖 , 𝑑 𝑗 ) , when that particular target, 𝑡 𝑖 , isunder normal tracking (not being recovered nor while lost) is givenby 𝑐 ( 𝑡 𝑖 , 𝑑 𝑗 ) = (cid:40) 𝑐 𝐼𝑂𝑈𝑖 𝑗 if 𝑐 𝐼𝑂𝑈𝑖𝑘 > 𝜏 overlap where 𝑘 ≠ 𝑗𝑐 feature 𝑖 𝑗 + 𝑐 𝐼𝑂𝑈𝑖 𝑗 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒. (1) 𝜏 overlap is set to a high strict value to prevent ID switches whena target is occluded by other people, i.e. the individual cost of as-signing the target to all other detections based on IOU alone shouldbe very high to rely on IOU. To further reduce target switches, atrack/detection pair are deemed incompatible if either the IOU costor the appearance cost exceed defined low maximums.Each appearance cost 𝑐 feature 𝑖 𝑗 is assigned the smallest cosine dis-tance between 𝑑 𝑗 and all versions of 𝑡 𝑖 in that target’s encodinggallery. As discussed later in this section, great care is taken to dis-courage imposter appearance encodings from entering a target’shistory. In the case that a rogue or noisy encoding is included, afurther step is taken to reduce ID switches. An average of the 𝑁 lowest appearance costs from the target’s history is computed andif found to exceed a predefined maximum, no match is allowed withthis combination of detection and target.Finally, all costs are passed along to a linear assignment step [32]where globally optimal target and detection assignments are found. Recovery:

Actors of interest will go into planned or unplannedshort and long term occlusion throughout filming. During occlusion,the tracker must not confuse imposters with actors, and should thenrecover these actors when out of occlusion. We use appearancecosts, 𝑐 feature 𝑖 𝑗 , exclusively for this step. However, the appearanceencodings available to us on detections and targets are temporallynoisy. An imposter detection might present a noisy appearanceencoding in one frame that matches to a lost target. To preventthese types of false matches, we define a recovery phase that isbegun when a detection is matched to a lost target. For a targetto come out of recovery, it must be matched to a detection for 𝑅 sequential timesteps. This mechanism sacrifices a few frames oftracking for recovery in the short term, but greatly improves thetracker’s resistance to ID switching and long term tracking. We testour tracker without a recovery step in Table 1. Feature History Management:

In dense scenes and in a target’srecovery phase, the tracker relies solely on each target’s appearanceencoding gallery, R 𝑖 = { 𝑟 , ..., 𝑟 𝐿 } for data association. Ideally, aninfinitely sized history would allow for the most accurate representa-tion of the target’s appearance. However, encoding comparisons forcalculating appearance costs would get expensive with longer targetlife cycles - 10 minutes at 30Hz yields 18,000 appearance encodings.In dense scenes, encodings that are produced on occluded bounding 𝑇 𝑃 ↑ 𝑀𝑇 ↓ 𝐹𝑃 ↓ 𝐷 ↓ T (ms) ↓ Our Tracker

Faulty Encodings 0.785 0.140 0.075 26.1 19.0Greedy Encodings 0.698 0.234 0.068 41.2 19.0Simple History 0.688 0.182 0.131 50.7 19.0

Table 1. Ablation study of our tracker on the two test sequences and themetrics we establish in Section 7.1. Simple history is a flavor of our trackerbut with no feature history management, only the last seen 𝐿 encodings arestored in memory. No recovery is our tracker but without a recovery stage.If a detection matches a target once, it is accepted as the target, leadingto stray incorrect tracks on distractors, a high 𝐹𝑃 score, and a lower 𝑇 𝑃 score in the long term.

Greedy Encodings stores a new incoming encodinginto the feature gallery even if similar ones exist, filling up the galleryfaster, thus leading to a restrictive appearance memory.

Faulty Encodings accepts detection encodings that are overlapped with other detections inthe scene. This pollutes the gallery with noisy encodings and detracts fromthe tracker’s ability to avoid distractors. Since the gallery sampling strategyis random, all trackers are run 40 times to ensure fairness. boxes might later allow an imposter to match to this target incor-rectly. We explored methods for maintaining the most informativebut small gallery of a target’s appearance.To address the faulty encodings issue, encodings are added ex-clusively in normal tracking when only one detection competes forthe current target. This prevents encodings produced by impostersoverlapping our target from producing ID switches later in track-ing. In Table 1, a tracker without this check is referred to as FaultyEncodings.Almost all MOT trackers use a fixed size gallery, typically 𝐿 = ∼ ∼

30 seconds) as in the MOT challenge, where targets do not have along life cycle and whose appearances do not vary greatly. However,this is less successful for longer sequences where a target mayreappear either with different lighting or pose than when they wentinto occlusion. We address the rapid increase in the gallery’s sizeby selectively adding encodings to the appearance gallery on everytime step. An encoding is added only if it is sufficiently distant, viathe cosine distance, from all other encodings in the gallery. Thisslows down the growth of the gallery by an order of magnitude andprioritizes space and time on informative encodings.Although these steps help reduce the expansion of the gallery’ssize and maintain its integrity, they only delay the pruning problemwhen the gallery is full. We explored informed techniques that clus-ter encodings to select the most informative ones. However, mostclustering techniques are iterative and time consuming, especiallyin this high dimensional space. A simple k-means run consumes7ms for each target. Alternatively, a simple and effective solution isto randomly sample 𝐿 𝑘 from the gallery when it is full. Randomlysampling on each iteration is still expensive, so we only sample oncethe gallery is 10% larger than 𝐿 𝑘 . This has the effect of maintainingnew appearances of a target while keeping a fading memory of olderappearances for longer, since with every sampling step, encodings ofan older age stamp are less likely to be propagated forward. Table 1 ookOut! Interactive Camera Gimbal Controller for Filming Long Takes • 7 shows the performance of a tracker with a naive last- 𝐿 𝑘 encodingshistory. Speed:

All trackers in the MOT challenge report refresh rateswith bounding boxes for detections computed in advance. In ourreal-time and purely online application, we must include time takenfor computing detections. A survey of the detection field shows thatsingle shot object detection, either YOLOv3 [42] or SSD [34], arebest suited for this application for their attractive trade-off betweenspeed and performance. With a CPU and GPU constrained system,tiny-YOLOv3 [42] produces detections at a rate of 40Hz, and someets our requirements for low latency and reasonable accuracy.This detector is trained for numerous object categories, but ourdefault LookOut experiments keep only people, cars, and bicycles.As shown in the Supplementary Videos, we also experimented withDaSiamRPN [58], which allows enrollment of novel objects, such asa shop window and a garden gnome. GPU tasks, such as computingdetections and appearance encodings are all kept in a separatecomputation thread. We keep all other tracker tasks in a separatethread running concurrently. The completely online implementationof our tracker runs at 34Hz with an average latency of 30ms in aconstrained system, satisfying (3).We downsample all camera update frames to 740 ×

416 for inputto tiny-YOLOv3. Our choice of detector does however produce tem-porally noisy and intermittently missing detections - especially forsmall objects. We tune the Kalman Filters that we use for trackingupdates to reduce spatial noise passed to any control loops downthe pipeline.

Subject Enrollment:

Our tracker allows a subject to be enrolledby using a chosen detection as a start point. The tracker allowstracking to take place immediately and it learns the appearance ofthe subject as the scene progresses. An optional extra step just beforefilming can help improve the tracker’s robustness. This is done byhaving the target turn around and ideally even enter differently litareas of a scene during setup.

We drive the camera orientation to re-frame actors dynamicallyover time. The controller reconciles live tracker data with the user’sinstructions and then drives motors on the gimbal to adjust thecamera assembly’s orientation to achieve the user’s desired framingof one or more actors. The interface for user instructions is discussedin the supplementary materials, and actor location tracking was inSection 5.The visual servoing community has made tremendous progressin constructing methods for moving cameras and robotic arms todesired positions in space and/or orienting them based on someexternal visual signal [28]. The bulk of visual servoing use-cases arein robot end effector control in manufacturing. Usually these meth-ods involve the solution of a Jacobian matrix [8, 14] that encodestasks and joint movement constraints. While some work exploresmodulating the variance of the mean position of all visual pointsof interest in image space [17, 55], none has provided a transparentformulation for controlling per target variance nor does one pro-vide a framework for gradual change between different tasks and constraints. We borrow themes from the visual servoing literaturewhile constructing a task specific control scheme.Appealing camera positioning and orientation is essential for ef-fective video game design, as such the video gaming has generatedmethods and implementation tricks for implementing dynamic cam-eras that follow in-game action on-the-fly [18]. While these methodsassume that targets are known with certainty and that control overcamera properties is instantaneous, we take hints from the com-munity when designing our own control scheme and incorporatestrategies to both mitigate and cope with real world noise.At a high level, the controller is a closed loop feedback systemwith proportional-integral-derivative(PID) [27] controllers that min-imize an error signal, 𝑒 ( 𝑡 ) , by modifying the camera frame’s yawand pitch over time. 𝑒 ( 𝑡 ) is an abstraction of the error betweenreal-time dynamic actor locations and desired user framing encapsu-lated in the script. If we simplify camera space conversions, ignorenoise, and assume only a single tracked target, then 𝑒 ( 𝑡 ) is just thescreen space difference between the actor’s tracker location andthe user’s screen space requirement for actor framing, with both 𝑥 and 𝑦 components. Errors in 𝑥 and 𝑦 are corrected by changing thecamera frame’s yaw ( (cid:164) 𝜓 ) and pitch ( (cid:164) 𝜃 ) respectively. The correctionsare handled by PID controllers, so (cid:164) 𝜓 = PID ( 𝑒 𝑥 ( 𝑡 )) and (cid:164) 𝜃 = PID ( 𝑒 𝑦 ( 𝑡 )) . (2)We tune our PID controllers using a relaxed version of the Ziegler-Nichols procedure [59] to achieve the tightest response possible,while minimizing overshoot, given delay and processing constraints.Note that these are camera frame radial velocities and not directmotor torque commands. The underlying gimbal camera assemblyradial velocity stabilization is tuned in the gimbal’s firmware and isnot discussed here.This abstracted version of e(t) is suitable for a single actor and willproduce erratic camera motion since raw tracker locations are noisyeither due to tracker inaccuracy or due to subtle actor movements.This is fine if the preferred style is very erratic unnerving cameramotion with a random component due to noise, but not for any otherdesired style. Tuning the PID controllers to be lazy would ignorenoise and allow for a lazy camera, but would erode control overall actor driven camera behavior and remove responsiveness whenresponsive corrections are required. Other design considerationsinclude handling behavior transitions and tracker dropouts, wherepotential camera jerks are likely and a single one would ruin atake. The controller must handle a variety of filming scenarios andbehaviors - from single actor to multi actor, from action scenes tocalmer slower paced scenes, and the transitions between them. Wetherefore design the controller to pursue these design objectives:(1) Achieve desired user framing: on every loop, the systemshould minimize the difference in required actor framing vs.actual actor framing. Logical compromises should occur whenframing multiple actors at once.(2)

Only move the camera if motivated: the user can providean ellipse for each actor in each behavior, indicating an areaaround the actor. There, their movements do not result incamera frame reorientation. The controller should also ignorenoise from raw tracker estimates so the camera does not • Sayed et al.

Fig. 4. High level control loop view of how LookOut fulfills subject framing. On top, user inputs come in the form of the GUI during pre-production andthrough the use of speech commands on-set. At the bottom, the tracker converts camera footage into raw tracks, P T . All of these inputs enter the orchestrator,whose job is to drive the gimbal through the PID [27] controller. By modulating process variances, h , the controller balances between responsiveness andsmoothness for one or more actors. h is among the outputs from behavior logic, which had access to augmented track points from the last timestep (notpictured) and current target points, P T . h helps compute the augmented points, P A , which go into the procrustes module. The other main input to theprocrustes module is the required locations for each actor, P R . Finally, the weighted difference between required locations and augmented locations drives thegimbal update. Not seen here is a velocity fading module that fades between different velocities at the transition from one type of behavior to another. oscillate and produce unpleasant motion. This is discussed inSection 6.1.(3) Enable smooth transitions:

As behaviors change, differentactors come in and out of scene. The transitions betweendifferent actors must be smooth. This is discussed in Sec 6.2.The 𝑒 ( 𝑡 ) signal driving the PID controller is computed based onthese objectives. Specifically, 𝑒 ( 𝑡 ) comes from a weighted Procrustesmodule, that we simplified: it aligns the current 2D actor location(s)with the location(s) required by the user, subject to the availabledegrees of freedom. We found that in-plane rotation for framingwasn’t helpful, and our star camera lacks external control over focallength. Therefore, in all our experiments, we used a Procrustes modelthat simply finds the translation vector T c as the weighted differenceof the average actual locations and the average required location; T c acts as our error vector 𝑒 ( 𝑡 ) . To address (2), we add "Leniency,"where instead of passing raw tracker components to compute T c ,we instead produce dynamically decoupled and filtered Augmentedlocations in Sec. 6.1. To address (3), we modify the influence eachactor has in current framing given transitions and tracker confidenceand introduce filtering on required script behavior in Sec. 6.2. Minimizing the difference between actor locations given by thetracker, P T = { p T1 , ..., p Tn } , and required user locations, P R = { p R1 , ..., p Rn } ,using Procrustes satisfies (1). However this wouldn’t filter noise,either from the tracker or abrupt camera translation, and wouldn’tallow for selectively ignoring small actor movement. Instead, tosatisfy (2), we define tracked-smoothed-augmented points (“Aug-mented”) P A that are smoothed versions of P T and use those tocompute T c . The obvious means of making augmented versions P A is via Kalman filtering, so p Ai = KalmanFilter ( p Ti , h i ) , (3)where h i is a process variance. A high h i means an augmentedpoint follows its tracked point quickly, allowing for an immediatechange in the error term for that actor and an immediate correctionsignal from Procrustes resulting in a very responsive camera toactor movement. A small h i allows for the opposite: each p Ai lazilyfollows its track point p Ti resulting in a less immediate correctivesignal and less eager camera panning.However, fixing h i limits user control. Ideally, there should bedefinable areas of forgiveness around an actor where small move-ments are ignored. Setting a small h i allows this, but this would ookOut! Interactive Camera Gimbal Controller for Filming Long Takes • 9 ignore actor movements outside of this area when they do matter.Instead we modulate each h i based on d LEi , the current discrepancybetween a tracked point p T an its augmented point p A from theprevious time step. We make h i proportional to d LEi , so that with asmall h i the Kalman filter will ignore new updates given by p Ti andinstead choose to maintain the older location of p Ai . As a point p Ti moves too far from its p Ai , the distance, d LEi , ramps up h i and theKalman filter is more sensitive to new incoming updates via p Ti , so p Ai follows p Ti more closely.The relationship between d LEi and h i is user definable and basedon a family of exponential functions. We define a set of artisticquantities: • Zero Error Lift, 𝑣 : This forces a non-zero value when d LEi isat zero. The result of a high 𝑣 is an immediate responsive panfrom the camera when the actor moves small distances fromrest. • Agnostic Gap, 𝑎 : This defines how much distance the actorhas to travel before the camera pans, • and Curve Profile, 𝑞 : This defines the ramp up at the edge ofthe allowed area of leniency and determines how sharply thecamera will pan when an actor begins to leave that leniencyarea.We also set a hard limit on h via 𝜂 . This cap limits the impact fromtemporal instabilities in the tracker, and was experimentally set to0.01 for vertical motion and 0.05 on horizontal motion in all experi-ments. We include a qualitative experiment for demonstrating thesesmoothing functions and raw tracker values in the supplementalvideos. For each actor, each component of h = ( ℎ 𝑥 , ℎ 𝑦 ) is computedas: ℎ 𝑥 = 𝜂 𝑥 clamp ( , , 𝑒 𝑞 𝑥 ( 𝑑 𝐿𝐸𝑥 − 𝑎 𝑥 ) + 𝑣 𝑥 ) and ℎ 𝑦 = 𝜂 𝑦 clamp ( , , 𝑒 𝑞 𝑦 ( 𝑑 𝐿𝐸𝑦 − 𝑎 𝑦 ) + 𝑣 𝑦 ) . These equations are not obvious, but the three input parametershave interpretable connections to the radii ( 𝑟 𝑥 , 𝑟 𝑦 ) of each ellipsedrawn by the user in the GUI. The functions relating radii r to eachof these parameters are given in the Supplementary Material. Inbrief, for large axes ranges in r , 𝑣 is reduced such that almost nomovement occurs at zero error, 𝑎 is made to satisfy the distancedefined by r , and 𝑞 is set so that the transition is smooth. Conversely,for a small r , 𝑣 is kept high for immediate reaction, 𝑎 is set so that thepoint at which the curve increases happens earlier, and 𝑞 is set sothat the curve is sharp. See Fig 5 for different curves correspondingto different user input radii. We include an example of multiple actorleniency in the supplemental video.Note that the naive solution of simply weighting the error associ-ated with the 𝑖 th actor to zero when the actor is in some allowedradius will not achieve multi-actor leniency. Most situations resultin an optimal optimization where required actor locations are notfulfilled perfectly due to physical limitations. A zero weight for anactor would result in a new optimization and, counterintuitively,produce camera motion when none was required. Figure 6 for anillustration of this. Fig. 5. Curve profiles for different user prescribed leniency radii. These radiirepresent areas around the actor where no camera motion should happenif the actor moves in that area. The y-axis is applied to 𝜂 to produce eachKalman filter’s process variance, h . The x-axis is the difference, d LEi , betweenthe augmented version of the actor’s location from the previous timestep, p Ai , and the raw tracker location, p Ti , and is normalized relative to screenspace size. A smaller ellipse radius limits the area where the actor can movewithout a camera pan, as the process variance ramps up immediately. Alarger ellipse allows for more actor movement before the camera startspanning. To allow for smooth transitions between subjects, satisfying (3), eachactor is assigned a weight, 𝑤 𝑖 , that modifies the actor’s error term inthe Procrustes optimization. When an actor appears in frame and ispart of the current behavior, their weight is increased progressivelyand decreased when they either disappear from frame, either dueto occlusion or tracking failure, or are no longer included in thebehavior.For each actor, we also apply Kalman filters on user selected re-quired points as they transition between different behaviors so thatno discontinuities occur. Separately, a behavior can be intentionallyshaky to give the viewer a hand-held impression. To achieve shak-iness or intentional banking behavior (like an airplane changingcourse), the controller reads the gimbal IMU accelerations on thecamera’s horizontal axis, applies smoothing, and actuates the rollaxis. See the supplemental video for an example of a path behavior. The LookOut system has been used to film over 12 hours of footage.To measure its strengths and find its weaknesses, we split up valida-tion into five components:(1) Tracker performance,(2) Controller Evaluation,(3) Hands-on evaluation by film-makers,(4) Discussion of LookOut footage with senior film-makers, and(5) Qualitative showcase of LookOut in different scenarios.For (1) and (2) we also compare performance against the DJI OsmoMobile 3 in the supplemental. Note that illustrated footage in thesupplemental website is slowed down to make ingesting telemetrydata easier.

Fig. 6. a) b) and c) show an example where the user specifies that bothactors should be framed in the center given by required locations p R1 and p R2 .However, the actors’ relative locations at p T1 and p T2 make it impossible forthat requirement to be fulfilled. As such, the best framing possible at steadystate is where both are equidistant from the center. In a) no leniency isdefined, and so a movement by either p T1 or p T2 will need a new optimizationand the camera moves. In b) and c) leniency is required on actor givenby the red ellipse defined by the user, i.e. if the actor at p T2 moves withinthe ellipse, the camera should not respond. b) a naive solution to achieveleniency is to attenuate the error term d E2 when the target p T2 is close to thepoint of the optimization at steady state (where p T2 sits). However, sincethis is a less than ideal framing with both required points at the center, anew optimization will be found that improves d E1 and the camera moves,disregarding leniency. Instead in c) we formulate a new augmented point p A2 that is output from a Kalman filter on p T2 whose process variance ismodulated by the distance d LE2 w.r.t to the ellipse. The actor and ellipse canmove around the augmented point and as long as the augmented point, p A2 , is in the ellipse, it will not move as the target point p T2 moves since theprocess variance h remains low; the error term d E2 doesn’t change since p A2 remains stationary although the target point has moved, and so thecamera does not move to compensate. When the actor leaves this ellipse, h is ramped up, p A2 moves to follow p T2 , the error term d E2 changes, a newoptimized framing is found, and the camera pans. We test our tracker’s performance on the VOT Long-Term Chal-lenge [31], and on two long manually annotated videos that betterrepresent our film-production use case. Market (one actor scene at3m20s with annotations every frame) and TwoPeople (two actorscene at 10m30s with annotations every five frames.) are challengingscenes with representative clutter, many occlusions by distractors,variable appearance before and after occlusion, and lighting changes(See Fig 7). Crucially, the subjects’ appearance changes to somethingnot seen before when emerging after an occlusion. While our trackerand others can sometimes be shown the subject from all angles tobuild a representative history, this test also checks for pickup-and-gofilming performance, so no such five second grace training periodis given. We ultimately advocate our tracker for the tracking ofpeople in our use case. However, we include all videos from theVOT challenge in the comparison.

Fig. 7. Sample frames from annotated videos used for benchmarks. Top:Market, a 3m20s scene of the actor in the beige coat walking through acrowded market. There are many occlusions in this scene, including wherethe target appears in frame with a different appearance than when theywent into occlusion. Bottom: TwoPeople, a 10m30s scene of two actors on awalk through a campus and a park. Both actors wear similar looking clothes,occlude one another, disappear from frame entirely, are seen at differentscales, and walk at various distances away from the camera.

We obtain a ground truth bounding box by manually annotatinginput videos at 740 × 𝑇 𝑃 ) point for a frame if it either correctlypredicts the bounding box of the actor or correctly predicts that theactor is occluded. We use a bounding box IOU threshold to determineif the correct bounding box is output. If a tracker outputs an incorrectbounding box, regardless of whether or not the actor is occluded, itis given a false positive point ( 𝐹𝑃 ) for that frame. If a tracker doesnot output a bounding box when the actor is unoccluded, it is givena missed track ( 𝑀𝑇 ) point. We distinguish between 𝐹𝑃 and 𝑀𝑇 inthis way to highlight errors that would point the camera away fromthe targets of interest, as is expressed with 𝐹𝑃 . We also computethe pixel distance between the center of the ground truth box andthe center of the track, 𝐷 , and obtain a mean over all updates, 𝐷 .The center of frame is used instead of the tracker’s output when thetracker is lost, and in case the tracker outputs some bounding boxbut the target is actually occluded. All trackers are instantiated onlyonce, using the first ground truth bounding box. Detection basedtrackers are all run on tiny-YOLOv3 output, and given the detectionbest fitting the groundtruth as a start point. All trackers are run ina single thread, including ours. We report raw unnormalised resultsfor Market and TwoPeople (average of both actors) and normalizedresults on VOT-LT2019 [31] sequences in Table 2. Since our trackerhas a random component, we average 40 runs on the same LookOutbackpack computer.We also ran a qualitative experiment with the leading VOT 2018real-time tracker, DaSiamRPN [58]. We filmed an actor walking ina pedestrian area using both our tracker and DaSiamRPN [58] inseparate takes. The rest of LookOut is kept constant, including actorweighting and actor specific leniency that both help to mitigatetracker noise and errors (but don’t affect tracking). We run twotakes each and show all takes in the supplementary video. Thesetakes show the importance of our robustness to imposters in filming. ookOut! Interactive Camera Gimbal Controller for Filming Long Takes • 11 Market, 3m20s, one actor TwoPeople, 10m30s, two actors VOT-LT2019 [31], ~2m24s, one target

𝑇 𝑃 ↑ 𝑀𝑇 ↓ 𝐹𝑃 ↓ 𝐷 ↓ T (ms) ↓ 𝑇 𝑃 ↑ 𝑀𝑇 ↓ 𝐹𝑃 ↓ 𝐷 ↓ T (ms) ↓ 𝑇 𝑃 ↑ 𝑀𝑇 ↓ 𝐹𝑃 ↓ 𝐷 ↓ T (ms) ↓ Our Tracker

71 17.5

132 34.8 Table 2. We evaluate our tracker and other leading state-of-art real-time trackers on the VOT long-term tracking dataset. Other algorithms outperformours on VOT. However, the VOT videos are qualitatively different in appearance from our use cases. So we introduce two further test sequences with 12,300manually labeled annotations. These videos are more representative because of their cinematic style, both long and short term occlusions, and the presence ofdistractors, including people in cluttered environments. A high

𝑇 𝑃 (true positive) is obviously advantageous. A low 𝐹𝑃 discourages the camera from movingonto a distractor. Some missed tracks, 𝑀𝑇 s, are tolerable, but especially after a long occlusion, missing the target could lead to catastrophic target loss. Whilea low 𝑀𝑇 score is important, a trivial tracker that always outputs a bounding box, whether or not the target is occluded, would allow the tracker to bedistracted. In the short term, this will lead to 𝐹𝑃 s, and in the long term, it will pollute that actor’s appearance representation. 𝐹𝑃 s are especially detrimentalfor LookOut, because the camera is controlled by tracker output. An inaccurate position will move the camera away, further decreasing the chances of recoveryand ruining a take. All run times include detector latency when appropriate. Operation T (ms) ↓ Frame Grab 8.2Frame Resize 3.1Gimbal Control 0.7Gimbal Metrics Retrieval 8.9Miscellaneous 1.6

Total

Table 3. Breakdown of task times in LookOut. There are three main threads.One main thread handles communication and control of the gimbal alongwith scripting logic. Gimbal Metrics Retrieval and Frame Grab are softlyrun in parallel. Tracker tasks are split into two threads, one handles GPUcomputation and the other handles the final detection/track assignment.All threads run in a pipelined manner.

In order to evaluate the controller components responsible for trans-lating script commands to target camera frame radial velocities, wefilm multiple qualitative videos and also run an ablated version ofthe system.We film two takes of the same running scene at the same locationand with the same predetermined path making sure to keep therelative motion between the camera and actor consistent. One takewas filmed using our full system, including a minor leniency that’sclose to the minimum allowed. The second take was filmed withan ablated version of the control system, or ‘Standard Control.’ Theablated version of the system passes raw tracker values as is to thePID controller without actor control weight adjustment (Sec. 6.2) andthe leniency mechanism (Sec. 6.1). Fig. 8 shows camera frame radialvelocities for both modes throughout this scene. Overall, the full controller satisfies scripted actor framing and largely ignores bothactor track noise and camera translational motion that manifestsitself as screen space motion. Following these internal and externalnoise sources would lead to an uncontrollably erratic camera. Pleasesee this side-by-side comparison in the supplemental validationvideo and in the website as the video pair named

Fully AblatedControl under

Control Ablation for both the illustrated visualizationand the star footage of this targeted A/B ablation comparison.In the video

Ablated Multi-Actor Weighting under

Control Abla-tion , we show how using binary weights for actor script transitionsproduces a nervous erratic camera at best, and usually leads to abroken take. This happens because the error 𝑇 𝑐 goes from beingentirely Actor2 focused to entirely Actor1 focused, and vice versa,in one time step. This spikes the PID controllers leading to erraticcorrections and a nervous camera. All other videos with multipleactors will show behavior with the method outlined in Sec. 6.2 andwith leniency from Sec. 6.1.We also film scenes to show the effect of variable leniency on asingle actor and for multiple actors, namely Hampstead LeniencySwitch and

Clown and Calm in the supplemental website. In thesupplemental website, please see other video illustrations of ac-tor control weights, leniency ellipsis, and actor process variancesdisplayed when available in filming metrics.

We designed and ran a small field study on an intermediate pro-totype, composed of two parts. Part 1 consisted of participantsbuilding a script using the LookOut GUI, while part 2 involved thesame participants filming the scene they have programmed.

Participants : In total we had 5 participants: four participantscompleted both parts, while one participant only completed part 1.We recruited the five volunteers (two female) by posting an adverton an amateur film-makers’ group and through our own socialnetworks. Three of them work within the film and entertainment

Fig. 8. Yaw (top) and pitch (bottom) camera frame radial velocities for bothour full system and ablated control throughout the control ablation runningscene. The standard controller (ablated system) does not ignore tracker noiseand translates changes in perceived actor location from the tracker directlyto an error in the PID controller. This leads to massive over corrections andan uncontrollably erratic camera. Instead, our full controller can handletracker noise and camera translation by using modifiable leniency (Sec. 6.1)on raw tracker locations and applying control weight adjustment (Sec. 6.2).Note that these are not gimbal motor velocities, rather these are targetradial velocities for the camera frame to achieve. Raw gimbal axis velocitiesand torque are a function of required camera frame velocities and externalforces acting on the gimbal assembly. industry (one lighting technician, one backstage support, and onedirector), while two are university students.All participants had prior experience with filming, from beginnerto amateur. Filming experience ranged from filming static scenesto action shots using Steadicams, from short clips for the Web toshort movies. None of the participants were familiar with computervision, nor had they been exposed to the system before the study.One of the participants reported being familiar with Blockly fromtoys such as the Sphero TM , which she previously encountered inher part-time work. Experimental Design:

The study was designed to expose par-ticipants to the full operation of the system, from the creation ofthe configuration scripts using the GUI, to the actual filming ofthe action. To harmonize the task complexity across participants,we asked them to film a predefined sequence, communicated tothem through a storyboard (printed in color on a single A3 page).Note that this form of study does not check creativity in run & gunscenarios, but rather productivity [45] when a DP is working solo.Designing a suitable storyboard required careful consideration, tobalance conflicting requirements. On one hand, we wanted a settingthat really challenges visual tracking algorithms, and the storyboardto be particularly complex for a single operator to film in one shot.These requirements were to assess the system’s ability to deal withchallenging filming situations, and the ability of the GUI to exposea spectrum of behaviors.On the other hand, the storyboard design was constrained byconcerns around the health and safety of participants (and to satisfy our research ethics review requirements). These concerns made usrule out any sequences involving stairs, streets with vehicles, orany other scenes that could be deemed unsafe. We also limited thenumber of actors required to two, and the overall study durationfor each participant to one hour.After a number of iterations, involving consultation with a sepa-rate filmmaker, we agreed on the storyboard in Figure 9. Like manylong takes, it incorporates a variety of shots, some of which wouldbe quite hard to implement with standard filming techniques. Onesuch difficult shot implements a sudden camera transition betweenthe two actors, followed by the participant having to run to keepup with the actor named “Blue.”Another hard shot is the swooping pan where the camera startslow and ends up high as the participant moves around the tree untilthey are behind actor “Red.” This would normally be hard to executeas it involves the camera operator moving from a crouched to astanding position while ensuring the actor is kept within frame.With LookOut, the camera angle is adjusted automatically to framethe actor, letting the camera operator focus on their own movement.

Fig. 9. Storyboard given to participants to film using LookOut during theuser study. ookOut! Interactive Camera Gimbal Controller for Filming Long Takes • 13

As confirmation that the story and park setting were challengingthemselves, two of our participants commented that, if they hadthe option, they would split the scene into separate shots (“I wouldsegment the scene into different shots” and “normally I would splitthe scene into several parts”).

Procedure:

Participants were given verbal instructions provid-ing a brief overview of the user interface and the scene they wererequired to film. The setting was a local park, in late afternoonthrough dusk. Participants were handed a copy of the storyboardand left on a bench to create the required configuration scripts ona laptop running the GUI. Figure 10 shows an example of a scriptcreated by a participant.

Fig. 10. An example of a configuration script programmed by one of theparticipants. They opted to use whip pans and actor based cues to automatemost of the camera’s behavior change.

Once participants declared that they were satisfied with thescripts, they were provided with a quick overview of how the restof LookOut works, and invited to start filming. As they tested theirscripts, they were allowed to go back to the GUI and change aspectsthey thought did not work very well. For example, one participantwent back and changed the speed of transitions, having realizedthat the “very fast” setting might miss locating the actor entirely.Within 50 minutes of the start of the study, or as soon as partic-ipants filmed a scene they were satisfied with, the filming ended,and participants were asked to take part in a short interview (10minutes) about their experience.

Configuration Scripts and GUI:

All five participants who at-tempted part 1 of the study were able to successfully use the GUIto create configuration scripts to match the storyboard. This pro-cess lasted between 20 to 25 minutes, and was carried out inde-pendently by participants, although they were allowed to ask theexperimenters questions. Participants were generally pleased withthe UI’s functionality. One participant commented that, “program-ming the framing was like coding so it was simple enough” while another participant stated that he was happy with the UI’s behav-ior possibilities: “already a lot with actor recognition and speechrecognition.” However, some participants did mention the need for a“zoom function or focal length change.” In addition, one participantwanted a feature to track objects: “e.g. if you wanted to track astatue while walking around it.” Although LookOut supports objecttracking, the UI did not offer this possibility at the time, only lettingthem select actors.In some cases, after one or more attempts at filming the scene,participants realized that they were not happy with some of the de-tails in their configuration scripts. In these cases, participants editedthe configuration scripts using the GUI. In one case a participantrealized that the duration for a timed cue was too short, so theyadjusted the value. In another case they were not happy with theangle of the yaw in a pan, so they increased it. The adjustments tookless than 5 minutes as the performed changes were minor parametersettings. No issues were reported or observed with the interface.These findings confirm that the task of scripting the behavior ofthe LookOut controller can be completed with minimal training bynovice users. The editing of the parameters after a script was testedindicates that participants were able to relate the two, and couldrefine the script behavior to match their needs.

Filming and Resulting Footage:

In the remaining 15-25 min-utes, two of four participants had enough time to record a long takethat they were happy with for this scene. In the other two cases,there were issues with the tracking of actors that led to the scenenot being adequately filmed within the prescribed time frame. Thiswas caused by the lighting being uncharacteristically bad: on mostfilm sets, there would be procedures in place to reduce the effectof strong sunlight filtering through the trees, to keep the actorsconsistently lit.One participant pointed out that even though the camera wasnot giving feedback about the filming through a GUI, she couldsee the physical movement of the camera and thus knew whetherthe camera was doing what she wanted it to do: “from the physicalmovement of camera it looked smooth.”Participants also spoke about the convenience of having an au-tomatic movement of the camera as it meant they could focus onother aspects of the filming, such as keeping up with the actors. Oneparticipant described the task of keeping the camera focused on anactor as “you can just track someone without caring about it.”

Participant Comments:

The aim of the storyboard was to haveseveral different types of shots, some of them that would be harderto execute with traditional filming equipment. One of these shotsinvolved having the camera quickly panning between the two actors:“the whip pan was easier with the AI, it found and tracked the subjectautomatically. Otherwise I would have to rehearse that 3-4 timesto get it correctly.” By using LookOut, the participant was able tocorrectly capture the shot from the first take.Participants were particularly pleased with using voice as a triggerfor the next action in the scene: “voice activating the cues workedvery well.” One participant stated that they “could see directorsusing that to program in actor’s lines.” This feature simplified thefilming process for participants, with all participants who attemptedpart 2 using speech triggers within their scripts.

We sought out three senior film-makers, separate from the film-makers who influenced the design of LookOut, and separate fromthose who did the Hands-On Evaluation (Section 7.3). Each of themhas been working as a professional Director of Photography, for 9, 13,and 25 years respectively. Each of them has a mix of experience, inboth scripted scenes with crew and actors, and run & gun filming fordocumentaries or journalism. We interviewed them separately, eachtime showing the same three unedited video examples, shot usingthe LookOut system (see Fig 11). We asked the same pre-definedset of questions to prompt them to think aloud while watching thevideos.

Fig. 11. Videos shown to senior film-makers. a) Rocky escarpment - cameraoperator climbing on foot and with one hand free. a) Bike ride. Cameraoperator also riding a bike. c) Pyramids - camera operator walking backwardson stairs.

The questions are listed in the Supplementary Material, but canbe broadly grouped as concerning i) the equipment and peopleneeded to film these long takes normally (without LookOut), and ii)critiques of both the footage and current LookOut capabilities.First, to shoot such takes without LookOut, two of the film-makershave used drones, and would consider using them here, if a licensedpilot were available, and the noise wasn’t prohibitive. Two of themsaid they would use cranes for video-A, if the budget allows. Onecomplained, however, that multiple cranes have bad placement ofviewfinders, resulting in them shooting blindly for long periods. Forvideos B and C, one said he would use a Steadicam, and the othertwo had specific two- or one-handed gimbals (like that modifiedfor LookOut), that they would try again, despite having small andawkward viewfinders.They each would need a second person at minimum, and usuallymore, to help with typical stabilization-only filming. Independently,they all said that if only one extra helper were available, then thatperson would be the spotter for the operator. A spotter physicallyguides the operator around obstacles. Second, their views of the footage and the LookOut system werevery positive, with some caveats. The two more senior ones ex-pressed the sentiment that LookOut would have no place in a big-budget project, because the Director and DP can give orders verbally,that get carried out eventually. Also, those two would need to useLookOut multiple times before they’d trust its reliability, and ide-ally, prefer if colleagues make some films with it first. Transcribedinterview quotes are in the supplemental material, and include com-ments such as “That would be so helpful! Especially in those run &gun situations, documentary, travel, journalism. If you’re filmingsomething that won’t happen again, you can focus on the otherthings” and “I could be more creative once I got used to it.”

LookOut has been used by the authors, (and pre-Covid-19) by test-subjects, and by novices who usually (but not exclusively) filmedusing existing behavior scripts. A representative cross-section isshown in the supplemental videos web-page. Some noteworthy ex-amples include sports where the operator is participating, such asskateboarding, or using one hand while e.g. playing frisbee, scram-bling, or cycling. For the Gnome and Plumbing-shop sequences, wefilmed, as an exception, using the DaSiamRPN [58] tracker withinLookOut, to cope with unusual object categories, though this re-quired multiple takes. In contrast, the vast majority of takes usingour tracker worked out on the first try.

The LookOut premise, software, and hardware, each have limitations.While it would be informative to do the end-to-end evaluation underrun & gun conditions, which represent the vast majority of users,those situations are rarely repeatable, and considered dangerousfrom an ethical experimentation perspective. That led us to usesimple scripted scenarios for that evaluation. The senior film-makersare likely right that big-budget productions will be reticent to useLookOut. The field-study tested with participants from our low-budget demographic of film-makers with a fixed storyboard, but anideal comprehensive user study would focus on adventure-athletesand journalists in somewhat dangerous conditions, to check realrun & gun scenarios.The LookOut GUI worked better and more intuitively than ex-pected. The detector and tracker combination too, perform ad-mirably across really diverse scenarios, though they are designedinitially for tracking actors across occlusions in hand-held films, andare unremarkable on the standard Computer Vision benchmarksMOT [38] and VOT [31]. The single weakest component across theLookOut system is the detector. We’ve seen it confuse the trackerwhen the actor hides or gets too small, there is too much motionblur, or actors wear the same uniform. For now, better detectors areavailable, but not with the low-latency required by the controller.LookOut is built in Python, which is not optimized for real-time andmultiple threads. We chose this for easier comparison with othertrackers and rapid prototyping, so efficiency gains are possible. Likeother appearance encodings, ours is susceptible to harsh and vari-able lighting (see Fig 12), which makes the system most vulnerable ookOut! Interactive Camera Gimbal Controller for Filming Long Takes • 15 at dusk or dawn, and possibly when switching between indoors andoutdoors.

Fig. 12. Harsh light and lens flares can upset the detector, and lead to gapsin tracking. If such a lighting change is fast enough and then long lasting,the tracker may not adequately associate new encodings with known actors,leading to a loss of tracking.

There are potentially two improvements for the hardware. First,some users requested that LookOut also manage focus-pulling andzooming, so this would require a star-camera where focal lengthis software-controllable in real-time. We have not found a suitablemodel yet. Further, we use a guide camera with a limited field ofview. 360 ◦ cameras are rarely used for cinematic filming due tolimited resolution, but could function as guide cameras. Then, newbehaviors could better “anticipate” actors that aren’t in-frame forthe star camera yet. We will release the LookOut blueprints anddownloadable system. REFERENCES [1] Amirsaman Ashtari, Stefan Stevšić, Tobias Nägeli, Jean-Charles Bazin, and OtmarHilliges. 2020. Capturing Subjective First-Person View Shots with Drones forAutomated Cinematography.

ACM Transactions on Graphics (Proceedings of ACMSIGGRAPH)

39, 5, Article 159 (Aug. 2020), 14 pages. https://doi.org/10.1145/3378673[2] John G. Avildsen, Irwin Winkler, and Robert Chartoff. 1976.

Rocky . United Artist.[3] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HSTorr. 2016. Fully-convolutional siamese networks for object tracking. In

Europeanconference on computer vision . Springer, 850–865.[4] B. Brown. 2016.

Cinematography: Theory and Practice: Image Making for Cine-matographers and Directors

IEEE Transactions on Image Processing

25, 3(2016), 1261–1274.[7] Marc Christie, Patrick Olivier, and Jean-Marie Normand. 2008. Camera control incomputer graphics. In

Computer Graphics Forum , Vol. 27. 2197–2218.[8] Peter I. Corke. 1994. Experiments in high-performance robotic visual servoing. In

Experimental Robotics III , Tsuneo Yoshikawa and Fumio Miyazaki (Eds.). SpringerBerlin Heidelberg, Berlin, Heidelberg, 193–205.[9] Kostas Daniilidis, Christian Krauss, Michael Hansen, and Gerald Sommer. 1998.Real-time tracking of moving objects with an active camera.

Real-Time Imaging

Automatic Face & Gesture Recognitionand Workshops (FG 2011), 2011 IEEE International Conference on

IEEE Transactions on Robotics and Automation

8, 3 (1992), 313–326.[15] Takuma Funahasahi, Masafumi Tominaga, Takayuki Fujiwara, and HiroyasuKoshimizu. 2004. Hierarchical face tracking by using PTZ camera. In

AutomaticFace and Gesture Recognition, 2004. Proceedings. Sixth IEEE International Conferenceon . IEEE, 427–432.[16] Quentin Galvane, Christophe Lino, Marc Christie, Julien Fleureau, Fabien Servant,Fran¸ois-louis Tariolle, and Philippe Guillotel. 2018. Directing Cinematographic Drones.

ACM Trans. Graph.

37, 3, Article 34 (July 2018), 18 pages. https://doi.org/10.1145/3181975[17] N. R. Gans, G. Hu, and W. E. Dixon. 2008. Keeping Objects in the Field of View:An Underdetermined Task Function Approach to Visual Servoing. In . 432–437.[18] Mark Haigh-Hutchinson. 2009.

Real Time Cameras: A Guide for Game Designersand Developers . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.[19] João F Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista. 2014. High-speedtracking with kernelized correlation filters.

IEEE transactions on pattern analysisand machine intelligence

37, 3 (2014), 583–596.[20] Chong Huang, Fei Gao, Jie Pan, Zhenyu Yang, Weihao Qiu, Peng Chen, XinYang, Shaojie Shen, and Kwang-Ting Cheng. 2018. ACT: An Autonomous DroneCinematography System for Action Scenes. (2018), 7039–7046.[21] Chong Huang, Chuan-En Lin, Zhenyu Yang, Yan Kong, Peng Chen, Xin Yang, andKwang-Ting Cheng. 2019. Learning to Film from Professional Human MotionVideos. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) .[22] Alejandro G. I ˜ 𝑛 árritu. 2014. Birdman . United States: Fox Searchlight Pictures.[23] Paul Jaccard. 1912. The distribution of the flora in the alpine zone. 1.

Newphytologist

11, 2 (1912), 37–50.[24] Niels Joubert, Dan B Goldman, Floraine Berthouzoz, Mike Roberts, James A Lan-day, Pat Hanrahan, et al. 2016. Towards a drone cinematographer: Guiding quadro-tor cameras using visual composition principles. arXiv preprint arXiv:1610.01691 (2016).[25] Z. Kalal, K. Mikolajczyk, and J. Matas. 2012. Tracking-Learning-Detection.

IEEETransactions on Pattern Analysis and Machine Intelligence

34, 7 (July 2012), 1409–1422. https://doi.org/10.1109/TPAMI.2011.239[26] S.D. Katz. 2004.

Cinematic Motion: A Workshop for Staging Scenes . Michael WieseProductions.[27] M. King. 2016.

Process Control: A Practical Approach . Wiley.[28] Danica Kragic and Henrik I Christensen. 2002.

Survey on Visual Servoing forManipulation . Technical Report. COMPUTATIONAL VISION AND ACTIVEPERCEPTION LABORATORY.[29] Matej Kristan, Aleš Leonardis, Jiri Matas, Michael Felsberg, Roman Pflugfelder,Luka Čehovin Zajc, Tomas Vojir, Gustav Häger, Alan Lukežič, AbdelrahmanEldesokey, and Gustavo Fernandez. 2017. The Visual Object Tracking VOT2017Challenge Results.[30] Matej Kristan, Ales Leonardis, Jiri Matas, Michael Felsberg, Roman Pfugfelder,Luka Cehovin Zajc, Tomas Vojir, Goutam Bhat, Alan Lukezic, Abdelrahman El-desokey, Gustavo Fernandez, and et al. 2018. The sixth Visual Object TrackingVOT2018 challenge results.[31] Matej Kristan, Jiri Matas, Ales Leonardis, Michael Felsberg, Roman Pflugfelder,Joni-Kristian Kamarainen, Luka Čehovin Zajc, Ondrej Drbohlav, Alan Lukezic,Amanda Berg, Abdelrahman Eldesokey, Jani Kapyla, and Gustavo Fernandez. 2019.The Seventh Visual Object Tracking VOT2019 Challenge Results.[32] Harold W Kuhn. 1955. The Hungarian method for the assignment problem.

Navalresearch logistics quarterly

2, 1-2 (1955), 83–97.[33] Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. 2017. Com-putational Video Editing for Dialogue-Driven Scenes.

ACM Trans. Graph.

36, 4,Article 130 (2017), 14 pages.[34] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed,Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector.In

European conference on computer vision . Springer, 21–37.[35] Chen Long, Ai Haizhou, Zhuang Zijie, and Shang Chong. 2018. Real-time Mul-tiple People Tracking with Deeply Learned Candidate Selection and Person Re-identification. In

ICME .[36] J.V. Mascelli. 1976.

The Five C’s of Cinematography: Motion Picture Filming Tech-niques Simplified . Cine/Grafic publications.[37] Sam Mendes. 2019. . United Kingdom: Universal Pictures.[38] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler. 2016. MOT16: A Bench-mark for Multi-Object Tracking. arXiv:1603.00831 [cs] (2016).[39] Tobias Nägeli, Lukas Meier, Alexander Domahidi, Javier Alonso-Mora, andOtmar Hilliges. 2017. Real-time Planning for Automated Multi-view DroneCinematography.

ACM Trans. Graph.

36, 4, Article 132 (July 2017), 10 pages.https://doi.org/10.1145/3072959.3073712[40] Vashi Nedomansky. 2013. Average Shot Length of 6 Famous Directors. https://vashivisuals.com/average-shot-length-of-6-famous-directors. Accessed: 2019-08-25.[41] Picovoice. 2019. On-device wake word detection powered by deep learning.https://github.com/picovoice/porcupine.[42] Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An Incremental Improvement. arXiv (2018).[43] Remi Ronfard, Vineet Gandhi, and Laurent Boiron. 2015. The prose storyboard lan-guage: A tool for annotating and directing movies. arXiv preprint arXiv:1508.07593 (2015).2021-01-01 01:29. Page 15 of 1–21. [44] Martin Scorsese and Irwin Winkler. 1990.

Goodfellas . United States: Warner Bros.[45] Ben Shneiderman. 2007. Creativity Support Tools: Accelerating Discovery andInnovation.

Commun. ACM

Indiana Jones and the Temple of Doom .Paramount Pictures.[48] Jack Valmadre, Luca Bertinetto, João Henriques, Andrea Vedaldi, and Philip HSTorr. 2017. End-to-end representation learning for correlation filter based tracking.In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition .2805–2813.[49] Miao Wang, Guo-Wei Yang, Shi-Min Hu, Shing-Tung Yau, and Ariel Shamir.2019. Write-A-Video: Computational Video Montage from Themed Text. In

ACMTransactions on Graphics, (Proceedings SIGGRAPH-Asia) , Vol. 38. Article No. 177.[50] Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip HS Torr. 2019.Fast online object tracking and segmentation: A unifying approach. In

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition . 1328–1338.[51] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. 2017. Simple online and realtimetracking with a deep association metric. In . IEEE, 3645–3649.[52] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. 2013. Online Object Tracking:A Benchmark. In

IEEE Conference on Computer Vision and Pattern Recognition(CVPR) .[53] Ke Xie, Hao Yang, Shengqiu Huang, Dani Lischinski, Marc Christie, Kai Xu,Minglun Gong, Daniel Cohen-Or, and Hui Huang. 2018. Creating and ChainingCamera Moves for Quadrotor Videography.

ACM Transactions on Graphics (Proc.SIGGRAPH)

37, 4 (2018), 88:1–88:13.[54] Gene Youngblood and R Buckminister Fuller. 1970.

Expanded cinema . P. Duttonand Co.[55] M. Zarudzki, H. Shin, and C. Lee. 2017. An image based visual servoing approachfor multi-target tracking using an quad-tilt rotor UAV. In . 781–790.[56] Xuaner Zhang, Kevin Matzen, Vivien Nguyen, Dillon Yao, You Zhang, and RenNg. 2019. Synthetic Defocus and Look-Ahead Autofocus for Casual Videography.

ACM Trans. Graph.

38, 4, Article 30 (July 2019).[57] Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and QiTian. 2016. MARS: A Video Benchmark for Large-Scale Person Re-identification.In

European Conference on Computer Vision .[58] Zheng Zhu, Qiang Wang, Li Bo, Wei Wu, Junjie Yan, and Weiming Hu. 2018.Distractor-aware Siamese Networks for Visual Object Tracking. In

EuropeanConference on Computer Vision .[59] John G Ziegler and Nathaniel B Nichols. 1942. Optimum settings for automaticcontrollers.

Trans. ASME

64, 11 (1942). 2021-01-01 01:29. Page 16 of 1–21. upplementary Text (Videos at http://visual.cs.ucl.ac.uk/pubs/lookOut/scenes.html ) forLookOut! Interactive Camera Gimbal Controller for Filming Long Takes

MOHAMED SAYED,

University College London

ROBERT CINCA,

University College London

ENRICO COSTANZA,

University College London

GABRIEL J. BROSTOW,

University College London, Niantic http://visual.cs.ucl.ac.uk/pubs/lookOut/

ContentsContents 171 Interface for Designing Long Takes 171.1 Scripts 171.2 Behaviors 171.3 Cues 182 Tracking Algorithm Pseudo-code 193 Leniency From User Radii 194 Result Videos 195 Questions Asked of Senior Film-Makers 19

The LookOut GUI is explained here and illustrated in Fig 1. In manyways, it is integral for the LookOut system, making it a form ofinteractive digital assistant.

A script is a preprogrammed sequence of camera behaviors. Aheadof filming, the user designs their long take using the LookOut GUI,typically on a laptop. They then export one or more scripts to thecontroller. During filming, the system carries out the user’s requestsand provides audio feedback about which script is being used andwhich behavior the system is performing.These requests come in the form of speech commands, spokenby the user (or actor) to interrupt a script, restart it, or jump toalternate scripts. The chain of behaviors within a script is linkedtogether by cues, which are explicit audible or visible events. Thecontroller monitors for these events during filming, so a long takecan be storyboarded and followed precisely, or it can be improvisedin response to the operator’s play-by-play instructions. Transitionsbetween behaviors and between scripts are tuned to be responsiveyet smooth.

Within the LookOut GUI, the continuous domain of camera motionsis organized into a menu of discrete and parameterized behaviors.Examples include standing still, or panning left 30 ◦ . The existingbehaviors in LookOut come from requirements gathering with twofilm-makers, but further behaviors can be programmed in the future,and users can already cover a broad range of use cases by chainingbehaviors together. Actor Based Framing:

In most filming scenarios, one or morehumans is the focus of attention. So a shot is driven either by actormonolog/dialog, or by them performing physical movements. Un-surprisingly, a user designing a script with an actor-based behaviormust first attach a specific ID to that behavior. An actor ID is mostlyjust a name for now, and the discriminative appearance info for eachactor in the pool will only be filled in on-set at the startup phase.The user then specifies how this behavior frames that particularactor. They do this by placing a dot for the actor in the desired partof screen-space, e.g. on the left of the frame (see Fig 2). Often, scenesinvolve multiple actors. The interface for a multi-actor behavior is

Fig. 1. Workspace GUI for Script creation. Two scripts (shown with light blue “blocks”) can be seen in this workspace. The left panel houses different structuresfor defining camera behavior. The Behaviors tab is open and displays some of the camera framing modes (in brown) available to the operator. Green “blocks”are cues, i.e. events that are being monitored, to then conclude a behavior and/or start the next one.Fig. 2. The operator can select where an actor must be positioned on screenwith a yellow dot for location and red ellipse for leniency (left), stringtogether multiple points for an actor path behavior (middle), and definean area for a landing zone cue (right). The blue grid represents the starcamera’s image space, including aspect ratio. essentially the same, and LookOut will later optimize, striving tosatisfy all the behavior’s constraints.Not all movements the actors make on screen require the camerato move. Camera movements must be motivated [26] [4]. A moti-vated camera movement or pan draws the attention of the viewer. Action shots may require tight and fast camera pans to keep a sub-ject in frame, whereas a slow indoors shot would not benefit fromquick camera pans when a subject rocks side to side.For each actor’s location, the user can specify an elliptical area ofleniency, where actor movements are not immediately convertedinto camera movements. The LookOut GUI provides further behav-iors for actor-framing. Especially for storyboarded takes, a pathbehavior lets the user spell out the framing of the actor over time.For example, a camera may pan to follow where an actor gazes whensearching, or may pan to look ahead in a dynamic running shot.The path is constructed in screen space from a set of dots, with thedistance between each point signifying how fast the camera movesin that part of the path.

Non-Actor Behaviors:

A few of the available behaviors are in-dependent of actor framing. A panning behavior makes the camera“scan” the scene, with a specified yaw or pitch direction, speed, andrange. The banking behavior rolls the camera (in response to theIMU), simulating the effect of an aircraft dipping one wing whilechanging direction. The UI simply exposes the options for thesebehaviors using drop-down menus and text-entry fields.

Camera movements are often deliberate, triggered by changes in thescene or the progression of the action. LookOut responds to thesechanges in the scene to initiate each successive behavior in the chain.To do this, LookOut monitors events that the user designated in theGUI as being significant cues for each context. LookOut informs upplementary Text (Videos at http://visual.cs.ucl.ac.uk/pubs/lookOut/scenes.html ) forLookOut! Interactive Camera Gimbal Controller for Filming Long Takes • 19 the user through quick audio feedback when the next cue is hit.LookOut can currently monitor for the following cues:

Actor Appearance and Disappearance:

Actors coming intoand out of frame can signify a new camera behavior. This cue couldsignal that a new actor is to be followed or that an informative panis to take place. The user can specify which actor LookOut shouldmonitor, and how sensitive LookOut is to that change.

Landing Zone:

We adapt the concept of a landing zone into acue. This cue is triggered when the requisite actor enters a specificuser defined part of screen space (See Fig 2).

Elapsed Time:

Although rigid, we also allow control over howlong a behavior runs, using an Elapsed Time cue.

Speech:

Speech recognition is included as a cue in LookOut,paired with speech synthesis. The aim is to give LookOut basicdialog-system capabilities, analogous to the verbal instructions be-tween the director and a camera operator. The user types triggerwords into the GUI when connecting a cue to a behavior, and eitherthe operator or an actor with pre-defined lines wears a lapel micro-phone. The UI rejects trigger words that are too close to distinguish.

Relative Actor Size:

This cue is useful for shots where the sub-ject’s relative image frame size is important to the narrative or actionon frame. For example, a subject may appear from the distance, butthe camera is to remain agnostic to them until they appear with alarge enough screen size.

See Algorithm 1 for pseudo-code of the LookOut tracker’s costformulation strategy. The main explanation of the tracker is in thepaper.

For advanced users, q , v , and a would be made available for finecontrol over leniency. However, in our UI implementation, leniencycurves are abstracted into a single radii pair, ( 𝑟 𝑥 , 𝑟 𝑦 ) , for each axisthat the user can specify via an ellipse in the UI. Note that this pairis normalized by guide camera image space size. We calculate eachaxis of the finer values using r as 𝑎 = . 𝑟 − . 𝑣 = ∗ ( 𝑟 − . ) , and 𝑞 = 𝑟 + .

01 .

For videos, please see the supplemental website, http://visual.cs.ucl.ac.uk/pubs/lookOut/scenes.html, that contains guide camera footagewith overlays of the LookOut system’s inner processes. Every videoalso has an associated higher quality “star" camera footage from aSony RX0, albeit with a reduced bit-rate for consumption via web.The telemetry/illustrated footage is displayed at a reduced framerate to make ingesting data easier by eye. Available telemetry datais listed on the video’s page.

Algorithm 1:

Cost Matrix Formulation Pseudo-code forLookOut tracker.

Input :

A set of tracks 𝑇 = { 𝑡 𝑖 , 𝑖 ≤ 𝑁 } . A set of detectionsin the current frame, 𝐷 = { 𝑑 𝑗 , 𝑗 ≤ 𝑀 } . A chain offeatures for each track of length 𝐿 𝑖 , 𝐹 𝑖 = { 𝑓 𝑘 , 𝑘 ≤ 𝐿 𝑖 } , and a single feature associatewith each detection 𝑓 𝑑 𝑗 . Output :

A matrix consisting of costs for each pair ofdetection, 𝑑 𝑗 , and track, 𝑡 𝑖 . 𝐶 = [ 𝑐 𝑖 𝑗 , 𝑖 < 𝑁 and 𝑗 < 𝑀 ] Overall cost matrix. 𝐶 iou = [ 𝑐 iou 𝑖 𝑗 , 𝑖 < 𝑁 and 𝑗 < 𝑀 ] IOU based cost matrix. 𝐶 feature = [ 𝑐 feature 𝑖 𝑗 , 𝑖 < 𝑁 and 𝑗 < 𝑀 ] Appearance based costmatrix. 𝐶 avgfeature = [ 𝑐 feature 𝑖 𝑗 , 𝑖 < 𝑁 and 𝑗 < 𝑀 ] Averageappearance based cost matrix. foreach track 𝑡 𝑖 ∈ 𝑇 do overlapCount = foreach detection 𝑑 𝑗 ∈ 𝐷 do 𝐶 iou [ 𝑖, 𝑗 ] = computeIOU ( 𝑡 𝑖 , 𝑑 𝑗 ) 𝐶 feature [ 𝑖, 𝑗 ] = computeAppearanceCost ( 𝑡 𝑖 , 𝑑 𝑗 ) 𝐶 avgfeature [ 𝑖, 𝑗 ] = computeAvgAppearanceCost ( 𝑡 𝑖 , 𝑑 𝑗 ) if 𝑐 iou 𝑖 𝑗 < 𝜏 overlap then overlapCount = overlapCount + endendif overlapCount < and 𝑡 𝑖 is not lost or being recovered then /* Only one detection is competing for thistrack spatially. Don’t rely onappearance costs. */ foreach detection 𝑑 𝑗 ∈ 𝐷 do 𝐶 feature [ 𝑖, 𝑗 ] = 𝐶 avgfeature [ 𝑖, 𝑗 ] = endendend 𝐶 = 𝐶 𝑖𝑜𝑢 + 𝐶 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 + 𝐶 𝑎𝑣𝑔𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 These questions were put to our three most senior film-makers,while getting them to think-aloud while critiquing Videos A, B, andC in Figure 3, all filmed using LookOut. These interviews and allexperiments pre-date the Covid-19 pandemic. • How many people would you need to shoot a scene compa-rable to this? • How many specialists would you need for it? • What is the level of skill necessary for the operators to havefor this type of shot? (1-5) • What equipment would you need to shoot a scene comparableto this? • Would you need to change the set? As in, bolt something tothe floor, drill holes, dig, etc.

Fig. 3. Videos shown to senior film-makers. a) Rocky escarpment - cameraoperator climbing on foot and with one hand free. a) Bike ride. Cameraoperator also riding a bike. c) Pyramids - camera operator walking backwardson stairs. • Would you have a steadicam/gimbal operator do this or maybehave a crane? • How much would it cost to get that equipment on site? • How many takes / how consistent is the framing? • How much planning goes into a shot like this?To get natural reactions from the participants, we did not insistthat they answer our specific questions. They did answer some ofthem. We made sure they at least gave their impressions about i)the equipment and people needed to film these long takes normally(without LookOut), and ii) critiques of both the footage and currentLookOut capabilities.Here are responses from each of the three film-makers, in turn.These responses were written in short-hand, and we omit variousstories and deviations recounted during the interview, as had beenagreed in advance, because some stories are unflattering, and wefelt this could help make the responses more candid.Quotes from film-maker “W”: • To shoot a scene comparable to this? (video-A and overall):Depends on the budget. More people if you can afford it.Probably try to make due with 2. Myself and spotter, but stilltoo rough a terrain and would need to rehearse a lot. • (How many takes?) Just need a few more takes to find yourfeet and get it right; learn your movements and theirs. • Skill? Probably prefer to get a crane operator, costs 2k perday. Similar price but less faff with a steadicam operator, butdepends on the shot. • I’d also think about using a drone on shots like these - justneed someone who’s licensed. But it’s noisy, so no good ifyou’re recording the dialog. Fine to try on a bigger production,and with no [low] wind. • (LookOut useful? Current capabilities?) Need to trust it first.Prefer if saw other film-makers using it repeatedly. Was the same for Red cameras. Saw the same with steadicam - no-body wants to be first, because an expensive set is expensivebecause of so many people and their time costs money. • (How many specialists, skill level?) Hire someone on the basisof their experience or their style, not really quantifiable.Quotes from film-maker “F”: • To shoot a scene comparable to this? (video-A): myself +spotter, but still too rough a terrain. Need to rehearse. Coulduse a drone, but they’re loud, a hassle, and then someone elseis deciding. • (video-A, How many takes?) Maybe just not attempt it, optfor simpler [shot], tripod. • (video-B, special skill?): probably not [needed] for most peo-ple; common to be riding on something driven by others. • (video-B would do differntly?) I like the orbiting, I’d like moreface. Tighter framing of guy. • (video-C): Gimbal, big monitor to make framing easier. • (video-C how many takes?): Practice first, if they move tooquickly, or you trip... Be CLEAR in your direction first. DJIRonan - majestic mode can’t respond to quick movements. • (Skill level? For all 3 videos:) If I really respected them, I’d tellthem to get on with it. Vs. more novice, I’d give specific in-structions. Not really detailed. Can always be a little different- so many variables. • (What equipment?) Xion crane - but motor on back coversLCD screen! And then mirorless cameras, screen is too small.Usually Sony A7 series, manual for Zoom, autofocus, or re-mote focus wheel in another hand. But those become a nui-sance when limited time. (Later, after seeing LookOut andunderstanding it) Big thing with this, that you wouldn’t needto worry about it. Use it when you can’t look at the screen,and need to pay attention to other things. Low or high angles,when you can’t see the viewfinder. • (On seeing Lookout) That would be so helpful! Especially inthose Run & gun situations, documentary, travel, journalism.If you’re filming something that won’t happen again, you canfocus on the other things. • (Would you use it yourself?) Definite market for this; peoplecould be funny about losing control [with motion control];here purists could say it’s part of the unique take. • (Asked us if they could try it out “when we start selling them”- really?) [Well, I ] can’t see it in high-end commercial featurefilm. There, you have time on your side. ... Next time, letme know. I work with a lot of cinematographers. Lots ofcontacts who would like this. Should talk to DJI - probablymost popular. • (What’s missing from LookOut?) Would you be able to controlthe zoom?Quotes from film-maker “G”: • To shoot a scene comparable to this? (video-A): Use a gymbalcamera with spotter to lead, move differently: smaller stepsto minimize up and down; Or clear a path through bouldersfor walking, using a digger. Path needs to be out of shotobviously. upplementary Text (Videos at http://visual.cs.ucl.ac.uk/pubs/lookOut/scenes.html ) forLookOut! Interactive Camera Gimbal Controller for Filming Long Takes • 21 • (video-A equipment): Maybe 2-handed gymbal to reduce side-to-side; hold gymbal forward, at level of stomach. • (video-B): Maybe Segway or cart driven by another person,so operator can focus on shooting. • (Skill for video-B?) Should have understanding of camerawork; need not be technical. • (video-C): Definitely spotter leading me up the steps, onmy shoulder, plus motorized single-handed gimbal, Easyrig[easyrig.se]. • (On seeing LookOut) That’s amazing! • (LookOut useful?) Does it know something about the scene?(answered him and explained system) I probably would becomfortable with a speech command. • (Would you use it yourself?) It wouldn’t take me very longto trust it to get the shot I needed, if I got to see it working afew times. I could be more creative once I got used to it. [Onbig budget projects] not many things [shots] that I can onlyget the first time. • (Current capabilities?) Focus is a massive consideration, canhave it pulled just right, but dynamics of scene change. Orauto-focus goes wrong: change focus from person A to B, orfrom very near to very far, while I pan up to see a mountain. • (Other features?) Nothing right now. Is resistance pre-defined?Would like to adjust that, maybe dynamically: example, walktoward building and then look up, so need resistance. SonyFS7 with Ronin S, film a lot in slow motion. Start at 50 or100fps then slow down.(Other features?) Nothing right now. Is resistance pre-defined?Would like to adjust that, maybe dynamically: example, walktoward building and then look up, so need resistance. SonyFS7 with Ronin S, film a lot in slow motion. Start at 50 or100fps then slow down.