Multi-view data capture using edge-synchronised mobiles
Matteo Bortolon, Paul Chippendale, Stefano Messelodi, Fabio Poiesi
MMulti-view data capture using edge-synchronised mobiles
Matteo Bortolon, Paul Chippendale, Stefano Messelodi and Fabio Poiesi
Fondazione Bruno Kessler, Trento, [email protected], { chippendale, messelod, poiesi } @fbk.eu Keywords: Synchronisation, Free-Viewpoint Video, Edge Computing, Augmented Reality, ARCloud.Abstract: Multi-view data capture permits free-viewpoint video (FVV) content creation. To this end, several usersmust capture video streams, calibrated in both time and pose, framing the same object/scene, from differentviewpoints. New-generation network architectures (e.g. 5G) promise lower latency and larger bandwidthconnections supported by powerful edge computing, properties that seem ideal for reliable FVV capture. Wehave explored this possibility, aiming to remove the need for bespoke synchronisation hardware when capturinga scene from multiple viewpoints, making it possible through off-the-shelf mobiles. We propose a novel andscalable data capture architecture that exploits edge resources to synchronise and harvest frame captures. Wehave designed an edge computing unit that supervises the relaying of timing triggers to and from multiplemobiles, in addition to synchronising frame harvesting. We empirically show the benefits of our edge computingunit by analysing latencies and show the quality of 3D reconstruction outputs against an alternative and popularcentralised solution based on Unity3D.
Immersive computing represents the next step inhuman interactions, with digital content that looks andfeels as if it is physically in the same room as you.The potential for immersive digital content impactsupon entertainment, advertising, gaming, mobile tele-presence and tourism (Shi et al., 2015; Jiang and Liu,2017; Elbamby et al., 2018; Rematas et al., 2018;Park et al., 2018; Qiao et al., 2019). But for immer-sive computing to become mainstream, a vast amountof easy-to-create free-viewpoint video (FVV) contentwill be needed. The production of 3D digital objectsinside a real-world space is almost considered a solvedproblem (Schonberger and Frahm, 2016), but doingthe same for FVV is still a challenge (Richardt et al.,2016). This mainly because FVVs require the tempo-ral capturing of dynamic objects from different, andcalibrated viewpoints (Guillemaut and Hilton, 2011;Mustafa and Hilton, 2017). Synchronisation acrossviewpoints dramatically reduces reconstruction arte-facts in FVVs (Vo et al., 2016). For controlled se-tups, frame-level synchronisation can be achieved us-ing shutter-synchronised cameras (Mustafa and Hilton,2017), however this is impractical in uncontrolled en-vironments with conventional mobiles. Moreover, mo-bile pose estimation (i.e. position and orientation) withrespect to a global coordinate system is necessary for content integration. This can be achieved using frame-by-frame Structure from Motion (SfM) (Schonbergerand Frahm, 2016) or Simultaneous Localisation AndMapping (SLAM) (Mur-Artal et al., 2015). The latterhas shown to be effective for collaborative AugmentedReality though the ARCloud (ARCore Anchors, 2019).ARCore Anchors are targeted at AR multiplayer gam-ing, but we experimentally tested that they are notyet ready for FVV production as is, i.e. relative poseestimation is not sufficiently accurate for 3D recon-struction. As per our knowledge a flexible and scalablesolution for synchronised calibrated data capture de-ployable on conventional mobiles does yet not exist.As FVVs require huge data transmissions, throughputis another challenge for immersive computing. Fig. 1shows four people recording a person with their mo-biles. The measured throughput in this setup was about52Mbps with frames captured at 10Hz, with a resolu-tion of 640 ×
480 pixels and encoded in JPEG. Notethat only a portion of the recorded person was covered.For a more complete, 360-degree coverage, severalmore mobiles would be needed. With the introduc-tion of new wireless architectures that target high-datathroughput and low latency, such as Multi-access EdgeComputing (MEC), device-to-device communicationsand network slicing in 5G networks, scalable com- The AR Anchor is a rigid transformation from local toa global coordinate system. a r X i v : . [ c s . MM ] M a y ramestream pose estimationframe capture edge processingrecording moving person sync data relaying
4D recon cam 3cam 2cam 1 cam 4 meshingtexturingARCloud
Figure 1: Multi-view data captures for free-viewpoint video creation generates large data throughput, and requires synchronisedand calibrated cameras. Our solution offloads computations from mobiles and the cloud to the edge, handling synchronisationand image processing more efficiently. Moving processing closer to the user improves performance and fosters scalability. munication mechanisms that are appropriate for thedeployment of immersive content on consumer de-vices are key (Shi et al., 2016; Qiao et al., 2019). Hy-brid cloud/fog/edge solutions will ensure that users getlow-lag feedback as well as the possibility to offloadcomputationally intensive tasks.In this paper, we present a system that harvests datagenerated from multiple-handheld mobiles at the edgeinstead of harvesting it in the cloud. This promotesscalability, lower latency and facilitates synchronisa-tion. Our implementation consists of a server, namelyEdge Relay, that handles communications across mo-biles and a Data Manager that harvests the contentcaptured by the mobiles. We handle synchronisationthrough a relay server because, as opposed to a peer-to-peer one, relay servers can effectively reduce band-width usage and improve connection quality (Hu et al.,2016). Although relay servers may lead to increasesof network latency, peer-to-peer connection may beineffective because when mobiles are within differ-ent networks (e.g. different operators in 5G networks),they cannot retrieve their respective IP address dueto Network Address Translation (NAT). Moreover, tomitigate the latency problem, we designed a latencycompensation strategy, that we empirically tested tobe effective when the network conditions are fairlystable. We developed an app that each mobile usesto capture frames and estimate pose with respect to aglobal coordinate system through ARCore (ARCore,2019). As the relative poses from ARCore are notaccurate enough for FVV, we refine them using SfM(Schonberger and Frahm, 2016). To summarise, thekey contributions of our system are (i) the protocols wehave designed to allow users under different networks(or operators) to join the same data capture session, (ii)the integration of a latency compensation mechanismto mitigate the communication delay among devices,and (iii) the integration of these modules with a SLAMframework to estimate mobile pose in real time and toglobally localise mobiles in an environment throughthe ARCloud. As per our knowledge, this is the firstproof-of-concept, decentralised system for synchro- nised multi-view data capture usable for FVV contentcreation. We have carried out a thorough experimentalanalysis by jointly assessing latency and temporal 3Dreconstruction. We have compared results against analternative and popular centralised solution based onUnity3D (Unity3D Multiplayer Service, 2019).
Low Latency Immersive Computing:
To foster im-mersive interactions between multiple users in aug-mented spaces, low-latency computing and commu-nications must be supported (Yahyavi and Kemme,2013). Although mobiles are the ideal medium to de-liver immersive experiences, they have finite resourcesfor complex visual scene understanding, reasoningand graphical tasks, hence computational offloadingis preferred for demanding and low-latency interac-tions. Fog and edge computing, soon to be mainstreamthanks to 5G, will be one of the key enablers. Thank-fully, not all immersive computing tasks (e.g. sceneunderstanding, gesture recognition, volumetric recon-struction, illumination estimation, occlusion reasoning,rendering, collaborative interaction sharing) have thesame time-critical nature. (Chen et al., 2016; Zhanget al., 2018) showed that scene understanding via ob-ject classification could be performed at a rate of sev-eral times per minute by outsourcing computationsto the cloud. (Zhang et al., 2018) showed how anoptimised image retrieval pipeline for a mobile ARapplication can be created by exploiting fog comput-ing, reducing data transfer latency up to five timescompared to cloud computing. (Sukhmani et al., 2018)analysed the concept of dynamic content caching formobiles, i.e. what to cache and where, and they illus-trated that a dramatic performance increase could beobtained by devising appropriate task offloading strate-gies. (Bastug et al., 2017) showed how a pro-activecontent request strategy could effectively be used topredict content before it was actually requested, thusreducing immersive experience latency, at the cost ofncreased data overhead. In the cases of FVV, whichis very sensitive to synchronisation issues, communi-cations must be executed as close to the user as pos-sible to reduce lag. Solutions to address the problemcan be hardware or software based. Hardware-basedsolutions include timecode synchronisation with orwithout genlock (Kim et al., 2012; Wu et al., 2008),and Wireless Precision Time Protocol ( ? ). Hardwarebased solutions is not our target as they require impor-tant modifications to the communication infrastructure.Software-based solutions are often based on the Net-work Time Protocol (NTP), instructing devices in asession to acquire frames at prearranged time intervalsand then attempt to compensate/anticipate for delays(Latimer et al., 2015). Cameras can share timers thatare updated by a host camera (Wang et al., 2015). Al-ternatively, errors in temporal frame alignment havebeen addressed using spatio-temporal bundle adjust-ment, in an offline post-processing phase (Vo et al.,2016). However, this type of post-alignment also in-curs a high computational overhead, as well as addingmore latency to the creation and consumption of FVVreconstructions. Although NTP approaches are simpleto implement, they are unaware of situational-context.Hence, the way in which clients are instructed to cap-ture images in a session is totally disconnected fromscene activity, hence they are unable to optimise ac-quisition rates either locally or globally, prohibitingoptimisation techniques such as (Poiesi et al., 2017),that aim to save bandwidth and maximise output qual-ity. Our solution operates online and is aimed at de-centralising synchronisation supervision, thus is moreappropriate for resource-efficient, dynamic-scene cap-ture. Free-viewpoint video production:
Free-viewpoint(volumetric or 4D) videos can be created either throughthe synchronised capturing of objects from differentviewpoints (Guillemaut and Hilton, 2011; Mustafaand Hilton, 2017) or with Convolutional Neural Net-works (CNN) (Rematas et al., 2018) that estimateunseen content. The former strategy needs cameraposes to be estimated/known for each frame, usingapproaches like SLAM (Zou and Tan, 2013) or by hav-ing hardware calibrated camera networks (Mustafa andHilton, 2017). Typically, estimated poses lead to less-accurate reconstructions (Richardt et al., 2016), whencompared to calibrated setups (Mustafa and Hilton,2017). Converserly, CNN-based strategies do notneed camera poses, but instead need synthetic trainingdata of 3D objects extracted, for example, from videogames or Youtube videos (Rematas et al., 2018) as theyneed to estimate unobserved data. Traditional FVV(i.e. non-CNN) approaches can be based on shape-from-silhouette (SFS) (Guillemaut and Hilton, 2011),
Cloud
Google Cloud Platform Unity3D Match Making
Router SwitchRouter Edge
Edge RelayServer DataManager Local Network
Figure 2: Block diagram of our edge-based architecture.Mobiles are connected to the same local network (e.g. WiFior 5G). When they perform a FVV capture, data relaying andprocessing is performed on the edge, ensuring low-latencyand synchronised frame capture. shape-from-photoconsistency (SFP) (Slabaugh et al.,2001), multi-view stereo (MVS) (Richardt et al., 2016)or deformable models (DM) (Huang et al., 2014). SFSmethods aim to create 3D volumes (or visual hulls)from the intersections of visual cones formed by 2Doutlines (silhouettes) of objects visible from multipleviews. SFP methods create volumes by assigning in-tensity values to voxels (or volumetric pixels) basedon pixel-colour consistencies across images. MVSmethods create dense point clouds by merging the re-sults of multiple depth-maps computed from multipleviews. DM methods try to fit known reference 3D-models to visual observations, e.g. 2D silhouettes or3D point clouds. All these methods need frame-levelsynchronised cameras. (Vo et al., 2016) proposed aspatio-temporal bundle adjustment algorithm to jointlycalibrate and synchronise cameras. Because it is acomputationally costly algorithm, it is desirable to ini-tialise it with “good” initial camera poses and synchro-nised frames. Amongst these methods, MVS producesreconstructions that are geometrically more accuratethan the other alternatives, albeit at a higher compu-tational cost. Approaches like SFS and SFP are moresuitable for online applications as they are fast, butoutputs have less definition.
Our system carries out synchronisation and datacapture at the edge, and uses cloud services for thesession initialisation. The edge hosts two applications:an Edge Relay and a Data Manager. Services usedin the cloud are a Unity3D Match Making server andthe Google Cloud Platform. Fig. 2 shows the blockdiagram of our system. Fig. 3 details the session setup.When a group of users want to initiate a multi-viewcapture, they must connect their mobiles to a wirelessnetwork via WiFi or 5G, and then open our frame- oogle Cloud Platform Unity3D Match Making
Edge Relay Server
ClientHost Client(a)(b) (c)(d)(e) (f)
Figure 3: Session setup procedure. Description of the stepscan be found in text. capture app. One user, designated the session host,creates a new session in the app. This sends a requestto the Match Making server (Unity3D MatchMaking,2019) to create an acquisition session (a). This requestincludes the IP address of the host as seen from theEdge Relay. The Match Making server adds this ses-sion to a list of active sessions, associating it with aunique session ID. The Match Making server publi-cises this session ID for others to join as clients. Thehost device informs the Edge Relay that it is ready toaccept client devices (b). Client devices see the listof active sessions from the Match Making server andthey choose one to join (c). When a client choosesa session, they retrieve the session ID, the host’s IPaddress from the Match Making server, and it usesthis information to connect to the Edge Relay (d). TheEdge Relay validates client connections by verifyingthat the information provided is correct. When allclients have joined a session, the Match Making serveris not used again.Before starting the frame capture, (i) all users mustmap their local surroundings in 3D using the app’sbuilt-in ARCore functionality, then (ii) the host mea-sures the communication latency between itself andthe clients to inform the clients of the compensationneeded to handle network delays. Latency compen-sation is explained in Sec. 4. 3D mapping involvescapturing sparse geometric features of the environment(Cadena et al., 2016). Once mapping is complete, thehost user places an AR Anchor in the mapped scene todefine a common coordinate system for all devices (e).This AR Anchor is uploaded by the host to the GoogleCloud Platform and then automatically downloadedby clients through HTTPS (Belshe et al., 2015), orthe QUIC protocol by devices that support it (QUIC,2019) (f). Finally, the capture session starts when thehost client presses ‘start’ on their app.Snapshots, or captured frames, can be taken eitherperiodically (Knapitsch et al., 2017), or dynamicallybased on scene content (Resch et al., 2015; Poiesi et al., 2017). Frame captures based on scene content isdesirable because one can avoid excessive data trafficwhen a scene is still and then capture fast dynamicsby increasing the rate when high activity is observed.However, the latter is more challenging than the formeras it requires mobiles to perform on-board processingand a decentralised mechanism to reliably relay syn-chronisation signals. We designed our system to besuitable for dynamic frame captures, hence we havechosen to let the host mobile drive synchronisation,rather than fixing the rate beforehand on an edge orcloud server. To trigger the other mobiles to capture aframe, the host sends a snapshot (or synchronisation) trigger to the Edge Relay. Snapshot triggers instructmobiles in a session to capture frames. The Edge Re-lay forwards triggers received, from the host, to allclients, instructing them to capture frames and gen-erate associated meta-data (e.g. mobile pose, cameraparameters). Captured frames and meta-data are mo-mentarily buffered on the devices and then transmittedto the Data Manager asynchronously. Without loss ofgenerality, the host uses an internal timer to take snap-shots that expire every C = / F . A snapshot counter is incremented each time the countdown expires and itis used as unique identifier for each snapshot taken. Traditional architectures for creating multi-user ex-periences are based on authoritative servers (Yahyaviand Kemme, 2013), typically exploiting relay servers(Unity3D Multiplayer Service, 2019; Photon, 2019).An authoritative-server based system allows one of theparticipants to be both a client and the host at the sametime, thus having the authority to manage the session(Unity3D HLAPI, 2019). Our Edge Relay routes ses-sion control messages from the host to the clients viaUDP, to avoid delays caused by flow-control systems(e.g. TCP). The Edge Relay handles four differenttypes of messages:
Start Relaying , Connect-to-Server , Data and
Disconnect Client .The host makes a
Start Relaying request to theEdge Relay to begin a session. This request carriesthe connection configuration (e.g. client disconnectiondeadline, maximum packet size, host’s IP address andport number), which is used as a verification mecha-nism for all the clients to connect to the Edge Relay(MLAPI Configuration, 2019). The verification is per-formed through a cyclic redundancy check (CRC). Ifthe host is behind NAT, clients will not be able to re-trieve its IP address nor the port number, hence theywill not be able to include them in the connection con-figuration, and hence they will not pass the verificationtage. In order to mitigate this NAT-related issue, werequired the Edge Relay to communicate to the hostthe IP address and port number with which the EdgeRelay sees the host. This IP address can be the actualIP of the host if the Edge Relay and host are withinthe same local network, or the IP address of the router(NAT) if host and Edge Relay are on different net-works. After the host receives this information fromthe Edge Relay, the host communicates host’s IP ad-dress and port number to the Match Making server.In this way, when the clients discover the session IDfrom the Match Making server, they can retrieve thehost’s IP address and port number, and use them forverification to connect to the Edge Relay.When a client decides to join a session, the clientsends a
Connect-to-Server request message to the EdgeRelay. This message contains the IP and port addressof the host, which the client retrieved from the MatchMaking server. The Edge Relay checks to see if therequested session associated to this IP address and portis already hosted by a mobile. If it is, then the EdgeRelay adds this client to the list of session participants.
Data messages carry information from one deviceto another in a session. When a data packet is re-ceived by the Edge Relay it explores the header tounderstand where the packet must be forwarded to:either to specific devices or broadcast to all. We usea data messaging system that involves two types ofmessages: State Update packages or Remote Proce-dure Calls (RPCs) (MLAPI Messaging System, 2019).State Update packages are used to update elementsin the session and to propagate the information to allparticipants, i.e. the AR Anchor, while RPCs are usedfor control commands, i.e. synchronisation triggersand the latency estimation mechanism. The RPC ofthe synchronisation trigger carries the information ofthe snapshot counter (Sec. 3).A
Disconnect Client message is exchanged whena user exits the session. This message can be sentby the client or by the Edge Relay. The Edge Relaydetects the exit of a client if a
Keep-alive packet isnot received within a timeout. We set the Keep-alivetime at 100ms and disconnect timeout as 1.6s. Upondisconnection of a client, the Edge Relay informs allthe other participants and removes this device from thelist of participants of the session.
To deal with latency variation and large throughput,we have implemented two optimisation strategies.The latency between client/host and the Edge Re-lay can vary due to the distance between devices and the antenna, network traffic, or interference with othernetworks (Soret et al., 2014). A high latency cannegatively affect the geometric quality of the recon-structed object, so it must be understood and com-pensated for (Vo et al., 2016). To cope with networklatency issues, we have implemented a latency com-pensation mechanism that uses Round Trip Time mea-surements on the communication link between hostand clients. We model the latency measured betweendevices to delay the capture of a frame upon the re-ception of a synchronisation trigger for each deviceindependently. This enables the devices to captureframes (nearly) synchronously. Specifically, duringthe initialisation phase, the host builds a N × M ma-trix P , where the element p ( i , j ) is the j -th measuredRound Trip Time (i.e. ping) between the host andthe i -th client. N is the number of clients and M isthe total number of ping measurements. The hostcomputes the average ping for each client, such as¯ p ( i ) = M ∑ Mj = p ( i , j ) , and extracts the maximum pingas ˆ p = max ( { ¯ p ( ) , ..., ¯ p ( N ) } ) . Then the i -th client cap-tures the snapshot with a delay of ∆ t i = · ( ˆ p − ¯ p ( i )) ms, whereas the host captures the snapshot with adelay of ˆ p ms.Each time a client receives a synchronisation trig-ger, it captures a frame, and the associated meta-data,i.e. pose (with respect to the global coordinate system),camera intrinsic parameters (i.e. focal length, principalpoint), device identifier and snapshot counter. To ef-fectively handle frame captures when synchronisationtriggers are received, we use two threads on the mo-biles. Triggers are managed by the main thread, whichuses a scheduler to guarantee that a frame is capturedwhen the calculated delay ∆ t i expires. Each capturedframe is passed to a second thread that encodes it intoa chosen format (e.g. JPEG) and enqueues it for trans-mission to the Data Manager. If there is bandwidthavailable over the communication channel, frames aretransmitted immediately, otherwise they are buffered.In the next section we explain how the Data Managerprocesses the received frames. The Data Manager is an application that resideson the edge unit and functions independently from theEdge Relay (Fig. 2). The communication between theData Manager and the participants is based on HTTPrequests (Berners-Lee et al., 1996). Fig. 4 shows thearchitecture of our Data Manager. The Data Manager’soperation consists of three phases:
Stream Initialisa-tion , Frame Transmission and
Stream End .During the
Stream Initialisation phase, the host equest handling threaddecoding frames and meta-data framemeta-data ... ... ... ...
HTTPdata manager sy n c h r on i s ed queue cpu core framemeta-data framemeta-data framemeta-data post requests post requests post requests post requests request handling thread cpu core merging cpu core Figure 4: Data manager architecture. sends an initial HTTP request to the Data Managercontaining information about the number of devicesthat it should expect to receive data from within a ses-sion. When this request is received, the Data Managercreates a unique Stream ID for the session, which issent to the host and clients. Then, the Data Managerinitialises a new thread to perform decoding and merg-ing operations (explained later).
Frame Transmission occurs when the criteria fortaking a snapshot has been met (Sec. 5). In particular,a client, after it receives the RPC and after it com-pensates for the latency, sends the requested framealong with the meta-data to the Data Manager througha HTTP request. We measured that the Data Managercan process a HTTP request in about 200 to 300 ms. Tooptimise the HTTP-request ingestion rate on the DataManager, each mobile creates multiple and simultane-ous HTTP requests that will be processed in parallelby Request Handling Threads. We create as many Re-quest Handling Threads as the number of CPU coresavailable. Then, we measured that a mobile can pro-cess up to 12 simultaneous requests with negligiblecomputational time and that the Data Manager can han-dle up to 100 requests. Therefore we create a policywhere, if N is the number of mobiles connected, eachmobile can create up to r = min {(cid:98) N / (cid:99) , } HTTPrequests, where (cid:98)·(cid:99) is the rounding to lower integeroperation. Each Request Handling Thread processeseach HTTP request and pushes it into a synchronisedqueue, which in turn feeds a decoder in charge of con-verting frames into a single format (e.g. in JPG, PNG).We measured that this operation can handle up to fourmobiles transmitting at 20 fps in real-time. Lastly, weuse a merging operation to re-organise data based ontheir snapshot counter. If the merging operation de-tects that, for a given snapshot trigger, the number offrames received is not the same as the number of mo-biles connected, the received frames will be labelledas partial when stored in the database, so the FVVreconstruction algorithm can handle them accordingly. A
Stream End occurs when the host ends a capturesession. In addition to stopping frame acquisition, thehost also sends a request to terminate the session tothe Data Manager, which in turn waits until the lastacquired snapshots have been received before termi-nating of the opened Threads.
Motivation:
Evaluating multi-view data capture quan-titatively is challenging because both pose estimationand synchronisation should be assessed. A possibilitycould be to create a FVV using the captured framesand assess the output quality. However, FVV groundtruth is difficult to obtain, especially when an objectbeing reconstructed is non-rigid. (Mustafa and Hilton,2017; Richardt et al., 2016) mainly evaluated theirFVV outputs qualitatively, and selected sub-modulesfor the quantitative assessment. Based on a similaridea, we have performed a qualitative analysis consist-ing of a live recording using handheld mobiles. Weused four mobiles (two Huawei P20Pro, one OnePlusFive and one Samsung S9) simultaneously observing amoving person and we reconstructed this person usinga popular SfM technique, i.e. COLMAP (Schonbergerand Frahm, 2016). Then, we quantified the perfor-mance of our system under controlled conditions byevaluating the reconstruction accuracy (3D triangula-tion error) of a rotating texture-friendly object. Lastly,to explicitly determine the end-to-end time differenceof the acquired frames we performed a two-view framecapture of a stopwatch displayed on an iPad screen.
Implementation:
Our Edge Relay is based on MLAPI(MLAPI, 2019) and is developed in C ×
480 pixels.
3D Reconstruction assessment:
We placed a refer-ence object on an adjustable angular-velocity turntable,and rotated it 270 degrees clockwise and then 270 de-grees anticlockwise. We 3D-reconstructed the refer-ence object over time from images captured from twoHuawei P20Pro positioned on tripods: vertically atthe same height, horizontally at 20cm far from eachother, and 40cm far from the rotating object. We usedtripod mounts to reduce pose-estimation errors, as wewanted to quantify reconstruction errors brought abouta) (b)
Figure 5: Experimental setup: (a) Left-hand camera: redbox shows region of interest for our analysis. (b) Right-handcamera: green points show keypoints extracted and bluepoints show keypoints that have been 3D triangulated. by network lag and synchronisation effects. Pairs offrames with the same snapshot counter are fed intoCOLMAP. The 3D-reconstruction algorithm processedall the pairs captured in the experiment. Fig. 5a showsthe object from the left-hand camera; the red boundingbox highlights the region of interest we have used forour analysis of keypoints/3D points. Fig. 5b shows theview from the right-hand camera with the keypoints,highlighted in green, and the keypoints that have been3D triangulated, shown in blue.Latencies have been compared to those obtained us-ing the Unity3D Relay (Unity3D Multiplayer Service,2019). We also assessed the quality of 3D reconstruc-tions over time by comparing the volumetric models ofthe reference object under different network latencieswith respect to a ground truth, which was created byreconstructing the reference object in 3D, frame-by-frame with one degree of separation between frames,for a total of 270 degrees. This created two sequencesof 270 aligned frames, one sequence for each cam-era. By picking one frame from the first sequence andthen another frame from the second sequence capturedwith a different pose, we could simulate different an-gular velocities and different frame-capture rates. Forexample, say we wanted to simulate the reference ob-ject rotating at 50 deg / s , captured at F = Hz . Thiscorresponds to a 100ms interval between frames, cor-responding to an object rotation of 5 deg . We can pickframe t from the first sequence and frame t + deg / s and 100 deg / s with steps of 10 deg / s ,and modelled snapshots that were captured with a fre-quency of F = Hz . We then modelled the latencybetween the mobile and the Edge Relay by adding nor-mally distributed delays, i.e. N ( µ , σ ) , where µ is themean and σ is the standard deviation that we measuredon our experimental WiFi network. In order to use real-istic latency estimates, we recreated latency variationconditions, ranging from 0ms to 150ms with a stepof 30ms, by injecting delays into the network using Table 1: Round Trip Time between mobile and Edge Relaymeasured on our WiFi network. ‘Set’ are the latencies setwith NetEm (NetEm, 2019), and ‘Meas’ are mean ± standarddeviation calculated over 40 Round Trip time measurements. Set (ms) 0 30 60 90 120 150Meas (ms) 14 ±
11 50 ±
14 80 ±
31 108 ±
26 141 ±
26 167 ± NetEm (NetEm, 2019). The mobiles were connectedvia a WiFi 2.4GHz network. We used an off-the-shelfWiFi access point (Thomson TG587nV2). Due to in-terference caused by neighbouring networks (a typicalscenario nowadays), we observed typical Round TripTimes (RTT) shown in Tab. 1.We used the data in Tab. 1 to quantify the triangu-lation error between ground-truth 3D points and 3Dpoints triangulated under simulated delays. We cal-culated the triangulation error using a grid composedof 16 × End-to-end delay assessment:
We used two mobilesconfigured as the 3D reconstruction case to capturethe time ticked by a stopwatch (up to the millisecondprecision). We extracted the time information fromeach frame pair of frames using OCR (Amazon Tex-tract, 2019) and computed their time difference. Weperformed this experiment using the delay compensa-tion activated. The configurations tested are with ourEdge Relay operating locally (i.e. at the edge) and inthe cloud. For the latter we deployed our Edge Re-lay on Amazon Web Services Cloud (AWS), and con-nected the mobiles to the cloud through a high-qualityoptical-fibre-based internet connection and though 4G,to resemble a real capture scenario.
Relay latency:
We assessed synchronisation triggerdelays by measuring the latency between the host andclients in the case of our Edge Relay and the Unity3DRelay. We performed these measurements with a dedi-cated feature integrated into the app that produces 250RTT measurements. We then calculated the averagemeasured RTTs. Unity3D Relay is cloud based andcan be located anywhere around the globe, based on auser’s location, the closest is usually queried (Unity3DNetwork Manager, 2019). In the case of our Edge Re-lay, we measured the host-client RTT and obtained anaverage of 66 ± ± − · = ± ± − · = ± ± ω = deg / s .This analysis is performed without using ground-truthinformation. We performed various reconstructions ofthe object by varying the latency between the mobiles.In one experiment, the object performed two spins:the first spin was 720 degrees counterclockwise, thesecond spin was 720 degrees clockwise. We conductedsix experiments in total, where we injected 30ms ofcommunication delay into the WiFi network (usingNetEm (NetEm, 2019)) between the two mobiles foreach experiment. Note that if synchronisation trig-gers are delivered with a delay, the object will appearwith a different pose in the two camera frames. Thisaffects the computation of the 3D points; matchedkeypoints will correspond to different 3D locations,and there will be parts of an object that might alsobe occluded. In this first experiment, we could notaccurately measure the triangulation error through theground truth because we did not have direct access tothe angular state of the turntable during a spin. There-fore, to make 3D reconstructions for each trigger andfor each experiment comparable, we quantified theoutput as being the ratio between the number of 3Dpoints reconstructed and the keypoints visible withinthe region of interest. We compared cases with latencycompensation disabled and then enabled. Fig. 6 showsthe variation of the 3D point ratio over time in thesetwo case. We refer to the left-hand mobile as A andthe right-hand as B. The first 100 frames correspond d p o i n t r a t i o d p o i n t r a t i o Figure 6: Instability of the 3D reconstruction when the la-tency between two mobiles increases up to 150 ms. The 3Dpoint ratio is defined as the ratio between the number of 3Dpoints and the number of keypoints counted inside the regionof interest. to 720 degrees of counterclockwise spin. The eightpeaks correlate to a face of the box pointing towardsboth mobiles. When latency compensation is disabled,mobile A does not delay the frame capture by ˆ p ms(Sec. 5) when it sends a synchronisation trigger. Asthe induced latency increases, mobile B receives itstrigger later and later. During this time, the object willhave rotated a few degrees in the same the directionas mobile B, resulting in a more favourable viewpointfor keypoint matching and 3D triangulation (i.e. it isalmost seeing an identical view as mobile A). Hence,the 3D point ratio is seen to increase as the inducedlatency increases when the rotation is counterclock-wise. However, the computed 3D points will not becalculated correctly and they will also have inaccuratecoordinates (we show quantitative evidence of this inthe next session). Vice-versa, when the rotation isclockwise, mobile A is more likely to capture framesof object regions that are occluded from mobile B’sviewpoint as they will have already rotated out of view.When our latency compensation algorithm is enabled,3D reconstructions become symmetric in both spin-ning directions, illustrating its effectiveness.
3D reconstruction analysis:
We quantify the recon-struction accuracy using simulated latency (e.g. groundtruth). Fig. 7 and 8 show the 3D triangulation error incases of counterclockwise and clockwise spin, respec-tively. From these graphs we see that the triangulationerror increases as simulated latency increases. Thisoccurs because after keypoints are matched across thetwo image planes, and, after the keypoints are trian-gulated in 3D (with an initially-guessed projectionmatrix), the Bundle Adjustment algorithm in the SfMpipeline tries to optimise the parameters of the projec-tion matrix by minimising re-projection (3D to 2D)error (Schonberger and Frahm, 2016). Hence, an er-roneous object’s pose is captured, thus affecting theestimation of the extrinsic and intrinsic parameters(e.g. providing different focal-length estimates), and, d t r i a n g u l a t i o n e rr o r ( % ) = 50 deg/s = 60 deg/s = 70 deg/s = 80 deg/s = 90 deg/s = 100 deg/s Figure 7: 3D triangulation error in the case of acounterclockwise-spinning object. The error is computedrelative to the distance between the centre of mass of the twocameras and the object, which is 40cm. The camera baselineis fixed at 20cm. d t r i a n g u l a t i o n e rr o r ( % ) = 50 deg/s = 60 deg/s = 70 deg/s = 80 deg/s = 90 deg/s = 100 deg/s Figure 8: 3D triangulation error in the case of a clockwise-spinning object. The error is computed relative to the dis-tance between the centre of mass of the two cameras andthe object, which is 40cm. The camera baseline is fixed at20cm. on the estimation of 3D points (i.e. highly likely tobe estimated somewhere in between the real 3D posi-tion of the keypoints of the two frames). Error-ratesin the two cases differ due to the same phenomenaillustrated in Fig. 6. Because synchronisation triggersare generated from the left-hand mobile, which takessnapshots upon generation, the right-hand mobile onlytakes snapshots when triggers are received. When theobject spins clockwise, the larger the delay the moreoften the object appears with self-occluded parts in thetwo cameras as it will have rotated more.
End-to-end delay:
Tab. 2 shows the end-to-end delaybetween mobiles, measured as the difference betweenthe times captured from two mobiles through OCRunder different communication configurations withthe Edge Relay. As expected, the experiments showevidence that when the Edge Relay is deployed at theedge we can capture frames with the lowest latency.When the Edge Relay is deployed in the cloud, eventhrough a highly reliable optical-fibre connection, wecan see that there is a worsening in the performancedue to the extra communication link to AWS.
Qualitative analysis:
We qualitatively analyse theperformance of our approach by comparing the 3Dreconstruction over time of a moving person using thestandard Unity3D Relay (Unity3D Multiplayer Ser-
Table 2: End-to-end delay between mobiles, automaticallymeasured by capturing the time ticked by a stopwatch. Casestudies when Edge Relay was deployed in the cloud andat the edge (i.e. locally). The connection to the cloud wascarried out through 4G and a high-quality optical fibre.
Cloud Edge4G Optical fibre36 . ± . . ± . . ± . vice, 2019) and our Edge Relay. We used the ARAnchor to estimate the scale of the point cloud in met-ric units. Fig. 9 shows the dense point clouds in twoinstants of time where (1a,2a) are the outputs withthe Unity3D Relay and (1b,2b) with our Edge Re-lay. We can see that when the object is still (Case1) the results (i.e. density of triangulated 3D points)using Unity3D Relay and Edge Relay are comparable.Whereas, when the object moves (Case 2), synchro-nisation is key to achieve accurate 3D triangulation,and using the Unity3D Relay leads to sparser recon-structions. We quantified the 3D triangulation accu-racy by calculating the average reprojection error afterBundle Adjustment. We measured 0 . ± .
04 pixelsusing the Unity3D Relay and 0 . ± .
02 pixels us-ing our Edge Relay. This result shows that we couldachieve a more accurate reconstruction using the EdgeRelay. During these experiments, we also monitoredthe percentage of frames successfully received by theData Manager from all the mobiles. Given a snapshottrigger, an instant of time that is captured only by asubset of mobiles leads to a partial capture, as onlythe frames of those mobiles that received the triggerand performed the capture will be transmitted to theData Manager. We name these frames “partial frames”and, ideally, we would like to achieve zero percentof partial frames. The percentage of partial frameswe measured using Unity3D Relay was 41%, whereaswith our Edge Relay it was 24%. Frame loss is be-cause snapshot triggers are transmitted through UDP,that does not acknowledge if packets are received anddoes not re-send packets in the case of failed recep-tion (unlike with TCP). However UDP is necessary toguarantee the timely delivery of packets. The videoof our qualitative analysis can be found at this link https://youtu.be/znoJmovdCgs . The video illus-trates the effect of losing 3D points and frames withthe Unity3D Relay.
We proposed a system to move data-relaying fromthe cloud to the edge, showing that this is key to
Figure 9: Examples of a moving person reconstructed using(1a,2a) the Unity3D Relay (Unity3D Multiplayer Service,2019) and (1b,2b) our Edge Relay. Case 1: When the objectis still we can see that results (i.e. density of triangulated 3Dpoints) using Unity3D Relay and Edge Relay are comparable.Case 2: When the object moves, synchronisation is key toachieve accurate 3D triangulation, and using the Unity3DRelay leads to sparser reconstructions. make frame capture synchronisation more reliable thancloud-based solutions and to enable number-of-usersscalability. Our implementation consists of an EdgeRelay to handle snapshot triggers used for the captur-ing of images for FVV production, and a Data Managerto receive capture frames via HTTP requests. Synchro-nisation triggers are generated by a host, rather thanby a system timer, to enable a motion-based, adaptivesampling-rate, fostering reduced data throughput. Al-though the creation of high-quality FVVs was not thescope of this work, we succeeded to show the benefit ofour decentralised data capturing system using a state-of-the-art 3D reconstruction algorithm (i.e. COLMAP)and by implementing the assessment of end-to-endcapture delays though OCR.Future research directions include the integrationof a volumetric 4D reconstruction algorithm that canbe executed in real-time on the edge to providing tele-presence functionality together with the integrationof temporal filtering of 3D reconstructed points toprovide more stable volumetric videos. We also aimto improve reconstruction accuracy by postprocessingARCore’s pose estimates. By the end of this year wewill deploy our system on a 5G network and carry outthe first FVV production in uncontrolled environmentsusing off-the-shelf mobiles.
REFERENCES
Amazon Textract (accessed: Nov 2019). https://aws.amazon.com/textract/ . ARCore (Accessed: Nov 2019). https://developers.google.com/ar .ARCore Anchors (Accessed: Nov 2019). https://developers.google.com/ar/develop/developer-guides/anchors .Bastug, E., Bennis, M., Medard, M., and Debbah, M. (2017).Toward Interconnected Virtual Reality: Opportunities,Challenges, and Enablers.
IEEE Communications Mag-azine , 55(6):110–117.Belshe, M., Peon, R., and Thomson, M. (2015). HypertextTransfer Protocol Version 2 (HTTP/2). RFC 7540.Berners-Lee, T., Fielding, R., and Nielsen, H. (1996). Hy-pertext transfer protocol – http/1.0. RFC 1945.Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza,D., Neira, J., Reid, I., and Leonard, J. (2016). Past,Present, and Future of Simultaneous Localization AndMapping: Towards the Robust-Perception Age.
IEEETrans. on Robotics , 32(6):1309–1332.Chen, Y.-H., Balakrishnan, H., Ravindranath, L., and Bahl,P. (2016). GLIMPSE: Continuous, Real-Time ObjectRecognition on Mobile Devices.
GetMobile: MobileComputing and Communications , 20(1):26–29.Elbamby, M., Perfecto, C., Bennis, M., and Doppler, K.(2018). Toward Low-Latency and Ultra-Reliable Vir-tual Reality.
IEEE Network , 32(2):78–84.Guillemaut, J.-Y. and Hilton, A. (2011). Joint Multi-LayerSegmentation and Reconstruction for Free-ViewpointVideo Applications.
International Journal on Com-puter Vision , 93(1):73–100.Hu, Y., Niu, D., and Li, Z. (2016). A Geometric Approachto Server Selection for Interactive Video Streaming.
IEEE Trans. on Multimedia , 18(5):840–851.Huang, C.-H., Boyer, E., Navab, N., and Ilic, S. (2014). Hu-man Shape and Pose Tracking Using Keyframes. In
Proc. of IEEE Computer Vision and Pattern Recogni-tion , Columbus, US.Jiang, D. and Liu, G. (2017). An Overview of 5G Require-ments. In Xiang, W., Zheng, K., and Shen, X., editors,
5G Mobile Communications . Springer.Kim, H., Guillemaut, J.-Y., Takai, T., Sarim, M., and Hilton,A. (2012). Outdoor Dynamic 3D Scene Reconstruc-tion.
IEEE Trans. on Circuits and Systems for VideoTechnology , 22(11):1611–1622.Knapitsch, A., Park, J., Zhou, Q.-Y., and Koltun, V. (2017).Tanks and Temples: Benchmarking Large-Scale SceneReconstruction.
ACM Transactions on Graphics ,36(4).Latimer, R., Holloway, J., Veeraraghavan, A., and Sabharwal,A. (2015). SocialSync: Sub-Frame Synchronization ina Smartphone Camera Network. In
Proc. of EuropeanConference on Computer Vision Workshops , Zurich,CH.MLAPI (Accessed: Nov 2019). https://midlevel.github.io/MLAPI .MLAPI Configuration (Accessed: Nov 2019). https://github.com/MidLevel/MLAPI.Relay .MLAPI Messaging System (Accessed: Nov 2019). https://mlapi.network/wiki/ways-to-syncronize/ .ur-Artal, R., Montiel, J., and Tard´os, J. (2015). ORB-SLAM: a versatile and accurate monocular SLAM sys-tem.
IEEE Trans. on Robotics , 31(5):1147–1163.Mustafa, A. and Hilton, A. (2017). Semantically Coher-ent Co-Segmentation and Reconstruction of DynamicScenes. In
Proc. of IEEE Computer Vision and PatternRecognition , Honolulu, US.NetEm (Accessed: Nov 2019). http://man7.org/linux/man-pages/man8/tc-netem.8.html .Park, J., Chou, P., and Hwang, J.-N. (2018). Volumetricmedia streaming for augmented reality. In
Proc. ofIEEE Global Communications Conference , Abu Dhabi,United Arab Emirates.Photon (Accessed: Nov 2019). .Poiesi, F., Locher, A., Chippendale, P., Nocerino, E., Re-mondino, F., and Gool, L. V. (2017). Cloud-basedCollaborative 3D Reconstruction Using Smartphones.In
Proc. of European Conference on Visual Media Pro-duction .Qiao, X., Ren, P., Dustdar, S., Liu, L., Ma, H., and Chen,J. (2019). Web AR: A Promising Future for MobileAugmented Reality–State of the Art, Challenges, andInsights.
Proceedings of the IEEE (Early Access) .QUIC (Accessed: Nov 2019). .Rematas, K., Kemelmacher-Shlizerman, I., Curless, B., andSeitz, S. (2018). Soccer on your tabletop. In
Proc. ofIEEE Computer Vision and Pattern Recognition , SaltLake City, US.Resch, B., Lensch, H. P. A., Wang, O., Pollefeys, M., andSorkine-Hornung, A. (2015). Scalable structure frommotion for densely sampled videos. In
Proc. of IEEEComputer Vision and Pattern Recognition , Boston, US.Richardt, C., Kim, H., Valgaerts, L., and Theobalt, C. (2016).Dense Wide-Baseline Scene Flow From Two HandheldVideo Cameras. In
Proc. of 3D Vision , Stanford, US.Schonberger, J. and Frahm, J.-M. (2016). Structure-from-motion revisited. In
Proc. of IEEE Computer Visionand Pattern Recognition , Las Vegas, USA.Shi, W., Cao, J., Zhang, Q., Li, Y., and Xu, L. (2016). EdgeComputing: Vision and Challenges.
IEEE Internet ofThings Journal , 5(3):637–646.Shi, Z., Wang, H., Wei, W., Zheng, X., Zhao, M., and Zhao,J. (2015). A novel individual location recommendationsystem based on mobile augmented reality. In
Proc.of IEEE Identification, Information, and Knowledge inthe Internet of Things , Beijing, CN.Slabaugh, G., Culbertson, B., Malzbender, T., and Schafer,R. (2001). A Survey of Methods for Volumetric SceneReconstruction from Photographs. In
InternationalWorkshop on Volume Graphics , New York, US.Soret, B., Mogensen, P., Pedersen, K., and Aguayo-Torres,M. (2014). Fundamental tradeoffs among reliability,latency and throughput in cellular networks. In
Proc.of IEEE Globecom Workshops , Austin, US.Sukhmani, S., Sadeghi, M., Erol-Kantarci, M., and Saddik,A. E. (2018). Edge Caching and Computing in 5G forMobile AR/VR and Tactile Internet.
IEEE Multimedia(Early Access) . Unity3D HLAPI (accessed: Nov 2019). https://docs.unity3d.com/Manual/UNetUsingHLAPI.html .Unity3D MatchMaking (Accessed: Nov 2019). https://docs.unity3d.com/520/Documentation/Manual/UNetMatchMaker.html .Unity3D Multiplayer Service (Accessed: Nov 2019). https://unity3d.com/unity/features/multiplayer .Unity3D Network Manager (Accessed: Nov 2019). https://docs.unity3d.com/ScriptReference/Networking.NetworkManager-matchHost.html .Vo, M., Narasimhan, S., and Sheikh, Y. (2016). Spatiotem-poral Bundle Adjustment for Dynamic 3D Reconstruc-tion. In
Proc. of IEEE Computer Vision and PatternRecognition , Las Vegas, US.Wang, Y., Wang, J., and Chang, S.-F. (2015). CamSwarm:Instantaneous Smartphone Camera Arrays for Collabo-rative Photography. arXiv:1507.01148 .Wu, W., Yang, Z., Jin, D., and Nahrstedt, K. (2008). Imple-menting a distributed tele-immersive system. In ,pages 477–484.Yahyavi, A. and Kemme, B. (2013). Peer-to-peer architec-tures for massively multiplayer online games: A survey.
ACM Comput. Surv. , 46(1):1–51.Zhang, W., Han, B., and Hui, P. (2018). Jaguar: Low LatencyMobile Augmented Reality with Flexible Tracking. In
Proc. of ACM International Conference on Multimedia ,Seoul, KR.Zou, D. and Tan, P. (2013). COSLAM: Collaborative visualslam in dynamic environments.