[PDF] Targeted Adversarial Perturbations for Monocular Depth Prediction

Abstract

We study the effect of adversarial perturbations on the task of monocular depth prediction. Specifically, we explore the ability of small, imperceptible additive perturbations to selectively alter the perceived geometry of the scene. We show that such perturbations can not only globally re-scale the predicted distances from the camera, but also alter the prediction to match a different target scene. We also show that, when given semantic or instance information, perturbations can fool the network to alter the depth of specific categories or instances in the scene, and even remove them while preserving the rest of the scene. To understand the effect of targeted perturbations, we conduct experiments on state-of-the-art monocular depth prediction methods. Our experiments reveal vulnerabilities in monocular depth prediction networks, and shed light on the biases and context learned by them.

Full PDF

TTargeted Adversarial Perturbationsfor Monocular Depth Prediction

Alex Wong

Department of Computer ScienceUniversity of California, Los Angeles [email protected]

Safa Cicek

Department of Electrical and Computer EngineeringUniversity of California, Los Angeles [email protected]

Stefano Soatto

Department of Computer ScienceUniversity of California, Los Angeles [email protected]

Abstract

We study the effect of adversarial perturbations on the task of monocular depthprediction. Speciﬁcally, we explore the ability of small, imperceptible additiveperturbations to selectively alter the perceived geometry of the scene. We showthat such perturbations can not only globally re-scale the predicted distances fromthe camera, but also alter the prediction to match a different target scene. We alsoshow that, when given semantic or instance information, perturbations can fool thenetwork to alter the depth of speciﬁc categories or instances in the scene, and evenremove them while preserving the rest of the scene. To understand the effect oftargeted perturbations, we conduct experiments on state-of-the-art monocular depthprediction methods. Our experiments reveal vulnerabilities in monocular depthprediction networks, and shed light on the biases and context learned by them.Figure 1:

Altering the predicted scene with adversarial perturbations.

Top to bottom: inputimage; adversarial perturbations with upper norm of × − ; predicted scene visualized as disparity.Left to right: original image and predicted scene; overall scene altered to be 10% closer; all vehiclesaltered to be 10% closer; vehicle in the center of the road is removed by perturbations. Consider the image shown in the top-left of Fig. 1, captured from a moving car. The correspondingdepth of the scene, inferred by a deep neural network and visualized as disparity, is shown underneath.Can adding a small perturbation cause the perceived vehicle in front of us disappear? Indeed, thisis shown on the rightmost panel of the same ﬁgure: The perturbed image, shown on the top-right,is indistinguishable from the original. Yet, the perturbation, ampliﬁed and shown in the center row,causes the depth map to be altered in a way that makes the car in front of us disappear.

Preprint. Under review. a r X i v : . [ c s . C V ] J un dversarial perturbations are small signals that, when added to images, are imperceptible yet cancause the output of a deep neural network to change catastrophically [Szegedy et al., 2013]. We knowthat they can fool a network to mistake a tree for a peacock [Moosavi-Dezfooli et al., 2016]. But,as autonomous vehicles are increasingly employing learned perception modules, mistaking a stopsign for a speed limit [Eykholt et al., 2018] or causing obstacles to disappear is not just an interestingacademic exercise. We explore the possibility that small perturbations can alter not just the classlabel associated to an image, but the inferred depth map, for instance to make the entire scene appearcloser or farther, or portions of the scene, like speciﬁc objects, to become invisible or be perceived asbeing elsewhere in the scene.When semantic segmentation is available, perturbations can target a speciﬁc category in the predictedscene. Some categories (e.g. trafﬁc lights, humans) are harder to attack than others (e.g. roads,nature). When instance segmentation is available, perturbations can manipulate individual objects, forinstance make a car disappear or move it to another location. We call these phenomena collectivelyas stereopagnosia , as the solid geometric analogue of prosopagnosia [Damasio et al., 1982].Stereopagnosia sheds light on the role of context in the representation of geometry with deep networks.When attacking a speciﬁc category or instance, while most of the perturbations are localized, someare distributed throughout the scene, far from the object of interest. Even when the target effect islocalized (e.g., make a car disappear), the perturbations are non-local, indicating that the networkexploits non-local context, which represents a vulnerability. Could one perturb regions in the image,for instance displaying billboards, thus making cars seemingly disappear?We note that, although the adversarial perturbations we consider are not universal, that is, they aretailored to a speciﬁc scene and its corresponding image, they are somewhat robust. Blurring theimage after applying the perturbations reduces, but does not eliminate, stereopagnosia. To understandgeneralizability of adversarial perturbations, we examine the transferability of the perturbationsbetween two monocular depth prediction models with different architectures and losses. Adversarial perturbations have been studied extensively for classiﬁcation (Sec. 2.1). We focus onregression, where there exists some initial work. However, we study the targeted case where theentire scene, a particular object class, or even an instance is manipulated by the choice of perturbation.

The early works [Szegedy et al., 2013, Goodfellow et al., 2014] show the existence of small, im-perceptible additive noises that can alter the predictions of deep learning based classiﬁcation net-works. Since then many more advanced attacks [Moosavi-Dezfooli et al., 2016] have been proposed.[Moosavi-Dezfooli et al., 2017] showed the existence of universal perturbations i.e. constant additiveperturbations to degrade the accuracy over the entire dataset.More recently, [Naseer et al., 2019] studied transferability of attacks across datasets and models.[Peck et al., 2017] derived lower bounds on the magnitudes of perturbations. [Najaﬁ et al., 2019]studied the attacks in semi-supervised learning setting. [Qin et al., 2019, Tramèr and Boneh, 2019]proposed methods to enhance robustness to adversarial attacks. [Laidlaw and Feizi, 2019] extendedadversarial attacks beyond small additive perturbations. [Ilyas et al., 2019] showed that the existenceof adversarial attacks makes deep networks more predictive.Despite the exponentially growing literature on adversarial attacks for the classiﬁcation task, thereonly have been a few works extending analysis of adversarial perturbations to dense-pixel pre-diction tasks. [Xie et al., 2017a] studied adversarial perturbations for detection and segmentation.[Hendrik Metzen et al., 2017] demonstrated targeted universal attacks for semantic segmentation.[Mopuri et al., 2018] examined universal perturbations in a data-free setting for segmentation anddepth prediction to alter predictions in arbitrary directions. Unlike them, we study targeted attackswhere network is fooled to predict a speciﬁc target.Our goal is to analyze the robustness of the monocular depth prediction networks to different targetedattacks to explore possible explanations of what is learned by these models. With a similar motivation,[Hu et al., 2019] identiﬁed the smallest set of image pixels from which the network can estimate adepth map with small error. Unlike them, we analyze the monocular depth networks by studying theirrobustness against targeted adversarial attacks. 2 .2 Monocular Depth Prediction [Eigen et al., 2014, Eigen and Fergus, 2015, Liu et al., 2015, Liu et al., 2016, Laina et al., 2016]trained deep networks with ground-truth annotations to predict depth from a single image. How-ever, high quality depth maps are often unavailable and, when available, are expensive to acquire.Hence, trends shifted to weaker supervision from crowd-sourced data [Chen et al., 2016], and ordinalrelationships amongst depth measurements [Zoran et al., 2015, Fu et al., 2018].Recently, supervisory trends shifted to unsupervised (self-supervised) learning, which relies onstereo-pairs or video sequences during training, and provides supervision in the form of imagereconstruction. While depth from video-based methods is up to an unknown scale, stereo-basedmethods can predict depth in metric scale because the pose (baseline) between the cameras is known.To learn depth from stereo-pairs, [Garg et al., 2016] predicted disparity by reconstructing one im-age from its stereo-counterpart. Monodepth [Godard et al., 2017] predicted both left and rightdisparities from a single image and laid the foundation for [Poggi et al., 2018, Pillai et al., 2019,Wong and Soatto, 2019]. To learn depth from videos, [Mahjourian et al., 2018, Zhou et al., 2017]also jointly learned pose between temporally adjacent frames to enable image reconstruction by repro-jection. [Wang et al., 2018, Yang et al., 2018] leveraged visual odometry, [Fei et al., 2018] used grav-ity, [Casser et al., 2019, Luo et al., 2018] considered motion segmentation, and [Yin and Shi, 2018]jointly learned depth, pose and optical ﬂow. Monodepth2 [Godard et al., 2019] explored both stereoand video-based methods and proposed a reprojection loss to discard potential occlusions.To study the effect of adversarial perturbations, we examine the robustness of Monodepth2, thestate-of-the-art, and its predecessor, Monodepth. While Monodepth2 proposed both stereo andvideo-based models, we choose their stereo model because the predicted depth is in metric scale,which enables us to study perturbations to alter the scale of the scene without changing its topology.In Sec. 3, we discuss our method. We show perturbations for altering entire predictions to a targetscene in Sec. 4 and localized attacks on speciﬁc categories and object instances in Sec. 5. We discussthe transferabilty of such perturbations in Sec. 6 and their robustness against defenses in Sec. B ofSupp. Mat.

Given a pretrained depth prediction network, f d : R H × W × → R H × W + , f d : x (cid:55)→ d ( x ) our goal isto ﬁnd a small additive perturbation v ( x ) ∈ R H × W × , as a function of the input image x , whichcan change its prediction to a target depth f d ( x + v ( x )) = d t ( x ) (cid:54) = d ( x ) with some norm constraint || v ( x ) || ∞ ≤ ξ and high probability P ( f d ( x + v ( x )) = d t ( x )) ≥ − δ .We begin by examining Dense Adversarial Generation (DAG) proposed by [Xie et al., 2017a] forﬁnding adversarial perturbations for the semantic segmentation task. The perturbations from DAGcan be formulated as the sum of a gradient ascent term (that pushes the predictions away from those ofthe original image) and a gradient descent term (that pulls predictions towards the target predictions).In the case of semantic segmentation, this formulation works well because the gradient ascent termsuppresses the probability for the original predictions, which naturally increases the probability of thetarget predictions (zero-sum) driven by the gradient descent term. However, such is not the case forregression tasks, which requires the network to predict a real-valued scalar (as opposed to probabilitymass) for a targeted scene. Hence, the gradient ascent term maximizes the difference between theoriginal and predicted depth, which results in DAG “overshooting” the target depth.Instead, we use a simple objective function, similar to [Hendrik Metzen et al., 2017], but we modifyit for the regression task by minimizing the normalized difference between predicted and target depth, (cid:96) ( x, v ( x ) , d t ( x ) , f d ) := || f d ( x + v ( x )) − d t ( x ) || d t ( x ) . (1)We minimize this objective function with respect to an image x by following an iterative optimizationprocedure as detailed in Alg. 1. The CLIP (cid:0) v n ( x ) , − ξ, ξ (cid:1) operation clamps any value of v ( x ) largerthan ξ to ξ and any value smaller than − ξ to − ξ . For all the experiments, ξ ∈ { × − , × − , × − , × − } . We evaluate adversarial targeted attacks on KITTI semantic split [Alhaija et al., 2018]. This is adataset of outdoor scenes, captured by car-mounted stereo cameras and a LIDAR sensor, with3 lgorithm 1

Proposed method to calculate targeted adversarial perturbations for a regression task.

Parameters:

Learning rate η , noise upper norm ξ . Inputs:

Image x , target depth map d t ( x ) , pretrained depth network f d . Outputs:

Perturbation v N ( x ) . Init: v ( x ) = 0 . for n = 0 : N − do v n ( x ) = CLIP (cid:0) v n ( x ) , − ξ, ξ (cid:1) Calculate (cid:96) ( x, v ( x ) , d t ( x ) , f d ) as deﬁned in Eqn. 1. v n +1 ( x ) = v n ( x ) − η ∇ (cid:96) ( x, v ( x ) , d t ( x ) , f d ) end for ground-truth semantic segmentation and instance labels. The semantic and instance labels in thissplit enables our experiments in Sec. 5 for targeting speciﬁc categories or instances in a scene.The depth models (Monodepth, Monodepth2) are trained on the KITTI dataset [Geiger et al., 2012]using Eigen split [Eigen and Fergus, 2015]. The Eigen split contains 32 out of the total 61 scenes, andis comprised of 23,488 stereo pairs with an average size of × . Images are resized to × as a preprocessing step and perturbations are computed with 500 steps of SGD. Entire optimizationfor each frame takes ≈ s (Monodepth2 takes ms for each forward pass and s ≈ × ms intotal) using a GeForce GTX 1080. Details on hyper-parameters are provided in Sec. C of Supp. Mat.For all the experiments, we use absolute relative error (ARE), computed with respect to the targetdepth d t ( x ) , as our evaluation metric:ARE = || f d ( x + v ( x )) − d t ( x ) || d t ( x ) . (2) Given a depth network f d , our goal is to ﬁnd adversarial perturbations to alter the predictions to atarget scene d t ( x ) for an image x . For this, we examine three settings (i) scaling the entire scene by afactor, (ii) symmetrically ﬂipping the scene, and (iii) altering the scene to a preset scene. For autonomous navigation, misjudging an obstacle to be farther away than it is could prove disastrous.Hence, to alter the distances in the predicted scene without changing its topology or structure, weexamine perturbations that will scale the scene (bringing the scene closer to or farther away from thecamera) by a factor of α . The target scene is deﬁned as: d t ( x ) = scale ( f d ( x )) = (1 + α ) f d ( x ) (3)for α ∈ {− . , − . , +0 . , +0 . } or − , − (closer), +5% , +10% (farther), respectively.Column two of Fig. 1 shows the scene scaled 10% closer to the camera by applying visuallyimperceptible perturbations with ξ = 2 × − . On average, scaling the scene by − , − , +5% , +10% with ξ = 2 × − require an (cid:107) v ( x ) (cid:107) of 0.0160, 0.0124, 0.0126, and 0.0161, respectively.We note that scaling the scene by ± requires less perturbations than ± and the magnituderequired for both directions is approximately symmetric. Also, perturbations are typically locatedalong the object boundaries with concentrations on the road. For a side by side visualization ofcomparisons between different scaling factors, please see Sec. H.1 in Supp. Mat.In Fig. 2-(a, b), we compare our approach with DAG, (Sec. 3) re-purposed for depth prediction task.While both are bounded by the same upper norm, DAG consistently produces results with highererror and generally with a higher standard deviation. As seen in Fig. 2-(a, b), even with ξ = 2 × − ,we are able to ﬁnd perturbations that can scale the scene to be ≈ from ± and ≈ from ± . With ξ = 2 × − , we are able to fully reach all four targets with less than ≈ . error. We now examine the problem setting where the target scene still retains the same structures given bythe image, however, they are mirrored across the y- (horizontal ﬂip) or x-axis (vertical ﬂip) .For the horizontal ﬂip scenario, we denote the target depth as d t ( x ) = fliph ( f d ( x )) where the fliph operator horizontally ﬂips the predicted depth map f d ( x ) across the y-axis.Fig. 3-(b) shows that the perturbations can fool the network into predicting horizontal ﬂipped scenes.For scenes with different structures on either side, v ( x ) fools the network into creating and removing a) (b) (c) (d) Figure 2:

ARE with various upper norm ξ for scaling and ﬂipping Monodepth2 predictions. (a)and (b) Comparisons between DAG and the proposed method for scaling the scene by ± and ± . (c) Results for horizontally and vertically ﬂipping the predictions. (d) comparison betweenscaling and ﬂipping tasks. Vertically ﬂipping proves to be the most challenging. (a) (b) Figure 3: (a)

Examples of success (left) and failure (right) cases for vertical ﬂip . For the failurecase, the car and road still remain on the bottom of the predictions. This is likely because the networkis biased to predict closer structures on the bottom half of the image and farther ones on the top half(last two rows). (b)

Examples of horizontal ﬂip.

Here, we observe the noise required to create andremove surfaces. Surprisingly, removing the white wall (right) requires very little perturbations.surfaces. We note that the amount of noise required to horizontally ﬂip the scene is much more thanthat of scaling the scene (i.e. for ξ = 2 × − , (cid:107) v ( x ) (cid:107) = 0 . for scaling +10% and 0.326 forﬂipping horizontally), which illustrates the difﬁculty in altering the scene structures. Interestingly,the amount of noise required to remove the white wall (Fig. 3-(b)) is signiﬁcantly less than the rest.For the vertical ﬂip scenario, we denote the target depth as d t ( x ) = flipv ( f d ( x )) where flipv operator vertically ﬂips the predicted depth map f d ( x ) across the x-axis.As seen in Fig. 3-(a), perturbations cannot fully ﬂip the predictions vertically. Even on successfulattempts (left), there are still artifacts in the output. For failure cases (right), portions of the cars stillremain on the bottom half of the predictions. This experiment reveals the potential biases learned bythe network. To verify this, we feed vertically ﬂipped images to the network. As seen in the last tworows of Fig. 3-(a), the network still assigns closer depth values to the bottom half of the image (nowsky) and farther depth values to the top half (now road and cars).In Fig. 2-(c, d), we plot the ARE achieved by the proposed method for different target depth maps:horizontal ﬂip, vertical ﬂip and different scales. Both ﬂipping tasks are much harder than the scalingtasks. Particularly, fooling the network to produce vertical ﬂipped predictions is the most challengingtask as the error is ≈ , even with ξ = 2 × − . We now examine perturbations for altering the predicted scene d ( x ) = f d ( x ) to an entirely differentpre-selected one d t ( x ) = f d ( x ) obtained from images sampled from the same training distribution x , x ∼ P ( x ) : d ( x ) = f d ( x ) (cid:54) = d t ( x ) = f d ( x ) .Fig. 4 shows that cars can be removed and road signs can be replaced with trees (leftmost), walls(column two) and vegetation (column three) can be added to the scene, and an urban street with vehi-5igure 4: Altering the predicted scene to a preset scene.

Adversarial perturbations can remove acar and replace a road sign with trees (leftmost), add walls (column two), and vegetation (columnthree) to open streets, and transform an urban environment with vehicles to an open road (rightmost).Figure 5:

ARE for scaling different categories closer and farther.

It is easier to fool the networkto predict vehicle and nature categories closer and farther than is to fool human and trafﬁc categories.cles can be transformed to an open road (rightmost). While perturbations are visually imperceptible,we note that (cid:107) v ( x ) (cid:107) = 0 . , which is ≈ × the amount required for horizontal ﬂip. However, theexistence of such perturbations demonstrates just how vulnerable depth prediction networks can be.Additionally, this experiment also conﬁrms the biases learned by network discussed in Sec. 4.2.While perturbations can alter the scene to a preset one with structures not present in the image, wehave difﬁculties ﬁnding perturbations that can vertically ﬂip the predicted scene. Given semantic and instance segmentation [Alhaija et al., 2018], we now examine adversarial pertur-bations to target localized regions in the scene. Our goal is to fool the network into (i) predictingdepths that are closer or farther by a factor of α for all objects belonging to a semantic category,(ii) removing speciﬁc instances from the scene, and (iii) moving speciﬁc instances to different regionsof the scene, all the while keeping the rest of the scene unchanged. Unlike Sec. 4.1, we want to alter a subset of the scene, partitioned by semantic segmentation, suchthat predictions belonging to an object category (e.g. vehicle, nature, human) are brought closer to orfarther from the camera by a factor of α for α ∈ {− . , − . , +0 . , +0 . } .We assume a binary category mask M ∈ { , } H × W derived from a semantic segmentation whereall pixels belonging to a category are marked with and otherwise. We denote the target depth as d t ( x ) = ( − M ) ◦ f d ( x ) + (1 + α ) M ◦ f d ( x ) (4)where is an H × W matrix of s. Column three of Fig. 1 illustrates this problem setting where theperturbations fool Monodepth2 into predicting all vehicles to be 10% closer to the camera. Fig. 5shows a comparative study between different categories. Unlike Sec. 4.1, it is more difﬁcult toalter a speciﬁc portion of the scene without affecting the rest. Moreover, each category exhibits adifferent level of robustness to adversarial noise. Some categories are harder to attack than others ,e.g. trafﬁc signs and human categories ( ≈ error for α = ± and ≈ for α = ± ) areharder to alter than vehicle and nature ( ≈ . error for α = ± and ≈ for α = ± ). Forinterested readers, please see Sec. H.3 in Supp. Mat. for visualizations, additional experiments, andperformance comparisons amongst all categories.6igure 6: Selectively removing instances of human (bikers and pedestrians).

Removing a lo-calized target requires attacking non-local contextual information. Moreover, one can attack aninstance without perturbing it at all. We demonstrate this by constraining the perturbation to be eithercompletely within the instance mask or completely out of the mask.

We now consider the case where instance labels are available for removing a speciﬁc instance (e.g.car, pedestrian) from the scene. By examining this scenario, we hope to shed light on the possibilitythat a depth prediction network can “miss” a human or car, which may cause incorrect rendering inaugmented reality or an accident in the autonomous navigation scenario.Similar to Sec. 4.1, we assume a binary mask M , but in this case, of speciﬁc instance(s) in the scene,e.g. all pixels belonging to a speciﬁc pedestrian are marked with 1 and 0 otherwise. To obtain d t ( x ) ,we ﬁrst remove the depth values in f d ( x ) belonging to M by multiplying f d ( x ) by − M . Then, weuse the depth values f d ( x ) on the contour of M to linearly interpolate the depth in the missing region: d t ( x ) = ( − M ) ◦ f d ( x ) + M ◦ d tM ( x ) (5)where d tM ( x ) := interp ( contour ( f d ( x ) , M )) . Fig. 6 shows examples of pedestrian and bikerremoval in the driving scenario where perturbations completely remove the targeted instance. Withthis attack, the road ahead becomes clear, which makes the agent susceptible to causing an accident.Even though perturbations are concentrated on the targeted instance region, non-zero perturbationscan be observed in the surrounding regions. While the target effect is localized (e.g., make a pedestriandisappear), the perturbation is non-local, implying that the network exploits non-local context, whichpresents a vulnerability to attacks against a target instance by perturbing other parts of the image. Motivated by our results in Sec. 5.2, we extend the instance conditioned removal task to moreconstrained scenarios where the perturbations have to either exist (i) completely within the targetinstance mask M or (ii) completely outside of it.For perturbations within the targeted instance mask M , we constrain v ( x ) to satisfy || M ◦ v ( x ) || ∞ ≤ ξ and || ( − M ) ◦ v ( x ) || ∞ = 0 . When constrained within M , perturbations can only remove someinstances successfully (e.g. biker is completely removed in row two, column three of Fig. 6). In othercases, the perturbations can only remove the outer part of the instance, leaving parts of the instance inthe scene (row four). This shows that depth prediction networks leverage global context; withoutattacking the contextual information located outside of M (e.g. without perturbing the entire imageas in column two of Fig. 6), it is not always possible to completely remove the target instance.Second, we want to answer the question posed in Sec. 1. Can perturbations remove an object byattacking anywhere (e.g. a billboard), but the object (e.g. a car)? In this more challenging case,the perturbations are constrained to be outside of the instance mask: || ( − M ) ◦ v ( x ) || ∞ ≤ ξ and || M ◦ v ( x ) || ∞ = 0 . Column four of Fig. 6 shows that even though there are no “direct attacks on theobject” (perturbations in the masked region), the perturbations can still remove parts of the targetinstance . While some of the target instance still remains, our experiment demonstrates that depthprediction networks are indeed susceptible to attacks against a target instance that does not requireperturbing the instance at all . In this case study, we examine perturbations for moving an instance (e.g. vehicle, pedestrian)horizontally or vertically in the image space. As Sec. 5.2 and 5.3 have demonstrated the ability toremove localized objects from the scene, we now show that it is possible for perturbations to move7 a) (b)

Figure 7: (a)

Moving horizontally.

Selected instances is moved by ≈ in the left and rightdirections while rest of the scene is preserved. (b) Flying cars.

A vehicle instance is moved ≈ upward while rest of the scene is preserved. Noise is concentrated around the targeted instance.Figure 8: Transferability across models.

Perturbations are (i) optimized for Monodepth andMonodepth2 separately, (ii) optimized for both together and (iii) summed over perturbations calculatedfor Monodepth and Monodepth2 separately. Each is tested on Monodepth and Monodepth2.such objects to different locations in the scene (removing the instance and creating it elsewhere)while keeping the rest of the scene unchanged.Fig. 7-(a) shows that perturbations can fool a network to move the target instance by ≈ across theimage in the left and right directions. When moved left, the biker (left column) is now in front ourvehicle. When moved right, the truck (right column) is in the wrong lane and looks to be on-comingtrafﬁc. Moreover, Fig. 7-(b) shows that perturbations can move select instances by ≈ in theupward direction, creating the illusion that there are “ﬂying cars” in the scene. Transferability is important for black-box scenarios, a practical setting where the attacker doesnot have access to the target model or its training data. To examine transferability, we test ourperturbations crafted for Monodepth2 [Godard et al., 2019] to fool its predecessor Monodepth[Godard et al., 2017] (different architecture and loss function) in Monodepth2 → Monodepth, andvice versa in Monodepth → Monodepth2, for the scene scaling task. (Sec. 4.1).To this end, we also optimized perturbations for Monodepth to scale the entire scene. Overall, theperturbations optimized for one model does not transfer to another and, interestingly, transferabilitydecays with increasing norm (Fig. 8), which may be due to perturbations overﬁtting to the model. Wesummed the perturbations for Monodepth and Monodepth2 (“Sum” in Fig. 8) and found that theirsummation can affect both models with reduced effects as the upper norm increases. For ξ = 2 × − ,the potency is nearly unaffected, meaning, for small norms, their summation can attack both modelsequally well. Lastly, by optimizing for both models (“Both” in Fig. 8), the same perturbation canfool both as if it was optimized for the models individually, with performance indistinguishable fromMonodepth → Monodepth and Monodepth2 → Monodepth2 across all norms . This shows that bothmodels share a space that is vulnerable to adversarial attacks. Hence, crafting perturbations for anarray of potential models may be an avenue towards achieving absolute transferability across models.8

Conclusion

Depth prediction networks are indeed vulnerable to adversarial perturbations. Not only can suchperturbations alter the perception of the scene, but can also affect speciﬁc instances, making thenetwork behave unpredictably, which can be catastrophic in applications that involve interactionwith physical space. These perturbations also shed light on the network’s dependency on non-localcontext for local predictions, making non-local targeted attacks possible. While we have exposedvulnerabilities, we hope that our ﬁndings on the network’s biases, the effect of context, and robustnessof the perturbations can help design more secure and interpretable models that are not susceptible tosuch attacks.

References [Alhaija et al., 2018] Alhaija, H., Mustikovela, S., Mescheder, L., Geiger, A., and Rother, C. (2018). Augmentedreality meets computer vision: Efﬁcient data generation for urban driving scenes.

International Journal ofComputer Vision (IJCV) .[Casser et al., 2019] Casser, V., Pirk, S., Mahjourian, R., and Angelova, A. (2019). Depth prediction withoutthe sensors: Leveraging structure for unsupervised learning from monocular videos. In

Proceedings of theAAAI Conference on Artiﬁcial Intelligence , volume 33, pages 8001–8008.[Chen et al., 2016] Chen, W., Fu, Z., Yang, D., and Deng, J. (2016). Single-image depth perception in the wild.In

Advances in Neural Information Processing Systems , pages 730–738.[Cordts et al., 2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U.,Roth, S., and Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In

Proc. ofthe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .[Damasio et al., 1982] Damasio, A. R., Damasio, H., and Van Hoesen, G. W. (1982). Prosopagnosia: anatomicbasis and behavioral mechanisms.

Neurology , 32(4):331–331.[Deng et al., 2009] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: Alarge-scale hierarchical image database. In ,pages 248–255. Ieee.[Eigen and Fergus, 2015] Eigen, D. and Fergus, R. (2015). Predicting depth, surface normals and semanticlabels with a common multi-scale convolutional architecture. In

Proceedings of the IEEE InternationalConference on Computer Vision , pages 2650–2658.[Eigen et al., 2014] Eigen, D., Puhrsch, C., and Fergus, R. (2014). Depth map prediction from a single imageusing a multi-scale deep network. In

Advances in neural information processing systems , pages 2366–2374.[Eykholt et al., 2018] Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Xiao, C., Prakash, A., Kohno,T., and Song, D. (2018). Robust physical-world attacks on deep learning visual classiﬁcation. In

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition , pages 1625–1634.[Fei et al., 2018] Fei, X., Wong, A., and Soatto, S. (2018). Geo-supervised visual depth prediction. arXivpreprint arXiv:1807.11130 .[Fu et al., 2018] Fu, H., Gong, M., Wang, C., Batmanghelich, K., and Tao, D. (2018). Deep ordinal regressionnetwork for monocular depth estimation. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 2002–2011.[Garg et al., 2016] Garg, R., BG, V. K., Carneiro, G., and Reid, I. (2016). Unsupervised cnn for single viewdepth estimation: Geometry to the rescue. In

European Conference on Computer Vision , pages 740–756.Springer.[Geiger et al., 2012] Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we ready for autonomous driving? thekitti vision benchmark suite. In , pages3354–3361. IEEE.[Godard et al., 2017] Godard, C., Mac Aodha, O., and Brostow, G. J. (2017). Unsupervised monocular depthestimation with left-right consistency. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 270–279.[Godard et al., 2019] Godard, C., Mac Aodha, O., Firman, M., and Brostow, G. J. (2019). Digging into self-supervised monocular depth estimation. In

Proceedings of the IEEE International Conference on ComputerVision , pages 3828–3838.[Goodfellow et al., 2014] Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014). Explaining and harnessingadversarial examples. arXiv preprint arXiv:1412.6572 .[Hendrik Metzen et al., 2017] Hendrik Metzen, J., Chaithanya Kumar, M., Brox, T., and Fischer, V. (2017).Universal adversarial perturbations against semantic image segmentation. In

Proceedings of the IEEEInternational Conference on Computer Vision , pages 2755–2764. Hu et al., 2019] Hu, J., Zhang, Y., and Okatani, T. (2019). Visualization of convolutional neural networksfor monocular depth estimation. In

Proceedings of the IEEE International Conference on Computer Vision ,pages 3869–3878.[Ilyas et al., 2019] Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., and Madry, A. (2019). Adversar-ial examples are not bugs, they are features. In

Advances in Neural Information Processing Systems , pages125–136.[Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980 .[Laidlaw and Feizi, 2019] Laidlaw, C. and Feizi, S. (2019). Functional adversarial attacks. In

Advances inNeural Information Processing Systems , pages 10408–10418.[Laina et al., 2016] Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., and Navab, N. (2016). Deeperdepth prediction with fully convolutional residual networks. In

3D Vision (3DV), 2016 Fourth InternationalConference on , pages 239–248. IEEE.[Liu et al., 2015] Liu, F., Shen, C., and Lin, G. (2015). Deep convolutional neural ﬁelds for depth estimationfrom a single image. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pages 5162–5170.[Liu et al., 2016] Liu, F., Shen, C., Lin, G., and Reid, I. (2016). Learning depth from single monocular imagesusing deep convolutional neural ﬁelds.

IEEE transactions on pattern analysis and machine intelligence ,38(10):2024–2039.[Luo et al., 2018] Luo, C., Yang, Z., Wang, P., Wang, Y., Xu, W., Nevatia, R., and Yuille, A. (2018). Everypixel counts++: Joint learning of geometry and motion with 3d holistic understanding. arXiv preprintarXiv:1810.06125 .[Mahjourian et al., 2018] Mahjourian, R., Wicke, M., and Angelova, A. (2018). Unsupervised learning ofdepth and ego-motion from monocular video using 3d geometric constraints. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 5667–5675.[Moosavi-Dezfooli et al., 2017] Moosavi-Dezfooli, S.-M., Fawzi, A., Fawzi, O., and Frossard, P. (2017). Uni-versal adversarial perturbations. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 1765–1773.[Moosavi-Dezfooli et al., 2016] Moosavi-Dezfooli, S.-M., Fawzi, A., and Frossard, P. (2016). Deepfool: asimple and accurate method to fool deep neural networks. In

Proceedings of the IEEE conference on computervision and pattern recognition , pages 2574–2582.[Mopuri et al., 2018] Mopuri, K. R., Ganeshan, A., and Babu, R. V. (2018). Generalizable data-free objective forcrafting universal adversarial perturbations.

IEEE transactions on pattern analysis and machine intelligence ,41(10):2452–2465.[Najaﬁ et al., 2019] Najaﬁ, A., Maeda, S.-i., Koyama, M., and Miyato, T. (2019). Robustness to adversarialperturbations in learning from incomplete data. In

Advances in Neural Information Processing Systems ,pages 5542–5552.[Naseer et al., 2019] Naseer, M. M., Khan, S. H., Khan, M. H., Khan, F. S., and Porikli, F. (2019). Cross-domaintransferability of adversarial perturbations. In

Advances in Neural Information Processing Systems , pages12885–12895.[Peck et al., 2017] Peck, J., Roels, J., Goossens, B., and Saeys, Y. (2017). Lower bounds on the robustness toadversarial perturbations. In

Advances in Neural Information Processing Systems , pages 804–813.[Pillai et al., 2019] Pillai, S., Ambru¸s, R., and Gaidon, A. (2019). Superdepth: Self-supervised, super-resolvedmonocular depth estimation. In , pages9250–9256. IEEE.[Poggi et al., 2018] Poggi, M., Tosi, F., and Mattoccia, S. (2018). Learning monocular depth estimation withunsupervised trinocular assumptions. In , pages 324–333.IEEE.[Qin et al., 2019] Qin, C., Martens, J., Gowal, S., Krishnan, D., Dvijotham, K., Fawzi, A., De, S., Stanforth, R.,and Kohli, P. (2019). Adversarial robustness through local linearization. In

Advances in Neural InformationProcessing Systems , pages 13824–13833.[Silberman et al., 2012] Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. (2012). Indoor segmentation andsupport inference from rgbd images. In

European conference on computer vision , pages 746–760. Springer.[Szegedy et al., 2013] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus,R. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 .[Tramèr and Boneh, 2019] Tramèr, F. and Boneh, D. (2019). Adversarial training and robustness for multipleperturbations. In

Advances in Neural Information Processing Systems , pages 5858–5868. Wang et al., 2018] Wang, C., Miguel Buenaposada, J., Zhu, R., and Lucey, S. (2018). Learning depth frommonocular videos using direct methods. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 2022–2030.[Wong and Soatto, 2019] Wong, A. and Soatto, S. (2019). Bilateral cyclic constraint and adaptive regularizationfor unsupervised monocular depth prediction. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , pages 5644–5653.[Xie et al., 2017a] Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., and Yuille, A. (2017a). Adversarial examplesfor semantic segmentation and object detection. In

Proceedings of the IEEE International Conference onComputer Vision , pages 1369–1378.[Xie et al., 2017b] Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. (2017b). Aggregated residual transfor-mations for deep neural networks. In

Proceedings of the IEEE conference on computer vision and patternrecognition , pages 1492–1500.[Yang et al., 2018] Yang, N., Wang, R., Stückler, J., and Cremers, D. (2018). Deep virtual stereo odometry:Leveraging deep depth prediction for monocular direct sparse odometry. In

European Conference onComputer Vision , pages 835–852. Springer.[Yin et al., 2019] Yin, W., Liu, Y., Shen, C., and Yan, Y. (2019). Enforcing geometric constraints of virtualnormal for depth prediction. In

Proceedings of the IEEE International Conference on Computer Vision , pages5684–5693.[Yin and Shi, 2018] Yin, Z. and Shi, J. (2018). Geonet: Unsupervised learning of dense depth, optical ﬂow andcamera pose. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages1983–1992.[Zhou et al., 2017] Zhou, T., Brown, M., Snavely, N., and Lowe, D. G. (2017). Unsupervised learning of depthand ego-motion from video. In

CVPR , volume 2, page 7.[Zoran et al., 2015] Zoran, D., Isola, P., Krishnan, D., and Freeman, W. T. (2015). Learning ordinal relationshipsfor mid-level vision. In

Proceedings of the IEEE International Conference on Computer Vision , pages 388–396. upplementary Materials A Summary of Contents

In Sec. B, the robustness of perturbations against defenses is discussed. Additional implementation detailsthat we could not ﬁt into main text due to space constraints are given in Sec. C. More experimental resultson changing the scale of the scene are provided in Sec. D. In Sec. E, existence of the successful adversarialattacks for indoor scenes (NYU-V2) is shown for state-of-the-art indoor monocular depth prediction model.In Sec. F, we examine how predictions behave when linear operations are applied to perturbations (sum oftwo perturbations and linear scaling of a perturbation). Limitations and failure cases for the perturbations areanalyzed in Sec. G. Finally, in Sec. H, more qualitative and quantitative results are provided for the experimentswhose compressed versions are presented in the main text.

B Robustness of the Targeted Attacks Against Defense Mechanisms

In the main text, we have shown that depth prediction networks are prone to adversarial attacks. In this section,we will examine the robustness of the perturbations against common defense mechanisms: (i) Gaussian blurringand (ii) adversarial training.

B.1 Defense through Gaussian Blurring

In Fig. 9, we show the effect of Gaussian blurring as a simple defense mechanism on our targeted attacks byblurring the image with additive perturbations. Even though Gaussian blur does reduce the effectiveness of theperturbations (increased ARE over all scales), the resulting scene is still only ≈ away from a target depththat is 10% closer or farther than the original predictions for ξ = 2 × − . This is the performance that themethod achieves for the case of ξ = 2 × − without blurring. In other words, the effect of the blurring can besuppressed simply by increasing the upper norm of the noise by × for scaling the scene by ± . Figure 9:

Gaussian blur.

Absolute relative error (ARE) achieved by adversarial perturbations ofdifferent norms ( ξ ) for different scales. For each scale, we plot the ARE with and without Gaussianblur. Even though absolute relative error increases with the Gaussian blur, the proposed method canstill ﬁnd a small norm noise to alter the scene. B.2 Defense through Adversarial Training

To examine the robustness of adversarial perturbations to adversarial training, we crafted adversarial perturbationsfor scaling the scene by a factor of α where α ∈ {− . , − . , +0 . , +0 . } for the KITTI Eigen split[Eigen and Fergus, 2015] (consisting of 22600 stereo pairs). We trained Monodepth2 [Godard et al., 2019] byminimizing the normalized discrepancy between the predicted depth of a perturbed image ( f d ( x + v ( x )) ) andits prediction for the original image ( f d ( x ) ). (cid:96) ( x, v ( x ) , f d ) = (cid:107) f d ( x ) − f d ( x + v ( x )) (cid:107) f d ( x ) (6)Fig. 10 shows the effect of the perturbations on Monodepth2 after adversarial training. While training doesreduce the inﬂuence of adversarial perturbations on the scene scaling task, it does not make the network invariantto the adversarial perturbations. With perturbations of ξ = 2 × − , the predict scene is ≈ from the target Adversarial training.

Absolute relative error (ARE) achieved by adversarial perturbationsof different norms ( ξ ) for different scales. For each scale, we plot the ARE with and withoutadversarial training. Even though absolute relative error increases with the adversarial training, theperturbations can still affect the predicted scene with small norm noise.Figure 11: Adversarial training vs Gaussian blur.

Absolute relative error (ARE) achieved byadversarial perturbations of different norms ( ξ ) for different scales. For each scale, we plot theARE without any defense, with Gaussian blur and with adversarial training. Both Gaussian blur andadversarial training makes the depth prediction network more robust to perturbations. Performancesof the two defense mechanisms are comparable for small norms ( ξ ), but adversarial training is moreeffective on larger norms. scene that is closer than or farther from the original and ≈ from the target scene that is closer orfarther. For ξ = 2 × − , perturbations can still fool the network to be predict a target scene scaled by ± with ≈ absolute relative error and ≈ error for fooling the network to predict a target scene that is scaledby ± .To compare the two defense mechanisms, we refer to Fig. 11. For smaller norms, e.g. ξ = 2 × − , we observea similar performance in using Gaussian blur (Sec. B.1) and adversarial training as defenses against adversarialperturbations; whereas, adversarial training is clearly better for larger ξ . This may be due to Gaussian blur’sability to destroy the perturbation for small norms and, hence, able to mitigate the effect of the perturbations.However, for larger norms, the blurring does not corrupt the perturbations enough and therefore does not reducethe effect of perturbations by as much. C Additional Implementation Details for Outdoor Scenario

In this section, we provide the additional implementation details for crafting adversarial perturbations forMonodepth [Godard et al., 2017] and Monodepth2 [Godard et al., 2019] on the KITTI dataset (outdoor drivingscenario) as discussed in the main text.

C.1 Hyper-parameters

Upper Norm ξ = 2 × − ξ = 5 × − ξ = 1 × − ξ = 2 × − Monodepth η = 1 . η = 2 . η = 3 . η = 4 . Monodepth2 η = 0 . η = 1 . η = 3 . η = 5 . Table 1:

Learning rates.

We achieve the best performances with the given learning rates.

Regarding hyper-parameters for crafting adversarial perturbations : We search the learning rate for each noisenorm from the set { . , . , . , . , . , . , . } . We report the best performing ones in Table 1. Regardingour choice for the number of steps to run, we experimented with 200, 400, 500, 800, and 1000 steps and foundlittle difference in performance measured by ARE between 500 steps, and 800 and 1000 steps. While an increasenumber of steps will obtain slight performance improvements, conscious of the time complexity, we chose 500for our experiments. egarding hyper-parameters for adversarial training : As a defense against adversarial perturbations, weoptimized Eqn. 6 for Monodepth2 using Adam [Kingma and Ba, 2014] with β = 0 . and β = 0 . . Weused a batch size of 4 and starting learning rate of × − . We decreased the learning rate to × − after10 epochs and to . × − after 20 epochs and × − after 30 epochs for a total of 40 epochs. Trainingtakes approximately 4 hours using an Nvidia GeForce GTX 1080 GPU. C.2 Monodepth and Monodepth2

We study the effects of adversarial perturbations on the state-of-the-art monocular depth prediction method,Monodepth2 [Godard et al., 2019] and its predecessor Monodepth [Godard et al., 2017]. The two models utilizedifferent network architectures and trained with different loss functions. In this section, we provide details onthe two methods.Regarding

Monodepth : Monodepth uses a ResNet50 encoder architecture as its backbone and a standard decoderwith skip connections. Monodepth predicts both left and right disparities from a single image (assuming it isthe left image of a stereo-pair) and uses image reconstruction as supervision. Additionally, it is trained with astandard local smoothness term weighted by image gradients and a left-right disparity consistency term as itsregularizers.Regarding

Monodepth2 : Monodepth2, unlike Monodepth, uses ResNet18 encoder (pretrained on ImageNet) asits backbone network architecture. Rather than simply minimizing an image reconstruction loss, Monodepth2leverages a heuristic to discount occluded pixels and also uses a criterion to discount static frames. Similar toMonodepth, Monodepth2 also minimizes a local smoothness regularizer weighted by image gradients.

D Scaling with Larger Factors

Figure 12:

ARE with various upper norm ξ for scaling Monodepth2 predictions by largerfactors. We increased the scaling to , , and closer and farther. We can see that for ξ = 2 × − , perturbations are still able to scale the entire scene by ≈ error. In Sec. 4.1 and Fig. 2-(a, b) of the main text, we showed that it is possible, even for small norms such as ξ = 2 × − to scale the scene to be or closer or farther with small error. In this section, wedemonstrate that it is possible to scale the scene by larger amounts (up to closer or farther). Fig. 12 showsthat with ξ = 2 × − , it is only able to scale the up to with reasonable error; whereas perturbationswith ξ = 5 × − can achieve this up to closer or farther. However, using larger norms ( × − and ξ = 2 × − ), one can scale the scene up to with small errors (less than ARE for ξ = 1 × − and ≈ ARE for ξ = 2 × − ).To see how far we can push for each upper norm, Fig. 17 shows various scales that each upper norm is capableof achieving. We note that ξ = 2 × − , is still able to obtain less than ARE when scaling the scene by ; however, standard deviation starts to grow larger as the scaling increases. Adversarial Attacks for Indoor Scenes (a) (b) (c)

Figure 13:

Indoor quantitative results.

ARE with various upper norm ξ for scaling and ﬂippingVNL predictions. (a) Results for horizontally and vertically ﬂipping the predictions. (b) Results forscaling the scene by ± . (c) comparison between scaling and ﬂipping tasks. To show the applicability of the adversarial method on indoor scenes, we examine the adversarial perturbationsfor Kinect Dataset NYU Depth V2 (NYU-V2) [Silberman et al., 2012]. We tested the effectiveness of adversarialperturbations on Virtual Normal Loss (VNL) [Yin et al., 2019] which is the state-of-the-art monocular depthprediction method for NYU-V2, trained in the supervised setting.

E.1 Implementation Details

NYU-V2 consists of 1449 RGBD images gathered from a wide range of buildings, comprising 464 differentindoor scenes across 26 scene classes. The images were hand-selected from 435,103 video frames, to ensurediversity. 1449 labeled samples are split into 795 training and 654 test images.In the method proposed by [Yin et al., 2019], a 3D point cloud is reconstructed from the estimated depth map.Then, three non-colinear points are randomly sampled with large distances to form a virtual plane. The deviationbetween ground truth and prediction for the direction of the normal vector corresponding to the plane is penalized.The pre-trained ResNeXt-101 [Xie et al., 2017b] model on ImageNet [Deng et al., 2009] is used as the backbonearchitecture. During training, images are cropped to the size 384 ×

384 for NYU-V2. We use the same imageresolution for our experiments. The training set is randomly sampled from 29K images of the raw unlabeledtraining set.The time it takes to forward an image with this method is ≈ . seconds ( ≈ times more than Monodepth2).Due to computational limitations, we choose the ﬁrst images out of images of the test split for ourexperiments. We run SGD for steps. The learning rate is kept at . for the entire optimization. E.2 Scaling and Symmetrically Flipping the Scene

In Fig. 13, we compare the performance for different target depth maps: scaling the scene by ± , horizontaland vertical ﬂipping. Unlike outdoor case (KITTI), in the indoor (NYU-V2) horizontal ﬂipping is a harder taskthan vertical ﬂipping. Achieving a vertically ﬂipped scene was expected to be simpler as layouts of indoor scenesare more diverse. Hence, the depth network does not overﬁt to a particular layout type e.g. the one in which thereare large depth values only at the top of the image. Horizontal ﬂipping being relatively harder for indoor scenescan be explained by the large divergence between the depth distributions of the original predictions and the targetdepths. The reason is that for the indoor scenes, most scenes are not symmetric in the horizontal direction, unlikethe outdoor driving scenario where the left and right parts of the scene from the ego-view are usually symmetric.Since [Yin et al., 2019] normalizes images with the deviation of the dataset which approximately scales theimage by , it also effectively scales the noise with the same deviation. But, since the relative norm is still thesame, we use the same norm values ξ ∈ { × − , × − , × − , × − } when plotting the ARE inFig. 13.In Fig. 14, we present qualitative results for NYU-V2 for ξ = 2 × − . Small, white borders around RGBimages exist in the raw dataset. For all the tasks, including vertical ﬂip, adversarial perturbations manage to foolthe model to predict the target depth with small errors. For horizontally ﬂipped target depth (a), predictions havemore artifacts than vertical ﬂipped depth (b) and scaled depths (c,d). a) (b) (c) (d) Figure 14:

Indoor qualitative results.

From left to right: (a) horizontal ﬂip, (b) vertical ﬂip, (c)scale 10% closer and (d) scale 10% farther. From top to bottom: RGB, noise, original disparity,disparity prediction for the perturbed image. 16

Linear Operations

Figure 15: Disparity for x + γv ( x ) where γ is . , . , . , . , . from top to bottom. v ( x ) iscalculated for d t = fliph ( f d ( x )) . So, the top is the original disparity map while bottom most is theﬂipped one. In between, portions of the scene are ﬂipped smoothly.Figure 16: (1st row) left to right: RGB, sum of noises. (2nd row) left to right: original disparity,disparity when two noises v ( x ) + v ( x ) are added to the image. (3rd row) left to right: noisefor closer, noise for farther. (4th row): Disparity predictions for the images perturbedwith the noises in the 3rd row. When added, two noises cancel each other’s effect: the scale for f d ( x + v ( x ) + v ( x )) is close to the original one f d ( x ) . To better understand how predictions of the depth network changes within a ball of small radius, we examine theeffect of linear operations on perturbations. Speciﬁcally, we visualize the predictions for the scaled perturbationsand for the perturbations which we get after summing two perturbations calculated for two different target depthmaps.In Fig. 15, we take the perturbation v ( x ) , which we calculated to horizontally ﬂip the prediction for the givenimage, and we visualize the prediction of the network for x + γv ( x ) where γ ∈ { . , . , . , . , . } .As can be seen, between γ = 0 and γ = 1 , the scene is smoothly ﬂipped. This implies that the adversarialperturbations can be used to control the depth prediction in a disentangled way. In other words, one causalfactor (e.g. horizontal orientation) of the prediction can be independently controlled by tweaking γ only, keepingeverything else the same.As observed before, noise is small for the white regions. See the third column, where there is a gray rectangle inthe noise corresponding to the white region of the truck. We speculate the reason behind this phenomenon asthe white color being on the border of the support of RGB images. But, the noise is still large for black regionswhich are at the other extreme of the support (see perturbations corresponding to black vehicles). So, we leftfurther understanding of this phenomenon as future work.In Fig. 16, we take v ( x ) and v ( x ) which are optimized to scale the scene to 10% closer and 10% farther. Then,visualize the summed perturbation, v ( x ) = v ( x ) + v ( x ) and the prediction for x + v ( x ) . As can be seen,two noises cancel each other: || v ( x ) || ≈ || v ( x ) || (cid:29) || v ( x ) + v ( x ) || . Furthermore, the prediction for theimage perturbed with the summed noise is close to the original prediction: f d ( x ) ≈ f d ( x + v ( x ) + v ( x )) .This shows that two perturbations with inverse functionalities can neutralize their effects when applied together. ARE with various upper norm ξ for scaling Monodepth2 predictions. This time erroris plotted for large scale ratios , , and (up to for larger ξ ), for scaling bothcloser and farther, showing the limitations of each norm for the scaling task. While ARE is stillrelatively small for larger norms, standard deviation grows larger – meaning the perturbations can nolonger scale the scene consistency with low error.Figure 18: Additional examples of failure cases for vertical ﬂip . While adversarial perturbationswith ξ = 2 × − , can fool Monodepth2 to predicted scenes that are scaled by large amounts, andhorizontally ﬂipped. They cannot cause the network to vertically ﬂip the scene, leaving behind carsand roads as artifacts. G Limitations

Fig. 17 shows the absolute relative error (ARE) with respect to the target scaling factor for each upper norm. Aswe can see, for smaller norms of × − and × − , the perturbations are limited to scaling the scene by ≈ and ≈ , respectively. Scaling factors higher than such increases the ARE by ≈ for every 5%increase in scaling factor, signaling the limit for these norms. For larger norms of × − and × − , theperturbations can afford to scale the scene by a much larger factor. For ξ = 1 × − , perturbations can scalethe scene by as much as closer and farther less than error. Whereas, for ξ = 1 × − , perturbationscan scale the scene up to ± with less than ARE. However, while large scaling still as a low ARE, thestandard deviation for larger norms increases drastically showing that it can no longer consistently scale thescene.While for smaller scales (e.g. ± , ± ) the ARE and amount of noise required is approximately the same (seeSec. 4.1, main text), suggesting similar difﬁculty levels. As we plot the errors for larger scales in Fig. 17, scalingthe scene farther generally yields lower error than scaling the scene closer.Fig. 18 shows additional examples of failure cases for vertical ﬂip. While we have shown in the main paper aswell as Sec. D and H that it is possible to manipulate the scene with small norm perturbations, we show here thatperturbations cannot fool a network into vertically ﬂipping the scene. Additional Results on Outdoor Scenarios

In this section, we show (i) side by side visualization of the perturbations required to scale the scene, (ii)additional visualizations of perturbations to horizontally and vertically ﬂip the scene, (iii) quantitative results ontargeted attacks to semantic categories and (iv) qualitative results on targeted attacks to instances.

H.1 Scaling the Scene

Figure 19: Visually imperceptible perturbations v ( x ) , with ξ = 2 × − , can fool Monodepth2 topredicted scenes that are 5% or 10% closer and also 5% or 10% farther. Here, we show qualitative results for the task of scaling the scene (Sec. 4.1, main text) by a factor of α where α ∈ {− . , − . , +0 . , +0 . } . As seen in Fig. 19, the perturbations are successful in foolingstate-of-the-art monocular depth prediction method, Monodepth2 [Godard et al., 2019], into predicting the scene5% or 10% closer and also 5% or 10% farther. Additionally, the perturbations are concentrated in similar regionsfor scaling the scene 5% or 10% closer and for 5% or 10% farther as well. As noted in the main text, the amountof noise required for scaling the scene by ± are approximately the same, as is the amount for scaling thescene by ± . This is visible in Fig. 19. H.2 Symmetrically Flipping the Scene

In Sec. 4.2 and Fig. 3 in the main text, we demonstrated that adversarial perturbation can cause a monocular depthprediction network to predict a horizontally or vertically ﬂipped scene. Here, we show additional qualitativeresults on the horizontal and vertical ﬂipping tasks in Fig. 20 and Fig. 21. We note that perturbations can causethe network to predict a horizontally ﬂip scene, they have trouble fooling the network to predict a verticallyﬂipped scene. This is unlike our ﬁndings in the indoor scenario (Sec. E) as seen in Fig. 14 and 13. Fooling thenetwork to vertically ﬂip the scene is in fact easier than fooling it to horizontally ﬂip the scene. This conﬁrms

Additional examples of horizontal ﬂip . Adversarial perturbations with ξ = 2 × − , canfool Monodepth2 to predicted scenes that horizontally ﬂipped.Figure 21: Additional examples of vertical ﬂip . Adversarial perturbations with ξ = 2 × − , canfool Monodepth2 to predicted scenes that vertically ﬂipped. Even in these successful examples, thereare still artifacts (ripples, wavy-ness) in the output. the biases (roads on bottom, sky on top) that the network learned from the outdoor dataset that are not present inthe indoor dataset. H.3 Category Conditioned Scaling

In Sec. 5.1 and Fig. 5 of the main text, we showed category speciﬁc attacks to scale all objects belonging agiven category to be a factor of α closer or farther where α ∈ {− . , − . , +0 . , +0 . } . Here, weprovide performance, measured in ARE, of adversarial perturbations crafted for each category. We use the sameconvention for grouping different classes into categories as the Cityscapes dataset [Cordts et al., 2016], with theexception of the “Human” category, which includes the bicycles that the bikers are riding.Fig. 22 shows a comparative study between different categories. Not all categories are equally easy to be fooledby the perturbations, some are more robust to adversarial attacks than others. As seen in Fig. 22, each categoryexhibits a different level of robustness to adversarial noise – “Human” and “Trafﬁc” categories are the hardest tofool, “Construction”, “Vehicle” and “Flat” are more susceptible, and “Sky” and “Nature” are the easiest to attack.Plots are cropped at the maximum error across different categories to enable comparison of difﬁculty in foolingdifferent categories. We note that attacking localized regions in the scene is considerably harder than attackingthe entire scene. Fig. 12 shows that perturbations can attack the entire scene with small errors across variousnorms while Fig. 22 shows that, even with large norms, there are still errors ( ≈ to ARE). We showvisualizations for the “Construction”, “Nature”, and “Vehicle” categories in Fig. 23, 24, and 25 respectively.

ARE for scaling different categories closer and farther.

From top to bottom: “Con-struction”, “Flat”, “Human”, “Nature”, “Sky”, “Trafﬁc”, “Vehicle”. From left to right: 10% closer,5% closer, 5% farther, 10% farther. Y-axis is kept the same for the same scale, for making thecomparison across categories possible. It is easier to fool the network to predict vehicle and naturecategories closer and farther than is to fool human and trafﬁc categories.21igure 23:

Examples of targeted attacks on regions belonging to “Construction” category.

Figure 24:

Examples of targeted attacks on regions belonging to “Nature” category.

Figure 25:

Examples of targeted attacks on regions belonging to “Vehicle” category.

Additional examples of human removal.

Targeted adversarial perturbations can removehumans from the predicted scene. Rightmost panel shows that we can target multiple humans andremove them from the scene without affecting the remaining pedestrian on the right.Figure 27:

Examples of vehicle removal.

Targeted adversarial perturbations can remove vehiclesfrom the predicted scene. Rightmost panel shows that we can target a truck and a car on the right sideand remove them. Note that the cars in the middle of the road still remain.

H.4 Instance Conditioned Targeted Attacks

In Sec. 5.2 and Fig. 6 in the main text, we shows that, when given instance segmentation, adversarial perturbationscan target speciﬁc instances and remove them from the scene and thus causing unforeseen consequences. Fig. 26shows additional examples of removing humans from the scene and Fig. 27 demonstrates that it is possibleto remove vehicles from the scene as well. In the rightmost panel of Fig. 26, we show that it is possible toremove some pedestrians from the scene without affecting others. Similarly, in the rightmost panel of Fig. 27,we removed a truck and a car on the right side and left the cars in the center untouched – leaving this as still aplausible highway driving scenario.In Sec. 5.4 and Fig. 7 in the main text, we show that perturbations can move an instance to another locationin the scene (requires removing the instance from its original location and creating it in the new location). Inthis section, we give more visuals for the perturbations used for moving an instance (e.g. vehicle, pedestrian)horizontally or vertically in the image space while keeping the rest of the scene unchanged.Fig. 28 shows that perturbations can fool a network to move the target instance by ≈ across the image in theleft and right directions. Furthermore, Fig. 29 shows that perturbations can move select instances by ≈ in the upward direction, creating the illusion that there are “ﬂying vehicles” in the scene. We note that in bothcases, the perturbations are concentrated on the instance and the region to which the instance is moved. Forexample, when moving a vehicle right or left, the corresponding perturbations also move right or left. Move instance horizontally.

Selected instance is moved by ≈ in left and rightdirections while rest of the scene is preserved. Noise and disparity for both directions are given. Wenote that the noise is concentrated around the targeted instance and the region to which the instanceis moved.Figure 29: Flying vehicles.

Selected vehicle is moved by ≈42%