[PDF] Learned Camera Gain and Exposure Control for Improved Visual Feature Detection and Matching

Abstract

Successful visual navigation depends upon capturing images that contain sufficient useful information. In this paper, we explore a data-driven approach to account for environmental lighting changes, improving the quality of images for use in visual odometry (VO) or visual simultaneous localization and mapping (SLAM). We train a deep convolutional neural network model to predictively adjust camera gain and exposure time parameters such that consecutive images contain a maximal number of matchable features. The training process is fully self-supervised: our training signal is derived from an underlying VO or SLAM pipeline and, as a result, the model is optimized to perform well with that specific pipeline. We demonstrate through extensive real-world experiments that our network can anticipate and compensate for dramatic lighting changes (e.g., transitions into and out of road tunnels), maintaining a substantially higher number of inlier feature matches than competing camera parameter control algorithms.

Full PDF

IIEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED FEBRUARY, 2021 1

Learned Camera Gain and Exposure Control forImproved Visual Feature Detection and Matching

Justin Tomasi , Brandon Wagstaff , Steven L. Waslander , and Jonathan Kelly Abstract —Successful visual navigation depends upon capturingimages that contain sufﬁcient useful information. In this paper,we explore a data-driven approach to account for environmentallighting changes, improving the quality of images for use invisual odometry (VO) or visual simultaneous localization andmapping (SLAM). We train a deep convolutional neural networkmodel to predictively adjust camera gain and exposure timeparameters such that consecutive images contain a maximalnumber of matchable features. The training process is fully self-supervised: our training signal is derived from an underlyingVO or SLAM pipeline and, as a result, the model is optimized toperform well with that speciﬁc pipeline. We demonstrate throughextensive real-world experiments that our network can anticipateand compensate for dramatic lighting changes (e.g., transitionsinto and out of road tunnels), maintaining a substantiallyhigher number of inlier feature matches than competing cameraparameter control algorithms.

Index Terms —Deep Learning for Visual Perception, Vision-Based Navigation, Visual Learning

I. I

NTRODUCTION R ELIABLE perception is crucial for safe robot operationin dynamic environments. While inexpensive commercialcameras have become ubiquitous due to their size, weight, andperformance, camera image quality can be degraded by rapidmotion and by scene lighting changes. In turn, poor-qualityimages will reduce the performance of many visual navigationalgorithms and, in the worst case, may cause a navigationalgorithm to fail entirely [1].There are three general approaches to increase the ro-bustness of visual navigation algorithms to dynamic lightingconditions [2]. The ﬁrst approach involves applying some formof post-processing after image capture in an effort to mitigatechanges in illumination [3]–[7]. The second approach is to uti-lize feature detection and matching algorithms that have somedegree of invariance to brightness variations [8], [9]. Thesetwo approaches can help to improve visual navigation whenthe acquired images already contain sufﬁcient information, butcannot recover information that is lost due to overexposure or

Manuscript received: October 15, 2020; Revised January 10, 2021; Ac-cepted February 2, 2021.This paper was recommended for publication by Editor Cesar CadenaLerma upon evaluation of the Associate Editor and Reviewers’ comments.This research was supported in part by the Canada Research Chairs program. Justin Tomasi, Brandon Wagstaff, and Jonathan Kelly, a Vector InstituteFaculty Afﬁliate, are with the Space and Terrestrial Autonomous RoboticSystems (STARS) Laboratory, University of Toronto Institute for AerospaceStudies (UTIAS), Toronto M3H 5T6, Canada. Steven L. Waslander is with the Toronto Robotics and AI Laboratory(TRAIL), UTIAS, Toronto M3H 5T6, Canada. .@robotics.utias.utoronto.ca

Digital Object Identiﬁer (DOI): see top of this page.

Fig. 1: Our method selects camera parameter values that yieldimages with a higher number of inlier feature matches (bottom row)compared to built-in automatic gain and exposure time control (toprow). This behaviour is demonstrated before entering a tunnel (leftcolumn) and while exiting from a tunnel (right column). Inlier featuretracks are shown in red. underexposure [10]. The third approach, which we follow inthis work, is to compensate for dynamic lighting during theimage acquisition process by actively adjusting the relevantcamera imaging parameters.The two camera parameters that have the greatest effect onimage quality (i.e., information content) are gain and exposuretime . Often, these parameters are set to ﬁxed values forsimplicity or adjusted automatically by a built-in, proprietaryparameter control algorithm. Built-in control algorithms areusually adequate for situations in which lighting conditionsare constant or change slowly. However, during fast lightingtransitions, relying on automatic gain and exposure timecontrol often results in poorly-exposed images due to relativelyslow algorithm response times [11].One reason for the poor performance of built-in parametercontrollers and other hand-crafted algorithms is that theyoperate in a reactive manner. Adjustments are made only aftera large change in overall image brightness has been recorded,which is too late to prevent the loss of valuable information(caused by overexposure or underexposure). We posit thatimproved image quality under dynamic lighting conditionscan be obtained through predictive adjustments to compensatefor impending lighting changes. We design such a predictivecontroller by training a deep neural network to adjust thecamera gain and exposure time (hereafter exposure) in such away that the quality of future images will be improved.Our approach is data-driven: we learn a deep convolutionalneural network (CNN) model that takes as input a sequence ofrecent images and the corresponding camera parameter values a r X i v : . [ c s . R O ] F e b IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED FEBRUARY, 2021 and outputs updated parameter values that are applied beforethe next image is acquired. The substantial representationalcapacity of deep networks allows us to capture importantdependencies between scene content, lighting, and VO per-formance. For example, when trained on image data fromroadways, the CNN learns to overexpose the sky in orderto better expose the road region, because the sky containslimited or no useful information for navigation. The trainingprocess is fully self-supervised and leverages the outputs ofan underlying VO front end; our loss function is designedto maximize the number of inlier feature matches acrossconsecutive images. Our main contributions are:1) an algorithm for predictively adjusting camera gain andexposure parameters such that acquired images containa greater number of sequential inlier feature matches;2) an approach for generating training targets in a fullyself-supervised manner; and3) extensive real-world experimental results demonstratingthat our method yields images with more inlier matchesthan competing parameter control algorithms.In particular, we demonstrate an ability to maintain successfulvisual tracking (and pose estimation) through road tunnelentry and exit transitions, which are challenging examplesof dramatic lighting change that cause competing algorithmsto consistently fail. Although we focus on improving theperformance of feature-based VO, our general approach canbe altered (through an appropriate choice of loss function) toimprove the quality of images for use in many different visualnavigation and mapping tasks.II. R

ELATED W ORK

Recent work in the area of camera parameter control for VOand SLAM has focused on task-agnostic, reactive adjustmentto maximize some, often heuristic, measure of image ‘quality.’Adjustments are generally made under the assumptions thatthe scene content and lighting remain relatively unchangedover the adjustment period. In this section, we review existingmethods in the literature for camera gain and exposure controlto improve various image quality measures.

A. Exposure Control

Hand-crafted approaches for camera parameter control have,in many cases, focused on adjustments of exposure only[10]–[12]. Exposure directly impacts image brightness andsharpness by varying the amount of light that strikes theimage sensor during acquisition. In [11], Shim et al. derivean image quality metric that is based on the magnitude ofthe image gradients. After an image is captured, the algorithmin [11] generates a series of synthetic counterparts by applyingvarious levels of gamma correction; the optimal exposurevalue corresponds to the gamma correction that maximizes thegradient metric. Critically, however, synthetically-generatedimages are only able to approximate the effects of variedexposure settings (e.g., this approach does not consider motion We note that our use of the word ‘metric’ herein refers to a measure ofimage quality rather than to a distance in the mathematical sense. blur induced by longer exposures). Also, the method in [11] isreactive—exposure adjustments are made based on the mostrecently acquired image only.Similar to [11], a gradient-based image quality metric is alsoemployed by Zhang et al. in [10]. The photometric (camera)response function is applied to model the changes in imagepixel intensity that result from changes in exposure. A gradientmeasure that is smoothly differentiable with respect to expo-sure is used in conjunction with the camera response functionto determine the best exposure adjustment. The exposure isonly adjusted in the direction of the (estimated) optimal value,however, rather than directly to the optimal setting as in [11].Additionally, photometric calibration must be carried out todetermine the camera response function [13], which may beinconvenient or impossible in many situations.In [12], gradient magnitudes and the Shannon entropy of theimage are combined to form a quality metric. Adjustmentsto camera exposure are made via Bayesian optimization bysampling sparsely from the parameter space. The samplingstrategy involves acquiring images at various ‘test’ exposurevalues, which may result in poor-quality intermediate images.Further, the method in [12] is reactive and the optimizationprocess requires signiﬁcant time. During rapid environmentallighting changes, prior queries of the objective surface are nolonger reliable and the rate at which ‘optimal’ images can beobtained is signiﬁcantly reduced. These factors limit the ap-plicability of the approach for real-time scenarios, particularlyin dynamic environments.Although the adjustment of a single camera parameterreduces algorithm complexity and can, in some cases, resultin improved image quality, the lack of gain control in [10]–[12] is a signiﬁcant drawback. For example, increases in imagebrightness can only be achieved through increases in exposure,which also contributes to motion blur and other detrimental ef-fects. To acquire high-quality images in a variety of conditions,both camera gain and exposure must be controlled.

B. Gain and Exposure Control

The simultaneous adjustment of gain and exposure providesmore ﬂexibility to improve image quality under a wider rangeof conditions. In [2], Lu et al. develop an algorithm to adjustgain and exposure to maximize the Shannon entropy of ac-quired images. The authors suggest that the increased entropywill result in images that contain more useful information. Theoptimization of gain and exposure in [2] is sampling-based: anumber of images must be captured over time with varied gainand exposure settings to determine the values that maximizethe image entropy. This approach, however, is intended foruse under static lighting conditions and breaks down whenthe lighting changes dynamically.More recently, in [14] Shin et al. propose to maximize animage quality metric that is a combination of the Shannonentropy, the gradient metric from [11], and a noise quantiﬁca-tion measure. Gain and exposure are independently adjustedthrough a Nelder-Mead simplex optimization that requiressampling of real images. As in [2], this method is onlyintended for operation under static or slowly-varying lighting

OMASI et al. : LEARNED CAMERA GAIN AND EXPOSURE CONTROL FOR IMPROVED VISUAL FEATURE DETECTION AND MATCHING 3

Input:3x(RGB+E+G)(15x224x224)

Conv+Pool+BN+ReLUConv+BN+ReLULinear+BN+ReLULinear+BNOutput ˆ E t +1 , ˆ G t +1 Conv15x5 Conv23x3 Conv33x3 Conv47x7 FC1 FC2 FC3 OutputˆE t +1 , ˆG t +1 Fig. 2: The structure of our predictive gain and exposure control network. The network takes as input a sequence of images { I t , I t − , I t − } and the corresponding gain { G t , G t − , G t − } and exposure { E t , E t − , E t − } values, and outputs the next gain G t + and exposure E t + settingsthat predictively maximize the number of inlier feature matches in future image frames. conditions and can fail when lighting changes occur quickly.Unlike these existing methods [2], [10]–[12], [14], we avoidthe pitfalls of reactive, sampling-based techniques throughpredictive, data-driven learning. At test time, our approachmakes direct and immediate adjustments to both gain andexposure without the need for iteration.III. M ETHODOLOGY

We train a deep CNN to predictively adjust camera gainand exposure settings in real time. Our training approachis self-supervised: we rely on an existing VO front end toextract and match features across consecutive images, and useas training targets the parameter settings that correspond toimages containing a high number of features and inlier featurematches.

A. System Overview

Our network architecture is shown in Figure 2. The net-work is built from a series of four convolutional blocksthat incorporate max-pooling [15], followed by three fully-connected layers. All layers, except the output, make use ofbatch normalization [16] and ReLU activation functions [17].The network takes as input images captured over the lastthree time steps, { I t , I t − , I t − } , as well as the correspondingcamera parameter settings. The images are downsampled to alower resolution and the gain { G t , G t − , G t − } and exposure { E t , E t − , E t − } values at each acquisition are concatenated asadditional input channels to create a 15-channel input. Thegain and exposure are linearly scaled to the image intensityrange and assigned to every pixel in the corresponding inputchannel. We make use of sequential input images to ensure thattemporal information about the scene and any lighting changesis available to the network; the inclusion of the gain andexposure settings allows the network to decouple the changesdue to varying parameter settings from changes due to varyingexternal illumination. The network outputs the next gain ˆ G t + and exposure ˆ E t + settings to be sent to the camera.Our network is trained with target gain and exposure values, G ∗ i and E ∗ i , respectively, for the i th training sample. The batchloss for N training samples is calculated as the weighted Concatenation is a straightforward way to provide gain and exposurevalues to the network—we plan to explore more efﬁcient network architecturesin future work. combination of the gain and exposure losses, where the ε valuecan be tuned to specify the importance of each parameter: L = ε N N ∑ i = | ˆ G i − G ∗ i | + ( − ε ) N N ∑ i = | ˆ E i − E ∗ i | , (1)To ensure that the two terms equally contribute to the loss, werescale the gain and exposure targets, which have an allowablerange of 0–30 dB and 75 µ s–30 ms, respectively, to be withinthe range [ , ] . At test time, we clamp the (unrestricted)network outputs to be within [ , ] , and then invert the scalingas a ﬁnal step. Through empirical testing, we determined thatregressing the absolute parameter values resulted in betterperformance than regressing a scaling or additive term appliedto the current parameter values.The targets in Equation (1) can be generated to suit any task-speciﬁc problem by selecting parameter values that maximizea speciﬁc objective function (or minimize a speciﬁc lossfunction). Herein, we develop a data labelling procedure thatidentiﬁes gain and exposure settings that lead to images with ahigh number of identiﬁable features and inlier feature matches.Our motivation for this choice is discussed in the next section. B. Feature Matching as a Proxy for VO Performance

Most techniques that seek to optimize VO performanceattempt to minimize pose estimation error. However, identi-fying the correct camera parameters that result in the highestpose estimation accuracy is, in many situations, intractable.The real-world image acquisition process is not differentiable;we cannot employ backpropagation to update the cameraparameters from captured images. Further, the majority ofstate-of-the-art VO pipelines are also not fully differentiable,and it is not obvious how to determine what portion of thepose estimation error should be attributed to gain and exposureadjustments.Instead of attempting to minimize pose estimation errordirectly, we instead follow the approach pioneered by Clementet al. in [4] and choose to maximize a proxy measure ofVO performance. Our proxy measure is a combination of thenumber of features found in the most recently captured imageand the number of inlier feature matches between sequentialimages (including the most recent image). This approach natu-rally admits a self-supervised training methodology (describedin Section III-D) in which we leverage the VO front end itself

IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED FEBRUARY, 2021 to generate our training targets. An additional advantage isthat there is no requirement that the underlying pipeline bedifferentiable. Although existing work has shown that goodVO accuracy can be achieved using small sets of features inspeciﬁc instances [18], we note that, in general, having morefeatures increases VO performance and robustness. C. Dataset Collection

We use a sampling-based data collection procedure to iden-tify the parameter values that maximize our proxy measure.However, unlike existing camera parameter controllers [2],[11], [12], [14], no sampling is required at test time; imagesampling is only carried out as part of the training process.Further, our sampling strategy involves capturing images whilemoving, which ensures that we account for motion blur dueto changes in exposure time.Our unique sampling approach leverages a dual-camera con-ﬁguration, with two identical cameras mounted side-by-side(see Figure 4b). Although this technique requires additionalhardware, it enables us to sample twice at each camera pose.While the number of samples from the parameter space at eachpose is reduced compared to [2] and [11], we acquire imagesat a relatively high frame rate (e.g., 15 Hz), which allows fora wide range of parameter values to be sampled within a shortamount of time. We describe the sampling and data labellingprocess in more detail in Section III-D.To ensure that we effectively sample values that are near, inthe majority of cases, to the optimal region of the parameterspace, we use an ‘informed’ sampling approach. Namely,we sample around a reference set of gain and exposurevalues that already produce satisfactory images, rather thansampling randomly over the entire parameter space. When‘better’ quality images (i.e., those having more features orinlier feature matches) are found after perturbing the referenceparameter values, the new parameter values are used for datalabelling. Otherwise, the reference parameters are used. Togenerate perturbed settings at each camera pose, the referenceparameters ( G t and E t ) from Camera 1 are independentlymultiplied by a random scaling factor. The perturbed parametervalues are then applied to acquire an image with Camera 2.We found that applying a scaling factor of 1 ± [ , . ] , wherethe closed interval is sampled uniformly, balances explorationof the parameter space with the possibility of sampling fromsub-optimal regions. In the case of G t = Experimental evidence for improved pose estimation accuracy with a largernumber of features is provided in [19].

An initial set of training data, collected using the built-in auto-gain and auto-exposure controller (hereafter

AG+AE )as the reference, can be used to generate preliminary train-ing targets through the data labelling procedure outlined inSection III-D. These targets are then used to train our gainand exposure control network. The AG+AE reference settings,however, may not always be near the ‘optimal’ parameter val-ues, causing sub-optimal labels to be generated with a higherprobability. Therefore, an iterative data collection approach isused: after training our network with the initial labels, wereplace the AG+AE reference with the trained network, andcollect additional (perturbed) data. Since the trained networkyields parameter settings that are closer to the ‘optimal’ valuesrelative to the AG+AE controller, perturbing around this newoperating point is more likely to yield even higher-qualityimages (i.e., improved training targets). D. Self-Supervised Data Labelling

In our dataset, the gain and exposure targets G ∗ t and E ∗ t are generated for an image at time step t by analyzing theimages in a window of future times { t + , t + , t + , t + } and ﬁnding the parameters that produce the image (and imagepair) that maximizes our feature count and inlier matchesmetrics. We consider both the next image and images fartherin the future so that there is a larger number of samples(and possible parameter values) to choose from. A windowof four poses is the minimum size required to sample fromthe four quadrants of the parameter space (as described inSection III-C). Although the number of inlier feature matchesin the future images may not represent the true number thatwould exist if we could sample repeatedly from the same pose(because the scene changes slightly with time), the windowedapproach is a close approximation when the camera frame rateis reasonably high.Concretely, for all images within a window, we determinethe number of image features, M feat ( I ( i ) t ) , and for all sequentialimage pairs within the window, we determine the numberof inlier feature matches, M match ( I ( i ) t , I ( j ) t + ) where i , j ∈ { , } correspond to images from Camera 1 (reference) or Camera 2(perturbed). We select the images that maximize these metrics,and use the corresponding parameter values as our trainingtargets. Finally, we consider an additional metric, M hybrid ,which is a combination of the two metrics above.

1) Generating Labels with M feat : The target gain andexposure values that maximize the M feat metric, G ∗ t , feat and E ∗ t , feat are straightforward to acquire in general. At time step t , each image I ( i ) t + a from camera i ∈ { , } in the window offuture time steps a ∈ { , , , } is processed using a featuredetection algorithm. The number of features in each image iscounted and we select the image I ( i ∗ ) t + a ∗ , where i ∗ is the indexof the camera that produced the image at time t + a ∗ with themaximal M feat score: { i ∗ , a ∗ } = argmax i , a M feat ( I ( i ) t + a ) . (2) We note the similarity of our technique to reinforcement learning anddescribe these similarities and differences in Section VI.

OMASI et al. : LEARNED CAMERA GAIN AND EXPOSURE CONTROL FOR IMPROVED VISUAL FEATURE DETECTION AND MATCHING 5

Fig. 3: Our sampling-based method for generating training labels G ∗ t , match and E ∗ t , match using the M match method. With two cameras, foursampled image pairs (and corresponding sets of parameters) are generated at each timestep: ( I t , I t + ), ( I t , I t + ), ( I t , I t + ), and ( I t , I t + ). The gain and exposure values, G ( i ∗ ) t + a ∗ and E ( i ∗ ) t + a ∗ , respectively,used to acquire I ( i ∗ ) t + a ∗ are obtained and selected as the trainingtarget for time step t : G ∗ t , feat = G ( i ∗ ) t + a ∗ , E ∗ t , feat = E ( i ∗ ) t + a ∗ . (3)

2) Generating Labels with M match : Obtaining the targetcamera parameters for the M match metric, G ∗ t , match and E ∗ t , match ,involves performing multiple matching steps over the windowof images (see Figure 3). Features from both images I ( i ) t + b captured by cameras i ∈ { , } are matched with features fromboth images I ( j ) t + b + captured by cameras j ∈ { , } , for aparticular pair of sequential time steps, b ∈ { , , , } . Weselect the image pair ( I ( i ∗ ) t + b ∗ , I ( j ∗ ) t + b ∗ + ) that results in the maximal M match score: { i ∗ , j ∗ , b ∗ } = argmax i , j , b M match ( I ( i ) t + b , I ( j ) t + b + ) . (4)Note that the selected image pair may include images acquiredfrom the same camera ( i = j ) or different cameras ( i (cid:54) = j ),depending on which combination contains more inlier featurematches. The gain and exposure values, G ( j ∗ ) t + b ∗ + and E ( j ∗ ) t + b ∗ + ,used to acquire the image with camera j ∗ at time step t + b ∗ + t : G ∗ t , match = G ( j ∗ ) t + b ∗ + , E ∗ t , match = E ( j ∗ ) t + b ∗ + . (5)

3) Combination of Feature Metrics:

The M feat metric gen-erally yields images that are bright and well-exposed, as suchimages typically contain the most features. Consequently, thismetric is well-suited for generating targets across lightingtransitions. However, due to the sparsity of parameter samplingin our dataset, use of this metric can result in sequential targetsthat are quite variable. This situation occurs in particular understatic lighting conditions and may lead to a low number offeature matches between consecutive frames. Conversely, the M match metric generally yields images that have relatively con-sistent gain and exposure settings across frames. Consequently,this metric is well-suited for generating training targets understatic conditions. Generating targets with M match , however, isless suitable during lighting transitions, as the metric favoursmaintaining existing feature tracks for longer durations. Inturn, the changes to gain and exposure are small, and fewernew features can be found because image regions may becomeover- or underexposed. We aim to balance the responsivenessof the M feat metric and the stability of the M match metricthrough the use of a combined metric, designated as M hybrid . The gain and exposure values selected using the M hybrid metricare the weighted average of the parameter values obtainedusing the M feat and M match metrics.IV. E XPERIMENTS

In this section we describe our hardware platform andprovide details about the environments in which we collectedtraining data and obtained our experimental results. We thendiscuss the speciﬁcs of our training process and explain howwe evaluated the performance of our network.

A. Hardware and Conﬁguration

Our dual-camera setup consisted of two FLIR Blackﬂy SU3-31S4C machine vision cameras, each with a Fujinon 6.23mm focal length C-mount lens, mounted side-by-side to a rigidplatform on the roof of our test vehicle (Figure 4a) in a fronto-parallel conﬁguration (with a baseline of 3.92 cm), as shownin Figure 4b. Images were captured synchronously from bothcameras. Image capture and network processing were carriedout using a Lenovo Legion Y730 laptop with an Intel i7-8750HCPU (2.20 GHz) and an NVIDIA GeForce GTX 1050 Ti GPU.With this hardware, the maximum input processing rate of thenetwork was 640 Hz. In practice, we were limited by the rateof image acquisition ( ∼

15 Hz).

B. Data Collection and Experiment Environments

One appreciable challenge with online parameter adjustmentis that the performance of the controller cannot be evalu-ated with previously-captured data. This issue arises becausechanges to the parameter settings affect the image acquisitionprocess itself. Furthermore, training data cannot be accuratelysimulated. Thus, our data and results were obtained fromdriving in real-world conditions in all cases.We drove our test vehicle on roads with several tunnelsin the cities of London and Toronto, Ontario, Canada, undera range of outdoor illumination conditions (bright sun, low-level clouds, etc.). Since changes in brightness of up to 120 dBmay occur during outdoor tunnel transitions [20], tunnels areideal environments for stress testing our predictive parametercontroller.For our training dataset collection, we selected a closedroute in London (see Figure 4c) that contains two tunnelpassages (Figure 4d), each of roughly eighty metres in length.Additionally, the route contains straight sections of unob-structed road where lighting conditions are relatively constant.

IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED FEBRUARY, 2021 (a) Data collection vehicle. (b) Dual camera rig.(c) London route aerial view. (d) London tunnel entrance.Fig. 4: Photographs of the experimental setup and of the London,Ontario validation road route (which included two tunnels).

Our training data consisted of trajectories acquired alongthe above route as well as on a variety of other roadwaysin London. After training, we experimentally validated theperformance of the network on the closed route. Our held-out test environment consists of a highway exit ramp tunnelin Toronto, for which the structure, appearance, and length issigniﬁcantly different from the training or validation trajecto-ries. Overall, we collected eight validation sequences and twotest sequences, for a total of 18 tunnel traverses.

C. Training and Dataset Details

Our training dataset consists of a total of ﬁfty-ﬁve sequencescontaining 58,782 RGB images acquired at a 2048 × M hybrid metric, with an equal weighting between the maximized M feat and M match metrics. Training images (seen in Figure 2) weredownsampled to a resolution of 224 ×

224 pixels. Each train-ing sample required three sequential input images: to generatethe sample, we selected one of the two available images (i.e.,from either camera) at each of the three time steps (yieldingeight possible sequences). We maximized our data efﬁciencyby using all eight combinations for training, resulting in N = ,

934 total training samples; this also ensured that thenetwork was trained with images acquired with a wide rangeof gain and exposure combinations. Our network was trainedfor 200 epochs using the Adam optimizer [21] with a batchsize of 64 and learning rate of 1 × − . We also made use ofdropout [22] ( p = .

4) in each layer to improve generalization.We selected ε = . D. Network Evaluation

To evaluate the performance of our network-based con-troller, we conducted an extensive set of real-world exper-iments, where we measured the number of inlier featurematches in images acquired using our method. We comparedthe performance of our controller to both built-in AG+AE andthe method of Shin et al. [14], and also investigated combiningthese controllers with a type of illumination-invariant imagetransformation.Our validation experiments involved capturing full traversalsof the selected route in London (Figure 4c), starting andending at roughly the same pose. We subdivided each sequenceinto two categories: ‘dynamic’, which corresponds to tunnelregions, and ‘static’, and compared the performance of thecontrollers on each subsequence. For every route traversal, weselected two dynamic sections and two static sections. Ourtest-time experiment involved acquiring images while drivinginto and out of a highway tunnel in Toronto. In all cases, thecaptured images were processed using the OpenCV ORB [8]and libviso2 [23] feature matching algorithms. Addition-ally, we repeated the matching experiments after transformingall images using the “SumLog” image transformation outlinedin [4]. We tuned the transformation parameters using a subsetof our training dataset by selecting the parameter values thatyielded the maximum number of feature matches betweensequential image pairs.To evaluate the success of our method, we recorded themedian number of inlier feature matches (median NFM) andthe minimum number of inlier feature matches (minimumNFM) on a per-image basis over sections of the recordedtrajectories. The minimum NFM is an important performancemeasure because VO may fail in cases where the NFM istoo low. A low minimum NFM typically occurs when aseries of consecutive frames are over- or underexposed. Themedian NFM provides an indication of the expected ‘average’performance of VO over the trajectory.Finally, we sought to determine if our method improves theoverall robustness of VO. To do so, we processed the imagesfrom the validation and test sequences using

ORB-SLAM2 [24], which fails when images do not contain sufﬁcient num-bers of matchable features (e.g., in our case, images acquirednear tunnel entrances and exits). We recorded the number ofsequences for which

ORB-SLAM2 was able to maintain suc-cessful tracking throughout. We expect that, within dynamiclighting regions, images acquired using our network-basedcontroller will yield more reliable

ORB-SLAM2 outputs com-pared to images acquired using the built-in AG+AE controller.V. R

ESULTS

We summarize our feature matching results for the vali-dation sequences in Table I and the test sequences in Ta-ble II. Under static lighting conditions, all of the parameter

OMASI et al. : LEARNED CAMERA GAIN AND EXPOSURE CONTROL FOR IMPROVED VISUAL FEATURE DETECTION AND MATCHING 7

Fig. 5: A sequence of images acquired at test time during a transition out of a tunnel along the Toronto route. Our approach (bottom row)compensates for the drastic change in lighting and maintains a higher number of inlier feature matches across all images during the transitionwhen compared with AG+AE (top row). controllers were able to obtain large median and minimumNFM scores, however, our network generally produced imagescontaining more matchable features. The advantages of ourapproach are most noticeable in the dynamic experiments.Here, our network obtained signiﬁcantly higher median andminimum NFM scores, especially compared with the Shinalgorithm [14], which failed to ﬁnd suitable parameter settingsduring the fast tunnel transitions.In the SumLog transformation (SL) image experiments, ourmethod yielded higher median and minimum inlier featurematches under dynamic lighting conditions compared withAG+AE (SL), despite these scores being lower than those forthe untransformed images. We attribute this in part to the SLtransform being suited for matching across extreme appearancechanges, rather than matching across sequential images that arealready relatively similar. Our results show that existing post-processing techniques cannot recover information that is lostdue to image saturation. Instead, a method such as ours thatensures appropriate parameter settings are used during imagecapture is vital for acquiring high-quality, matchable images.Examples of the operation of our controller during botha validation sequence and a test sequence tunnel transition O R B M a t c h e s li b v i s o2 M a t c h e s E x p o s u r e ( µ s ) G a i n ( d B ) . . . . . . Image Number . . . . . . NetworkAG+AENetwork MedianAG+AE Median

Fig. 6: ORB and libviso2 inlier feature match statistics plottedover time for a London validation sequence (left) and a Torontotest sequence (right) tunnel transition. The corresponding per-framegain and exposure settings are also shown. Our method preemptivelyadjusts gain before entering and exiting the tunnel and maintains alower exposure setting, leading to a reduction in blur and more inlierfeature matches compared with AG+AE. TABLE I: A comparison of our network with AG+AE (six traversals)and the Shin method [14] (two traversals) over the London validationsequences. We recorded the median NFM and minimum NFM,averaged across all sequences, with an average of ∼

250 images ineach dynamic sequence and ∼

225 images in each static sequence.

Lighting Method Median NFM Minimum NFM

ORB libviso2 ORB libviso2

Static

AG+AE 593 8301 366 5840Ours

600 9591 401 6636

Shin [14] 414 5295 183 3555Ours

439 5666 200 4032

AG+AE (SL) 436

Ours (SL)

Dynamic

AG+AE 393 5350 54 301Ours

395 7380 98 421

Shin [14] 131 614 0 0Ours

202 3530 37 127

AG+AE (SL)

264 3680 40 385

TABLE II: A comparison of inlier feature match statistics for ournetwork compared with AG+AE for the Toronto test sequences.

Lighting Method Median NFM Minimum NFM

ORB libviso2 ORB libviso2

Dynamic

AG+AE 323 6035 63 245Ours

357 10229 96 305

AG+AE (SL) 198 2684 17 154Ours (SL)

279 4927 38 290 are shown in Figure 6. During the transition into the tunnel,beginning at frame 40 approximately (left plots) and at frame20 approximately (right plots), our network preemptivelyincreased the gain value and to a lesser extent, the exposure,relative to AG+AE. During the transition out of the tunnel,around frames 90–110 (left plots) and frames 80–100 (rightplots), our network preemptively reduced the gain and expo-sure relative to AG+AE, resulting in images containing lessmotion blur and more visible details outside of the tunnel.Consequently, there was a dramatic increase in both ORBand libviso2 inlier matches in this region (see Figure 1lower row). Examples of images acquired during a Torontotest sequence tunnel exit are shown in Figure 5.Finally, in Table III, we show that

ORB-SLAM2 was able

IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED FEBRUARY, 2021

TABLE III: The number of experiments in which

ORB-SLAM2

VOsuccessfully maintained tracking.

VO Tracking SuccessesMethod Validation Trials (London) Test Trials (Toronto)AG+AE 0/6 0/2Ours to successfully track across all validation and test sequenceswhen our network was employed to control the camera.Conversely,

ORB-SLAM2 consistently failed when the built-inAG+AE controller was used instead. Here, a failure was iden-tiﬁed if, at any point in the sequence,

ORB-SLAM2 reportedthat too few features matches were available to compute apose change estimate. Notably, our analysis revealed that, inall cases, tracking failed because the AG+AE images wereoverexposed during transitions out of tunnels.VI. C

ONCLUSIONS AND F UTURE W ORK

The predictive adjustment of camera gain and exposuresettings can improve the quality of acquired images for usein visual navigation. We demonstrated that our CNN, trainedin a self-supervised manner with targets generated from afeature-based VO front end, selects camera parameters thatresult in images containing signiﬁcantly more features andsequential inlier feature matches compared with reactive al-gorithms, under static and dynamic lighting conditions. Ournetwork can predict changes in lighting due to an approachingtunnel entrance or exit, for example, and compensate for thesechanges by adjusting gain and exposure preemptively. Weveriﬁed that the increased number of inlier matches due to pre-emptive adjustment improves the robustness of feature-basedVO. Results in the literature also indicate that adjustmentswhich improve feature matching are likely to beneﬁt othervision tasks as well [11], [14].Although our predictive controller works well under staticand dynamic conditions, we expect that our results could befurther improved if a more sophisticated sampling strategywere employed, or better yet, if more images could be obtainedfrom the same (or nearby) camera poses. We relied on thenumber of features and inlier feature matches as a proxy forfeature-based VO performance, but our network could also betrained with a different loss function. For example, a densephotometric loss would improve the quality of images foruse in direct VO. Alternatively, improvements could possi-bly be obtained by framing camera parameter control as areinforcement learning (RL) problem. The iterative componentof our technique is similar in spirit to RL, in the sense thatwe are learning a policy and then iteratively updating thepolicy to increase the ‘reward’ (the number of features andfeature matches). In practice, we developed our unique methodbecause it was infeasible to train our network with standardRL algorithms (due to the large state space of images and tochallenges involving simulation). However, these challengesmay be overcome; we leave the implementation of an RL-based framework as future work. R

EFERENCES[1] J. Kim and A. Kim, “Light condition invariant visual SLAM via entropybased image fusion,” in

Int. Conf. Ubiquitous Robots and AmbientIntelligence (URAI) , 2017, pp. 529–533.[2] H. Lu, H. Zhang, S. Yang, and Z. Zheng, “Camera parameters auto-adjusting technique for robust robot vision,” in

IEEE Intl. Conf. Roboticsand Automation (ICRA) , 2010, pp. 1518–1523.[3] L. Clement and J. Kelly, “How to train a CAT: Learning canonical ap-pearance transformations for direct visual localization under illuminationchange,”

IEEE Robot. Autom. Lett. , vol. 3, no. 3, pp. 2447–2454, 2018.[4] L. Clement, M. Gridseth, J. Tomasi, and J. Kelly, “Learning matchableimage transformations for long-term metric visual localization,”

IEEERobot. Autom. Lett. , vol. 5, no. 2, pp. 1492–1499, 2020.[5] R. Gomez-Ojeda, Z. Zhang, J. Gonzalez-Jimenez, and D. Scaramuzza,“Learning-based image enhancement for visual odometry in challengingHDR environments,” in

IEEE Intl. Conf. Robotics and Automation(ICRA) , 2018, pp. 805–811.[6] H. Porav, W. Maddern, and P. Newman, “Adversarial training for adverseconditions: Robust metric localisation using appearance transfer,” in

IEEE Intl. Conf. Robotics and Automation (ICRA) , 2018, pp. 1011–1018.[7] S. Park, T. Sch¨ops, and M. Pollefeys, “Illumination change robustnessin direct visual SLAM,” in

IEEE Intl. Conf. Robotics and Automation(ICRA) , 2017, pp. 4523–4530.[8] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efﬁcientalternative to SIFT or SURF,” in

IEEE Intl. Conf. Computer Vision(ICCV) , 2011, pp. 2564–2571.[9] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “LIFT: Learned invariantfeature transform,” in

European Conf. Computer Vision (ECCV) , 2016,pp. 467–483.[10] Z. Zhang, C. Forster, and D. Scaramuzza, “Active exposure controlfor robust visual odometry in HDR environments,” in

IEEE Intl. Conf.Robotics and Automation (ICRA) , 2017, pp. 3894–3901.[11] I. Shim, J. Lee, and I. S. Kweon, “Auto-adjusting camera exposure foroutdoor robotics using gradient information,” in

IEEE/RSJ Intl. Conf.Intelligent Robots and Systems (IROS) , 2014, pp. 1011–1017.[12] J. Kim, Y. Cho, and A. Kim, “Exposure control using Bayesian opti-mization based on entropy weighted image gradient,” in

IEEE Intl. Conf.Robotics and Automation (ICRA) , 2018, pp. 857–864.[13] P. E. Debevec and J. Malik, “Recovering high dynamic range radiancemaps from photographs,” in

Conf. on Computer Graphics and InteractiveTechniques , ser. SIGGRAPH ’97, 1997, pp. 369–378.[14] U. Shin, J. Park, G. Shim, F. Rameau, and I. S. Kweon, “Cameraexposure control for robust robot vision with noise-aware image qualityassessment,” in

IEEE/RSJ Intl. Conf. Intelligent Robots and Systems(IROS) , 2019, pp. 1165–1172.[15] Y. Zhou and R. Chellappa, “Computation of optical ﬂow using a neuralnetwork,” in

IEEE Int. Conf. Neural Networks (ICNN) , 1988, pp. 71–78.[16] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in

Intl. Conf.Machine Learning (ICML) , 2015, pp. 448–456.[17] K. He, X. Zhang, S. , Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

IEEE Conf. Computer Vision and Pattern Recognition(CVPR) , 2016, pp. 770–778.[18] I. Cviˇsi´c, J. ´Cesi´c, I. Markovi´c, and I. Petrovi´c, “SOFT-SLAM: Compu-tationally efﬁcient stereo visual simultaneous localization and mappingfor autonomous unmanned aerial vehicles,”

J. Field Robotics , vol. 35,no. 4, pp. 578–595, 2018.[19] J. Tomasi, “Learned adjustment of camera gain and exposure timefor improved visual feature detection and matching,” M.A.Sc. thesis,University of Toronto, Toronto, Ontario, Canada, 2020.[20] J. Westerhoff, M. Meuter, and A. Kummert, “A generic parameteroptimization workﬂow for camera control algorithms,” in

IEEE Conf.Intelligent Transportation Systems (ITSC) , 2015, pp. 944–949.[21] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimiza-tion,” in

Intl. Conf. Learning Representations (ICLR) , 2015.[22] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-dinov, “Dropout: A simple way to prevent neural networks from over-ﬁtting,”

J. Machine Learning Research , vol. 15, no. 56, pp. 1929–1958,2014.[23] A. Geiger, J. Ziegler, and C. Stiller, “StereoScan: Dense 3D reconstruc-tion in real-time,”

IEEE Intelligent Vehicles Symp. (IV) , pp. 963–968,2011.[24] R. Mur-Artal, J. M. M. Montiel, and J. D. Tard´os, “ORB-SLAM: aversatile and accurate monocular SLAM system,”