Matching Representations of Explainable Artificial Intelligence and Eye Gaze for Human-Machine Interaction
MMatching Representations of Explainable Artificial Intelligenceand Eye Gaze for Human-Machine Interaction
Tiffany Hwu [email protected] Laboratories, LLCMalibu, CA, USA
Mia Levy [email protected] Laboratories, LLCMalibu, CA, USA
Steven Skorheim [email protected] Laboratories, LLCMalibu, CA, USA
David Huber [email protected] Laboratories, LLCMalibu, CA, USA
ABSTRACT
Rapid non-verbal communication of task-based stimuli is a chal-lenge in human-machine teaming, particularly in closed-loop inter-actions such as driving. To achieve this, we must understand therepresentations of information for both the human and machine,and determine a basis for bridging these representations. Tech-niques of explainable artificial intelligence (XAI) such as layer-wiserelevance propagation (LRP) provide visual heatmap explanationsfor high-dimensional machine learning techniques such as deepneural networks. On the side of human cognition, visual attentionis driven by the bottom-up and top-down processing of sensoryinput related to the current task. Since both XAI and human cog-nition should focus on task-related stimuli, there may be overlapsbetween their representations of visual attention, potentially pro-viding a means of nonverbal communication between the humanand machine. In this work, we examine the correlations betweenLRP heatmap explanations of a neural network trained to predictdriving behavior and eye gaze heatmaps of human drivers. The anal-ysis is used to determine the feasibility of using such a techniquefor enhancing driving performance. We find that LRP heatmapsshow increasing levels of similarity with eye gaze according tothe task specificity of the neural network. We then propose howthese findings may assist humans by visually directing attentiontowards relevant areas. To our knowledge, our work provides thefirst known analysis of LRP and eye gaze for driving tasks.
KEYWORDS human machine teaming, explainable artificial intelligence, XAI,layer-wise relevance propagation, LRP, eyetracking, driving, atten-tion
Human cognition is flexible and adaptive, learning from past expe-riences and generalizing to novel situations. Machines, on the otherhand, can tirelessly process lifetimes of data for pattern recognition.The complementary strengths of humans and machines makes forideal teamwork, but several obstacles exist in connecting the twodisparate systems of information processing. A prerequisite forteaming in such situations is an examination of the mental repre-sentations of humans and machines to find common ground forcommunication. When performing certain tasks, the inputs that the human and machine find salient often overlap. Areas of machinesalience can be found by explainable AI techniques that explainthe inner workings of machine learning algorithms, while areas ofsalience in human cognition can be found through physiologicalmonitoring techniques such as gaze tracking. The work describedin this paper performs such comparisons on a dataset of driver eyegaze, revealing insightful patterns that expand possible applicationsin adaptive teaming.
The term βexplainable artificial intelligence" (XAI) refers to meth-ods that explain the processes used by artificial intelligence andmachine learning and the results that they produce [9]. Particularlyfor high-dimensional machine learning methods such as deep learn-ing, the methods are a black box, which often leads to mistrust andconcerns in safety-critical applications, such as driving. XAI canimprove the safety and trustworthiness of AI by showing the useror the developer why the AI is making certain decisions, whichallows humans to validate and improve the AIβs decisions. Onespecific method of XAI is known as layer-wise relevance propaga-tion (LRP) [3], which can be applied to any multi-layered classifier,decomposing an output prediction into the contributions of eachcomponent of each layer. In the case of image classification, LRPdisplays a heatmap of which parts of an image contributed the mostto the classification.
Eye tracking is a way of approximating human attention by mea-suring the focal point of human gaze as the human views an imageor performs a task. It is most often done with hardware that de-tects pupil position using the infrared or visible light spectrum andaligning it with gaze points during a calibration period, after whichhuman gaze can be estimated. The visual features that drive gazecan be divided into sources of bottom-up and top-down categories[4]. Bottom-up features are visually salient and may have high con-trast, brightness, or motion/flicker, whereas top-down features arerelated to the task at hand and are driven by the humanβs mentalmodels and goals. In addition, top-down features are found whenthe human is attentive to the task at hand, whereas bottom-upfeatures are more common in inattentive situations. In the case of a r X i v : . [ c s . H C ] J a n iffany Hwu, Mia Levy, Steven Skorheim, and David Huber driving, both types of features are encountered, including unex-pected pedestrians and traffic signals. Similarly, most eye trackingdatasets have varying levels of bottom-up and top-down influences,where some are more task-specific and others are recorded onfree viewing. For instance, the SALICON dataset is a free viewingdataset collected from the mouse movements of online subjects [11].There are also other free viewing datasets based on actual humangaze, such as the CAT2000 dataset [5], in which subjects view acollection of indoor and outdoor scenes, line drawings, and lowresolution images. For predicting gaze in such free viewing tasks,the DeepGaze II gaze estimation model extracts high-level featuresfound from neural networks trained on image recognition [12]. TheDeepGaze II model shows the potential of extracting informationfrom machine learning methods to aid in human state estimationin the domain of free viewing.The DR(eye)VE dataset includes eye tracking data recorded fromseveral human subjects driving a car in naturalistic environmentsof varied settings, including different times of the day, settings, andweather conditions [2, 15]. It differs from the previously discusseddatasets as it is task-specific. Instead of viewing images passively,the subject views a real scene while performing a relevant task.This work uses the DR(eye)VE dataset, which contains over 500,000frames of data captured from eye tracking glasses in a physical driv-ing setup and matches them with dashcam footage. While there areother eye tracking datasets for driving [20], they are often collectedin simulated environments due to the difficulty of collecting andprocessing data in a naturalistic setting. The DR(eye)VE datasetis one of the most comprehensive datasets yet, and also includeshuman annotations indicating segments of time in which the driveris attentive towards driving-relevant stimuli or inattentive.The conjunction of XAI and eyetracking has been rarely ex-plored. [17] compared LRP heatmapping methods to eye gaze forfacial expression recognition, finding some correlations betweenwhich parts of the face received the most attention and whichparts the neural network found relevant. The results of this workindicate some promise in using visual LRP heatmaps as a meansfor human-machine teaming. Compared to the task of expressionrecognition, driving requires rapid task-dependent attention andmotivates the creation of a closed-loop human-machine teamingsystem. As autonomous and semi-autonomous driving researchincreases, comparisons between humans and machines enable newtechnological development in this area. The analysis described here centers around the comparison ofsaliency heatmaps generated by methods that vary along the scaleof bottom-up and top-down processing. Each method is describedbelow.
For each dashcam frame in the DR(eye)VE dataset, there exists acorresponding frame indicating the gaze focus of the driver, inte-grated over a short segment of time before and after the frame. Theframes are grayscale with a resolution of 1920x1080, where whitervalues represent more salient areas. Each of the other heatmappingmethods were compared to this ground truth measurement of gaze. Layer Output Shape Parametersinput 1080x1920x3 0vgg16 33x60x512 14714688global average pooling 512 0dropout 512 0dense 2 1026
Table 1: Neural Network Layers
To form a baseline, a basic method of saliency detection known asthe spectral residual [10] was selected. The method is independentof features, categories, and prior knowledge of the image, relyingon spatial frequency to find areas of interest. The implementationfrom the OpenCV library was used to generate greyscale saliencymaps of the dashcam frames of the same dimension as ground truthgaze [6].
This work also sought to compare against the heatmaps generatedby an existing module for effectively predicting human gaze. TheDeepGaze II model was ideal for this, and their implementationwas used to generate the saliency maps.
The main heatmaps of interest were those generated by LRP expla-nations of a deep neural network trained on the dashcam images asinput. The default VGG16 network was used as a base model [18],removing the output layer and adding a layer of global averagepooling, dropout layer of rate 0.2, and a dense layer with outputsize 2. The final network structure is described in Table 1.80 percent of shuffled frames in the DR(eye)VE dataset wereused for training, with the remaining frames used for testing. Usingthe iNNvestigate toolbox [1], LRP heatmaps of the trained neuralnetworks running inference on each of the testing frames weregenerated, creating saliency heatmaps of the same format as theother methods. To examine the network at various stages of train-ing, the heatmaps were generated on three versions of the neuralnetwork: 1) Completely randomized weights according to Tensor-Flow defaults, 2) The VGG16 portion containing weights pretrainedon the ImageNet dataset [7], and 3) The same as 2) but with thenon-VGG16 layers fine-tuned to predict the steering and velocity ofthe vehicle at each frame. The fine-tuning of layers was trained withmean squared error loss to convergence at 5 epochs, batch size 8,and learning rate 1e-3. For 3), the DR(eye)VE dataset did not containground truth values for steering and acceleration. These values wereestimated using monocular visual odometry to find the camera yawand translation from the current frame to the next frame. For this,the implementation from https://github.com/avisingh599/mono-vowas used, rescaling the yaw and translation values to the range of0 to 1.For further analysis on whether any specific objects in the framesreceived more attention in the LRP heatmaps, Yolov3 was used [16],implemented in TensorFlow 1.0 from the GitHub repository athttps://github.com/YunYang1994/tensorflow-yolov3/. The networkwas pretrained to segment and identify objects from the COCOdataset [13]. atching Representations of Explainable Artificial Intelligence and Eye Gaze for Human-Machine Interaction
Figure 1: Heatmap generation techniques on a sample image. A) Original image with YOLO segmentation of objects such aspedestrians, bikes, and traffic lights. B) Ground truth gaze focus from DR(eye)VE dataset, which focuses on a traffic light.C) Saliency heatmap generated using spectral residual method, which seems to focus on pedestrians. D) Saliency heatmapgenerated using DeepGaze II method, which focuses on several areas including the traffic lights.Figure 2: LRP heatmap generation. A) Original image. B) LRP heatmap of neural network with random weights, which seems tocontain evenly distributed outlines. C) LRP heatmap of neural network pretrained on ImageNet, which outlines many object-related areas such as people and trees. D) LRP heatmap of neural network trained on driving output. Particular emphasis isfound on the traffic light, which is not seen in C. iffany Hwu, Mia Levy, Steven Skorheim, and David Huber
Figure 3: Comparison of emphasis on traffic-related objectsbetween heatmapping methods. X-axis contains differencesin heatmap values within bounding boxes of objects, in unitsof e-7. A) Mean emphasis difference between LRP ImageNetand LRP Random. B) Mean emphasis difference betweenLRP Driving and LRP ImageNet. Larger emphasis on road-centered objects such as traffic lights and vehicles.
In initial examinations of the DR(eye)VE dataset, it was foundthat many frames were largely similar, with no notable events andgaze directed towards the center of view. The set of frames wasnarrowed to include those annotated as attentive and non-trivial.In addition, videos from nighttime runs detected objects poorly onthe pretrained Yolov3 model. A further filter was applied such thatonly frames used in the testing set of the neural network were used.Using only frames that met these conditions, this work ultimatelygenerated all heatmaps on a dataset of 56266 images.Following the methods of [17], cosine similarity and Spearmancorrelation was used to compare the heatmaps to ground truthgaze. To calculate cosine similarity, the heatmaps were flattenedinto vectors A and B and compared as πππ ( π ) = A Β· B || A |||| B || (1)Similarly, Spearman correlation was calculated as π π = π ππ π ,ππ π = πππ£ ( ππ π , ππ π ) π ππ π π ππ π (2)in which ππ π and ππ π are the ranks of the raw values within eachvector, π is the correlation coefficient, πππ£ ( ππ π , ππ π ) is the covari-ance of the rank variables, and π ππ π and π ππ π are the standarddeviations of the rank variables.Prior to analysis, a positive correlation was predicted betweenthe task-specificity of the heatmapping technique and its similaritywith the heatmap of the driverβs gaze. It was also predicted that theratio of similarity scores between attentive and inattentive framesfor each state would increase with task specificity. This reflects thebottom-up and top-down nature of eye movements, with inattentivestates likely to be driven by unexpected saliency and attentive statesdriven by the task. The main hypothesis of this work was that there would be a positivecorrelation between the similarity scores and the task-specificity Method All Attentive Inattentive RatioSpectral Residual 0.00624 0.00619 0.00649 0.95524Deep Gaze II 0.01332 0.01319 0.01422 0.92707LRP Random 0.00914 0.00906 0.00958 0.94570LRP ImageNet 0.01114 0.01109 0.01144 0.96951LRP Driving 0.01330 0.01337 0.01297 1.03069
Table 2: Cosine Similarity, Median
Method All Attentive Inattentive RatioSpectral Residual 1222.1 1226.1 1196.4 1.02486Deep Gaze II 1577.7 1578.8 1570.5 1.00525LRP Random 1022.7 1029.9 969.37 1.062488LRP ImageNet 801.629 804.465 788.097 1.02076LRP Driving 857.030 858.234 848.420 1.01156
Table 3: Spearman Correlation, Median of the heatmap generation method. Calculating similarity scores tocompare each heatmapping technique as shown in Tables 2-3, thismain hypothesis was evaluated, calculating both the median scoreswithin all frames, attentive frames, inattentive frames, and the ra-tio of attentive and inattentive. DeepGaze II had higher similaritythan other methods in many conditions. As a prior state-of-the-arttechnique, this was not unexpected, especially as it was tailoredspecifically towards gaze prediction while the other methods werenot. For cosine similarity, the hypotheses were supported in severalways. Within the LRP methods, the similarity score increased asthe task specificity increased from Random to ImageNet to Driving.Spectral Residual was considered to be the least task-specific, andalso had the lowest similarity; while LRP Driving was the mosttask-specific had the highest cosine similarity. Looking at the ratioof similarity scores between attentive and inattentive states forLRP methods, it was found that the more task-specific the heatmap,the higher the ratio between attentive and inattentive states. Aninattentive person may look at salient objects unrelated to the task,which would be best predicted by methods such as DeepGaze II,whereas an attentive person may look at objects related to the task,which could be better predicted by XAI methods. All findings weresignificant with p < .005. The significance between heatmappingmethods was calculated using one-way ANOVA, and the signifi-cance of ratios of median values were found with the two-sidedMann-Whitney test of log values of attentive and inattentive.Examining Spearman correlation, the effects are not as apparent.Among the heatmapping methods, it was noted that DeepGaze IIgenerally outperformed the other methods in similarity. As willbe shown in qualitative examples, a plausible reason for these ob-servations can be found in the fact that LRP heatmaps consist ofdiffuse outlines, while the other heatmapping methods have moreconcentrated focal points. Because of this, the rank variables usedin the Spearman metric may not capture the relevant portions ofthe LRP heatmaps as well as the cosine metric. Figure 1 shows an example of heatmap generation of a frame inwhich the driver is stopped at an intersection. The ground truth gazein 1B shows that the driver is currently looking at the traffic lighton the right. The spectral residual heatmap in 1C shows emphasis atching Representations of Explainable Artificial Intelligence and Eye Gaze for Human-Machine Interaction
Figure 4: Example comparison between LRP Driving and LRP Imagenet on highway. A) Heatmap for LRP Driving. B) Heatmapfor LRP ImageNet. C) Heatmap for clipped difference between A and B, showing emphasis on lane features. D) Original imagefor reference. on pedestrians, and the DeepGaze II heatmap in 1D shows emphasisin several areas including the traffic lights, pedestrians, and bikes.Figure 2 shows the heatmaps generated from LRP. Compared to theothers, these are far more diffuse, unlike the focal nature of gaze.However, they are revealing in which stimuli are task-relevant. TheLRP Random heatmap in 2B shows general diffuse outlines, theLRP ImageNet heatmap in 2C shows particular emphasis on objectssuch as the pedestrians and trees, and the LRP Driving heatmap in2D shows more focal emphasis on traffic-specific areas, such as thetraffic light, which is almost undetectable in LRP ImageNet. LRP Im-ageNet heatmaps contained more noise in background areas, likelydue to the extra untrained layers added to the pretrained VGG16network. While LRP heatmaps alone may not be good predictors ofgaze, they may reveal areas of interest that other methods cannot.
The emphasis on the traffic light led to questions of whether anyparticular objects received more emphasis in one heatmappingtechnique over another. To compare the emphasis between any twotechniques, the proportion of heatmap activity within the boundingbox for each frame in the dataset and for each bounded object inthat frame was determined. The difference of this value between thetwo techniques was computed and then averaged for each objectover all frames. Figure 3 shows the mean emphasis differences fortraffic-related objects between the methods of LRP ImageNet andLRP Random in 3A, as well as LRP Driving and LRP ImageNet in 3B.Confirming observations, traffic lights were indeed more prominent in the LRP Driving heatmaps. In addition, the LRP Driving heatmapstended to emphasize objects found on the road in plain view of thedriver. While stop signs, bicycles, and pedestrians may affect drivingoutput, they are often observed in the periphery, only affectingdriving if seen within the vehicleβs planned line of travel. Thus,a dataset with more corner case situations could potentially leadto different results, while the DR(eye)VE dataset captures typicaldriving situations. Also of note, many regular signs in the distancewere falsely detected as stop signs, hence not being emphasized thesame way that traffic lights were. Of note, the neural network andLRP heatmaps themselves had no notion of object bounding boxesor object labels, and still emphasized them according to task. LRPheatmaps may lead to emergent discovery of task-relevant stimulithat are not be obvious to human evaluators.
This work further looked at which aspects of the frames wereemphasized more in the LRP Driving heatmaps compared to LRPImageNet, not accounting for object segmentation. To do this, theheatmaps from each method were normalized, then the LRP Ima-genet heatmap was subtracted from the LRP Driving heatmap. Toimprove visibility in the LRP heatmaps, which were originally inthe range of 0-255, were clipped to the range 0-100. Figure 4 showsan example where this subtraction showed a particular emphasison lane features such as painted lines and boundaries. Withoutany explicit identification of such features, they were detected withLRP. This provides a potential means for highlighting features to iffany Hwu, Mia Levy, Steven Skorheim, and David Huber inattentive drivers, as the subtracted heatmap is sparse enough topick targeted areas for visual prompting.
These results show that XAI techniques, particularly LRP, may bevaluable in building a human-machine teaming system for usefulapplications in machine operation such as driving. Specifically, theLRP heatmaps are able to identify task-relevant stimuli withoutexplicit object segmentation and classification, allowing both low-level and object-based features to be emphasized. Combined withother techniques for estimating human eye gaze, the full spectrumof bottom-up salient features and top-down task-specific featurescan be addressed.LRP heatmapping techniques alone do not predict human gazeas well as other methods, raising the question of why the LRPheatmaps could not also be trained with human gaze data. Whilethis is an appropriate next step, we wished to examine the humanand machine representations separately before determining howto merge the two. We expect that a combination of machine-basedtask training and human-driven eye data would lead towards thegoal of human-machine teaming methods of driver assistance.Driving alert systems for quickly orienting distracted drivershave been developed in the past, with emphasis on using multiplemodalities including auditory, haptic, and visual cues [14, 19]. Par-ticularly, visual cues in heads-up displays can alert drivers withoutrequiring them to switch gaze between the road and dashboard[8]. With such alerting systems, it is crucial to avoid unnecessarydistractions and false alarms. The techniques in this work help tofilter sensory input by task relevance, enabling intelligent filteringof cues and alerts. Moreover, XAI and LRP techniques are agnosticto sensor type, and are able to explain neural networks processingany modality of information.
Examining the frames without temporal information was a nec-essary prerequisite for gauging the whether basic XAI methodssuch as LRP would be useful to explore. However, as human gazeinvolves integrating many focal points over time, we hope to extendour evaluation to temporal XAI methods. For example, rather thanusing the VGG16 architecture, which takes isolated frames as input,a recurrent network may be used instead. This may increase thesimilarity with human gaze, as a human may choose to look awayfrom task-relevant stimuli once acknowledged and processed. Tocontinue the work in developing such a human-machine system, wehope to test such methods in closed-loop training of human drivers.This requires design considerations in how to generate LRP-basedvisual prompts in real time, while avoiding distracting or irrelevantinformation. This would involve a mental model of human state,which may include not only whether the human is attentive ornot, but also which stimuli the human has already processed. Aswe have seen, the method is also dependent on the dataset, requir-ing training on corner cases. Advanced causal reasoning methodscould potentially be combined with the XAI techniques for betterhandling of such corner cases.