[PDF] Dissonance Between Human and Machine Understanding

Abstract

Complex machine learning models are deployed in several critical domains including healthcare and autonomous vehicles nowadays, albeit as functional black boxes. Consequently, there has been a recent surge in interpreting decisions of such complex models in order to explain their actions to humans. Models that correspond to human interpretation of a task are more desirable in certain contexts and can help attribute liability, build trust, expose biases and in turn build better models. It is, therefore, crucial to understand how and which models conform to human understanding of tasks. In this paper, we present a large-scale crowdsourcing study that reveals and quantifies the dissonance between human and machine understanding, through the lens of an image classification task. In particular, we seek to answer the following questions: Which (well-performing) complex ML models are closer to humans in their use of features to make accurate predictions? How does task difficulty affect the feature selection capability of machines in comparison to humans? Are humans consistently better at selecting features that make image recognition more accurate? Our findings have important implications on human-machine collaboration, considering that a long term goal in the field of artificial intelligence is to make machines capable of learning and reasoning like humans.

Full PDF

556Dissonance Between Human and Machine Understanding

ZIJIAN ZHANG, JASPREET SINGH, UJWAL GADIRAJU, AVISHEK ANAND,

L3S Research Center, Leibniz Universität HannoverComplex machine learning models are deployed in several critical domains including healthcare and au-tonomous vehicles nowadays, albeit as functional blackboxes. Consequently, there has been a recent surge ininterpreting decisions of such complex models in order to explain their actions to humans. Models whichcorrespond to human interpretation of a task are more desirable in certain contexts and can help attributeliability, build trust, expose biases and in turn build better models. It is therefore crucial to understand how and which models conform to human understanding of tasks. In this paper we present a large-scale crowdsourcingstudy that reveals and quantifies the dissonance between human and machine understanding, through thelens of an image classification task.In particular, we seek to answer the following questions: Which (well performing) complex ML modelsare closer to humans in their use of features to make accurate predictions? How does task difficulty affectthe feature selection capability of machines in comparison to humans? Are humans consistently better atselecting features that make image recognition more accurate? Our findings have important implications onhuman-machine collaboration, considering that a long term goal in the field of artificial intelligence is to makemachines capable of learning and reasoning like humans.CCS Concepts: •

Human-centered computing ; •

Applied computing → Law, social and behavioralsciences ; •

Information systems ;Additional Key Words and Phrases: Dissonance; Humans; Machine Learning Models; Neural Networks;Machines; Interpretability; Object Recognition; Image Understanding; Crowdsourcing; Human Intelligence

ACM Reference Format:

Zijian Zhang, Jaspreet Singh, Ujwal Gadiraju, Avishek Anand. 2019. Dissonance Between Human and MachineUnderstanding.

Proc. ACM Hum.-Comput. Interact.

3, CSCW, Article 56 (November 2019), 23 pages. https://doi.org/10.1145/3359158

For several decades researchers have attempted to build machine learning models that can elicithigher-order human behaviour and thinking [31]. Recent advances in computational capabilities ofmachines alongside advances in algorithmic intelligence, have surpassed expectations and resultedin staggering feats such as ‘AlphaGo’ defeating a world champion in the game of Go using deepneural networks [56, 57].With all the perceived superiority of machines in decision making, arising partly from theircomputational prowess, we are interested in the question, “

Do machines think like humans? .” At thesame time, it is worthy to note that humans are very good at dealing with abstract and subjectivetasks, notions that machines struggle to model and cope with. This raises the question of whetherhumans are consistently better decision makers in tasks they are naturally suited to.

Author’s address: Zijian Zhang, Jaspreet Singh, Ujwal Gadiraju, Avishek Anand,L3S Research Center, Leibniz Universität Hannover, Hannover, Germany. {zzhang,singh,gadiraju,anand}@l3s.de.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].© 2019 Association for Computing Machinery.2573-0142/2019/11-ART56 $15.00https://doi.org/10.1145/3359158Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. a r X i v : . [ c s . A I] J a n Understanding these broad questions are crucial in building machine learning systems [66] andguiding interpretable system design [51]. More so, with the focus on algorithmic transparencywhere it is paramount to understand the rationale behind the decision towards building trust in thesystem [21]. Intelligent machines have now become an integral part of our everyday lives, wherethe interaction, collaboration and cooperation between a human and an intelligent machine shapesvarious aspects of our society [73]. Recent technological advances have led to the growing popularityof a variety of such systems, ranging from voice-based conversational assistants that facilitate andsupport everyday social interactions [45, 65], mobile health (mHealth) applications which havebeen proposed to transform healthcare and for health promotion [60], to pervasive recommendersystems which support online and offline activities of humans with growing regularity.There has been plenty of interest in the machine learning community towards making machinesmore understandable to humans, studied under interpretability of machine learning models [10, 26].One line of work focuses on building systems that are interpretable by design or whose decisionprocess can be unambiguously explained. On the other hand there have been approaches thatprovide post-hoc explanations to already trained models [37, 49].To the best of our knowledge most prior work focuses largely on faithfully explaining a trainedmachine learning model. However little work has been done on answering the question howhuman-like is the machine behaving . A general consensus across research communities suggeststhat machines which can reason or act more congruently with human expectations can createmore seamless solutions for collaboration and cooperation with humans in socio-technologicalsystems. We aim to fill this knowledge gap by enhancing the current comprehension of “ dissonancebetween human and machine understanding. ” By doing so, we make important strides in CSCWand HCI towards building machines which are more congruent with human expectations .In this paper we focus on dissonance with respect to a task that is natural to humans – imagerecognition [30]. Our choice of task is further motivated by recent machine learning models inimage classification that have reached near-human performance [61, 62]. Specifically, we focuson two scenarios of human decision making central to the image recognition task – selection ofimportant parts of an image that make an object detectable in the image, and identification or recognition of an object. The scope of this work is guided by the following research questions: • RQ

How do humans compare to machines in selecting important features/segmentsfor the image classification task? • RQ

What factors influence the accuracy of humans in an image recognition task?

Task in a Nutshell.

Towards answering these questions we employed a novel two stage crowdsourcing approach (over 7000 HITs – human intelligence tasks) based on a consistent explanationspace to gather a collective understanding of human and machine behaviour.As a contextual grounding for our proposed approach to this problem, we base our task designon Biederman’s theory for image understanding [6]. The author proposed a bottom-up process,called recognition-by-components to explain object recognition. Biederman showed that humansrecognise objects by separating them into the object’s main component parts. Inspired by this, wechoose image super pixels as the space of input features over which we gather selection informationfrom both humans and neural network models. In the first task we ask humans to select relevantsegments of an image given an object (in the image)/label that needs to be recognised. This gives ushuman ‘reasons’ whereas the SHAP [37] interpretability approach allows us to identify the inputimage segment attribution for a given decision (classified image) by a neural network. By gatheringhuman judgements and machine explanations on the same set of segments we can directly analyse

Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. issonance Between Human and Machine Understanding 56:3 and quantify differences in reasoning which has been relatively unexplored in the literature. Inthe second task, we present segments of a given image one at a time to human assessors, in adecreasing order of importance determined by humans or NN models, asking them to identify theobject. In doing so, we compare the dissonance between human selection versus machine selectionbased on the number of segments revealed towards eliciting the correct guess (i.e., the accurateclass label pertaining to the given image). (a) Original segmented image.(b) HUMAN (c) Inception (d) ResNet (e) VGGFig. 1. An example of a segmented image from the ‘ kimono ’ class (1a) as displayed to humans in Task-1,and 5 of the most discriminative segments uncovered in Task-2 (1b, 1c, 1d, 1e), according to the orderingbased on humans (HUMAN) and machines (Inception, ResNet, VGG). Humans considered the segmentscorresponding to the kimono itself to be most discriminative in recognizing the kimono, while the neuralnetworks also picked contextual features such as the faces and hands of the women wearing the kimonos.

Key findings and outcomes.

A key tangible outcome is a dataset of 300 images annotated by377 workers and 7000 HITS that we also release. Previous works have shown how human domainunderstanding can be utilized in building effective machine learning models [51, 66]. On its own, tothe best of our knowledge, this is largest dataset to be used for evaluation of interpretability for theimage classification task. To ensure replicability of data collection using our tasks, the instructionsfor all tasks will be released along with the complete dataset .From a findings perspective, we found that neural network (NN) models that are close to humanselection patterns tend to generalise well. This has key implications on the utility of our data settowards machine learning (ML) model design. That said, our results suggest that humans do notalways select the most discriminatory segments for recognition. For example, in Figure 1, we reportthe first 5 discriminative segments as perceived by humans and other ML models. Interestingly,we find that Inception and ResNet focus on more human understandable features responsible forfaster human prediction. We find that some ML models outperform humans in 25% more images.On closer examination we find that this can be attributed in part to the inability of humans to effectively choose good features from the context information that is vital for quick recognitionby the crowd. Humans may potentially use more context in their decision making process thanthey attribute to it. Further experiments are required to fully understand this. We also use the datagenerated by our tasks to characterise the performance of the state-of-the-art neural networks thatwe chose in our study. Specifically, we find that while deeper networks tend to generalise betterand choose more important features, they are less effective on difficult images. On the contrary,wide and over-parameterized networks tend to be robust in spite of being markedly different fromhuman intuition.Our work aims to foster research on understanding how trust manifests, builds and evolvesbetween humans and machines, as a result of measuring the congruence of machines with humanexpectations. This lies at the core of HCI research, and we aim to bridge the gap between themachine learning, AI communities with the CSCW community through our work. Researchers in the CSCW and HCI communities have shed light on the unintended consequencesof algorithms and machine learning models that can have a societal impact that is unanticipatedby their creators [7, 42]. Others have also reflected on the benefits that machine learning modelscan offer to the society at large by supporting human decision-making [3, 25, 27]. Several machinelearning models mediate our social, cultural, economic and political interactions in today’s world[47]. Therefore, understanding these models and how congruent they are with human expectationsis of paramount importance, so as to control their actions, enjoy their benefits and mitigate theirharms. For example, online pricing models have been shown to shape the cost of products differentlyto different customers [20]. Understanding the full breadth of societal effects that machines can havebecomes more complex in hybrid systems composed of many humans and machines interacting;demonstrating collective behaviour [55]. In a recently laid out HCI research agenda, authorsreflected on how a lot of work in the AI and ML communities tends to suffer from a lack ofusability, practical interpretability and efficacy on real users, calling the HCI community to takethe lead to ensure that new intelligent systems and ML models are transparent from the groundup, and congruent to human expectations [1]. In this paper, we aim to bridge the knowledge gapin understanding how congruent machine learning models are with the expectations of humansin image classification tasks, where machine learning models have been shown to be on par withhuman performance. Our findings have direct implications on HCI and CSCW research that aimsto understand how humans and machines differ in their decision-making. We make a foundationalcontribution towards studying the decision-making processes of humans and machines, attemptingto understand how and where they differ.We discuss related literature in four broad realms – (1) work on algorithmic transparency by usingexplanations understandable to humans, (2) methodological approaches in model interpretability, (3)neuroscience approaches that explore the ‘humans versus machines’ context in object recognition,and (4) theories on human understanding.

Today’s world is characterised by an increasing dependency on algorithmic decision-makingsystems [72]. Since these systems augment our everyday lives, recent CSCW and HCI research hasreflected upon the importance for people to understand them better [1]. As described by Raderet al., algorithmic transparency involves encountering non-obvious information that is typicallydifficult for the user of a system to learn and experience directly, about how and why a systemworks the way it does and what this means for the system’s outputs [46]. Several recommendersystems provide explanations alongside their recommendations with an aim to be more persuasive,

Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. issonance Between Human and Machine Understanding 56:5 ensuring that the system’s goals are served [5]. Explanations in such contexts present a userwith information regarding how and why the system produced a given recommendation. Priorworks have focused on various attributes of explanations; cognitive fit [18], content type [19], datasources [43], and modality [40]. In other work, authors classified explanations into ‘black box’ and‘white box’ descriptions [13]. ‘Black box’ explanations provide justifications for the outcomes of asystem but do not disclose and discuss how the system works [68]. On the other hand, ‘white box’explanations delve into the inputs and outputs of a system and the steps taken through the courseof arriving at particular outcomes [64]. Recent work by Binns et al. argued that there may be no‘best’ approach to explaining algorithmic decisions [7].A significant amount of prior work has focused on the importance and effects of algorithmictransparency and the role of explanations to help human users comprehend the functioning ofintelligent machines better. This includes work from the CSCW community on algorithmic fairnessin the sharing economy [34], and algorithmic mediation in group decisions [33]. However, fewworks have juxtaposed human understanding with that of machines. In this paper, we aim to fillthis gap by studying the dissonance between human and machine understanding.

Unlike work on creating explanations it’s important to note that there’s a difference betweenexplaining why a system behaves a certain way and interpreting a model . Interpretable models canbe categorised into two broad classes: model introspective and model agnostic . Model introspectionrefers to “interpretable” models, such as decision trees, rules [35], additive models [8] and attention-based networks [70]. Instead of supporting models that are functionally black-boxes, such as anarbitrary neural network or random forests with thousands of trees, these approaches use modelsin which there is the possibility of meaningfully inspecting model components directly, e.g. a pathin a decision tree, a single rule, or the weight of a specific feature in a linear model.Model agnostic approaches on the other hand extract post-hoc explanations by treating theoriginal model as a black box either by learning from the output of the black box model, orperturbing the inputs, or both [28, 50]. Model agnostic interpretability is of two types: local andglobal.

Local interpretability refers to the explanations used to describe a single decision of themodel. There are also other notions of interpretability, and for a more comprehensive descriptionof the approaches we point the readers to [36].Local Interpretability can be model agnostic or introspective. In the model agnostic case likein [50], a simple linear model is trained to explain a single data by perturbing the data pointsystematically and labelling the new synthetic data using the model.More recently, Lunderberg and Lee [37] introduced their model introspective approach, alsoknown as SHAP, which utilizes the classical Shapley value estimation method from cooperativegame theory. In essence, SHAP generates feature importance values for a given decision overa pre-trained model by propagating differences in activation to the expected value through thenetwork. In this work, we use SHAP scores over the image segments (that we consider as featuresin our setting) to compute feature importance in Task-2.

Our work in this paper is not the first attempt to study how humans and artificial neural network(NN) models differ in the way they perceive objects. Afraz et al. proposed falsifiable, predictivemodels that account for neural encoding and decoding processes that underlie visual object recog-nition [2]. With an aim to better understand neural encoding in the higher areas of the ventral

Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. stream of human brains, Yamins et al. used computational techniques to identify a NN model thatmatches human performance on an object categorisation task [71]. Authors found that the modelwas highly predictive of neural responses in both the V4 cortex and the inferior temporal cortex,the top two layers of ventral visual hierarchy in humans. Schrimpf et al. proposed Brain-Score ,a composite of several neural and behavioural benchmarks that score a neural network on howsimilar it is to a primate brain’s mechanisms for core object recognition [53]. Rajalingham et al.systematically compared specific neural network models with the behavioral responses of humansand monkeys at the resolution of individual images [48]. The authors found that the NN modelswhich they tested, significantly diverged from primate behavior.In contrast to the aforementioned approaches that utilize fMRI’s and other sensing devices tocorrelate features with NN models, in this work we rely on gathering explicit feedback from humanson their decision-making process for the task of object recognition. Although object recognition isintuitive to humans, understanding reasons for their decisions in unobtrusive ways (for example, byusing eye tracking, fMRIs, etc.) is expensive and does not scale easily. The novelty of our work liesin understanding dissonance between humans and machines based on instance-level fine grainedreasoning due to our choice of task, NNs and interpretability techniques. Cognitive scientists have proposed that much of our thinking, memory and attitudes all operate ontwo levels: conscious and deliberate, and unconscious and automatic [39]. Intuition is our capacityfor immediate insight without observation or reason, i.e. thinking without conscious awareness.Kahneman [24] argues that like the perceptual system, intuition operates through impressions andjudgements that directly reflect impressions. In contrast, deliberate thinking is reflective, reasoning-like, critical, analytic and operates in the realm of conscious awareness. Intuitive judgements can ofcourse be overridden by a more deliberate, rational process but intuition may still affect subsequentresponses through priming [24].Consequently, human decision making is based on these two levels of rationality. While even themost tedious decisions that appear to be deliberate and well considered like market investmentsor medical diagnostics involve a certain amount of intuition. Herbert Simon’s theory of boundedrationality [58] argues against the strict rationality model and states that decisions can be madewith reasonable amounts of calculation, and using incomplete information .With an aim to further the understanding of human-machine dissonance, we chose the machinelearning task of image classification , since humans are known to be capable of solving imagerecognition tasks with high accuracy using their intuition and deliberation. Moreover, neuralnetworks (NNs) have matched and surpassed human performance on many benchmarks in thetask of object recognition and are being used in various real-world applications [61, 63]. This taskalso has added benefits from a feasibility standpoint – several trained NNs with clear descriptionsof their architecture are freely available. Interpretability techniques developed in the machinelearning community allow us to examine the decision making process of NNs. Having been studiedover several years for object recognition in particular, these interpretability techniques are nowmature. Towards this end, we involve a large number of human subjects in a crowdsourcing setting,as described in the following section. The ventral stream is involved with object and visual identification and recognition (cf. the two-stream hypothesis [12]). Core object recognition is the ability to rapidly recognise objects despite variations in their appearance.Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. issonance Between Human and Machine Understanding 56:7

The ImageNet data set was created to help train machine learning models classify objects in images[9]. It consists of over a million images and 1000 classes. Each image is labelled with a single classeven if there are multiple objects in the image. Classes range from broad categories like ‘ minivan ’to specific breeds of dogs like ‘ shih-tzu ’. As motivated by prior work, creating ground truth datafor evaluation using human input and intuition is often an expensive process when scaled [23].This is indeed the case for industry-sized data sets such as ImageNet [29]. Moreover, to study theresearch questions posed earlier we are not constrained by a need for a very large data set. Thus,we selected 50 classes out of 1000 and sampled 6 images at random from each class to create a dataset of 300 images. Additionally we also ensure that all chosen images are classified correctly by themodels we consider.We solicited the aid of 3 researchers in our university to select these 50 classes pro bono. Weonly showed them the full list of classes (not the images). We defined selection criteria based onthe scope of our research as follows: • Familiar : the class should be familiar to all the annotators, i.e., all annotators should know whatexactly the selected class of objects refers to. This criteria was added to help select classes thatmost people would recognise and reduce undue effort from crowd workers. • Unambiguous : the class should have only one clear connotation for the given object. For instance,the class ‘ crane ’ can refer to either the machine or the animal, and is thereby ambiguous. • Non-specific : the class should not be a specialisation or a potential sub-class of another class inImageNet. If it is then neither class can be selected. For example, the classes ‘ cat ’ and ‘

Persiancat ’. Since crowd workers are not experts in identifying various fine-grained classes of objects,we cannot expect them to be able to identify features pertaining to a very specific class whereasthe ML models are exposed to all classes in training.Apart from this, we also gathered annotations based on whether the annotators believed that itwould be easy to identify objects from the selected class in a given image. We marked classes as difficult to identify if at least one annotator indicated so. From this process we ended up with 28 easy classes and 22 difficult classes. Finally, we considered the first 50 classes that the annotatorscompletely agreed on according to the criteria.

We employed three neural networks in our experiments – VGG19 [59], Inception-ResNet-V2 [61]and Inception-V3 [63]. These are state-of-the-art models that report high accuracy and human-levelperformance on the ImageNet data set. Furthermore, they differ in key areas of their networkarchitecture which is discussed below.First released in 2014, VGG19 won the first and second prize of ILSVRC (ImageNet) localisationand classification challenges. It has 16 convolution layers and 3 fully connected layers that madeit one of the deepest NN architectures at the time. They report a 74.5% Top-1 accuracy on thevalidation data of ILSVRC2012 [9] contest. The number of parameters (143,667,240) of VGG19 isthe highest among the three models chosen in this work.Inception-V3 is an improved version of the original GoogLeNet [62]. They introduced concate-nated pooling layers and showed that breaking down the large convolution kernels into severalsmall ones significantly improved the performance as well as reducing the number of parameters.The number of parameters (23,851,784) is the smallest among the three models chosen, and theTop-1 accuracy reported is 78.8%.

Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. (a) Task 1 (b) Task 2 - Human Segments (c) Task 2 - Machine SegmentsFig. 2. Tasks in our Crowdsourcing study. (a) Task 1 presents the clickable-segmented image along with theactual object name. (b) and (c) show the image recognition UI for Task 2 where segments are shown one at atime. The initial machine selected segments is shown in (b) and the human selected images is shown in (c)for the same image class go-kart . Inception-ResNet-V2 has a hybrid structure consisting of residual and inception units thataccelerate the training while maintaining the precision of the network. The depth of Inception-ResNet-V2 is 572 and the highest among three models considered in this work, while its number ofparameters (55,873,736) is approximately two times that of the Inception-V3. Its Top-1 accuracy onILSVRC2012 is 82.2%.For the sake of readability we will refer to the VGG19, Inception-ResNet-V2, and the Inception-V3models as

VGG , ResNet, and Inception respectively hereafter in this paper.

Models that make decisions close to the way humans do are often desired and tend to perform lessperplexing-ly on unseen data [51]. Even if models have similar performance according to metricslike accuracy , they may differ in terms of the reasons that drive the making of their decisions. Thesereasons can be attributed to the training data, architecture, training procedure or a combination ofsuch factors. In this work we focus on models that have been trained and validated using the samedata but have different architectures; all three neural networks (

VGG , ResNet, and Inception)were trained on the same 1.2M images belonging to 1K classes in the ImageNet data set.Our task design is inspired by Biederman’s seminal work on human image understanding[6]. Biederman proposed the recognition-by-components theory, which can account for the majorphenomena of object recognition. He showed that if an arrangement of a few primitive componentscan be recovered from the input, the objects can be quickly recognised even in the presence of asignificant amount of noise. Thus, in the context of object recognition in images, we define humanintuition or reasoning in terms of the segments in an image which are perceived to aid the accuraterecognition of the image class or label.For instance, take the Figure shown in 2a whose class label according to ImageNet is “ go-kart ”.To correctly identify the object as a go-kart , not only are the segments corresponding to the kartstrong reasons but so are those pertaining to the driver. To accurately capture human intuition inthis task, we not only need all the segments that humans use to make a decision but we also need tounderstand the relative importance of each segment. To this end we first deployed a crowdsourcing

Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. issonance Between Human and Machine Understanding 56:9 image classification task on FigureEight , a primary crowdsourcing platform, to gather humanintuition judgements corresponding to the 300 images from 50 different classes. We divide each image into 50segments. Super pixel segmentation is utilised to cluster spatially similar pixels into a fixed numberof segments. Standard grid lines also allow for such fixed size segmentation but are unaware ofobject boundaries, which is crucial in identifying segments of importance. For example, a singlesegment in a grid can contain an important part of the object and a large part of the backgroundwhich may be non-essential. Super pixels are less susceptible to such effects and are hence alsoutilised by SHAP and other approaches like LIME [50].Crowd workers are shown the images and their corresponding labels, and then instructed toselect all segments in the image that help them correctly identify the given object. Workers areurged to select segments in the order of perceived importance, where the first segment they selectis the strongest indicator of the object in the image. Annotators can click on each segment to selectit. The first segment that is clicked is marked with the number 1 and every subsequent click is alsorecorded and displayed with the corresponding selection number as shown in the Figure 2a. Notethat the workers were explicitly encouraged to select the most important segments that could helpin identifying the object in the image, including segments with contextual cues.We collected 5 distinct judgements for each of the 300 images. Workers were paid at an hourlyrate of 7,50 USD. To ensure a high reliability of judgements gathered, we restricted participation tothe highest quality workers using an inbuilt feature on the platform . We created gold-standarddata and used test questions within the task, facilitating training of workers and maintaining theoverall quality simultaneously [14, 41]. We balanced the distribution of easy and difficult classes inour gold-standard data by having an equal number of images from the easy and difficult classes toprevent potential biases. For convenience, we will refer to this task as Task-1 hereafter. Throughthe remainder of the paper, we do not use the terms ‘easy’ and ‘difficult’ to refer to the classes. Wedefine image level difficulty as perceived by workers in Section 3.4. In our setting, humans select image segments in order tohelp identify an object. The human annotations are essentially an ordering or ranking of imagesegments per image per crowd worker. Rankings are inherently different from categorical andordinal scale annotations which means we cannot use standard agreement measures (like Fleiss’Kappa) or aggregation methods like average or majority voting.In our case, since we do not enforce an exact number of segments to select, we have non-conjointpartial rankings, i.e. for the same image we can have (i) different segments (ii) a varied number ofsegments (iii) differing preferences. Additionally we’re most interested in the top ranked segments.Standard rank correlation metrics like Kendall’s Tau are not designed to handle these conditions. Abetter measure for this purpose is rank biased overlap (RBO) [69] that is specifically designed toaddress these shortcomings. To measure the segment selection agreement between workers foran image we compute the average pairwise RBO. In our experiments we found a high agreementbetween workers, with

𝑅𝐵𝑂 = 0 . Level-3 contributors on FigureEight comprise workers who completed >

100 test questions across hundreds of differenttypes of tasks, and have a near perfect overall accuracy.Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. known as the Bradley-Terry model meant for the case of pairwise comparisons. Given a set ofrankings we estimate the parameters of the model using maximum likelihood. Each parametercorresponds to the probability of selection for an item from a set of alternatives. We order thesegments based on the estimated PL model for each image . For segments that are not selected byany workers we randomise their order and append them to the list of ordered segments. We thenconvert the parameter estimates for the segments into a probability distribution using softmax tocompute certain measures for dissonance (emd) in our study.

Next, we aim to understand thefactors that influence accuracy of humans in an image prediction task that is informed by thediscriminative features identified by either other humans or by machines.In this task, workers were asked to identify an object in an image within a game-like experience.Workers were incrementally shown segments of an object in an image (one segment at a time),based on the aggregated human ordering (

HUMAN ) or that corresponding to one of the 3 neuralnetwork models (

VGG , Inception, ResNet). In all cases, the segments were revealed accordingto a decreasing order of importance. The overall objective of the workers was to guess whichobject was being revealed, using as few uncovered segments as possible. The task began with oneuncovered segment and workers could make at most 3 guesses by filling a text field after everynew uncovered segment. Workers were also allowed to uncover another segment in case they didnot have any guesses at each stage, by clicking a ‘

Show One More! ’ button. To encourage workersto correctly identify the object using the fewest number of segments possible, we incentivizedthem with a bonus payment of 3 USD cents for every object they correctly identified using thefewest segments among the corresponding cohort of 5 workers for each image. After 50% of animage was uncovered (i.e., 25 segments were shown), we automatically revealed the entire imageand workers were allowed to make a final set of 3 guesses. We accepted misspelled guesses withinan edit-distance of 1, and also expanded the list of acceptable responses by using a dictionary ofsynonyms. If workers failed to correctly identify the object, they were asked to identify whetherthe said object was present in the image using a multiple choice question (with ‘

Yes ’, ‘ No ’, or ‘ IDon’t Know ’ options). Finally, all workers were then asked to respond to a question regarding howdifficult it was to identify the given object in the image on a 5-point Likert scale ranging from ‘ ’ to ‘

5: Very Difficult ’. For convenience, we will refer to this task as Task-2 hereafter.

In this section we first introduce the notion of image difficulty and how it is computed in our setting.Recall that, in Task-2 (cf. Section 3.3.3), subjects are asked to assess the difficulty in identifying theobject in the image (on a 5-point Likert scale) after the completion of their guessing procedure. Insoliciting responses there is inherent variability in assessments of workers that might stem fromfactors such as their inherent familiarity with the object, sub-optimality of the segments beinguncovered as a function of features choses by humans or machines, and so forth.

In coming up with an aggregate measure for inherent difficulty ofan image classification instance given a certain sequence of uncovered segments we assume thefollowing: • We assume that for the same sequence of segments presented to humans (same model) thereis inherently low variability in assessments. • We assume that the optimal sequence for guessing, that is the best sequence that results in asuccessful guess in smallest number of segments, sets the difficulty of the task.

Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. issonance Between Human and Machine Understanding 56:11

We explore these assumptions and qualitatively argue their validity in guiding the design of ourmeasure for difficulty. First, although there is variability in the number of uncovered segmentsneeded to correctly guess the object in an image, we found low entropy in the self-reported difficultyassessments from corresponding workers. So for an image-model pair we take the median of thedifficulty assessment values say 𝑚 𝑖,𝑗 where 𝑖 is the image and 𝑗 is the model that is generating thesequence (a neural network or humans).The optimal sequence of segments presented to the user that would solicit the best guess isunknown. We can however provide an upper bound to this by choosing the model that has thelowest difficulty estimate. Hence, we denote the difficulty of an image 𝑖 as min 𝑗 { 𝑚 𝑖,𝑗 } .Consider an image 𝑖 from our data set and NN 𝑗 (one of VGG, Inception or ResNet) with thenumber of segments needed to guess the correct label from 5 different crowd workers. For examplewe have the values (4,5,6,10,11). We now take the median of these assessments to get 𝑚 𝑖,𝑗 = 6. Wecompute this for each j in VGG, Inception or ResNet for the image i. Let’s say these values are(6,10,21). Then the inherent difficulty of image i is the min of 6,10,21 which is 6. This gives us adata-driven measure of difficulty per image.Note that this is different from the class level difficulty we solicited in the beginning of ourexperiments. Within the scope of our study, we propose two distinct notions of disparitybetween humans and machines (ML models); implicit and explicit dissonance . We characterise implicit dissonance as the difference between humans and machines emerging from the Task 1, dueto collective differences in features (segments in our case) that humans and machines perceiveas being more important for accurate classification. This plays a pivotal role in enabling workersto readily recognise images in the second task. We characterise explicit dissonance based on theperformance of humans and machines in Task 2.

Features.

Note that we used the same pixel clustering approach as that of SHAP when gatheringjudgements in Task-1 so as to ensure that the SHAP explanation is comparable to the data wegathered. Since the output of SHAP is an importance score (shapley value) distribution oversegments, we order segments in decreasing order of these scores.

Implicit Dissonance.

We quantify the dissonance between human and machine understanding ofthese images as the distance between the human annotated segments and the output explanation ofSHAP for each neural network model. We analysed the performance of the three neural networkswith 3 measures having different semantics: Jaccard Similarity, NDCG [22], weighted Kendall’s 𝜏 [54] and EMD [52]. The simplest measure is coverage using Jaccard similarity between the humanand machine annotated segments. Jaccard similarity however, does not capture the importance ofsegments indicated by their order of selection in our case. We use a weighted version of Kendall’s 𝜏 to measure rank correlation between human and machine selection. Weighting here allows us topay more attention to the ordering of the top segments. 𝜏 entails order preservation but fails to capture locality. Locality is important because minorrank differences between segments that are spatially very close may be negligible. Earth MoversDistance (EMD) [38] is a Wasserstein metric that measures the distance between 2 distributions andtakes locality into account. EMD between two sets of points in R 𝑑 of equal sizes (say, 𝑠 ) is definedto be the cost of the minimum cost bipartite matching between the two point sets. It is a naturalmetric for comparing sets of geometric features of objects. The EMD is based on a solution to thetransportation problem from linear optimisation, for which efficient algorithms are available, andalso allows naturally for partial matching. It is more robust than histogram matching techniques, in Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. that it can operate on variable-length representations of the distributions that avoid quantizationand other binning problems typical of histograms.

Explicit Dissonance.

To get a more explicit notion of dissonance we use the data from Task-2. Foreach image 𝑖 we have the median number of segments needed to correctly classify it. Let 𝑚 𝑖,𝑗 denotethe median number of segments needed to guess an image 𝑖 given model 𝑗 ’s segment ordering. Wedefine dissonance between a pair of models 𝑗, 𝑘 for 𝑁 images as the average difference in segmentsneeded to correctly classify images. dissonance ( 𝑗, 𝑘 ) = (cid:32) ∑︁ 𝑖 𝑍 ∥{ 𝑚 𝑖,𝑗 − 𝑚 𝑖,𝑘 }∥ (cid:33) / 𝑁 where 𝑍 is the normalising factor that is chosen to be the maximum number of segments tobound the value between [0 , 𝑗 . By analysing the data gathered from our first task, we aim to understand how close the featureselection of machines is to human understanding (

RQ ).Human understanding is encoded in the segments selected by crowd workers and is operational-ized by aggregations of these assessments from Task-1. It can be represented as a set (for precision),sequence or ordered list (for Kendall’s 𝜏 ) or a distribution (EMD). Table 1 presents the differencesin the implicit dissonance measures between humans and machines. We can clearly see that In-ception and ResNet are closer to human intuition than VGG . Using multiple one-way ANOVAswe found statistically significant differences between all the implicit metrics for dissonance acrossthe three NN models at 𝑝 < . Top-1 Acc ) are also correlated with human understanding.This finding relates to prior works, which have argued that models that correlate more with humanfeature selection tend to generalise better [11, 17].

Table 1. Implicit Dissonance Measures – How close are machines to human understanding when selectingfeatures? p@5 p@10 emd tau

Top-1 Acc easy diff easy diff easy diff easy diff

Inception

ResNet

VGG

Next, we explore whether NN models (Inception, ResNet, and

VGG in our case) which arecloser to human intuition result in superior performance in the image recognition task. We areinterested to see whether the sequence of segments aggregated from the segments selected byhumans in Task-1, indeed result in better image recognition by other human subjects in Task-2.Figure 3 illustrates our findings. Contrary to what was expected, we found that human selectionof important segments (

HUMAN ) does not always lead to the best prediction by other humans. Forthe sake of readability, we present and discuss our findings in 5 segment intervals with respect to the

Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. issonance Between Human and Machine Understanding 56:13

HUMAN INCEPTION RESNET VGG

No. of Segments Uncovered I m age s C o rr e c t l y R e c ogn i z ed Easy Di ﬃ cult (a) Humans versus Machines No. of Segments Uncovered F r a c t i on o f I m age s C o rr e c t l y R e c ogn i z ed Easy Di ﬃ cult (b) Humans and Machines Segment Ordering I m age s C o rr e c t l y R e c ogn i z ed Easy Di ﬃ cult (c) Humans versus MachinesFig. 3. Distribution of easy and difficult images that were correctly recognised by workers in the guessingtask (Task-2), where segments were uncovered in orders determined by humans (HUMAN) in comparison todifferent machines (Inception, ResNet, VGG). number of segments uncovered for accurate image recognition. ResNet ordering resulted in the bestperformance by far in the image recognition task within the first 5 uncovered segments (62 imagesaccurately recognised), when compared to HUMAN (33 images accurately recognised), Inception(21 images accurately recognised) and

VGG (16 images accurately recognised) as illustrated inFigure 3a. Note that our findings are consistent when the data is anlaysed in a continuous fashionwithout intervals. This is the first evidence which suggests that human understanding of featureselection is not the most discriminative for recognising images.Image recognition based on

HUMAN ordering catches up with ResNet as more segmentsare uncovered. Figure 4 shows an example image where humans were able to better identifydiscriminatory segments. Human selection is independent of the dataset biases that the NNs areexposed to during training.Interestingly, we found that segment ordering based on Inception and

VGG results in an increasein the number of images correctly recognised by humans in Task-2 after the uncovering of around10 segments.

Why ResNet why : We further examined the images where ResNet performed considerablybetter than humans in Task-2. We consistently found that while humans particularly focus on thesegments belonging to the given object, ResNet and the other machines in general, also focused ondiscriminative features outside the body of the object that comprise the context. This is illustratedin Figure 1. We see that Inception and ResNet also pick the faces of the women which is richcontext for guessing the correct label of the image, ‘ kimono ’.The importance of context for image recognition is well documented in human cognition lit-erature [4] as well as machine learning [32]. Thus, we reveal that although humans are good atclassifying images, they do not always perform well in selecting the most discriminatory featuresfor image recognition in our setting. It is indeed the case that we do not explicitly ask crowdworkers to select discriminative segments with respect to the nature of task 2 but neither are theNNs trained specifically to help humans determine the class label in the fewest segments.In fact we found that

HUMAN ordering helped other humans guess the fewest images overall(217).

VGG helps users guess the most images correctly (244) albeit slowly (i.e., after severalsegments are uncovered). We reason that since

VGG tends to overfit and memorize more patterns,it is able to eventually present good enough segments to facilitate a correct answer for most images.Our findings suggest that deeper networks with residual connections like ResNet learn similarabstractions for image understanding as humans and hence are capable of identifying the segmentsmost essential for accurate image recognition.

Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. (a) Original segmented image.(b) HUMAN (c) Inception (d) ResNet (e) VGGFig. 4. An example of a segmented image from the ‘ boathouse ’ class (4a) as displayed to humans in Task-1,and 5 of the most discriminative segments uncovered in Task-2 (4b, 4c, 4d, 4e), where humans (HUMAN)selected segments covering the boathouse mostly in the first 5 selections, while machines (Inception, ResNet,VGG) tend to select contextual segments including the sky, river, and grass.

ResNet has highest explicit dissonance (0.221) while also helping humans guess the most objectswithin the first 5 segments. Interestingly, in this light, it reinforces the finding that ResNet selectsmore discriminatory features early on compared to

HUMAN . Inception (0.209) and

VGG (0.207) areless dissonant but do worse than humans in estimating the importance of discriminative features.

VGG exhibits negative 𝜏 but still gets the most number of correct guesses overall by revealing keyobject segments after the context segments. VGG also exhibits high EMD indicating its tendency tocater to context first.Figure 5a illustrates how ResNet selects the best feature to guess coil and has median number-of-segments-to-correct-guess of 2 while all others require more than 20. This is in accordance withBiederman’s recognition-by-components theory , where he showed that a delay in the determinationof an object’s components has an effect on the identification latency of the object [6].Therefore, with respect to

RQ we found that humans are not always superior to machinesin selecting discriminative segments in images. ResNet ordering led to the most number ofcorrect guesses within the first 5 segments.

In this section we elaborate further on the impact of image difficulty in both, segment selection(Task-1) and object recognition (Task-2).

Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. issonance Between Human and Machine Understanding 56:15

Task-1 : On analyzing the segments selected by humans and machines, we found that the averagenumber of segments selected by humans and different NNs is nearly the same ( ∼

18 for easy images, ∼

17 for difficult images, as shown in Table 2). For the number of segments selected by a NN weonly considered segments with a positive score as returned by SHAP. Segments with a positivescore are those which directly contribute towards the correct classification decision.However, on average across all segment orders, humans successfully recognize objects in Task-2after uncovering around 10 segments of the easy images and 15 segments of difficult images. Weconducted two one-way between subjects ANOVAs to investigate the effect of image difficulty(for each easy and difficult ) on the average number of segments uncovered to elicit accurate imagerecognition across the segment ordering conditions (

HUMAN , Inception,

VGG , ResNet). In case ofthe easy images, we found a significant difference across all conditions; 𝐹 (2 , . 𝑝 < . difficult images, we found a significant difference across all conditions; 𝐹 (2 , . 𝑝 < .

05. Post-hoc Tukey HSD tests revealed a significant difference betweenResNet and the other three models ( 𝑝 < . easy images, while it revealed a significantdifference between ResNet with respect to each of Inception and VGG ( 𝑝 < .

01) in case of difficult images. Thus, we found that ResNet needs the least uncovered segments for successful objectrecognition in case of both easy (8.7 segments) and difficult images (14.3 segments) on average,followed by

HUMAN with 9.6 segments for easy and 14.6 segments for difficult images.A two-tailed T-test revealed a significant difference in the the average number of segmentsuncovered to elicit accurate image recognition in Task-2 based on the image difficulty ( easy, difficult ),across all models (humans and machines in aggregate); 𝑡 (2 , . 𝑝 < . easy images can be recognised more quickly than the difficult counterparts. Table 2. Comparison of the number of discriminative segments selected in Task-1 and the number of seg-ments uncovered before eliciting accurate image recognition in Task-2, across different models (humans andmachines) and with respect to inherent image difficulty (easy, difficult).

HUMAN

Inception

VGG

ResNet

Easy Difficult Easy Difficult Easy Difficult Easy Difficult

Time Taken (Task-2, in seconds, avg.)

44 30 46 36 46 34 47 29

172 45 183 53 186 58 179 43

We conducted two one-way between subjects ANOVAs to investigate the effect of image difficulty( easy and difficult ) on the average amount of time taken by human assessors in Task-2 to accuratelyrecognize the images across the different segment ordering conditions (

HUMAN , Inception,

VGG ,ResNet). In case of the easy images, we found a significant difference across the conditions; 𝐹 (2 , . 𝑝 < .

05. Post-hoc Tukey HSD test revealed a significant difference between

HUMAN segment ordering with respect to all three neural networks, ( 𝑝 < . difficult images. Our findings show thathuman assessors took more time to recognize images accurately when the segments were revealedaccording to HUMAN ordering in comparison to each of the neural networks, when the imageswere easy . We reason that this is because the neural networks focus on the context early on, whereashumans tend to select the whole object first which may still make it hard to identify the objectwithout the aid of contextual cues.

Task-2 : We first explored the nature of image classes in our dataset with respect to the classmembership of images that were correctly recognised. We define a class as being covered if at least

Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019.

VGG correspondsto the highest class coverage of 68%, while

HUMAN corresponds to the lowest.

Table 3. Class coverage resulting from segment ordering by humans and different machines. Bold classes arecovered *only* by the corresponding model.

Ordering Class Coverage Example Covered ClassesInception

60% strainer , water buffalo , scoreboard , ... ResNet dam , milk can , cannon , ... VGG freight car , strainer , kimono , ... HUMAN

58% boathouse , common iguana , car mirror , ... Across all images that were correctly recognised using human and NN ordering of segments,we observe the expected trend of easy images being recognised quickly (with fewer uncoveredsegments) and the difficult images requiring more uncovered segments before being correctlyguessed (as shown in Figure 3b). Finally, we also found that the

VGG segment ordering was mosteffective in correctly recognising difficult images in comparison to

HUMAN and other machines(see Figure 3c).In Table 4, we present a confusion matrix of cases when a given model (human or machine)performs better or dominates another.

Table 4. Confusion matrix of model domination. Domination values are counts in two image scenarios:Easy/Difficult.

Inception ResNet

VGG

HumanInception - 66/31 103/22 81/31

ResNet

VGG

Human 𝑖, 𝑗 ) is the count the number of instanceswhen model 𝑖 dominates 𝑗 in terms of the number of segments required to guess the correct imagetype. We present domination values as counts in two scenarios – when the image is considered tobe easy , and difficult .Consider the difference between HUMAN selection and ResNet.

HUMAN selection dominatesResNet on 73 easy images, as opposed to being dominated on 104 easy images by ResNet, whereinResNet segment ordering leads to correct recognition with fewer uncovered segments.Addressing

RQ , we found that image difficulty, the order and the number of discriminativesegments revealed influence the accuracy of humans (i.e., crowd workers in Task-2) in theimage recognition task.

Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. issonance Between Human and Machine Understanding 56:17 (a) coil (b) accordion (c) custard apple (d) strainer

HUMAN VGGINCEPTION RESNET (e) car-mirror

Fig. 5. Heat maps encoding the order of segment importance corresponding to humans and machines (inclockwise direction from top-left: HUMAN, VGG, Inception, ResNet.) The heat map is a visualisation ofthe importance scores returned by SHAP. The intensity of the green colour shows the relative importancebetween segments. In Task-2, the segments are shown in the order of intensity as displayed in these heatmaps. Segments with no coloration have an importance score of 0 or lower.

Demographics of Participants –

To maintain the integrity of the experimental setup and not divertworker attention from the task at hand, we did not gather explicit background information fromcrowd workers regarding their demographics. Based on the data available by default from the Figure8platform, we found that 68 distinct trustworthy workers from 17 different countries completed1,500 instantiations of the image classification task (300 images 𝑋 𝑋 𝑋 Key Takeaways –

Our results revealed interesting insights into how both humans and machinesapproach the task of object recognition in images. The first key takeaway is that humans are notconsistently better than machines when it comes to selecting discriminative segments in images.From our study we see that ResNet is better than

HUMAN in helping workers quickly identifyimages. ResNet is a deep neural network with residual connections that helps to better train a deepnetwork. We see that deep networks (including Inception) select good discriminative features

Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. when compared to the denser and shorter

VGG . We ascertain these features to be discriminativedue to the support from Biederman’s work on ‘ human image understanding ’, where he showedthat a delay in the determination of an object’s components leads to increased latency of the objectrecognition. We note that some of the interesting scenarios where humans are worse than neuralnetwork models in selecting discriminative features for recognition, open up interesting avenuesfor future work. In particular, understanding how humans perceive context and the role that contextplays in human understanding can be pivotal in building more human-like machines. Our workpresents an important first step towards the vision of thoroughly understanding the dissonancebetween humans and machines across a variety of tasks.

Crowdsourcing Setup –

We took several measures to ensure the reliability of responses gathered fromcrowd workers in Task-1 and Task-2 [15, 16]. By using dynamic worker lists, we made sure thatworkers participated in only one crowdsourcing task in our entire study. Workers in Task-1 werenot allowed to participate in Task-2, and workers were not allowed to participate in more than onecondition within Task-2 (

HUMAN , Inception, ResNet or

VGG ). We chose not to show workers inTask-1 all 1000 ImageNet classes, since in our pilot study workers exhibited a tendency to repeatedlyguess the label of images they were previously exposed to in the task, on encountering a new imageto recognise. We accounted for this in our final study setup described in Task-1 by limiting thenumber of images being shown and ensuring that only images from distinct classes were shown toeach worker. Showing workers all 1000 classes beforehand would also have potentially increasedtheir cognitive load significantly, thereby biasing our experimental setup.

Recognition-by-Objects –

We adopt a simplified understanding of Biederman’s theory [6] forobject recognition. Note that in the original theory that was proposed, Biederman showed that a setof components could be derived from five properties of edges in a 2-dimensional image; curvature,co-linearity, symmetry, parallelism and co-termination. Since the detection of these properties hasbeen shown to be invariant to the quality of the images and the viewing position, we project thisnotion of components onto image ‘segments’ in our case.

Super pixel segmentation –

By operating on this space for both humans and neural networks,we make comparison easier and more accurate. Using free form annotations from humans as analternative to super pixel segmentation would make agreement computation complex, aggregationof annotations hard, and introduce noise in the metrics.

Framing of Task-1 –

Our goal within Task-1 was to understand how humans select importantsegments for identifying the given object in an image. It was therefore important to frame thetask without confounding it with an end goal of helping other humans recognise the object. Ourrationale behind this is that in the image classification task, ML models also operate with an aim tocorrectly identify the object in the image. The aim of Task 2 was to then measure the impact of thedissonance between human and machine understanding of images where other humans are taskedwith recognising an object being revealed one segment at a time. Further experiments are requiredto test whether framing the task differently, and asking workers to focus more on the segmentsthey would pick to help other humans identify the object would have a significant impact on theirsegment selection process.

Selection of Discriminatory Segments –

It must be acknowledged that an alternative hypothesisthat can explain the segment selection process of humans in Task-1 is the possibility that humanspotentially use more context in their decision making process than they attribute to it. Anotherpotential factor that may influence the segment selection process of humans, is the noise in theirranking of segments in the decreasing order of importance beyond the first few segments.

Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. issonance Between Human and Machine Understanding 56:19

Choices Made for Segment Ordering –

Secondly, Using SHAP with deep models possessing fewerparameters (Inception and ResNet), only gives us the distribution of importance on the segmentsthe network focuses on. Since they are smaller models they focus on fewer segments and we donot have the overall ordering of all 50 segments. For some images, it is also a difficult task to orderall 50 segments in an image accurately even for humans. To overcome such cases, once we runout of annotations/importance estimates, the segment ordering corresponding to the rest of theimage is uniformly random which could lead to low information gain. Using SHAP with

VGG onthe other hand results in information about nearly all segments which could be another indicatoras to why it corresponds to the most images accurately recognised overall in Task-2.

An important goal for CSCW and HCI research today is to make AI systems more receptive ofhuman needs. Understanding human-machine dissonance (eg. through answering which neuralnetwork is more human like), has direct applications in evaluating and building credible andinterpretable machine learning models [67] which can support and shape our everyday interactions.Users are more likely to trust and adopt credible models where explanations conform to establisheddomain understanding. We provide metrics to understand dissonance, and a data set that thecommunity can use for evaluation and training models for object recognition. Our work can inspireand inform further studies that evaluate the “human-ness” of neural network models in differenttasks, both from a design choices standpoint and through our findings.

Ethical implications of our work.

Our study can inform and further the ethical discussionsaround machine learning models in terms of their congruence with human expectations. Machinelearning models increasingly mediate our daily lives, nudging human behaviour along the way[47]. However, with the boon of nudging human behaviour in a positive direction or intendedway comes the risk that human behaviour may be nudged in undesirable or unintended ways. Forexample, people can be influenced to buy certain products, or watch particular television programs,or even vote for particular political parties.We aim to better understand the congruence of human expectations with machines by studyingthe dissonance between humans and machines. We use the lens of the image classification task,analysing segments of the image that humans consider as being important to classify a givenobject in contrast to machines. Images where humans take longer to determine the class label (ahigher number of segments in Task 2) or justify their decisions differently (according to metricslike EMT and tau in Task 1) as compared to a neural network model are strong indicators of humanand machine misalignment. Understanding such disagreement can help to reason about whetherthe misalignment is ethically sensitive, i.e. is the model making the right choice for ethically ormorally wrong reasons. Secondly, since Task 2 does not explicitly inform subjects about the sourceof the segment ordering (humans or machines), we can potentially further analyse which neuralnetwork models imbibe trust of actual end-users. Finally, our experimental framework provides aprincipled approach for evaluating “how congruent machine learning models are to the expectationsof humans”, which can be defined in terms of ethical considerations.

In this paper, we focus on juxtaposing human understanding in an image recognition task withthat of machines in two central scenarios of human decision making – selection of discriminativesegments in an image and object recognition . We conducted a large-scale crowd sourcing studyentailing 7,000 HITs with an aim to further the understanding of dissonance between humans andmachines in the image classification task. To this end, we proposed novel metrics to measure thedissonance between humans and 3 state-of-the-art neural network models (Inception, ResNet,

Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. ). Our findings suggest that human perception of feature importance (i.e., the selection ofdiscriminative segments in Task-1) does not consistently result in better human image recognition(in Task-2) in comparison to that by the neural network models considered in this work. We foundthat the models that are close to human understanding also generalise better. Our experimentalevidence shows that humans are not always able to effectively exploit the use of context towardsdetermining good features (i.e., discriminative segments in images).We also found that image difficulty is directly correlated with the effort in recognising objectsirrespective of human or machine selected features. Finally, we release our entire dataset consistingof the two-stage crowd sourced tasks, complete with annotations from crowd workers for evaluationof image classification models. Our experiments in this paper shed light on the value that such adataset and task design can bring to the CSCW and HCI community in furthering the understandingof human-machine dissonance. For example we unearth the fact that over-parametrized models like

VGG tend to be more robust even if they are not the best performing models in case of easy images.We resonate that building more human-like machines can result in their seamless integration intoour everyday lives, through interactions including collaboration and cooperation.In our future work we will delve into investigating the dissonance between humans and machines(i) when they both make the same error ( ‘are machines wrong for the right reasons?’ ) which is keyin critical domains like health and defence and (ii) in other tasks such as visual question answering,machine translation, document retrieval etc. We also aim to investigate effects of a closed domainassumption for image recognition and other classification tasks where the set of classes/labels areknown to the assessor.

Acknowledgements

We thank all the anonymous crowd workers who participated in our experiments. This researchhas been supported in part by the Amazon Research Awards, and the Erasmus+ project DISKOW(grant no. 60171990).

REFERENCES [1] Ashraf Abdul, Jo Vermeulen, Danding Wang, Brian Y Lim, and Mohan Kankanhalli. 2018. Trends and trajectories forexplainable, accountable and intelligible systems: An hci research agenda. In

Proceedings of the 2018 CHI Conference onHuman Factors in Computing Systems . ACM, 582.[2] Arash Afraz, Daniel LK Yamins, and James J DiCarlo. 2014. Neural mechanisms underlying visual object recognition.In

Cold Spring Harbor symposia on quantitative biology , Vol. 79. Cold Spring Harbor Laboratory Press, 99–107.[3] Avishek Anand, Kilian Bizer, Alexander Erlei, Ujwal Gadiraju, Christian Heinze, Lukas Meub, Wolfgang Nejdl, andBjoern Steinroetter. 2018. Effects of Algorithmic Decision-Making and Interpretability on Human Behavior: Experimentsusing Crowdsourcing. In

Proceedings of the HCOMP 2018 Works in Progress and Demonstration Papers Track of the sixthAAAI Conference on Human Computation and Crowdsourcing (HCOMP 2018), Zurich, Switzerland, July 5-8, 2018. [4] Mark E Auckland, Kyle R Cave, and Nick Donnelly. 2007. Nontarget objects can influence perceptual processes duringobject recognition.

Psychonomic bulletin & review

14, 2 (2007), 332–337.[5] Shlomo Berkovsky, Ronnie Taib, and Dan Conway. 2017. How to recommend?: User trust factors in movie recommendersystems. In

Proceedings of the 22nd International Conference on Intelligent User Interfaces . ACM, 287–300.[6] Irving Biederman. 1985. Human image understanding: Recent research and a theory.

Computer vision, graphics, andimage processing

32, 1 (1985), 29–73.[7] Reuben Binns, Max Van Kleek, Michael Veale, Ulrik Lyngs, Jun Zhao, and Nigel Shadbolt. 2018. ’It’s Reducinga Human Being to a Percentage’: Perceptions of Justice in Algorithmic Decisions. In

Proceedings of the 2018 CHIConference on Human Factors in Computing Systems (CHI ’18) . ACM, New York, NY, USA, Article 377, 14 pages.https://doi.org/10.1145/3173574.3173951[8] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. Intelligible modelsfor healthcare: Predicting pneumonia risk and hospital 30-day readmission. In

Proceedings of the 21th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining . ACM, 1721–1730.[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical imagedatabase. In

Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on . Ieee, 248–255.Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. issonance Between Human and Machine Understanding 56:21 [10] Finale Doshi-Velez and Been Kim. 2017. Towards a rigorous science of interpretable machine learning. (2017).[11] Leonidas AA Doumas, Guillermo Puebla, and Andrea E Martin. 2018. Human-like generalization in a machine throughpredicate learning. arXiv preprint arXiv:1806.01709 (2018).[12] Michael W Eysenck and Mark T Keane. 2013.

Cognitive psychology: A student’s handbook . Psychology press.[13] Gerhard Friedrich and Markus Zanker. 2011. A taxonomy for generating explanations in recommender systems.

AIMagazine

32, 3 (2011), 90–98.[14] Ujwal Gadiraju, Besnik Fetahu, and Ricardo Kawase. 2015. Training workers for improving performance in crowd-sourcing microtasks. In

Design for Teaching and Learning in a Networked World . Springer, 100–114.[15] Ujwal Gadiraju, Ricardo Kawase, Stefan Dietze, and Gianluca Demartini. 2015. Understanding malicious behavior incrowdsourcing platforms: The case of online surveys. In

Proceedings of the 33rd Annual ACM Conference on HumanFactors in Computing Systems . ACM, 1631–1640.[16] Ujwal Gadiraju, Jie Yang, and Alessandro Bozzon. 2017. Clarity is a worthwhile quality: On the role of task clarity inmicrotask crowdsourcing. In

Proceedings of the 28th ACM Conference on Hypertext and Social Media . ACM, 5–14.[17] Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Schütt, Matthias Bethge, and Felix A Wichmann. 2018.Generalisation in humans and deep neural networks. In

Advances in Neural Information Processing Systems . 7549–7561.[18] Justin Scott Giboney, Susan A Brown, Paul Benjamin Lowry, and Jay F Nunamaker Jr. 2015. User acceptance ofknowledge-based system recommendations: Explanations, arguments, and fit.

Decision Support Systems

72 (2015),1–10.[19] Shirley Gregor and Izak Benbasat. 1999. Explanations from intelligent systems: Theoretical foundations and implicationsfor practice.

MIS quarterly (1999), 497–530.[20] Anikó Hannák, Claudia Wagner, David Garcia, Alan Mislove, Markus Strohmaier, and Christo Wilson. 2017. Biasin online freelance marketplaces: Evidence from taskrabbit and fiverr. In

Proceedings of the 2017 ACM Conference onComputer Supported Cooperative Work and Social Computing . ACM, 1914–1933.[21] IEEE Global Initiative et al. 2016. Ethically Aligned Design.

IEEE Standards v1 (2016).[22] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques.

ACM Transactions onInformation Systems (TOIS)

20, 4 (2002), 422–446.[23] Tatiana Josephy, Matt Lease, Praveen Paritosh, Markus Krause, Mihai Georgescu, Michael Tjalve, and Daniela Braga.2014. CrowdScale 2013: Crowdsourcing at Scale Workshop Report.

AI Magazine

35, 2 (2014), 75–78.[24] Daniel Kahneman. 2003. A perspective on judgment and choice: mapping bounded rationality.

American psychologist

58, 9 (2003), 697.[25] Daniel Kahneman, Andrew M Rosenfield, Linnea Gandhi, and Tom Blaser. 2016. Noise: How to overcome the high,hidden cost of inconsistent decision making.

Harvard business review

94, 10 (2016), 38–46.[26] Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. 2016. Examples are not enough, learn to criticize! criticism forinterpretability. In

Advances in Neural Information Processing Systems . 2280–2288.[27] Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and Sendhil Mullainathan. 2017. Human decisionsand machine predictions.

The quarterly journal of economics arXiv preprintarXiv:1703.04730 (2017).[29] Ranjay A Krishna, Kenji Hata, Stephanie Chen, Joshua Kravitz, David A Shamma, Li Fei-Fei, and Michael S Bernstein.2016. Embracing error to enable rapid crowdsourcing. In

Proceedings of the 2016 CHI conference on human factors incomputing systems . ACM, 3167–3179.[30] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neuralnetworks. In

Advances in neural information processing systems . 1097–1105.[31] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. 2017. Building machines that learnand think like people.

Behavioral and Brain Sciences

40 (2017).[32] Wallace Lawson, Laura Hiatt, and J Trafton. 2014. Leveraging cognitive context for object recognition. In

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition Workshops . 381–386.[33] Min Kyung Lee and Su Baykal. 2017. Algorithmic mediation in group decisions: Fairness perceptions of algorithmicallymediated vs. discussion-based social division. In

Proceedings of the 2017 ACM Conference on Computer SupportedCooperative Work and Social Computing . ACM, 1035–1048.[34] Min Kyung Lee, Daniel Kusbit, Evan Metsky, and Laura Dabbish. 2015. Working with machines: The impact ofalgorithmic and data-driven management on human workers. In

Proceedings of the 33rd Annual ACM Conference onHuman Factors in Computing Systems . ACM, 1603–1612.[35] Benjamin Letham, Cynthia Rudin, Tyler H McCormick, David Madigan, et al. 2015. Interpretable classifiers usingrules and Bayesian analysis: Building a better stroke prediction model.

The Annals of Applied Statistics

9, 3 (2015),1350–1371. Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. [36] Zachary C Lipton. 2016. The mythos of model interpretability.

ICML Workshop on Human Interpretability of MachineLearning (2016).[37] Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In

Advances in NeuralInformation Processing Systems . 4765–4774.[38] Gaspard Monge. 1781. Mémoire sur la théorie des déblais et des remblais.

Histoire de l’Académie Royale des Sciences deParis (1781).[39] David G Myers. 2002. The powers & perils of intuition.

Psychology Today

35, 6 (2002), 42–52.[40] Kenya Freeman Oduor and Eric N Wiebe. 2008. The effects of automated decision algorithm modality and transparencyon reported trust and task performance. In

Proceedings of the Human Factors and Ergonomics Society Annual Meeting ,Vol. 52. SAGE Publications Sage CA: Los Angeles, CA, 302–306.[41] David Oleson, Alexander Sorokin, Greg P Laughlin, Vaughn Hester, John Le, and Lukas Biewald. 2011. ProgrammaticGold: Targeted and Scalable Quality Assurance in Crowdsourcing.

Human computation

11, 11 (2011).[42] Cathy O’Neill. 2016. Weapons of math destruction: How big data increases inequality and threatens democracy.

NuevaYork, NY: Crown Publishing Group (2016).[43] Alexis Papadimitriou, Panagiotis Symeonidis, and Yannis Manolopoulos. 2012. A generalized taxonomy of explanationsstyles for traditional and social recommender systems.

Data Mining and Knowledge Discovery

24, 3 (2012), 555–583.[44] Robin L Plackett. 1975. The analysis of permutations.

Applied Statistics (1975), 193–202.[45] Martin Porcheron, Joel E Fischer, Stuart Reeves, and Sarah Sharples. 2018. Voice interfaces in everyday life. In proceedings of the 2018 CHI conference on human factors in computing systems . ACM, 640.[46] Emilee Rader, Kelley Cotter, and Janghee Cho. 2018. Explanations As Mechanisms for Supporting AlgorithmicTransparency. In

Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18) . ACM, NewYork, NY, USA, Article 103, 13 pages. https://doi.org/10.1145/3173574.3173677[47] Iyad Rahwan, Manuel Cebrian, Nick Obradovich, Josh Bongard, Jean-François Bonnefon, Cynthia Breazeal, Jacob WCrandall, Nicholas A Christakis, Iain D Couzin, Matthew O Jackson, et al. 2019. Machine behaviour.

Nature

Journal of Neuroscience

38, 33 (2018), 7255–7269.[49] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Model-agnostic interpretability of machine learning. arXiv preprint arXiv:1606.05386 (2016).[50] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why Should I Trust You?: Explaining the Predictions ofAny Classifier. In

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and DataMining . ACM, 1135–1144.[51] Andrew Slavin Ross, Michael C. Hughes, and Finale Doshi-Velez. 2017. Right for the Right Reasons: TrainingDifferentiable Models by Constraining their Explanations. In

Proceedings of the Twenty-Sixth International JointConference on Artificial Intelligence, IJCAI-17 . 2662–2670. https://doi.org/10.24963/ijcai.2017/371[52] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. 1998. A metric for distributions with applications to imagedatabases. In

Computer Vision, 1998. Sixth International Conference on . IEEE, 59–66.[53] Martin Schrimpf, Jonas Kubilius, Ha Hong, Najib J Majaj, Rishi Rajalingham, Elias B Issa, Kohitij Kar, Pouya Bashivan,Jonathan Prescott-Roy, Kailyn Schmidt, et al. 2018. Brain-Score: Which Artificial Neural Network for Object Recognitionis most Brain-Like?

BioRxiv (2018), 407007.[54] Grace S Shieh. 1998. A weighted Kendall’s tau statistic.

Statistics & probability letters

39, 1 (1998), 17–24.[55] Hirokazu Shirado and Nicholas A Christakis. 2017. Locally noisy autonomous agents improve global human coordina-tion in network experiments.

Nature nature

Nature

Models of bounded rationality: Empirically grounded economic reason . Vol. 3. MITpress.[59] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).[60] Elizabeth Stowell, Mercedes C Lyson, Herman Saksono, Reneé C Wurth, Holly Jimison, Misha Pavel, and Andrea GParker. 2018. Designing and Evaluating mHealth Interventions for Vulnerable Populations: A Systematic Review. In

Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems . ACM, 15.Proc. ACM Hum.-Comput. Interact., Vol. 3, No. CSCW, Article 56. Publication date: November 2019. issonance Between Human and Machine Understanding 56:23 [61] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, inception-resnet andthe impact of residual connections on learning.. In

AAAI , Vol. 4. 12.[62] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, VincentVanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition . 1–9.[63] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inceptionarchitecture for computer vision. In

Proceedings of the IEEE conference on computer vision and pattern recognition .2818–2826.[64] Nava Tintarev and Judith Masthoff. 2007. A survey of explanations in recommender systems. In

Data EngineeringWorkshop, 2007 IEEE 23rd International Conference on . IEEE, 801–810.[65] Alexandra Vtyurina and Adam Fourney. 2018. Exploring the role of conversational cues in guided task support withvirtual assistants. In

Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems . ACM, 208.[66] Jiaxuan Wang, Jeeheh Oh, Haozhu Wang, and Jenna Wiens. 2018. Learning Credible Models. In

Proceedings of the 24thACM SIGKDD International Conference on Knowledge Discovery & . ACM, New York, NY,USA, 2417–2426. https://doi.org/10.1145/3219819.3220070[67] Jiaxuan Wang, Jeeheh Oh, Haozhu Wang, and Jenna Wiens. 2018. Learning credible models. In

Proceedings of the 24thACM SIGKDD International Conference on Knowledge Discovery & Data Mining . ACM, 2417–2426.[68] Weiquan Wang and Izak Benbasat. 2007. Recommendation agents for electronic commerce: Effects of explanationfacilities on trusting beliefs.

Journal of Management Information Systems

23, 4 (2007), 217–246.[69] William Webber, Alistair Moffat, and Justin Zobel. 2010. A Similarity Measure for Indefinite Rankings.

ACM Trans. Inf.Syst.

28, 4, Article 20 (Nov. 2010), 38 pages. https://doi.org/10.1145/1852102.1852106[70] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and YoshuaBengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In

International Conferenceon Machine Learning . 2048–2057.[71] Daniel LK Yamins, Ha Hong, Charles F Cadieu, Ethan A Solomon, Darren Seibert, and James J DiCarlo. 2014.Performance-optimized hierarchical models predict neural responses in higher visual cortex.

Proceedings of theNational Academy of Sciences

Science, Technology, & Human Values

41, 1 (2016), 118–132.[73] Nan-ning Zheng, Zi-yi Liu, Peng-ju Ren, Yong-qiang Ma, Shi-tao Chen, Si-yu Yu, Jian-ru Xue, Ba-dong Chen, andFei-yue Wang. 2017. Hybrid-augmented intelligence: collaboration and cognition.

Frontiers of Information Technology& Electronic Engineering

18, 2 (2017), 153–179.