Humans learn too: Better Human-AI Interaction using Optimized Human Inputs
HHumans learn too: Better Human-AI Interaction using Optimized Human Inputs
Johannes Schneider University of Liechtenstein, Vaduz, [email protected]
Abstract
Humans rely more and more on systems with AI components.The AI community typically treats human inputs as a givenand optimizes AI models only. This thinking is one-sided andit neglects the fact that humans can learn, too. In this work,human inputs are optimized for better interaction with an AImodel while keeping the model fixed. The optimized inputsare accompanied by instructions on how to create them. Theyallow humans to save time and cut on errors, while keepingrequired changes to original inputs limited. We propose contin-uous and discrete optimization methods modifying samples inan iterative fashion. Our quantitative and qualitative evaluationincluding a human study on different hand-generated inputsshows that the generated proposals lead to lower error rates,require less effort to create and differ only modestly from theoriginal samples.
Introduction
In an industrial setting, production processes involving hu-mans and robots are constantly improved to reduce errorsand improving efficiency. Improvements target measures to-wards humans and machines alike in a holistic manner. In themachine learning community, the setup is usually different:Treat the dataset originating from people as fixed and findthe best possible model.This works breaks with this paradigm. It focuses on help-ing humans to improve, i.e. humans are offered feedbackallowing them to alter their inputs to machine learning sys-tems. See Figure 1 for a process overview. By improvingtheir interaction with AI systems, humans might benefit fromlower time to create inputs and fewer misunderstandings ininteraction. This might be highly attractive, since “smart sys-tems” containing machine learning components have alreadydeeply penetrated our daily life and humans interact withmultiple such systems on a daily basis: recommendation sys-tems (on web pages), voice assistants, gesture recognitionsystems, driver-assisted cars, to name a few. Furthermore,thanks to the recent success of AI and related technologiessuch as deep learning, interaction is likely to continue andinclude more and more safety-critical systems, such as fullyself-driving cars, where errors in interaction might have fatalconsequences. Thus, there are good reasons, why humans
Copyright c (cid:13)
Figure 1: Humans change their inputs to AI based on feed-back leading to time savings and better recognizabilitymight want to improve on their inputs. Furthermore, if hu-man inputs are of very low quality, so that they are either notrecognizable or easy to confuse, even highly sophisticatedmachine learning systems might not be able to correctly in-terpret them. These points illustrate the need to help humansimprove on their behavior. Humans should be educated toproduce inputs that reduce errors in interaction and requirelittle effort to create. Furthermore, learning the proposedchanges should also be easy. Unfortunately, fulfilling all ofthese objectives is difficult for multiple reasons. First, thereis a trade-off between accuracy in recognition and effort tocreate. Simplifying inputs too much will increase error rates.Making inputs more complex, e.g. by adding redundant fea-tures to inputs, might reduce errors, but increase the amountof effort to create them. Second, streamlining human-to-AIinteraction is more intricate than human-to-human interac-tion, since AI systems process information differently and aresensitive to input changes that are hardly noticed by humans.Using almost invisible, adversarial perturbations, classifiersmight be “fooled” to misjudge samples that can be catego-rized without any problems by humans (e.g. (Poursaeed et al.2018)). From our perspective this is both good and bad news.While it suggests that minor changes exist that might increaseconfidence in the correct class (rather than in an incorrectclass as used in an adversarial setting), it also highlights thathumans might not even be able to distinguish an improvedfrom an non-improved sample if optimization is not done a r X i v : . [ c s . H C ] S e p ith care. Moreover, human controlled movements of limbs(and the vocal tract) is subject to stochastic variation, makingit close to impossible to alter inputs in very subtle ways asdone in adversarial settings like (Poursaeed et al. 2018).Designing adequate optimization objectives is non-trivial.Even if the proposed samples optimize a specific mathemat-ical objective, it is not clear, how well humans can actuallylearn such changes. That is, humans might deem the proposedinputs unnatural and they might, at least initially, struggleto reproduce such changes that deviate from deeply rootedhabits that they have pursued for decades.While in our work, we argue that small changes that bearsome similarity to existing inputs from humans are easier tolearn, this is not the sole reason that inputs should resemblesimilarity to their original inputs. Neglecting this constraintwould ultimately lead to proposing the same, “optimal” proto-typical input for each human. This is not just at odds with therequirement to suggest small changes that are easier to learn,but it also strongly works against human diversity, which isconsidered highly valuable.This work is among the first to discuss the topic of im-proving human-to-AI interaction through optimization. Itcontributes as follows: • Proposing algorithms to optimize single samples of anindividual directly in an iterative manner rather than us-ing a separate system to do so. While the latter might becomputationally faster, it typically performs some sort ofgeneralization leading to more uniform, less diverse sam-ples. Our approach leads to highly personalized samples,which yields better outcomes in terms of needed time andaccuracy, while preserving diversity among humans. • Showing inputs in combination with how to create them.That is, we utilize (hand) movement data, highlighting theorder and direction of movements rather than exposing ahuman only to the suggested optimized inputs without anyinstructions how to create them efficiently. • Conducting an extensive evaluation including a user study,proposing and using metrics that account for discrepan-cies between suggested inputs and those that are actuallycreated by humans due to innate variation in human move-ment.
Problem
This work focuses on object classification. It works on databeing the result of human (physical) activity, i.e. hand move-ments as needed for sketching and writing (and gestures).Each input X by the human to the classifier should be la-beled as a specific class Y . An input X is a sequence ofpoints X = (( x i , y i , I i )) ordered in the way the input wascreated, where ( x i , y i ) are the coordinates of the i -th pointand I i is an indicator having one of three values {− , , } .The value ‘1’ means no line was drawn when moving to thispoint. ‘0’ indicates a line has been drawn from the prior point i − to point i . The value ‘-1’ only applies for (array) datastructures of fixed length, if there are less points then thearray length. It indicates that the (array) position is empty.The sequence of points can be split into contiguous line seg-ments, i.e. strokes. A stroke denotes a sequence of connected points, so that no points before and after the sequence (ifthey exist) are connected to any point of the sequence. Wedenote as a stroke segment, two connected points of a stroke,ie. (( x i , y i , I i ) , ( x i +1 , y i +1 , I i +1 )) . The classifier C process-ing human inputs is optimized using a known loss L C , i.e.typically the cross-entropy loss. The model C is treated asunchangeable, but it can be used in the optimization process.Our human interaction optimization algorithm O is provideda sample X with its label Y . It computes an optimized in-put (or proposal) ˆ X := O ( X, Y ) that should improve on X according to one or several objectives. Thus, there is nolabeled data available, i.e. there are no optimal proposals,which could be used to train a model. We aim to providesome guidance for a user showing how a suggested proposalcan be created. That is, we show all operations on how todraw the optimized output ˆ X based on changing her input X .We consider the following alterations of the creation processof the original input: (i) changing the (drawing) direction ofa stroke, (ii) changing the position of one or more strokes,(iii) moving a point and (iv) deleting a stroke segment. If astroke segment is deleted that is not consisting of the last twoor first two points of a stroke then a stroke is split into twoshorter strokes. Objectives and Measures
We consider three main objectives:
Minimizing time to create inputs : While the time to createan optimized sample is easily measured, i.e. in a user study, asan optimization objective, it is difficult to incorporate. In ouroptimization algorithm, we used a more tangible proxy lossmetric to estimate the time to create an optimized sample, i.e.the total distance the hand has to move to create the sketch.We do not include the movement to the start point, but weaccount for movements for the hand between an end point ofa stroke and the starting point of the next stroke. We denotethis as effort loss: L E ( X ) := | X |− (cid:88) i =0 (cid:112) ( x i − x i +1 ) + ( y i − y i +1 ) Minimizing mis-understanding while accounting for hu-man variation : The amount of wrongly extracted or inter-preted information by the AI should be kept as little as possi-ble. That is, optimized inputs should lead to better task per-formance, ie. higher recognition accuracy, when processedby the AI. Thus, generated sample ˆ X are created minimizingthe classifier loss L C ( ˆ X ) among other losses. Proposed sam-ples should allow for variance in human behavior. Humanbehavior is characterized by unintentional variation. For ex-ample, a human is not able to reproduce even a single of herown strokes exactly. Since classifiers are knowingly sensitiveto small changes in the input as witnessed by adversarialexamples, the robustness of optimized samples should beevaluated, e.g. as done for adversarial samples using linearprograms (Bastani et al. 2016). We model human variation bycreating noisy samples of a proposed input ˆ X . We measureaccuracy Acc
Noi on these noisy samples. There are multi-ple approaches to create noisy samples, e.g. using local andlobal deformations(Yu et al. 2017). We went for a well-established, easy to comprehend approach: A noisy sample ˆ X (cid:48) is created by adding uniform noise to each coordinate.That is for a sequence ˆ X , the noisy sequence ˆ X (cid:48) is ˆ X (cid:48) := (cid:0) (ˆ x i + (cid:15) i, , ˆ y i + (cid:15) i, , ˆ I i ) | (ˆ x i , ˆ y i , ˆ I i ) ∈ ˆ X (cid:1) Each (cid:15) i,j ∈ [ − r, r ] is chosen uniformly and independently atrandom within a fixed range. The range is upper and lowerbounded by a constant r that depends on the dataset, e.g. itmight be 20 pixels or 1cm. Minimize modifications of original samples : The proposedsamples should bear large similarity to the original inputs.Preserving characteristics of the inputs as much as possibleis aligned with the idea that inputs remain comprehensiblefor other humans or other systems (given that they were so inthe first place), diversity is maintained among humans andchanges are easy to comprehend and execute for humans. Weuse two loss objectives depending on what type of alterationis applied to an input.The length-wise difference loss L D captures how much theoriginal and suggested sample differ in visible parts. Lengthcorresponds to the sum of lengths of all visible strokes. Thatis, only the final outcome is relevant. We do not account forordering and direction of strokes, which impacts distances tobe moved but not the outcome. The length of visible parts, ie.strokes, of an input X is given by V ( X ) := | X |− (cid:88) i =0 ,I i +1 (cid:54) =1 (cid:112) ( x i − x i +1 ) + ( y i − y i +1 ) The loss is: L D ( X, ˆ X ) := | V ( X ) − V ( ˆ X ) | This objective is adequate, if parts of the input are removed.In this case, the original and the modified sample are identicalexcept that one is missing some parts.The point-wise difference loss L P captures the displacementof individual points between the original and suggested sam-ple. It is suitable, when the positions of points are altered, ie.points are moved but their order is kept: L P ( X, ˆ X ) := | X |− (cid:88) i =0 (cid:112) ( x i − ˆ x i ) + ( y i − ˆ y i ) These three objectives require trade-offs: Keeping changesminimal is at odds with the other objectives, since any changeto the input to address the other objectives is non-desirable.Furthermore, minimizing time to create samples is at oddswith recognizability. Little time to create implies little infor-mation is contained in the outputs, which makes discrimi-nating inputs harder. Thus, what is most preferred – time orrecognizability or minimal changes – is a decision that issubjective and left to end user. She must state her preferences.We consider two mechanisms a user can state her inclina-tions: (i) Weighing objectives and (ii) providing constraints.Constraints are ensured not to be violated, while objectivesare optimized. But no guarantee can be given beforehand towhat extent an objective is fulfilled. To keep matters simple,we shall consider two (primary) objectives, i.e. either mini-mize “time” or maximize “accuracy” while constraining themaximal distortion of the original input. Additionally, we enforce that optimized inputs must still be recognizable bythe classifier or become recognizable due to the optimizationprocess.
Methodology
In our setting, learning requires trial-and-error to identifywhich alterations of input samples are beneficial due to lackof labeled data. We optimize each input in an iterative fash-ion, where the strategy to investigate the next option to “try”depends on the number of possible solutions. We employthree strategies: gradient descent, a greedy approach and abrute-force approach. Alternatively, one might train a ma-chine learning model to do the optimization of subsequentlyprovided inputs in one forward pass (e.g. (Schneider 2020;Riaz Muhammad et al. 2018)). A model is likely advanta-geous if labeled data is available or generalization acrosssamples is helpful or computational load is a concern. Butsince detailed characteristics of the input should be preservedas much as possible and the diversity of inputs is large, gen-eralization possibilities seem limited. Applying optimizationtechniques directly on individual samples seems preferableas also confirmed by our experimental evaluation.
Optimizing Individual Samples
We need to solve a constraint optimization problem encom-passing discrete operations, e.g. removal of stroke segments,and continuous operations, e.g. changing coordinates ofpoints. We employ a general framework (Algorithm 1) thatis adjusted to each alteration operation.
Algorithm 1
Discrete optimization
Input : Input X of class Y , Classifier C , Max. distortion d , Primary objective Obj
Output : Optimized input ˆ Xn iter := 20000 {number of iterations} ˆ X := X {Best solution so far} Eval ( Z ) := (cid:26) L D ( Z ) Obj = T imeL C ( Z ) Obj = Accuracy for i from to n iter do Z := Solution candidate obtained by changing ˆ X If L M ( Z ) < d · V ( X ) then {Similar to original} If C ( Z ) = Y ∨ C ( ˆ X ) (cid:54) = Y then { Classifiedcorrectly or no best solution (incl. X ) has ever been } If Eval ( Z ) < Eval ( ˆ X ) then ˆ X := Z end for Algorithm 1 shows the general outline for our discreteoptimization employing (one of the) following operations:removal of strokes, reversing stroke direction or changingstroke order. The maximal allowed deviation d is specifiedby the user as well as the objective obj , which is either timeor accuracy. The algorithm creates a new solution candi-date Z in each iteration using the considered operation. Asolution candidate Z is checked if it fulfills the followingtwo constraints: (i) It does not deviate too much from theoriginal input; (ii) it is classified correctly or no best solu-ion ˆ X has been classified correctly so far. If a solutioncandidate Z fulfills both constraints and it has lower lossvalue according to the user determined objective obj , thebest solution ˆ X is set to the solution candidate Z . Solutioncandidates are generated for each operation as follows: Forremoval of visible parts, we consider all stroke segments s i := (( x i , y i , I i ) , ( x i +1 , y i +1 , I i +1 )) . A solution candidate Z based on ˆ X is obtained by removing point i from ˆ X , ie. Z := X \ s i . In case, point i + 1 is the last point of a stroke, itis also removed. Otherwise, additionally, we set I i +1 = 1 toindicate the start of a new stroke. The order in which the so-lution candidates are created is important. That is, it matterswhich stroke segments are removed first. A natural choice(denoted as CL) is to select stroke segments with smallestclassifier loss L C as being removed first. The idea being thatsegments being only weakly indicative of the given class Y can likely be removed without causing a mis-classification.Increasing the number of strokes due to removal of strokesegments is not desirable. Therefore, we consider a variant(CE), where only stroke segments at the beginning or end ofa stroke are removed. For comparison, we also consider re-moval in reverse order(RO) in which points are created and inthe same order(SO). The motivation being that this procedureconstantly removes one stroke after the other without split-ting strokes. Furthermore, humans might first draw the mostimportant high level outline that might help in distinguishingobjects and add more details over time.For changing the direction of strokes and changing the or-der of strokes, we choose a solution candidate randomly. Thatis, for changing direction of a stroke, we flip the direction ofa randomly chosen stroke of ˆ X by reversing the sequence ofpoints of the stroke. To change the order of strokes, we per-form a cut-and-paste operation. That is, we choose a randomsequence consisting of subsequent strokes, remove it andinsert it either after a random stroke or at the very beginningof the sequence. We also consider doing both that is for eachcut-and-paste operation, we flip randomly the direction with50% probability. This randomized approach is essentiallyequally to trying all options given the number of strokes issmall.For continuous optimization, we use gradient descent withgradients obtained from the classifier. That is, we maintain asolution candidate Z that is initialized with the original input X . It is updated in each iteration using a gradient descentstep treating the classifier weights as fixed and the input X asvariable. Otherwise the procedure is identical to Algorithm1. The loss function L T ot is a weighted combination of theclassifier loss L C , the effort loss L M and the point-wisedifference loss L P : L Tot ( Z ) := β C L C ( Z ) + β P L P ( Z ) + β E L E ( Z ) Evaluation
Datasets : We used two datasets. 1 Mio. samples of the Quick-Draw (Ha and Eck 2017) dataset distributed equally among Note, this includes the original input X due to initialization of ˆ X := X . the first 30 classes. It consists of human sketches of an ob-ject given its name. The data was created primarily usingmouse and touch devices. The second dataset consists of mu-sical notes drawn using a pen(Calvo-Zaragoza and Oncina2014). It consists of 32 classes but only of 15200 samples.We padded or stripped sequences to be of fixed length, ie. 104points. For evaluation, we optimized 256 samples for eachclass of the QuickDraw dataset that were not used for train-ing, yielding 7680 test samples. For the Homus dataset weused 20% of the data corresponding to 3040 samples. Notethat our optimization algorithms do not require any trainingdata. They optimize test samples directly. Models : For classifier C , we trained three instances of twoidentical classifier architectures for each dataset yielding 12classifiers. For each instance, we ran our optimization algo-rithms. We used PyTorch 1.6.0 and trained on 2080 TI NvidiaGPUs. Our “LSTM” architecture is made of 3 Conv1D, 2LSTM and 2 dense layers with dropout. The “Conv1D” net-work consists of 3 Conv1D layers with stride 2 and a denselayer. For comparing to prior work, we also implementedan architecture based on (Yu et al. 2017). Details on archi-tectures can be found in the supplementary material. If notstated differently we used d = 20% , ie. we allowed at most adistortion of 20% for deletion and moving points by at most20%. Procedure : We conduct: i) a user study, where humans haveto redraw original and optimized samples, ii) a qualitativeevaluation, illustrating created samples, and iii) a quantitativeevaluation, discussing the impact of parameters and com-paring various approaches. Results for all combinations ofmodels and datasets were very similar in nature. Therefore,we focus on just one scenario, i.e. the LSTM-network on theQuickdraw dataset, results for other setups are summarizedin the end and detaisl can be found in the supplementarymaterial.
Qualitative Evaluation
We discuss optimized samples for each alteration operationseparately. As seen in Figure 2 methods preferring removalof segments in the same order as creation (SO) or reverseorder(RO) tend to remove entire strokes. This can lead tounnatural sketches, e.g. angles without heads. Random re-moval(RA) and classifier loss based ordering (CL) increasesthe number of strokes, which might be undesirable, whenit comes to reproducing the optimized sample. Removingonly end points based on classifier loss (CE) tends to producesamples that are well-recognizable without artifacts and itdoes not split strokes.Figure 3 shows outcomes for continuous optimization.Changes appear more subtle than for removal, in particu-lar, when optimizing for accuracy (second row). Continuousoptimization tends to shorten strokes, straighten them (bestseen for wings of the first angle) and it might also rotate them– as done for clock hands. When only optimizing for effort(last row), which should lead to largest distortion, changesseem to be the least noticeable. Our quantitative analysisshows that this is not the case. Samples get scaled entirely ina more uniform manner, making changes harder to spot (bestseen for third image from the left in the last row).igure 2: Original and generated samples for removal, mini-mizing creation effort - more samples in supplementary ma-terialSamples for altering order and direction of strokes are inthe supplementary material.
Quantitative
We use priorly described metrics related to (i) the classifier’scapability to recognize samples (
Acc (uracy),
Acc
Noi ( se ) ),(ii) the effort of creating a sample L E and (iii) the distortionof the original sample L D , L P . To compute Acc
Noi ,we create for each input 10 noisy samples, where each (cid:15) i,j ∈ [ − , . For the Quickdraw dataset, where pointsare within a range of [0,255] this means that the maximaldistance due to the addition of (cid:15) i,j to each coordinatebetween two points is about 28 pixels or 10% of the canvasused for sketching. We report the accuracy on all createdsamples, i.e. for the Quickdraw dataset with 7680 testsamples, we report the accuracy of 76800 samples.Table 1 shows all metrics for continuous optimization vary-ing loss weights β . All settings improve upon the original interms of accuracy. This is expected given the constraint thata sample is no modification is done that changes a correctlyclassified sample into an incorrectly classified one. Whenoptimizing for effort, accuracy gains of optimized samplesvary. For noisy samples, accuracy can even be lower than forthe original. Reduced samples contain less (redundant) infor-mation for classification than the original ones, making themsomewhat more sensitive to noise. If optimizing effort loss( β E = 1 ) only, effort loss is not lowest among all options.Having a classifier loss β C > , not only strongly improvesaccuracy, but interestingly also leads to lowest effort loss.Without a classifier loss, all parts of the sketch are alteredirrespective of whether they are relevant for classification. Thus, significant increase in loss is occurred due to smallmovements of highly relevant points for classification. Thisis largely avoided using a classifier loss. Obj β C , β P , β E Acc Acc
Noise L E L P Original 0.897 ± ± ± ± Acc 1.0,0,0 0.929 ± ± ± ± ± ± ± ± .9999,0,.0001 0.963 ± ± ± ± .9998,.0001,.0001 ± ± ± ± Eff 1.0,0,0 0.962 ± ± ± ± ± ± ± ± .9999,0,.0001 0.963 ± ± ± ± .9998,.0001,.0001 ± ± ± ± Table 1: Results varying loss term weights β CL , β RE , β EF Obj Meth.
Acc Acc
Noi L E L D Original 0.897 ± ± ± ± Acc CL ± ± ± ± CE 0.956 ± ± ± ± RA 0.919 ± ± ± ± RO 0.908 ± ± ± ± SO 0.911 ± ± ± ± Eff CL ± ± ± ± CE 0.947 ± ± ± ± RA 0.914 ± ± ± ± RO 0.912 ± ± ± ± SO 0.912 ± ± ± ± Table 2: Results for removalThe results for all removal strategies (Table 2) indicate thatabandoning irrelevant stroke parts yields improvements forboth effort and accuracy at the same time. Using the classifierloss (CL or CE) for ordering removals yields best results interms of accuracy. When optimizing for effort, only thesemethods also achieve much better noisy accuracy
Acc
Noi than the original samples. Both also do well in terms of effort.When being allowed to split strokes (CL) accuracy is largerthan for removing parts at the end of strokes CE, but thesegains come at the expense of having more strokes. Remov-ing samples in reverse order (RO) or in sorted order (SO)gives lowest effort loss. Removing entire strokes from thebeginning (or end) yields benefits, since we do not accountfor moving to the first point or from the last point to somestarting point and there is often a significant distance betweenthe end point of one stroke and start point of the next stroke.This distance is also gained when removing entire strokes. Incontrast, when removing strokes (segments) in the middle, atransition between strokes remains.Table 3 shows that the proposed optimization procedureif only a fixed percentage of average stroke segments of acategory is kept as proposed and described for GDSA in(Muhammad et al. 2019) for a smaller subset of QuickDraw.In this setup, removal takes place even if it leads to erro-neous classification. Our classifier guided methods CE andCL achieve higher accuracy than prior work (DQSN(Zhou,Xiang, and Cavallaro 2018) and GDSA) which is based onigure 3: Original and generated samples for continuous optimization for various β C , β P and β E Method Original Acc. ∆ if keep 50% ∆ if keep 25%DQSN 0.92 -0.12 -0.27GDSA 0.92 -0.06 -0.20CE(This paper) 0.95 -0.02 -0.16CL(This paper) 0.95 Table 3: Accuracy reduction for Sketch-a-Net architecturewhen reducing visible elements; bold shows besttraining a model using reinforcement learning significantly.We attribute this to the fact that we optimize samples individ-ually in an iterative manner.As shown in Table 4 both permuting strokes (P) and revers-ing directions (R) or doing both (B) yield significant gains interms of accuracy and also effort.
Obj Meth.
Acc Acc
Noi L E L D Original 0.897 ± ± ± ± Acc P 0.952 ± ± ± ± R 0.951 ± ± ± ± B ± ± ± ± Eff P 0.911 ± ± ± ± R ± ± ± ± B 0.913 ± ± ± ± Table 4: Results for permuting strokes(R), reversing direction(R) and doing both (B)
Obj Met.
Acc Acc
Noi L E L D L P Original 0.897 ± ± ± ± ± Acc B-C 0.976 ± ± ± ± ± C-D 0.975 ± ± ± ± ± D-B ± ± ± ± ± Eff B-C 0.961 ± ± ± ± ± C-D ± ± ± ± ± D-B 0.958 ± ± ± ± ± Table 5: Results for applying multiple methods sequentially;(C)ontinuous point movement, (B) Reverse and permute,(D)eletion of partsWe also applied two and more methods sequentially (Ta-ble 5). For continuous optimization of points C we used ( β C , β P , β E ) = ( . , . , . . Applying multiplemethods gives some more improvement. That is, both themaximum accuracy and minimum effort loss are lower, whenapplying multiple methods. We also investigated differentorderings, eg. B-C instead of C-B as well as performing mul-tiple applications simulating an interwoven optimization. Forexample, B-C-B-C with reduced iterations for each method.This leads to some additional improvements. Other networks and datasets:
We found that qualitativelyall results were identical, meaning that if there was a clearimprovement for optimized samples for one dataset and onenetwork type this also held for others. But gains could varyper dataset, network and operation considered. For exam-ple, the highest accuracy gains relative to the original wasachieved for Conv1D on Quickdraw (13.4%) compared to8.3% for LSTM on QuickDraw.
User study
We conducted an experiment to assess, if optimized samplescan be reproduced by humans and if these reproductions in-deed yield gains according to the specified objective. Whilethe prior numerical investigation is highly suggestive, opti-mized samples might be unnatural for humans. Thus, repro-ductions of those samples might take longer and deviate morestrongly from the proposal than non-optimized samples, mak-ing a user study necessary. We used generated samples formethod “D-B” (Table 5) optimized towards accuracy for theQuickDraw dataset. The overall pool of sketches consisted of10 original samples per class, where each sample consisted ofup to 7 strokes to ensure good readability of instructions, ie.numbering and arrows. Since we are particularly interestedin the capability, whether errors in interaction can be miti-gated, we chose 5 (of the 10) original samples per class thatwere misclassified. Each participant had to copy an optimizedversion of a human input and the original version for fiverandomly selected sketches, yielding 10 sketches per user -see Figure 4. It was decided randomly, whether a user wasfirst presented the optimized or original version. Users wereadvised to use the same number of strokes and draw them inthe order and direction as indicated by the numbering andarrows. We recruited 200 English speaking participants onAmazon Mechanical Turk. We removed reproduced sketches,igure 4: During the user study participants are shown asketch with numbered strokes and stroke start indicated (leftpanel). They should reproduce it (right)that did not match the instructed number of strokes or tookmore than 60s to create sketch or that had only the originalor the corresponding optimized sample drawn adequately,i.e. within 60s and with the correct number of strokes. The(LSTM) classifier had an accuracy of 54% on sketches re-sembling the original and 68% on sketches based on theoptimized sample. The differences are statistically significantusing a t-test, yielding p < . . Participants took on average23.2s to (re)sketch an original sample. They were 1.7s fasterfor optimized samples (though only with p = 0 . ). Notethat we used samples optimized towards accuracy not effort.Still, even those samples have (mostly) less visible strokes( L D ), while overall hand movements are typically similar tooriginal samples ( L E ) - see Table 5. Related Work
Human-AI interaction: (Rzepka and Berger 2018) summa-rized the effects of various user and AI system character-istics in general, while (Martins, Santos, and Dias 2019)focused on digital AI assistants. Interaction between AIand users has also been studied (Amershi et al. 2019;Janssen et al. 2019; Bansal et al. 2019; Carroll et al. 2019;Martins, Santos, and Dias 2019; Nocentini et al. 2019;Ghosh et al. 2019; Shneiderman 2020) in various contextssuch as social robots (Martins, Santos, and Dias 2019;Nocentini et al. 2019). The primary focus has been on desir-able AI behavior, eg. empathy, or strategies how AI can adaptto user behavior((Ghosh et al. 2019; Carroll et al. 2019) withfew exceptions. (Bansal et al. 2019) explicitly investigatedhow users can alter the behavior, i.e. override decisions of theAI, by understanding the error boundary of a classifier, while(Shneiderman 2020) provides general guidelines on human-centered AI. Closest to our work is (Schneider 2020). (Schnei-der 2020) introduced a human-to-AI coach based on an auto-encoder that given a picture of a digit outputs a digit that haslower classifier loss and potentially consists of less pixels.In contrast, we optimize samples individually in an iterativemanner, also provide instructions on how to create samplesand we are the first to evaluate on actual users. Our work alsouses more complex datasets that are commonly studied inother contexts, e.g. see (Xu 2020) for a survey on machinelearning and human sketches. Abstracting sketches usingremoval of stroke segments and entire strokes, while preserv-ing semantics was studied in (Riaz Muhammad et al. 2018; Muhammad et al. 2019). That is, an agent learns to se-lect strokes that are relevant for a classifier to maintainthe correct class. From the perspective of this paper, thisis similar to neglecting all constraints and focusing on“time” with narrowing down on just one option for ab-stracting: Removal. In this paper, we also consider a grad-ual movement of points. (Riaz Muhammad et al. 2018;Muhammad et al. 2019) used reinforcement learning whilehaving correct classification as a reward (rather than as anobjective). The implementation using reinforcement learningalso differs from our approach improving individual sam-ples directly. Combination of both approaches might lead tobetter outcomes that is using RL with a planning and simu-lation capability. (Liu et al. 2019) used GANs to completeartificially corrupted sketches, ie. through occlusion. Theyachieved high-quality results comparable to other methodssuch as image inpainting.
Explainability:
This paper has strong ties to (personalized)explanations (Schneider and Handali 2019; Guidotti et al.2018) and explanations in the field of human-AI inter-action (Hois, Theofanou-Fuelbier, and Junk 2019). Coun-terfactual explanations seek to identify a modification ofthe input to obtain another class (Dhurandhar et al. 2018;Goyal et al. 2019). (Dhurandhar et al. 2018) aims to identifyminimal changes to digits on a pixel level using perturbations.Thus, in contrast to our work, they focus on mis-classifiedsamples. Moreover, the suggested changes commonly in-volve adding or removing multiple pixels distributed acrossthe digit, which seems infeasible for humans, since they arenot able to reproduce digits on that level of detail. (Goyal etal. 2019) combine two images, the image to change and animage from a class the image should be changed to.
Discussion and Conclusions
Humans will interact more and more with AI. This paper pro-vided first steps towards improving this interaction by show-ing how human inputs to an AI can be optimized. Our resultsindicate that optimized samples can be created faster and leadto less mis-classifications, while still bearing similarity to theoriginal input. We chose to optimize samples individually,which allows to personalize to a very high degree. We believethat an approach using meta-learning or reinforcement learn-ing with planning could lead to even better results. Our opti-mized samples were also accompanied by instructions on howto create them efficiently. While our human study confirmedthat reconstructing proposed samples leads to saving in timeand reduces mis-classifications, more exploration of the fieldof human to AI interaction is needed: improve interaction ona semantic level as needed for interaction with chatbots be-yond making chatbots more human (Chaves and Gerosa 2019;Ciechanowski et al. 2019), consider other recognition prob-lems such as speech(Zhang et al. 2018), perform a joint op-timization of human inputs and AI models, eg. interactivemodeling(Ware et al. 2001), derive optimization algorithmsthat use inputs of a human to provide general rules as feed-back, assess additional concerns such as acceptance of tech-nology by humans(Venkatesh et al. 2003). We hope that thecommunity will pick up on these questions to foster seamlessuse of AI and reduce risks due to miscommunication. eferences [Amershi et al. 2019] Amershi, S.; Weld, D.; Vorvoreanu, M.;Fourney, A.; Nushi, B.; Collisson, P.; Suh, J.; Iqbal, S.; Ben-nett, P. N.; Inkpen, K.; et al. 2019. Guidelines for human-aiinteraction. In
Proceedings of the 2019 chi conference onhuman factors in computing systems , 1–13.[Bansal et al. 2019] Bansal, G.; Nushi, B.; Kamar, E.;Lasecki, W. S.; Weld, D. S.; and Horvitz, E. 2019. Be-yond accuracy: The role of mental models in human-ai teamperformance. In
Proceedings of the AAAI Conference onHuman Computation and Crowdsourcing , volume 7, 2–11.[Bastani et al. 2016] Bastani, O.; Ioannou, Y.; Lampropoulos,L.; Vytiniotis, D.; Nori, A.; and Criminisi, A. 2016. Mea-suring neural net robustness with constraints. In
Advances inneural information processing systems .[Calvo-Zaragoza and Oncina 2014] Calvo-Zaragoza, J., andOncina, J. 2014. Recognition of pen-based music notation:the homus dataset. In .[Carroll et al. 2019] Carroll, M.; Shah, R.; Ho, M. K.; Grif-fiths, T.; Seshia, S.; Abbeel, P.; and Dragan, A. 2019. Onthe utility of learning about humans for human-ai coordina-tion. In
Advances in Neural Information Processing Systems ,5174–5185.[Chaves and Gerosa 2019] Chaves, A. P., and Gerosa, M. A.2019. How should my chatbot interact? a survey on human-chatbot interaction design. arXiv preprint arXiv:1904.02743 .[Ciechanowski et al. 2019] Ciechanowski, L.; Przegalinska,A.; Magnuski, M.; and Gloor, P. 2019. In the shades of theuncanny valley: An experimental study of human–chatbotinteraction.
Future Generation Computer Systems
Advances in NeuralInformation Processing Systems .[Ghosh et al. 2019] Ghosh, A.; Tschiatschek, S.; Mahdavi,H.; and Singla, A. 2019. Towards deployment of robustai agents for human-machine partnerships. arXiv preprintarXiv:1910.02330 .[Goyal et al. 2019] Goyal, Y.; Wu, Z.; Ernst, J.; Batra, D.;Parikh, D.; and Lee, S. 2019. Counterfactual visual explana-tions. arXiv preprint arXiv:1904.07451 .[Guidotti et al. 2018] Guidotti, R.; Monreale, A.; Turini, F.;Pedreschi, D.; and Giannotti, F. 2018. A survey of methodsfor explaining black box models.[Ha and Eck 2017] Ha, D., and Eck, D. 2017. A neu-ral representation of sketch drawings. arXiv preprintarXiv:1704.03477 .[Hois, Theofanou-Fuelbier, and Junk 2019] Hois, J.;Theofanou-Fuelbier, D.; and Junk, A. J. 2019. Howto achieve explainability and transparency in human ai inter-action. In
International Conference on Human-ComputerInteraction , 177–183. Springer. [Janssen et al. 2019] Janssen, C. P.; Donker, S. F.; Brumby,D. P.; and Kun, A. L. 2019. History and future of human-automation interaction.
International journal of human-computer studies
Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , 5830–5839.[Martins, Santos, and Dias 2019] Martins, G. S.; Santos, L.;and Dias, J. 2019. User-adaptive interaction in social robots:A survey focusing on non-physical interaction.
InternationalJournal of Social Robotics
Proceedings of theIEEE International Conference on Computer Vision , 71–80.[Nocentini et al. 2019] Nocentini, O.; Fiorini, L.; Acerbi, G.;Sorrentino, A.; Mancioppi, G.; and Cavallo, F. 2019. Asurvey of behavioral models for social robots.
Robotics
Pro. of Conference on Computer Vision and PatternRecognition .[Riaz Muhammad et al. 2018] Riaz Muhammad, U.; Yang,Y.; Song, Y.-Z.; Xiang, T.; and Hospedales, T. M. 2018.Learning deep sketch abstraction. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , 8014–8023.[Rzepka and Berger 2018] Rzepka, C., and Berger, B. 2018.User interaction with ai-enabled systems: a systematic reviewof is research. In
Int. Conf. on Information Systems (ICIS) .[Schneider and Handali 2019] Schneider, J., and Handali, J.2019. Personalized explanation in machine learning. In
European Conference on Information Systems (ECIS) .[Schneider 2020] Schneider, J. 2020. Human-to-ai coach:Improving human inputs to ai systems. In
InternationalSymposium on Intelligent Data Analysis , 431–443. Springer.[Shneiderman 2020] Shneiderman, B. 2020. Human-centeredartificial intelligence: Reliable, safe & trustworthy.
Interna-tional Journal of Human–Computer Interaction
MIS quarterly
International Journal of Human-Computer Studies arXiv preprint arXiv:2001.02600 .[Yu et al. 2017] Yu, Q.; Yang, Y.; Liu, F.; Song, Y.-Z.; Xiang,T.; and Hospedales, T. M. 2017. Sketch-a-net: A deep neuralnetwork that beats humans.
International journal of computervision
ACM Transactions onIntelligent Systems and Technology (TIST) arXiv preprintarXiv:1807.03089arXiv preprintarXiv:1807.03089