[PDF] Classifying and Qualifying GUI Defects

Abstract

Graphical user interfaces (GUIs) are integral parts of software systems that require interactions from their users. Software testers have paid special attention to GUI testing in the last decade, and have devised techniques that are effective in finding several kinds of GUI errors. However, the introduction of new types of interactions in GUIs (e.g., direct manipulation) presents new kinds of errors that are not targeted by current testing techniques. We believe that to advance GUI testing, the community needs a comprehensive and high level GUI fault model, which incorporates all types of interactions. The work detailed in this paper establishes 4 contributions: 1) A GUI fault model designed to identify and classify GUI faults. 2) An empirical analysis for assessing the relevance of the proposed fault model against failures found in real GUIs. 3) An empirical assessment of two GUI testing tools (i.e. GUITAR and Jubula) against those failures. 4) GUI mutants we've developed according to our fault model. These mutants are freely available and can be reused by developers for benchmarking their GUI testing tools.

Full PDF

CClassifying and Qualifying GUI Defects

Valéria Lelli

INSA Rennes, [email protected]

Arnaud Blouin

INSA Rennes, [email protected]

Benoit Baudry

Inria, [email protected]

Abstract —Graphical user interfaces (GUIs) are integral partsof software systems that require interactions from their users.Software testers have paid special attention to GUI testing inthe last decade, and have devised techniques that are effective inﬁnding several kinds of GUI errors. However, the introduction ofnew types of interactions in GUIs presents new kinds of errorsthat are not targeted by current testing techniques. We believethat to advance GUI testing, the community needs a compre-hensive and high level GUI fault model, which incorporates alltypes of interactions. The work detailed in this paper establishes4 contributions: 1) A GUI fault model designed to identify andclassify GUI faults. 2) An empirical analysis for assessing therelevance of the proposed fault model against failures foundin real GUIs. 3) An empirical assessment of two GUI testingtools ( i.e.

GUITAR and Jubula) against those failures. 4) GUImutants we’ve developed according to our fault model. Thesemutants are freely available and can be reused by developers forbenchmarking their GUI testing tools.

I. I

NTRODUCTION

Increasing presence of system interactivity requires softwaretesting to closely consider the testing of graphical user interfaces(GUI). GUIs are composed of graphical objects called widgets,such as buttons. Users interact with these widgets ( e.g. pressa button) to produce an action that modiﬁes the state of thesystem. For example, pressing the button " Delete " of a drawingeditor produces an action that deletes the selected shapes fromthe drawing. Most of these standard widgets provide users withan interaction composed of a single input event ( e.g. pressinga button). In this paper we call such interactions "mono-eventinteractions". These standard widgets work identically in manyGUI platforms. In the context of GUI testing, the tools rely onthe concept of standard widgets and have demonstrated theirability for ﬁnding several kinds of errors in GUIs composedof such widgets, called WIMP GUIs [3], [4], [5], [6], [7].The current trend in GUI design is the shift from designingGUIs composed of standard widgets to designing GUIs relyingon more complex interactions and ad hoc widgets [2], [8], [9].So, standard widgets are being more and more replaced by adhoc ones. By ad hoc widgets we mean non-standard widgetsdeveloped speciﬁcally for a GUI. Such widgets involve multi-event interactions (in opposition to mono-event interactions, e.g. multi-touch interactions for zooming, rotating) that aimat being more adapted, natural to users. A simple exampleof such widgets is the drawing area of graphical editors withwhich users interact using more complex interactions such Also called command [1], [2] or event [3]. WIMP stands for

Windows, Icons, Menus, and Pointing device. These interactions are more complex from a software engineering pointof view. From a human point of view they should be more natural, i.e. moreclose to how people interact with objects in the real life. as pencil-based or multi-touch interactions. GUIs containingsuch widgets are called post-WIMP GUIs [10]. The essentialobjective is the advent of GUIs providing users with moreadapted and natural interactions, and the support of new inputdevices such as multi-touch screens. As Beaudouin-Lafon wrotein 2004, " the only way to signiﬁcantly improve user interfacesis to shift the research focus from designing interfaces todesigning interaction " [8]. This new trend of GUI designpresents to developers new problems of GUI faults that currentGUI testing tools cannot detect. An essential pre-requisite topropose comprehensive testing techniques for both WIMP andpost-WIMP GUIs is to deﬁne an exhaustive and high levelGUI fault model. Indeed, testing consists of looking for errorsin a program. This requires a clear idea about the errors weare looking for. This is the goal of fault models that permit toqualify the effectiveness of testing techniques [11].In this paper, we leverage of the evolution of the currentHuman-Computer Interaction (HCI) state-of-the-art concepts topropose an original, complete fault model for GUIs. This modeltackles dual objectives: 1) provide a conceptual frameworkagainst which GUI testers can evaluate their tool or technique;and 2) build a set of benchmark mutations to evaluate the abilityof GUI testing tools to detect failures for both WIMP and post-WIMP GUIs. We assess the coverage of the proposed modelthrough an empirical analysis: 279 GUI-related bug reports ofhighly interactive open-source GUIs have been successfullyclassiﬁed using our fault model. Also, we assess the ability oftwo GUI testing tools ( i.e.

GUITAR and Jubula) to ﬁnd real GUIfailures. Then, from an open-source system we created mutantsimplementing the faults described in our fault model. Thesemutants are freely available and can be used for benchmarkingGUI testing tools. As an illustrative use of these mutants, weconducted an experiment to evaluate the ability of two GUItesting tools to detect these mutants. We show that some mutantscannot be detected by current GUI testing tools and discussfuture work to address the new kinds of GUI faults.The paper is organized as follows. The next section examinesin detail the seminal HCI concepts we leveraged to build ourGUI fault model. Based on these concepts, the proposed GUIfault model is then detailed. Subsequently, the beneﬁts of ourproposal are highlighted through: an empirical analysis ofexisting GUI bug reports; the manual creation of GUI mutantson an existing system; and an evaluation of two GUI testingtools to detect such mutants. This paper ends with related workand the conclusion presenting GUI testing challenges.II. S

EMINAL

HCI C

ONCEPTS

Identifying GUI faults requires an examination in detail ofthe major HCI concepts. In this section we detail these conceptsto highlight and explain in Section III the resulting GUI faults. a r X i v : . [ c s . S E ] M a r efore introducing these seminal HCI concepts, we recall thebasic elements that compose GUIs. Users act on an interactivesystem by performing a user interaction on a GUI. A userinteraction produces as output an action that modiﬁes the stateof the system. For example, the user interaction that consistsof pressing the button " Delete " of a drawing editor producesan action that deletes the selected shapes from the drawing. Auser interaction is composed of a sequence of events (mousemove, etc. ) produced by input devices (mouse, etc. ) handledby users. One interaction may involve several input devices,which is then called a multi-modal interaction. For instance,pointing a position on a map and speaking to perform anaction is a multi-modal interaction. The correct synchronizationbetween the different input devices is a key concern and iscalled multi-modal fusion. A GUI is composed of graphicalcomponents, called widgets, laid out following a speciﬁc order.The graphical elements displayed by a widget are either purelyaesthetics (fonts, etc. ) or presentations of data. The state ofa widget can evolve in time having effects on its graphicalrepresentation ( e.g. visibility, position, value, data content).

Direct manipulation is one of the seminal HCI concepts[12], [13]. It aims at minimizing the mental effort required touse systems. To do so, direct manipulation promotes severalrules to respect while developing GUIs. One of these rulesstipulates that users have to feel engaged with control of objectsof interest, not with GUIs or systems themselves. An example ofdirect manipulation is the drawing area of drawing editors. Sucha drawing area represents shapes as 2D/3D graphical objectsas most of the people deﬁne the concept of shapes. Users canhandle these shapes by interacting directly within the drawingarea to move or scale shapes using advanced interactions such asbi-manual interactions. Direct manipulation is in opposition tothe use of standard widgets that bring indirection between usersand their objects of interest. For instance, scaling a shape usinga bi-manual interaction on its graphical representation is moredirect than using a text ﬁeld. So, developing direct manipulationGUIs usually implies the development of ad hoc widgets,such as the drawing area. These ad hoc widgets are usuallymore complex than standard ones since they rely on: advancedinteractions ( e.g. bi-manual, speech+pointing interactions); adedicated data representation ( e.g. shapes painted in the drawingarea). Testing such heterogeneous and ad hoc widgets is thusa major challenge.This contrast between GUIs composed of standard widgetsonly and GUIs that contain advanced widgets is reiﬁed,respectively, under the terms WIMP and post-WIMP. Van Damproposed that a post-WIMP GUI is one " containing at least oneinteraction technique not dependent on classical 2D widgetssuch as menus and icons " [10].Another seminal HCI concept is feedback [13], [14], [2],[9]. Feedback is provided to users while they interact withGUIs. It allows users to evaluate continuously the outcomeof their interactions with the system. Feedback is computedand provided by the system through the user interface and cantake many forms. A ﬁrst simple example is when users movethe cursor over a button. To notify that the cursor is correctlypositioned to interact with the button this changes its shape. Amore sophisticated example is the selection process of mostof drawing editors that can be done using a Drag-And-Drop(DnD) interaction. While the DnD is performed on the drawing area, a temporary rectangle is painted to notify users aboutcurrent selection area.Another HCI concept is the notion of reversible actions [12],[13], [9]. The goal of reversible actions is to reduce user anxietyby about making mistakes [12]. In WIMP GUIs, revertingactions is reiﬁed under the undo/redo features usually performedusing buttons or shortcuts that revert the latest executed actions.In post-WIMP GUIs, recent outcomes promote the ability tocancel actions in progress [15].All these HCI concepts introduced in this section are inter-active features that must be tested. However, we demonstratein this paper that current GUI fault models and GUI testingtools do not cover all these features. In the next section, theGUI faults stemming from WIMP and post-WIMP GUIs aredetailed. III. F

AULT M ODEL

In this section we present an exhaustive GUI fault model.Bochmann et al. [11] deﬁne a fault model as:

Deﬁnition 1 (Fault Model):

A fault model describes a setof faults responsible for a failure possibly at a higher level ofabstraction.To recall what a fault is:

Deﬁnition 2 (Fault):

Faults are textual (or graphical) differ-ences between an incorrect and a correct behavior description[16].Based on these deﬁnitions, we propose the followingdeﬁnitions of a GUI fault and failure:

Deﬁnition 3 (GUI Fault):

GUI faults are differences be-tween an incorrect and a correct behavior description of aGUI.

Deﬁnition 4 (GUI Error):

A GUI error is an activation ofa GUI fault that leads to an unexpected GUI state.

Deﬁnition 5 (GUI Failure):

A GUI failure is a manifesta-tion of an unexpected GUI state provoked by a GUI fault.A GUI fault can be introduced at different levels of a GUIsoftware ( e.g.

GUI code, GUI models). An illustration of aGUI fault is: a correct line of GUI code vs an incorrect line ofGUI code. For example, a GUI fault can be activated when an unexpected entry , such as a wrong value into an input widget,is not handled correctly by its GUI code. So, an unexpectedGUI state is manifested ( e.g. a crash as a GUI failure) when auser clicks on a button after typing this entry.To build the proposed fault GUI model we ﬁrst analyzedthe state-of-the-art of HCI concepts (see Section II). We thenanalyzed real GUI bug reports (different than those used inSection IV) to assess and to precise the fault model. Weperformed a round trip process between the analysis of HCIconcepts and GUI bug reports until obtain a stable fault model.The description of our fault model is divided into twogroups: The user interface faults and the user interaction faults.The user interface faults refer to faults affecting the structureand the behavior of graphical components of GUIs. The userinteraction faults refer to faults affecting the interaction processwhen a user interacts with a GUI. ABLE I. U

SER I NTERFACE F AULTS

Fault categories ID Faults Possible failures

GUI StructureandAesthetics GSA1 Incorrect layout of widgets( e.g. alignment, dimension, orientation, depth) The positions of 2 widgets are inverted.A text is not fully visible since the size of text ﬁeld is too small.Rulers do not appear on the top of a drawing editor.The vertical lines for visualizing the precise position of shapes in thedrawing editor are not displayed.GSA2 Incorrect state of widgets( e.g. visible, activated, selected, focused, modal,editable, expandable) Not possible to click on a button since it is not activated.A window is not visible so that its widgets cannot be used.Not possible to draw in the drawing area of a drawing editor since itis not activated.GSA3 Incorrect appearance of widgets( e.g. font, color, icon, label) The icon of a button is not visible.In a GUI of a power plant, the color reﬂecting the critical status ofa pump is green instead of red.DataPresentation DT1 Incorrect data rendering( e.g. scaling factors, rotating, converting) The size of a text is not scaled properly.In a drawing editor, a dotted line is painted as a dashed one.A rectangle is painted as an ellipse.DT2 Incorrect data properties( e.g. selectable, focused) A web address in a text is not displayed as hyperlink.DT3 Incorrect data type or format( e.g. degree vs radian, ﬂoat vs double) The date is displayed with ﬁve digits ( e.g. dd/mm/y) instead of 6digits ( e.g. dd/mm/yy).A text ﬁeld displays an angle in radian instead of in degree. A. User Interface Faults

GUIs are composed of widgets that can act as mediatorsto interact indirectly ( e.g. buttons on WIMP GUIs) or directly(direct manipulation principle in post-WIMP GUIs) with objectsof the data model. In this section, we categorize the userinterface faults, i.e. faults related to the structure, the behavior,and the appearance of GUIs. We further break down userinterface into two categories: the

GUI structure and aesthetics ,and the data presentation fault, as introduced below. Table Ipresents an overview of these faults and their potential failures.

1) GUI Structure and Aesthetics Fault:

This fault categorycorresponds to unexpected GUI designs. Since GUIs arecomposed of widgets laid out following a given order, the ﬁrstfault is the incorrect layout of widgets (GSA1). Possible failurescorresponding to this fault occur when GUI widgets followan unexpected layout ( e.g. wrong size or position). The nextfault concerns the incorrect state of widgets (GSA2). Widgets’behavior is dynamic and can be in different states such asvisible, enabled, or selected. This fault occurs when the currentstate of a widget differs from the expected one. For example,a widget is unexpectedly visible. The following fault treatsthe unexpected appearance of widgets (GSA3). That concernsaesthetic aspects of widgets not bound to the data model, suchas look-and-feels, fonts, icons, or misspellings.

2) Data presentation:

In many cases, widgets aim at editingand visualizing data of the data model. For example with WIMPGUIs, text ﬁelds or lists can display simple data to be editedby users. Post-WIMP GUIs share this same principle with thedifference that the data representation is usually ad hoc andmore complex. For example, the drawing area of a drawingeditor paints shapes of the data model. Such a drawing area hasbeen developed for the speciﬁc case of this editor. That permitsto represent graphically in a single widget complex data ( e.g. shapes). In other cases, widgets aim at monitoring data only.This is notably the case for some GUIs in control commandsof power plants where data are not edited but monitored by users. The deﬁnition of data representations is complex anderror-prone. It thus requires adequate data presentation faults.The ﬁrst fault of this category is the incorrect data rendering (DT1). DT1 is provoked when data is converted or scaledwrongly. Possible failures for this fault are manifested byunexpected data appearance ( e.g. wrong color, texture, opacity,shadow) or data layout ( e.g. wrong position, geometry). Thesecond fault concerns incorrect data properties (DT2). Propertiesdeﬁne speciﬁc visualization of data such as selectable or focused.A possible failure is a web address that is not displayed as ahyperlink. The last fault (DT3) occurs when an incorrect datatype or format is displayed. For instance, an angle value isdisplayed in radian instead of in degree.

B. User Interaction Faults

In this section, we introduce the faults that treat userinteractions. The proposed faults are based on the characteristicsof WIMP and post-WIMP GUIs detailed in the previous section.For each fault we separated our analysis into two parts. Onededicated to WIMP interactions and another one to post-WIMP interactions. WIMP interactions refer to interactionsperformed on WIMP widgets. They are simple and composedof few events (click , key pressed, etc. ). Post-WIMP interactionsrefer to interactions performed on post-WIMP widgets. Suchinteractions are more complex since they can be multimodal, i.e. involve multiple input devices (gesture, gyroscope, multi-touch screen); be concurrent ( e.g. in bi-manual interactions thetwo hands evolve in parallel); be composed of numerous events( e.g. multimodal interactions may be composed of sequencesof pressure, move, and voice events). Such interactions can bemodeled as ﬁnite-state machines [9], [17], [18]. Subsequentlythe direct manipulation principles, other particularities of post-WIMP interactions are that they aim at: being as natural as A click is one interaction composed of the event mouse pressed followedby the event mouse released . Its simple behavior has leaded to consider a clickas an event itself.ABLE II. U

SER I NTERACTION F AULTS

Fault categories ID Faults Possible failures

InteractionBehavior IB1 Incorrect behavior of a user interaction A bi-manual interaction developed for a speciﬁc purpose doesnot work properly.The synchronization between the voice and the gesture doesnot work properly in a voice+gesture interaction.Action ACT1 Incorrect action results Translating a shape to a position ( x , y ) translates it to theposition ( − x , − y ) .Setting the zoom level at 150%, sets it at 50%.ACT2 No action executed Clicking on a button has no effect.Executing a DnD on a drawing area to draw a rectangle hasno effect.ACT1 Incorrect action executed Clicking on the button Save shows the dialogue box used forloading.Scaling a shape results in its rotation.Performing a DnD to translate shapes results in their selection.Reversibility RVSB1 Incorrect results of undo or redo operations Clicking on the button redo does not re-apply the latest undoneaction as expected.Pressing the keys ctrl+z does not revert the latest executedaction as expected.RVSB2 Reverting the current interaction inprogress works incorrectly Pressing the key "

Escape " during a DnD does not abort thislast.Saying the word "

Stop " does not stop the interaction inprogress.RVSB3 Reverting the current action in progressworks incorrectly Clicking on the button "

Cancel " to stop the loading of theﬁle previously selected does not work properly.Feedback FDBK1 Feedback provided by widgets to reﬂectthe current state of an action in progressworks incorrectly. The progress bar that shows the loading progress of a ﬁleworks incorrectly.FDBK2 The temporary feedback provided all alongthe execution of long interactions is incor-rect. Given a drawing editor, drawing a rectangle using a DnDinteraction does not show the created rectangle during theDnD as expected. possible; providing users with the feeling of handling datadirectly ( e.g. shapes in drawing editors). Table II summarizesthe user interaction faults and some of their potential failuresfor both WIMP and post-WIMP interactions. These faults aredetailed as follows.

1) Interaction Behavior:

Developing post-WIMP interac-tions is complex and error-prone. Indeed, as explained in thesection on GUIs’ characteristics, it may involve many sequencesof events or require the fusion of several modalities such as voiceand gesture. So, the ﬁrst fault (IB1) occurs when the behaviorof the developed interactions does not work properly. This faultmainly concerns post-WIMP widgets since WIMP widgetsembed simple and hard-coded interactions. For instance, anevent such as pressure can be missing in a bi-manual interaction.Another example is the incorrect synchronization between thevoice and the gesture in a voice+gesture interaction.

2) Action:

This category of faults groups faults that concernactions produced while interacting with the system. The ﬁrstfault (ACT1) focuses on the incorrect results of actions. Inthis case the expected action is executed but its results are notcorrect. For instance with a drawing editor, a failure can be thetranslation of one shape to the given position ( − x , − y ) whilethe position ( x , y ) was expected. The root cause of this failurecan be located in the action itself or in its settings. For instance,a ﬁrst root cause of the previous failure can be the incorrect coding of the translation operation. A second root cause canbe located in the settings of the translation action.The second fault (ACT2) concerns the absence of actionwhen interacting with the system. For instance, this fault canoccur when an interaction, such as a keyboard shortcut, is notcorrectly bound to its widget.The third fault (ACT3) consists of the execution of wrongactions. The root cause of this fault can be that the wrongaction is bound to a widget at a given instant. For instance:clicking on the button Save shows the dialogue box used forloading; doing a DnD interaction on a drawing area selectsshapes instead of translating them.

3) Reversibility:

This fault category groups three faults.The ﬁrst fault (RVSB1) concerns the incorrect behavior of theundo/redo operations. Undo and redo operations usually relyon WIMP widgets such as buttons and key shortcuts. Theseoperations revert or re-execute actions already terminated andstored by the system. A possible failure is the incorrect reversionof the latest executed action when the key shortcut ctrl+z isused.Contrary to WIMP interactions, that are mainly one-shot,many interactions last some time such as the DnD interaction.In such a case, users may be able to stop an interactionin progress. The second fault (RVSB2) thus consists of thencorrect interruption of the current interaction in progress. Forinstance, pressing the key "

Escape " during a DnD does not stopthis last. This fault could have been classiﬁed as an interactionbehavior fault. We decided to consider it as a reversibility faultsince it concerns the ability to revert an ongoing interaction.Once launched, actions may take time to be executed entirely.In this case such actions can be interrupted. The third fault(RVSB3) concerns the incorrect interruption of an action inprogress. A possible failure concerns the ﬁle loading operation:clicking on the button "

Cancel " to stop the loading of a ﬁledoes not work properly.

4) Feedback:

Widgets are designed to provide immediateand continuous feedback to users while they interact with them.For instance, progress bars showing the loading progress of aﬁle is a kind of feedback provided to users. The ﬁrst fault of thiscategory (FDBK1) concerns the incorrect feedback provided bywidgets to reﬂect the current state of an action in progress. Thisfault focuses on actions that last in time and which progressshould be monitored by users.The second fault (FDBK2) focuses on potentially longinteractions ( i.e. interactions taking a certain amount of time tobe completed) which progress should be discernible by users.For instance with a drawing editor, when drawing a shape onthe drawing area, the shape in creation should be visible sothat the user knows the progression of her work. So, a possiblefailure is drawing a rectangle using a DnD interaction, thatworks correctly, does not show the created rectangle during theDnD as expected.

C. Discussion

The deﬁnition and the use of a fault model raise severalquestions we discuss about in this sub-section.

What are the beneﬁts of the proposed GUI fault model?

Thebeneﬁts of our GUI fault model are twofold. First, a fault modelis an exhaustive classiﬁcation of faults for a speciﬁc concern[11]. Providing a GUI fault model permits GUI developers andtesters to have a precise idea of the different faults they mustconsider. As an illustration, Section IV describes an empiricalanalysis we conducted to classify and discuss about GUI failuresof open-source GUIs. Second, our GUI fault model allowsdevelopers of GUI testing tools to evaluate the efﬁciency oftheir tool in terms of bug detection power w.r.t. a GUI speciﬁcfault model. As detailed in Section VI, we created mutants ofan existing GUI. Each mutant contains one GUI failure thatcorresponds to one GUI fault of our fault model. Developersof GUI testing tools can run their tools against these mutantsfor benchmarking purposes.

Should usability have been a GUI fault?

Answering thisquestion requires the deﬁnition of a fault to be re-explained: afault is a difference between the observed behavior descriptionand the expected one. Usability issues consist of reporting thatthe current observed behavior of a speciﬁc part of a GUI lacksat being somehow usable. That does not mean the observedbehavior differs from the behavior expected by test oracles.Instead, it usually means that the expected behavior has notbeen deﬁned correctly regarding some usability criteria. That iswhy we do not consider usability as a GUI fault. This reasoningcan be extended to other concerns such as performance.

How to classify GUI failures into a fault model?

A GUIfailure is a perceivable manifestation of a GUI error. ClassifyingGUI failures thus requires to have identiﬁed the root cause ( i.e.

GUI error) of the failure. So, classifying GUI failures can bedone by experts of the GUI under test. These experts needsufﬁcient information, such as patches, logs, or stack traces, toidentify if the root cause of a failure is a GUI error to thenclassify it. For example, given a failure manifested in the GUIand caused by a precondition violation. In this case, such afailure is not classiﬁed into the GUI fault model. Similarly,classifying correctly a GUI failure also requires to qualifythe involved widgets ( e.g. standard or ad hoc ) as well as theinteraction ( e.g. mono-event or multiple-event interaction).

How to classify failures stemming from other failures?

Forinstance, the incorrect results of the execution of an action(action fault) let a widget not visible as expected (GUI structurefault). In such cases, only the ﬁrst failure must be consideredsince it puts the GUI in an unexpected and possibly unstablestate. Besides, the appearance of a GUI error depends on theprevious actions and interactions successfully executed. Typicalexamples are the undo and redo actions. A redo action canbe executed only if an action has been previously performed.Furthermore, the success of a redo action may depend on theprevious executed actions. We considered this point duringthe creation of mutants (as detailed in Section VI) to providefailures that appear both with and without previous actions.IV. R

ELEVANCE OF THE F AULT M ODEL : AN EMPIRICAL ANALYSIS

In this section the proposed GUI fault model is evaluated.Our evaluation has been conducted by an empirical analysis toassess the relevance of the model w.r.t. faults currently observedin existing GUIs. The goal is to state whether our GUI faultmodel is relevant against failures found in real GUIs.

A. Introduction

To assess the proposed fault model, we analyzed bug reportsof 5 popular open-source software systems: Sweet Home3D, File-roller, JabRef, Inkscape, and Firefox Android. Thesesystems implement various kinds of widgets, interactions, andencompass different platforms (desktop and mobile). TheirGUIs cover the main following features: indirect and direct ma-nipulation; several input devices ( e.g. mouse, keyboard, touch); ad hoc widgets such as canvas; discrete data manipulation ( e.g. vector-based graphics); and undo/redo actions. B. Experimental Protocol

Bug reports have been analyzed manually from the re-searcher/tester perspective by looking only at data available inthe failures report ( i.e. black box analysis). To focus on detailedand commented bug reports that concern GUI failures, theselection has been driven by the following rules. Only closed,ﬁxed, and in progress bug reports were selected. The followingsearch string has been also used to reduce the resulting sample: interface OR "user interface” OR “graphical user interface”OR "graphical interface" OR GUI OR UI OR layout OR designOR graphic OR interaction OR “user interaction” OR interactOR action OR feedback OR revert OR reversible OR undo ORredo OR abort OR stop OR cancel . Each report has been then

ABLE III. D

ISTRIBUTION OF ANALYZED FAILURES PER SOFTWARE

Software Analyzed failures User interface failures User interaction failures Repositories link

Sweet Home 3D 33 55% 45% http://sourceforge.net/p/sweethome3d/bugs/File-roller 32 28% 72% https://bugzilla.gnome.org/query.cgiJabRef 84 42% 58% http://sourceforge.net/p/jabref/bugs/Inkscape 82 28% 72% https://bugs.launchpad.net/inkscape/Firefox Android 48 60% 40% https://bugzilla.mozilla.org/ manually analyzed to state whether it is a GUI failure. Also,selected bug reports have to provide explanations about theroot cause of the failure such as a patch or comments. This stepis crucial to be able to categorize the failures using our GUIfault model considering their root cause. We also discardedfailures identiﬁed as non-reproducible, duplicated, usability, oruser misunderstanding. From this selection we kept 279 bugreports (in total for the ﬁve systems) describing one GUI failureeach. The following sub-sections discuss about these failuresand the classiﬁcation process.

C. Classiﬁcation and Analysis

All the 279 failures have been successfully classiﬁed intoour fault model. Fig. 1 gives an overview of the selectedbug reports classiﬁed using our proposed fault model. Thesefailures were classiﬁed into the

Action (119 failures, 43%),

GUIStructure and Aesthetics (75 failures, 27%),

Data Presentation (39 failures, 14%),

Reversibility (31 failures, 11%),

Interactionbehavior (12 failures, 4%), and

Feedback (3 failures, 1%) faultcategories. Most of the failures classiﬁed into

GUI Structureand Aesthetics concern the incorrect layout of widgets (51%).Likewise, most of the failures in the

Action category refer to incorrect action results (75%).

Fig. 1. Classiﬁcation of the 279 bug reports using the GUI fault model

Table III shows the distribution of the 279 analyzed GUIfailures per software and category (user interface or userinteraction). These results point out that the systems

SweetHome 3D and

Firefox Android seem to be more affected byuser interface failures. Most of these failures concern the

GUIstructure and aesthetics fault. That can be explained by thecomplex and ad hoc

GUI structure of these systems.

FileRoller and

JabRef

GUIs include widgets with coarse-grainedproperties ( i.e. simple input value such as number or text). Mostof their failures concern WIMP interactions classiﬁed into the action category. In contrast,

Inkscape presented more failuresclassiﬁed as post-WIMP. Indeed, Inkscape, a vector graphicssoftware, mainly relies on its drawing area that provides userswith different post-WIMP interactions. These failures havebeen categorized mainly into interaction behavior , action , and reversibility . Fig. 2. Manifestation of failures in the user interface and user interactionlevels

As depicted by Fig. 2, 41% of these 279 GUI failures areoriginated by faults classiﬁed into the user interface category and59% into the user interaction category. Most of user interactionfailures have been classiﬁed into the incorrect action results (54%). This plot also highlights that only 25% of the analyzeduser interface failures and 18% of the user interaction oneshave been classiﬁed as post-WIMP. We comment these resultsin the following sub-section.

D. Discussion

The empirical results must be balanced with the fact thatuser interactions are less tangible than user interfaces. So, usersmay report more GUI failures when they can perceive failuresgraphically (an issue in the layout of a GUI or in the result ofan action visible through a GUI). Users, however, may havedifﬁculties to detect a failure in an interaction itself whileinteracting with the GUI. That may explain the low numberof failures (4%) classiﬁed into

Interaction Behavior . Anotherexplanation may be the primary use of WIMP widgets, relyingon simple interactions.In our analysis, many failures that could be related to

Feedback were discarded since they concerned enhancementsor usability issues, which are out of the scope of a GUI faultmodel as discussed previously. For instance, GUI failures thatconcern the lack of haptic feedback in Firefox Android werediscarded. So, few faults (1%) were classiﬁed into this category.Another explanation may be the difﬁculty for users to identifyfeedback issues as real failures that should be reported.We observed that some reported GUI failures are falsepositives regarding the fault localization : if the report does nothave enough information about the root cause of a failure ( e.g. patch or exception log), a GUI failure can be classiﬁed in awrong fault category. For example, when moving a shape usinga DnD does not move it. At a ﬁrst glance, the root cause ofthis failure can be associated to an incorrect behavior of theDnD. So, this failure can be categorized into the interactionehavior. However, after analyzing the root cause this failurerefers to an action failure since the DnD works properly, butno action is linked to this interaction.Likewise, the failures related to

Reversibility and

Feedback were easily identiﬁed through the steps to reproduce them. Forexample in JabRef, " pressing the button "Undo" will clear allthe text in the ﬁeld, but then pressing the button "Redo" willnot recover the text ". Furthermore, some systems do not revertinteractions step by step but entirely. This can imply a failurefrom a user’s point view, but sometimes it is considered asan invalid failure ( e.g. requirements vs. usability issues) bydevelopers. In

JabRef , the undo/redo actions did not revertdiscrete operations. For example, pressing the button "Undo"clears all texts typed into different text ﬁelds instead of clearingonly one ﬁeld each time the button "undo" is pressed.Another important point concerns the WIMP vs. post-WIMPGUIs faults. We classiﬁed more failures involving WIMP thanpost-WIMP widgets. A possible explanation is that, despite theincreasing interactivity of GUIs, the analyzed GUIs still relymore on WIMP widgets and interactions. Moreover, users nowmaster the behavior of WIMP widgets so that they can easilyidentify when they provoke failures. It may not be the casewith ad hoc and post-WIMP widgets.V. A RE GUI T

ESTING T OOLS A BLE TO D ETECT C LASSIFIED F AILURES ? A N E MPIRICAL S TUDY

This section provides an empirical study of two GUI testingtools: GUITAR [19] and Jubula . To demonstrate the currentlimitations of GUI testing tools in testing real GUIs, we appliedthose tools to detect the failures previously classiﬁed into ourGUI fault model. A. GUITAR and Jubula

GUITAR is one of the most widespread academic GUItesting tools. It extracts the GUI structure by reverse engineering.This structure is transformed into a GUI Event Flow Graph(EFG), where each node represents a widget event. Based on thisEFG, test cases are generated and executed automatically overthe SUT. We used the plugin for Java Swing ( i.e.

JFC GUITARversion 1.1.1) . In GUITAR, each test case is composed by asequence of widget events. The generation of test cases canbe parameterized with the size of that sequence ( i.e. test caselength).Jubula is a semi-automated GUI testing tool that lever-ages pre-deﬁned libraries to create test cases. These librariescontain modules that can be reused to generate manually testsequences. The modules encompass actions ( e.g. check, select)and interactions ( e.g. click, drag and drop) over different GUItoolkits ( e.g. swing, SWT, RCP, mobile). We have reused thelibrary dedicated to Java Swing (Jubula version 7.2) to writethe test cases presented in the next experiments. This librarycontains actions to test only standard widgets such as dragginga column/row of a table by passing an index. To test ad hoc widgets ( e.g. canvas), we made a workaround by mappingactions directly to these widgets. For example, to draw a shapeon canvas we need to specify the exact position ( e.g. drag anddrop coordinates) where the interaction should be executed. http://sourceforge.net/apps/mediawiki/guitar/ B. Experiment

We selected JabRef , a software to manage bibliographicreferences. JabRef is written in Java which allows us to applyboth GUITAR and Jubula. For each fault described in our GUIfault model, we selected one reported failure. To reproduceeach failure, we downloaded the corresponding faulty versionof JabRef. We used the exact test sequence ( i.e. number ofactions) to reproduce a failure. In GUITAR, all test cases weregenerated automatically over a faulty version. In Jubula, eachtest case was created manually to detect one failure. All testcases were written by one of the authors of this paper who hasexpertise in JabRef. Also, their test sequences are extracted byanalyzing failure reports ( e.g. steps to reproduce a failure) andreusing Jubula’s libraries. Then, GUITAR and Jubula run alltheir test cases automatically for checking whether the selectedfailure is found. C. Results and Discussion

Table V summarizes the detection of the JabRef GUI failuresby GUITAR and Jubula. These failures cover 11 out of the 15faults described in our fault model. The remaining four faultswere not covered for two reasons: 1) no failure was classiﬁedfor that fault; or 2) a failure was classiﬁed, but we could notreproduce it - only occurred in a speciﬁc environment ( e.g.

Operating System) or given a certain input ( e.g. a particulardatabase in JabRef).The reported failures in JabRef are mostly related to WIMPwidgets, so we would expect GUITAR and Jubula to detectthem, but it was not the case. For instance, failure e.g. text, eventhandlers) of those buttons. In GUITAR, checking the propertiesof that widget did not reveal this failure since the expectedand actual values of its size property ( e.g. width) remained thesame. In Jubula, the concerned widget cannot be mapped totest cases execution and thus cannot be tested.Failures e.g. no exception) and GUI properties are the "expected" ones. Forexample, a text property of a status bar contains the value: "Redo: change ﬁeld" , when this action was actually not redone.Similarly, failure http://jabref.sourceforge.net/ABLE IV. M UTANTS PLANTED ACCORDING TO FAULTS IN THE

GUI F

AULT M ODEL ID GSA1 GSA2 GSA3 DT1 DT2 DT3 IB1 ACT1 ACT2 ACT3 RVSB1 RVSB2 RVSB3 FDBK1 FDBK2

TABLE V. J AB R EF FAILURES DETECTED BY

GUITAR

AND J UBULA

IDfault IDfailure Bug repository link GUITAR Jubula

GSA1 (cid:55) (cid:55)

GSA2 (cid:55) (cid:51)

GSA3 (cid:55) (cid:51)

DT1 (cid:55) (cid:51)

DT2 (cid:55) (cid:51)

IB1 (cid:51) (cid:51)

ACT2 (cid:51) (cid:51)

ACT3 (cid:55) (cid:55)

RVSB1 (cid:55) (cid:51)

RVSB2 (cid:51) (cid:51)

FDBK1 (cid:55) (cid:51)

FDBK2 but the test case was successfully replayed by Jubula. The inputtext via keyboard was typed and saved automatically withoutany interference of the auto-completion feature.Another point is the accuracy of test cases generatedmanually in Jubula. Detecting failure e.g. \ %), and then checking its output ina preview window should not contain any command ( e.g. e.g. SelectPattern[%,equals] in ComponentText[preview] ). Or, write a test caseto check whether an entire text matches to the expected one( e.g. CheckText[100%, equals] in ComponentText[preview] ).However, the last test case will fail since a text from previewwindow in JabRef is shown internally as HTML and, in Jubula,the action’s parameters cannot be speciﬁed in that format.Our experiment does not aim at comparing both tools sinceGUITAR is a fully automated tool contrary to Jubula. However,the results of this study highlight the current limitations ofGUI testing tools. GUITAR and Jubula currently mainly workfor detecting failures that affect properties of standard widgets.Moreover, GUITAR does GUI regression testing: it considersa given GUI as the reference one from which tests will beproduced. If this GUI is faulty, GUITAR will produce tests thatwill consider these failures as the correct behavior. A possiblesolution to overcome this issue is to base the test process onthe speciﬁcations (requirements, etc. ) of the GUI.VI. F ORGING FAULTY

GUI

S FOR BENCHMARKING

In this section, we evaluate the usefulness of our fault modelby applying it on a highly interactive open-source softwaresystem. We created mutants of this system corresponding to thedifferent faults of the model. The main goal of these mutantsis to provide GUI testers with benchmark tools to evaluatethe ability of GUI testing tools to detect GUI failures. As anillustration of the practical use of these mutants, we executedtwo GUI testing tools against the mutants of the system. Thanks to that we caught a glimpse of their ability to cover our proposedfault model. The goal of this experiment is to answer theresearch question: what are the beneﬁts of this fault model forGUI testing?A. Mutants Generation

As highlighted by Zhu et al. , "software testing is often aimedat detecting faults in software. A way to measure how well thisobjective has been achieved is to plant some artiﬁcial faultsinto the program and check if they are detected by the test. Aprogram with a planted fault is called a mutant of the originalprogram" [20]. Following this principle, we planted 65 faultsin a highly interactive open-source software system, namelyLatexdraw , using our proposed fault model. Latexdraw hasbeen selected because of the following points: 1) it is a highlyinteractive system written in Java and Scala (dedicated to thecreation of drawings for L A TEX); 2) its GUI mixes both standardand ad hoc widgets; 3) it is released under an open-sourcelicense (GPL2) so that it can be freely used by the testingcommunity.We created 65 mutants corresponding to the different faultsof our proposed fault model. All these mutants and the originalversion are freely available . Each mutant is documented todetail its planted fault and the oracle permitting to ﬁnd it .Multiple mutants have been created from each fault by: usingWIMP (22 mutants) or post-WIMP (43 mutants) widgets tokill the mutants; varying the test case length ( i.e. the number ofactions required to provoke the failure). Each action ( e.g. select ashape) requires a minimal number of events ( e.g. in LaTeXDrawa DnD requires at least three events: press/move/release) to beexecuted. Table IV summarizes the number of forged mutantsand the minimal and maximal test case length for each fault.For instance, a length 0 .. e.g. IB1, DT1).

B. How GUI testing tools kill our GUI mutants: a ﬁrstexperiment

We applied the GUI testing tools GUITAR and Jubula onthe mutants to evaluate their ability to kill them. Our goalis not to provide benchmarks against these tools but ratherhighlight the current challenges in testing interactive systemsnot considered yet ( e.g. post-WIMP interactions). GUITAR testcases have been generated automatically while Jubula oneshave been written manually.Considering the mutants planted at the user interface level,Jubula and GUITAR tests killed the mutants that involvechecking standard widget properties, such as layout ( e.g. width, http://sourceforge.net/projects/latexdraw/ https://github.com/arnobl/latexdraw-mutants eight) and state ( e.g. enable, selection, focusable). Also, it ispossible to test simple data ( e.g. string values on text ﬁelds) onthose widgets. However, most of the mutants that concern the ad hoc widgets were alive. Notably, when test cases involvetesting complex data from the data model. For example, it is notpossible to compare the results of the actual shape on canvasagainst the expected one. Even if some shape properties ( e.g. rotation angle) are presented on standard widgets ( e.g. spinner),GUITAR and Jubula cannot state whether the current values inthese widgets match the expected shape rotation on the canvas.Likewise, our GUITAR and Jubula tests cannot kill most ofthe user interaction mutants that result on a wrong presentationof shapes. In particular, when we tested mutants planted intothe Reversibility or Feedback categories. For example, testingundo/redo operations in Latexdraw should compare all states tomanipulate a shape on canvas. Moreover, the tests verdict onJubula passed even though interactions are deﬁned incorrectly( e.g. mouse cursor does not follow a DnD) or actions cannot beexecuted ( e.g. a button is disabled). In GUITAR, the generatedtest cases do not cover properly actions having dependencies.For example, the action "

Delete " in Latexdraw requires ﬁrstselecting a shape on canvas. However, no test sequence thatcontains "

Select Shape " before "

Delete Shape " was generated.Thus, some mutants could not be killed.Table VI gives an overview of the number of mutants killedby GUITAR and Jubula. The results show that both tools are notable to kill all mutants because of the four following reasons:1)

Testing Latexdraw with GUITAR and Jubula is limited to thetest of the standard Swing widgets . In Jubula, the test cases areonly written according to libraries available in the Swing toolkit.In GUITAR, the basic package for Java Swing GUIs only coversstandard widgets and mono-events ( e.g. a click on a button).2)

Conﬁgure or customize a GUI testing tool to test post-WIMPwidgets is not a trivial task . For example, each sequence of atest case in Jubula needs to be mapped for the correspondingGUI widget manually. Also, GUITAR needs to be extendedto generate test cases for ad hoc widgets ( e.g. canvas) as welltheir interactions ( e.g. multi-modal interactions). 3)

Testing post-WIMP widgets requires a long test case sequence . In Latexdraw,a sequence to test interactions over these widgets is composedof at least two actions. That sequence is longer when we haveto detect failures into undo/redo operations. 4)

It is not possibleto give a test verdict for complex data . The oracle provided bythe two GUI testing tools do not know the internal behavior of ad hoc widgets, their interaction features and data presentation.These results answer the research question by highlighting thebeneﬁts of our fault model for measuring the ability of GUItesting tools in ﬁnding GUI failures.

C. Threats to Validity

Regarding the conducted empirical studies, we identiﬁedthe two following threats to validity. The ﬁrst one concernsthe scope of the proposed fault model since we evaluated itempirically on a small number (ﬁve) of interactive systems.To limit this threat, we selected interactive systems that coverdifferent aspects of the HCI concepts we detailed in Section II.The second threat we identiﬁed concerns the subjectivityobserved in bug reports to describe failures. To deal with this,we based the classiﬁcation on the bug report artifacts (patches,logs, etc. ) to identify the root cause of the reported failures.

TABLE VI. M

UTANTS KILLED BY

GUITAR

AND J UBULA

GUITAR JUBULAID

WIMP post-WIMP WIMP post-WIMPGSA1 2 0 2 0GSA2 5 0 6 1GSA3 3 0 3 0DT1 - 0 - 0DT2 - 0 - 0DT3 - 0 - 1IB1 - 0 - 0ACT1 0 0 0 1ACT2 3 0 3 0ACT3 2 0 2 0RVSB1 2 0 2 0RVSB2 - 0 - 0RVSB3 - - - -FDBK1 1 0 1 0FDBK2 - 0 - 0

VII. R

ELATED W ORK

Existing fault classiﬁcations are presented in a higher level ofabstraction considering mainly the components that are affectedby faults. Most classiﬁcations leverage the software assets ( e.g. speciﬁcation, models, architecture, code) to deﬁne their faults.These faults have been described into fault models [11], [16]or defects taxonomies [21].In an effort to cover GUIs, the Orthogonal Defect Classi-ﬁcation (ODC) [21] is extended by IBM Research to includeGUIs faults. These faults focus on the appearance of widgets,navigation between widgets, unexpected behavior of widgetsevents and input devices. In our fault model, we do not coverfaults that concern the behavior of input devices ( i.e. hardwarefault). Although this taxonomy considers GUI faults, it does notseparate the user interface and user interaction faults. Moreover,this extension does not consider faults caused by post-WIMPwidgets and their advanced interactions as well faults into the data presentation category.Li et al. categorize faults of industrial and open sourceprojects using the ODC taxonomy [22]. The category

Interface concerns several GUI defects. However, this single categorycovers several user interface defects related to speciﬁc widgetssuch as window , title bar , menu , or tool bar . Similarly, theinteraction defects are limited to mouse and keyboard . Thus,it is not possible to identify the kind of faults classiﬁed intothese categories since they are not detailed. For example, a faultclassiﬁed into the mouse category can concern an interaction,an action, or an input device.Brooks et al. [23] present a study that characterizes GUIsbased on reported faults of three industrial systems. To classifyall these faults (GUI and non-GUI faults), the authors adapteda defect taxonomy by including other categories such as GUIdefects. This category encompasses both the user interface anduser interaction faults. Also, Børretzen et al. [24] analyze faultsreported by four projects by combining two defect taxonomies.Both works introduce a category that concerns the GUI faultsbut these faults are not described and thus no classiﬁcation ispresented. Strecker et al. [25] characterize faults that affectGUI test suites. However, these faults do not concern the GUIfaults but any fault at the code level ( e.g. class or method faults)that may affect the GUI.In contrast, several research papers concern the fault effectsby classifying GUI failures instead of GUI faults. In general,hese works focus on speciﬁc GUIs (automotive GUIs [26]) ordomains (mobile [27], safety-critical [28]). For example, Maji et al. characterize failures for mobile operating systems [27].These failures are classiﬁed according to the fault localization.For example, a failure manifested in a camera is categorizedin the Camera segment . Similarly, failures for other segmentssuch as Web, Multimedia, or GUI are categorized. Also, Zaeem et al. [29] have conducted a bug study for Android applicationsto automate oracles. They identiﬁed 20 categories includingsome GUI issues such as Rotation (device’s rotation), Gestures(zooming and out) and Widget. Although, these papers haveinvestigated failures in a context that brings many advancesin terms of interactive features, no classiﬁcation or discussionabout these kinds of failures is presented.Mauser et al. propose a GUI failure classiﬁcation forautomotive systems [26]. This classiﬁcation is based on thethree categories: design, content, and behavior. In the

Design category, the failures refer to GUI layouts ( e.g. color, font,position). In the

Content category, the failures are associatedto data displayed such as text, animation, and symbols/icons.The failures in the

Behavior category are caused by a wrongbehavior of windows ( e.g. wrong pop-up) or widgets ( e.g. wrongfocus). The authors focus on characterizing GUI failures basedonly on a small set of speciﬁc widgets designed for these kindsof GUIs. Furthermore, they do not consider user interactionfailures.VIII. C

ONCLUSION AND R ESEARCH A GENDA

This paper proposes a GUI fault model for providing GUItesters with benchmark tools to evaluate the ability of GUItesting tools to detect GUI failures. This fault model has beenempirically assessed by analyzing and classifying into it 279GUI bug reports of different open-source GUIs. To demonstratethe beneﬁts of the proposed fault model, mutants have thenbeen developed from it on a Java open-source GUI. As anillustrative use case of these mutants, we executed two GUItesting tools on these mutants to evaluate their ability to detectthem. This experiment shows that if current GUI testing toolshave demonstrated their ability for ﬁnding several kinds of GUIerrors, they also fail at detecting several GUI faults we identiﬁed.The underlying reasons are twofold. First, GUI failures maybe related to the graphical rendering of GUIs. Testing a GUIrendering is a complex task since current testing techniquesmainly rely on code analysis that can hardly capture graphicalproperties. Second, the current trend in GUI design is theshift from designing GUIs composed of standard widgets todesigning GUIs relying on more complex interactions and adhoc widgets [2], [8], [9]. New GUI testing techniques havethus to be proposed for fully testing, as automated as possible ,GUI rendering and complex interactions using ad hoc widgets.A

CKNOWLEDGEMENTS

This work is partially supported by the French BGLE ProjectCONNEXION. R

EFERENCES[1] E. Gamma, R. Helm, R. Johnson, and J. Vlissides,

Design patterns:elements of reusable object-oriented software . Addison-Wesley, 1995.[2] M. Beaudouin-Lafon, “Instrumental interaction: an interaction modelfor designing post-WIMP user interfaces,” in

Proc. of CHI’00 . ACM,2000, pp. 446–453. [3] A. M. Memon, “An event-ﬂow model of GUI-based applications fortesting,”

STVR , vol. 17, no. 3, pp. 137–157, 2007.[4] M. Cohen, S. Huang, and A. Memon, “Autoinspec: Using missing testcoverage to improve speciﬁcations in GUIs,” in

Proc of ISSRE’12 , 2012,pp. 251–260.[5] S. Arlt, A. Podelski, C. Bertolini, M. Schaf, I. Banerjee, and A. Memon,“Lightweight static analysis for GUI testing,” in

Proc of ISSRE’12 , 2012.[6] L. Mariani, M. Pezzè, O. Riganelli, and M. Santoro, “Autoblacktest:Automatic black-box testing of interactive applications,” in

Proc. ofICST’12 . IEEE, 2012, pp. 81–90.[7] D. H. Nguyen, P. Strooper, and J. G. Süß, “Automated functionalitytesting through GUIs,” in

Proc. of ACSC ’10 , 2010, pp. 153–162.[8] M. Beaudouin-Lafon, “Designing interaction, not interfaces,” in

Proc.of AVI’04 , 2004.[9] A. Blouin and O. Beaudoux, “Improving modularity and usability ofinteractive systems with Malai,” in

Proc. of EICS’10 , 2010, pp. 115–124.[10] A. van Dam, “Post-WIMP user interfaces,”

Commun. ACM , vol. 40,no. 2, pp. 63–67, Feb. 1997.[11] G. von Bochmann, A. Das, R. Dssouli, M. Dubuc, A. Ghedamsi, andG. Luo, “Fault models in testing.” in

Protocol Test Systems , 1991, pp.17–30.[12] B. Shneiderman, “Direct manipulation: a step beyond programminglanguages,”

IEEE Computer , vol. 16, no. 8, pp. 57–69, 1983.[13] E. L. Hutchins, J. D. Hollan, and D. A. Norman, “Direct manipulationinterfaces,”

Hum.-Comput. Interact. , vol. 1, no. 4, pp. 311–338, 1985.[14] D. A. Norman,

The Design of Everyday Things , reprint paperback ed.Basic Books, 2002.[15] C. Appert, O. Chapuis, and E. Pietriga, “Dwell-and-spring: undo fordirect manipulation,” in

Proc. of CHI’12 . ACM, 2012, pp. 1957–1966.[16] A. Pretschner, D. Holling, R. Eschbach, and M. Gemmar, “A genericfault model for quality assurance,” in

Proc of MODELS’13 , 2013.[17] A. Blouin, B. Morin, G. Nain, O. Beaudoux, P. Albers, and J.-M. Jézéquel,“Combining aspect-oriented modeling with property-based reasoning toimprove user interface adaptation,” in

Proc. of EICS’11 , 2011, pp. 85–94.[18] C. Appert and M. Beaudouin-Lafon, “SwingStates: Adding state ma-chines to Java and the Swing toolkit,”

SW: Practice and Experience ,vol. 38, no. 11, pp. 1149–1182, 2008.[19] B. Nguyen, B. Robbins, I. Banerjee, and A. Memon, “GUITAR: aninnovative tool for automated testing of GUI-driven software,”

AutomatedSoftware Engineering , pp. 1–41, 2013.[20] H. Zhu, P. A. V. Hall, and J. H. R. May, “Software unit test coverageand adequacy,”

ACM Comput. Surv. , vol. 29, no. 4, pp. 366–427, 1997.[21] R. Chillarege, I. S. Bhandari, J. K. Chaar, M. J. Halliday, D. S. Moebus,B. K. Ray, and M.-Y. Wong, “Orthogonal defect classiﬁcation-a conceptfor in-process measurements,”

IEEE Trans. Softw. Eng. , vol. 18, no. 11,pp. 943–956, 1992.[22] N. Li, Z. Li, and X. Sun, “Classiﬁcation of software defect detected byblack-box testing: An empirical study,” in

Proc. of WCSE’10 .[23] P. Brooks, B. Robinson, and A. Memon, “An initial characterization ofindustrial graphical user interface systems,” in

Proc. of ICST’09 .[24] J. A. Børretzen and R. Conradi, “Results and experiences from anempirical study of fault reports in industrial projects,” in

Proc. ofPROFES’06 . Berlin, Heidelberg: Springer-Verlag, 2006, pp. 389–394.[25] J. Strecker and A. Memon, “Relationships between test suites, faults,and fault detection in gui testing,” in

Proc. of ICST’08 , 2008, pp. 12–21.[26] D. Mauser, A. Klaus, R. Zhang, and L. Duan, “GUI failure analysis andclassiﬁcation for the development of in-vehicle infotainment,” in

Proc.of VALID’12 , 2012, pp. 79–84.[27] A. Kumar Maji, K. Hao, S. Sultana, and S. Bagchi, “Characterizingfailures in mobile oses: A case study with android and symbian,” in

Proc. of ISSRE’10 , 2010, pp. 249–258.[28] R. Lutz and I. mikulski, “Empirical analysis of safety-critical anomaliesduring operations,”

IEEE Trans. Softw. Eng. , pp. 172–180, 2004.[29] R. N. Zaeem, M. R. Prasad, and S. Khurshid, “Automated generationof oracles for testing user-interaction features of mobile apps,” in