Classifying and Qualifying GUI Defects
CClassifying and Qualifying GUI Defects
Valéria Lelli
INSA Rennes, [email protected]
Arnaud Blouin
INSA Rennes, [email protected]
Benoit Baudry
Inria, [email protected]
Abstract —Graphical user interfaces (GUIs) are integral partsof software systems that require interactions from their users.Software testers have paid special attention to GUI testing inthe last decade, and have devised techniques that are effective infinding several kinds of GUI errors. However, the introduction ofnew types of interactions in GUIs presents new kinds of errorsthat are not targeted by current testing techniques. We believethat to advance GUI testing, the community needs a compre-hensive and high level GUI fault model, which incorporates alltypes of interactions. The work detailed in this paper establishes4 contributions: 1) A GUI fault model designed to identify andclassify GUI faults. 2) An empirical analysis for assessing therelevance of the proposed fault model against failures foundin real GUIs. 3) An empirical assessment of two GUI testingtools ( i.e.
GUITAR and Jubula) against those failures. 4) GUImutants we’ve developed according to our fault model. Thesemutants are freely available and can be reused by developers forbenchmarking their GUI testing tools.
I. I
NTRODUCTION
Increasing presence of system interactivity requires softwaretesting to closely consider the testing of graphical user interfaces(GUI). GUIs are composed of graphical objects called widgets,such as buttons. Users interact with these widgets ( e.g. pressa button) to produce an action that modifies the state of thesystem. For example, pressing the button " Delete " of a drawingeditor produces an action that deletes the selected shapes fromthe drawing. Most of these standard widgets provide users withan interaction composed of a single input event ( e.g. pressinga button). In this paper we call such interactions "mono-eventinteractions". These standard widgets work identically in manyGUI platforms. In the context of GUI testing, the tools rely onthe concept of standard widgets and have demonstrated theirability for finding several kinds of errors in GUIs composedof such widgets, called WIMP GUIs [3], [4], [5], [6], [7].The current trend in GUI design is the shift from designingGUIs composed of standard widgets to designing GUIs relyingon more complex interactions and ad hoc widgets [2], [8], [9].So, standard widgets are being more and more replaced by adhoc ones. By ad hoc widgets we mean non-standard widgetsdeveloped specifically for a GUI. Such widgets involve multi-event interactions (in opposition to mono-event interactions, e.g. multi-touch interactions for zooming, rotating) that aimat being more adapted, natural to users. A simple exampleof such widgets is the drawing area of graphical editors withwhich users interact using more complex interactions such Also called command [1], [2] or event [3]. WIMP stands for
Windows, Icons, Menus, and Pointing device. These interactions are more complex from a software engineering pointof view. From a human point of view they should be more natural, i.e. moreclose to how people interact with objects in the real life. as pencil-based or multi-touch interactions. GUIs containingsuch widgets are called post-WIMP GUIs [10]. The essentialobjective is the advent of GUIs providing users with moreadapted and natural interactions, and the support of new inputdevices such as multi-touch screens. As Beaudouin-Lafon wrotein 2004, " the only way to significantly improve user interfacesis to shift the research focus from designing interfaces todesigning interaction " [8]. This new trend of GUI designpresents to developers new problems of GUI faults that currentGUI testing tools cannot detect. An essential pre-requisite topropose comprehensive testing techniques for both WIMP andpost-WIMP GUIs is to define an exhaustive and high levelGUI fault model. Indeed, testing consists of looking for errorsin a program. This requires a clear idea about the errors weare looking for. This is the goal of fault models that permit toqualify the effectiveness of testing techniques [11].In this paper, we leverage of the evolution of the currentHuman-Computer Interaction (HCI) state-of-the-art concepts topropose an original, complete fault model for GUIs. This modeltackles dual objectives: 1) provide a conceptual frameworkagainst which GUI testers can evaluate their tool or technique;and 2) build a set of benchmark mutations to evaluate the abilityof GUI testing tools to detect failures for both WIMP and post-WIMP GUIs. We assess the coverage of the proposed modelthrough an empirical analysis: 279 GUI-related bug reports ofhighly interactive open-source GUIs have been successfullyclassified using our fault model. Also, we assess the ability oftwo GUI testing tools ( i.e.
GUITAR and Jubula) to find real GUIfailures. Then, from an open-source system we created mutantsimplementing the faults described in our fault model. Thesemutants are freely available and can be used for benchmarkingGUI testing tools. As an illustrative use of these mutants, weconducted an experiment to evaluate the ability of two GUItesting tools to detect these mutants. We show that some mutantscannot be detected by current GUI testing tools and discussfuture work to address the new kinds of GUI faults.The paper is organized as follows. The next section examinesin detail the seminal HCI concepts we leveraged to build ourGUI fault model. Based on these concepts, the proposed GUIfault model is then detailed. Subsequently, the benefits of ourproposal are highlighted through: an empirical analysis ofexisting GUI bug reports; the manual creation of GUI mutantson an existing system; and an evaluation of two GUI testingtools to detect such mutants. This paper ends with related workand the conclusion presenting GUI testing challenges.II. S
EMINAL
HCI C
ONCEPTS
Identifying GUI faults requires an examination in detail ofthe major HCI concepts. In this section we detail these conceptsto highlight and explain in Section III the resulting GUI faults. a r X i v : . [ c s . S E ] M a r efore introducing these seminal HCI concepts, we recall thebasic elements that compose GUIs. Users act on an interactivesystem by performing a user interaction on a GUI. A userinteraction produces as output an action that modifies the stateof the system. For example, the user interaction that consistsof pressing the button " Delete " of a drawing editor producesan action that deletes the selected shapes from the drawing. Auser interaction is composed of a sequence of events (mousemove, etc. ) produced by input devices (mouse, etc. ) handledby users. One interaction may involve several input devices,which is then called a multi-modal interaction. For instance,pointing a position on a map and speaking to perform anaction is a multi-modal interaction. The correct synchronizationbetween the different input devices is a key concern and iscalled multi-modal fusion. A GUI is composed of graphicalcomponents, called widgets, laid out following a specific order.The graphical elements displayed by a widget are either purelyaesthetics (fonts, etc. ) or presentations of data. The state ofa widget can evolve in time having effects on its graphicalrepresentation ( e.g. visibility, position, value, data content).
Direct manipulation is one of the seminal HCI concepts[12], [13]. It aims at minimizing the mental effort required touse systems. To do so, direct manipulation promotes severalrules to respect while developing GUIs. One of these rulesstipulates that users have to feel engaged with control of objectsof interest, not with GUIs or systems themselves. An example ofdirect manipulation is the drawing area of drawing editors. Sucha drawing area represents shapes as 2D/3D graphical objectsas most of the people define the concept of shapes. Users canhandle these shapes by interacting directly within the drawingarea to move or scale shapes using advanced interactions such asbi-manual interactions. Direct manipulation is in opposition tothe use of standard widgets that bring indirection between usersand their objects of interest. For instance, scaling a shape usinga bi-manual interaction on its graphical representation is moredirect than using a text field. So, developing direct manipulationGUIs usually implies the development of ad hoc widgets,such as the drawing area. These ad hoc widgets are usuallymore complex than standard ones since they rely on: advancedinteractions ( e.g. bi-manual, speech+pointing interactions); adedicated data representation ( e.g. shapes painted in the drawingarea). Testing such heterogeneous and ad hoc widgets is thusa major challenge.This contrast between GUIs composed of standard widgetsonly and GUIs that contain advanced widgets is reified,respectively, under the terms WIMP and post-WIMP. Van Damproposed that a post-WIMP GUI is one " containing at least oneinteraction technique not dependent on classical 2D widgetssuch as menus and icons " [10].Another seminal HCI concept is feedback [13], [14], [2],[9]. Feedback is provided to users while they interact withGUIs. It allows users to evaluate continuously the outcomeof their interactions with the system. Feedback is computedand provided by the system through the user interface and cantake many forms. A first simple example is when users movethe cursor over a button. To notify that the cursor is correctlypositioned to interact with the button this changes its shape. Amore sophisticated example is the selection process of mostof drawing editors that can be done using a Drag-And-Drop(DnD) interaction. While the DnD is performed on the drawing area, a temporary rectangle is painted to notify users aboutcurrent selection area.Another HCI concept is the notion of reversible actions [12],[13], [9]. The goal of reversible actions is to reduce user anxietyby about making mistakes [12]. In WIMP GUIs, revertingactions is reified under the undo/redo features usually performedusing buttons or shortcuts that revert the latest executed actions.In post-WIMP GUIs, recent outcomes promote the ability tocancel actions in progress [15].All these HCI concepts introduced in this section are inter-active features that must be tested. However, we demonstratein this paper that current GUI fault models and GUI testingtools do not cover all these features. In the next section, theGUI faults stemming from WIMP and post-WIMP GUIs aredetailed. III. F
AULT M ODEL
In this section we present an exhaustive GUI fault model.Bochmann et al. [11] define a fault model as:
Definition 1 (Fault Model):
A fault model describes a setof faults responsible for a failure possibly at a higher level ofabstraction.To recall what a fault is:
Definition 2 (Fault):
Faults are textual (or graphical) differ-ences between an incorrect and a correct behavior description[16].Based on these definitions, we propose the followingdefinitions of a GUI fault and failure:
Definition 3 (GUI Fault):
GUI faults are differences be-tween an incorrect and a correct behavior description of aGUI.
Definition 4 (GUI Error):
A GUI error is an activation ofa GUI fault that leads to an unexpected GUI state.
Definition 5 (GUI Failure):
A GUI failure is a manifesta-tion of an unexpected GUI state provoked by a GUI fault.A GUI fault can be introduced at different levels of a GUIsoftware ( e.g.
GUI code, GUI models). An illustration of aGUI fault is: a correct line of GUI code vs an incorrect line ofGUI code. For example, a GUI fault can be activated when an unexpected entry , such as a wrong value into an input widget,is not handled correctly by its GUI code. So, an unexpectedGUI state is manifested ( e.g. a crash as a GUI failure) when auser clicks on a button after typing this entry.To build the proposed fault GUI model we first analyzedthe state-of-the-art of HCI concepts (see Section II). We thenanalyzed real GUI bug reports (different than those used inSection IV) to assess and to precise the fault model. Weperformed a round trip process between the analysis of HCIconcepts and GUI bug reports until obtain a stable fault model.The description of our fault model is divided into twogroups: The user interface faults and the user interaction faults.The user interface faults refer to faults affecting the structureand the behavior of graphical components of GUIs. The userinteraction faults refer to faults affecting the interaction processwhen a user interacts with a GUI. ABLE I. U
SER I NTERFACE F AULTS
Fault categories ID Faults Possible failures
GUI StructureandAesthetics GSA1 Incorrect layout of widgets( e.g. alignment, dimension, orientation, depth) The positions of 2 widgets are inverted.A text is not fully visible since the size of text field is too small.Rulers do not appear on the top of a drawing editor.The vertical lines for visualizing the precise position of shapes in thedrawing editor are not displayed.GSA2 Incorrect state of widgets( e.g. visible, activated, selected, focused, modal,editable, expandable) Not possible to click on a button since it is not activated.A window is not visible so that its widgets cannot be used.Not possible to draw in the drawing area of a drawing editor since itis not activated.GSA3 Incorrect appearance of widgets( e.g. font, color, icon, label) The icon of a button is not visible.In a GUI of a power plant, the color reflecting the critical status ofa pump is green instead of red.DataPresentation DT1 Incorrect data rendering( e.g. scaling factors, rotating, converting) The size of a text is not scaled properly.In a drawing editor, a dotted line is painted as a dashed one.A rectangle is painted as an ellipse.DT2 Incorrect data properties( e.g. selectable, focused) A web address in a text is not displayed as hyperlink.DT3 Incorrect data type or format( e.g. degree vs radian, float vs double) The date is displayed with five digits ( e.g. dd/mm/y) instead of 6digits ( e.g. dd/mm/yy).A text field displays an angle in radian instead of in degree. A. User Interface Faults
GUIs are composed of widgets that can act as mediatorsto interact indirectly ( e.g. buttons on WIMP GUIs) or directly(direct manipulation principle in post-WIMP GUIs) with objectsof the data model. In this section, we categorize the userinterface faults, i.e. faults related to the structure, the behavior,and the appearance of GUIs. We further break down userinterface into two categories: the
GUI structure and aesthetics ,and the data presentation fault, as introduced below. Table Ipresents an overview of these faults and their potential failures.
1) GUI Structure and Aesthetics Fault:
This fault categorycorresponds to unexpected GUI designs. Since GUIs arecomposed of widgets laid out following a given order, the firstfault is the incorrect layout of widgets (GSA1). Possible failurescorresponding to this fault occur when GUI widgets followan unexpected layout ( e.g. wrong size or position). The nextfault concerns the incorrect state of widgets (GSA2). Widgets’behavior is dynamic and can be in different states such asvisible, enabled, or selected. This fault occurs when the currentstate of a widget differs from the expected one. For example,a widget is unexpectedly visible. The following fault treatsthe unexpected appearance of widgets (GSA3). That concernsaesthetic aspects of widgets not bound to the data model, suchas look-and-feels, fonts, icons, or misspellings.
2) Data presentation:
In many cases, widgets aim at editingand visualizing data of the data model. For example with WIMPGUIs, text fields or lists can display simple data to be editedby users. Post-WIMP GUIs share this same principle with thedifference that the data representation is usually ad hoc andmore complex. For example, the drawing area of a drawingeditor paints shapes of the data model. Such a drawing area hasbeen developed for the specific case of this editor. That permitsto represent graphically in a single widget complex data ( e.g. shapes). In other cases, widgets aim at monitoring data only.This is notably the case for some GUIs in control commandsof power plants where data are not edited but monitored by users. The definition of data representations is complex anderror-prone. It thus requires adequate data presentation faults.The first fault of this category is the incorrect data rendering (DT1). DT1 is provoked when data is converted or scaledwrongly. Possible failures for this fault are manifested byunexpected data appearance ( e.g. wrong color, texture, opacity,shadow) or data layout ( e.g. wrong position, geometry). Thesecond fault concerns incorrect data properties (DT2). Propertiesdefine specific visualization of data such as selectable or focused.A possible failure is a web address that is not displayed as ahyperlink. The last fault (DT3) occurs when an incorrect datatype or format is displayed. For instance, an angle value isdisplayed in radian instead of in degree.
B. User Interaction Faults
In this section, we introduce the faults that treat userinteractions. The proposed faults are based on the characteristicsof WIMP and post-WIMP GUIs detailed in the previous section.For each fault we separated our analysis into two parts. Onededicated to WIMP interactions and another one to post-WIMP interactions. WIMP interactions refer to interactionsperformed on WIMP widgets. They are simple and composedof few events (click , key pressed, etc. ). Post-WIMP interactionsrefer to interactions performed on post-WIMP widgets. Suchinteractions are more complex since they can be multimodal, i.e. involve multiple input devices (gesture, gyroscope, multi-touch screen); be concurrent ( e.g. in bi-manual interactions thetwo hands evolve in parallel); be composed of numerous events( e.g. multimodal interactions may be composed of sequencesof pressure, move, and voice events). Such interactions can bemodeled as finite-state machines [9], [17], [18]. Subsequentlythe direct manipulation principles, other particularities of post-WIMP interactions are that they aim at: being as natural as A click is one interaction composed of the event mouse pressed followedby the event mouse released . Its simple behavior has leaded to consider a clickas an event itself.ABLE II. U
SER I NTERACTION F AULTS
Fault categories ID Faults Possible failures
InteractionBehavior IB1 Incorrect behavior of a user interaction A bi-manual interaction developed for a specific purpose doesnot work properly.The synchronization between the voice and the gesture doesnot work properly in a voice+gesture interaction.Action ACT1 Incorrect action results Translating a shape to a position ( x , y ) translates it to theposition ( − x , − y ) .Setting the zoom level at 150%, sets it at 50%.ACT2 No action executed Clicking on a button has no effect.Executing a DnD on a drawing area to draw a rectangle hasno effect.ACT1 Incorrect action executed Clicking on the button Save shows the dialogue box used forloading.Scaling a shape results in its rotation.Performing a DnD to translate shapes results in their selection.Reversibility RVSB1 Incorrect results of undo or redo operations Clicking on the button redo does not re-apply the latest undoneaction as expected.Pressing the keys ctrl+z does not revert the latest executedaction as expected.RVSB2 Reverting the current interaction inprogress works incorrectly Pressing the key "
Escape " during a DnD does not abort thislast.Saying the word "
Stop " does not stop the interaction inprogress.RVSB3 Reverting the current action in progressworks incorrectly Clicking on the button "
Cancel " to stop the loading of thefile previously selected does not work properly.Feedback FDBK1 Feedback provided by widgets to reflectthe current state of an action in progressworks incorrectly. The progress bar that shows the loading progress of a fileworks incorrectly.FDBK2 The temporary feedback provided all alongthe execution of long interactions is incor-rect. Given a drawing editor, drawing a rectangle using a DnDinteraction does not show the created rectangle during theDnD as expected. possible; providing users with the feeling of handling datadirectly ( e.g. shapes in drawing editors). Table II summarizesthe user interaction faults and some of their potential failuresfor both WIMP and post-WIMP interactions. These faults aredetailed as follows.
1) Interaction Behavior:
Developing post-WIMP interac-tions is complex and error-prone. Indeed, as explained in thesection on GUIs’ characteristics, it may involve many sequencesof events or require the fusion of several modalities such as voiceand gesture. So, the first fault (IB1) occurs when the behaviorof the developed interactions does not work properly. This faultmainly concerns post-WIMP widgets since WIMP widgetsembed simple and hard-coded interactions. For instance, anevent such as pressure can be missing in a bi-manual interaction.Another example is the incorrect synchronization between thevoice and the gesture in a voice+gesture interaction.
2) Action:
This category of faults groups faults that concernactions produced while interacting with the system. The firstfault (ACT1) focuses on the incorrect results of actions. Inthis case the expected action is executed but its results are notcorrect. For instance with a drawing editor, a failure can be thetranslation of one shape to the given position ( − x , − y ) whilethe position ( x , y ) was expected. The root cause of this failurecan be located in the action itself or in its settings. For instance,a first root cause of the previous failure can be the incorrect coding of the translation operation. A second root cause canbe located in the settings of the translation action.The second fault (ACT2) concerns the absence of actionwhen interacting with the system. For instance, this fault canoccur when an interaction, such as a keyboard shortcut, is notcorrectly bound to its widget.The third fault (ACT3) consists of the execution of wrongactions. The root cause of this fault can be that the wrongaction is bound to a widget at a given instant. For instance:clicking on the button Save shows the dialogue box used forloading; doing a DnD interaction on a drawing area selectsshapes instead of translating them.
3) Reversibility:
This fault category groups three faults.The first fault (RVSB1) concerns the incorrect behavior of theundo/redo operations. Undo and redo operations usually relyon WIMP widgets such as buttons and key shortcuts. Theseoperations revert or re-execute actions already terminated andstored by the system. A possible failure is the incorrect reversionof the latest executed action when the key shortcut ctrl+z isused.Contrary to WIMP interactions, that are mainly one-shot,many interactions last some time such as the DnD interaction.In such a case, users may be able to stop an interactionin progress. The second fault (RVSB2) thus consists of thencorrect interruption of the current interaction in progress. Forinstance, pressing the key "
Escape " during a DnD does not stopthis last. This fault could have been classified as an interactionbehavior fault. We decided to consider it as a reversibility faultsince it concerns the ability to revert an ongoing interaction.Once launched, actions may take time to be executed entirely.In this case such actions can be interrupted. The third fault(RVSB3) concerns the incorrect interruption of an action inprogress. A possible failure concerns the file loading operation:clicking on the button "
Cancel " to stop the loading of a filedoes not work properly.
4) Feedback:
Widgets are designed to provide immediateand continuous feedback to users while they interact with them.For instance, progress bars showing the loading progress of afile is a kind of feedback provided to users. The first fault of thiscategory (FDBK1) concerns the incorrect feedback provided bywidgets to reflect the current state of an action in progress. Thisfault focuses on actions that last in time and which progressshould be monitored by users.The second fault (FDBK2) focuses on potentially longinteractions ( i.e. interactions taking a certain amount of time tobe completed) which progress should be discernible by users.For instance with a drawing editor, when drawing a shape onthe drawing area, the shape in creation should be visible sothat the user knows the progression of her work. So, a possiblefailure is drawing a rectangle using a DnD interaction, thatworks correctly, does not show the created rectangle during theDnD as expected.
C. Discussion
The definition and the use of a fault model raise severalquestions we discuss about in this sub-section.
What are the benefits of the proposed GUI fault model?
Thebenefits of our GUI fault model are twofold. First, a fault modelis an exhaustive classification of faults for a specific concern[11]. Providing a GUI fault model permits GUI developers andtesters to have a precise idea of the different faults they mustconsider. As an illustration, Section IV describes an empiricalanalysis we conducted to classify and discuss about GUI failuresof open-source GUIs. Second, our GUI fault model allowsdevelopers of GUI testing tools to evaluate the efficiency oftheir tool in terms of bug detection power w.r.t. a GUI specificfault model. As detailed in Section VI, we created mutants ofan existing GUI. Each mutant contains one GUI failure thatcorresponds to one GUI fault of our fault model. Developersof GUI testing tools can run their tools against these mutantsfor benchmarking purposes.
Should usability have been a GUI fault?
Answering thisquestion requires the definition of a fault to be re-explained: afault is a difference between the observed behavior descriptionand the expected one. Usability issues consist of reporting thatthe current observed behavior of a specific part of a GUI lacksat being somehow usable. That does not mean the observedbehavior differs from the behavior expected by test oracles.Instead, it usually means that the expected behavior has notbeen defined correctly regarding some usability criteria. That iswhy we do not consider usability as a GUI fault. This reasoningcan be extended to other concerns such as performance.
How to classify GUI failures into a fault model?
A GUIfailure is a perceivable manifestation of a GUI error. ClassifyingGUI failures thus requires to have identified the root cause ( i.e.
GUI error) of the failure. So, classifying GUI failures can bedone by experts of the GUI under test. These experts needsufficient information, such as patches, logs, or stack traces, toidentify if the root cause of a failure is a GUI error to thenclassify it. For example, given a failure manifested in the GUIand caused by a precondition violation. In this case, such afailure is not classified into the GUI fault model. Similarly,classifying correctly a GUI failure also requires to qualifythe involved widgets ( e.g. standard or ad hoc ) as well as theinteraction ( e.g. mono-event or multiple-event interaction).
How to classify failures stemming from other failures?
Forinstance, the incorrect results of the execution of an action(action fault) let a widget not visible as expected (GUI structurefault). In such cases, only the first failure must be consideredsince it puts the GUI in an unexpected and possibly unstablestate. Besides, the appearance of a GUI error depends on theprevious actions and interactions successfully executed. Typicalexamples are the undo and redo actions. A redo action canbe executed only if an action has been previously performed.Furthermore, the success of a redo action may depend on theprevious executed actions. We considered this point duringthe creation of mutants (as detailed in Section VI) to providefailures that appear both with and without previous actions.IV. R
ELEVANCE OF THE F AULT M ODEL : AN EMPIRICAL ANALYSIS
In this section the proposed GUI fault model is evaluated.Our evaluation has been conducted by an empirical analysis toassess the relevance of the model w.r.t. faults currently observedin existing GUIs. The goal is to state whether our GUI faultmodel is relevant against failures found in real GUIs.
A. Introduction
To assess the proposed fault model, we analyzed bug reportsof 5 popular open-source software systems: Sweet Home3D, File-roller, JabRef, Inkscape, and Firefox Android. Thesesystems implement various kinds of widgets, interactions, andencompass different platforms (desktop and mobile). TheirGUIs cover the main following features: indirect and direct ma-nipulation; several input devices ( e.g. mouse, keyboard, touch); ad hoc widgets such as canvas; discrete data manipulation ( e.g. vector-based graphics); and undo/redo actions. B. Experimental Protocol
Bug reports have been analyzed manually from the re-searcher/tester perspective by looking only at data available inthe failures report ( i.e. black box analysis). To focus on detailedand commented bug reports that concern GUI failures, theselection has been driven by the following rules. Only closed,fixed, and in progress bug reports were selected. The followingsearch string has been also used to reduce the resulting sample: interface OR "user interface” OR “graphical user interface”OR "graphical interface" OR GUI OR UI OR layout OR designOR graphic OR interaction OR “user interaction” OR interactOR action OR feedback OR revert OR reversible OR undo ORredo OR abort OR stop OR cancel . Each report has been then
ABLE III. D
ISTRIBUTION OF ANALYZED FAILURES PER SOFTWARE
Software Analyzed failures User interface failures User interaction failures Repositories link
Sweet Home 3D 33 55% 45% http://sourceforge.net/p/sweethome3d/bugs/File-roller 32 28% 72% https://bugzilla.gnome.org/query.cgiJabRef 84 42% 58% http://sourceforge.net/p/jabref/bugs/Inkscape 82 28% 72% https://bugs.launchpad.net/inkscape/Firefox Android 48 60% 40% https://bugzilla.mozilla.org/ manually analyzed to state whether it is a GUI failure. Also,selected bug reports have to provide explanations about theroot cause of the failure such as a patch or comments. This stepis crucial to be able to categorize the failures using our GUIfault model considering their root cause. We also discardedfailures identified as non-reproducible, duplicated, usability, oruser misunderstanding. From this selection we kept 279 bugreports (in total for the five systems) describing one GUI failureeach. The following sub-sections discuss about these failuresand the classification process.
C. Classification and Analysis
All the 279 failures have been successfully classified intoour fault model. Fig. 1 gives an overview of the selectedbug reports classified using our proposed fault model. Thesefailures were classified into the
Action (119 failures, 43%),
GUIStructure and Aesthetics (75 failures, 27%),
Data Presentation (39 failures, 14%),
Reversibility (31 failures, 11%),
Interactionbehavior (12 failures, 4%), and
Feedback (3 failures, 1%) faultcategories. Most of the failures classified into
GUI Structureand Aesthetics concern the incorrect layout of widgets (51%).Likewise, most of the failures in the
Action category refer to incorrect action results (75%).
Fig. 1. Classification of the 279 bug reports using the GUI fault model
Table III shows the distribution of the 279 analyzed GUIfailures per software and category (user interface or userinteraction). These results point out that the systems
SweetHome 3D and
Firefox Android seem to be more affected byuser interface failures. Most of these failures concern the
GUIstructure and aesthetics fault. That can be explained by thecomplex and ad hoc
GUI structure of these systems.
FileRoller and
JabRef
GUIs include widgets with coarse-grainedproperties ( i.e. simple input value such as number or text). Mostof their failures concern WIMP interactions classified into the action category. In contrast,
Inkscape presented more failuresclassified as post-WIMP. Indeed, Inkscape, a vector graphicssoftware, mainly relies on its drawing area that provides userswith different post-WIMP interactions. These failures havebeen categorized mainly into interaction behavior , action , and reversibility . Fig. 2. Manifestation of failures in the user interface and user interactionlevels
As depicted by Fig. 2, 41% of these 279 GUI failures areoriginated by faults classified into the user interface category and59% into the user interaction category. Most of user interactionfailures have been classified into the incorrect action results (54%). This plot also highlights that only 25% of the analyzeduser interface failures and 18% of the user interaction oneshave been classified as post-WIMP. We comment these resultsin the following sub-section.
D. Discussion
The empirical results must be balanced with the fact thatuser interactions are less tangible than user interfaces. So, usersmay report more GUI failures when they can perceive failuresgraphically (an issue in the layout of a GUI or in the result ofan action visible through a GUI). Users, however, may havedifficulties to detect a failure in an interaction itself whileinteracting with the GUI. That may explain the low numberof failures (4%) classified into
Interaction Behavior . Anotherexplanation may be the primary use of WIMP widgets, relyingon simple interactions.In our analysis, many failures that could be related to
Feedback were discarded since they concerned enhancementsor usability issues, which are out of the scope of a GUI faultmodel as discussed previously. For instance, GUI failures thatconcern the lack of haptic feedback in Firefox Android werediscarded. So, few faults (1%) were classified into this category.Another explanation may be the difficulty for users to identifyfeedback issues as real failures that should be reported.We observed that some reported GUI failures are falsepositives regarding the fault localization : if the report does nothave enough information about the root cause of a failure ( e.g. patch or exception log), a GUI failure can be classified in awrong fault category. For example, when moving a shape usinga DnD does not move it. At a first glance, the root cause ofthis failure can be associated to an incorrect behavior of theDnD. So, this failure can be categorized into the interactionehavior. However, after analyzing the root cause this failurerefers to an action failure since the DnD works properly, butno action is linked to this interaction.Likewise, the failures related to
Reversibility and
Feedback were easily identified through the steps to reproduce them. Forexample in JabRef, " pressing the button "Undo" will clear allthe text in the field, but then pressing the button "Redo" willnot recover the text ". Furthermore, some systems do not revertinteractions step by step but entirely. This can imply a failurefrom a user’s point view, but sometimes it is considered asan invalid failure ( e.g. requirements vs. usability issues) bydevelopers. In
JabRef , the undo/redo actions did not revertdiscrete operations. For example, pressing the button "Undo"clears all texts typed into different text fields instead of clearingonly one field each time the button "undo" is pressed.Another important point concerns the WIMP vs. post-WIMPGUIs faults. We classified more failures involving WIMP thanpost-WIMP widgets. A possible explanation is that, despite theincreasing interactivity of GUIs, the analyzed GUIs still relymore on WIMP widgets and interactions. Moreover, users nowmaster the behavior of WIMP widgets so that they can easilyidentify when they provoke failures. It may not be the casewith ad hoc and post-WIMP widgets.V. A RE GUI T
ESTING T OOLS A BLE TO D ETECT C LASSIFIED F AILURES ? A N E MPIRICAL S TUDY
This section provides an empirical study of two GUI testingtools: GUITAR [19] and Jubula . To demonstrate the currentlimitations of GUI testing tools in testing real GUIs, we appliedthose tools to detect the failures previously classified into ourGUI fault model. A. GUITAR and Jubula
GUITAR is one of the most widespread academic GUItesting tools. It extracts the GUI structure by reverse engineering.This structure is transformed into a GUI Event Flow Graph(EFG), where each node represents a widget event. Based on thisEFG, test cases are generated and executed automatically overthe SUT. We used the plugin for Java Swing ( i.e.
JFC GUITARversion 1.1.1) . In GUITAR, each test case is composed by asequence of widget events. The generation of test cases canbe parameterized with the size of that sequence ( i.e. test caselength).Jubula is a semi-automated GUI testing tool that lever-ages pre-defined libraries to create test cases. These librariescontain modules that can be reused to generate manually testsequences. The modules encompass actions ( e.g. check, select)and interactions ( e.g. click, drag and drop) over different GUItoolkits ( e.g. swing, SWT, RCP, mobile). We have reused thelibrary dedicated to Java Swing (Jubula version 7.2) to writethe test cases presented in the next experiments. This librarycontains actions to test only standard widgets such as dragginga column/row of a table by passing an index. To test ad hoc widgets ( e.g. canvas), we made a workaround by mappingactions directly to these widgets. For example, to draw a shapeon canvas we need to specify the exact position ( e.g. drag anddrop coordinates) where the interaction should be executed. http://sourceforge.net/apps/mediawiki/guitar/ B. Experiment
We selected JabRef , a software to manage bibliographicreferences. JabRef is written in Java which allows us to applyboth GUITAR and Jubula. For each fault described in our GUIfault model, we selected one reported failure. To reproduceeach failure, we downloaded the corresponding faulty versionof JabRef. We used the exact test sequence ( i.e. number ofactions) to reproduce a failure. In GUITAR, all test cases weregenerated automatically over a faulty version. In Jubula, eachtest case was created manually to detect one failure. All testcases were written by one of the authors of this paper who hasexpertise in JabRef. Also, their test sequences are extracted byanalyzing failure reports ( e.g. steps to reproduce a failure) andreusing Jubula’s libraries. Then, GUITAR and Jubula run alltheir test cases automatically for checking whether the selectedfailure is found. C. Results and Discussion
Table V summarizes the detection of the JabRef GUI failuresby GUITAR and Jubula. These failures cover 11 out of the 15faults described in our fault model. The remaining four faultswere not covered for two reasons: 1) no failure was classifiedfor that fault; or 2) a failure was classified, but we could notreproduce it - only occurred in a specific environment ( e.g.
Operating System) or given a certain input ( e.g. a particulardatabase in JabRef).The reported failures in JabRef are mostly related to WIMPwidgets, so we would expect GUITAR and Jubula to detectthem, but it was not the case. For instance, failure e.g. text, eventhandlers) of those buttons. In GUITAR, checking the propertiesof that widget did not reveal this failure since the expectedand actual values of its size property ( e.g. width) remained thesame. In Jubula, the concerned widget cannot be mapped totest cases execution and thus cannot be tested.Failures e.g. no exception) and GUI properties are the "expected" ones. Forexample, a text property of a status bar contains the value: "Redo: change field" , when this action was actually not redone.Similarly, failure http://jabref.sourceforge.net/ABLE IV. M UTANTS PLANTED ACCORDING TO FAULTS IN THE
GUI F
AULT M ODEL ID GSA1 GSA2 GSA3 DT1 DT2 DT3 IB1 ACT1 ACT2 ACT3 RVSB1 RVSB2 RVSB3 FDBK1 FDBK2
TABLE V. J AB R EF FAILURES DETECTED BY
GUITAR
AND J UBULA
IDfault IDfailure Bug repository link GUITAR Jubula
GSA1 (cid:55) (cid:55)
GSA2 (cid:55) (cid:51)
GSA3 (cid:55) (cid:51)
DT1 (cid:55) (cid:51)
DT2 (cid:55) (cid:51)
IB1 (cid:51) (cid:51)
ACT2 (cid:51) (cid:51)
ACT3 (cid:55) (cid:55)
RVSB1 (cid:55) (cid:51)
RVSB2 (cid:51) (cid:51)
FDBK1 (cid:55) (cid:51)
FDBK2 but the test case was successfully replayed by Jubula. The inputtext via keyboard was typed and saved automatically withoutany interference of the auto-completion feature.Another point is the accuracy of test cases generatedmanually in Jubula. Detecting failure e.g. \ %), and then checking its output ina preview window should not contain any command ( e.g. e.g. SelectPattern[%,equals] in ComponentText[preview] ). Or, write a test caseto check whether an entire text matches to the expected one( e.g. CheckText[100%, equals] in ComponentText[preview] ).However, the last test case will fail since a text from previewwindow in JabRef is shown internally as HTML and, in Jubula,the action’s parameters cannot be specified in that format.Our experiment does not aim at comparing both tools sinceGUITAR is a fully automated tool contrary to Jubula. However,the results of this study highlight the current limitations ofGUI testing tools. GUITAR and Jubula currently mainly workfor detecting failures that affect properties of standard widgets.Moreover, GUITAR does GUI regression testing: it considersa given GUI as the reference one from which tests will beproduced. If this GUI is faulty, GUITAR will produce tests thatwill consider these failures as the correct behavior. A possiblesolution to overcome this issue is to base the test process onthe specifications (requirements, etc. ) of the GUI.VI. F ORGING FAULTY
GUI
S FOR BENCHMARKING
In this section, we evaluate the usefulness of our fault modelby applying it on a highly interactive open-source softwaresystem. We created mutants of this system corresponding to thedifferent faults of the model. The main goal of these mutantsis to provide GUI testers with benchmark tools to evaluatethe ability of GUI testing tools to detect GUI failures. As anillustration of the practical use of these mutants, we executedtwo GUI testing tools against the mutants of the system. Thanks to that we caught a glimpse of their ability to cover our proposedfault model. The goal of this experiment is to answer theresearch question: what are the benefits of this fault model forGUI testing?A. Mutants Generation
As highlighted by Zhu et al. , "software testing is often aimedat detecting faults in software. A way to measure how well thisobjective has been achieved is to plant some artificial faultsinto the program and check if they are detected by the test. Aprogram with a planted fault is called a mutant of the originalprogram" [20]. Following this principle, we planted 65 faultsin a highly interactive open-source software system, namelyLatexdraw , using our proposed fault model. Latexdraw hasbeen selected because of the following points: 1) it is a highlyinteractive system written in Java and Scala (dedicated to thecreation of drawings for L A TEX); 2) its GUI mixes both standardand ad hoc widgets; 3) it is released under an open-sourcelicense (GPL2) so that it can be freely used by the testingcommunity.We created 65 mutants corresponding to the different faultsof our proposed fault model. All these mutants and the originalversion are freely available . Each mutant is documented todetail its planted fault and the oracle permitting to find it .Multiple mutants have been created from each fault by: usingWIMP (22 mutants) or post-WIMP (43 mutants) widgets tokill the mutants; varying the test case length ( i.e. the number ofactions required to provoke the failure). Each action ( e.g. select ashape) requires a minimal number of events ( e.g. in LaTeXDrawa DnD requires at least three events: press/move/release) to beexecuted. Table IV summarizes the number of forged mutantsand the minimal and maximal test case length for each fault.For instance, a length 0 .. e.g. IB1, DT1).
B. How GUI testing tools kill our GUI mutants: a firstexperiment
We applied the GUI testing tools GUITAR and Jubula onthe mutants to evaluate their ability to kill them. Our goalis not to provide benchmarks against these tools but ratherhighlight the current challenges in testing interactive systemsnot considered yet ( e.g. post-WIMP interactions). GUITAR testcases have been generated automatically while Jubula oneshave been written manually.Considering the mutants planted at the user interface level,Jubula and GUITAR tests killed the mutants that involvechecking standard widget properties, such as layout ( e.g. width, http://sourceforge.net/projects/latexdraw/ https://github.com/arnobl/latexdraw-mutants eight) and state ( e.g. enable, selection, focusable). Also, it ispossible to test simple data ( e.g. string values on text fields) onthose widgets. However, most of the mutants that concern the ad hoc widgets were alive. Notably, when test cases involvetesting complex data from the data model. For example, it is notpossible to compare the results of the actual shape on canvasagainst the expected one. Even if some shape properties ( e.g. rotation angle) are presented on standard widgets ( e.g. spinner),GUITAR and Jubula cannot state whether the current values inthese widgets match the expected shape rotation on the canvas.Likewise, our GUITAR and Jubula tests cannot kill most ofthe user interaction mutants that result on a wrong presentationof shapes. In particular, when we tested mutants planted intothe Reversibility or Feedback categories. For example, testingundo/redo operations in Latexdraw should compare all states tomanipulate a shape on canvas. Moreover, the tests verdict onJubula passed even though interactions are defined incorrectly( e.g. mouse cursor does not follow a DnD) or actions cannot beexecuted ( e.g. a button is disabled). In GUITAR, the generatedtest cases do not cover properly actions having dependencies.For example, the action "
Delete " in Latexdraw requires firstselecting a shape on canvas. However, no test sequence thatcontains "
Select Shape " before "
Delete Shape " was generated.Thus, some mutants could not be killed.Table VI gives an overview of the number of mutants killedby GUITAR and Jubula. The results show that both tools are notable to kill all mutants because of the four following reasons:1)
Testing Latexdraw with GUITAR and Jubula is limited to thetest of the standard Swing widgets . In Jubula, the test cases areonly written according to libraries available in the Swing toolkit.In GUITAR, the basic package for Java Swing GUIs only coversstandard widgets and mono-events ( e.g. a click on a button).2)
Configure or customize a GUI testing tool to test post-WIMPwidgets is not a trivial task . For example, each sequence of atest case in Jubula needs to be mapped for the correspondingGUI widget manually. Also, GUITAR needs to be extendedto generate test cases for ad hoc widgets ( e.g. canvas) as welltheir interactions ( e.g. multi-modal interactions). 3)
Testing post-WIMP widgets requires a long test case sequence . In Latexdraw,a sequence to test interactions over these widgets is composedof at least two actions. That sequence is longer when we haveto detect failures into undo/redo operations. 4)
It is not possibleto give a test verdict for complex data . The oracle provided bythe two GUI testing tools do not know the internal behavior of ad hoc widgets, their interaction features and data presentation.These results answer the research question by highlighting thebenefits of our fault model for measuring the ability of GUItesting tools in finding GUI failures.
C. Threats to Validity
Regarding the conducted empirical studies, we identifiedthe two following threats to validity. The first one concernsthe scope of the proposed fault model since we evaluated itempirically on a small number (five) of interactive systems.To limit this threat, we selected interactive systems that coverdifferent aspects of the HCI concepts we detailed in Section II.The second threat we identified concerns the subjectivityobserved in bug reports to describe failures. To deal with this,we based the classification on the bug report artifacts (patches,logs, etc. ) to identify the root cause of the reported failures.
TABLE VI. M
UTANTS KILLED BY
GUITAR
AND J UBULA
GUITAR JUBULAID
WIMP post-WIMP WIMP post-WIMPGSA1 2 0 2 0GSA2 5 0 6 1GSA3 3 0 3 0DT1 - 0 - 0DT2 - 0 - 0DT3 - 0 - 1IB1 - 0 - 0ACT1 0 0 0 1ACT2 3 0 3 0ACT3 2 0 2 0RVSB1 2 0 2 0RVSB2 - 0 - 0RVSB3 - - - -FDBK1 1 0 1 0FDBK2 - 0 - 0
VII. R
ELATED W ORK
Existing fault classifications are presented in a higher level ofabstraction considering mainly the components that are affectedby faults. Most classifications leverage the software assets ( e.g. specification, models, architecture, code) to define their faults.These faults have been described into fault models [11], [16]or defects taxonomies [21].In an effort to cover GUIs, the Orthogonal Defect Classi-fication (ODC) [21] is extended by IBM Research to includeGUIs faults. These faults focus on the appearance of widgets,navigation between widgets, unexpected behavior of widgetsevents and input devices. In our fault model, we do not coverfaults that concern the behavior of input devices ( i.e. hardwarefault). Although this taxonomy considers GUI faults, it does notseparate the user interface and user interaction faults. Moreover,this extension does not consider faults caused by post-WIMPwidgets and their advanced interactions as well faults into the data presentation category.Li et al. categorize faults of industrial and open sourceprojects using the ODC taxonomy [22]. The category
Interface concerns several GUI defects. However, this single categorycovers several user interface defects related to specific widgetssuch as window , title bar , menu , or tool bar . Similarly, theinteraction defects are limited to mouse and keyboard . Thus,it is not possible to identify the kind of faults classified intothese categories since they are not detailed. For example, a faultclassified into the mouse category can concern an interaction,an action, or an input device.Brooks et al. [23] present a study that characterizes GUIsbased on reported faults of three industrial systems. To classifyall these faults (GUI and non-GUI faults), the authors adapteda defect taxonomy by including other categories such as GUIdefects. This category encompasses both the user interface anduser interaction faults. Also, Børretzen et al. [24] analyze faultsreported by four projects by combining two defect taxonomies.Both works introduce a category that concerns the GUI faultsbut these faults are not described and thus no classification ispresented. Strecker et al. [25] characterize faults that affectGUI test suites. However, these faults do not concern the GUIfaults but any fault at the code level ( e.g. class or method faults)that may affect the GUI.In contrast, several research papers concern the fault effectsby classifying GUI failures instead of GUI faults. In general,hese works focus on specific GUIs (automotive GUIs [26]) ordomains (mobile [27], safety-critical [28]). For example, Maji et al. characterize failures for mobile operating systems [27].These failures are classified according to the fault localization.For example, a failure manifested in a camera is categorizedin the Camera segment . Similarly, failures for other segmentssuch as Web, Multimedia, or GUI are categorized. Also, Zaeem et al. [29] have conducted a bug study for Android applicationsto automate oracles. They identified 20 categories includingsome GUI issues such as Rotation (device’s rotation), Gestures(zooming and out) and Widget. Although, these papers haveinvestigated failures in a context that brings many advancesin terms of interactive features, no classification or discussionabout these kinds of failures is presented.Mauser et al. propose a GUI failure classification forautomotive systems [26]. This classification is based on thethree categories: design, content, and behavior. In the
Design category, the failures refer to GUI layouts ( e.g. color, font,position). In the
Content category, the failures are associatedto data displayed such as text, animation, and symbols/icons.The failures in the
Behavior category are caused by a wrongbehavior of windows ( e.g. wrong pop-up) or widgets ( e.g. wrongfocus). The authors focus on characterizing GUI failures basedonly on a small set of specific widgets designed for these kindsof GUIs. Furthermore, they do not consider user interactionfailures.VIII. C
ONCLUSION AND R ESEARCH A GENDA
This paper proposes a GUI fault model for providing GUItesters with benchmark tools to evaluate the ability of GUItesting tools to detect GUI failures. This fault model has beenempirically assessed by analyzing and classifying into it 279GUI bug reports of different open-source GUIs. To demonstratethe benefits of the proposed fault model, mutants have thenbeen developed from it on a Java open-source GUI. As anillustrative use case of these mutants, we executed two GUItesting tools on these mutants to evaluate their ability to detectthem. This experiment shows that if current GUI testing toolshave demonstrated their ability for finding several kinds of GUIerrors, they also fail at detecting several GUI faults we identified.The underlying reasons are twofold. First, GUI failures maybe related to the graphical rendering of GUIs. Testing a GUIrendering is a complex task since current testing techniquesmainly rely on code analysis that can hardly capture graphicalproperties. Second, the current trend in GUI design is theshift from designing GUIs composed of standard widgets todesigning GUIs relying on more complex interactions and adhoc widgets [2], [8], [9]. New GUI testing techniques havethus to be proposed for fully testing, as automated as possible ,GUI rendering and complex interactions using ad hoc widgets.A
CKNOWLEDGEMENTS
This work is partially supported by the French BGLE ProjectCONNEXION. R
EFERENCES[1] E. Gamma, R. Helm, R. Johnson, and J. Vlissides,
Design patterns:elements of reusable object-oriented software . Addison-Wesley, 1995.[2] M. Beaudouin-Lafon, “Instrumental interaction: an interaction modelfor designing post-WIMP user interfaces,” in
Proc. of CHI’00 . ACM,2000, pp. 446–453. [3] A. M. Memon, “An event-flow model of GUI-based applications fortesting,”
STVR , vol. 17, no. 3, pp. 137–157, 2007.[4] M. Cohen, S. Huang, and A. Memon, “Autoinspec: Using missing testcoverage to improve specifications in GUIs,” in
Proc of ISSRE’12 , 2012,pp. 251–260.[5] S. Arlt, A. Podelski, C. Bertolini, M. Schaf, I. Banerjee, and A. Memon,“Lightweight static analysis for GUI testing,” in
Proc of ISSRE’12 , 2012.[6] L. Mariani, M. Pezzè, O. Riganelli, and M. Santoro, “Autoblacktest:Automatic black-box testing of interactive applications,” in
Proc. ofICST’12 . IEEE, 2012, pp. 81–90.[7] D. H. Nguyen, P. Strooper, and J. G. Süß, “Automated functionalitytesting through GUIs,” in
Proc. of ACSC ’10 , 2010, pp. 153–162.[8] M. Beaudouin-Lafon, “Designing interaction, not interfaces,” in
Proc.of AVI’04 , 2004.[9] A. Blouin and O. Beaudoux, “Improving modularity and usability ofinteractive systems with Malai,” in
Proc. of EICS’10 , 2010, pp. 115–124.[10] A. van Dam, “Post-WIMP user interfaces,”
Commun. ACM , vol. 40,no. 2, pp. 63–67, Feb. 1997.[11] G. von Bochmann, A. Das, R. Dssouli, M. Dubuc, A. Ghedamsi, andG. Luo, “Fault models in testing.” in
Protocol Test Systems , 1991, pp.17–30.[12] B. Shneiderman, “Direct manipulation: a step beyond programminglanguages,”
IEEE Computer , vol. 16, no. 8, pp. 57–69, 1983.[13] E. L. Hutchins, J. D. Hollan, and D. A. Norman, “Direct manipulationinterfaces,”
Hum.-Comput. Interact. , vol. 1, no. 4, pp. 311–338, 1985.[14] D. A. Norman,
The Design of Everyday Things , reprint paperback ed.Basic Books, 2002.[15] C. Appert, O. Chapuis, and E. Pietriga, “Dwell-and-spring: undo fordirect manipulation,” in
Proc. of CHI’12 . ACM, 2012, pp. 1957–1966.[16] A. Pretschner, D. Holling, R. Eschbach, and M. Gemmar, “A genericfault model for quality assurance,” in
Proc of MODELS’13 , 2013.[17] A. Blouin, B. Morin, G. Nain, O. Beaudoux, P. Albers, and J.-M. Jézéquel,“Combining aspect-oriented modeling with property-based reasoning toimprove user interface adaptation,” in
Proc. of EICS’11 , 2011, pp. 85–94.[18] C. Appert and M. Beaudouin-Lafon, “SwingStates: Adding state ma-chines to Java and the Swing toolkit,”
SW: Practice and Experience ,vol. 38, no. 11, pp. 1149–1182, 2008.[19] B. Nguyen, B. Robbins, I. Banerjee, and A. Memon, “GUITAR: aninnovative tool for automated testing of GUI-driven software,”
AutomatedSoftware Engineering , pp. 1–41, 2013.[20] H. Zhu, P. A. V. Hall, and J. H. R. May, “Software unit test coverageand adequacy,”
ACM Comput. Surv. , vol. 29, no. 4, pp. 366–427, 1997.[21] R. Chillarege, I. S. Bhandari, J. K. Chaar, M. J. Halliday, D. S. Moebus,B. K. Ray, and M.-Y. Wong, “Orthogonal defect classification-a conceptfor in-process measurements,”
IEEE Trans. Softw. Eng. , vol. 18, no. 11,pp. 943–956, 1992.[22] N. Li, Z. Li, and X. Sun, “Classification of software defect detected byblack-box testing: An empirical study,” in
Proc. of WCSE’10 .[23] P. Brooks, B. Robinson, and A. Memon, “An initial characterization ofindustrial graphical user interface systems,” in
Proc. of ICST’09 .[24] J. A. Børretzen and R. Conradi, “Results and experiences from anempirical study of fault reports in industrial projects,” in
Proc. ofPROFES’06 . Berlin, Heidelberg: Springer-Verlag, 2006, pp. 389–394.[25] J. Strecker and A. Memon, “Relationships between test suites, faults,and fault detection in gui testing,” in
Proc. of ICST’08 , 2008, pp. 12–21.[26] D. Mauser, A. Klaus, R. Zhang, and L. Duan, “GUI failure analysis andclassification for the development of in-vehicle infotainment,” in
Proc.of VALID’12 , 2012, pp. 79–84.[27] A. Kumar Maji, K. Hao, S. Sultana, and S. Bagchi, “Characterizingfailures in mobile oses: A case study with android and symbian,” in
Proc. of ISSRE’10 , 2010, pp. 249–258.[28] R. Lutz and I. mikulski, “Empirical analysis of safety-critical anomaliesduring operations,”
IEEE Trans. Softw. Eng. , pp. 172–180, 2004.[29] R. N. Zaeem, M. R. Prasad, and S. Khurshid, “Automated generationof oracles for testing user-interaction features of mobile apps,” in