[PDF] Prioritize Crowdsourced Test Reports via Deep Screenshot Understanding

Abstract

Crowdsourced testing is increasingly dominant in mobile application (app) testing, but it is a great burden for app developers to inspect the incredible number of test reports. Many researches have been proposed to deal with test reports based only on texts or additionally simple image features. However, in mobile app testing, texts contained in test reports are condensed and the information is inadequate. Many screenshots are included as complements that contain much richer information beyond texts. This trend motivates us to prioritize crowdsourced test reports based on a deep screenshot understanding. In this paper, we present a novel crowdsourced test report prioritization approach, namely DeepPrior. We first represent the crowdsourced test reports with a novelly introduced feature, namely DeepFeature, that includes all the widgets along with their texts, coordinates, types, and even intents based on the deep analysis of the app screenshots, and the textual descriptions in the crowdsourced test reports. DeepFeature includes the Bug Feature, which directly describes the bugs, and the Context Feature, which depicts the thorough context of the bug. The similarity of the DeepFeature is used to represent the test reports' similarity and prioritize the crowdsourced test reports. We formally define the similarity as DeepSimilarity. We also conduct an empirical experiment to evaluate the effectiveness of the proposed technique with a large dataset group. The results show that DeepPrior is promising, and it outperforms the state-of-the-art approach with less than half the overhead.

Full PDF

PPrioritize Crowdsourced Test Reportsvia Deep Screenshot Understanding

Shengcheng Yu, Chunrong Fang ∗ , Zhenfei Cao, Xu Wang, Tongyu Li, Zhenyu Chen State Key Laboratory for Novel Software Technology, Nanjing University, China ∗ Corresponding author: [email protected]

Abstract —Crowdsourced testing is increasingly dominant inmobile application (app) testing, but it is a great burden for appdevelopers to inspect the incredible number of test reports. Manyresearches have been proposed to deal with test reports basedonly on texts or additionally simple image features. However, inmobile app testing, texts contained in test reports are condensedand the information is inadequate. Many screenshots are includedas complements that contain much richer information beyondtexts. This trend motivates us to prioritize crowdsourced testreports based on a deep screenshot understanding.In this paper, we present a novel crowdsourced test reportprioritization approach, namely D EEP P RIOR . We ﬁrst representthe crowdsourced test reports with a novelly introduced feature,namely D EEP F EATURE , that includes all the widgets along withtheir texts, coordinates, types, and even intents based on thedeep analysis of the app screenshots, and the textual descriptionsin the crowdsourced test reports. D EEP F EATURE includes the

Bug Feature , which directly describes the bugs, and the

ContextFeature , which depicts the thorough context of the bug. Thesimilarity of the D EEP F EATURE is used to represent the testreports’ similarity and prioritize the crowdsourced test reports.We formally deﬁne the similarity as D EEP S IMILARITY . We alsoconduct an empirical experiment to evaluate the effectiveness ofthe proposed technique with a large dataset group. The resultsshow that D EEP P RIOR is promising, and it outperforms the state-of-the-art approach with less than half the overhead.

Index Terms —Crowdsourced testing, Mobile App Testing,Deep Screenshot Understanding

I. I

NTRODUCTION

Crowdsourcing has been one of the mainstream techniquesin many areas. The openness of crowdsourcing brings manyadvantages. For example, the operations on crowdsourcingsubjects can be simulated in multiple different real practicalenvironments. Such advantages help alleviate the severe “frag-mentation problem” in mobile application (app) testing [1].There are hundreds of thousands of different mobile devicemodels with different brands, operating system (OS) versions,and hardware sensors, which is the well-known “fragmentationproblem” in Android testing. Crowdsourced testing is one ofthe best solutions faced with such a problem. App developerscan distribute their apps to crowdworkers with different mobiledevices and require them to submit test reports containing app screenshots and textual descriptions . This helps appdevelopers reveal as many problems as possible.However, report reviewing efﬁciency in crowdsourced test-ing is a severe problem. The openness of crowdsourcingcan lead to a great number of reports being submitted, andalmost 82% of the submitted reports are duplicate [2]. It is quite tough work to review all the reports automaticallydue to the complexity. In the text part, the complexity ofneural language can lead to ambiguity, and crowdworkers mayuse different words to describe the same objects or use oneword to describe completely different scenarios. In the imagepart, screenshot similarity can also provide little help becausemany app functions share similar UI. Therefore, it is hard butimportant for app developers to reveal all the mentioned bugsas early as possible.Among the recent researches, test reports’ disposal is alwaysdivided into two parts: app screenshots and textual descrip-tions. Existing researches analyze these two parts separately toextract features. For textual descriptions, existing approachesextract the keywords and normalize the keywords accordingto predeﬁned vocabulary. For app screenshots, they treat eachscreenshot as a whole and extract the image features repre-sented with numeric vectors. After obtaining the results fromtwo parts, most researches currently rely on texts and considerscreenshots as supplemental materials or simply concatenateimage information and text information. However, we thinkthis kind of disposal can cause much valuable information tobe missing. The relationship between textual descriptions andapp screenshots is left out, and the report deduplication orprioritization can be less effective.In this paper, we propose a novel approach, namely D

EEP -P RIOR , to P

RIOR itize crowdsourced test reports via D

EEP screenshot understanding. D

EEP P RIOR considers the deepunderstanding of both app screenshots and textual descriptionsin detail. For a submitted test report, we extract informationfrom both screenshots and texts. In screenshots, we collect allthe widgets according to computer vision (CV) technologies,and we locate the problem widget (denoted as W P ) accordingto the textual descriptions (details in Section III-A1). Theremaining widgets are treated as context widgets (denoted as W C ). Texts are processed with Natural Language Processing(NLP) technologies and are divided into two parts: the repro-duction steps (denoted as R ) and the bug description (denotedas P ). The reproduction steps are further normalized into“action-object” sequences. The bug description is also furtherprocessed to extract the problem widget description for W P localization.Instead of processing app screenshots and textual descrip-tions separately, we take them as a whole and collect allthe information as a D EEP F EATURE for a report. Based onthe relativity to the bug itself, D

EEP F EATURE includes

Bug a r X i v : . [ c s . S E ] F e b eature (BFT) and Context Feature (CFT). The

Bug Feature consists of W P and P , and it represents the informationdirectly relevant to the bug revealed in the report. The ContextFeature consists of W C and R , and it represents the contextinformation, including the operation track triggering the bugand the activity information where the bug occurs.After integrating the above features into the D EEP F EATURE ,D EEP P RIOR calculates the D

EEP S IMILARITY among all thereports for prioritization. For

Bug Feature and

Context Fea-ture , we calculate D

EEP S IMILARITY separately.For

Bug Feature , to calculate the D

EEP S IMILARITY of the W P in the reports, we use CV technologies to extract andmatch the feature points. P is a short textual description, sowe use NLP technologies to extract the bug-related keywordsbased on our self-built vocabulary, and compare the keywordfrequency as the D EEP S IMILARITY .For

Context Feature , W C is fed into a pre-trained deeplearning classiﬁer to identify each widget’s type, and thenumber vector for each type is considered as the W C D EEP -S IMILARITY . R is composed of a series of actions and thecorresponding widgets, representing the sequence from theapp launching to the bug occurrence. Therefore, we extractthe actions and the objects using NLP technologies in the R order. We take the “action-object” sequence as the operationtrack and calculate the D EEP S IMILARITY .Then starts the prioritization. We ﬁrst construct a

NULL Re-port (deﬁned in Section III-D) and append it to the prioritizedreport pool. Then, we repeatedly calculate the D

EEP S IMILAR - ITY between each unprioritized report and all the reports in theprioritized report pool. The report with the lowest “minimumD

EEP S IMILARITY ” with all the reports in the prioritized reportpool is put into the prioritized report pool.We also design an empirical experiment, using a large-scaledataset group from a large and active crowdsourced testingplatform . We compare D EEP P RIOR with two other strategies,and the results show that D

EEP P RIOR is effective.The noteworthy contributions of this paper are as follows. • We propose a novel approach that prioritizes crowd-sourced test reports via deep screenshot understandingand detailed text analysis. We extract all the widgets fromthe screenshots, classify textual information to differentcategories and form the D

EEP F EATURE . • We construct an integrated dataset group for deep screen-shot understanding, including a large-scale widget imagedataset, a large-scale test report keyword vocabulary, alarge-scale text classiﬁcation dataset, and a large-scalecrowdsourced test report dataset. • Based on the dataset group, we conduct an empiricalevaluation of the proposed approach D

EEP P RIOR , andthe results show that D

EEP P RIOR outperforms the state-of-the-art approach with less than half the overhead.

More resources can be found on our online package . anonymous for double-blind principle https://sites.google.com/view/deepprior II. B

ACKGROUND & M

OTIVATION

Crowdsourced testing has gained a large amount of pop-ularity in mobile app testing, its advantages are obvious,but its drawbacks are also unignorable. On most mainstreamcrowdsourced testing platforms, crowdworkers are required tosubmit a report to describe the bug they meet. The main bodyof a report is a screenshot of the bug and a textual description.The app screenshot and the textual description are also theprincipal basis to prioritize the crowdsourced test reports.Current solutions for crowdsourced test report processingthat consider the screenshots like [2] [3] mainly analyze theapp screenshot features and textual description information tomeasure the similarity among all the reports. Though theyconsider the app screenshots, they simply treat the images as width × height × RGB matrixes. However, these approachesignore the rich and valuable information, and we hold theopinion that the app screenshots should be viewed as acollection of meaningful widgets instead of the collection ofmeaningless pixels. We make such a stand because whilereviewing the crowdsourced test report dataset, we ﬁnd somevivid examples that the existing approaches have difﬁcultyhandling because they merely make simple feature extractioninstead of deep screenshot understanding.

A. Example 1: Different App Theme

Nowadays, apps support different themes, making it pos-sible for users to customize the app appearance accordingto their preferences (Fig. 1). Moreover, the supported “darkmode” makes the color scheme more complex. Image featureextraction algorithms can hardly handle such complexity andwill make mistakes. From the examples, we can ﬁnd thatthe app screenshots in three reports are of blue, white, andgreen themes. All these three reports are reporting the loadingfailure of the music resource ﬁles. However, according to[2], the image color feature is one vital component of thereport surrogate. App screenshots with different colors will berecognized as different screenshots.

Fig. 1. Example 1: Different App Theme eport : Choose a directory containing music ﬁlesand “mark as music dir”, but music ﬁles do not showwhen returning to main page.

Report : The song list cannot show the music ﬁles.

Report : The page shows “No media found” afterchoosing the music library.

B. Example 2: Different Bugs on the Same Screenshots

As shown in Fig. 2, the two reports use the screenshotsof the same app activity, and the image feature extractionalgorithm will assign a high similarity between these twoscreenshots. However, according to the bug description, thetwo reports are describing completely different bugs. In D

EEP -P RIOR , for

Report , we can extract the text “no mediafound”; for

Report , we can extract the volume widgetbesides the prompt information, and D

EEP P RIOR can identifythe different problems.

Fig. 2. Example 2: Different Bugs on the Same Screenshots

Report : When the headphones are plugged in, thevolume is automatically increased while playing.

Report : The song list cannot show the music ﬁles.

C. Example 3: Same Bug on Different Screenshots

As shown in Fig. 3, the

ImageView widget on the top isof different contents, and it occupies a large proportion of theentire page. Also, the comments are different due to differenttesting time. Therefore, existing approaches will consider thetwo screenshots are of low similarity which will pull downthe whole similarity even if the textual descriptions are withhigh similarity. With D

EEP P RIOR , we can extract the pop-up information on the bottom, saying “comment failed”, andassign a high similarity to the two reports. Such pop-ups areconsidered as a quite signiﬁcant widget that contains the bug.

Report : The page reminds failure after submittingcomment, but the submission shows in the list.

Report : When inputing the comment, the pagereminds submit failure, but when reentering the page,the comment has been in the list.

Fig. 3. Example 3: Same Bug on Different Screenshots

III. A

PPROACH

This section presents the details of D

EEP P RIOR , whichmeans prioritizing crowdsourced test reports via deep screen-shot understanding. D

EEP P RIOR consists of 4 stages, includ-ing feature extraction, feature aggregation, D

EEP S IMILARITY calculation, and report prioritization. In the ﬁrst stage, wecollect 4 different types of report features from both appscreenshots and textual descriptions . We then aggregate theextracted features into a D

EEP F EATURE , which includes

BugFeature and

Context Feature . Based on the D

EEP F EATURE ,we design an algorithm to calculate the D

EEP S IMILARITY between every two test reports. Based on the pre-deﬁnedrules (details in Section III-D), we prioritize the test reportsaccording to the D

EEP S IMILARITY . The general frameworkof the D

EEP P RIOR approach can be referred to in Fig. 4.

A. Feature Extraction

The ﬁrst and most important step is feature extraction.In this step, we analyze the app screenshots and textualdescriptions in the crowdsourced test reports separately. Features from App Screenshot : App screenshots arevital in crowdsourced test reports. Crowdworkers are requiredto take screenshots while the bugs occur to better illustratethe bug. As described in [2], texts can be confusing, becausetextual descriptions can only provide limited information.Therefore, screenshots are taken into consideration to providemuch more information besides textual descriptions. In ascreenshot, there exist many different widgets, and somewidgets can prompt the bug information. Therefore, the deepunderstanding of the screenshots mainly rely on the widgets.In D

EEP P RIOR , we use both CV technologies and deeplearning (DL) technologies to extract all the widgets andanalyze their information. DL technologies are powerful andCV technologies can process tasks in a larger variety [4].

Problem Widget.

An app activity can be seen as anorganized widget set. Generally speaking, in crowdsourcedtesting tasks, the bugs that can be found by crowdworkersare sure to be revealed through the widgets. Therefore, it https://developer.android.com/reference/android/app/Activity https://developer.android.com/reference/android/widget/package-summaryig. 4. D EEP P RIOR

Framework is important to locate the widget that triggers the bug anddistinguish this widget, which is what we deﬁne as ProblemWidget ( W P ), from other widgets. In order to distinguishthe Problem Widget, we analyze the textual descriptions.In crowdsourced test reports, crowdworkers will point outwhich widget is operated before the bug occurs. As shownin Section III-A2, we can extract the problem widget fromthe textual descriptions, and to locate the Problem Widget, weadopt two different strategies for different situations: • If the extracted widget contains texts, we will matchthe texts from the widget screenshot and the textualdescription. The matched widget will be considered asthe Problem Widget. • If there are no texts on the widgets or the text matchingfails, we will feed the extracted widgets into a deepneural network to identify the simple widget intention.The deep neural network is modiﬁed from the research ofXiao et al. [5]. The model encodes the widget screenshotinto a feature vector with a convolutional neural network(CNN). The output is a short text fragment decoded fromthe feature vector with a recurrent neural network (RNN),and the text fragment describes the widget intent.

Context Widget.

Besides the Problem Widget, the widgetset representing the app screenshot also contains many morewidgets that make up the context, which is also critical to deepimage understanding. From the early stage survey, we ﬁnd thatthe situations are common when app activities are entirelydifferent even if the problem widget, the reproduction step(the activity launching path), and the bug description are thesame (like motivating example in Section II-C). At this time,the context widget is signiﬁcant to identify the differences.Therefore, we collect the rest of the widgets as Context Widget( W C ). For each context widget, we feed the widget screenshotinto a convolutional neural network to identify its type. Theamount for each type consists of a 14-dimensional vector.The convolutional neural network is capable of identifying14 different types of most widely used widgets, including Button (BTN),

CheckBox (CHB),

CheckTextView (CTV),

EditText (EDT),

ImageButton (IMB),

ImageView (IMV),

ProgressBarHorizontal (PBH),

ProgressBarVertical (PBV),

RadioButton (RBU),

RatingBar (RBA),

SeekBar (SKB),

Switch (SWC),

Spinner (SPN),

TextView (TXV) . To train the neuralnetwork, we collect 36,573 widget screenshots that evenlydistribute in 14 types. The ratio of the training set, validationset, and test set is 7:1:2, which is a common practice for animage classiﬁcation task. The neural network is composedof multiple Convolutional layers,

MaxPooling layers,and

FullyConnected layers.

AdaDelta algorithmis used as the optimizer, and this model adopts the categorical_crossentropy loss function. Features from Textual Description : Besides app screen-shots, textual descriptions can provide bug information moreintuitively and directly. Also, textual descriptions can makea positive supplement for app screenshots. In D

EEP P RIOR ,we adopt NLP technologies, speciﬁcally DL algorithms, toprocess the textual descriptions in the test reports.In the textual description, crowdworkers are required todescribe the bug in the screenshot and provide the reproductionstep, which is the operation sequence from the app launchingto the bug occurrence. However, on most crowdsourced testingplatforms, the bug descriptions and the reproduction steps aremixed together, and crowdworkers are not required to obeyspeciﬁc patterns due to the great diversity of their professionalcapability [6]. Therefore, it is complex to distinguish bugdescriptions from reproduction steps. In order to handle thisproblem, we adopt the TextCNN model [7].The TextCNN model can complete sentence-level classi-ﬁcation tasks with pre-trained word vectors. Before feedingthe texts into the model, we pre-process the data. The textualdescriptions of the test reports are segmented into sentences,and then we use jieba library to segment sentences intowords, and then we ﬁlter out the stop words according to a the short name is used in this paper for convenience https://github.com/fxsjy/jieba top word list . After the pre-processing, we feed the textsinto a WordEmbedding layer. In this layer, the texts aretransformed into 128-dimensional vectors using a Word2Vecmodel [8]. Afterwards, we adopt several

Convolutional layers and

MaxPooling layers to extract the text features.In the last layer, we use

SoftMax activation function and getthe probability of whether each sentence is a bug descriptionor a reproduction step. Finally, we merge all the sentencesclassiﬁed as bug description or reproduction step. To trainthe TextCNN model, we form a large-scale text classiﬁcationdataset composed of 2,252 bug descriptions and 2,088 repro-duction steps. We set the ratio of the training set, validationset, and test set as 6:2:2, following the common practice.

Bug Description.

Bug descriptions are always in the formof a short sentence. Therefore, we represent the sentence witha vector, which is also encoded using the Word2Vec model.Most of the bug descriptions are following some speciﬁc kindof patterns, like “ apply SOME operation on SOME widget, andSOME unexpected behavior happens ”, so even if the speciﬁcwords would vary, it is effective to extract such feature.One more important process is to extract the descriptionof the problem widget in order to help localize the prob-lem widget. To achieve this goal, we use text segmentationalgorithms based on HMM (Hidden Markov Model) models[9] and analyze the part-of-speech of each part of the bugdescriptions after text segmentation. Then, we extract the ob-ject components as the basis for problem widget localization,and such object components of the sentences are the widgetsthat trigger the bugs. After acquiring the objects, we use thestrategies introduced above to localize the problem widget.

Reproduction Step.

In addition to the bug description,another signiﬁcant part of the textual description is the repro-duction step. The reproduction step is a series of operations,describing the user’s operations from the app launching to thebug occurrence. For sentences classiﬁed to the reproductionstep class, we process in the initial order in the reports. Weuse the same NLP algorithms to make text segmentation andanalyze the part-of-speech for each text segment for eachsentence. Then, the action part and the object part are collectedto form the “action-object” pair. Then, we concatenate the“action-object” pairs to an “action-object” sequence. Also,besides the action words and the objects, we add somecomplementary information for some speciﬁc operations. Forexample, suppose one operation is a typing action, we will addthe input content as the supplementary information, becausedifferent test inputs can lead to different consequences andmake the app directed to different activities. Finally, after theformalizing processing, we can obtain the Reproduction Stepfrom the textual descriptions.

B. Feature Aggregation

After acquiring all the features both from app screenshotsand textual descriptions, we aggregate them into two featurecategories:

Bug Feature (BFT) and

Context Feature (CFT). https://github.com/goto456/stopwords Bug Feature refers to the features that directly reﬂect ordescribe the bug in the crowdsourced test report, while

ContextFeature is assembled by the features that provide a depictionfor the environment of the bug occurrence. Bug Feature (BFT):

Bug Feature can directly provideinformation about the bugs. Since a crowdsourced test reportis composed of the app screenshot and the textual descrip-tion, both components contain critical information about theoccurring bug. In the app screenshot, we extract the ProblemWidget, which is a widget screenshot. D

EEP P RIOR can extractsuch information automatically. In the textual description, theBug Description part directly describes the bug. Therefore,with a balanced consideration of the app screenshot and thetextual description, we aggregate the Problem Widget and theBug Description as the

Bug Feature . Context Feature (CFT):

Context Feature includes thefeatures that construct a thorough context for the bug oc-currence. In the app screenshot, the Context Widget consistsof all the widgets expect the Problem Widget. In the textualdescription, the Reproduction Step information is taken intoconsideration because it provides the full operation path fromthe app launching to the bug occurrence, and it can helpidentify whether the bugs of two test reports are on the sameapp activity. Therefore, Context Widget and Reproduction Stepare aggregated together to form the

Context Feature .

3) Feature Aggregation:

With

Bug Feature and

ContextFeature , we can aggregate all the obtained features from bothapp screenshots and textual descriptions of the crowdsourcedtest reports into the ﬁnal D

EEP F EATURE . We have a deepscreenshot understanding for app screenshots instead of di-rectly transforming the app screenshots into simple featurevectors. We also have a tighter combination between appscreenshots and textual descriptions. Moreover, we take theapp screenshots and textual descriptions as a whole and dividethem according to their roles in bug reﬂection.

Bug Feature isundoubtedly important, and we hold that

Context Feature alsoplays a crucial role in crowdsourced test report prioritizationbecause the calculation of the bug similarity relies heavily onthe whole context. C. D EEP S IMILARITY

Calculation

To prioritize the crowdsourced test reports, one signiﬁcantstep is to calculate the similarity among all the reports.Because we are the ﬁrst to introduce the deep screenshotunderstanding into report prioritization, we name the similarityas D

EEP S IMILARITY . As the common practice to mergedifferent features in previous studies [2] [3], we calculate theD

EEP S IMILARITY of different features separately, and allocatedifferent weights for the results of different features. Theformal expression is as follows (

Sim is short for similarity):

DeepSimilarity = γ ∗ Sim

BF T + (1 − γ ) ∗ Sim

CF T (1a)

Sim

BF T = α ∗ Sim W P + (1 − α ) ∗ Sim P (1b) im CF T = β ∗ Sim W C + (1 − β ) ∗ Sim R (1c) Bug Feature : We calculate the D

EEP S IMILARITY ofProblem Widget and the Bug Description separately and mergethem with the weight parameter α . Problem Widget.

Problem Widget is a widget screenshotextracted from the app screenshot according to the strategiesintroduced in Section III-A1. To calculate the D

EEP S IMILAR - ITY of the problem widget, we extract the image features ofthe widget screenshots. To extract the image features, we adoptthe state-of-the-art SIFT (Scale-Invariant Feature Transform)algorithm [10]. Therefore, each widget is represented by afeature point set. SIFT algorithm has the advantage of beingable to process the images with different sizes, positions, androtation angles, which is a common phenomenon in suchan era when mobile devices are of hundreds of thousandsof different models. To compare and match the problemwidgets from different crowdsourced test reports, we use theFLANN Library [11]. After the calculation, we can get ascore ranging from 0 to 1, and 0 means completely different,and 1 means completely the same. This score can be viewedas the D EEP S IMILARITY of Problem Widget.

Bug Description.

Bug Description is a shot sentence brieﬂydescribing the bug in the crowdsourced test report. Therefore,we use NLP technologies to encode bug descriptions. Follow-ing the approaches in previous studies, we use the Word2Vecmodel as the encoder. To improve the performance of theWord2Vec Model, we construct a test report keyword database .The test report keyword database contains 8,647 keywords re-lated to software testing, mobile app, and test report, includinglabeled synonyms, antonyms, and polysemies. The encodedbug description is a 100-dimensional vector. Afterward, alsoreferring to the previous studies like [2] [3], we adopt thewidely used

Euclidean Metric algorithm to calculate theD

EEP S IMILARITY of bug descriptions of different test reportsin pair. To unify values of different scales, we normalize eachresult x to [0, 1] interval with the function x − minmax − min , where max is the maximum value of all results and min is theminimum value of all results. Context Feature : We also calculate the D

EEP S IMILAR - ITY of Context Widget and the Reproduction Step separatelyand merge them with a weight parameter β . Context Widget.

Context Widget is also a very importantcomponent of the whole context of the occurring bug. To havea deep understanding of the app screenshots, speciﬁcally thewidgets on the app screenshot, we use a convolutional neuralnetwork to identify the widget type for each extracted widgetscreenshot and form a vector containing the number of the14 types of the widgets. Afterward, we use the

EuclideanMetric algorithm to calculate the distance of the acquired 14-dimensional vectors. We consider the absolute amount of thewidgets for each type and all the widgets’ distribution. Theresult of the

Euclidean Metric algorithm (from 0 to 1) isconsidered as the Context Widget D

EEP S IMILARITY . https://github.com/mariusmuja/ﬂann Reproduction Step.

Reproduction Step is transformed intoan “action-object” sequence during the feature extraction. Tocalculate the D

EEP S IMILARITY of the “action-object” se-quence, we adopt the Dynamic Time Warping (DTW) [12] al-gorithm to process the to-compare “action-object” sequences.DWT algorithm is most widely known for the capability ofautomatic speech recognition. In this paper, we adapt the DWTalgorithm to process the operation path that triggers the bugin the corresponding crowdsourced test reports. DWT algo-rithm can measure the similarity of the temporal sequences,especially the temporal sequences that may vary in “speed”.Speciﬁcally speaking, the “speed” in our task refers to thesituation that the different user operations can reach the sameapp activity through a different path. Compared with othertrack similarity algorithms, DTW has a better matching effectbecause it can process the sequences with different lengths,which is suitable for processing the “action-object” sequences.

D. Report Prioritization

After aggregating the D

EEP F EATURE , and deﬁning theD

EEP S IMILARITY calculation rule, we start to prioritize thecrowdsourced test reports. First, we construct two null reportpools: the unprioritized report pool and the prioritized reportpool. All the crowdsourced test reports are put into theunprioritized report pool initially.Different from the strategy adopted in [3], where a reportis randomly chosen as the initial report, we think all reportsshould be treated equally, and the randomly chosen report islikely to affect the ﬁnal prioritization. Therefore, to formal-ize and unify the prioritization algorithm, we introduce theconcept of

NULL Report , which also contains four features. • Problem Widget: the screenshot of the problem widgetis essentially a 3-dimensional matrix representing thewidth, the height, and three color channels. Therefore,we construct the problem widget as a zero matrix. Thewidth and the height of the zero matrix are set as theaverage size of all the actual crowdsourced test reports.Intuitively speaking, it is an all-black image. • Bug Description: the bug description of the

NULL Report is directly set as an empty string, and since the stringlength is 0, obviously it contains no words, after theWord2Vec processing, the feature vector will be a 100-dimensional vector of all ‘0’s. • Context Widget: for the context widget of the

NULLReport , we directly construct the vector representing thenumbers of the 14 different types of widgets, and allelements are 0. This represents that there are “no” widgetson the app screenshot of the crowdsourced test report. • Reproduction Step: the reproduction step of the

NULLReport is also set as an empty string, and the “action-object” sequence is also with a length of 0.The primary consensus for prioritization is to reveal all thebugs as early as possible under the circumstances when somereports would describe the problems repetitiously [3] [13] [14].Therefore, it is important to provide as many reports de-scribing different bugs as possible for the developers early. lgorithm 1

Crowdsourced Test Report Prioritization

Input:

Crowdsourced Test Report Set R initial Output:

Prioritized Crowdsourced Test Report Set P initiate unprioritized report pool U ← R initial initiate prioritized report pool P = ∅ initiate target report r t initiate NULL Report r null P .append( r null ) while | U | (cid:54) = 0 do initiate similarity = 2 for each r ∈ U do for each r p ∈ P do initiate similarity r = 2 Sim

BF T = α ∗ Sim W P + (1 − α ) ∗ Sim P Sim

CF T = β ∗ Sim W C + (1 − β ) ∗ Sim R calSemSim() = γ ∗ Sim

BF T +(1 − γ ) ∗ Sim

CF T similarity r + = calSemSim( r , r p ) if calSemSim( r , r p ) < similarity r then similarity r = calSemSim( r , r p ) end if end for if similarity r < similarity then r t = r similarity = similarity r end if end for P .append( r t ) U .remove( r t ) end while return P Based on this idea, we design our prioritization strategy asfollows, and the formal expression is presented in Algorithm 1.First, we construct the

NULL Report according to the rules.mentioned above and append the

NULL Report to the nullprioritized report pool. Following is a iterative process. Wecalculate the D

EEP S IMILARITY of each unprioritized reportto the whole prioritized report pool, which is deﬁned as theminimum D

EEP S IMILARITY of the unprioritized report to allthe reports in the prioritized report pool. The report with thelowest D

EEP S IMILARITY with the prioritized report pool willbe moved to the prioritized report pool.IV. E

VALUATION

A. Experimental Setting

To evaluate the proposed D

EEP P RIOR , we design an em-pirical experiment. To complete the experiment, we collect536 crowdsourced test reports from 10 different mobile apps(details in TABLE I). The apps are labeled from A1 to A10,and the number of test reports of different apps ranges from10 to 152. We also invite software testing experts to manuallyclassify the test reports according to the bugs they describe,and the average number of reports in a bug category is 8.06.Based on the crowdsourced test report dataset, we buildthree speciﬁc datasets to better support the evaluation, includ-

TABLE IE

XPERIMENT A PP App No. App Category Report ing a large-scale widget image dataset, a large-scaletest report keyword set, and a large-scale text classiﬁcationdataset. All the 4 dataset build up the integrated dataset group.In total, we design three research questions (RQ) to evaluatethe proposed test report prioritization approach, D EEP P RIOR . • RQ1 : How effective can D

EEP P RIOR identify the widgettype extracted from the app screenshots? • RQ2 : How effective can D

EEP P RIOR classify the textualdescriptions from the crowdsourced test reports? • RQ3 : How effective can D

EEP P RIOR prioritize thecrowdsourced test reports?

B. RQ1: Widget Type Classiﬁcation

The ﬁrst research question is set to evaluate the effectivenessof our processing to the app screenshots. The most importantcomponent of app screenshot processing is the widget extrac-tion and classiﬁcation. Therefore, we evaluate the accuracy ofthe widget type classiﬁcation CNN. 36,573 different widgetimages are collected from real-world apps, and the imageshave an even distribution in 14 categories.The details of the CNN is presented in Section III-A1. Thedataset is divided by the ratio 7:2:1 into the training set, thevalidation set and the test set, as the common practice. Afterthe CNN model is trained, we evaluate the accuracy on the testset. The overall accuracy of widget type classiﬁcation reaches89.98%. Speciﬁcally speaking, we use precision , recall and F-Measure values to evaluate the network. The calculatingformula can be seen as follows, where TP means true positivesamples, FP means false-positive samples, TN means truenegative samples, FN means false negative samples. precision = T PT P + F P , recall = T PT P + F N (2) F − M easure = 1 precision + 1 recall (3)The evaluation results can be seen in TABLE II. The precision value reaches an average of 90.05%, and the lowest precision is 74.36% and the highest is 99.81%. For recall ,which measures the total amount of relevant instances thatare actually retrieved, the average values is 89.98%, and the recall values range from 70.83% to 100.00%.

F-Measure is aharmonic mean of the precision and recall , and it reaches anverage of 89.92%. The above results reﬂect the outstandingcapability of the proposed classiﬁer.

TABLE IIW

IDGET T YPE C LASSIFICATION

Widget Type precision recall F-MeasureBUT 82.30% 76.53% 79.31%CHB 96.35% 94.89% 95.61%CTV 93.27% 92.73% 93.00%EDT 74.66% 87.81% 80.70%IMB 76.73% 85.26% 80.77%IMV 74.36% 72.01% 73.17%PBH 98.65% 93.78% 96.15%PBV 94.35% 99.81% 97.00%RBU 94.17% 93.45% 93.81%RBA 99.05% 99.81% 99.43%SWC 98.85% 97.73% 98.29%SKB 99.23% 94.89% 97.01%SPN 99.81% 100.00% 99.90%TXV 78.95% 70.83% 74.67%Average 90.05% 89.97% 89.92%

We also have an in-depth insight into the results. Weﬁnd that there are two groups of widgets that are easy tobe confused. The ﬁrst group includes

ImageButton and

ImageView . It is easy to understand that from a visualperspective, these two types can hardly be identiﬁed. Theonly difference between these 2 types is that

ImageButton can trigger an action while

ImageView is a simple image.However, one important thing to mention is that the in appdesign, developers can add a hyperlink to the

ImageView widget to realize the equivalent effect. The second groupincludes

Button , EditText and

TextView . These threewidgets are all a ﬁxed area containing a text fragment, whichis also visually similar and hard to identify even for humans.Moreover, some special renderings make the widgets evenharder to identify. According to our survey, we ﬁnd that thesetwo confusing groups will not affect much, and the widgetscan be treated as equivalent from both visual perspective andfunction perspective.

Results for RQ1 : The overall accuracy of the CNN toclassify the widget types reaches 89.98%. For each speciﬁctype, the average precision is 90.05%, the precision is 74.36%and the

F-Measure is of 89.92%. Also, according to oursurvey on real test reports, even if some types that are withlow precision, their visual and functional features will notnegatively affect D

EEP P RIOR . C. RQ2: Textual Description Classiﬁcation

In the processing of the textual description, we classifyinto two categories: bug description and reproduction step.Different textual descriptions are considered as different reportfeatures. To classify the textual descriptions, we segmentthe textual descriptions into sentences. Then, we feed thesentences into a TextCNN model to complete the task, and thedetails are presented in Section III-A2. Also, to better train andevaluate the network, we build a large-scale text classiﬁcationdataset. The dataset contains 4,340 pieces of labeled textualsegments, including 2,252 pieces of bug descriptions and 2,088 pieces of reproduction steps. The dataset is divided intotraining set, validation set and test set at the ratio of 7:2:1.

TABLE IIIT

EXT C LASSIFICATION precision recall F-MeasureBug Description 98.46% 97.81% 98.13%Reproduction Step 97.95% 98.53% 98.24%Average 98.21% 98.17% 98.19%

The results on the test set can be seen from TABLE III,and the overall accuracy of the model reaches 96.65%. Morespeciﬁcally, we use the same measurements like the evalua-tion in RQ1, including precision , recall and F-Measure . Theaverage precision of 2 types of textual description reaches98.21%, the average recall reaches 98.17%, and the he average

F-Measure reaches 98.19%. The result is quite promising,and we also manually check the textual descriptions. We ﬁndthat compared with reproduction steps, bug descriptions tendto contain bug-related words, such as “crash”, “ﬂashback”,“missing element”, “wrong”, “fail”, “no response”, etc.

Whilethe reproduction steps contain just the operations, target wid-gets and the corresponding responses.

Results for RQ2 : The overall accuracy of text classiﬁcationreaches 96.65%, and the precision , recall and F-Measure are all over 98%. Such results show D

EEP P RIOR ’s excellentcapability to analyze textual descriptions, which also lays asolid foundation for the crowdsourced test report prioritization.

D. RQ3: Crowdsourced Test Report Prioritization

In this research question, we evaluate the test report prior-itization effectiveness of D

EEP P RIOR . The metric we use isthe APFD (Average Percentage of Fault Detected) metric [15],which is also used by Feng et al. to prioritize crowdsourcedtest reports [3]. In the formula, the T fi means the index ofthe report that ﬁrst ﬁnds the bug i , the n is the total reportnumber, and the M is the total number of revealed bugs. AP F D = 1 − (cid:80) Mi =1 T fi n × M + 12 × n (4)To better illustrate the advantage of D EEP P RIOR , we com-pare D EEP P RIOR with the following prioritizing strategies: • I DEAL : This strategy is the best prioritization on theory,which means that developers can review all the bugsrevealed by the reports in the shortest time. • I MAGE : This strategy uses only the results of deep imageunderstanding of D EEP P RIOR to rank the test reportsbecause deep image understanding is a signiﬁcant part ofour research. • BDD IV : This strategy refers to the algorithm in Feng etal. ’s work [3], which is also the state-of-the-art approachfor crowdsourced test report prioritization. • R ANDOM : The R

ANDOM strategy refers to the situationwithout any prioritization strategy.For D EEP P RIOR and I MAGE strategy, we run once becauseof the stability of our approach, and the trained model will

ABLE IVD

EEP P RIOR R EPORT P RIORITIZATION R ESULT AND C OMPARISON

App No.

IDEAL D

EEP P RIOR

BDD IV D EEP P RIOR vs.

BDD IV OverheadComparison I MAGE D EEP P RIOR vs. I MAGE R ANDOM D EEP P RIOR vs. R ANDOM

A1 0.974 0.927 0.914 1.36% 68.75% 0.926 0.09% 0.805 15.15%A2 0.931 0.839 0.822 2.10% 33.40% 0.805 4.29% 0.655 28.07%A3 0.957 0.865 0.827 4.65% 35.49% 0.721 20.00% 0.692 25.00%A4 0.980 0.933 0.941 -0.87% 83.29% 0.845 10.42% 0.794 17.51%A5 0.967 0.898 0.894 0.43% 56.14% 0.892 0.64% 0.751 19.52%A6 0.942 0.827 0.765 8.04% 41.89% 0.827 0.00% 0.619 33.54%A7 1.000 1.000 1.000 0.00% 32.18% 1.000 0.00% 0.750 33.33%A8 0.941 0.850 0.826 2.97% 34.38% 0.858 -0.86% 0.694 22.61%A9 0.958 0.931 0.875 6.35% 23.97% 0.847 9.84% 0.681 36.73%A10 0.938 0.863 0.763 13.11% 35.12% 0.854 0.98% 0.621 38.93%Average 0.956 0.893 0.863 3.81% 44.46% 0.857 4.54% 0.706 27.04% not produce different results for various attempts; for I DEAL strategy, we manually calculate the APFD because for a ﬁxedreport cluster, it is a ﬁxed value; for

BDD IV strategy, we run30 times and calculate the average value as the original paper[3]; and for R ANDOM strategy, we run 100 times to eliminatethe effect of the occasional circumstances.First, we compare D EEP P RIOR with R ANDOM strategy. Asshown in the TABLE IV, we ﬁnd D EEP P RIOR outperforms R ANDOM strategy much, ranging from 15.15% to 38.93%,and the average improvement reaches 27.04%. This shows asuperiority of D EEP P RIOR .Then we compare the results of D EEP P RIOR with thesingle I MAGE strategy. The average improvement of D EEP -P RIOR is 4.54%, and in 2 apps (A3 and A4), D EEP P RIOR outperforms much. For A8, D EEP P RIOR is slightly weakerthan I MAGE strategy. We review the reports of A8 and ﬁndthat the textual descriptions are not well written, and cannotpositively help the prioritization of the reports. Generallyspeaking, the results prove the necessity of combining bothtext analysis and deep image understanding, and a singlestrategy will compensate each other’s drawbacks and improvethe prioritization accuracy.Also, we make a comparison between D EEP P RIOR and

BDD IV strategy, which is the state-of-the-art approach. Ac-cording to the experiment results, D EEP P RIOR outperforms

BDD IV with an average improvement of 3.81%. The im-provements in some apps are especially obvious. Moreover,we record the total time overhead from reading the reportcluster to the output of the prioritized reports. It shows that D EEP P RIOR uses less than half of the time of

BDD IV , whichshows great performance superiority.Another advantage of D EEP P RIOR over

BDD IV is that the D EEP P RIOR can output stable results while

BDD IV ’s resultswill ﬂoat. According to the detailed results of BDD IV strategy(in the online package), we ﬁnd that BDD IV is quite volatile.The improvements over baseline strategies vary amongdifferent apps, and some reasons account for this. First, the“report/category” rate for each app is different, so in thelimited activity set of an app, the recurring of the sameactivities become frequent. Second, different apps have variouscontents. For example, A1 is a kids’ education app, and itconsists of a large number of pictures, videos, variant texts. Such a situation makes it much more complex to extract usefultext information and have a thorough understanding of the appscreenshot. As a result, the matric would decrease. Results for RQ3 : D EEP P RIOR ’s capability of prioritizingcrowdsourced test reports is excellent, it outperforms the state-of-the-art approach,

BDD IV , with less than half the overhead.Also, the speciﬁc experiment of the I MAGE strategy shows theeffectiveness of our deep screenshot understanding algorithm.Compared with the state-of-the-art approach, D EEP P RIOR performs much more stable.

E. Threats to Validity

The categories of the apps in this experiment are limited .Our 15 experimental apps cover eight different categories(according to app store taxonomy). The coverage is limited.However, we want to emphasize that due to our deep screen-shot understanding involves the layout characterization to theapp activity, the D

EEP P RIOR is only suitable for analyzingthe apps with a grid layout or a list layout. We also limit ourclaim within apps of such layouts.

The enrollment of the crowdworkers is also uncon-trolled . The crowdworkers’ capability is uncontrolled, andlow-quality reports may occur. However, even if the qualityof some reports is low, D

EEP P RIOR can identify the bug it isdescribing if it actually contains one. If not, D

EEP P RIOR willcategorize the report as a single category, and will not affectthe prioritization of other reports.

The datasets we construct are of Chinese language . Thelanguage of the datasets may be another threat, but NLPand OCR technologies are quite strong. If we replace thetext processing engine with that of another language, the textprocessing will also be completed well, and will not have anegative impact on D

EEP P RIOR . Moreover, the maturity ofmachine translation [16] also makes it robust to process cross-language textual information.V. R

ELATED W ORK

A. Crowdsourced Testing

Crowdsourced testing has been a mainstream testing strat-egy. It is signiﬁcantly different from traditional testing. Testingtasks are distributed to a large group of crowdworkers ofdifferent locations and have widely varying abilities. The mostnotable advantage of crowdsourced testing is the capability ofimulating different using conditions and the relatively loweconomic cost [17] [18]. However, the openness of crowd-sourced testing leads to a large amount redundant reports. Thekey problem is to improve the developers’ efﬁciency to reviewthe test reports. some researches start from selecting skillfulcrowdworkers to complete the tasks [19] [20] [21]. Such astrategy is effective, while it is still hard to control becauseeven skillful crowdworkers would loaf on the task. Therefore,we think it is more important to process the test reports insteadof other factors in crowdsourced testing. Liu et al. [22] and Yu[6] proposed approaches respectively to automatically generatedescriptions from screenshots for test reports based on theconsensus that app screenshots are easy to acquire while thetextual descriptions are hard to write for all the crowdworkers.This idea inspired us to have a deep screenshot understandingof to help better prioritize the test reports.

B. Crowdsourced Test Report Processing

Many researches have been done to process crowdsourcedtest reports to better help developers review the reports and ﬁxbugs. Basic strategies include report classiﬁcation, duplicatedetection, and report prioritization. In this section, we willpresent the related works of different basic strategies.Banerjee et al. proposed FactorLCS [23], utilizing thecommon sequence matching, and the approach is effective onopen bug tracking repositories. They also proposed a method[24] with a multi-label classiﬁer to ﬁnd the “primary” reportof a cluster of reports with high similarity. Similarly, Jiang et al. proposed TERFUR [14], a tool that clusters the testreports with NLP technologies, and they also ﬁlter out thelow-quality reports. Wang et al. [25] takes the features ofthe crowdworkers into consideration as a feature of the testreports, and then make the clustering. Wang et al. propose theLOAF [26], which is the ﬁrst considering the operation stepsand result descriptions separately for report feature extraction.More researches are done to detect the duplication of thetest reports. Sun et al. [27] use information retrieval modelsto detect duplicate bug reports more accurately. Sureka et al. [28] adopted a character n-gram based model to complete theduplicate detection task. Prifti et al. [29] conducted a surveyon large-scale open-sourced project test reports and proposed amethod that can concentrate the search for duplicate reports onspeciﬁc portions of the whole repository. Sun et al. proposeda retrieval function, REP [30], to measure the similarity, andthe function includes the similarity of non-textual ﬁelds likecomponent, version, etc.

Nguyen et al. introduced DBTM [31],a tool that utilizes both IR-based features and topic-basedfeatures, and detects the duplications bug reports according totechnical issues. Alipour et al. [32] had a more comprehensiveanalysis of the test report context and improved the detectionaccuracy. Hindle [33] makes improvements by combiningcontextual quality attributes, architecture terms, and system-development topics to improve bug deduplicate detection.The above approaches, including the report classiﬁcationand the duplicate detection, choose part of the test reports torepresent all the test reports. However, we hold that all the reports contain valuable information even if duplicates exist.Moreover, After duplicates are detected, developers still needto review the reports to carry forward the bug processing.Therefore, we think report prioritization is a better choice.There are also many researches on report prioritization.Zhou et al. introduced BugSim [34], considering both textualand statistical features, to rank the test reports. DRONE,proposed by Tian et al. [35], is a machine learning-basedapproach to predict the priority of the test reports, consideringdifferent factors of the test reports. Feng et al. proposed aseries of approaches, DivRisk [36] and BDDiv [3], to prioritizethe test reports, and they ﬁrst consider the screenshots of thetest reports. Subsequently, Wang et al. [2] work further andexplore a more sound approach to prioritize test reports withmuch more attention paid to the screenshots.Among all the above researches, only a few of them, like[2] and [3], consider the app screenshots, which we think isa valuable factor for extracting the features to process thetest reports. But such researches only treat the screenshotsas simple matrixes instead of meaningful content.

C. Deep Image Understanding

Image understanding is a hotspot issue in the ComputerVision (CV) ﬁeld. In this section, we mainly focus on theresearches utilizing image understanding in software testing.Lowe [10] proposed the SIFT algorithm, which is used tomatch the feature points on the target images and calculate thesimilarity, making use of a kind of new image local features,which are invariant to image changing, including translation,scaling, and rotation. Optical character recognition (OCR) isa widely-used tool to recognize the texts, which is helpfulto better understand the images based on the rich textualinformation. Nguyen et al. [37] proposed REMAUI, makinguse of CV technologies to identify the widgets, texts, images,and even containers of the app screenshot. Moran et al. [38]proposed REDRAW based on REMAUI, which is more preciseto identify the widgets and can automatically generate codesfor the app UI. Similarly, Chen et al. [39] also proposed atool to generate GUI skeletons from app screenshots withthe combination of CV technologies and machine learning.Yu et al. [1] proposed a tool named LIRAT to record andreplay mobile app test scripts among different platforms witha thorough understanding of the app screenshot.VI. C

ONCLUSION

This paper proposes a crowdsourced test report prioritiza-tion approach, D

EEP P RIOR , via deep screenshot understand-ing. D

EEP P RIOR transforms the app screenshots and textualdescriptions into four different features, including problemwidget, context widget, bug description, and reproductionstep. Then, the features are aggregated into D

EEP F EATURE ,including

Bug Feature and

Context Feature according to theirrelativity to the bug. Afterwards, we calculate the D

EEP S IM - ILARITY based on the features. Finally, the reports are prior-itized according to the D

EEP S IMILARITY with a preset rule.We also conducted an experiment to evaluate the proposedpproach, and the results show that D

EEP P RIOR outperformsthe state-of-the-art approach with less than half the overhead.A

CKNOWLEDGEMENT

This work is supported partially by National Key R&DProgram of China (2018AAA0102302), National Natural Sci-ence Foundation of China (61802171, 61772014, 61690201),Fundamental Research Funds for the Central Universities(14380021), and National Undergraduate Training Program forInnovation and Entrepreneurship (202010284073Z).R

EFERENCES[1] S. Yu, C. Fang, Y. Feng, W. Zhao, and Z. Chen, “Lirat: Layout and imagerecognition driving automated mobile testing of cross-platform,” in . IEEE, 2019, pp. 1066–1069.[2] J. Wang, M. Li, S. Wang, T. Menzies, and Q. Wang, “Images don’t lie:Duplicate crowdtesting reports detection with screenshot information,”

Information and Software Technology , vol. 110, pp. 139–155, 2019.[3] Y. Feng, J. A. Jones, Z. Chen, and C. Fang, “Multi-objective test reportprioritization using image understanding,” in .IEEE, 2016, pp. 202–213.[4] N. O’Mahony, S. Campbell, A. Carvalho, S. Harapanahalli, G. V.Hernandez, L. Krpalkova, D. Riordan, and J. Walsh, “Deep learning vs.traditional computer vision,” in

Science and Information Conference .Springer, 2019, pp. 128–144.[5] X. Xiao, X. Wang, Z. Cao, H. Wang, and P. Gao, “Iconintent: automaticidentiﬁcation of sensitive ui widgets based on icon classiﬁcation forandroid apps,” in . IEEE, 2019, pp. 257–268.[6] S. Yu, “Crowdsourced report generation via bug screenshot understand-ing,” in . IEEE, 2019, pp. 1277–1279.[7] Y. Kim, “Convolutional neural networks for sentence classiﬁcation,” arXiv preprint arXiv:1408.5882 , 2014.[8] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their composi-tionality,” in

Advances in neural information processing systems , 2013,pp. 3111–3119.[9] L. R. Rabiner, “A tutorial on hidden markov models and selectedapplications in speech recognition,”

Proceedings of the IEEE , vol. 77,no. 2, pp. 257–286, 1989.[10] D. G. Lowe et al. , “Object recognition from local scale-invariantfeatures.” in iccv , vol. 99, no. 2, 1999, pp. 1150–1157.[11] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors withautomatic algorithm conﬁguration.”

VISAPP (1) , vol. 2, no. 331-340,p. 2, 2009.[12] D. F. Silva and G. E. Batista, “Speeding up all-pairwise dynamictime warping matrix calculation,” in

Proceedings of the 2016 SIAMInternational Conference on Data Mining . SIAM, 2016, pp. 837–845.[13] T. Y. Chen, F.-C. Kuo, R. G. Merkel, and T. Tse, “Adaptive randomtesting: The art of test case diversity,”

Journal of Systems and Software ,vol. 83, no. 1, pp. 60–66, 2010.[14] B. Jiang, Z. Zhang, W. K. Chan, and T. Tse, “Adaptive random testcase prioritization,” in . IEEE, 2009, pp. 233–244.[15] G. Rothermel, R. H. Untch, C. Chu, and M. J. Harrold, “Prioritizingtest cases for regression testing,”

IEEE Transactions on software engi-neering , vol. 27, no. 10, pp. 929–948, 2001.[16] S. Karimi, F. Scholer, and A. Turpin, “Machine transliteration survey,”

ACM Computing Surveys (CSUR) , vol. 43, no. 3, pp. 1–46, 2011.[17] R. Gao, Y. Wang, Y. Feng, Z. Chen, and W. E. Wong, “Successes,challenges, and rethinking–an industrial investigation on crowdsourcedmobile application testing,”

Empirical Software Engineering , vol. 24,no. 2, pp. 537–561, 2019.[18] K. Mao, L. Capra, M. Harman, and Y. Jia, “A survey of the useof crowdsourcing in software engineering,”

Journal of Systems andSoftware , vol. 126, pp. 57–84, 2017.[19] Q. Cui, S. Wang, J. Wang, Y. Hu, Q. Wang, and M. Li, “Multi-objectivecrowd worker selection in crowdsourced testing.” [20] Q. Cui, J. Wang, G. Yang, M. Xie, Q. Wang, and M. Li, “Who should beselected to perform a task in crowdsourced testing?” in ,vol. 1. IEEE, 2017, pp. 75–84.[21] M. Xie, Q. Wang, G. Yang, and M. Li, “Cocoon: Crowdsourcedtesting quality maximization under context coverage constraint,” in . IEEE, 2017, pp. 316–327.[22] D. Liu, X. Zhang, Y. Feng, and J. A. Jones, “Generating descriptions forscreenshots to assist crowdsourced testing,” in . IEEE, 2018, pp. 492–496.[23] S. Banerjee, B. Cukic, and D. Adjeroh, “Automated duplicate bug reportclassiﬁcation using subsequence matching,” in . IEEE,2012, pp. 74–81.[24] S. Banerjee, Z. Syed, J. Helmick, and B. Cukic, “A fusion approach forclassifying duplicate problem reports,” in . IEEE, 2013,pp. 208–217.[25] J. Wang, Q. Cui, Q. Wang, and S. Wang, “Towards effectively test reportclassiﬁcation to assist crowdsourced testing,” in

Proceedings of the 10thACM/IEEE International Symposium on Empirical Software Engineeringand Measurement , 2016, pp. 1–10.[26] J. Wang, S. Wang, Q. Cui, and Q. Wang, “Local-based active classiﬁ-cation of test report to assist crowdsourced testing,” in

Proceedings ofthe 31st IEEE/ACM International Conference on Automated SoftwareEngineering , 2016, pp. 190–201.[27] C. Sun, D. Lo, X. Wang, J. Jiang, and S.-C. Khoo, “A discriminativemodel approach for accurate duplicate bug report retrieval,” in

Pro-ceedings of the 32nd ACM/IEEE International Conference on SoftwareEngineering-Volume 1 , 2010, pp. 45–54.[28] A. Sureka and P. Jalote, “Detecting duplicate bug report using charactern-gram-based features,” in . IEEE, 2010, pp. 366–374.[29] T. Prifti, S. Banerjee, and B. Cukic, “Detecting bug duplicate reportsthrough local references,” in

Proceedings of the 7th International Con-ference on Predictive Models in Software Engineering , 2011.[30] C. Sun, D. Lo, S.-C. Khoo, and J. Jiang, “Towards more accurateretrieval of duplicate bug reports,” in . IEEE,2011, pp. 253–262.[31] A. T. Nguyen, T. T. Nguyen, T. N. Nguyen, D. Lo, and C. Sun,“Duplicate bug report detection with a combination of information re-trieval and topic modeling,” in . IEEE,2012, pp. 70–79.[32] A. Alipour, A. Hindle, and E. Stroulia, “A contextual approach towardsmore accurate duplicate bug report detection,” in . IEEE, 2013, pp.183–192.[33] A. Hindle, A. Alipour, and E. Stroulia, “A contextual approach towardsmore accurate duplicate bug report detection and ranking,”

EmpiricalSoftware Engineering , vol. 21, no. 2, pp. 368–410, 2016.[34] J. Zhou and H. Zhang, “Learning to rank duplicate bug reports,” in

Proceedings of the 21st ACM international conference on Informationand knowledge management , 2012, pp. 852–861.[35] Y. Tian, D. Lo, and C. Sun, “Drone: Predicting priority of reportedbugs by multi-factor analysis,” in . IEEE, 2013, pp. 200–209.[36] Y. Feng, Z. Chen, J. A. Jones, C. Fang, and B. Xu, “Test reportprioritization to assist crowdsourced testing,” in

Proceedings of the 201510th Joint Meeting on Foundations of Software Engineering , 2015, pp.225–236.[37] T. A. Nguyen and C. Csallner, “Reverse engineering mobile applicationuser interfaces with remaui (t),” in

IEEE/ACM International Conferenceon Automated Software Engineering , 2016.[38] K. Moran, C. Bernal-C´ardenas, M. Curcio, R. Bonett, and D. Poshy-vanyk, “Machine learning-based prototyping of graphical user interfacesfor mobile apps,” arXiv preprint arXiv:1802.02312 , 2018.[39] C. Chen, T. Su, G. Meng, Z. Xing, and Y. Liu, “From ui design imageto gui skeleton: a neural machine translator to bootstrap mobile guiimplementation,” in