Session-Based Recommender Systems for Action Selection in GUI Test Generation
““Session-Based Recommender Systems for Action Selec-tion in GUI Test Generation” by Varun Nayak and DanielKraus, submitted to the 3rd IEEE Workshop on NEXt levelof Test Automation (NEXTA) 2020. This is a preprint of theaccepted version of this paper. The paper starts on the nextpage, after this information.c (cid:13) a r X i v : . [ c s . S E ] F e b ession-Based Recommender Systems for ActionSelection in GUI Test Generation Varun Nayak ∗ , Daniel Kraus † ReTest GmbHHaid-und-Neu-Straße 776131 Karlsruhe, GermanyEmail: ∗ [email protected], † [email protected] Abstract —Test generation at the graphical user interface (GUI)level has proven to be an effective method to reveal faults. Whendoing so, a test generator has to repeatably decide what actionto execute given the current state of the system under test (SUT).This problem of action selection usually involves random choice,which is often referred to as monkey testing. Some approachesleverage other techniques to improve the overall effectiveness,but only a few try to create human-like actions—or even entireaction sequences. We have built a novel session-based recom-mender system that can guide test generation. This allows us tomimic past user behavior, reaching states that require complexinteractions. We present preliminary results from an empiricalstudy, where we use GitHub as the SUT. These results showthat recommender systems appear to be well-suited for actionselection, and that the approach can significantly contribute tothe improvement of GUI-based test generation.
Index Terms —Test generation, testing and debugging, infor-mation filtering.
I. I
NTRODUCTION
System tests through the graphical user interface (GUI) are important since they stimulate software from end to end,i.e., somewhat from a user’s perspective down to persistencelayers such as databases. When used wisely, they can be apowerful part of a testing strategy. However, such tests usuallyhave a bad reputation because they tend to be “[. . .] brittle,expensive to write, and time consuming to run.” [1] Bothacademia and the industry try to overcome these issues byautomatically generating GUI tests; not just to free developersand testers from the burden of test creation and maintenance,but to reduce the overall costs—without compromises and atthe pace required [2].While generated tests cannot fully compensate hand-craftedtest cases, the future of testing is said to drastically increasethe use of automation [3]. Nowadays, test generators alreadyyield impressive results in a wide range of application areas.Sapienz [4], for example, found previously unknowncrashes in an empirical study with more than , Androidapps from the Google Play store. Meanwhile, the formerresearch project has been deployed at Facebook, where itis now used to automatically test the mobile apps of, e.g.,Instagram, WhatsApp and Facebook itself [5].When it comes to GUI-based test generation, a crucial partis to decide what action to execute next given the currentstate of the system under test (SUT). Many of today’s ap-proaches rely on random choice, a.k.a. monkey testing . This is sometimes combined with techniques like (meta-)heuristics ormachine learning (ML) to improve the generated tests. For in-stance, ant colony optimization [6], genetic programming [7],ML-enhanced evolutionary computing [8], data mining [9],deep learning [10], Q-learning [11] or other reinforcementlearning algorithms [12]. Yet, only a few actually focus oncreating human-like sequences of actions, e.g., to allow a testgenerator to get behind “gate GUIs” [13] such as login screensor non-trivial forms. And although random testing is effectivein finding relevant faults, it tends to miss bugs humans doreveal [14]. Therefore, generating sequences that mimic pastuser behavior might help to reduce this gap.We propose a novel approach to action selection in GUI-based test generation by leveraging recommender systems .Recommender systems are a well-studied field and they formthe core of many successful businesses like Netflix [15]or YouTube [16], for which targeted recommendations areindispensable. We investigate a possible intersection betweenrecommender systems and the problem of action selection bymapping GUI actions to items and sessions within a SUTto users. Provided an adequate amount of data, our approachis able to predict actions a user likely would perform in thecurrent state. By using a session-based recommender systemas our model, we not just suggest single actions, but sequencesof actions. This allows a test generator to be guided throughstates that require complex user interactions.First, we give a brief introduction to recommender systemsand some advances relevant for this paper in Section II.Section III outlines our technical approach, where we describethe overall design and various implementation details. InSection IV, we conduct a first empirical study on top ofGitHub by mixing real-world and synthetic data. Afterwards,we summarize our findings and report on our planned futurework in Section V.II. R
ECOMMENDER S YSTEMS
Receiving recommendations of different forms has becomea part of our daily online experience in a variety of applica-tion domains such as e-commerce, social media and contentstreaming. Internally, such systems analyze the past behaviorof individual users to detect patterns in data. On typical onlinesites, various types of user actions can be recorded, e.g., thata user views an item or makes a purchase. These recordedctions and the detected patterns are then used to providerecommendations to the user. In this context, the entity beingrecommended is called item , and the entity that receives therecommendation is referred to as the user .The basic models for recommender systems work primarilywith user-item interactions such as ratings or like/dislike, orattributes like user interests or item properties. Based on this,there are two main approaches to recommender systems: col-laborative filtering and content-based . Traditional techniquessuch as matrix factorization have treated user-item interactionsas flat, matrix-structured data, often ignoring the temporalstructure and order within the data [17]. Being able to predicta user’s short-term interests in an online session is a highlyrelevant problem in practice, e.g., to adapt to item viewingand purchase activities in e-commerce. Within such applicationdomains, the items have to be recommended in a certain order,or the recommendation of one item only makes sense aftersome other event has happened.
Session-based recommender systems consider the infor-mation embedded in between sessions and treat sessions asthe basic recommendation unit. A session could be a setof items or a collection of actions consumed in one eventor in a particular period of time. When dealing with suchsequential data, recurrent neural networks (RNNs) are beingheavily used [18]. But practical applications involve temporaldependencies spanning many time steps where the network isoften unable to propagate useful information from the outputend of the model back to the layers near the input end, knownas the vanishing gradient problem [19].In 2016, Hidasi et al. [20] presented a gated recurrentunit (GRU) -powered RNN for session-based recommendationsand called this method
GRU4Rec . A GRU is a more elaboratemodel of an RNN that can deal with the aforementionedproblem(s) [21]. The gatings within these units essentiallylearn when and how much to update the hidden state of theunit. This enables more accurate recommendations for session-based data. III. O UR T ECHNICAL A PPROACH
We design our approach on top of the work by Hidasi etal. as illustrated in Figure 1. We adopt their general networkarchitecture, but specialize it for our purposes.The input to our model is a batch of sessions where eachsession is encoded as a sequence of action IDs. Action IDsare derived from the targeted GUI element and the actionperformed on it. That is, two actions only produce the same IDif they target the same element with the same action (ignoringpossible input data like text). An RNN layer is added next,which consists of GRU or long short-term memory (LSTM).(In our evaluation below, we explore multiple network types.)This layer is expected to learn the temporary patters in theaction-selection behavior. The following dense layer convertsthis information into a probability distribution over the givenaction IDs—the output of our model. Thus, we get a stream ofranked actions that is based on past behavior of actual usersa test generator can choose from.
Fig. 1. General architecture of our network based on [20].
To implement our approach, we are using the GRU4Reclibrary based on [20], [22]. It adds several extensions to a se-quence modeling architecture like ours. For example, session-parallel mini-batches, mini-batch-based output sampling andthe use of a pairwise ranking loss function. Sequence-to-sequence models produce the result by one item at a time, inother words, by solving a classification problem at each timestep. According to Hidasi et al., pairwise ranking losses areexpected to give a better performance with the given networksetup. The loss function compares the rank of pairs of apositive and a negative item and enforces that the rank of thepositive item should be lower than that of the negative one.For more details on the basic network setup, please refer tothe original paper(s). Next, we conduct a first empirical study,show how the network can be trained and how it performs.IV. A F
IRST E MPIRICAL E VALUATION
According to the World Quality Report 2018-19 [23], datais a main obstacle when it comes to the adoption of artificialintelligence (AI) in testing. The problem of data scarcity isan important factor since data is at the core of any MLproject. This issue is especially challenging for young or smallorganizations, because only rarely there are cooperations withlarge enterprises where sufficient data is available.As we were struggling with small data too, we have de-signed an extract, transform, load (ETL) pipeline to increasethe amount of real-world data by adding synthetic data, sothat we get a first impression of how our approach performs.We decided to target web applications (i.e., web-based GUIs),where we picked GitHub as the SUT; a popular code hostingand development platform. This setup allows insights based on(i) a mature and widely-used target platform, (ii) a sufficientlycomplex and well-known SUT.To create real-world data, we recorded of our own usersessions on GitHub using the Selenium IDE, a record-and-playback tool available as a Chrome and Firefox extension.We exported these sessions as Java tests, where every testcase represents a user session. Each exported test was executedwith a custom Selenium WebDriver, which allows us to extracttraining data as CSV. Note that the Selenium IDE comes withthe functionality that when the default locator—a particularGUI element property such as an ID, typically used in testscripts to locate elements—doesn’t find an element, it willfall back to other available means. This fallback mechanismensures that most recorded tests don’t fail, e.g., due to record-ing inaccuracies. The exported Java code does not have this ABLE IR
EAL - WORLD DATA SAMPLES AFTER PRE - PROCESSING .Session ID Action IDs sequence Timestamp1 (151, 1, 2, 3, 4) 15685730732 (151, 4, 5, 3, 1, 2, 3, 4) 15685730793 (6, 7, 8, 9, 10, 11, 12, 2, 3, 4) 15685730884 (151, 4, 5, 3, 4, 1, 2, 3, 4) 15685730995 (6, 109, 110, 2, 3, 4) 1568573362 feature and only uses a single locator, which is why we hadto manually fix many locators to avoid runtime failures. Thepoor code export quality when using the Selenium IDE is amajor bottleneck in the proposed ETL pipeline that we aim toaddress as part of our future work (see Section V).One of the key parts in the pre-processing step is theassignment of accurate action IDs. The tests exported via theSelenium IDE already contain the absolute XPath for eachelement, we combine this information with the web page theelement appears on and the performed action type to derivethe action ID. Table I illustrates some resulting data samples,where every session is of arbitrary length (the actions executedby the user) and represented by a sequence of action IDs.
Fig. 2. Action ID distribution within the recorded user sessions.
When it comes to synthetic data, Wu et al. [24] formalizethe problem of generating such datasets using the maximumentropy principle for categorical data, which captures thecharacteristics of the underlying data. Figure 2 shows thedistribution of action IDs within the recorded user sessions. Ascan be seen, some actions are more common across sessions.These frequent actions usually carry a deeper meaning andrepresent short-term user goals that correspond to commonuse cases like a login procedure. The assumption we madeis that most sessions will have such recurring patterns inthe recorded interactions and in-between a user will performarbitrary actions. The synthetic sessions we generated stillhold these properties, but have been mixed up with randomlycreated action IDs. In practice, click botnets may also createa considerable amount of traffic [25], so we additionallyinterjected random noise to represent spam and the like.The resulting dataset is summarized in Table II. We splitthis data into a training set (
80 % ) and a test set (
20 % ) forevaluation. In the context of recommender systems, we aremost likely interested in recommending an item from the top- n list of items. Therefore, we calculate relevant metrics withregards to the first n actions instead of all actions. Precision TABLE IIS
UMMARY OF THE USED DATASET .Real-world sessions Avg. no. of actions . Synthetic sessions , Min. no. of actions Distinct actions
Max. no. of actions at n is the proportion of recommended actions present in thetop- n list that are relevant. Recall at n is the proportion ofrelevant actions found in the top- n recommendations. Meanreciprocal rank (MRR), which is important in cases where theorder of recommendations matter, is the mean of reciprocalsof the rank from all queries. The reciprocal rank is set to zeroif the rank is above n .We further followed the practice of Quadrana et al. [26],where items from each session in the test set are groupedtogether to form a sequence, and each sequence is furthersplit into the user profile and ground truth. The user profileis composed of the first event in the sequence that is fedinto the system and used to compute recommendations. Theground truth is composed of the remainder of the sequencethat is used for performance evaluation. Items are revealedincrementally, then the evaluation is performed after each newitem. This helps to evaluate the recommendation quality in asetting where user profiles are revealed sequentially. Metricsare averaged over each sequence and then averaged over all. Fig. 3. Precision, recall, MRR at 1, 5, 10, 20 for the different network types.
As a baseline, we used a simple k -nearest neighbor recom-mender based on an item-to-item similarity. In this setting, thesimilarity matrix is pre-computed from the available sessiondata, i.e., actions that are often executed together in sessionsare deemed to be similar. This similarity matrix is then usedduring the session to recommend the most similar actions tothe one the user has currently performed. We compared thismodel to different GRU-based network types with differentlosses, as well as a custom model using LSTM and cross-entropy loss. The models converged between – epochs,depending on the loss function and the amount of data. Theevaluation results are shown in Figure 3.An observation we made is that the MRR is roughly withinthe range [0 . , . across all the evaluated recommendationlist lengths. This indicates that the best relevant actions wereretrieved between top-5 and top-10. The MRR, however, seemsto saturate, which we assume is a consequence of the datadeficit. The baseline performance being on-par with the othermodels is most likely owing to short average session length.recision and recall both indicate the accuracy of the models.Precision for all models except the custom one is very highat the top-1 recommendation, then it continues to drop as therecall climbs up. This is because in order to recall everything,it is required to keep generating results which are not accurate,hence, lowering the precision.We also observed that the models did not significantlyoutperform the baseline in the top-5 recommendation andbeyond. In all experiments, the best-choice model performedbetter only by –
18 % . This could possibly have to dowith the used dataset, but the lack of data remains a threatto validity and adds some uncertainty. Apart from this, themodels performed well. The recommended sequences reflectmany of the recorded use cases, mastering also complex states.V. C
ONCLUSION AND F UTURE W ORK
We have built a novel prototype for action selection in GUItest generation using a session-based recommender system.We conducted a first empirical study on top of GitHub, forwhich we presented preliminary results. These results suggestthat action selection, when seen as a sequence modeling task,can guide a test generator through states that require complexinteractions by mimicking past user behavior.Based on our current approach and findings, we identifiedseveral tasks for future work. First, there is an overall needfor more real-world data in order to develop sophisticatedmodels. Therefore, we strive for cooperations with owners ofbig web applications and other researchers. Second, a majorbottleneck was the poor quality of the Selenium IDE codeexport. We plan to develop (i) a native Selenium IDE plugin toleverage fallback locators and (ii) a script for web applicationowners to extract anonymous usage traces. Third, adoptingadditional GRU4Rec extensions from [22] could improve theresults. Moreover, hyperparameter optimization tools may beused to further improve the models’ performance. Last andmost importantly, the presented results are preliminary. Toprove its effectiveness, the approach must be evaluated as partof a large-scale study using multiple, diverse SUTs, ideally incomparison to other test generators.We believe that addressing these tasks can significantlycontribute to the improvement of GUI-based test generation.A
CKNOWLEDGEMENTS
As part of the joint research project “Surili”, this workis supported by a grant (no. 01IS17092A) from the GermanFederal Ministry of Education and Research.R
EFERENCES[1] M. Fowler. (May 2012). Testpyramid, [Online]. Available: https : / /martinfowler.com/bliki/TestPyramid.html.[2] A. Walgude and S. Natarajan, “World quality report 2019–20,” Paris,France, Tech. Rep., 2019.[3] K. Wiklund and M. Wiklund, “The next level of test automation: Whatabout the users?” In
Proceedings of the 2018 International Confer-ence on Software Testing, Verification, and Validation Workshops , ser.ICSTW ’18, Vasteras, Sweden: IEEE, 2018, pp. 159–162.[4] K. Mao, M. Harman, and Y. Jia, “Sapienz: Multi-objective automatedtesting for android applications,” in
Proceedings of the 25th ACMSIGSOFT International Symposium on Software Testing and Analysis ,ser. ISSTA ’16, Saarbrücken, Germany: IEEE, 2016, pp. 94–105. [5] N. Alshahwan, X. Gao, M. Harman, Y. Jia, K. Mao, A. Mols,T. Tei, and I. Zorin, “Deploying search based software engineeringwith sapienz at facebook,” in
Proceedings of the 10th InternationalConference on Search-Based Software Engineering , ser. SSBSE ’18,Montpellier, France: Springer Verlag, 2018, pp. 3–45.[6] S. Bauersfeld, S. Wappler, and J. Wegener, “A metaheuristic approachto test sequence generation for applications with a gui,” in
Proceedingsof the 3rd International Conference on Search-Based Software Engi-neering , ser. SSBSE ’11, Szeged, Hungary: Springer Verlag, 2011,pp. 173–187.[7] A. I. Esparcia-Alcázar, F. Almenar, T. E. J. Vos, and U. Rueda, “Usinggenetic programming to evolve action selection rules in traversal-basedautomated software testing: Results obtained with the testar tool,”
Memetic Computing , vol. 10, no. 3, pp. 257–265, Sep. 2018.[8] D. Kraus,
Machine learning and evolutionary computing for gui-basedregression testing , 2018. arXiv: 1802.03768.[9] M. Ermuth and M. Pradel, “Monkey see, monkey do: Effectivegeneration of gui tests with inferred macro events,” in
Proceedings ofthe 25th ACM SIGSOFT International Symposium on Software Testingand Analysis , ser. ISSTA ’16, Saarbrücken, Germany: ACM, 2016,pp. 82–93.[10] Y. Li, Z. Yang, Y. Guo, and X. Chen,
A deep learning based approachto automated android app testing , 2019. arXiv: 1901.02633.[11] A. I. Esparcia-Alcázar, F. Almenar, M. Martínez, U. R. Rueda,and T. E. J. Vos, “Q-learning strategies for action selection in thetestar automated testing tool,” in
Proceedings of the 6th InternationalConference on Metaheuristics and Nature Inspired Computing , ser.META ’16, Marrakech, Morocco, 2016, pp. 174–180.[12] C. Degott, N. P. Borges Jr., and A. Zeller, “Learning user interfaceelement interactions,” in
Proceedings of the 28th ACM SIGSOFTInternational Symposium on Software Testing and Analysis , ser. ISSTA’19, Beijing, China: ACM, 2019, pp. 296–306.[13] D. Amalfitano, V. Riccio, N. Amatucci, V. De Simone, and A. R.Fasolino, “Combining automated gui exploration of android appswith capture and replay through machine learning,”
Information andSoftware Technology , vol. 105, pp. 95 –116, May 2019.[14] I. Ciupa, B. Meyer, M. Oriol, and A. Pretschner, “Finding faults:Manual testing vs. random+ testing vs. user reports,” in
Proceedingsof the 19th IEEE International Symposium on Software ReliabilityEngineering , ser. ISSRE ’08, Seattle, WA, USA: IEEE Press, 2008,pp. 157–166.[15] C. A. Gomez-Uribe and N. Hunt, “The netflix recommender system:Algorithms, business value, and innovation,”
ACM Transactions onManagement Information Systems , vol. 6, no. 4, 13:1–13:19, Dec.2015.[16] P. Covington, J. Adams, and E. Sargin, “Deep neural networks foryoutube recommendations,” in
Proceedings of the 10th ACM Confer-ence on Recommender Systems , ser. RecSys ’16, Boston, MA, USA:ACM, 2016, pp. 191–198.[17] S. Wang, L. Cao, and Y. Wang,
A survey on session-based recom-mender systems , 2019. arXiv: 1902.04864.[18] Z. C. Lipton, J. Berkowitz, and C. Elkan,
A critical review of recurrentneural networks for sequence learning , 2015. arXiv: 1506.00019.[19] Y. Bengio, P. Y. S. Simard, and P. Frasconi, “Learning long-termdependencies with gradient descent is difficult,”
IEEE Transactionson Neural Networks , vol. 5, no. 2, pp. 157–166, Mar. 1994.[20] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk,
Session-basedrecommendations with recurrent neural networks , 2016. arXiv: 1511.06939.[21] K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio,
On the prop-erties of neural machine translation: Encoder-decoder approaches ,2014. arXiv: 1409.1259.[22] B. Hidasi and A. Karatzoglou,
Recurrent neural networks with top-kgains for session-based recommendations , 2017. arXiv: 1706.03847.[23] M. Buenen and A. Walgude, “World quality report 2018–19,” Paris,France, Tech. Rep., 2018.[24] H. Wu, Y. Ning, P. Chakraborty, J. Vreeken, N. Tatti, and N. Ramakr-ishnan,
Generating realistic synthetic population datasets , 2016. arXiv:1602.06844.[25] S. Nagaraja and R. Shah,
Clicktok: Click fraud detection using trafficanalysis , 2019. arXiv: 1903.00733.[26] M. Quadrana, P. Cremonesi, and D. Jannach, “Sequence-aware recom-mender systems,”