[PDF] Context-Aware Target Apps Selection and Recommendation for Enhancing Personal Mobile Assistants

Abstract

Users install many apps on their smartphones, raising issues related to information overload for users and resource management for devices. Moreover, the recent increase in the use of personal assistants has made mobile devices even more pervasive in users' lives. This paper addresses two research problems that are vital for developing effective personal mobile assistants: target apps selection and recommendation. The former is the key component of a unified mobile search system: a system that addresses the users' information needs for all the apps installed on their devices with a unified mode of access. The latter, instead, predicts the next apps that the users would want to launch. Here we focus on context-aware models to leverage the rich contextual information available to mobile devices. We design an in situ study to collect thousands of mobile queries enriched with mobile sensor data (now publicly available for research purposes). With the aid of this dataset, we study the user behavior in the context of these tasks and propose a family of context-aware neural models that take into account the sequential, temporal, and personal behavior of users. We study several state-of-the-art models and show that the proposed models significantly outperform the baselines.

Full PDF

CContext-Aware Target Apps Selection and Recommendationfor Enhancing Personal Mobile Assistants

MOHAMMAD ALIANNEJADI ∗ , University of Amsterdam, The Netherlands

HAMED ZAMANI,

University of Massachusetts Amherst, USA

FABIO CRESTANI,

Università della Svizzera italiana (USI), Switzerland

W. BRUCE CROFT,

University of Massachusetts Amherst, USAUsers install many apps on their smartphones, raising issues related to information overload for users andresource management for devices. Moreover, the recent increase in the use of personal assistants has mademobile devices even more pervasive in users’ lives. This paper addresses two research problems that are vitalfor developing effective personal mobile assistants: target apps selection and recommendation . The former isthe key component of a unified mobile search system: a system that addresses the users’ information needsfor all the apps installed on their devices with a unified mode of access. The latter, instead, predicts thenext apps that the users would want to launch. Here we focus on context-aware models to leverage the richcontextual information available to mobile devices. We design an in situ study to collect thousands of mobilequeries enriched with mobile sensor data (now publicly available for research purposes). With the aid of thisdataset, we study the user behavior in the context of these tasks and propose a family of context-aware neuralmodels that take into account the sequential, temporal, and personal behavior of users. We study severalstate-of-the-art models and show that the proposed models significantly outperform the baselines.CCS Concepts: •

Information systems → Retrieval on mobile devices ; Query log analysis ; Recommendersystems ; Personalization ; Query intent ; •

Human-centered computing → Personal digital assistants ; Fieldstudies ; ACM Reference Format:

Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W. Bruce Croft. 2019. Context-Aware Target AppsSelection and Recommendation for Enhancing Personal Mobile Assistants.

ACM Transactions on InformationSystems

0, 0, Article 0 ( 2019), 30 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

In recent years, the number of available apps on the mobile app market has been growing dueto high demand from users, leading to over 3.5 million apps on Google Play Store, for example. As a consequence, users now spend an average of over five hours a day using their smartphones, ∗ Part of the work reported in this paper was done while Mohammad Aliannejadi was affiliated with the Università dellaSvizzera italiana (USI). a r X i v : . [ c s . I R ] J a n ig. 1. A typical workflow of a unified mobile search framework. accessing a variety of applications. An average user, for example, installs over 96 different apps ontheir smartphones [10]. In addition, the emergence of intelligent assistants, such as Google Assistant,Microsoft Cortana, and Apple Siri, has made mobile devices even more pervasive. These assistantsaim to enhance the capability and productivity of users by answering questions, performing actionsin mobile apps, and improving the user experience while interacting with their mobile devices.Another goal is to provide users with a universal voice-based search interface; however, they stillhave a long way to go to provide a unified interface with the wide variety of apps installed on users’mobile phones. The diversity of mobile apps makes it challenging to design a unified voice-basedinterface. However, given that users spend most of their time working within apps (rather than abrowser), it is crucial to improve their cross-app information access experience.In this paper, we aim to address two research problems that are crucial for effective developmentof a personal mobile assistant: target apps selection and recommendation in mobile devices . Targetapps selection is the key component towards achieving a unified mobile search system – a systemthat can address the users’ information needs not only from the Web, but also from all the appsinstalled on their devices. We argued the need for a universal mobile search system in our previouswork [6], where our experiments suggested that the existence of such a system would improvethe users’ experience. Target apps recommendation, instead, predicts the next apps that the userswould want to launch and interact with, which is equivalent to target apps selection with no query.A unified mobile search framework is depicted in Figure 1. As we see in the figure, with sucha framework, the user could submit a query through the system which would then identify thebest target app(s) for the issued query. The system then would route the query to the identifiedtarget apps and display the results in an integrated interface. Thus, the first step towards designinga unified mobile search framework is identifying the target apps for a given query, which is thetarget apps selection task [6].

Target apps recommendation is also crucial in a mobile environment. It has attracted a great dealof attention in multiple research communities [12, 28, 49]. Among various benefits and use casesdiscussed in the literature, we find the following two cases the most important ones: (i) to assistusers in finding the right app for a given task the user wishes to perform; (ii) to help the operatingsystem manage its resources more efficiently. It is worth noting that both use cases potentiallyplay essential roles in improving end users’ experience. The former reduces the users’ effort tofind the right app among various apps installed on their phone. On the other hand, the latter canaffect the users’ experience through smart resource management. For instance, a system couldremove many background processes of apps that are not going to be used in the near future to http://flurrymobile.tumblr.com/post/157921590345/us-consumers-time-spent-on-mobile-crosses-5ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. void excessive battery usage. It can also be used to allocate the required resources to an app thatis going to be launched by the user in the immediate future, providing faster and smoother userexperience. The use of a target apps recommendation system and a target apps selection systembrings even more benefits. While app usage data can help a target apps selection model providemore accurate predictions, the submitted cross-app queries could also improve a recommendationsystem’s performance. For example, in cases when a user is traveling, they would use traveland navigation apps more often. This could be considered as an indication of the current user’sinformation to the system. Also, assume a user submits the query “Katy Perry hits” to Google . Therecommendation system could use this information in its prediction and recommend music apps.As mobile devices provide rich contextual information about users, previous studies [2, 32, 60]have tried to incorporate query context in various domains. In particular, query context is oftendefined as information provided by previous queries and their corresponding clickthrough data [57,58], or situational context such as location and time [14, 29, 60]. However, as user interactions on amobile device are mostly with apps, exploring apps usage patterns reveals important informationabout the user contexts, information needs, and behavior. For instance, a user who starts spendingtime on travel-related apps, e.g., TripAdvisor, is likely to be planning a trip in the near future.Carrascal and Church [18] verified this claim by showing that people use certain categories of appsmore intensely as they do mobile search. Modeling the latent relations between apps is of greatimportance because while people use few apps on a regular basis, they tend to switch between appsmultiple times [18]. In fact, previous studies have tried to address app usage prediction by modelingpersonal and contextual features [10], exploiting context-dependency of app usage patterns [35],sequential order of apps [59] and collaborative models [56].However, our previous attempt to study unified mobile search through crowdsourcing did notcapture users’ contexts in the data collection phase [6] because it was done on the phone’s browser,failing to record any contextual and sensor data related to the user location and activities. Inaddition, there are some other limitations. For example, we asked workers to complete a set ofgiven search tasks, which obviously were not generated by their actual information needs, andthus the queries were likely different from their real search queries. In addition, not all of workerscompleted their tasks on actual mobile devices, which affected their behavior. Furthermore, theuser behavior and queries could not be studied over a day-long or week-long continuous period.These limitations have motivated us to conduct the first in situ study of target apps selectionfor unified mobile search. This enables us to obtain clearer insights into the task. In particular,we are interested in studying the users’ behavior as they search for real-life information needsusing their own mobile devices. Moreover, we studied the impact of contextual information on theapps they used for search. To this aim, we developed a simple open source app, called uSearch , andused it to build an in situ collection of cross-app queries. Over a period of 12 weeks, we collectedthousands of queries which enables us to investigate various aspects of user behavior as they searchfor information in a cross-app search environment.Using the collected data, we conducted an extensive data analysis, aiming to understand howusers’ behavior vary across different apps while they search for their information needs. A keyfinding of our analysis include the fact that users conduct the majority of their daily search tasksusing specific apps, rather than

Google . Among various available contextual information, we focuson the users’ apps usage statistics as their apps usage context , leaving others for future work. Thisis motivated by the results of our analysis in which we show that users often search on the appsthat they use more frequently. Based on the insights we got from our data analysis, we proposea context-aware neural target apps selection model, called

CNTAS . In addition, as we aimed tomodel the sequential app usage patterns while incorporating personal and temporal information,we proposed a neural target apps recommendation model, called

NeuSA , which is able to predict

ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. he next apps that a user would launch at a certain time. The model learns complex behavioralpatterns of users at different times of day by learning high-dimensional app representations, takinginto account the sequence of previously-used apps.In summary, the main contributions of this paper are: • An in situ mobile search study for collecting thousands of real-life cross-app queries. Wemake the app , the collected search query data , and the annotated app usage data publiclyavailable for research purposes. • The first in situ analysis of cross-app queries and users’ behavior as they search with differentapps. More specifically, we study different attributes of cross-app mobile queries with respectto their target apps, sessions, and contexts. • A context-aware neural model for target apps selection. • A personalized sequence-aware neural model for target apps recommendation. • Outperforming baselines for both target apps selection and recommendation tasks.Our analyses and experiments lead to new findings compared to previous studies, openingspecific future directions in this research area.This paper extends our previous work on in situ and context-aware target apps selection forunified mobile search [5]. We previously stressed the importance of incorporating contextualinformation in a unified mobile search and studied the app usage statistics data to identify theuser’s intent of submitting a query more accurately. We showed that considering what applicationsa person has used mostly in the past 24 hours is useful to improve the effectiveness of target appsselection. In this paper, we further explore the effect of sequential app usage behavior of usersfor target apps recommendation. This is an ideal complement to our context-aware target appsselection model as these two components constitute an important part of context-aware mobilecomputing [23]. In summary, this paper extends our previous work as follows: • It presents a novel personalized time-aware target apps recommendation, called NeuSA. • It compares the performance of NeuSA to state-of-the-art target apps recommendation. • It describes the new dataset that we have collected and annotated for target apps recommen-dation, which we will make publicly available for research purposes. • It includes more analysis of the collected data and the experimental results. • It provides more details on our proposed context-aware target apps selection model CNTAS.This paper demonstrates that both our proposed models are able to outperform the state-of-the-art. Also, it provides new analysis and insights into the effect of context in both target appsselection and recommendation tasks. Finally, the joint analysis of context allows the reader toobserve and compare the effectiveness of analyzing and incorporating user behavior data into theprediction.The remainder of the paper is organized as follows. Section 2 provides a brief overview of therelevant studies in the literature. Section 3 elaborates on our effort for collecting the data, followedby Section 4 where we analyze the collected data in depth. Then, in Sections 5 and 6 we describe bothour proposed models for context-aware target apps selection and recommendation, respectively.Section 7 then includes details on the experimental setup, followed by Section 8 discussing andanalyzing the results. Finally, Section 9 concludes the paper and discusses possible future directionsthat stem from this work. https://github.com/aliannejadi/uSearch https://github.com/aliannejadi/istas https://github.com/aliannejadi/LSAppACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. RELATED WORK

Our work is related to the areas of mobile IR, context-aware search, target apps recommendation,human interaction with mobile devices (mobile HCI), and proactive IR. Moreover, relevant relatedresearch has been carried out in the areas of federated search and aggregated search and queryclassification. In the following, we briefly summarize the related research in each of these areas.A mobile IR system aims at enabling users to carry out all the classical IR operations on amobile device [23], as the conventional Web-based approaches fail to satisfy users’ informationneeds on mobile devices [20]. Many researchers have tried to characterize the main differences inuser behavior on different devices throughout the years. In fact, Song et al. [53] found significantdifferences in search patterns done using iPhone, iPad, and desktop. Studying search queries isone of the main research topics in this area, as queries are one of the main elements of a searchsession. Kamvar and Baluja [31] conducted a large-scale mobile search query analysis, findingthat mobile search topics were less diverse compared to desktop search queries. Analogously, Guy[26] and Crestani and Du [22] conducted comparative studies on mobile spoken and typed queriesshowing that spoken queries are longer and closer to natural language. All these studies show thatsignificant changes in user behavior are obvious. Change of the interaction mode, as well as thepurpose and change of the information need, are among the reasons for this change [6].Moreover, there has been studies on how mobile search sessions compare with desktop searchsessions [12, 27, 28, 55]. van Berkel et al. [55] did a comprehensive analysis on how variousinactivity gaps can be used to define an app usage session on mobile devices where they concludedthat “researchers should ignore brief gaps in interaction.” Carrascal and Church [18] studied userinteractions with respect to mobile apps and mobile search, finding that users’ interactions with appsimpact search. Also, they found that mobile search session and app usage session have significantdifferences.Given that mobile devices provide rich contextual information about users’ whereabouts, a largebody of research has tried to study the effect of such information on users’ behavior. Church andOliver [19] did a diary and interview study to understand users’ mobile Web behavior. Aliannejadiet al. [3] conducted a field study where the recruited participants completed various search tasks inpredefined time slots. They found that the temporal context, as well as the user’s current activitymode (e.g., walking vs. sitting), influenced their perception of task difficulty and their overall searchperformance.Also, Sohn et al. [52] conducted a diary study in which they found that contextual featuressuch as activity and time influence 72% of mobile information needs. This is a very importantfinding, as it implies that using such information can greatly impact system performance and usersatisfaction. In fact, research on proactive IR mainly focuses on this fact [13, 49]. Shokouhi andGuo [49] analyzed user interactions with information cards and found that the usage patterns ofthe proactive information cards depend on time, location, and the user’s reactive search history.Proactive IR is very useful in a mobile context, where the user has a limited attention span forthe mobile device and the applications running on it. Similarly, Benetka et al. [13] studied howvarious types of activities affect users’ information needs. They showed that not only informationneeds vary across activities, but they also change during an activity. Our work follows a similarline leveraging the changing context to determine the target apps for a given query.Other works focused on a more comprehensive comparison of user behavior where they foundusing information from user search sessions among different platforms can be used to improveperformance [40]. It has also been shown that using external information such as online reviewscan be used to improve the performance of search on mobile devices [43]. Park et al. [42] inferredusers’ implicit intentions from social media for the task of app recommendation. This last work

ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. s closely related to our previous work [6] where we introduced the need for a unified mobilesearch framework as we collected cross-app queries through crowdsourcing. In contrast, we collectreal-life cross-app queries over a longer period with an in situ study design in this work.Research on unified mobile search has considerable overlap with federated, aggregated search,and query classification. While federated search systems assume the environment to be uncoopera-tive and data to be homogeneous, aggregated search systems blend heterogeneous content fromcooperative resources [9]. Target apps selection, on the other hand, assumes an uncooperativeenvironment with heterogeneous content. Federated search has a long history in IR for Web search.In the case of uncooperative resources Callan and Connell [15] proposed a query-based samplingapproach to probe the resources. Markov and Crestani [39] carried out an extensive theoretical,qualitative, and quantitative analysis of different resource selection approaches for uncooperativeresources. One could study probing for unified mobile search; however, we argue that apps couldpotentially communicate more cooperatively, depending on how the operating system wouldfacilitate that. More recently, research on aggregated search has gained more attention. Aggre-gated search share certain similarities with target apps selection in dealing with heterogeneousdata [50]. However, research on aggregated search often enjoys fully cooperative resources as theresources are usually different components of the bigger search engine. For example, Diaz [25]proposed modeling the query dynamics to detect news queries for integrating the news vertical in SERP. Research on query classification has also been of interest for a long time in the field ofIR. Different strategies are used to assign a query to predefined categories. As mobile users areconstantly being distracted by external sources, the queries often vary a lot, and it is not easy todetermine if a query is related to the same information need that originated the previous query.Kang and Kim [33] defined three types of queries, each of which requiring the search engine tohandle differently. Shen et al. [48] introduced an intermediate taxonomy used to classify queriesto specified target categories. Cao et al. [16] leveraged conditional random fields to incorporateusers’ neighboring queries in a session as context. More recently, Zamani and Croft [62] studiedword embedding vectors for the query classification task and proposed a formal model for queryembedding estimation.Predicting app usage has been studied for a long time in the field. Among the first works thattried to model app usage, Liao et al. [37] proposed an app widget where users would see a list ofrecommended apps. Their model predicted the list of apps based on temporal usage profiles ofusers. Also, Huang et al. [30] studied different prediction models on this problem, including linearand Bayesian, where they found that contextual information, as well as sequential usage data, playimportant roles for accurate prediction of app usage. As smartphones kept evolving throughoutthese years, more data about various apps and users’ context became available. As a result, moreresearch focused on studying the effect of such information, as well as incorporating them intoprediction models. For instance, Lu et al. [38] studied the effect of location data and proposed amodel that takes into account GPS data together with other contextual information. Baeza-Yateset al. [10] studied next app recommendation for improved home screen usage experience, extractinga set of personal and contextual features in a more commercial setting. Lee et al. [35] found thatthe usage probabilities of apps follow the Zipf’s law, as opposed to “inter-running” and runningtimes which follow log-normal distributions. Wang et al. [56] modeled the apps following the ideaof collaborative filtering, proposing a context-aware collaborative filtering model to unload andpre-load apps. Xu et al. [59] modeled the sequential app usage using recurrent networks. Zhao et al.[63] proposed the AppUsage2Vec model, inspired by doc2vec. Their proposed architecture includesan app-attention mechanism and a dual-DNN layer.As indicated in the literature, contextual and personal information have great impact in predictinguser behavior on mobile devices. Also, researchers in the areas of federated and aggregated search

ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. ig. 2. uSearch interface on LG Google Nexus 5 as well as the survey. Checkboxes are used to indicate thetarget app for a query. have shown that contextual information play an important role in improved performance. In thiswork, we explore various sources of contextual information for both tasks. We also explore the useof recent app usage data as an implicit source of contextual information for target apps selectionand show that it indeed provide useful contextual information to the model. Moreover, we studythe collected data for both tasks, aiming to shed more light on the task of target apps selection andrecommendation.

In this section, we describe how we collected

ISTAS ( I n S i T u collection of cross- A pp mobile S earch),which is, to the best of our knowledge, the first in situ dataset on cross-app mobile search queries.We collected the data in 2018 by recruiting 255 participants. The participants installed a simpleAndroid app, called uSearch, for at least 24 hours on their smartphones. We asked them to useuSearch to report their real-life cross-app queries as well as the corresponding target apps. We firstdescribe the characteristics of uSearch. Then, we provide details on how we recruited participantsas well as the details on how we instructed them to report queries through the app. Finally, we givedetails on how we checked the quality of the collected data. In order to facilitate the query report procedure, we developed uSearch, an Android app shownin Figure 2. We chose the Android platform because, in comparison with iOS, it imposes lessrestrictions in terms of sensor data collection and background app activity.

User interface.

As shown in Figure 2, uSearch consists of three sections. The upper part lists allthe apps that are installed on the phone, with the most used apps ranked higher. The participantswere supposed to select the app in which they had carried out their real-life search (e.g.,

Facebook ). ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. n the second section, the participants were supposed to enter exactly the same query that they hadentered in the target app (e.g.,

Facebook ). Finally, the lower part of the app, provided them easyaccess to a unique ID of their device and an online survey on their demographics and backgrounds.

Collected data.

Apart from the participants’ input data, we also collected their interactions withinuSearch (i.e., taps and scrolling). Moreover, a background service collected the phone’s sensors data.We collected data from the following sensors: (i) GPS; (ii) accelerometer; (iii) gyroscope; (iv) ambientlight; (v) WiFi; and (vi) cellular. Also, we collected other available phone data that can be used tobetter understand a user’s context. The additional collected data are as follows: (i) battery level;(ii) screen on/off events; (iii) apps usage statistics; and (iv) apps usage events. Note that apps usagestatistics indicate how often each app has been used in the past 24 hours, whereas apps usageevents provides more detailed app events. Apps usage events record user interactions in termsof: (i) launching a specific app; (ii) interacting with a launched app; (iii) closing a launched app;(iv) installing an app; and (v) uninstalling an app; The background service collected the data at apredefined time interval. The data was securely transferred to a cloud service.

We recruited participants through an open call on Amazon Mechanical Turk. The study received theapproval of the ethics committee of the university. We provided a clear statement to the participantsabout the kind of data that we were collecting and the purpose of the study. Furthermore, we usedsecure encrypted servers to store users’ data. We asked the participants to complete a survey insideuSearch. Moreover, we mentioned all the steps required to be done by the participants in orderto report a query. In short, we asked the participants to open uSearch after every search they didusing any installed app on their phones. Then, we asked them to report the app as well as the querythey used to perform their search task. We encouraged the participants to report their search assoon as it occurs, as it was very crucial to capture their context at the right moment.After running several pilot studies, over a period of 12 weeks we recruited 255 participants,asking them to let the app running on their smartphones for at least 24 hours and report at least5 queries. Since some people may not submit 5 search queries during the period of 24 hours, weasked them to keep the app running on their phones after the first 24 hours until they report 5queries. Also, we encouraged them to continue reporting more than 5 queries for an additionalreward. As incentive, we paid the participants $0.2 per query. We recruited participants only fromEnglish-speaking countries.

During the course of data collection, we performed daily quality checks on the collected data. Thechecks were done manually with the help of some data visualization tools that we developed. Wevisualized the use of selected apps in the participant’s app-usage history in a timeline to validate auser’s claim when they report using a specific app for their search. As we were paying participants areward per query, we carefully studied the submitted queries as well as user interactions to preventparticipants from reporting false queries. For each query, we checked the apps usage statistics andevents for the same day. If a participant reported a query in a specific app (e.g.,

Facebook ) but wecould not find any recent usage events regarding that app, we assumed that the query was falselyreported. Moreover, if a participant reported more than 10 queries per day, we took some extraquality measures into account. Finally, we approved 6,877 queries out of 7,750 reported queries. https://developer.android.com/reference/android/app/usage/package-summary http://mturk.comACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. ig. 3. Number of queries and active participants per day, during the course of data collection (best viewed incolor). To prevent unwanted career charges, we limited the data transfer to WiFi only. For this reason,we provided a very flexible implementation to manage the data in our app. In our app design, thedata is stored locally as long as the device is not connected to a WiFi network. As soon as a WiFiconnection is available, the app uploads the data to the cloud server. We made this point very clearin the instructions and asked the participants to take part in the study only if they had a strongWiFi connection at home or office.

Before asking for required app permissions, we made clear statement about our intentions on howwe were going to use the participants’ collected data as well as what was collected from theirdevices. We ensured them that their data were stored on secure cloud servers and that they couldopt out of the study at any time. In that case we would remove all their data from the servers.While granting apps usage access was mandatory, granting location access was optional. We askedparticipants to allow uSearch access their locations only if they felt comfortable with that. Note that,through the background service, we did not collect any other data that could be used to identifyparticipants.

In this section, we describe the basic characteristics of ISTAS, and present a thorough analysis oftarget apps, queries, sessions, and context.

ISTAS.

During the period of 86 days, with the help of 255 participants, we collected 6,877 searchqueries and their target apps as well as sensor and usage data. The collected raw data was over300 gigabytes. Here, we summarize the main characteristics of the participants based on thesubmitted surveys. Over 59% of the participants were female. Nearly 50% of them were agedbetween 25-34, followed by 22% between 35-44, and 15% 18-24 years. Participants were from allkinds of educational backgrounds ranging from high school diploma to PhD. In particular, 32% ofthem had a college degree, followed by 30% with a bachelor’s degree. Smartphone was the maindevice used for connecting to the Internet for 53% of the participants, followed by laptop (25%).Among the participants, 67% used their smartphones more often for personal reasons rather thanfor work. Finally, half of the participants stated that they use their smartphones 4 hours a day or

ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. able 1. Statistics of ISTAS. ± ± ± ± ± ± ± . ± . = .

21) is due to existence ofsome users who submitted numerous queries. Moreover, Figure 3 shows the number of queries andactive participants per day during the data collection period. Note that, as shown in Figure 3, in thefirst half collection period, we were mostly developing the visualization tools and did not recruitmany participants.

LSApp.

We collected LSApp ( L arge dataset of S equential mobile App usage) using the uSearch data collection tool during an eight-month period involving 292 users. Notice that 255 of the userswere the same people that were involved in collecting ISTAS. The extra 37 participants were theones that either did not submit any queries during this period, or submitted low-quality queriesand were removed in the quality check phase. Table 2 summarizes the statistics of LSApp. Sincewe observed many repeated app usage records with very short differences in time (< 10 seconds),we considered all repeated app usage records with less than one minute time difference as onerecord. Also, as the app usage data includes various system apps, we filtered out background systempackages and kept only the most popular apps in the data. We identify the most popular apps basedon the data we collected in this dataset. How apps are distributed.

Figure 4 shows how queries are distributed with respect to the top 20apps. We see that the top 20 apps account for 88% of the searches in ISTAS, showing that the appdistribution follows a power-law. While

Google and

Chrome queries respectively attract 26% and23% of the target apps, users conduct half (51%) of their search tasks using other apps. This findingis inline with what was shown in a previous work [6], even though we observe a higher percentageof searches done using

Google and

Chrome apps. In [6], we collected a dataset cross-app queriescalled UniMobile under a different experimental setup where we asked the participants to submitcross-app queries for given search tasks. Therefore, the differences in the collected data can be dueto two reasons: (i) ISTAS is collected in situ and on mobile devices, thus being more realistic thanUniMobile; (ii) ISTAS queries reflect real-life information needs rather than a set of given searchtasks, hence the information need topics are more diverse than UniMobile. Moreover, we observe a https://github.com/aliannejadi/uSearchACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. able 2. Statistics of LSApp. Fig. 4. Number of queries per app for top 20 apps in ISTAS. notable variety of apps among the top 20 apps, such as

Spotify and

Contacts . We also see

GooglePlay Store among the top target apps. This suggests that people use their smartphones to searchfor a wide variety of information, most of which were done with apps other than

Google or Chrome .It should also be noted that users seek the majority of their information needs on various apps,even though there exists no unified mobile search system on their smartphones, suggesting thatthey might even do a smaller portion of their searches using

Google or Chrome , if a unified mobilesearch system was available on their smartphones.

How apps are selected.

Here, we analyze the behavior of the participants in ISTAS, as theysearched for real-life information needs, in terms of the apps they chose for performing the search.Figure 5a shows the distribution of unique apps per user. We can see how many users selected acertain number of unique apps, with an average of 5.14 unique apps per user. Again, this indicatesthat users seek information in a set of diverse apps. It is worth noting that in Figure 5a, we observea totally different distribution compared to [6], where the average number of unique apps per userwas much lower. We believe this difference is due to the fact that the participants in our workreported their real-life queries, as opposed to the crowdsourcing setup of [6].

ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. a) apps per user (b) apps per sessionFig. 5. Distribution of unique apps per user and per session in ISTAS.Table 3. Cross-app query attributes for 9 apps. The upper part of the table lists the distribution of number ofquery terms as well as mean query terms per app. The lower part lists the query overlap at different similaritythresholds (denoted by 𝜏 ) per app. All shows query distributions across all apps. A ll G oo g l e Y o u T u b e F a c e b oo k A m a z o n S h . M a p s G m a i l G . P l a y S t o r e S p o t i f y C o n t a c t s Query term distribution 𝜏 Query overlap > 0.25 56% 39% 41% 28% 27% 26% 27% 25% 8% 14%> 0.50 19% 11% 15% 13% 7% 11% 12% 12% 4% 10%> 0.75 13% 5% 8% 11% 5% 9% 12% 10% 2% 10%On the other hand, Figure 5b plots the distribution of unique apps with respect to sessions, whichis how many unique apps were selected during a single search session. We see an entirely differentdistribution where the average number of unique apps per task is 1 .

36. This shows that while usersseek information using multiple apps, they are less open to switching between apps in a singlesession. This can partly be due to the fact that switching between apps is not very convenient.However, this behavior requires more investigation to be fully understood, that we leave for futurework.

In order to understand the differences in user behavior while formulating their information needsusing different apps, we conducted an analysis on the attributes of the queries with respect to theirtarget apps. First, we start by studying the number of query terms in each app for the top 9 apps inISTAS.

ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. ow query length differs among apps.

The upper part of Table 3 lists the distribution of thenumber of query terms in the whole dataset (denoted by

All ) as well as in each app. It also lists theaverage query terms per app. As we can see, the average query length is 3.00, which is slightly lowerthan previous studies on mobile query analysis [26, 31]. However, the average query length for appsthat deal with general web search such as

Google is higher ( = . Contacts has the lowest average querylength ( = . Gmail and

Google Play Store havean average query length lower than 2 as most searches are keyword based (e.g., part of an emailsubject or an app name) . This difference shows a clear behavioral difference in formulating queriesusing different apps. Moreover, we can see that the distribution of the number of query terms variesamong different apps; take

Contacts as an example, whose single-term queries constitute 81%of its query distributions, which are often names of user’s personal contacts. This indicates thatthe structure of queries vary across the target apps. Studying the most frequent query unigramsof each app also confirms this finding. For example,

Google ’s most popular unigrams are mostlystopwords (i.e., “to”, “the”, “of”, “how”), whereas

Facebook ’s most popular unigrams are not (i.e.,“art”, “eye”, “wicked”, “candy”).

How query similarity differs across apps.

The lower part of Table 3 lists the query similarityor query overlap using a simple function used in previous studies [6, 21]. We measure the queryoverlap at various degrees and use the similarity function sim ( 𝑞 , 𝑞 ) = | 𝑞 ∩ 𝑞 |/| 𝑞 ∪ 𝑞 | , simplymeasuring the overlap of query terms. We see that among all queries, 18% of them are similar tono other queries. We see a different level of query overlap in queries belonging to different apps.The highest overlap is among queries from Web search apps such as Chrome and

Google . Lowerquery similarity is observed for personal apps such as

Facebook and for more focused apps suchas

Amazon Shopping . Note that the query overlap is higher when all app queries are taken intoaccount (All), as opposed to individual apps. This shows that users tend to use the same query or avery similar query when they switch between different apps, suggesting that switching betweenapps is part of the information seeking or query reformulation procedure on mobile devices.

ISTAS.

A session is a “series of queries by a single user made within a small range of time” [51].Similar to previous work [18, 31, 51], we consider a five-minute range of inactivity as closing asession. ISTAS consists of 3,796 sessions, with 1.81 average queries per session. The majority ofsessions have only one query ( = = LSApp.

For consistency with the search sessions, we consider a five-minute range of inactivityalso for LSApp. It is worth noting that even though the relevant work suggests smaller inactivityperiods [18, 55], we assume that a session ends after five minutes of inactivity to tackle the noisyapp usage data and appearance of background services while the user is continuously using thesame app. The collection contains a total number of 61,632 app usage sessions. Table 2 reports themean and median length of sessions in terms of time, number of switches between apps. Also, wereport the mean and median number of unique apps that users launch in a session. Comparingthe number of app switches with unique apps, we see that in many sessions, users tend to work

ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. .00 0.05 0.10 0.15 0.20 0.25 0.30 G oo g l e C h r o m e F a c e b oo k G oo g l e F a c e b oo k M e ss e n g e r I n s t a g r a m M e ss a g e s G m a il T e l e g r a m C o n t a c t s Y o u T u b e W h a t s A pp M e ss e n g e r T w i tt e r P h o n e G oo g l e P l a y S t o r e M a p s C a m e r a R e dd i t S a m s un g G a ll e r y H a n g o u t s H a n g o u t s S a m s un g G a ll e r y R e dd i t C a m e r a M a p s G oo g l e P l a y S t o r e P h o n e T w i tt e r W h a t s A pp M e ss e n g e r Y o u T u b e C o n t a c t s T e l e g r a m G m a il M e ss a g e s I n s t a g r a m F a c e b oo k M e ss e n g e r G oo g l e F a c e b oo k G oo g l e C h r o m e C o mm un i c a t i o n N e w s & M a g a z i n e s P h o t o g r a p h y P r o d u c t i v i t y S o c i a l T oo l s T r a v e l & L o c a l V i d e o P l a y e r s & E d i t o r s Fig. 6. Heat map depicting co-occurrence of apps in same sessions with other apps in LSApp. The graph onthe left shows the co-occurrence at app level, whereas the one on the right shows it at category level. Popularapps such as

Google Chrome dominantly co-occur with most of other apps in various categories. with two apps and do multiple switches between them. To gain more insight into the nature of appswitches, we perform the two analyses shown in Figures 6 and 7.Our first goal here is to show how top-used apps in LSApp are used in the same session byusers. To this end, we count the number of co-occurrences in sessions and normalize the numbersby summing over all co-occurrence values. Note that we describe the definition of an app usagesession in Section 4.4. Figure 6 illustrates the co-occurrence values in the form of a heat map withother apps displayed based on individual apps on the left, as well as categories on the right. Wehave used the official taxonomy of apps from Google Play Store. Since every app always co-occurswith itself (hence having the maximum value of each row), we have set the diagonal values tozero for a better quality of the figure. We see from the first column that

Google Chrome has thehighest share of usage compared to other apps because it has the highest value of most rows. It isinteresting to see that users employ more popular apps such as

Google together with the otherapps in most of the sessions. As argued in [6], users tend to use multiple apps to complete a singlesearch task. Switching between popular search apps in our data suggests the same behavioralpattern is observed here. On the right side of the figure, we see how each app co-occurs withother apps based on their categories. It is interesting to observe that some app features could affectwhat type of apps co-occur. For example, observing the co-occurrences of the “Photography” app

ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. ommunicationSocial ToolsEntertainmentHealth & Fitness Music & AudioNews & MagazinesPhotography Productivity Shopping Travel & LocalVideo

Players & Editors START (a) Switch pattern among app categories.

Contacts MessagingPhone FacebookFacebook

MessengerGmail HangoutsInstagramTelegram (b) Switch pattern among apps belonging to Social and Communi-cation categories.Fig. 7. App switch pattern in sessions with a Markov model on (a) app category level and (b) apps belongingto two categories. Edges represent a transition probability of over 0.05. Edges are directed and weighted bytransition probability, with blue and red edges indicating over 0.2 and 0.4 transition probabilities, respectively. category, we see that social networking apps such as

Instagram and

Telegram exhibit some of thelowest co-occurrence values. This could be because of the photography features that already existin such apps. Conversely, we see that apps such as

Messages and

Gmail co-occur more frequently.Also, we see that other apps belonging to the same or related categories are, in some cases, usedin a session. For example, we see that

Phone co-occurs with

Messaging and

Contacts . It is alsointeresting to observe the lowest row of the figure, showing the co-occurrence of

Hangouts . Wesee that while

Hangouts exhibits high co-occurrence with social media apps like

Facebook and

Instagram , it is not highly used in the same sessions with instant messaging apps such as

WhatsAppMessenger , Facebook Messenger , and

Messages . This suggests that apps that fall into the samehigh-level category (i.e., social networking) tend to co-occur in a session, as users achieve differentgoals. However, users tend to use only one of the apps that fulfill very similar needs (i.e., instantmessaging).We illustrate the transition probabilities between app categories in Figure 7a. The figure shows aMarkov model of how users switch between apps that belong to different categories in a session.We see that the majority of sessions start with apps of Tools, Social, and Communication categories.Although users switch between various categories of apps, we see that they mostly tend to use apps

ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. ig. 8. Time-of-the-day distribution of queries and unique apps (best viewed in color). of the same categories in a single session. This suggests that perhaps the types of tasks they completein a single session can be done using a single or a set of apps with similar proposes (i.e., belong tothe same category). To explore the transition probabilities between apps, we show in Figure 7b aMarkov model of app transitions in sessions for Social and Communication apps. Here, we also seethat even though users switch often among different apps, there is a higher tendency to switch tothe same app (i.e., blue- and red-colored edges indicate higher probabilities). This suggests thatwhile users are trying to perform a task, they might be interrupted by environmental distractions ornotifications on their phones, closing the current app and opening it later. In particular we see a self-transition probability of over 0.4 on

Phone , Instagram , Hangouts , and

Facebook . This is perhapsrelated to the users’ tendency to engage with these apps for longer, leading to a higher probabilityof interruption. Interestingly, we observe that native Communication apps (i.e.,

Contacts , and

Phone , Messaging ) form a cluster on the left side of the figure, with users switching mainly amongthe three apps while switching to other apps only through

Messaging . Temporal behavior.

We analyze the behavior of users as they search with respect to day-of-weekand time-of-day. We see that the distribution of queries on different days of week slightly peaks onFridays. Notice that in this analysis, we only include the users that participated in our study formore than six days. Moreover, Figure 8 shows the distribution of queries and unique target appsacross time-of-day for all participants. Our findings agree with similar studies in the field [12, 28].As we can see, more queries are submitted in the evenings, however we do not see a notabledifference in the number of unique target apps.

Apps usage context.

We define a user’s apps usage context at a given time 𝑡 as the apps usagestatistics of that specific user during the 24 hours before 𝑡 . Apps usage statistics contain detailsabout the amount of time users spent on every app installed on their smartphones. This givesvaluable information on users’ personal app preferences as well as their contexts. For example, auser who has interacted with travel guide apps in the past 24 hours is probably planning a trip inthe near future. Therefore, we analyze how users’ apps usage context can potentially help a targetapp selection model. Figure 9 shows the histogram of target app rankings in the users’ apps usagecontexts. We see that participants often looked for information in the apps that they use morefrequently. For instance, 19% of searches were done on the most used app, followed by 10% on thesecond most used app. We also see that, in most cases, as the ranking increases, the percentage oftarget apps decreases, suggesting that incorporating users app usage context is critical for targetapps selection. ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. ig. 9. Apps usage context ranking distribution of relevant target apps. Lower values of x axis mean that theapp has been used more often in the past 24 hours.

In this section, we propose a context-aware neural model called CNTAS ( C ontext-aware N eural T arget A pps S election), which is an extension to our recent neural target apps selection model (i.e.,NTAS1) [6]. Our model takes as input a query 𝑞 , a candidate app 𝑎 , and the corresponding querycontext 𝑐 𝑞 and produces a score indicating the likelihood of the candidate app 𝑎 being selected bythe user as the target app for the query 𝑞 . In the following, we first describe a general frameworkfor context-aware target apps selection and further explain how it is implemented and how contextis incorporated into the framework.Formally, the CNTAS framework estimates the probability 𝑝 ( 𝑆 = | 𝑞, 𝑎, 𝑐 𝑞 ; 𝐴 ) , where 𝑆 is a binaryrandom variable indicating whether the app 𝑎 should be selected ( 𝑆 =

1) or not ( 𝑆 = 𝐴 denotesthe set of candidate apps. This set can be all possible apps, otherwise those that are installed onthe user’s mobile device, or again a set of candidate apps that is obtained by another model in acascade setting. The app selection probability in the CNTAS framework is estimated as follows: 𝑝 ( 𝑆 = | 𝑞, 𝑎, 𝑐 𝑞 ; 𝐴 ) = 𝜓 ( 𝜙 𝑄 ( 𝑞 ) , 𝜙 𝐴 ( 𝑎 ) , 𝜙 𝐶 ( 𝑐 𝑞 )) , (1)where 𝜙 𝑄 , 𝜙 𝐴 , and 𝜙 𝐶 respectively denote query representation, app representation, and contextrepresentation components. 𝜓 is a target apps selection component that takes the mentionedrepresentations and generates an app selection score. These components can be implemented indifferent ways. In addition, 𝑐 𝑞 can contain various types of query context, including search time,search location, and the users apps usage.We implement the component 𝜙 𝑄 with two major functions: an embedding function E : 𝑉 → R 𝑑 that maps each vocabulary term to a 𝑑 -dimensional embedding space, and a global term weightingfunction W : 𝑉 → R that maps each vocabulary term to a real-valued number showing its globalimportance. The matrices E and W are the network parameters in our model and are learned toprovide task-specific representations. The query representation component 𝜙 𝑄 represents a givenquery 𝑞 = { 𝑤 , 𝑤 , · · · , 𝑤 | 𝑞 | } as follows: 𝜙 𝑄 ( 𝑞 ) = | 𝑞 | ∑︁ 𝑖 = (cid:99) W ( 𝑤 𝑖 ) · E ( 𝑤 𝑖 ) , ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. hich is the weighted element-wise summation over the terms’ embedding vectors. (cid:99) W is thenormalized global weights computed using a softmax function as follows: (cid:99) W ( 𝑤 𝑖 ) = exp (W ( 𝑤 𝑖 )) (cid:205) | 𝑞 | 𝑗 = exp (W ( 𝑤 𝑗 )) . This is a simple yet effective approach for query representation based on the bag of wordsassumption, which has been proven to be effective for target apps selection [6] and ad-hoc retrieval[24, 47].To implement the app representation component 𝜙 𝐴 , we learn a 𝑑 -dimensional dense represen-tation for each app. More specifically, this component consists of an app representation matrix A ∈ R 𝑁 × 𝑑 where 𝑁 denotes the total number of apps. Therefore, 𝜙 𝐴 ( 𝑎 ) returns a row of the matrix A that corresponds to the app 𝑎 .Various context definitions can be considered to implement the context representation component.General types of context, such as location and time, has been extensively explored in differenttasks, such as web search [14], personal search [60], and mobile search [29]. In this paper, we referto the apps usage time as context, which is a special type of context for our task. As introducedearlier in Section 4.5, the apps usage context is the time that the user spent on each mobile app inthe past 24 hours of the search time. To implement 𝜙 𝐶 , we first compute a probabilistic distributionbased on the apps usage context, as follows: 𝑝 ( 𝑎 ′ | 𝑐 𝑞 ) = time spent on app 𝑎 ′ in the past 24 hours (cid:205) 𝑎 ′′ ∈ 𝐴 time spent on app 𝑎 ′′ in the past 24 hours , where 𝐴 is a set of candidate apps. 𝜙 𝐶 is then computed as: 𝜙 𝐶 ( 𝑐 𝑞 ) = ∑︁ 𝑎 ′ ∈ 𝐴 𝑝 ( 𝑎 ′ | 𝑐 𝑞 ) · A 𝐶 [ 𝑎 ′ ] , where A 𝐶 ∈ R 𝑁 × 𝑑 denotes an app representation matrix which is different from A used in theapp representation component. This matrix is supposed to learn app representations suitablefor representing the apps usage context. A 𝐶 [ 𝑎 ′ ] denotes the representation of app 𝑎 ′ in the apprepresentation matrix A 𝐶 .In summary, each of the representation learning components 𝜙 𝑄 , 𝜙 𝐴 , and 𝜙 𝐶 returns a 𝑑 -dimensional vector. The app selection component is modeled as a fully-connected feed-forwardnetwork with two hidden layers and the output dimensionality of 1. We use rectified linear unit(ReLU) as the activation function in the hidden layers of the network. Sigmoid is used as the finalactivation function. To avoid overfitting, the dropout technique [54] is employed. For each query,the following vector is fed to this network: ( 𝜙 𝑄 ( 𝑞 ) ◦ 𝜙 𝐴 ( 𝑎 )) · | 𝜙 𝑄 ( 𝑞 ) − 𝜙 𝐴 ( 𝑎 )| · ( 𝜙 𝐶 ( 𝑐 𝑞 ) ◦ 𝜙 𝐴 ( 𝑎 )) · | 𝜙 𝐶 ( 𝑐 𝑞 ) − 𝜙 𝐴 ( 𝑎 )| , where ◦ denotes the Hadamard product, i.e., the element-wise multiplication, and · here meansconcatenation. In fact, this component computes the similarity of the candidate app with the querycontent and context, and estimates the app selection score based on the combination of both.We train our model using pointwise and pairwise settings. In a pointwise setting, we use meansquared error (MSE) as the loss function. MSE for a mini-batch 𝑏 is defined as follows: L 𝑀𝑆𝐸 ( 𝑏 ) = | 𝑏 | | 𝑏 | ∑︁ 𝑖 = ( 𝑦 𝑖 − 𝜓 ( 𝜙 𝑄 ( 𝑞 𝑖 ) , 𝜙 𝐴 ( 𝑎 𝑖 ) , 𝜙 𝐶 ( 𝑐 𝑞 𝑖 ))) , ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019.

STM LSTM LSTM LSTM ... a a a n-2 a n-1 time steps ... u s ed app s userid timeslot a n fully-connected feedforward softmax Fig. 10. The architecture of NeuSA where 𝑞 𝑖 , 𝑐 𝑞 𝑖 , 𝑎 𝑖 , and 𝑦 𝑖 denote the query, the query context, the candidate app, and the label in the 𝑖 th training instance of the mini-batch. For this training setting, we use a linear activation for theoutput layer.CNTAS can be also trained in a pairwise fashion. Therefore, each training instance consistsof a query, the query context, a target app, and a non-target app. To this end, we employ hingeloss (max-margin loss function) that has been widely used in the learning to rank literature forpairwise models [36]. Hinge loss is a linear loss function that penalizes examples violating themargin constraint. For a mini-batch 𝑏 , hinge loss is defined as below: L 𝐻𝑖𝑛𝑔𝑒 ( 𝑏 ) = | 𝑏 | | 𝑏 | ∑︁ 𝑖 = max { , − sign ( 𝑦 𝑖 − 𝑦 𝑖 )( (cid:98) 𝑦 𝑖 − (cid:98) 𝑦 𝑖 )} , where (cid:98) 𝑦 𝑖 𝑗 = 𝜓 ( 𝜙 𝑄 ( 𝑞 𝑖 ) , 𝜙 𝐴 ( 𝑎 𝑖 𝑗 ) , 𝜙 𝐶 ( 𝑐 𝑞 𝑖 )) . In this section, we propose a neural sequence-aware model called NeuSA (

Neu ral S equential target A pp recommendation), which captures the sequential dependencies of apps as well as users behaviorwith respect to their usage patterns (i.e., the personal app sequence) and temporal behavior. In thefollowing, we first describe an overview of our target apps recommendation and further explainhow it is implemented.Formally, NeuSA estimates the probability 𝑝 ( 𝐿 = | 𝑢, 𝑎, 𝑐 𝑢 ; 𝐴 ) , where 𝐿 is a binary randomvariable indicating whether the app 𝑎 should be launched ( 𝐿 =

1) or not ( 𝐿 = 𝐴 denotes the setof candidate apps. Similar to CNTAS, this set can be either all apps, those that are installed onthe user’s mobile device, or a set of candidate apps that is obtained by another model in a cascadesetting. The app recommendation probability in the NeuSA framework is estimated as follows: 𝑝 ( 𝐿 = | 𝑢, 𝑎, 𝑐 𝑢 ; 𝐴 ) = 𝜓 ( 𝜙 𝑈 ( 𝑢 ) , 𝜙 𝐴 ( 𝑎 ) , 𝜙 𝐶 ( 𝑐 𝑢 )) , where 𝜙 𝑈 , 𝜙 𝐴 , and 𝜙 𝐶 ( 𝑐 𝑢 ) denote user, app, and user context representation components, respec-tively. 𝜓 is a target apps recommendation component that takes the mentioned representations andgenerates a recommendation score. Any of these components can be implemented in different ways.In addition, 𝑐 𝑢 can contain various types of user context, including time, location, and sequence ofpreviously-used apps. ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. e implement the component 𝜙 𝑈 with an embedding function E : U → R 𝑑 that maps a userto a 𝑑 -dimensional embedding space. The matrix E is the network parameter in our model and islearned to provide task-specific representations.To implement the app representation component 𝜙 𝐴 , we learn a 𝑑 -dimensional dense repre-sentation for each app. In more detail, this component consists of an app representation matrix A ∈ R 𝑁 × 𝑑 where 𝑁 denotes the total number of apps. Therefore, 𝜙 𝐴 ( 𝑎 ) returns a row of the matrix A that corresponds to the app 𝑎 .General types of context, such as location and time, has been extensively explored in differenttasks, such as web search [14] and mobile search [29]. In this paper, we refer to the 𝑘 previously-usedapps and time as context with 𝑘 =

9. Therefore, we define a window of size 𝑘 and consider thesequence of used apps just before the time of recommendation as the sequence context. Follow-ing [11], we break a full day (i.e., 24 hours) into 8 equal time bins (early morning - late night). Toimplement 𝜙 𝐶 , we first compute a probabilistic distribution based on the apps usage records, asfollows: 𝑝 ( 𝑎 ′ | 𝑐 𝑢 ) = time spent on app 𝑎 ′ in the current time bin (cid:205) 𝑎 ′′ ∈ 𝐴 time spent on app 𝑎 ′′ in the current time bin , where 𝐴 is a set of candidate apps. 𝜙 𝐶 is then computed as: 𝜙 𝐶 ( 𝑐 𝑢 ) = ∑︁ 𝑎 ′ ∈ 𝐴 𝑝 ( 𝑎 ′ | 𝑐 𝑢 ) · A [ 𝑎 ′ ] , where A ∈ R 𝑁 × 𝑑 denotes an app representation matrix. This matrix is supposed to learn apprepresentations suitable for representing sequences of apps. A [ 𝑎 ′ ] denotes the representation ofapp 𝑎 ′ in the app representation matrix A .Each of the representation learning components 𝜙 𝑈 , 𝜙 𝐴 , and 𝜙 𝐶 returns a 𝑑 -dimensional vector.The app recommendation component is modeled as a recurrent neural network (RNN) consistingof Long Short-Term Memory (LSTM) units. After modeling the sequence of apps in this layer,the parameters, together with user and time features are passed to a fully-connected feedforwardnetwork with two hidden layers. We use rectified linear unit (ReLU) as the activation functionin the hidden layers of the network. Softmax is used as the final activation function. To avoidoverfitting, the dropout technique [54] is employed. We train our model using pointwise trainingsetting where we use cross entropy as the loss function. Figure 10 depicts the architecture of ourproposed network. In this section, we evaluate the performance of the proposed model in comparison with a set ofbaseline models.

Data.

We evaluate the performance of our proposed models on the ISTAS dataset. We follow twodifferent strategies to split the data: (i) In

ISTAS-R , we randomly select 70% of the queries fortraining, 10% for validation, and 20% for testing; (ii) In

ISTAS-T , we sort chronologically the queriesof each user and keep the first 70% of each user’s queries for training, the next 10% for validation,and the last 20% for testing. ISTAS-T is used to evaluate the methods when information about users’search history is available. To minimize random bias, for ISTAS-R we repeated the experiments 10times and report the average performance. The hyper-parameters of all models were tuned basedon the nDCG@3 value on the validation sets.

ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. valuation metrics.

Effectiveness is measured by four standard evaluation metrics that were alsoused in [6]: mean reciprocal rank (MRR), and normalized discounted cumulative gain for the top 1,3, and 5 retrieved apps (nDCG@1, nDCG@3, nDCG@5). We determine the statistically significantdifferences using the two-tailed paired t-test with Bonferroni correction at a 95% confidence interval( 𝑝 < . Compared methods.

We compared the performance of our model with the following methods: • MFU (Most Frequently Used):

For every query we rank the apps in the order of their popularity inthe training set as a static (query independent) model. • QueryLM, BM25, BM25-QE:

For every app we aggregate all the relevant queries from the trainingset to build a document representing the app. QueryLM is the query likelihood retrieval model [45].For BM25-QE, we adopt Bo1 [8] for query expansion. We use the Terrier [41] implementation ofthese methods. • k-NN, k-NN-AWE: To find the nearest neighbors in k nearest neighbors (k-NN), we consider thecosine similarity between the TF-IDF vectors of queries. Then, we take the labels (apps) of thenearest queries and produce the app ranking. As for k-NN-AWE [62], we compute the cosinesimilarity between the average word embedding (AWE) of the queries obtained from GloVe [44]with 300 dimensions. • ListNet, ListNet-CX:

For every query-app pair, we use the scores obtained by BM25-QE, k-NN,k-NN-AWE, and MFU as features to train ListNet [17] implemented in RankLib . For every query,we consider all irrelevant apps as negative samples. ListNet-CX also includes users’ apps usagecontext as an additional feature. • NTAS : A neural model approach that we designed for the target apps selection task in our previouswork [6]. We use the NTAS1 model due to its superior performance compared to NTAS2. • Contextual baselines:

In order to carry out a fair comparison between CNTAS and other context-aware baselines, we apply a context filter to all non-contextual baselines. We create the contextfilter as follows: for every app 𝛼 in the training samples of user 𝑢 , we take the time that 𝑢 hasspent on 𝛼 in the past 24 hours as its score. We then perform a linear interpolation with thescores of all the mentioned baselines. Note that all scores are normalized. All these models aredenoted by a -CR suffix. Data.

For every user, we take the 70% earliest app usage records as training set, 10% next recordsas validation, and 20% latest records as test set.

Evaluation metrics.

Effectiveness is measured by 6 standard evaluation metrics: mean reciprocalrank (MRR), normalized discounted cumulative gain for the top 1, 3, and 5 predicted apps (nDCG@1,nDCG@3, nDCG@5), and recall for the top 3 and 5 predicted apps (Recall@3, Recall@5). Ourchoice of evaluation metrics was motivated by the two main purposes of app recommendation wediscussed in Section 1. The MRR and nDCG@ 𝑘 metrics are intended to evaluate the effectivenessfor improved homescreen app ranking user experience, whereas Recall@ 𝑘 mainly evaluates howwell a model is able to pre-load the next app among the top 𝑘 predicted apps.We determine the statistically significant differences using the two-tailed paired t-test at a 99.9%confidence interval ( 𝑝 < . https://sourceforge.net/p/lemur/wiki/RankLib/ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. MFU (Most Frequently Used) : For every test instance we rank the apps in the order of theirpopularity in the training set as a static recommendation model. • MRU (Most Recently Used) : For every test instance we rank the apps in the order of the theirinteraction time, so that the most recent apps are ranked higher. • Bayesian & Linear [30] : We implement the two baselines proposed by Huang et al. [30], namely,Bayesian and Linear. Both baselines incorporate various contextual information in modeling appusage. In this work, we only use the contextual information available in our dataset, i.e., time,weekday, user, and previous app. • LambdaMART & ListNet : For a given candidate app and every app in the sequence context,we compute the cosine similarity of their representation and consider it as a feature. The apprepresentations are the average word embedding (AWE) of app descriptions on Google Play Store.Other features include the recommendation time and current user. These features were used totrain LambdaMART and ListNet as state-of-the-art learning to rank (LTR) methods, implementedin RankLib. • k-NN & DecisionTree : Similar to LTR baselines, we take AWE similarity between app pairs aswell as user and time as classification features. We also include the apps that appear in thecontext sequence as additional features. We train kNN and DecisionTree classifiers implementedin scikit-learn. • TempoLSTM [59] models the sequence of apps using a two-layer network of LSTM units. Thetemporal information as well as the application is directly passed to each LSTM node. • NeuSA w/o user , NeuSA w/o time & NeuSA w/o user, w/o time : These are three variations of our model. Theonly difference is in the use of time and user features in the models. NeuSA w/o user is trainedwithout user data; NeuSA w/o time without time data; and NeuSA w/o user, w/o time without neither ofthem.

In the following, we evaluate the performance of CNTAS trained on both data splits and study theimpact of context on the performance. We further analyze how the models perform on both datasplits.

Performance comparison.

Table 4 lists the performance of our proposed methods versus thecompared methods. First, we compare the relative performance drop between the two data splits.We see that almost all non-contextual models perform worse on ISTAS-T compared to ISTAS-R,whereas almost all context-aware models perform better on ISTAS-T. Among the non-contextualmethods, ListNet is the most robust model with the lowest performance drop and k-NN-AWE is theonly method that performs better on ISTAS-T (apart from MFU). Worse results achieved by MFUsuggests that ISTAS-T is less biased towards most popular apps, hence being more challenging. Onthe other hand, QueryLM exhibits the highest performance drop ( −

27% on average), as opposed toContextual-k-NN-AWE with the highest performance improvement on ISTAS-T ( +

10% on average).This indicates that k-NN-AWE is able to capture similar queries effectively, whereas QueryLMrelies heavily on the indexed queries. It should also be noted that MFU performs better on ISTAS-Tindicating that it is more biased towards popular apps.Among the non-contextual baselines, we see that NTAS-pairwise performs best in terms of mostevaluation metrics on both data splits, this is because it learns high dimensional app and query https://sourceforge.net/lemur/wiki/RankLib/ https://scikit-learn.org/ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. able 4. Performance comparison with baselines on ISTAS-R and ISTAS-T datasets. The superscript * denotessignificant differences compared to all the baselines. Methods ISTAS-R Dataset ISTAS-T Dataset

MRR nDCG@1 nDCG@3 nDCG@5 MRR nDCG@1 nDCG@3 nDCG@5MFU 0.4502 0.2597 0.4435 0.4891 0.4786 0.2884 0.4752 0.5173QueryLM 0.3556 0.2431 0.3534 0.3900 0.2706 0.1486 0.2713 0.3097BM25 0.4205 0.3134 0.4363 0.4564 0.3573 0.2447 0.3771 0.3948BM25-QE 0.4319 0.2857 0.4371 0.4727 0.3930 0.2411 0.4053 0.4364k-NN 0.4433 0.2761 0.4455 0.4811 0.4067 0.2294 0.3982 0.4655k-NN-AWE 0.4742 0.2937 0.4815 0.5211 0.4859 0.2950 0.4919 0.5392ListNet 0.5170 0.3330 0.5211

MFU-CR 0.4903 0.3015 0.4901 0.5268 0.5289 0.3576 0.5358 0.5573QueryLM-CR 0.4540 0.2773 0.4426 0.5013 0.4696 0.3023 0.4597 0.5145BM25-CR 0.5398 0.3653 0.5394 0.5871 0.5249 0.3496 0.5255 0.5723BM25-QE-CR 0.5215 0.3398 0.5223 0.5693 0.5230 0.3474 0.5260 0.5728k-NN-CR 0.4978 0.3114 0.4926 0.5431 0.5161 0.3481 0.4956 0.5555k-NN-AWE-CR 0.5144 0.3233 0.5142 0.5632 0.5577 0.3722 0.5612

ListNet-CR 0.5391 0.3544 0.5417 0.5845 0.5599 0.3780 0.5657 0.6037ListNet-CX 0.5349 0.3580 0.5343 0.5784 0.5019 0.3139 0.5153 0.5521NTAS-pointwise-CR 0.5532 0.3745 0.5580 0.5883 0.5627 0.3865 0.5663 0.5965NTAS-pairwise-CR 0.5576 0.3779 0.5568 0.5870 0.5683 0.3923 0.5661 0.6047CNTAS-pointwise 0.5614* 0.3833* * * 0.5586 * * * Fig. 11. Performance comparison with respect to certain apps with and without context. representations which help it to perform more effectively. We see that applying the contextual filterimproves the performance of all models. These improvements are statistically significant in all cases,so are not shown in the table. Although this filter is very simple, it is still able to incorporate usefulinformation about user context and behavior into the ranking. This also indicates the importanceof apps usage context, as mentioned in Section 4.5. Among the context-aware baselines, we seethat NTAS-pairwise-CR performs best in terms of MRR and nDCG@1, while k-NN-AWE-CR andListNet-CR perform better in terms of other evaluation metrics. It should also be noted that ListNet-CR performs better than ListNet-CX. This happens due to the fact that ListNet-CX integrates theapps usage context as an additional feature, whereas ListNet-CR is the result of the combination ofListNet and the contextual filter.

ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. a) Δ MRR per app (b) Δ MRR per userFig. 12. Performance difference per app and per user in terms of Δ MRR.

We see that our proposed CNTAS outperforms all the baselines with respect to the majority ofevaluation metrics. In particular CNTAS-pairwise exhibits the best performance. The achievedimprovements in terms of MRR and nDCG@1 are statistically significant. The reason is that CNTASis able to learn latent features from the interaction of mobile usage data in the context. Theseinteractions can reveal better information for better understanding the user information needs.

Impact of context on performance per app.

In this experiment we demonstrate the effect ofcontext on the performance with respect to various apps. Figure 11 shows the performance forqueries that are labeled for specific target apps (as listed in the figure). We see that the context-aware model performs better while predicting social media apps such as

Facebook and

Instagram .However, we see that the performance for

Google drops as it improves for

Chrome . This happensbecause users do most of their browsing activities on

Chrome , rather than on

Google ; hence theusage statistics of

Chrome helps the model to predict it more effectively. Moreover, we study thedifference of MRR between the model with and without context for all apps. Our goal is to see howcontext improves the performance for every target app. We see in Figure 12a that the performance isimproved for 39% of the apps. As shown in the figure, the improvements are much larger comparedwith the performance drops. Among the apps with the highest context improvements, we canmention

Quora , Periscope , and

Inbox . Impact of context on performance per user.

Here we study the difference of MRR betweenthe model with and without context for all users. Our goal is to see how many users are impactedpositively by incorporating context in the target apps selection model. Figure 12b shows howperformance differs per user when we apply context compared with when we do not. As we cansee, users’ apps usage context is able to improve the effectiveness of target apps selection for themajority of users. In particular, the performance for 57% of the users is improved by incorporatingthe apps usage context. In fact, we observed that users with the highest impact from context useless popular apps.

Impact of context on performance per query length.

We create three buckets of test queriesbased on query length uniformly. Therefore, the buckets will have approximately equal numberof queries. The first bucket, called Short queries, contains the shortest queries, the second one,called Med. queries, constitutes of medium-length queries and the last bucket, called Long queries,obviously includes the longest queries of our test set. Table 5 lists the performance of the modelwith and without context in terms of MRR. As we can see, the average MRR for all three bucketsis improved as we apply context. However, we observe that as the queries become shorter, the

ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. able 5. Performance analysis based on query length, dividing the test queries into three evenly-sized lengthbuckets.

Short queries Med. queries Long queriesMRR MRR MRRw/o context 0.5302 0.4924 0.4971w/ context

Table 6. Performance comparison with baselines on the LSApp dataset. The superscripts ∗ , ★ , † , and ‡ denotesignificant improvements compared to all the baselines, NeuSA w/o user, w/o time , NeuSA w/o user , and NeuSA w/o time respectively ( 𝑝 < . ). Method MRR nDCG@1 nDCG@3 nDCG@5 Recall@3 Recall@5MFU

MRU

Bayesian [30]

Linear [30]

LambdaMART

ListNet k-NN

DecisionTree

TempoLSTM [59]

NeuSA w/o user, w/o time ∗ ∗ ∗ ∗ ∗ NeuSA w/o user ∗ ★ ★ ∗ ★ ∗ ★ ∗ ★ ∗ ★ NeuSA w/o time ∗ ★ † ∗ ∗ ★ † ∗ ★ † ∗ ★ † ∗ ★ † NeuSA ∗ ★ †‡ ∗ ★ † ∗ ★ † ∗ ★ † ∗ ★ † ∗ ★ † improvement increases. The reason is that shorter queries tend to be more general or ambiguous,and thus query context can have higher impact on improving search for these queries. In the following, we evaluate the performance of NeuSA trained on LSApp and study the impact oftime and user features as well as of the learned app representations.

Performance comparison.

Table 6 lists the performance of our proposed method as well as itsvariations and baselines. As we can see, ListNet exhibits the best performance among LTR baselinesand DecisionTree among classification baselines. Moreover, all models outperform MFU in terms ofall evaluation metrics. In particular, we see that Recall@5 is improved for all methods, indicating thatallowing most used apps to run in the background is not effective. Also, we see that while ListNetconsistently outperforms LambdaMART, k-NN exhibits a better performance than DecisionTree interms on Recall@3 and Recall@5. We see that all models, including MFU and MRU, outperform thestatistical baselines, namely, Bayesian and Linear. The large margin in the performance of simplemodels such as k-NN with these two models indicates the effectiveness of representation-basedfeatures (i.e., AWE similarity) for this task. Furthermore, we see that NeuSA outperforms all thebaselines by a large margin in terms of all evaluation metrics. For instance, we see a 39% relativeimprovement over DecisionTree in terms of MRR and a 40% relative improvement over k-NN interms of Recall@5. This suggests that learning high dimensional sequence-aware representation ofapps enables the model to capture users behavioral patterns in using their smartphones. It is worth

ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. M RR Fig. 13. Performance of NeuSA in terms of MRR for different number of previously-used apps as context ( 𝑘 ). noting that NeuSA achieves a high value of Recall@5, suggesting that a mobile operating system isable to pre-load 5 apps with an 87% recall value. Impact of time and user features.

To evaluate the impact of time and user features we comparethe performance of NeuSA with three variations called NeuSA w/o user , NeuSA w/o time , and NeuSA w/o user, w/o time . As we described earlier, these three models are trained after removing user and timefeatures from the data. We see that in all cases, the performance consistently drops. In particular, wesee that when both user and time features are removed, NeuSA w/o user, w/o time exhibits the largestperformance loss, while still outperforming all the baseline models for the majority of metrics.As we add the user feature to the model, we see that the performance improves, showing thata personalized app recommendation model is effective. In particular, we see that NeuSA w/o time outperforms NeuSA w/o user, w/o time significantly in terms of all evaluation metrics. Also, we seea large drop of performance when we remove the user data from NeuSA, confirming again thatpersonal app usage patterns should be taken into consideration for this problem. Therefore, apractical system can be trained on a large dataset of app usage from various users and be fine-tunedon every user’s phone according to their personal usage behavior. Furthermore, although we seethat adding time to NeuSA w/o user, w/o time model results in significant improvements (i.e., NeuSA w/o user ), we do not observe the same impact after adding the user data to the model (comparingNeuSA against NeuSA w/o time ). This suggests that while temporal information contain importantinformation revealing time-dependent app usage patterns, it does not add useful information tothe personal model. This can be due to the fact that the personal information already conveys thetemporal app usage behavior of the user (i.e., each user temporal behavior is unique).

Impact of number of the context length.

Here, we evaluate the effect of the number of previously-used apps that we consider in our NeuSA model. To do so, we keep all the model parameters thesame and change the number of apps in the context ( 𝑘 ). We plot the performance of NeuSA forvarious 𝑘 values in Figure 13. As we see in the figure, even though the performance somewhatconverges with 𝑘 ≥

3, the best performance is achieved with 𝑘 =

9. This indicates that while themodel depends highly on the latest three apps that have been used by the user, it can learn somelonger patterns in some rare cases. Moreover, it is worth noting that the model’s performance usingonly one app in the context in terms of MRR is 0 . ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019.

CONCLUSIONS AND FUTURE WORK

In this paper, we conducted the first in situ study on the task of target apps selection, whichwas motivated by the growing interest in intelligent assistants and conversational search systemswhere users interact with a universal voice-based search system [1, 4, 7, 34, 61]. To this aim, wedeveloped an app, uSearch, and recruited 255 participants, asking them to report their real-lifecross-app mobile queries via uSearch. We observed notable differences in length and structureamong queries submitted to different apps. Furthermore, we found that while users search usingvarious apps, a few apps attract most of the search queries. We found that even though

Google and

Chrome are the most popular apps, users do only 26% and 23% of their searches in these apps,respectively. The in situ data collection enabled us to collect valuable information about users’contexts. For instance, we found that the target app for 29% of the queries were among the toptwo most used apps of a particular user. Inspired by our data analysis, we proposed a model thatlearns high-dimensional latent representations for the apps usage context and predicts the targetapp for a query. The model was trained with an end-to-end setting. Our model produces a scorefor a given context-query-app triple. We compared the performance of our proposed method withstate-of-the-art retrieval baselines splitting data following two different strategies. We observedthat our approach outperforms all baselines, significantly.Furthermore, we proposed a neural sequence-aware model, called NeuSA, for predicting nextapp usage. NeuSA learns a high-dimensional representation for mobile apps, incorporating theapp usage sequence as well as temporal and personal information into the model. We trained themodel on the app usage data collected from 292 real users. The results showed that the proposedmodel is able to capture complex user behavioral patterns while using their phones, outperformingclassification and LTR baselines significantly in terms of nDCG@ 𝑘 , Recall@ 𝑘 , and MRR. Limitations.

Like any other study, our study has some limitations. First, the study relies onself-reporting. This could result in specific biases in the collected data. For instance, participantsmay prefer to report shorter queries simply because it requires less work. Also, in many cases,participants are likely to forget reporting queries or do not report all the queries that belong to thesame session. Second, the reported queries are not actually submitted to a unified search systemand users may formulate their queries differently is such setting. For example, in a unified systema query may be “videos of Joe Bonamassa” but in

YouTube it may be “Joe Bonamassa.” Both thementioned limitations are mainly due to lack of an existing unified mobile search app. Hence,building such app would be essential for building a more realistic collection. Also, our study doesnot consider the users’ success or failure in their search. Submitting queries in certain apps couldresult in different chances of success, and consequently, affect users’ behavior in the session tosubmit other queries in the same app or other apps. Finally, more efficient data collection strategiescould be employed based on active learning [46].

Future work.

The next step in this research would be exploring the influence of other types ofcontextual information, such as location and time, on the target apps selection and recommendationtasks. In addition, it would be interesting to explore result aggregation and presentation in thefuture, considering two important factors: information gain and user satisfaction. This direction canbe studied in both areas of information retrieval and human-computer interaction. Furthermore,based on our findings in the analyses, we believe that mobile search queries can be leveragedto improve the user experience. For instance, assuming a user searches for a restaurant using aunified search system and finds some relevant information on

Yelp . In this case, considering theuser’s personal preference as well as the context, the system could send the user a notification withinformation about the traffic near the restaurant. This would certainly improve the quality of the

ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019. ser experience. We also plan to investigate if the demographics of the participants are linked toparticular queries and behavior. And if such behavioral biases exist, how different models are ableto address such issues?

Acknowledgements.

We thank the anonymous reviewers for the valuable feedback. This workwas supported in part by the RelMobIR project of the Swiss National Science Foundation (SNSF), andin part by the Center for Intelligent Information Retrieval. Any opinions, findings and conclusionsor recommendations expressed in this material are those of the authors and do not necessarilyreflect those of the sponsors.

REFERENCES [1] Mohammad Aliannejadi, Manajit Chakraborty, Esteban Andrés Ríssola, and Fabio Crestani. 2020. Harnessing Evolutionof Multi-Turn Conversations for Effective Answer Retrieval. In

CHIIR . 33–42.[2] Mohammad Aliannejadi and Fabio Crestani. 2017. Venue Appropriateness Prediction for Personalized Context-AwareVenue Suggestion. In

SIGIR . 1177–1180.[3] Mohammad Aliannejadi, Morgan Harvey, Luca Costa, Matthew Pointon, and Fabio Crestani. 2019. UnderstandingMobile Search Task Relevance and User Behaviour in Context. In

CHIIR . 143–151.[4] Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeff Dalton, and Mikhail S. Burtsev. 2020. ConvAI3:Generating Clarifying Questions for Open-Domain Dialogue Systems (ClariQ).

CoRR abs/2009.11352 (2020).[5] Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W. Bruce Croft. 2018. In Situ and Context-Aware TargetApps Selection for Unified Mobile Search. In

CIKM . 1383–1392.[6] Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W. Bruce Croft. 2018. Target Apps Selection: Towards aUnified Search Framework for Mobile Devices. In

SIGIR . 215–224.[7] Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W. Bruce Croft. 2019. Asking Clarifying Questions inOpen-Domain Information-Seeking Conversations. In

SIGIR . 475–484.[8] Giambattista Amati. 2003.

Probability models for information retrieval based on divergence from randomness . Ph.D.Dissertation. University of Glasgow, UK.[9] Jaime Arguello. 2017. Aggregated Search.

Foundations and Trends in Information Retrieval

10, 5 (2017), 365–502.[10] Ricardo A. Baeza-Yates, Di Jiang, Fabrizio Silvestri, and Beverly Harrison. 2015. Predicting The Next App That YouAre Going To Use. In

WSDM . 285–294.[11] Linas Baltrunas, Karen Church, Alexandros Karatzoglou, and Nuria Oliver. 2015. Frappe: Understanding the Usage andPerception of Mobile App Recommendations In-The-Wild.

CoRR abs/1505.03014 (2015).[12] Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, David A. Grossman, and Ophir Frieder. 2004. Hourly analysis ofa very large topically categorized web query log. In

SIGIR . 321–328.[13] Jan R. Benetka, Krisztian Balog, and Kjetil Nørvåg. 2017. Anticipating Information Needs Based on Check-in Activity.In

WSDM . 41–50.[14] Paul N. Bennett, Filip Radlinski, Ryen W. White, and Emine Yilmaz. 2011. Inferring and using location metadata topersonalize web search. In

SIGIR . 135–144.[15] James P. Callan and Margaret E. Connell. 2001. Query-based sampling of text databases.

ACM Trans. Inf. Syst.

19, 2(2001), 97–130.[16] Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, and Qiang Yang. 2009. Context-aware query classification. In

SIGIR . 3–10.[17] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach tolistwise approach. In

ICML . 129–136.[18] Juan Pablo Carrascal and Karen Church. 2015. An In-Situ Study of Mobile App & Mobile Search Interactions. In

CHI .2739–2748.[19] Karen Church and Nuria Oliver. 2011. Understanding mobile web and mobile search use in today’s dynamic mobilelandscape. In

Mobile HCI . 67–76.[20] Karen Church, Barry Smyth, Keith Bradley, and Paul Cotter. 2008. A large scale study of European mobile searchbehaviour. In

Mobile HCI . 13–22.[21] Karen Church, Barry Smyth, Paul Cotter, and Keith Bradley. 2007. Mobile information access: A study of emergingsearch behavior on the mobile Internet.

TWEB

1, 1 (2007), 4.[22] Fabio Crestani and Heather Du. 2006. Written versus spoken queries: A qualitative and quantitative comparativeanalysis.

JASIST

57, 7 (2006), 881–890.[23] Fabio Crestani, Stefano Mizzaro, and Ivan Scagnetto. 2017.

Mobile Information Retrieval . Springer.ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019.24] Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W. Bruce Croft. 2017. Neural Ranking Modelswith Weak Supervision. In

SIGIR . 65–74.[25] Fernando Diaz. 2009. Integration of news content into web results. In

WSDM . 182–191.[26] Ido Guy. 2016. Searching by Talking: Analysis of Voice Queries on Mobile Web Search. In

SIGIR . 35–44.[27] Martin Halvey, Mark T. Keane, and Barry Smyth. 2006. Mobile web surfing is the same as web surfing.

Commun. ACM

49, 3 (2006), 76–81.[28] Martin Halvey, Mark T. Keane, and Barry Smyth. 2006. Time based patterns in mobile-internet surfing. In

CHI . 31–34.[29] Shun Hattori, Taro Tezuka, and Katsumi Tanaka. 2007. Context-Aware Query Refinement for Mobile Web Search. In

SAINT Workshops .[30] Ke Huang, Chunhui Zhang, Xiaoxiao Ma, and Guanling Chen. 2012. Predicting mobile application usage usingcontextual information. In

Ubicomp . 1059–1065.[31] Maryam Kamvar and Shumeet Baluja. 2006. A large scale study of wireless search behavior: Google mobile search. In

CHI . 701–709.[32] Maryam Kamvar and Shumeet Baluja. 2007. The role of context in query input: using contextual signals to completequeries on mobile devices. In

Mobile HCI . 405–412.[33] In-Ho Kang and Gil-Chang Kim. 2003. Query type classification for web document retrieval. In

SIGIR . 64–71.[34] Antonios Minas Krasakis, Mohammad Aliannejadi, Nikos Voskarides, and Evangelos Kanoulas. 2020. Analysing theEffect of Clarifying Questions on Document Ranking in Conversational Search. In

ICTIR . 129–132.[35] Joohyun Lee, Kyunghan Lee, Euijin Jeong, Jaemin Jo, and Ness B. Shroff. 2016. Context-aware application schedulingin mobile systems: what will users do and not do next?. In

UbiComp . 1235–1246.[36] Hang Li. 2011.

Learning to Rank for Information Retrieval and Natural Language Processing . Morgan & ClaypoolPublishers.[37] Zhung-Xun Liao, Po-Ruey Lei, Tsu-Jou Shen, Shou-Chung Li, and Wen-Chih Peng. 2012. Mining Temporal Profiles ofMobile Applications for Usage Prediction. In

ICDM . 890–893.[38] Eric Hsueh-Chan Lu, Yi-Wei Lin, and Jing-Bin Ciou. 2014. Mining mobile application sequential patterns for usageprediction. In

GrC . 185–190.[39] Ilya Markov and Fabio Crestani. 2014. Theoretical, Qualitative, and Quantitative Analyses of Small- DocumentApproaches to Resource Selection.

ACM Trans. Inf. Syst.

32, 2 (2014), 9–37.[40] George D. Montanez, Ryen W. White, and Xiao Huang. 2014. Cross-Device Search. In

CIKM . 1669–1678.[41] Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald, and Douglas Johnson. 2005. TerrierInformation Retrieval Platform. In

ECIR . 517–519.[42] Dae Hoon Park, Yi Fang, Mengwen Liu, and ChengXiang Zhai. 2016. Mobile App Retrieval for Social Media Users viaInference of Implicit Intent in Social Media Text. In

CIKM . 959–968.[43] Dae Hoon Park, Mengwen Liu, ChengXiang Zhai, and Haohong Wang. 2015. Leveraging User Reviews to ImproveAccuracy for Mobile App Retrieval. In

SIGIR . 533–542.[44] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.In

EMNLP . 1532–1543.[45] Jay M. Ponte and W. Bruce Croft. 1998. A Language Modeling Approach to Information Retrieval. In

SIGIR . 275–281.[46] Md. Mustafizur Rahman, Mucahid Kutlu, and Matthew Lease. 2019. Constructing Test Collections using Multi-armedBandits and Active Learning. In

WWW . 3158–3164.[47] Ivan Sekulic, Amir Soleimani, Mohammad Aliannejadi, and Fabio Crestani. 2020. Longformer for MS MARCODocument Re-ranking Task.

CoRR abs/2009.09392 (2020).[48] Dou Shen, Jian-Tao Sun, Qiang Yang, and Zheng Chen. 2006. Building bridges for web query classification. In

SIGIR .131–138.[49] Milad Shokouhi and Qi Guo. 2015. From Queries to Cards: Re-ranking Proactive Card Recommendations Based onReactive Search History. In

SIGIR . 695–704.[50] Milad Shokouhi and Luo Si. 2011. Federated Search.

Foundations and Trends in Information Retrieval

5, 1 (2011), 1–102.[51] Craig Silverstein, Monika Rauch Henzinger, Hannes Marais, and Michael Moricz. 1999. Analysis of a Very Large WebSearch Engine Query Log.

SIGIR Forum

33, 1 (1999), 6–12.[52] Timothy Sohn, Kevin A. Li, William G. Griswold, and James D. Hollan. 2008. A diary study of mobile informationneeds. In

CHI . 433–442.[53] Yang Song, Hao Ma, Hongning Wang, and Kuansan Wang. 2013. Exploring and exploiting user search behavior onmobile and tablet devices to improve search relevance. In

WWW . 1201–1212.[54] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: ASimple Way to Prevent Neural Networks from Overfitting.

Journal of Machine Learning Research

15 (2014), 1929–1958.[55] Niels van Berkel, Chu Luo, Theodoros Anagnostopoulos, Denzil Ferreira, Jorge Gonçalves, Simo Hosio, and VassilisKostakos. 2016. A Systematic Assessment of Smartphone Usage Gaps. In

CHI . 4711–4721.ACM Transactions on Information Systems, Vol. 0, No. 0, Article 0. Publication date: 2019.56] Yingzi Wang, Nicholas Jing Yuan, Yu Sun, Fuzheng Zhang, Xing Xie, Qi Liu, and Enhong Chen. 2016. A contextualcollaborative approach for app usage forecasting. In

UbiComp . 1247–1258.[57] Ryen W. White, Paul N. Bennett, and Susan T. Dumais. 2010. Predicting short-term interests using activity-basedsearch context. In

CIKM . 1009–1018.[58] Biao Xiang, Daxin Jiang, Jian Pei, Xiaohui Sun, Enhong Chen, and Hang Li. 2010. Context-aware ranking in websearch. In

SIGIR . 451–458.[59] Shijian Xu, Wenzhong Li, Xiao Zhang, Songcheng Gao, Tong Zhan, Yongzhu Zhao, Wei-wei Zhu, and Tianzi Sun. 2018.Predicting Smartphone App Usage with Recurrent Neural Networks. In

WASA . 532–544.[60] Hamed Zamani, Michael Bendersky, Xuanhui Wang, and Mingyang Zhang. 2017. Situational Context for Ranking inPersonal Search. In

WWW . 1531–1540.[61] Hamed Zamani and Nick Craswell. 2020. Macaw: An Extensible Conversational Information Seeking Platform. In

SIGIR . 2193–2196.[62] Hamed Zamani and W. Bruce Croft. 2016. Estimating Embedding Vectors for Queries. In

ICTIR . 123–132.[63] Sha Zhao, Zhiling Luo, Ziwen Jiang, Haiyan Wang, Feng Xu, Shijian Li, Jianwei Yin, and Gang Pan. 2019. AppUsage2Vec:Modeling Smartphone App Usage for Prediction. In