AOBTM: Adaptive Online Biterm Topic Modeling for Version Sensitive Short-texts Analysis
AAOBTM: Adaptive Online Biterm Topic Modelingfor Version Sensitive Short-texts Analysis
Mohammad Abdul Hadi
Department of Computer ScienceThe University of British Comumbia
Kelowna, [email protected]
Fatemeh H Fard
Department of Computer ScienceThe University of British Comumbia
Kelowna, [email protected]
Abstract —Analysis of mobile app reviews has shown its im-portant role in requirement engineering, software maintenanceand evolution of mobile apps. Mobile app developers check theirusers reviews frequently to clarify the issues experienced byusers or capture the new issues that are introduced due to arecent app update. App reviews have a dynamic nature and theirdiscussed topics change over time. The changes in the topicsamong collected reviews for different versions of an app canreveal important issues about the app update. A main techniquein this analysis is using topic modeling algorithms. However,app reviews are short texts and it is challenging to unveil theirlatent topics over time. Conventional topic models such as LatentDirichlet Allocation (LDA) and Probabilistic Latent SemanticAnalysis (PLSA) suffer from the sparsity of word co-occurrencepatterns while inferring topics for short texts. Furthermore, thesealgorithms cannot capture topics over numerous consecutivetime-slices (or versions). Online topic modeling algorithms suchas Online LDA (OLDA) and Online Biterm Topic Model (OBTM)speed up the inference of topic models for the texts collected inthe latest time-slice by saving a fraction of data from the previoustime-slice. But these algorithms do not analyze the statistical-data(such as topic distributions) of all the previous time-slices, whichcan confer contributions to the topic distribution of the currenttime-slice.In this paper, we propose Adaptive Online Biterm Topic Model(AOBTM) to model topics in short texts adaptively. AOBTMalleviates the sparsity problem in short-texts and considers thestatistical-data for an optimal number of previous time-slices.We also propose parallel algorithms to automatically determinethe optimal number of topics and the best number of previousversions that should be considered in topic inference phase.Automatic evaluation on collections of app reviews and real-world short text datasets confirm that AOBTM can find morecoherent topics and outperforms the state-of-the-art baselines.For reproducibility of the results, we open source all scripts.
Index Terms —App review analysis, adaptive topic model,biterm, online algorithm, automatic parameter setting
I. I
NTRODUCTION
Mobile app reviews form a main feedback channel for theapp developers [1] to evaluate their products and improveapplication maintenance and evolution tasks [2]. The appdevelopers require to analyze app reviews in order to gaininsights about the current state of their apps from users’perspectives. Mobile app reviews form a feedback channelfor the developers [1] to evaluate their products and improve
This research is supported by the Canada NSERC Discovery Grant [RGPIN-2019-05175]. application maintenance and evolution tasks [2]. The appdevelopers must analyze app reviews in order to gain insightsabout the current state of their apps from users’ perspectives.Several studies have been proposed to analyze the app reviews,including extracting informative text [3], summarizing userreviews [4], identifying the bugs or feature requests [5]–[7], prioritizing feature inclusions [8], and extracting insightsabout the apps [1]. Although the (popular) apps are updatedfrequently [9], the app review analysis studies mostly considerthe app reviews static [10], [11]. However, app reviews havea dynamic nature and their discussed topics change over time.If the update-centric analysis is neglected, it misses the pointthat feedback are written on a certain update [11]. The changein the topics extracted from reviews for different app versionscan reveal important issues about the app [10], [11].A recent example of the importance of discussed topicsover time is the
Zoom Cloud Meeting , a popular app forvideo conferencing. Zoom received massive one-star ratings(the lowest rating) in Google Play Store during the COVID-19 outbreak in March 2020. Most of the issues were related tousers’ concerns about data-privacy and security-malpractices.These issues were so severe that in a letter to Zoom, theNew York attorney generals office expressed concerns andaddressed security flaws [12]. These issues were extensivelydiscussed in the user reviews and could have been flushed outmonths ago through proper inspection of the user reviews andtopic changes from user feedbacks on Google Play Store.App reviews are short texts that are time/version sensitive asthese texts are generated constantly and are collected regularlyfor consecutive app versions [9], [10]. The underlying latenttopics derived from app reviews can benefit the developersextensively [13]. However, extracting relevant topics from appreviews is challenging due to their dynamic nature and lackof rich context in short texts.The most popular topic modeling methods for discoveringthe underlying topics from text-corpus are LDA [14] andPLSA [15]. But these topic models do not perform wellwith text-corpus containing short-texts as documents [16].These algorithms consider individual short texts as separatedocuments and model each of these documents as a mixtureof topics, where each topic is considered as a probability dis-tribution over words. The models then utilize various statistical a r X i v : . [ c s . I R ] S e p echniques to determine the topic components and mixturecoefficients of each document by implicitly capturing thedocument-level word co-occurrence patterns [17], [18]. Whiledealing with the typical lengthy documents, the mentionedalgorithms could rely on larger word counts to know howthe words are related. However, the natural sparseness of theword co-occurrence patterns in each short document makesthese models suffer from the data-sparsity problem [19].Moreover, short texts lack the richness of context, making itmore difficult for these topic models to identify the sensesof ambiguous words in short documents. Biterm Topic Model(BTM) alleviates these problems by learning topics over shorttexts and explicitly modeling the generation of biterms in thewhole corpus to enhance topic learning [16].As mentioned, app reviews are usually collected in batchesof consecutive time-slices [10]. Each time a new batch oftext-data arrives, these topic models (e.g. LDA, BTM) requireretraining to discover latent topic distributions from the newdataset, which is prohibitively time and memory consuming.The popular way to alleviate the scalability problem is todevelop online algorithms such as Online LDA (OLDA) [20]and Online BTM (OBTM) [16]. These online algorithms storea small fraction of data on the fly in order to accommodate thedataset of the upcoming time-slice. When a new batch of text-data arrives, online algorithms model the topics of texts eitherby using the statistics of samples collected in the immediatelyprevious version/time-slice [21] or by naively aggregatingstatistics of all the previous time-slices [16]. However, theseonline algorithms do not take different versions’ varying con-tributions into account. Statistics of the textual data collectedover different time periods or different versions may have anon-negligible difference in similarity with that of the latesttime-slice; and thus, can contribute differently to the latestversion [10], [18]. Adaptive versions of online algorithms,such as Adaptively Online LDA (AOLDA), can be used toaddress the problem of the varying contribution of differentversions [10]. But, the underlying model in AOLDA is LDA,which again makes it suffer from the mentioned data-sparsityproblem while working with short-texts.In this paper, we propose a new adaptive online topicmodel for short texts which takes previous versions’ varyingcontribution into account. We refer to this novel model as the A daptive O nline B iterm T opic M odel (AOBTM). AOBTMinherits the characteristics of BTM to deal with the datasparsity issue. It is an online algorithm that can scale for theincreasing volume of the dataset that is generated frequently.AOBTM also endows the statistics of the previous versionswith different contributions to the topic distributions of thecurrent version of the dataset. Also, we have employed apreprocessing technique that is useful for yielding better topcontributing key-terms to help the manual investigation of theinferred topics. Our contributions are enlisted below:1) We propose a novel method called AOBTM for versionsensitive content analysis for short texts. This method Time-slice and version are semantically equal in this paper. adaptively combines the topic distributions of a selectednumber prior versions to generate topic distributions ofthe current version.2) We propose two parallel algorithms; the first algorithmcan identify an optimal number of topics to be derived inthe latest version, and the second algorithm can identifythe optimal number of previous versions to be taken intoconsideration for adaptive aggregation of statistical data.3) To encourage replicability, of the research results, wemake all scripts, codes, and graphs available to thecommunity .We have conducted experiments on app review datasetsand Twitter dataset with large number of records to evaluateperformance of AOBTM compared to five baseline algorithms.Also, we integrated AOBTM into the state of the art onlineapp-review analysis framework called IDEA for comparison[10]. Our results show that topics captured by AOBTM aremore coherent compared to the topics extracted by baselinemethods.The rest of the paper is organized as follows: SectionII and III describe the background and our Topic ModelDesign. Sections IV and V are dedicated to the proposedparallel algorithms and experiments and results, followed bythe related works in Section VI. We add threats to validity insection VII and conclude the paper in Section VIII.II. B ACKGROUND AND M OTIVATION
A. Topic Modeling for conventional text-documents
Topic modeling algorithms such as PLSA and LDA arewidely embraced for identifying latent semantic structuresfrom text corpus without requiring any prior annotations ofthe documents [22]. These algorithms observe each documentas a mixture of topics while a distribution over the vocabularyterms characterizes each topic. Statistical techniques such asVariational methods and Gibbs sampling are then applied toinfer the latent topic distributions of given documents andthe word distributions of inferred topics [23]. Although thesealgorithms and their variants contributed largely in modelingtext collections such as blogs, research papers, and newsarticles [20], [24], [25], these topic models endure considerableperformance deterioration while handling short texts [16],[26]. Directly applying these models on short texts suffer fromsevere data sparsity problem [19] as the frequency of wordsin the short texts play less discriminative role, which makesit hard to infer words correlation from short documents [19].The limited contexts in the short text also make it challengingto identify the sense of ambiguous words.
B. Topic Modeling for short text-documents
Researchers have proposed the numerous topic modelingalgorithms for short texts by trying to solve one or two of thefollowing inherent characteristics of the short texts: i) lack ofenough word co-occurrence information, probability of mostindividual short texts being generated by singular topic, ii) https://anonymous.4open.science/r/995c2443-74d9-4e10-a3fa-4f814082b06d/ nability to fully capture semantically co-related but rarelyco-occurring words (due to lack of statistical information ofwords among texts), and iii) the probability that a single-topic assumption is too strong for some short texts [22].In [22], Qiang et al. divided the short text topic modelingalgorithms into three major categories: Dirichlet multinomialmixture (DMM) based methods, Global word co-occurrencesbased methods, and Self-aggregation based methods . Briefintroductions to the related works can be found in section VI.
C. Online Topic Modeling for short texts
A particular issue with the traditional topic modeling algo-rithms is that they can not scale with the expanding dataset.Whenever a new batch of data arrives, these topic models(i.e., LDA, PLSA) need to train from scratch. Moreover, theseconventional topic models can not guarantee consistency inthe sequence of topics if independent training is performedon different batches of the same corpus [14], [16]. Thisinconsistency occurs because, before each of the independenttraining, we set the prior topic distribution to a default value,a Dirichlet parameter; so, the fixed number of topics (assume, K ) can be generated in any sequence. [16]. For example,when we train a corpus, using BTM or LDA, with K numberof topics, each independent training of BTM generates K number of topic distributions, where we can not ascertainthat each time we train the corpus, a topic no. K = k (here, k = [1 , ..., K ] ) will always correspond to a specific topic (i.e.,”UI component”).Researchers proposed online models (i.e., OBTM, OLDA)to circumvent the problems with streaming datasets. Here,we can assume that the documents would be generated instreams and can be collected from, divided different time-slices or versions, where the documents are exchangeable in atime-slice. For example, Online BTM (OBTM) accommodatesand deals with batches of short-text documents divided intodifferent time-slices or versions. Let’s assume that OBTM hasalready got the topic distribution for ( t − -th time-slice.When a new batch ( t -th time-slice) arrives, OBTM utilizesthe topic distribution of ( t − -th time-slice to set the priortopic distribution of t -th time-slice. It, in turn, ensures thatafter the training of the t -th time-slice, the k -th topic in the t -th time-slice is closely related to the k -th topic generatedin the ( t − -th time-slice. If we introduce a completely newtopic in the latest batch of documents ( t -th time-slice), the newtopic will merge into one (or more) existing topic(s) generatedin the ( t − -th time-slice that has (/have) strong correlation(s)with the introduced topic. [16] D. Adaptively Online Topic Modeling for short texts
The problem with online topic modeling algorithms is thatthey do not consider or compare the varying consequentialcorrelation among all the preceding time slices or versionsof the short texts while inferring topics for the latest time-slice. For example, in OBTM, this limitation transpires as thetopic model generates the topic distribution of a time-sliceby making it directly dependent on the topic distribution ofpreceding time-slices. If new topics are being introduced ineach time-slice, the k -th topic in the latest time slice would be significantly different than that of the first time-slice. Wecan not reliably compare the distribution of the K = k -th topicbetween two non-consecutive time-slices as OBTM does notimpose or investigate the varying correlation between them.In our proposed method, we aim to alleviate the mentionedproblem by adaptively integrating the topic distributions ofall the previous time-slices with their respective weights ascontributions, for generating the prior distribution of the latest, t -th time-slice. This way, we can warrant both coherence ofspecific topics and consistency in the topics’ sequence in allthe available versions. The adaptiveness enables us to comparetopic distributions of any two different time-slices reliably. Fig. 1. Overview of the Framework
In Figure 1, we show an overview of the framework. Here,version tagged short-texts are processed and fed into theAOBTM algorithm to find better topic distribution for thelatest version by leveraging previous versions’ statistical data.The details of each part are discussed in Sections III, IV.III. A
DAPTIVE O NLINE B ITERM T OPIC M ODEL
In this section, we discuss the details of the Adaptive On-line Biterm Topic Model (AOBTM). This method introduces
Adaptiveness to give the online algorithm, OBTM, a version ortime-slice sensitivity so that the prior topic distribution of thelatest time-slice takes varying contributions (topic distribution-wise) of the previous time-slices into account. After setting theprior, we train the model to find out the final topic distribution.The details of the proposed method are described below.
A. Applied Biterm Extraction technique
We have adopted the definition of
Biterm from [16], where itdenotes an unordered term-pair co-occurring in a small, fixed-size window over a term sequence. The fixed-sized windowis referred to as short-context . The optimal size of short-context varies from dataset to dataset and can be consideredas an important parameter setting. In a given short-context, anunordered pair of any two distinct terms can form a biterm. Forexample, a short context with size=3, generates the followingbiterms: ( w , w , w ) ⇒ { ( w , w ) , ( w , w ) , ( w , w ) } In previous works of topic labeling (i.e., [27], [28]), inferredtopics have been labeled with the term(s), which has(/have) amore significant contribution to the respective topics. Here, term denotes any non-redundant word in the document, whichcannot be found in Natural Language Toolkit’s (NLTK) stop-word list. But Gao et al. [9], showed that the most contributingsingular term or their combination could not adequately repre-sent the respective topic. So, instead of using singular terms,e use meaningful phrases to label the topics, where a
Phrase refers to two frequently co-occurring words. To ensure thecomprehensibility of the extracted phrases, we use a PointwiseMutual Information (PMI)- based phrase extraction method[29], where the higher frequency of two words’ co-occurrencewarrants the generation of a more meaningful phrase. For ourmodel, we have empirically set our frequency threshold to24. After identifying the phrases, we convert them into singleterms using ‘ ’ (i.e., w w ), to train them along with otherterms using our algorithm. During biterms extraction, wordsconstructing the phrases are also considered as term whenthey appear outside identified phrases. We train the phrasesto capture their underlying semantics, which, in turn, wouldhelp us to label the topics with the most relevant phrases.We will further demonstrate this modification’s impact in theexperiment section. B. Model Description
To alleviate the data-sparsity problem faced by AOLDA andto capture more coherent, comprehensible, and discriminativetopics, we propose an adaptive online topic modeling method,AOBTM, which improves OBTM by adaptively combiningthe topic distributions in previous versions. The details of theproposed AOBTM method are described in figure 2.
Fig. 2. Overview of AOBTM. The red rectangle highlights the adaptiveintegration of the topics of the w previous versions for generating the prior Φ in the t-th version After the preprocessing, the short texts are separated intodifferent time-slices or versions, and input into AOBTMsequentially. AOBTM treats the short texts-set from each time-slice as a separate corpus. We denote the whole corpus as R = R , R , ..., R t , where t indicates the t-th time-slice. Followingthe literature, we denote the prior distributions over corpus-topic as α and the prior distributions over topic-words as β ;both α and β are defined initially. The topic-word distributionsdetermine the topic’s distribution over all the non-redundantterms (including the phrases) that appear in the corpus. Thenumber of the topics is specified as K. For the k-th topic, Φ tk is the probability distribution vector over all the inputterms in the t -th time slice. We introduce a new parameter- win (window size), which defines the number of previousversions to be considered for inferring the topic distributionsof the current version. The overview of the AOBTM modelis depicted in Figure 2. Different from OBTM (and similarto AOLDA), as Figure 2 shown, we adaptively integrate thetopic distributions of the previous win versions, denoted as Φ t − , Φ t − , ..., Φ t − i , ..., Φ t − win , for generating the prior, β t for the t-th version. The adaptive integration sums up the topicdistributions of different versions with different weights, γ t,i : β tk = (cid:80) wini =1 γ t,ik Φ t − ik + n tw | k (1)Here, i denotes the i -th previous version ( ≤ i ≤ w ). n tw | k denotes the number of times word, w is assigned to topic k intime-slice t. The weight γ t,ik is determined by considering thesimilarity of the k-th topic between the (t-i)th version and the(t-1)th version, which is calculated by the following softmaxfunction: γ t,ik = exp (Φ t − ik · β t − k ) (cid:80) winj =1 exp (Φ t − jk · β t − k ) (2) Algorithm 1:
Adaptive Online BTM
Input :
K, win, α, β, B ( ) , ...., B ( T ) Output: { Φ ( t ) , θ ( t ) } Tt =1 Set α (1) = ( α, ...., α ) and { β (1) k = ( β, ...., β ) } Kk =1 for t ← to T do Randomly assign topics to biterms in B ( t ) ; for iter ← to N iter do foreach biterm b i = ( w i, , w i, ) ∈ B ( t ) do Draw topic k from
Eq. 3 ; Update n ( t ) k , n w i, ( t ) | k , and n w i, ( t ) | k ; end Set α ( t +1) by Eq.4 ; Set { β ( t +1) k } Kk =1 by Eq.1 ; end Compute Φ ( t ) by φ k,w = n w | k + β,n ·| k + W β ; (refer to [16]) Compute θ ( t ) by θ k = n k + α,N B + K α ; (refer to [16]) end In Equation 2, [ Φ t − ik · β t − k ] represents Einstein Summation and computes the similarity between the topic distribution, Φ t − ik and the prior of the (t-1)th version, β t − ik . This adaptiveaggregation allows the topics of the previous versions to endowdifferent contributions to the topic distributions of the currentversion. The steps are shown in Algorithm 1. In Algorithm1, B denotes the biterms collection. Here, N B is the numberof biterms and b i denotes a biterm with two terms: w i, and w i, in i -th biterm. We use W as the total number of wordsin the vocabulary, and θ (t) as a K-dimensional multinomialdistribution which denotes the corpus-topic distribution fora time-slice. Here, n k , n k | d , and n w | k denote number ofwords in topic k, number of words in document d assigned totopic k, and number of times word w is assigned to topic k,respectively.In Algorithm 1, the topics are drawn from Eq. 3 and priordistribution, α for the latest time-slice is calculated using Eq.4: P (cid:0) z i = k | z ( t ) − i , B ( t ) , α ( t ) , { β ( t ) k } Kk =1 (cid:1) ∝ ( n ( t ) − i,k + α ( t ) k ) ( n − i,w i | k + β ( t ) k,w i )( n − i,w j | k + β ( t ) k,w j )[ (cid:80) Ww =1 ( n − i,w | k + β ( t ) k,w )] (3) α ( t +1) k = α ( t ) k + n ( t ) k (4)where, z ∈ [1 , K ] refers to the topic indicator variable andP(z) refers to the prevalence of topics in the corpus. We useymmetric Dirichlet distributions as the initial priors by setting α = ( α, ..., α ) and β k = ( β, ..., β ) . Given α t and [ β tk ] Kk =1 ,we iteratively draw topic assignments for each biterm b i ∈ B t ,according to the conditional distribution stated in Eq. 3. Onceiterations are completed, we obtain the counts n tk and n tw | k .We adjust the hyperparameters α t and [ β tk ] Kk =1 for time slice ( t + 1) by setting α ( t +1) k and β t +1 k using Eq. 4 and Eq. 1,respectively. The derivation of Eq. 3 and Eq. 4 can be foundin [16]. C. AOBTM Complexity and Comparison with Baselines
In this section, we discuss the details of the running timeand memory requirement for AOBTM and compare it withdifferent batch, online, and adaptive online algorithms. Wehave listed the time-complexity and the number of in-memoryvariables for different topic models in Table I.In the following discussion, ¯ l refers to the average documentlength, and N D refers to the number of documents in thecorpus, respectively. We can assume that all the documents inthe short-text corpus have almost the same length [16], [21].It is reasonable to infer N B (number of biterms in the corpus)using this assumption as we are applying N B only for the topicmodels, which are devised for short texts (i.e., BTM, OBTM,and AOBTM). According to our assumption, each documentwith length ¯ l , would produce ¯ l (¯ l − / biterms; so, we havethe equivalence of N B as: ≈ N D · ¯ l · (¯ l − / Furthermore, in Table I, win denotes the user-defined window-size (the number of previous versions to consider) inthe adaptive inline algorithms, W denotes the total number ofterms, and v refers to the number of available time-slices. TABLE IT
IME C OMPLEXITIES AND THE N UMBER OF I N -M EMORY V ARIABLES IN D IFFERENT T OPIC M ODELS
Methods Time Complexities O ( N iter KN D ¯ l ) N D K + WK + N D ¯ l BTM O ( N iter KN B ) K + WK + N B OLDA O ( N iter K | N ( t ) D ¯ l ( t ) | ) N D K + WK + | N ( t ) D ¯ l ( t ) | OBTM O ( N iter K | N ( t ) B | ) K + WK + | N ( t ) B | AOLDA O ( N iter K | N ( t ) D ¯ l ( t ) | + vKW ) N D K + vWK + | N ( t ) D ¯ l ( t ) | AOBTM O ( N iter K | N ( t ) B | + vKW ) K + vWK + | N ( t ) B | Time Complexity.
The most time-consuming part in thesetopic models is the component calculating the conditionalprobability of topic assignments, which requires O ( K ) time.While LDA draws a topic for each word occurrence,BTM draws a topic for each biterm. So, the overall time-complexity for LDA and BTM turn out as O ( N iter KN D ¯ l ) and O ( N iter KN B ) , respectively [14], [16]. From our previousassumption-based calculation of N B , we can further expandthe time-complexity for BTM: O ( N iter KN D ¯ l (¯ l − / , whichis approximately (¯ l − / times the time-complexity ofLDA. As BTM works with short texts where value of ¯ l isconsiderably small, the run-time of BTM can still be comparedto that of LDA [21].The online algorithms, such as OLDA and OBTM, deal withdocuments and short texts, respectively, present in the latesttime-slice. In Table I, we have used superscript t to denote the latest time-slice or version. But, the adaptively onlinealgorithms (i.e., AOLDA, AOBTM) compare and determinecontributions of the previous v number of topic-word distri-butions for different time-slices, which require an additional O ( vKW ) time. Number of Variables Stored in Memory.
LDA maintains thefollowing counts as the cached memory: the number of wordsin a document d assigned to topic k , n k | d (= N D K ), and thenumber of times word w assigned to topic k , n w | k (= W K ).LDA also stores the topic assignment for each word occurrence(= N D ¯ l ) [30]. On the other hand, BTM stores the followingvariables: the number of topics, n k (= K ), the number of timesword w assigned to topic k , n w | k (= W K ), and the topicassignment for each biterm (= N B ) [16].Unlike the batch algorithms, online topic models do notrequire running over all documents (in case of OLDA), or allbiterms (in case of OBTM) observed up to the latest time slice.Instead, OLDA only iteratively runs over the words presentin the current time-slice documents, whereas OBTM onlyiterates over the biterm set in the latest time-slice. These onlinealgorithms require almost constant memory cost to update themodels, since the number of documents, their average length,and the number of biterms are often stable [20], [21].In the adaptively online algorithms, topic-word assignmentsfor different versions are compared, weighted, and combinedto set the prior topic-word distribution of the latest time-slice.Therefore, the counts, n w | k (= W K ) for all the previous time-slices, need to be stored as cache-memory. As win ∈ [1 , ..., v ] ,we consider vW K as the counts stored in memory.From table I, we can see that AOBTM’s time complexity ishigher, but it is comparable to other algorithms while dealingwith a fewer number of short texts. In practice, the number oftexts is bound to decline as they are separated into differentversions or time-slices. On the other hand, AOBTM has tostore some additional variables to accommodate adaptiveness ,yet incur less memory cost than the other Adaptive Onlinealgorithm, AOLDA.IV. A LGORITHMS TO F IND O PTIMAL N UMBER OF T OPICSTO I NFER AND P REVIOUS V ERSIONS TO CONSIDER
Two parameters in the adaptive online topic modelingmethod, play key-roles in the quality of the topics discovered:(i) the number of topics to derive, (ii) and the number ofprevious versions to consider for adaptive integration. Inprevious studies, the values of these crucial parameters wereset via informed guess established from the manual examina-tions performed over the dataset [10]. We propose algorithmsto determine the values of these parameters automatically.Before developing algorithms to find optimal values for theseparameters, we need to determine suitable evaluation metricfor measuring the quality of discovered topics.Perplexity (or, marginal likelihood) evaluated on a held-out test set have been utilized in many studies to assess theeffectiveness and efficiency of a generated topic model [14],[31], [32]. But, the minimized perplexity as a metric is notsuitable for our approach for the following reasons. First,he mentioned studies focused on LDA-based topic modelswhere the likelihood of word occurrences in documents isoptimized, whereas, in our approach, the likelihood of bitermoccurrences in the latest time-slice is optimized. Second, itwas argued in [33], that topic models with better held-outlikelihood might infer less semantically meaningful topics,which deviates our underlying expectations of topic models(e.g., better interpretability and coherence of the derivedtopics).For our purpose, we can use
Coherence Score or PMI-Score . Coherence Score is a metric used for measuring thequality of the discovered topics automatically [34]. It depictsthat a topic is more coherent if the most probable words inthat topic co-occur more frequently in the corpus. On theother hand,
PMI-Score measures the coherence of a topicbased on point-wise mutual information using large scale textdatasets from external sources, such as Wikipedia [29]. Thisidea resonates with the underlying assumption of our approach,which maintains that words co-occurring more frequently inan external dataset, should be more likely to belong to thesame topic. Since the external dataset is model-independent,the generated PMI-Score would fluctuate consistently fordistinct topic models with different parameter values [16].Therefore, we exploit PMI-Score to evaluate the discoveredtopic quality, which measures the pairwise association among T most contributing words in a discovered topic, k : PMI-Score ( k ) = T ( T − (cid:80) ≤ i An inadequate number of topics could render our topicmodel too coarse to identify distinct and particular topics.Conversely, an inordinate number of topics could deliver amodel that is too involved, making subjective validation andinterpretation difficult. https://github.com/martin-majlis/Wikipedia-API Algorithm 2: Optimal Number of Topics Input : InputArr, iter, span, dataset Output: optT opicNum Set maxP MI ← . , optV al ← InputArr [0]; ) for i ← to InputArr.size () − do //Each thread works on 1 element of InputArr PMI sum ← for j ← to iter − do Set threadId ← omp get thread num (); Set topicNum ← InputArr [ threadId ]; Set K in Algorithm 1 with topicNum ; { Φ ( t ) , θ ( t ) } Tt =1 ← Run AOBTM (Algorithm 1); PMI[ K ] ← new array of double; for t ← to K − do PMI[ t ] ← by Eq. 5; end PMI score ← K (cid:80) k PMI-Score [ k ] ; PMI sum + = PMI score; end PMI final ← PMI sum / iter ; if (PMI final > maxP MI ) then optVal ← topicNum; maxPMI ← PMI final; end ) for i ← optV al − (cid:100) span/ (cid:101) to optV al + (cid:100) span/ (cid:101) do Set threadId ← omp get thread num (); Set tmp ← threadId + optV al − (cid:100) span/ (cid:101) ; if ( tmp == optV al ) then break; else repeat lines 4 to Set topicNum ← tmp ; repeat lines 7 to end optT opicNum ← topicNum ; To estimate the most appropriate number of topics for ourtopic modeling approach, we propose a 2-step parallel algo-rithm. For the parallelization, we have employed OpenMP, aset of compiler directives and an API for our program (writtenin C++) that provides support for multi-platform, multiprocess-ing programming in shared-memory environments. OpenMPenabled us to write the algorithm so that the multithreadingdirectives are skipped (or replaced with regular arguments) inthe machines that do not have OpenMP installed. The designedalgorithm to determine the optimal number for topics inferenceis provided in Algorithm 2.The first step of our parallel algorithm takes an array ofintegers, InputArr . This array stores candidate number oftopics, such as [ n , n ,..., n t ], where n is an integer and t isthe array-size . If core refers to the number of CPU-coresavailable, it is advisable to limit t within [2 , ( cores − .This limit warrants that only one core would be assignedfor each element in the array. Each core, in turn, builds iter number of AOBTM models and calculates respective PMI-Scores. If we independently train AOBTM multiple times, onthe same dataset with the same number of topics, we endup with slightly different PMI-Scores with each independenttraining. So, we designed our algorithm in such a way sothat, for one candidate number of topics, each core builds iter number of models and generates separate PMI-Scores.We stabilize the metric for corresponding candidate numbery taking an average of the distinct PMI-Scores. We findthe optimal candidate number ( optV al ) through reduction (OpenMP operator) after all the threads finish their execution.In the second step of our algorithm, we finetune near theoptimal candidate number ( optV al ) to determine the finalvalue of the optimal topic number, optT opicN um . The usermay specify span with an integer, which defines the breadthof grid-search around optV al , observed in the algorithm’s firststep. Without any user specification, span is automatically setas core − . For each integer in the range of span around optV al , we repeat the procedure of the first step to determinethe optimal value and set it as the appropriate number of topics.Fig. 3 illustrate the first and second phase of Algorithm2, respectively. In essence, the first phase determines the ap-propriate number for topics inference ( optV al ) by evaluatingthe elements of the InputArr ; the second phase determinesthe optimal topic number by evaluating integers around the optV al . The algorithm ensures that in each phase, one coreevaluates only one integer topic number. Fig. 3. Two Phases of Algorithm 2, , determining the optimal topic number.Each element of the InputArr is handled by one CPU-core. Circle-maxrepresent the reduction operator. B. Algorithm Determining Number of Versions to Consider Earlier, we have discussed how previous versions or timeslices incur different contributions to the topic distributions ofthe latest time-slice. In AOBTM, we form the prior topic distri-butions for t -th time-slice by taking weighted contributions ofthe previous win number of time slices into account. Users candefine the parameter win ∈ [1 , v − in Algorithm 1, where v denotes the available time slices. To make an educated decisionabout the parameter win , we can analyze the change in PMIScores for different values of win . We build ( v − ) number ofAOBTM models for the latest time-slice with different valuesof win and calculate the PMI-Scores. In the parallel block,we use OpenMp reduction clause to find the maximized PMI-score. If OpenMp is not enabled in a machine, we store thescores in an array, where the score’s index corresponds tothe number of considered versions. Then, with one scan ofthe array, we determine the cutoff point where the PMI-scoredropped and did not rise again.We propose Algorithm 3 to determine the appropriate num-ber of previous versions to consider automatically. Algorithm 3: Optimal Number of Versions Input : dataset, v, iter Output: optV erNum Set maxP MI ← . , optV erNum ← ) for i ← to v − do Set threadId ← omp get thread num (); Set win ← threadId + 1; /* Each thread runs AOBTM iter time by taking win number of previous versions into account,calculate the PMI-Scores and saves them inPMI final; (repeating lines 5 to 18 of Algo. 2);*/ if (PMI final > maxP MI ) then optVerNum ← w ; maxPMI ← PMI final; end V. E XPERIMENTS AND R ESULTS In this section, we evaluate the performance of AOBTM inidentifying consistent and distinctive latent topics from corporacomprising of short text documents. We explain the datasetsand compare the results of different topic modeling algorithms.Our focus is to answer the following research questions: • RQ1: Can AOBTM achieve better performance comparedto other topic modeling methods? • RQ2: How do different parameter settings, document-lengths, and pre-processing approaches impact the per-formance of AOBTM? • RQ3: Using the parameters set by our parallel algorithms,how discriminative and coherent are the topics discoveredby different topic modeling methods? A. Setups1) Datasets: To show the effectiveness of our approach, inaddition to using app reviews, we use a large dataset of Twittermicroblogs. Tweets are considered as short text and evaluationon this dataset can show the applicability of AOBTM on shorttext analysis. The details of the datasets are as follows: App Reviews from Apple Store and Google Play. We use thedataset provided by Gao et al. [10], which is previously studiedto evaluate AOLDA for extracting topics from app reviews.The dataset includes reviews that are related to a number ofversions of the collected apps. The subject apps are distributedin different categories and platforms; this choice ensures thegeneralization of our approach. We enriched the provideddatasets by adding the user-reviews collected from the latestversions of the subject apps. However, one of the apps in thisdataset, ”Clean Master,” was discontinued, and we could notacquire app-changelogs. Another app in the provided dataset,namely ”eBay,” had pulled enormous app-reviews from theapp-stores. As the changelogs are critical to our evaluationmetrics, we have decided to discard these two apps from theevaluation. We double-checked the provided app-reviews andchangelogs in the dataset from the play stores and discardedthe ones that could not be found. Table II summarizes thespecifications of the app reviews datasets. Tweets2020 is a collection of approximately 200,000 tweetsscraped from Twitter between January 1st and May 20th,2020, where each month is considered as a time-slice. Forthe collection of the tweets, we have used an open-sourcedwitter-scraper . We used 300 top trending topics over theregion of North America to collect the tweets with timestamp.Besides the content, each tweet includes user id, timestamp,number of retweets, and likes.The user reviews and tweets collections contain many noisywords, such as repetitive words, casual words, misspelledwords, and non-informative words (e.g., ”normally”). Wehave performed common text preprocessing techniques includ-ing removing meaningless words, lowercasing, lemmatization,digit and name replacement following [35]. We apply thepreprocessing technique in for lemmatization and replace alldigits with ” < digit > .” We also removed duplicate records anddocuments with a single word. TABLE IIS UBJECT APPS FROM DIFFERENT APP - STORES App Name Category Platform 2) Baselines: We select LDA [14], BTM [16], OLDA [20],OBTM [21], and AOLDA [10] as our baseline methods to eval-uate the performance of AOBTM. The details of the baselinealgorithms are explained in Section II. All the experimentswere carried on a Linux machine with Intel 2.21 GHz CPUand 16G memory. Following the literature [10], we have usedall algorithms implemented by Gibbs sampling in C++ . 3) Evaluation Metrics: Good topic models deliver coherent[28] and discriminative topics, which cover unique and com-prehensive aspects of the corpus [10]. So, We utilized PMI-Score as a measure of coherence [16] and Discreteness Score(Dis Score) to measure the discriminative property of thederived topics, which is inspired from the semantic similaritymapping in [10]. Higher values of PMI Score and Dis Scoresuggest the discovery of more coherent and discriminativetopics. We also presented time-cost (seconds) per iteration(Time Cost in Table III) as the third performance metric. Wepicked the top 10 terms from each generated topic to calculatethe PMI-Scores, as explained in section IV. For calculatingDis Score, we use Jensen Shannon (JS) Divergence D JS [36],to estimate the difference between two topic distributions ( Φ ).The equations are provided below: Dis Score = (cid:80) Kk =1 (cid:18) (cid:80) Kj =1 ,j (cid:54) = k D JS (Φ tk || Φ tj ) K (cid:19) /K ; (6) D JS ( φ tk || φ tj ) = D KL ( φ tk || M ) + D KL ( φ tj || M ); (7) D KL ( P || Q ) = (cid:80) i P ( i ) log P ( i ) Q ( i ) ; M = ( φ tk + φ tj ); (8)Eq. 7 elaborates D JS of Eq. 6. Eq. 8 defines the D KL (Kullback-Leibler Divergence) and M from Eq. 7. In Eq.6, the innter term, (cid:80) Kj =1 ,j (cid:54) = k D JS (Φ tk || Φ tj ) /K measures thedifference of a single topic distribution’s average with the restof the topic distributions. In Eq. 8, P ( i ) (or Q ( i ) ) is the i -thitem in P (or Q ).To provide a better comparison, we adopted three more per-formance metrics used in [10] to evaluate the performance of https://pypi.org/project/twitter-scraper/ Code of BTM : http://code.google.com/p/btm/ AOLDA. These metrics are Precision E , Recall L , and F hybrid .Further details about these metrics are discussed in [10]. Forapp-reviews and twitter data, we have taken app-changelogsand popular hashtags, respectively as our ground-truths. Here, Precision E measures the accuracy in detecting emerging top-ics in the latest time slice, t [10]. Recall L evaluates whetherour prioritized topics (including both emerging and non-emerging) reflect the changes mentioned in the change-logsor hashtags. Higher F hybrid suggests that the change-logs andhashtags are more explicitly covered by detected topics. Thehigher score in F hybrid also signifies that the prioritized issuesreflect more of the change-logs and hashtags contents [10]. B. Result of RQ1: Comparison Results with Different Methods Table III presents the evaluation results, where, P E , R L , F h refer to Precision E , Recall L , and F hybrid , respectively. TABLE IIIC OMPARISON RESULT OF DIFFERENT METHODS App-Name( P E R L F h Tweets2020(~39,803) LDA 2.03 ± ± ± ± ± AOBTM 2.13 ± NOAA Radar(~632) LDA 1.34 ± ± ± ± ± AOBTM 1.48 ± Youtube(~1,272) LDA 1.56 ± ± ± ± ± AOBTM 1.66 ± Viber(~2,147) LDA 1.65 ± ± ± ± ± AOBTM 1.89 ± Swiftkey(~1,360) LDA 1.53 ± ± ± ± ± AOBTM 1.71 ± During evaluation for RQ1, we set the parameters as w = 3 and K = 10 for the adaptive online algorithms for the sakeof uniformity. We have initialized α = 0 . and β = 0 . for LDA based methods as they have achieved the bestperformance with these values for short texts in [16]. We haveset α = 50 /K and β = 0 . for BTM based algorithms [16].From Table III, we observe that AOBTM delivers thehighest PMI-Scores with every dataset by alleviating the datasparsity problem and considering the varying contributionsof different time-slices or versions. So, the topics discov-ered by AOBTM are more coherent and comprehensible. Fordiscriminative topic learning, AOBTM performs better thanother methods, except for the Tweets2020 dataset. A largeamount of short texts per time-slice in the Tweets2020 datasethelps AOLDA to learn better document level word correlationsand infer more discriminative topics; still, AOLDA did notgenerate higher PMI-Score than AOBTM for Tweets2020.From the result, it is apparent that AOBTM exceeds theenchmark methods including its online version, OBTM, aswell as AOLDA. Although AOBTM marginally improved theperformance of AOLDA, the performance improvement ofAOLDA over OLDA is approximately the same as that ofAOBTM over OBTM. AOBTM also generated the highestscores for every dataset for Precision E , Recall L , and F hybrid ,which indicates that our topic model can select emergingtopics more precisely.We acknowledge that AOBTM is more time-expensive thanall the other baselines, but the runtime is comparable toadaptive online methods when the dataset is small. From TableIII, we observe that the difference of runtime between AOLDAand AOBTM is trivial; AOBTM even outperforms AOLDA inruntime for NOAA Radar dataset, which has the lowest numberof average short texts per version. C. Result of RQ2: Effect of Different Parameter Settings,Document Lengths and Preprocessing Approaches Effect of Different Parameter Settings. In Fig. 4 andFig. 5, we have compared our method with AOLDA, as thenumber of previous versions to consider is unique to adaptiveonline algorithms. In Fig. 4, we consider distinct uniformtopic-number for each dataset and calculate PMI-Scores forthe different number of previous versions. The topic numbersfor the dataset is calculated by Algorithm 2. In Fig. 5, weconsider fixed window-size (number of versions) to calculatethe PMI-Scores for varying number of topics.We can observe that AOBTM in general generates thehighest PMI-Scores, and the trendlines of both methods areanalogous. In Fig. 4, the declines in the performance ofAOBTM (i.e., win=25 in YouTube dataset) can transpire forthe following reasons: the emergence of an unrelated noveltopic in the recently considered versions (i.e., from 20thto 25th versions in YouTube dataset), content drifting, andhigher occurrence of meaningless texts in the newly includedversions [21]. In Fig. 5, in all cases, the methods producedan increasing number of PMI-Scores until the tipping pointgenerates the highest score. Once the methods reach their peak,a further increase in the number of topics generates incoherentcoinciding topics inducing the reduction in PMI-Scores. Fig. 4. PMI-Scores for varying number of considered versions or time-slices. Effect of Document Length. In Fig. 6, we have presentedAOBTM’s performance using PMI-Scores with respect tovarying document lengths. We have considered average docu-ment length for each dataset to evaluate the considered meth-ods. Tweets2020 , NOAA Radar , Youtube , Viber , and Swiftkey Fig. 5. PMI-Scores for varying number of topics. have average document length of 68.3, 8.5, 13.6, 9.4, and6.2, respectively. AOBTM performs better than other methodsfor all datasets. It is worth noting that, as expected, LDAbased methods performed well with large document length andbigger corpus, mostly because of the dataset’s content richnessand abundance of document level word co-occurrences. Fig. 6. PMI-Scores for varying document lengths. Effect of Preprocessing. In section III, we have men-tioned a preprocessing technique with Phrase Extraction thatcan deliver more comprehensible top contributing terms ineach discovered topic. We implement AOBTM+ with thephrase extraction preprocessing technique. In Table IV, weexplicate how this technique in AOBTM+ generates bettertopic words than AOBTM with no phrase extraction. Weselected a topic discovered from the Twitter dataset, which isrelated to Racism . We can see that extracting phrases duringpreprocessing and training them with the rest of the termshelp distinguishing key terms for topic representation, whichis captured only by AOBTM+. TABLE IVF IVE MOST CONTRIBUTING TERMS FROM A TOPIC FROM T WEETS Methods Key-TermsAOLDA hate, race, black, white, stopAOBTM black, hate, white, race, crimeAOBTM+ stop racism, black, stop hate, police brutality, white supremacy D. Result of RQ3: Quality of Discovered Topics Using Pa-rameters Determined by Proposed Algorithms In [37]–[40], researchers have explored different ways tofinetune the topic models’ parameters. Their basic approach isto train various topic models (with different parameter settings)over several iterations to select one with the best performance.All the proposed procedures are computationally expensive,especially when executed sequentially. Moreover, all the ex-isting libraries and packages that implement the mentionedrocedures use LDA based models [41]. So, we proposed twoparallel algorithms as described in Section IV to determine2 important parameters in our approach automatically: i) thenumber of topics to discover ( K ) and ii) the number ofprevious versions/time-slices to consider ( win ). In Table V,time-complexities are provided for both algorithms. It is worthnoting that the proposed algorithms run sequentially if theenvironment is not set up to perform parallelism. TABLE VT IME C OMPLEXITIES FOR P ROPOSED P ARALLEL A LGORITHMS Time-ComplexityAlgorithm 2 O (cid:0) [ iter + span ][ N iter K | N ( t ) B | + vKW ] (cid:1) Algorithm 3 O (cid:0) [ iter ][ N iter K | N ( t ) B | + vKW ] (cid:1) We have employed the proposed algorithms to determine thebest values for the parameters K and win for each dataset. ForTweets2020, Youtube, Viber, NOAA, and Swiftkey, Algorithm2 determined the corresponding numbers of topics to bederived as 31, 22, 18, 13, and 11, respectively, whereasAlgorithm 3 determined the best win value to be considered as5, 34, 9, 16, and 11, respectively. After setting the best detectedparameters for the online and adaptively online algorithms, wehave calculated the PMI-Scores for each dataset and presentthe results in Fig. 7. The plot shows that AOBTM outperformsall the other online algorithms for all datasets. Compared toOBTM, AOLDA and OLDA perform better for datasets thatcontain longer documents. Furthermore, OBTM outperformsboth of the LDA based methods when it comes to the limiteddatasets containing short texts. Fig. 7. PMI-Scores generated by setting the parameters determined byAlgorithms 2 and 3 E. Real-World Application Prompt detection of emerging topics from user reviewsis crucial for app developers as these topics reveal usersrequirements, preferences, and complaints [10]. Developerscan proactively identify user-complaints and take quick actionsto improve user experience through efficient analysis of theapp-reviews. Timely and precisely identifying emerging issueshelps developers to fix bugs, refine existing features, and addessential functions in the subsequent update of the application.For this purpose, Gao et al. developed a framework named IDEA to detect emerging issues from the app-reviews ofpopular applications [10]. IDEA collects user reviews afterthe publication of different versions of the app and implementAOLDA to get the topic-word distributions for the app reviewscollected after the publication of the latest version. We have modified the open-source framework IDEA and incorporatedAOBTM instead of AOLDA to generate the topic-word dis-tribution. The rest of the framework’s components, such aspreprocessing, emerging topic identification, and topic inter-pretation, remain the same. The modified version is denotedas OPRA ( O nline A p p R eview A nalysis). TABLE VIF IVE MOST CONTRIBUTING TERMS FROM TWO SAMPLE TOPICS Topics IDEA OPRATopic 1 password zoombombmeeting passwordabuse securityattack policypolicy disturbTopic 2 message group chatstatus messagechannel notificationchat transferlink link Inspired by the zoom case study as explained in Section I,we have collected around 15,000 app reviews for Zoom CloudMeetings from Google Play. The average review length forthis dataset is 7.8. These reviews were generated after thepublication of the latest 5 versions of the app. For emergingtopic detection, we set the parameters as win = 3 and K = 10 for fair evaluation. We also changed the initial values of α and β to 0.1 and 0.01, respectively, as these values yielded bestperformance for IDEA (implementing AOLDA) in [10].In Table VI, we have reported top 5 most contributing wordsfrom two topics generated by IDEA and OPRA: first topicis closely related to app-security, and second topic is closelyrelated to messaging feature of the app. In Table VII, wehave reported the corresponding PMI-Scores and Time-Costfor both frameworks. We can see that applying AOBTM inthe framework slightly increases the time-cost, but generatesmore comprehensive and coherent topics. TABLE VIIPMI-S CORES & T IME C OST EVALUATION FOR Z OOM APP - REVIEWS Frameworks PMI Score Time Cost Precision E Recall L F hybrid IDEA 2.08 12.52 0.572 0.608 0.586OPRA 2.35 16.8 0.593 0.619 0.608 VI. R ELATED W ORK For the topic modeling for short texts, PLSA, LDA, andtheir variants suffer from the lack of enough word co-occurrences. To boost the performance of topic models, re-searchers had utilized external knowledge to produce supple-mentary essential word co-occurrences across short texts [42]–[44]. The problem rests in that auxiliary information can betoo scarce or too expensive (or both) for deployment.In the short text topic modeling regime, Yin at al. introducedthe DMM based topic modeling method in [45], where itis presumed that each short-text is sampled from only onelatent topic. But this proved to be too simple and too strongof an assumption for any reasonable short text topic model[46]. In self-aggregation based methods, short texts are mergedinto long pseudo-documents before topic inference to helpdevelop rich word co-occurrence information. Researchershave used this type of method in [47], [48], where theyave presumed that each short text is sampled from a longconcatenated pseudo-document (unobserved in current textcollection). This presumption enables inferring latent topicsfrom long pseudo-documents. But the concatenation yieldedsuboptimal results in [49], [50] as merging short texts into longpseudo-documents using word embeddings cannot alleviate theloss of auxiliary information or metadata. Global word co-occurrences based methods (i.e., [16], [51]) try to use therich global word co-occurrence patterns for inferring latenttopics, where the adequacy of these co-occurrences alleviatesthe sparsity problem of short texts. [16] posits that the twowords in a biterm share the same topic drawn from a mixtureof topics over the whole corpus. This topic modeling algorithmis comparatively more robust and suitable for all the mentionedcharacteristics of short texts [52].To solve the problems with streaming short texts and un-ordered topic generation, researchers proposed online modelssuch as OLDA [20] and Online BTM [16]. In essence, onlinealgorithms fit conventional topic models (i.e., BTM, LDA,respectively) over the data in a time-slice t and use the inferredstatistical data to adjust Dirichlet hyperparameters for the nexttime slice.Gao et al. [10] introduced the Adaptively Online LDA(AOLDA) by factoring in all the previous time-slices’ con-tribution, instead of just the preceding one. Here, they haveshown that the comparison among the topic distributions formore than two consecutive time-slices/versions can lead tomore coherent and distinguishable topic learning. But thisspecific method uses LDA as their underlying topic model,which suffers from several discussed issues when it comes toshort texts [16].VII. T HREATS TO VALIDITY Human Evaluation. In our experiments, we have onlyused deterministic scores to evaluate our results against thebenchmarks. The authors of this paper have manually reviewedthe outcomes of the considered topic models. We have con-sulted with fellow researchers about the model outcome andhave considered the opinions of industry developers whichconfirmed that AOBTM generates more coherent key-wordsfor extracted topics. But we did not perform any formal humanevaluation to assess how well our model performs in practicecompared to others. Datasets. Our proposed topic models with text corpus dis-tributed over different versions or time-slices. Evaluating ourmodels using only version-tagged app-reviews from mobileapplications that have multiple published versions in the app-store would not give us a precise idea about how this algorithmworks with time slices. We have handled the problem by Wehave evaluated our approach using a few mobile applications,which might affect the generality of our model. To migratethe problem, we have incorporated a new dataset with around200,000 timeline-tagged tweets scraped by considering 300top trending topics from Canada and the USA. Furthermore,we carefully selected apps so that we could demonstrate out topic model’s performance for apps that have small or largenumber of reviews per version (~623-2,147) Ground Truth. To measure the extensibility of our topicmodel, we wanted to know how it scales to other onlinealgorithms for prioritizing topics and detecting emerging ones.For selecting ground-truth, we have used key-terms from app-changelogs for app-reviews, as Gao et al. did in [10]. In orderto calculate precision, recall, and F-score, app-changelogs areused as ground truth, similar to [10]. However, we did nothave any changelogs or tweet-summary to take as ground-truth for Tweets2020. We have manually selected top-trendinghashtags over different time-slices to mine the key-terms.Other approaches for evaluation should be studied. Memory and Time Cost. We acknowledge that our modelcannot compare to the benchmarks when it comes to memoryand time-cost. Still, we are currently endeavoring to incorpo-rate word-cooccurrence pattern algorithm to make our topicmodel faster while using significantly less resources.VIII. C ONCLUSION AND F UTURE W ORK In this paper, we proposed a novel adaptive topic modelingalgorithm, AOBTM, which is able to discover coherent anddiscriminative topics from short texts. AOBTM addressesthe problems with conventional topic models by adopting aversion sensitive strategy. Along with AOBTM, we use apreprocessing technique that enables capturing distinguishableterms in an extracted topic. Moreover, we implemented twoparallel algorithms to determine the value of the two mostimportant parameters of our model automatically. The resultsof several experiments on app reviews and Twitter datasetsconfirm the performance of AOBTM compared to the state ofthe art algorithms.We plan to improve the underlying BTM method using shorttext expansion and concept drifting detection and integrate itwith a topic visualization tool specifically designed for appreviews. For the parallel algorithms, we plan to use GPU-coresand shared memory cache to make the program run faster. Weare currently endeavoring to incorporate word-cooccurrencepattern algorithm to make our topic model faster while usingsignificantly less resources. EFERENCES[1] C. Gao, J. Zeng, D. Lo, C.-Y. Lin, M. R. Lyu, and I. King, “Infar:Insight extraction from app reviews,” in Proceedings of the 2018 26thACM Joint Meeting on European Software Engineering Conference andSymposium on the Foundations of Software Engineering , 2018, pp. 904–907.[2] X. Gu and S. Kim, “what parts of your apps are loved by users?,”in Proceedings of the 30th IEEE/ACM International Conference onAutomated Software Engineering , ser. ASE 15. IEEE Press, 2015, p.760770. [Online]. Available: https://doi.org/10.1109/ASE.2015.57[3] N. Chen, J. Lin, S. C. Hoi, X. Xiao, and B. Zhang, “Ar-miner: mininginformative reviews for developers from mobile app marketplace,” in Proceedings of the 36th international conference on software engineer-ing , 2014, pp. 767–778.[4] W. Martin, F. Sarro, Y. Jia, Y. Zhang, and M. Harman, “A survey of appstore analysis for software engineering,” IEEE transactions on softwareengineering , vol. 43, no. 9, pp. 817–847, 2016.[5] F. Palomba, M. Linares-Vasquez, G. Bavota, R. Oliveto, M. Di Penta,D. Poshyvanyk, and A. De Lucia, “User reviews matter! trackingcrowdsourced reviews to support evolution of successful apps,” in . IEEE, 2015, pp. 291–300.[6] D. Pagano and W. Maalej, “User feedback in the appstore: An empir-ical study,” in . IEEE, 2013, pp. 125–134.[7] W. Maalej and H. Nabil, “Bug report, feature request, or simplypraise? on automatically classifying app reviews,” in . IEEE, 2015,pp. 116–125.[8] L. Carreo and K. Winbladh, “Analysis of user comments: An approachfor software requirements evolution,” in International Conference onSoftware Engineering , 05 2013, pp. 582–591.[9] C. Gao, W. Zheng, Y. Deng, D. Lo, J. Zeng, M. R. Lyu, and I. King,“Emerging app issue identification from user feedback: Experienceon wechat,” in Proceedings of the 41st International Conferenceon Software Engineering: Software Engineering in Practice , ser.ICSE-SEIP 19. IEEE Press, 2019, p. 279288. [Online]. Available:https://doi.org/10.1109/ICSE-SEIP.2019.00040[10] C. Gao, J. Zeng, M. R. Lyu, and I. King, “Online app reviewanalysis for identifying emerging issues,” in Proceedings of the 40thInternational Conference on Software Engineering , ser. ICSE 18. NewYork, NY, USA: Association for Computing Machinery, 2018, p. 4858.[Online]. Available: https://doi.org/10.1145/3180155.3180218[11] S. Hassan, C.-P. Bezemer, and A. E. Hassan, “Studying bad updates oftop free-to-download apps in the google play store,” IEEE Transactionson Software Engineering IEEE Transactions on Software Engineering ,2019.[14] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J.Mach. Learn. Res. , vol. 3, no. null, p. 9931022, Mar. 2003.[15] T. Hofmann, “Probabilistic latent semantic indexing,” SIGIR Forum ,vol. 51, no. 2, p. 211218, Aug. 2017. [Online]. Available:https://doi.org/10.1145/3130348.3130370[16] X. Cheng, X. Yan, Y. Lan, and J. Guo, “Btm: Topic modeling over shorttexts,” IEEE Transactions on Knowledge and Data Engineering , vol. 26,no. 12, pp. 2928–2941, 2014.[17] J. Boyd-Graber and D. Blei, “Syntactic topic models,” in Proceedingsof the 21st International Conference on Neural Information ProcessingSystems , ser. NIPS08. Red Hook, NY, USA: Curran Associates Inc.,2008, p. 185192.[18] X. Wang and A. McCallum, “Topics over time: A non-markovcontinuous-time model of topical trends,” in Proceedings of the 12thACM SIGKDD International Conference on Knowledge Discovery andData Mining , ser. KDD 06. New York, NY, USA: Associationfor Computing Machinery, 2006, p. 424433. [Online]. Available:https://doi.org/10.1145/1150402.1150450[19] L. Hong and B. D. Davison, “Empirical study of topic modelingin twitter,” in Proceedings of the First Workshop on Social Media Analytics , ser. SOMA 10. New York, NY, USA: Associationfor Computing Machinery, 2010, p. 8088. [Online]. Available:https://doi.org/10.1145/1964858.1964870[20] M. D. Hoffman, D. M. Blei, and F. Bach, “Online learning for latentdirichlet allocation,” in Proceedings of the 23rd International Conferenceon Neural Information Processing Systems - Volume 1 , ser. NIPS10. RedHook, NY, USA: Curran Associates Inc., 2010, p. 856864.[21] X. Hu, H. Wang, and P. Li, “Online biterm topic model based shorttext stream classification using short text expansion and concept driftingdetection,” Pattern Recognition Letters , vol. 116, 10 2018.[22] Q. Jipeng, Q. Zhenyu, L. Yun, Y. Yunhao, and W. Xindong, “Short texttopic modeling techniques, applications, and performance: A survey,”2019.[23] D. M. Blei, “Probabilistic topic models,” Commun. ACM , vol. 55,no. 4, p. 7784, Apr. 2012. [Online]. Available: https://doi.org/10.1145/2133806.2133826[24] P. Xie and E. P. Xing, “Integrating document clustering and topic mod-eling,” in Proceedings of the Twenty-Ninth Conference on Uncertaintyin Artificial Intelligence , ser. UAI13. Arlington, Virginia, USA: AUAIPress, 2013, p. 694703.[25] P. Xie, D. Yang, and E. Xing, “Incorporating word correlation knowledgeinto topic modeling,” in Proceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics:Human Language Technologies Proceedingsof the 23rd International Conference on World Wide Web , ser. WWW 14.New York, NY, USA: Association for Computing Machinery, 2014, p.539550. [Online]. Available: https://doi.org/10.1145/2566486.2567980[27] J. H. Lau, D. Newman, S. Karimi, and T. Baldwin, “Best topic wordselection for topic labelling,” in Proceedings of the 23rd InternationalConference on Computational Linguistics: Posters , ser. COLING 10.USA: Association for Computational Linguistics, 2010, p. 605613.[28] Q. Mei, X. Shen, and C. Zhai, “Automatic labeling of multinomialtopic models,” in Proceedings of the 13th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , ser. KDD 07.New York, NY, USA: Association for Computing Machinery, 2007, p.490499. [Online]. Available: https://doi.org/10.1145/1281192.1281246[29] D. Newman, J. H. Lau, K. Grieser, and T. Baldwin, “Automatic evalu-ation of topic coherence,” in Human Language Technologies: The 2010Annual Conference of the North American Chapter of the Associationfor Computational Linguistics , ser. HLT 10. USA: Association forComputational Linguistics, 2010, p. 100108.[30] G. Heinrich, “Parameter estimation for text analysis,” Technical report,Tech. Rep., 2005.[31] A. Gruber, Y. Weiss, and M. Rosen-Zvi, “Hidden topic markov models,”in Artificial intelligence and statistics , 2007, pp. 163–170.[32] T. L. Griffiths, M. Steyvers, D. M. Blei, and J. B. Tenenbaum, “Inte-grating topics and syntax,” in Advances in neural information processingsystems , 2005, pp. 537–544.[33] J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. M. Blei, “Readingtea leaves: How humans interpret topic models,” in Proceedings ofthe 22nd International Conference on Neural Information ProcessingSystems , ser. NIPS09. Red Hook, NY, USA: Curran Associates Inc.,2009, p. 288296.[34] D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum,“Optimizing semantic coherence in topic models,” in Proceedings of theConference on Empirical Methods in Natural Language Processing , ser.EMNLP 11. USA: Association for Computational Linguistics, 2011,p. 262272.[35] Y. Man, C. Gao, M. R. Lyu, and J. Jiang, “Experience report: Un-derstanding cross-platform app issues from user reviews,” in , 2016, pp. 138–149.[36] “Jensenshannon divergence,” April 2020. [Online]. Available: https://en.wikipedia.org/wiki/Jensen-Shannon divergence[37] R. Arun, V. Suresh, C. E. Veni Madhavan, and M. N. Narasimha Murthy,“On finding the natural number of topics with latent dirichlet allocation:Some observations,” in Proceedings of the 14th Pacific-Asia Conferenceon Advances in Knowledge Discovery and Data Mining - Volume Part I ,ser. PAKDD10. Berlin, Heidelberg: Springer-Verlag, 2010, p. 391402.[Online]. Available: https://doi.org/10.1007/978-3-642-13657-3 4338] J. Cao, T. Xia, J. Li, Y. Zhang, and S. Tang, “A density-based method for adaptive lda model selection,” Neurocomput. ,vol. 72, no. 79, p. 17751781, Mar. 2009. [Online]. Available:https://doi.org/10.1016/j.neucom.2008.06.011[39] R. Deveaud, E. SanJuan, and P. Bellot, “Accurate and effective la-tent concept modeling for ad hoc information retrieval,” Documentnum´erique , vol. 17, no. 1, pp. 61–84, 2014.[40] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedingsof the National academy of Sciences , vol. 101, no. suppl 1, pp. 5228–5235, 2004.[41] M. Ponweiser, “Latent dirichlet allocation in r,” WU Vienna UniversityJournal , 2012.[42] O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang, “Transferring topicalknowledge from auxiliary long texts for short text clustering,” in Proceedings of the 20th ACM International Conference on Informationand Knowledge Management , ser. CIKM 11. New York, NY, USA:Association for Computing Machinery, 2011, p. 775784. [Online].Available: https://doi.org/10.1145/2063576.2063689[43] X.-H. Phan, L.-M. Nguyen, and S. Horiguchi, “Learning to classifyshort and sparse text & web with hidden topics from large-scale datacollections,” in Proceedings of the 17th International Conference onWorld Wide Web , ser. WWW 08. New York, NY, USA: Associationfor Computing Machinery, 2008, p. 91100. [Online]. Available:https://doi.org/10.1145/1367497.1367510[44] R. Mehrotra, S. Sanner, W. Buntine, and L. Xie, “Improving ldatopic models for microblogs via tweet pooling and automatic labeling,”in Proceedings of the 36th International ACM SIGIR Conference onResearch and Development in Information Retrieval , ser. SIGIR 13.New York, NY, USA: Association for Computing Machinery, 2013, p.889892. [Online]. Available: https://doi.org/10.1145/2484028.2484166[45] J. Yin and J. Wang, “A dirichlet multinomial mixture model-basedapproach for short text clustering,” in Proceedings of the 20thACM SIGKDD International Conference on Knowledge Discovery andData Mining , ser. KDD 14. New York, NY, USA: Associationfor Computing Machinery, 2014, p. 233242. [Online]. Available:https://doi.org/10.1145/2623330.2623715[46] W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li,“Comparing twitter and traditional media using topic models,” in Pro-ceedings of the 33rd European Conference on Advances in InformationRetrieval , ser. ECIR11. Berlin, Heidelberg: Springer-Verlag, 2011, p.338349.[47] X. Quan, C. Kit, Y. Ge, and S. J. Pan, “Short and sparse text topicmodeling via self-aggregation,” in Proceedings of the 24th InternationalConference on Artificial Intelligence , ser. IJCAI15. AAAI Press, 2015,p. 22702276.[48] Y. Zuo, J. Wu, H. Zhang, H. Lin, F. Wang, K. Xu, and H. Xiong, “Topicmodeling of short texts: A pseudo-document view,” in Proceedingsof the 22nd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , ser. KDD 16. New York, NY, USA:Association for Computing Machinery, 2016, p. 21052114. [Online].Available: https://doi.org/10.1145/2939672.2939880[49] C. Li, H. Wang, Z. Zhang, A. Sun, and Z. Ma, “Topic modeling forshort texts with auxiliary word embeddings,” in Proceedings of the 39thInternational ACM SIGIR Conference on Research and Developmentin Information Retrieval , ser. SIGIR 16. New York, NY, USA:Association for Computing Machinery, 2016, p. 165174. [Online].Available: https://doi.org/10.1145/2911451.2911499[50] P. Bicalho, M. Pita, G. Pedrosa, A. Lacerda, and G. L. Pappa, “A generalframework to expand short text for topic modeling,” Inf. Sci. , vol. 393,no. C, p. 6681, jul 2017.[51] Y. Zuo, J. Zhao, and K. Xu, “Word network topic model: A simplebut general solution for short and imbalanced texts,” Knowl. Inf.Syst. , vol. 48, no. 2, p. 379398, Aug. 2016. [Online]. Available:https://doi.org/10.1007/s10115-015-0882-z[52] C. Li, Y. Duan, H. Wang, Z. Zhang, A. Sun, and Z. Ma, “Enhancingtopic modeling for short texts with auxiliary word embeddings,”