Asynchronous Stochastic Variational Inference
AAsynchronous Stochastic Variational Inference
Saad Mohamad
Department of Computing,Bournemouth UniversityPoole, [email protected]
Abdelhamid Bouchachia
Department of Computing,Bournemouth UniversityPoole, [email protected]
Moamar Sayed-Mouchaweh
Department of Informatics andAutomatics, IMT Lille DouaiDouai, [email protected]
ABSTRACT
Stochastic variational inference (SVI) employs stochastic optimiza-tion to scale up Bayesian computation to massive data. Since SVI isat its core a stochastic gradient-based algorithm, horizontal paral-lelism can be harnessed to allow larger scale inference. We proposea lock-free parallel implementation for SVI which allows distributedcomputations over multiple slaves in an asynchronous style. Weshow that our implementation leads to linear speed-up while guar-anteeing an asymptotic ergodic convergence rate O ( /√ T ) giventhat the number of slaves is bounded by √ T ( T is the total numberof iterations). The implementation is done in a high-performancecomputing (HPC) environment using message passing interface(MPI) for python (MPI4py). The extensive empirical evaluationshows that our parallel SVI is lossless, performing comparably wellto its counterpart serial SVI with linear speed-up. KEYWORDS
Distributed Variational Inference, Probabilistic Modelling, TopicMining, HPC
Probabilistic models with latent variables have grown into a back-bone in many modern machine learning applications such as textanalysis, computer vision, time series analysis, network modelling,and others. The main challenge in such models is to compute theposterior distribution over some hidden variables encoding hiddenstructure in the observed data. Generally, computing the poste-rior is intractable and approximation is required. Markov chainMonte Carlo (MCMC) sampling has been the dominant paradigmfor posterior computation. It constructs a Markov chain on the hid-den variables whose stationary distribution is the desired posterior.Hence, the approximation is based on sampling for a long time to(hopefully) collect samples from the posterior [2].Recently, variational inference (VI) has become widely used as adeterministic alternative approach to MCMC sampling. In general,VI tends to be faster than MCMC which makes it more suitable forproblems with large data sets. VI turns the inference problem to anoptimization problem by positing a simpler family of distributionsand finding the member of the family that is closest to the trueposterior distribution [19]. Hence, the inference task boils down toan optimization problem of a non-convex objective function. Thisallows us to bring sophisticated tools from optimization literature totackle the performance problems. Recently, stochastic optimisationhas been applied to VI in order to cope with massive data [8]. WhileVI requires repeatedly iterating over the whole data set before up-dating the variational parameters (parameters of the variational objective), stochastic variational inference (SVI) updates the pa-rameters every time a single data example is processed. Therefore,by the end of one pass through the dataset, the parameters willhave been updated multiple times. Hence, the model parametersconverge faster, while using less computational resources. The ideaof SVI is to move the variational parameters at each iteration in thedirection of a noisy estimate of the variational objective’s naturalgradient based on a couple of examples [8]. Following these stochas-tic gradients with certain conditions on the (decreasing) learningrate schedule, SVI provably converges to a local optimum [17].Although stochastic optimization improves the performance ofVI, its serial employment prevents scaling up the inference andharnessing distributed resources. Since, SVI is at its core a stochas-tic gradient-based optimisation algorithm, horizontal parallelismis straightforward. That is, computing stochastic gradients of abatch of data samples can be done locally in parallel given that theparameters update is synchronised. Such synchronisation limitsthe scalability by requiring slaves to send their stochastic gradientsto the master prior to each parameter update. Hence, synchronousmethods suffer from the curse of the last reducer; that is, a singleslow slave can dramatically slow down the whole performance.Thus, asynchronous parallel optimization is an interesting alter-native provided it maintains comparable convergence rate to itssynchronous counterpart. Indeed, asynchronous parallel stochas-tic gradient-based optimisation algorithms have recently receivedbroad attention [1, 6, 12, 16, 20].Authors in [1] show that for smooth stochastic convex problemsthe asynchronisation effects are asymptotically negligible and order-optimal convergence results can be achieved. Since, the resultingobjective function of the SVI is non-convex, we are particularlyinterested in the asynchronous parallel stochastic gradient algo-rithm (ASYSG) for smooth non-convex optimization [3]. A recentstudy [10] breaks the usual convexity assumption taken by [1].Nonetheless, theoretical guarantees (convergence and speed-up)for many recent successes of ASYSG are provided. In this paper,we use the ASYSG algorithm proposed in [1] to come up with anasynchronous stochastic variational inference (ASYSVI) algorithmfor a wide family of Bayesian models. We also adapt the theoreticalstudies of ASYSG for smooth non convex optimization from [10] toexplain ASYSVIs’ convergence and speed-up properties. This paperproposes a novel contribution that allows to linearly speeding upSVI by distributing its stochastic natural gradient computations inan asynchronous way while guaranteeing an ergodic convergencerate O ( /√ T ) under some assumptions. We take latent Dirichletallocation (LDA) as a case study to empirically evaluate ASYSVI. a r X i v : . [ s t a t . M L ] J a n he rest of the paper is structured as follows. We briefly reviewvariational and stochastic variational inference in Sec. 2. We de-scribe our asynchronous stochastic variational inference algorithmalong with its convergence analysis in Sec. 3. Latent Dirichlet allo-cation case study is developed in Sec. 4. Related work is discussedin Sec. 5. Empirical evaluation is presented in Sec. 6 and the paperconcludes with a discussion in Sec. 7. In the following, we derive the model family studied in this paperand review SVI. We follow the same pattern in [8].
Model family.
Our family of models consists of three randomvariables: observations x = x n , local hidden variables z = z n ,global hidden variables β and fixed parameters α . The model as-sumes that the distribution of the n pairs of ( x i , z i ) is conditionallyindependent given β . Further, their distribution and the prior dis-tribution of β are in an exponential family: p ( β , x , z | α ) = p ( β | α ) n (cid:214) i = p ( z i , x i | β ) , (1) p ( z i , x i | β ) = h ( x i , z i ) exp (cid:0) β T t ( x i , z i ) − a ( β ) (cid:1) , (2) p ( β | α ) = h ( β ) exp (cid:0) α T t ( β ) − a ( α ) (cid:1) (3)Here, we overload the notation for the base measures h ( . ) , sufficientstatistics t ( . ) and log normalizer a ( . ) . While the soul of the proposedapproach is generic, for simplicity we assume a conjugacy relation-ship between ( x i , z i ) and β . That is, the distribution p ( β | x , z ) is inthe same family as the prior p ( β | α ) .Note that this innocent looking family of models includes (butis not limited to) latent Dirichlet allocation [4], Bayesian Gaussianmixture, probabilistic matrix factorization, hidden Markov models,hierarchical linear and probit regression, and many Bayesian non-parametric models. Mean-field variational inference.
Variational inference (VI)approximates intractable posterior p ( β , z | x ) by positing a family ofsimple distributions q ( β , z ) and find the member of the family that isclosest to the posterior (closeness is measured with KL divergence).The resulting optimization problem is equivalent maximizing theevidence lower bound (ELBO): L( q ) = E q [ log p ( x , z , β )] − E q [ log p ( zβ )] ≤ log p ( x ) (4)Mean-field is the simplest family of distribution, where the distri-bution over the hidden variables factorizes as follows: q ( β , z ) = q ( β | λ ) n (cid:214) i = p ( z i | ϕ i ) (5)Further, each variational distribution is assumed to come from thesame family of the true one. Mean-field variational inference opti-mizes the new ELBO with respect to the local and global variationalparameters ϕ and λ : L( λ , ϕ ) = E q (cid:20) log p ( β ) q ( β ) (cid:21) + n (cid:213) i = E q (cid:20) log p ( x i , z i | β ) q ( z i ) (cid:21) (6)It iteratively updates each variational parameter holding the otherparameters fixed. With the assumptions taken so far, each update has a closed form solution. The local parameters are a function ofthe global parameters: ϕ ( λ t ) = arg max ϕ L( λ t , ϕ ) (7)We are interested in the global parameters which summarises thewhole dataset (clusters in Bayesian Gaussian mixture, topics inLDA): L( λ ) = max ϕ L( λ , ϕ ) (8)To find the optimal value of λ given that ϕ is fixed, we computethe natural gradient of L( λ ) and set it to zero by setting: λ ∗ = α + n (cid:213) i = E ϕ i ( λ t ) [ t ( x i , z i )] (9)Thus, the new optimal global parameters are λ t + = λ ∗ . The al-gorithm works by iterating between computing the optimal localparameters given the global ones (cid:0) Eq. (7) (cid:1) and computing the opti-mal global parameters given the local ones (cid:0)
Eq. (9) (cid:1) . Stochastic variational inference.
Rather than analysing allthe data to compute λ ∗ at each iteration, stochastic optimization canbe used. Assuming that the data is uniformity at random selectedfrom the dataset, an unbiased noisy estimator of L( λ , ϕ ) can bedeveloped based on a single data point: L i ( λ , ϕ i ) = E q (cid:20) log p ( β ) q ( β ) (cid:21) + nE q (cid:20) log p ( x i , z i | β ) q ( z i ) (cid:21) (10)The unbiased stochastic approximation of the ELBO as a functionof λ can be written as follows: L i ( λ ) = max ϕ i L i ( λ , ϕ i ) (11)Following the same step in the previous section, we end up with anoisy unbiased estimate of Eq. (9): ˆ λ = α + nE ϕ i ( λ t ) [ t ( x i , z i )] (12)At each iteration, we move the global parameters a step-size ρ t (learning rate) in the direction of the noisy natural gradient: λ t + = ( − ρ t ) λ t + ρ t ˆ λ (13)With certain conditions on ρ t , the algorithm converges ( (cid:205) ∞ t = ρ t = ∞ , (cid:205) ∞ t = ρ t < ∞ )[17]. In this section, we describe our proposed parallel implementation ofASYSVI on computer cluster and study its convergence and speed-up properties. The steps of the algorithm follows from the originalASYSG in [1].
ASYSVI is presented analogously to ASYSG in [10] but in thecontext of VI. The architecture of the computer network on whichASYSVI is run is known as the star-shaped network. In this net-work, a master machine maintains the global variational parameter λ , whereas other machines serve as slaves which independentlyand simultaneously compute the local variational parameters ϕ andstochastic gradients of ELBO L( λ ) . The slaves only communicatewith the master to exchange information in which they access the lgorithm 1 ASYSVI-Master Input: number of iteration T and step-size { ρ t } t = ,..., T − initialize: λ randomly and t to 0 while ( t < T ) do Aggregate M stochastic natural gradients ˆ ∇L ( λ t − τ t , ) , ..., ˆ ∇L M ( λ t − τ t , M ) from the slaves Average the M stochastic natural gradients. G tM = (cid:205) m ˆ ∇L m ( λ t − τ t , m ) Update the current estimate of the global variational parameter. λ t + = λ t + ρ t G tM t = t + end whileAlgorithm 2 ASYSVI-Slave Input: data size D while (True) do Sample a data point x i uniformly from the data set Pull a global variational parameter λ ∗ from the master Compute the local variational parameters ϕ ∗ i ( λ ∗ ) corresponding to the data point x i and the global variational parameter λ ∗ , ϕ ∗ i ( λ ∗ ) = arg max ϕ i L i ( λ ∗ , ϕ i ) Compute the stochastic natural gradient with respect to the global parameter λ , д i ( λ ) = α + DE ϕ i ( λ ) [ t ( x i , z i )] − λ Push д i ( λ ∗ ) to the master end while state of the global variational parameter and provide the masterwith the stochastic gradients. These gradients are computed withrespect to λ based on few (mini-batched or single) data points ac-quired from distributed sources. The master aggregates predefinedamounts of stochastic gradients from slaves nonchalantly about thesources of the collected stochastic gradients. Then, it updates itscurrent global variational parameter. The update step is performedas an atomic operation where slaves cannot read the value of theglobal variational parameter during this step. However, verticalparallelism can be achieved by adopting the ASYSG algorithm pro-posed in [16]. Furthermore, a hybrid horizontal-vertical parallelismcould be achieved by combining the mechanism used in [15] withASYSVI (more details in Sec. 5 and Sec. 7)The key difference between ASYSVI and the synchronous parallelSVI is that ASYSVI does not lock the slaves until the master’s updatestep is done. That is, the slaves might compute some stochasticgradients based on early value of the global variational parameter.By allowing delayed and asynchronous updates, one might expectslower convergence if any. In the next section, we apply the studyof [10] on SVI to show that the effect of stochastic gradients delaywill vanish asymptotically. The algorithms of ASYSVI-mater andASYSVI-salve are shown in Alg. 1 and Alg. 2. We denote by τ t , m the delays between the current iteration t and the one when theslave pulled the global variational parameter at which it computedthe stochastic gradient. Following [1, 10], we take the same assumptions, but replace thegradient with the natural gradient: • Unbiased gradient: the expectation of the stochastic naturalgradient of Eq. (11) is equivalent to the natural gradient ofEq. (8): ˆ ∇L( λ ) = E [ ˆ ∇L i ( λ )] (14) where ˆ ∇ denotes natural gradient. This assumption alreadyholds in SVI problems for the family of models shown inSec. 2. • Bounded variance: the variance of the stochastic naturalgradient is bounded for all λ ∈ G , E [|| ˆ ∇L i ( λ ) − ˆ ∇L( λ )|| ] ≤ σ . By applying SVI natural gradient, we end up with thefollowing formulation: E [|| nE ϕ i ( λ ) [ t ( x i , z i )] − n (cid:213) i = E ϕ i ( λ ) [ t ( x i , z i )]|| ] ≤ σ (15) • Lipschitz-continuous gradient: the natural gradient is L-Lipschitz-continuous for all λ ∈ G an λ ′ ∈ G , || ˆ ∇L( λ ) − ˆ ∇L( λ ′ )|| ≤ L || λ − λ ′ || . By applying SVI natural gradient, weend up with the following formulation: || n (cid:213) i = E ϕ i ( λ ) [ t ( x i , z i )] − λ − n (cid:213) i = E ϕ i ( λ ′ ) [ t ( x i , z i )] + λ ′ || ≤ L || λ − λ ′ || (16) • Bounded delay: All delay variables τ t , m are bounded:max t , m τ t , m ≤ B (17)In addition to these assumptions, authors [1, 10] assume that eachslave receives a stream of independent data points. Although thisassumption might not be satisfied strictly in practice, we follow thesame assumption for analysis purpose. Thus, the same theoreticalresults obtained by [10] can be applied for ASYSVI, namely, anergodic convergence rate O ( /√ MT ) provided that T is greaterthan O ( B ) . The results also show that, since the number of slavesis proportional to B , the ergodic convergence rate is achieved aslong as the number of salves is bounded by O ( (cid:112) T / M ) . Note that O ( /√ MT ) is consistent with the serial stochastic gradient (SG)and the stochastic variational inference (SVI). Thus, ASYSG andASYSVI allow for a linear speed-up if B ≤ O ( (cid:112) T / M ) . CASE STUDY: LATENT DIRICHLETALLOCATION
Latent Dirichlet allocation (LDA) is an instance of the family ofmodels described in Sec 2 where the global, local, observed variablesand their distributions are set as follows: • the global variables { β } Kk = are the topics in LDA. A topic isa distribution over the vocabulary, where the probability of aword w in topic k is denoted by β k , w . Hence, the prior distri-bution of β is a Dirichlet distribution p ( β ) = (cid:206) k Dir ( β k ; η )• the local variables are the topic proportions { θ d } Dd = and thetopic assignments {{ z d , w } Dd = } Ww = which index the topicthat generates the observations. Each document is associatedwith a topic proportion which is a distribution over topics, p ( θ ) = (cid:206) d Dir ( θ d ; α ) . The assignments {{ z d , w } Dd = } Ww = are indices, generated by θ d , that couple topics with words, p ( z d | θ ) = (cid:206) w θ d , z d , w • the observations x d are the words of the documents whichare assumed to be drawn from topics β selected by indices z d , p ( x d | z d , β ) = (cid:206) w β z d , w , x d , w The basic idea of LDA is that documents are represented as randommixtures over latent topics, where each topic is characterized by adistribution over words [4]. LDA assumes the following generativeprocess:1 Draw topics β k ∼ Dir ( η , ..., η ) for k ∈ { , ..., K } θ d ∼ Dir ( α , ..., α ) for d ∈ { , ..., D } z d , w ∼ Mult ( θ d ) for w ∈ { , ..., W } x d , w ∼ Mult ( β z d , w ) According to Sec. 2, each variational distribution is assumed to comefrom the same family of the true one. Hence, q ( β k | λ k ) = Dir ( λ k ) , q ( θ d | γ d ) = Dir ( γ d ) and q ( z d , w | ϕ d , w ) = Mult ( ϕ d , w ) . To computethe stochastic natural gradient д i in Alg. 2 for LDA, we need tofind the sufficient statistic t ( . ) presented in Eq. (2). By writing thelikelihood of LDA in the form of Eq. (2), we obtain t ( x d , z d ) = (cid:205) Ww = I z d , w , x d , w , where I i , j is equal to 1 for entry ( i , j ) and 0 forall the rest. Hence, the stochastic natural gradient д i ( λ k ) can bewritten as follows: д i ( λ k ) = η + D W (cid:213) w = ϕ ki , w I k , x i , w − λ k (18)Details on how to compute the local variational parameters ϕ ∗ i ( λ ∗ ) in Alg. 2 can be found in [8].Having computed the elements needed to run ASYSVI’s algo-rithms 1 and 2, we move to the convergence analysis. Since thedata is assumed to be subsampled uniformly, the gradient unbiasedassumption holds for LDA. We can always find a constant variablesuch that the bounded variance is satisfied. At the worst case, thevariance of the stochastic natural gradient of LDA can be boundedby DW (cid:0) max i , w ( ϕ ki , w ) − min i ′ , w ′ ( ϕ ki ′ , w ′ ) (cid:1) , ∀ k . Therefore, it canbe bounded by O (( DW ) ) . It is clear that the Lipschitz-continuousgradient can be satisfied for any class of the family models pro-posed in Sec. 2 and hence, for LDA. Finally, the bounded delay canbe guaranteed through the implementation. Therefore, ASYSVI ofLDA can converge since the aforementioned assumptions can besatisfied. Few work has been proposed to scale variational inference to largedatasets. We can distinguish two major classes. The first class isbased on the Bayesian filtering approach [5, 9]. That is, the sequen-tial nature of Bayes theorem is exploited to recursively update anapproximation of the posterior. Particularly, variational inferenceis used between the updates to approximate the posterior whichbecomes the prior of the next step. Author in [9] employs forgettingfactors to decay the contributions from old data points in favourof a new better one. The algorithm proposed in [5] considers asequence of data batches. It iterates over the data points in thebatch until convergence. The computation of the batches posterioris done in a distributed and asynchronous manner. That is, the al-gorithm applies VI by performing asynchronous Bayesian updatesto the posterior as batches of data arrive continuously. Similar toour approach, master-slave architecture is used.The second class of work is based on optimization [8, 14, 15].As we already discussed, SVI proposed by [8] employs stochasticoptimization to scale up Bayesian computation to massive data. SVIis inherently serial and requires the model parameters to fit in thememory of a single processor. Authors in [14] presents an inferencealgorithm, where the data is divided across several slaves and eachof them perform VI updates in parallel. However at each iteration,the slaves are synchronized to combine their obtained parameters.Such synchronisation limits the scalability and decreases the speedof the update to that of the slowest slave. To avoid bulk synchro-nization, authors in [15] propose an asynchronous and lock-freeupdate. In this update, vertical parallelism is adopted, where eachprocessor asynchronously updates a subset of the parameters basedon a subset of the data attributes. In contrast, we adopt horizon-tal parallelism update based on few (mini-batched or single) datapoints acquired from distributed sources. The update steps are, then,aggregated to form the global update. Note that the proposed ap-proach can make use of the mechanism proposed by [15] to achievea hybrid horizontal-vertical parallelism. On the contrary to [15],our approach is not customised for LDA and can be simply appliedto any model from the family of models presented in Sec. 2
In the following, we demonstrate the usefulness of distributing thecomputation of SVI, mainly the speed-up advantages of ASYSVI.For this purpose, we compare the speed-up of ASYSVI LDA againstserial SVI LDA (online LDA [7]). The two versions of LDA areevaluated on three datasets consisting of very large collections ofdocuments. We also evaluate ASYSVI LDA in the streaming settingwhere new documents arrive in the form of stream. The algorithmimplementation is available in Python .The performance evaluation is done using held-out perplexity asa measure of model fit. Perplexity is defined as the geometric meanof the inverse marginal probability of each word in the held-outset of documents [4]. To validate the speed-up properties, follow-ing [10], we compute the running time speed-up (TSP): TSP = runninд time o f online LDArunninд time o f asynchronous LDA (19) Code will be uploaded later able 1: Parameters settings Data sets Enron emails NYTimes news articles Wikipedia articles batch
16 64 256 1024 16 64 256 1024 16 64 256 1024 κ τ (a) TSP (b) RSP Figure 1: Comparing ASYSVI LDA to online LDA on
Enron dataset (a) TSP on
Enron dataset (b) RSP on
Enron dataset
Figure 2: TSP and RSP with respect to streamming samples on
Enron datasetDatasets: we perform all comparisons and evaluations on threecorpora of documents. The first two corpora are available on [11].The third corpus was used in [7]. • Enron emails: The corpus contains 39 ,
861 email messagesfrom about 150 users, mostly senior management of Enron. Data is proceeded before usage by removing all words notin a vocabulary dictionary of 28 ,
102 words. • NYTimes news articles: The corpus contains 300 ,
000 newsarticles from the
New York Times . Data is proceeded before a) TSP on NYTimes dataset (b) RSP on
NYTimes dataset
Figure 3: TSP and RSP with respect to streamming samples on
NYTimes dataset (a) TSP on
Wikipedia dataset (b) RSP on
Wikipedia dataset
Figure 4: TSP and RSP with respect to streamming samples on
Wikipedia dataset usage by removing all words not in a vocabulary dictionaryof 102 ,
660 words. • Wikipedia articles: this corpus contains 1 M documents down-loaded from Wikipedia . Data is proceeded before usage byremoving all words not in a vocabulary dictionary of 7 , Settings the parameters:
In all experiments, α and η are fixedat 0 .
01 and the number of topics K =
50. We evaluated a range of set-tings of the learning parameters, κ , τ , and batch on all four corpora.The parameters κ and τ , defined in [7], control the learning steps-size ρ t . For each corpora, we use 29 ,
861 emails from
Enron dataset,50 ,
000 news articles from
NYTimes dataset and 300 ,
000 documentsfrom
Wikipedia dataset as training sets. We also reserve 5 ,
000 docu-ments as a validation set and another 5 ,
000 documents as a testingset. The online LDA is run (one time per corpus) on the training sets of each corpus for κ ∈ { . , . , . } , τ ∈ { , , , } , and batch ∈ { , , , } . Table 1 summarises the best settings ofeach batch along with the resulting perplexity on the test set foreach corpus. Comparing Serial online LDA and asynchronous LDA: foreach dataset, we set the parameters setting that give the best per-formance (least perplexity). ASYSVI LDA is then compared againstserial SVI LDA using the same parameters setting.The code is implemented on high-performance computing (HPC)environment using message passing interface (MPI) for python(MPI4py). The cluster consists of 10 nodes, excluding the head node,with each node is a four-core processor. We run AYSVI LDA on
Enron dataset for number of workers nW ∈ { , , , , , , , } , B is set to 5. The number of employed nodes is equal to nW aslong as nW is less than 9. As nW becomes higher than the available odes, the processors’ cores of nodes are employed as slaves untilall the cores of all the nodes are used i.e., 9 × =
36. Since the batchsize is fixed to 1024, each slave processes a batch of data of size S = / M per iteration, where M is fixed to 36. Thus, the gradientcomputed by each slave will be multiplied by D / S . Hence, linenumber 6 of Alg. 2 becomes, д i ( λ ) = α + ( D / S ) E ϕ i ( λ ) [ t ( x i , z i )] − λ .Figure 1 summarises the total speed up (i.e., TSP measured at theend of the algorithm) as well as the ratio of serial LDA perplexity toparallel LDA (RSP) on the test set for Enron dataset. It shows TSP andRSP results with respect to the number of slaves. It is clear that aslong as each node is assigned one slave, the speed-up is linear whichdemonstrates the convergence analysis done in Sec. 3. Linear speed-up slowly converts to sub-linear as solo machines host more thanone slave. The main reason of such behaviour is the communicationdelay caused by the increase of the network traffic. Hence, TSP isaffected by the hardware. The communication cost starts affectingthe algorithm speed-up when it becomes comparable to the localcomputation. Hence, increasing the local computation by increasingthe batch size can be adopted to soften the communication effect.However, this decreases the convergence rate and increase localmemory load. Hence, a balanced trade-off should be considered.RSP in Figure 1 shows that although the speed of online LDA hasbeen increased up to 15 times, performance is not seriously affected.We also evaluate TSP and RSP on
NYTimes and
Wikipedia for nW =
27. The processing speed of Online LDA on
NYTimes has beenincreased 19 .
29 times,
TSP = .
29, with slight loss of performance,
RSP = .
97. For
Wikipedia , TSP = .
58 and
RSP = . Enron , NYTimes and
Wikipedia datasets. These figuresshow the performance of ASYSVI in a true online setting wherethe algorithm continually collects samples from the hard driverfor the case of
Enron and
NYTimes or by downloading online inthe case of
Wikipedia . The perplexity is obtained online on thecoming batches before being used to update the model parameters.The plots in the figures are slightly softened using a low-pass filterin order to make them easy to read. These plots show that thespeed-up becomes invariant as more samples are processed. Thepoor speed-up in the beginning is normally caused by initializationand loading process. It can be noticed that the performance ofASYSVI LDA suffers at the beginning then it becomes comparableto online LDA after certain number of iterations. This behaviourcan be explained by the convergence condition shown in Sec. 3 ( T is greater than O ( B ) ). Thus, as the number of iterations increases,the convergence of ASYSVI LDA is guaranteed and its performancebecomes comparable to that of online LDA. Hence, RSP approaches1. We have introduced ASYSVI, an asynchronous parallel implementa-tions for SVI on computer cluster. ASYSVI leads to linear speed-up,while guaranteeing an asymptotic convergence rate given someassumptions involving the number of the slaves and iterations.Empirical results using latent Dirichlet allocation topic model asa case study have demonstrated the advantages of ASYSVI overSVI, particularly with respect to the key issue of speeding-up thecomputation while maintaining comparable performance to SVI. In future work, vertical parallelism can be adopted along withthe proposed horizontal one leading to a hybrid horizontal-verticalparallelism. In such case, multi-core processors will be used forthe vertical parallelism, while horizontal parallelism is achieved ona multi-node machine. Another avenue of interest is to derive analgorithm for streaming, distributed, asynchronous inference wherethe number of instances is not known. Moreover, it is interestingto apply ASYSVI on very large scale problems and particularly onother models of the family discussed in Sec. 2 and studying theeffect of the statistical properties of those models.
ACKNOWLEDGMENT
S. Mohamad and A. Bouchachia were supported by the EuropeanCommission under the Horizon 2020 Grant 687691 related to theproject: PROTEUS: Scalable Online Machine Learning for PredictiveAnalytics and Real-Time Interactive Visualization.
REFERENCES [1] Alekh Agarwal and John C Duchi. 2011. Distributed delayed stochastic optimiza-tion. In
Advances in Neural Information Processing Systems . 873–881.[2] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan.2003. An introduction to MCMC for machine learning.
Machine learning
50, 1-2(2003), 5–43.[3] Dimitri P Bertsekas and John N Tsitsiklis. 1989.
Parallel and distributed computa-tion: numerical methods . Vol. 23. Prentice hall Englewood Cliffs, NJ.[4] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. the Journal of machine Learning research
Advances in NeuralInformation Processing Systems . 1727–1735.[6] Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson. 2016. Anasynchronous mini-batch algorithm for regularized stochastic optimization.
IEEETrans. Automat. Control
61, 12 (2016), 3740–3754.[7] Matthew Hoffman, Francis R Bach, and David M Blei. 2010. Online learning forlatent dirichlet allocation. In advances in neural information processing systems .856–864.[8] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. 2013. Sto-chastic variational inference.
The Journal of Machine Learning Research
14, 1(2013), 1303–1347.[9] Antti Honkela and Harri Valpola. 2003. On-line variational Bayesian learning.In . 803–808.[10] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. 2015. Asynchronous parallelstochastic gradient for nonconvex optimization. In
Advances in Neural InformationProcessing Systems . 2737–2745.[11] M. Lichman. 2013. UCI Machine Learning Repository. (2013). http://archive.ics.uci.edu/ml[12] Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, KannanRamchandran, and Michael I Jordan. 2015. Perturbed iterate analysis for asyn-chronous stochastic optimization. arXiv preprint arXiv:1507.06970 (2015).[13] James McInerney, Rajesh Ranganath, and David M Blei. 2015. The PopulationPosterior and Bayesian Inference on Streams. arXiv preprint arXiv:1507.05253 (2015).[14] Willie Neiswanger, Chong Wang, and Eric Xing. 2015. Embarrassingly ParallelVariational Inference in Nonconjugate Models. arXiv preprint arXiv:1510.04163 (2015).[15] Parameswaran Raman, Jiong Zhang, Hsiang-Fu Yu, Shihao Ji, and SVN Vish-wanathan. 2016. Extreme Stochastic Variational Inference: Distributed and Asyn-chronous. arXiv preprint arXiv:1605.09499 (2016).[16] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild:A lock-free approach to parallelizing stochastic gradient descent. In
Advances inNeural Information Processing Systems . 693–701.[17] Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method.
The annals of mathematical statistics (1951), 400–407.[18] Lucas Theis and Matthew D Hoffman. 2015. A trust-region method for stochas-tic variational inference with applications to streaming data. arXiv preprintarXiv:1505.07649 (2015).[19] Martin J Wainwright and Michael I Jordan. 2008. Graphical models, exponentialfamilies, and variational inference.
Foundations and Trends® in Machine Learning
1, 1-2 (2008), 1–305.[20] Ruiliang Zhang and James T Kwok. 2014. Asynchronous Distributed ADMM forConsensus Optimization.. In
ICML . 1701–1709.. 1701–1709.