[PDF] METEOR: Learning Memory and Time Efficient Representations from Multi-modal Data Streams

Abstract

Many learning tasks involve multi-modal data streams, where continuous data from different modes convey a comprehensive description about objects. A major challenge in this context is how to efficiently interpret multi-modal information in complex environments. This has motivated numerous studies on learning unsupervised representations from multi-modal data streams. These studies aim to understand higher-level contextual information (e.g., a Twitter message) by jointly learning embeddings for the lower-level semantic units in different modalities (e.g., text, user, and location of a Twitter message). However, these methods directly associate each low-level semantic unit with a continuous embedding vector, which results in high memory requirements. Hence, deploying and continuously learning such models in low-memory devices (e.g., mobile devices) becomes a problem. To address this problem, we present METEOR, a novel MEmory and Time Efficient Online Representation learning technique, which: (1) learns compact representations for multi-modal data by sharing parameters within semantically meaningful groups and preserves the domain-agnostic semantics; (2) can be accelerated using parallel processes to accommodate different stream rates while capturing the temporal changes of the units; and (3) can be easily extended to capture implicit/explicit external knowledge related to multi-modal data streams. We evaluate METEOR using two types of multi-modal data streams (i.e., social media streams and shopping transaction streams) to demonstrate its ability to adapt to different domains. Our results show that METEOR preserves the quality of the representations while reducing memory usage by around 80% compared to the conventional memory-intensive embeddings.

Full PDF

MMETEOR: Learning Memory and Time Efficient Representationsfrom Multi-modal Data Streams ∗ Amila Silva, Shanika Karunasekera, Christopher Leckie and Ling Luo

School of Computing and Information SystemsThe University of MelbourneParkville, Victoria, Australia{amila.silva@student.,karus@,caleckie@,ling.luo@}unimelb.edu.au

ABSTRACT

Many learning tasks involve multi-modal data streams, where con-tinuous data from different modes convey a comprehensive de-scription about objects. A major challenge in this context is howto efficiently interpret multi-modal information in complex envi-ronments. This has motivated numerous studies on learning un-supervised representations from multi-modal data streams. Thesestudies aim to understand higher-level contextual information (e.g.,a Twitter message) by jointly learning embeddings for the lower-level semantic units in different modalities (e.g., text, user, andlocation of a Twitter message). However, these methods directlyassociate each low-level semantic unit with a continuous embed-ding vector, which results in high memory requirements. Hence,deploying and continuously learning such models in low-memorydevices (e.g., mobile devices) becomes a problem. To address thisproblem, we present METEOR, a novel ME mory and T ime E fficient O nline R epresentation learning technique, which: (1) learns com-pact representations for multi-modal data by sharing parameterswithin semantically meaningful groups and preserves the domain-agnostic semantics; (2) can be accelerated using parallel processesto accommodate different stream rates while capturing the tempo-ral changes of the units; and (3) can be easily extended to captureimplicit/explicit external knowledge related to multi-modal datastreams. We evaluate METEOR using two types of multi-modaldata streams (i.e., social media streams and shopping transactionstreams) to demonstrate its ability to adapt to different domains.Our results show that METEOR preserves the quality of the repre-sentations while reducing memory usage by around 80% comparedto the conventional memory-intensive embeddings. ACM Reference Format:

Amila Silva, Shanika Karunasekera, Christopher Leckie and Ling Luo. 2020.METEOR: Learning Memory and Time Efficient Representations from Multi-modal Data Streams. In

CIKM ’20: ACM Conference on Information andKnowledge Management, October 19–23, 2020, Galway, Ireland.

ACM, NewYork, NY, USA, 11 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn ∗ Accepted as a conference paper at CIKM 2020Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Unsupervised representation learning [2, 28] has recently becomea rapidly growing direction in machine learning due to its abilityto: (1) exploit the availability of unlabeled data; and (2) associatethe underlying factors behind data effortlessly. For a given do-main, conventional representational learning techniques learn low-dimensional vectors (i.e., embeddings) for low-level data units (e.g.,words in a language) such that these representations capture the un-derlying semantics of the particular domain in a task-independentmanner. For example, the language modelling techniques in [10, 13]learn embeddings for words or characters by considering them asthe low-level units of a language, from which higher-level struc-tures (e.g., phrases and sentences) can be constructed and under-stood. Subsequently, these representations serve as features to solvedifferent application-level problems. Lately, it has been empiricallyproven that these representations yield a significant performanceboost for many downstream tasks in domains such as Natural Lan-guage Processing [3, 13] and Computer Vision [8, 21].This paper focuses on representation learning techniques formulti-modal data streams: techniques to learn representations fromrecords with different types of low-level units (i.e., attributes) in anonline fashion to capture temporal changes of the units while pre-serving the relationships between different modalities. For example,a geo-tagged Twitter stream (see Figure 1) is such a multi-modaldata stream, in which each record has multiple types of attributes,such as its location, user, timestamp, and text content. Althoughthere has been previous work [19, 27] on this problem, our workaims to address the following research gaps in the literature. Research Gaps.

First, conventional embedding learning tech-niques are memory-intensive as they assign an independent embed-ding vector to each low-level unit. If such techniques are extendedto embed attributes with a growing number of distinct low-levelunits, such as users and words, the amount of memory requiredto store all the embedding vectors becomes a major overhead. For

Time

Figure 1: Geo-tagged Twitter Stream as an example of amulti-modal data stream Such low-level units in different domains are referred to as “units” in this paper. a r X i v : . [ c s . L G ] J u l IKM ’20, October 19–23, 2020, Galway, Ireland Silva A., Karunasekera S., Leckie C., and Luo L. updatinд window id ( w ) o f d i s t i n c t u s e r s i n w × % o f d i s t i n c t u s e r s u p t o w (a) LA(b) IC Figure 2: The fraction of different users that appear withinan updating window of length ∆ W compared to the totalusers seen up to the particular window: (a) LA, a Geo-taggedTwitter Stream ( ∆ W = hour ); and (b) IC, a Shopping Trans-action Stream ( ∆ W = day ). More details about LA and ICare provided in Section 5 example, consider a system that learns embeddings using a shop-ping transaction record stream, which consists of two types oflow-level units: (1) 1 million distinct users; and (2) 1 million dif-ferent products. If a 300-dimensional vector is assigned to eachunit, the total memory requirement for the embeddings becomes ( , , + , , )( units ) × ( dimensions ) × ( bytes ) ≈ . GB . Moreover, some multi-modal streams can have more thantwo modalities. If we consider a geo-tagged Twitter stream as anexample, there are four modalities, namely: location, text, user, andtimestamp. Hence, the total memory required for such a streamcan be substantially higher than the example above. Thus, theapplication of conventional embedding learning approaches formulti-modal streams becomes problematic, particularly on limitedmemory platforms. There are several previous works (comparedin Section 2) on learning compact memory-efficient representa-tions. However, almost all these methods are not well suited todata streams [4, 18], or they are specific to a particular domain(e.g., the technique proposed in [16] is specific to natural languagemodelling). Thus, this paper proposes a domain-agnostic andmemory-efficient representation learning technique to workwith data streams .Second, the processing time per record of online learning tech-niques should meet at least the rate of the stream (i.e., 1/aver-age records per unit time) to update models in a timely manner.Parallel-processing architectures can be used to meet this require-ment when working with data streams. However, this problem isnot well studied in the context of online representation learning,possibly due to the memory-intensive nature of conventional em-beddings. With memory-efficient representations, it is feasible toadopt parallel-processing to reduce the time complexity of onlinerepresentation learning techniques. We propose a decomposableobjective function to learn memory-efficient representationsfrom streams, which allows the flexibility to assign more par-allel processes (with different memory capacities) for compu-tationally expensive steps in METEOR . Here we assume that the values in the embeddings are represented as single-precision float numbers (4 bytes per value).

Third, some of the attributes in multi-modal data streams showspecific relationships or behaviours (either explicit or implicit).For example, the products that appear in a shopping transactionstream can be grouped using higher-level product categories. Al-though such explicit relationships have been exploited in previousworks [26] to improve the quality of the representations, they arenot well-studied for the task of making representations compact.Also, for a given attribute of a multi-modal stream, we observethat a small fraction of units of the attribute appear during a shortperiod of the stream compared to the total number of distinct unitsof the particular attribute. For example, the fraction of users ap-pear in an updating window is less than 3% of the total users fortwo multi-modal streams as shown in Figure 2. Likewise, the at-tributes show specific relationships and behaviours in multi-modaldata streams. Our model is designed in a manner to exploit suchexplicit/implicit relationships in multi-modal data streams.

Contributions.

In this paper, we propose METEOR, a novelcompact representation learning technique using multi-modal datastreams, which: • learns compact online representations for multi-modal dataunits by sharing the parameters (i.e., basis vectors) insidethe semantically meaningful groupings of the units. Also,METEOR is domain-agnostic, thus yielding consistent resultswith multi-modal streams from different domains; • proposes an architecture (METEOR-Parallel) to acceleratethe learning of METEOR, which can learn embeddings ofthe same quality at twice the speed of METEOR; and • can be easily extended to exploit explicit knowledge sourcesrelated to multi-modal data units by defining parametersharing groups (implicitly defined if there is no such ex-plicit knowledge source) based on the particular knowledgesources. Our results show that METEOR can further improvethe quality of the compact embeddings by including explicitknowledge sources. Although work on learning compressed coding systems began in the1950’s, such as error correction codes [7], and Hoffman codes [9],there has been recent progress in representation learning tech-niques [13, 15] that encode each low-level unit with a continuousembedding vector. In such approaches, the number of parameters inthe embeddings grows linearly with the number of low-level units.As a result, learning compact representations has become a popularresearch problem and recently been addressed by many previousworks. Almost all the previous works on learning memory-efficientrepresentations can be divided into two categories: (1) composi-tional embedding learning techniques; and (2) data compressiontechniques. (1) Compositional embedding learning techniques.

In thesetechniques, a set of basis vectors are learned, which are shared be-tween all low-level units. Then the final embedding for a givenunit is taken as a composition (e.g., linear combination) of the basisvectors. These techniques mainly differ from each other in the waythe basis vectors are composed to generate final representations.METEOR also belongs to this category.

ETEOR: Learning Memory and Time Efficient Representations from Multi-modal Data Streams CIKM ’20, October 19–23, 2020, Galway, Ireland

Update compressed embeddings and sub-basis vectors using Compression Loss (Section 4.2) Update recency-aware costly representations using Reconstruction Loss (Section 4.1) recent records (R Δ ) Food ProductsBread MilkSkimFull CreamWholemealWhite ….….

Assign noisy ﬁxed cluster using explicit or implicit knowledge

Pretrained embedding spaceExplicit product hierarchyTime

Assign noisy ﬁxed clusters to newly seen units

New units Extract costly embeddings for units appear in recent recordsReturn updated costly embeddings for units appear in recent records Compressed embeddings for units

INPUTOUTPUT

Figure 3: Overview of METEOR

In some works [17, 24], predefined hashing functions (e.g., di-vision hashing and tree based hashing) are used to assign eachlow-level unit to one of the basis vectors.

However, these approachesdo not consider the semantics of the units when mapping to the ba-sis vectors, thus they may blindly map vastly different units to thesame basis vector.

This can lead to substantial loss of informationand deterioration of the embeddings. To mitigate that, other recentworks explore domain-specific sub-units with smaller vocabulariesto define basis vectors. For example, in [10, 16, 23], the basis vectorsare defined using characters and sub-words in a language. Thenhash functions are defined to automatically map words (or texts)to pre-defined bases, according to which the vectors are composed.However, such approaches may not be scalable for other domains asthey use language-specific semantics when defining their composi-tions. In contrast, METEOR exploits the semantics of the low-levelunits in a domain-agnostic manner.To learn task-independent compact embeddings, a few previousworks [4, 18] attempt to jointly learn basis vectors and discretecodes for each unit, which defines the composition of the basisvectors. However, the main challenge of learning discrete codesto define composition is that they cannot be directly optimizedvia SGD like other parameters. In [4, 18], the discrete encodingsare relaxed using continuous relaxation techniques such as theTempering-Softmax trick and Gumbel-Softmax trick. In [22], a sim-ilar technique has been proposed, which divides the aforemen-tioned non-differentiable objective function into several solvablesub-problems, and sequentially solves each sub-problem. All thesetechniques require continuous costly embeddings (e.g., pretrainedword embeddings) to be stored to guide their learning process. Also,we empirically observe that the latter is computationally expensiveto learn in an online fashion. Thus, these techniques are not suitablefor data streams, where the learning happens incrementally. In con-trast, METEOR learns sparse continuous code vectors along withbasis vectors, which could be trained in an online fashion. Also, tothe best of our knowledge, METEOR is the first work on learningcompact online embeddings using multi-modal data streams. (2) Data compression techniques.

In addition, some previousworks [1, 12, 14] adopt data compression techniques (e.g., quantiza-tion, dimension reduction, sparse coding, quantum entanglement) to reduce the memory requirement to store embeddings. However,all these approaches linearly increase the size of the embeddingtable with the number of distinct units in the system, whereas ME-TEOR introduces a small overhead with respect to the total numberof distinct units.

Let R = { r , r , r , ..., r n , ... } be a continuous stream of recordsthat arrive in chronological order. Each record r ∈ R is a tuple < a r , a r , ...., a rN > , where a ri is the i th attribute of r and N denotesthe number of attributes of r .Our problem is to learn embeddings for all possible units in eachattribute, denoted as A i ( = (cid:208) ∀ r { a ri }) for the i th attribute, such thatthe embedding v x of a unit x ∈ A i :(1) is a d -dimensional vector ( d << (cid:205) Ni = | A i | ), where | A i | is thenumber of different units in A i ;(2) preserves the co-occurrences of the attributes;(3) is continuously updated as new records ( R ∆ ) arrive to incor-porate the latest information;(4) is memory efficient with a memory complexity << O ( d ) .Consider a shopping transaction stream as an example for R , inwhich each record (i.e., transaction) r can be characterized using atuple consisting of three attributes < t r , u r , p r > , where: (1) t r isthe transaction time of r ; (2) u r is the user id of r ; and (3) p r is theset of products in the shopping basket of r . This work aims to jointlylearn recency-aware vector representations for all possible units ineach attribute (e.g., discretized timestamps, products, and users in R ) such that the co-occurring units have similar embeddings. Overview.

For a given attribute (e.g., products) in a multi-modaldata stream (e.g., shopping transaction stream), METEOR groupsall the possible low-level units (e.g., white bread and skim milk)of the particular attribute into a set of semantically meaningfulclusters (i.e., noisy fixed clusters ). Then for each cluster, a set of basisvectors are trained to learn the costly embeddings (i.e., conventional Continuous attributes such as timestamps should be discretized to make themfeasible for embedding

IKM ’20, October 19–23, 2020, Galway, Ireland Silva A., Karunasekera S., Leckie C., and Luo L. memory intensive embeddings) for the units in the particular clusteras a composition of the corresponding basis vectors. For each unit,the composition of the basis vectors (denoted as the compressedembedding ) is defined as a sparse continuous vector.Despite learning the compressed embeddings of the units directly,METEOR follows a computationally efficient sequential approachas shown in Figure 3. For a given set of recent records R ∆ , METEORinitially extracts the costly embeddings for the units that appearedin R ∆ , then updates the extracted costly embeddings using R ∆ to incorporate the recent information. Subsequently, the recency-aware costly embeddings are used to update the correspondingcompressed embeddings and basis vectors. This section discussesin detail how these steps are performed incrementally. The desired embedding space of METEOR should be able to predictany attribute of a record given the other attributes (i.e., to preservethe first-order co-occurrences in records). Thus, the costly embed-dings in METEOR are learned to recover the attributes of recordsas much as possible, which is formally elaborated as follows.For a given record r , the embeddings are learned such that theunits (e.g., product or user in a shopping transaction) of r can berecovered by looking at r ’s other units. Formally, we model thelikelihood for the task of recovering unit x ∈ r given the other units r − x of r as: p ( x | r − x ) = exp ( s ( x , r − x )/ (cid:213) y ∈ X exp ( s ( y , r − x )) (1)where X is the type (e.g., product or user in a shopping transaction)of x , and s ( x , r − x ) is the similarity score between x and r − x . We de-fine the s ( x , r − x ) as s ( x , r − x ) = v Tx h x where h x is mean embeddingof r ’s units except x .Then, the final loss function for the attribute recovery task isthe negative log likelihood of recovering all the attributes of therecords in the current buffer B : O R ∆ = − (cid:213) r ∈ R ∆ (cid:213) x ∈ r p ( x | r − x ) (2)The objective function above is approximated using negativesampling (proposed in [13]) for efficient optimization using stochas-tic gradient descent (SGD). Then for a selected record r and unit x ∈ r , the loss function is: L recon = − log ( σ ( s ( x , r − x ))) − (cid:213) n ∈ N x log ( σ (− s ( n , r − x ))) (3)where σ ( z ) = + exp (− z ) and N x is the set of randomly selectednegative units that have the type of x . Adaptive Optimization.

Since the loss function is incremen-tally optimized using a stream, only the recent records in the streamare used to update the embeddings. Hence, we adopt a novel adap-tive strategy to optimize the loss function in Equation 3 whilealleviating overfitting to the recent records as follows.For each record r , we compute the intra-agreement Ψ r of r ’sattributes as: Ψ r = (cid:205) x , y ∈ r , x (cid:44) y σ ( v ⊤ x v y ) (cid:205) x , y ∈ r , x (cid:44) y r is calculated as, lr r = exp (− τ Ψ r ) ∗ η (5)where η denotes the standard learning rate and τ controls the im-portance given to Ψ r . If the representations have already overfittedto r , then Ψ r takes a higher value. Consequently, a low learningrate is assigned to r to avoid overfitting. In addition, the learningrate for each unit x in r is further weighted using the approachproposed in AdaGrad [5] to alleviate overfitting to frequent items.Then, the update for the v x at the t th timestep is: v t + x = v tx − lr r (cid:114)(cid:205) t − j = ( ∂ L ∂ v x ) j + ϵ ( ∂ L ∂ v x ) t (6)Our experimental results verify that the proposed optimiza-tion technique to accommodate online learning yields compara-ble (sometimes even superior) results compared to state-of-the-artsampling-based approaches without storing any historical records. The aforementioned framework learns an independent embeddingvector for each unit, and is thus memory inefficient. To alleviatethis problem, METEOR takes the following steps: • Step 1: Define a set of clusters (i.e., noisy fixed clusters) for agiven attribute such that each unit of the particular attributebelongs to a single cluster. • Step 2: Assign a set of basis vectors for each cluster. • Step 3: Impose an additional constraint on costly embeddingsof the units such that they are linear combinations of thebasis vectors of the corresponding noisy fixed clusters.Then, the compressed embedding ˆ v x of a given unit x is the set ofweights related to the corresponding basis vectors (i.e., composi-tion), which could be used to reconstruct the costly embedding v x given the basis vectors. The rest of this section discusses each stepmentioned above in detail. Fora given attribute a (e.g., products in a shopping transaction stream),METEOR defines a set of clusters C a = { C a , C a , .... C a | C | } and acluster assignment function д a : x → c x (where c x ∈ C a ), whichassigns each unit x (e.g., white bread or skim milk if the attribute is aproduct in a shopping transaction stream) in a to a single cluster in C a . C a and д a can be determined using either (1) explicitly availabledomain knowledge or (2) implicitly using a pretrained embeddingspace. (1) Explicit clusters. These are the clusters generated usingan explicitly available grouping scheme. As an example, products(i.e., a modality of a shopping transaction stream) can be groupedusing explicitly available product categories. Both C a and д a can bedetermined using such a grouping, which do not generally changeover time. (2) Implicit clusters. The clusters in C a can be generated usinga clustering algorithm like KMeans [11] by clustering a pretrainedembedding space (by optimizing Equation 3 using a subset R pre Sampling-based approaches require historical records to be stored, and the sam-ples from the historical records are fed along with the recent records to update theembeddings incrementally.

ETEOR: Learning Memory and Time Efficient Representations from Multi-modal Data Streams CIKM ’20, October 19–23, 2020, Galway, Ireland updatinд window id o f c l u s t e r c h a n д e s Figure 4: A smaller fraction of users in the TF dataset, whichconsists of users in total, change their clusters over dif-ferent updating windows ( ∆W = ) if cluster changes areallowed (without Assumption 1) i ( cluster id ) | C i | ( c l u s t e r s i z e ) Figure 5: The sizes of the noisy fixed clusters (64 clusters)generated for users in TF Dataset are uneven of R ) of a . Then д a is defined such that it assigns each unit inthe embedding space to the closest cluster based on the Euclideandistance to each cluster. Assumption 1.

METEOR assumes that the units do not changetheir clusters once they have been assigned to a noisy fixed clus-ter. This assumption is valid for explicit clusters. However, it maynot be true for implicit clusters, as they are generated using theembeddings of the units, which are updated over time. As shownby [6], allowing hard cluster assignments to change over time coulddegrade the embedding space due to the sudden changes of the em-beddings, which are incrementally learned over time. To preservethe validity of this assumption for the implicit clusters, we set thenumber of noisy fixed clusters to a small number (i.e., <

1% of thetotal number of units) to make the size of clusters big enough tominimize cluster changes over time. We empirically observe thatAssumption 1 restricts only a few changes in the cluster assign-ments under the aforementioned setting, as shown in Figure 4. InFigure 4, the users in the TF dataset are clustered into 64 clustersbased on their embeddings using KMeans (with the initialization ofcluster centres using the cluster centres of the previous updatingwindow) at the end of each updating window, and we observe thatat most 0 .

06% users change their clusters over time.

For each noisy fixed cluster C ai of attribute a , a set of d -dimensional vectors B ai are assigned torepresent C ai ’s basis. Let | B ai | be the number of basis vectors in B ai ,then B ai ∈ R d ×| B ai | (where d >> | B ai | ). Number of basis vectors per cluster.

We empirically observethat the noisy fixed clusters generated using either explicit or im-plicit approaches are uneven in size, as shown in Figure 5. Hence,

Algorithm 1:

METEOR Learning

Input:

The noisy fixed cluster assignments C The current compressed embeddings ˆ V The current basis vectors B A collection of new records R ∆ Output:

The updated ˆ V and B V ← ∅ ; for unit x in R ∆ do v x ← (cid:26) initialize randomly if x is a new unit compute usinд Eq . V ← { v x } ∪ V end // Optimize L recon using R ∆ for epoch f rom to N do for r in R ∆ do Update V to recover r ’s attribute using Eq. 3; end end // Assign noisy fixed clusters for new units for v x ∈ V do if x is a new unit then Assiдn v x to the closest noisy f ixed cluster end end // Optimize L comp using V for v x ∈ V do Update ˆ V and B using Eq. 10; end Delete V; Return ˆ

V and B when assigning the basis vectors for different noisy fixed clusters,METEOR assigns more basis vectors to the large clusters usingEquation 7 (based on the cluster sizes at the initialization) to ac-commodate uniform encoding for all the units. Let K a be the totalnumber of basis vectors assigned to different noisy fixed clustersof attribute a , then the number of basis vectors for C ai , | B ai | , iscalculated as, | B ai | = ceil ( K a ∗ | C ai | (cid:205) ∀ j | C aj | ) (7)where ceil ( . ) is the standard ceiling operation. For each unit x ∈ C ai , METEOR learns x ’s costly embedding v x asa linear combination of B ai , v x = B ai · ˆ v x (8)where ˆ v x is the memory-efficient compressed embedding of x ,ˆ v x ∈ R | B i | and | B i | << d . A trivial way to learn the memory-efficient embeddings ˆ V and basis vectors B is by replacing theembeddings in Equation 3 using Equation 8. However, that approachis computationally expensive to perform in an online fashion as itintroduces many matrix multiplications into the loss function. IKM ’20, October 19–23, 2020, Galway, Ireland Silva A., Karunasekera S., Leckie C., and Luo L.

Thus, METEOR decomposes the original loss function as follows, L = L recon + L comp (9)where L comp for a given unit x of attribute type a is defined as, L comp = ( B ai · ˆ v x − v x ) + λ ∗ || ˆ v x || (10)where λ is the weight given to the L1 regularization term. ME-TEOR imposes L1 (i.e., Lasso) regularization on ˆ V to make thememory-efficient representations ˆ V as sparse as possible. Thisyields memory-efficient embeddings with a small fraction of non-zero values, which further reduces the memory requirement tostore ˆ V using sparse matrix storage formats. Initialization of Compressed Embeddings and Basis Vec-tors.

METEOR exploits the costly embedding in a pretrained em-bedding space, which is used to generate noisy fixed clusters,and iteratively optimizes L comp in Equation 10 using Adagrad toinitialize ˆ V and B . Incremental Learning of Compressed Embeddings and Ba-sis vectors.

Then the compressed embeddings and the basis vectorsare learned incrementally using the newly arrived records fromthe stream to incorporate recent information into the embeddings.For a given new set of records R ∆ , METEOR adopts a sequentialapproach as shown in Algorithm 1 to learn the model, instead ofjointly optimizing L in Equation 9. Initially, METEOR produces thecostly embeddings for the units (using Equation 8) that appear in R ∆ using the compressed embeddings and basis vectors returned atthe end of the previous updating window (Line 2-5 in Algo 1). Fornew units, METEOR randomly initializes their costly embeddings.Then the costly embeddings are updated using Equation 3 to recon-struct the recent records. Then the new units of attribute type a areassigned to the noisy fixed clusters using the cluster assignmentfunction д a , introduced in Section 4.2.1 (Line 11-14). Subsequently,ˆ V and B are updated using the recently updated costly embeddingsby minimizing the compression loss in Equation 10 (Line 16-18).Within an updating window, METEOR maintains costly embed-dings only for the units appearing within the particular window,which is a small fraction compared to the total number of units(see Figure 2). At the end of the updating window, METEOR deletesthe costly embeddings and only retains the compressed embed-dings and basis vectors in memory, which requires considerablyless memory (discussed in detail in Section 4.4). Thus, METEORcan be deployed on low-memory devices to learn recency-awarerepresentations for different domains. The other main challenge for a system like METEOR is scalabilityto high-speed data streams. The number of records processed bya system like METEOR should be at least similar to the rate ofthe stream to accommodate all the records in the stream. Sincethe rates of multi-modal streams can change significantly with thedomain, a domain-agnostic model like METEOR should be able toadapt to a wide range of data rates. Thus, this section proposes away to accelerate the learning process of METEOR using a parallelprocessing architecture with different memory domains. We denote The pretraining can be performed in a large server with sufficient memorycapacity, as it needs to be performed only once. current updating windowUpdate the Compressed Embeddings using Compression Loss

TimeOptimize the Reconstruction Loss using Parallel Processors

Updated Costly EmbeddingsUpdated Compressed Embeddings and Sub-basis Vectors

Figure 6: Architecture of METEOR-Parallel this version of METEOR as "METEOR-Parallel" for the rest ofthis paper .As presented in the above section, METEOR optimizes the de-composed objective function in Equation 9 using two sequentialsteps: (1) update the costly embeddings to reconstruct the recordsin the current updating window; and (2) optimize the compressionloss to update the compressed embeddings and the basis vectors.We have empirically identified that the latter step has a considerablylower time complexity compared to the former. Thus the bottleneckof the METEOR learning process lies with the former step. Also,we observed that learning Step 1 using parallel processors withseparate memory units (e.g., a cluster of computers) does not de-teriorate the quality of the compressed embeddings, and preservethe performance for downstream applications (see Figure 8). Hence,the architecture of METEOR-Parallel consists of multiple parallelprocesses to perform the first step and a single processor to performthe latter as shown in Figure 6.Then for a given set of records in R ∆ , METEOR-Parallel di-vides the records into p number of parallel processes. Each processorreads the most recent compressed embeddings and basis vectors totheir own memory. Then each processor updates the costly embed-dings (generated from the compressed embeddings and the basisvectors) to reconstruct the records arriving to each processor. Sub-sequently, all the updated costly embeddings are passed to the nextstage, which centrally optimizes the compression loss to update thecompressed embeddings and the basis vectors. In this section, we analyse the complexity of METEOR and METEOR-Parallel using a multi-modal stream, in which each attribute a has | A a | distinct units. The architecture of METEOR-Parallel is different from the conventional multi-threading architectures, which mostly use shared memory. Inside of a process inMETEOR-Parallel, conventional multi-threading can be still applied for furtheracceleration. However, we emulate the architecture of METEOR-Parallel usingmultiple threads in this paper. We leave the performance of METEOR-Parallel on areal computer cluster as future work.

ETEOR: Learning Memory and Time Efficient Representations from Multi-modal Data Streams CIKM ’20, October 19–23, 2020, Galway, Ireland

Assume that the number of noisy fixedclusters is | C a |(≈ × | A a |) and the total number of basis vec-tors shared between noisy fixed clusters of attribute a is K a . Thespace complexity to store conventional costly embeddings is O ( d ×( (cid:205) ∀ a | A a |)) . In contrast, the space complexity of METEOR is O ( (cid:205) ∀ a K a (| A a |/| C a | + d )) in the average case, where K a /| C a | << d and K a << | A a | . The memory complexity per processor inMETEOR-Parallel remains the same. The time complexity of METEOR consistsof: (1) O ( EN M d max ( R ∆ )) to optimize the reconstruction loss (Step1); and (2) O ( Ed (cid:205) ∀ a max (| a ∆ |) K a /| C a |) to optimize the compres-sion loss (Step 2), where E , N , and M are the number of epochs,the number of negative samples, and the maximum number ofattributes of a record, respectively. | a ∆ | is the distinct units of at-tribute type a that appear in R ∆ , which is a smaller fraction of | A a | as shown in Figure 2. The complexity of METEOR-Parallel with p processes remains same as for Step 2, and O ( EN M d max ( R ∆ )/ p ) for Step 1. Datasets.

We conduct our experiments using three shopping trans-action datasets: • Complete Journey Dataset (CJ) contains transactions at aretailer by 2,500 frequent shoppers over two years. • Ta-Feng Dataset (TF) includes shopping transactions of theTa-Feng supermarket from November 2000 to February 2001. • InstaCart Dataset (IC) contains the shopping transactionsof Instacart, an online grocery shopping centre, in 2017.and three geo-tagged Twitter datasets collected from three urban-ized cities: • LA Dataset (LA) contains around 1.2 million geo-taggedtweets from Los Angeles during the last quarter of 2014. • NY Dataset (NY) includes 1.5 million geo-tagged tweetscollected from New York during the last quarter of 2014. • MB Dataset (MB) includes 263,363 geo-tagged tweets col-lected from Melbourne during the period from 2016 Novem-ber to 2018 January.The descriptive statistics of the datasets are shown in Table 1. Theevaluation based on the datasets from different domains supportsthe domain-agnostic behaviour of the proposed model. As can beseen, the datasets from the same domain also have significantlydifferent statistics. For example, TF has a shorter collection periodand IC has a larger user base. This helps to evaluate the performanceof METEOR in different environment settings. Baselines.

We compare METEOR with the following methods: • Dim_Reduct reduces the size of the costly embedding vec-tors ( d ) and learns the embeddings as shown in Section 4.1. • m -bit Quantization evenly quantizes the values in thecostly embeddings into 2 m -bins as post processing. https://drive.google.com/file/d/0Byrzhr4bOatCRHdmRVZ1YVZqSzA/view Table 1: Descriptive statistics of the datasets

Shopping Transaction Datasets

Geo-tagged Twitter Datasets • Hash Trick adopts modulo-division hashing to assign low-level units to clusters and a shared d − dimensional embed-ding is trained for each cluster. The divisor of the hashingfunction for an attribute a is set as γ × A , where A is the totalnumber of distinct units in a , and γ defines the value of thedivisor as a proportion of A . • DCN+Hard Clustering adopts the deep clustering approachproposed in [25] to learn the costly embeddings. At the endof each updating window, the embedding space related toan attribute a is clustered into γ × A clusters and the embed-dings of the units are replaced by the corresponding clustercenters.We compare a few online learning techniques with the proposedcostly embedding learning approach (METEOR-Full) in METEOR: • METEOR-Decay and METEOR-Info adopt SGD optimiza-tion with the sampling-based online learning methods pro-posed in [27] and [20], respectively. • METEOR-Cons adopts SGD optimization with the constraint-based online learning approach proposed in [27].

Parameter Settings.

The two main parameters of METEOR are | C a | and K a . For most of the experiments, we set | C a | = | A a | and K a = ∗ | A a | ( | A a | is the total distinct units of attribute a in the pretrained embedding space), otherwise the parametervalues are specified. We present the results with different | C a | and K a values in Figure 7. The weight given to the sparsity constraint λ is set to 0.001 in METEOR after performing a grid search over [ , . , . , . , . , ] values.In addition, all the aforementioned techniques share three com-mon parameters (default values are given in brackets): (1) the costlyembedding dimension d (300), (2) the SGD learning rate η (0.05), (3)the negative samples | N x | (3), and (4) the number of epochs N (50).We set τ = . Evaluation Metrics.

METEOR is quantitatively evaluated us-ing two retrieval tasks: (1) intra-basket item retrieval task for shop-ping transaction datasets; and (2) location retrieval task for geo-tagged Twitter datasets. Both tasks follow a similar experimentalsetup. Similar to previous work [27], we adopt the following proce-dure to evaluate the performance of each retrieval task. For eachrecord in the test set, we select one unit (e.g., a product for intra-basket item retrieval or the location for location retrieval ) as thetarget prediction and the rest of the units of the record as the con-text. We mix the ground truth target unit with a set of M negativesamples (i.e., a set of units that have the type of the ground truth)to generate a candidate pool to rank. M is set to 10 for all the exper-iments. Then the size- ( M + ) candidate pool is sorted to get therank of the ground truth. The average similarity of each candidateunit to the context of the corresponding test instance is used to IKM ’20, October 19–23, 2020, Galway, Ireland Silva A., Karunasekera S., Leckie C., and Luo L.

Table 2: Results for intra-basket item retrieval

CJ Dataset TF Dataset IC Datasetmethod parameter(s) forreducing model size Additionalmemory fora new unit Modelsize MRR R@1 Modelsize MRR R@1 Modelsize MRR R@1METEOR-Full costly embeddings (d=300)

METEOR-Info costly embeddings (d=300) 2.34KB 217MB 0.5991 0.4275 39MB 0.4046 0.2205 586MB 0.7482 0.5852METEOR-Decay costly embeddings (d=300) 2.34KB 217MB 0.5984 0.4221 39MB 0.402 0.2186 586MB 0.7117 0.5442METEOR-Cons costly embeddings (d=300) 2.34KB 217MB 0.4610 0.2742 39MB 0.3996 0.2031 586MB 0.5942 0.4193METEOR | C a | = . | A a | , K a = | A a | | C a | = | A a | , K a = | A a | Dim_Reduct d = 100 0.76KB 72MB 0.5773 0.4044 13MB 0.4762 0.2940 195MB 0.6923 0.5213d = 50 0.38KB 36MB 0.5495 0.3664 7MB 0.4509 0.2713 98MB 0.6628 0.4825d = 25 0.19KB 18MB 0.5178 0.3323 4MB 0.4228 0.2305 49MB 0.6284 0.4381Quantization 8 bit quant. 0.59KB 54MB 0.5677 0.3722 10MB 0.4672 0.2892 147MB 0.6836 0.50774 bit quant. 0.29KB 27MB 0.5453 0.3502 5MB 0.4478 0.2700 73MB 0.6447 0.46032 bit quant. 0.15KB 14MB 0.4321 0.2801 3MB 0.3217 0.1413 37MB 0.5573 0.3705Hash Trick γ =

4B 65MB 0.5376 0.3487 12MB 0.4335 0.2535 176MB 0.6558 0.4727 γ =

4B 43MB 0.4993 0.3106 8MB 0.3711 0.1987 117MB 0.6248 0.4376 γ =

4B 22MB 0.4677 0.2988 4MB 0.3417 0.1786 58.6MB 0.5688 0.3848DCN + Hard Clustering γ =

4B 65MB 0.5477 0.3588 12MB 0.4577 0.2724 176MB 0.6731 0.4906 γ =

4B 43MB 0.5321 0.3411 8MB 0.4122 0.2236 117MB 0.6482 0.4583 γ =

4B 22MB 0.4882 0.3075 4MB 0.3876 0.2033 58.6MB 0.5883 0.4017 produce the ranking of the candidate pool. Cosine similarity is usedas the similarity measure for all the baselines.If the model is well trained, then higher ranked units are mostlikely to be the ground truth. Hence, we use two different evaluationmetrics to analyze the ranking performance:(1) Mean Reciprocal Rank (MRR) = (cid:205) Qq = / rank i | Q | (2) Recall@k (R@k) = (cid:205) Qq = min ( , ⌊ k / rank i ⌋)| Q | where Q is the set of test queries and rank i refers the rank of theground truth label for the i -th query. ⌊ . ⌋ is the floor operation. Agood ranking performance should yield higher values for the bothevaluation metrics.We divide the records in each data stream into different updatingwindows such that each window has a length of ∆ w ( ∆ w = day for shopping transaction datasets and ∆ w = hour for geo-taggedTwitter datasets). The first half of the period for each dataset isused to pretrain costly embeddings, which are subsequently used toproduce noisy-fixed clusters. We randomly select 20 query updatingwindows from the second half of the period for each dataset, andall the records in the randomly selected time windows are used astest instances. For each query window, we only use the records thatarrive before the query window to train different models. We ignoretimestamps in both types of streams in the embedding as they donot substantially affect the performance of the tasks. The locationsin geo-tagged Twitter streams are discretized into 300 m × m small grids to make them feasible for embedding. (1) Compressed Embedding Learning in METEOR. Table 2and Table 3 show the results collected for intra-basket item retrieval and location retrieval respectively. METEOR shows significantlybetter results than the comparable baseline models, which are themodels with similar size. For example, if we consider the resultscollected for intra-basket retrieval using the IC dataset, the modelsize of METEOR at ( K a = | A a | , | C a | = | A a | ) is similar to:Dim Reduct at ( d = bit quant . ); Hash Trick at ( γ = γ = .

88% in MRR and 17 .

55% in

Recall @1. This obser-vation is consistent for both intra-basket item retrieval and locationretrieval .Out of the baselines, Dim Reduct and Quantization are thestrongest baselines, but they have linearly increasing model sizeswith respect to the total number of distinct units. Thus, the addi-tional memory required to represent a newly seen unit is higher forthose two baselines. For example, Quantization at (4 bit quant . )requires 363% more memory per unit compared to the correspond-ing METEOR at ( K a = | A a | , | C a | = | A a | ). The other twobaselines: Hash Trick; and DCN+Hard Clustering, have signifi-cantly lower overheads for new units than METEOR. However, theyperform a hard cluster assignment for each unit, which restricts theflexibility of representations, thus yielding poor performance forthe downstream tasks. In DCN+Hard Clustering, the hard cluster-ing is performed incrementally at the end of each updating window.Subsequently, for a given unit, the embedding is taken as the centreof the cluster that the unit belongs to. However, as shown in [6],such hard assignment can degrade the embedding space, whichcould be the reason for having poor results with DCN+Hard Clus-tering. Hash Trick is the worst baseline, which could be due tothe hard cluster assignment performed randomly without consider-ing the semantics of the units. Hence, it is important to considerthe semantics when learning memory efficient representations. Explicit Clusters as Noisy Fixed Clusters.

As we discussedin Section 4.2.1, METEOR can exploit the knowledge available inexplicit grouping schemes when assigning the noisy fixed clus-ters for units. The results collected for intra-basket item retrieval using such explicit product categories in CJ datasets are shownin Table 4. As can be seen, the compressed embeddings in ME-TEOR can even outperform METERO-Full while achieving 83% ( = ( MB − MB ) × / MB ) reduction in memory comparedto METERO-Full, when the noisy fixed clusters are assigned usingthe product category information. This could be due to the sharedbasis vectors inside noisy fixed clusters. Thus, they can capture the ETEOR: Learning Memory and Time Efficient Representations from Multi-modal Data Streams CIKM ’20, October 19–23, 2020, Galway, Ireland

Table 3: Results for location retrieval

LA Dataset NY Dataset MB Datasetmethod parameter(s) forreducing model size Additionalmemory fora new unit Modelsize MRR R@1 Modelsize MRR R@1 Modelsize MRR R@1METEOR-Full costly embeddings (d=300)

METEOR-Info costly embeddings (d=300) 2.34KB 1118MB 0.8053 0.6943 1313MB 0.8027 0.6927 274MB 0.6021 0.4388METEOR-Decay costly embeddings (d=300) 2.34KB 1118MB 0.7614 0.6488 1313MB 0.775 0.6590 274MB 0.5808 0.4169METEOR-Cons costly embeddings (d=300) 2.34KB 1118MB 0.7602 0.6488 1313MB 0.7701 0.6562 274MB 0.5762 0.4001METEOR | C a | = . | A a | , K a = | A a | | C a | = | A a | , K a = | A a | Dim_Reduct d = 100 0.76KB 373MB 0.7426 0.6134 438MB 0.7398 0.6188 91MB 0.5403 0.3614d = 50 0.38KB 187MB 0.7116 0.5683 219MB 0.7073 0.5692 46MB 0.5022 0.3245d = 25 0.19KB 93MB 0.6882 0.5385 109MB 0.6702 0.5173 23MB 0.4595 0.2806Quantization 8 bit quant. 0.59KB 280MB 0.7262 0.5882 328MB 0.7288 0.5946 69MB 0.5333 0.35724 bit quant. 0.29KB 140MB 0.6903 0.5483 164MB 0.6929 0.5493 34MB 0.4859 0.30062 bit quant. 0.15KB 70MB 0.5883 0.4277 82MB 0.5892 0.4196 17MB 0.4007 0.2177Hash Trick γ =

4B 335MB 0.7001 0.5481 394MB 0.7065 0.5582 82MB 0.5338 0.3520 γ =

4B 224MB 0.6638 0.5083 263MB 0.6690 0.5173 55MB 0.4961 0.3169 γ =

4B 112MB 0.6104 0.4854 131MB 0.6152 0.4502 27MB 0.4517 0.2752DCN + Hard Clustering γ =

4B 335MB 0.7177 0.5764 394MB 0.7208 0.5872 82MB 0.5382 0.3520 γ =

4B 224MB 0.6843 0.5375 263MB 0.6943 0.5493 55MB 0.5099 0.3287 γ =

4B 112MB 0.6429 0.5104 131MB 0.6544 0.4921 27MB 0.4728 0.2970

Table 4: Results for intra-basket item retrieval with explicitproduct clusters (categories 92,339 products in to 2384 prod-uct categories) in CJ Dataset as noisy fixed clusters

The variant ofMETEOR Averagememory fora new unit Modelsize MRR R@1Using costly embeddings (d=300) 2.34KB 217MB 0.6013 0.4325Without explicit product clusters | C a | = . | A a | , K a = | A a | | C a | = | A a | , K a = | A a | | C a | = . | A a | , K a = | A a | | C a | = | A a | , K a = | A a | Table 5: Ablation Study of Elements in METEOR

CJ Dataset LA DatasetMethod MRR R@1 MRR R@1METEOR 0.587 0.4098 0.7701 0.6631(-) weighted basis vector assignment 0.5703 0.3995 0.7577 0.6304(-) sparsity constraint 0.5852 0.4098 0.7695 0.6631 additional knowledge introduced by explicit clusters too. This couldhelp to improve the embeddings of rarely-occurring units (whichhave inaccurate embeddings in general) based on their neighboursin the same explicit cluster. To verify this point, the accuracy ofthe predicted labels for rarely occurring target test instances (ap-pearing <

10 times) are examined. For those test instances, thecompressed embeddings of METEOR with explicit noisy fixed clus-ters outperforms the costly embeddings of METEOR by nearly 2%with respect to MRR. Thus, we can conclude that the compressedembeddings of METEOR with explicit noisy fixed clusters predictsrarely occurring ground truth instances more accurately than thecostly embeddings.

Ablation Study.

To check the significance of the role of theeach element in METEOR, we perform an ablation study as shownin Table 5: (1) by removing weighted basis vector assignment , whichuniformly distributes basis vectors among noisy fixed clusters; and(2) by removing the sparsity constraint in Equation 10. As shown,the weighted basis vector assignment plays a significant role inMETEOR. Although the sparsity constraint does not account fora significant performance improvement, we empirically observe | Ca |∗ | Aa | m o d e l s i z e ( i n M B )) . . . . . M RR sizeMRR (a) Ka | Aa | m o d e l s i z e ( i n M B ) . . . . M RR sizeMRR (b) Figure 7: Parameter Sensitivity on CJ: (a) MRRs for different | C a | values with K a = | A a | ; and (b) MRRs for different K a values with | C a | = | A a | that it yields compressed embeddings with many zeros, whichcould be exploited to reduce the memory to store the compressedembeddings using sparse matrix storage formats. Parameter Sensitivity.

In Figure 7, we check the performanceof METEOR, with different K a and | C a | values. For a given fixed K a value, when | C a | increases (see Figure7a), the number of basisvectors per cluster reduces. Thus, the number of dimensions forcompressed embedding reduces, in turn reducing the model size.Meanwhile, small noisy fixed clusters (with high | C a | values) in-crease the violations of Assumption 1 (see Section 4.2.1). This couldbe the reason for the performance drop when | C a | increases. Asshown in Figure 7b, when K a increases, both model sizes and MRRsfor retrieval tasks monotonically increase and ultimately reach themodel size and the performance of METEOR-Full. (2) Online Learning in METEOR. Table 2 and Table 3 alsoshow the performance of the proposed online learning approachof METEOR (i.e., METEOR-Full) to update costly embeddings.Comparing METEOR-Full with the other online learning variantsof METEOR: METEOR-Info; METEOR-Decay; and METEOR-Cons,METEOR-Full’s results are comparable (mostly superior) withsampling-based online learning variants of METEOR (i.e., METEOR-Decay and METEOR-Info), which store historical records to avoidoverfitting to recent records. Also, METEOR-Full outperformsMETEOR-Cons as much as 30% with respect to MRR, which has a

IKM ’20, October 19–23, 2020, Galway, Ireland Silva A., Karunasekera S., Leckie C., and Luo L. p A v e r a g e t i m e p e rr e c o r d ( i n m s ) . . . M RR Avg timeMRR (a) p A v e r a g e t i m e p e rr e c o r d ( i n m s ) . . . M RR Avg timeMRR (b) p A v e r a g e t i m e p e rr e c o r d ( i n m s ) . . . M RR Avg timeMRR (c) p A v e r a g e t i m e p e rr e c o r d ( i n m s ) . . . M RR Avg timeMRR (d)

Figure 8: Average time taken to process a record and MRR of METEOR-Parallel with p number of parallel processors: (a) for intra-basket item retreival using CJ; (b) for intra-basket item retreival using IC; (c) for location retreival using LA; and (d) for location retreival using NY similar memory complexity of METEOR-Full. Hence, the proposedadaptive optimization-based online learning technique in METEORachieves the performance of the state-of-the-art online learningmethods (i.e., METEOR-Decay and METEOR-Info) in a memory-efficient manner without storing any historical records. (3) METEOR-Parallel. Table 8 shows the results collectedwith METEOR-Parallel with different numbers of parallel pro-cesses. For both retrieval tasks, METEOR-Parallel yields around50% reduction in the time taken to process a single record with 5 par-allel processes. In contrast, the drop in the performance with respectto MRR with 5 processes are slight, which are 1.2%, 2.3%, 1.3%, and1.1% for CJ, IC, LA, and NY respectively. Thus, METEOR-Parallelcould be used to accelerate the embedding learning process of ME-TEOR, while preserving the quality of the embeddings.

In this work, we proposed METEOR, which learns compact repre-sentations using multi-modal data streams in an online fashion. Thelearning of METEOR could be speeded up using parallel processeswith different memory domains. Our results show that METEORis a domain-agnostic framework, which can substantially reducethe memory complexity of conventional embedding learning ap-proaches while preserving the quality of the embeddings.METEOR achieves around 80% reduction in memory comparedto the conventional costly embeddings without sacrificing perfor-mance by sharing parameters inside the semantically meaningfulgroupings of the multi-modal units. Hence, integrating METEORwith other similar explicit/implicit knowledge bases could be apromising research direction. Also, METEOR decides the numberof shared parameters for each semantic group (i.e., noisy fixedclusters) based on their size. Besides, other factors could be con-sidered (e.g., cluster shape) when assigning the basis vectors fornoisy fixed clusters. In this work, we emulate the architecture ofMETEOR-Parallel using multi-threading, and we leave its actualimplementation in a cluster of machines as future work. Also, weaim to explore the applications of METEOR for other domains.

ACKNOWLEDGEMENT

This research was financially supported by Melbourne GraduateResearch Scholarship and Rowden White Scholarship.

REFERENCES [1] Anish Acharya, Rahul Goel, Angeliki Metallinou, and Inderjit Dhillon. 2019.Online embedding compression for text classification using low rank matrixfactorization. In

Proc. of AAAI .[2] Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representationlearning: A review and new perspectives.

IEEE Transactions on Pattern Analysisand Machine Intelligence (2013).[3] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. En-riching Word Vectors with Subword Information.

Transactions of the Associationfor Computational Linguistics (2017).[4] Ting Chen, Martin Renqiang Min, and Yizhou Sun. 2018. Learning k-way d-dimensional discrete codes for compact embedding representations.

Proc. ofICML (2018).[5] John Duchi, Elad Hazan, and Yoram Singer. 2010. Adaptive subgradient methodsfor online learning and stochastic optimization.

Proc. of COLT (2010).[6] Xifeng Guo, Long Gao, Xinwang Liu, and Jianping Yin. 2017. Improved deepembedded clustering with local structure preservation.. In

Proc. of IJCAI .[7] Richard W Hamming. 1950. Error detecting and error correcting codes.

The BellSystem Technical Journal (1950).[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residuallearning for image recognition. In

Proc. of CVPR .[9] David A Huffman. 1952. A method for the construction of minimum-redundancycodes.

Proc. of the IRE (1952).[10] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-aware neural language models. In

Proc. of AAAI .[11] Stuart Lloyd. 1982. Least squares quantization in PCM.

IEEE Transactions onInformation Theory (1982).[12] Avner May, Jian Zhang, Tri Dao, and Christopher Ré. 2019. On the downstreamperformance of compressed word embeddings. In

Proc. of NeurIPS .[13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.Distributed representations of words and phrases and their compositionality. In

Proc. of NIPS .[14] Aliakbar Panahi, Seyran Saeedi, and Tom Arodz. 2019. word2ket: Space-efficientWord Embeddings inspired by Quantum Entanglement. arXiv:1911.04975 (2019).[15] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:Global vectors for word representation. In

Proc. of EMNLP .[16] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machinetranslation of rare words with subword units. arXiv:1508.07909 (2015).[17] Hao-Jun Michael Shi, Dheevatsa Mudigere, Maxim Naumov, and Jiyan Yang.2019. Compositional Embeddings Using Complementary Partitions for Memory-Efficient Recommendation Systems. arXiv:1909.02107 (2019).[18] Raphael Shu and Hideki Nakayama. 2017. Compressing word embeddings viadeep compositional code learning. arXiv:1711.01068 (2017).[19] Amila Silva, Shanika Karunasekera, Christopher Leckie, and Ling Luo. 2019. US-TAR: Online Multimodal Embedding for Modeling User-Guided SpatiotemporalActivity.

Proc. of IEEE BigData (2019).[20] Amila Silva, Shanika Karunasekera, Christopher Leckie, and Ling Luo. 2019. US-TAR: Online Multimodal Embedding for Modeling User-Guided SpatiotemporalActivity.

Proc. of IEEE-BigData (2019).[21] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networksfor large-scale image recognition. arXiv:1409.1556 (2014).[22] Jun Suzuki and Masaaki Nagata. 2016. Learning Compact Neural Word Embed-dings by Parameter Space Sharing.. In

Proc. of IJCAI .[23] Dan Tito Svenstrup, Jonas Hansen, and Ole Winther. 2017. Hash embeddings forefficient word representations. In

Proc. of NIPS . ETEOR: Learning Memory and Time Efficient Representations from Multi-modal Data Streams CIKM ’20, October 19–23, 2020, Galway, Ireland [24] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and JoshAttenberg. 2009. Feature hashing for large scale multitask learning. In

Proc. ofICML .[25] Bo Yang, Xiao Fu, Nicholas D Sidiropoulos, and Mingyi Hong. 2017. Towardsk-means-friendly spaces: Simultaneous deep learning and clustering. In

Proc. ofICML . [26] Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. 2016. Revisitingsemi-supervised learning with graph embeddings. arXiv:1603.08861 (2016).[27] Chao Zhang, Keyang Zhang, Quan Yuan, Fangbo Tao, Luming Zhang, Tim Han-ratty, and Jiawei Han. 2017. React: Online multimodal embedding for recency-aware spatiotemporal activity modeling. In

Proc. of SIGIR .[28] Daokun Zhang, Jie Yin, Xingquan Zhu, and Chengqi Zhang. 2018. Networkrepresentation learning: A survey.