[PDF] An Algebraic Approach for High-level Text Analytics

Abstract

Text analytical tasks like word embedding, phrase mining, and topic modeling, are placing increasing demands as well as challenges to existing database management systems. In this paper, we provide a novel algebraic approach based on associative arrays. Our data model and algebra can bring together relational operators and text operators, which enables interesting optimization opportunities for hybrid data sources that have both relational and textual data. We demonstrate its expressive power in text analytics using several real-world tasks.

Full PDF

aa r X i v : . [ c s . D B ] M a y An Algebraic Approach for High-level Text Analytics

Xiuwen Zheng [email protected] of California San DiegoLa Jolla, Californa, USA

Amarnath Gupta [email protected] of California San DiegoLa Jolla, Californa, USA

ABSTRACT

Text analytical tasks like word embedding, phrase mining and topicmodeling, are placing increasing demands as well as challengesto existing database management systems. In this paper, we pro-vide a novel algebraic approach based on associative arrays. Ourdata model and algebra can bring together relational operators andtext operators, which enables interesting optimization opportuni-ties for hybrid data sources that have both relational and textualdata. We demonstrate its expressive power in text analytics usingseveral real-world tasks.

KEYWORDS associative array, text analytics, natural language processing

ACM Reference Format:

Xiuwen Zheng and Amarnath Gupta. 2020. An Algebraic Approach forHigh-level Text Analytics. In

SSDBM ’20: Int. Conf. on Scientiﬁc and Sta-tistical Data Management, June 03–05, 2020, Amsterdam, Netherlands.

ACM,New York, NY, USA, 4 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

A signiﬁcant part of today’s analytical tasks involve text opera-tions. A data scientist who has to manipulate and analyze text datatoday typically uses a set of text analysis software libraries (e.g.,NLTK, Stanford CoreNLP, GenSim) for tasks like word embedding,phrase extraction, named entity recognition and topic modeling. Inaddition, most DBMS systems today have built-in support for full-text search. PostgreSQL, for instance, admits a text vector (calledtsvector) that extracts and creates term and positional indices toenable eﬃcient queries (called tsquery). Yet, some common andseemingly simple text analysis tasks cannot be performed simplywithin the boundaries of a single information system.

Example 1.

Consider a relational table R (newsID, date, newspaper,title, content) where title and content are text-valued attributes,and two sets L o , L p that represent a collection of organization namesand person names respectively. Now, consider the following anal-ysis: • N = Select a subset of news articles from date d through d • N = Identify all news articles in N c orga-nization names from L o and c persons from L p Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor proﬁt or commercial advantage and that copies bear this notice and the full cita-tion on the ﬁrst page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior speciﬁc permissionand/or a fee. Request permissions from [email protected].

SSDBM ’20, June 03–05, 2020, Amsterdam, Netherlands © 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-XXXX-X/18/06...$15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn • T = Create a document-term matrix on N . text • T = Remove rows and columns of the matrix if either of theirrow or column marginal sums is below θ and θ respectively. • M = Compute a topic model using T L p ) and any one government organizations (list L o ).The analysis itself is straightforward and can be performed with acombination of SQL queries and Python scripts.Our goal in this short paper is to present the idea that a novel relation-ﬂanked associative array data model has the potential ofserving as the underlying framework for the management and anal-ysis of text-centric data. We develop the theoretical elements of themodel and illustrate its utility through examples. A number of current data systems, typically in the domain of poly-store data systems, use associative arrays [3, 4] or its variants likeassociative tables [1] and tensor data model [5]. Many of these datamodels are used to support analytical (e.g., machine learning) tasks.In our setting, we specialize the essential associative model for textanalytics. For our level of abstraction, our model reuses relationaloperations for all metadata of the associative arrays. While it hasbeen shown [1] that associative arrays can express relational oper-ations, we believe that using relational abstraction along with ourtext-centric algebraic operations makes the system easier to pro-gram and interpret. At a more basic level, since most text process-ing operations include sorting (e.g., by TF-IDF scores), our modelis based on partially ordered semirings.

Deﬁnition 2.1 (Semiring).

A semiring is a set R with two binaryoperations addition ⊕ and multiplication ⊙ , such that, 1) ⊕ is asso-ciative and commutative and has an identity element 0 ∈ R ; 2) ⊙ is associative with an identity element 1 ∈ R ; 3) ⊙ distributes over ⊕ ; and 4) ⊙ by 0 annihilates R . Deﬁnition 2.2 (Partially-Ordered Semiring). [2] A semiring R ispartially ordered if and only if there exists a partial order relation ≤ on R satisfying the following conditions for all a , b ∈ R : • If a ≤ b , then a ⊕ c ≤ b ⊕ c ; • If a ≤ b and 0 ≤ c , then a ⊙ c ≤ b ⊙ c and c ⊙ a ≤ c ⊙ b . Deﬁnition 2.3 (Text Associative Array).

The Text Associative Ar-ray (TAA) A is deﬁned as a mapping: A : K × K → R where K and K are two key sets (named row key set and columnkey set respectively), and R is a partially-ordered semiring (Deﬁni-tion 2.2). We call K × K “the dimension of A ”, and denote A . K , SDBM ’20, June 03–05, 2020, Amsterdam, Netherlands X. Zheng A . K and A . K as the row key set, column key sets, and set of keypairs of A , respectively.Next, we deﬁne the basic operations on text associative arrays,to be used by our primary text operations (Sec. 2.2). Deﬁnition 2.4 (Addition).

Given two TAAs A , B : K × K → R ,the addition operation C = ( A ⊕ B ) : K × K → R is deﬁned as, C ( k , k ) = ( A ⊕ B )( k , k ) = A ( k , k ) ⊕ B ( k , k ) . Deﬁne K , K as a TAA where K , K ( k , k ) = ∀ k ∈ K , k ∈ K . K , K serves as an identity for addition operation on key set K × K . Deﬁnition 2.5 (Hadamard Product).

Given two TAAs A , B : K × K → R , the Hadamard product operation C = ( A ⊙ B ) : K × K → R is deﬁned as, C ( k , k ) = ( A ⊙ B )( k , k ) = A ( k , k ) ⊙ B ( k , k ) . Deﬁne K , K as a TAA where K , K ( k , k ) = ∀ k ∈ K , k ∈ K . K , K serves as an identity for Hadamard product on key set K × K . Deﬁnition 2.6 (Array Multiplication).

Given two TAAs A : K × K → R and B : K × K → R , the array multiplication operation C = ( A ⊗ B ) : K × K → R is deﬁned as, C ( k , k ) = ( A ⊗ B )( k , k ) = Ê k ∈ K A ( k , k ) ⊙ B ( k , k ) . Deﬁnition 2.7 (Array Identity).

Given two key sets K and K ,and a partial function f : K ֒ → K , the array identity E K , K , f : K × K → R is deﬁned as a TAA such that E K , K , f ( k , k ) = ( , if k ∈ dom f and k = f ( k ) ;0 , otherwise . Speciﬁcally, if dom f = K ∩ K and f ( k ) = k for ∀ k ∈ K , E K , K , f is abbreviated to E K , K .In general, E K , K , f ( k , k ) is not an identity for general arraymultiplication. However, E K , K is an identity element for array mul-tiplication on associative arrays K × K → R . Deﬁnition 2.8 (Kronecker Product).

Given two TAAs A : K × K → R and B : K × K → R , their Kronecker product C = A ⊛ B : ( K × K ) × ( K × K ) is deﬁned by C (( k , k ) , ( k , k )) = A ( k , k ) ⊙ B ( k , k ) . Deﬁnition 2.9 (Transpose).

Given a TAA A : K × K → R , itstranspose, denoted by A T , is deﬁned by A T : K × K → R where A T ( k , k ) = A ( k , k ) for k ∈ K and k ∈ K . We can express a number of fundamental text operations usingthe proposed TAA algebra. We ﬁrst deﬁne three basic TAAs specif-ically for text analytics, then a series of text operations will be de-ﬁned on general TAA or these basic structures.

Deﬁnition 2.10 (Document-Term Matrix).

Given a text corpus, adocument term matrix is deﬁned as a TAA M : D × T → R where D and T are the document set and term set of a text corpus. The term set in the document-term matrix can be the vocabu-lary or the bigram of the corpus, or an application-speciﬁc user-deﬁned set of interesting terms. The matrix value M ( d , t ) can alsotake diﬀerent semantics, in one application it can be the occur-rence of t in document d , while in another application, it can bethe term frequency-inverse document frequency (tf-idf). Typically,elements of D and T will have additional relational metadata. Adocument may have a date and a term may have an annotationlike a part-of-speech (POS) tag. Deﬁnition 2.11 (Term-Index Matrix).

Given a document d , theterm index matrix is deﬁned as a TAA, N : T d × I → { , } where T d = { d } × T is the set of terms in document d and I = { , · · · , I d } is the index set ( I d is the size of d ). Speciﬁcally, for ( d , t ) ∈ T d and i ∈ I , N (( d , t ) , i ) = ( , if i -th word of document d is t ;0 , otherwise . Example 2.

For a document d = “Today is a sunny day”, let itsterm index matrix be N : ({ d } × T ) × I → { , } , then we have T = { “today”, is”, “a”, “sunny”, “day” } , I = { , , , , } . N ( “today” , ) = , N ( “is” , ) = , N ( “a” , ) = , N ( “sunny” , ) = , N ( “day” , ) = ( t , i ) pairs where ( t , i ) ∈ T × I , we have N ( t , i ) = Deﬁnition 2.12 (Term Vector).

There are two types of term vec-tors. 1) Given a set of terms T of a document d , the term vector isdeﬁned as a TAA V : { d } × T → R . 2) Given a set of terms T fora collection of documents D , V : { } × T → R is a term vector forthe corpus D .The term vector represents some attribute of terms in the scopeof one document or a corpus. For example, for a document d , thevalue of the term vector V : { d } × T can be the occurrence of eachterm in this document. For a corpus D , the value of its term vector V : { } × T can be idf value for each term in the whole corpus, andthe value is not speciﬁc to a single document.Based on these structures, we can deﬁne our unit text operatorsas follows. Some operators are deﬁned for general TAAs, whilesome are deﬁned for a speciﬁc type of TAAs. Deﬁnition 2.13 (Extraction).

Given a TAA A : K × K → R andtwo projection sets K ′ ⊆ K , K ′ ⊆ K , we deﬁne the extractionoperation as Π K ′ , K ′ ( A ) = E K ′ , K ⊗ A ⊗ E T K ′ , K . Let B = Π K ′ , K ′ ( A ) , we have B ( k , k ) = A ( k , k ) , for ∀ ( k , k ) ∈ K ′ × K ′ .When only extracting row keys, the operation can be expressedas Π K ′ , : and when extracting column keys, it is expressed as Π : , K ′ . Deﬁnition 2.14 (Rename).

Given a TAA A : K × K → R ,suppose K ′ is another column key set and there exists a bijection f : K → K ′ . The column rename operation is deﬁned as ρ K , K → K ′ , f ( A ) = A ⊗ E K , K ′ , f . Similarly, given another row key set K ′ and a bijection f : K → K ′ , the row rename operation is deﬁned as ρ K → K ′ , K , f ( A ) = E K ′ , K , f − ⊗ A . n Algebraic Approach for High-level Text Analytics SSDBM ’20, June 03–05, 2020, Amsterdam, Netherlands The subscript f can be omitted if the bijection is clear, e.g., | dom f | =

1. In addition, the row rename operation and column rename oper-ation can be combined together as ρ K → K ′ , K → K ′ ( A ) . Our renameoperator is more general than the rename operation of relationalalgebra since it supports both row key set and column key set re-naming. Deﬁnition 2.15 (Apply).

Given a TAA A : K × K → R anda function f : R → R , deﬁne the apply operator by Apply f ( A ) : K × K → R where, Apply f ( A )( k , k ) = f ( A ( k , k )) , ∀ ( k , k ) ∈ K × K . Deﬁnition 2.16 (Filter).

Given a TAA A : K × K → R and anindicator function f : R → { , } , deﬁne the ﬁlter operation on A as B = Filter f ( A ) = σ f ( A ) : K f , K f → R , where K f × K f = {( k , k )|( k , k ) ∈ K × K and f ( A ( k , k )) = } , and B ( k , k ) = A ( k , k ) . Deﬁnition 2.17 (Sort).

Given a TAA A : K × K → R , for any k ∈ K , we extract a TAA V = Π { k } , : ( A ) of dimension { k } × K .Since R is a partially-ordered semiring (Deﬁnition 2.2), the valueset { V ( k , x )| ∀ x ∈ K } ⊆ R inherits the partial order from R , whichimplies an order V ( k , x ) ≤ V ( k , x ) ≤ · · · ≤ V ( k , x | K | ) . Deﬁne Idx ( k , x i ) = i , then the sort by column operation is deﬁned as Sort ( A ) : K × K → { , · · · , | K |} , where Sort ( A )( k , x ) = Idx ( k , x ) . Similarly, we have sort by rowoperation deﬁned as Sort ( A ) : K × K → { , · · · , | K |} . When the column key dimension or row key dimension is 1 (e.g.,for a term vector),

Sort or Sort is abbreviated to Sort . Deﬁnition 2.18 (Merge).

Given two TAAs A : K A × K A and B : K B × K B , if ( K A × K A ) ∩ ( K B × K B ) = ∅ , then mergeoperation can be applied on them, and it is deﬁned as, C = Merge ( A , B ) : K × K → R where K = K A ∪ K B and K = K A ∪ K B , and C ( k , k ) =  A ( k , k ) , if ( k , k ) ∈ K A × K A ; B ( k , k ) , if ( k , k ) ∈ K B × K B ;0 , otherwise. Deﬁnition 2.19 (Expand).

Given an elementwise binary oper-ator OP on associative arrays, e.g., ⊕ and ⊙ , a term vector V : { } × T → R and a document-term matrix M : D × T → R , theexpand operator is deﬁned as Expand OP ( V , M ) = ρ { }× D → D , T ×{ }→ T (cid:16) V ⊛ D , { } (cid:17) OP M . This operator implicitly expands the term vector V to generateanother associative array M ′ : D × T → R where M ′ ( d , t ) = V ( , t ) , ∀ d ∈ D and ∀ t ∈ T , and then applies OP on M ′ and M .Suppose that for a corpus D , there is a term vector V : { } × T → R where V ( , t ) is the mean occurrence of term t in D (i.e., Count t | D | where Count t is the total occurrence of t in D ), and thereis a document-term matrix M : D × T , then Expand ⊕ ( Apply f ( x ) = − x ( V ) , M ) will generate the diﬀerence of terms occurrences for each docu-ment from their average occurrences. Deﬁnition 2.20 (Flatten).

Given an associative array A : K × K → R , the ﬂatten operation is deﬁned by Flatten ( A ) : { } ×( K × K ) → R where Flatten ( A )( , ( k , k )) = A ( k , k ) for ∀ ( k , k ) ∈ K × K . Deﬁnition 2.21 (Left Shift).

Given a term-index matrix N : ({ d }× T ) × I → R , and a non-negative integer n , deﬁne the left shiftoperator by LShift n ( N ) : ({ d } × T ) × I → R where LShift n ( N ) = LShift ( LShift n − ( N )) and LShift ( N )(( d , t ) , i ) = ( N (( d , t ) , i + ) , if i < | T | ;0 , if i = | T | ; . For a term-index matrix N of document d , LShift ( N ) generatesanother term-index matrix N ′ where N ′ (( d , t ) , i ) = t is the ( i + ) -th word in d . Deﬁnition 2.22 (Union).

Suppose there are two term-index ma-trices with the same index set I , N : ({ d } × T ) × I → R and N : ({ d } × T ) × I → R , the union operation on N and N isdeﬁned by Union ( N , N ) = ρ ({ d }× T )×({ d }× T )→{ d }×( T × T ) , I × I → I (cid:16) Π : , {( i , i )| i ∈ I } ( N ⊛ N ) (cid:17) . Suppose N = Union ( N , N ) , then N (( d , ( t , t )) , i ) = ( , if N (( d , t ) , i ) = N (( d , t ) , i ) = , otherwise . The left shift and union operations can be composed to computeall bigrams of a document. Given a term-index matrix N of docu-ment d , let N ′ = Union ( N , LShift ( N )) , then N ′ (( d , ( t , t )) , i ) = ( t , t ) is the i -th bigram in document d . Deﬁnition 2.23 (Sum).

The sum operation takes a TAA A : K × K → R and an integer which can take the value of 0, 1 or 2 asinputs and will have diﬀerent semantics based on the integer value: B : { } × K = Sum ( A ) where B ( , k ) = Ê k ∈ K A ( k , k ) ; B : K × { } = Sum ( A ) where B ( k , ) = Ê k ∈ K A ( k , k ) . As we state in Section 2.2, a document term matrix is a commonrepresentation model for a collection of documents where the termscan be a list of import terms or the whole vocabulary or bigrams.The entry of the matrix can be either the occurrence of each termor the tf-idf value.

Example 3.

For document collection C , build a document termmatrix where terms are all unigrams and bigrams in C , and thevalues should be the occurrence of each term in the whole corpus.Suppose there is a tokenization function called Tokenize thattakes a document d as input and generates a term index matrix N : ({ d } × T ) × I . The construction can be decomposed to two SDBM ’20, June 03–05, 2020, Amsterdam, Netherlands X. Zheng parts, the ﬁrst part is to construct a Term Vector for one single doc-ument d containing all unigrams and bigrams together with theircorresponding occurrences. Fig. 1 shows the construction process. N = Tokenize ( d ) : ({ d } × T ) × I V = ρ { }→{ d } , { d }× T → T ( Sum ( N )) T : { d } × T T = N ⊗ LShift ( N ) T : ({ d } × T ) × ({ d } × T ) V = Flatten ( T ) : { } × ({ d } × T ) × ({ d } × T )) V = ρ { }→{ d } , ({ d }× T )×({ d }× T )→( T × T ) ( V ) : { d } × ( T × T ) V = σ f : x → ( x > ) ( V ) : { d } × ( T × T ) V d = Merge ( V , V ) : { d } × ( T ∪ ( T × T )) Figure 1: Algebraic representation for task in Example 3.

Step 1 generates the term index matrix where each term is theunigram. The

Sum operation in Step 2 generates the term vectorwhere V ( d , t ) is the unigram t in document d . Steps 3–6 get theterm vector V where the column key set is all bigrams in d . Step7 concatenates the two term vectors to get the representation for d . For each document d i in collection D = { d , · · · , d n } , we get itsterm vector V di : { d i } × ( T i ∪ ( T i × T i )) → R using the above steps,then apply the Merge operation to get the document-term matrix M : D × T → R where T = ( T ∪· · ·∪ T n )∪(( T × T )∪· · ·∪( T n × T n )) is the union of all unigrams and bigrams in the whole corpus, Merge ( V d1 , Merge ( V d2 , · · · , Merge ( V d ( n − ) , V d ( n ) ))) . Besides word-occurrence as the values of term document ma-trix, one can also use a term’s tf-idf value. If all terms are consid-ered, term document matrix M would be of high dimension andsparse, which would be costly to manipulate. A simple and com-monly adopted method to reduce dimension is to select out infor-mative words. The following presents the queries to get document-term matrix M with the tf-idf values for only informative termswhere the informativeness is measured by idf value. Example 4.

Given a collection of documents D , we have to gen-erate a document-term matrix M for the top 1000 “informativewords” where M ( d , t ) is the tf-idf value for term t in document d . Suppose there is a term-document matrix M which stores theoccurrence for all unigrams in each document (the construction issimilar to that of example 2 and thus is skipped), M can be gener-ated by the following steps. The function id f in Step 3 is to calcu-late idf value, which is deﬁned as id f ( x ) = − log x | D | where x isthe number of documents that contains a speciﬁc term. M = Apply f : x → ( x > ) ( M ) : D × T V = Sum ( M ) : { } × T I = σ f : x → ( x ≤ ) ( Sort ( V )) : { } × T ′ V = Apply idf ( Π : , I . K ( V )) : { } × T ′ M = Π : , I . K ( M ) : D × T ′ M = Expand ⊙ ( V , M ) : D × T ′ Figure 2: Algebraic representation for task in Example 4.

For Example 1 introduced in Section 1, we express this analysisusing relational algebra and the associative array operations. Sup-pose that the maximum number of words for a term in L o ∪ L p is3, now this analysis can be expressed as the following. The Step 1is expressed in relational algebra. TopicModel in the last step is afunction which takes a document-term matrix and produce docu-ment topic matrix and topic term matrix, which are the standardoutputs of topic modeling, represented by another two TAAs

DTM and

TTM . Let T = ρ f : x → ( x ≥ | D |− k ) ( Sort ( DTM )) , then T . K willreturn all ( d , t ) pairs where t is one of the top- k topics for d . D = π content ( σ d ≤ data ≤ d ( R )) M : {} × {} → R , FV : {} × {} → R d ∈ D : 3 N = Tokenize ( d ) . V = ρ { }→{ d } , { d }× T → T ( Sum ( N )) T . N = Union ( N , LShift ( N )) . N = Union ( N , LShift ( N )) . N = Merge ( N , Merge ( N , N )) . V f = ρ { }→{ d } , { d }× T ′ → T ′ ( Sum ( N ) T ) . FV = Merge ( FV , V f ) . M = Merge ( M , V ) . FV o = Π : , L o ( FV ) FV p = Π : , L p ( FV ) I o = σ f : x → ( x > c ) ( Sum ( FV o )) I p = σ f : x → ( x > c ) ( Sum ( FV p )) M = Π I o . K ∩ I p . K , : ( M ) I t = σ f : x → ( x < θ ) ( Sum ( M )) I d = σ f : x → ( x < θ ) ( Sum ( M )) M = Π I d . K , I t . K ( M ) DTM , TTM = TopicModel ( M ) Figure 3: Algebraic representation for the task in Example 1.

REFERENCES [1] Pablo Barceló, Nelson Higuera, Jorge Pérez, and Bernardo Subercaseaux. 2019.On the Expressiveness of LARA: A Uniﬁed Language for Linear and RelationalAlgebra. arXiv preprint arXiv:1909.11693 (2019).[2] Jonathan S Golan. 2013.

Semirings and aﬃne equations over them: theory andapplications . Vol. 556. Springer Science & Business Media.[3] Hayden Jananthan, Ziqi Zhou, Vijay Gadepally, Dylan Hutchison, Suna Kim, andJeremy Kepner. 2017. Polystore mathematics of relational algebra. In

Int. Conf. onBig Data . IEEE, 3180–3189.[4] Jeremy Kepner, Vijay Gadepally, Hayden Jananthan, Lauren Milechin, and Sid-dharth Samsi. 2020. AI Data Wrangling with Associative Arrays. arXiv preprintarXiv:2001.06731 (2020).[5] Éric Leclercq, Annabelle Gillet, Thierry Grison, and Marinette Savonnet. 2019.Polystore and Tensor Data Model for Logical Data Independence and ImpedanceMismatch in Big Data Analytics. In