An Algebraic Approach for High-level Text Analytics
aa r X i v : . [ c s . D B ] M a y An Algebraic Approach for High-level Text Analytics
Xiuwen Zheng [email protected] of California San DiegoLa Jolla, Californa, USA
Amarnath Gupta [email protected] of California San DiegoLa Jolla, Californa, USA
ABSTRACT
Text analytical tasks like word embedding, phrase mining and topicmodeling, are placing increasing demands as well as challengesto existing database management systems. In this paper, we pro-vide a novel algebraic approach based on associative arrays. Ourdata model and algebra can bring together relational operators andtext operators, which enables interesting optimization opportuni-ties for hybrid data sources that have both relational and textualdata. We demonstrate its expressive power in text analytics usingseveral real-world tasks.
KEYWORDS associative array, text analytics, natural language processing
ACM Reference Format:
Xiuwen Zheng and Amarnath Gupta. 2020. An Algebraic Approach forHigh-level Text Analytics. In
SSDBM ’20: Int. Conf. on Scientific and Sta-tistical Data Management, June 03–05, 2020, Amsterdam, Netherlands.
ACM,New York, NY, USA, 4 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
A significant part of today’s analytical tasks involve text opera-tions. A data scientist who has to manipulate and analyze text datatoday typically uses a set of text analysis software libraries (e.g.,NLTK, Stanford CoreNLP, GenSim) for tasks like word embedding,phrase extraction, named entity recognition and topic modeling. Inaddition, most DBMS systems today have built-in support for full-text search. PostgreSQL, for instance, admits a text vector (calledtsvector) that extracts and creates term and positional indices toenable efficient queries (called tsquery). Yet, some common andseemingly simple text analysis tasks cannot be performed simplywithin the boundaries of a single information system.
Example 1.
Consider a relational table R (newsID, date, newspaper,title, content) where title and content are text-valued attributes,and two sets L o , L p that represent a collection of organization namesand person names respectively. Now, consider the following anal-ysis: • N = Select a subset of news articles from date d through d • N = Identify all news articles in N c orga-nization names from L o and c persons from L p Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].
SSDBM ’20, June 03–05, 2020, Amsterdam, Netherlands © 2020 Association for Computing Machinery.ACM ISBN 978-1-4503-XXXX-X/18/06...$15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn • T = Create a document-term matrix on N . text • T = Remove rows and columns of the matrix if either of theirrow or column marginal sums is below θ and θ respectively. • M = Compute a topic model using T L p ) and any one government organizations (list L o ).The analysis itself is straightforward and can be performed with acombination of SQL queries and Python scripts.Our goal in this short paper is to present the idea that a novel relation-flanked associative array data model has the potential ofserving as the underlying framework for the management and anal-ysis of text-centric data. We develop the theoretical elements of themodel and illustrate its utility through examples. A number of current data systems, typically in the domain of poly-store data systems, use associative arrays [3, 4] or its variants likeassociative tables [1] and tensor data model [5]. Many of these datamodels are used to support analytical (e.g., machine learning) tasks.In our setting, we specialize the essential associative model for textanalytics. For our level of abstraction, our model reuses relationaloperations for all metadata of the associative arrays. While it hasbeen shown [1] that associative arrays can express relational oper-ations, we believe that using relational abstraction along with ourtext-centric algebraic operations makes the system easier to pro-gram and interpret. At a more basic level, since most text process-ing operations include sorting (e.g., by TF-IDF scores), our modelis based on partially ordered semirings.
Definition 2.1 (Semiring).
A semiring is a set R with two binaryoperations addition ⊕ and multiplication ⊙ , such that, 1) ⊕ is asso-ciative and commutative and has an identity element 0 ∈ R ; 2) ⊙ is associative with an identity element 1 ∈ R ; 3) ⊙ distributes over ⊕ ; and 4) ⊙ by 0 annihilates R . Definition 2.2 (Partially-Ordered Semiring). [2] A semiring R ispartially ordered if and only if there exists a partial order relation ≤ on R satisfying the following conditions for all a , b ∈ R : • If a ≤ b , then a ⊕ c ≤ b ⊕ c ; • If a ≤ b and 0 ≤ c , then a ⊙ c ≤ b ⊙ c and c ⊙ a ≤ c ⊙ b . Definition 2.3 (Text Associative Array).
The Text Associative Ar-ray (TAA) A is defined as a mapping: A : K × K → R where K and K are two key sets (named row key set and columnkey set respectively), and R is a partially-ordered semiring (Defini-tion 2.2). We call K × K “the dimension of A ”, and denote A . K , SDBM ’20, June 03–05, 2020, Amsterdam, Netherlands X. Zheng A . K and A . K as the row key set, column key sets, and set of keypairs of A , respectively.Next, we define the basic operations on text associative arrays,to be used by our primary text operations (Sec. 2.2). Definition 2.4 (Addition).
Given two TAAs A , B : K × K → R ,the addition operation C = ( A ⊕ B ) : K × K → R is defined as, C ( k , k ) = ( A ⊕ B )( k , k ) = A ( k , k ) ⊕ B ( k , k ) . Define K , K as a TAA where K , K ( k , k ) = ∀ k ∈ K , k ∈ K . K , K serves as an identity for addition operation on key set K × K . Definition 2.5 (Hadamard Product).
Given two TAAs A , B : K × K → R , the Hadamard product operation C = ( A ⊙ B ) : K × K → R is defined as, C ( k , k ) = ( A ⊙ B )( k , k ) = A ( k , k ) ⊙ B ( k , k ) . Define K , K as a TAA where K , K ( k , k ) = ∀ k ∈ K , k ∈ K . K , K serves as an identity for Hadamard product on key set K × K . Definition 2.6 (Array Multiplication).
Given two TAAs A : K × K → R and B : K × K → R , the array multiplication operation C = ( A ⊗ B ) : K × K → R is defined as, C ( k , k ) = ( A ⊗ B )( k , k ) = Ê k ∈ K A ( k , k ) ⊙ B ( k , k ) . Definition 2.7 (Array Identity).
Given two key sets K and K ,and a partial function f : K ֒ → K , the array identity E K , K , f : K × K → R is defined as a TAA such that E K , K , f ( k , k ) = ( , if k ∈ dom f and k = f ( k ) ;0 , otherwise . Specifically, if dom f = K ∩ K and f ( k ) = k for ∀ k ∈ K , E K , K , f is abbreviated to E K , K .In general, E K , K , f ( k , k ) is not an identity for general arraymultiplication. However, E K , K is an identity element for array mul-tiplication on associative arrays K × K → R . Definition 2.8 (Kronecker Product).
Given two TAAs A : K × K → R and B : K × K → R , their Kronecker product C = A ⊛ B : ( K × K ) × ( K × K ) is defined by C (( k , k ) , ( k , k )) = A ( k , k ) ⊙ B ( k , k ) . Definition 2.9 (Transpose).
Given a TAA A : K × K → R , itstranspose, denoted by A T , is defined by A T : K × K → R where A T ( k , k ) = A ( k , k ) for k ∈ K and k ∈ K . We can express a number of fundamental text operations usingthe proposed TAA algebra. We first define three basic TAAs specif-ically for text analytics, then a series of text operations will be de-fined on general TAA or these basic structures.
Definition 2.10 (Document-Term Matrix).
Given a text corpus, adocument term matrix is defined as a TAA M : D × T → R where D and T are the document set and term set of a text corpus. The term set in the document-term matrix can be the vocabu-lary or the bigram of the corpus, or an application-specific user-defined set of interesting terms. The matrix value M ( d , t ) can alsotake different semantics, in one application it can be the occur-rence of t in document d , while in another application, it can bethe term frequency-inverse document frequency (tf-idf). Typically,elements of D and T will have additional relational metadata. Adocument may have a date and a term may have an annotationlike a part-of-speech (POS) tag. Definition 2.11 (Term-Index Matrix).
Given a document d , theterm index matrix is defined as a TAA, N : T d × I → { , } where T d = { d } × T is the set of terms in document d and I = { , · · · , I d } is the index set ( I d is the size of d ). Specifically, for ( d , t ) ∈ T d and i ∈ I , N (( d , t ) , i ) = ( , if i -th word of document d is t ;0 , otherwise . Example 2.
For a document d = “Today is a sunny day”, let itsterm index matrix be N : ({ d } × T ) × I → { , } , then we have T = { “today”, is”, “a”, “sunny”, “day” } , I = { , , , , } . N ( “today” , ) = , N ( “is” , ) = , N ( “a” , ) = , N ( “sunny” , ) = , N ( “day” , ) = ( t , i ) pairs where ( t , i ) ∈ T × I , we have N ( t , i ) = Definition 2.12 (Term Vector).
There are two types of term vec-tors. 1) Given a set of terms T of a document d , the term vector isdefined as a TAA V : { d } × T → R . 2) Given a set of terms T fora collection of documents D , V : { } × T → R is a term vector forthe corpus D .The term vector represents some attribute of terms in the scopeof one document or a corpus. For example, for a document d , thevalue of the term vector V : { d } × T can be the occurrence of eachterm in this document. For a corpus D , the value of its term vector V : { } × T can be idf value for each term in the whole corpus, andthe value is not specific to a single document.Based on these structures, we can define our unit text operatorsas follows. Some operators are defined for general TAAs, whilesome are defined for a specific type of TAAs. Definition 2.13 (Extraction).
Given a TAA A : K × K → R andtwo projection sets K ′ ⊆ K , K ′ ⊆ K , we define the extractionoperation as Π K ′ , K ′ ( A ) = E K ′ , K ⊗ A ⊗ E T K ′ , K . Let B = Π K ′ , K ′ ( A ) , we have B ( k , k ) = A ( k , k ) , for ∀ ( k , k ) ∈ K ′ × K ′ .When only extracting row keys, the operation can be expressedas Π K ′ , : and when extracting column keys, it is expressed as Π : , K ′ . Definition 2.14 (Rename).
Given a TAA A : K × K → R ,suppose K ′ is another column key set and there exists a bijection f : K → K ′ . The column rename operation is defined as ρ K , K → K ′ , f ( A ) = A ⊗ E K , K ′ , f . Similarly, given another row key set K ′ and a bijection f : K → K ′ , the row rename operation is defined as ρ K → K ′ , K , f ( A ) = E K ′ , K , f − ⊗ A . n Algebraic Approach for High-level Text Analytics SSDBM ’20, June 03–05, 2020, Amsterdam, Netherlands The subscript f can be omitted if the bijection is clear, e.g., | dom f | =
1. In addition, the row rename operation and column rename oper-ation can be combined together as ρ K → K ′ , K → K ′ ( A ) . Our renameoperator is more general than the rename operation of relationalalgebra since it supports both row key set and column key set re-naming. Definition 2.15 (Apply).
Given a TAA A : K × K → R anda function f : R → R , define the apply operator by Apply f ( A ) : K × K → R where, Apply f ( A )( k , k ) = f ( A ( k , k )) , ∀ ( k , k ) ∈ K × K . Definition 2.16 (Filter).
Given a TAA A : K × K → R and anindicator function f : R → { , } , define the filter operation on A as B = Filter f ( A ) = σ f ( A ) : K f , K f → R , where K f × K f = {( k , k )|( k , k ) ∈ K × K and f ( A ( k , k )) = } , and B ( k , k ) = A ( k , k ) . Definition 2.17 (Sort).
Given a TAA A : K × K → R , for any k ∈ K , we extract a TAA V = Π { k } , : ( A ) of dimension { k } × K .Since R is a partially-ordered semiring (Definition 2.2), the valueset { V ( k , x )| ∀ x ∈ K } ⊆ R inherits the partial order from R , whichimplies an order V ( k , x ) ≤ V ( k , x ) ≤ · · · ≤ V ( k , x | K | ) . Define Idx ( k , x i ) = i , then the sort by column operation is defined as Sort ( A ) : K × K → { , · · · , | K |} , where Sort ( A )( k , x ) = Idx ( k , x ) . Similarly, we have sort by rowoperation defined as Sort ( A ) : K × K → { , · · · , | K |} . When the column key dimension or row key dimension is 1 (e.g.,for a term vector),
Sort or Sort is abbreviated to Sort . Definition 2.18 (Merge).
Given two TAAs A : K A × K A and B : K B × K B , if ( K A × K A ) ∩ ( K B × K B ) = ∅ , then mergeoperation can be applied on them, and it is defined as, C = Merge ( A , B ) : K × K → R where K = K A ∪ K B and K = K A ∪ K B , and C ( k , k ) = A ( k , k ) , if ( k , k ) ∈ K A × K A ; B ( k , k ) , if ( k , k ) ∈ K B × K B ;0 , otherwise. Definition 2.19 (Expand).
Given an elementwise binary oper-ator OP on associative arrays, e.g., ⊕ and ⊙ , a term vector V : { } × T → R and a document-term matrix M : D × T → R , theexpand operator is defined as Expand OP ( V , M ) = ρ { }× D → D , T ×{ }→ T (cid:16) V ⊛ D , { } (cid:17) OP M . This operator implicitly expands the term vector V to generateanother associative array M ′ : D × T → R where M ′ ( d , t ) = V ( , t ) , ∀ d ∈ D and ∀ t ∈ T , and then applies OP on M ′ and M .Suppose that for a corpus D , there is a term vector V : { } × T → R where V ( , t ) is the mean occurrence of term t in D (i.e., Count t | D | where Count t is the total occurrence of t in D ), and thereis a document-term matrix M : D × T , then Expand ⊕ ( Apply f ( x ) = − x ( V ) , M ) will generate the difference of terms occurrences for each docu-ment from their average occurrences. Definition 2.20 (Flatten).
Given an associative array A : K × K → R , the flatten operation is defined by Flatten ( A ) : { } ×( K × K ) → R where Flatten ( A )( , ( k , k )) = A ( k , k ) for ∀ ( k , k ) ∈ K × K . Definition 2.21 (Left Shift).
Given a term-index matrix N : ({ d }× T ) × I → R , and a non-negative integer n , define the left shiftoperator by LShift n ( N ) : ({ d } × T ) × I → R where LShift n ( N ) = LShift ( LShift n − ( N )) and LShift ( N )(( d , t ) , i ) = ( N (( d , t ) , i + ) , if i < | T | ;0 , if i = | T | ; . For a term-index matrix N of document d , LShift ( N ) generatesanother term-index matrix N ′ where N ′ (( d , t ) , i ) = t is the ( i + ) -th word in d . Definition 2.22 (Union).
Suppose there are two term-index ma-trices with the same index set I , N : ({ d } × T ) × I → R and N : ({ d } × T ) × I → R , the union operation on N and N isdefined by Union ( N , N ) = ρ ({ d }× T )×({ d }× T )→{ d }×( T × T ) , I × I → I (cid:16) Π : , {( i , i )| i ∈ I } ( N ⊛ N ) (cid:17) . Suppose N = Union ( N , N ) , then N (( d , ( t , t )) , i ) = ( , if N (( d , t ) , i ) = N (( d , t ) , i ) = , otherwise . The left shift and union operations can be composed to computeall bigrams of a document. Given a term-index matrix N of docu-ment d , let N ′ = Union ( N , LShift ( N )) , then N ′ (( d , ( t , t )) , i ) = ( t , t ) is the i -th bigram in document d . Definition 2.23 (Sum).
The sum operation takes a TAA A : K × K → R and an integer which can take the value of 0, 1 or 2 asinputs and will have different semantics based on the integer value: B : { } × K = Sum ( A ) where B ( , k ) = Ê k ∈ K A ( k , k ) ; B : K × { } = Sum ( A ) where B ( k , ) = Ê k ∈ K A ( k , k ) . As we state in Section 2.2, a document term matrix is a commonrepresentation model for a collection of documents where the termscan be a list of import terms or the whole vocabulary or bigrams.The entry of the matrix can be either the occurrence of each termor the tf-idf value.
Example 3.
For document collection C , build a document termmatrix where terms are all unigrams and bigrams in C , and thevalues should be the occurrence of each term in the whole corpus.Suppose there is a tokenization function called Tokenize thattakes a document d as input and generates a term index matrix N : ({ d } × T ) × I . The construction can be decomposed to two SDBM ’20, June 03–05, 2020, Amsterdam, Netherlands X. Zheng parts, the first part is to construct a Term Vector for one single doc-ument d containing all unigrams and bigrams together with theircorresponding occurrences. Fig. 1 shows the construction process. N = Tokenize ( d ) : ({ d } × T ) × I V = ρ { }→{ d } , { d }× T → T ( Sum ( N )) T : { d } × T T = N ⊗ LShift ( N ) T : ({ d } × T ) × ({ d } × T ) V = Flatten ( T ) : { } × ({ d } × T ) × ({ d } × T )) V = ρ { }→{ d } , ({ d }× T )×({ d }× T )→( T × T ) ( V ) : { d } × ( T × T ) V = σ f : x → ( x > ) ( V ) : { d } × ( T × T ) V d = Merge ( V , V ) : { d } × ( T ∪ ( T × T )) Figure 1: Algebraic representation for task in Example 3.
Step 1 generates the term index matrix where each term is theunigram. The
Sum operation in Step 2 generates the term vectorwhere V ( d , t ) is the unigram t in document d . Steps 3–6 get theterm vector V where the column key set is all bigrams in d . Step7 concatenates the two term vectors to get the representation for d . For each document d i in collection D = { d , · · · , d n } , we get itsterm vector V di : { d i } × ( T i ∪ ( T i × T i )) → R using the above steps,then apply the Merge operation to get the document-term matrix M : D × T → R where T = ( T ∪· · ·∪ T n )∪(( T × T )∪· · ·∪( T n × T n )) is the union of all unigrams and bigrams in the whole corpus, Merge ( V d1 , Merge ( V d2 , · · · , Merge ( V d ( n − ) , V d ( n ) ))) . Besides word-occurrence as the values of term document ma-trix, one can also use a term’s tf-idf value. If all terms are consid-ered, term document matrix M would be of high dimension andsparse, which would be costly to manipulate. A simple and com-monly adopted method to reduce dimension is to select out infor-mative words. The following presents the queries to get document-term matrix M with the tf-idf values for only informative termswhere the informativeness is measured by idf value. Example 4.
Given a collection of documents D , we have to gen-erate a document-term matrix M for the top 1000 “informativewords” where M ( d , t ) is the tf-idf value for term t in document d . Suppose there is a term-document matrix M which stores theoccurrence for all unigrams in each document (the construction issimilar to that of example 2 and thus is skipped), M can be gener-ated by the following steps. The function id f in Step 3 is to calcu-late idf value, which is defined as id f ( x ) = − log x | D | where x isthe number of documents that contains a specific term. M = Apply f : x → ( x > ) ( M ) : D × T V = Sum ( M ) : { } × T I = σ f : x → ( x ≤ ) ( Sort ( V )) : { } × T ′ V = Apply idf ( Π : , I . K ( V )) : { } × T ′ M = Π : , I . K ( M ) : D × T ′ M = Expand ⊙ ( V , M ) : D × T ′ Figure 2: Algebraic representation for task in Example 4.
For Example 1 introduced in Section 1, we express this analysisusing relational algebra and the associative array operations. Sup-pose that the maximum number of words for a term in L o ∪ L p is3, now this analysis can be expressed as the following. The Step 1is expressed in relational algebra. TopicModel in the last step is afunction which takes a document-term matrix and produce docu-ment topic matrix and topic term matrix, which are the standardoutputs of topic modeling, represented by another two TAAs
DTM and
TTM . Let T = ρ f : x → ( x ≥ | D |− k ) ( Sort ( DTM )) , then T . K willreturn all ( d , t ) pairs where t is one of the top- k topics for d . D = π content ( σ d ≤ data ≤ d ( R )) M : {} × {} → R , FV : {} × {} → R d ∈ D : 3 N = Tokenize ( d ) . V = ρ { }→{ d } , { d }× T → T ( Sum ( N )) T . N = Union ( N , LShift ( N )) . N = Union ( N , LShift ( N )) . N = Merge ( N , Merge ( N , N )) . V f = ρ { }→{ d } , { d }× T ′ → T ′ ( Sum ( N ) T ) . FV = Merge ( FV , V f ) . M = Merge ( M , V ) . FV o = Π : , L o ( FV ) FV p = Π : , L p ( FV ) I o = σ f : x → ( x > c ) ( Sum ( FV o )) I p = σ f : x → ( x > c ) ( Sum ( FV p )) M = Π I o . K ∩ I p . K , : ( M ) I t = σ f : x → ( x < θ ) ( Sum ( M )) I d = σ f : x → ( x < θ ) ( Sum ( M )) M = Π I d . K , I t . K ( M ) DTM , TTM = TopicModel ( M ) Figure 3: Algebraic representation for the task in Example 1.
REFERENCES [1] Pablo Barceló, Nelson Higuera, Jorge Pérez, and Bernardo Subercaseaux. 2019.On the Expressiveness of LARA: A Unified Language for Linear and RelationalAlgebra. arXiv preprint arXiv:1909.11693 (2019).[2] Jonathan S Golan. 2013.
Semirings and affine equations over them: theory andapplications . Vol. 556. Springer Science & Business Media.[3] Hayden Jananthan, Ziqi Zhou, Vijay Gadepally, Dylan Hutchison, Suna Kim, andJeremy Kepner. 2017. Polystore mathematics of relational algebra. In
Int. Conf. onBig Data . IEEE, 3180–3189.[4] Jeremy Kepner, Vijay Gadepally, Hayden Jananthan, Lauren Milechin, and Sid-dharth Samsi. 2020. AI Data Wrangling with Associative Arrays. arXiv preprintarXiv:2001.06731 (2020).[5] Éric Leclercq, Annabelle Gillet, Thierry Grison, and Marinette Savonnet. 2019.Polystore and Tensor Data Model for Logical Data Independence and ImpedanceMismatch in Big Data Analytics. In