A Lightweight Algorithm to Uncover Deep Relationships in Data Tables
AA Lightweight Algorithm to Uncover Deep Relationships inData Tables
Jin Cao
Nokia Bell [email protected]
Yibo Zhao
Indeed [email protected]
Linjun Zhang
University of [email protected]
Jason Li
Academy for Information TechnologyScotch Plains, [email protected]
ABSTRACT
Many data we collect today are in tabular form, with rows as recordsand columns as attributes associated with each record. Understand-ing the structural relationship in tabular data can greatly facilitate thedata science process. Traditionally, much of this relational informa-tion is stored in table schema and maintained by its creators, usuallydomain experts. In this paper, we develop automated methods touncover deep relationships in a single data table without expert or do-main knowledge. Our method can decompose a data table into layersof smaller tables, revealing its deep structure. The key to our ap-proach is a computationally lightweight forward addition algorithmthat we developed to recursively extract the functional dependencies between table columns that is scalable to tables with many columns.With our solution, data scientists will be provided with automaticallygenerated, data-driven insights when exploring new data sets.
CCS CONCEPTS • Information systems
Data mining ; •
Mathematics ofcomputing
Statistical paradigms . KEYWORDS
Functional Dependency, Machine Learning, Random Permutation,Feature Engineering
Prior to machine learning activities, the data scientist has to developan understanding of the data. For a relational database, the contextualand structural information are expressed in data schema, which isoften generated by the domain experts at the time of data creation.However, schema maintenance on big or evolving data is non-trivial,and often requires lots of manual effort and domain knowledge.In this paper, we aim to develop automated profiling methods fortabular data that allow for a fast and accurate understanding of itsstructural relationships between data columns. Our work is moti-vated by
Automated Machine Learning [1], with the goal to developautomated procedures to improve efficiency of machine learning fornon-experts.The key to our structural relationship discovery is efficient extrac-tion of important functional dependency between columns of a datatable T with N columns. In simple words, a column combination C functionally determines a column Y , or C → Y , if and only if each setof C values is associated with precisely one Y value. The dependency is minimal if it no longer holds after removal of any columns in C .Existing approaches on uncovering functional dependencies focuson extracting all solutions of C for a given column Y in a data table.Unlike these methods, our approach recursively extracts the mostimportant structural relations for the purpose of data profiling andunderstanding. Specifically, we discover those minimal functionaldependencies that tend to contain fewer columns.Another key to our approach is the recursive strategy. We startthe discovery from the combination of all columns, i.e., the rowindex, as the initial Y . We find descendants of Y defined as thosecolumns that are functionally dependent on Y . From its descendants,we attempt to find a small set of columns C such that C → Y . Ifsuch C exists, we then recursively apply the same process to eachcolumn (or combination of columns) in C until failure. The outcomeof the whole procedure is a skeleton of a schema tree with nodesrepresenting columns and a split representing a minimal functionaldependency between a parent and its children. After the skeleton isextracted, we then attach the remaining columns not in the skeletonto one of the skeleton nodes as descendants to complete the treeconstruction. With this schema tree, a data table is decomposed intolayers of smaller tables, thus revealing the deep dependencies within.Our main contributions include: • A forward addition (FA) algorithm that identifies a solution ofcolumn combination C that functionally determines a givencolumn Y with only O ( log N D ) distinct count evaluations fora size constraint D , i.e., the number of columns in solution C is at most D . • Success probability analysis for finding a solution C with sizeconstraint D in one run of FA algorithm. • Algorithms for finding the best solutions from multiple runsof FA with size and error constraints. • A recursive process to build schema tree giving concise rep-resentations of data structure and dependency. • An example of how to utilize the discovered schema tree forfeature engineering.Before we proceed, we discuss how our work here on structuralrelationship discovery based on functional dependency is related tothe widely studied statistical (or probabilistic) dependency betweencolumns. Notice that functional dependency between columns is deterministic while the statistical dependency is probabilistic . Inthis view, statistical dependency can be viewed as a generalizationof the functional dependency. Many methods has been developedto detect strong statistical dependencies between columns in a data a r X i v : . [ c s . D S ] S e p . Cao et al. table, from using simple metrics such as Pearson correlation [2],mutual information [3] to more complex methods such as graphicalmodels [4]. In this section, we first provide necessary background and thendiscuss related work. We start with notations and definitions. Let T be a data table with N columns and R rows. For the work presentedin the paper, we assume there is no duplicated rows in T . Let C and Y be two sets of column combinations.D EFINITION For a column combination C , define r ( C ) = distinct row count with columns C from table T , and | C | as the num-ber of elements (columns) in C . Note that r ( C ) is non-decreasingwith respect to column additions to C , and its maximum value is R . D EFINITION C functionally determines Y , or C → Y , if andonly if r ( C ) = r ( C ∪ Y ) , i.e., each C value is associated with preciselyone Y value. It is a minimal functional dependency if removal ofany column from C breaks the dependency. D EFINITION Define
Descendant ( C ) as the set of all columns Y such that C → Y . D EFINITION C is called minimal unique if r ( C ) = R , and ifremoval of any column from C breaks the equality. Notice this is aspecial case of Definition 2 when Y is the set of all columns. For a column combination C , it is easy to obtain Descendant ( C ) by simply checking the equality r ( C ) = r ( C ∪ Y ) for any column Y .The inverse problem is much harder: for a given Y , find C such that C → Y is a minimal functional dependency. Many algorithms havebeen proposed in the literature, which can be classified into ’column-wise’ algorithms ([5, 6]), ’row-wise’ algorithms ([7–9]), or hybridmethods ([10, 11]). These methods focus on finding all solution setsof C since finding one solution is considered as a simple problem.For example, a commonly used approach for finding one solutionof minimal unique is what we call the ’Backward Elimination’ (BE)algorithm. It starts from the complete set (all columns), then it re-cursively attempts to eliminate a column from the set to maintainthe distinct count R . A solution is obtained when such operation isno longer possible. It is however a much harder problem to find all solution sets of minimal functional dependency, as in the worst case,the number of possible solutions is exponential in N . Therefore,earlier work seek algorithms that can find all solutions with compu-tational complexities that are polynomial with respect to the size ofthe solution set. In contrast to these methods, we do not derive allsolutions of C that satisfy the functional dependency but instead fo-cus on finding important functional relations, especially those C thatcontain fewer number of columns. By extracting these functionaldependencies recursively, they can be succinctly expressed in theform of a tree that can be very useful for data understanding andgreatly facilitate downstream machine learning.The rest of the paper is organized as follows. In Section 3, wepresent our FA algorithm to find one solution for near functional de-pendency and show that it favors short solutions with random columnpermutations. Section 4 presents algorithms to find the best solutionswith multiple runs of the FA algorithm. In Section 5, we presentan algorithm to build a scheme tree for a data table using recursive functional dependency discovery. In Section 6, an application of thediscovered schema tree for feature engineering is demonstrated. Weconclude in Section 7. In this section, we present a lightweight forward addition (FA) algo-rithm to obtain a single solution of C such that C → Y for a columncombination Y with a size constraint D , i.e., | C | ≤ D . We first il-lustrate our method for the case of finding minimal uniques andshow FA is much faster for a table with large number of columns N comparing to BE Algorithm (Section 2). We discuss the probabilisticversion of FA using random column permutations and show that thisprobabilistic variation favors shorter solutions of C (i.e. those withfewer columns). Finally we extend our algorithm to the general caseof (approximate) functional dependency. We first present FA algorithm for deriving one solution of minimalunique C (Definition 4), with a size (number of columns) no biggerthan D . Given a sequence of column indices L which is a permuta-tion of { , , . . . , N } , FA Algorithm attempts to find C in at most D sweeps of L . We start with an empty set C and in each sweep, one el-ement of L is added to C until r ( C ) = R . The algorithm is presentedboth in pictorial (Figure 1) and pseudo code form (Algorithm 1). Adetailed description is as follows. Algorithm 1
FA Algorithm for Finding a minimal unique with size ≤ D for Table T Input : L , a permutation of { , , . . . , N } Initialize: C ← ∅ , d ← , L ← L while r ( C ) < R and d < D T ← C for i = to | L | T ← ( T , l i ) , l i (cid:17) i th element of L if r ( T ) = R then C ← ( C , l i ) , L ← ( l , l , ..., l i − ) , d = d + break (go to line 2) if r ( C ) = R return C else return ∅ First, initialize C = ∅ , d = and L = L (line 1 and the first panel).Next, we create a temporary set of columns T that is identical to C (line 3). We now add indices in L one by one to T until for the firsttime the distinct row count for column combination T reaches R (line4-6 and panel 2). Suppose this is the h th element in L . From here,we can conclude that { l , l , . . . , l h } must contain a solution set ofminimal unique, and column l h is required in the solution, so we add l h to C and ignore the remaining columns after the first h columns in L (line 7-8 and panel 3 in the figure where the shaded columns areto be discarded). This completes one sweep of L . With an updated C ,we go back to Step 2 and repeat this process at most D times wherein each time we find one column in L to add to solution set C untileither r ( C ) = R or we exceed the size limit D (line 2 in algorithm,and 4th and 5th panels of Figure 1 shows the second iteration of this). Lightweight Algorithm to Uncover Deep Relationships in Data Tables
Figure 1: Illustration of Algorithm 1. The rectangle represents the data matrix with columns l , . . . , l N . Panel 1: data matrix withcolumns l , . . . , l N ; Panel 2 illustrates the result of the first iteration (line 6 with i = ), where column l h (blue) is the first selectedcolumn. Columns in the shaded area are then excluded from further consideration. Panel 3 shows column l h is moved to front withthe rest remaining in the same order (line 7). Panel 4 and 5 shows the result of the second iteration. Panel 6 shows the final solutionset consisting of the blue columns. If r ( C ) = R , then C is a minimal unique with | C | ≤ D . Otherwise wefailed to find a solution (line 9 or panel 6).It is helpful to understand how FA algorithm works in the case ofmultiple minimal uniques using illustrative examples. Let the initial L be L = ( , , . . . , N ) . Assume first { , , } is the only minimalunique. Then FA algorithm would find column 6 first, then column4, and finally column 1 to add to C in three sweeps of L (whereeach time L shrinks by discarding the elements after the selected one(shaded columns in Figure 1)). Now suppose there are two solutions: C = { , , } and C = { , , } . As the largest numbers in C and C are 6 and 7 respectively, and 6 is less than 7, FA algorithm wouldreturn C as the solution since the search reaches C first (line 4 and5). Suppose the two solutions are: C = { , , } , C = { , , } . Inthis case the largest number in each solution set ties (both are 6).Then after 6 is added to C , since from C is less than from C , thisbecomes the tiebreaker with the addition of the 2nd column index, which would return C as the solution. In general, FA would returnwith a solution whose largest column index in the permuted tableis the smallest among all solutions. If there is a tie, the tie-breakerwould be determined by next additions in the same manner. During each sweep, column indices in L are incrementally addedto a temporary set T until the distinct row count reaches R (line5-6 of Algorithm 1). Since the distinct row count with respect tocolumn addition is strictly increasing until it hits R , if the number ofcolumns N is large, we can use binary search to locate this columnwith O ( log N ) distinct count evaluations. Therefore, with D sweeps,the total distinct count evaluations is O ( D log N ) . In comparison, therequired distinct count evaluations for BE algorithm (Section 2) is . Cao et al. O ( N ) for the case of D ≪ N , which implies FA is less expensivethan BE for a table with a large number of columns N .T HEOREM One run of Forward Addition (FA) algorithm re-quires O ( D log N ) distinct count evaluations while one run of Back-ward Elimination (BE) algorithm requires O ( N ) distinct count eval-uations. In Algorithm 1 line 1, L is a permutation of the original columnindices { , , . . . , N } . We analyze the probabilistic properties of FAalgorithm where the index sequence L is a random permutation of { , , . . . , N } , especially when there are multiple competing minimaluniques. Suppose there exists at least one minimal uniques with sizeno bigger than D . We also discuss the success probability that onerun of the randomized FA algorithm will find a desired solution.Suppose a column combination C is a minimal unique with d columns. Given a random permutation L of { , , . . . , N } , let m , m , . . . , m d be the index of each of C i in L . For example, sup-pose C = ( , , ) is a minimal unique. With a random permutation L of ( , , . . . , N ) , their corresponding indices in L becomes ( , , ) respectively. This means, the 5th column in L is the 1st columnin the original table, the 3rd column in L is the 4th column in theoriginal table and the 9th column in L is the 6th column in theoriginal table.Then m = , m = , m = . For d ≪ N with alarge N , it can be shown that m i / N , i = , . . . , N can be approx-imated by independent and identically distributed (i.i.d.) uniformrandom variables on [ , ] . Let M = max ( m , . . . , m d )/ N , whichis the largest index of the solution set after permutation, dividedby N . Then M ≈ max di = U i where U i are i.i.d. uniform [ , ] . It iseasy to show that M d has the following approximate distribution: P ( M ≤ x ) ≈ x d . We summarize the result in the following lemma.L EMMA Suppose we are given a table T and a column com-bination C of d indices, C = ( C , C , . . . , C d ) . After a randomcolumn permutation of table T , let m i , i = , . . . , d , denote thenew index of C i in the permuted table. Then as the total num-ber of columns N → ∞ , m i / N , i = , . . . , d approximates inde-pendent U ni f orm ( , ) random variables. Furthermore, let M = max di = m i / N . Then P ( M ≤ x ) ≈ x d for any x ∈ ( , ) as N → ∞ . Now suppose there are K minimal uniques, C , C , . . . , C K , withlength d , d , . . . , d K . Let M k (cid:17) largest index of C k in the randomly columnpermuted table , k = , . . . , K As we show before, if there is no ties in { M k , k = , . . . , K } , FA willreturn the solution C j whose M j is the smallest. For the special in-stance of two minimal uniques with length d , d and no-overlappingelements, simple calculations by integration show that P ( M < M ) ≈ d d + d . (1)Therefore, if d < d , then P ( M < M ) > / , and the probabilitygoes to 1 as d becomes increasingly larger than d . This impliesthat FA would probabilistically favor the shorter solutions. It iseasy to extend the conclusion to the case when C , C that share common elements by separating out the common elements from thenon-overlapping elements and use similar integration calculations.We have the following lemma.L EMMA Let C , C be two minimal uniques with length d and d respectively. If d < d then P ( M < M ) > / , where M , M is the maximal index value defined in Lemma 1. Thisimplies that if C and C are the only two minimum uniques, FAalgorithm with random column permutation would favor C as thereturned solution as the probability of returning C is higher thanthat of C . The above lemma can be generalized to multiple solutions, andto the case with arbitrary solution lengths (not necessarily d ≪ N ).However, the extension to the most general case when the solutionssets are overlapping may need careful treatment.T HEOREM FA with random column permutation (i.e., L isa random permutation of { , , . . . , N } tends to favor the shortersolutions of minimal unique. In particular, if C , . . . , C K are the K solution sets with length d < d ≤ d ≤ ... ≤ d K ≤ D that arenon-overlapping, then the probability that C be the final returnedsolution is the highest. Suppose there exist solutions with a size no bigger than D out ofall K minimal uniques. Let Succ be the event that we successfullyfinds a minimum unique with size constraint D after one run of FAwith random permutation. Then by Theoroem 2, P ( Succ = ) is atleast / K . Therefore, for a finite K and an arbitrary small ϵ > , wecan find the a max failure times parameter F such that ( − / K ) F < ϵ .If there are F consecutive failures, we can probabilistically declarethat there is no minimal uniques with size less or equal to D . Itis worthwhile to mention that the performance of BE algorithm isidentical to FA in terms of success probabilities when the columnsare randomly permuted. We generalize Algorithm 1 to the case of functional dependence. Thegeneralization also considers approximate functional dependencedefined below. Again let C , Y be two column combinations in Table T . We consider two definitions of error measurement in order todefine ϵ − approximate functional dependency: C ϵ −→ Y .D EFINITION Let e ( C → Y ) be the minimal fraction of rows tobe removed for C → Y to hold. Then C ϵ −→ Y , if e ( C → Y ) ≤ ϵ . D EFINITION Define ˜ e ( C → Y ) = − r ( C )/ r ( C ∪ Y ) , then C ϵ −→ Y if ˜ e ( C → Y ) ≤ ϵ . It is easy to show the following lemma that states a monotoneproperty of the error measures with column additions.L
EMMA e ( C → Y ) = iff ˜ e ( C → Y ) = . For a given Y , e ( C , Y ) is non-decreasing with column additions to C . ˜ e ( C , Y ) is non-decreasing with column additions to C when columns in the additionare chosen from Descendant ( Y ) . We use two error definitions here due to computational considera-tions. Error in Definition 6 is much easier to compute than Definition
Lightweight Algorithm to Uncover Deep Relationships in Data Tables
Descendant ( Y ) . However, this special case is indeed what we useto build a schema tree in Section 5. In the following, Algorithm 2extends FA algorithm to general functional dependence in a straight-forward fashion, by replacing the check for minimal uniqueness withthe check for functional dependency (line 2, 6 and 9). The key tothis extension is the monotonic property of the errors in Lemma 3. Algorithm 2
FA Algorithm for finding C = FA ( Y , ϵ , D ) , such that C ϵ −→ Y with | C | ≤ D for a column combination Y Input : L , a sequence of column indices after column permutation Initialize: C ← C , d ← , L ← L \ C while e ( C → Y ) > ϵ and d < D T ← C for i = to | L | T ← ( T , l i ) , l i (cid:17) i th element of L if e ( T → Y ) ≤ ϵ then C ← ( C , l i ) , L ← ( l , l , ..., l i − ) , d = d + break (go to line 2) if e ( C → Y ) ≤ ϵ return C else return ∅ For a set of columns Y from a data table T , in this section, wepropose algorithms to find best solutions of C such that C ϵ −→ Y with | C | ≤ D for a given error ϵ > and size D > . These algorithmsessentially run FA algorithm (Algorithm 2) multiple times where ineach iteration we attempt to improve upon previous solution untilfailure occurs. We consider the following two scenarios. For a given Y and error upper bound ϵ ≥ , Algorithm 3 finda solution C with the fewest number of columns with at most F consecutive failures such that C ϵ −→ Y . The algorithm works with aninput maximum number of consecutive failures F after which wecan declare no better solutions can be found. A call to Algorithm 2is performed in each iteration (line 4) to attempt to find a solutionwith the given error bound ϵ and the current achievable minimal size (cid:101) D (initialized to be D ). After the first solution is found, the algorithmtries to find a better solution with a smaller size (fewer columns) ifsuccessful (line 6 and 7). The algorithm terminates if the maximumnumber of attempts has been reached or the solution size reaches 1(line 2). A more difficult situation is to find the shortest solution C such that C ϵ −→ Y and | C | ≤ D while minimizing e ( C → Y ) for a given ϵ (i.e.minimizing e first, then the solution size). A naive approach is to runAlgorithm 3 on a grid of e values, e < ϵ , and return the solution withthe minimal possible e . If multiple solutions exist with the minimalvalue of e , return the shortest one. Algorithm 3
Find
Shortest ( Y , ϵ , D ) , the shortest solution of C with | C | ≤ D , such that C ϵ −→ Y for a column combination Y Input : Maximum number of consecutive failures F (e.g. 10) Initialize (cid:101) D ← D , k ← , C ← ∅ . while k < F and (cid:101) D > L ← random permutation of ( , , . . . , N ) S = FA ( Y , ϵ , (cid:101) D ) using Alg. 2 with input L if S = ∅ then k = k + if ( C = ∅ and S (cid:44) ∅) or ( C (cid:44) ∅ and < | S | < (cid:101) D ) then C ← S ; (cid:101) D ← | S | ; k ← break ; go to line 2 return C Algorithm 4 provides an alternative for finding the minimal e using hill climbing, avoiding the grid search with at most F consecu-tive failures. In the first iteration, it finds a candidate solution S byrunning Algorithm 2 (line 3-4). Next, it attempts to reduce the errorrate e using sequential column additions to S with the max size D ,i.e., ColAdd ( S , D ) (line 5,6). This is due to the fact that, from Lemma3, the error rate can be reduced with column additions. We assignthis updated column set as our initial solution C along with its errorrate e . In the following iterations, we perform a similar updatingprocess (line 3, 4, 7-10) in the attempt to achieve an even smallererror rate until we fail consecutively F times. At this point, we havefound the smallest reachable e . Finally, we find the shortest solutionusing Algorithm 3 with the given minimum value of e (line 11, 12). Algorithm 4
Find
Shortest ( Y , ϵ , D ) , the shortest solution C , amongall solutions S such that S ϵ −→ Y for a column combination Y , withminimum e ( S → Y ) , and | S | ≤ D . Input : Maximum number of consecutive failures F (e.g. 10) Initialize e ← ϵ , k ← , C ← ∅ , new ← while k < F and e > L ← random permutation of ( , , . . . , N ) S = FA ( Y , e , (cid:101) D ) using Alg. 2 with input L if ( C = ∅ and S (cid:44) ∅) then C ← ColAdd ( S ; D ) , e ← e ( C → Y ) ; new ← ; next if ( C (cid:44) ∅ and S (cid:44) ∅ ) then S ← ColAdd ( S ; D ) ; e ′ ← e ( S → Y ) if e ′ < e then C ← S ; e ← e ′ ; new ← if new = then ( k ← new ← ) else k = k + if | C | > , run Algorithm 3 to update C by Shortest ( Y , e , D ) with C as the initial solution return C In this section, we proposed algorithms to find the best solution of acolumn combination C that functionally determines Y for given sizeand error constraints, using multiple iterations of FA algorithm withrandom column permutations. Our algorithms avoid the exhaustivesearch of all solutions and utilize the fact that FA algorithm favors . Cao et al. a short solution (Theorem 2). We want to point out that if the goalis to find all solutions of C that functionally determines Y , thenrunning many iterations of FA algorithm with independent columnpermutations is not efficient. This is because the number of allpermutation is N ! for a table with N columns, a number muchlarger than the number of all subsets, N (the maximal numberof solutions of C ). That said, we believe more efficient algorithmscan be developed for finding multiple solutions of C by adaptivelychoosing column permutations L based on previous found solutions.For example, if C = { , , } is a first solution found by one runof FA, then in the next iteration, we can place { , , } last in L and randomly reorder the remaining columns. This way, we areguaranteed to find a different solution if it exists (since the indicesof the 2nd solution will appear earlier). We leave this to future work. In this section, we present an algorithm that builds a schema treefor a data table T using recursive functional dependency discovery.The schema tree represents the structure dependency between tablecolumns in a simple hierarchical form. As a result, it is possibleto decompose a big data table into layers of smaller tables. In thenext section, we show how this information is utilized for automatedfeature engineering. Algorithm 5
Build Schema Skeleton for Table T Input : size parameter D , error parameter ϵ Output : Schema tree skeleton V Define
RecusiveSplit ( Y ) = { if Y = ∅ then return V E ← Descendant ( Y ) . S ← shortest ( Y , ϵ , D ) , S ⊂ E , using Algorithm 3 or 4. if S = ∅ then return V else grow V by adding child nodes S under node Y for each node W ⊂ S , RecursiveSplit ( W ) } Add an index (rowid) column Y . Set Y as the root node of V . RecusiveSplit ( Y ) Our schema tree is built via a two-step process. In the first step, webuild the tree skeleton using Algorithm 5. Given a table T , we firstadd a hypothetical rowid column as the row index and start buildingthe tree using this as the root node (line 9). Let Y be the currentnode of interest. Then we attempt to split Y using columns from theset E = Descendant ( Y ) (Definition 3 and line 3). A split is found ifthere exists a solution S ⊂ E such that S ϵ −→ Y and | S | ≤ D usingAlgorithm 3 or 4 (depending on the application context), for thegiven size constraint D and error constraint ϵ (line 4). If a solution S is found, then we split Y by nodes in S as its children. Such a splittingprocess is done recursively for all nodes in S (line 6,7) until we are nolonger able to split (line 5). Note when ϵ = , such a split indicatesa bi-directional functional dependency (or equivalence) between Y and S since S is derived from Descendant ( Y ) . It is important to notethat for well designed data tables, the skeleton nodes are usually notdecimal valued numeric columns. Therefore we do not include thesecolumns when building the tree skeleton. Once the schema skeleton is built, the second step is to add theremaining columns to complete the tree construction. For each ofthe remaining nodes, we attach it to the deepest skeleton node that itfunctionally depends. As explained in Section 2, this can be doneby a simple row distinct count check. In the following, we shallelaborate our process using simulated and real data. Table 1 displays sample rows of a simulated table that stores informa-tion related to customer purchase orders on an e-commerce website:an order is placed by a customer at a specific time, and each ordercan contain multiple products spanning multiple rows. The tablealso contains the associated customer and product information. Hereproper column names are shown with the contextual information sothat a domain expert can easily understand the structure relationsbetween columns. Understanding column relations will be muchmore difficult if this contextual information is removed. time phonenofullnamecustomerID ageshippingCostweight ptypeproductID priceorderTypeorderIDrowid
Figure 2: Schema tree of Order Table 1.
Figure 2 shows the discovered schema tree with ϵ = and D = ,where the red node is the root node representing the row index, bluenodes indicate the discovered tree skeleton using Algorithm 5, andthe gray nodes are the rest of the leaf nodes attached in the secondstep of tree construction. Let ↔ indicate the joint bidirectionalfunctional dependency, and as before, let → indicate functionaldependency. Figure 2 can be interpreted as follows: rowID ↔ ( orderID , productID ) orderID ↔ ( customerID , time ) orderID → orderType ; customerID → phoneno ; productID → price where the first two rows are regarding to the blue nodes, and the nexttwo are sample relations concerning the gray nodes. The tree skeleton(red and blue nodes) is built using Algorithm 5 while removing weight and shippingCost as these are decimal valued columns andnot initially included (decimal valued numeric columns can be auto-detected by methods in [12]). In the second step, we attach each ofthe remaining columns (gray nodes) to the deepest blue (or red) nodethat it functionally depends. With this schema tree, we can actuallyextract several smaller tables: a ’Product’ table from productID andits children, a ’Customer’ table from CusomterID and its children,and a smaller ’Order’ table from orderID and its children. Our bigtable is, in fact, the merge of these three tables.
Lightweight Algorithm to Uncover Deep Relationships in Data Tables
Table 1: Sample Rows from a Simulated Table of Customer Purchase Orders orderID productID customerID time ordertype fullname phoneno age ptype price weight shippingcost1 1 4 day 1 web Alex Smith 732-906-9882 29 book 15 0.5 4.51 6 4 day 1 web Alex Smith 732-906-9882 29 music 11 NA NA2 3 2 day 1 web Emma Miller 908-654-3213 36 beauty 39 0.2 4.23 2 2 day 2 web Emma Miller 908-654-3213 36 clothing 32 0.6 4.63 4 2 day 2 web Emma Miller 908-654-3213 36 games 18 NA NA3 6 2 day 2 web Emma Miller 908-654-3213 36 music 11 NA NA4 5 3 day 2 phone Kim Dole 973-211-1245 45 grocery 25 1 2.34 5 3 day 2 phone Kim Dole 973-211-1245 45 grocery 12 1 2.3
The second dataset contains broadband home router data recordsof customers from a network carrier during a 30-day period. It con-sists of columns and rows, where columns are device’ ID,type, associated network nodes and types, the customer information,and time series of several KPIs.Figure 3 shows the discovered schema tree with ϵ = and D = .The first split from the red root node representing row index is rowid ↔ ( start _ дp , customerattr , day . ) , and the next split is customerattr ↔ ( day , sid ) . Furthermore, day ↔ day _ id , sid ↔( deviceid , bras _ id ) , dn ↔ deviceid ↔ dn . The leftmost 6 graynodes are associated KPI time series (it is time related by noticingthe day.1 of the 1st split from the root), and the rightmost graynodes are device-related attributes. Again, with the schema tree, wecan decompose the table into layers of ’Device’ table from node deviceid , a ’customerattr2’ table from node customerattr2 , and aKPI time series table for each customerattr2/ start_gp combination.Understanding the column relations is much more straightforwardwith our diagram.Finally, we want to comment that the schema tree derived fromour algorithm may not be unique, especially for tables with com-plex column relationship. For example, if there are multiple shortfunctional dependencies with the same length, the algorithm willrandomly pick one of these to follow through. Figure 4 shows two dif-ferent realizations of our algorithm for an internal data table named’Project’ describing the progress of ongoing projects. From the rootnode rowid , both scheme trees find a single column id as the 1stlevel descendant. However, the trees differ in the next blue node split:the tree on the left is split by ( creator , createdTime ) , and the tree onthe right is split by ( modi f iedTime , actual _ start _ date ). Althoughboth solutions are valid, one may prefer the first scheme tree. Thisis because ( creator , createdTime ) is the combination of a label andDateTime column, but ( modi f iedTime , actual _ start _ date ) are twoDateTime columns, and the formal is a more natural table key thanthe latter. In this case, we are using the contextual information in thecolumns to select a more appropriate solution. How to automaticallyinfer the contextual information of columns and utilize this in theselection of schema tree is part of our future work. Our schema tree built automatically from a data table not only pro-vides valuable information for the data scientists to better understandthe table, it can also directly benefit the downstream machine learn-ing tasks. In this section, we show how information from the schematree can be used for automated feature engineering . Traditionally, feature engineering often requires handcrafting based on contextualunderstanding.We shall illustrate this using the simulated order Table 1. Suppose’CustomerID’ is an anchor node ’customer’ and we are interested inengineering features for each customer. First, we notice that thereare two types of relationship from
CustomerID to any other node: one-to-one and one-to-many . In the one-to-one relation, each
Cus-tomerID determines a single instance of the other variable, which infact implies that these variables are either functional equivalent to ordescendants of
CustomerID in the schema tree. The remaining nodesin the schema tree have a one-to-many relation from
CustomerID which imply that each
CustomerID corresponds to multiple instancesof the variable.In this example,
CustomerID has three gray nodes as descendants: age , fullname , phoneno . This implies that for each of the threenodes, there is a one-to-one relation from CustomerID , so we cantake the instance value directly as a feature for each
CustomerID .On the other hand, for each of the remaining columns, there is a one-to-many relation as each distinct
CustomerID will have one orseveral instances of its values. (Take the node
Price as an example. InTable 1, Customer 4 have 2 price instances, Customer 2 have 4 priceinstances.) Therefore we need to aggregate these values to createa uniform number of features for each
CustomerID . In this case,for any node N , we first find the shortest path from CustomerID to N in the schema tree, which gives us a bottom-to-top aggregationpath towards CustomerID . For example, the path from
Price to CustomerID while ommiting rowid as it is a hypothetical node is:
Price → ProductID → OrderID → CustomerID . We further omitthe one-to-one relations in the path as no aggregation is necessary,and so the aggregation path is simplified to
Price ⇒ OrderID ⇒ CustomerID , where ⇒ indicates a many-to-one relation. Finally, weextract all the subpaths and use a set of pre-specified aggregationfunctions to generate features. Specifically, Price is aggregated by
Price f = ⇒ OrderID f = ⇒ CustomerID , Price f = ⇒ CustomerID where f , f , f are pre-specified aggregation functions dependingon the characteristics of the variable to be aggregated. For numericvariables, these functions can be max, mean, standard deviation,interval probabilities or quantiles. For categorical variables, it canbe distinct counts. This way we can come up with a set of featuressuch as: the maximum ( f ) of average ( f ) price per orderID , or thestandard deviation ( f ) of the price. This approach is related to recentresearch on automated feature engineering [13, 14]. However, thesemethods focuses on relational databases with a known schema withan assumption that each table has a shallow structure. In contrast, . Cao et al. svlan_or_vp_idsoftwareversionpop_locationnetwork_typemsan_or_dslam_idline_card_il2s_idexchange_iddndy_iddeviceidbras_idsiddaystart_gp quarterhourday.1customattr2 customattr1rowid wan_dsl_us_crcwan_dsl_transmitblockswan_dsl_severelyerroredsecondwan_dsl_receiveblocksan_dsl_ds_erroredsecswan_dsl_ds_crc Figure 3: Schema Tree of Home Router Device Data workspace_id smp_weightageprogram_id lag_timecreator createdTimework_order_no project_statuspercentage_completionmodifiedTime isProjectStartedid actual_start_daterowid workspace_idsmp_weightageproject_status program_idmodifiedTimelag_time isProjectStartedactual_start_datework_order_no percentage_completionid creatorcreatedTimerowid
Figure 4: Two Realizations of Schema Tree for Project Data Table our method proposed here do not assume prior schema knowledgeand can work with data tables with layered structure. In addition,our methods can also be extended relational databases by joiningschema graphs together (not illustrated here).
In this paper, we developed an automatically-generated, data-drivenschema tree to represent the structural relations between columns ofa data table that can greatly facilitate a data scientist when exploring a new dataset. The key to our approach is the recursive extraction ofimportant functional dependency between columns of the data table,where we proposed a forward addition algorithm that requires muchless computation compared to existing approaches.
REFERENCES [1] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter,“Efficient and robust automated machine learning,”
Proceedings of the 28th Inter-national Conference on Neural Information Processing Systems , pp. 2755–2763,2015.
Lightweight Algorithm to Uncover Deep Relationships in Data Tables [2] K. Pearson, “Note on regression and inheritance in the case of two parents,”
Proceedings of the Royal Society of London , vol. 58, pp. 240–242, 1895.[3] C. E. Shannon, “A mathematical theory of communication.”
Bell Syst. Tech.J. , vol. 27, no. 3, pp. 379–423, 1948. [Online]. Available: http://dblp.uni-trier.de/db/journals/bstj/bstj27.html
Probabilistic Graphical Models: Principles andTechniques , ser. Adaptive computation and machine learning. MIT Press, 2009.[Online]. Available: https://books.google.co.in/books?id=7dzpHCHzNQ4C[5] Y. Huhtala, J. Kaerkkaeinen, P. Porkka, and H. Toivonen, “Tane: an efficient algo-rithm for discovering functional and approximate dependencies.”
The ComputerJournal , p. 42(2):100â ˘A¸S111, 1999.[6] A. Heise, J.-A. Quiané-Ruiz, Z. Abedjan, A. Jentzsch, and F. Naumann, “Scalablediscovery of unique column combinations,”
Proc. VLDB Endow. , vol. 7, no. 4, pp.301–312, Dec. 2013. [Online]. Available: http://dx.doi.org/10.14778/2732240.2732248[7] S. Lopes, J.-M. Petit, and L. Lakhal, “Efficient discovery of functional dependen-cies and armstrong relations,” in
International Conference on Extending DatabaseTechnology . Springer, 2000, pp. 350–364.[8] C. Wyss, C. Giannella, and E. Robertson, “Fastfds: A heuristic-driven, depth-firstalgorithm for mining functional dependencies from relation instances extendedabstract,” in
International Conference on Data Warehousing and Knowledge Discovery . Springer, 2001, pp. 101–110.[9] Y. Sismanis, P. J. H. P. Brown, and B. Reinwald, “Gordian: Efficient and scalablediscovery of composite keys.”
In VLDB , pp. 691–702, 2006.[10] Z. Abedjan and F. Naumann, “Advancing the discovery of unique column combi-nations.”
International Conference on Information and Knowledge Management(CIKM) , p. 1565â ˘A¸S1570, 2011.[11] T. Papenbrock and F. Naumann, “A hybrid approach for efficient unique columncombination discovery,” in
Datenbanksysteme fÃijr Business, Technologie und Web(BTW 2017) , B. Mitschang, D. Nicklas, F. Leymann, H. SchÃ˝uning, M. Herschel,J. Teubner, T. HÃd’rder, O. Kopp, and M. Wieland, Eds. Gesellschaft fÃijrInformatik, Bonn, 2017, pp. 195–204.[12] H. Uzunalioglu, J. Cao, C. Phadke, G. Lehmann, A. Akyamac, R. He, J. Lee, andM. Able, “Augmented data sciencetowards industrialization and democratizationof data science,” Nokia Bell Labs, Tech. Rep., 2019.[13] J. M. Kanter and K. Veeramachaneni, “Deep feature synthesis: Towards automat-ing data science endeavors,” in
Data Science and Advanced Analytics (DSAA),2015. 36678 2015. IEEE International Conference on . IEEE, 2015, pp. 1–10.[14] H. T. Lam, J.-M. Thiebaut, M. Sinn, B. Chen, T. Mai, and O. Alkan, “One buttonmachine for automating feature engineering in relational databases,” arXiv preprintarXiv:1706.00327arXiv preprintarXiv:1706.00327