A Systematic Method for On-The-Fly Denormalization of Relational Databases
AA Systematic Method for On-The-FlyDenormalization of Relational Databases
Sareen ShahZucker School of Medicine at Hofstra/Northwell
Abstract –
Normalized relational databases are a common method forstoring data, but pulling out usable denormalized data for consumptiongenerally requires either direct access to the source data or creation ofan appropriate view or table by a database administrator. End-usersare thus limited in their ability to explore and use data that is storedin this manner. Presented here is a method for performing automateddenormalization of relational databases at run-time, without requiringaccess to source data or ongoing intervention by a database administrator.Furthermore, this method does not require a restructure of the databaseitself and so it can be flexibly applied as a layer on top of already existingdatabases.
Introduction
Normalized relational databases took their form initially from a need to reduceduplication of records and to organize data in a manner whereby updating oradding records would not lead to data inconsistencies [1]. However, this struc-ture can lead to difficulty with the actual retrieval of usable information. Forwide databases, it can be an arduous process to discover the logical sequenceof tables that must be joined to bring information from different ends of thedatabase together, especially when many combinations of different data ele-ments are desired. A more important limitation is that an end-user of the datawithout direct access is reliant upon the database administrator creating thedenormalized tables a priori for any data exploration/analysis desired.Relational databases remain the most common method of storing data [2]. Datacollected for research purposes are also usually stored/provided using a rela-tional schema. Thus, there is a role for creating a technique for on-the-fly denor-malization, to allow for maximal flexibility for information retrieval by end-usersand to unburden the database administrator. The algorithms provided here willenable an end-user to be able to combine any number of columns of interest from1 a r X i v : . [ c s . D B ] O c t ny number of tables, and automatically construct the join sequences requiredto perform this denormalization ad hoc . Illustrative Example
We will use a modified version of the Northwind Traders database to demon-strate scenarios where an on-the-fly denormalization algorithm can be useful.The Northwind Traders database was originally created by the Microsoft Cor-poration, and the modified files can be found at https://github.com/sareeneng/NorthwindModified. The modified dataset consists of 11 tables that are orga-nized in a typical normalized fashion, although some of the relationships havebeen constructed in a way to highlight certain points in the discussion to follow,and so it is not a fully optimized data representation. For simplicity, only thejoining columns and a few other columns are shown in
Figure 1 . Figure 1: Schema of the modified Northwind dataset. *s represent many-to-one rela-tionships.
Suppose an end-user is interested in finding which employees and customershave interacted with each other. From the diagram, it is clear that the way toaccomplish this is to start with the orders table, and then join the employees and customers tables in. A database administrator might create a view to bring thedata from these two tables together if this is a frequently requested combinationof variables. This requires anticipation on the part of the administrator.If the end-user wants to explore the database and create many different com-binations of variables, then these must also be anticipated. For example, the2ser might want to know what territories the orders are originating from, orsee which vendors served which customers. The user may also want many morevariables pulled in together, such as quantity and names of the products soldby each employee. Notably, some of these combinations may not be possible;orders are linked to employees, but employees may serve multiple territories sothere is no way to definitively link the tables territories to orders .One (unfeasible) option is to anticipate all potential combinations that a usermay be interested in and create the associated denormalized tables (e.g. witha star schema). This becomes an exponentially laborious task the wider thedatabase and the more variables that a user might want to pull together, and thevast majority of these cases will be extraneous. The on-the-fly denormalizationalgorithm presented here avoids the need to anticipate end-user data requests. Graph Structure
A simple way to represent the relational database is by using a directed graphstructure that will then allow for the path-finding required. Tables are repre-sented as the nodes, and foreign key links are represented as the edges betweennodes. The box details the steps for constructing the graph, and
Figure 2 shows the resulting graph for the modified Northwind dataset.
Directed graph representation of relational database
For each foreign key FK link, use the following rules to construct thegraph G where the nodes are tables and the edges are the foreign keylinks. Let a relationship mean that both tables have only uniquevalues in their respective joining columns, let a M-1 relationship meanthat the first joining column has non-unique values and the second joiningcolumn has only unique values, and let a
M-M relationship mean thatboth joining columns have non-unique values. • If node A is already connected to B along a different F K , see text. • If A.F K
A/B has a
M-M relationship with
B.F K
B/A – Do not connect node A to node B • If A.F K
A/B has a
M-1 relationship with
B.F K
B/A – A F K −−→ B • If A.F K
A/B has a relationship with
B.F K
B/A – A FK ←→ B igure 2: Directed graph representation of the modified Northwind dataset. Many-to-one relationships are connected in one direction, and one-to-one relationships areconnected in both directions. The columns used to join tables together can be classified as either “many”(non-unique values) or “one” (only unique values). The method for determin-ing whether the columns have only unique values will depend on the databasemanagement system. The links between tables can then be manually specified,and the directionality of the arrows determined based on the columns classifi-cation in both tables. In the case where there are multiple foreign keys betweentwo tables, linking options will depend on the use-case. A database adminis-trator may decide that only one of the foreign keys is relevant, or all links maybe created and an option may be provided to the user at run-time as to whichspecific connection to use for the query.There are a few ways to minimize the amount of manual labor involved increating these links, including:1. If the database is stored in a typical database management system withforeign keys already set up, then these foreign keys can be used to createthe connections.2. Many databases are structured such that column names have a consistentmeaning across tables. In this case, one can iterate over all column namesand whenever a column is found in multiple tables, then a link can becreated with directionality based on the classification described above ifthe uniqueness of the columns is known and will not change.Once the graph has been created, then the connections can be stored and thegraph regenerated from this metadata at any time. At this point, two path-finding algorithms need to be implemented:1.
Find if a path exists at all between any two nodes. Find all simple paths between any two nodes - a simple path is one4hat connects the source to the target without traversing any intercedingnodes more than once.Many programming languages have packages/modules that can construct thesegraphs and implement the path-finding algorithms needed. For example, Pythonusers can install the networkx package [3].
Denormalization Algorithm
The basic algorithm builds on the idea that one needs to find an origin nodefrom which a simple path can be drawn to all individual desired nodes. Theorigin node may also be one of the desired nodes. For example, if the user isinterested in data from the suppliers and the categories tables, a path cannotbe drawn directly from one to the other. However, the origin node products has a path to each individual desired table, and so it can serve as the left-most table in the join sequence. There is the potential for a significant amountof redundancy; order details could have served as an origin node, but it doesnot add more useful information and adds more complexity if the user is onlyinterested in suppliers and categories , while products is a necessary table in thejoin sequence. The steps below describe how to eliminate the redundancy.
Find paths between two given tables
The first building block is to find all simple paths between an origin node andeach desired table. Some of the paths found may include extraneous tables(which may reduce useful information while adding to complexity), and so theredundancy should be removed (
Algorithm 1 ). First, sort the list of pathsfound from shortest to longest, then eliminate all paths that are a superset ofany previous path. For example, the simple paths from order details to suppliers are ORS → V E → SU , ORS → P R → SU , and ORS → P R → V E → SU .The first two paths are kept because neither is a superset of the other, and itis unknown at this time which is the optimal path (e.g. going through vendors might yield more rows or more relevant data than going through products , orvice-versa). However the last path can be removed because there is no possibilityof it being more optimal than either of the first two paths.Note that this is not the same as finding edge disjoint paths. If one path is A → B → C → E and another is A → B → D → E , these paths share an edgebut it is not clear if one is inferior to the other and so both should be kept.5 lgorithm 1: Finding reduced simple paths between two spe-cific nodes
1. Let L be a list of ordered pairs representing the edges of a singlesimple path T s → T x → T y → ... → T w → T z → T e as such: L = [( T s , T x ) , ( T x , T y ) , ..., ( T w , T z ) , ( T z , T e )]2. Let M ( T s , T e ) be a list of all L from T s to T e
3. Sort M by the number of unique nodes in M i in ascending order4. Remove all M i from M where the nodes in M i are a superset ofthe nodes in any prior M ...i − for all i = 2 ...n Finding join sequences between multiple tables
If an origin node exists from which a simple path can be drawn to all desirednodes, then at least one join sequence can be generated. The basic technique isto first take all the paths between all possible origin nodes and each of the desiredtables. The origin node might be one of the desired nodes. Then calculate thecartesian product to generate all possible valid combinations of simple paths,and in a similar fashion as above, sort the candidate join sequences by thenumber of unique tables to remove those that are simply supersets of priorsequences. These steps are outlined in
Algorithm 2 .The operation of finding all simple paths can be computationally expensive forvery large graphs. A significant optimization can be done by first checking to seeif a path exists at all from the candidate origin node to each of the desired nodesusing Dijkstra’s algorithm, and if not, then move on to the next candidate.The result will be a list of lists, with each individual list comprised of a setof ordered pairs representing a join sequence. For example: [( T .C , T .C ),( T .C , T .C ) , ... ] represents “ T JOIN T ON T .C = T .C JOIN T T .C T .C T , T ) , ( T , T )]A fully worked example is provided in the Appendix, where a join sequence is foundbetween customers , suppliers , and categories . Note that this is different from theexamples provided above in order to specifically highlight all the steps of the algorithm.The join sequence generated by this algorithm uses the order details table as theorigin/left-most table. Of note, the algorithm eliminates the potential join sequencethat makes use of the path from ORS → V E → SU because in order to capture categories , the products table must be included and since suppliers is directly relatedto products , there is no need to use vendors . If categories was not required, then twovalid join sequences would have been produced. lgorithm 2: Finding join sequences between multiple tables
1. Let the list of tables desired be represented as D = [ T , ..., T n ]2. Let H be an empty list of valid join sequences3. Let the list of all tables in the graph be represented as S = [ S , ..., S n ].For all S i...n : let V i, = M ( S i , D ) , V i, = M ( S i , D ) , ..., V i,n = M ( S i , D n ) where S i (cid:54) = D j . If V i,j is not empty for all j = 1 ...n , then:(a) Let P i = { V i, } × { V i, } × ... × { V i,n } (b) For each path combination P i,k :i. Flatten P i,k to a single list of ordered pairs.ii. Remove duplicated ordered pairs from P i,k .iii. Add P i,k to H
4. Sort H by the number of unique tables traversed by H i
5. Remove all H i from H where the tables traversed in H i are a supersetof the tables in any prior H ...i − for all i = 2 ...n Discussion
The algorithm presented here allows for end-user creation and consumption of denor-malized data without requiring a restructuring of the normalized database and withoutrequiring access to the source data. Once the initial set-up has been completed by adatabase administrator, data can be pulled from different tables with the appropriatejoins created automatically through the steps detailed above. Furthermore, this algo-rithm is relatively simple to implement in various programming languages. A plausibleuse-case scenario could be a dashboard where an end-user can select data elements ofinterest and choose filters, and then the resulting denormalized data can be used toconstruct visualizations.There are several limitations, some of which have been described above. • The links need to be set up manually, but this is a one-time task. Ideally thisshould be done by an administrator familiar with the database structure (espe-cially if the database is live) as this person can identify which are the importantlinks between tables, and classify the relationships appropriately based on de-fined constraints and foreign keys. If the data is provided as static files and thereis no database administrator (e.g. research data provided for distribution in .csv format), then the column data itself can be analyzed to determine if there areunique/non-unique values present, and then the appropriate relationships canbe set up. • For very large and highly connected graphs, the path-finding may be computa-tionally expensive. Some options to ameliorate this include storing the pathswhen found to avoid having to recalculate them, ensuring a path exists usingDjikstra’s algorithm, and setting a maximum search depth. Once the paths have een found for a pair of nodes, they can be stored for future use so that theydo not need to be recalculated every time. • There is no specific support for self-referencing tables, though the algorithmcould be modified to incorporate this. • In the case where there are multiple valid join sequences, a decision needs to bemade on which data to provide. Options are to 1) prioritize paths that traversemandatory over optional foreign keys, 2) provide the “union distinct” outputof all possible results, 3) provide each result separately, or 4) provide the tablethat has the most records in it.
Acknowledgments
I would like to thank Dr. Alysia Flynn for her thorough review and analysis of thealgorithm, as well as Dr. Nelson Sanchez-Pinto for his guidance and mentorship.
References [1] S. Malaika and M. Nicola, “The history of business records.” , Dec 2011.[2] solid IT, “Historical trend of the popularity ranking of database management sys-tems.” https://db-engines.com/en/ranking_trend , Sept 2020.[3] Networkx developers, “Networkx documentation.” https://networkx.github.io/ , 2020. ppendix Worked Example
Find a join sequence between customers , suppliers , and categories (fromFigure 1) D = [ CU, SU, CA ]2. H = [ ]3. S = [ ORD, ORS, T E, ET, EM, RE, CU, P A, V E, SU, CA ] V , = M ( S , D ) = M ( ORD, CU ) • L = [( ORD, CU )], this is the only simple path. • M ( ORD, CU ) → [ [( ORD, CU )] ] V , = M ( S , D ) = M ( ORD, SU ) → no simple paths found. Stop and moveonto the next candidate origin node. V , = M ( S , D ) = M ( ORS, CU ) • L = [ ( ORS, ORD ) , ( ORD, CU ) ] • V , = M ( ORS, CU ) → [ [( ORS, ORD ) , ( ORD, CU )] ] V , = M ( S , D ) = M ( ORS, SU ) • L = [( ORS, P R ) , ( P R, SU )] • L = [( ORS, V E ) , ( V E, SU )] • L = [( ORS, P R ) , ( P R, V E ) , ( V E, SU )] → nodes set is { ORS, P R, V E, SU } which is a superset of the nodes in L so eliminate it. Note that it is alsoa superset of L but the first condition is enough to eliminate. • V , = M ( ORS, SU ) → [ [( ORS, P R ) , ( P R, SU )][(
ORS, V E ) , ( V E, SU )] ] V , = M ( S , D ) = M ( ORS, CA ) • L = [( ORS, P R ) , ( P R, CA )] • V , = M ( ORS, CA ) → [ [( ORS, P R ) , ( P R, CA )] ] P = { V , } × { V , } × { V , } = [ [ [( ORS, ORD ) , ( ORD, CU )] , [( ORS, P R ) , ( P R, SU )] , [( ORS, P R ) , ( P R, CA )] ][ [(
ORS, ORD ) , ( ORD, CU )] , [( ORS, V E ) , ( V E, SU )] , [( ORS, P R ) , ( P R, CA )] ] ] flatten −−−−−→ [ [(
ORS, ORD ) , ( ORD, CU ) , ( ORS, P R ) , ( P R, SU ) , (cid:40)(cid:40)(cid:40)(cid:40)(cid:40) ( ORS, P R ) , ( P R, CA )][(
ORS, ORD ) , ( ORD, CU ) , ( ORS, V E ) , ( V E, SU ) , ( ORS, P R ) , ( P R, CA )] ]
Add P , and P , to HV ... have at least one M ( S i , D j ) = [ ], so no new valid join sequences arefound in this case, and H is fully populated from ORS as the origin node. . H sorted −−−−→ [ [( ORS, ORD ) , ( ORD, CU ) , ( ORS, P R ) , ( P R, SU ) , ( P R, CA )][(
ORS, ORD ) , ( ORD, CU ) , ( ORS, V E ) , ( V E, SU ) , ( ORS, P R ) , ( P R, CA )] ] H has 6 unique nodes { ORS, ORD, CU, P R, SU, CA } H has 7 unique nodes { ORS, ORD, CU, P R, SU, CA, V E }
5. The nodes in H are a superset of the nodes in H and so it can be eliminated.This leave only one join sequence:[( ORS, ORD ) , ( ORD, CU ) , ( ORS, P R ) , ( P R, SU ) , ( P R, CA )]which can be translated as follows:
ORS
JOIN
ORD ON ORS.orderID = ORD.orderID
JOIN CU ON ORD.customerID = CU.customerID
JOIN
P R ON ORS.productID = P R.productID
JOIN SU ON P R.supplierID = SU.supplierID