Hyper-Graph Based Database Partitioning for Transactional Workloads
HHyper-Graph Based Database Partitioning forTransactional Workloads
Technical Report
August 2, 2013
ABSTRACT
A common approach to scaling transactional databases inpractice is horizontal partitioning, which increases systemscalability, high availability and self-manageability. Usu-ally it is very challenging to choose or design an optimalpartitioning scheme for a given workload and database. Inthis technical report, we propose a fine-grained hyper-graphbased database partitioning system for transactional work-loads. The partitioning system takes a database, a workload,a node cluster and partitioning constraints as input and out-puts a lookup-table encoding the final database partitioningdecision. The database partitioning problem is modeled asa multi-constraints hyper-graph partitioning problem. Byderiving a min-cut of the hyper-graph, our system can min-imize the total number of distributed transactions in theworkload, balance the sizes and workload accesses of thepartitions and satisfy all the partition constraints imposed.Our system is highly interactive as it allows users to im-pose partition constraints, watch visualized partitioning ef-fects, and provide feedback based on human expertise andindirect domain knowledge for generating better partition-ing schemes.
1. INTRODUCTION
The difficulty of scaling front-end applications is well knownfor DBMSs executing highly concurrent workloads. One ap-proach to this problem employed by many Web-based com-panies is to partition the data and workload across a largenumber of commodity, shared-nothing servers using a cost-effective, parallel DBMS, e.g. Greenplum Database. Thescalability of online transaction processing (OLTP) applica-tions on these DBMSs depends on the existence of an op-timal database design, which defines how an application’sdata and workload is partitioned across nodes in the clus-ter, and how queries and transactions are routed to nodes.This in turn determines the number of transactions that ac-cess data stored on each node and how skewed the load isacross the cluster. Optimizing these two factors is critical toscaling complex systems: a growing fraction of distributedtransactions and load skew can degrade performance by overa factor 10x. Hence, without a proper design, a DBMS willperform no better than a single-node system due to the over-head caused by blocking, inter-node communication, andload balancing issues.Usually, it is very challenging to choose or design an opti-mal partitioning scheme for a given workload and database.Executing small distributed transactions will incur heavyoverhead [9] and thus should be avoided whenever possible. However, especially when dealing many-to-many relation-ships or very complex database schemas, it is not an easytask to put all the tuples that are accessed together onto thesame node so as to reduce the overhead of distributed trans-actions. In the meantime, data skew or workload skew de-grades the performance of the overloaded nodes and therebylowers the overall system throughput. Therefore, it is alsovery critical to achieve both data and workload balancing.Moreover, for a specific partitioning strategy to be feasible,it must not violate the constraints on the cluster configura-tion, such as node storage capacity, node processing ability,and network bandwidth between nodes.Partitioning in databases has been widely studied, forboth single system servers and shared-nothing systems. How-ever, most of the existing techniques for automatic databasepartitioning are tailored for large-scale analytical applica-tions (i.e. data warehouses). These approaches typicallyproduce possible partitions using round-robin (send eachsuccessive tuple to a different partition), range (divide uptuples according to a set of predicates), or hash-partitioning(assign tuples to partitions by hashing them) [6], which arethen evaluated using heuristics and cost models. Unfortu-nately, none of these approaches are ideal for transactionalworkloads, which are very different from analytical work-loads and are featured with numerous short-lived and highlyconcurrent transactions, a small set of pre-defined transac-tion types and relatively few tuples touches by each transac-tion. For transactional workloads, if more than one tuple isaccessed, then round-robin and hash partitioning typicallyrequire accessing to multiple sites and thus incur distributedtransactions, which as we explained have significant over-head. Range partitioning may be able to do a better job,but this requires carefully selecting ranges which may bedifficult to do by hand. The partitioning problem gets evenharder when transactions touch multiple tables, which needto be divided along transaction boundaries. For example, itis difficult to partition the data for social networking websites, where schemas are often characterized by many n-to-nrelationships.In this report, we introduce a fine-grained hyper-graphbased database partitioning system for transactional work-loads. The input to our system includes a database, a work-load, a node cluster and partitioning constraints imposedby users. We model the database partitioning problem asa multi-constraints hyper-graph partitioning problem. Oursystem first analyzes the database and workload and con-structs a weighted hyper-graph. It then runs an iterativehyper-graph partitioning phase to get a feasible and near-1 a r X i v : . [ c s . D B ] S e p ptimal partitioning scheme. After each iteration of parti-tioning, our system will evaluate the partitioning feasibil-ity and performance, receive user feedbacks and then de-cide whether it should do hyper-graph refinement and re-partitioning. The final output is a lookup table which in-dicates how the database should be partitioned and dis-tributed over the cluster so that the total distributed trans-actions in the workload will be minimized, the sizes andworkload accesses of the partitions will be balanced and allthe imposed constraints will be met.Our database partitioning system can easily handle many-to-many table relationships and complex database schemas.It is also efficient as the size of the derived hyper-graph isindependent of the database size. It provides great oppor-tunities for the users to participate in the loop of decisionmaking and import their human expertise and indirect do-main knowledge for better partitioning performance.The rest of the report is organized as the follows: Sec-tion 2 describes the hyper-graph based database partitioningmodel. Section 3 presents the partitioner system architec-ture, as well as implementation details. Section 4 introducesthe experiments evaluation. Section 5 is the related works.We conclude in Section 6.
2. HYPER-GRAPH BASED DATABASE PAR-TITIONING
Here we focus on horizontal partitioning of database ta-bles. The effect of a partitioning scheme for a transactionalworkload is normally measured by the number of distributedtransactions [9]. So the problem can be turned into findinga partitioning scheme that minimizes the number of dis-tributed transactions. Data skew and workload skew willdecrease the system throughput and thus are expected to beunder certain threshold. There are also constraints imposedfor the partitioning in practice, such as node storage capac-ity, node processing ability and network bandwidth betweenphysical nodes. For a partitioning strategy to be feasible, itmust meet all these constraints. We thereby formalize thedatabase partitioning problem as follows:
Given a database D, a workload W, the number of physicalnodes k, and the constraints C, find the optimal partition-ing solution to partition D over k physical nodes so that thecost of executing W is minimized, while all the constraints Care satisfied and the imbalance degree of the data sizes andworkload accesses across k nodes are under some balancethreshold T.
Tuple Group . Before modeling the above database par-titioning problem as a multi-constraints hyper-graph parti-tioning problem, we first give the definition of tuple group .A tuple group is a collection of tuples within a relation,which will always be accessed together throughout the exe-cution of W . Each tuple group is essentially represented bya min-term predicate [13]. Given a relation R , where A is anattribute of R, then a simple predicate p defined on R hasthe form p : A θ const where const is a constant value and θ ∈ { = , <, (cid:54) = , >, ≤ , ≥} .A min-term predicate is the conjunction of simple predi-cates. Given the set of simple predicates { p , p , ..., p n } onrelation R that are derived from W , a min-term predicate M is defined as M = p ∗ ∧ p ∗ ... ∧ p ∗ n where p ∗ i = p i or p ∗ i = ¬ p i (1 ≤ i ≤ n ), which meansthat each simple predicate can occur in a min-term predicateeither in its natural form or its negated form.The min-term predicate has the property that all the tu-ples belonging to this predicate will be accessed together.A min-term has two attributes: min-term size and accesscount. The min-term size is the number of tuples it rep-resents in the actual table. The access count is the timesthat transactions within the workload accessing (some of)the tuples covered by this min-term predicate. These twoattributes of a tuple group M are denoted by size ( M ) and access ( M ) respectively. Hyper-Graph Partitioning Problem Modeling . It isobvious that a good partitioning scheme should put all thetuples of a tuple group into the same node in order to reducethe number of distributed transactions. So our basic ideato do the partitioning is: we first analyze and split D intodisjoint tuple groups, then try to place these tuple groupsinto k nodes.A hyper-graph extends the normal graph definition so thatan edge can connect any number of vertices. A hyper-graph HG ( V, E ) is constructed as follows: each vertex v i representsa tuple group M i ; each hyper-edge e i = ( v , v , ..., v n ) rep-resents a transaction X i in W accessing all the tuple groupsconnected by this hyper-edge. A vertex v i has two kinds ofweights size ( M i ) and access ( M i ). The weight count ( e i ) ofa hyper-edge e i is the number of transactions that access thesame vertices (i.e. tuple groups).Given a hyper-graph HG ( V, E ), k-way partitioning of HG assigns vertices V of HG to k disjoint nonempty partitions.The k-way partitioning problem seeks to minimize the netcut, which means the number of hyper-edges that span morethan one partition on the graph partitioning, or, more gen-erally, the sum of weights of such hyper-edges. There arealso constraints imposed on the graph partitioning, whichcorrespond to the partition constraints C and the balancethreshold T in the above database partitioning problem.Each cut-edge incurs at least one distributed transactionsince the data that the transaction need to access will beplaced into at least two nodes. So the sum of weights ofthe cut-edges is equal to the total number of resulting dis-tributed transactions.As such, we turn the database partitioning problem intoa multi-constraints hyper-graph partitioning problem whichaims to get the minimum k-way net-cut while keeping graphpartitions balanced and meeting various constraints.
3. SYSTEM DESCRIPTION
We first introduce the overview system architecture, andthen present the implementation details.
Figure 1 illustrates the overview of our solution, whichconsists of the following six steps:
S1: DB and workload analysis.
Each table in thedatabase is divided into one or multiple tuple groups , accord-ing to the information extracted from the workload. The tu-ples within each group are always accessed together through-out the whole workload. The sizes of tuple groups are de-rived from the database meta-data and statistics stored in2 igure 1: System Architecture the system catalog. Besides, the information about whichtuple groups are involved in each transaction of the workloadis also recorded.
S2: Hyper-graph generation.
The database parti-tioning problem is modeled as a hyper-graph partitioningproblem. The hyper-graph has the following characteristics:1. each tuple group obtained from the previous step corre-sponds to a distinct graph vertex with two weights: the tuplegroup size and the number of transactions accessing this tu-ple group; 2. each transaction is mapped to a hyper-edgethat connects all the tuple groups it accesses. It is possiblefor different transactions to be mapped to the same hyper-edge. Each hyper-edge is associated with a weight countingthe number of transactions mapped to it.
S3: Hyper-graph partitioning.
A graph partitioningalgorithm is used to produce a balanced min-cut partition-ing of the hyper-graph into k partitions. Each vertex (i.e.tuple group) is assigned to one partition, and each partitionis assigned to one cluster node. The min-cut of the hyper-graph means a minimized number of distributed transac-tions resulting from the corresponding database partition-ing strategy. The partitioning algorithm also tries to keepthe extent of incurred data skew and workload skew undercertain thresholds. S4: Partitioning effect evaluation.
The graph par-titioning result from S3 is evaluated according to certaincriteria. If the result meets the criteria, the next step is S6;otherwise it is S5. The criteria are two-fold. First, the re-sulting database partitioning must be feasible, which meansthat it should not violate the physical constraints of thecluster. For example, the total volume of data assigned to acluster node must not exceed its storage capacity. Second,the partitioning performance, i.e. the number of distributedtransactions and the extent of data skew and workload skew,should achieve the expectations that are optionally imposedby the user. During this phase, user can watch the visu-alized partitioning effects, and optionally provide feedbackbased on his expertise and domain knowledge to affect thedecision on whether the system should proceed to do graphrefinement and re-partitioning.
S5: Hyper-graph refinement.
The existing hyper-graph is refined towards generating a better partitioningthat meets the criteria defined in S4. The basic idea of re-finement is to choose some tuple groups in the hyper-graphand break them into smaller ones as new vertices. Thehyper-edges are adjusted accordingly. The newly derived hyper-graph is then fed into S3 for partitioning. Intuitively,the new hyper-graph represents an expanded solution spacethat subsumes the space represented by the old hyper-graph.Since the new hyper-graph is usually similar to the old one,in addition to running the complete partitioning algorithm,the partitioning of the former could be done by incremen-tally revising the partitioning result of the latter.
S6: Look-up table construction.
The finally decideddatabase partitioning strategy is encoded into a look-up ta-ble, which records the tuple-to-node mappings via a com-pact data structure representation. This look-up table isused when both loading the database into the cluster androuting transactions to involved data nodes during workloadexecution.In the following sections, we elaborate on the technical de-tails of our database partitioning solution roughly depictedabove.
The steps for obtaining the tuple groups, i.e. min-termpredicates, for each relation R are illustrated bellow: First,extract all the simple predicates related to relation R in theworkload. Second, construct the min-term predicates list byenumerating the conjunctions of all the simple predicates ofeither normal or negated form. Third, eliminate those min-term predicates containing contradicting simple predicates,and simplify the min-term predicates by removing the simplepredicates that are implied by other simple predicates withinthe same min-term predicate. In order to control the numberof min-term predicates generated, we could only select thetop- k mostly accessed attributes of each relation for min-term predicate construction. k is configurable by the userand currently has a default value of 2.We obtain the database meta-data and statistics informa-tion (e.g. histograms) from the underlying database systemcatalog, and then estimate size ( M ) of a min-term predicatewith methods similar to those utilized by a conventionalrelational database optimizer. To obtain the access count access ( M ) of a min-term predicate, we examine each trans-action in the workload and determine whether it accessesthe tuple group M . A transaction X will access the tuplegroup M iff for each attribute A of R , the set of simple pred-icates on A that are involved by X don’t contradict with M .Then access ( M ) is equal to the total number of transactionsaccessing the tuple group M .The outputs of the DB and workload analysis include themin-term predicates for all the database relations w.r.t theworkload, and a transaction access list which tells whichmin-term predicates a transaction will access. Our partitioning system employs an existing partitioningalgorithm hMETIS [12] to do the hyper-graph partitioning.hMETIS is the hyper-graph version of hMETIS, a multilevelgraph partitioning algorithm. hMETIS will tell which vertexbelongs to which node, and also the sum of weights of the netcut, which represents the number of distributed transactionsthat would be incurred by this partitioning solution.hMETIS also supports incrementally revising an alreadypartitioned hyper-graph according to new constraints. Thisfeature of hMETIS enables the lighter-weight hyper-graphrepartitioning after the hyper-graph refinement.3 .4 Partitioning Effect Evaluation
For a specific partitioning solution to be feasible, it mustnot violate the physical restrictions of the underlying nodecluster. Three types of physical restrictions are considered.First, the storage capacity of each node is limited. Second,the data processing ability of each node, which depends onthe CPU and I/O speeds, is also limited. Third, the band-widths of the network connecting the nodes are limited.Intuitively, when a node is assigned more data and ac-cessed by more transactions, the speed at which this nodehandle transaction processing will be slower, thus this nodeis more likely to become a performance bottleneck of thewhole system. Therefore, the extent of data skew and work-load skew of the system resulting from a specific partitioningsolution should be within certain threshold which representsthe performance expectation of the user. We define a skewfactor SF to quantitatively measure the extent of data andworkload skews. Assume a cluster with n nodes. Let s i and t i be the size of assigned database partition and the numberof accessing transactions respectively, of the i th node. Then SF is calculated as follows: SF = n (cid:80) i =1 ( α × ( s i − n × n (cid:80) i =1 s i ) + β × ( t i − n × n (cid:80) i =1 t i ) ) n where α and β are configurable non-negative parameters( α + β = 1) which may be used to reflect the different perfor-mance impacts of data skew and workload skew. Generally,a smaller value of SF means a better partitioning result.Finally, the user also inputs his expected number of par-titioning iterations (i.e. the cycle of S3 → S4 → S5 → S3in Figure 1), which represents the time budget that the userallows the system to consume before he gives up finding afeasible or better partitioning result.
The evaluation generates predictions on multiple perfor-mance metrics: data distribution, workload distribution, thenumber of distributed transaction, as well as the systemthroughput and response latency, which are obtained by thesimulated execution of the workload with our previous toolPEACOD [14], a partitioning scheme evaluation and com-parison system.Our system is always under one of two execution modes:the fully automatic mode and the interactive mode . Theautomatic mode will totally rely on the intelligence of thesystem on evaluating the partitioning performance in orderto decide whether continue or halt the iterative graph parti-tioning procedure. In contrast, the interactive mode allowsthe user to provide feedback to affect the partitioning strat-egy at runtime, jointly with the system intelligence.Under the interactive mode, after each graph partitioningiteration, the system will produce the visualized partitioningresults. Besides, the comparison on the partitioning resultsbetween this and the last iteration will also be visualized.After that, the system will pause execution and wait for theinstructions or feedback from the user. The interactions bythe user can be of various types. First, the user may ter-minate the system execution earlier, with either an alreadysatisfactory partitioning result or a hopelessly bad result.Second, the user may provide suggestions on how the cur- rent hyper-graph should be refined.
If the partition result is neither feasible nor good enough,we invoke the partitioning refinement to get a feasible andbetter one. The basic idea is to split some tuple groups(i.e. hyper-graph vertices) and then redo partitioning forthe according revised hyper-graph. Tuple group splitting isthree-phase.First, we rank the vertices with a ranking function. Ver-tices with higher ranks are more likely to be split. Currently,we use the vertex size as the ranking function. Alternativerank functions, e.g. the ratio of size and access frequency,may also be utilized.Second, we select the top- k vertices to split. k is config-urable by the user and currently has default value of 20.Last, we split each selected vertex V into two new vertices V and V . We pick up the simple predicate p with thelowest selectivity in the min-term predicate M of V andthen break p into two simple sub-predicates, p and p , withthe same selectivity. V and V correspond to the new min-term predicates constructed by replacing p in M with p and p respectively. A hyper-edges accesses V and V iff itaccesses V . As a result, size ( V ) = size ( V ) = size ( V ) / access ( V ) = access ( V ) = access ( V ).Obviously, hyper-graph refinement through splitting ver-tices can’t further reduce the number of distributed transac-tions. However, the refined hyper-graph does contain finer-grained vertices, which may enable feasible partitioning so-lutions as well as mitigate the issues of data and workloadskews.
4. EXPERIMENTS EVALUATION
In this section, we report the experimental results.
We have implemented a tool called PEACOD [14] to au-tomatically and extendibly evaluate and compare variousdatabase partitioning schemes. PEACOD is a Java applica-tion runs on Linux system. The tool embeds several well-known OLTP benchmarks such as TPC-C, EPINIONS, TATP.We shall make PostgreSQL [2] as the target database server.The experiments used PostgreSQL 9.1.2 as the DBMSwith buffer pool size set to 1GB, hosted on a machine withtwo 2.4GHz cores and 4GB of physical RAM.
We have implemented and embedded seven partitioningschemes to be compared, including our hyper-graph basedpartitioning scheme(
HGP ). The other six schemes are:
CountMaxRoundRobin(CMRR).
In this scheme, thosemost frequently accessed attributes are selected as the par-titioning keys. The tables are partitioned in the round-robinmanner based on the partitioning key values.
SchemaHashing(SH).
This scheme selects partitioning keysbased on the primary-foreign key relationship topology inthe database schema [17]. The primary key of root table be-comes the main deriving partitioning key. Then the tablesare hash-partitioned.
PKHashing(PKH).
This scheme selects primary keys oftables as the partitioning keys and hash-partitions tables.4
KRange(PKR).
This scheme selects primary keys of ta-bles as the partitioning keys and range-partitions tables.
PKRoundRobin(PKRR).
This scheme selects primarykeys of tables as the partitioning keys and partitions tablesin the round-robin manner based on the partitioning keyvalues.
AllReplicate(AllR).
This scheme replicates each tables toall data nodes.
In the experiment, we used the following three transac-tional benchmarks:
TPC-C.
This benchmark is the current industry standardto evaluate the performance of OLTP systems [3]. It con-sists of nine tables and five transactions that simulate awarehouse-centric order processing application. All the trans-actions are associated with a parameter warehouse id, whichis the foreign key ancestor for all tables except ITEM table.In the experiments, we generate a 2-warehouse dataset anda 10-warehouse dataset.
EPINIONS.
The Epinions.com experiment aims to chal-lenge our system with a scenario that is difficult to partition.It verifies it effectiveness in discovering intrinsic correlationsbetween data items that are not visible at the schema orquery level. It consists of four tables: users, items, reviewsand trust. The reviews table represents an n-to-n relation-ship between users and items (capturing user reviews andratings of items). The trust table represents a n-to-n rela-tionship between pairs of users indicating a unidirectionaltrust value. The workload is obtained from the open-sourceOLTP benchmarks oltpbenchmark [1].
TATP.
This benchmark is an OLTP testing application thatsimulates a typical caller location system used by telecom-munication provider [4]. It consists of four tables, three ofwhich are foreign key descendants of the root SUBSCRIBERtable. Most of the transactions in TATP are associated withSUBSCRIBER id, allowing them to be routed directly to thecorrect node.
In this experiments, we evaluate the number of distributedtransactions that each scheme will produce. We regard thekey metric of a partitioning scheme is the number of dis-tributed transactions. We did several experiments to get thenumber for the 7 partitioning schemes and 3 benchmarks wementioned above.
We first conducted the experiments using the TPC-C bench-mark. All the experiments used 1000 transactions work-load. There are three scenarios we tested: partitioning a 2-warehouse-dataset into 2 nodes, partitioning a 2-warehouse-dataset into 8 nodes, and partitioning a 10-warehouse-datasetinto 10 nodes. We tested all the 7 partitioning schemes. Theresult is listed in Table 1 and Figure 2.From the result, we can observe that HGP and SH is sig-nificantly better than other partitioning schemes. AllR isworst since it needs lots of update operation spanned over allphysical nodes. CMRR is also very bad since it chooses badpartitioning key. The result of the three primary key basedpartitioning schemes are not very bad since they choose the
HGP SH PKH PKR PKRR CMRR AllR2w- > > > >
10 82 168 418 417 410 914 927
Table 1: , d i s t r i bu t e d t r a n s a c t i o n s HGPSHPKHPKRPKRRCMRRAllR
Figure 2: best partitioning keys. The most suitable partitioning keysof TPC-C are the primary keys for each table. But it is notright for all OLTP benchmarks. So we can find that thesethree partitioning schemes in other benchmark may performvery bad in the following experiments.SH chooses the optimal partitioning keys according to thePK-FK references. So its result is very good. But HGP isbetter than SH. HGP can analyze the co-locate relationshipbut SH not. So some distributed transactions can be elim-inated by these information in HGP. Hence, HGP typicallyproduces less number of distributed transactions than SH.From the experimental results of the three PK-based schemes(PKH, PKR, PKRR), we can find that the partitioningmethods (hashing, range, round-robin) are not very impor-tant for the partitioning algorithm. The three methods justgot the same results. Compared with CMRR and PKRR, weobtained that the important thing for a partitioning schemeis the selection of partitioning key. Choosing the right par-titioning keys can get a very good result and performance.It is not so important that which partitioning methods arechosen.
We conducted the TATP experiments using 2000 transac-tions. We partitioned the data into 2, 4, 8, 16 nodes sepa-rately. The result is shown in Figure 3.From this experiment result, we can find that our HGPis far better than other schemes. Its proportion of numberof distributed transactions is under 5 percents, while otherschemes’ proportion are greater than 20 percents. d i s t r i bu t e d t r a n s a c t i o n s HGPSHPKHPKRPKRRCMRRAllR
Figure 3: nodes 3nodes 4nodes 5nodes501002575125 d i s t r i bu t e d t r a n s a c t i o n s HGPSHPKHPKRPKRRCMRRAllR
Figure 4:
The three PK-based schemes performed very bad in thisexperiments. It performed even worse than the CMRR andAllR. It indicates the importance of partitioning keys se-lection. SH is also bad since it can’t choose the suitablepartitioning keys just analyzing the database schema. Thecorrelations between data items are not visible at the schemalevel in Epinions.
The result of the experiment using Epinions benchmarkis shown in Figure 4. The experiment used 200 transac-tions. We partitioned the database into 2,3,4,5 nodes sep-arately. The experiment generated the similar result withTATP benchmark.HGP is far better than other schemes. SH, PKH, PKRand PKRR generated the same number of distributed trans-actions. They all choosed primary keys as the partitioningkeys. The methods (round-robin, range, hashing) used todistribute the data are not the key factor. On the contrary,CMRR choosed the most accessed attributes as the parti-tioning keys. Hence, it produced just the half number ofdistributed transactions of the PK ones.In a conclusion, our partitioning scheme HGP performsbetter than other schemes if we use the number of dis-tributed transaction as the key metric. Other experimentswill be conducted in the near future. Other performancemetrics will be used to compare these partitioning schemes.We also built a demo prototype, as shown in Figure 5.
5. RELATED WORKS
Database partitioning is very crucial for scale transac-tional database, and it is very challenging for choosing ordesigning the optimal partitioning scheme for a given work-load and database.There already exist many kinds of general-purpose parti-tioning algorithms, among which round-robin, range-based,hashing are the most widely used [6]. These algorithms arevery effective for data analytical workload which scan verylarge data sets. But for transactional workload, these meth-ods typically produce multiple nodes access therefore pro-duce distributed transactions when more than one tuple isaccessed in a query.In the meantime, more ad-hoc and flexible partitioningschemes tailored for specific- purpose applications were alsodeveloped, such as the consistent hashing of Dynamo [8],Schism [9] and One Hop Replication [11], etc..Bubba provides many heuristic approaches to balance theaccess frequency rather than the actual number of tuplesacross partitions [5]. This algorithm is simple and cheap, but doesn’t guarantee perfect balancing of processing. Schismprovides a novel workload aware graph-based partitioningscheme [9]. The scheme can get balanced partitions andminimize the number of distributed transactions.Scaling social network applications has been widely re-ported to be challenging due to the highly interconnectednature of the data. One Hop Replication is an approach toscale these applications by replicate the relationships, pro-viding immediate access to all data items within ’one hop’of a given record [11].[16] provides a fine-grained partitioning called lookup ta-bles for distributed databases. With this fine-grained parti-tioning, related individual tuples (e.g., cliques of friends) areco-located together in the same partition in order to reducethe number of distributed transactions. But for the tuple-level lookup table, the database need store a large amountof meta-data about which partition each tuple resides in. Itconsumes large storage space and makes the lookup opera-tion not very efficient.Consistent hashing [7] can be used to minimize the datamoving when doing re-partitioning. But it may cause nonuni-form load distribution. Dynamo [8] extends consistent hash-ing by adding virtual nodes. It provides different partition-ing strategies on load distribution which can ensure uniformload distribution at the same time of providing excellent re-partitioning performance. Other works such as CRUSH [10]and FastScale [15] can also provide algorithms which can beused for re-partitioning.
6. CONCLUSION
In this technical report, we propose a fine-grained hyper-graph based database partitioning system for transactionalworkloads. Our hyper-graph based database partitioningscheme has the following major advantages over previousones.First, our scheme can reach much fine-grained and ac-curate partitioning results thus works well for all kinds oftransactional workloads, by taking tuple groups as the min-imum components of partitions. On the one hand, since tu-ple groups are directly calculated based on the workload in-formation, compared with blind round-robin, range or hashpartitioning methods, they are more likely to successfullycluster tuples that will eventually be co-accessed by transac-tions. As a result, our approach can lead to a fewer numberof distributed transactions. On the other hand, by splittingtuple groups into smaller ones, we can more easily mitigatethe issues of data skew and workload skew.Second, our scheme is very light-weight and efficient. Ithas good scalability, as the size of the generated hyper-graphdepends only on the workload size but not on the databasesize. Unlike the previous approaches, it does not need tointeract with the query optimizer for cost estimation, whoseoverhead is quite significant. This is feasible in practice, asthe dominant performance bottleneck of transactional work-loads lies in the number of distributed transactions, whichcan be directly counted from the hyper-graph partitioningresult. Moreover, the repartitioning of the hyper-graph canbe done incrementally.Third, our scheme is very flexible. The users are allowedto input their performance expectations at the beginning.During the partitioning iterations, They can watch the vi-sualized partitioning effects, and optionally provide feedbackbased on his expertise and domain knowledge so as to affect6 igure 5: System Demonstration the decision on whether the system should proceed to dograph refinement and re-partitioning, as well as to providesuggestions on how the current hyper-graph should be re-fined. With such interactions with the users, our approachis able to reach configurable and precise balance between thepartitioning speed and partitioning quality.
7. REFERENCES7. REFERENCES