Ingesting High-Velocity Streaming Graphs from Social Media Sources
IIngesting High-Velocity Streaming Graphs fromSocial Media Sources
Subhasis Dasgupta
San Diego Supercomputer CenterUniv. of California San Diego
La Jolla, CA 92093, [email protected]
Aditya Bagchi
Dept. of Computer ScienceRKMV Educational and Research Institute
Howrah 711202, West Bengal, [email protected]
Amarnath Gupta
San Diego Supercomputer CenterUniv. of California San Diego
La Jolla, CA 92093, [email protected]
Abstract —Many data science applications like social networkanalysis use graphs as their primary form of data. However,acquiring graph-structured data from social media presents someinteresting challenges. The first challenge is the high data velocityand bursty nature of the social media data. The second challengeis that the complex nature of the data makes the ingestion processexpensive. If we want to store the streaming graph data in a graphdatabase, we face a third challenge – the database is very oftenunable to sustain the ingestion of high-velocity, high-burst data.We have developed an adaptive buffering mechanism and a graphcompression technique that effectively mitigates the problem.A novel aspect of our method is that the adaptive bufferingalgorithm uses the data rate, the data content as well as the CPUresources of the database machine to determine an optimal dataingestion mechanism. We further show that an ingestion-timegraph-compression strategy improves the efficiency of the dataingestion into the database. We have verified the efficacy of ouringestion optimization strategy through extensive experiments.
Index Terms —High velocity Graph processing, Graph Inges-tion, Ingestion Optimization, Adaptive Ingestion Buffer, IngestionManagement
I. I
NTRODUCTION
A significant fraction of data used in Data Science todaycomes from streaming data sources. These include data fromsocial media streams like Facebook and Twitter, IOT data fromsensors, stock market data from stock exchanges and financialinformation sources. There are two broad categories of datascience research for streaming data – real-time analytics andnon-real-time analytics. In the first case, analytics tasks canbe performed on a small window of in-flight data as the datastreams in. For example, Bifet et [3] develop a streaming de-cision tree technique that operates on an in-memory snapshotof data and adapts to changes in streams. In the second case,although the data is collected in real time, a data ingestionsystem needs to collect data for some time before the analyticsoperations can be executed. For example, computing hourlyfrequency distribution of hashtags from Twitter would requirethe system to store data because due to the high velocity ofTwitter streams, an hour’s worth of data will often exceedthe memory capacity of the streaming system. In this case,the ingestion of the streaming data must keep up with thefluctuating data rates of the stream so that it does not have
This work is partly funded by NSF Grant 1738411, and the AWESOMEProject at the San Diego Supercomputer Center to resort to any load shedding scheme that were applied to aprevious generation of stream processing systems (e.g., [2]).The reality, however, is that when the data gets morecomplex and needs pre-processing before storage, there adistinct bandwidth gap between the data rate of the streamproducer and the ingestion capability of the store that housesthe streaming data. In this paper, we investigate the nature ofand mitigation strategies for this bandwidth gap problem inthe context of streaming social media data (JSON stream) thatis transformed into a graph and stored in a graph databasefor analytics operations. We consider graph data to be morecomplex because, in contrast to relational records withoutstrong integrity constraints, the nodes and edges of a graphare not independent of each other. Therefore, while edges ofthe graph arrive in random order in streaming data, the DBMShas to spend additional time to ensure that two neighboringedges ingested at different times from the stream are connectedto the same node inside the DBMS. Thus, the ingestion costof graph data is higher, leading to the bandwidth gap betweenthe ingestion rate and the storage rate of streaming graphs.Interestingly, while processing high-volume graph data for net-work analysis is an emerging research area [8], [16], ingestionoptimization of streaming graphs is still an unexplored area.
Example Use Case.
To motivate the problem, we present anillustrative use case from the domain of political science. Theobjective of the study is to understand patterns in politicalconversations and public opinions on Social Media in USA.One of the data sources for the study is Twitter; we useTwitter’s streaming API (1% sample) from which we filtertweets by using a set of domain-specific keywords. Duringpolitically charged times, the rate of Tweets received show abursty behavior. Figure 1 shows the rate of data arrival over a25-second period. The figure shows the bursty nature of Tweetarrival and a peak value over 2500. This is in comparison withthe average rate of 60 tweets/sec (1% of 5787 tweets/minute[17]) available in real-time using the Twitter API. This tweetstream is accepted by an ingestor process , transformed intoa graph model by a transformer process and then pushed toa backend graph DBMS (Neo4J). As the velocity of tweetsincreases and the transformer process pushes increasinglymore data to the store, the Neo4J machine reaches a 100%user time (Figure 2) while there is a small decrease in memory a r X i v : . [ c s . D B ] M a y ig. 1: Performance Measurement During Direct Tweet Streamingestionavailability. The deterioration of system efficiency is furtherevidenced from the speed of context switching (Figure 3). With no intervention, this results in a significant slowdown ora total failure of the DBMS server.
This system failure pointsFig. 2: Performance Measurement During Direct Tweet StreamingestionFig. 3: Performance Measurement During Direct Tweet Streamingestionto a largely ignored aspect of the data science infrastructure– with all the advances in improving “Big Variety” problems,ingestion of streaming graph data into graph databases hasremained unaddressed [4], [15].In this paper, we address the above form of system failureby combining two completely different approaches to theproblem:1)
Adaptive Buffering.
We develop an adaptive bufferingscheme that monitors both the data arrival rate and the CPUload of the server and balances the effective ingestion loadtransmitted to the server.2)
Graph Compression.
We exploit the information redun-dancy in the social media data content to compress thegraph load that would be ingested by the DBMS. For adaptive buffering, we create a predictive model of howthe CPU load will be impacted by the buffer size and thevariation of data content as the data rate fluctuates. We findthat the buffer size itself is controlled by a metric related to thediversity of the data content when the data is transformed to agraph. For graph compression, we make use of the observationthat during a burst, a large number of users post about thecorrelated content and in the process, reuse hashtags createdby others.We show that using this combined approach, we can largelyadapt to the velocity and burstiness, and only on rare occasionsresort to spilling the incoming data to local storage of theingestor machine.II. I
NGESTING S TREAMING G RAPHS
Our stream processing engine has a pipelined architectureas shown in Figure 4. In the following, we first presentthe building blocks of this architecture and then present thecontrolling algorithms for managed stream ingestion.
A. Data Processing Pipeline
The data processing pipeline, consisting of seven steps, isdeveloped on top of a threaded, multiprocessing and par-tially distributed environment. The primary computation inthe pipeline is data manipulation and transformation whichis executed by breaking up the stream into mini-batches.Fig. 4: Ingestion Steps • Filter:
The ingestion process starts by filtering out dataitems (tweets) that do not satisfy the semantic requirementsof the system. The filter is applied in two stages. The firstset of filters is applied as a parameter of the streamingAPI provided by the data source (Twitter, in this case). Inour example, we provide a set of keywords to the TwitterAPI for a specific application, and receive a data streamsatisfying the filter. In the second phase, we apply a set ofanalysis-specific filtering criteria (e.g., remove tweets withonly emojis). The choice of filtering criteria has a profoundeffect on the effective data rate of a stream. In our example,our keywords involve names of US politicians and somepolitical issues. Therefore, whenever a political issue grabspublic attention, we see a significant burst in the rate atwhich the data streams into our system. We have observed15-45% velocity fluctuations on a normal day and over250% fluctuation on extremely busy days. • Buffer:
The filtered data is collected in a buffer. Asmentioned before, the size of the buffer is an importantparameter in the efficiency of data ingestion. Using abuffer is a standard strategy, we determined that using afixed buffer size does not effectively handle the problemf efficient ingestion management. In the case of a burstperiod, the CPU of the DBMS machine quickly goes to100% load and a large buffer is needed to absorb thebursting content and control the CPU load. However, usinga large buffer also delays the ingestion process, becausewhen the content of a large buffer is transmitted to theCPU, its ingestion load increases. To counter this dilemma,we have developed an adaptive buffer management strategydescribed in the next subsection that senses the impact ofan upcoming burst and adaptively modifies the size of thebuffer. We show that the factors impacting the requiredbuffer size depends significantly on the data rate but alsoon the content of the data. The use of content in controllingthe buffer size distinguishes our work from the traditionalbuffer management algorithms that only perform congestioncontrol [10], [18]. • Model Transformation:
Model transformation is the pro-cess of transforming data from its native form to a targetform that conforms to the data model supported by thedestination storage. In our case, tweets enter the system asa stream of tree-object (JSON) and needs to be convertedinto a property graph (a graph where both nodes andedges can have attributes and values). In this step, a tweet(sometimes a tweet set, as shown later) is manipulatedto construct typed nodes, labeled edges, node propertiesand edge properties. Figure 6 shows an example of modeltransformation, where the JSON elements called user and tweet become nodes, but hashtag , a JSON property, isunnested and transformed into nodes because in the targetgraph model, a hashtag will be shared by a number of nodes.The edges of the graph, namely owner, mentioned,hashtag-used-in and mentioned-with-ht areconstructed from the JSON content. For instance the edge mentioned-with-ht connects a hashtag with a userwho is mentioned in the tweet. In general, the modeltransformation uses a configuration file that specifies themapping between the input and target data. Fig. 5 shows thenode types and node properties the target graph must have,and a mapping section that specifies how these propertiescan be populated from the input (e.g., using the getName )function. • Batch Optimizer:
The model-transformed data is preparedfor ingestion by wrapping the data into INSERT clausesfor the graph database. However, the actual insertion pro-cess is expensive because of the ingestion latency of thetarget DBMS. This is managed by grouping the INSERToperations into “batches”. The batch optimization processdetermines the optimal batch size to improve system effi-ciency. Mini-batching [7], [13], [19] is a standard strategyfor optimizing throughput during ingestion. However, weexploit the observation that although the number of tweetsincrease during a burst, there is a high degree of redundancyamong them. This presents an optimization opportunitybecause, the redundant portions of a graph must be ingestedonly once. In our case, this optimization takes the form ofdynamic graph compression which utilizes the fact nodes like user and hashtag should be computed only onceduring batch creation. • Graph Ingestor:
The graph ingestor has two parts. Thefirst part manipulates the data structures of the modeltransformation step and constructs the ingestion instructionsso that the graph can be ingested by the target DBMS. Theconstruction implements the graph compression method.The second part is an interface between our pipeline (Fig.4) and the graph DBMS. The ingestor pushes the data tothe DBMS ingestion pool where the pool size is predefinedand managed by the third party connectors. In our example,we choose the Neo4j DBMS, which uses bolt as a graphconnector.Fig. 5: Basic Components of a XML Map fileIII. I
NGESTION C ONTROL
In this section, we present the design principle behind theadaptive buffer control and graph compression algorithms toimprove ingestion efficiency.
A. Modeling the Ingestion Problem
Our strategy to manage the buffer for the streaming data,we first need to establish the factors that govern the buffersize. Let us first define a set of relevant parameters.ig. 6: Model transformation for tweets • Graph Density( d ) : The density of a graph G ( V, E ) is theratio of edges of G to the maximum possible number ofedges that can be induced by the nodes of G . Thus, density d = 2( | E | ) / ( | V | ∗ ( | V | − where | . | is the cardinalityfunction. • Ingestion Buffer Size( β ) : The ingestion buffer is thememory space used for all pre-processing operations on thestreaming data, including filtering and model transformation. • Effective Buffer/Output Buffer( β e ) : The effective buffer(or output buffer) is the buffer that contains the output ofthe model transformation. • Bucket Diversity Ratio ( ρ ) : A bucket B [ i ] is a mini-batchof graph data that will be sent to the database for ingestionat time i . The bucket to be sent immediately has time index0, and the bucket to be sent right after has index 1, andso forth. The diversity of a bucket is the proportion of newnodes (e.g., new hashtags) that appear in that bucket. Thebucket diversity ratio ρ is the average ratio of new nodesobserved over k temporal buckets.Based on the above parameters, the model to predict theeffective buffer size β e is β e = f ( ρ, d ) (1)Notice that the model does not use β as a variable because β e is generated from β . To set up the model, we assumethat the model function f does not depend on time, but theparameters need to be dynamically determined at each timechunk. We further assume that the effects of ρ and d on β e linearly separable. Thus, β e [ i ] = K [ i ] .φ ( ρ [ i ]) + R [ i ] .φ ( d [ i ]) (2)where the functions φ , φ and their linear coefficients K [ i ] and R [ i ] need to be learned from the data. The result of theprediction model is presented in detail in Section IV. Oncethe parameters of this model are determined, we need toestimate how the buffer size i.e., the volume of data to besent to the graph database for ingestion, impacts the stabilityof system resources on the DBMS side. The obvious candidateperformance metrics to be considered on the DBMS side arememory, CPU user time (called CPU-usage later), contextswitching of the CPU, and interrupt per second. To simplifythe model, we experimentally observe these metrics (Fig. 7)over time, where no buffer control is exercised over input streams. A comparison of these performance metrics showthat the CPU-usage rises from about 40% to 100% in lessthan a second as a the number of ingested data records (i.e.,the effective buffer size) increases. This effectively increasesthe delay time because the CPU spends longer updating thedatabase.The ingestion delay I n is the time gap between a recordappearing at the stream, and the the record is ready for thequery. In other words, I n is the total time that the data staysinside the ingestion pipeline. There are two factors responsiblefor this delay – buffer latency and system delay. Buffer latencyrefers to the time the time delay of a data item in the bufferdue to the effective buffer size, while system delay refers to thedelay that occurs because the CPU load was too high for theprevious mini-batch, that results in a delay to get/process thenext mini-batch. Let us assume that at any time point i delayis the sum of bucket delay δ i and system delay α i . Hence, thetotal system delay over T time units is: D = T (cid:88) i =1 ( δ i + α i ) (3)If the expected value of CPU-usage at i th time-point is µ exp [ i ] and the effective buffer size is β e [ i ] . Then ∆ µ user [ n ] = µ user [ n ] − µ user [ n − is the change of CPU usage at n . Sincewe intend to regulate the CPU use of the DBMS machine evenwhen the streaming data is very large, our goal is to bound thevalue of ∆ µ user [ n ] to achieve system stability. We observe that ∆ µ user [ n ] monotonically increases with the increase of β e , theeffective buffer size of the ingestion control system increases.However, the nature of this monotonic function needs to bedetermined by a second predictive model, of the form ∆ µ exp [ n ] = f ( β e [ n ]) ⇐⇒ µ exp [ n ] = f ( β e [ n ]) − µ exp [ n − (4)Substituting Eq. 2 from the first prediction model, we get ∆ µ exp [ n ] = f ( K [ n ] .φ ( ρ [ n ]) + R [ n ] .φ ( d [ n ])) (5)In Section IV, we experimentally estimate the form of themodel and the parameters of these equations, and demonstratethat how we can effectively control the streaming ingestionproblem. B. Algorithms
In this section we present the algorithms referred to inprevious subsections. The first algorithm 1 is the algorithmfor model transformation (Section II-A) where a JSON objectis manipulated to construct a property graph. In the process,we implement the graph compression strategy mentioned inSection I. The second algorithm implements the buffer controltechnique based on the prediction models from Section III-A.The third algorithm controls the actual ingestion process thattransmits data from the effective buffer to the DBMS server.
Graph Model Transformation Algorithm:
Designed to beflexible, the graph model transformation algorithm tasks asinput an XML-structured mapping file and a data extractionlibrary which can parse an input data object and extract itsig. 7: Effect of Uncontrolled Ingestion – graph ingestion is a CPU-bound processsub-elements as needed. This extraction process depends onthe data model of the input data file, and is not specific to anyparticular data set. In our case, the extraction library operateson any JSON file. The specificity of the problem-specific inputand output is provided by the mapping file. This makes thetranslation more “portable” – to choose a different data source(e.g., Reddit) that conforms to the same data model, onewould only need to change the mapping file. We choose anin-memory edge-centric data structure to represent the graph.The primary task of the algorithm is to extract information topopulate an edge table and an indexed node list, followed bythe insertion instructions from this in-memory representation of the property graph. Figure 9 represents the structure of theedge table. Each edge has a unique id, start node, end node,start node properties, end node properties, and edge properties.Node and edge properties stored as a simple MAP objectwhere the ‘key’ is the name of the property and ‘value’ isthe property value. In addition, a set of table-level metadatalike node density and diversity are computed for the edgetable. We use a special property ‘count’ to handle duplicateedges. When we encounter a duplicate edge, we increase thevalue of the ‘count’ (line 20 of Algorithm 1). The duplicatedetection is handled by the procedure INSERTEDGE() (line13 of Algorithm 1). The algorithm keeps a node index toig. 8: The Edge Table structure is implemented as a mul-tithreaded structure. G1 represents the model transformationalgorithm, and G2 is the graph insertion algorithm.Fig. 9: Edge Table Structuresearch nodes, and a list of connected nodes together. Thealgorithm updates the node index in any insertion, while duringthe insertion it also searches for the duplicate edges. Theindexed nodelist together with the deduplicated edge table arethe necessary data structures used in the graph compressionstep described in Algorithm 3.The
CREATEEDGE () algorithm works in batch mode. It takesa set of status or tweets, a map function, and extraction libraryas input. After initiation, the extraction function extracts nodes,node properties, edge properties (From line 2 to line 11)and passes it to the
INSERTEDGE () function. The run timecomplexity of the algorithm is linear in the number of edges.
Buffer Controller Algorithm:
The objective of the buffercontrol algorithm (Algorithm 2) is to improve the systemstability during ingestion. Since graph ingestion is a CPUbound process, our algorithm maintains the ‘CPU-usage’ (theCPU utilization percentage for the user space) level within
Algorithm 1
Graph Model Transformation Algorithm
Input:
List of records, record extraction library, XMLmapping file
Output: edge table, node list, degree distribution,diversity ratio procedure C REATE E DGE ( ListStatus [] , datalib, map) nodeList = new nodeList() edgeList = new edgeTable() (cid:46) Initiating new edge table andnode table for status : st [] do d = new datalib(map) (cid:46) Initialize a new data extractionfunctions edgeTypeList[] = d.getEdgeType() for edge : edgeTypeList do stNode ← d.getStartNode ( st ) stP rop [] ← d.getStartNodeP rop ( st ) (cid:46) Returnsmap[name, value] endNode ← d.getEndNode ( st ) endP rop [] ← d.getEndNodeP rop ( st ) INSERTEDGE(stNode, endNode, edgeTable,nodeIndex) return edgeTable procedure I NSERT E DGE (stNode, endNode) if nodeIndex(stNode) & nodeIndex(endNode) then if !(nodeIndex(stNode).getEdge(endNode)) then nodeIndex(stNode).addEdge(endNode) nodeIndex(endNode).addEdge(stNode) addEdgeTable(stNode, endNode) else if (nodeIndex(stNode).getEdge(endNode)) then edgeProp.count = edgeProp.count + 1 nodeIndex.add(stNode) nodeIndex.add(endNode) if (!nodeIndex(stNode) ) & nodeIndex(endNode) then if !(nodeIndex(endNode).getEdge(stNode)) then nodeIndex(endNode).addEdge(stNode) addEdgeTable(stNode, endNode) nodeIndex.add(stNode) if (nodeIndex(stNode) ) & !nodeIndex(endNode) then if !(nodeIndex(stNode).getEdge(endNode)) then nodeIndex(stNode).addEdge(endNode) addEdgeTable(stNode, endNode) nodeIndex.add(endNode) if (nodeIndex(stNode) ) & nodeIndex(endNode) then if !(nodeIndex(stNode).getEdge(endNode)) then nodeIndex(stNode).addEdge(endNode) addEdgeTable(stNode, endNode) if !(nodeIndex(stNode).getEdge(endNode)) then nodeIndex(stNode).addEdge(endNode) addEdgeTable(stNode, endNode) nodeIndex.add(endNode) nodeIndex.add(endNode) return edgeTable, nodeIndex acceptable bounds (called cpu min and cpu max respectively).Specifically, as the date rate fluctuates, this algorithm con-trols the CPU load by adjusting the buffer size, within therange [ β min , β max ]. The edge table computes diversity ratio,velocity, and the degree distribution of the nodes (line 17 to20 ), and the Zabbix API supplies CPU-usage. Hence, theinput of the algorithm is average CPU-usage data and theedge table. Depending on the data velocity and the diversityor a particular time range, we predict the actual buffer sizeby using multivariate linear regression. Next, we estimate thepossible maximum buffer size from CPU-usage data and the“acceleration”, i.e., second derivative of data rate. We uselinear regression to compute predicted CPU-usage. The stepsof the buffer control algorithm are detailed as follows.1) With the input, the algorithm estimates effective buffersize, expected CPU-usage, and the ‘velocity’, i.e., the firstderivative of the data rate.2) If the expected CPU-usage is higher than cpu max , increasethe buffer size by θ (a constant in the range [0,1]) timesthe available memory.3) It measures if the CPU-usage is θ times higher than cpu max . If so, it writes to the local disk, which we call datathrottling . Here, θ and θ are system specific constantsdetermined experimentally for our testbed.4) If the expected CPU-usage is lower than cpu max , we pushthe data to the graph database.5) While the buffer size is greater than β min , it decreases thebuffer size by θ times the available memory. This increasethe availability of the data because with a lower buffer size,the ingestion latency is improved.6) If the CPU is θ times lower than cpu max , it reads fromthe disk where the data was stored during throttling, andpushes it forward to the DBMS server.7) At every step of the above process, the buffer size, expectedCPU, and velocity are supplied by the PREFMON function,which uses our prediction models the CPU user time. Graph Insertion Algorithm:
The
GRAPHPUSH method usedto transmit the graph from the ingestion machine to thegraph database is explained in Algorithm 3. The methodconverts the data from the edge table, node list and nodeproperties to construct the insertion instruction using the
CREATE and
MERGE statements of Cypher, the language ourtarget database. Algorithm 3 creates node and edge ingestionstatements in Cypher (From line number 6 to 11 in 3) byextracting start node and end node from the edges of the edgetable. It uses an indexed list(line 4) to ensure that nodes arecreated only once in the target database. For each commitmenttransaction, it also checks the integrity constraint that thenodes referred to in the edge table also exists in the node list.Since a commit to the DBMS may fail due to many practicalreasons like network failure, the method stores data to the localmemory until the timeout. It uses 3rd party data connectors tocreate a fault tolerant connection to the DBMS and to maintaina suitable data pool size at the DBMS.
Graph Compression:
In the previous process, at the endof the edge list traversal, the algorithm creates a set ofunique node insertion instruction and the corresponding edgeinstructions as well. Our process guarantees the removal of theduplicate node in this stage, while it compresses the numberof edges during the edge table creation process. Hence, ouralgorithm ensures compressed and minimum ingestion foreach of the buffers. In Section IV we have show how thecompression reduces the ingestion load and demonstrate the
Algorithm 2
Buffer Control Algorithm
Input:
CPU performance Data( µ ), EdgeTable( et ),cpu[max, min] , nodeIndex( n ), maximum andbuffer( β max , β min ) procedure B UFFER C ONTROL ( µ, et, µ [ min, max ] , n ) β , µ exp , s ← P erfMon ( et, µ, n ) if cpu[max] ≤ µ exp then echo “CPU High Alert” sleep( n ) (cid:46) Sleep n time period n depends onconfiguration if ( β + ( θ ∗ β )) ≤ β max then β ← β + θ ∗ β (cid:46) Increase the size of buffer to delay the ingestion β exp , µ exp , s ← P erfMon ( et, µ, n ) if θ ∗ cpu [ max ] ≤ µ exp & s ≥ then FlushDataToDisk() (cid:46) throttle the data to diskassuming θ is a configuration properties varies from system tosystem if β ≥ β max then if µ exp ≤ cpu [ max ] then GRAPHPUSH ( e ) if ( β − ( θ ∗ β )) ≥ β min then β ← β − θ ∗ β (cid:46) decrease the size of buffer toreduce ingestion time or latency if θ ∗ cpu [ min ] ≥ µ exp then LoadDiskFromDisk(expBuf.Size()) (cid:46)
WritingData to the Disk procedure P ERF M ON (edgeTable, CP U user ) δ = edgeTable.getDiversityRatio() ν = edgeTable.getNodeDensity() e = edgeTable.size() + nodeIndex.size() (cid:46) the size ofeffective buffer e β exp = M ∗ ν + N/δ (cid:46)
Used machine learning model toderive the degree and coeff
CP U exp ← A ∗ β exp + B ∗ avg ( µ user ) (cid:46) Used machinelearning model s ← getCP USlope () (cid:46) Simple regression to get slop fromthe data
Return e , CP U exp , s interaction between the compression and the buffer size.IV. E XPERIMENTS
Environment and Deployment.
The experimental testbed forour work is a cloud computing environment (SDSC OpenStackcloud ) that runs CentOS 7. The deployment diagram in Figure10 shows an ingestion server node, a database node with Neo4J3.6 and a performance monitoring server (Zabbix 4.2 agentwith json-rpc api support). Each node contains 2 VPUs and 32GB memory and connected using high performance switches.The underlying processors are Intel Westmere (Nehalem-C)with 16384 KB Cache and around 2.2 GHz clock. Data Set.
The data set for our experiments come from the“Political Data Analysis” project at the San Diego Supercom-puter Center. The project collects tweets continuously. In ourexperiments, we used two forms of data ingestion – (a) directlyfrom the Twitter Stream at its natural rate, and (b) streamingdata from tweets stored in files, where we programmaticallycontrol the streaming rate to test the limits of our solution. In ABLE I: Experimental results of the Coefficient and Error Estimations
Max CPU MAE
MSE
RMSE40 2.97 17.57 4.1955 3.29 39.66 6.2955 1.35 15.98 3.99 (a) µ exp = A.µ n − + B.log ( β exp ) Max CPU MAE
MSE
RMSE40 2.79 17.27 4.1950 3.09 37.19 3.0955 2.26 33.44 5.78 (b) µ exp = A.µ n − + B.β exp Max CPU MAE
MSE
RMSE35 2.89 17.56 4.1950 3.29 37.19 6.0955 2.86 33.44 5.78 (c) µ exp = A.µ n − + B.β exp
Max CPU MAE
MSE
RMSE40 2.92 17.57 4.1950 3.29 37.19 6.0955 2.26 33.44 5.78 (d) µ exp = A.log ( µ n − ) + B.log ( β exp ) Max CPU MAE
MSE
RMSE40 2.79 17.27 4.1955 3.09 37.19 6.0950 2.26 33.44 5.78 (e) µ exp = A.µ n − + B.log ( β exp ) Max CPU MAE
MSE
RMSE40 2.89 17.56 4.1950 3.29 37.19 6.0955 2.86 33.44 5.78 (f) µ exp = A.µ n − + B.log ( β exp ) Param 40
55A .009 .008 0.09B .001 .0024 .003Intercept 0.541 5.29 1.96 (g) µ exp = A.µ n − + B.log ( β exp ) Algorithm 3
Graph Commit Algorithm
Input: edgeTable, XML Configuration file procedure G APH P USH ( edgeT able, map.xml ) if poolSize ≤ maxP oolSize then db ← connect ( map.xml ) (cid:46) Create Graph connection nodeList [] ← ∅ (cid:46) Hash List for nodes for { i=0 ; i < edgeIdList.size (); i++ } do if nodeList.contains(edgeList.getStartNode()) then nodeList.add(edgeList.getStartNode()) node [] ← createNode ( edgeList.getStartNode ()) if node.contains(edgeList.getEndNode()) then nodeList.add(edgeList.getEndNode()) node [] ← createNode ( edgeList.getEndNode ()) edge [] ← createEdge ( edgeList.getEdge ()) nodeList [] ← ∅ if commit = true then db.execute ( node ) db.execute ( edge ) if commit = false then archive.store ( node, edge ) both cases, the period of observation and control was 8 hours.The average velocity of tweets from the direct stream was 4.9tweets per second, and the maximum rate was 23.78 tweetsper second. In the simulation, we multiplied the velocity upto 5 times with 5% - 20% duplicate tweets. Implementation Architecture.
The stream processing archi-tecture in our experimental setup (Figure 8) is designed as aproducer-consumer model. Under the control of the ingestioncontroller module, the graph ingestor accepts data from thebuffer and distributes it over multiple threads to constructthe edge table concurrently. The results from the partial edgetables are collected in the Cypher statement buffer which per-forms the database commit operation. The ingestion controlledgoverns this process by using the performance monitoringservices. Fig. 10: Deployment Diagram of our Test bed
A. Prediction Models
We have tested two prediction models for β e , the effectivebuffer size as function of the graph density d and the bucketdiversity ratio ρ (Eq. 2) as well as µ exp the expected CPU-usage as a function of β e (Eq. 4). The model was tested withpython SciKit learn . Expected Buffer Estimation( β e ): We determined that theequation β e [ i ] = K [ i ] .φ ( ρ [ i ])+ R [ i ] .φ ( d [ i ]) is modeled with φ is linear while φ fits best with a quadratic function. Thelinear parameters K[i] and R[i] were estimated as 0.597 and1.48, with standard error 0.024 and 0.021 respectively. Expected CPU-Usage Estimation( µ exp ): Choosing an appro-priate model for CPU-usage was a little more challenging.Table I shows the models we have tested for and their errors.Our experiments µ exp [ n ] = 0 , .log ( β ) + 0 . µ exp [ n −
1] + c is the closest fit while a linear model is a close second, Figure11 shows the observed values (X-axis) vs. the predicted values(Y-axis) the expected CPU-usage for 4 different settings of https://scikit-learn.org ig. 11: CPU User Predictions cpu max . It is seen that a low choice of cpu max produces un-clear results but the prediction closely matches the observationfor cpu max > . As cpu max > , we observe that whilethe prediction is still good, the model demonstrates that theCPU-usage takes a quantum jump, explaining the gap in theplot in the CPU-usage range [0.21, 0.35]. B. Effect of Graph Compression
The experimental summary of the graph compression isshown in Figure 13: the X-axis represents the compressionratio (the effective count of insert instructions over the numberof original tweets), and Y-axis represents the effective buffersize at that time. We can observe that for most cases, theeffective compression rate (mean compression rate = 24.97%)varies between 15% and 35% . With increased buffer size,the impact of the compression is not as effective. We haveobserved that during a twitter storm (e.g., in January 2018 forthe hashtag ), when the graph density ishigh, the algorithm gives a better compression ratio.
C. Performance Output of The Algorithm
The performance improvement of the final buffer is shownin Figure 12. One advantage using the political tweet data set isthat the minimum data rate for tweets is high, and during peakhours there is a 4.5 fold increase in the data rate. In contrastwith the uncontrolled CPU-usage, the experiment (Figure 12)shows that after our technique is applied, the CPU user timenever reaches a spiking condition. As expected, each time the CPU touches the maximum allowable limit, the algorithmreduces it down by not pushing any data during that period.Further, the IPS and context switching of the CPU was in lowand stable condition, while the memory usage for the entireobservation period is generally low. We monitored the systemperformance on the ingestor machine as well 14, and observedthat CPU and memory utilization is well within control.V. C
ONCLUSION
In this paper, we addressed a bandwidth gap problemencountered in ingesting and storing social media data. Wetook advantage of the temporal clustering property of socialmedia data (i.e., the fact that many similar nodes and edges arecreated during bursty periods) to compress the graph to saveingestion time, and dynamically adjusted to buffer to controlthe CPU load. Our work sits in the middle of graph analyticsresearch underlying many data science applications [1], [11],[14] who use small data sets, and graph database researchthat promotes in-database graph analytics [12] who do notconsider streaming input. We view the graph stream ingestionproblem discussed in this paper as a component of optimizedingestion control in the AWESOME polystore system [5], [6]where multiple streams of heterogeneous data can flow into acomponent DBMS managed under the polystore. We expectthat the general idea of using buffer control, data compressionand resource monitoring for DBMS can be effectively applied.In future work, we expect to extend this work to cover a largervariety of data models and data stores.ig. 12: The CPU utilization is controlled as cpu max is set at 35% and 55% respectively.Fig. 13: Effect of Graph CompressionFig. 14: The CPU-usage, avg. CPU load, and memory use onthe Ingestor Node stays within reasonable bounds for 8 hours. Secondly, notice that in Algorithm 2, we form a datastructure that contains generic graph properties like degreedistribution for the time-slice of data available in the buffer.Metrics like these are the building blocks of more com-plex analytical measures developed by graph-centric researchcommunities. In future work, we will materialize more ofthese temporally evolving properties and use them for theevolutionary analysis of the social media graph, communitydetection [1], [11], [14], and other graph analytics operations[12], which will benefit from our continuous computation ofthese “building block” measures.Finally, our work primarily solves an infrastructure problemthat generalizes beyond just social media data. As part of ourfuture work, we will apply, and if needed, extend our systemfor other forms of structured and semi-structured streamingdata (e.g., newswire data, lifelog data [9]).R
EFERENCES[1] N. Ayman, T. F. Gharib, M. Hamdy, and Y. Afify. Influence rankingmodel for social networks users. In
International Conference onAdvanced Machine Learning Technologies and Applications , pages 928–937. Springer, 2019.[2] B. Babcock, M. Datar, and R. Motwani. Load shedding for aggregationqueries over data streams. In
Proceedings. 20th International Conferenceon Data Engineering , pages 350–361. IEEE, 2004.[3] A. Bifet, J. Zhang, W. Fan, C. He, J. Zhang, J. Qian, G. Holmes, andB. Pfahringer. Extremely fast decision tree mining for evolving datastreams. In
Proc. of the 23rd ACM Int. Conf. on Knowledge Discoveryand Data Mining (KDD) , pages 1733–1742. ACM, 2017.[4] R. Calvillo, C. Denton, J. A. Breckman, and J. Palmer. Managementsystem for high volume data analytics and data ingestion, Jan. 4 2018.US Patent App. 15/704,891.[5] S. Dasgupta, K. Coakley, and A. Gupta. Analytics-driven data ingestionand derivation in the AWESOME polystore. In
Proc. of the IEEE Int.Conf. on Big Data , pages 2555–2564. IEEE, Dec. 2016.[6] S. Dasgupta, C. McKay, and A. Gupta. Generating polystore ingestionplans - A demonstration with the AWESOME system. In
Proc. of theIEEE Int. Conf. on Big Data , pages 3177–3179, Dec. 2017.[7] R. Grover and M. J. Carey. Data ingestion in asterixdb. In
EDBT , pages605–616, 2015.8] C.-Y. Gui, L. Zheng, B. He, C. Liu, X.-Y. Chen, X.-F. Liao, and H. Jin. Asurvey on graph processing accelerators: Challenges and opportunities.
Journal of Computer Science and Technology , 34(2):339–371, 2019.[9] C. Gurrin, A. F. Smeaton, A. R. Doherty, et al. Lifelogging: Personalbig data.
Foundations and Trends® in information retrieval , 8(1):1–125,2014.[10] M. Hirano and N. Watanabe. Traffic characteristics and a congestioncontrol scheme for an atm network.
International Journal of Digital &Analog Communication Systems , 3(2):211–217, 1990.[11] W. Inoubli, S. Aridhi, H. Mezni, M. Maddouri, and E. M. Nguifo.An experimental survey on big data frameworks.
Future GenerationComputer Systems , 86:546–564, 2018.[12] M. Kronmueller, D.-j. Chang, H. Hu, and A. Desoky. A graph databaseof yelp dataset challenge 2018 and using cypher for basic statistics andgraph pattern exploration. In , pages 135–140.IEEE, 2018.[13] J. Meehan, C. Aslantas, S. Zdonik, N. Tatbul, and J. Du. Data ingestionfor the connected world. In
CIDR 2017, 8th Biennial Conference onInnovative Data Systems Research, Chaminade, CA, USA, January 8-11, 2017, Online Proceedings , 2017.[14] F. S. Pereira, S. de Amo, and J. Gama. Evolving centralities in temporalgraphs: a twitter network analysis. In , volume 2, pages 43–48. IEEE, 2016.[15] D. S. Reiner, N. Nanda, and T. Bruce. Ingestion manager for analyticsplatform, Jan. 4 2018. US Patent App. 15/197,072.[16] S. Samsi, V. Gadepally, M. Hurley, M. Jones, E. Kao, S. Mohindra,P. Monticciolo, A. Reuther, S. Smith, W. Song, et al. Graphchallenge.org: Raising the bar on graph analytic performance. In , pages 1–7. IEEE,2018.[17] G. Stricker. The 2014
Twitter Blog , 2014.[18] A. Vishwanath, V. Sivaraman, and M. Thottan. Perspectives on routerbuffer sizing: Recent results and open problems.
ACM SIGCOMMComputer Communication Review , 39(2):34–39, 2009.[19] X. Wang and M. J. Carey. An IDEA: an ingestion framework for dataenrichment in asterixdb.