A Graph-Based Platform for Customer Behavior Analysis using Applications' Clickstream Data
AA Graph-Based Platform for Customer BehaviorAnalysis using Applications’ Clickstream Data
Mojgan MohajerBMW AG [email protected]
February 25, 2020
Technical Report
Abstract
Clickstream analysis is getting more attention since the increase of usagein e-commerce and applications. Beside customers’ purchase behavior anal-ysis, there is also attempt to analyze the customer behavior in relation tothe quality of web or application design. In general, clickstream data can beconsidered as a sequence of log events collected at different levels of web/appusage. The analysis of clickstream data can be performed directly as sequenceanalysis or by extracting features from sequences. In this work, we show howrepresenting and saving the sequences with their underlying graph structurescan induce a platform for customer behavior analysis. Our main idea is thatclickstream data containing sequences of actions of an application, are walksof the corresponding finite state automaton (FSA) of that application. Ourhypothesis is that the customers of an application normally do not use allpossible walks through that FSA and the number of actual walks is muchsmaller than total number of possible walks through the FSA. Sequences ofsuch a walk normally consist of a finite number of cycles on FSA graphs.Identifying and matching these cycles in the classical sequence analysis isnot straight forward. We show that representing the sequences through theirunderlying graph structures not only groups the sequences automatically butalso provides a compressed data representation of the original sequences.
Keywords: sequence analysis, customer behavior analysis, clickstream, graphbased, cycle, simple path, graph database
Collecting and analyzing clickstream data is growing since the increase of usage ine-commerce and applications. The focus is mainly of on purchasing and predictionof customers’ purchase behavior [1, 2, 3, 4, 5, 6, 7]. However, there have also beenattempts to analyze the customer behavior with respect to the usability of web orapplication design [8, 9, 10, 11]. In general, clickstream data can be considered as a1 a r X i v : . [ c s . D B ] F e b equence of log events collected at different levels of web/app usage, such as client orserver level. It can be also considered at different levels of granularity, for example,page visits or action clicks on each visited page. The analysis of clickstream data hasdeveloped in several directions during the previous years. Some of the works focus onfeature extraction from clickstream sequences for further analysis [8, 1, 3, 4]. Thereare some other works that use clickstream sequences as input data and performsequence analysis methods directly on sequences [2, 5, 12, 9, 10, 13, 6, 7]. Pathanalysis is also related to this topic, where a specific path is defined and customerbehavior is analyzed through this path to identify possible problems and bottlenecks [11, 14]. For analyzing the customer behavior towards quality analysis ofapplication design, extracting features or doing path analysis can be meaningful.However, such methods suffer from the lack of generality. In both methods, wedefine explicitly our focus of analysis from the beginning. On the other hand,generalized methods which analyze the whole clickstream sequences and try tofind patterns to understand customer behavior, provide insights into the designproblems, which are not clear from the beginning. In the works which consider theclickstream data as sequences, either perform classical sequence mining to find out,e.g., which pages were normally visited together [15, 16], or cluster the users bydefining some similarity measures between the sequences [2, 12, 9, 10, 7]. Thesemethods normally suffer from scalability and performance.Using graphs for sequence analysis in this domain is mostly limited to similarity-graphs and corresponding algorithms on similarity-graphs [13, 7]. In a similarity-graph, each node represents a sequence and edges represent the distance betweennodes. In this sense, the graph is not used to take the structure of the sequenceinto account but is a tool for performing specific clustering algorithms optimizedfor similarity-graphs. Yet, representing the sequence in a graph structure has somebenefits. In biology, for example, they use graphs to represent parts of sequences,due to the fact that graph structure is invariant under some transformations suchas mirroring [17, 18, 19, 20]. In the domain of clickstream analysis, Baumann etal. [6] use graph structure for modeling the clickstream sequences. They generate agraph for each sequence of click data and calculate graph metrics for that generatedgraph. Finally, they use the calculated graph metrics for further methods such asgradient boosted trees. They examine experimentally the importance of differentmetrics and showed, for example, that number of cycles and self-loops are significantmetrics in prediction of customer purchasing behavior. But they loose the track ofhow many times a cycle was traversed.A graph structure can capture the circles in a sequence and therefore providesa more compact representation of a sequence by counting the number of times acycle was visited. It is specially the case if the sequences are the walks through astate automaton, which is the case in most of clickstream data from applications.Furthermore, if a sequence can be considered as a regular expression, then it canbe reduced to a walk on a finite state automata as well [21].In this work, we show how representing and saving the sequences with theirunderlying graph components, can induce a platform for customer behavior analysis.The main idea behind our work is that clickstream data produced from actions inan application are walks of the corresponding finite state automate (FSA) of thatapplication. Each action corresponds to a state of the FSA. Our hypothesis is2hat the customers of an application normally are not using all the possible walksthrough that FSA but the number of real walks is much smaller than the numberof possible walks through the FSA. The second point is that the walks (sequences)consist of finite numbers of cycles on FSA graphs. Thus, these walks can be reducedto their common graph components. These components capture information whichis not completely covered in the common used methods such as k-grams patternanalysis in common sequence analysis. In k-grams method, normally the numberof common k -long sub-sequences of two sequences is calculated. The correct choiceof k is crucial. The multiple attendance of a cycle is captured somehow by countingthe k-grams but the meaning of the cycle itself is totally lost.We show that representing the sequences through their underlying graph struc-ture not only groups the sequences automatically, but also is a compressed datarepresentation of the original sequences. This compressed representation, allowsqueries directly on the saved data in graph database. In the original sequence for-mat, for each of these queries, an specific algorithm is required. In the followingsections, we describe the data and formal definitions used in this paper. Afterward,we introduce our method how to convert sequences to their graph components. Atthe end, we discuss the possible designs in a native graph database and introducespecific queries which can be performed better in this concept. The data used for our analysis are log events of clickstream from application in-stalled on board of a vehicle such as navigation, radio etc. It consists of sessionswhich can roughly be considered as a drive (it is still possible to have several ses-sions during a drive, depending on how a new id for a session is triggered). Duringa drive the actions of the user on the HMI (Human-Machine Interface) are collectedby a event-log mechanism. These events correspond to the states of a FSA. Weconsider the consequent occurrence of the states of an specific application during asession as a sequence of the clickstream for that application and user. The states ina sequence also have a timestamp which defines the order of the states in a sequence.A vehicle (user) can have several drives (sessions) over time. For each session, existsa sequence for every running application. Each sequence will be represented in itscorresponding graph components as described in the next section.
We define a graph G (cid:104) V, E (cid:105) with a set of vertices V and edges E as a directed graph,such that an edge e ∈ E is a tuple ( u, v ) with starting vertex u and ending vertex v . • A (directed) walk in such a directed graph G is a finite sequence of edges( e , e , , e n − ) which joins a sequence of vertices ( v , v , , v n ) such that e i =( v i , v i +1 ) for i = 1 , , , ( n −
1) and ( v , v , , v n ) is the vertex sequence of thedirected walk. 3igure 1: A small directed graph G . • A (directed) trail is a (directed) walk in which all edges are distinct. • A (directed) simple path is a (directed) trail in which all vertices are distinct. • A (directed) cycle is a non-empty directed trail in which the only repeatedvertices are the first and last vertices [22].Consequently, from these definitions implies that a directed walk can consist ofseveral simple paths and cycles.In our clickstream data from on-board application, each sequence is the succes-sion of state transitions in the underlying FSA of that application. Thus, each statetransition can be considered as an edge in a FSA graph and the whole sequence canbe considered as a directed walk on that graph. Finally, these directed walks canbe reduced to their components, namely simple paths and cycles. We introduce amethod to split the sequences into their components. As we discussed previously in section 2, a sequence q is the consecutive occurrenceof the states of the FSA graph of a specific application during a drive. We assumethat for each two consecutive states u and v with v appearing after u in sequence q , there exists a directed edge e = ( u, v ) between those states in the FSA graph.With algorithm 1, we split a sequence into simple paths and cycles. In general,other splits are also possible. The example below shows the result of our algorithmand an alternative split for a walk on an example graph depicted in Figure 1.Sequence q = ( S , S , S , S , S , S , S , S , S , S , S , S ) is a directed walk on thegraph g in Figure 1. The algorithm 1 reduces this sequence in following components: • a simple path: ( S , S ) • three cycles: ( S , S , S , S ) • a simple path: ( S , S )An alternative split of the same sequence can be: • a simple path: ( S , S , S , S ) • two cycles: ( S , S , S , S ) • a simple path: ( S , S , S ) 4 lgorithm 1: Sequence to Graph Components
Data: given a sequence q list of visited vertices vs = [ ];list of paths ps = [ ];list of cycles cs = [ ]; for each state u in q doif u in vs and u is the first element of vs then add u to the end of vs ;add vs to cs ;set vs = [ u ]; endif u in vs and u is equal to the i th element of vs then add u to the end of vs ;add ( v , ...v i − ) in vs as a simple path to ps ;add ( v i , ..., u ) in vs as a cycle to cs ;set vs = [ u ]; else add u to vs ; endendif vs is not empty then add vs as a simple path to ps endResult: return the lists ps and cs as graph components of sequence q If we put the simple paths from first splitting together, we will have a simplepath from vertex S to the vertex S , ( S , S , S ). Doing the same for alternativesplitting, we will have ( S , S , S , S , S , S ) which is no longer a simple path andinclude a cycle. We can show that the algorithm 1 always produce the maximalnumber of cycles. For this, we start with some theorems. Theorem 1. a simple path p is a part of another simple path p if and only if, theedges in p appear in the same order in p .Proof. By definition, every edge in a simple path starts with the end vertex of itsprevious edge. Thus, for two consecutive edges e and e in p , we have e = ( u, v )and e = ( v, w ). If the path p , has other edges between e and e , then it containsa cycle and it is not any more a simple path. Theorem 2. a cycle c with m the number of its edges can not be part of an othercycle c with length of n where n > m .Proof. Every vertex on a directed cycle has per definition exactly one incoming edgeand one outgoing edge. If a cycle is part of a longer cycle that means, there areedges and vertices on the longer one which are not part of the shorter cycle. A cycleis a connected component. Therefore, these extra edges in the longer cycle must bealso connected to the edges and vertices of the shorter cycle. This contradicts thecondition of exactly one in and out edges on each vertex.5rom Theorem 2 follows that, a cycle is definite and clearly defined just throughthe set of its edges.
Theorem 3.
Considering only the simple paths found by Algorithm 1, these pathsbuild a connected path from start to the end of the original walk (sequence).Proof.
Considering a vertex v on the given walk q , which is not the last vertex, wehave only one of the three possibilities:1. either v is starting point of a cycle,2. or v is the starting point of the next simple path,3. otherwise, v is in the middle of a path or cycle.In case 1, at the end of the cycle, we will end again at vertex v . So that v will beagain either a case 1 or 2. Considering only paths, that means a vertex at the endof a simple path, which is not the last vertex of the walk q , is the starting pointof the next simple path. This concludes that putting all simple paths together willresult in a connected path from start to the end. Theorem 4.
The Algorithm 1 identifies the maximum number of cycles on a walk q .Proof. Assume there exists a splitting of walk q with n cycles more than the resultof algorithm 1. Without loosing the generality, we consider the n cycles with n separate starting nodes in the same order in which they appear on the walk q . Sothat, we have cycles ( v , ...v ), ( v , ...v ), ..., ( v n , ...v n ). Suppose there is a simplepath before the first cycle ( v , ...v ). The algorithm 1 has to find cycle ( v , ...v ) orit finds another cycle earlier. If it finds another cycle ( u, ...u ) earlier, ( u, ...u ) cannot be happen before ( v , ...v ), otherwise we have n + 1 cycles in the walk q . FromTheorem 2 it can not be part of ( v , ...v ). This means that, either it has its startingpoint on the path and ends on the cycle, u, ... ( v , ...u, ...v ) or it is completely insidethe cycle ( v , ...u, ..., u, ...v ). In both cases, the algorithm will find cycle ( u, ...u ),instead of ( v , ...v ). That means the cycle ( u, ...u ) appears before other n − w and w on the graph G depicted in Figure-2a. Both walksstart at vertex S and enter the cycle at two different vertices S and S . The Figure-2b illustrates the two walks with the edges of each walk marked with different colors.In Figure-2c, we see the splitting of the walks as a result of algorithm 1. Both walksare split into three components. Two simple paths and a cycle. They share thecycle but have different paths. The ending path of the walk w is the part of endingpath of the walk w . Figures 2d and 2e shows the components of walks w and w from algorithm 1. On the other hand, we can consider the graph of each walk andsplit it by the cycles and non cycle parts of the graph as depicted in Figure-2f andFigure-2g. In this case, the two walks share two similar components and only differ6 a) An Example of a graph G (b) Two possible walks on G (green and magenta). (c) Two possible walks depicted in (b)and their components.(d) Components for walk w from algo-rithm 1. (e) Components for walk w from algo-rithm 1.(f) Graph components of walk w (g) Graph components of walk w Figure 2: Graph components versus walk components from algorithm 1.by their starting paths. It is just a matter of perspective, what we are interested inand what we see as more similar. In the splitting in Figure-2f and Figure-2g, someinformation from original walks are lost. Both walks are running through exactlythe same vertices, but their length varies. This means that some vertices are visitedmore often in one walk than the other one, which is visible in Figure-2b, as we arewalking part of the cycle for a second time. This information is coded in the splitin Figure-2c, so that the last paths differ. In this paper we consider that walk w and walk w only share one cycle, which they enter at different vertices.We use algorithm 1 to split sequences into their components. In the next sec-tion, we show how we use a native graph database to save the sequences by theircomponents and perform sequence analysis by querying the graph database.7 Modeling sequences in graph database
In native graph databases, graph models play a central role. The graph model rep-resents how we look into the data and which questions and queries we are interestedin. In other words, our problem domain is reflected in our graph model [23]. In thischapter, we discuss several modeling scenarios for our sequence components andcompare them with each other. we start with the problem description and discussother aspects like performance and flexibility in the following sections.
In our clickstream data, we are interested in the ways customers use our applicationin the car. The key analytical questions are, e.g., common problems like longcommon paths, or cycles, to fulfill a task, or clustering the customer according totheir usage pattern of the application.In the literature, there are methods and topics which address similar or relatedproblems, such as funnel analysis. In contrast to our method, in funnel analysis, wefirst select a specific path and then analyze the behavior of the customers on thatpath [14].The majority of the works on customer behavior and clickstream analysis iseither focused on sequence analysis by sequence vectorization [5, 12, 9, 10], or com-parison of some key performance indicators [3, 4]. In most of these studies thepatterns inside a sequence are coded and compared with each other but the mean-ing and shape of a single sub-pattern in the sequence is lost or not considered. Inthe method we introduced in the section 3, we split sequences into their underlyingcomponents such as simple path and cycle. These components can be compared,classified and grouped in the process of sequence analysis. For this, we save thesequences in form of their components in a graph database. Our models have ahierarchical structure and at each level of hierarchy we can save the correspondingmeta data for that level of hierarchy. Figure 3 shows a modeled hierarchy for theclickstream data from vehicle’s onboard applications. At the top of hierarchy wehave customers. Each customer performs several drives. During each drive differentapplications are used. For each application usage we have the corresponding click-stream sequence which can be split into its underlying simple paths and cycles byalgorithm 1. In the example from Figure 3, application 1 in drives 1 and 2 sharescomponents 2 and 3. Same application in drive 3 has totally different componentsthan in drive 1 and 2.At each level of hierarchy we can have different attributes. A customer canhave gender, age and address. A drive has a start and an end time, and startand destination positions. Application sequence has the attributes of applicationcategory, e.g.
Navigation , Entertainment , etc. Each application can have its ownspecific attributes.
Navigation , for example, can have the binary status if a routingcalculation during the drive has started or not.Each sequence itself consists of a finite number of components such as simplepaths and cycles from the walk on the FSA of the corresponding application. Inthe next section we describe several scenarios how we can model these componentsin a graph database. 8igure 3: A modeled hierarchy for clickstream data.
As we discussed in chapter 3, a simple path p can be clearly defined by the orderedlist of its vertices, where the order defines visiting time of the vertex. A simple pathhas clear start and end point. From Theorem 2 follows that a cycle can be definedclearly by a set of its edges, while the order of edges in a set is not important. Acycle does not have a defined start and end point. It can be entered at any vertex.We start our models according to these mathematical definitions.Suppose we have a set of sequences Q to analyze, with | Q | as the number ofsequences in Q . The Algorithm 1 results in | C | distinct cycles and | P | distinctsimple paths after computing on set Q . Lets be c e and c v the number of all edgesand vertices of the cycles in C (with duplicates). Analogously, we have p e and p v as the number of all edges and vertices of the paths in P respectively (withduplicates). The sequences in Q are walks on a FSA graph. The FSA hast | V | vertices and | E | edges. In the following we use these numbers to compare differentmodeling variants. In the first variant we start modeling directly on FSA graph. We model distinctstates of an FSA as defined vertices in graph database. Each path and cycle ismarked with a new different edge between the vertices of FSA. Several edges canexist between two vertices. A simple example with three components in Figure 4,one cycle and two paths, demonstrate our model in variant 1. As we can see betweenvertex S and S we have three edges with different types for each component. Inthis model we have to create | V | vertices and p e + c e edges in the graph database.9igure 4: Cycles and paths in model variant 1. In the second variant, we save each component as separate vertices and edges asdepicted in Figure 5. As a result, we have to create p v + c v vertices and p e + c e edges in the graph database.Figure 5: Cycles and paths in model variant 2. In the third model variant, instead of the vertices of the FSA, we save the distinctedges of the FSA as nodes (transit-nodes) in a graph database. Each path andcycle is represented as a node and is related to its corresponding transit-nodes.That means, components can share transit-nodes. In Figure 6, the same examplefrom Figure 4 is modeled as variant 3. For an FSA with | V | vertices as directedgraph, the maximum number of edges | E | is limited to 2-permutation of | V | andtherefore can be maximally | V | ∗ ( | V | − p e + c e . We have also to add extra vertices for representation of eachpath and cycle, which means | C | + | P | extra nodes in database.10igure 6: Cycles and paths in model variant 3. To prototype our models, we used the Neo4j community edition [24]. In general,the concept is independent from the graph database platform but still variation inthe implementation of graph databases can influence the performance on differentmodels. To clarify the used vocabulary in the discussion below we go througha short introduction of some conventions in Neo4j. In Neo4j, there are nodes andrelationships, which can be considered as equivalent to vertices and edges in a graph.Each node or relationship has a type which acts like a classification, that meanswe can create several nodes and relationships of the same type. Each node andrelationship can also have several attributes of different data types such as name(
String ), creation time (
DateTime ), visited (
Boolean ) etc. We use
Cypher graphdatabase query language , widely used in Neo4j community [23], for our expressionsin this context. The below Cypher expression creates two nodes and a relationshipbetween them. The nodes are of type
State and the relationship is of type
Cycle . a and r are variables. Type State has an attribute
Name , which for the first nodeis
Address Book and for the second node is
Search Field . The relationship with thetype
Cycle has an attribute ID with value 101. CREATE (a:State {Name:’Address Book’})-[r:Cycle {ID:101}]->(:State {Name:’Search Field’})
Suppose we have a set of | C | distinct cycles and | P | distinct paths to add to thegraph database. Injection in model variant 1:
In model version 1, we can create the | V | states of the FSA graph at once. Then, for each cycle or path, it is enough to select its vertices in database and create the edges. Due to the relative small numberof states, this task is performed in constant time. But Neo4j is not designed for11ultiple-edges modeling. For creating multiple relationships between two nodes,there are several possibilities: • multiple relationships with same type and different e.g. name attributes,Neo4j does not support creation of such multiple relationships between twonodes from the available drivers. It is possible to create such relationshipsdirectly from the Neo4j desktop. • multiple relationships with different types. • single relationship with a list of attributes. Having paths as attribute entriesin a list makes the queries on searching and selecting those paths inefficientbecause the graph database is optimized for queries directly performed onnodes and relationships.In both cases, with multiple relationships,for both single type or multiple types, wehave to check if the new path or cycle already exists. As we know from Theorem 1, ashorter path can be part of a longer path. Thus, such paths will cause problems byselecting them in model variant 1, because all the longer ones will be selected alongwith it. At the end, we have to perform more complicated queries to find out whichof them are longer ones. Another disadvantage of this model is the assignment to itshigher level. Because the distinction between components is modeled as edges, it isnot possible to assign the parent level nodes (drives / sequences) by a relationship.We have to put this information at the attribute level, which again withdraws thebenefits of a graph structure. Injection in model variant 2:
In the model version 2, for every distinctcomponent we create nodes and relationships between those nodes as a new com-ponent. The path or cycle can be distinguished by, e.g., its ID or Name at theattribute level on nodes and relationships. By every injection we have to checkif that path or cycle already exists in the database. It can be performed by thefollowing expressions easily. • For a path with n nodes:
MATCH p =(s:State {name: ’State Name 1’, title: ’start’})-[r]->(State {name: ’State Name i’})--> ... // all the nodes is between(e:State {name: ’State Name n’, title: ’end’})RETURN r.name
The paths is clearly identified through its start, end and the nodes in between. • For a cycle with n nodes:
MATCH p =(:State {name: ’State Name 1’)-[r]->(State {name: ’State Name i’})--> ... // all the nodes is between(State (:State {name: ’State Name 1’))RETURN r.name
Injection in model variant 3:
In the third variant we create the edgesof the underlying FSA graph as nodes and assign them to the cycles and paths.The MATCH expressions to check if the components already exists, can have thefollowing forms: • For a path with n nodes.
MATCH (c:Paths)-[d:P]->(:T)WHERE ALL( x IN {start_end} WHERE (c)-[:P]->//start_end is a list of tuples for each edge//with names of its start and end states.(:T {start: x[0], end: x[1]}) )WITH c, count(d) as pathLengthWHERE pathLength = length({start_end})RETURN c.name • For a cycle with n nodes.
MATCH (c:Circles)-[d:C]->(:T)WHERE ALL( x IN {start_end} WHERE (c)-[:C]->//start_end is a list of tuples for each edge//with names of its start and end states.(:T {start: x[0], end: x[1]}) )RETURN c.name
In this model, the edges are shared, hence even if the component itself has to becreated, it must be guaranteed that the node for edges are added once. For this,we use MERGE instead of CREATE. MERGE is equivalent to MATCH beforeCREATE. Accordingly, a CREATE operation is faster than MERGE.
Injection of large amount of data:
If we have to deal with a large amountof data at once, it is better to avoid the single inserts due to lock latencies. Forthis, in Neo4j we can benefit from table-wise insertions, in which the node andedge information are provided row-wise in a table. The table can be uploaded inone step and will be internally processed row by row. In the case of variant 1, theproblems already discussed in the case of single injection also persist in the table-wise insertion. In the model variant 2, we create for each edge of the components, arow in a table with a start, an end and a component id. In row wise insertion, firstthe start and the end nodes will be merged. Finally, the relationship between the13tart and the end of the edge will be created. In variant 3, we can have the edgesper component as rows of a table and merge them to the database, the same wayas in the single insertion.
Actually the main purpose of the data transformation into graph database, is toopen the possibilities for data queries which in relational databases are either in-efficient or even impossible. We concentrate ourselves on the example of severalqueries which play an important role in the analysis of customer behavior. At thesame time, these queries do not perform well in traditional relational databases,particularly in the case of large amounts of data. Of course, we can always addexplicit indexing on relational databases to mimic the benefits of non-relationaldatabases. The scope of discussion here is the benefits of non-relational databasessuch as graph databases, and not the way we want to realize such designs.
Example 1:
Given a sequence of consequent states such S − > S − > S as a simple path, we are interested in the customers who traversed exactly thissimple path during the application usage. In a relational database, the data asdescribed in section 2, has to be sorted per session and timestamp. Even if, thedata is stored sorted, we still have to search the whole table (e.g. several T bytes)for the start state of the simple path and check the consequent states afterwards.In a graph database such simple paths can be searched at once with all nodes andedges. Experiments from our prototype show that the number of graph componentsconverts. That means the search space is much smaller than the original table inthe relational database. Example 2:
We are interested to find out how many different paths the cus-tomers traverse between two states of interest S i and S j . The query itself will besimilar to the example before, but the variation depends strongly on whether weconsider the loops and the multiple traversal of the loops or not. Identifying loopsis not straightforward. However, the component based storage of the clickstreamdata makes it possible to identify such structures. Example 3:
We are interested to find out in which loops the customers arestuck most of the time. Firstly, finding the loops itself is not straight forward.Secondly, we have to be able to count the loops as the same loop even if, theyappear with different starting points in the sequence. Here is an example; S − >S − > S − > S and S − > S − > S − > S are the same cycle. For thesekinds of analysis we have to extract data and solve our questions algorithmically.The components wise storage of the sequences allows us to perform such analysisas queries. We can also combine it with similar reasoning as funnel analysis, suchas considering loops appear in a path of interest. Example 4:
One of the main focus of click stream analysis and customer be-havior analytics in literature is sequence pattern mining. They usually use methodsof vectorization such as k-grams to map the sequences to vectors and define metrics14n them [7]. Finally they use known vector based pattern mining methods suchas clustering. In this work, we have introduced component wise analysis of thesequences. The main hypothesis is that the sequences consist of cycles and simplepaths which they share. In other words, we can compare and group the sequencesaccording to their common components. In our prototype, we can cluster the se-quences into clusters of the exact same components. The clustering itself can beperformed by a single query and can be restricted easily for further analysis.The first two examples are query based analysis which show the benefits of agraph database storage while the third and forth example highlight the benefit ofcomponent-wise analysis of the sequences.
For our prototype we used a subset of the data described in section 2. The subsetconsists about 200 k drives within two weeks and is about 5 GB in size. For thestorage in the graph database, the following steps were performed: • Extracting the sequences per drive. • Deriving the components of the sequences. • Insertion of the sequences and their components into the graph. • Insertion of customer-drive nodes into the graph and assignment to the se-quences.The created result in the graph database is less than 500 MB large. We haveachieved roughly a space reduction of 10x. We implemented all the three modelvariants described in section 4.2. All three models result in a similar storage amountof <
500 MB. In what follows, we compare the model variant 2 and 3 by performingthe queries from examples 1 to 4. Table 1 gives an overview of the subset usedfor our experiments according to the numbers introduced in section 4.1. Table 2summarizes the number of elements in each model. It also shows if the element ismodeled as a relationship (edge) or a node.
Quantities number | Q | number of sequences ≡ drives 224265 | C | number of distinct cycles 3767 | P | number of simple paths 7438 c e = c v total number of edges or vertices in cycles 16077 p e = p v − V number of vertices in FSA ≡ states 124 E number of edges in FSA ≡ transits 1691 Table 1: Overview of the statistics for the prototyped subset.In the first query from example 1, we search the database for all drives whichhave passed through the simple path S − > S − > S , where the numbers1710, 552 and 574 are the states’ IDs. The search query in model variant 2 lookslike this: 15 eq. ComponentsVehicle Drove Drive Has Circles Paths State TransitNodesV2 85975 - 224265 - - - 49375 -EdgesV2 - 224265 - 1570774 16077 25736 - -NodesV3 85975 - 224265 - 3767 7438 - 1691EdgesV3 - 224265 - 1570774 16077 25736 - - Table 2: Model Variant 2 compared with Model Variant 3
MATCH p=(v:Vehicle)-[:Drove]->(:Drive)-[:Has]->(:State)-[*0..]->(:State {stateId:1710})-->(:State {stateId:552})-->(:State {stateId:574})-[*0..]->(:State)RETURN p
The same search in model variant 3 looks like this:
Match (v:Vehicle)-[:Drove]->(:Drive)-[:Has]->(c)-->(:T {name: "1710_552"}),(c)-->(:T {name: "552_574"})with (v:Vehicle)-[:Drove]->(:Drive)-[:Has]->(c)-->(:T) as preturn distinct p
In both models it takes few milliseconds to query the data. Figure 7 shows theresults of query 1. The query finds 4 paths and 6 drives.In the second query from example 2, we search the database for all paths goingthrough state S and state S . In Figure 8, we compare the result of query 2 inboth model variants. In model variant 2, the query finds 8 simple paths and a cycle.In model variant 3, we have 9 paths and a cycle. The reason for the difference is,that in model variant 3, it is difficult to consider the order of the transit nodes. Forthis, it finds a path in which the state S is visited before the state S . Belowis the possible query for variant 2: MATCH (:State {stateId:1710})-[c*]->(:State {stateId:574})with (:Vehicle)-->(:Drive)-->(:State)-[*{compHash: head(c).compHash}]->() as pRETURN p
The same search in model variant 3 looks like this:
Match (v:Vehicle)-[:Drove]->(:Drive)-[:Has]->(c)-->(:T {start: "ST_Nav_DestInput_FTS_Result"}),(c)-->(:T {end: "ST_Nav_LastDestinations"})with (v:Vehicle)-[:Drove]->(:Drive)-[:Has]->(c)-->(:T) as preturn distinct p
In the third query, we are looking for non trivial cycles, which are visited morethan one time by several users. Figure 9 shows one of the cycles with six nodesfrom the result list with 16 users. It shows that these users have difficulty to usea specific functionality, as they are repeating these particular steps several times.Here is the Query 3 in Model variant 2: 16 atch p =(d:Drive)-[h:Has]->(s:State)-[:Circle*]->(:State)with id(d) as di , count(h) as NrVisit, s.compHash as CircleName,length(max(p))-1 as CircleLenwhere NrVisit/CircleLen > 1with CircleName, size(collect(di)) as NrDirves,avg(NrVisit/CircleLen) as NrVisits, CircleLenwhere NrDirves > 10Return CircleName, NrDirves, NrVisits, CircleLenorder by CircleLen desclimit 20
And here is the Query 3 in Model variant 3:
MATCH (n:Drive)-[h:Has]->(c:Circles)with n, count(h) as ch, cwhere ch > 1with c, size(collect(n.name)) as NrDirves,avg(ch) as NrVisits,size((:Circles {name: c.name})-->(:T)) as CircleLenwhere NrDirves > 10return c.name as CircleName, NrDirves, NrVisits, CircleLenorder by CircleLen desclimit 20
In Query 4, we cluster the drives by their common components. It clusters166556 from 224265 drives into 7063 clusters. There is a big cluster with 67151elements which is a single component (a cycle). This cycle is a trivial cycle in whichthe application starts per default. In this clustering we are considering the exactcommon components. On the other hand, if we use only cycles for clustering we willhave 180999 drives clustered into 6048 clusters. To cluster the remaining singletonclusters, we can use other scoring possibilities such as Jaccard distance, to comparethe common neighborhood between drives. Here is the Query 4 in Model variant 2:
MATCH (n:Drive)-[:Has]->(:State)-[c]->(:State)with n, c.compHash as compHashorder by compHashwith n, collect(distinct compHash) as clusterwith collect(n.name) as Drives, clusterwhere size(Drives) > 1return cluster, size(Drives) as clusterSizeorder by size(Drives) desc
The same query in Model variant 3 looks like this:
MATCH (n:Drive)-[:Has]->(c:Circles)with n, c.name as compHashorder by compHashwith n, collect(distinct compHash) as Clusterwith collect(n.name) as Drives, Cluster here size(Drives) > 1return Cluster, size(Drives) as clusterSizeorder by size(Drives) desc One of the major challenges however remains the design of the proper injectionpipeline for the components into the graph database. This comes from the fact thatwe have to avoid the duplicated components. That means at each insertion we haveto check whether the component already exists in the database. We propose a hashindexing of the components in a look-up table to address this issue.
In this work we have discussed the idea of remodeling the data representation andstorage system, which can provide new possibilities for data analysis. In the caseof sequence analysis, instead of traditional vectorization methods, we suggest agraph component-wise analysis. The concept behind it is derived from the fact thatsequence itself is a traversal of a finite state automata. Based on this assumption,we introduce a new way of reviewing a sequence and consideration of loops. Themain hypothesis is that the variation of components in a real use case converts.This assumption is very important because in a fully connected graph the numberof possible simple paths and cycles increases exponentially with the number of nodesand edges. The number of simple paths in a graph with n nodes for example canbe approximated by the size of possible subsets of a n -element set, 2 n . Thus, wehave to examine the hypothesis on a larger amount of data.In general, due to the huge amount of data produced by the vehicles, the scala-bility is of great interest to us. In analysis of the model variants from section 4.2, weevaluated a specific implementation of Neo4j and highlighted its limitations. Fur-ther investigation of other graph databases and their comparative study is thereforenecessary.Last but not least, we introduce a platform for ad hoc analysis of customer clickdata. Most of the relational databases are equipped with dashboards and graphicalvisualization interfaces which make it easier for the end-user to use those systems.Graph databases have their own query languages which is not familiar to most end-users.Therefore, an evaluation of user friendly interfaces and visualization tools forgraph databases is essential. References [1] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, and Pang-Ning Tan.Web usage mining: Discovery and applications of usage patterns from webdata.
Acm Sigkdd Explorations Newsletter , 1(2):12–23, 2000.[2] Arindam Banerjee and Joydeep Ghosh. Clickstream clustering using weightedlongest common subsequences. In
Proceedings of the web mining workshop atthe 1st SIAM conference on data mining , volume 143, page 144, 2001.183] Wendy W Moe, Hugh Chipman, Edward I George, and Robert E McCulloch. Abayesian treed model of online purchasing behavior using in-store navigationalclickstream.
Revising for 2nd review at Journal of Marketing Research , 2002.[4] Jeffrey Heer and Ed H Chi. Separating the swarm: categorization methods foruser sessions on the web. In
Proceedings of the SIGCHI Conference on Humanfactors in Computing Systems , pages 243–250. ACM, 2002.[5] Alan L Montgomery, Shibo Li, Kannan Srinivasan, and John C Liechty. Mod-eling online browsing and path analysis using clickstream data.
Marketingscience , 23(4):579–595, 2004.[6] Annika Baumann, Johannes Haupt, Fabian Gebert, and Stefan Lessmann.Changing perspectives: Using graph metrics to predict purchase probabilities.
Expert Systems with Applications , 94:137–148, 2018.[7] Gang Wang, Xinyi Zhang, Shiliang Tang, Christo Wilson, Haitao Zheng, andBen Y Zhao. Clickstream user behavior models.
ACM Transactions on theWeb (TWEB) , 11(4):21, 2017.[8] Nina Bhatti, Anna Bouch, and Allan Kuchinsky. Integrating user-perceivedquality into web server design.
Computer Networks , 33(1-6):1–16, 2000.[9] I-Hsien Ting, Chris Kimble, and Daniel Kudenko. Ubb mining: finding unex-pected browsing behaviour in clickstream data to improve a web site’s design.In
The 2005 IEEE/WIC/ACM International Conference on Web Intelligence(WI’05) , pages 179–185. IEEE, 2005.[10] Gang Wang, Tristan Konolige, Christo Wilson, Xiao Wang, Haitao Zheng, andBen Y Zhao. You are how you click: Clickstream analysis for sybil detection.In
Presented as part of the 22nd { USENIX } Security Symposium ( { USENIX } Security 13) , pages 241–256, 2013.[11] Ruili Geng and Jeff Tian. Improving web navigation usability by comparingactual and anticipated usage.
IEEE transactions on human-machine systems ,45(1):84–94, 2014.[12] Lin Lu, Margaret Dunham, and Yu Meng. Mining significant usage patternsfrom clickstream data. In
International Workshop on Knowledge Discovery onthe Web , pages 1–17. Springer, 2005.[13] Gang Wang, Xinyi Zhang, Shiliang Tang, Haitao Zheng, and Ben Y Zhao.Unsupervised clickstream clustering for user behavior analysis. In
Proceedingsof the 2016 CHI Conference on Human Factors in Computing Systems , pages225–236. ACM, 2016.[14] Teresa Mah, Hendricus DJ Hoek, and Ying Li. Method and system for clickpathfunnel analysis, March 28 2006. US Patent 7,020,643.[15] Birgit Hay, Geert Wets, and Koen Vanhoof. Mining navigation patterns usinga sequence alignment method.
Knowledge and information systems , 6(2):150–163, 2004. 1916] Arthur Pitman and Markus Zanker. Insights from applying sequential pat-tern mining to e-commerce click stream data. In , pages 967–975. IEEE, 2010.[17] Eduardo Corel, Philippe Lopez, Rapha¨el M´eheust, and Eric Bapteste. Network-thinking: graphs to analyze microbial complexity and evolution.
Trends inMicrobiology , 24(3):224–237, 2016.[18] Benedict Paten, Adam M Novak, Jordan M Eizenga, and Erik Garrison.Genome graphs and the evolution of genome inference.
Genome research ,27(5):665–676, 2017.[19] Vaddadi Naga Sai Kavya, Kshitij Tayal, Rajgopal Srinivasan, and NaveenSivadasan. Sequence alignment on directed graphs.
Journal of ComputationalBiology , 26(1):53–67, 2019.[20] Chi Zhang, Fengyu Cong, Tuomo Kujala, Wenya Liu, Jia Liu, Tiina Parviainen,and Tapani Ristaniemi. Network entropy for the sequence analysis of functionalconnectivity graphs of the brain.
Entropy , 20(5):311, 2018.[21] Philip Hingston.
Using finite state automata for sequence mining , volume 24.Australian Computer Society, Inc., 2002.[22] Edward A Bender and S Gill Williamson.
Lists, Decisions and Graphs , page164. S. Gill Williamson, 2010.[23] Ian Robinson, Jim Webber, and Emil Eifrem.
Graph databases , page 25. ”O’Reilly Media, Inc.”, 2013.[24] Neo4j Community Edition. https://neo4j.com/licensing/.20 a) Result of Query 1 in model variant 2.(b) Result of Query 1 in model variant 3.
Figure 7: Result of query 1 in different models. Four paths is found. Six drives andvehicles has these paths as their components.21 a) result of Query 2 on model variant 2.(b) result of Query 2 on model variant 3.
Figure 8: Model variant 3 finds one more path, because it does not taking intoaccount the order of visits. 22 a) Result of Query 3 in model variant 2.(b) Result of Query 3 on model variant 3.a) Result of Query 3 in model variant 2.(b) Result of Query 3 on model variant 3.