Client Network: An Interactive Model for Predicting New Clients
Massimiliano Mattetti, Akihiro Kishimoto, Adi Botea, Elizabeth Daly, Inge Vejsbjerg, Bei Chen, Öznur Alkan
CClient Network: An Interactive Model for PredictingNew Clients
Massimiliano Mattetti , Akihiro Kishimoto , Adi Botea , Elizabeth Daly , IngeVejsbjerg , Bei Chen , and Öznur Alkan IBM Research - Ireland [email protected] {akihirok,adibotea,elizabeth.daly,ingevejs,oalkan2,beichen2}@ie.ibm.com Abstract.
Understanding prospective clients becomes increasingly important ascompanies aim to enlarge their market bases. Traditional approaches typicallytreat each client in isolation, either studying its interactions or similarities withexisting clients. We propose the Client Network, which considers the entire clientecosystem to predict the success of sale pitches for targeted clients by com-plex network analysis. It combines a novel ranking algorithm with data visual-ization and navigation. Based on historical interaction data between companiesand clients, the Client Network leverages organizational connectivity to locatethe optimal paths to prospective clients. The user interface supports exploring theclient ecosystem and performing sales-essential tasks. Our experiments and userinterviews demonstrate the effectiveness of the Client Network and its success insupporting sellers’ day-to-day tasks.
Keywords:
Link prediction; Graph visualization; User Interfaces
Identifying new potential clients is an important and time-consuming task for many or-ganizations. Sellers have to prospect for new clients and figure out which companiesmight be more interested in becoming a client. Sellers may also have to follow up onmarketing responses where potential clients may have previously expressed an interestin a product by downloading a white paper or asking for additional information as a fol-low up to a marketing campaign. With limited time and large numbers of follow-ups topursue, sellers may need assistance in prioritizing which leads represent likely conver-sions. Typical approaches to client prospecting use the notion of similarity to identifyclients who have shared attributes to the organization’s existing clients. Prioritizationmethods tend to filter out low likelihood interactions or identify spam-like behavior tofocus on interactions that express the most interest. These solutions evaluate the orga-nization or interactions in isolation from the existing client relationships. Knowledgeand trust of a brand is something that is built up over time with clients and can play animportant role in a client’s decision to purchase from a company in the future. This in-fluence does not need to be limited to the existing clients, given that the business worldis a small world where key professionals move between organizations. When a persontakes on a new role in an organization they do not just start afresh, they also bring with a r X i v : . [ c s . S I] J u l M. Mattetti et al. them their knowledge, experiences and relationships. This can tell us something aboutthe companies that they are moving between: it could be that either they are in a similarindustry space or have a similar corporate culture. We can harness the fact that they maybring with them knowledge and experience of products and services used in their priororganization. If we can find these companies, they can represent valuable prospectiveclients for an organization to approach.In this paper we explore the research question of how to leverage latent informationin the relationship graph to predict prospective clients and also surface the relationshipsto sellers, to provide tangible explanations and give context for the predictions. Wepresent our solution, Client Network, which uses organizational connectivity to builda network with different types of links with different meanings and different weights.It combines a novel ranking algorithm with data visualization to allow the sellers tounderstand the client ecosystem better. We tuned the algorithm by analyzing our con-nectivity with existing clients assuming the client link is missing in order to be able touse this ranking to predict the utility of the prospective client.Our contributions in this paper are as follows: A scalable network algorithm thatsupports heterogenous network relationships to rank prospective clients in order to al-low sellers to focus on leads with the highest probability of success; and an interactivevisualisation tool which supports exploration and provides context supporting the pre-dictions allowing them to interpret the ranking and better inform the seller.
Given a network of companies connected together via multiple relationships, the prob-lem of recommending prospective clients can be formulated as predicting whether thelink client appears in the network or not. Such problem is a variant of the standard linkprediction problem which aims to predict the formation of any type of relationships ina network. It has extensively been studied in various fields, such as web science [1,36],healthcare and biology [17,5,4,12], and recommender systems [3,23,32,20,15].Formally, the link prediction task can be formulated as follows [20]. Given a net-work G ( V , E ) , let edge e = ( u , v ) ∈ E represent an interaction between nodes u , v ∈ V attime t ( e ) . The multiple interactions between u and v are recorded as parallel edges. For t ≤ t (cid:48) , let G [ t , t (cid:48) ] denote the subgraph of G restricted to the edges between t and t (cid:48) . For training interval [ t , t (cid:48) ] , the link prediction task is to predict a list of edges occur in thenetwork G [ t , t (cid:48) ] , where t > t (cid:48) .Various techniques exist for link prediction which differs in model complexity,prediction performance, scalability, and generalization ability (see [30,22,14] for anoverview). Topology-based link prediction methods [30] are unsupervised approachesthat leverage the latent information contained in the network topology to assign ascore to each pair of nodes. Algorithms in this category are generally divided into local and global similarity-based approaches [22]. Local approaches, such as Com-mon Neighbors, the Jaccard Coefficient and the Adamic-Adar Coefficient [2], use nodeneighbourhood-related structural information to compute similarity among nodes. Theyare simple and fast to compute [22]. However, their performance in non-small-networks [20],where links can form between nodes at distances greater than two, is poor. Global ap- lient Network: An Interactive Model for Predicting New Clients 3 proaches, on the other hand, use the whole network topological information to scoreeach link. They have strong predictive power but at higher computational cost. Path-based algorithms, such as the Katz index [19], and algorithms based on random walks,such as PageRank [24], SimRank [16], Hitting Time [20] and PropFlow [21], fall underthe umbrella of global topology-based approaches.A wide range of real-world systems can be modelled as heterogeneous informationnetworks (HIN), such as human social activities, communication and computer systems,and biological networks. Formally a HIN is defined as a network of multiple types ofnodes (e.g., authors, conferences and papers) and multiple types of edges (e.g., co-author, author-write-paper and paper-published-in-conference) [27].Initial approaches to the link prediction problem in HIN utilized the same algo-rithms used in the homogeneous ones, with no changes. These algorithms were notdesigned to take into account the dependency patterns across types that exist in hetero-geneous networks. Such approaches treat all relationships equally or separately studyhomogeneous projections of the networks, completely ignoring information about thedifferent topology or the different formation mechanism that each type of edge mayhave. Later approaches introduced new extensions to classical link prediction algo-rithms [9,33] which improved their performance in HIN. More recently, a meta-structureknown as a meta path [28] has been proposed in order to account for the semantic of thedifferent types of relationships. Furthermore, a new category of algorithms has arisenwhich leverages the concept of meta path [26,25]. Fig. 1.
Application Workflow
Our solution, Client Network, aims at supporting sellers to identify prospective clients.It leverages past interactions that took place within an ecosystem of people and orga-nizations to build a network connecting the seller’s company to any other company inthe ecosystem. It employs a novel ranking algorithm for computing the likelihood of
M. Mattetti et al. turning a company into a new client and enables the seller to explore the network withan interactive UI. Hereafter we refer to the seller’s company with the term root . Data Model.
The Client Network can work with any type of network providing ithas the following characteristics: i) a subset of nodes represent companies; ii) one ofthe nodes of type company is marked as root ; iii) a relationship of type client existsbetween the root node and other nodes of type company.Thus, the Client Network is a generic solution, with no additional assumptions aboutthe types of nodes and the types of relations, for example.However, for clarity, we use details from an actual network in the presentation. Thisis also used as input to our system in the evaluation. It is a network that combines in-formation about the clients of our organization with financial information about banks,corporations, investment managers and more. Being confidential data, we are not ableto publish the dataset used in the evaluation. The network contains two types of nodes,company and person. The latter is for professionals with decisional power in their com-panies. Each job role of a professional, either current or former, is represented as anedge between the corresponding person node and company node.Job roles are divided into two types: board member and executive . We further attachto job-role edges a label such as current or former . For instance, given a company andits current CTO, the corresponding company and person nodes are connected with anedge labelled as current executive . Likewise, the node of a professional that has servedin the company’s board in the past is connected to the company’s node through an edgelabelled as former board member .In addition to the connections capturing the organizational people flow, the networkcontains business to business (B2B) relationships, such as sponsor , subsidiary and in-vestor , which can be in different states, such as pending , cancelled and prior .Despite being a B2B relationship, the client relationship differs from the othersbecause it does not have a state. In other words, a company is either a current client ofthe root company or not. Former clients are equivalent to non-client companies fromthe seller’s point of view. Note that a client relationship always involves the root, beingan edge between the root and the client at hand. It is worth mentioning that there is asignificant imbalance in the distribution of relationships and nodes. For example, of the11.5 million nodes in the network, two-thirds are organizations whereas the remainingthird are professionals. Furthermore, within the organizations there is a ratio of 1 to 14between the client and non-client nodes, respectively.Finally, multiple relationships can exist between two entities, that is, the Client Net-work allows multiple edges between a pair of nodes. Solution Overview.
The Client Network includes two main components: the
Ana-lyzer and the
Interactive Visualizer .The Analyzer extracts firmographic data and biographical information about pro-fessionals from relational data sources, converts the data into a graph data model andfinally loads it into a database. Once the structure of the network is ready, our NORAalgorithm, described in Section 3, assigns a score to each node in the network. Thesescores are then stored as attributes of the node objects in the database. In this work, a former client is a company that has not purchased any product/service pro-duced/provided by the root company in the last 5 years.lient Network: An Interactive Model for Predicting New Clients 5
The Interactive Visualizer allows users to interact with the network. Possible in-teractions include exploring the neighborhood of a node, retrieving the subgraph con-necting a node to the root node, and ranking a list of companies, given as input, usingNORA. These functionalities are backed by a set of REST APIs which have been de-signed to support any sales application in exploring the network data, therefore allowingan easier integration of our model with existing sellers’ tools.The client ecosystem is a very dynamic environment. A new client acquisition, anindividual who moves to a different company and a company that invests in anothercompany are all examples of events that contribute to the evolution of the network. TheClient Network requires an up-to-date snapshot of the network in order to provide ahigh prediction accuracy. Hence, a full import of the data, as well as a new scoring ofthe nodes, have to be performed periodically. Figure 1 describes the full workflow of oursolution, where data ingestion and node scoring tasks are performed by the Analyzerand data exploration is enabled with the Interactive Visualizer.
User interface.
An API and UI have been developed to support navigating throughthe complex network of companies and individuals. Our aim was to facilitate the ex-ploration of the network context while maintaining a task-oriented focus. Users candiscover prospective clients with connections to the root company, explore each client’srelationships and receive a measure of the success chances of a sales pitch to that client.We envisioned that the Client Network would help sellers in several important tasks,such as: following up on a list of leads from marketing; exploring the connections of arecently acquired client; and finding out more about a social media interaction.Initial discussions with domain experts revealed that one concern for the root or-ganization was the amount of time that the sellers spend prospecting for whitespace companies using their existing tools like LinkedIn. Using the Client Network will allowsellers to save time when deciding which marketing leads they should prioritize, as wellas to make discoveries which could help them shape their approach to the client andincrease the likelihood of a successful sales opportunity outcome. The seller can get afeeling for how the individual company fits into a larger and connected client ecosystemrelating back to their company.The user interface is composed of a general landing and information area, a searchinterface to find companies, company ranking list views and a view to explore a com-pany and its connections in more detail. The user interface consists of a React.js webapplication which communicates with a Java API that supplies it with the data to powerthe search and the visualizations. The UI is built responsively, so that it will respondto screen size and show appropriate styling and information depending on the devicescreen size. The network graph is visualized using the vis.js library [7] and it uses aforce-directed layout, with controls for zooming and resetting the graph.The search functionality allows the user to search for a company name. On search-ing for a company, the user is presented with a list of results which shows companyfeatures such as the company name, ranking, probability of success, status, location,and year founded, allowing the user to disambiguate between similarly named compa-nies. Once the targeted company is identified, the user can click through to access adetailed view of that company, or they can add them to a ranking list. Term used internally by sellers to refer to prospective clients. M. Mattetti et al.
Fig. 2.
Left: Company ranking. Right: Top Whitespace Companies
Sellers receive lists of prospective clients from the marketing team or their man-agers. Deciding which of these clients is a likely prospect and should be approached asa priority can be time-consuming. The seller can input a list of companies and the ClientNetwork will return the list ranked by the confidence level of turning a company into anew client. The seller searches for each company in turn and adds them to the list. Thesystem processes this list, returning a modal window with the list of companies. The listis ranked, showing companies assumed to have a higher chance of success of becominga future client first, as presented in Figure 2 left.Depending on factors such as the product they are selling or the country they aretargeting, some sellers may receive fewer leads than others. In the case of a seller strug-gling to find prospective clients, a good starting point would be to explore the connec-tions of a recently acquired client. An existing client can be used as a starting-off-pointto prospect for whitespace clients that are connected to it. Thus, the Client Networkturns a newly acquired client into an opportunity for approaching whitespace clientswhich were not taken into consideration previously due to a low connectivity to the rootcompany. The seller searches for the company and can further investigate the “Whites-pace connections”. A ranked list of whitespace clients that are connected to the newlyacquired client is returned to the seller (Figure 2 right) allowing them to quickly identifycompanies in that ecosystem that have a high chance of conversion to a client.In the third scenario, the seller can find information about the positioning of thecompany within the client ecosystem and use company descriptions and professionalbiographies to find out more. The seller can select a company to see a detailed viewof that company and its associated subgraph, which visualizes how it links back to theroot company, as shown in Figure 3. The seller can further see details about companiesand professionals that are connected by a series of relationships. Each relationship islabelled with its type. Current relationships are displayed with a solid line and previousrelationships with a dashed line. The size of the node is dictated by its connectivity withthe root company and is related to the size of the other nodes in the subgraph, that is tosay it is scaled based on the scores of the other nodes in the current subgraph.An important feature is the interaction design to allow the sellers to make sense ofa large graph. It has been suggested that producing a good visualization gets harder asthe graph gets larger with difficulties in the syntactic (e.g., avoiding occlusion), and se- lient Network: An Interactive Model for Predicting New Clients 7
Fig. 3.
A detailed view of the company BlackOrange and its connections mantic (e.g., highlighting the important features of the underlying network) areas [31].With this in mind, the decision was taken to limit the initial view to a subgraph so thatthe user did not suffer from information overload and that initial view of the graph isnot as overwhelming as it would be if the subgraph was too large for the window inwhich it is presented. Not every link between the root company and the target companyis visualized, but just a selection of the shortest paths. The user can explore the graphin any direction that they find interesting by requesting the visualization of any furtherconnections that the selected node in the subgraph may have. Information about thetarget company and each other node in the subgraph is available below the diagram inthe form of short biographies of professionals and company descriptions.A good visualization should emphasize the readability and promote the understand-ing of the underlying relationships. Our aim with this visualization is to afford usersthe opportunity to understand the intrinsic information contained in the structure of thenetwork, enabling them to better understand the data that is available about the compa-nies and professionals. The external image of the graph becomes a kind of informationstore [35], allowing the sellers to use it as a space for visual problem solving.
Ranking Mechanism
Marketing campaigns allow companies to capture the interest ofprospective clients. This interest is converted into leads passed to the sales division. Theseller who is supposed to follow up on these leads has to go through a careful skimmingphase which involves prospecting and prioritizing the companies in the list. Our domainexperts estimate that a seller spends over 30% of their time on these activities.To support the seller in taking a decision about which companies should have thehighest priority and which ones may not be worth pursuing, the Client Network assigns
M. Mattetti et al.
Algorithm 1:
NORA ranking technique input : graph G = ( V , E ) , start node s output: Scores of all nodes, to be used for ranking ( ∆ , E (cid:48) ) ← AnalyzeNetwork ( G , s ) for v ∈ V in the order given by ∆ do F ( v ) = γ × Σ p ∈ P F ( p ) od ( p ) , where P is the set of all parent nodes in G (cid:48) = ( V , E (cid:48) ) and γ < return F a score to each company in the network. These scores are computed using our NodeRelevance Algorithm (NORA), which computes a flow value of each node in a graph.Then, nodes can be ranked (ordered decreasingly) based on the scores. The flow valueof a node measures the strength of the relation to the source node (seller’s company).See details about the computation of flow values later in this section.Intuitively, the sellers should approach a company that has a close connection totheir company. Additionally, a company that has multiple connections to the sellers’company should be considered as a strong candidate. NORA’s flow values attempt tosatisfy these criteria by considering the subset of the global topological structure of theClient Network.Algorithm 1 shows the main steps of NORA in pseudocode. At line 1, method AnalyzeNetwork has a three-fold purpose: compute an ordering of the nodes ∆ , and afiltered set of edges E (cid:48) ⊆ E . Secondly, edges in E (cid:48) become directed edges, as explainedlater in this section. Finally, E (cid:48) allows NORA to ensure the convergence of flow valueswith low computational complexity. We will introduce two approaches of this step,leading to two variations of our algorithm, called NORA-D and NORA-T.The next step of NORA (lines 2–3 in Algorithm 1) is to compute a function F : V → ℜ + , called the flow-value function. A lookup table is a simple and easy implementationfor F . Nodes are parsed in the order given by ∆ . It remains to describe the actual formulato compute the flow value of a given node v (line 3 in the pseudocode). Given a node p , let od ( p ) be the number of outgoing directed edges from p , in graph G (cid:48) . For a givennode v , let P be the set of all parent nodes in G (cid:48) . The flow of v is defined as: F ( v ) = γ × Σ p ∈ P F ( p ) od ( p ) , where γ < p will distributeits flow value among its children in G (cid:48) . In the formula above, the value is shared equallyamong the children nodes of p . A child node accumulates flow values from all its parentnodes in G (cid:48) . The discount factor γ implements the intuition that, if a node v is furtheraway from s , then its flow value should be smaller.Next we present NORA-D and NORA-T, our algorithmic versions that differ in theway they implement step 1 in Algorithm 1. NORA-D.
In this variant of our algorithm, step 1, corresponding to
Analyze-Network , performs a one-to-many shortest-path search to produce its output. We dis-cuss two possible implementations of shortest-path search, one based on Dijkstra’s al-gorithm and one based on breadth-first search. lient Network: An Interactive Model for Predicting New Clients 9
Algorithm 2:
Extended Dijkstra input : graph G = ( V , E ) , start node s output: Ordering of nodes ∆ , in increasing order of the distance from s ; Edges thatbelong to shortest paths from s , in the data structure E (cid:48) g ( s ) ← for v ∈ V , v (cid:44) s do g ( v ) ← ∞ P ← { s } while P (cid:44) /0 do n ← pop ( P ) append n to ∆ for ( n , c ) ∈ E do if g ( c ) > g ( n ) + cost ( n , c ) then g ( c ) ← g ( n ) + cost ( n , c ) insert c into P , unless this has been done before MarkedIncomingEdges ( c ) ← /0 ( n , c ) → MarkedIncomingEdges ( c ) E (cid:48) ← ∪ v ∈ V , v (cid:44) s MarkedIncomingEdges ( v ) return ∆ , E (cid:48) With no assumptions made about whether the edges have a uniform cost or not, thefirst implementation runs Dijkstra’s algorithm [10] from the source node s . The Dijkstraalgorithm takes as input a graph, such as G (cid:48) = ( V , E (cid:48) ) , and a node s ∈ V . It computes thedistance (i.e., the cost of a minimum-cost path) from v to any other node in the graph.As such, Dijkstra’s algorithm is a one-to-many distance computation technique.A standard implementation of Dijkstra’s algorithm returns, for each node in v ∈ V ,the cost of an optimal path from s to v . We have slightly modified the Dijkstra im-plementation to return additional information beside the optimal costs. Specifically, wemark all edges in E with the property that they belong to an optimal path from s to somenode v . More formally, let Opt ( a , b ) be the set of all optimal paths from a node a to anode b . An edge e is marked iff ∃ v ∈ V , ∃ π ∈ Opt ( s , v ) such that e ∈ π . Marked edgesare added into the subset E (cid:48) . We further assign a direction to the edges in E (cid:48) , from thenode with a smaller cost to the node with the larger cost. As such, the graph G (cid:48) = ( V , E (cid:48) ) is a directed acyclic graph (DAG).Algorithm 2 shows the Dijkstra’s algorithm in pseudocode, together with our exten-sion that marks all edges that we want to keep after filtering. Dijkstra’s algorithm usesa priority list P populated with elements n ∈ V , ordered on the cost g ( n ) . The cost g ( n ) ,also called the g value of the node n , is the cost from the starting node to the currentnode n . Popping a node from P returns and removes from P an element with a smallest g value. The algorithm expands each node once, in an increasing order of the g value.Expanding a node n updates the g value of each successor c , if reaching c through n isa shortest path from s to c among all paths from s to c explored so far. At an expansionstep, we also insert successor nodes into the priority list, unless a given successor node Algorithm 3:
AnalyzeNetwork (NORA-D) input : graph G = ( V , E ) , start node s output: node ordering ∆ ; set of filtered edges E (cid:48) Run Extended Dijkstra and take ∆ and E (cid:48) return ∆ , E (cid:48) Algorithm 4:
AnalyzeNetwork (NORA-T) input : graph G = ( V , E ) , start node s output: node ordering ∆ ; set of filtered edges E (cid:48) Generate a DAG G (cid:48) = ( V , E (cid:48) ) from G Perform topological sort and take ∆ return ∆ , E (cid:48) has been inserted into P at a previous time. Our extensions, which keep track of ∆ and E (cid:48) , are shown at lines 7, 12, 13, 14 and 15.On a uniform-cost graph, we can replace Dijkstra’s algorithm with breadth-firstsearch. This reduces the complexity of method AnalyzeNetwork , as discussed later inthis section.
NORA-T.
In constructing a DAG based on Dijkstra’s algorithm, NORA-D onlytakes into account the edges that generate shortest paths. On the other hand, in NORA-T, method
AnalyzeNetwork transforms the original graph into a DAG G (cid:48) = ( V , E ) where the edges in E (cid:48) do not necessarily have to be created from edges that belong toshortest paths in G . In this way, NORA-T attempts to consider more connections thatmay impact the companies.NORA-T performs a topological sorting and obtains a node ordering ∆ (see Al-gorithm 4). If G (cid:48) has an edge ( n , m ) , the topological ordering will place n before m [8]. In line 2 of Algorithm 1, the topological node ordering ∆ ensures that a node n accumulates the values propagated to n via all paths in G (cid:48) .The first step of NORA-T is to covert the input graph into a DAG. For undirectedgraph, NORA-T starts by assigning a direction to each edge in the graph, directionwhich corresponds to the one given by traversing the graph in a breadth-first manner,starting with source node s . Any self-loop is also detected and removed while traversingthe graph, thus the result of this step is a DAG. When the input graph is a directed graph,the initial step of NORA-T is to remove the cycles in the graph. Eliminating a minimum number of edges from a directed cyclic graph to obtain a DAG is known as the feedbackarc set problem, which is NP-hard [18]. However, approximation techniques that run inpolynomial time exist. We take such an approximation approach, based on Eades etal.’s algorithm [11]. More specifically, NORA-T computes a minimum feedback arc setusing the Eades et al.’s algorithm [11] and then builds the final DAG G (cid:48) by removingfrom G all the edges in the set.Either depth-first or breadth-first search can be used for the topological sort. Weemploy depth-first search, as in [29,8]. lient Network: An Interactive Model for Predicting New Clients 11 Worst-case Computational Complexity.
When using Dijkstra’s algorithm, the com-plexity of NORA-D is O ( | E | + | V | log | V | ) , where | V | is the number of nodes and | E | isthe number of edges. Dijkstra’s algorithm is the step with the largest complexity. Whenthe graph edges have uniform costs, replacing Dijkstra’s algorithm with breadth-firstsearch reduces NORA-D’s complexity to O ( | E | + | V | ) . The complexity of topologicalsort and assigning direction to the edges is O ( | E | + | V | ) . The algorithm of Eades etal. calculates a feedback arc set in O ( | E | ) time. NORA-T [11], therefore, calculatesflow values in O ( | E | + | V | ) time. Recall that, after applying NORA to compute flows,nodes need to be ranked. The ranking (sorting) complexity is within O ( | V | log ( | V | )) .However, this step is specific to the problem, rather than being specific to an algorithmsuch as NORA. Regardless of the algorithm used to compute relevance scores for thenodes, nodes would have to be ranked based on those scores. Our evaluation goes into two main directions. Firstly, we evaluate the algorithm byformulating it as a solution to link prediction. Secondly, we gathered feedback fromuser interviews to evaluate the impact of our work from the perspective of real users.The Client Network can support sellers in identifying prospective clients by high-lighting the non-client companies which occupy the top positions in the ranking com-puted by the ranking algorithm. A human expert is involved in selecting the candidates,and contacting the selected candidates afterwards. A human expert will not have timeto process the entire list, and they typically focus on a subset at the top of the list.Therefore, the
Precision at K ( P @ K ) is a key metric for evaluating the performanceof the ranking algorithm. Specifically, we used Precision at 10, 50, 100 and 1000 formeasuring the quality of the recommendations.Moreover, for the sake of completeness, we added to the comparison metrics thatare widely adopted in the link prediction literature [34], such as the Area Under the Re-ceiver Operating Characteristic Curve (AUROC), the Area Under the Precision-RecallCurve (AUPR) and the Top | P | Predictive Rate (TPR | P | ) .The client relationship partitions the organization nodes into two distinct classes, client and non-client . These two are unevenly represented in the network. The imbal-ance ratio is 1 to 14 between client nodes and non-client nodes. Furthermore, about 50%of the client nodes are only connected to the root node. This type of node does not pro-vide any information that can be leveraged by the link prediction algorithms, especiallythose that rely on the topological structure of the network. Hence, only the remaininghalf of client nodes together with the non-client ones are used in our experiments.We compared NORA with two random-walk based methods, Rooted PageRank andPropFlow. Rooted PageRank (RPR) [20] is a modified version of PageRank. The rankof a node corresponds to the probability that the node will be reached through a randomwalk from the source. A parameter α specifies how likely the algorithm is to visit thenode’s neighbors rather than starting over. PropFlow [21] is related to Rooted PageR-ank, but it is more localized. The rank of a node corresponds to the probability that a Where P is the set of positive instances in the prediction results [34].2 M. Mattetti et al. restricted random walk starting at the source node ends at the target node in no morethan l steps. Unlike RPR, PropFlow does not require walk restarts or convergence butsimply employs a modified breadth-first search restricted to depth l .We used a 10-fold cross-validation stratified edge holdout scheme to measure theperformance of the three algorithms in predicting links of type client. A formal de-scription of the evaluation methodology is presented in Algorithm 5. At line 1, method SampleFolds randomly samples nodes from the client and non-client classes. Its im-plementation ensures that the distribution of the two classes is preserved in each fold.Method
ComputeMetrics (line 10 in Algorithm 5) calculates the Precision at 10,50, 100, 1000 as well as the AUROC, AUPR and top | P | predictive rate by labellingthe client and non-client nodes in the i th fold as positive and negative instances, respec-tively. These values are computed for each algorithm j and fold i and stored into themultidimensional vector Metrics . Algorithm 5:
Evaluation Framework Workflow
Data: graph G = ( V , E ) , root node r , set of link prediction algorithms AS Result: average values of AUROC, AUPR, top | P | predictive rate and Precision at 10, 50,100, 1000 for each algorithm in AS Folds ← SampleFolds ( V ) for each f old i ∈ Folds do DC ← /0 for each node ∈ f old i do if node is a client then insert node into DC RemoveClientLink ( G , r , node ) for each alg j ∈ AS do Scores ← ComputeScores ( G , r , alg j ) Metrics [ j ][ i ] ← ComputeMetrics ( Scores , f old i ) for each node ∈ DC do AddClientLink ( G , r , node ) for each row j ∈ Metrics do ComputeAverageMetrics ( row j ) Results.
All evaluated algorithms are implemented in Python using the NetworkXmodule [13]. An undirected multigraph represents the structure of the network. Furtherdetails on the configuration parameters used in the experiments are available in Table 1.Table 2 contains the averages of the metrics for each algorithm. NORA-D achievesthe highest scores for the Precision at 10 and 50 while NORA-T places the highestnumber of disconnected clients on the top 100 and 1000 of its ranking. Rooted PageR-ank offers the highest performance on the AUROC, AUPR and TPR | P | but it is alsothe worst in term of Precision when considering up to position 1000 of the ranking.PropFlow never excels in any of the metrics although it is never the worst. lient Network: An Interactive Model for Predicting New Clients 13 PropFlow has a computational complexity of O ( | V | + | E | ) , where | V | is the numberof nodes and | E | the number of edges in the graph. So do both NORA variants, since thegraph has edges with uniform cost, as explained in complexity analysis section. RootedPageRank is the algorithm with the highest time complexity in the group since each iteration of the algorithm runs in linear time O ( | V | + | E | ) .In summary, NORA has a good time complexity, and it performs well for the Pre-cision at K metrics. It is worth mentioning again that these metrics are very importantin our application since the list of companies that the Client Network recommends thesellers to approach is extracted from the top of the ranking. Table 1.
Configurations used in the experimentsAlgorithm ParametersRooted PageRank α = 0.15; Error tolerance = 1 . e − = 1.0 Our aim in these interviews was to understand which features of the application wereconsidered the most and least useful by the users and which features could achievea time saving when approaching prospective clients. We aimed to understand if thevisualization and the ranking would aid the user in their daily tasks.
Method.
We demonstrated the system in the sales department in our organization,after which we recruited 5 volunteers whose job roles involve evaluating whitespaceclients. They were asked to evaluate the usability and usefulness of the Client Networkin relation to their daily tasks. After initial group demonstrations, they tried out thefunctionality. We followed up with semi-structured interviews. A questionnaire withopen-ended questions was developed which allowed some scope for exploring some ofthe responses in depth. We found this approach useful as the sellers have different rolesin the organization and therefore were using the Client Network in different ways.
Results.
The sellers were enthusiastic about the functionality of the Client Network,but they also pointed out a need for a deeper integration with the Sales Customer Re-lationship Management (CRM) tool they use before they can really make use of itstime-saving potential. There were positive responses to the company search coverage,indicating that they could find results related to the companies they were searching for,but there were some omissions when they searched for public sector clients.Sellers showed an interest in the application’s display of information about themovement of people and subsidiaries. This information is difficult, if not impossible, to We use the term “weight” for indicating the flow capacity associated to each edge. The values are averages over the 10 folds. In bold the highest value for each metric.4 M. Mattetti et al.
Table 2.
Client Prediction Results Algorithm P @10 P @50 P @100 P @1000 TPR | P | AUROC AUPRRooted PageRank 0.39 0.55 0.578 0.589
PropFlow 0.81 0.73 0.715 0.628 0.313 0.833 0.26NORA-D uncover in their existing software toolset, and is useful for them to know before theycontact the company. Several sellers thought that the company description is very usefulinformation to display because “. . . it’s important to know a bit about the client whenyou are calling the client. . . ” , and that this information could be useful as a conver-sational “hook” when cold calling “. . . when you call 20 people a day you obviouslycannot research every individual in a very detailed way.”
The majority of sellers reported that the visualization was the most important partof the application to them from an exploratory point of view. A number of key require-ments have been identified for successful network exploration tools, including highquality layout algorithms, data filtering, clustering, statistics and annotation [6]. Thefindings of the user interviews suggest that we should reconsider our approach to an-notating the data, and that showing the visualizations as the only interface to the datashould be reconsidered if we are considering a task-based rather than exploratory use,to afford users the option of exploring visually or in a more structured manner.There were very positive comments about the time-saving aspect of showing thecompany and subsidiary information, and the fact that the data was more up to datethan the existing system that the seller would look in for this type of information. UIimprovements that were suggested were to add a legend to the visualization to helpinterpret the different type of connections and to improve the graph search filters.There was a strong belief from the majority of the sellers that the prototype couldbe even more useful if it was expanded to display historical product information whenshowing information about linked companies, for example, if there was a linkage be-tween nodes on the graph and the sales CRM that is used in the company so that youcould see which products the connected clients had purchased previously “. . . there is alot of functionality that if they are expanded in the right way, can actually help us a lot,and integrations are the very zero point to start. . . ” . Another case which was suggestedfor the graph is to highlight the companies that have business relationships, as this ishelpful when the sellers are trying to target business-to-business products.
In this paper, we formalized the problem of identifying prospective clients in sales andproposed an innovative solution, Client Network. It utilizes a novel ranking algorithmfor predicting new clients and provides an interactive interface for allowing the ex-ploration of the client ecosystem. While typical approaches study client interactionsor explore clients’ similarity to existing clients based on market segmentation, our ap-proach leverages the whole structure of the client ecosystem. This ecosystem can be lient Network: An Interactive Model for Predicting New Clients 15 represented as a heterogeneous network and, in such a context, the problem of identify-ing prospective clients can be formalized as an instance of the link prediction problem.We presented NORA, a novel ranking algorithm and compared its performance withtwo well known algorithms in the link prediction field. The experiments demonstratethat our technique achieves higher precision than Rooted PageRank and PropFlow. Ourapproach can be considered highly suitable for the use cases covered by the ClientNetwork. In user interviews sellers expressed their appreciation about how the ClientNetwork can help them in prioritizing leads and how the information it provides iscomplementary to that available from the tools they use to work with. As part of futurework, we will iterate the design of the UI based on the user feedback, add new featuresbased on new use cases and explore more novel ways to display the subgraph. Further-more, we aim to further improve the quality of the recommendations and re-evaluatethe performance of the algorithms with new experiments.
We would like to acknowledge the support and collaboration of the CAO team: AliceChang, Claire Tian, Ben J Dubiel, Sanjmeet Abrol and Weiwei Li; for their valuableinsights into the realm of digital sellers.
References
1. Adafre, S.F., de Rijke, M.: Discovering missing links in wikipedia. In: Proceedings of the3rd International Workshop on Link Discovery. pp. 90–97. LinkKDD ’05, ACM (2005).https://doi.org/10.1145/1134271.11342842. Adamic, L.A., Adar, E.: Friends and neighbors on the web. Social Networks (3), 211–230(2003)3. Aiello, L.M., Barrat, A., Schifanella, R., Cattuto, C., Markines, B., Menczer, F.: Friendshipprediction and homophily in social media. ACM Trans. Web (2), 9:1–9:33 (Jun 2012)4. Airoldi, E.M., Blei, D.M., Xing, E.P., Fienberg, S.E.: Mixed membership stochastic blockmodels for relational data, with applications to protein-protein interactions. In: Proceedingsof International Biometric Society-ENAR Annual Meetings (2006)5. Almansoori, W., Gao, S., Jarada, T.N., ElSheikh, A.M., Murshed, A.N., Jida, J., Alhajj,R., Rokne, J.G.: Link prediction and classification in social networks and its applicationin healthcare and systems biology. NetMAHIB (1-2), 27–36 (2012)6. Bastian, M., Heymann, S., Jacomy, M., et al.: Gephi: an open source software for exploringand manipulating networks. Icwsm (2009), 361–362 (2009)7. B.V., A.: Vis.js (2018), http://visjs.org/
8. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MITPress, 3 edn. (2009)9. Davis, D., Lichtenwalter, R., Chawla, N.V.: Multi-relational link prediction in heterogeneousinformation networks. In: Proceedings of the 2011 International Conference on Advances inSocial Networks Analysis and Mining. pp. 281–288. ASONAM ’11 (2011)10. Dijkstra, E.W.: A note on two problems in connexion with graphs. Numer. Math. (1), 269–271 (Dec 1959). https://doi.org/10.1007/BF01386390, http://dx.doi.org/10.1007/BF01386390 , 319–323 (1993)12. Freschi, V.: A graph-based semi-supervised algorithm for protein function prediction frominteraction maps. In: LION. Lecture Notes in Computer Science, vol. 5851, pp. 249–258.Springer (2009)13. Hagberg, A.A., Schult, D.A., Swart, P.J.: Exploring network structure, dynamics, and func-tion using NetworkX. In: Varoquaux, G., Vaught, T., Millman, J. (eds.) Proceedings of the7th Python in Science Conference. pp. 11–15. Pasadena, CA (2008)14. Hasan, M.A., Zaki, M.J.: A survey of link prediction in social networks. In: Social NetworkData Analytics, pp. 243–275 (2011)15. Huang, Z., Li, X., Chen, H.: Link prediction approach to collaborative filtering. In: Proceed-ings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries. pp. 141–142. JCDL’05, ACM, New York, NY, USA (2005)16. Jeh, G., Widom, J.: Simrank: A measure of structural-context similarity. In: Proceedingsof the Eighth ACM SIGKDD International Conference on Knowledge Discovery and DataMining. pp. 538–543. KDD ’02, ACM (2002)17. Johnson, R.A., Yang, Y., Aguiar, E., Rider, A., Chawla, N.V.: Alive: A multi-relational linkprediction environment for the healthcare domain. In: Proceedings of the 2012 Pacific-AsiaConference on Emerging Trends in Knowledge Discovery and Data Mining. pp. 36–46.PAKDD’12, Springer-Verlag, Berlin, Heidelberg (2013)18. Karp, R.M.: Reducibility among combinatorial problems. In: a symposium on the Complex-ity of Computer Computations. pp. 85–103 (1972)19. Katz, L.: A new status index derived from sociometric analysis. Psychometrika (1), 39–43(March 1953), http://ideas.repec.org/a/spr/psycho/v18y1953i1p39-43.html
20. Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. Journalof the American Society for Information Science and Technology (7), 1019–1031 (2007).https://doi.org/10.1002/asi.20591, http://dx.doi.org/10.1002/asi.20591
21. Lichtenwalter, R.N., Lussier, J.T., Chawla, N.V.: New perspectives and methods in linkprediction. In: Proceedings of the 16th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. pp. 243–252. KDD ’10, ACM, New York, NY,USA (2010). https://doi.org/10.1145/1835804.1835837, http://doi.acm.org/10.1145/1835804.1835837
22. Martínez, V., Berzal, F., Cubero, J.C.: A survey of link prediction in complex networks.ACM Comput. Surv. (4), 69:1–69:33 (Dec 2016). https://doi.org/10.1145/3012704, http://doi.acm.org/10.1145/3012704
23. Mori, J., Kajikawa, Y., Kashima, H., Sakata, I.: Machine learning approach for findingbusiness partners and building reciprocal relationships. Expert Systems with Applications (12), 10402 – 10407 (2012). https://doi.org/https://doi.org/10.1016/j.eswa.2012.01.202,
24. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing orderto the web. Tech. rep., Stanford University (1999)25. Shakibian, H., Moghadam Charkari, N.: Mutual information model for link prediction inheterogeneous complex networks. Scientific Reports , 44981 (03 2017), http://dx.doi.org/10.1038/srep44981
26. Shi, C., Li, Y., Yu, P.S., Wu, B.: Constrained-meta-path-based ranking in heterogeneous in-formation network. Knowl. Inf. Syst. (2), 719–747 (2016), http://dblp.uni-trier.de/db/journals/kais/kais49.html
27. Shi, C., Li, Y., Zhang, J., Sun, Y., Yu, P.S.: A survey of heterogeneous informationnetwork analysis. IEEE Trans. on Knowl. and Data Eng. (1), 17–37 (Jan 2017).https://doi.org/10.1109/TKDE.2016.2598561, https://doi.org/10.1109/TKDE.2016.2598561 lient Network: An Interactive Model for Predicting New Clients 1728. Sun, Y., Han, J., Yan, X., Yu, P.S., Wu, T.: Pathsim: Meta path-based top-k similarity searchin heterogeneous information networks. PVLDB (11), 992–1003 (2011), http://dblp.uni-trier.de/db/journals/pvldb/pvldb4.html
29. Tarjan, R.E.: Edge-disjoint spanning trees and depth-first search. Acta Informatica , 171–185 (1976)30. Wang, P., Xu, B., Wu, Y., Zhou, X.: Link prediction in social networks: the state-of-the-art.CoRR abs/1411.5118 (2014), http://arxiv.org/abs/1411.5118
31. Ware, C., Purchase, H., Colpoys, L., McGill, M.: Cognitive measurements of graph aesthet-ics. Information visualization (2), 103–110 (2002)32. Wu, S., Sun, J., Tang, J.: Patent partner recommendation in enterprise social net-works. In: Proceedings of the Sixth ACM International Conference on Web Searchand Data Mining. pp. 43–52. WSDM ’13, ACM, New York, NY, USA (2013).https://doi.org/10.1145/2433396.2433404, http://doi.acm.org/10.1145/2433396.2433404
33. Yang, Y., Chawla, N., Sun, Y., Hani, J.: Predicting links in multi-relational and hetero-geneous networks. In: Proceedings of the 2012 IEEE 12th International Conference onData Mining. pp. 755–764. ICDM ’12, IEEE Computer Society, Washington, DC, USA(2012). https://doi.org/10.1109/ICDM.2012.144, http://dx.doi.org/10.1109/ICDM.2012.144
34. Yang, Y., Lichtenwalter, R.N., Chawla, N.V.: Evaluating link prediction methods. CoRR abs/1505.04094 (2015)35. Zhang, J., Norman, D.A.: Representations in distributed cognitive tasks. Cognitive science (1), 87–122 (1994)36. Zhu, J., Hong, J., Hughes, J.G.: Using markov models for web site link predic-tion. In: Proceedings of the Thirteenth ACM Conference on Hypertext and Hy-permedia. pp. 169–170. HYPERTEXT ’02, ACM, New York, NY, USA (2002).https://doi.org/10.1145/513338.513381,(1), 87–122 (1994)36. Zhu, J., Hong, J., Hughes, J.G.: Using markov models for web site link predic-tion. In: Proceedings of the Thirteenth ACM Conference on Hypertext and Hy-permedia. pp. 169–170. HYPERTEXT ’02, ACM, New York, NY, USA (2002).https://doi.org/10.1145/513338.513381,