[PDF] Know Your Clients' behaviours: a cluster analysis of financial transactions

Abstract

In Canada, financial advisors and dealers are required by provincial securities commissions and self-regulatory organizations--charged with direct regulation over investment dealers and mutual fund dealers--to respectively collect and maintain Know Your Client (KYC) information, such as their age or risk tolerance, for investor accounts. With this information, investors, under their advisor's guidance, make decisions on their investments which are presumed to be beneficial to their investment goals. Our unique dataset is provided by a financial investment dealer with over 50,000 accounts for over 23,000 clients. We use a modified behavioural finance recency, frequency, monetary model for engineering features that quantify investor behaviours, and machine learning clustering algorithms to find groups of investors that behave similarly. We show that the KYC information collected does not explain client behaviours, whereas trade and transaction frequency and volume are most informative. We believe the results shown herein encourage financial regulators and advisors to use more advanced metrics to better understand and predict investor behaviours.

Full PDF

KKnow Your Clients’ behaviours: a cluster analysis of ﬁnancialtransactions

John R.J. Thompson, Longlong Feng, R. Mark Reesor, Chuck GraceMay 15, 2020 a r X i v : . [ ec on . E M ] M a y ohn R.J. Thompson (Corresponding Author) Department of Statistical and Actuarial SciencesThe University of Western OntarioLondon, Ontario N6A [email protected] FengDepartment of MathematicsWilfrid Laurier UniversityWaterloo, Ontario N2L [email protected]. Mark ReesorDepartment of MathematicsWilfrid Laurier UniversityWaterloo, Ontario N2L [email protected] GraceDepartment of FinanceIvey Business SchoolLondon, Ontario N6G [email protected] 1 bstract

In Canada, ﬁnancial advisors and dealers are required by provincial securities commissions and self-regulatoryorganizations–charged with direct regulation over investment dealers and mutual fund dealers–to respectivelycollect and maintain Know Your Client (KYC) information, such as their age or risk tolerance, for investoraccounts. With this information, investors, under their advisor’s guidance, make decisions on their invest-ments which are presumed to be beneﬁcial to their investment goals. Our unique dataset is provided by aﬁnancial investment dealer with over 50,000 accounts for over 23,000 clients. We use a modiﬁed behaviouralﬁnance recency, frequency, monetary model for engineering features that quantify investor behaviours, andmachine learning clustering algorithms to ﬁnd groups of investors that behave similarly. We show that theKYC information collected does not explain client behaviours, whereas trade and transaction frequency andvolume are most informative. We believe the results shown herein encourage ﬁnancial regulators and advisorsto use more advanced metrics to better understand and predict investor behaviours.

Keywords: machine learning, clustering, behavioural ﬁnance, ﬁnancial advising

Introduction

Investors hire ﬁnancial advisors to help them select, facilitate, and manage their investment choices. InCanada, the client-advisor relationship varies by institution and regulatory regime. Some investors askadvisors to provide advice but ultimately make their own investment choices, other investors ask for a rec-ommendation and then approve the advisors investment choices, while still others delegate full discretionaryinvestment choices to the advisor. However, regardless of the relationship, advisors are expected to providerecommendations that are suitable for the client.Suitability is described by regulators in Canada as a meaningful dialogue with the client to obtain a solidunderstanding of the client’s investment needs and objectives, and to explain how a proposed investmentstrategy is suitable for the client in light of the client’s investment needs and objectives (Ontario SecuritiesCommission, 2014). One of the suitability determinants for advisors is to determine the general investmentneeds and objectives of their client and any other factors necessary for them to determine whether a proposedpurchase or sale is suitable (Know Your Client or KYC). The assumption is that any subsequent purchasesor sales (trading behaviour) will conform to the KYC attributes and therefore be suitable .In this paper, we consider unique interconnected datasets of ﬁnancial transactions and KYC attributes toexamine the relationship between KYC and trading behaviour. The KYC data is comprised of objectivedemographic and identifying information and subjective ﬁnancial situation information, where both areused to generate a client’s risk tolerance. We quantify trading behaviour through metrics designed using anextended Recency, Frequency, and Monetary (RFM) model from behavioural ﬁnance. Our hypothesis is thatgroups of investors with similar KYC attributes will have the same risk tolerance and trading behaviours.KYC information should inform a risk tolerance score which the ﬁnancial advisor – informed by suitabilityregulations – uses to delineate client investment transactions.We conduct our analysis using a machine learning k -prototypes clustering algorithm and visualize the clustersusing t -distributed stochastic neighbour embeddings. Using advanced data analytics, our analysis shows that: • Objective and subjective KYC data have little inﬂuence on trading behaviours (cf. Table 1). • The distribution of risk tolerance across each clusters’ trading behaviour is found to be similar, showingthat trading behaviours may on occasion be inconsistent with the KYC generated risk tolerance (cf.Table 1 and Figure 12). • KYC criteria appear to concentrate investors within narrow and rigid swim lanes and appear to do a An important aspect of suitability is the product recommendation or KYP which we will address in subsequent papers. and investment outcomes.Figure 1: The downstream footprints of KYC regulations.Our conclusion that KYC data does not demonstrate a strong relationship to the trading behaviours exhibitedby investors is important because “Know Your Client” is a foundational principle behind the concept of“suitability” and the corresponding investment regulatory framework deployed in many jurisdictions . Theprinciple has also become more important as employers and governments de-risk retirement and savingsprograms post-2009 and move more of the burden of investment decision making from professional portfoliomanagers to individual investors . Furthermore, the topic has become more urgent given the events of early2020.At this point, it is important to acknowledge that investor behaviour is a complex and dynamic topic.Investor behaviour is not only driven by the investors personal motives such as their goals and ﬁnancialneeds but it is also inﬂuenced by the advisor relationship, dealer processes, regulatory obligations, andmarket inﬂuences. As well, while the client onboarding and discovery process is foundational, it is alsocontextual and time-dependent since the corresponding product recommendations are constantly changingin real-time. While the dataset and analysis used in this paper are unique, we are not privy to some ofthe subjective or undocumented inﬂuences and we cannot include them in our algorithms. We have alsoexamined only one set period of time. It is therefore impossible for us to determine why the KYC process In this paper we have focused on trading behaviour but we plan to address portfolio construction, asset mix, and risk andreturns in subsequent papers. See Proposed Amendments to National Instrument 31-103 Registration Requirements, Exemptions and Ongoing RegistrantObligations, December 2019 for a full discussion of the topic in Canada. On a scale of 1 to 5 where 1 is a low or preservation risk tolerance and 5 is high or aggressive.

ClustersClient trait 1 – Active Traders 2 – Early Savers 3 – Just-in-Time 4 – Older Investors 5 – Systematic SaversKYC Average age, income &demographics. Aver-age investment knowl-edge. Average $ ac-counts & balances Slightly younger butaverage income & de-mographics. Averageinvestment knowledge.Average $ accounts &balances Average age, income &demographics. Aver-age investment knowl-edge. Average $ ac-counts & balances Older but average, in-come & demograph-ics. Average invest-ment knowledge. Av-erage $ accounts & bal-ances Average age, income & de-mographics. Average in-vestment knowledge. Av-erage $ accounts & bal-ancesTradebehaviour Trade frequently inlarge amounts andappear sensitive tomarket inﬂuences Smaller, regular de-posits making use ofPACs Infrequent trades atseemingly random in-tervals Primarily withdrawals,dividends, and interestpayments Larger, systematic tradesand re-balancingRisk tol-eranceobservedaverage is not leading to the outcomes we would expect. Our analysis has inspired the question “Could protocols beimproved? but we cant answer the question without further research .The paper reads as follows: The rest of Section 1 is a literature review on KYC regulations and tradingbehaviour and Section 2 introduces the client and advisor ﬁnancial data collected by a dealer, and developsthe features that were used to measure client behaviours. Section 3 describes the machine learning methodsused to identify investor groups based on their KYC information and behaviour metrics. Section 4 showsthe results from that clustering and Section 5 discusses the implications of the results and future work. Investors hire ﬁnancial advisors who, in turn, recommend or distribute suitable ﬁnancial products frominvestment dealers. The regulations for investment suitability for clients in Canada have been in place fordecades and were formed through a collaboration of dealers, advisors, and regulators, with signiﬁcant updatesin 2009. This paper studies the KYC obligation that requires ﬁnancial advisors and dealers to conduct duediligence on clients and take reasonable steps to establish such things as their identity, creditworthiness,investment needs, ﬁnancial objectives, and risk tolerance. The KYC obligation is designed to protect clientsand advisors from unnecessary ﬁnancial risk that does not align with the needs of the client, and ensureadvisors and dealers are acting in good faith. Please refer to Section 5 for our future research plans. .2 Know your client To fulﬁll the KYC suitability requirement, advisors meet with clients to determine the clients identity,investment needs, ﬁnancial objectives and circumstances, and risk tolerance. Many, but not all, will usea formal questionnaire to help gather this information and score the risk tolerance . An eﬀective KYCprotocol collects two types of information: (1) objective demographic information (legal identity), and (2)subjective information, from the perception of the client and their ﬁnancial advisor, on the client’s investmentneeds, ﬁnancial objectives, investment knowledge, appetite for risk and circumstances. For example, thequestionnaire typically establishes the client’s identity by their full name, social insurance number, dateof birth, address, and phone number. For investment needs, ﬁnancial objectives and circumstances, theyare asked about their income, net assets, living expenses, time horizon for the investment account, potentialwithdrawal of funds from the account over a year, how they would change their portfolio based on the marketchanges, how they set aside savings, plan for retirement, and make retirement savings plan contributions.To help determine risk tolerance, they are asked about investment knowledge, dependants, debt, willingnessto take on risk-based on situational questions, and what they want to accomplish with their wealth.Research in the area of eﬀective KYC protocols is at the emergent stage and has focused on the collectionand evaluation of KYC information. The main focuses of research by the ﬁnancial community have been onthe objective information for improving compliance to prevent illegal or terrorist activities and decreasing thecost associated with increased compliance. Where KYC research exists, it tends to focus on cost eﬃciency-distributed ledger systems (Moyano and Ross, 2017), how the ﬁnancial crisis in the USA from 2007 to 2009may have been aﬀected due to non-compliance to US KYC regulations (Bilali, 2011), on using KYC toprotect client accounts (Mondal et al., 2016), and on improving auditor eﬀectiveness in evaluating KYCcompliance (Smet and Mention, 2011).In contrast, few studies have been conducted to study the subjective information of the KYC obligation andtheir relationship to advisor and client investment behaviours, client investment objectives and outcomes,and dealer strategies to assist their advisors (Ontario Securities Commission, 2015). Picard and de Palma(2010) reviewed a number of existing risk tolerance assessment tools and concluded that while the neoclassicaleconomic concept of risk tolerance is clear, its measurement through surveys is unclear. Since the economicdeﬁnition of risk tolerance is a variation in future spending, many economists use questions that measureincome volatility over time in order to assess risk tolerance. These questions are theoretically correct, buttheir performance as predictors of actual investment behaviour during volatile stock markets is mediocre Questionnaires are not limited to these criteria since regulators do not require a speciﬁc questionnaire but to take reasonablesteps to understand client needs.

At the onset, the hypothesis for our research was that a thorough and complete assessment of an investor’sKYC data should lead to an accurate determination of their risk tolerance and suitability requirements. Inturn, those determinations should manifest downstream in trading behaviour and, eventually, in productrecommendations, portfolio construction and investment outcomes.In this paper, we look to better understand the relationship between collected KYC information and tradingbehaviours through applications of behavioural ﬁnance and statistical analysis. Behavioural ﬁnance is theintersection of psychology and ﬁnance to explain the trends and actions of ﬁnancial markets, institutions,advisors, and individual investors. Behavioural ﬁnance has three main areas of application: analysis of pat-terns in stock returns, studying trading activity, and corporate ﬁnance (Subrahmanyam, 2008). Our analysisfocuses on trading activity. Our dataset encompasses over 23,000 clients who work with ﬁnancial advisors atan anonymous investment dealer under the auspice of the Investment Industry Regulatory Organization ofCanada (IIROC) regulatory regime. We use an extended RFM behavioural ﬁnance model (Lumsden et al.,2008). RFM models are used primarily in direct marketing to analyze customer behaviours through therecency of their last purchase, the frequency of their purchases, and how much is spent on each purchase.RFM models have been embedded in data mining algorithms (Birant, 2011).It is important to acknowledge that investor behaviour is a complex and dynamic topic. Investor behaviouris not only driven by the investors personal motives such as their goals and ﬁnancial needs but it is alsoinﬂuenced by the advisor relationship, dealer processes, regulatory obligations, and market inﬂuences. Whilethe dataset and analysis used in this paper are unique, we are not privy to some of the subjective orundocumented inﬂuences and we cannot include them in our algorithms. It is therefore impossible for us todetermine why the KYC process is not leading to the outcomes we would expect. Our analysis has inspiredthe question Could protocols be improved? but we cant answer the question without further research - whichwe discuss in Section 5. 5

Data description and feature engineering for behavioural ﬁ-nance

The data for this analysis is provided by a registered investment dealer that has provided investment productsand technology to Canadian retail investors for over 30 years. The dealer hitherto has approximately 200advisors who work with approximately 23 ,

000 clients across Canada with over $5 billion Canadian dollars(CAD) in assets. Clients typically have multiple accounts each with diﬀerent purposes. For example, a clientmay have accounts for: (i) retirement savings; (ii) children’s education savings; and (iii) other savings. Intotal, clients with advisors who work with the dealer have over 50 ,

000 accounts. They provide a variety ofﬁnancial products and services designed to support independent advisors. Their focus is to provide positiveoutcomes to clients and advisors, and not to push certain ﬁnancial products.In this section, we describe the KYC information and trades and transactions recorded in the data. Weuse descriptive analysis to demonstrate the demographics of our data and that the data is of good quality.We describe the features engineered from the data to be used in clustering, including unique metrics thatmeasure client behaviours.

The data is comprised of 52 ,

025 accounts for 23 ,

970 clients with associated KYC information, trade andtransaction details from August 13th 2018 to August 12th 2019. The datasets were edited by the datadonor prior to our receipt to ensure all client identiﬁers were anonymized consistent with Canada’s PersonalInformation Protection and Electronic Documents Act (PIPEDA) and standard research ethics protocols.Even using anonymization practices, there is still the possibility that clients could be identiﬁed using machinelearning algorithms (Rocher et al., 2019). Therefore, no individuals will be identiﬁed or referenced in thispaper and any subset of the data cannot be shared with readers.The data is organized into linked datasets where entries were uniquely determined by an anonymized accountID or other relational database information. The speciﬁc datasets we used are a KYC information datasetand a trades and transactions dataset. We created new features derived from both datasets that eﬀectivelysupplement the KYC information with metrics that measure trading behaviours.The data was processed by cleaning the data for improper entries (e.g., recording typos), transforming valuesinto categories (e.g., grouping occupations into classiﬁcations), removing irrelevant, anonymized (e.g., contactinformation), or repeated (e.g., postal code in place of residence region) data. Any variable containing over60 percent missing values or errors (e.g., ‘*’ or ‘unknown’) is removed to avoid excessive bias from imputationin our analysis. On the remaining data, imputation is conducted for each numeric and categorical featurebased on existing values. For example, missing values in categorical variables such as ‘residency’ are ﬁlledwith mode value ‘Ontario’ since more than 67% of clients are from Ontario; missing values in numericalvariables such as ‘annual income’ are ﬁlled with mean income based on the job categories from KYC. SeeTable 8 in Appendix B for more details on missing data.Table 2 shows the details of the pertinent objective KYC information. The distribution of client age isshown in Figure 2. The client age distribution is unimodal, centred at 58.1 years, has a standard deviationof 14.1 years, and is slightly left-skewed. The minimum age is 18 years–the legal age to open an account inCanada–and the maximum is 98.Table 2: Details of variables from clients’ KYC informationVariable Summary Data type Example val-uesAge Ages range from 18 to 98 yearsold, with average at 57.4 years Continuous 31 years oldGender 50 .

5% male and 49 .

5% female Indicator

M, F

Residency Province or Country or Region,with 70% from Ontario Categorical ON, UK,USEast, . . .

Annualincome Gross annual income in CAD Continuous Multiples of100 between$1 ,

000 and$220 , . . . ,

658 and is right-skewed, with 50% of clients making less than $60k. There are also incomespikes at $50k and $100k, $150k and $200k. Table 4 shows the number of accounts per client. Most clientshave two accounts and few have ﬁve or more. Ontario (ON), British Columbia (BC), Alberta (AB), Nova Scotia (NS), Canada (CA), United States of America (USA),United Kingdom (UK) ON BC AB MB NS Other (CA) Unknown USA UKPercentage 65.19 14.63 12.00 3.94 2.59 0.92 0.41 0.26 0.06Figure 3: Distribution of client annual incomes. The vertical dotted lines represent the three quartiles at$40k, $60k, and $100k. 8able 4: The number of clients by number of accounts.Unique accounts 1 2 3 4 5 6 7 8 9 10Number of clients 5475 7659 6661 3051 775 222 79 40 4 4Our dataset contains a combination of trades and transactions for each client. We reserve the word “trades”for any interaction with mutual funds, stocks, securities, and bonds, and “transactions” for any interactionthat does not include those interactions such as collecting dividends and interest. Trades are logged as orders,which are either active, inactive, ﬁlled, rejected, cancelled, or expired. In this paper, only ﬁlled orders arestudied and the study of investor behaviours through all of their order history and is deferred to future work.Each trade and transaction is recorded with the type of product or transaction, size, value, currency type,security identiﬁcation code, order date, process date, value date, and more. Using the trades and transactiondataset, we determined the variables that we believe contain information on client behaviours and developednew metrics using feature engineering to measure client behaviour.

Feature engineering in data science is the process of using industry knowledge about data to construct metricsor “features” that can act as a measure for a quantity to be used in a machine learning model (Zhengand Casari, 2018). Features generated from an RFM model can be used in conjunction with a machinelearning algorithm (Anitha and Patil, 2019). We construct features that using objective and subjectiveKYC information, and trade and transaction information that we believe to be related to client investmentbehaviour. Our features are an extension of an RFM model and fall into four categories: recency, frequency,monetary, and proﬁle (RFMP).The RFMP features are aggregated into a cross-sectional dataset that is static in time, where the cross-section is calculated on the last day recorded (August 12th 2019) in the dataset. Table 5 lists the featuresused for the clustering algorithm described in Section 3 and to generate the results shown in Section 4. Wenow describe each type.Proﬁle features describe the client as who they are and what their ﬁnancial goals are. Commonly, theyare considered inﬂuential factors to the behaviour of the client (Foerster et al., 2017). Proﬁle features aregenerated from KYC and account information for each of the clients. Some of the proﬁle features wereimmediately ready for usage (for example, the time horizon of the account) whereas other variables neededto be derived; age in years is calculated from birth dates and the number of accounts is determined bysearching the database for client accounts. 9able 5: The RFMP features engineered from the datasetFeature type Description VariablesRecency Number of days since last trade onrecord Days between the most recent tradedate and August 12, 2019Frequency Total number of tradesAverage number of days between trades Number of trades between ﬁrst tradedate and August 12, 2019Number of days divided by number oftrades since ﬁrst trade dayMonetary Buy and sell size totalsBuy and sell size minimum and max-imumTrade size by typeVariability of trade size by type

Third-party initiated trade type

Dividends, income distribution, in-terest

Systematic trade type

Auto-withdrawal, pre-authorizedcontribution, asset allocation,reinvested dividend

Periodic trade type

Buys, sells, contribution, exchange,payment, electronic funds transfer(EFT), withdrawal, EFT deposit,tax-free savings account (TFSA)contribution, spousal contribution,redeemsProﬁle KYC informationFinancial descriptors (e.g. number ofaccounts) Age, gender, residency, annual income,investment knowledge level, number ofaccounts, marital status, retirement in-dicator10he recency feature is calculated as the number of days since a client’s most recent trade or transaction.The frequency features are calculated through a client’s overall amount of trading throughout the history ofthe dataset. These two features types provide some information on their own, but when used together aremore than the sum of their parts. If they have a large total number of trades (frequency) and months sincetheir last trade (recency), this means they have a “burst” investing behaviour. These feature types whenused together provide an interesting picture of client behaviours.The monetary features are features engineered from trade and transaction amount details, rather than theirtemporal attributes. Speciﬁcally, a trade size multiplied by the value for each unit is the total monetaryvalue in CAD, which we will refer to as the trade amount. If we looked at each trade as equivalent–similar torecency and frequency–then we will incorrectly consider that purchasing a stock is the same as re-investinga dividend. The stock purchase is an active trade that a client or advisor initiates, whereas a re-investeddividend is not. We classify trade sizes into the three metrics given by

T hird - party initiated trade size = Dividend + Income distribution + Interest, (1)

Systematic trade size = Auto withdrawal + P re - authorized contribution ++ Asset allocation + Reinvest dividend, (2)

P eriodic trade size = Buy ( securities ) + Sell ( securities ) + Contribution + Exchange + P ayment + Electronic f unds transf er ( EF T ) +

W ithdrawal + EF T deposit + T F SA + Spousal contribution + Redeem (3)where the descriptions of the trade types can be found in Appendix A. Third-party initiated trades arecomprised of trade types that are initiated by a third party, such as a coupon collected as cash from a bond.Systematic trades are comprised of self-imposed automatic investment strategies, such as an automaticmonthly withdrawal from savings to purchase a mutual fund. Periodic trades are client or advisor initiatedtrades and transactions, such as an unscheduled purchase of a mutual fund for a TFSA.Figure 4 shows the relative percentages of transaction sizes comprising the three behavioural metrics inEquations (1) to (3) versus time. For third-party initiated trade size, dividend and income distributiondominate most of the transactions, and there appears to be a cyclical trend for dividends paid at thebeginning of every month. For systematic trades, automatic withdrawal represents the majority of thefeature size and has an obvious cyclical trend. There are spikes for asset allocation at the beginning of theyear and six months in; a bi-annual cycle for asset allocations in systematic trades. For the periodic trades,the buy and sell types dominate without any cyclical trends.11igure 4: The relative percentage of transactions sizes from the three behavioural metrics versus time(January to August 2019). Top, middle, and bottom panels correspond to third-party initiated, systematic,and periodic trades, respectively.The features we engineer in this section are used directly as variables in our clustering model in Section 4.The next step is to take our engineered features and use them in a clustering algorithm. The theoreticalunderpinnings for our algorithm are described in the next section, which is followed by empirical results from12lustering in the subsequent section.

Clustering is an unsupervised machine learning algorithm that is used to draw inferences about groupingcommonalities from like-individuals in high dimensional data. It is a popular method for exploratory dataanalysis that ﬁnds previously unknown structures in data without specifying the underlying data generatingprocess. Clustering is a powerful technique used in many ﬁelds, such as identifying fake news (Hosseinimot-lagh and Papalexakis, 2018), bioinformatics (Krishna and Murty, 1999; Lan et al., 2018), text mining (Berryand Castellanos, 2004), and wireless sensor networks (Abbasi and Younis, 2007).Clustering bears the task of grouping our set of clients by considering the similarity of their attributesand trading behaviour (Xu and Wunsch, 2008). For obvious reasons, we are interested in applications ofclustering for ﬁnancial data analytics (Le-Khac et al., 2012), particularly the area of Behaviour ClusteringAnalysis (BCA). Popular clustering algorithms used in this ﬁeld are k -means (Steinley, 2006) and k -modes(Huang, 1998; Chaturvedi et al., 2001; Huang and Ng, 2003). In this section, we introduce the k -prototypesalgorithm that allows for both continuous and categorical data to cluster clients based on their similarity.Next, we introduce t -distributed stochastic embeddings that reduces the dimensions of the data based onthe similarity of each data point. The embeddings display the data in low-dimensions by similarity, whilethe clustering algorithm identiﬁes the clusters among the data points. k -prototypes clustering The k -prototypes algorithm used here is similar to the k -means algorithm, where k -prototypes incorporatesmethods for including categorical data (Huang, 1997). Suppose we have a set of N accounts each with aunique identiﬁer or index in the set N = { , , . . . , N } . The goal of any clustering algorithm is to put clientsinto k groups or clusters such that • each client is put into exactly one cluster; • clients within a cluster have similar attributes; and • clients in diﬀerent clusters have dissimilar attributes.Mathematically, the k clusters form a partition of the the client index set into k subsets. Let N (cid:96) denote A partition of any set A is a set of subsets A , A , . . . , that are mutually disjoint ( A i ∩ A j = φ for all i (cid:54) = j ) and exhaustive( ∪ i A i = A ). (cid:96) , (cid:96) = 1 , , . . . , k , and P N = {N , N , . . . , N k } denote thepartition of the client index set. Furthermore, let n (cid:96) denote the number of clients in cluster (cid:96) , such that (cid:80) k(cid:96) =1 n (cid:96) = N .Each client has attributes that describe the individual given by their attribute vector x i , i = 1 , . . . , N .These attributes are a combination of p numeric variables (e.g., age) and q categorical variables (e.g. maritalstatus). Without loss of generality, we put the numeric attributes in the ﬁrst p positions of the attributevector and the categorical attributes in the last q positions giving x i = ( x i , x i , . . . , x ip (cid:124) (cid:123)(cid:122) (cid:125) numeric , x i ( p +1) , . . . , x i ( p + q ) (cid:124) (cid:123)(cid:122) (cid:125) categorical ) . (4)The clustering algorithm works in an iterative fashion according to the following steps.1. Initialize the centroid (location) of the clusters by selecting k clients as “prototype” centroids.2. Allocate the clients to the clusters with the closest centroid.3. Compute an overall cost of the allocation by computing total distance of all clients from their assignedcentroids.4. Update cluster centroids.5. Re-allocate the clients to the clusters with the closest (updated) centroid.6. Compute the overall cost by computing total distance.7. Iterate steps 4-6 until there is no change in the overall cost and output the clusters.We kickoﬀ the clustering party by randomly selecting k clients to serve as the initial centroids (locations)of the clusters. Speciﬁcally, the initial centroids are given by the attribute vectors of the randomly-chosen k clients and are denoted by c (cid:96) = ( c (cid:96) , c (cid:96) , . . . , c (cid:96)p (cid:124) (cid:123)(cid:122) (cid:125) numeric , c (cid:96) ( p +1) , . . . , c (cid:96) ( p + q ) (cid:124) (cid:123)(cid:122) (cid:125) categorical ) , (cid:96) = 1 , . . . , k, (5)where c (cid:96)j is the cluster- (cid:96) , attribute- j centroid. Attributes in the centroid vectors are positioned in exactlythe same order as in the client attribute vectors. As we shall see, as clusters are formed the centroids getupdated according to the individuals within each cluster.After initializing the cluster centroids, we need some way of deciding how to put the clients into the clusters14o that individuals within clusters are similar (close) and individuals across clusters are dissimilar (far apart).To measure the similarity between client i and cluster (cid:96) we use the distance metric d ( x i , c (cid:96) ) = p (cid:88) n =1 (cid:112) ( x in − c (cid:96)n ) + p + q (cid:88) n = p +1 δ ( x in , c (cid:96)n ) , (6)where δ ( a, b ) =  a (cid:54) = b a = b . (7)Note that the distance metric is zero if and only if the attribute vector is exactly the same as the centroidand if there are no categorical variables ( q = 0) then d ( · , · ) is the usual Euclidean distance.For client i the distance between its attribute vector and each of the (cid:96) cluster centroids are computed, d ( x i , c (cid:96) ) , (cid:96) = 1 , . . . , k , and the client is placed in the closest cluster (e.g., minimum distance). This is donefor all N clients (the clients initially chosen as centroids will clearly be placed in the correct cluster), witheach client assigned to exactly one of the (cid:96) clusters.After all clients are assigned to a cluster, the overall distance between individuals and their cluster centroidis computed by the cost function J = k (cid:88) (cid:96) =1 (cid:88) i ∈N (cid:96) d ( x i , c (cid:96) ) (8)The cluster centroids are updated by independently ﬁnding the middle for each cluster’s attributes. Forthe numeric attributes, the centroids are updated to be the within-cluster average value. Speciﬁcally, theupdated j -th attribute for cluster (cid:96) is c (cid:96)j = 1 n (cid:96) (cid:88) i ∈N (cid:96) x ij , j = 1 , . . . , p. (9)The categorical attributes of each cluster are updated using the mode, given by c (cid:96)j = M ( x ij | i ∈ N (cid:96) ) (10)where M is the mode function. Next, we re-allocate each client to clusters using the minimum distancebetween the client attribute vector and the updated cluster centroids. After re-allocation, the overall cost iscomputed using Equation 8. If the total cost is unchanged from the previous iteration, we stop; otherwise,the cluster centroids are updated and clients are re-allocated. This is repeated until the total cost function15s unchanged.Since the initial set of k cluster centroids (e.g., k clients serving as initial centroids) is chosen randomly, theclustering process is repeated for a large number of randomly-chosen initial cluster centroids to better searchfor the global minima of the cost function. Each initial cluster centroid produces clusters and their totalcost. The best (and ﬁnal) cluster is the one that minimizes the cost function over all randomly-chosen initialcluster centroids. Typically it is infeasible to look at all possible k initial cluster centroids, which is the reasonfor the random sampling of the initial cluster centroids. For example, with N = 25000 clients and k = 5clusters, the number of possible ways of choosing the initial cluster centroids is × × × × which is an infeasible number of possibilities to examine. t -distributed stochastic neighbour embeddings Visualizing high-dimensional data by projecting it onto a lower-dimensional space is commonly used (Yang,1999). The computationally eﬃcient dimensionality reduction tool used herein is the t -distributed stochas-tic neighbour embeddings ( t -SNE) (Maaten and Hinton, 2008). The t -SNE method provides a signiﬁcantdimensionality reduction from high dimensional data to two- or three-dimensions while preserving the sig-niﬁcant structure. This method is a nonlinear mapping which, as opposed to linear mappings, performsbetter for preserving the local structure of data–that is, this method keeps similar clients close together ina low-dimensional visualization. This is important for visualizing clusters since we are using a clusteringmethod that evaluates clients by their similarity. Therefore, the t -SNE method creates a map of clientsbased on their similarity, and then we independently apply the clustering algorithm to the data–all withoutspecifying the data generating process.Figure 5 displays the visualization of some sample client data; t -SNE is applied to project the high dimensionaldata into the 2-D space. For the t -SNE method, “perplexity” is an important parameter that aﬀects thevisual behaviour of data projection. Diﬀerent datasets require diﬀerent perplexities to display the clustering–or lack thereof–features present in the data. According to (Maaten and Hinton, 2008), the perplexity can beviewed as the algorithm’s method to measure the number of eﬀective nearest neighbours with typical valuesbetween 5 and 50. Choosing the perplexity value requires the user to tune it during the modelling process.There is no standard method for specifying the perplexity value. Furthermore, larger datasets require alarger perplexity (van der Maaten, 2009). For our dataset, the perplexity value is set to 200 to get a stableembedded data plot. 16igure 5: A t -SNE’s 2-D projection for a small sample of client data. In this section, we discuss the results of applying the method described in Section 3 to the client data discussedSection 2. The data cleaning, feature engineering, clustering algorithm, t -SNE embedding visualization,and analysis are implemented using Python version 3.6 and R version 3.5.3 (R Core Team, 2020). Theimplementation of the k -prototypes clustering algorithm originated from a GitHub repository (de Vos, 2020)and the t -SNE algorithm used for data visualization is in the sklearn Python package (Pedregosa et al.,2011).Figure 6 shows a two-dimensional similarity representation of the data using the t -SNE algorithm with aperplexity of 200 . Each point represents one client’s attributes projected down to two dimensions, wherethe Euclidean distance between clients by their embedding represents a quantiﬁcation of their similarity.The next step is to use the k -prototypes clustering algorithm to identify the optimal number of clusters k for this client dataset. Two clustering performance evaluation methods are used to determine the optimal number of clusters: theSilhouette coeﬃcient and the Davies-Bouldin (DB) score. The Silhouette coeﬃcient (Rousseeuw, 1987) com-pares the cluster membership classiﬁcation of each client by comparing their similarity within and betweenclusters and indicates how well clients are assigned. The Silhouette coeﬃcient of client i in cluster N (cid:96) is See Section 3.2 for discussion on perplexity for the t -SNE method t -SNE visualization for the full data set projected onto two embeddings.deﬁned as S i = b i − a i max( a i , b i ) , (11)where a i is a similarity measure of client i to clients within their cluster given by a i = 1 |N i | − (cid:88) j ∈N (cid:96) ,j (cid:54) = i d ( x i , x j ) , and b i is a similarity measure of client i to the clients in the most similar or closest neighbouring clustergiven by b i = min g ∈{ , ,...,k } ,g (cid:54) = (cid:96)  |N g | (cid:88) j ∈N g d ( x i , x j )  . S = N (cid:80) Ni =1 S i for k = 2 to 8clusters. The average Silhouette coeﬃcient is maximized for this clustering method when we choose k = 5clusters.The DB score (Davies and Bouldin, 1979) is another cluster partition evaluation metric that compares thesimilarity between clusters with the size of the clusters themselves. The DB score is calculated as DB = 1 k k (cid:88) i =1 max j (cid:54) = i (cid:18) s i + s j d ij (cid:19) (12)where k is the number of clusters, s i is the average distance of all clients in cluster i from the centroid c i ,and d ij is the distance between cluster centroids c i and c j . The DB index quantiﬁes the density of clustersand clusters which are farther apart. Hence, the DB index decreases as separation between the clustersincreases. Similarly to the averaged Silhouette coeﬃcient, the second plot in Figure 7 indicates a k = 5clustering partition yields the optimal clustering results.Figure 8 shows the overlaid cluster membership on the t -SNE visualization. Among the 5 clusters, cluster 1has 19% of the clients and its data points are green on the embedding map, cluster 2 has the largest portionof clients with (36%) and is labelled blue, cluster 3 has 27% of clients and is labelled purple, cluster 4 theleast portion (7%) of clients and labelled black, and cluster 5 has 12% of clients and is labelled orange.From the two-dimensional embedding map in Figure 8, there are distinct boundaries between clusters 2, 3and clusters 1, 4, 5. There are overlaps between clusters 1 and 5, clusters 2 and 3, and clusters 1 and 4. Itis noteworthy that higher dimensional embedding can reveal other higher-order boundaries that distinguishthese overlapped clusters. The projection from three-dimensions to these two dimensions creates the visualappearance of overlapping. Figure 9 shows a tree-structured dendrogram with a heat map to visualize the pattern within and betweenclusters’ attributes. A sample of 53 clients from the dataset is selected by stratiﬁed random sampling, whereeach cluster represents a stratum and the relative number of selected individuals is proportional to the clustersize. Each row of the dendrogram shows an individual client’s attributes, and the columns show the featuresused in clustering. The ﬁrst column is the clustering labels from Figure 8. For each remaining column, a19igure 7: The top panel shows the average Silhouette coeﬃcient and the bottom panel shows the DB scorefor diﬀerent numbers of clusters. The optimal number of clusters is identiﬁed by the red circle at the elbow.heat map is presented with the scaled values using the range of each attribute. The minimum value of theattribute is scaled to zero (black) and the maximum value is scaled to 1 (white), and the rest of the valuesbetween the minimum and maximum are mapped on a linear scale. The dendrogram rows are ordered bydistance between the clients’ attributes using a hierarchical structure shown on the left side of the diagram.Table 6 summarizes the mean values of the numeric features for each cluster. These mean values are thenumeric attributes of the centroids (location) of the optimal clusters. Figure 8 and Table 6 demonstrate thefollowing patterns between each of the clusters: • Clusters 1 (green) and 5 (orange) are similar in their demographics and trade types, but cluster 5trades less often with smaller periodic trade sizes. • Cluster 2 (blue) is distinct from the others where they are largely inactive in their trading. • Clusters 3 (purple) and 4 (gray) are similar, except that cluster 3 makes larger, less frequent tradesand cluster 4 utilizes larger systematic trades. 20igure 8: t -SNE visualization for the full dataset by cluster projected onto two embeddings.Figure 10 shows the clustering results for categorical features. For the residency and gender features, thereare no obvious diﬀerences between clusters. For the age feature, cluster 4 a high average age, and thedistribution is left-skewed and appears almost bimodal. Clusters 1, 3 and 5 have similar age distributions.The cluster 2 age distribution appears shifted left and has younger clients compared to other clusters. Thebottom right panel shows the percentages of the six account types in diﬀerent clusters. Clients in clusters1, 3 and 5 have similar account proportions. Cluster 2 has more cash accounts and cluster 4 has more RIFaccounts.Figure 11 shows the monthly average trade amount over time, where the shaded areas are 95% bootstrappedpointwise conﬁdence intervals. We note ﬁrst the scale of each type of trade in the ﬁgure, where there arethree diﬀerent orders of magnitude. This may be caused by the nature of the trade types or by the numberof elementary trade types within each of the trade type classes deﬁned in Equations (1) to (3). • For third-party initiated trades, cluster 4 has a relatively high trade amount and the largest volatility.Cluster 1 has similarly high trade amounts but less volatility. Clusters 3 and 5 have very similar trade21igure 9: A dendrogram of the clustering result with a heat map. Each attribute value is scaled to lie inthe interval [0 , • For systematic trades, a similar pattern to third-party initiated trades is reﬂected. Clusters 1 and4 are again similar in the trade amount and volatility, with cluster 4 having slightly larger amountson average except in June. Clusters 3 and 5 have almost identical average trade amounts except inAugust, and cluster 2 has the smallest average trade amount. An interesting aspect of all clusters isthe peaks for the average trade amount evident in January and June. • Cluster 1 dominates the periodic trade amounts, while cluster 2 has almost zero periodic trade amountson average with very little volatility. Clusters 3 to 5 have similar trade amounts and volatilities, exceptin February and March when there is a slight peak before trending down for clusters 3 and 5. Clusters3 to 5 all have an uptick in the average trade amount in July. There is a clear scale diﬀerence comparedto the previous two trade types.Figure 12 shows the inferred risk tolerance (RT) score distributions for clients of each cluster. The majorityof clients in each cluster’s distribution (top four and bottom left panels) have a RT score close to three.Furthermore, each distribution appears quite similar, with smaller upticks at RT scores of two and four. Thepanel in the bottom right shows the overlaid translucent densities of each cluster, where the reddish-brownarea is the shape that all clusters share.We investigated the similarity of these distributions using a parametric ANOVA comparison of client RT scoremeans and a nonparametric Kruskal-Wallis test comparison of means (Kruskal and Wallis, 1952; McKight andNajab, 2010), for which both tests’ null hypothesis were rejected with P -values ≤ × − and 3.23 × − ,respectively. A post hoc analysis of a comparison of individual groups with adjusted P -values for multiplecomparisons was conducted using Tukey’s test (Tukey, 1949) for ANOVA and the nonparametric Dunn’stest (Dunn, 1964) for Kruskal-Wallis test. The results of these tests are shown in Appendix C. These resultssuggest that clusters 3 and 4 have signiﬁcantly diﬀerent distributions from the rest. We investigated thediﬀerence in the distributions using the histogram density estimators (Figure 12) in a a pairwise symmetricKullback-Liebler (KL) plug-in estimator (Kullback and Leibler, 1951; Ram´ırez et al., 2004; Wang et al.,2005). The KL estimator shows that the diﬀerence between the unlike-clusters’ divergences (3,4) is notmuch larger than the like-clusters (1,2,5) divergences. The results of the symmetric KL estimators areshown in Appendix C.From these analyses between the clusters in terms of the distribution of inferred RT scores, we can concludethat the distributions are similar, although there exists a statistically signiﬁcant diﬀerence between the25igure 12: Inferred RT score distributions by cluster. The top four and bottom left panels are each cluster’sdistribution of the number of clients by inferred RT score. The bottom right panel is each of the clusters’risk score density overlaid.distributions. A smaller sample of points from each distribution would have a diﬃcult time rejecting thenull hypotheses of an analysis of variance test. The mean pattern and shape of risk tolerance distributionsdo not line up with what we would have expected. Clusters 1 and 4 are the most striking. Cluster 4 isdemographically skewed towards older investors and we would expect to see RT scores weighted towardsscores 1.0, 2.0 or 3.0. There are, in fact, only 15.7% of clients in Cluster 4 who have less than a 3.0 RT score.Behaviourally, cluster 1 appears to pursue a riskier trading strategy and we would, therefore, have expectedto see a strong weighting towards observations in the 4.0 to 5.0 RT score range. In fact, 14.8% of cluster 1clients fall into the 4.0 to 5.0 RT score range. 26 .3 From data to people – Personas The cluster memberships are determined by the similarity of individuals, and we are interested in studyinghow the groups diﬀer from each other. Using the plots and information presented heretofore, we summarizehow the clusters diﬀer using the most important variables to their cluster classiﬁcation. We note thatindividuals from two diﬀerent groups may appear similar, but they are classiﬁed based on subtle diﬀerencesdetermined by the clustering algorithm.Using our understanding of investors and ﬁnance, we have created ‘personas’ for clients to ease discussionsand help understand the groups as real people and not just data. The ﬁve personas are as follows: • Cluster 1: Active Traders (19% of investors) trade frequently (weekly and monthly) and in largeamounts. The pattern of trades is seemingly random and initiated manually. These investors hadinvestments across a spectrum of accounts (mainly registered savings plans (RSPs) and TFSAs), andwere of an “average” age distribution and demographic. They had a derived risk tolerance rating thataveraged 3.19 with standard deviation 0.63, where 1 is a low or preservative risk tolerance and 5 ishigh or aggressive. • Cluster 2: Early Savers (36%) never actively trade and instead rely on systematic transactions (auto-withdrawal, pre-authorized contribution, asset allocations). This group tended to have investments incash accounts and to be younger. They had a derived risk tolerance rating that averaged 3.18 withstandard deviation 0.75. • Cluster 3: Just-In-time (27%) initiate trades manually but far less frequently than Cluster 1 and insmaller amounts. These investors had investments across a spectrum of accounts (RSPs, TFSAs etc.),and were of an “average” age and demographic. they had a derived risk tolerance rating that averaged3.12 with standard deviation 0.73. • Cluster 4: Older Investors (7%) trade infrequently and the trades were either initiated systematicallyor from a third-party (pre-authorized withdrawals, dividends and other disbursements). This clusterhad an above average concentration of RIFs, and tended to be older. They had a derived risk tolerancerating that averaged 2.95 with standard deviation 0.71. • Cluster 5: Systematic Savers (12%) trade recurrently (every 60, 90, or 120 days), in small amountsdriven by systematic processes (dollar cost averaging) and periodic trading. These investors had invest-ments across a spectrum of accounts (RSPs, TFSAs etc.), and of an “average” age and demographics.They had a derived risk tolerance rating that averaged 3.19 with standard deviation 0.76.27

Discussion and Future Plans

We have conducted a variety of approaches to analyze the client dataset to extract ﬁnancial behaviours. Wehave constructed data summaries and extracted features that we believe capture ﬁnancial behaviours, andincluded those summaries and features in a descriptive analysis. The features engineered from our data willdirectly aﬀect the performance of future predictive models we are developing. We conducted a k -prototypesclustering algorithm on extracted features, where the cluster memberships were determined by minimizing asimilarity cost function. We evaluated our clustering method using a Silhouette coeﬃcient and a DB score,and analyzed the clustering results using the centroids generated by the algorithm and t -SNE visualizations.The ultimate goal of our research is to provide enhanced advice to clients and their advisors using bothtraditional and digital approaches. The projects described herein are a path to attain that goal, providingthe necessary algorithms to give information and advice in good faith. The projects not only support digitaladvice, but the results can be used to report to regulatory committees on how data-driven results can aidregulators in promoting ﬁnancial wellness policies.Moving forward, we will examine the behaviours of the clusters against the suitability and KYC protocolsnoted in this paper and then attempt to determine if those behaviours have a constructive or destructiveimpact on client outcomes. We also plan to examine the impact that advisor behaviours have on theanalysis noted above while looking for evidence for whether we can change or nudge any or all of the notedbehaviours. Previous research has determined that traditional characteristics explain only 12 percent of aninvestor’s portfolio allocations (Foerster et al., 2014; Grace, 2014; Foerster et al., 2017; Linnainmaa et al.,2018). Our goal is to use new, sophisticated technologies to help examine the remaining 88 percent ofunexplained investor behaviour (Grace, 2019). Trade and Asset Mix

At the root of modern portfolio theory is the assumption that portfolio asset mix drives the portfoliosinherent risk. The determination of suitability, based on the KYC, extends through portfolio constructionto ensure that the portfolios asset mix is consistent with the investors risk tolerance. In our next phase ofthe project, we will use the same statistical techniques and dataset above to examine whether the tradingbehaviour identiﬁed in each cluster is ”suitable”–as deﬁned by the prescribed regulations. We will completethis analysis by looking at the asset mix exhibited by each cluster. We will evaluate the security risk inthe context of the client risk derived from the attributes of the cluster analysis. We will use security riskratings (SRR) that are deﬁned by industry for each of the securities bought and sold and held by the client.28hese risk ratings are required by regulators under the Know Your Product protocols (Ontario SecuritiesCommission, 2019). We will examine the trading behaviour and trade mix at speciﬁc points in time andthen along a longitudinal continuum to see if the relationship changes over time. From this analysis, we willbe able to determine if investor behaviour is suitable. We will examine how the trading behaviour exhibitedby each cluster impacts their portfolios and the probability of achieving their desired outcomes. We will alsolook for evidence of whether the investors trading behaviour leads to unintended changes in the portfoliosasset mix and risk characteristics over time.

Portfolio Returns

Where the analysis noted in the previous projects examine risk and the probability of success, we also planto examine returns. We will analyze the assumption that higher risk should lead to higher returns (in thelong run) and presumably faster portfolio growth . Likewise, lower risk will presumably lead to more modestreturns and preservation of capital. During this examination, we will use multiple methods to calculatereturns including industry best practices and regulatory guidance.

Advice

This project recognizes that investor behaviour is a complex event with a number of variables inﬂuencingbehaviour. Spouses, family, friends, media and events, for example, can all inﬂuence the timing, characteris-tics and trajectory of behaviour. However, it is widely acknowledged that the investment advisor acts as thegate keeper for most investment trades and therefore, presumably, the trading behaviour (Marsden et al.,2011; Montmarquette et al., 2012; Investment Funds Institute of Canada, 2012; Kinniry et al., 2014). In thisproject, we will look for evidence to see if the advisors behaviour is inﬂuencing trading behaviour consistentwith the KYC and suitability requirements.

Investor Outcome Improvements

In this project, we will take advantage of a second unique data set to examine whether it is possible to changeor inﬂuence investor behaviours through new, systematic technologies. Using the same methodologies above,and the same set of investors, we will examine investor behaviour before and after a signiﬁcant systemenhancement implemented in November 2019 - leading into the market events of March 2020. We will makeuse of control charts to help determine the key variables that drive risky behaviour over time. We will usethis analysis will help assess the viability of potential new algorithms in the digital advice space.29 eferences

Ameer Ahmed Abbasi and Mohamed Younis. A survey on clustering algorithms for wireless sensor networks.

Computer communications , 30(14-15):2826–2841, 2007.Palaksha Anitha and Malini M. Patil. RFM model for customer purchase behavior using k -means algorithm. Journal of King Saud University-Computer and Information Sciences , 2019.Michael W. Berry and Malu Castellanos. Survey of text mining.

Computing Reviews , 45(9):548, 2004.Genci Bilali. Know your customer–or not.

University of Toledo Law Review , 43:319, 2011.Derya Birant. Data mining using RFM analysis. In

Knowledge-oriented applications in data mining . Inte-chOpen, 2011.Anil Chaturvedi, Paul E. Green, and J. Douglas Caroll. k -modes clustering. Journal of classiﬁcation , 18(1):35–55, 2001.David L. Davies and Donald W. Bouldin. A cluster separation measure.

IEEE transactions on patternanalysis and machine intelligence , (2):224–227, 1979.Nico de Vos. Python implementations of the k -modes and k -prototypes clustering algorithms, for clusteringcategorical data. 2020. URL https://github.com/nicodv/kmodes .Olive Jean Dunn. Multiple comparisons using rank sums. Technometrics , 6(3):241–252, 1964.Stephen Foerster, Juhani T. Linnainmaa, Brian Melzer, and Alessandro Previtero. The costs and beneﬁtsof ﬁnancial advice.

Working paper , 2014.Stephen Foerster, Juhani T. Linnainmaa, Brian T. Melzer, and Alessandro Previtero. Retail ﬁnancial advice:does one size ﬁt all?

The Journal of Finance , 72(4):1441–1482, 2017.Chuck Grace. Practitioner’s summary: the costs and beneﬁts of ﬁnancial advice. 2014.Chuck Grace. Next-gen ﬁnancial advice: Digital innovation and canadas policymakers.

CD Howe InstituteCommentary , 538, 2019.Michael Guillemette, Michael S. Finke, and John Gilliam. Risk tolerance questions to best determine clientportfolio allocation preferences.

Journal of Financial Planning , 25(5):36–44, 2012.Seyedmehdi Hosseinimotlagh and Evangelos E. Papalexakis. Unsupervised content-based identiﬁcation offake news articles with tensor decomposition ensembles. In

Proceedings of the Workshop on Misinformationand Misbehavior Mining on the Web (MIS2) , 2018.30hexue Huang. Clustering large data sets with mixed numeric and categorical values. In

The First Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining , pages 21–34, 1997.Zhexue Huang. Extensions to the k -means algorithm for clustering large data sets with categorical values. Data mining and knowledge discovery , 2(3):283–304, 1998.Zhexue Huang and Michael K. Ng. A note on k -modes clustering. Journal of Classiﬁcation , 20(2):257, 2003.Investor Economics Investment Funds Institute of Canada. Mutual fund MERsand cost to customer in canada: Measurement, trends and changing per-spectives. 2012. URL .Francis M. Kinniry, Colleen M. Jaconetti, Michael A. DiJoseph, and Yan Zilbering. Putting a value on yourvalue: Quantifying vanguard advisors alpha.

Vanguard Research , 16, 2014.K. Krishna and M. Narasimha Murty. Genetic k -means algorithm. IEEE Transactions on Systems, Man,and Cybernetics, Part B (Cybernetics) , 29(3):433–439, 1999.William H. Kruskal and W. Allen Wallis. Use of ranks in one-criterion variance analysis.

Journal of theAmerican Statistical Association , 47(260):583–621, 1952.Solomon Kullback and Richard A Leibler. On information and suﬃciency.

The annals of mathematicalstatistics , 22(1):79–86, 1951.Kun Lan, Dan-tong Wang, Simon Fong, Lian-sheng Liu, Kelvin K.L. Wong, and Nilanjan Dey. A survey ofdata mining and deep learning in bioinformatics.

Journal of medical systems , 42(8):139, 2018.Nhien-An Le-Khac, Cai Fan, and Tahar Kechadi. Clustering approaches for ﬁnancial data analysis. In , July 2012.Juhani T. Linnainmaa, Brian Melzer, and Alessandro Previtero. The misguided beliefs of ﬁnancial advisors.

Kelley School of Business Research Paper , (18-9), 2018.Shelly-Ann Lumsden, Srikanth Beldona, and Alastair M. Morrison. Customer value in an all-inclusive travelvacation club: An application of the RFM framework.

Journal of Hospitality & Leisure Marketing , 16(3):270–285, 2008.Laurens van der Maaten and Geoﬀrey Hinton. Visualizing data using t -sne. Journal of machine learningresearch , 9(Nov):2579–2605, 2008. 31itchell Marsden, Cathleen D. Zick, and Robert N. Mayer. The value of seeking ﬁnancial advice.

Journalof family and economic issues , 32(4):625–643, 2011.Patrick E. McKight and Julius Najab. Kruskal-wallis test.

The Corsini Encyclopedia Of Psychology , 2010.Prakash Chandra Mondal, Rupam Deb, and Mohammad Nurul Huda. Transaction authorization from knowyour customer (KYC) information in online banking. In , pages 523–526. IEEE, 2016.Claude Montmarquette, Nathalie Viennot-Briot, et al.

Econometric Models on the Value of Advice of aFinancial Adviser , volume 49. CIRANO, 2012.Jos´e Parra Moyano and Omri Ross. KYC optimization using distributed ledger technology.

Business &Information Systems Engineering , 59(6):411–423, 2017.Ontario Securities Commission. CSA staﬀ notice 31-336 guidance for portfolio managers, exempt marketdealers and other registrants on the know-your-client, know-your-product and suitablility obligations. Jan2014.Ontario Securities Commission. Amendments to national instrument 31-103 registration requirements, ex-emptions and ongoing registrant. 2019.Investor Advisory Panel Ontario Securities Commission. Current practices for risk proﬁling in canada andreview of global best practices. 2015. URL .F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.Scikit-learn: Machine learning in Python.

Journal of Machine Learning Research , 12:2825–2830, 2011.Nathalie Picard and Andr de Palma. Evaluation of MiFID questionnaires in france.

Technical report, AMF ,2010.R Core Team.

R: A Language and Environment for Statistical Computing . R Foundation for StatisticalComputing, Vienna, Austria, 2020. URL .Javier Ram´ırez, Jaume C. Segura, Carmen Ben´ıtez, Angel De La Torre, and Antonio J. Rubio. A newkullback-leibler vad for speech recognition in noise.

IEEE signal processing letters , 11(2):266–269, 2004.Luc Rocher, Julien M Hendrickx, and Yves-Alexandre De Montjoye. Estimating the success of re-identiﬁcations in incomplete datasets using generative models.

Nature communications , 10(1):1–9, 2019.32eter J. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.

Journal of computational and applied mathematics , 20:53–65, 1987.Dieter De Smet and Anne-Laure Mention. Improving auditor eﬀectiveness in assessing KYC/AML practices:Case study in a luxembourgish context.

Managerial Auditing Journal , 26(2):182–203, 2011.Douglas Steinley. k -means clustering: a half-century synthesis. British Journal of Mathematical and Statis-tical Psychology , 59(1):1–34, 2006.Avanidhar Subrahmanyam. Behavioural ﬁnance: a review and synthesis.

European Financial Management ,14(1):12–29, 2008.John W Tukey. Comparing individual means in the analysis of variance.

Biometrics , pages 99–114, 1949.Laurens van der Maaten. Learning a parametric embedding by preserving local structure. In David van Dykand Max Welling, editors,

Proceedings of the Twelth International Conference on Artiﬁcial Intelligenceand Statistics , volume 5 of

Proceedings of Machine Learning Research , pages 384–391, Hilton ClearwaterBeach Resort, Clearwater Beach, Florida USA, 16–18 Apr 2009. PMLR. URL http://proceedings.mlr.press/v5/maaten09a.html .Qing Wang, Sanjeev R Kulkarni, and Sergio Verd´u. Divergence estimation of continuous distributions basedon data-dependent partitions.

IEEE Transactions on Information Theory , 51(9):3064–3074, 2005.Rui Xu and Don Wunsch.

Clustering , volume 10. John Wiley & Sons, 2008.Li Yang. 3D grand tour for multidimensional data and clusters. In David J. Hand, Joost N. Kok, andMichael R. Berthold, editors,

Advances in Intelligent Data Analysis , pages 173–184, Berlin, Heidelberg,1999. Springer Berlin Heidelberg.Alice Zheng and Amanda Casari.

Feature Engineering for Machine Learning: Principles and Techniques forData Scientists . OReilly Media, Inc., 1st edition, 2018.33 ppendix A - Trade type descriptions

Table 7: Types of trades in the client databaseType Examples DescriptionThird-partyinitiated DividendIncomeDistribution Interest Third-party transactions are generated by productmanufacturers and vary by product type securities,ETFs, mutual funds, ﬁxed income etc. The genera-tion of these transactions does not require the partic-ipation of the advisor or investor and ﬂow from themanufacturer to the dealer and then to the investorsaccount.Systematic Auto WithdrawalPre-authorized Contri-butionAsset AllocationReinvest Dividend Systematic transactions are created by the advi-sor or investor to automatically generate on a pre-scribed timetable (for example monthly or quar-terly). When these transactions are set-up, they canrun for months or years without change until suchtime as the advisor or investor determine a revisionis required because of new circumstances.Periodic Buy (securities)Sell(securities)ContributionExchangePaymentPeriodicEFT WithdrawalEFT depositTFSASpousal contributionRedeem Periodic transactions are initiated by the advisor orinvestor without a prescribed transaction amount ortime frame. The description for these transactionscan vary by product type for example “sell” refersto the disposition of a security while “redeem” refersto the disposition of a mutual fund.34 ppendix B - Imputation

The details of speciﬁc variables that were imputed are shown in Table 8. We investigated each variable re-moved values by imputing the missing values and including them in the clustering algorithm. The clients withcategorical variables that were between 5% and 10% missing were removed, since these variables were foundnot to be important for determining cluster membership or imputing the categories introduced unnecessarybias into the sample.Table 8: Summary of missing values and imputation for clusteringVariable Percent missing ActionAge 2.2% Imputed with meanResidency 0.47% Imputed with modeRisk tolerance 14.16% Removed from clustering algorithmInvestment objective 6.7% Removed clients with missing informationAnnual income 0.13% Imputed with meanInvestment knowledge level 7.8% Removed clients with missing informationGender 8.04% Removed clients with missing information35 ppendix C - Risk tolerance score distribution analysis