[PDF] Comparison of Distance Metrics for Hierarchical Data in Medical Databases

Abstract

Distance metrics are broadly used in different research areas and applications, such as bio-informatics, data mining and many other fields. However, there are some metrics, like pq-gram and Edit Distance used specifically for data with a hierarchical structure. Other metrics used for non-hierarchical data are the geometric and Hamming metrics. We have applied these metrics to The Health Improvement Network (THIN) database which has some hierarchical data. The THIN data has to be converted into a tree-like structure for the first group of metrics. For the second group of metrics, the data are converted into a frequency table or matrix, then for all metrics, all distances are found and normalised. Based on this particular data set, our research question: which of these metrics is useful for THIN data? This paper compares the metrics, particularly the pq-gram metric on finding the similarities of patients' data. It also investigates the similar patients who have the same close distances as well as the metrics suitability for clustering the whole patient population. Our results show that the two groups of metrics perform differently as they represent different structures of the data. Nevertheless, all the metrics could represent some similar data of patients as well as discriminate sufficiently well in clustering the patient population using k -means clustering algorithm.

Full PDF

CComparison of Distance Metrics for HierarchicalData in Medical Databases

Diman Hassan, Uwe Aickelin and Christian Wagner

Abstract —Distance metrics are broadly used in different re-search areas and applications, such as bio-informatics, datamining and many other ﬁelds. However, there are some metrics,like pq -gram and Edit Distance used speciﬁcally for data with ahierarchical structure. Other metrics used for non-hierarchicaldata are the geometric and Hamming metrics. We have appliedthese metrics to The Health Improvement Network (THIN)database which has some hierarchical data. The THIN data hasto be converted into a tree-like structure for the ﬁrst group ofmetrics. For the second group of metrics, the data are convertedinto a frequency table or matrix, then for all metrics, all distancesare found and normalised. Based on this particular data set, ourresearch question: which of these metrics is useful for THINdata?. This paper compares the metrics, particularly the pq -gram metric on ﬁnding the similarities of patients’ data. Italso investigates the similar patients who have the same closedistances as well as the metrics suitability for clustering thewhole patient population. Our results show that the two groups ofmetrics perform differently as they represent different structuresof the data. Nevertheless, all the metrics could represent somesimilar data of patients as well as discriminate sufﬁciently wellin clustering the patient population using k -means clusteringalgorithm. I. I

NTRODUCTION S INCE the representation of structured objects in large andmodern databases like The Health Improvement Network(THIN) database becomes more complex and important, suchstructures should be considered when searching for similarobjects. Therefore, ﬁnding an efﬁcient measurement for dis-covering similar objects in data sets is the key feature whenthe task is to classify new objects or to cluster data objects.The pq -gram [1] and Edit Distance [2] metrics are knownto be two good approaches that have been used to measurethe similarity of the structured data objects, especially inTrees. The limitation of Edit Distance metric is related to thecomputational complexity which is considered very high [3]as compared to the pq Diman Hassan, Uwe Aickelin and Christian Wagner are with the School ofComputer Science, University of Nottingham, Nottingham, United Kingdom(email: { dsh, uxa, cxw } @cs.nott.ac.uk). which is belong to the general practice electronic healthcaredatabase, some research [6] [7] have been performed usingdata mining techniques, such as association and sequentialpatterns. The purpose was to detect association between patientattributes (e.g. age, gender, medical history) and adverse eventsof drugs. No other data mining technique has been applied tothe THIN database yet, such as clustering; this motivated us touse the unexplored clustering approach for the prediction anddetection of negative side effects of drugs. The overarchingaim of our research is to cluster hierarchical data to identifyadverse side effects of drugs in the THIN database. However,clustering techniques need distance measures to represent thesimilarity between patients who have similar side effects. Forthis reason, this preliminary work aims to ﬁnd the useful andsuitable measure for our hierarchical data set in order to clusterpatients. To achieve this aim, different metrics are consideredand applied to the THIN data and their results compared.The investigation determines if these metrics can measuresimilarity and ﬁnd similar patients (i.e. the patients who havesimilar side effects of drugs). Additionally, by looking atthe whole patient population, is any of the metrics able toaccurately represent similarity between patients when using,for example the k -means clustering algorithms [8]?.The layout of this paper is as follows. In Section II, abackground on the THIN database and the distance metricsis given. The data preparation for both groups of metrics,the calculation of the distances and the clustering using thosemetrics are explained in Section III followed by a discussionon the results in Section IV. Section V presents a summaryand the conclusion of the work.II. MATERIALS AND METHODS A. Background on THIN Database

The THIN database is one of the electronic health-carelongitudinal databases that contains anonymous electronicmedical records extracted directly from general practicesthroughout the United Kingdom. The database contains infor-mation of each patient registered within the general practiceincluding personal details, such as gender, date of birth, dateof registration and family history. In addition, the data onall the drug prescriptions and the associated set of symptomsbased on which the drug is prescribed are also included. Theindividual medical record is represented in the THIN databaseby a reference code named as read code. The latter is analphanumeric code that deﬁnes and groups illnesses usingthe hierarchical nosology system. The read codes are alsocomprehensive coded medical language developed in the UK a r X i v : . [ c s . D B ] S e p nd funded by the National Health Service (NHS). In thispaper, we test our experiments on a group of patients betweenthe age of 0 and 17 years old. The information shown in Table Iwas extracted from THIN for two kinds of drugs that have beenchosen based on the number of prescriptions. The ﬁrst drugDESLORATADINE has a large number of prescriptions and isused to treat allergies under the group of Antihistamines. Thesecond drug has a smaller number of prescriptions and belongsto the family of Tricyclics that relate to antidepressant drugs[9]. For our experiments, a sample size of 9949 prescriptionsafter 30 days of taking the drug (representing 988 patients)out of 53,995 prescriptions (representing 18,293 patients) havebeen tested to ﬁnd the similarity between them for the ﬁrstdrug. For the second drug we used all the prescriptions (1172)after 30 days of taking the drug for 42 patients. TABLE IA

SUBSET OF INFORMATION FROM THE DATABASE FOR TWO KINDS OFDRUGS

DESLORATADINE DOXEPINAll drug’s codes in THIN data set 6 15All prescription 358,768 72448All patients 81,000 6152All prescription(0-17 years) 53,995 2014All patients (0-17 years) 18,293 60All presc.(0-17) after 30 days 9949 1172All Patients(0-17) after 30 days 988 42

B. Background on Distance Metrics

A metric space ( X , d ) is a set X that has the concept ofdistance d ( x , y ) between any pair of points x , y ∈ X and themetric is a function on the set X that satisﬁes the followingproperties for a distance [10] [11]. Deﬁnition : a metric d on a set X is a function d : X × X → R such that for all x , y ∈ X : d (x , y) ≥ ∀ x , y ∈ X . (Non-negativity). d (x , y) = 0 ⇐⇒ x = y (Identity). d (x , y) = d (y , x) ∀ x , y ∈ X . (Symmetry). d (x , y) ≤ d (x , z) + d (z , y). (Triangle inequality) ∀ x , y and z ∈ X .The following are the six distance metrics used in this study:

1) Euclidean Distance Metric:

Euclidean metric is a dis-tance d on the space R n × R n → R which is deﬁned as adistance between any two points in space R n × R n → R d ( x , y ) = (cid:113) ( x − y ) + ( x − y ) + ... + ( x n − y n ) n (1)where x = ( x , x , ... , x n ) , y = ( y , y , ... , y n ) [12].

2) Minkowski Distance Metric:

Minkowski metric is a p -metric between n -dimensional points x = ( x i ) and y = ( y i )deﬁned as: d ( x , y ) = p (cid:118)(cid:116) n (cid:88) i = | ( x i − y i ) p | (2)If p =

2, it is called Euclidean distance and if p = p = ∞ , then it is called Chebyshev or maximum distance [4]. In ourexperiment, p =

3) Manhattan Distance Metric:

It is a special case of theMinkowski metric when p = d ( x , y ) = n (cid:88) i = | ( x i − y i ) | (3)where x = ( x , x ,..., x n ) and y = ( y , y ,..., y n )

4) Hamming Distance Metric:

Hamming distance isused for the detection and correction of errors in digitalcommunications. It is deﬁned as the number of differentsymbols between two equal length sequences. For example,the hamming distance between ”toned” and ”roses” is 3 andbetween 217389 and 213379 is 2 [13].

5) Edit Distance Metric:

According to Kialing et al. [2],the deﬁnition of the Edit Distance measure between twotrees T and T is the minimum cost of all edit sequencesthat transform T to T : Edit Distance( T , T ) = min { c ( S ) \ S a sequence of edit operations transformations T into T } .Kialing et al. claimed the advantage of using the edit distanceas a similarity measure provided the mapping between thenodes in two trees during the term of edit sequence (Insertion,Deletion and Relabeling nodes in a tree T ).

6) PQ-Gram Distance Metric:

The pq -gram distance hasbeen proposed by Augsten et al. [1] and is mainly used forcomputing distances between ordered labeled trees. The pq -grams of a tree are all its sub-trees of a speciﬁc shape. Thespeciﬁc shape of the pq -gram is based on the values of twoparameters p and q . The tree T shown in Fig. 1 is expandedby inserting dummy nodes (*) to make sure that each nodeappears at least in one pq -gram. The expansion of each tree isdone by inserting p -1 before the root node, insert q -1 beforethe ﬁrst and after the last child of each non-leaf node and insert q nodes to each leaf node, for example p = 2, q = 3 in Fig.2. After the expansion process, the 2, 3-grams are extractedto produce the list of pq -grams. An example of a single 2,3-gram is given in Fig. 2 where p = (*, a6706022p) is thestem and q = (*, *, 1) is the base. The trees that have a largenumber of common pq -grams are considered more similar thanthose trees that have less; Furthermore, the pq -gram distance isused to approximately match hierarchical data of large sourcesusing the following equations: dist ( p , q ) ( T , T ) = | I (cid:93) I | − | I (cid:96) I | (4)Where T , T are the two trees, and p and q are the twoparameters that specify the shape of the pq -gram. The pq -gram indexes, I and I are the bags of Label-tuples of all pq -grams of T and T , respectively. In addition, the (cid:93) refersto the bag union between I and I and the (cid:96) refers to the bagintersection between the same indexes. The normalisation ofhe pq -gram distances is as follows: dist norm ( p , q ) ( T , T ) = dist ( p , q ) ( T , T ) | I (cid:93) I | − | I (cid:96) I | (5)The pq -gram metric has been proposed originally to ap-proximately match similar hierarchical information from au-tonomous sources that may have different representation inthe sources [1]. The pq -gram metric has the advantage ofcomputational efﬁciency and can be computed in O ( n log n )time and O ( n ) space. Another advantage of the pq -gramdistance is that it can be tuned by adjusting the two parameters p and q [14]. The determination of p and q values dependson the underlying semantics of the data. In general, increasingthe values of p and q makes the distance between two treesmore sensitive to the structure of the trees rather than to thedata, while decreasing them makes the distance sensitive tothe data. As an example, in our experiments we have useddifferent values of p and q : for p = 1 and q = 3 and for p =2, q = 3, the results of pq -gram distances are shown in TableIII and Table V. The results reveal that better distances areobtained when p = 1 and q = 3. Fig. 1. An example of a tree T and its 2, 3-Extended treeFig. 2. An example of single pq -gram from a THIN data tree III. EXPERIMENTS AND RESULTS

A. Data Preprocessing

The THIN data is converted into trees before applying the pq -gram and Edit Distance metrics, while the data for the geometric and Hamming distance metrics is converted into afrequency table. The data extracted from THIN is based ondifferent patient’s attributes such as the patient’s unique ID,the gender, the age of ﬁrst taking the speciﬁed drug and themedical codes related to the drug. The medical events arechosen at level 3 (the ﬁrst three digits of the read codes likeH33). Fig. 3 shows part of this information represented inTHIN for three patients which have unique identiﬁers in thedatabase (a6706013B, a6706015R, a670601o8): Fig. 3. Part of the THIN data extracted based on speciﬁc attributes

1) PQ-Gram and Edit Distance Preparation:

From the datain Fig. 3, we have converted each patient’s records into atree as depicted in Fig. 4 to enable the computation of both pq -gram and Edit Distance metrics. For the pq -gram metriceach tree is expanded in the same way as in Fig. 1. For ourexperiments, we use ( p = q =

3) and ( p = q = pq -grams are extractedfor each tree; Fig. 5 shows the 2, 3-grams for the tree inFig. 4. The pq -gram distance between two trees is formed byall the common pq -grams between them and computed usingequation (4), while the calculation of the distances for theEdit Distance is performed by inserting, deleting or re-labelingnodes to convert one tree to another. The single edit operationhas cost 1 and the Edit Distance between two trees is equalto the minimum cost or minimum number of edit operationsto convert one tree to another. Fig. 4. A tree representation from THIN data

2) Geometric and Hamming Metrics preparation:

TheTHIN data for the Euclidean, Minkowski, Manhattan andHamming metrics has been converted into a frequencytable as shown in Table II. The table represents how manytimes each patient had a speciﬁc symptom after taking thespeciﬁed drug. The table also contains additional columns,

ABLE IIT

HE FREQUENCY TABLE FROM

THIN

DATA

Patient’s ID The medical events Patient’s ages Sex168 171 195 19C 1A5 730 F58 H17 M0. M26 N24 N32 SD. SL. ZL5 10 11 12 15a6706013B 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1a6706015R 1 0 1 0 0 1 2 0 0 0 0 0 0 0 0 1 0 0 0 2a670601o8 0 1 0 1 1 0 0 1 1 1 1 1 1 1 0 0 0 1 0 1a670601yJ 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 2Fig. 5. The 2, 3-grams of a tree T one for the patient’s gender (In THIN, 1 = male, 2 = female)and others for the different ages of each patient taking thedrug. In Table II, the ages of the patients are 10, 11, 12 and 15.

3) Distances Calculation:

The distances using all thesix metrics applied to the THIN data are calculated andnormalised. The normalisation of the distances is todemonstrate that the small distances that are close to 0indicate similar patients, while the large distances that arenear to 1 indicate dissimilar patients. In the case of Euclidean,Minkowski and Manhattan metrics, the data in Table II hasbeen used to calculate the distances using equations (1),(2) and (3), respectively. For the calculation of Hammingdistances, the number of different values between two of equallength sequences from Table II has been taken into account.The normalisation of the distances has been calculated usingthe formula: norm-dist. ( x ) = x − min( x ) / max( x ) − min( x )where x refers to the distance between two patients. Regardingthe pq -gram metric, the distance between two trees of patientsis deﬁned as a symmetric difference between the two setsof pq -grams using equation (4), while the normalisationof the pq -gram distances is calculated using equation (5).On the other hand, the Edit Distance distances are equal tothe minimum number of edit operations (insert, delete orrename nodes) when converting one tree to another. Each editoperation has cost 1 and based on the distance being equalto the minimum cost of converting T to T . The Tree EditDistance Normalisation (TED NORM) is: T ED NORM ( T , T ) = T ED ( T , T )( | T | + | T | ) (6)Where ( | T | + | T | ) means the sum of the two trees’ nodes. The results of calculating the distances using all the six metrics aresummarised in Table III and Table V for DESLORATADINEand DOXEPIN, respectively. The tables contain all the small-est normalized distances for patients (the most similar data)among the other distances.The results for the ﬁrst drug show that geometric and hammingmetrics could ﬁnd similar patients as the distance betweentwo patients equal to zero. In contrast, the pq -gram andEdit Distance metrics produced a very few similar patients,like (a670605Up, a670602uS) and (a67340327, a681001KN)besides others who have some similarity or close distances tothe identical level between patients. The reason behind that isrelated to the structure of the data which is a hierarchical treestructure.On the other hand, the experiment for the second drug alsoproduced a number of similar patients in their medical eventsbased on the geometric and hamming metrics as shown inTable V, while for the pq -gram and Edit Distance metrics thetable shows no similar distances. The reason behind that couldbe the lack of data for the second drug.

4) Clustering the Distances:

The results in Table III andTable V show the similarities and closest distances betweenpatients using the previously mentioned metrics. The followingstep of this work has been to use a clustering method to verifyour results, to give the ﬁrst insight on how the data lookslike and to ﬁnd which distance metric can represent similardistances better than the others. The clustering process hasbeen also used to show whether all the similar distances inTables III and V fall in one cluster or are distributed over all orsome clusters. In this work, we used the k -means method andwe chose the number of clusters to be equal to three clusters.For the ﬁrst drug, two ﬁgures are reported to show the clustersof patients (Fig. 6(a) and Fig. 6(b)) using Euclidean and pq -gram distance metrics (a metric from each group of metrics).Fig. 6(c) and Fig. 6(d) show the clusters of patients using thesame metrics for the second drug.Since the k -means algorithm is known to be biased by thestarting positions, it needs to be re-run more than once. Asa result, we may get more than one outcome. The ﬁguresof the clusters represented in this work are those resultingfrom the most frequent clustering (the majority vote, in ourexperiments 10 times running). In order to distinguish betweenthe clusters, we report Table VI and Table VII that contain thenumber of patients in each cluster for the ﬁrst and second drug,respectively. Cluster cluster

ABLE IIIS

MALLEST NORMALISED DISTANCES FOR PATIENTS TAKING

DESLORATADINE

DRUG

The Normalised Distancespatient’s ID Euclidean Minkowski Manhattan Hamming Edit Distance 1, 3-Grams 2, 3-Gramsa670605Up, a670602uS 0 0 0 0 0.25 0 0a6732002X, a673200WF 0 0 0 0 0.25 0.888889 1a6732002X, a673201@y 0 0 0 0 0.25 0.888889 1a673200tm, a673201j7 0 0 0 0 0.25 0.888889 1a673201@y, 673200WF 0 0 0 0 0.25 0.888889 1a673201Wt,a6732025y 0 0 0 0 0.25 0.888889 1a673201wI, a678701pI 0 0 0 0 0.6666 0.888889 1a67340327, a681001KN 0 0 0 0 0.25 0 0a677505bO, a677505pe 0 0 0 0 0.25 0.888889 1a683104@Y, 677505bO 0 0 0 0 0.6666 0.888889 1a683104@Y, a677505pe 0 0 0 0 0.6666 0.888889 1a673201wI, a777805mH 0 0 0 0 0.25 0.888889 1a673402zw, a683105Bk 0 0 0 0 0.25 0.888889 1a678701pI, a777805mH 0 0 0 0 0.25 0.888889 1a791600uB,a777806FG 0 0 0 0 0.25 0.888889 1a7916065T, a777800Gj 0 0 0 0 0.25 0.888889 1TABLE IVT HE S HARED M EDICAL E VENTS FOR P ATIENTS IN T ABLE

III.Patient’s ID The medical events For the ﬁrst patient The medical events for the second patient The description of the eventa670605Up, a670602uS 1B8 1B8 Itchy eye symptoma6732002X, a673200WF 17Z 17Z Respiratory symptom NOSa6732002X, a673201@y 17Z 17Z Respiratory symptom NOSa673200tm, a673201j7 ZL5 ZL5 Referral to orthopaedic surgeona673201@y, 673200WF 17Z 17Z Respiratory symptom NOSa673201Wt,a6732025y 740 740 Submucous diathermy to turbinate of nosea673201wI, a678701pI H05 H05 Upper respiratory tract infection NOSa67340327, a681001KN 171 171 Cougha677505bO, a677505pe 8B3 8B3 Medication requesteda683104@Y, 677505bO 8B3 8B3 Medication requesteda683104@Y, a677505pe 8B3 8B3 Medication requesteda673201wI, a777805mH H05 H05 Upper respiratory tract infection NOSa673402zw, a683105Bk H17, 8B3 H17, 8B3 Hay fever or pollens, Medication requesteda678701pI, a777805mH H05 H05 Upper respiratory tract infection NOSa791600uB,a777806FG H17 H17 Hay fever or pollensa7916065T, a777800Gj A78 A78 Verrucae warts or Molluscum contagiosumTABLE VS

MALLEST NORMALISED DISTANCES FOR PATIENTS TAKING

DOXEPIN

DRUG

The Normalised Distancespatient’s ID Euclidean Minkowski Manhattan Hamming Edit Distance 1, 3-Grams 2, 3-Gramsa793901c8,a9910027z 0 0 0 0 0.4761 0.971 0.967b977401S1,a999104cU 0 0 0 0 0.4 0.923 1g989501KB,a999104cU 0 0 0 0 0.375 1 1g989501KB,b990804AL 0 0 0 0 0.5 1 1b990804AL, a999104cU 0 0 0 0 0.375 1 1 other. The remaining patients are grouped in cluster pq -gram metric which is designed for hierarchicaldata like the read codes in THIN. We have implemented thedistance metrics using two different types of data structuresand compared their results. The two data structures are thetree-like structure of the group of pq -gram and Edit Distance metrics as shown in Fig. 4 and the frequency table or matrixfor the group of geometric and Hamming metrics as shown inTable II. The distance metrics have been applied to the dataand generally, the results revealed that these metrics producedgood similarity distances between patients’ data. Regardingthe pq -gram, the distances depend mainly on the number ofintersected pq -grams between two trees as well as the valuesof the parameters p and q . Choosing the correct values of p and q is a matter of tradeoffs. In [14], Srivastava et al. analysedthe sensitivity of pq -gram distances with the values of p and q and concluded that increasing p relative to q implies that a) The clusters of patients using Euclidean metric, DECLORATE-DINE drug (b) The clusters of patients using pq -gram metric, DECLORATE-DINE drug(c) The clusters of patients using Euclidean metric, DOXEPIN drug (d) The clusters of patients using pq -gram metric, DOXEPIN drugFig. 6. The clusters of patients using Euclidean and pq -gram metrics more importance is being given to the ancestors than to thechildren of the trees, i.e. two nodes are considered to be thesame only when they share p common ancestors.Thus, in our case the smaller the value of p relative to q , the more probability of ﬁnding the intersected pq -gramsbetween two trees and the more importance is given to thedata rather than the structure of the trees. Based on that,the results in the seventh column in Table III and Table Von the preceding page are better compared to the results ofthe eighth column of the same tables. In general, the pq -gram metric is not the best metric compared to the othermetrics as it depends on many parameters ( p , q and thetree structure), but it could highlight some similar patients and measure the similarity between their data as shown inTable III (e.g. patients a670605Up, a670602uS and patientsa67340327, a681001KN). On the other hand, Table V containssome non-similar distances produced by the pq -gram and EditDistance metrics, for example the two patients (g989501KBand a999104cU) have the normalised distance equal to 1which means there is no similarity between both patients’data. The reason behind that could be the lack of data forthe DOXEPIN drug. That is to say, the more data availablethe more probability of having similar data for patients in theTHIN database.After ﬁnding all the distances using the chosen metrics,we veriﬁed our results by considering all the population of ABLE VIT

HE NUMBER OF PATIENTS IN EACH CLUSTER FOR

DESLORATADINE

DRUG

Cluster 1 (similar) Cluster 2 (Non-similar) Cluster 3 othersEuclidean Metric 513 114 361Minkowski, p =3 578 89 321Manhattan Metric 602 89 304Hamming Metric 579 75 334PQ-Gram Metric 409 164 415Edit Distance Metric 284 332 372TABLE VIIT HE NUMBER OF PATIENTS IN EACH CLUSTER FOR

DOXEPIN

DRUG

Cluster 1 (similar) Cluster 2 (Non-similar) Cluster 3 othersEuclidean Metric 23 3 16Minkowski, p =3 81 7 17Manhattan Metric 23 3 16Hamming Metric 23 3 16PQ-Gram Metric 15 15 12Edit Distance Metric 16 15 11 patients for each drug and by checking weather these distancemetrics discriminate sufﬁciently using clustering the patientpopulation. Fig. 6(a), Fig. 6(b), Fig. 6(c) and Fig. 6(d) showthe results of clustering using the k -means algorithm. The latteris the simplest clustering method and requires the numberof clusters to be known in advance. In this work, we chosethe number of clusters to be equal to 3. However, moreproper data analysis is required for future work and more thanthree clusters might be considered. The clusters have beenplotted using the clusplot function from R software which isrepresenting all the observations by points in the plots usingthe principal component analysis [15]. PCA is used in the dataset for the purpose of visualisation and no feature selection hasbeen carried out. The clusters are labeled using numbers (1,2 and 3) as shown in Fig. 6 and the geometric and Hammingmetrics discriminate successfully on the population for bothdrugs. We chose only two ﬁgures for each drug, one for eachgroup of metrics. Table VI and Table VII show the number ofpatients in each cluster. The patients in Table III are groupedin cluster cluster pq -gram and Edit Distance metrics have a verypoor similarity. Thus cluster cluster cluster cluster cluster cluster k -means algorithm.In conclusion, the pq -gram metric might not be the best metricfor THIN data, but it can measure similar distances and groupthem in one cluster. That is to say, it highlighted some knownmedical events related to the drugs been taken, for example thecough and itchy eye symptoms related to DESLORATADINEdrug. As each group of metrics depends on different datastructures and in order to choose the appropriate distancemeasure for the THIN data, we may need an appropriatestructure of the data: for example, a mixed data structure fromboth the hierarchical and non-hierarchical data. By making thetree structure for all the levels of read codes, the distances canbe calculated for read codes only. As a result of that, the pq -gram could ﬁnd the related medical codes to each other in abetter way. R EFERENCES[1] N. Augsten, M. B¨ohlen, and J. Gamper, “The pq-gram distance betweenordered labeled trees,”

ACM Transactions on Database Systems (TODS) ,vol. 35, no. 1, p. 4, 2010.[2] K. Kailing, H.-P. Kriegel, and S. Sch¨onauer, “Content-based imageretrieval using multiple representations,” in

Knowledge-Based IntelligentInformation and Engineering Systems . Springer, 2004, pp. 982–988.[3] K. Zhang and D. Shasha, “Simple fast algorithms for the editing distancebetween trees and related problems,”

SIAM journal on computing ,vol. 18, no. 6, pp. 1245–1262, 1989.[4] R. Cordeiro de Amorim and B. Mirkin, “Minkowski metric, featureweighting and anomalous cluster initializing in k-means clustering,”

Pattern Recognition , vol. 45, no. 3, pp. 1061–1075, 2012.[5] R. Shahid, S. Bertazzon, M. L. Knudtson, and W. A. Ghali, “Comparisonof distance measures in spatial analytical modeling for health serviceplanning,”

BMC health services research , vol. 9, no. 1, p. 200, 2009.[6] J. Reps, J. M. Garibaldi, U. Aickelin, D. Soria, J. E. Gibson, andR. B. Hubbard, “Discovering sequential patterns in a uk general practicedatabase,” in

IEEE-EMBS International Conference on Biomedical andHealth Informatics (BHI) , 2012, pp. 960–963.7] J. Reps, J. Feyereisl, J. M. Garibaldi, U. Aickelin, J. E. Gibson, andR. B. Hubbard, “Investigating the detection of adverse drug events ina uk general practice electronic health-care database,”

UKCI, the 11thAnnual Workshop on Computational Intelligence, Manchester , 2011.[8] A. K. Jain and R. C. Dubes,

Algorithms for clustering data . Prentice-Hall, Inc., 1988.[9] J. F. Committee and R. P. S. of Great Britain,

British national formulary(bnf) . Pharmaceutical Press, 2012, vol. 64.[10] J. C. Oxtoby, “Metric and topological spaces,” in

Measure and Category .Springer, 1971, pp. 39–41.[11] T. K¨orner, “Metric and topological spaces,” 2010.[12] J. C. Gower, “Euclidean distance geometry,”

Mathematical Scientist ,vol. 7, no. 1, pp. 1–14, 1982.[13] S. Hosangadi, “Distance measures for sequences,” arXiv preprintarXiv:1208.5713 , 2012.[14] N. Srivastava, V. Mishra, and A. Bhattacharya, “Analyzing the sensitivityof pq-gram distance with p and q,”

ACM , 2010.[15] M. Maechler,