DrugDBEmbed : Semantic Queries on Relational Database using Supervised Column Encodings
Bortik Bandyopadhyay, Pranav Maneriker, Vedang Patel, Saumya Yashmohini Sahai, Ping Zhang, Srinivasan Parthasarathy
DDrugDBEmbed : Semantic Queries on Relational Database usingSupervised Column Encodings
Bortik Bandyopadhyay, Pranav Maneriker, Vedang Patel, Saumya Yashmohini Sahai, Ping Zhang,Srinivasan Parthasarathy
The Ohio State UniversityColumbus, Ohio{bandyopadhyay.14,maneriker.1,patel.3140,sahai.17,zhang.10631}@osu.edu,[email protected]
ABSTRACT
Traditional relational databases contain a lot of latent semanticinformation that have largely remained untapped due to the diffi-culty involved in automatically extracting such information. Recentworks have proposed unsupervised machine learning approaches toextract such hidden information by textifying the database columnsand then projecting the text tokens onto a fixed dimensional se-mantic vector space. However, in certain databases, task-specificclass labels may be available, which unsupervised approaches areunable to lever in a principled manner. Also, when embeddings aregenerated at individual token level, then column encoding of multi-token text column has to be computed by taking the average ofthe vectors of the tokens present in that column for any given row.Such averaging approach may not produce the best semantic vectorrepresentation of the multi-token text column, as observed whileencoding paragraphs or documents in natural language processingdomain. With these shortcomings in mind, we propose a super-vised machine learning approach using a Bi-LSTM based sequenceencoder to directly generate column encodings for multi-token textcolumns of the DrugBank database, which contains gold standarddrug-drug interaction (DDI) labels. Our text data driven encodingapproach achieves very high Accuracy on the supervised DDI pre-diction task for some columns and we use those supervised columnencodings to simulate and evaluate the Analogy SQL queries onrelational data to demonstrate the efficacy of our technique.
KEYWORDS supervised learning, database column encoding, analogy sql query
ACM Reference Format:
Bortik Bandyopadhyay, Pranav Maneriker, Vedang Patel, Saumya Yash-mohini Sahai, Ping Zhang, Srinivasan Parthasarathy. 2018. DrugDBEm-bed : Semantic Queries on Relational Database using Supervised ColumnEncodings. In
Woodstock ’18: ACM Symposium on Neural Gaze Detection,June 03–05, 2018, Woodstock, NY.
ACM, New York, NY, USA, 14 pages.https://doi.org/10.1145/1122445.1122456
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
Woodstock ’18, June 03–05, 2018, Woodstock, NY © 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-XXXX-X/18/06...$15.00https://doi.org/10.1145/1122445.1122456
Traditional relational database systems support a wide variety ofwell-defined structured and unstructured data types like numeric,categorical, unstructured text, images etc. [5, 6]. The relational datacan be accessed by SQL, which only allows for syntactic matching orrange-based queries [5] through select, project and join operations.However, as pointed out by Bordawekar and Shmueli [6], thereare a lot of latent semantic information present inside a relationaldatabase that cannot be directly utilized through SQL queries onthe original data. The latent semantic information could sometimesbe at individual token level, while in other cases, a sequence oftext tokens can jointly convey that information [6]. While thereare dictionary-based text-extenders [11] or ontology-based supportsystems [33] to partially enable semantic queries, such systemscannot directly extract the latent semantic information from thedatabase [6]. Bordawekar and Shmueli [6] propose a textificationbased database token embedding generation approach to automati-cally extract latent semantic information from a relational database,by projecting the textified database tokens to a low dimensionalvector space using an unsupervised token embedding generationtechnique, adapted from the popular word2vec algorithm [40]. Co-sine similarity between such vector representations of databasetokens can be easily computed via custom user defined functionsas part of SQL queries [5]. Additionally, those vector representa-tions can be used to run a wide variety of semantic queries [5] likeapproximate nearest neighbor query, Analogy SQL query etc.The unsupervised token embedding generation approach pro-posed by Bordawekar et al. [5] is based on the distributional hypoth-esis of words in text corpora [40], which makes them generic andwell suited for exploratory data analytics. In contrast, we believea supervised embedding generation approach maybe more suitedto task-specific scenarios, where ground-truth labels are available.Our assumption of databases containing task-specific supervisioninformation is grounded on practical examples. Take for example, aresearcher who is working on a novel drug discovery problem in apharmaceutical company. She has access to a highly curated propri-etary relational database that contains information of prescriptionas well as under-evaluation drugs, along with gold standard drug-drug interaction information . Although she has little or no exper-tise of SQL query, the researcher may be interested in retrievingfrom such a proprietary relational database, the list of all drugs thatinteract with a query drug A in the same way as B interacts withA. Such an end-user query can be executed on the database using Drug-Drug interaction implies adverse side-effects when two prescription drugs aretaken together by a patient. Detecting possible drug-drug interaction is an importanttask, since it can save a lot of human lives and control annual health care costs [18, 59] a r X i v : . [ c s . D B ] J u l oodstock ’18, June 03–05, 2018, Woodstock, NY Bortik Bandyopadhyay, Pranav Maneriker, Vedang Patel, Saumya Yashmohini Sahai, Ping Zhang, Srinivasan Parthasarathy drug embeddings, through an approximation-based Analogy SQLquery [5]. However, it would be preferred that the drug embeddingsare task-specific i.e., they are generated to capture the gold standarddrug-drug interaction information, instead of being task-agnostic.To this end, we propose a Bi-LSTM based supervised columnencoding generation approach for multi-token text columns of arelational database using techniques from the natural language pro-cessing domain [28, 35, 55, 60] and utilize such embeddings to solveAnalogy SQL queries [5] on the relational database. For expositionsimplicity, we describe our approach using the Drug database exam-ple, where the ground truth drug-drug interaction labels are presentin a special table (DDI) in addition to the table containing drug infor-mation (DI). Each row of the DDI table contains an interaction type(class label) for a pair of rows (i.e., the corresponding drugs) in the DItable. We utilize the drug pair interaction label to generate columnencodings for the text-based column of the corresponding drug rowsin the DI table, by training a classification model for the DDI predic-tion task. Thus, we generate database column encodings based on re-lationship between a pair of rows using ground truth labels. We trainour classification model from scratch, which gives the embedding ofindividual tokens as a by-product of our approach. Thus the columnand token encodings, generated by our proposed Bi-LSTM model,are fully task-specific interaction relationship-based encodings, asopposed to the generic word distribution based encodings of [5].To the best of our knowledge, our approach is the first to explic-itly utilize gold-standard row-pair relationship label driven supervisedcolumn encodings for semantic queries on relational databases us-ing the drug drug interaction use-case. Prior works like [5–8] havefocused on generating task-agnostic unsupervised embeddings foreach individual token of a sequence of text tokens within a databaserow. In such a setting, one way to compute column encoding for amulti-token column is to compute the average of the vectors of thetext tokens present in that column [5], which may not always bethe best semantic vector representation for that text column [30].In contrast, we focus on directly generating task-specific supervisedcolumn encodings i.e., a vector representation for a sequence oftext tokens present within the column boundary, similar to para-graph [30] and document [12] encoding generation approaches inNLP, albeit in a supervised learning setting. We get embeddingsfor individual tokens as a by-product of our approach. Since wedemonstrate our approach using DDI scenario, it is important topoint out that there are several works on the supervised DDI pre-diction task [13, 16, 27, 51, 58, 61], but most of them focus on thechemical structure of the drugs. For the works [16, 27, 58] that usetext based information, there is a costly data pre-processing andfeature extraction phase. In contrast, we focus on only categoricaland text based attributes and our DDI prediction approach cantake the text token sequence as input after a very light-weight textpre-processing step, which significantly reduces the overhead ofthe feature generation step for the classification task. Our proposedBi-LSTM model achieves very competitive performance over stan-dard BOW baselines for the classification task. We demonstratethe efficacy of our supervised column encodings for the AnalogySQL task, by proposing an intuitive DDI pair based simulation andevaluation strategy, under two different real-world settings. Language Embedding:
In the NLP domain, language embeddinggeneration, for obtaining a vector representation of words in a lan-guage, has been extensively studied. A few such techniques includeneural network based learning [3], log-linear classifiers [38], func-tion optimization of objective matrix [47], matrix factorization tech-niques [32] etc. The unsupervised database token embedding gener-ation technique used by Bordawekar et al. [5] is an adaptation of oneof the most popular word embedding algorithm called word2Vec [39,40]. The vectors produced by word2Vec [39, 40] can be utilized formany semantic reasoning tasks like analogy [31] with good perfor-mance. Additional details about word2Vec can be obtained here [20,31, 41] while its modifications for databases are detailed here [5].Paragraphs [30] and documents [12] can also be summarizedusing vector representation for various downstream applications.For text classification tasks, BOW models have been used as base-lines [53]. However, there are many deep learning techniques likethe CNN [28], RNN [35], LSTM [55] and Bi-LSTM [60] based models,that can be utilized to generate a fixed low dimensional represen-tation of the input text and use it for the classification task. Yin etal. [56] has conducted a detailed comparative study of CNN andRNN for such tasks. Note that, our goal is to learn vector repre-sentations for text token sequence in such a way that we learn thelatent relationship between a pair of text expressed by the classlabel (i.e., DDI type), similar to relation learning tasks [29, 34, 49].
Database Embedding generation and utilization:
To ad-dress the problem of resolving schematic differences between ob-jects in multiple databases, Kashyap and Sheth [26] define thesemantic proximity between two related objects, by associatingthe mapping of those objects with the comparison context. TheDLDB [44] system has been designed to allow semantic web querieson relational databases using the FaCT description logic reasoner.Semantic queries through dictionary-based text-extenders (e.g.:DB2 Text Extender [11]) or ontology-based support systems [33]has been in practice. Yu et al. [57] define keyword relationship sum-maries and utilize those summaries to develop novel ranking meth-ods to select the most relevant database for a given keyword query.However, none of these works use database token embeddings forexecuting semantic reasoning queries [5] using SQL UDF’s.Bordawekar and Shmueli [6] introduce unsupervised machinelearning technique to generate database token embeddings thatcan capture the latent semantic information between database en-tities. It is a two step process. The first step involves convertingheterogeneous data types of the original database (like numeric,text, images etc.) into a consistent text-based representation [5, 6].The database is converted into a text corpus where each line of thetext corpus represents a textified row of the original relational data-base. The authors adapt the very popular word2vec algorithm [40],an unsupervised approach, to generate embeddings of the tokensof the textified database corpus. The cosine similarity between anypair of database token (primary key) vectors is the approximatesemantic similarity between those tokens (the rows). Furthermore,the authors demonstrate [42] how custom user defined functionscan be incorporated as part of SQL queries to realize enhancedsemantic querying capabilities [5, 6] like analogy queries, semantic rugDBEmbed : Semantic Queries on Relational Database using Supervised Column Encodings Woodstock ’18, June 03–05, 2018, Woodstock, NY clustering etc. which can be executed by leveraging this low dimen-sional representation of database tokens. A sample case-study [5]demonstrates the usefulness of the novel insights that such seman-tic queries can uncover from relational databases using simple SQLsyntax. The semantic vectors are primarily based on the informa-tion present within the database, with the added option of utilizinginformation from external corpus like Wikipedia [5] for semanticquerying purposes. One advantage of the unsupervised approach isthat the embeddings can be incrementally trained [5] in case of anydatabase or content updates, which makes it very flexible in prac-tice. Additionally, such unsupervised database token embeddingscan also be effectively utilized for selectively disclosing databaseinformation [8]. In contrast, we propose a supervised task-specificdatabase column encoding approach that summarizes the infor-mation present in database columns wrt the specific task at hand,while also generating embeddings for individual database tokens.The sub-problem of approximate nearest neighbor search usingtoken embeddings in a database have been further optimized [19]in the Freddy [21] framework, which is a free open sourced imple-mentation of semantic querying capabilities on top of Postgresqldatabase. Cappuzzo et al. [9] first construct a graph based repre-sentation of the relational data and then lever unsupervised graphembedding approach to generate database token embeddings, whichoutperforms over several baseline for the data integration task. Theidea of generating an embedding based representation of each rowof a database has been also utilized for the missing value imputationtask [4] quite successfully. Srinivas et al. [52] propose a ‘siamesetriplet’ network that assigns a small distance to the two related sur-face forms of the same entity, which can be utilized very effectivelyfor improving performance of data merging tasks. Termite [15] isa relational embedding framework for data integration task, wherethe authors aim to learn a distance metric that can be used to com-pute the similarity/distance of data coming from text and databases.In RETRO [22], database token embeddings are generated by for-mulating relation retrofitting as a learning problem (similar to [14]),such that relational information present in the database can becombined in a principled way with semantic information presentin external text corpus for better token embeddings. Arora andBedathur [1] propose an LSTM based model for capturing inter-rowrelationships in relational databases and utilize such embeddingsfor similarity queries and data cell completion tasks. Their workclearly demonstrates the advantage of using an LSTM encoder forcapturing the sequence based information (temporal informationin their setting) over traditional approaches.
Drug-Drug Interaction Prediction :
Two drugs when consumedtogether may cause unexpected adverse side effects. Thus, detectingpossible adverse drug-drug interaction is a very important task,which can save a lot of human lives as well as, control annualhealth care costs [18, 59]. There has been a significant amount ofresearch on the DDI prediction task, but most of those works ofteninvolve costly pre-processing and feature extraction step. Ryu etal. develop the DeepDDI framework [51], which uses the chemicalstructures of the input drug pair in simplified molecular-input line-entry system (SMILES) format to construct SSPâĂŹs for each drugthrough a multi-step feature construction process and use thoseSSP vectors as input to train a multilayered perceptron that learnsto predict multiple DDI types for the given drug pair. Deng et al. develop the DDIMDL framework [13] that uses multiple differentdrug information like chemical structures, targets, enzymes andpathways, to construct corresponding similarity matrices after anelaborate feature extraction and similarity computation step. Thesimilarity matrices are then used to train a modular DNN networkwhich combines individual sub-model’s prediction to generate thefinal DDI prediction. In contrast, we use only text data as input,and lever a BiLSTM to do automatic input feature extraction forthe one-out-of-many DDI class prediction problem.Zhang et al. [58] propose a label propagation framework for DDIprediction by integrating heterogeneous information (like chem-ical structure, side effects from package inserts of drugs etc.) fromdiverse sources. Fokoue et al. [16] design Tiresias, where semanticsimilarity scores from multiple diverse information sources are com-puted for drug pairs, and then these scores are used as features totrain a logistic regression classifier to predict DDI between the drugpairs. Kastrin at al. [27] extract various topological and semanticsimilarity based features for drug pairs from multiple heterogeneousinformation sources and utilize machine learning models (like SVM,GBM etc.) to perform DDI prediction. Ma et al. [37] propose anattentive multi-view graph auto-encoders for DDI prediction. Notethat while some of the previous works [16, 27, 58] use text dataas one of the many information sources, there is a significant pre-processing burden for feature engineering, as different types ofpairwise similarity scores need to be computed for each drug pair.Zheng et al. [59] propose an attention based by BiLSTM model foreffective drug-drug interaction extraction (but not DDI prediction)from medical corpus. Asada et al. [2] utilize a graph convolutionneural network to capture the molecular structure information ofdrug pairs, which further improves the quality of drug-drug in-teraction extraction from biomedical text. Our approach focusesonly on the text based drug information, like atc codes, categories,description etc. of each drug, but not the chemical structure, tobetter understand the true informativeness of the text-only data forthe DDI task. We use a BiLSTM encoder based supervised neuralmodel for column encoding generation (similar to [59] without at-tention), by learning to predict drug-drug interaction from varioustext-based drug information. Thus our feature extraction processfor the classification task is automatic due to the BiLSTM encod-ings, which can be utilized for the execution of semantic querieson relational databases. Unsupervised embedding based analogyqueries have been evaluated on medical data [43], but not in thecontext of relational databases.
The overall steps are similar to the one proposed by Bordawekar etal. [5], although our textification step is less involved, since we focusonly on multi-token text columns, and the encodings we generateare fully supervised. In Section 3.1, we describe how to generatenovel task-specific column encodings and then in Section 3.2 ex-plain how to use such encodings to run the Analogy SQL query [5].
We assume that the data has been converted to text form (more de-tails in Section 4.1), and hence we can employ various text sequencemodels for encoding. Recurrent Neural Network based models, in oodstock ’18, June 03–05, 2018, Woodstock, NY Bortik Bandyopadhyay, Pranav Maneriker, Vedang Patel, Saumya Yashmohini Sahai, Ping Zhang, Srinivasan Parthasarathy particular, Long Short-Term Memory (LSTMs) [23] have been suc-cessfully used to generate text encodings that can generalize acrossvarious task [24, 48]. Our drug encoding models are also basedon LSTMs. First, we describe the procedure we use to generateencodings for each column using LSTM based encodings.Consider a sequence of tokens, where each token is mapped to aunique integer ∈ { , . . . , N − } , where N denotes the size of the vo-cabulary consisting of all the tokens in the dictionary. By iteratingthrough the entries in the column in the training set, such a dictio-nary can be constructed. A special token is also added to encodeunseen tokens in the training data. Each element in the vocabularyis then mapped to a d-dimensional vector, and each sequence iseither truncated or padded with special tokens to make sure thatall sequences have the same length. Thus for each sequence s , weget a corresponding sequence of encoded d-dimensional vectors [ x s , x s , . . . x sn ] , where n is the chosen maximum length. Eachencoded sequence s is then passed through a multi-layer LSTM net-work. The corresponding LSTM equations in the forward directionare [23, 45]: i lst = σ ( W lii x l − st + b lii + W lhi h ls ( t − ) + b lhi ) f lst = σ ( W lif x l − st + b lif + W lhf h ls ( t − ) + b lhf ) д lst = tanh ( W liд x l − st + b liд + W lhд h ls ( t − ) + b lhд ) o lst = σ ( W lio x l − st + b lio + W lho h s ( t − ) + b lio ) c lst = f lst ⊙ c ls ( t − ) + i lst ⊙ д lst h lst = o st ⊙ tanh ( c st ) x lst = h lst where i , f , д , o represent the input, forget, cell, and output gates/di-mensions, t represents the time step, h represents the hidden state/di-mension, l represents the layer, σ is the sigmoid function, and ⊙ represents Hadamard product. W and b refer to the weights andbiases for internal layers. As shown in the last equation, outputhidden states h lst are used as the input states for the next layer x lst ,with x st being the d-dimensional encoding described previously. Areverse direction LSTM ( t + t − L layernetwork, the final hidden states ( h L ( f ) sN for forward and h L ( b ) N forthe backward) are concatenated and used as the representation forthe entire sequence. We label the output e s = h L ( f ) sN ⊕ h L ( b ) N , as theembedding for sequence s , where ⊕ represents the concatenationoperation. To tune these embeddings for predicting drug drug in-teractions, we use ideas from relation learning [54]. For each pairof drugs encodings e i , e j in the training data, we construct a pairencoding as p ij = (| e i − e j |) ⊕ ( e i ⊙ e j ) Prior work [54] has shown that using the difference and product canhelp capture the relationship between pairs of encodings. Finally,we pass the pair encoding through a single linear layer to get thelogit scores corresponding to each interaction label. Figure 1 demon-strates the complete series of steps used to get the scores for each in-teraction class for a single drug pair. These scores are then converted into log softmax values after normalization and then used to com-pute a negative log likelihood based loss for predicting the correct in-teraction class. We use a single hidden layer to avoid offloading anycomplexities of interactions in the final layers - the encodings them-selves contain the information corresponding to the interactions.
The Deep Learning based classification model developed above canbe used to generate task-specific (i.e., DDI interaction predictionbased) d-dimensional column encodings for each of the text columnsof each drug, by concatenating the final bi-directional hidden statesof the trained Bi-LSTM model. Then the column-specific encodingsare uploaded in separate tables (called encoding tables) in the data-base, where the primary key is the drug id and the column containsthe d-dimensional vector [5, 21]. Since the encodings are columnspecific, hence Analogy SQL queries [5] on drugs can now be ex-ecuted by leveraging the relevant column encodings of the querydrugs. We now present a brief overview of executing Analogy SQLqueries using those encodings, as proposed by Bordawekar et al. [5].Given the elements A,B,C,D have been projected as vectors ina common multi-dimensional vector space, Rumelhart and Abra-hamson [50] describe analogy query ( A : B :: C : D ) as the task offinding a vector D, whose distance from the vector of C is closestto the distance between the vectors of A and B. In our settings, theelements A,B,C,D can be thought of as unique drug names (or drugidentifiers) which have been projected into a common vector spacebased on the text data of the specific column for individual drugs(as described in Section 3.1). While many standard techniques existin the NLP literature for Analogy query computation using wordembeddings, we use the cosine similarity ( Cosine ) based 3COS-MUL [31] strategy, as it has shown consistently good performancein both NLP [31] and Database [5] settings for analogy computa-tion task. For completeness, we present the formula to compute3COSMUL [5, 31] for a given analogy query A : B :: C : D , and thecorresponding vectors V A , V B , V C , V D below as:arg max D ∈ Druдs C ( V D , V C ) ∗ C ( V D , V B ) C ( V D , V A ) + ϵ (1)where, C ( V , V ) = ( Cosine ( V , V ) + . )/ . ϵ = .
001 toavoid division by zero error [31]. The method 3COSMUL is imple-mented as part of SQL query in the Freddy framework [21] to realizethe column specific analogy functionality utilizing the respectivecolumn embedding tables. A sample analogy query (‘DB08897’ :‘DB11315’ :: ‘DB08897’ : ?) with real drugs using one of the column’sencodings (
ColEncTable ) is shown below.
SELECT DISTINCT
T. drugbank_id as DrugId ,(C( v4 . vector , v3 . vector ) ∗ C( v4 . vector , v2 . vector ) ) /(C( v4 . vector , v1 . vector ) + 0.001) AS Score
FROM
DrugBankFullClean AS T INNER JOIN
ColEncTable AS v1 ON v1 . drugbank_id = ' DB08897 ' INNER JOIN
ColEncTable AS v2 ON v2 . drugbank_id = ' DB11315 ' INNER JOIN
ColEncTable AS v3 ON v3 . drugbank_id = ' DB08897 ' INNER JOIN
ColEncTable AS v4 ON v4 . drugbank_id = T. drugbank_id WHERE
T. drugbank_id
NOT IN ( ' DB08897 ' , ' DB11315 ' , ' DB08897 ' )
AND (C( v4 . vector , v3 . vector ) ∗ C( v4 . vector , v2 . vector ) ) /(C( v4 . vector , v1 . vector ) + 0.001) >= 0.25
ORDER BY (C( v4 . vector , v3 . vector ) ∗ C( v4 . vector , v2 . vector ) ) /(C( v4 . vector , v1 . vector ) + 0.001)
DESC
FETCH
FIRST ROWS ONLY ; rugDBEmbed : Semantic Queries on Relational Database using Supervised Column Encodings Woodstock ’18, June 03–05, 2018, Woodstock, NY EmbeddingLayer V e r t epo r fi n Predicted Groundtruth
Label 0Label 85Label 1Label 2Label 84
Log s o ft m a x l a y e r L i nea r l a y e r s LSTMEncoder
Input to LSTM Encoder: [[Drug1 Word Tensor], [Drug2 Word Tensor]] < , M AX _L E N , E m bed_ D i m > Truncate/padTruncate/pad
Inputs D r ug D r ug Use NLLLoss against Ground Truth class label and backpropagate to learn weights & token embedding. < , M AX _L E N > C y a m e m a z i ne Figure 1: Overview of the proposed DNN model for column encoding generation via learning the DDI prediction task.
In the above query, note that A and C are same and only top-10 most relevant drugs are retrieved as the answer drug for thegiven analogy query, such that the answer drugs drugD presumablyhave the same interaction type (i.e., DDI label) with drug drugC(‘DB08897’) as the interaction type between the drug pair drugA(‘DB08897’) and drugB (‘DB11315’). We describe the simulation andevaluation strategy for such queries developed by us in Section 4.5and Section 4.6 respectively.
We implement all components of our framework in Python, and useFreddy [21] as the back-end database. The BOW based models andthe DNN model have been implemented using scikit-learn [46]and PyTorch [45] respectively. Experiments are run on machineshosted at Ohio Supercomputer Center [10].
We use the publicly available parser toextract the raw drug information downloaded from DrugBank . Wefocus on 6 text based attributes viz: ATC codes, Categories Descrip-tion, Merged Class information, Protein Binding and Target Action.We concatenate various class based information of each drug into asingle categorical attribute that we call as Merged Class informationin the rest of the paper. Our approach will work for any columnsthat are of categorical or text type or can be easily textified by var-ious methods proposed in [5]. We use the ground-truth drug drug https://github.com/dhimmel/drugbank interaction labels from Ryu et al. [51] which contains about 192KDDI pairs. Since our setting is a single class classification problem,we randomly retain only a single DDI interaction for drug pairsthat have multiple interaction labels. We perform very light-weightbasic text pre-processing on the raw drug information and retainonly those drugs for which there is a valid DDI pair above. Thefinal data contains textified information of 1705 drugs extractedfrom the DrugBank corpus, and 191,728 corresponding DDI pairswith total 86 different interaction types (class labels). For every input text field, we firstconvert all characters to lowercase. Following this step, we truncatemultiple white spaces into a single space for cleaner tokenization,and split the input into tokens based on the space character. Fur-ther, we remove all words that are present in the NLTK stopwordscorpus for English [36]. We replace most of the numeric tokens bya special token ‘numtkn’ that indicates the presence of a number(only applicable for certain columns). For some of the categoricalcolumns, we observe that sometimes each drug may have multipleunique values in a column. For example, a single drug may havemultiple different ATC codes. Hence during data extraction phasewe separate each unique value of a multi-valued categorical col-umn using a special character (‘ • ATC codes: We first convert all text to lowercase. Then, wesplit the string by ‘ oodstock ’18, June 03–05, 2018, Woodstock, NY Bortik Bandyopadhyay, Pranav Maneriker, Vedang Patel, Saumya Yashmohini Sahai, Ping Zhang, Srinivasan Parthasarathy code is 7 characters in length and has special ontologicalinformation encoded for each level. We want to preserve thisontological structure by pre-pending the level informationto each of the code corresponding to that level. Thus foran ATC code of “B01AE02” the processed token sequence is(“atcl1_b atcl2_01 atcl3_a atcl4_e atcl5_02”). • Categories: We split the input text by ‘
We use the publicly available post-gresql based framework called Freddy [21] as the data back-end. Allthe textified information of the drugs are loaded in a table called
DrugBankFullClean , where the DrugBank identifier field of eachdrug forms the primary key of this table. The DDI pairs, along withthe ground-truth labels, are loaded in the
DDITypeInfoTable havingcolumn names : (drug1, drug2, label). This table has a compositeprimary key < D1Id, D2Id > , where D1Id and D2Id are the DrugBankidentifiers for the first (drug1) and the second (drug2) columns ofeach of the DDI pairs respectively. Besides these 2 tables , there areseparate ColEncTable tables, one for each of the 6 columns, prefixedwith the column name that store the < Drug Id, Column Encodings > ,where the corresponding column encodings of each drug is gener-ated from the trained DNN model. The dimension of the encodingsmay vary for each column depending on the parameters of the besttrained model for the corresponding column. To train the ML models for the supervised DDI prediction task, thedrug pair interaction data consisting of the ground-truth labels,needs to be partitioned into training, validation and testing set. Wecreate two separate partitioning strategies, as described below, andrun all the experiments under both of these settings to evaluate theefficacy of our strategy.
This partitioning approach is acommon way to train and evaluate many supervised machine learn-ing models. We use stratified sampling technique, with each class https://en.wikipedia.org/wiki/Anatomical_Therapeutic_Chemical_Classification_System of the 86 class labels as a strata, to split the 191,728 drug pair inter-action data into 80% training (153,346 DDI pairs), 10% validation(19,171 DDI pairs) and remaining 10% testing (19,211 DDI pairs) set.Thus, each unique label has similar distribution in all the partitions,which makes model evaluation using the partitions, easier. This partitioning approach isrooted in the practical end-user scenario [58], where the databasealready contains several drugs as well as the corresponding groundtruth DDI pairs, and then some new drugs (along with the drug’sinformation) get introduced into the database. Now the task is to pre-dict the DDI interaction of these new drugs using the DDI model andalso to execute the Analogy SQL query using the corresponding en-codings. We think this setting is a much harder one to evaluate, com-pared to the DDI pairwise partitioning scheme, albeit the most prac-tical one. For this setting, we randomly select x% of the total uniquedrugs from our filtered drug set and consider them as held-off drugs.Then we remove all the ground-truth DDI pairs where atleast one ofthe drug belongs to the held-off drug set , and we designate this as thetesting DDI set. The testing set may not contain drug pairs from all86 classes and thus the label distribution is skewed. The remainingground truth DDI pairs are split randomly into Training (90%) andValidation (10%) set, ensuring that both of these set contains atleastone drug drug interaction pair from each of the 86 classes.1%Held-Off 2%Held-Off 3%Held-Off
Table 1: Drug Held-off partitioning statistics.
In our setup, we vary the percentage of drugs held-off i.e., ‘x’ as1%, 2% and 3% to obtain different partitions, and report some of thekey characteristics of the resulting data in Table 1. Clearly, as x%increases, the number of held-off test drugs increase, and so doesthe size of the final testing drug pairs. The distribution of DDI pairsas well as the size of the training/validation set is different from theprevious setting (described in Section 4.2.1), and hence none of theresults across these two separate partitioning schemes are compara-ble. Also, note that, the label distribution of the test set is very differ-ent from that of the training/validation set, due to the way the splitshave been constructed. This skew in label distribution makes classi-fier performance comparison very difficult between the validation rugDBEmbed : Semantic Queries on Relational Database using Supervised Column Encodings Woodstock ’18, June 03–05, 2018, Woodstock, NY
Column DDI Partition 1% Drug Heldoff 2% Drug Heldoff 3% Drug Heldoff
Count = 1 Count = 2 Count = 1 Count = 2 Count = 1 Count = 2 Count = 1 Count = 2ATC Code 132 123 132 123 132 121 132 122Categories 2262 1659 2256 1646 2253 1653 2241 1645Description 12179 5905 12138 5873 12094 5862 12016 5800Merged Class. 710 565 712 566 710 561 698 558Protein Binding 1471 477 1467 479 1464 476 1452 472Target Action 3157 1539 3151 1538 3145 1488 3143 1524
Table 2: Vocabulary size (including 3 fixed tokens). and testing sets within each of the x% settings. Hence, this partition-ing scheme makes model training and evaluation tasks much morechallenging, while also capturing a more realistic end-user scenario.
We compare the results from the LSTM based approaches againstthe bag-of-words baselines for the classification task. To constructthese baselines, the first step is to extract a bag-of-words repre-sentation for each column of each drug. First, we use the same vo-cabulary as the LSTM models for each column. Now, each columncan be represented in terms of the count of different tokens (fromthe vocabulary) in that column. We use the scikit-learn [46]
CountVectorizer for this purpose. Now, for each interaction pair,the pair can be represented as the sum of the count vectors of thecorresponding columns for each drug. This representation for a drugpair can be used as an input to multiple machine learning models.We run experiments using the popular k-nearest neighbors (KNN)and random forests (RF) models. One notable difference betweenthese count based representations and the ones derived from theLSTM is that LSTM encoding vectors are semantic vectors, whichcan be utilized for running Analogy SQL queries on databases. Thebag-of-word based approaches described here do not generate anyencodings. Thus the comparison of the BOW models against LSTMis only possible at the individual column level and that too for onlythe DDI prediction task, thereby greatly limiting their usage.Besides the above bag-of-words models, we additionally imple-ment a random classifier as a baseline to demonstrate how much“learning” is truly happening for each of our models in presenceof the skewness in DDI label distribution. Our “Random” classifierfirst computes the frequency distribution of each class label usingthe training data. Then for each drug pair in validation/testingset, it randomly samples a class label using the frequency distribu-tion computed above and compares the sampled label against theground truth label for that pair to compute the accuracy. Note thatthis classifier does not use any column specific information, andhence the performance is same for all columns on a particular data.For each scenario, we run multiple simulations of the “Random”classifier to report the mean of the results.
We conduct a hyper-parameter search for each column indepen-dently on the BOW baselines as well as the LSTM model. For theBOW baselines, we use 3-fold cross validation on the concatenatedtraining and validation data to select the best hyper-parameter. Wetune the value of K for the KNN model, while the value of the number of estimators, max tree depth and class weighted v/s nonweighted loss are varied for the RF model. For the LSTM model,cross validation is too expensive and hence each instance of themodel is trained on the Training set and the best hyper-parametersare selected based on the performance on the Validation set usingAccuracy as the scoring metric. During hyper-parameter search,we vary the token embedding dimension, number of layers, hiddenunits within the LSTM cell, and the number of tokens considered. The best way to verify the efficacy of the Analogy SQL query [5]for the drug database setting is a controlled user study involvinga group of subject matter experts. However, such a study is verycostly, time-consuming and hard to setup due to the sheer lack ofexperts. Thus, in the absence of an user study, we attempt to designan automated simulation and evaluation approach for Analogy SQLunder supervised setting. The main Analogy SQL query we study is :[A : B :: A : (D?)] Given the drugs A & B, find other drugs (D) inthe database that interact with drug A in a similar way as the drugA interacts with drug B.We lever the labelled drug pairs, already partitioned into validationand testing sets, to simulate such queries using Algorithm 1, andevaluate the qualitative performance using the strategy describedin Section 4.6.2.
Our classification task is aone-out-of-86 class classification problem for each drug pair, wherewe have ground-truth DDI labels [51]. The qualitative performanceof the classifiers is evaluated by computing the Accuracy (A), MacroF1 (M) and Weighted F1 (W) metrics on the Validation and Testingsets of the respective data partitioning schemes (Section 4.2), us-ing the scikit-learn.metrics package. The metrics are defined asfollows: A = Correct PredictionsTotal Predictions M = C (cid:213) c = F c CW = C (cid:213) c = (cid:16) N c N (cid:17) F c C https://scikit-learn.org/stable/modules/classes.html oodstock ’18, June 03–05, 2018, Woodstock, NY Bortik Bandyopadhyay, Pranav Maneriker, Vedang Patel, Saumya Yashmohini Sahai, Ping Zhang, Srinivasan Parthasarathy Algorithm 1
Simulation of [A : B :: A : (D?)]
Require:
Fully initialized Database
Require:
Input DDI pairs of Val/Test partition for (D1, D2) with label L in VAL/TEST DDI pairs do C = count how many times D1 interacts with another drug(except D2) having interaction type L if C == 0 then Skip simulation with (D1, D2) else Use (D1, D2) to frame the analogy [D1 : D2 :: D1 : (D3?)]using Analogy SQL (Section 3.2) Execute query on the DrugBankFullClean table using spe-cific column encodings and fetch the list of D3 as D3List Compute the count of correct drugs (M) in D3List:SELECT COUNT(*) FROM DDITypeInfoTableWHERE (LABEL = L) AND(drug1=‘D1’ AND drug2 in D3List); Precision@K = M/K end if end for return where F c denotes the F c , with C classes overall,and N c denotes the number of samples in class c , and N = (cid:205) Cc = N c .The Precision (P), recall (R) and F1 score for a single class is definedas: P c = T P c T P c + FP c R c = T P c T P c + F N c F c = ∗ ( P c ∗ R c )( P c + R c ) where TP, FP, FN denote True Positives, False Positives and FalseNegatives respectively. The ground truth DDIlabel for all drug pairs are stored in the DDITypeInfoTable duringthe database initialization step (Section 4.1.3), with the followingcolumn names : (drug1, drug2, label). Our understanding of thedata shared by Ryu et al. [51] prompts us to assume that the orderof the drug in the drug pair is important and should be retainedduring the simulation and evaluation steps. Thus the triple (A, B,L) for the drug pair (A,B) with the DDI label L does not necessarilyimply that the triple (B, A, L) is valid, which leads us to design amore conservative evaluation strategy. The candidate list of answerdrugs (D3List) is generated using only the column encodings ofdrugs as described in Step-6 & 7 in Algorithm 1. The SQL query inStep-8 of Algorithm 1 is used to compute how many of the drugs,obtained as part of Analogy SQL query in Step-7, interact with thequery drug D1 in the same way as D1 interacts with D2 (i.e., havethe same interaction label L). The order of the individual drugs aremaintained by assigning them to appropriate column names (drug1v/s drug2) in the SQL query of Step-8 in Algorithm 1. Additionally,it is possible that our Analogy SQL query may retrieve some answerdrugs in D3List, which interact with D1 via a label L ′ , such that thelabels L and L ′ maybe closely related in terms of the underlying biological condition. However, if such a pair is absent in the groundtruth DDI pair, we consider that pair as an incorrect result duringPrecision computation, due to our lack of subject matter expertiseto systematically judge the relatedness of the labels L and L ′ . We have outlined two possible data partitioning approaches in Sec-tion 4.2 that we utilize to demonstrate the efficacy of our approach.For each partitioning scheme, we present the results for the clas-sification as well as the Analogy SQL task. Note that the BOWclassifier models are used to primarily evaluate how good our DNNmodel learned to predict the DDI relationship between drug pairs.Since our data has significant skew in the label distribution, we useAccuracy (A), Macro F1 (M) and Weighted F1 (W) to compare thequalitative performance of the classifiers. We assume that higherthe classification accuracy of the proposed Bi-LSTM model, betteris the quality of the DDI task-specific embedding generated by themodel, which should also get reflected in the performance of theAnalogy SQL queries through improved Precision@K value. Wetrain and evaluate all the models using the same data partitions,while varying minimum word occurrence count value between 1and 2. Note that a minimum occurrence count value of 1 indicatesthat no word is pruned i.e., all unique tokens have been retained inthe final vocabulary. Table 2 lists the effective size of the vocabularyfor these different count threshold values for the respective datapartitioning approaches.
In this section, we discuss the empirical results for DDI pair parti-tioning strategy described in Section 4.2.1. This data partitioningstrategy follows the general approach of evaluating classifiers inmachine learning domain, where the training, validation and test-ing sets have similar distribution of class labels, thereby makingclassifier performance comparisons much more straightforward.The performance of the DNN model as well as the other base-line models are presented in Table 3 and Table 4 for the minimumword occurrence count values of 1 and 2 respectively. The Ran-dom label selection algorithm does not depend on the text contentof the corresponding column and hence the performance metricsare same across of all columns for the validation and testing setrespectively. The gap in performance between the Random labelselection strategy and every other learning based strategy clearlydemonstrates the performance benefits of using supervised learn-ing approach for the DDI prediction task. We observe that the DNNmodel consistently outperforms all the baseline models for eachof the columns on both the validation and testing set. The DNNmodel has approximately 6x improvement in accuracy for columnslike ATC codes, categories and description when compared to therandom label selection algorithm. After DNN model, the best per-forming classifier is the BOW based Random Forest (RF) model. TheRF model outperforms the KNN model, but RF consistently under-performs for all columns when compared to the DNN model on thevalidation and testing set. However, RF is very competitive for thecategories column, where it is about 7% poorer in terms of accuracyon the testing set compared to the DNN model for the minimumword count values of 1 and 2. The DNN model performs better than rugDBEmbed : Semantic Queries on Relational Database using Supervised Column Encodings Woodstock ’18, June 03–05, 2018, Woodstock, NY the RF model by atleast 7% and atmost 14-17% on the testing dataacross of columns for different word count values. Additionally,we observe that the performance gap of the DNN model on thetesting set for different minimum word occurrence count valuesof 1 and 2 is very small across of the columns, although the size ofthe vocabulary is very different for those different minimum countvalues. This may be due to the fact that the common words, whichoccur more than once and hence part of the vocabulary, influencethe DDI prediction performance much more than the rare wordsthat get pruned for minimum count value of 2. Note that the BOWmodels do not generate any “semantic encoding vector” as the inputto the models is a simple count vector of the words and hence canbe used only for the classification task. The DNN model performsbetter than the best performing RF based BOW model for each ofthe respective columns, which implies that the column encodingsgenerated by our DNN model is able to capture, atleast to someextent, meaningful information about the drug-drug interactionrelationship, and may be utilized for the Analogy SQL task.We then simulate Analogy SQL on the validation and testing setsusing Algorithm 1 by using the respective column encodings gener-ated by the trained DNN model. We vary the “K” to 1 , , , ,
10 whilecomputing the Precision@K for the Analogy SQL queries using themethodology defined in Section 4.6.2 and report the mean value ob-tained after simulating all the queries using the respective partitions.Figure 2 and Figure 3 shows the performance of the Analogy SQLqueries on the corresponding data using column encodings obtainedby varying minimum word occurrence count value to 1 and 2 respec-tively. For each column, the Precision@K is highest for K = 1, butthen gradually decrease as K is increased to 10. This means that theAnalogy SQL queries are quite often able to fetch the top-most of theanalogous drugs for the query pair correctly using our approach de-scribed in Algorithm 1. However, the Precision@K value generallydecreases as K is gradually increased,implying that we do not con-sistently perform well for all top-K scenarios. It is important to pointout that some drugs may not have a total of K interactions of typeL (query interaction label), which can contribute to this decrease inperformance. Due to the lack of an user study by subject matter ex-perts, it is hard to determine how many of those “incorrect” answerdrugs could be approximately “semantically” related in terms ofthe interaction label. We observe that the columns, like ATC codes,categories and description, for which the DNN model achieves veryhigh DDI prediction Accuracy value, show consistently higher per-formance in terms of Precision@K for the Analogy SQL simulationson both validation and testing set. In contrast, some of the poorerperforming columns, in terms of classification accuracy, like mergedclass information and protein binding, show very low Precision@Kvalue for Analogy SQL simulations. Thus, our original assumptionthat higher classification accuracy of the DNN model would leadto better quality of DDI task-specific embedding resulting in betterperformance of the Analogy SQL task, appears to hold in this case.
In this section, we present the simulation results for a fixed per-centage of the Drug held-off partitioning strategy described inSection 4.2.2. This is a more realistic end user scenario, where wetrain all our models using the ground truth label information of existing drugs, and then use such trained models to predict DDIwith newly added drugs [58]. However, evaluating this setup isvery tricky as the class label distribution is no longer same acrossthe training, validation and testing sets. Under this setup, the train-ing and validation sets have approximately the same class labeldistribution, but the testing set has a different label distribution, asobserved in the Table 1.For brevity, we present the classifier performance results for 2%drug held off setting in Table 5, and the performance of AnalogySQL queries on 1%, 2% and 3% drug held partitioning in Figure 4,for a minimum word count value of 2. In terms of testing accuracy,we observe that the Random Forest model performs better thanthe DNN model for the categories and the description columns,while the DNN model performs better for ATC codes, merged classand protein binding, whereas the performance on the target actioncolumn is almost same. We have observed similar change in per-formance between the Random Forest and DNN model for someof the columns with other drug held-off percentage values as well.This observation leads us to conclude that under different drugheld-off settings, our proposed DNN model may not always be ableto generate high quality encodings for certain columns, as indicatedby a lower Accuracy score than the Random Forest model. Inter-estingly, the Random Forest model can still be used for achievingbetter DDI prediction accuracy, if that is the only requirement. Fordifferent percentages of drug held-off partitioning, while the overallperformance trends for the Analogy SQL query appear very similar,there is a significant dip in performance, specifically for columnslike protein binding and merged class information. Note that theDNN classifier performance on the testing set have also degradedfor these columns compared to the performance on the validationset, which explains this relatively poor performance. Also, the Pre-cision@K performance in Validation set is slightly better (by 0.1)than that in the Testing set in some cases (notice the differencein y-axis value). Nonetheless, our results on the 2% data (testingset) are very encouraging and we believe this can lead to improvedresults with additional future investigations.
Our approach is built to lever gold-standard label information togenerate task-specific supervised database column encodings. Weobserve that for some columns (like ATC codes and categories)our approach achieves very high qualitative performance on boththe classification as well as the Analogy SQL task for the DDI Pairpartitioning approach, while the performance is competitive forthe Drug Held-off partitioning approach. The performance dropsdrastically for some other columns (like merged class informationand protein binding) across all settings. This drop in performancecould be due to either lack of adequate task-specific informationin these columns or our tokenization and information encodingscheme being not effective for that specific column. We think thatcolumn-specific textification and tokenization scheme, which canbalance between human readability and sub-word based informa-tion reuse [17], along with more advanced encoders (like LSTMwith Attention [55, 59]) may improve the performance for some ofthese under-performing columns. oodstock ’18, June 03–05, 2018, Woodstock, NY Bortik Bandyopadhyay, Pranav Maneriker, Vedang Patel, Saumya Yashmohini Sahai, Ping Zhang, Srinivasan Parthasarathy
Model
ATC Codes Categories Description Merged Class Protein Binding Target Action
Validation Testing Validation Testing Validation Testing Validation Testing Validation Testing Validation TestingRandom
A : 15.81
M : 0.0117W : 0.1581
A : 15.70
M : 0.0115W : 0.1569
A : 15.81
M : 0.0117W : 0.1581
A : 15.70
M : 0.0115W : 0.1569
A : 15.81
M : 0.0117W : 0.1581
A : 15.70
M : 0.0115W : 0.1569
A : 15.81
M : 0.0117W : 0.1581
A : 15.70
M : 0.0115W : 0.1569
A : 15.81
M : 0.0117W : 0.1581
A : 15.70
M : 0.0115W : 0.1569
A : 15.81
M : 0.0117W : 0.1581
A : 15.70
M : 0.0115W : 0.1569KNN
A : 64.08
M : 0.4391W : 0.6285
A : 64.20
M : 0.4119W : 0.6290
A : 79.92
M : 0.6601W : 0.7956
A : 80.17
M : 0.7264W : 0.7980
A : 58.76
M : 0.3418W : 0.5647
A : 58.61
M : 0.3350W : 0.5630
A : 52.99
M : 0.3273W : 0.5083
A : 52.29
M : 0.3379W : 0.5016
A : 44.58
M : 0.1437W : 0.4260
A : 43.99
M : 0.1324W : 0.4194
A : 65.16
M : 0.4914W : 0.6437
A : 65.29
M : 0.4846W : 0.6444RF
A : 76.80
M : 0.5913W : 0.7619
A : 76.79
M : 0.5772W : 0.7611
A : 89.85
M : 0.7893W : 0.8964
A : 89.95
M : 0.8258W : 0.8975
A : 79.67
M : 0.5837W : 0.7898
A : 79.33
M : 0.5973W : 0.7864
A : 60.47
M : 0.4259W : 0.5926
A : 60.22
M : 0.4358W : 0.5904
A : 50.56
M : 0.1536W : 0.4625
A : 50.37
M : 0.1656W : 0.4586
A : 74.50
M : 0.5901W : 0.7388
A : 74.40
M : 0.6125W : 0.7373DNN
A : 92.15
M : 0.7828W : 0.9203
A : 91.54
M : 0.8051W : 0.9142
A : 96.88
M : 0.8870W : 0.9686
A : 97.08
M : 0.9332W : 0.9707
A : 96.35
M : 0.8978W : 0.9633
A : 96.52
M : 0.9367W : 0.9650
A : 72.79
M : 0.5925W : 0.7223
A : 72.09
M : 0.6035W : 0.7154
A : 61.66
M : 0.3641W : 0.5972
A : 61.26
M : 0.3841W : 0.5926
A : 89.42
M : 0.7822W : 0.8931
A : 88.99
M : 0.8027W : 0.8887
Table 3: Qualitative performance comparison of DDI prediction by DNN and other baseline models on Validation and Testingset for DDI Pair partitioning (Section 4.2.1) using min-word-count = 1.
Model
ATC Codes Categories Description Merged Class Protein Binding Target Action
Validation Testing Validation Testing Validation Testing Validation Testing Validation Testing Validation TestingRandom
A : 15.81
M : 0.0117W : 0.1581
A : 15.70
M : 0.0115W : 0.1569
A : 15.81
M : 0.0117W : 0.1581
A : 15.70
M : 0.0115W : 0.1569
A : 15.81
M : 0.0117W : 0.1581
A : 15.70
M : 0.0115W : 0.1569
A : 15.81
M : 0.0117W : 0.1581
A : 15.70
M : 0.0115W : 0.1569
A : 15.81
M : 0.0117W : 0.1581
A : 15.70
M : 0.0115W : 0.1569
A : 15.81
M : 0.0117W : 0.1581
A : 15.70
M : 0.0115W : 0.1569KNN
A : 64.27
M : 0.4369W : 0.6299
A : 64.36
M : 0.4151W : 0.6299
A : 79.78
M : 0.6569W : 0.7942
A : 80.14
M : 0.7124W : 0.7976
A : 58.55
M : 0.3424W : 0.5628
A : 58.48
M : 0.3517W : 0.5624
A : 52.31
M : 0.3277W : 0.5037
A : 52.03
M : 0.3312W : 0.5009
A : 41.99
M : 0.0638W : 0.3666
A : 41.91
M : 0.0665W : 0.3684
A : 65.60
M : 0.4901W : 0.6484
A : 65.29
M : 0.4876W : 0.6447RF
A : 76.80
M : 0.5921W : 0.7620
A : 76.50
M : 0.5742W : 0.7581
A : 89.97
M : 0.7924W : 0.8977
A : 90.10
M : 0.8366W : 0.8993
A : 80.09
M : 0.6056W : 0.7947
A : 79.87
M : 0.6196W : 0.7924
A : 59.77
M : 0.4268W : 0.5865
A : 59.39
M : 0.4284W : 0.5826
A : 49.03
M : 0.1486W : 0.4459
A : 48.89
M : 0.1526W : 0.4424
A : 74.32
M : 0.5883W : 0.7370
A : 74.18
M : 0.6173W : 0.7354DNN
A : 91.89
M : 0.8069W : 0.9176
A : 91.75
M : 0.8184W : 0.9160
A : 96.89
M : 0.9111W : 0.9686
A : 96.85
M : 0.9328W : 0.9683
A : 96.68
M : 0.9102W : 0.9667
A : 96.67
M : 0.9300W : 0.9665
A : 72.23
M : 0.5986W : 0.7187
A : 71.98
M : 0.6115W : 0.7162
A : 61.57
M : 0.3647W : 0.5964
A : 60.81
M : 0.3739W : 0.5869
A : 89.15
M : 0.8062W : 0.8906
A : 88.81
M : 0.8074W : 0.8868
Table 4: Qualitative performance comparison of DDI prediction by DNN and other baseline models on Validation and Testingset for DDI Pair partitioning (Section 4.2.1) with min-word-count = 2.Figure 2: Precision@K for Analogy queries on respective data for DDI pair partitioning (Section 4.2.1) for min-word-count = 1.Figure 3: Precision@K for Analogy queries on respective data for DDI pair partitioning (Section 4.2.1) for min-word-count = 2. rugDBEmbed : Semantic Queries on Relational Database using Supervised Column Encodings Woodstock ’18, June 03–05, 2018, Woodstock, NY
Model
ATC Codes Categories Description Merged Class Protein Binding Target Action
Validation Testing Validation Testing Validation Testing Validation Testing Validation Testing Validation TestingRandom
A : 15.91
M : 0.0117W : 0.1591
A : 14.37
M : 0.0116W : 0.1380
A : 15.91
M : 0.0117W : 0.1591
A : 14.37
M : 0.0116W : 0.1380
A : 15.91
M : 0.0117W : 0.1591
A : 14.37
M : 0.0116W : 0.1380
A : 15.91
M : 0.0117W : 0.1591
A : 14.37
M : 0.0116W : 0.1380
A : 15.91
M : 0.0117W : 0.1591
A : 14.37
M : 0.0116W : 0.1380
A : 15.91
M : 0.0117W : 0.1591
A : 14.37
M : 0.0116W : 0.1380KNN
A : 66.95
M : 0.4759W : 0.6620
A : 53.15
M : 0.3628W : 0.5214
A : 80.18
M : 0.7276W : 0.7980
A : 75.93
M : 0.7071W : 0.7533
A : 58.89
M : 0.3702W : 0.5663
A : 53.77
M : 0.3425W : 0.5126
A : 54.95
M : 0.4103W : 0.5420
A : 32.40
M : 0.2362W : 0.3278
A : 49.15
M : 0.1336W : 0.4425
A : 34.71
M : 0.1106W : 0.2831
A : 67.19
M : 0.5289W : 0.6630
A : 57.61
M : 0.4293W : 0.5584RF
A : 77.11
M : 0.6123W : 0.7659
A : 62.37
M : 0.5288W : 0.6028
A : 90.64
M : 0.8527W : 0.9044
A : 81.85
M : 0.7496W : 0.8111
A : 80.73
M : 0.6301W : 0.8013
A : 68.73
M : 0.5656W : 0.6613
A : 60.76
M : 0.4461W :0.5950
A : 36.84
M : 0.2672W : 0.3655
A : 49.36
M : 0.1556W : 0.4465
A : 36.06
M : 0.1276W : 0.2875
A : 75.10
M : 0.6459W : 0.7448
A : 61.34
M : 0.4985W : 0.5968DNN
A : 91.92
M : 0.8045W : 0.9176
A : 65.53
M : 0.4901W : 0.6385
A : 96.22
M : 0.9250W : 0.9620
A : 78.15
M : 0.7083W : 0.7757
A : 96.17
M : 0.9363W : 0.9616
A : 65.13
M : 0.4814W : 0.6409
A : 73.43
M : 0.6053W : 0.7294
A : 41.18
M : 0.3077W : 0.4092
A : 62.30
M : 0.3620W : 0.6009
A : 41.26
M : 0.1582W : 0.3570
A : 89.38
M : 0.8342W : 0.8926
A : 61.39
M : 0.4579W : 0.5986
Table 5: Qualitative performance comparison of DDI prediction by DNN and other baseline models on Validation and Testingset for 2% Drug Held-off partitioning (Section 4.2.2) with min-word-count = 2. (a) 1% Data - Validation (b) 1% Data - Testing(c) 2% Data - Validation (d) 2% Data - Testing(e) 3% Data - Validation (f) 3% Data - Testing
Figure 4: Precision@K for Analogy queries on respective data for different percentage of Drug Held-off partitioning (Sec-tion 4.2.2 for min-word-count = 2.
An obvious criticism of our approach would be that it is focusedon utilizing a single column’s information and hence cannot utilizeinformation from across columns. We argue that each column oftenhas different data types as well as diverse information and maycontain a different perspective on semantic similarity, which ourapproach will be able to capture better. For example, the ATC codes column contains codes, each of which are of fixed length, andthe codes capture an implicit clustering information of drugs, interms of the organ/system of action, besides their therapeutic intentand other characteristics. This column is interpretable only to an https://en.wikipedia.org/wiki/Anatomical_Therapeutic_Chemical_Classification_System oodstock ’18, June 03–05, 2018, Woodstock, NY Bortik Bandyopadhyay, Pranav Maneriker, Vedang Patel, Saumya Yashmohini Sahai, Ping Zhang, Srinivasan Parthasarathy expert who is familiar with ATC coding system. On the other hand,the description column contains human readable free text,makingit much more interpretable to ordinary users, but may containtoo many details that may not necessarily be relevant. Thus ourDNN model for the ATC codes column tries to capture the implicitclustering based interaction information, while the DNN modelfor the description column tries to capture the more explicit textbased (i.e., medical concept based) interaction information. Clearly,when the encodings of these two columns are separately used forAnalogy SQL task, they provide very different perspectives for thedrug drug interaction prediction task, which may be useful forbetter debugging during execution of end user scenarios.While there is room for further improvement in DDI predictionand Analogy SQL performances for some of the columns, manyother columns of the drug database have been currently unexplored.Note that the drug database has a wide variety of data types likenumeric and chemical structure information in its columns. Oneapproach could be to follow the textification and tokenization strat-egy proposed by Bordawekar et al. [5] for all columns like numeric,categorical and text based columns. However, for a column likechemical structure that contains a graph representing the chemicalstructure of the drug (often stored in SMILES format in databases),it may be a better idea to encode the information through special-ized models like graph convolutional network [61] instead of usingthe textification approach [5]. This require data type specific en-coders for individual columns (graph v/s text encoders) therebymaking the models more complex as well as significantly increasingthe model maintenance cost, rather than the current simplistic textencoder based strategy that follows the broad textification approachproposed by Bordawekar et al. [5]. However, we think that suchdata type specific encoders will be able to better capture the latentinformation of the specific data type, despite the increase in modelcost and complexity, and hence needs future exploration.The Analogy SQL task that we have studied is [A : B :: A : (D?)].However, another practical and even more challenging analogyquery would be [A : B :: C : (D?)], where we seek a drug D whichinteracts with drug C in the same way as A interacts with B. Su-pervised evaluation of this scenario through simulation is quitechallenging, as we need to have a systematic approach to generatethe query triples (A, B, C) and ensure that we have a consistent defi-nition of the correct answer drug D for the query triple (A, B, C). Inthis scenario, it would be ideal to conduct an user study of our frame-work by recruiting subject matter experts and enabling them via aweb portal to test their hypothesis using our column encoding basedAnalogy SQL queries. Moreover, since our Analogy SQL queries areapproximate by nature, there will always be some errors in the finalresult. A portion of this error could be due to answer drugs beingretrieved from a class label (L ′ ), which is semantically related to thequery interaction label (L), but we currently flag those as incorrectresults, due to the lack of a principled way to compute the labelsimilarities. An user study by subject matter experts can potentiallyreveal this error source in our approach, giving us more insight andpossible scope of further improvement. However, recruiting expertsand performing such an user study is still challenging in terms ofcost and time, that we could not address as part of this work. https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system One shortcoming of our current approach is that we assumeall the tables in the database are static while generating the corre-sponding column encodings. However, database updates may bequite frequent [5] and can be typically of two types in our settings,viz: 1) update in drug information (i.e., database column values)2) update in drug drug interaction information (old interactionsremoved, new interactions added etc.). As the information gets mod-ified in different tables, the overall semantic context of the columns,captured by the supervised column encodings, should also get up-dated accordingly, similar to the updates proposed by Bordawekaret al. [5] for the unsupervised model. Unfortunately, modifying su-pervised encodings for database updates could be a non-trivial task.In our supervised setting, intuitively, the newly added interactionof a drug pair may be easier to handle by incrementally re-trainingthe DNN model first, using the new interactions alongside a subsetof existing interactions for the corresponding drugs, and then byregenerating the column encodings for all drugs using the updatedDNN model. But deleted interaction between any drug pair is verydifficult to incrementally “unlearn”, and even harder to evaluatethe deletion. Similarly, any update to the column values for anydrug will require re-training of our DNN model, using atleast eachsuch interaction pairs where the drug with updated informationappear. Since we have one model per-column, updates to individ-ual columns would mean re-training of the column-specific model,which may be less costly sometimes, provided the column data issmall and the number of columns updated is small too. All theseare intuitive approaches which will need additional empirical val-idations. Note that, this assumption of static information may beless problematic for our Drug database scenario, due to relativelylow frequency of updates in practice, but will be a major issue forother medical databases like EHR data of patients in hospitals (e.g.:MIMIC III by [25]), where all types of database updates are very fre-quent. Thus, designing task-specific supervised column encodingsfor databases with rapid and diverse updates is a very challengingproblem, and is worth a thorough investigation in future.
We study the task of semantic information preserving supervisedcolumn encoding generation of multi-token text columns of a re-lational database. We use the Drug Drug Interaction predictionscenario as a case study, whereby we seek to learn the column en-codings of a pair of rows of the drug information table through su-pervised learning method, using the ground truth DDI label informa-tion for the drug pair. We propose a DNN model for the text-basedDDI prediction task, which has minimum text processing overheadcompared to previous works, and our proposed model achieves avery high DDI prediction accuracy. Additionally, we utilize thosecolumn encodings to simulate Analogy SQL query on the relationaldatabase and propose an evaluation strategy to demonstrate theefficacy of the column encodings for the Analogy SQL task.
REFERENCES [1] S. Arora and S. Bedathur. On embeddings in relational databases. arXiv preprintarXiv:2005.06437 , 2020.[2] M. Asada, M. Miwa, and Y. Sasaki. Enhancing drug-drug interaction extractionfrom texts by molecular structure information. arXiv preprint arXiv:1805.05593 ,2018. rugDBEmbed : Semantic Queries on Relational Database using Supervised Column Encodings Woodstock ’18, June 03–05, 2018, Woodstock, NY [3] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic languagemodel. volume 3, pages 1137–1155, 2003.[4] F. Biessmann, D. Salinas, S. Schelter, P. Schmidt, and D. Lange. "deep" learningfor missing value imputation in tables with non-numerical data. In
CIKM , pages2017–2025, 2018.[5] R. Bordawekar, B. Bandyopadhyay, and O. Shmueli. Cognitive database: A steptowards endowing relational databases with artificial intelligence capabilities.
CoRR , abs/1712.07199, 2017.[6] R. Bordawekar and O. Shmueli. Enabling cognitive intelligence queries in rela-tional databases using low-dimensional word embeddings.
CoRR , abs/1603.07185,2016.[7] R. Bordawekar and O. Shmueli. Using word embedding to enable semantic queriesin relational databases. In
Proceedings of the 1st Workshop on Data Managementfor End-to-End Machine Learning , pages 1–4, 2017.[8] R. Bordawekar and O. Shmueli. Exploiting latent information in relationaldatabases via word embedding and application to degrees of disclosure. In
CIDR , 2019.[9] R. Cappuzzo, P. Papotti, and S. Thirumuruganathan. Local embeddings forrelational data integration. arXiv preprint arXiv:1909.01120 , 2019.[10] O. S. Center. Pitzer supercomputer. http://osc.edu/ark:/19495/hpc56htp, 2018.[11] R. Cutlip and J. Medicke.
Integrated Solutions with DB2 . Addison Wesley LongmanPublishing Co., Inc., USA, 2003.[12] A. M. Dai, C. Olah, and Q. V. Le. Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998 , 2015.[13] Y. Deng, X. Xu, Y. Qiu, J. Xia, W. Zhang, and S. Liu. A multimodal deep learningframework for predicting drug-drug interaction events.
Bioinformatics , 05 2020.btaa501.[14] M. Faruqui, J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith. Retrofittingword vectors to semantic lexicons. In
Proceedings of the 2015 Conference of theNorth American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies , pages 1606–1615, 2015.[15] R. C. Fernandez and S. Madden. Termite: A system for tunneling through hetero-geneous data. In
Proceedings of the Second International Workshop on ExploitingArtificial Intelligence Techniques for Data Management , aiDM âĂŹ19, New York,NY, USA, 2019. Association for Computing Machinery.[16] A. Fokoue, M. Sadoghi, O. Hassanzadeh, and P. Zhang. Predicting drug-druginteractions through large-scale similarity-based link prediction. In
EuropeanSemantic Web Conference , pages 774–789. Springer, 2016.[17] P. Gage. A new algorithm for data compression.
C Users J. , 12(2):23âĂŞ38, Feb.1994.[18] K. M. Giacomini, R. M. Krauss, D. M. Roden, M. Eichelbaum, M. R. Hayden, andY. Nakamura. When good drugs go bad.
Nature , 446(7139):975–977, 2007.[19] M. GÃijnther, M. Thiele, and W. Lehner. Fast approximated nearest neighbor joinsfor relational database systems. In T. Grust, F. Naumann, A. BÃűhm, W. Lehner,T. HÃďrder, E. Rahm, A. Heuer, M. Klettke, and H. Meyer, editors,
BTW 2019 ,pages 225–244. Gesellschaft fÃijr Informatik, Bonn, 2019.[20] Y. Goldberg and O. Levy. word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method.
CoRR , abs/1402.3722, 2014.[21] M. Günther. Freddy: Fast word embeddings in database systems. In
Proceedings ofthe 2018 International Conference on Management of Data , SIGMOD âĂŹ18, page1817âĂŞ1819, New York, NY, USA, 2018. Association for Computing Machinery.[22] M. Günther, M. Thiele, and W. Lehner. Retro: Relation retrofitting for in-databasemachine learning on textual data. arXiv preprint arXiv:1911.12674 , 2019.[23] S. Hochreiter and J. Schmidhuber. Long short-term memory.
Neural computation ,9(8):1735–1780, 1997.[24] J. Howard and S. Ruder. Universal language model fine-tuning for text classifica-tion. In
Proceedings of the 56th Annual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 328–339, 2018.[25] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody,P. Szolovits, L. A. Celi, and R. G. Mark. Mimic-iii, a freely accessible critical caredatabase.
Scientific data , 3:160035, 2016.[26] V. Kashyap and A. Sheth. Semantic and schematic similarities between databaseobjects: A context based approach.
The VLDB Journal The International Journalon Very Large Data Bases , 5:276–304, 01 1996.[27] A. Kastrin, P. Ferk, and B. Leskošek. Predicting potential drug-drug interactionson topological and semantic similarity features using statistical learning.
PloSone , 13(5), 2018.[28] Y. Kim. Convolutional neural networks for sentence classification.
CoRR ,abs/1408.5882, 2014.[29] J. Kuang, Y. Cao, J. Zheng, X. He, M. Gao, and A. Zhou. Improving neural relationextraction with implicit mutual relations. arXiv preprint arXiv:1907.05333 , 2019.[30] Q. Le and T. Mikolov. Distributed representations of sentences and documents.In
International conference on machine learning , pages 1188–1196, 2014.[31] O. Levy and Y. Goldberg. Linguistic regularities in sparse and explicit wordrepresentations. In
Proceedings of the eighteenth conference on computationalnatural language learning , pages 171–180, 2014.[32] O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization.In
Advances in neural information processing systems , pages 2177–2185, 2014. [33] L. Lim, H. Wang, and M. Wang. Semantic queries by example. In
Proceedings ofthe 16th international conference on extending database technology , pages 347–358,2013.[34] Y. Lin, S. Shen, Z. Liu, H. Luan, and M. Sun. Neural relation extraction withselective attention over instances. In
Proceedings of the 54th Annual Meetingof the Association for Computational Linguistics (Volume 1: Long Papers) , pages2124–2133, 2016.[35] P. Liu, X. Qiu, and X. Huang. Recurrent neural network for text classificationwith multi-task learning.
CoRR , abs/1605.05101, 2016.[36] E. Loper and S. Bird. Nltk: The natural language toolkit. In
In Proceedingsof the ACL Workshop on Effective Tools and Methodologies for Teaching NaturalLanguage Processing and Computational Linguistics. Philadelphia: Association forComputational Linguistics , 2002.[37] T. Ma, C. Xiao, J. Zhou, and F. Wang. Drug similarity integration through attentivemulti-view graph auto-encoders. arXiv preprint arXiv:1804.10850 , 2018.[38] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of wordrepresentations in vector space.
CoRR , abs/1301.3781, 2013.[39] T. Mikolov, Q. V. Le, and I. Sutskever. Exploiting similarities among languagesfor machine translation. volume abs/1309.4168, 2013.[40] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed rep-resentations of words and phrases and their compositionality. In
Advances inneural information processing systems , pages 3111–3119, 2013.[41] A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. In
Advances in neural information processing systems ,pages 2265–2273, 2013.[42] J. L. Neves and R. Bordawekar. Demonstrating ai-enabled sql queries over rela-tional data using a cognitive database. 2018.[43] D. Newman-Griffis, A. M. Lai, and E. Fosler-Lussier. Insights into analogy com-pletion from the biomedical domain. arXiv preprint arXiv:1706.02241 , 2017.[44] Z. Pan and J. Heflin. Dldb: Extending relational databases to support seman-tic web queries. Technical report, LEHIGH UNIV BETHLEHEM PA DEPT OFCOMPUTER SCIENCE AND ELECTRICAL ENGINEERING, 2004.[45] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison,A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch:An imperative style, high-performance deep learning library. In H. Wallach,H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors,
Advances in Neural Information Processing Systems 32 , pages 8024–8035. CurranAssociates, Inc., 2019.[46] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Courna-peau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learningin Python.
Journal of Machine Learning Research , 12:2825–2830, 2011.[47] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for wordrepresentation. In
EMNLP , pages 1532–1543, 2014.[48] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettle-moyer. Deep contextualized word representations. In
Proceedings of NAACL-HLT ,pages 2227–2237, 2018.[49] G. Rossiello, A. Gliozzo, R. Farrell, N. R. Fauceglia, and M. Glass. Learningrelational representations by analogy using hierarchical siamese networks. In
NAACL , pages 3235–3245, 2019.[50] D. E. Rumelhart and A. A. Abrahamson. A model for analogical reasoning.
Cognitive Psychology , 5(1):1–28, 1973.[51] J. Y. Ryu, H. U. Kim, and S. Y. Lee. Deep learning improves prediction of drug–drug and drug–food interactions.
Proceedings of the National Academy of Sciences ,115(18):E4304–E4311, 2018.[52] K. Srinivas, A. Gale, and J. Dolby. Merging datasets through deep learning.
CoRR ,abs/1809.01604, 2018.[53] S. Wang and C. D. Manning. Baselines and bigrams: Simple, good sentiment andtopic classification. In
Proceedings of the 50th annual meeting of the associationfor computational linguistics: Short papers-volume 2 , pages 90–94. Association forComputational Linguistics, 2012.[54] Z. Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge graph embedding bytranslating on hyperplanes. In
Aaai , volume 14, pages 1112–1119, 2014.[55] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. Hierarchical attentionnetworks for document classification. In
Proceedings of the 2016 conference of theNorth American chapter of the association for computational linguistics: humanlanguage technologies , pages 1480–1489, 2016.[56] W. Yin, K. Kann, M. Yu, and H. Schütze. Comparative study of CNN and RNN fornatural language processing.
CoRR , abs/1702.01923, 2017.[57] B. Yu, G. Li, K. Sollins, and A. K. H. Tung. Effective keyword-based selectionof relational databases. In
Proceedings of the 2007 ACM SIGMOD InternationalConference on Management of Data , SIGMOD âĂŹ07, page 139âĂŞ150, New York,NY, USA, 2007. Association for Computing Machinery.[58] P. Zhang, F. Wang, J. Hu, and R. Sorrentino. Label propagation prediction ofdrug-drug interactions based on clinical side effects.
Scientific reports , 5(1):1–10,2015. oodstock ’18, June 03–05, 2018, Woodstock, NY Bortik Bandyopadhyay, Pranav Maneriker, Vedang Patel, Saumya Yashmohini Sahai, Ping Zhang, Srinivasan Parthasarathy [59] W. Zheng, H. Lin, L. Luo, Z. Zhao, Z. Li, Y. Zhang, Z. Yang, and J. Wang. Anattention-based effective neural model for drug-drug interactions extraction.
BMC bioinformatics , 18(1):445, 2017.[60] P. Zhou, Z. Qi, S. Zheng, J. Xu, H. Bao, and B. Xu. Text classification improved byintegrating bidirectional lstm with two-dimensional max pooling. arXiv preprint arXiv:1611.06639 , 2016.[61] M. Zitnik, M. Agrawal, and J. Leskovec. Modeling polypharmacy side effectswith graph convolutional networks.