Hybrid Semantic Recommender System for Chemical Compounds
HHybrid Semantic Recommender System forChemical Compounds (cid:63)
M´arcia Barros (cid:0) , − − − , Andr´e Moitinho − − − ,and Francisco M. Couto − − − LASIGE, Departamento de Inform´atica, Faculdade de Ciˆencias, Universidade deLisboa, 1749–016 Lisboa, Portugal. [email protected] CENTRA, Departamento de F´ısica, Faculdade de Ciˆencias, Universidade deLisboa, 1749–016 Lisboa, Portugal
Abstract.
Recommending Chemical Compounds of interest to a partic-ular researcher is a poorly explored field. The few existent datasets withinformation about the preferences of the researchers use implicit feed-back. The lack of Recommender Systems in this particular field presentsa challenge for the development of new recommendations models. Inthis work, we propose a Hybrid recommender model for recommend-ing Chemical Compounds. The model integrates collaborative-filteringalgorithms for implicit feedback (Alternating Least Squares (ALS) andBayesian Personalized Ranking(BPR)) and semantic similarity betweenthe Chemical Compounds in the ChEBI ontology (ONTO). We evalu-ated the model in an implicit dataset of Chemical Compounds, CheRM.The Hybrid model was able to improve the results of state-of-the-artcollaborative-filtering algorithms, especially for Mean Reciprocal Rank,with an increase of 6.7% when comparing the collaborative-filtering ALSand the Hybrid ALS ONTO.
Keywords:
Recommender System · Implicit feedback · Ontology · Collaborative-Filtering · Semantic similarity.
The recommendation of Chemical Compounds of interest for scientific researchershas not been widely explored [9,23]. However, Recommender Systems (RSs) mayhelp in the discovery of compounds, for example, by suggesting items not yetstudied by the researchers. One challenge in this field is the lack of availabledatasets with the preferences of the researchers about the Chemical Compoundsfor testing the RS. More recently, alternatives have emerged with the devel-opment of datasets consisting of data collected from implicit feedback. Unlikewhat happens with other datasets, for example, Movielens [6], these datasets do (cid:63)
This work was supported by the Funda¸c˜ao para a Ciˆencia e Tecnologia (FCT), un-der LASIGE Strategic Project - UID/CEC/00408/2019, UIDB/00408/2020, CEN-TRA Strategic Project UID/FIS/00099/2019, FCT funded project PTDC/CCI-BIO/28685/2017 and PhD Scholarship SFRH/BD/128840/2017. a r X i v : . [ c s . I R ] J a n M. Barros et al. not contain the specific interests of the researchers. Instead, this information isextracted from the activities of the researchers, for example, through scientificliterature [15,3].Datasets of explicit or implicit feedback require different recommender al-gorithms, especially because implicit feedback has some significant downgrades,such as the lack of negative feedback, and unbalanced ratio of positive vs unob-served ratings [18,11]. When dealing with implicit feedback datasets, the solutioninvolves applying learning to rank (LtR) approaches. LtR consists in, given a setof items, identify in which order they should be recommended [17].The main approaches in RSs are Collaborative-Filtering (CF) and Content-Based (CB) [20]. CF uses the similarity between the ratings of the users, and CBuses the similarity between the features of the items. CF approaches cannot dealwith new items or new users in the system, i.e., items and users without ratings(cold start problem). CB does not need to deal with this problem for new items,and that is the main reason Hybrid RSs (CF + CB) exist. One of the tools usedby CB are ontologies [27], which are related vocabularies of terms and definitionsfor a specific field of study [28,2]. Some examples of well-known ontologies arethe Chemical Entities of Biological Interest (ChEBI) [7], the Gene Ontology(GO) [4], and the Disease Ontology (DO) [21].In this paper, we propose a Hybrid recommender model for recommendingChemical Compounds, consisting of a CF module and a CB module. In theCF module we tested two algorithms for implicit feedback datasets, Alternat-ing Least Squares (ALS) [8] and Bayesian Personalized Ranking (BPR) [18],separately. In the CB module we explored the semantic similarity between thecompounds in the ChEBI ontology (ONTO algorithm). The Hybrid model com-bines ALS + ONTO, and BPR + ONTO. The framework developed for thiswork is available at https://github.com/lasigeBioTM/ChemRecSys . There are a few studies using RS for recommending Chemical Compounds. [9]describes the use of CF methods for creating a Free-Wilson-like fragment rec-ommender system. [23] use RS techniques for the discovery of new inorganiccompounds, by applying machine-learning to find the similarity between theproposed and the existent compounds.Next, we describe studies using ontologies for improving the performance ofCF algorithms. [12] created a RS for recommending English collections of booksin a library. The authors developed PORE, a personal ontology RecommenderSystem, which consists of a personal ontology for each user and then the appli-cation of a CF method. They used a standard normalized cosine similarity forfinding the similarity between the users. [26] also used an ontology for creating http://geneontology.org/ http://disease-ontology.org/ ybrid Semantic Recommender System for Chemical Compounds 3 users’ profiles for the domain of books. They calculated the similarity, not be-tween the ratings of the users, but based on the interest scores derived from theontology. The CF method used was the k-nearest neighbours. [24] developed aTrustSemantic Fusion approach, tested on movies and Yahoo! datasets. Their ap-proach incorporates semantic knowledge to the items primary information, usingknowledge from the ontologies. They used the user-based Constrained PearsonCorrelation and the user-based Jaccard similarity.[16] presented a solution for the top@k recommendations specifically for im-plicit feedback data. The authors developed the Spank - semantic path-basedranking. They extracted path-based features of the items from DBpedia andused LtR algorithms to get the rank of the most relevant items. They tested themethod on music and movies domains. [1] developed a new semantic similaritymeasure, the Inferential Ontology-based Semantic Similarity. The new measureimproved the results of a user-based CF approach, using Pearson Correlation forcalculating the similarity between the users. The authors tested the approachon the tourism domain. Most recently, [14] developed a Hybrid RS tested on themovies domain. The method used Single Value Decomposition for dimension-ality reduction for the item and user-based CF, and ontologies for item-basedsemantic similarity, improving the CF results. They do not deal with implicitdata.To the best of our knowledge, our study is the first to use semantic simi-larity for recommending Chemical Compounds, dealing with implicit data byusing state-of-the-art methods (ALS and BPR) and improving the results forthe top@k in several evaluation metrics. The proposed model has two modules: CF and CB. Figure 1 shows the generalworkflow of the model. The input data used in this model has the format ofFig. 1: Workflow of the Hybrid recommender model. < user,item,rating > . The unrated set represents the items we want to rank toprovide the best recommendations in the first positions to a user. The rated setare the items the users already rated. Since we will split the data into train andtest, lets call train set to the rated set and test set to the unrated set. Bothtrain and test sets are the input for CF and CB modules. Using CF algorithmsfor implicit feedback datasets, the CF module gives a score for each item in thetest set. The CB module uses semantic similarity for providing a score for the M. Barros et al. items in the test set. In the last step, the scores from CF and CB modules arecombined and sorted in descending order.For the CF module, we selected state-of-the-art CF recommender algorithmsfor implicit data , ALS [8] and BPR [18]. ALS is a latent factor algorithm thataddresses the confidence of a user-item pair rating. BPR is also a latent factoralgorithm, but it is more appropriate for ranking a list of items. BPR does notjust consider the unobserved user-item pairs as zeros, but instead, it takes intoconsideration the preference of a user between an observed and an unobservedrating.The CB module (ONTO algorithm) is based on ChEBI ontology. This moduleassigns a score S to each item in the test set, calculating the semantic similaritybetween each item in the train and the test sets, as shown in Figure 2. Forcalculating the similarity, we used DiShIn [5], a tool for calculating semanticsimilarities between the entities represented by an ontology. Semantic similarityallows measuring how close two entities are in a semantic base. When usingontologies, the semantic similarity may be measured, for example, by calculatingthe shortest path connecting the nodes of two entities. DiShin allows to calculatethree similarity metrics: Resnik [19], Lin [13], and Jiang and Conrath [10]. Forthis work, we used the Lin metric. We intend to test the other metrics in thefuture. Fig. 2: Example of ONTO algo-rithm. I is a test item, I , I and I are train items. The se-mantic similarity is calculatedfor each pair of test-train items.The score for I (SI ) is themean of the similarities of eachtest-train pair.Whereas the CF module uses all the rat-ings from the train set to train the model,CB module only takes into account the rat-ings of each user. Using DiShin, we calculatethe value of the similarity between each itemin the train test and the items in test set.Lets I be the item in test, and I , I ...I n the items in the train, with size m , for auser U. The score S for I (S I1 ) is calculatedaccording to the Equation 1. ONTO algorithmdoes not use any real rating of the test itemswhen calculating the score for each item in thetest set, thus we do not have the problem ofintroducing bias in the results. S I1 = Sim , + Sim , + ... + Sim ,n m (1)For obtaining a final score (FS) for each item in the test, we combine the scoresfrom CF module (S CF ) and CB module (S CB ), into a Hybrid recommendationapproach, according to Equation 2. Our goal is to prove that by combining bothmodules, we can improve the results of each module separately. F S I1 = S CF × S CB (2) https://implicit.readthedocs.io/en/latest/index.html https://github.com/lasigeBioTM/DiShIn ybrid Semantic Recommender System for Chemical Compounds 5 Experiments
The data used in this work is a subset of a dataset of ChemicalCompounds, CheRM, with the format of < user,item,rating > [3]. The users areauthors from research articles, the items are Chemical Compounds present inChEBI, and the ratings (implicit) are the number of articles the author wroteabout the item . The subset has 102 Chemical Compounds, 1184 authors, 5401ratings, and a sparsity level of 95.5%. We used a subset of CheRM because ithas more than 22,000 items and there is a bottleneck in the calculation of thesimilarity between all the items in real time.The algorithms tested were ALS, BPR, ONTO, and the hybrids ALS ONTOand BPR ONTO. For ALS and BPR we tested different latent factors, achiev-ing the best results for this data with 150 factors. We used offline methods [25]for evaluating the performance of the algorithms for the top@k, with k varyingbetween 0 and 20, with steps of 1. From the vast range of metrics for evaluatingrecommender algorithms, we selected Classification Accuracy Metrics (CAMet)and Rank Accuracy Metrics (RAMet). CAMet measure the relevant and irrel-evant items recommended in a ranked list. Examples of CAMet are Precision,Recall, and F-Measure. RAMet measure the ability of an algorithm for recom-mending the items in the correct order. Some well-known RAMet are MeanReciprocal Rank (MRR), Normalized Discount Cumulative Gain (nDCG), andLimited Area Under the Curve (lAUC), a variation of AUC [22]. All the selectedmetrics range between 0 and 1, and values closest to 1 are better. For the seg-mentation of the dataset, we used a cross-validation approach, by splitting usersand items in 5 folds. Each iteration had 1/5 of the users and the items as testand 4/5 as train data. All the positive ratings in the test set are considered asrelevant items. We considered the unrated items as negative ratings, i.e., notrelevant for the users. Results
We present the results of this study in Figure 3, for all the algorithmsand all the metrics described previously. Analysing Figure 3, the ONTO algo-rithm alone has the lowest results in all metrics. Nevertheless, in metrics such asPrecision, Recall and F-measure, it follows the trend of the other algorithms, andwhen measuring these metric for the top@20, the results are similar. ONTO hasthe advantage of being a CB algorithm, therefore it does not have the problemof cold start for new items. ALS and BPR cannot be used if the item in the testset is not in the train set at least once (at least one author in the train set wroteabout this Chemical Compound).Between ALS and BPR, ALS achieved the best results. Since BPR is analgorithm for ranking, it was expected to obtain better results. We believe thisis due the fact that the dataset has a large number of ratings equal to one, andmany items have the same relevance (difficult to rank).The approach with the best results in most of the metrics is the HybridALS ONTO. The use of ALS and ONTO algorithms together has a particularly https://github.com/lasigeBioTM/CheRM M. Barros et al.
Fig. 3: Results comparing ALS, BPR, ONTO and the hybrids ALS ONTO andBPR ONTO, for Precision, Recall, F-measure, MRR, nDCG, and lAUC.positive effect on the metrics measuring the ranking accuracy (MRR, nDCG andAUC), especially for MRR, with an increase of 6.7% when comparing the ALSalgorithm and the Hybrid ALS ONTO. This means that ONTO reorder ALSscores in a way that the first results in the top@k are more relevant.These are preliminary results. The study needs to be replicated with the fullCheRM dataset, and we need to perform more studies to see the real impactfor the cold start problem. Nevertheless, the results seem promising, in the onehand for improving the relevant recommendations provided (CAMet), and onthe other hand in enhancing the position of the most relevant items in a rankedlist (RAMet). Our Hybrid algorithm may be applied to other areas, for example,for genes, phenotypes, and diseases, provided that exists an ontology for theseitems.
In this work, we presented a Hybrid recommendation model for recommendingChemical Compounds, based on CF algorithms for implicit data and a CB algo-rithm based on semantic similarity of the Chemical Compounds using the ChEBIontology. The obtained results support our hypothesis that by using the semanticsimilarity between the Chemical Compounds, the results of state-of-the-art CFalgorithms can be improved. For future work we intend to increase the lengthof the dataset, to test other similarity metrics, and to test other alternatives tocalculate the final score of the Hybrid algorithm. ybrid Semantic Recommender System for Chemical Compounds 7
References
1. Al-Hassan, M., Lu, H., Lu, J.: A semantic enhanced hybrid recommendation ap-proach: A case study of e-government tourism service recommendation system.Decision Support Systems , 97–109 (2015)2. Barros, M., Couto, F.M.: Knowledge representation and management: a linkeddata perspective. Yearbook of medical informatics (01), 178–183 (2016)3. Barros, M., Moitinho, A., Couto, F.M.: Using research literature to generatedatasets of implicit feedback for recommending scientific items. IEEE Access ,176668–176680 (2019)4. Consortium, G.O.: The gene ontology resource: 20 years and still going strong.Nucleic acids research (D1), D330–D338 (2018)5. Couto, F., Lamurias, A.: Semantic similarity definition. Encyclopedia of bioinfor-matics and computational biology (2019)6. Harper, F.M., Konstan, J.A.: The movielens datasets: History and context. Acmtransactions on interactive intelligent systems (tiis) (4), 1–19 (2015)7. Hastings, J., Owen, G., Dekker, A., Ennis, M., Kale, N., Muthukrishnan, V.,Turner, S., Swainston, N., Mendes, P., Steinbeck, C.: Chebi in 2016: Improved ser-vices and an expanding collection of metabolites. Nucleic acids research (D1),D1214–D1219 (2015)8. Hu, Y., Koren, Y., Volinsky, C.: Collaborative filtering for implicit feedbackdatasets. In: 2008 Eighth IEEE International Conference on Data Mining. pp.263–272. Ieee (2008)9. Ishihara, T., Koga, Y., Iwatsuki, Y., Hirayama, F.: Identification of potent orallyactive factor xa inhibitors based on conjugation strategy and application of pre-dictable fragment recommender system. Bioorganic & medicinal chemistry (2),277–289 (2015)10. Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics andlexical taxonomy. arXiv preprint cmp-lg/9709008 (1997)11. Khawar, F., Zhang, N.L.: Conformative filtering for implicit feedback data. In:European Conference on Information Retrieval. pp. 164–178. Springer (2019)12. Liao, I.E., Hsu, W.C., Cheng, M.S., Chen, L.P.: A library recommender systembased on a personal ontology model and collaborative filtering technique for englishcollections. The electronic library (3), 386–400 (2010)13. Lin, D., et al.: An information-theoretic definition of similarity. In: Icml. vol. 98,pp. 296–304. Citeseer (1998)14. Nilashi, M., Ibrahim, O., Bagherifard, K.: A recommender system based on collab-orative filtering using ontology and dimensionality reduction techniques. ExpertSystems with Applications , 507–520 (2018)15. Ortega, F., Bobadilla, J., Guti´errez, A., Hurtado, R., Li, X.: Artificial intelligencescientific documentation dataset for recommender systems. IEEE Access , 48543–48555 (2018)16. Ostuni, V.C., Di Noia, T., Di Sciascio, E., Mirizzi, R.: Top-n recommendationsfrom implicit feedback leveraging linked open data. In: Proceedings of the 7thACM conference on Recommender systems. pp. 85–92. ACM (2013)17. Rendle, S., Balby Marinho, L., Nanopoulos, A., Schmidt-Thieme, L.: Learningoptimal ranking with tensor factorization for tag recommendation. In: Proceedingsof the 15th ACM SIGKDD international conference on Knowledge discovery anddata mining. pp. 727–736. ACM (2009) M. Barros et al.18. Rendle, S., Freudenthaler, C., Gantner, Z., Schmidt-Thieme, L.: Bpr: Bayesianpersonalized ranking from implicit feedback. In: Proceedings of the twenty-fifthconference on uncertainty in artificial intelligence. pp. 452–461. AUAI Press (2009)19. Resnik, P.: Using information content to evaluate semantic similarity in a taxon-omy. arXiv preprint cmp-lg/9511007 (1995)20. Ricci, F., Rokach, L., Shapira, B.: Recommender systems: introduction and chal-lenges. In: Recommender systems handbook, pp. 1–34. Springer (2015)21. Schriml, L.M., Mitraka, E., Munro, J., Tauber, B., Schor, M., Nickle, L., Felix,V., Jeng, L., Bearer, C., Lichenstein, R., et al.: Human disease ontology 2018 up-date: classification, content and workflow expansion. Nucleic acids research (D1),D955–D962 (2018)22. Schr¨oder, G., Thiele, M., Lehner, W.: Setting goals and choosing metrics for recom-mender system evaluations. In: UCERSTI2 Workshop at the 5th ACM conferenceon recommender systems, Chicago, USA. vol. 23, p. 53 (2011)23. Seko, A., Hayashi, H., Tanaka, I.: Compositional descriptor-based recommendersystem for the materials discovery. The Journal of chemical physics (24), 241719(2018)24. Shambour, Q., Lu, J.: A trust-semantic fusion-based recommendation approachfor e-business applications. Decision Support Systems (1), 768–780 (2012)25. Shani, G., Gunawardana, A.: Evaluating recommendation systems. In: Recom-mender systems handbook, pp. 257–297. Springer (2011)26. Sieg, A., Mobasher, B., Burke, R.: Improving the effectiveness of collaborativerecommendation with ontology-based user profiles. In: proceedings of the 1st In-ternational Workshop on Information Heterogeneity and Fusion in RecommenderSystems. pp. 39–46. ACM (2010)27. Tarus, J.K., Niu, Z., Mustafa, G.: Knowledge-based recommendation: a review ofontology-based recommender systems for e-learning. Artificial Intelligence Review (1), 21–48 (2018)28. Uschold, M., Gruninger, M.: Ontologies: Principles, methods and applications. Theknowledge engineering review11