CupQ: A New Clinical Literature Search Engine
CCupQ: A New Clinical Literature Search Engine
Jesse Wang
Department of Translational Biomedical ScienceSchool of Medicine and DentistryUniversity of RochesterRochester, NY, [email protected]
Henry Kautz
Department of Computer ScienceHajim School of Engineering and Applied SciencesUniversity of RochesterRochester, NY, [email protected]
ABSTRACT
A new clinical literature search engine, called CupQ, is presented.It aims to help clinicians stay updated with medical knowledge.Although PubMed is currently one of the most widely used digitallibraries for biomedical information, it frequently does not returnclinically relevant results. CupQ utilizes a ranking algorithm thatfilters non-medical journals, compares semantic similarity betweenqueries, and incorporates journal impact factor and publication date.It organizes search results into useful categories for medical practi-tioners: reviews, guidelines, and studies. Qualitative comparisonssuggest that CupQ may return more clinically relevant informationthan PubMed. CupQ is available at https://cupq.io/.
CCS CONCEPTS • Applied computing → Life and medical sciences ; Health careinformation systems ; KEYWORDS
Medicine, Literature, Search Engine
The task of staying updated with advances in medicine remainsa challenging aspect of clinical practice. An average of about twobiomedical documents is added to the literature every minute [12].The widely used PubMed digital library often does not deliverclinically relevant results within a reasonable time frame [10, 11, 15].Other resources, such as UpToDate, NEJM Journal Watch, and ACPJournal Club, rely on the expensive and time-consuming process ofusing human curators to manually comb the literature for clinicalinformation [1, 2, 9]. The current utilities for medical informationretrieval may be inadequate for continuing medical education andconsequently may be hindering efforts to improve patient care.PubMed is a biomedical digital library built and maintained bythe United States National Center for Biotechnology Information[6]. It often requires users to select filters, identify MeSH terms, andgenerate boolean entries to distill relevant clinical results [16, 22].
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected]. arXiv, 2019, USA © 2019 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn
The complexity of PubMed may contribute to low search satisfac-tion among healthcare professionals [10, 11, 15]. Moreover, thenewly released Best Match relevance algorithm does not incorpo-rate important metrics such as journal rank and semantic similarity[12]. These ranking signals also appear to be missing in the relatedsearch tool, PubMed Clinical Queries [7]. To better fulfill the in-formation needs of medical practice, PubMed may require furtherimprovements.This application note discusses the development of a new medicalliterature search engine called CupQ. The system uses Word2Vecto generate word embeddings for comparing semantic similaritybetween queries and documents [18, 26]. It also considers journalimpact factor (JIF) and publication date [4]. Results are organizedby reviews, guidelines, and clinical studies. Documents written inEnglish and published in journals listed in the medicine subject areaof ScimagoJR are returned [8]. Example search results suggest thatCupQ may be more effective than PubMed for returning relevantclinical information. This publication aims to encourage utilizationof CupQ for staying updated with medical literature.
Lu provides a survey of web tools for searching biomedical litera-ture, including Quertle, MEDIE, and Semantic MEDLINE [17].Quertle is a semantic search engine utilizing over 250 millionsubject-verb-object (SVO) associations to provide relevant publica-tions [21]. It also features "Power Terms" that allow users to searchtopics. Example terms, denoted by a dollar sign ($) prefix, include"$Amino Acids," "$Biomarkers," and "$Chemicals." In addition, Quer-tle differentiates capitalizations, such as the "WHO" abbreviationfor the World Health Organization and the "who" pronoun. Searchresults are presented in two tabs. One tab lists results derived fromits semantic-based algorithm. Another tab lists results obtainedfrom a standard PubMed search. Quertle was developed and iscurrently maintained by a for-profit private enterprise. The exactdetails of its search process are consequently unavailable to thepublic.MEDIE also aims to incorporate grammatical meaning into itssearch algorithm [19]. It returns documents that match the user’s de-sired SVO relations. For example, the query "what does p53 activate"would produce results that contain sentences matching "activate"and "p53" as the verb and object, respectively. Queries in MEDIEare first annotated with part-of-speech tags through the Enju head-driven phrase structure grammar parser. Genes and diseases arealso annotated through a dictionary comparison approach. Afterannotation, results returned from a standard keyword search are fil-tered based on the predicate structure of their sentences. Users are a r X i v : . [ c s . D L ] J u l rXiv, 2019, USA Jesse Wang and Henry Kautz shown the results with sentences matching the specified semanticrelations.Semantic MEDLINE, similar to Quertle and MEDIE, utilizes lin-guistic information [21]. In particular, it extracts normalized repre-sentations of semantic relations. For example, the phrase "GenesAFFECTS Circadian Rhythms" was parsed from the title "Clockgenes are the genes that control circadian rhythms in physiologyand behavior." The extraction process was developed using the Sem-Rep natural language processing platform, which depends on theNational Library of Medicine’s Unified Medical Language System.Semantic categories and relationships are derived from this collec-tion. The process was conducted on about 25 million MEDLINEabstracts and produced more than 26 million semantic relations. An instance of CupQ uses two networked servers. A dedicatedstorage server is used to maintain persistent information, includ-ing a MySQL database, a MongoDB database, and other files. Thestorage server also performs operations relating to data downloadand extraction. Another server containing high memory capacity isused for tokenization, embedding, indexing, searching, and websitehosting. The storage server contains an Intel Core i7-4790K 4.0 GHzprocessor, Ballistix Sport 32 GB DDR3 RAM, and a Samsung 850Evo 1 TB SSD. The memory server is a Dell R710 with dual IntelXeon X5687 3.6 GHz processors, 288 GB PC3-10600R RAM, and aSamsung 850 Evo 256 GB SSD.
MEDLINE/PubMed data is downloaded via FTP as a directory ofcompressed XML files [3]. MD5 checksums are compared to ensurefile integrity. Specific XML elements related to title, abstract, jour-nal, authors, and publication date are parsed and inserted into aMongoDB collection. The most recent journal information fromScimagoJR and Journal Citation Reports is also downloaded. Docu-ments published in journals listed in the medicine subject area ofScimagoJR are labeled. Each document in this subset is assigned theJIF of its publishing journal. Subsequent operations are performedonly on this document subset.
Tokens are extracted from titles and abstracts by splitting text onspace and hyphen characters. The LuiNorm API is used for tokennormalization [5]. Stopwords, except for those fully capitalized,are removed. Then, the Genism library is used to run Word2Vec,generating a vector representation for each token [20]. Vectors of100 elements are produced using skip-gram and a window size of100 without sentence boundaries for 10 epochs. Embeddings fordocument titles are computed as the sum of each token embeddingmultiplied by the log ratio of the corpus size to the number ofdocuments containing the token.
A Java hash map with keys as integers and values as integer arraylists is instantiated. Keys represent numeric token identifiers (TIDs). Values represent document PubMed identifiers (PMIDs). For eachdocument, a hash set of title and abstract TIDs is created. Thedocument PMID is added to the array list for each TID. Key-valuepairs are stored in a MySQL table comprised of two integer columns,the first for TIDs and the second for PMIDs, with the primary keyset over both columns. New MEDLINE/PubMed documents areautomatically downloaded, processed, and indexed on a weeklybasis.
A Java search server tokenizes the search string and computes aweighted sum vector representation. The token contained in thefewest documents is passed to the inverted index, which returnsa list of PMIDs. Only results written in English and containingall search tokens are retained. Errata, retracted documents, anddocuments published before the year 1990 are removed. Documentinformation, including publication date, publication type, title em-bedding, and TIDs are stored in an object array. Results are orga-nized by publication type into array lists for reviews, guidelines,and studies.After assigning documents into publication categories, a rele-vance score is computed for each document. A different relevancecalculation is used for each publication category. Document lists arethen sorted by relevance in descending order. The top 500 resultsare retained and cached into a MySQL table. A sublist contain-ing results to be displayed for the user’s requested page numberis obtained. Display information including title, abstract, authorabbreviations, journal ISO abbreviation, and publication year isretrieved from disk. Search results are returned to the web serveras a JSON payload for HTML rendering.The document relevance score is the sum of several min-maxnormalized subscores multiplied by empirically configured boost-ing factors. A semantic score is computed as the cosine similaritybetween the query vector and the title vector. A title count score isset to one if a title contains all search tokens and zero otherwise. Adate score is computed as an estimated number of days. A journalscore is set to the JIF. If a document is published over twenty yearsago, the relevance score is fractioned by a tenth. If any of the sub-scores are zero, then the relevance score is zero. Different sets ofboosting factors are used for each category (Table 1).
Table 1: Publication Category Boosting Factors
Category Title Cosine Title Count Date JournalReviews 4 3 1 2Guidelines 6 8 1 4Studies 3 5 1 1
CupQ provides a simple user interface that includes a search bar forentering queries and a tab bar for selecting publication categories(Figures 1–2). upQ: A New Clinical Literature Search Engine arXiv, 2019, USA
Figure 1: CupQ home page.Figure 2: CupQ results page showing the top three resultsfor the query "stroke."
The top ten results for several queries and filters were comparedbetween CupQ and PubMed. Note that the PubMed sidebar does notallow for selection of document types to be excluded. For example, itdoes not allow inclusion of documents that are reviews but not sys-tematic reviews. Defining document types using advanced searchstrings in PubMed returns different results than selecting documenttypes in the sidebar, perhaps because the Best Match ranking al-gorithm weighs document types in the search string whereas thesidebar selection behaves as a simple binary filter. Search resultcomparisons were made with the PubMed sidebar because its inter-face is more similar to the CupQ tab bar. Searches were performedon January 28, 2019.
For the query "myocardialinfarction," CupQ returned review results from high impact factorjournals, including
New England Journal of Medicine (JIF = 79.26),
Lancet (JIF = 53.254), and
BMJ (JIF = 23.562) (Table A1). The titles ofthe first two results, "Acute myocardial infarction," were highly rele-vant to the query with a cosine similarity of 0.986. Both documentswere published in 2017 issues of
New England Journal of Medicine and
Lancet . Other results referenced common concepts related tomyocardial infarction, including coronary reperfusion strategies,percutaneous coronary intervention, electrocardiogram, and ST-segment elevation. Incorporation of JIF and Word2Vec query-titlecosine similarity may explain the effective prioritization of highimpact factor journals and titles containing semantically relatedconcepts to myocardial infarction.PubMed contrastingly returned no results from
New EnglandJournal of Medicine , Lancet , or
BMJ (Table A2). The title of the firstresult was less relevant to the query with a cosine similarity of 0.832.Moreover, the first result was published in an unranked journal byJIF. Although there was a document with the highly relevant title"Acute myocardial infarction," it was published in a 2013 issue of
Disease-A-Month , a relatively low impact factor journal (JIF = 0.891).PubMed did not return the newer documents with the same titlefrom
New England Journal of Medicine and
Lancet . In addition, onlyone PubMed title referenced an aforementioned common topic re-lated to myocardial infarction, percutaneous coronary intervention.PubMed and CupQ shared no common results. These observationssuggest that PubMed may not effectively incorporate JIF in thecontext of query-title semantic similarity.
There were more similarities be-tween CupQ and PubMed for the query "depression" when search-ing for guidelines (Tables A3–A4). The same document publishedin a 2016 issue of
JAMA (JIF = 47.661) appeared as the top result forboth search engines. However, PubMed lacked the result "Screeningfor Depression in Children and Adolescents: U.S. Preventive Ser-vices Task Force Recommendation Statement," published in a 2016issue of
Annals of Internal Medicine (JIF = 19.384). This was unusualbehavior because PubMed was able to return a document with thesame title and year, albeit from a lower impact factor journal,
Pe-diatrics (JIF = 5.515). Unlike PubMed, CupQ may return relevantresults by estimating importance via JIF.All result titles in CupQ contained the query "depression." Aproblem with PubMed was that the title of the fourth result did notcontain the query. Although this result was published within the lasttwo years in a high impact factor journal,
CA: A Cancer Journal forClinicians (JIF = 244.585), it did not specifically focus on depression.This document encompassed strategies for addressing multipleconditions in patients with breast cancer, including chemotherapy-induced nausea, vomiting, and peripheral neuropathy. Althoughthis document may be more appropriate as a top ten result for thequery "depression breast cancer," it addresses too many topics otherthan depression to be a top ten result for the query "depression."
When searching for studies about stroke,all result titles from CupQ and PubMed contained the query (TablesA5–A6). CupQ only returned results from
New England Journal ofMedicine whereas PubMed returned no results from this journal.The first result returned by PubMed was published in
Clinical Neu-rology and Neurosurgery (JIF = 1.736). The highest impact factorjournal returned by PubMed was
Lancet Neurology (JIF = 27.144).The first result from CupQ was published in 2018 whereas the firstresult from PubMed was published 2017. Moreover, CupQ resultswere published from 2017 to 2018 whereas PubMed results werepublished from 2007 to 2018. CupQ may prioritize recent, highimpact factor results whose titles contain the query. rXiv, 2019, USA Jesse Wang and Henry Kautz
Search engine performance can be assessed through a variety of ap-proaches. Precision and recall can be measured, assuming a binaryrelevance model and an existing standard for relevance [14]. Usertask studies may demonstrate performance with respect to specificsearch objectives but may require statistical adjustment for prioruser experience with comparative search tools [25]. Click-throughrates may provide another indication of performance given a highvolume of web traffic [12]. This paper qualitatively compared resultsbetween CupQ and PubMed for specific queries and publicationcategories. Because CupQ was recently launched in January 2019,future work will include analyses of click-through rates when thereis significant traffic.The CupQ ranking algorithm prioritizes title relevance, JIF, andpublication date. It assumes that users place the most emphasis ontitle content when determining the relevance of a result. It alsoassumes that users weigh the reliability and importance of informa-tion, represented by JIF, either greater than or equal to the recencyof information. Although JIF is not necessarily representative of in-dividual articles in a journal, it may serve as a useful approximationfor physicians who may have limited time to search for information[13, 23, 24]. In addition, CupQ only returns information publishedin journals that are listed in the medicine subject area of ScimagoJR.This unique implementation of title relevance, JIF, publication date,and journal category may enable CupQ to return relevant clinicalinformation.
ACKNOWLEDGMENTS
Jesse Wang is an MD and PhD candidate in the Medical ScientistTraining Program funded by the National Institute of Health un-der grant T32 GM07356. The content is solely the responsibilityof the author and does not necessarily represent the official viewsof the National Institute of General Medicine Science or the Na-tional Institute of Health. We thank Jie Wang at the University ofMassachusetts Lowell and Daniel Schwartz at the University ofConnecticut for their comments that greatly improved this manu-script.
REFERENCES
Implementation Science
9, 1 (2014), 125.[11] Karen S Davies. 2011. Physicians and their use of information: a survey compari-son between the United States, Canada, and the United Kingdom.
Journal of theMedical Library Association: JMLA
99, 1 (2011), 88.[12] Nicolas Fiorini, Kathi Canese, Grisha Starchenko, Evgeny Kireev, Won Kim,Vadim Miller, Maxim Osipov, Michael Kholodov, Rafis Ismagilov, Sunil Mohan,et al. 2018. Best Match: new relevance search for PubMed.
PLoS biology
16, 8(2018), e2005343.[13] Eugene Garfield. 2006. The history and meaning of the journal impact factor.
Jama
Information Retrieval
4, 1 (2001), 33–59.[15] Gah Juan Ho, Su May Liew, Chirk Jenn Ng, Ranita Hisham Shunmugam, andPaul Glasziou. 2016. Development of a search strategy for an evidence basedretrieval service.
PloS one
11, 12 (2016), e0167170.[16] Wesley T Lindsey and Bernie R Olin. 2013. PubMed searches: Overview andstrategies for clinicians.
Nutrition in Clinical Practice
28, 2 (2013), 165–176.[17] Zhiyong Lu. 2011. PubMed and beyond: a survey of web tools for searchingbiomedical literature.
Database
Advances in neural information processing systems . 3111–3119.[19] Tomoko Ohta, Yusuke Miyao, Takashi Ninomiya, Yoshimasa Tsuruoka, AkaneYakushiji, Katsuya Masuda, Jumpei Takeuchi, Kazuhiro Yoshida, Tadayoshi Hara,Jin-Dong Kim, et al. 2006. An intelligent search engine and GUI-based effi-cient MEDLINE search tool based on deep syntactic parsing.
Proceedings of theCOLING/ACL 2006 Interactive Presentation Sessions (2006), 17–20.[20] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modellingwith Large Corpora. In
Proceedings of the LREC 2010 Workshop on New Challengesfor NLP Frameworks . ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/884893/en.[21] Thomas C Rindflesch, Halil Kilicoglu, Marcelo Fiszman, Graciela Rosemblat,and Dongwook Shin. 2011. Semantic MEDLINE: An advanced informationmanagement application for biomedicine.
Information Services & Use
31, 1-2(2011), 15–21.[22] Tony Russell-Rose and Jon Chamberlain. 2017. Expert search strategies: theinformation retrieval practices of healthcare information professionals.
JMIRmedical informatics
5, 4 (2017), e33.[23] Somnath Saha, Sanjay Saint, and Dimitri A Christakis. 2003. Impact factor: avalid measure of journal quality?
Journal of the Medical Library Association
91, 1(2003), 42.[24] Per O Seglen. 1997. Why the impact factor of journals should not be used forevaluating research.
Bmj
JSW
3, 1 (2008), 63–73.[26] Lyndon White, Roberto Togneri, Wei Liu, and Mohammed Bennamoun. 2015.How well sentence embeddings capture meaning. In
Proceedings of the 20thAustralasian Document Computing Symposium . ACM, 9.
APPENDIX
The appendix consists of Tables A1–A6. upQ: A New Clinical Literature Search Engine arXiv, 2019, USA
Table A1: Reviews returned by CupQ for the query "myocardial infarction."No Title Journal Year
Table A2: Reviews returned by PubMed for the query "myocardial infarction."No Title Journal Year
Table A3: Guidelines returned by CupQ for the query "depression."No Title Journal Year rXiv, 2019, USA Jesse Wang and Henry Kautz
Table A4: Guidelines returned by PubMed for the query "depression."No Title Journal Year
Table A5: Studies returned by CupQ for the query "stroke."No Title Journal Year upQ: A New Clinical Literature Search Engine arXiv, 2019, USA
Table A6: Studies returned by PubMed for the query "stroke."No Title Journal Year1 Hereditary cerebral small vessel disease and stroke. Clinical Neurology and Neurosurgery 20172 Imaging Markers of Post-Stroke Depression and Apathy: a Systematic Reviewand Meta-Analysis. Neuropsychology Review 20173 Role of Total, Red, Processed, and White Meat Consumption in Stroke Incidenceand Mortality: A Systematic Review and Meta-Analysis of Prospective CohortStudies. Journal of the American Heart Associa-tion 20174 Endarterectomy achieves lower stroke and death rates compared with stentingin patients with asymptomatic carotid stenosis. Journal of Vascular Surgery 20175 The Course of Activities in Daily Living: Who Is at Risk for Decline after FirstEver Stroke? Cerebrovascular Diseases 20176 Prevalence, incidence, and factors associated with pre-stroke and post-strokedementia: a systematic review and meta-analysis. Lancet Neurology 20097 Acupuncture lowering blood pressure for secondary prevention of stroke: astudy protocol for a multicenter randomized controlled trial. Trials 20178 Decreased Serum Brain-Derived Neurotrophic Factor May Indicate the Devel-opment of Poststroke Depression in Patients with Acute Ischemic Stroke: AMeta-Analysis. Journal of Stroke and CerebrovascularDiseases 20189 Aerobic Exercises for Cognition Rehabilitation following Stroke: A SystematicReview. Journal of Stroke and CerebrovascularDiseases 201610 Types of stroke recurrence in patients with ischemic stroke: a substudy fromthe PRoFESS trial. International Journal of Stroke 2014