[PDF] Automated text summarisation and evidence-based medicine: A survey of two domains

Abstract

The practice of evidence-based medicine (EBM) urges medical practitioners to utilise the latest research evidence when making clinical decisions. Because of the massive and growing volume of published research on various medical topics, practitioners often find themselves overloaded with information. As such, natural language processing research has recently commenced exploring techniques for performing medical domain-specific automated text summarisation (ATS) techniques-- targeted towards the task of condensing large medical texts. However, the development of effective summarisation techniques for this task requires cross-domain knowledge. We present a survey of EBM, the domain-specific needs for EBM, automated summarisation techniques, and how they have been applied hitherto. We envision that this survey will serve as a first resource for the development of future operational text summarisation techniques for EBM.

Full PDF

AAutomated text summarisation and evidence-basedmedicine: A survey of two domains

ABEED SARKER and DIEGO MOLL ´A, Macquarie UniversityC ´ECILE PARIS, CSIRO Australia

The practice of evidence-based medicine (EBM) urges medical practitioners to utilise the latest research evidence when makingclinical decisions. Because of the massive and growing volume of published research on various medical topics, practitioners oftenﬁnd themselves overloaded with information. As such, natural language processing research has recently commenced exploringtechniques for performing medical domain-speciﬁc automated text summarisation (ATS) techniques— targeted towards the taskof condensing large medical texts. However, the development of eﬀective summarisation techniques for this task requires cross-domain knowledge. We present a survey of EBM, the domain-speciﬁc needs for EBM, automated summarisation techniques,and how they have been applied hitherto. We envision that this survey will serve as a ﬁrst resource for the development offuture operational text summarisation techniques for EBM.

Categories and Subject Descriptors: H.3.4 [

Information Systems ]: Information Storage and Retrieval—

Systems and Software ;I.7.0 [

Computing Methodologies ]: Document and Text Processing—

General ; I.2.7 [

Computing Methodologies ]: ArtiﬁcialIntelligence—

Natural Language Processing

General Terms: Algorithms, Experimentation, PerformanceAdditional Key Words and Phrases: Text summarisation, medical text processing, evidence-based medicine, medical text sum-marisation, natural language processing

1. INTRODUCTION

Evidence-based medicine (EBM) promotes a form of medical practice which requires the incorporationof the current, best research evidence available on a topic by medical practitioners, combine it withtheir expertise, and the patients’ preferences, when making medical decisions. However, the amountof available medical research literature is massive, presenting practitioners with the problem ofinformation overload. As such, there has been increasing interest in recent times on research targetedtowards the development of post-retrieval systems that can summarise information based on the needsof medical practitioners. Text summarisation is a ﬁeld of research within the domain of natural languageprocessing (NLP), and its aim is to generate summaries from large volumes of text. While research onopen-domain text summarisation has seen signiﬁcant progress, its application to problems involvingmore complex forms of natural language, such as those in the medical domain, is still very much in itsinfancy [Sarker et al. 2016]. This can be attributed to the fact that systems attempting to summariseinformation in such complex domains require the incorporation of vast amounts of domain-speciﬁcknowledge, which practitioners acquire through years of experience, and integrate that knowledge withNLP techniques targeted towards summarisation of information. Thus, NLP researchers intendingto develop end-to-end summarisation systems to support EBM require an in-depth understanding ofthe practice as well as techniques for automated summarisation. The intent of this survey is to bringtogether information from the distinct domains of EBM and ATS, and to provide a single resource forresearchers attempting to develop automated summarisation systems addressing the needs of thispractice.In this paper, we review the state-of-the-art for EBM. We ﬁrst provide a detailed description of EBM,its goals, the obstacles faced by practitioners, and a brief overview of how various NLP techniques a r X i v : . [ c s . C L ] J un :2 • A. Sarker, D. Moll´a, C. Paris can aid the practice. In the second part of this paper, we focus on ATS. We ﬁrst provide a briefreview of generic summarisation techniques, and then present a discussion of the state-of-the-art insummarisation/question-answering technologies in this domain, pointing out the gaps in research thatneed to be ﬁlled to implement end-to-end systems. We then provide a relatively detailed discussion ofsummarisation approaches that have been applied to the medical domain, and analyse some recentsummarisation approaches in detail. In particular, we attempt to answer the following questions:—What is EBM and what are the major obstacles hindering EBM practice?—What are the characteristics of text in the medical domain?—What tools and resources are available for text processing in the medical domain?—What is ATS and what is its relevance to EBM?—What is the current state of research in ATS, particularly for the medical domain?—How can ATS approaches for EBM be evaluated?The rest of the article is organised as follows. In Section 2 we provide a detailed overview of EBM, thetechnological needs of the practice, and the challenges associated with it. We also discuss the tools andresources that are currently available to EBM practitioners. We review the literature associated withATS in Section 3 and discuss some key techniques that have been applied in the past. In Section 4, wefocus our review on question answering and summarisation techniques speciﬁc to the medical domain,including current state-of-the-art in evidence-based text summarisation. In Section 5, we brieﬂy reviewsome evaluation techniques for ATS and discuss their applicability to EBM. We conclude the paper inSection 6, and summarise the key discoveries from our survey.

2. EVIDENCE-BASED MEDICINE

The phrase ‘evidence-based medicine’ was deﬁned initially as “a process of turning clinical problems intoquestions and then systematically locating, appraising, and using contemporaneous research ﬁndingsas the basis for clinical decisions” [Rosenberg and Donald 1995]. A more concrete and widely accepteddeﬁnition of EBM was coined by Sackett et al. [1996] who explained it as “the conscientious, explicit,and judicious use of current best evidence in making decisions about the care of individual patients .” Asthe practice has evolved, medical practitioners increasingly integrate, as required by the latest clinicalguidelines, the latest clinical evidence with their own expertise. In addition, practitioners are alsorequired to incorporate the informed patients’ choices. The long-term objective of EBM is to enhance thestandard of healthcare, promote practices that are proven by evidence to work, and to identify elementsof the practice that do not work. A patient-oriented approach is key to the practice— combining the bestpatient-oriented evidence with patient-centred care, placing the evidence in perspective with the needsand desires of the patient [Slawson and Shaughnessy 2005]. In the following sections, we provide anoverview of EBM, discuss the problems associated with EBM practice, the resources and tools availablefor text processing in this domain and the solutions that NLP can offer.Incorporation of research evidence when making clinical decisions involves “the practice of assessingthe current problem in the light of the aggregated results of hundreds or thousands of comparable cases ina distant population sample, expressed in the language of probability and risk” [Greenhalgh 1999]. EBMtherefore involves the efﬁcient use of information search strategies to locate reliable and up-to-dateinformation from varying sources and extraction strategies to efﬁciently collect and analyse retrievedinformation. The practice includes a process formally known as the

Critical Appraisal Exercise whichinvolves the following steps according to Selvaraj et al. [2010]:(1) clearly deﬁning what the problem of a patient is, and what evidence is needed to address thepatient’s problems; utomated text summarisation, evidence-based medicine: A survey • (2) performing searches for the relevant literature efﬁciently;(3) selecting the best of the performed studies, and applying the guidelines of EBM to ascertain thevalidity of the studies;(4) appraising the quality of the evidence effectively; and(5) extracting and synthesising relevant evidences and applying them to the problem at hand. Deﬁning a Patient Problem as a Clinical Question.

Formulating the patient problem forms the basis forthe clinical question, which is used to search resources for an evidence-based answer. A well formulatedquestion includes information about a patient (symptoms, signs, test results and knowledge of previoustreatments), the particular values and preferences of the patient, and other factors that could be relevant[Greenhalgh 2006]. All that information should be summarised into a succinct question deﬁning theproblem and the speciﬁc additional items of information needed to solve the problem.There has been substantial research in the area of medical question formulation and query-focusedsummarisation, and, in recent years, particularly in the ﬁeld of EBM ( e.g. , a recent example of researchdrive in this area is the CLEF eHealth shared tasks [Suominen et al. 2013]). This is because it has beenshown that the answerability of questions can be largely increased by better query formulation amongother things [Gorman and Helfand 1995]. The PICO format, which has four components, has become theaccepted framework for formulating patient-speciﬁc clinical queries [Richardson et al. 1995]. The fourcomponents are: “primary P roblem/ P opulation, main I ntervention, main intervention C omparison, and O utcome of intervention.” The PICO components of queries represent key aspects of patient care, andhave become very much the standard for formulating EBM queries. Although this query formulationframework was originally designed for treatment-like queries, it was later extended to other kinds ofmedical queries [Armstrong 1999]. Research has shown that this mechanism of framing questions helpsin the better formulating of clinical problems in queries, resulting in more accurate information retrievalresults [Booth et al. 2000; Cheng 2004]. Although the PICO format has gained popularity with time, itis well known that not all clinical questions, because of the complex nature of the information needs, canbe mapped in terms of PICO elements, and this is particularly true for non-therapy questions [Huanget al. 2006]. There is also evidence that even doctors ﬁnd it difﬁcult to formulate the questions in termsof the PICO format [Ely et al. 2002]. Variants of the framework have been suggested ( e.g. , PESICO[Scholosser et al. 2006], PIBOSO [Kim et al. 2011]); they offer more ﬂexibility and comprehensivenessand have applications beyond query formulation. Conducting Literature Search.

The medical domain has a massive amount of available publishedliterature, scattered over many databases ( e.g. , MEDLINE indexes over 24 million articles), andsearching for relevant literature requires expertise in this area. Searching for appropriate literature caninclude searching from raw databases, databases with search ﬁlters, databases of pre-appraised articles,databases of synthesised articles and even personal contact with human sources. Some research hasbeen carried out on strategies for retrieving high quality medical articles [Haynes et al. 1994; Hunt andMcKibbon 1997; Shonjania and Bero 2001; Montori et al. 2005]. Searching for the correct literature is arather tedious task, and it is in fact one of the major problems associated with EBM practice [Ely et al.2005]. Selecting the Best Resources.

The selected articles must be closely relevant to the problem at hand, and,simultaneously, they must have a good level of evidence . The level of evidence of a medical publicationcan be inﬂuenced by a number of factors such as the publication type and the sample size of thestudy. Checking the relevance of the papers requires a thorough analysis, and a ranking system isusually employed for examining the level of evidence of different sources. Medical publication types :4 • A. Sarker, D. Moll´a, C. Paris include Systematic Reviews (SR), Randomised Controlled Trials (RCT), Meta-Analyses (MA), singleCase Studies, Tutorial Reviews and many more, including even personal opinions. Although all of themprovide evidence of some form, their levels of evidence vary signiﬁcantly. Figure 1, slightly modiﬁedfrom the one provided by Gilbody [1996], provides a ranking of some of the common medical articletypes (highest ranked on top).

Fig. 1: Quality of evidence with respect to publication types, adapted from Gilbody [1996].

A detailed analysis of these studies is outside the scope of this survey. Guidelines are availablefor healthcare practitioners to obtain information from each of these types of studies and evaluatetheir levels of evidence. Examples of guidelines available online include ASSERT (A Standard for theScientiﬁc and Ethical Review of Trials) , PRISMA (Preferred Reporting Items for Systematic Reviewsand Meta-Analyses) , and EQUATOR (Enhancing the QUAlity and Transparency Of health Research) . Identifying Strengths and Weaknesses.

Once relevant papers are identiﬁed, each paper must bestudied in detail to extract the evidence it contains with respect to the problem at hand. Practitionersare particularly interested in studying in detail what type of study was conducted, on the numberof subjects, demographic information about the subjects,the kind of intervention, the length of thefollow-up period, and the outcome measures [Greenhalgh 2006]. For quantitative tests, analysing theresults of statistical tests is also an important task. Identiﬁcation of the strengths and weaknessesprovides a clear indication of the quality of the evidence provided by each article, and hence aids thepractitioner to make the ﬁnal decision.

Applying Evidence to Patient Care.

Practitioners make their ﬁnal judgment considering the outcomespresented in the article(s) and the relevance of the article(s) to the problem at hand. Often, a numberof articles suggest the same solution or a similar one, making it easier for the practitioner to make adecision. However, there are also cases when chosen articles provide contradictory outcomes. In suchcases, practitioners have to choose between outcomes based on personal experience, the quality ofevidence of the articles, the closeness of the articles to the given problem or other sources of evidence.Note that the quality of evidence here refers to the overall quality of the evidence obtained from allthe articles combined, whereas the level of evidence of a single article refers to the reliability of thatparticular article only. http://prisma-statement.org. Accessed on 15th Jan, 2016. utomated text summarisation, evidence-based medicine: A survey • Generally, countries have mandatory clinical practice guidelines that must be adhered to whenperforming EBM. Practitioners primarily navigate through these guidelines during practice. As such, itis actually often the providers/developers of clinical guidelines who are required to perform elaborateliterature searches and follow the approaches mentioned above for the preparation of the appropriatepractice guidelines.There has been research to identify a grading system that is suitable for the practice of EBM, andnumerous requirements for the systems have been speciﬁed, some more important than others in typicalEBM settings. Atkins et al. [2004], for example, suggested two aspects— simplicity and clarity— so thatidentifying the quality of a body of evidence is fast and can be easily projected on to some grading scale.Ebell et al. [2004] mentioned the comprehensiveness of evidence appraisal systems as a key aspect sothat the same grading system may be applied to different types of clinical studies such as treatment,diagnosis and prognosis.

EBM requires practitioners to stay up-to-date with the latest medical literature and use the latestresearch discoveries. Practitioners currently utilise a variety of sources to obtain information.2.1.1

The MEDLINE Database and Similar Databases.

The MEDLINE database is managed by theNational Library of Medicine (NLM), U.S.A., and it is the most popular source of up-to-date evidence[Taylor et al. 2003]. It indexes over 24 million articles covering a wide range of topics. It also providesvarious specialised indexing information and search techniques to aid retrieval. Despite the broadcoverage of MEDLINE and its impressive collection, there are still journals that are not indexed by thisdatabase, particularly journals not published in the United States. Hence practitioners often have torefer to other similar databases that specialise in the required areas. The following is a list of databasesthat are relevant to various areas. Note that providing details of each database is beyond the scopeof this survey, and therefore appropriate links to the databases are provided as footnotes. This list ofdatabases is by no means exhaustive.—Embase – This database is a popular resource for EBM, and covers over 8,500 biomedical journals.It indexes over 29 million articles, including all that are covered by MEDLINE. It is versatile andupdated fast.—Allied and Complementary Medicine Database (AMED) – This is an alternative medicine databasethat is designed for practitioners and researchers interested in learning about alternative treatments.Topics include chiropractic, acupuncture, and so on, and many of the journals indexed are not availablefrom other databases.—CINAHL – This is a dedicated database for nursing and allied health professionals. Nursing, healtheducation, social services in health care and other related disciplines are covered by this database.2.1.2 Databases of Synthesised Information.

One problem with MEDLINE and other ‘raw’ databasesis that they contain articles of varying quality – from high quality SRs and MAs to informal, unreliableclinical trials. When speciﬁc topics are searched for in these databases, the results returned invariablycontain a mixture of high and low quality articles. To address this problem, there are databases thatonly contain articles that are of high quality. A good example of such a database with synthesisedevidence is the Cochrane Library. It contains peer-reviewed Cochrane Systematic Reviews, Systematic (http://store.elsevier.com/embase. Accessed on 15th Jan, 2016.) :6 • A. Sarker, D. Moll´a, C. Paris

Reviews listed in the Database of Abstracts of REviews (DARE), and selected published clinical trials intheir Central Registry of Controlled Trials. However, the number of articles contained in this library isminute compared to the total number of medical articles available, and the scope of topics covered isunderstandably limited. Despite this, the Cochrane Library is very much the ﬁrst port of call for clinicalresearchers looking for quality articles.Among such high quality databases, there are also guideline databases such as the National GuidelineClearing House and National Institute for Health and Clinical Excellence (NICE) that provideevidence-based clinical practice guidelines. Many countries have their own national clinical practiceguidelines, and sometimes their use are mandated by law ( e.g. , in Finland). There are also sources thatnot only synthesise the best available information but also present them in readily usable formats, suchas: Clinical Evidence (CE), Evidence-Based On-Call (EBOC). ACP Journal Club, Journal of FamilyPractice, Evidence Based Medicine, and UpToDate There are several search engines specialised for the medical domain, and about.com provides ananalysis of the top ﬁve search engines in the medical domain. The list includes PubMed, OmniMedi-calSearch, WebMd, Healthline and HealthFinder. Generic search engines are also frequentlyused to search for evidence and they, particularly Google, have evolved into effective search tools foronline full text peer reviewed journals [Greenhalgh 2006]. According to a study by [Tutos and Moll ´a2010] that assessed the abilities of search engines to identify the correct information for meeting theinformation needs of clinical queries, Google performs better than any other systems including PubMedwhich is specialised on medical text. This clearly demonstrates the strength of generic search enginesand their emergence as a useful tool in medical article retrieval and EBM practice.

Ely et al. [2002] conducted a well known research to identify and explain the obstacles practitionersface when they attempt to answer clinical questions using evidence. The study revealed ﬁfty-nineobstacles which were divided into the following ﬁve broad categories (i) recognising that there is a voidin the knowledge of a practitioner on a speciﬁc topic, (ii) formulating a clinical query that accuratelyand comprehensively encapsulates the information needs of the practitioner, (iii) performing efﬁcientsearches for the relevant information, (iv) synthesising multiple and incomplete evidences to formulatethe ﬁnal response, (v) and using the formulated answer to make decisions about patient care, takinginto account other relevant information such as the medical history of the patient. From the ﬁfty nineobstacles, Ely et al. [2002] also identiﬁed a few that are considered to be critical by practising doctors,which are as follows:(1) The amount of time needed to ﬁnd information is commonly seen as the most crucial obstaclein EBM practice. Most busy doctors lack the time or skills to track down and evaluate evidence, https://acpjc.acponline.org/. Accessed on 15th Jan, 2016. http://ebm.bmj.com/. Accessed on 15th Jan, 2016. http://uptodate.com/home/about/index.html. Accessed on 15th Jan, 2016. http://websearch.about.com/od/enginesanddirectories/tp/medical.htm. Accessed on 15th Jan, 2016. utomated text summarisation, evidence-based medicine: A survey • and when searching for evidence, practicing doctors do not have time to search multiple resources.Literature search and appraisal may take hours and even days. According to Hersh et al. [2002],it takes more than 30 minutes on average for a practitioner to search for an answer. But usuallypractitioners spend about 2 minutes [Ely et al. 2000]. Hence, many questions go unanswered. Elyet al. [2005] classify this problem as a ‘resource-related’ problem and state that physicians wantrapid access to concise answers that are easy to ﬁnd and tell them what to do in speciﬁc terms.(2) It is often difﬁcult to structure a question that includes all the important information and is notvague. Recent research in medical information retrieval has focused on query formulation and otheraspects of information retrieval to aid practitioners [Heppin and Jarvelin 2012; Kelly et al. 2014].(3) It is difﬁcult to select an optimal strategy to search for information as the medical literature is unwieldy, disorganised and biased [Godlee 1998], and although electronic databases facilitatesearching, the procedure requires skill and expertise. Even for an experienced librarian, the searchto ﬁnd a comprehensive set of documents related to a clinical query may take hours. Other difﬁcultiesrelated to information seeking include the presence of large numbers of irrelevant material in searchresults, and the difﬁculty in ﬁnding correct search terms [Verhoeven et al. 2000].(4) An identiﬁed resource may not cover the speciﬁc topic, and often practitioners may not have timelyaccess to the necessary resources required to answer a question [Young and Ward 2001].(5) It is often difﬁcult to tell when all the relevant evidence associated with a topic has been found. Thenumber of relevant documents can vary signiﬁcantly between topics, and, therefore, physicians areoften left unsure as to when they should stop searching [Ely et al. 2002; Green and Ruff 2005].(6) The synthesis of information— once all the relevant information has been found, it is a dauntingtask to synthesise the information from different sources. Studies carried out have shown thatpractitioners frequently mention difﬁculties with generalising research ﬁndings and applyingevidence to individual patient care [Young and Ward 2001]. The reasons behind this include theincapability of any of the sources to completely answer the clinical question, the selected articles notdirectly answering the clinical question, and different articles providing contradictory information[Ely et al. 2005, 2002]. Despite the presence of many barriers, EBM practice has gained popularity over recent years for anumber of reasons, including its promise of improving patient healthcare in the long run. As for thebarriers, advances in technology are gradually eliminating them and making the process more efﬁcient.From the problems and barriers speciﬁed earlier in this section, it can be inferred that the next boost inEBM practice will come from research in NLP.NLP offers suitable solutions to the problems faced by EBM practitioners, particularly for problemsassociated with information overload. Ideally, practitioners require bottom-line answers to their queries,along with estimates about the qualities of the evidences. They require fast access to information, andevidence-backed reasoning for recommendations [Ely et al. 2005]. NLP has the potential of addressingall these requirements. For example,

Query Analysis can be used to understand and expand practitioners’queries through the use of domain-speciﬁc semantic information. Queries posed by practitioners areoften very short with an average of 2.5 words [Hoogendam et al. 2008], and therefore existing techniquesfor ontology-driven query reformulation [Schwitter 2010] can be built upon to help practitioners composeand reﬁne queries.

Information Retrieval techniques tailored for the medical domain can be used toincrease the recall and precision of literature searches. Strength of recommendation values can beused to classify documents to make searches more reliable. There has already been some researchin this area with promising outcomes [Karimi et al. 2009; Pohl et al. 2010].

Information Extraction :8 • A. Sarker, D. Moll´a, C. Paris techniques which incorporate domain knowledge ( e.g. , MeSH terms, UMLS) can be used to extractrelevant information, based on practitioners’ questions, from the retrieved documents. This is also anarea that is being explored by medical informatics researchers, and knowledge-based and statisticaltechniques have produced promising results [Demner-Fushman and Lin 2007]. Furthermore, someresearch has attempted to extract semantic information and use that information to represent themain concepts presented in documents [Fiszman et al. 2003, 2004, 2009]. Research in the area ofinformation extraction is closely related to that of

Document Summarisation . The goal of summarisationin this context is to summarise the content extracted from multiple documents and present them tothe users ( i.e. , speciﬁc bottom-line recommendations). Summarisation of multiple documents is a wellexplored area ( e.g. , in the news domain), however its application to the medical domain is still quitelimited. Successful summarisation of medical documents and effective presentation of summarisedinformation to the user are key to the successful development of end-to-end question answering systemsthat can be used for EBM practice. More speciﬁcally, query-focused multi-document summarisation thatincorporates medical domain knowledge is an area of research that is worth exploring and success inthis area will signiﬁcantly advance the practice of EBM. For the rest of this survey, we cover automatedsummarisation techniques and their applications to the medical domain.

3. AUTOMATED TEXT SUMMARISATION

In this section, we review the basics of automated summarisation techniques, important domain-independent approaches, breakthroughs in this research area, and approaches that have the potential ofbeing applied to the medical domain. Since the ﬁeld of ATS is too broad to be discussed in a single review,we have attempted to keep this section short with references to detailed literature for the interestedreader.

According to Radev et al. [2002], the intent of a summary is to express the informative contents of adocument in a compressed manner. Mani [2001] provides a more formal deﬁnition and explains thatthe process of summarisation involves “ taking an information source, extracting content from it, andpresenting the most important content to the user in a condensed form and in a manner sensitive to theuser’s or application’s needs .” Sparck Jones [1999] explains that summarisation is a hard task because itrequires the characterisation of a source text as a whole, “ capturing its important content, where contentis a matter of both information and its expression, and importance is a matter of what is essential aswell as what is salient .”The motivation for building automated summarisation systems has increased over time due to theincreasing availability of web-based textual information, and the explosion of available information hasnecessitated intensive research in this area [Mani et al. 2000]. The number of online text documentshas been increasing exponentially over the recent years, and this is also true for the medical domain[Sarker et al. 2016]. Consequently, signiﬁcant progress in automated summarisation has been made inthe last two decades [Sparck Jones 2007].The prime advantage of having a large amount of available information is the redundancy in it.Research has shown how summarisation systems can exploit this redundancy [Barzilay and McKeown2005; Clarke et al. 2001; Dumais et al. 2002], particularly when summarising from multiple documents.As will be discussed later, many summarisation systems rely on redundancy ( e.g. , frequently occurringwords, concepts, etc.) to generate automated summaries. In EBM, redundancy is beneﬁcial as it givesstronger evidence on a given ﬁnding. As for the cons, the abundance of information also introduces theneed for efﬁcient information retrieval and extraction, both of which are difﬁcult tasks. In many cases,identiﬁcation of relevant information requires elaborate manual searching through redundant informa- utomated text summarisation, evidence-based medicine: A survey • tion, which is often quite time consuming and therefore rather inefﬁcient [Barzilay and McKeown 2005].“ It has been realised that added value is not gained merely through larger quantities of data, but througheasier access to the required information at the right time and in the most suitable form ” [Afantenoset al. 2005]. This gives rise to the need for technologies that can gather required information for theusers and present them in a simpliﬁed, concise and friendly manner. Thus, we can say that while thelarge amount of information is a necessary condition for the development of automated summarisationsystems, it also introduces the problem of information overload.

Automated text summarisers must take into account a number of factors to achieve their goals. Herewe discuss some of these factors, knowledge of which is essential to understand the process of textsummarisation. The factors affecting ATS can be grouped into three main categories: input, purposeand output. We primarily draw the following information from existing literature [Sparck Jones 1999;Mani 2001; Afantenos et al. 2005; Sparck Jones 2007], and provide references to articles containingmore detail in speciﬁc cases.3.2.1

Input.

The following factors are associated with the inputs of a summarisation system:

Unit - Summarisers can take as input either one document or multiple documents. Summarisingmultiple documents represents a more difﬁcult task than summarising single documents, and, additionalalgorithms are required to overcome problems of redundancy (different documents might presentidentical information which could lead to repetitions in the summary), inconsistency (the informationpresented in distinct documents may not be consistent), and incoherence (information extracted fromseparate documents may not be coherent when presented together).

Language - A summarising system can be mono-lingual, multi-lingual or cross-lingual.

Domain - Summarisation systems can either be domain-speciﬁc or domain-independent. Domain-independent approaches are applicable to documents from various domains without requiring anychanges in the algorithm. While having the beneﬁt of portability to different domains, domain-independent approaches fail to take advantage of knowledge and resources available to speciﬁc domains.Domain-speciﬁc systems are designed for speciﬁc domains and use all the available resources andknowledge available for the relevant domain. There has been signiﬁcant research in text summarisationspeciﬁc to domains such as news [Barzilay and McKeown 2005; McKeown et al. 2002], scientiﬁc [Plazaet al. 2011], legal [Farzindar and Lapalme 2004], and medical [McKeown et al. 1998; Demner-Fushmanand Lin 2007; Rindﬂesch et al. 2005].

Structure - This refers to the external structure of documents, such as headings, boxes, tables, and alsomore subtle exploitable structural information such as rhetorical patterns. The structure of documentscan vary between domains. For example, news stories do not generally have much explicit structureother than top-level headlines [Sparck Jones 2007], while technical articles contain more structurewhich can be exploited by summarisation systems [Elhadad and McKeown 2001; McKeown et al.1998; Saggion and Lapalme 2002; Demner-Fushman and Lin 2007]. Medical article abstracts may bestructured or unstructured, and this factor plays a vital role in the performance of summarisers.

Meta-data - In some cases, header information or meta-data associated with input documents play animportant role in summarisation. For example, dates of publications are crucial for news items. In thecase of the medical domain, databases such as MEDLINE use meta-data to indicate the key topics ineach article.3.2.2

Purpose.

These factors are associated with the purpose of the summaries. In other words,these factors determine what a summary is for or what a summary is like. :10 • A. Sarker, D. Moll´a, C. Paris

Summary Type - Summaries can either be extractive or abstractive. Extractive summarisation in-volves extracting content from the source document(s) and presenting them as the summary. Abstractivesummarisation, in contrast, involves discovering the most salient contents in the source document(s),aggregating them, and presenting them in a concise manner ( e.g. via techniques that automaticallygenerate natural language) [Afantenos et al. 2005].

Information - Summaries can be generic or user-oriented. Generic summaries only take into accountthe information found in the input document(s), whereas user-oriented summaries attempt to extractand summarise only the information that are needed to fulﬁll information requirements of the users.The summarisation query enables the user to formulate the information needs.

Use - This is the most important purpose factor [Sparck Jones 2007], and has a major inﬂuence onsummary content and presentation.

Audience - Summaries must take into account the intended audience. Different summaries can begenerated from the same input sources to suit the requirements of the audience. News material, forexample, “ has at least two audiences: ordinary readers and analysts ” [Sparck Jones 2007].

Time - Time is a critical factor for some summarisation systems. For example, query-oriented sum-maries need to be delivered promptly, and thus, corresponding systems need to be very fast.3.2.3

Output.

These factors are associated with the summarised output.

Coverage - Coverage of a set of sources by a summary can be either comprehensive or selective [SparckJones 2007].

Reﬂective summaries are comprehensive ( i.e. , summarises the whole source text), while query- or topic-oriented summaries are selective ( i.e. , summarises only a segment of the source text thatcontains information associated with the query/topic) [Sparck Jones 2007]. Compression - This refers to the amount by which the information from the source(s) are reducedduring summarisation.

Structure - The output of a summary may be plain text or the information may be represented usingtables ( e.g. , Mani et al. [2000]), forms (Farzindar and Lapalme [2004]), and lists ( e.g. , Radev et al. [2000])to name a few.

Early Summarisation Approaches.

The earliest known work on automated summarisationdates back to 1958 when Luhn [1958] proposed that the frequency of words could provide a measureof their importance in a document. In his work, he ranked words based on their frequencies and usedthe ranking of individual words in a sentence to calculate their signiﬁcance , ﬁnally selecting the topranked sentences as the summary. Baxendale [1958] used sentence position as a feature for selectingimportant sentences in a document ( e.g. , for news sentences, early sentences are more important thanlater sentences). Edmundson [1969] extended this work on summarisation and proposed a extensiblemodel for extractive summarisation, which later on came to be known as the

Edmundsonian Paradigm [Mani 2001]. In this proposed model, the author used a weighted linear equation to combine fourfeatures to score text nuggets ( e.g. , sentences) for summary inclusion: W ( s ) = αC ( s ) + βK ( s ) + γL ( s ) + δT ( s ) (1)where α, β, γ and δ are manually assigned weights; W is the overall weight of sentence s , C representsthe score given to sentence s due to the presence of cue words (bonus words or stigma words) extractedfrom a corpus, K represents the score given for key words (based on word frequency), L is the scoregiven based on sentence location features, and T is the weight assigned based on terms in the sentencethat are also present in the title. Such a statistical approach to sentence selection has been very popularin the research community with Earl [1970] being the ﬁrst to experiment with a variety of shallow utomated text summarisation, evidence-based medicine: A survey • lexical features. More recently, numerous statistical approaches using multiple words, noun phrases,main verbs, and named entities. have been proposed [Barzilay and Elhadad 1997; Lin and Hovy 2000;Harabagiu and Lacatusu 2005; Hovy and Lin 1999; Lacatusu et al. 2003], while some research focusedon utilising shallow lexical items with salient properties ( e.g. , subheadings) [Teufel and Moens 1997;Chakrabarti et al. 2001] or via the analysis of topical content [Ando et al. 2000]. The importance of thediscourse structure of a text was realised quite early in summarisation research, and this property hasbeen heavily exploited ever since [Hearst 1994; Marcu 1998; Hahn and Strube 1997].3.3.2 Recent Advances in Automatic Summarisation.

Following the early summarisation researchworks, research on summary generation and evaluation techniques have been boosted by the DocumentUnderstanding Conference (DUC) and similar other regular workshops and initiatives. They include:the NII Test Collection for Information Retrieval Project (NTCIR) , the Text REtrieval Conference(TREC) , and the Text Analysis Conference (TAC) , which was initiated from the Text Summarisa-tion track of the DUC and the Question Answering track of the TREC. Since the commencement ofwidespread research on ATS, a branch of research has focused on the application of machine learningalgorithms for text summarisation. In most of the initial research, machine learning was employedsparingly, as preliminary steps [Sparck Jones 2007]. Early research mostly assumed feature indepen-dence and used the Na¨ıve Bayes classiﬁer [Kupiec et al. 1995; Lin and Hovy 1997; Aone 1999] withvarious features including those used in the

Edmundsonian Paradigm . Later research introduced theuse of richer feature sets and a range of machine learning algorithms. For example, Lin [1999] used

Decision Trees , Conroy and O’Leary [2001] employed

Hidden Markov Models , Osborne [2002] applieda

Log Linear Model for sentence extraction, Svore et al. [2007] used

Neural Nets , and Schilder andKondadadi [2008] used

Support Vector Machines . Interested readers can ﬁnd more information aboutthese approaches and other important related work in Sparck Jones [2007].While one branch of research focused on machine learning approaches, another branch progressedresearch on natural language analysis techniques. Miike et al. [1994] and Marcu [1998, 1999, 2000] usethe Rhetorical Structure Theory (RST) by building RST source text trees using discourse structureinformation, and identifying the nuclei of the trees to build the summaries. Polanyi et al. [2004] andThione et al. [2004] utilise the PALSUMM model which uses more abstract discourse trees. Barzilayand Elhadad [1997] show the use of lexical chains — sequences of related words that can span short orlong distances — for single-document summarisation. For example, McKeown et al. [1998] and Elhadadand McKeown [2001] use a template suited to the medical domain; McKeown and Radev [1995] andRadev and Mckeown [1998] ﬁll up template slots from a database of news information as a ﬁrst step totheir algorithm for news summarisation; and Sauper and Barzilay [2009] show the use of templates toautomatically generate Wikipedia articles.3.3.3 Multi-document Summarisation.

The ﬁeld of multi-document summarisation was pioneeredby the SUMMONS [McKeown and Radev 1995; Radev and Mckeown 1998] summarisation system,which employed a template-based summarisation approach. Extractive summarisation systems havebeen shown to work well for multiple documents particularly in the news domain, an example beingthe MEAD [Radev et al. 2000] system. Multi-document summarisation approaches suffer from theproblems of incoherence and redundancy, and numerous approaches have been proposed to address them. http://research.nii.ac.jp/ntcir/index-en.html. Accessed on 15th Jan, 2016. http://trec.nist.gov/. Accessed on 15th Jan, 2016. :12 • A. Sarker, D. Moll´a, C. Paris

One popular approach to reduce redundancy is the use of clustering . In this technique, common themesor concepts across document sets are identiﬁed and grouped or clustered together. Once the clusters arecreated, the summary can be generated by applying various algorithms that depend primarily on thecontent and compression needs. For example, some use a single sentence to represent each cluster inthe ﬁnal summary [McKeown and Radev 1995; Radev et al. 2000; Yih et al. 2007], while some generatea composite sentence from each cluster through the use of information fusion so as to combine the mostsalient concepts from multiple sentences within a cluster [Barzilay and Elhadad 1997; Barzilay et al.1999; Barzilay and McKeown 2005].Another approach that has been successfully applied to limit the selection redundant informationwhen performing summarisation is Maximal Marginal Relevance (MMR). This technique, which isspeciﬁcally useful for query-focused summarisation, text segments that are deemed relevant to a query,usually via some text similarity measure, are rewarded. At the same time, redundant text segments arepenalised using the same similarity measure. The two similarity measures are combined using a linearequation in an attempt to balance redundancy and relevance. MMR was initially utilised for documentretrieval and is given by the following formula:

M M R ≡ argmax D x ∈ R \ S [ λ ( Sim ( D x , Q ) − (1 − λ ) max D y ∈ S Sim ( D x , D y ))] (2)where, as explained by Carbonell and Goldstein [1998]: “ Q is a query; R = IR ( C, Q, θ ) , i.e., the rankedlist of documents retrieved by an IR system given a document collection C and a relevance threshold θ ; S is the subset of documents in R already selected; R \ S is the set difference, i.e., the set of unselecteddocuments in R ; Sim is the similarity metric used in document retrieval and relevance ranking betweendocuments and a query; and Sim can be the same or a different metric.”Graph-based approaches have also been applied to text summarisation [Mani and Bloedorn 1997;Mani and Maybury 1999; Erkan and Radev 2004; Leskovec et al. 2005], with Mani and Bloedorn [1997]being the pioneers in this area. In their approach, the authors use nodes to represent words and edgesbetween nodes represent relationships. The summaries generated can be topic driven, and there isno text in the summaries. Instead, the summary content is represented via as nodes and edges thatrepresent contents and relations between them. When summarising a pair of documents, common nodes represent same words or synonyms, while difference nodes are those that are not common. Sentenceselection from the graph is computed from the average activated weights of the covered words: for asentence s , its score in terms of coverage of common nodes is given by the following formula: score ( s ) = 1 | c ( s ) | | c ( s ) | (cid:88) i =1 weight ( w i ) (3)where c ( s ) = { w | w ∈ Common (cid:84) s } . The score for differences is calculated similarly. The sentences withhigher common and difference scores are selected for the ﬁnal summary.Erkan and Radev [2004] presented the LexRank system for multi-document summarisation, which isa graph-based system using a connected, undirected graph to represent documents. A similar method,suitable for single-document summarisation only, was proposed by Mihalcea and Tarau [2004]. Othergraph based approaches have been proposed, both in the medical domain [Reeve et al. 2007; Fiszmanet al. 2004] and outside it [Litvak and Last 2008].Similar to graph-based techniques are centroid-based summarisation techniques, ﬁrst proposed byRadev et al. [2000, 2004]. The summarisation is done in three stages. The ﬁrst stage involves groupingtogether news articles on the same topics (topic detection), with the goal of clustering together newsstories based on the events they discuss. These groups are called clusters and the centroid for eachcluster is computed. The centroid of a cluster is a pseudo-document consisting of a set of highly relevant utomated text summarisation, evidence-based medicine: A survey • words. The relevance of these words is determined using Term Frequency-Inverse Document Frequency( tf × idf ) measures—those having tf × idf values above a predetermined threshold are included in thecentroid. Tf and idf values are computed using the following equations: tf i,j = n i,j (cid:80) k n kj ; idf i = log | D ||{ d : t i ∈ d }| (4)where n i,j is the occurrence frequency of the considered term t i in document d j , and (cid:80) k n kj is the sumof the number of occurrences of all the terms in d j ; | D | is the total number of documents the corpuscontains, and |{ d : t i ∈ d }| is the total count of documents where the term t i occurs. In the second stage,salient topics are identiﬁed via the use of centroids, and the overall score for each sentence is computedby combining three distinct values— centroid, positional, and overlap with the ﬁrst sentence— minus apenalty for redundancy.

4. SUMMARISATION AND QUESTION ANSWERING FOR THE MEDICAL DOMAIN

In this section, we provide an overview of text summarisation and Question Answering (QA) approachestargeted towards the medical domain. The goal here is to present some of the important research workin this area and analytically review more recent and promising approaches. The medical domain itselfis quite broad, and so we attempt to adhere to research work that is relevant for EBM. It must bementioned that, in the recent past, there has been steady ongoing research in biomedical and medicaltext processing [Afantenos et al. 2005; Zweigenbaum et al. 2007]. However, compared to other domains,there is very little published research in summarisation and question answering [Fiszman et al. 2009].A variety of tools and resources have also been made available to aid summarisation approaches in thisdomain.

Summarisation of text in the medical domain requires the incorporation of signiﬁcant amounts ofdomain-speciﬁc knowledge. There are numerous tools and knowledge resources currently available.Here we brieﬂy introduce some of the important ones, starting with the resources.4.1.1

UMLS.

The Uniﬁed Medical Language System (UMLS) is a system and resource developedby the U.S. National Library of Medicine (NLM) for biomedical vocabularies. The development andmaintainance of this system is targeted to aid language processing systems in the biomedical domain.The goal is to create standards and mappings for terminologies, and relationships between terminologiesthat automated systems can understand. More speciﬁcally, it was developed as “ an effort to overcometwo signiﬁcant barriers to the effective retrieval of machine-readable information : the variety of namesused to express the same concept and the absence of a standard format for distributing terminologies”[Lindberg et al. 1993]. The UMLS consists of the following three major components:—Metathesaurus – This is the major component of the UMLS and consists of a repository of inter-relatedbiomedical concepts and terms obtained from several controlled vocabularies and their relationships.—Semantic Network – This provides a set of high-level categories and relationships used to categoriseand relate the entries in the Metathesaurus. Each concept in the Metathesaurus is given a ‘semantictype’ and certain ‘semantic relationships’ may be present between members of the various semantictypes. For example, a disease mention ( e.g. , headache) may have an is-treated-by relationship with a drug mention ( e.g., aspirin). :14 • A. Sarker, D. Moll´a, C. Paris —SPECIALIST Lexicon – This is a database of lexicographic information for use in NLP. Each entry init contains syntactic, morphological and orthographic (spelling) information.4.1.2

SNOMED CT.

SNOMED CT is the most comprehensive source of medical terminology andis a standard in the U.S. for the exchange of health information electronically. It can be accessed viathe NLM and the National Cancer Institute (NCI). It is one of the controlled vocabularies used by theUMLS.4.1.3 MetaMap.

The MetaMap is a software that can map biomedical text to UMLS Metathesaurus,which was developed and is managed by the NLM. It also employs word sense disambiguation and apart-of-speech tagger that is designed speciﬁcally for biomedical text.4.1.4 SemRep.

SemRep is a programalso developed at the NLM. It identiﬁes semantic predications ( i.e. , subject-relation-object triples) in biomedical natural language. It has been used for a variety ofbiomedical applications, including automated summarisation, literature-based discovery and hypothesisgeneration.4.1.5 Other Tools.

There are other text processing tools that have been used in the past for medicaltext processing although their functionalities are not restricted to this domain. The following is a brieflist:—GATE : It is a Java platform from the University of Shefﬁeld, and, according to their website, it iscapable of solving almost any text processing problem.—LingPipe : A useful, Java-based language processing tool that is free for non-commercial use and iswidely used for research.—OpenNLP : It is a source for a variety of Java-based NLP tools that can perform a range of basic andadvanced text processing tasks. A number of factors make the medical domain a complex and interesting one for text processing.They include: large volume of data ( e.g. , about 24 million articles in MEDLINE alone); highly complexdomain-speciﬁc terminologies ( e.g. , drug names and disease names); domain-speciﬁc linguistic andontological resources (such as UMLS); and software tools and methods for identifying semantic conceptsand relationships (such as MetaMap [Aronson 2001]).The integration of technology with medical practice was initiated through the use of Clinical DecisionSupport (CDS) systems. Such systems help practitioners “ make clinical decisions, deal with medical dataor with the knowledge of medicine necessary to interpret such data ” [Shortliffe 1990]. Early CDS systemsconsisted mostly of applications that facilitated diagnosis, treatment [Shortliffe et al. 1979; Miller et al.1982] and the management of patients ( e.g. , through computerised guidelines and alerts) [Barnett et al.1983]. However, the capabilities of such applications were quite limited, primarily because they did nothave access to raw medical data [Friedman and Hripcsak 1999].Since a lot of medical data is textual, the need to integrate NLP with CDS systems became morenoticeable as the volume of medical data increased. This requirement introduced new challenges as http://metamap.nlm.nih.gov. Accessed on 14th May, 2014. http://semrep.nlm.nih.gov. Accessed on 14th May, 2014. http://gate.ac.uk. Accessed on 17th May, 2014. http://alias-i.com/lingpipe. Accessed on 18th May, 2014. http://opennlp.sourceforge.net. Accessed on 25th May, 2014. utomated text summarisation, evidence-based medicine: A survey • well, such as faster processing [Demner-Fushman et al. 2009], which the latest research in large-scaledistributed processing has successfully addressed. A detailed discussion of CDS systems is outside thescope of this paper and in the following subsections, we focus instead on a subset of CDS systems thatattempt to provide answers to clinical queries by summarising the information contained in medicaltexts. Such systems consist of information extraction, summarisation, and QA components, and we donot distinguish between these three types. Identifying and presenting evidence in a condensed manneris essentially a task of summarisation. Hence we refer to these systems/components as summarisationsystems. For QA systems, we primarily discuss their summarisation components. For the interestedreader, Friedman and Hripcsak [1999] provide a review of early NLP-based CDS approaches and arecent and detailed analysis of CDS systems and NLP for the medical domain is provided by [Demner-Fushman et al. 2009]. Readers unfamiliar with QA may refer to [Moll ´a and Vicedo 2007] for an overviewof QA techniques in restricted domains and [Athenikos and Han 2009] for a review of QA approaches inthe biomedical domain. Information Extraction Approaches.

Early summarisation systems were mostly concernedwith extracting relevant information from structured or unstructured medical text. The LinguisticString Project (LSP) [Sager et al. 1994] is an example of early medical NLP work. Its primary purposeis the transformation of clinical narratives into formal representations. MedLEE [Friedman 2005] isalso responsible for extracting information from clinical narratives and presenting the information instructured forms through the use of a controlled vocabulary. It is used for processing various formsof notes and reports and is integrated with a clinical information system. HiTEx [Zeng et al. 2006]is yet another system used for extracting ﬁndings such as diagnoses and family history from clinicalnarratives through the use of NLP techniques. TRESTLE (Text Retrieval Extraction and summarisationTechnologies for Large Enterprises) [Gaizauskas et al. 2001] is an information extraction system thatgenerates summarised information from pharmaceutical newsletters in one sentence through theidentiﬁcation of Named Entities (NEs) followed by sentence extraction based on the presence of key NEs.Drug and disease names are considered by this system to be named entities. Hahn et al. [2002] presentMEDSYNDIKATE, a natural language processor that automatically extracts clinical information fromreports. The contents of the texts are transferred to conceptual representations that correspond toa knowledge base, and the system incorporates domain knowledge to semantically interpret majorsyntactic patterns in medical documents. Xu et al. [2010] propose the MedEx system which uses NLP toextract medication information from clinical notes with very high recall and precision.

Extractive Summarisation Approaches . Most summarisation systems in this domain applied extrac-tive summarisation approaches like the initial MiTAP system (MITRE Text and Audio Processing)[Damianos et al. 2002] which was targeted towards the monitoring of infectious disease outbreaks orother biological threats. MiTAP monitors various sources of information such as online news, televisionnews, and newswire feeds. and captures information which are ﬁltered and processed to identify sen-tences, paragraphs, and POS tags. The ﬁnal summary is generated by WebSumm [Mani and Maybury1999] as extracted sentences from the processed text. Reeve et al. [2007] propose a single-document,extractive summarisation approach that combines BioChain [Reeve et al. 2006], which identiﬁes rel-evant sentences using concept-chaining (similar to lexical chaining but applied to UMLS concepts),and the FreqDist system [Reeve et al. 2006], which uses a frequency distribution model to identifyrelevant sentences. This hybrid approach, called ChainFreq, sequentially employs the BioChain methodto identify candidate sentences that contain relevant concepts, and then the FreqDist method, whichgenerates a set of summary sentences from the chosen sentences. Other extractive approaches havebeen proposed with various intents: Mihalcea [2004] presents an approach for automated sentence :16 • A. Sarker, D. Moll´a, C. Paris extraction using graph based ranking algorithms; Elhadad [2006] proposes extractive algorithms forperforming user-sensitive text summarisation; and, more recently, Ben Abacha and Zweigenbaum[2011] put forth some summarisation approaches for the automated extraction of semantic relationsbetween medical entities . Non-extractive Approaches . MUSI is an early system that applies semantic information thoroughly[Lenci et al. 2002] (MUltilingual summarisation for the Internet). It follows an approach similar tothe ones already mentioned for sentence extraction but also has the capability of producing semanticrepresentations of the extracted sentences to produce an abstractive summary. Other than the MUSIsystem, early research in this area did not consider abstractive summarisation. Recently, however,a number of abstractive summarisation systems have been proposed ( e.g. , PERSIVAL [Elhadad andMcKeown 2001], MedQA [Lee et al. 2006a; Yu et al. 2007], EpoCare [Niu et al. 2005, 2006]. We discusssome of these approaches in more detail in the following subsections.4.3.2

Progress in Medical Summarisation and Question Answering.

In recent years, medical docu-ment summarisation has received signiﬁcant research attention, which attempt to make use of greatercomputational power and the availability of more sources of domain knowledge. A variety of documenttypes are now addressed by the different approaches, and QA systems have been developed speciﬁc tothis domain. Most work on query-focused summarisation in this research area has been executed underthe broader domain of QA. While early NLP focused on domain-independent QA, restricted domain QAhas received focus over the last decade [Moll ´a and Vicedo 2007]. Restricted domain QA, particularly inthe medical domain, require incorporation of terminological and specialised domain-speciﬁc knowledge[Zweigenbaum 2003, 2009].Some research has focused primarily on question analysis and the coarse-grained classiﬁcation as theﬁrst step for medical QA [Athenikos and Han 2009]. Yu and Sable [2005], Yu et al. [2005, 2007] and Yuand Cao [2008], for example, study the answerability of clinical questions and attempt to classify clinicalquestions based on the Evidence Taxonomy [Ely et al. 1999], and also into general topics. The QAapproach proposed by Wang et al. [2007] relies heavily on semantic information present on the questionsand documents. First, candidate sentences are identiﬁed and phrases are extracted by identifyingmappings between question and answer semantic types. The system has been evaluated for factoidand complex questions and is shown to have very good recall and precision (77% and 92% respectively).The use of medical semantic types to identify salient information in sentences based on query needshave been exploited in later, more comprehensive summarisation systems, which we discuss in thenext subsection. Finally, Workman et al. [2012] present a dynamic summarisation approach using analgorithm called

Combo to identify salient semantic predications. It was shown to outperform severalbaseline methods in terms of recall and precision.

While discussing approaches in the previous subsection, we attempted to only provide the reader with aﬂavour of the various directions in which medical text summarisation has progressed. We intentionallyskipped some important systems and ﬁltered out many. In this subsection, we discuss the characteristicsof a number of systems that perform summarisation of medical text. For full QA systems, we primarilyfocus on their summarisation components. This review not only discusses the capabilities of the systemsbut also presents their strengths and weaknesses from the perspective of EBM. Table I provides asummary comparison of the systems mentioned here. Note that in the table, SemRep represents the summarisation approach using SemRep proposed by [Fiszman et al. 2009]. utomated text summarisation, evidence-based medicine: A survey • S y s t e m I n p u t U n i t U s e o f S e m a n t i c I n - f o r m a t i o n S u mm a r y T y p e T a r g e t u s e r M e d Q A * M E D L I N E a n d t h e W e b M u l t i - d o c u m e n t N o N o n - e x t r a c t i v e / D e ﬁn i t i o n a l H e a l t h c a r e P r a c t i - t i o n e r s C Q A . * M E D L I N E ab s t r a c t s M u l t i - d o c u m e n t b u t s e p a - r a t e s u mm a r i e s f o r e a c h ab - s t r a c t Y e s ( U M L S ) E x t r a c t i v e H e a l t h c a r e P r a c t i - t i o n e r s S e m R e p C li n i c a l t r i a l s f r o m M E D L I N E M u l t i - d o c u m e n t Y e s ( U M L S a n d S e m - R e p ) N o n - e x t r a c t i v e H e a l t h c a r e P r a c t i - t i o n e r s E p o C a r e M E D L I N E ab s t r a c t s ( c i t e d b y CE a r t i c l e s ) M u l t i - d o c u m e n t Y e s ( U M L S ) E x t r a c t i v e H e a l t h c a r e P r a c t i - t i o n e r s P ER S I VA L P a t i e n t R e c o r d s , M e d i - c a l A r t i c l e s a n d W e b - ba s e d T e x t A r t i c l e s M u l t i - d o c u m e n t Y e s ( U M L S ) N o n - e x t r a c t i v e H e a l t h c a r e P r a c t i - t i o n e r s a n d L a y - m e n A s k H e r m e s * M E D L I N E ab s t r a c t s a n d f u ll t e x t s M u l t i - d o c u m e n t Y e s ( U M L S ) E x t r a c t i v e H e a l t h c a r e P r a c t i - t i o n e r s Q S p e c M E D L I N E ab s t r a c t s S i n g l e - a n d m u l t i - d o c u m e n t Y e s ( U M L S ) E x t r a c t i v e H e a l t h c a r e P r a c t i - t i o n e r s T ab l e I .: C o m p a r i s o n o f s u mm a r i s a t i o n s y s t e m s f o r t h e m e d i c a l d o m a i n . S y s t e m s a v a il ab l eo n li n e a r e m a r k e d w i t h a * . :18 • A. Sarker, D. Moll´a, C. Paris

MedQA.

MedQA [Lee et al. 2006a; Yu et al. 2007] answers deﬁnitional questions by producingparagraph-level answers from MEDLINE and the web. Syntactic parsing, query formulation, and queryclassiﬁcation techniques are used to prepare queries, and an IR engine retrieves relevant documents [Yuand Sable 2005; Yu et al. 2005]. The answer extraction component employs document zone detection andsentence categorisation via the identiﬁcation of cue phrases. Deﬁnitional sentences are identiﬁed in thisfashion. Hierarchical clustering [Lee et al. 2006b] and centroid-based summarisation techniques [Radevet al. 2000] are used for text summarisation. The system’s ability to answer deﬁnitional questionsis evaluated against three search engines— Google, OneLook and PubMed Yu et al. [2007]; Yu andKaufman [2007] Evaluations show that Google is very effective in obtaining deﬁnitions, outperformingMedQA.MedQA has opened new directions in medical text summarisation. It has shown how supervisedclassiﬁcation can be used for intermediate steps in medical text summarisation, such as query analysis.MedQA, however, is not capable of incorporating semantic information, and the intent of the system isvery different from the requirements of the EBM practitioner. The key limitation of the system is that itcan only answer deﬁnitional questions, and not real life-like complex ones.4.4.2 CQA 1.0. [Demner-Fushman and Lin 2007] present a QA system that uses a statistical andknowledge-based approach and is particularly targeted towards the practice of EBM. The proposedsystem uses PICO representations of questions as queries which are sent to PubMed to retrieve aninitial set of abstracts. From the abstracts, each of the PICO elements (Problem/Population, Intervention,Comparison and Outcome) are extracted using various techniques. MetaMap is extensively used toidentify UMLS terms and their categories. The population extractor uses a series of hand-craftedrules to identify occurrences of population terms in the abstracts, with preference given to termsoccurring earlier in the documents. The problem extractor identiﬁes elements that represent the UMLSsemantic group ‘DISORDER’. The intervention and comparison elements are identiﬁed in a similarway— by recognition of multiple UMLS semantic types ( e.g. , clinical drug , diagnostic procedure , etc.).In structured abstracts, more weight is given if the semantic types occur in ‘title’, ‘aims’ or ‘methods’sections, while, in unstructured abstracts, more weight is given if they appear towards the beginningof the document. Using the extracted knowledge, the authors re-rank the retrieved documents usinga document ranking algorithm that takes into account the knowledge elements, strength of evidenceand other task speciﬁc considerations [Lin and Demner-Fushman 2006]. An outcome extractor extractsoutcome sentences from the retrieved documents using a “ strategy based on an ensemble of classiﬁers(a rule-based classiﬁer, a bag-of-words classiﬁer, an n-gram classiﬁer, a position classiﬁer, an abstractlength classiﬁer and a semantic classiﬁer) ” [Lin and Demner-Fushman 2006]. Each sentence is given aprobability based on the classiﬁer scores, and the top-ranked sentences are chosen. As the ﬁnal output,the system simply produces the top ranked sentences from the top re-ranked documents along with thequestion and the strength of recommendation. Only basic clinical questions such as ‘ What is the bestdrug treatment for X? ’ are addressed. The authors evaluate the performance of the knowledge extractorsagainst different baselines and also manually evaluate the ﬁnal output against a baseline that onlypresents top sentences from unranked documents.The techniques applied by the CQA system show the importance of statistics and domain-speciﬁcknowledge for medical text summarisation. There are several limitations of the system. It relies on PICOframes for queries (instead of natural language questions), and an information synthesis technique Available at: http://askhermes.org/MedQA/. Accessed on 15th Jan, 2016. ∼ demner/. Accessed on 15th Jan, 2016. utomated text summarisation, evidence-based medicine: A survey • is absent at the end to produce a single answer from related documents. Furthermore, the algorithmapplied to predict the qualities of evidences does not follow an evidence-based guideline.4.4.3 Summarisation using SemRep.

Fiszman et al. [2004], Rindﬂesch et al. [2005] and Fiszman et al.[2009] propose an approach to abstractive summarisation that primarily relies on utilising semanticinformation. The summariser uses SemRep as a semantic processor to perform source interpretation andpredication listing, and relies on user-speciﬁed topics. A transformation stage generalises and condensesthe list of predicates generating a conceptual condensate for the input topic. The transformation iscarried out in four phases: Relevance, Connectivity, Novelty, and Saliency. First, the relevance phase isresponsible for identifying predications for a particular topic ( e.g. , treatments). The UMLS semanticnetwork is utilised, along with a controlled schema, to identify predications. The predications, whichmust conform to the schema, are called the ‘ core predications . The Connectivity phase is a generalisationprocess and retrieves all predications that share arguments with core predications. The Novelty phasecondenses the summaries further by removing general argument predications ( e.g. , PharmaceuticalPreparations), which are predications that appear higher in the UMLS metathesaurus hierarchy.Finally, the Saliency phase calculates the occurrence of uses simple frequency measures to keep therepresentative predicates, predications, and arguments. The ﬁnal summary is produced in the form ofa graph, and the approach can be applied to both single documents and multiple documents withoutrequiring any modiﬁcations.One of the drawbacks of the system is that it does not incorporate query information. The system isevaluated [Fiszman et al. 2009] on its capability to discover drug interventions for disorders/syndromesonly (other forms of interventions are not taken into account). Therefore, the application domain of thesystem is very limited. Furthermore, only clinical trials are used as source documents. The performanceof the system is compared to a baseline that selects drug names based on the frequencies of theiroccurrence in source texts. A scoring mechanism called the clinical usefulness score is used. It rewardsthe systems for identifying beneﬁcial drug interventions and penalises them for identifying harmful ornot useful ones. To assess the usefulness of drug interventions, drug listings in

Clinical Evidence (CE)articles are used as gold standards, in place of manually annotated references. The system is shownto outperform the baseline in terms of both the clinical usefulness score and mean average precision(MAP).The summarisation approach applied by the SemRep system is simple, innovative, and effective. Itsperformance illustrates the importance of domain-speciﬁc semantic information, and the usefulness ofdistributional semantics in automated summarisation for this domain. Incorporating query-focus intothe summarisation procedure could technique more applicable for EBM practice. Summary informationrequirements must be identiﬁed from clinical questions instead of manually identiﬁed topics, and thesummarisation component should be able to identify information other than drug interventions.4.4.4

EPoCare.

The EpoCare (Evidence at Point of Care) project [Niu et al. 2003; Niu and Hirst2004; Niu et al. 2005, 2006] is an initiative by the University of Toronto to develop a clinical QAsystem. The current implementation relies heavily on automatic identiﬁcation of PICO elements fromboth clinical questions and their corresponding answers. PICO keywords are ﬁrst identiﬁed from thequestion and used as keywords for retrieval. The problem of identifying answers to a clinical questionis divided into four sub-problems – (i) identifying roles (PICO elements) in the text, (ii) identifyingthe lexical boundary of each element, (iii) analysing the relationships between distinct elements and(iv) determining which combinations of roles are most likely to contain correct answers. The initialwork presented by Niu and colleagues addresses simple treatment-type questions. MetaMap is used for :20 • A. Sarker, D. Moll´a, C. Paris automated identiﬁcation of interventions and problems. The authors note that identiﬁcation of outcomesis a much more difﬁcult task, and they identify cue words (nouns, verbs and adjectives) that indicate thepresence of outcomes in sentences. The outcome detection task is further subdivided into two sub-tasks— outcome identiﬁcation and lexical boundary determination. In addition to sentence level outcomedetection, the authors suggest that the polarity of outcomes play an important role in determiningwhich sentences to choose as answers. Four categories of polarities are deﬁned (positive, negative, nooutcome and neutral). The authors use Support Vector Machines (SVMs) to classify sentences into thefour categories, and show that best results are obtained by combining linguistic features with domainfeatures.The outcomes of the polarity classiﬁcation task are used in a multi-document summarisation approachto automatically ﬁnd information from MEDLINE abstracts to answer a clinical question. Presenceand polarity of outcomes, position of sentence in abstracts, length of sentences and Maximal MarginalRelevance (MMR) are used as features in a machine learning algorithm (SVM) used to solve thesummarisation problem. Sentences in MEDLINE abstracts cited by Clinical Evidence (CE) articles aremanually annotated for each clinical query, and sentences obtained from the automated approach arecompared with these for evaluation. A total of 197 abstracts from 24 subsections in CE are annotated togive a total of 2,298 sentences.The outcome detection task is shown to have an accuracy of 83%. The polarity assessment task isshown to have an accuracy of 79.42%. For the summarisation task, it is observed that the identiﬁcationof outcomes and polarity improves performance signiﬁcantly. However, F-scores are shown to be verylow ( < PERSIVAL.

PERSIVAL (PErsonalized Retrieval and summarisation of Image, Video andLanguage) [Elhadad and McKeown 2001] is a medical digital library designed to provide customised access to a distributed library of multimedia medical literature . It is not possible to discuss the wholeproject in this review, and hence we focus on the text summarisation component of the system thatproduces customised, abstractive summaries for persons from technical and non-technical backgrounds[Elhadad et al. 2005; Elhadad 2006].The summarisation system takes as input three different sources: patient records, medical articles(about cardiology) that are appropriate for the patient, and a query by the user from which key wordsare extracted. The input articles are classiﬁed as prognosis, treatment or diagnosis. Relevant sentencesfrom the ‘Results’ sections of the articles are extracted and stored in in a pre-deﬁned template, whichcontains three sets of information: parameter(s), ﬁndings, and relation. The relation describes therelation between a parameter and a ﬁnding. The information extraction phase also identiﬁes theextent to which the parameters are dependent, along with other meta-data, such as the position of thesentence from which a parameter has been extracted [Elhadad et al. 2005]. Following the extraction,the information is ﬁltered to keep only the portions that are relevant to the patient’s medical records,and hand-crafted templates are ﬁlled with the extracted elements. In the next step, the ﬁlled templatesare converted into semantic representations via mapping them into a graph. Repetitions (when twonodes are connected via multiple vertices of the same type) and contradictions (when two nodes areconnected by multiple vertices of different types) are identiﬁed from the graphs, and this informationis used to generate a coherent summary. The representation is then ordered for summary generation:relations that are deemed relevant to the user question are given the highest preference followed byrecitations and contradictions, which in turn are followed by preference based on the relation type ( e.g. , utomated text summarisation, evidence-based medicine: A survey • relations representing risks are given higher priority than association relations). Finally, dependentrelations identiﬁed from one template are output together, and the ﬁnal summary is generated usingnatural language generation techniques, along with hyperlinks to medical concepts.The summaries generated by the system are not evidence-based, and the intent of the system is toprovide personalised information for users from both technical and non-technical backgrounds. Theapproaches applied by this system have the potential of being applied for evidence-based summarisation.4.4.6 AskHERMES.

AskHERMES [Cao et al. 2011] is a clinical QA system that is capable oftaking queries in natural language, performs thorough query analysis, and generates single-document,query-focused extracts from a group of texts that are relevant to the query. The input to the system,being in natural language, requires minimal query formulation by the user, and the interface allowsquick navigation of the summaries. The system is demonstrated to outperform Google and UpToDatewhen answering complex clinical questions.AskHermes operates in ﬁve phases: a query analysis module automatically identiﬁes the informationneeds posed by questions by generating a list of informative query terms; a related questions extraction module identiﬁes a list of questions that are similar, based on the terms; an information retrieval modulereturns the relevant documents; an information extraction module identiﬁes relevant passages from thesource documents; and a summarisation and answer presentation module that analyses the identiﬁedrelevant passages, identiﬁes and removes redundant elements, and ﬁnally outputs summaries withstructure.In the query analysis phase, a query is ﬁrst classiﬁed into one of 12 general topics [Yu and Cao 2008],and then keywords are automatically extracted from the question using Conditional Random Fields.Following the retrieval of relevant documents, a two-layer hierarchical clustering is applied to grouppassages into different topics. Query terms and their UMLS mappings are used to assign the clusterlabels. The ﬁrst layer of clustering generates the topic labels for a tree structures, and a second layerof clustering is applied to provide more reﬁned categories. The ﬁnal set of clusters are ranked basedon the key query terms that are found in in them, and redundancies are detected and removed using longest common substrings .AskHERMES is shown to perform comparably to state-of-the-art systems, and its potential applicationin the ﬁeld of EBM is very promising. One issue is that the lengthy summaries do not satisfy therequirement of bottom-line recommendations required by practitioners. Due to the extractive nature ofthe summarisation, it is difﬁcult to synthesise information from multiple documents and generate briefsummaries. However, the system’s performance suggests that customising the summary generationprocess to the type of question may be beneﬁcial. Furthermore, the authors show that key query termscan be used to determine topics.4.4.7 QSpec.

QSpec [Sarker 2014; Sarker et al. 2012, 2013a, 2015, 2016] is a summarisationsystem that attempts to generate evidence-based summaries to complex medical queries posed bypractitioners.The system utilises a publicly available data set [Moll ´a et al. 2015], which consists ofreal-life clinical queries posed by practitioners, evidence-based justiﬁcations in response to the queries(single-document summaries), and bottom-line recommendations (multi-document summaries). Theproposed system consists of two primary components: (i) query-focused summariser [Sarker et al.2016], and (ii) automatic evidence grader [Sarker et al. 2015]. The summariser module performssingle-document, extractive summarisation based on multiple domain-speciﬁc and domain-independent http://web.science.mq.edu.au/ ∼ diego/medicalnlp/. Accessed on 2nd February, 2015. http://sourceforge.net/projects/ebmsumcorpus/. Accessed on 2nd February, 2015. :22 • A. Sarker, D. Moll´a, C. Paris features, such as sentence positions, lengths, sentence classes, UMLS semantic types, and others. Thesystem takes as input questions in natural language. Following the retrieval of the relevant documents, single-document summaries are generated by selecting the three most appropriate sentences from eachdocument. The crucial part of this component is the sentence selection process, which utilises statisticsderived from the annotated data in the corpus. Using statistics associated with a range of topics ( e.g. ,question-speciﬁc semantic types, associations between semantic types, maximal marginal relevance andothers), the system attempts to apply weights to each sentence score, and ﬁnally ranks the sentencesbased on the scores.Multi-document summaries are generated by performing automatic, contextual polarity classiﬁcationof sentences [Sarker et al. 2013b]. In addition to the multi-document summaries for each evidence, asupervised learning approach that utilises a sequence of classiﬁers [Sarker 2014] is applied to grade thequality of the evidence.The system is shown to perform signiﬁcantly better than several baseline and benchmark systemson a scale that employs ROUGE F-scores for relative comparison of systems [Ceylan et al. 2010].However, the generation of multi-document summaries is described as a much more difﬁcult task, andthe contextual polarity classiﬁer is only applied to a subset of the corpus that addresses therapeuticquestions. This system illustrates the usefulness of annotated corpora for this task, and it is likely thatfuture research will utilise the publicly available data set that this system has been trained on.

5. EVALUATION

The objective of this section is to brieﬂy discuss the evaluation techniques used for automated sum-marisation, approaches attempted in the past and possible approaches for evaluating summaries forEBM. Evaluation of automatically generated summaries is a hard problem [Sparck Jones 1999, 2007].This is primarily because of the fact that it is a heuristic problem – there are more than one acceptablesolution, and a universally accepted standard evaluation is absent. Evaluation measures usually focuson speciﬁc features of the summarised text, which depend largely on the summarisation factors. Thefeatures focused upon by evaluation techniques for generic summaries can be signiﬁcantly differentfrom those for query-focused summaries. Evaluation techniques can also vary according to the unit ofsummarisation (i.e., single vs. multi-document), domain, type (extractive vs. non-extractive) and otherfactors mentioned previously. In this brief section, we focus only on approaches that have been used ormay be relevant for application in EBM summarisers. The interested reader can refer to [Lin and Hovy2002] for a brief discussion on automatic and manual evaluation techniques.

Techniques for automatic summary evaluation can be divided into two broad classes— extrinsic andintrinsic. Extrinsic evaluation techniques focus on the purpose of summaries. The objective is to measurethe usefulness of a summary for a speciﬁc task. The single feature that plays the most important role indetermining the usefulness of a summary is its content, and numerous evaluation techniques use thisfeature as an indication of the qualities of summaries. Recall, precision, F-score, coverage and othersimilar measures [Lin and Hovy 2002] have been frequently used in the past as simple evaluationsof content relevance ( e.g. , [Wang et al. 2007; Niu et al. 2003; Niu and Hirst 2004; Niu et al. 2005,2006]). More sophisticated techniques have also been applied for extrinsic evaluation. An example isthe assessment of the answerability of questions, which can be answered from original texts, fromsummarised texts Morris et al. [1992]; Teufel [2001]; Mani et al. [2002]. These evaluation techniqueshave been employed in open domain QA, but have not been explored for the medical domain until The system does not perform information retrieval. utomated text summarisation, evidence-based medicine: A survey • recently [ ? ]. Such an evaluation approach can be useful for summarisers targeted towards EBM, asdiscussed later.Intrinsic methods concentrate on the summary itself, trying to measure features such as coherence,cohesion, grammaticality, readability and other important features. Intrinsic methods for extractivesummaries assess features such as discourse well-formedness, while those for non-extractive approachesmust also assess sentence well-formedness. Although in some domains, such as news, intrinsic evalua-tion plays an important role, for a query-focused summarisation system for EBM, we believe it to bemuch less important. The reasons are discussed later in the section. Both manual and automated evaluation techniques are used for evaluating summaries. The followingare some common techniques used in both methods of evaluation.5.2.1

Gold Standards.

Gold standards (also known as human reference summaries) are often usedfor evaluating automatic summarisation primarily because humans can be “ relied upon to captureimportant source content and to produce well-formed output text ” [Sparck Jones 2007; Suominen et al.2008]. The expected output summaries are manually created by human experts, who are often expertsin a speciﬁc domain. The created summaries therefore contain the necessary content and become thetarget performance for systems. Evaluation then compares the generated summaries with the goldstandard summaries. This can be done automatically or manually, and is itself a research problem. Themore similar a generated summary is to the gold standard, the better it is considered to be.For years, the absence of gold standard data has been an obstacle to summarisation research in themedical domain. However, the recent creation of specialised corpora for EBM [Moll ´a et al. 2015] isdriving the development of data-driven summarisation and evaluation techniques.5.2.2

Baselines.

Baselines can be considered to be the opposite of gold standards in that they indicatethe minimum level of required performance by a summarisation system. For extractive summarisation,various baselines such as n random sentences have been used. A more appropriate baseline for newssummarisation was introduced by Brandow et al. [1995] who used the ﬁrst n sentences. With thesebaselines, the minimum performance required by a system is to select n sentences that better summarisethe source than the baseline. Similar baselines have been established for summarisation in various otherdomains. For example, Demner-Fushman and Lin [2007] propose an outcome extractor that is comparedagainst a baseline of ‘last n sentences’ (since outcomes presented in a medical paper usually appeartowards the end of the abstract). Baseline measures of tf.idf type have also been used in the literatureand even for EBM [Fiszman et al. 2009]. In such baselines, the summarisation units (sentences, words,n-grams) are chosen based on their tf.idf values. A standard baseline for summarisation systems acrossdomains and even within speciﬁc domains, however, still does not exist.5.2.3 Topic-oriented Evaluation.

Topic-oriented evaluation techniques are specialised to the topicand intent of the summarisation task. A number of topic-oriented evaluation schemes have beenproposed in the literature, both within the medical domain and outside. A recent example of a topic-oriented evaluation mechanism is the ‘Clinical Usefulness Score’ (CUS) [Fiszman et al. 2009], a uniqueevaluation of generated summaries. The CUS is a categorical performance metric. In calculating thisscore, interventions extracted by a system are assigned to one of four high-level categories depending onhow they match the interventions in a predetermined reference standard. The goal is to give credit to asystem for ﬁnding beneﬁcial interventions and, similarly, penalise it for ﬁnding harmful interventions. :24 • A. Sarker, D. Moll´a, C. Paris

Due to the difﬁculty associated with evaluating automatic summaries, manual evaluation is still acommon practice. There is more conﬁdence in manual evaluation (compared to automatic evaluation)since humans can infer, paraphrase and use world knowledge to relate to text units with similarmeanings but worded differently. In such evaluations, domain experts (often several for a singlesummary) read and grade summaries, usually using some chosen scale. Most of the systems mentionedin the last subsection of the previous section have undergone some form of manual evaluation, or at leastinvolved human experts for the preparation of a gold standard ( e.g. , [Demner-Fushman and Lin 2007]).However, agreement among human summarisers is generally quite low, and the process of manualevaluation is quite expensive. Human judgements have been shown to be unstable and inconsistent aswell [Lin and Hovy 2002]. As a result, alternative automatic evaluations having high correlation withhuman scores are usually used.Nenkova and Passonneau [2004] present the pyramid approach for the manual evaluation of sum-maries. Summary content is divided into summarisation content units (SCU), and SCUs representingthe same semantic information are manually annotated in each source document. Once annotation iscomplete, all the SCUs are assigned weights based on the number of summaries each of them appear in.Next, the SCUs are partitioned into a pyramid in which each tier contains SCUs of the same weight andhigher tiers contain SCUs of higher score (i.e., SCUs appearing in more human summaries). Summariesthat contain more top-tier SCUs are ranked higher than those that contain lower tier SCUs. An optimalsummary consists of all the top-tier SCUs combined, within the summary length limits. The score foran automatically generated summary is calculated by summing the weights the constituent SCUs, anddividing this number by the sum of the SCUs of the corresponding optimal summary.An earlier approach presented by van Halteren and Teufel [2003] is similar: atomic semantic units,called factoids, are used to represent the meanings of sentences. The approach requires the generationof a large number of summarised articles from which the gold standard can be obtained by identifyingthe most frequently occurring factoids. As an example, the authors show that to generate a news articlesummary of a 100 words (from 50 sample summaries), all factoids appearing in at least 30% of thesummaries had to be included. Hence, gold standards of different lengths can be generated by varyingthe factoid threshold. Although this approach ensures very high agreement among the humans, itsrequirement of a large number of sample summaries makes it quite an expensive approach.

ROUGE.

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) [Lin and Hovy 2003;Lin 2004] is one of the most widely used tools for automatic summary evaluation, and consists of anumber of metrics that evaluate summaries based on distinct criteria. The various ROUGE measuresattempt to compute similarities between system generated summaries and gold standard summaries ata lexical level. One of the metrics, for example, called ROUGE-N, is an n-gram based recall-orientedmeasure, and is calculated using the following formula:

ROU GE − N ( s ) = (cid:80) r ∈ R < φ n ( r ) , φ n ( s ) > (cid:80) r ∈ R < φ n ( r ) , φ n ( r ) > (5)where R = { r , ..., r m } denote the group of gold-standard summaries, s denotes an automated systemsummary, and φ n ( d ) denotes a binary vector contained in a document d [Lin and Hovy 2003; Lin 2004].This metric, therefore, simply attempts to measure the extent to which an automatically generatedsummary contains the same information as the reference summary. Other ROUGE metrics applydifferent techniques with the same primary intent. For example, ROUGE-L attempts to ﬁnd the longest utomated text summarisation, evidence-based medicine: A survey • common subsequence (LCS) between two summaries, with the rationale that summaries with longerLCSs are more similar. Another ROUGE metric, known as ROUGE-S, is a gappy alternative of theROUGE-N metric (for n = 2), and matches ordered bi-grams of the generated summary with referencesummaries.One problem with using ROUGE for evaluation is that it gives an absolute value score for a generatedsummary. As a result, it is difﬁcult to ascertain how good or bad a summarisation system’s score iscompared to the best possible obtainable score. It is also difﬁcult to perform relative comparisons betweensystems. To address these drawbacks, Ceylan et al. [2010] paved the way for relative comparisonsbetween ROUGE scores using a percentile-rank based approach. Ceylan et al. [2010] empirically showedthat since the number of extractive summaries for a set of documents are ﬁnite, the ROUGE scoresfor summary sentence combinations fall within a ﬁnite range, following a gaussian distribution. Mostcombinations get a ROUGE score that is very close to the mean. This leads to a long-tailed probabilitydistribution for all ROUGE scores, meaning that a small increase in the ROUGE score assigned to asystem can indicate large increases in percentile ranks [Sarker et al. 2016]. Importantly, the distributionof all ROUGE scores for a particular data set can be used to perform relative comparisons betweensystems, and such a technique has been applied recently for comparative evaluation of systems.5.4.2 BLEU.

BLEU [Papineni et al. 2002] was originally designed for evaluating machine transla-tion, and has been shown to be promising for automatic summary evaluation as well [Lin and Hovy2002]. In this technique, automatically computed accumulative n-gram matching scores (NAMS) be-tween a model unit (MU) and a system summary (S) are used as performance indicators of the system[Papineni et al. 2002]. A number of combinations of n-grams are used to compute NAMS, and thetechnique is shown to have satisfactory correlation with human scores.5.4.3

Information-theoretic Evaluation.

Lin et al. [2006] introduce an information-theoretic ap-proach to the automatic evaluation of summaries based on the divergence of distribution of termsbetween an automatic summary and model summaries. For a set of documents D , the authors assumethat there exists a probabilistic distribution with parameters speciﬁed by θ R that generates referencesummaries from D , and the task of summarisation is to estimate θ R . Similarly, the authors assumethat every system summary is generated from a probabilistic distribution with parameters speciﬁedby θ A . The process of summary evaluation then becomes the task of estimating the distance between θ R and θ A . The authors present a number of variants of divergence measures ( e.g. , Jensen-Shannondivergence (JS), Kullback-Leibler divergence (KL)) for this and show that this technique is comparableto ROUGE for the evaluation of single-document summarisation, and better than ROUGE in evaluatingmulti-document summarisation systems. Louis and Nenkova [2008] and Kabadjov et al. [2010] presentsimilar information theoretic evaluation approaches, which incorporate other divergence measures andalso attempt to capture complex phoenomena such as synonymy.5.4.4 ParaEval.

The motivation behind this evaluation technique is the lack of semantic matchingof content in automatic evaluation. The authors of this evaluation technique [Zhou et al. 2006] explainthat an essential part of semantic matching involves paraphrase matching and this evaluation systemattempts to perform that. ParaEval applies a three-level comparison strategy. At the top level, anoptimal search via dynamic programming to ﬁnd multi-word to multi-word paraphrase matches betweengenerated and reference summaries is used. In the second level, a greedy algorithm is used to ﬁndsingle-word paraphrase matches among non-matching fragments in the ﬁrst level. Finally, literal lexicaluni-gram matching is performed on the remaining text at the third level. The authors show that thequality of ParaEval’s evaluations closely resembles that of ROUGE. :26 • A. Sarker, D. Moll´a, C. Paris

If summarising for EBM is a hard task, evaluating summaries is even harder. A summarisation systemfor EBM should be capable of extracting evidence from medical articles and additionally assess thegrade of the evidence. The evaluation should be able to determine if the evidence is correctly extractedand also if the extracted information correctly answers the practitioner’s query. Therefore, the singlemost important aspect of the summary is its content and a strong focus on extrinsic evaluation isrequired. Intrinsic evaluation to assess aspects such as coherence and readability are perhaps not veryimportant, and a stronger focus needs to be on the purpose of summaries [Sparck Jones 1999].For the automatic evaluation of summaries, approaches based on n-gram co-occurrence such asROUGE are the most frequently used. Although these approaches are very robust and efﬁcient, theirmain drawback is that if two summaries were produced using non- or almost non-overlapping vocabulary,yet conveying the same information, the similarity score such summaries would be assigned by purelyn-gram based metrics would be too low and, hence, unrepresentative of the actual information theyshare. This is deﬁnitely not desirable for evaluation, particularly in the medical domain where relationssuch as synonymy and hyponymy play important roles. Furthermore, making relative comparisonsbetween different summarisers is difﬁcult using automatic evaluation approaches such as ROUGE.Thus, approaches that incorporate domain knowledge, semantic similarities, and comparisons betweensummarisation systems are required. Research on the evaluation of summarisers for EBM is still verymuch in its infancy. However, recent works in this area such as that of Fiszman et al. [2009] andSarker et al. [2016] have made useful contributions by ensuring that evaluation techniques assess theperformance of the summariser in the light of its goals.

6. CONCLUSIONS

In this paper, we provided an overview of EBM and described how summarisation of text can aidpractitioners at point-of-care. We discussed the obstacles that EBM practitioners face, as indicated byvarious research papers on the topic. Following our review of the domain, we provided an overview ofautomatic summarisation, its intent, and some important contributions to automatic summarisationresearch. We discussed that unlike automatic summarisation research on some domains such asnews, the medical domain has not received much research attention. We also explained that domain-independent summarisation techniques lack sufﬁcient domain knowledge, incorporation of whichcan be crucial for summarisation research. We reviewed several recent summarisation systems thatare customised for the medical domain. Our review of these systems revealed promising approaches,including the clever use of domain-speciﬁc information and distributional semantics. Our surveyindicates that combining some of the useful approaches from existing literature, and building on fromthese already explored techniques, may produce encouraging results for text summarisation in thisdomain. Our survey reveals that an important factor limiting summarisation research in this domainhas been the lack of suitable corpora/gold standard summaries. However, recent trends in the creationof specialised corpora will inevitably drive the development of more data-centric systems in the future.

REFERENCES

S. D. Afantenos, V. Karkaletsis, and P. Stamatopoulos. 2005. Summarization from Medical Documents:A Survey.

Artiﬁcial Intelligence in Medicine

33, 2 (2005), 157–177.R. K. Ando, B. K. Boguraev, R. J. Byrd, and M. S. Neff. 2000. Multi-document Summarization by Visual-izing Topical Content. In

Proceedings of the NAACL-ANLP Workshop on Automatic Summarization .79–88. utomated text summarisation, evidence-based medicine: A survey • O. C. Aone. 1999.

Advances in Automatic Text Summarization . MIT Press, Chapter A TrainableSummarizer with Knowledge Acquired from Robust NLP Techniques, 71–80.E. C. Armstrong. 1999. The well-built clinical question: the key to ﬁnding the best evidence efﬁciently.

Wisconsin Medical Journal

98, 2 (1999), 25–28.A. R. Aronson. 2001. Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The MetaMapProgram. In

Proceedings AMIA Annual Symposium . 17–21.S. J. Athenikos and H. Han. 2009. Biomedical question answering: A survey.

Computer Methods andPrograms in Biomedicine

99, 1 (2009), 1–24.D. Atkins, D. Best, P. A. Briss, M. Eccles, Y. Falck-Ytter, S. Flottorp et al. , and the G. R. A. D. E.Working Group. 2004. Grading quality of evidence and strength of recommendations.

BMJ

Medical Care

21, 4 (1983), 400–409.R. Barzilay and M. Elhadad. 1997. Using Lexical Chains for Text Summarization. In

Proceedings of theACL Workshop on Intelligent Scalable Text Summarization . 10–17.R. Barzilay and K. McKeown. 2005. Sentence Fusion for Multidocument News Summarization.

Compu-tational Linguistics

31, 3 (2005), 297–328.R. Barzilay, K. McKeown, and M. Elhadad. 1999. Information Fusion in the Context of Multi-DocumentSummarization. In

Proceedings of the 37th annual meeting of ACL . 550–557.P. Baxendale. 1958. Machine-made index for Technical Literature - An Experiment.

IBM Journal ofResearch Development

2, 4 (1958), 354–361.A. Ben Abacha and P. Zweigenbaum. 2011. Automatic extraction of semantic relations between medicalentities: a rule based approach.

Journal of Biomedical Semantics

2, 5 (2011), S4.A. Booth, A. J. O’Rourke, and N. J. Ford. 2000. Structuring the pre-search reference interview: a usefultechnique for handling clinical questions.

Bulletin of the Medical Library Association

88, 3 (July2000), 239–246.R. Brandow, K. Mitze, and L. F. Rau. 1995. Automatic condensation of electronic publications bysentence selection.

Information Processing and Management

31, 5 (1995), 675–686.Y. Cao, F. Liu, P. Simpson, L. D. Antieau, A. Bennett, J. J. Cimino, J. W. Ely, and H. Yu. 2011. AskHermes:An Online Question Answering System for Complex Clinical Querstions.

Journal of BiomedicalInformatics

44, 2 (2011), 277 – 288.J. Carbonell and J. Goldstein. 1998. The use of MMR, diversity-based reranking for reordering doc-uments and producing summaries. In

Proceedings of the International ACM-SIGIR Conference onResearch and Development in Information Retrieval . 335–336.H. Ceylan, R. Mihalcea, U. ¨Ozertem, E. Lloret, and M. Palomar. 2010. Quantifying the Limits andSuccess of Extractive Summarization Systems Across Domains. In

Proceedings of NAACL . 903–911.S. Chakrabarti, M. Joshi, and V. Tawde. 2001. Enhanced Topic Distillation Using Text, Markup tags, andHyperlinks. In

Proceedings of the International ACM-SIGIR Conference on Research and Developmentin Information Retrieval . 208–216.G. Y. Cheng. 2004. A study of clinical questions posed by hospital clinicians.

Journal of the MedicalLibrary Association

92, 3 (2004), 445–458.C. L. A. Clarke, G. V. Cormack, and T. R. Lynam. 2001. Exploiting redundancy in question answering.In

Proceedings of the International ACM-SIGIR Conference on Research and Development in IR .358–365. :28 • A. Sarker, D. Moll´a, C. Paris

J. M. Conroy and D. P. O’Leary. 2001. Text Summarization via Hidden Markov Models. In

Proceedingsof the International ACM-SIGIR Conference on Research and Development in IR . 406–407.L. Damianos, S. Wohlever, J. Ponte, G. Wilson, F. Reeder, T. McEntee, R. Kozierok, L. Hirschman, andD. Day. 2002. Real users, real data, real problems: the MiTAP system for monitoring bio events.In

Proceedings of the second international conference on Human Language Technology Research .357–362.D. Demner-Fushman, W. W. Chapman, and C. J. McDonald. 2009. Methodological Review: What cannatural language processing do for clinical decision support?

Journal of Biomedical Informatics

42, 5(October 2009), 760–772.D. Demner-Fushman and J. J. Lin. 2007. Answering Clinical Questions with Knowledge-Based andStatistical Techniques.

Computational Linguistics

33, 1 (2007), 63–103.S. Dumais, M. Banko, E. Brill, J. Lin, and A. Ng. 2002. Web question answering: is more alwaysbetter?. In

Proceedings of the International ACM-SIGIR Conference on Research and Development inInformation Retrieval . 291–298.L. Earl. 1970. Experiments in Automatic Extracting and Indexing.

Information Storage and Retrieval

American Family Physician

69, 3 (Feb. 2004), 548–556.H. P. Edmundson. 1969. New Methods in Automatic Extracting.

Journal of the ACM

16, 2 (1969),264–285.N. Elhadad. 2006.

User-Sensitive Text Summarization: Application to the Medical Domain . Ph.D.Dissertation. Columbia University.N. Elhadad, M. Kan, J. L. Klavans, and K. McKeown. 2005. Customization in a Uniﬁed Framework forSummarizing Medical Literature.

Artiﬁcial Intelligence in Medicine

33, 2 (2005), 179–198.N. Elhadad and K. R. McKeown. 2001. Towards generating patient speciﬁc summaries of medicalarticles. In

Proceedings of the NAACL 2001 Workshop on Automatic Summarization . 31–39.J. W. Ely, J. Osheroff, P. Gorman, M. Ebell, L. Chambliss, E. Pifer, and Z. Stavri. 2000. A taxonomy ofgeneric clinical questions: classiﬁcation study.

BMJ

321 (2000), 429–432.J. W. Ely, J. A. Osheroff, M. L. Chambliss, M. H. Ebell, and M. E. Rosenbaum. 2005. AnsweringPhysicians’ clinical Questions: Obstacles and Potential Solutions.

Journal of the American MedicalInformatics Association

12, 2 (2005), 217–224.J. W. Ely, J. A. Osheroff, M. H. Ebell, G. R. Bergus, B. T. Levy, M. L. Chambliss, and E. R. Evans.1999. Analysis of questions asked by family doctors regarding patient care.

BMJ

Journal of Artiﬁcial Intelligence Research

22 (2004), 457–479.A. Farzindar and G. Lapalme. 2004. Legal text summarization by exploration of the thematic structuresand argumentative roles. In

Text Summarization Branches Out Conference held in conjunction withACL 2004 . 27–38. utomated text summarisation, evidence-based medicine: A survey • M. Fiszman, D. Demner-Fushman, H. Kilicoglu, and T. C. Rindﬂesch. 2009. Automatic summarizationof MEDLINE citations for evidence-based medical treatment: A topic-oriented evaluation.

Journal ofBiomedical Informatics

42, 5 (2009), 801–813.M. Fiszman, T. C. Rindﬂesch, and H. Kilicoglu. 2004. Abstraction Summarization for Managing theBiomedical Research Literature. In

In Proceedings of the NAACL-HLT workshop on ComputationalLexical Semantics . 76–83.M. A. Fiszman, T. C. Rindﬂesch, and H. Kilicoglu. 2003. Integrating a hypernymic proposition interpreterinto a semantic processor for biomedical texts.

Proceedings of the AMIA Annual Symposium (2003),239–243.C. Friedman. 2005.

Knowledge management and datamining in biomedicine . Springer New York,Chapter Semantic Text Parsing for patient records, 423–448.C. Friedman and G. Hripcsak. 1999. Natural Language Processing and Its Future in Medicine.

AcademicMedicine

74, 8 (August 1999), 890–893.R. Gaizauskas, P. Herring, M. Oakes, M. Beaulieu, P. Willett, H. Fowkes, and A. Jonsson. 2001.Intelligent Access to Text: Integrating Information Extraction Technology into Text Browsers. In

Proceedings of the Human Language Technology Conference (HLT) .S. Gilbody. 1996. Evidence-based Medicine. An Improved Format for Journal Clubs.

Psychiatric Bulletin

20 (1996), 673–675.F. Godlee. 1998. Getting evidence into practice.

BMJ

317 (1998), 6.P. N. Gorman and M. Helfand. 1995. Information seeking in primary care: How physicians choose whichclinical questions to pursue and which to leave unanswered.

Medical Decision Making

15, 2 (1995),113–119.M. L. Green and T. R. Ruff. 2005. Why Do Residents Fail to Answer Their Clinical Questions? AQualitative Study of Barriers to Practicing Evidence-Based Medicine.

Academic Medicine: Journal ofthe Association of American Medical Colleges

80, 2 (February 2005), 176–182.T. Greenhalgh. 1999. Narrative based medicine in an evidence based world.

BMJ

318 (1999), 323–325.T. Greenhalgh. 2006.

How to read a paper: The Basics of Evidence-based Medicine (3 ed.). BlackwellPublishing.U. Hahn, M. Romacker, and S. Schulz. 2002. MEDSYNDIKATE — a natural language system for theextraction of medical information from ﬁndings reports.

International Journal of Medical Informatics

67, 1–3 (2002), 63–74.U. Hahn and M. Strube. 1997. Centering in-the-large computing referential discourse segments. In

Proceedings of the 35th Annual Meeting of the ACL and the 8th Conference of the European Chapter ofthe ACL . 104–111.S. Harabagiu and F. Lacatusu. 2005. Topic Themes for Multi-document Summarization. In

Proceedingsof the International ACM-SIGIR Conference on Research and Development in Information Retrieval .202–209.R. B. Haynes, N. Wilczynski, K. A. McKibbon, C. J. Walker, and J. C. Sinclair. 1994. Developing OptimalSearch Strategies for Detecting Clinically Sound Studies in MEDLINE.

Journal of the AmericanMedical Informatics Association

1, 6 (1994), 447–458.M. A. Hearst. 1994. Multi-paragraph segmentation of expository text. In

Proceedings of the 32nd AnnualMeeting of the ACL . Association for Computational Linguistics, 9–16.K. F. Heppin and A. Jarvelin. 2012. Towards Improving Search Results for Medical Experts andLaypersons. In

Proceedings of CLEFeHealth . :30 • A. Sarker, D. Moll´a, C. Paris

W. R. Hersh, M. K. Crabtree, D. H. Hickman, L. Scherek, C. P. Friedman, P. Tidmarsh, C. Mosbaek, andD. Kraemer. 2002. Factors associated with searching MEDLINE and applying evidence to answerclinical questions.

Journal of the American Medical Informatics Association

BMC Medical Informatics and Decision Making

Advances in Automatic Text Summarization . MIT Press, Chapter AutomatedText Summarisation in SUMMARIST, 81–94.X. Huang, J. Lin, and D. Demner-Fushman. 2006. Evaluation of PICO as a Knowledge Representationfor Clinical Questions. In

Proceedings of AMIA Annual Symposium . 359–363.D. L. Hunt and K. A. McKibbon. 1997. Locating and Appraising Systematic Reviews.

Annals of InternalMedicine

Advances in InformationRetrieval . Lecture Notes in Computer Science, Vol. 5993. Springer Berlin / Heidelberg, 662–666.S. Karimi, J. Zobel, S. Pohl, and F. Scholer. 2009. The Challenge of High Recall in Biomedical SystematicSearch. In

Proceedings of the third international workshop on Data and text mining in bioinformatics .89–92.L. Kelly, S. Dungs, S. Kriewel, A. Hanbury, L. Goeuriot, G. J. F. Jones, G. Langs, and H. Muller. 2014.Professional: Multilingual, Multimodal Professional Medical Search. In

ECIR 2014 . 754 – 758.S. N. Kim, D. Martinez, L. Cavedon, and L. Yencken. 2011. Automatic classiﬁcation of sentences tosupport Evidence Based Medicine.

BMC bioinformatics

12 Suppl 2 (2011).J. Kupiec, J. Pedersen, and F. Chen. 1995. A Trainable Document Summarizer. In

Proceedings of theInternational ACM-SIGIR Conference on Research and Development in Information Retrieval . 68–73.F. Lacatusu, P. Parker, and S. Harabagiu. 2003. Lite-GISTexter: Generating Short Summaries withMinimal Resources. In

Proceedings of the Document Understanding Conference . 122–128.M. Lee, J. Cimino, H. R. Zhu, C. Sable, V. Shanker, J. Ely, and H. Yu. 2006a. Beyond InformationRetrieval – Medical Question Answering. In

Proceedings of AMIA Annual Symposium . 469–473.M. Lee, W. Wang, and H. Yu. 2006b. Exploring supervised and unsupervised methods to detect topics inbiomedical text.

BMC Bioinformatics

Proceedings of the Third International Conference on Language Resources and Evaluation . 1464–1471.J. Leskovec, N. Milic-Frayling, and M. Grobelnik. 2005. Impact of Linguistic Analysis on the SemanticGraph Coverage and Learning of Document Extracts. In

Proceedings of AAAI . 1069–1074.C. Lin. 1999. Training a Selection Function for Extraction. In

Proceedings of the Eighteenth AnnualInternational ACM Conference on Information and Knowledge Management (CIKM) . 1–8.C. Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In

Proceedings of NAACL-HLT .C. Lin, G. Cao, J. Gao, and J. Nie. 2006. An information-theoretic approach to automatic evaluation ofsummaries. In

Proceedings of NAACL-HLT . 463–470.C. Lin and E. Hovy. 1997. Identifying topics by position. In

Proceedings of the Fifth conference onApplied Natural Language Processing . 283–290. utomated text summarisation, evidence-based medicine: A survey • C. Lin and E. Hovy. 2000. The Automated Acquisition of Topic Signatures for Text Summarization. In

Proceedings of the 18th International Conference on Computational Linguistics . 495–501.C. Lin and E. Hovy. 2002. Manual and automatic evaluation of summaries. In

Proceedings of the ACL-02Workshop on Automatic Summarization . 45–51.C. Lin and E. Hovy. 2003. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics.In

Proceedings of NAACL-HLT . 71–78.J. J. Lin and D. Demner-Fushman. 2006. The role of knowledge in conceptual retrieval: a study in thedomain of clinical medicine. In

Proceedings of the International ACM-SIGIR Conference on Researchand Development in Information Retrieval . 99–106.D. A. Lindberg, B. L. Humphreys, and A. McCray. 1993. The Uniﬁed Medical Language System.

Methods of Information in Medicine

32 (1993), 281–291.M. Litvak and M. Last. 2008. Graph-Based Keyword Extraction for Single-Document Summarization. In

Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization .17–24.A. Louis and A. Nenkova. 2008.

Automatic Summary Evaluation without Human Models . TechnicalReport.H.P. Luhn. 1958. The Automatic Creation of Literature Abstracts.

IBM Journal

Automatic Summarization . John Benjamins Publishing Company.I. Mani and E. Bloedorn. 1997. Multi-Document Summarization by Graph Search and Matching. In

Proceedings of AAAI . 622–628.I. Mani, M. T. Maybury (editors, and M. Sanderson). 2000.

Book Review: Advances in Automatic TextSummarization . MIT Press.I. Mani, G. Klein, D. House, L. Hirschman, T. Firmin, and B. Sundhem. 2002. SUMMAC: A textsummarisation evaluation.

Natural Language Engineering

8, 1 (2002), 43–68.I. Mani and M. T. Maybury. 1999.

Advances in Automatic Text Summarization . MIT Press.D. Marcu. 1998. To Build Text Summaries of High Quality, Nuclearity is not Sufﬁcient. In

Proceedingsof AAAI . 1–8.D. Marcu. 1999.

Advances in Automatic Text Summarization . MIT Press, Chapter Discourse Trees areGood Indicators of Importance in Text, 123–136.D. Marcu. 2000.

The Theory and Practice of Discourse Parsing and Summarisation . Cambridge MA:MIT Press.K. R. McKeown, R. Barzilay, D. Evans, V. Hatzivassiloglou, J. L. Klavans, A. Nenkova, C. Sable, B.Schiffman, and S. Sigelman. 2002. Tracking and summarizing news on a daily basis with Columbia’sNewsblaster. In

Proceedings of the second international conference on Human Language TechnologyResearch . 280–285.K. R. McKeown, D. A. Jordan, and V. Hatzivassiloglou. 1998.

Generating Patient-Speciﬁc Summaries ofOnline Literature . Technical Report. AAAI Technical Report SS-98-06. 34–43 pages.K. R. McKeown and D. R. Radev. 1995. Generating Summaries of Multiple News Articles. In

Proceedingsof the International ACM-SIGIR Conference on Research and Development in Information Retrieval .74–82.R. Mihalcea. 2004. Graph-based ranking algorithms for sentence extraction, applied to summarization.In

Proceedings of the ACL 2004 on Interactive poster and demonstration sessions .R. Mihalcea and P. Tarau. 2004. TextRank: Bringing Order into Texts. In

Proceedings of EMNLP .Barcelona, Spain. :32 • A. Sarker, D. Moll´a, C. Paris

S. Miike, E. Itoh, K. Ono, and K. Sumita. 1994. A Full-text Retrieval System with a Dynamic AbstractGeneration Function. In

Proceedings of the International ACM-SIGIR Conference on Research andDevelopment in Information Retrieval . 152–161.R. A. Miller, H. E. Pople, and J. D. Myers. 1982. Internist-1, an experimental computer-based diagnosticconsultant for general internal medicine.

The New England Journal of Medicine

Language Resources and Evaluation (2015), 1–23.

DOI: http://dx.doi.org/10.1007/s10579-015-9327-2D. Moll ´a and J. L. Vicedo. 2007. Question Answering in Restricted Domains: An Overview.

Computa-tional Linguistics

33 (2007), 41–61.V. M. Montori, N. L. Wilczynski, D. Morgan, and R. B. Haynes. 2005. Optimal search strategies forretrieving systematic reviews from Medline: analytical survey.

BMJ

Information Systems Research

3, 1 (1992),17–35.A. Nenkova and R. Passonneau. 2004. Evaluating Content Selection in Summarization: The PyramidMethod. In

Proceedings of NAACL-HLT .Y. Niu and G. Hirst. 2004. Analysis of semantic classes in medical text for question answering. In

Proceedings of the ACL-2004 workshop Question Answering in Restricted Domains, Barcelona, Spain .Y. Niu, G. Hirst, G. McArthur, and P. Rodriguez-Gianolli. 2003. Answering Clinical Questions with RoleIdentiﬁcation. In

Proceedings of the ACL-2003 workshop Natural Language Processing in Biomedicine .Y. Niu, X. Zhu, and G. Hirst. 2006. Using outcome polarity in sentence extraction for medical question-answering. In

Proceedings of AMIA Annual Symposium . 599–603.Y. Niu, X. Zhu, J. Li, and G. Hirst. 2005. Analysis of polarity information in medical text. In

Proceedingsof AMIA Annual Symposium . 570–574.M. Osborne. 2002. Using Maximum Entropy for Sentence Extraction. In

Proceedings of the ACL’02Workshop on Automatic Summarization . 1–8.K. Papineni, S. Roukos, T. Ward, and W. J. Zhu. 2002. BLEU: A Method for Aytomatic Evaluation ofMachine Translation. In

Proceedings of the 40th Annual Meeting of the ACL .L. Plaza, A. Diaz, and P. Gervas. 2011. A semantic graph-based approach to biomedical summarisation.

Artiﬁcial Intelligence in Medicine

53 (2011), 1–14.S. Pohl, J. Zobel, and A. Moffat. 2010. Extended Boolean Retrieval for Systematic Biomedical Reviews.In

Proceedings of the thirty-third Australasian Computer Science Conference .L. Polanyi, C. Culy, M. van den Berg, G. L. Thione, and D. Ahn. 2004. A Rule-based Approach toDiscourse Parsing. In

Proceedings of the ﬁfth SIGdial workshop on Discourse and Dialogue . 108–117.D. R. Radev, E. Hovy, and K. McKeown. 2002. Introduction to the special issue on summarization.

Computational Linguistics

28, 4 (2002), 399–408.D. R. Radev, H. Jing, and M. Budzikowska. 2000. Centroid-based Summarization of Multiple Documents:Sentence Extraction, Utility-based Evaluation, and User Studies. In

Proceedings of the NAACL-ANLPWorkshop on Automatic Summarization . 21–30.D. R. Radev, H. Jing, M. Stys, and D. Tam. 2004. Centroid-based Summarization of Multiple Documents.

Information Processing and Management

40 (2004), 919–938. utomated text summarisation, evidence-based medicine: A survey • D. R. Radev and K. R. Mckeown. 1998. Generating Natural Language Summaries from Multiple On-lineSources.

Computational Linguistics

24, 3 (1998), 470–500.D. R. Radev, J. M. Prager, and V. Samn. 2000. Ranking suspected answers to natural language questionsusing predictive annotation. In

Proceedings of the sixth conference on Applied Natural LanguageProcessing . 150–157.L. H. Reeve, H. Han, and A. D. Brooks. 2006. BioChain: Using lexical chaining methods for biomedicaltext summarization. In

Proceedings of the 21st annual ACM symposium on applied computing,bioinformatics track . 180–184.L. H. Reeve, H. Han, and A. D. Brooks. 2007. The use of domain-speciﬁc concepts in biomedical textsummarization.

Information Processing and Management

43 (2007), 1765–1776.L. H. Reeve, H. Han, S. V. Nagori, J. C. Yang, T. A. Schwimmer, and A. D. Brooks. 2006. Conceptfrequency distribution in biomedical text summarization. In

Proceedings of the ACM 15th conferenceon information and knowledge management (CIKM’06) . 604–611.S. W. Richardson, M. C. Wilson, J. Nishikawa, and R. S. Hayward. 1995. The well-built clinical question:a key to evidence-based decisions.

ACP Journal Club

BMJ

310 (1995), 1122–1126.D. L. Sackett, W. M. C. Rosenberg, J. A. M. Gray, B. R. Haynes, and W. S. Richardson. 1996. Evidencebased medicine: what it is and what it isn’t.

BMJ

Journal of the American Medical Informatics Association

Compu-tational Linguistics

28, 4 (2002), 497–526.A. Sarker. 2014.

Automated Medical Text Summarisation to Support Evidence-based Medicine . Ph.D.Dissertation. Macquarie University. http://web.science.mq.edu.au/ ∼ diego/theses/AbeedSarker.pdfA. Sarker, D. Moll ´a, and C. Paris. 2012. Towards Two-step Multi-Document Summarisation for EvidenceBased Medicine. In Proceedings of the ALTW . 79–87.A. Sarker, D. Moll ´a, and C. Paris. 2013a. An Approach for Query-Focused Text Summarisation forEvidence Based Medicine. In

Artiﬁcial Intelligence in Medicine , N. Peek, R. M. Morales, and M. Peleg(Eds.). Lecture Notes in Computer Science, Vol. 7885. Springer Berlin Heidelberg, 295–304.A. Sarker, D. Moll ´a, and C. Paris. 2013b. Automatic Prediction of Evidence-based Recommendationsvia Sentence-level Polarity Classiﬁcation. In

Proceedings of the International Joint Conference onNatural Language Processing . 712–718.A. Sarker, D. Moll ´a, and C. Paris. 2015. Automatic evidence quality prediction to support evidence-baseddecision making.

Artiﬁcial Intelligence in Medicine

64, 2 (June 2015), 89–103.A. Sarker, D. Moll ´a, and C. Paris. 2016. Query-oriented evidence extraction to support evidence-basedmedicine practice.

Journal of Biomedical Informatics

59 (2016), 169–184.C. Sauper and R. Barzilay. 2009. Automatically Generating Wikipedia Articles: A Structure-AwareApproach. In

Proceedings of the ACL .F. Schilder and R. Kondadadi. 2008. FastSum: Fast and Accurate Query-based Multi-document Summa-rization. In

Proceedings of ACL-HLT, Short Papers . 205–208. :34 • A. Sarker, D. Moll´a, C. Paris

R. W. Scholosser, R. Koul, and J. Costello. 2006. Asking well-built questions for evidence-based practicein augmentative and alternative communication.

Journal of Communication Disorders

40 (2006),225–238.R. Schwitter. 2010. Creating and Querying Formal Ontologies via Controlled Natural Language.

AppliedArtiﬁcial Intelligence

24, 1-2 (2010), 149–174.S. Selvaraj, Y. Kumar, E., P. Saraswathi, D. Balaji, P. Nagamani, and S. K. Mohan. 2010. Evidence-basedmedicine - a new approach to teach medicine: a basic review for beginners.

Biology and Medicine

2, 1(2010), 1–5.K. G. Shonjania and L. A. Bero. 2001. Taking Advantage of the Explosion of Systematic Reviews: AnEfﬁcient MEDLINE Search Strategy.

Effective Clinical Practice

4, 4 (2001), 157–162.E. H. Shortliffe. 1990. Computer Programs to Support Clinical Decision Making.

The Journal of theAmerican Medical Association

Proceedings of the IEEE ,Vol. 67. 1207–1223.D. C. Slawson and A. F. Shaughnessy. 2005. Teaching Evidence-Based Medicine: Should We Be TeachingInformation Management Instead?

Academic Medicine

80, 7 (2005), 685–689.K. Sparck Jones. 1999. Automatic summarizing: factors and directions. In

Advances in automatic textsummarization , Inderjeet Mani and Mark T. Maybury (Eds.). The MIT Press, Chapter 1, 1 – 12.K. Sparck Jones. 2007. Automatic summarising: The state of the art.

Information Processing andManagement

43 (2007), 1449 – 1481.H. Suominen, S. Pyysalo, M. Hissa, F. Ginter, S. Liu, and D. Marghescu et al. 2008.

Handbookof Research on Text and Web Mining Technologies . IGI Global, Chapter Performance EvaluationMeasures for Text Mining, 724 – 747.H. Suominen, S. Salantera, S. Velupillai, W. W. Chapman, G. Savova, N. Elhadad, S. Pradhan, B. R.South, D. L. Mowery, G. J. F. Jones, J. Leveling, L. Kelly, L. Goeuriot, D. Martinez, and G. Zuccon.2013. Overview of the ShARe/CLEF eHealth Evaluation Lab 2013. In

Information Access Evaluation.Multilinguality, Multimodality, and Visualization , Pamela Forner, Henning Muller, Roberto Predes,Paolo Rosso, and Benno Stein (Eds.). Lecture Notes in Computer Science, Vol. 8138. 212–231.K. M. Svore, L. Vanderwende, and C. J. C. Burges. 2007. Enhancing Single-document summarizationby combining RankNet and Third-party Sources. In

Proceedings of the 2007 Joint Conference onEMNLP-CoNLL . 448–457.R. J. Taylor, B. R. McAvoy, and T. O’Dowd. 2003.

General Practice Medicine: an illustrated colour text .Elsevier Health Sciences.S. Teufel. 2001. Task-Based Evaluation of Summary Quality: Describing Relationships betweenScientiﬁc Papers. In

Proceedings of the NAACL 2001 Workshop on Automatic Summarization . 12–21.S. Teufel and M. Moens. 1997. Sentence Extraction as a Classiﬁcation Task. In

Proceedings of theACL-97 . 58–65.G. L. Thione, M. Van den Berg, L. Polanyi, and C. Culy. 2004. Hybrid Text Summarization: CombiningExternal Relevance Measures with Structural Analysis. In

Proceedings of the Text SummarizationBranches Out: Proceedings of the ACL-04 Workshop . 51–55.A. Tutos and D. Moll ´a. 2010. A Study on the Use of Search Engines for Answering Clinical Questions.In

Proceedings of the Australian HIKM Workshop . 61–68. utomated text summarisation, evidence-based medicine: A survey • H. van Halteren and S. Teufel. 2003. Examining the concensus between human summaries: initialexperiments. In

Proceedings of the NAACL-HLT Workshop on Text summarization , Vol. 5. 57–64.A. A. H. Verhoeven, E. J. Boerma, and B. M. de Jong. 2000. Which literature retrieval method is mosteffective for GPs?

Family Practice

17, 1 (2000), 30–35.W. Wang, D. Hu, M. Feng, and L. Wenyin. 2007. Automatic clinical question answering based on UMLSrelations. In

Proceedings of the third International Conference on Semantics, Knowledge and Grid .T. E. Workman, M. Fiszman, and J. F. Hurdle. 2012. Text summarisation as a decision support aid.

BMC Medical Informatics and Decision Making

12 (2012), 41–53.H. Xu, S. P. Stenner, S. Doan, K. B. Johnson, L. R. Waltman, and J. C. Denny. 2010. MedEx: a medicationinformation extraction system for clinical narratives.

Journal of the American Medical InformaticsAssociation

17 (2010), 19–24.W. Yih, J. Goodman, L. Vanderwende, and H. Suzuki. 2007. Multi-Document Summarization byMaximizing Informative Content-Words. In

Proceedings of the 20th International Joint Conference onArtiﬁcial Intelligence (IJCAI) . 1776–1782.J. M. Young and J. E. Ward. 2001. Evidence-based Medicine in General Practice: Beliefs and BarriersAmong Australian GPs.

Journal of Evaluation in Clinical Practice

7, 2 (2001), 201–210.H. Yu and Y. Cao. 2008. Automatically Extracting Information Needs from Ad Hoc Clinical Questions.In

Proceedings of AMIA Annual Symposium . 96–100.H. Yu and D. Kaufman. 2007. A Cognitive Evaluation of Four Online Search Engines for AnsweringDeﬁnitional Questions Posed by Physicians.

Paciﬁc Symposium on Biocomputing

12 (2007), 328–339.H. Yu, M. Lee, D. Kaufman, J. Ely, J. A. Osheroff, G. Hripcsak, and J. Cimino. 2007. Development,implementation and, a cognitive evaluation of a deﬁnitional question answering system for physicians.

Journal of Biomedical Informatics

40 (2007), 236–251.H. Yu and C. Sable. 2005. Being Erlan Shen: Identifying Answerable Questions. In

Proceedings of theIJCAI’05 Workshop on Knowledge and Reasoning for Answering Questions (KRAQ’05) . 6–14.H. Yu, C. Sable, and H. Zhu. 2005. Classifying Medical Questions based on an Evidence Taxonomy. In

Proceedings of the AAAI Workshop Question Answering in Restricted Domains . 27–35.Q. T. Zeng, S. Goryachev, S. Weiss, M. Sordo, S. N. Myurphy, and R. Lazarus. 2006. Extracting principaldiagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural languageprocessing system.

BMC Medical Informatics and Decision Making

Proceedings of NAACL-HLT . 447–454.P. Zweigenbaum. 2003. Question Answering in Biomedicine. In

Proceedings of the EACL 2003 Workshopon Natural Language Processing for Question Answering .P. Zweigenbaum. 2009. Knowledge and reasoning for medical question-answering. In

Proceedings of the2009 Workshop on Knowledge and Reasoning for Answering Questions, ACL-IJCNLP 2009 . 1–2.P. Zweigenbaum, D. Demner-Fushman, H. Yu, and K. B. Cohen. 2007. Frontiers of biomedical textmining: current progress.