Domain-topic models with chained dimensions: charting an emergent domain of a major oncology conference
Alexandre Hannud Abdo, Jean-Philippe Cointet, Pascale Bourret, Alberto Cambrosio
Domain-topic models with chained dimensions: charting the evolution of a major oncology conference (1995-2017)
Alexandre Hannud Abdo , Jean-Philippe Cointet , Pascale Bourret , Alberto Cambrosio LISIS, UMR 1326 INRAE, Université Paris Est, Champs sur Marne, Marne-la-Vallée, France Sciences Po, médialab, Paris, France Aix Marseille Univ, INSERM, IRD, SESSTIM, Marseille, France Department of Social Studies of Medicine, McGill University, Canada Garoa Hacker Clube, São Paulo, Brazil
First submitted for publication on 2019-02-20, this version on 2020-03-03. Supporting information S1 and S2 are available at
Abstract
This paper presents three main contributions to the computational study of science from bibliographic corpora. First, by combining hypergraphs and stochastic block models, it introduces a new approach to model corpora based on their substantive contents and integrating both temporal and other metadata dimensions. We call this simultaneous modeling of documents and words "domain-topic models", and their integration with metadata their "chained dimensions". Second, the paper introduces a new form of interactive map for the exploration of hypergraph data that enables the seamless navigation of the different dimensions, scales, and their relations, as expressed in the models, and describes the steps to accurately read these new science maps. Third, it introduces a new corpus that is both of great interest to current STS research and an exemplary case for the new methodology presented here: the 1995-2017 collection of abstracts presented at ASCO, the largest annual oncology research conference. It is shown that the new approach, named SASHIMI, is able to infer thematic clusters in the corpus, describe them as assemblages of topics, and detect the presence of significant temporal patterns, identifying the major thematic transformations of oncology during the period. Keywords science mapping; scientometrics; text mining; stochastic block model; hypergraph; history of oncology; clinical cancer research;
Introduction
In this paper, we introduce a new approach for the computational analysis of research activities and their dynamics. We begin by briefly situating it vis-a-vis previous approaches, followed by a detailed description of our modeling, instructions on how to read the maps it generates, and a concrete example of its application to a substantive area, namely oncology. Since De Solla Price’s pioneering work [1,2] , the quantitative investigation of scientific activities has led to the development of a wide range of methodologies that often embed different understandings of the nature and dynamics of science. The availability of bibliometric information, as epitomized by the Science Citation Index [3] was instrumental for the development of the first science indicators. It also underlies initial attempts to map the epistemic structure of science. This accounts for the fact that early efforts to visualize “the topology of relationships between elements or aspects of science” [4] resorted to the analysis of citation networks. Co-citation networks [5] and bibliographic coupling [6] are two influential examples of these early approaches. The former connects articles that are jointly cited by other articles, whereas the latter links articles whose cited references strongly overlap. The main goal of citation analysis is to map the organization of the scientific community into a number of sub-communities, and to examine how this structure evolves through time [7], but citation analysis can also provide insights into the content of scientific activities, insofar as clusters of citing and cited references can be interpreted as corresponding to distinct topical domains or epistemic fields, i.e. as sets of papers that practitioners can recognize as belonging to a given research area or thought community. Not every document, however, contains references. Conference abstracts such as those analyzed in this article, for instance, typically lack citations, which precludes the use of citation analysis to map the research-front results presented at those meetings, or to compare claims made at conferences with those subsequently published in clinical or scientific journals. Because citation analysis relies on the chronology and biases of citations, it is also subject to criticism that it fails to capture the dynamics of research fronts and fast moving domains [8], prompting the usage of hybrid methods [9]. An alternative to citation networks, the analysis of scientific collaboration networks provided an attractive option for mapping scientific activities. While initial work focused on the fine-grained description of different kinds of relationships (co-authoring, mentoring, etc.) between scientists within small research teams [10], the subsequent availability of large publication databases allowed for the large-scale analysis of collaborative structures derived from co-authoring networks [11,12]. These collaborative networks correspond to the organization of research by teams, but this approach fails to capture and describe situations characterized, for instance, by the presence of competing teams working with similar questions and conceptual frameworks, or one team's internally evolving research program. Furthermore, when mapping scientific activities by employing network analysis methods, they often feature a single analytical dimension, such as clusters of co-authors, citations, or keywords (see, e.g. [13–15]). The resulting visualizations appear easy to interpret but conceal a number of drawbacks. For instance, in addition to their inability to simultaneously analyze and display multiple dimensions, most mapping algorithms flatten the research landscape, missing hierarchical relations between and within clusters, and only display a limited number of nodes, often disregarding part of the data. This calls for more accurate and exhaustive descriptions of the multi-level structure of scientific activities.
The development of co-word analysis [16] resulted in a major shift in focus. The “sociology of translation” that underlies co-word analysis is primarily interested in how texts are framed and organized in order to funnel the attention (and interests) of readers. Articles are defined as literary inscriptions that are both the outcome of research activities and artefacts specifically designed to enroll other scientists. A key purpose of co-word analysis is thus to reveal the strategies deployed by authors when they choose to employ a given assemblage of terms to describe (“problematize”) a given issue. Connections between terms are thus said to provide faithful descriptions of a given scientific endeavor. Clusters of terms correspond to research specialties whose emergence or disappearance during subsequent periods is signaled by variations in the strength of associations between words [17]. Co-word analysis, insofar as it does not resort to citations or references to connect documents can be used with a variety of textual sources (including grey literature, legal judgments, etc.) and has been designed to capture the dynamics of research fronts whose emergence can be tracked by new associations between words. Co-word analysis, however, leads to a purely lexical description of research specialties. Entities featured in co-word maps [18] are “severed” from their original medium, namely scientific articles. Paradoxically, the co-word analysis model, although inspired by the sociology of translation, does not allow for the inclusion of heterogeneous actors such as authors or institutions, or does so only metaphorically (e.g. in [19]), without providing an actual computational model measuring how heterogeneous nodes are connected to each other. As we will see, SASHIMI, like co-word analysis, analyzes word associations within documents, but instead of subsequently discarding documents to focus uniquely on lexicon, investigates word-occurrence patterns in relation to the documents that feature them. Topic models [20] also draw on word co-occurrence patterns to explore the semantic structure of document collections. More precisely, by examining the joint occurrence of words within a single document, this approach infers a generative model whereby documents are modeled as a mixture of topics. Correspondingly, topics are comprised of a probabilistic mixture of terms. Although not originally designed to describe the evolution of scientific fields, topic models have proven to be an efficient tool to analyze these processes [21]. As with co-word analysis, topic models only require as an input statistical information about word occurrences and co-occurrences within documents. Moreover, they describe individual documents as consisting of a weighted list of topics, and other kinds of document-level information can potentially be included in the analysis. For instance, author topic models [22] link topics to specific authors. They, however, constrain the collection generation model by pre-defining a limited number of topics that can be assigned to each author, the underlying rationale being that a single author can only participate in a restricted number of endeavors. Additionally, while efficient for generating descriptions of scientific fields, the variety of categorizations that topic models can describe is constrained by strong priors that typically take the form of a Dirichlet distribution [23]. Most importantly, despite their solid mathematical foundations and their empirically proven efficiency, topic models do not provide per se a way to cluster individual documents into groups. Problems in dealing with documents are further compounded by the probabilistic model used to assign topics to documents. The question is then how to reconcile the real-time, highly granular semantic nature of word-based methods for portraying research activities , with the need to reconstruct sets of semantically coherent papers that constitute research domains. Named SASHIMI (Symmetrical And Sequential analysis from Hierarchical Inference of Multidimensional Information), our approach builds on work by Peixoto [24] on Stochastic Block Models (henceforth: SBM) and is technically related to recent work by Gerlach et al. [23] on topic models. It centers on the analysis of textual corpora enriched with metadata (e.g., in the case of scientific activities, collections of articles, conference abstracts, or grant proposals). Its main feature is a dual description of corpora in terms of research domains (collections of documents) and topics (collections of words). SASHIMI seamlessly articulates these two different analytical dimensions in what we will call a Domain-Topic Model. Using the resulting domain organization as a starting point, inference chains are subsequently constructed to unveil the structure of additional metadata, generating , for instance, hierarchical clusters of meetings, journals, authors, or institutions. Finally, because SASHIMI inherits the nonparametric, parsimonious, and nested nature of our choice of SBM (see methods section below), our approach is robust, avoiding overfitting as well as underfitting the data. Its nested hierarchy also allows for the description of topics and research domains, as well as metadata structures at different scales or levels of granularity. To test our approach, we applied it to a rarely investigated, yet quite significant kind of scientific corpus, namely abstracts of presentations at yearly research conferences. In the present case, we selected a major clinical research conference, the annual meetings of the American Society of Clinical Oncology (ASCO) during the 1995-2017 period (for more details on ASCO and the reasons for this choice, see the Material and Methods section below). We use this dataset to illustrate how our method provides insights into both the structure and the dynamics of a research domain. For brevity’s sake we analyze only the chronological dimension of the metadata, i.e. the annual meeting at which each paper was presented. However, the same procedure could be used to assess the organization of other dimensions, such as institutional strategies. Moreover, because our main interest is to explore the unique feature of providing a dual description of corpora in terms of domains and topics, in this paper we shall not strive to directly compare our results to existing models, which identify either clusters of documents or topics of words, but not both. Nonetheless, we note that approaches based on the same underlying SBM framework adopted here have already been successfully compared to clustering methods [24] and to topic models [23].
Material and methods
The work presented here has received ethical approval by the Institutional Review Board of the Faculty of Medicine of McGill University (IRB Study Number A07-E55-15B).
Domain-topic models
As a starting point, we posit that research domains can be defined as sets of scientific texts addressing the same questions, using shared methods, and focusing on the same or related entities. Research domains also carry a social dimension, given that they are grounded in the activities of specific teams and institutions, whose results are made public at professional conferences and in scholarly journals. Our approach thus first relies on the content of documents to cluster them into research domains that differentially assemble content topics, while reciprocally clustering content elements, usually words, into topics that differentially appear in domains. Additional dimensions such as authorship, institutional affiliations, funding sources, or year of publication can subsequently be clustered based on their distinctive relationship to domains. In other words, they can be explored through the lens of their differential distribution across and within these research domains. . To analyze documents we adopt a classic bag-of-words hypothesis [25] and model each document as an unordered set of words. As exemplified by figure 1 , document 1 is composed by the terms: “ the” , “ patient” and “ surgery” ; document 2 by the terms: “ the” , “ patient” , “ average” , and “ radiation”; document 3 by the terms “ the” , “ average” , “ therapy” , and “ breast_cancer” . We take into account every single term contained in a document, including stop-words like “ the” that are generally removed by Natural Language Processing pipelines. By contrast, we automatically identify most prominent co-locations such as “ breast_cancer” (see section on data processing for more details), although this step is not essential. Figure 1 : Representation of relationships found in a research corpus. Documents work as hyperedges, connecting their textual content and metadata dimensions. Our goal is to study the document hypergraph through the lens of its substantive dimension (terms in the text) by clustering its incidence graph restricted to such dimension, then use the clustered hyperedges to induce clusterings on the other dimensions. The ensemble of documents is subsequently turned into a network, as shown in figure 2 . The network features two kinds of nodes: individual documents and terms in their vocabulary. Edges are placed connecting each document to the terms it contains. As this only connects documents to terms, and not nodes of the same type, the network is bipartite. The next step is to simultaneously cluster documents and terms to produce a categorization of both partitions: documents are organized by domains, while terms are organized by topics (see figure 2 ). Traditionally, content-based clustering of documents groups them into domains that share a similar term usage, whereas topic modeling groups terms into topics that appear together in documents. A domain-topic model does both at once in a synergistic way, resulting in a dual clustering of domains on one side, and topics on the other, according to connectivity patterns in the bipartite document-term graph. Figure 2 : Domain-topic model of a bipartite graph of documents linked to their terms. The resulting dual structure features on the right side a hierarchy of research topics and on the left side a hierarchy of research domains gathering articles with a similar distribution of topics (1D = level-1 domains, 2D = level-2 domains, 1T and 2T for topics). The empirical difference in connectivity patterns on the two sides of the bipartite network raises the following issue. On the one hand, terms may vary from being present in only a few documents to spanning all of them, while, on the other hand, all documents contain about the same number of terms. Put differently, the frequency distribution of terms is wide while the distribution of the number of terms per document is narrow. This suggests that we will either need to couple two different models or employ a model that can capture quite general patterns. To deal with this obstacle, we employ the nested degree-corrected stochastic block model [24], which clusters nodes according to general connectivity patterns and provides a parsimonious method for model fitting and selection. In SBM terminology, the clusters are called blocks . The basic SBM is a generative model where nodes are organized into blocks and connected according to edge probabilities between the blocks to which they belong. Degree-correction improves on it by accounting for heterogeneous degree frequencies within blocks, while nesting lets it express patterns at different scales, producing a hierarchy of blocks. This stochastic model and related variants have been successfully applied to the analysis of both static and time-varying graphs, and have been shown to robustly reveal unexpected connectivity patterns while avoiding overfitting [26]. In the framework we adopt, fitting the model to network data is done by partitioning the data nodes into blocks, and by tracing the edge counts and degree frequencies from the connections in the data. A best fit is to be found by seeking one such partition that minimizes the combined informational weights of describing the data given the model and of describing the model parameters, following a minimal description length approach [24]. As detailed in [24], with and respectively , k , eA b corresponding to the data, the degree sequences, the edge counts and the block structure, one has to minimize the description length : Σ − P ( A | k , , ) P ( k , , )Σ = log e b − log e b This class of models further provides our domain-topic model with the desired property that blocks can be described at different scales, producing a nested hierarchy whereby blocks come together to form higher level blocks, which may themselves be grouped, and so forth. Such a hierarchy provides meaningful boundaries to manage the interpretation of a very large number of domains or topics. Furthermore, the hierarchy plays an important role from an inference perspective as it helps to avoid underfitting the model, and its depth is inferred from the data in the same way as the number and size of blocks [25] . This yields a similar equation for the description length to be minimized. With standing for the values of Σ e }, { b }{ l l for every level : , be l l l − P ( A | k , e }, b }) P ( k , e }, b })Σ = log { l { l − log { l { l This can be related in a Bayesian formulation to maximizing the likelihood of the data given the model parameters, plus the likelihood of the model parameters, in a nonparametric model where both the number and the size of blocks are modeled as hyperparameters with uninformative hyperpriors and are thus themselves derived from the data [27] . Thus, in a hierarchical domain-topic model, instead of a single level of abstraction (i.e., domains containing documents and topics containing terms), multiple levels of subdomains and subtopics are possible, up from the respective ground levels of document and term. Level 1 corresponds to the most detailed or granular block level, followed by level 2 and up to the highest level that provides the coarsest description. Another major challenge in the study of corpora is the ease and reproducibility of result interpretation. This concern guides the following two choices regarding our modeling. First, stochastic models can be employed either by searching for a single model that best fits the data or by averaging the values of interest over a distribution of models yielded according to their fitness. In this paper we have worked with the single best-known fit for our data. Second, the model adopted here allows for overlapping as well as non-overlapping blocks. In this paper we have worked with non-overlapping blocks. Contrary to what one might expect, it has been shown that these models are often a better fit than overlapping ones [28]. This remains an area for future work, including the development of models that overlap topics but not domains, to better take into account the polysemy of terms. Together, these choices greatly simplify interpretability insofar as each document is attached to a single domain, both probabilistically (single best fit) and concretely (non-overlapping).
To sum up: we resort to nested degree-corrected stochastic block models to categorize the document-term bipartite graph, giving rise to a hierarchical domain-topic model. These models have several advantages over other community detection methods, namely: ● They detect general connectivity patterns and are thus not limited by assortativity, in contrast to traditional modularity based community detection methods [29]. ● They are non-parametric ; in particular, both the number of groups and levels in the hierarchy are directly inferred from the data. ● They do not overfit: by accounting for the information weight of model parameters when maximizing model probability, they only infer statistically significant structures. We perform our optimization procedures using the SBM implementation found in the graph-tool library [30], which employs an efficient Markov Chain Monte Carlo (MCMC) approach. Our usage of the library and the specific procedures adopted for this paper are detailed in the supporting information (S2 Data). Chained dimensions
After obtaining the domain-topic model, we can use its partitions to form an inference chain towards the documents’ metadata. This is achieved by first adding new elements to the graph that correspond to metadata values, while connecting them to the documents within which they appear. Then, we fit the same model to this new graph, while keeping fixed the nested blocks of documents and topics we previously inferred, thus optimizing only on metadata blocks given their connections to documents and their domains. Optionally, one can remove the terms and topics from the graph since they are fixed and not connected to the metadata. We call this procedure a chained stochastic block model. It allows us to obtain hierarchical partitions of metadata elements from their relationship to the domains, corresponding to our stated objective of studying other dimensions through the lens of the thematic organization of research. Figure 3 illustrates the chaining of temporal metadata. Figure 3 : In SASHIMI, the model can be extended to any other variable that can be assigned to individual documents, like publication years as shown in this example. The research domain hierarchy which has been computed based on the content of documents is fixed. Years are then arranged in blocks in search of the most parsimonious model (1P = level-1 periods, 2P = level-2 periods). "Sashimi" interactive maps To investigate and account for SASHIMI’s domain-topic model and chained dimensions, we developed a form of interactive map appropriate to navigate its hypergraph structure and the relationships established therein. Individually, these interactive maps allow one to navigate all hierarchical levels and zoom over specific blocks, which can be domains, topics or even blocks from an extended dimension. One may as well consult a summary of any block by positioning the mouse over it. This summary displays general information such as the total number and examples of documents or terms contained in the block, as well as terms or document titles it associates with in the opposite side of the model. Moreover, clicking on a block displays additional information about it on the side of the map. Because of their appearance and what they represent, we call these maps "sashimi" (lower case), optionally adding the respective dimension, i.e. the domain sashimi. Yet, sashimis are most useful in the form of a “sashimi combo” map, where two or more maps are presented side-by-side. In the case of a domain-topic map, a dual map with a domain map to the left and a topic map to the right, a click on a domain will change the colors of the topic map to correspond to the degree in which each topic is more present in that specific domain than in the full corpus. Conversely, clicking on a topic will color the domain map to display how much each domain employs that chosen topic in relation to its global usage. We call the first case the topic map for a given domain, and the second the domain map for a given topic. Both colorings are expressed by the following simple formula, respectively with a fixed or : D T ( D , ) m T = log Abstracts in DAbstracts in D using T − log T otal abstractsT otal abstracts using T In a later section we will present static frames obtained from these visualizations, as we investigate a few significant research domains at different levels by discussing their most salient topics (see, e.g., figure 5 ). We add labels and notes on top of the static maps to guide readers in the absence of interaction, but otherwise they are encouraged to follow the interactive "combo" map included in the supporting information (S1 Sashimis). We note that the formula above also applies in the context of a domain-chained map, where is a block from any chained dimension. T While free exploration of these maps can be useful in itself, we provide here a more disciplined approach to reading them.The first step, starting from a domain-topic model over a corpus, is to produce a volumetric combo map showing the simple count of documents in the domains and terms in the topics. This initial contact could be used to check the general outlook of the maps, making sure there are no suspicious structures that may suggest the model is not quite optimized and perhaps further optimization might yield something significantly better. Such a sign could be the presence of a very small block on a very high level, whose content doesn't look that different from the rest. Still, one would not usually worry about this, and if optimization has already been performed thoroughly (see S2 Data) this can be entirely skipped. Then, with the same volumetric maps, one proceeds to get familiar with and annotate the topic hierarchy. Topics are much simpler than domains, since at their base they represent groups of terms, while domains represent groups of documents, which are themselves assemblages of different topics. Thus, the full topic hierarchy can be traversed and understood by itself, with higher levels annotated on the basis of their children. This becomes a powerful resource when reading the domains later on. Next, towards studying the domain hierarchy, we note that at the highest level of aggregation domains can display a quite heterogeneous composition. We thus consider that, for interpretation purposes, it is better to move up the hierarchy, starting from the lowest level, where we find the most coherent subdomains. In practice, this means that level-1 domains can be characterized from their topic maps. From there, we can build up our understanding of the higher levels, which are best understood from reading their annotated subdomains. However, lower levels are often too numerous, so one must find a criteria to find a path through the corpus, which will depend on the research question being posed. Perhaps the simplest criteria is to start with high volume domains. Alternatively, some other meaningful quantity. In this paper, we model a chained dimension, namely time, and take the change in importance of domains between periods (blocks of years) as indication of interest. This could be done to other extended dimensions in an analogous fashion, producing other kinds of paths through the corpus. Finally, one may start with a topic of interest, of any level, and use the domain map for that topic to identify domains where it is, or isn't, relevant. One may also follow the topic map for an identified domain, and jump between both sides of the domain-topic model, to find domains that share some topic with a previous one. Different combinations of these steps will answer different questions. And, in case some domain of interest is not a level-1 domain, one shall replay these procedures to characterise and annotate the lower levels until one can properly read it. As a general rule, good characterisation of a domain or topic usually comes from the bottom-up. Beyond simple volumetric maps, contrast maps (see, e.g., figure 7 ) are especially useful for depicting how the prevalence of each domain changed between two subsets of the corpus, which can be obtained, for example, from a partition of a chained model over an extended dimension. As just mentioned, this is what we will do to the temporal dimension of our corpus (see figure 6 ). The formula below is for the color of contrast maps between partitions and : P + P − ( D , , ) m P + P − = Abstracts in D for P + T otal abstracts for P + − Abstracts in D for P − T otal abstracts for P − Moreover, while a domain may shift in overall importance, as compared to other domains of the same level, it may also exhibit interior motion, with its subdomains changing their proportions within the domain over time. Or, in the case of a level-1 domain, which doesn't have subdomains, with its topics varying in usage over time. Charting this interior motion would provide further understanding of domains that have already been characterised both statically and in their longitudinal prevalence in the corpus. To conclude this section, when creating maps, we strongly suggest that domains and topics should be consistently presented in contrasting colors, e.g. red and blue. Extended dimensions should avoid these colors. In our maps, colors are normalized within each level, so that blocks of strongest color corresponds to the greatest value at their level of granularity. The ASCO Annual Meeting abstracts
To test and exemplify our approach, we applied it to the analysis of a dataset of conference abstracts from the American Society for Clinical Oncology (ASCO) Annual Meeting between 1995 and 2017. Why conference abstracts rather than journal articles? SASHIMI only requires access to the textual content of individual documents, rather than specific metadata, and is therefore able to analyze different kinds of textual documents, including publications, grant proposals, or conference proceedings. Scientific and clinical gatherings such as the ASCO meetings are a major forum for the introduction of the latest clinical and scientific research results, with the understanding that some of those results are preliminary, will not necessarily be confirmed, and will therefore remain unpublished [31] . Investigating conference abstracts rather than publications will thus provide us with a privileged take on “science in the making” [32] , i.e., in our case, the moving front of oncology research, while also creating the possibility of comparing different stages of the production of scientific knowledge. Science studies scholars have rarely investigated scientific conferences. Notable exceptions include Söderqvist and Silverstein [33,34] who analyzed immunology meetings to unravel the sub-disciplinary structure of that discipline: their approach, however, was based on a cluster analysis of meeting participants, rather than an investigation of the actual content of presentations. But can abstracts be considered as representative of the content of full presentations? Recent evidence [35] — albeit based on journal abstracts in a different domain — shows that abstracts cannot be considered as “mere teasers”, and that in fact their “generosity” (defined on the basis of the amount and importance of information provided by the abstracts) has been increasing in recent years. But why ASCO? The ASCO Annual Meeting is the main venue for cancer researchers from around the world to present innovative results. Established in 1964 with 66 members specializing in the then emerging chemotherapy domain, ASCO membership had grown by 2010 to more than 27,000 adherents representing all oncology subspecialties. Its network presently connects close to 45,000 oncology professionals and covers more than 150 countries. Attendance to the annual meeting broke the 5,000 mark in 1985, and in 2018 had increased to 40,700 participants (33,100 professionals and 5,700 exhibitors). Regularly attended by a large number of foreign practitioners (46% of all attendees in 2018), the ASCO Annual Meeting has become one of the largest gatherings of medical professionals in the world. Initially held in conjunction with the annual meeting of the American Association for Cancer Research (AACR), the ASCO conference became self-standing in 1993, and, according to its organizers, by 1995 a decision was made to increase its emphasis on translational science. We can thus safely claim that papers selected for presentation at ASCO provide a representative sample of oncology investigations at the international research front, often including practice-changing results. Figure 4 shows the evolution of the number of abstracts presented at the ASCO Annual Meeting from 1995 to 2017. After a period of steady growth, the yearly number of abstracts reached a plateau of approximately 5000 abstracts in the 2010s. There are both practical and theoretical reasons for our decision to examine these abstracts and select this 23-year window. Oncology has played a pioneering role in the much-touted area of translational research [36], a subfield characterized by rapid change, and the ASCO Annual Meeting is the main annual event in this domain. As just mentioned, 1995 coincides with a readjustment of the socio-technical content of the annual meetings that continues to characterize them to the present day. We can therefore test our approach on a reasonably stable, coherent set of documents that pertain, at the same time, to a dynamic domain. The highly selective nature of the ASCO Annual Meeting will moreover dispense us with the thorny issue of separating the grain of what practitioners consider innovative contributions from the chaff of the mundane or routine findings that can be found in a rapidly expanding number of oncology journals. Figure 4 : The corpus of abstracts from the American Society for Clinical Oncology (ASCO) Annual Meeting between 1995 and 2017, achieving a total of 83,476 abstracts.
Finally, we should note that presentations at ASCO meetings are distributed throughout different conference sections and sessions according to their subject. However, these “native” categories are primarily designed for organizational purposes, and thus highly contingent. By contrast, we focus on the processing of the abstracts’ raw textual material as it provides a direct take on the thematic landscape of the conference as a whole and across time.
Data processing
We now turn to a more technical description of the processing pipeline we apply to the ASCO abstracts in preparation to infer their domain-topic block structure. The procedure we adopt is entirely language independent. We do not apply any language-specific NLP procedure such as stemming or lemmatizing but we still search for co-location using classical statistical analysis [36] to extract frequent bigrams in the corpus such as “ stem_cells” , “ breast_cancer” , “ partial_response” , “ aplastic_anemia” , or “ colorectal_cancer” . We then index each document by its own set of terms, i.e. words and bigrams, effectively transforming each abstract into a vector that codes for the presence or absence of each term, thus following the classical bag-of-words prescription [37] to model documents. We differ in two important ways from other traditional text-analysis procedures [38]: (i) we do not filter out a priori very rare or frequent terms; (ii) our document vectors are purely binary, meaning that we do not store the information about how many times a term occurs in the abstract. Regarding (i), the inference method we use succeeds to gather stop-words and non-frequent terms in separate sub-categories. This reflects the aforementioned property of the SBM, which is able to detect any connectivity pattern found in the data. In the case of stop-words, the pattern is that of a uniform distribution. Regarding (ii), by ignoring the local frequency of terms within an abstract we attenuate stylistic variations across papers, thus avoiding groupings that are based on research features other than thematic ones. Results and discussion
We adopt the notation below to refer to an individual domain or topic block at a particular level. For instance, L2T29 is the topic with index 29 at level 2, and L3D40 is the research domain with index 40 at level 3. Although the notation distinguishes domains from topics, all blocks are indexed together in such a way that, for a given level, domain indices start after the highest topic index. iDj omain j at level iL ≡ d iT j opic j at level iL ≡ t The interactive maps discussed in the "Sashimi" section, from which the static frames shown in this section are taken, are available as self-contained HTML files in the supporting information (S1 Sashimis).
Overview
We applied SASHIMI to the set of 83,476 abstracts covering the 23-year period between 1995 and 2017 of the ASCO Annual Meeting. As detailed in the Domain-topic model section, and as seen on the map in figure 5 , we inferred with our method a dual hierarchical model of the data with domains (collections of documents) on one side and topics (collections of terms) on the other side. The domain map is color-coded in red, and the topic map in blue. The inference procedure determined that 5 hierarchical levels (corresponding to columns in the block maps of figure 5 ) yielded the description that best fit our data. Table 1 summarizes the partition sizes: the number of domains and topics at different levels. Figure 5 also displays their nested structure, whereby the intensity of color corresponds respectively to the fraction of documents belonging to a domain, and the fraction of documents containing any term in a topic. Let us exemplify these characteristics by considering levels and partitions on the domain map (the same applies, mutatis mutandis , to the topic map). Level-5 corresponds to the trivial partition of grouping all abstracts together. The level-4 partition is made of 4 different blocks. The block at the bottom (L4D5) contains the largest number of abstracts, and the one just above (L4D6) the smallest number of abstracts. At level 3, one can see that the topmost level-4 block (L4D8) is divided into two level-3 sub-blocks (L3D47 and L3D44) . While the first of these 3 sub-blocks (L3D44) is the largest level-3 block, it is not a sub-block of the largest level-4 block (L4D5) , which gets split into 7 sub-blocks of smaller size. Figure 5 : The domain hierarchy on the left and topic hierarchy on the right, as presented in the interactive map. Here, colors correspond to the fraction of documents either belonging to a domain or containing any term in a topic. Domains are consistently presented in red and topics in blue. Columns represent the different levels, where each block from a higher level is sliced in sub-blocks of equal height in the lower level.
L5 L4 L3 L2 L1 N Domains 1 4 24 110 479 83476 Topics 1 5 37 112 407 253758 Table 1 : Partition counts for domains and topics at each nested level of the model, plus the number (N) of documents and terms modeled.
Topics
While our main interest lies in the research domains and their topic composition, by inspecting the hierarchy of topics we can gain a better understanding of their semantic structure and organization. Unlike level-1 domains, level-1 topics are just groups of terms, which are for the most part directly interpretable. And topics at higher levels can be characterised from the subtopics composing them, upwards from the groups of terms at level-1. We provide below only a brief characterization of the highest level (level-4) topics in our inferred model. Following the y-axis at level-4 of the topic map, we have: ● common terms ("cancer", "patients", "studies"), generic qualifiers ("between", "possibly", "known") and stop-words ("it", "and", "the"). (L4T0) ● terms for most cancer types; symptoms, and early stage clinical trials. (L4T1) ● broad terms for disease progression, diagnosis and treatment; terms for breast and liver cancer. (L4T2) ● terms for social and health factors, public health concerns; biopsy and essays, genetics; terms for skin, colon, bone and reproductive organs cancer. (L4T3) ● terms including methylation and genetic alterations; smoking and pregnancy (factors that may contribute to the first); plus a large subtopic with terms that only appear once in the corpus (extremely specific or typos). (L4T4) Time
As we mentioned, the block maps in figure 5 allow us to visualize the largest domains and topics in the term of their number of documents and terms. Using their interactive version (see section "Sashimi" interactive maps), one can then explore their substantive content, i.e. the composition and relationships of domains and topics therein, in order to gain an understanding of the corpus. Although this is already a worthwhile endeavor, especially if one is unfamiliar with the activities taking place in a given field, it only provides a static picture of that field. We find it more interesting to analyze our results by introducing a temporal dimension that will allow us to capture dynamic aspects of the field. To this end, starting from the dual representation of topics and domains, we infer a chained block model that extends the model to conference years, as explained in the Chained dimensions section and illustrated in figure 3 . The resulting partition of years corresponds to a two-level hierarchy (see figure 6 ). At the highest level, two groups of consecutive years are found: the first period ranges from 1995 to 2005, and the second from 2006 to 2017. Figure 6 : The inferred hierarchy of annual conferences. Contrary to domains and topics, there is no statistically significant distinction found above level 2. Since conferences are treated as categorical data, the fact that the partitions respect the chronological sequence is not a given, but instead reveals the progressive evolution of the domains in time.
It is worth noting that the model processes years as categorical values, thus the consecutiveness of the groups of years identified, at all levels, is not assumed but is a consequence of the progressive and cumulative character of cancer research. The split found at 2005/2006 corresponds to the strongest structural frontier in the data. Each period is then sub-divided in three new sub-periods, but we have chosen to examine the temporal dynamics of the ASCO Annual Meeting through the lens of the main periods inferred, that is, the evolution from [1995-2005] towards [2006-2017]. This is represented in figure 7 , where colors correspond to the growth (in red) or decline (in grey) of a given domain over time, as measured by the difference between the two periods in the relative volume of abstracts belonging to the domain. Figure 7 : Colors represent the growth, in red, or decline, in grey, of the fraction of abstracts in a domain. We consider the evolution between the two highest-level periods: [1995-2005] and [2006-2017]. Colors are normalized within each level, so that the blocks of strongest color correspond to the greatest change at their level of granularity. Having detected a dividing line between the pre-2005 and post-2006 periods, the domain-topic model exposes a number of significant features defining the evolution of the thematic content of the corpus. As shown by figure 7 , different domains gained or lost momentum between these periods. Readers should note that some of the results we will discuss, such as the growth of precision oncology research, correspond to obvious evolutions in the field, while other results point to less publicized trends. At this stage, however, our argument does not rest on the production of unexpected or unprecedented results, but simply on the development of a suitable method for categorizing and characterizing the dynamics of these clinical-scientific activities. Our method successfully captures the fine-grained structure of domains and subdomains, while simultaneously providing insights into the dynamics of oncology research as featured at ASCO Annual Meetings. Going back to figure 7 , at level 4 one can see that the blocks at the bottom and top of the map underwent respectively the most pronounced decline and growth. Still, the subdomains assembled by a given domain are often characterized by different degrees of change. As a result, a growing domain may contain declining subdomains as well as domains that exceed its growth, and vice-versa. In the following sections we shall investigate in detail the domains of most pronounced change between our periods of interest, at different scales. At the most granular level (level 1) the block with the highest growth refers to survival and prognosis across different cancers, while the block with the strongest decline refers to traditional chemotherapy for lung cancer and is consistent with the demise of cytotoxic chemotherapy approaches at the research front. At the subsequent level (level 2) we'll see the highest growth corresponds to cancer genomics as defined by work on genomic alterations, mutations, and molecular profiles across cancer types. We also find a stable domain that corresponds to hereditary cancer. And at level 3 we'll describe a high-growth domain that assembles the fields of health system research, healthcare policy, healthcare services, and cost analysis. Let us then have a closer look at these domains, by inspecting their relationship to topics. Level-1 domains
The highest growth level-1 domain (L1D808) , as defined by abstract titles and terms, refers to survival and prognosis across different cancers, with a strong presence of lung cancer, including issues such as treatment outcomes and mortality, and the cancer statistics program known as SEER (Surveillance, Epidemiology, and End Results). Three level-3 topics (see figure 8 ) are constitutive of this domain: a first topic on external causes of cancer (L3T31) ; a second on cost-effectiveness measurement that appears to be disassembled, at the inferior level, into race, geriatric and pediatric comorbidities, and Medicare (L3T27) ; and finally a third topic on economic inequalities and socio-epidemiology (L3T10) . All of these topics characterise the domain since they are overrepresented in them. It is interesting to note that, coherent with the inferred hierarchy, the topical structure of this level-1 domain resembles, although does not entirely overlap with, the topical structure of its parent level-3 domain on healthcare policy (L3D44) , which will be discussed below. Figure 8 : Topic prevalence of the level-1 domain on survival and prognosis (L1D808) . Intensity of color for a topic corresponds to how much it is more present in the domain than overall. To ease visualization of the salient topics, values are normalized by their absolute maximum within each level and negative values are shown as blank. The level-1 domain showing the largest decline (L1D458) refers to chemotherapy-based treatments for lung cancer and follows the demise of traditional chemotherapeutic approaches at the research front (as exemplified by the ASCO Annual Meeting). While it continues to play an important role in routine cancer care, traditional chemotherapy has been back staged by the emergence of the new targeted therapies linked to the development of molecular profiling (a level-2 domain (L2D133) discussed below). This is particularly the case for lung cancer, dubbed by historians a “recalcitrant disease” [39] insofar as, contrary to other types of cancer such as breast cancer, traditional chemotherapeutic approaches have been largely unsuccessful. Initial applications of molecular oncology have therefore often targeted lung cancer, in the hope of addressing this “unmet medical need”. Specific topics of this domain thus include the chemotherapy of mesothelioma (a rare, aggressive form of cancer related to lungs), investigations of lung cancer by cooperative oncology groups (traditional networks of clinical cancer trialists presently competing with more flexible arrangements such as ad hoc consortia), the side effects of chemotherapies, lung cancer pathology and therapeutic regimens, esophageal cancer, and patient performance status (to assess ability to tolerate therapy). Temporal analysis provides us with access to the dynamics of the oncology field as reflected in the ASCO Annual Meeting abstracts. After selecting growing and declining domains on the basis of the pre-2005 and post-2006 temporal analysis, it is possible to inspect their temporal profile on a yearly basis. The level 1 section of figure 9 shows this kind of profile for the two aforementioned level-1 domains. Figure 9 : Yearly evolution of the fraction of abstracts in a domain, for the highest level domains (level 4) and for the domains discussed in the paper. Level-2 domains
If we now move up to level-2 domains, abstract titles and terms of the highest-growth block (L2D133) correspond to the area of cancer genomics, as defined by work on genomic alterations, mutations, and molecular profiles across cancer types. Interestingly, there is a clear difference between abstract titles of the two periods. Titles in 1995-2005 concern two early biomarkers, HER2 and EGFR, that have contributed to the emergence of the field and were initially explored with more classical methods (immunohistochemistry, FISH, PCR); whereas titles in the subsequent period center on more recently analyzed biomarkers (BRAF, KRAS, PIK3CA) and related pathways (AKT, mTOR), as analyzed using next-generation sequencing techniques (NGS). If we now switch to the corresponding topics ( figure 10 ), the five most important ones are as follows: ● Mutations and targeted therapies, with a focus on the identification and interpretation of “actionable” mutations (EGFR, Tyrosine kinase, ALK), and resistance mechanisms to targeted therapies (L3T7) . This topic is subdivided into level-2 and level-1 topics relating to: (a) the identification and analysis of mutations and genomic landscapes across cancer types; and, more specifically; and (b) the analysis of genomic alterations and targeted therapies for lung and colorectal cancers. It is important to note that a large number of these level-2 and level-1 subtopics that flow into the mutations and targeted therapies topic are also highly prevalent, in particular the following ones: exploring mutations and genetic landscapes across cancer types using sequencing techniques (L2T20) ; genomic alterations and targeted therapies for lung and colorectal cancer (L2T43) ; mutation analysis in colorectal cancer (L1T191) ; methods and algorithms to identify and interpret actionable genetic alterations across cancer types (L1T22) ; genomic alterations (esp. KRAS/NRAS, BRAF), targeted therapies (esp. cetuximab) and resistance to therapies, with a focus on colorectal cancer (L1T56) ; and genomic alterations (esp. ALK, EGFR, ROS), targeted therapies and resistance to therapies, esp. for lung cancer (L1T130) . ● Pharmacogenomics and pharmacogenetics, genotyping, and Single Nucleotide Polymorphisms (SNPs) (L3T5) . Sub-topics refer to polymorphisms as predictors of drug response and toxicity, and molecular cytogenetics (chromosomal deletions and translocations, loss of heterozygosity). ● Epigenetics and methylation across cancer types, in particular those related to smoking (L3T14) . ● Smoking as a risk factor (in particular in Asia), smoker status, and smoking prevention (L3T31) . ● Methodology for detecting circulating tumor cells (CTCs) and its comparison with other techniques for assessing the presence of biomarkers (L3T32) . This topic consists of subtopics dealing with CTCs, the molecular pathology of HER2 expression, biopsy techniques (core- and fine-needle), and drugs targeting DNA-repair mechanisms (PARP-inhibitors). Figure 10 : Topic prevalence of cancer genomics domain (L2D133) . To sum up, this domain and its topics center on the rapidly growing domain of cancer genomics. Its identification by our SASHIMI approach, while far from unexpected, confirms its ability to neatly capture this recent trend and its components when examined by topics. Moreover, SASHIMI was able to detect a number of specific subtopics that contribute differentially to the development of this domain, for instance a strong presence of investigations focusing on the relation between smoker status and lung cancer molecular profiling and treatment, or the distinction between research on somatic mutations in cancer and other subfields such as epigenetics or the analysis of the role of the genome in drug response (pharmacogenomics). Figure 11 : Topic prevalence of hereditary cancer domain (L2D139) . A second example of a level-2 domain concerns a stable one (L2D139) that mainly refers to research on the breast cancer hereditary susceptibility genes BRCA1 and BRCA2, as contrasted with the investigation of somatic mutations. Identified in 1994 and 1995, the BRCA genes have not only led to the establishment of hereditary breast cancer clinics and to successful, albeit controversial commercial initiatives (as epitomized by the company Myriad), but also to major research programs that have continued in subsequent years [40]. It is thus not surprising that this domain maintains a stable presence during the period covered by our analysis. Topics contributing to this domain (see figure 11 ), include hereditary breast/ovarian cancer (HBOC) and colorectal cancer (L3T28) , with a sub-cluster corresponding more specifically to BRCA1 and BRCA2 (L2T63) , and a sub-cluster devoted to genetic testing (L1T97) . Other topics include pregnancy during and after cancer (L3T33) and community cancer care and physicians’ education (L3T10) . Finally, a subtopic refers to issues of race and ethnicity (L2T53) (as differentially related to cancer risk), while another subtopic concerns the different but related issue (insofar as it concerns gynecological cancers) of HPV cervical cancer screening (L2T42) . The level 2 section of figure 9 displays the stable temporal profile of the hereditary cancer domain and the growing importance of cancer genomics. Level-3 domains
Moving up the nested structure, domains become more heterogeneous but maintain a degree of thematic unity. Abstract titles and terms of the highest-growth level-3 domain (L3D44) situate it within the broad field of health system research, healthcare policy, healthcare services and costs. A better sense of this high-level domain is provided by its topics ( figure 12 ). For our present purpose we will discuss four main topics. ● A first topic (L3T10) concerns issues of physician and healthcare-worker education, and patient communication. This topic can be further disassembled into subtopics pertaining, respectively, to: (a) the improvement of practices, access to services, and telemedicine; (b) patient satisfaction and their emotional and spiritual needs, especially at the end of life; (c) insurance issues, economic and geographic inequalities; and (d) training, communication, and patient information. ● A second topic (L3T27) concerns health economics and related issues such as cost-effectiveness, cost-utility, but also race disparities. It can be disassembled into five subtopics, namely: (a) cost-effectiveness and comparative effectiveness; (b) racial and ethnic disparities; (c) comorbidities, mortality, and survival; (d) quality of life, and (e) systematic reviews and meta-analysis. ● A third topic (L3T21) covers questionnaires about anxiety, psychological distress, cachexia, nutrition, weight, and diet. It separates neatly into subtopics focusing on (a) weight, nutrition, body mass, and diet; and (b) depression, anxiety, sleep, fatigue, and psychological distress. ● A final topic (L3T33) centers on pregnancy and fertility issues after cancer treatment Figure 12 : Topic prevalence of the health system research, healthcare policy, healthcare services, and costs domain (L3D44) . The level 3 section of figure 9 shows the yearly temporal profile of this domain. The domain as a whole corresponds to the growing importance within the oncology field itself of discussions concerning healthcare policy (often dubbed “oncopolicy” by practitioners) and health economics, especially following the development of increasingly expensive drugs and diagnostic tools, and the growing awareness of wide disparities in cancer care within and across countries. About 20% of all contributions were related to this domain, confirming its growing centrality to the field. A close investigation of the topics that inform the shaping and framing of this domain shows the presence, in addition to more traditional economic and policy themes, of issues related to the patients’ quality of life in terms not only of quantitative indicators but also of qualitative experiences. Institutions
While the present paper focuses on the analysis of the thematic content of the abstracts, our approach can be extended to include other aspects such as the contribution of specific institutions to different (sub)domains. To quickly exemplify this procedure, we selected a group of US and French institutions, namely the 27 members of the US National Comprehensive Cancer Network and the 20 French Comprehensive Cancer Centers. Several other institutions from different countries (in particular the UK and other European countries) have of course contributed papers to the ASCO Meetings, but these 47 national leading centers have increased their share of ASCO presentations from 19% in 2001 to 23% in 2009 and 33% in 2017. They thus provide a robust enough basis for illustrating institutional analysis. Unsurprisingly, a small number of leading US (e.g., MD Anderson and Memorial Sloan-Kettering), and French (e.g., Gustave Roussy) institutions have year after year contributed the largest number of papers, and it is thus to be expected that they are also major contributors to most (sub)domains. Other, medium size institutions clearly follow a more specialized strategy: for instance, UCSD’s Moores Cancer Center has a strong presence in the fast-growing cancer genomics subdomain (L2D133) but a relatively minor presence in the healthcare policy domain (L3D44) . Future work will include a more fine-grained analysis of this kind of institutional dynamics, for instance by clustering institutions that share a similar profile in terms of activity domains. Conclusion
In this paper we introduced a simultaneous hierarchical modeling of documents and terms into domains and topics, detecting the thematic organization of a corpus in such a way that domains can be described as assemblages of topics, and compared to other domains in a clear, distinctive fashion. Given that documents connect the different elements of a corpus, we showed how their thematic organization can be used as a basis to structure other dimensions of the data. Applied to the clinical oncology research showcased at the ASCO Annual Meeting during the last two decades, we focused on the chronological dimension of this dynamic process. Having detected a dividing line between the pre-2005 and post-2006 periods, the domain-topic model was able to characterize a number of significant features defining the evolution of the thematic content of the corpus. Our method successfully captures the fine-grained structure of domains and subdomains, while simultaneously providing insights in the dynamics of oncology research as featured at ASCO annual meetings. As shown by figure 7 , different domains gained or lost momentum between the pre-2005 and post-2006 periods. In the Results section, we discussed three research domains with the highest growth at three different levels. The most granular domain refers to the survival and prognosis across different cancers (L1D808) . A second level domain corresponds to cancer genomics as defined by work on genomic alterations, mutations, and molecular profiles across cancer types (L2D133) . A rising domain at level 3 assembles the fields of health system research, healthcare policy, healthcare services, and cost analysis (L3D44) . Conversely, we noticed the strong decline of a domain that involves traditional chemotherapy for lung cancer (L1D458) and is consistent with the demise of cytotoxic chemotherapy approaches at the research front. Our approach provides a clear picture of the evolution of the content of the ASCO Annual Meeting, and deeper analyses of the results may reveal more about the nature of the transition between the two main periods detected by our temporal analysis, as well as less dramatic changes between the shorter sub-periods. In order to better explain the original contribution of our SASHIMI approach, it is useful to compare it to a similar model recently described by [23]. Starting from the construction of a bipartite network connecting terms to documents, their model also applies a stochastic block model approach to extract structure from a corpus of texts. Their main purpose, however, is to compare their algorithm’s efficacy with that of other topic models. As a consequence, they are mostly concerned with the grouping of terms within semantically coherent groups. While they mention the dual nature of their model, they do not explore it. In contrast, in our work it is precisely the domain-topic duality that motivated the choice of construction. Our three interrelated goals were: to explore hierarchical and semantic relationships between domains, to describe domains in terms of their topic mixtures, and to infer linkages between domains and non-textual dimensions such as time, institutions, and community organization.
An interesting avenue for future research will be to further explore the heuristic value of the nested SBM, as embedded in SASHIMI, in relation to these goals. From a methodological or technical point of view, one of the advantages of the nested SBM is its generality and robustness. We acknowledge, however, that the dual (domain-topic) nature could be achieved by a combination of other models. At a more substantive level, future work could involve comparative SASHIMI analysis of different kinds of corpora. For instance, in addition to ASCO abstracts, the abstracts of its twin European organization (ESMO), including the identification of core sets of authors who define the international research agenda in oncology. At a longitudinal level, a promising research avenue involves the analysis of the research projects financed by the US National Cancer Institute, thus following the link from grant proposals to conference presentations and, finally, to publications in scientific and clinical journals. Code and data availability
A library that allows one to perform the steps described in this paper is available at < https://gitlab.com/solstag/abstractology/ >. Our source code makes usage of the graph-tool library [30]. To accomplish the chaining procedure hierarchically, a slight modification to that library is necessary for the moment. We provide it as a patch in our repository. Acknowledgments
Work for this paper has been made possible by a grant from the Canadian Institutes for Health Research (CIHR - MOP-142478). We thank our Co-PI James A. Evans (University of Chicago) and his assistant Antonio Nanni for their suggestions and assistance, as well as ASCO, and in particular Dr. Richard L. Schilsky, for providing access to electronic copies of the annual conference abstracts.
Bibliography WORD