[PDF] DLBCL-Morph: Morphological features computed using deep learning for an annotated digital DLBCL image set

Abstract

Diffuse Large B-Cell Lymphoma (DLBCL) is the most common non-Hodgkin lymphoma. Though histologically DLBCL shows varying morphologies, no morphologic features have been consistently demonstrated to correlate with prognosis. We present a morphologic analysis of histology sections from 209 DLBCL cases with associated clinical and cytogenetic data. Duplicate tissue core sections were arranged in tissue microarrays (TMAs), and replicate sections were stained with H&E and immunohistochemical stains for CD10, BCL6, MUM1, BCL2, and MYC. The TMAs are accompanied by pathologist-annotated regions-of-interest (ROIs) that identify areas of tissue representative of DLBCL. We used a deep learning model to segment all tumor nuclei in the ROIs, and computed several geometric features for each segmented nucleus. We fit a Cox proportional hazards model to demonstrate the utility of these geometric features in predicting survival outcome, and found that it achieved a C-index (95% CI) of 0.635 (0.574,0.691). Our finding suggests that geometric features computed from tumor nuclei are of prognostic importance, and should be validated in prospective studies.

Full PDF

DDLBCL-Morph: Morphological features computedusing deep learning for an annotated digital DLBCLimage set

Damir Vrabac , Akshay Smit , Rebecca Rojansky , Yasodha Natkunam , Ranjana H.Advani , Andrew Y. Ng , Sebastian Fernandez-Pol , and Pranav Rajpurkar Department of Computer Science, Stanford University Department of Pathology, Stanford University School of Medicine Department of Medicine, Division of Oncology, Stanford University School of Medicine * Corresponding author(s): Pranav Rajpurkar ([email protected]) † These authors contributed equally to this work. ‡ These authors contributed equally to this work.

ABSTRACT

Diffuse Large B-Cell Lymphoma (DLBCL) is the most common non-Hodgkin lymphoma. Though histologically DLBCLshows varying morphologies, no morphologic features have been consistently demonstrated to correlate with prognosis. Wepresent a morphologic analysis of histology sections from 209 DLBCL cases with associated clinical and cytogenetic data.Duplicate tissue core sections were arranged in tissue microarrays (TMAs), and replicate sections were stained with H&E andimmunohistochemical stains for CD10, BCL6, MUM1, BCL2, and MYC. The TMAs are accompanied by pathologist-annotatedregions-of-interest (ROIs) that identify areas of tissue representative of DLBCL. We used a deep learning model to segment alltumor nuclei in the ROIs, and computed several geometric features for each segmented nucleus. We ﬁt a Cox proportionalhazards model to demonstrate the utility of these geometric features in predicting survival outcome, and found that it achieveda C-index (95% CI) of 0.635 (0.574,0.691). Our ﬁnding suggests that geometric features computed from tumor nuclei are ofprognostic importance, and should be validated in prospective studies.

Background & Summary

Diffuse Large B-Cell Lymphoma (DLBCL) is the most common type of non-Hodgkin lymphoma (NHL), accounting for over athird of cases with more than 20,000 patients diagnosed annually in the United States . DLBCL is fatal without treatment,however approximately 70% of patients can be cured with contemporary therapeutic regimens . Treatment outcomes followingstandard R-CHOP (rituximab, cyclophosphamide, doxorubicin, vincristine, and prednisone) therapy are highly variable, anddepend on a number of clinical, biologic, and genetic factors. Currently, the most effective prognostic classiﬁcation is theNational Comprehensive Cancer Network International Prognostic Index (NCCN-IPI), which incorporates ﬁve clinical variablesincluding age, lactate dehydrogenase (LDH), extra-nodal sites of involvement, Ann Arbor stage, and ECOG performancestatus . The NCCN-IPI model is widely used to risk stratify patients into good, intermediate, and poor-risk categories, howeverit is insufﬁcient to guide therapeutic decision-making for individual patients.Gene expression proﬁling (GEP) studies revealed distinct subtypes of DLBCL that correspond to differences in cell oforigin (COO) and show different outcomes in response to R-CHOP therapy . This approach categorizes DLBCL as eithergerminal center B-cell (GCB), activated B-cell (ABC), or indeterminate, based on the phase of B-cell development it mostclosely resembles . A practical algorithm employing this approach for immunohistochemically stained, formalin ﬁxed, parafﬁnembedded tissue was developed by Hans et al, and despite imperfect concordance with the gold standard GEP method, it isnow the most widely used algorithm in the United States for DLBCL . The GCB subtype is associated with more favorableoutcomes than the non-GCB subtype .In addition to COO subtyping, double-hit lymphomas with concurrent chromosomal translocations of the MYC and BCL2genes, or less commonly MYC and BCL6 genes, and double-expressor lymphomas with dual overexpression of MYC andBCL2 proteins have been found to correlate with an aggressive clinical course and poor outcomes when treated with R-CHOP .Determination of these molecular subsets is now standard of care per the World Health Organization (WHO) guidelines andpatients harboring dual chromosomal translocations are now formally classiﬁed as having high grade B-cell lymphoma, with a r X i v : . [ c s . C V ] S e p nnotated core Extracted Patch Segmented Tumor NucleiSingle Nucleus Segmentation MapNeoplastic Nucleus Masks Hull area

449 px

Min diameter

25 px

Max diameter

28 px

Elongation ...

Shape and Size Features a. b. c.e.d. f.

Figure 1.

Data pipeline for a single core from an H&E stained tissue microarray (TMA).

In a) the red rectangle is thepathologist-annotated ROI. In c) red corresponds to cell nuclei classiﬁed as “neoplastic" by HoVer-Net. Green corresponds to“inﬂammatory" and orange corresponds to “non-neoplastic epithelial".MYC and BCL2 and/or BCL6 translocations (HGBL) .While COO subtyping by the Hans algorithm corresponds to morphologically distinct benign precursors, germinal centertype B-cells and activated B-cells, classiﬁcation based on the morphologic properties of the tumor itself has historically beenchallenging due to the signiﬁcant histomorphologic heterogeneity of DLBCL. Cytologically, DLBCL may resemble centroblastswith multiple peripheral nucleoli and vesicular chromatin or immunoblasts with abundant cytoplasm and a single prominentnucleolus. However, the prognostic signiﬁcance of these and other recognised cytologic variants, for example anaplastic typeDLBCL, is unclear and the subject of continued debate .Though several studies have thus far failed to conclusively demonstrate that morphologic classiﬁcation can predictoutcomes in DLBCL, automated imaging methods could potentially identify novel, prognostically signiﬁcant morphologicalor immunohistochemical biomarkers. The ability of automated methods to identify prognostically relevant features on H&Esections that have eluded pathologists has been demonstrated . If successful, automated image analysis could be scaled upinto a cost-effective alternative to current classiﬁcation methods which are typically costly and/or labor intensive. A criticalrequirement for the development of these models is the availability of datasets containing digitally scanned slides stained toshow cell morphology and expression of relevant proteins with accompanying prognostic outcome data.Here we present DLBCL-Morph, a publicly available dataset containing 42 digitally scanned high-resolution tissuemicroarrays (TMAs) from 209 DLBCL cases at Stanford Hospital. Each TMA was stained for H&E as well as for CD10, BCL6,MUM1, BCL2, and MYC protein expression. All of the TMAs are accompanied by pathologist-annotated regions-of-interest(ROIs) that indicate areas representative of DLBCL. For each patient in the dataset, we provide survival data, follow-up status,and a wide range of clinical and molecular variables such as age and MYC/BCL2/BCL6 gene translocations. We also segmentedout tumor nuclei from ROIs inside the H&E stained TMAs, and provide several geometric features for each tumor nucleus. .a. c. d. Figure 2.

Tissue microarrays (TMAs) with region-of-interest (ROI) annotations. a) H&E stained TMA. The redrectangles denote ROIs annotated by a human expert. Some missing or unrepresentative cores have no ROIs. b) A single corefrom the TMA in a) with ROI that ignores unrepresentative areas of the core. c) BCL6 stained TMA, containing cores from thesame patients as a). d) A single annotated core from the TMA in c). Cells stained orange show greater BCL6 expression.

Methods

Our dataset contains digitally scanned TMAs accompanied by pathologist-annotated ROIs. We extracted patches from the ROIsinside the H&E stained TMAs, and used a deep learning model called HoVer-Net to segment tumor cell nuclei. We thencomputed several geometric descriptors for each segmented nucleus. Figure 1 shows our pipeline for an H&E stained TMAcore. Our project was approved by the Institutional Review Board of Stanford University. All protected health information wasremoved and the project had no impact on clinical care, so the requirement for individual patient consent was waived. Patient Cohort

The study cohort consists of patients with de novo, CD20+ DLBCL treated with curative intent with R-CHOP or R-CHOP–likeimmunochemotherapy with available clinical data from the Stanford Cancer Institute, Stanford, California. This patient cohortwas included in a prior study with clinicopathologic inclusion criteria are as previously described . Tissue Microarray

Stained tissue microarray (TMA) slides were scanned at 40x magniﬁcation (0.25 µ m per pixel) on an Aperio AT2 scanner(Leica Biosystems, Nussloch, Germany) in ScanScope Virtual Slide (SVS) format. This high magniﬁcation level displays thetissue in very ﬁne detail, which we believe to be beneﬁcial for the development of automated imaging models. Each SVS ﬁlealso contains a slide label image, a macro camera image, and a thumbnail image. The slide label image is a low-resolutionimage of the slide’s label, which shows the TMA number and the stain (eg: BCL2). The macro camera image is a low-resolutionpicture of the entire slide. The thumbnail is an image of the whole scanned TMA.Our dataset includes 7 TMAs, each with a 0.4 micron thick formalin-ﬁxed, parafﬁn-embedded (FFPE) section of tumorsassembled in a grid. Within the microarray each tumor is represented by a 0.6-mm core diameter sample in duplicate. Replicatesof each TMA were stained with H&E, which shows cell morphology. They were also stained for the expression of thefollowing 5 oncogenes: CD10, BCL6, MUM1, BCL2, and MYC. We therefore have 6 stains per TMA, resulting in 42 distinctdigitally-scanned slides. An example of an H&E stained TMA is shown in Figure 2 a) and a BCL6 stained TMA is shownin Figure 2 c). Since overexpression of one or more of these proteins is observed in a signiﬁcant portion of DLBCL cases, egmented cell nucleus Fitted rectangle Fitted ellipsea. b. c. Figure 3.

Rectangle and ellipse ﬁtted to a single segmented tumor nucleus. a) a binary segmentation image for a tumorcell nucleus. For visual clarity, the image is zero-padded by 5 pixels on each side. b) rotated rectangle ﬁt to the nucleus. Ourdataset provides the rectangle’s center coordinates, width, height and rotation angle. c) rotated ellipse ﬁt to the nucleus. Ourdataset provides the ellipse’s center coordinates, perimeter, area, and major and minor axis lengths.automated imaging models can use the immunostained TMAs to potentially identify prognostically signiﬁcant features relatedto protein expression.

Pathologist annotations

Although TMA cores were already taken from areas of tissue showing DLBCL, some of the cores were partially or entirelymissing. Furthermore, some cores still contained areas of tissue that had very few or no tumor cells. We obtained rectangularROI annotations from expert pathologists to highlight the core regions which represent DLBCL accurately. The annotationswere created for all TMAs and all stains at 40x magniﬁcation. The pixel coordinates for the rectangles in ROIs, along with thecorresponding deidentiﬁed unique patient_id, are included in our dataset. We believe the exclusion of missing or insufﬁcientlyrepresentative tissue areas will be beneﬁcial for automated prognostic models which use patches from the TMAs as input.Example ROI annotations are shown in Figure 2 b) and d).

Patches from stained TMAs

We extracted patches of size 224x224 from within the ROIs in the stained TMAs, at 40x magniﬁcation. The patches wereextracted uniformly from inside each annotated rectangle, starting from the top-left corner and proceeding until the bottom-rightcorner. The patches are non-overlapping, and we omitted patches that are mostly white and contain little tissue. We providethese patches as part of our dataset. Due to our ROI annotation process detailed above, our patches exclude missing andunrepresentative areas of cores. Since deep learning based imaging methods typically cannot directly operate on images aslarge as the 40x magniﬁcation image, the patches can instead be used as input. We also used patches from H&E stained TMAsto segment tumor cell nuclei as described below.

Tumor cell nucleus segmentation

We used a deep learning based nucleus segmentation and classiﬁcation model called HoVer-Net to segment every tumor cellinside each of the patches from H&E stained TMAs. The HoVer-Net operates independently on each patch, and producesan output image segmenting all individual cell nuclei in the patch, and another output image specifying the classiﬁcation ofeach segmented nucleus. The HoVer-Net classiﬁes segmented nuclei into 5 categories: neoplastic, non-neoplastic epithelial,inﬂammatory, connective, dead. HoVer-Net uses a neural network based on a pretrained ResNet-50 architecture to extractimage features. These extracted features are then processed in three steps: the nuclear pixel (NP) step, HoVer step, and nuclearclassiﬁcation (NC) step. The NP step determines whether each pixel belongs to a nucleus or the background, and the HoVerstep predicts the vertical and horizontal distances of nucleus pixels to their centroid, thereby allowing separation of touchingnuclei. Then the NC step classiﬁes each nucleus pixel, and aggregates these across all pixels in a segmented nucleus to classify ach nucleus as neoplastic, non-neoplastic epithelial, inﬂammatory, connective, or dead. We used the HoVer-Net output toidentify each neoplastic cell nucleus in a patch, and saved it as a separate binary image, thereby obtaining one binary imagefor each tumor cell. Each binary image illustrates the size and shape of the nucleus, and we provide these in our dataset. Anexample binary image is shown in Figure 1 e) and another is shown in Figure 3 a). We used these binary images to computegeometric features for each tumor cell nucleus as described below.

Geometric features from tumor nuclei

We used the per-nucleus binary segmentation images to compute several geometric features for each tumor cell nucleus. Whileend-to-end imaging models may not require such hand-crafted features, prognostic models which use these features can givemore explainable results, and can more clearly indicate the prognostic importance of these features.We ﬁt a (possibly rotated) rectangle of minimum area enclosing the binary mask, and provide the rectangle’s top left pointcoordinates, width and height, and rotation angle. An example rectangle is shown in Figure 3 b). The rectangle’s top left pointis a tuple corresponding to the feature rectCenter. The ﬁrst element of the tuple corresponds to the x-coordinate, and the secondelement corresponds to the y-coordinate. The width and height are in a tuple corresponding to the feature rectDimension. Theﬁrst element of the tuple corresponds to the width, and the second element to the height. The rotation angle corresponds tothe feature rotate_angle, which ranges from − ◦ to 0 ◦ . A value of − ◦ corresponds to an axis-aligned rectangle. As therectangle is rotated clockwise, the angle increases toward 0 ◦ , at which point the rectangle is again axis-aligned and the angleresets to − ◦ .We ﬁt an ellipse around the nucleus in the binary segmentation mask, and provide the ellipse center, major axis, minor axis,perimeter and area of the ellipse. An example ellipse is shown in Figure 3 c). The ellip_centroid parameter is a tuple containingthe x and y coordinates of the ellipse. The features shortAxis and longAxis correspond to the lengths of the minor and majoraxes respectively. The feature ellip_perimt corresponds to the ellipse perimeter, and ellip_area corresponds to the ellipse area.We computed the maximum and minimum Feret diameters for each segmented nucleus, and provide the correspondingangles. Given an object and a ﬁxed direction, the Feret diameter is the distance between two parallel tangents to the object,where the tangents are perpendicular to the ﬁxed direction. The feature maxDiameter contains the Feret diameter maximizedover all directions, and maxAngle speciﬁes the angle (between − ◦ and 180 ◦ ) at which the maximum diameter is obtained.The features minDiameter and minAngle are similar but for the minimum Feret diameter. We further computed the convex hullof the segmented nucleus. The feature hull_area corresponds to the area of the convex hull.Finally we computed a number of geometric features derived from the quantities described above. These features are esf,csf, sf1, sf2, elongation, and convexity. These are deﬁned below in ( ) − ( ) . The esf, sf1, sf2 and elongation are all simpleratios that can be thought of as measures of how “elongated" the nucleus is. In particular, they are all equal to 1 if the nucleus isperfectly circular. The csf is similar: it is a measure of circularity, and is equal to 1 if the nucleus is perfectly circular. Forincreasingly elliptical nuclei, the csf decreases towards 0.esf = shortAxislongAxis (1)csf = π ∗ ellip_areaellip_perimt (2)sf1 = shortAxismaxDiameter (3)sf2 = minDiametermaxDiameter (4)elongation = maxDiameterminDiameter (5)convexity = (cid:114) ellip_areahull_area (6) ata Records The DLBCL-morph dataset is organized into three folders,

TMA , Patches , and

Cells as is shown by Figure 4. The clinicaldata of the patients together with the outcome is stored in clinical_data.xlsx and clinical_data_cleaned.csv where the lattercontains all the patients for which the outcome is recorded and all categorical variables are converted to numerical values, e.g.‘neg’, ‘pos’, and ‘no data’ were converted to 0, 1, and NaN, respectively for the variable CD10 IHC. Each patient has a uniqueidentiﬁer. There are 209 patients recorded in clinical_data_cleaned.csv . The column OS records the overall survival which isthe length of time (in years) from the end of treatment until death or last follow-up. The column Follow-up Status (FUS) is 1 ifthe patient was deceased at the time of last follow-up, else 0.

TMA

The

TMA folder contains a total of 42 digitally-scanned TMAs, which are organized within subfolders for each stain. Theﬁlename of each TMA is a TMA id which is the same across all stains, i.e.

DLBCL-Morph/TMA/HE/TMA255 and

DLBCL-Morph/TMA/BCL2/TMA255 contains cores of the same set of patients. The TMA id together with the row and column numberof each core, starting with 0 and 0, respectively in the upper left corner, can be linked to the patient id through core.csv , eachpatient has two cores. The annotations.csv contains coordinates of ROIs annotated by human experts. For each annotation thereis a patient id, TMA id, and stain where the TMA id and the stain is used to locate the TMA ﬁle that the annotation belongs to.The annotations are rectangular and the coordinates record the upper left and lower right corners based on the 40x magniﬁcationlevel of the TMAs.

Figure 4.

The directory structure of DLBCL-Morph

Patches

The Patches folder contains subfolders of stains which contains subfolders of patients that has at least one ROI. The patches arelocalized in the folders of patient ids with a patch id as the ﬁlename and are stored in PNG format. There are 195 patients thathave at least one patch from at least one stained TMA. However, some patients do not contain patches for all 6 stains, whichcan occur if the core for a particular stain was missing or not covered by any ROIs.

Cells

The

Cells folder contains subfolders of patient ids which contains subfolders of patch ids. The binary segmentation imagesfor tumor cell nucleus are localized in the folders of patch ids with the cell number as the ﬁlename and stored in NPY format.The NPY format is used by the Numpy package for Python to save arrays, in this case we are storing 2-dimensional arrayswith binary values as segmentations of tumor cell nucleus. The cell numbers are non-consecutive since all non-tumor cells arediscarded in each patch. All the geometric features computed from tumor nuclei are stored in cell_shapes.csv and can be linkedto the nucleus segmentation images through the patch id and the cell number. echnical Validation

We performed survival regression using the geometric and clinical features in our dataset to measure the utility of these featuresin predicting prognostic outcome. This analysis was performed on the 170 patients for whom patches from H&E stained TMAswere available. For each of the geometric features computed per tumor nucleus, we computed the mean and standard deviationacross all nuclei for each patient. We then ﬁt Cox Proportional Hazards models using the binary Follow-up Status (FUS) featureas an indicator of censoring, and the overall survival (OS) feature as the time to event or censoring (in years). We evaluated ourmodels using Harrel’s C-index . Random prediction would give a C-index of 0 .

5. Speciﬁcally we ﬁt three models: i) usingboth clinical and geometric features ii) using only clinical features iii) using only geometric features.We used the bootstrap method to obtain an “optimism-corrected" C-index . We sampled 1000 bootstrap replicates withreplacement and ﬁt the model on each bootstrap replicate. We then evaluated the model on both the original data and thebootstrap replicate. We recorded the performance decrease between evaluating on the bootstrap replicate and evaluating onthe original data. This decrease, averaged over all bootstrap replicates, was subtracted from the original C-index to obtainthe optimism-corrected C-index. We also generated the corresponding 95% two-sided conﬁdence intervals (CI) for theoptimism-corrected C-indices using the non-parametric percentile bootstrap method with 1000 bootstrap replicates.The resulting optimism-corrected C-indices with 95% CIs for our models were: i) 0 . ( . , . ) using clinical andgeometric features, ii) 0 . ( . , . ) using only clinical features and iii) 0 . ( . , . ) using only geometricfeatures. Thus, use of the geometric features alone allowed signiﬁcantly better than random survival prediction. Use of bothclinical and geometric features led to a higher performance than the use of clinical features alone, although this performancedifference was not statistically signiﬁcant. While prognostic classiﬁcation based on the morphologic properties of the tumor hasproved to be challenging and the subject of continued debate, our results suggest that geometric features computed fromH&E-stained tumor nuclei can provide a signiﬁcant signal to predict surival outcome. This ﬁnding should be further evaluatedon external datasets and prospectively in future studies. Usage Notes

The DLBCL-Morph dataset can be downloaded here: https://stanfordmedicine.box.com/s/0sh3plpjfovea6gv93y8a5ch1k3j0lr5.The data is organized as shown in Figure 4. We have provided publicly available Jupyter Notebooks to illustrate computationof geometrical features as well as usage of the data. One notebook uses the clinical and geometric variables in the datasetto reproduce the survival regression results described in the Technical Validation section. Another notebook visualizes andreproduces the computation of several geometric features for any segmented tumor nucleus in our dataset. Finally, weprovide another notebook to extract patches uniformly from inside any of the ROIs in the dataset. These patches are alreadyincluded as part of the dataset, but we believe this notebook will be beneﬁcial for researchers who work with the SVS ﬁlesin our dataset. The notebooks, along with the code to compute all geometrical features from tumor nuclei, are provided athttps://github.com/stanfordmlgroup/DLBCL-Morph.

Code availability

The code to compute all geometric features from all tumor nuclei in our dataset, along with notebooks to illustrate usage of ourdata and reproduce all survival regression results, is publicly available at https://github.com/stanfordmlgroup/DLBCL-Morph.

References Project, T. N.-H. L. C. A Clinical Evaluation of the International Lymphoma Study Group Classiﬁcation of Non-Hodgkin’sLymphoma.

Blood , 3909–3918, 10.1182/blood.V89.11.3909 (1997). https://ashpublications.org/blood/article-pdf/89/11/3909/1408169/3909.pdf. Horvat, M. et al.

Diffuse large b-cell lymphoma: 10 years’ real-world clinical experience with rituximab plus cyclophos-phamide, doxorubicin, vincristine and prednisolone.

Oncol. Lett. Leonard, J. P. et al.

Randomized phase II study of r-CHOP with or without bortezomib in previously untreated patients withnon–germinal center b-cell–like diffuse large b-cell lymphoma.

J. Clin. Oncol. , 3538–3546, 10.1200/jco.2017.73.2784(2017). Zhou, Z. et al.

An enhanced international prognostic index (NCCN-IPI) for patients with diffuse large b-cell lymphomatreated in the rituximab era.

Blood , 837–842, 10.1182/blood-2013-09-524108 (2014). Alizadeh, A. A. et al.

Distinct types of diffuse large b-cell lymphoma identiﬁed by gene expression proﬁling.

Nature ,503–511, 10.1038/35000501 (2000). . Scott, D. W. Cell-of-origin in diffuse large b-cell lymphoma: Are the assays ready for the clinic?

Am. Soc. Clin. Oncol.Educ. Book e458–e466, 10.14694/edbook_am.2015.35.e458 (2015). Basso, K. & Dalla-Favera, R. Germinal centres and b cell lymphomagenesis.

Nat. Rev. Immunol. , 172–184, 10.1038/nri3814 (2015). Riedell, P. A. & Smith, S. M. Should we use cell of origin and dual-protein expression in treating DLBCL?

Clin. LymphomaMyeloma Leuk. , 91–97, 10.1016/j.clml.2017.12.003 (2018). Gutiérrez-García, G. et al.

Gene-expression proﬁling and not immunophenotypic algorithms predicts prognosis inpatients with diffuse large b-cell lymphoma treated with immunochemotherapy.

Blood , 4836–4843, 10.1182/blood-2010-12-322362 (2011).

Scott, D. W. et al.

Prognostic signiﬁcance of diffuse large b-cell lymphoma cell of origin determined by digital geneexpression in formalin-ﬁxed parafﬁn-embedded tissue biopsies.

J. Clin. Oncol. , 2848–2856, 10.1200/jco.2014.60.2383(2015). Fu, K. et al.

Addition of rituximab to standard chemotherapy improves the survival of both the germinal center b-cell–like and non–germinal center b-cell–like subtypes of diffuse large b-cell lymphoma.

J. Clin. Oncol. , 4587–4594,10.1200/jco.2007.15.9277 (2008). Alizadeh, A. A. et al.

Prediction of survival in diffuse large b-cell lymphoma based on the expression of 2 genes reﬂectingtumor and microenvironment.

Blood , 1350–1358, 10.1182/blood-2011-03-345272 (2011).

Lenz, G. et al.

Molecular subtypes of diffuse large b-cell lymphoma arise by distinct genetic pathways.

Proc. Natl. Acad.Sci. , 13520–13525, 10.1073/pnas.0804295105 (2008).

Riedell, P. A. & Smith, S. M. Double hit and double expressors in lymphoma: Deﬁnition and treatment.

Cancer ,4622–4632, 10.1002/cncr.31646 (2018).

Swerdlow, S. H. et al.

The 2016 revision of the World Health Organization classiﬁcation of lymphoid neoplasms.

Blood ,2375–2390, 10.1182/blood-2016-01-643569 (2016). https://ashpublications.org/blood/article-pdf/127/20/2375/1393632/2375.pdf.

Engelhard, M. et al.

Subclassiﬁcation of Diffuse Large B-Cell Lymphomas According to the Kiel Classiﬁcation: Distinctionof Centroblastic and Immunoblastic Lymphomas Is a Signiﬁcant Prognostic Risk Factor.

Blood , 2291–2297, 10.1182/blood.V89.7.2291 (1997). https://ashpublications.org/blood/article-pdf/89/7/2291/1642341/2291.pdf. Baars, J. W. et al.

Diffuse large b-cell non-hodgkin lymphomas: the clinical relevance of histological subclassiﬁcation.

Br.J. Cancer , 1770–1776, 10.1038/sj.bjc.6690282 (1999). Diebold, J. et al.

Diffuse large b-cell lymphoma: A clinicopathologic analysis of 444 cases classiﬁed according to theupdated kiel classiﬁcation.

Leuk. & Lymphoma , 97–104, 10.1080/10428190210173 (2002). Nakamine, H. et al.

Prognostic signiﬁcance of clinical and pathologic features in diffuse large b-cell lymphoma.

Cancer , 3130–3137, 10.1002/1097-0142(19930515)71:10<3130::aid-cncr2820711039>3.0.co;2-r (1993). Salar, A. et al.

Diffuse large b-cell lymphoma: is morphologic subdivision useful in clinical management?

Eur. J.Haematol. , 202–208, 10.1111/j.1600-0609.1998.tb01023.x (2009). Villela, L. et al.

Prognostic features and outcome in patients with diffuse large b-cell lymphoma who do not achieve acomplete response to ﬁrst-line regimens.

Cancer , 1557–1562, 10.1002/1097-0142(20010415)91:8<1557::aid-cncr1165>3.0.co;2-4 (2001). Beck, A. H. et al.

Systematic analysis of breast cancer morphology uncovers stromal features associated with survival.

Sci.Transl. Medicine , 108ra113–108ra113, 10.1126/scitranslmed.3002564 (2011). https://stm.sciencemag.org/content/3/108/108ra113.full.pdf. Kather, J. N. et al.

Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer.

Nat. Medicine , 1054–1056, 10.1038/s41591-019-0462-y (2019). Jain, M. S. & Massoud, T. F. Predicting tumour mutational burden from histopathological images using multiscale deeplearning. bioRxiv

Graham, S. et al.

Hover-net: Simultaneous segmentation and classiﬁcation of nuclei in multi-tissue histology images(2018). 1812.06499. Rosenwald, A. et al.

Prognostic signiﬁcance of myc rearrangement and translocation partner in diffuse large b-celllymphoma: A study by the lunenburg lymphoma biomarker consortium.

J. Clin. Oncol. , 3359–3368, 10.1200/JCO.19.00743 (2019). PMID: 31498031, https://doi.org/10.1200/JCO.19.00743. Harrell, J., Frank E., Califf, R. M., Pryor, D. B., Lee, K. L. & Rosati, R. A. Evaluating the Yield of Medical Tests.

JAMA , 2543–2546, 10.1001/jama.1982.03320430047030 (1982). https://jamanetwork.com/journals/jama/articlepdf/372568/jama_247_18_030.pdf.

Harrell, F. E., Lee, K. L. & Mark, D. B. Multivariable prognostic models: Issues in developing models, evaluating assump-tions and adequacy, and measuring and reducing errors.

Stat. Medicine , 361–387, 10.1002/(sici)1097-0258(19960229)15:4<361::aid-sim168>3.0.co;2-4 (1996). Efron, B. & Tibshirani, R. Bootstrap Methods for Standard Errors, Conﬁdence Intervals, and Other Measures of StatisticalAccuracy.

Stat. Sci. , 54–75, 10.1214/ss/1177013815 (1986). Publisher: Institute of Mathematical Statistics. Perez, F. & Granger, B. E. Ipython: A system for interactive scientiﬁc computing.

Comput. Sci. Eng. , 21–29 (2007). Kluyver, T. et al.

Jupyter notebooks - a publishing format for reproducible computational workﬂows. In Loizides, F. &Scmidt, B. (eds.)

Positioning and Power in Academic Publishing: Players, Agents and Agendas , 87–90 (IOS Press, 2016).

Author contributions statement

DV, AS, SF, and PR developed the concept and design; DV, AS, RR, YN, RHA, SF, and PR performed acquisition, analysis, orinterpretation of data; AYN, SF, and PR provided supervision. DV, AS, and RR drafted the manuscript, and all authors providedcritical revision of manuscript for important intellectual content.

Competing interests