[PDF] Deep Learning-based Computational Pathology Predicts Origins for Cancers of Unknown Primary

Abstract

Cancer of unknown primary (CUP) is an enigmatic group of diagnoses where the primary anatomical site of tumor origin cannot be determined. This poses a significant challenge since modern therapeutics such as chemotherapy regimen and immune checkpoint inhibitors are specific to the primary tumor. Recent work has focused on using genomics and transcriptomics for identification of tumor origins. However, genomic testing is not conducted for every patient and lacks clinical penetration in low resource settings. Herein, to overcome these challenges, we present a deep learning-based computational pathology algorithm-TOAD-that can provide a differential diagnosis for CUP using routinely acquired histology slides. We used 17,486 gigapixel whole slide images with known primaries spread over 18 common origins to train a multi-task deep model to simultaneously identify the tumor as primary or metastatic and predict its site of origin. We tested our model on an internal test set of 4,932 cases with known primaries and achieved a top-1 accuracy of 0.84, a top-3 accuracy of 0.94 while on our external test set of 662 cases from 202 different hospitals, it achieved a top-1 and top-3 accuracy of 0.79 and 0.93 respectively. We further curated a dataset of 717 CUP cases from 151 different medical centers and identified a subset of 290 cases for which a differential diagnosis was assigned. Our model predictions resulted in concordance for 50% of cases (\k{appa}=0.4 when adjusted for agreement by chance) and a top-3 agreement of 75%. Our proposed method can be used as an assistive tool to assign differential diagnosis to complicated metastatic and CUP cases and could be used in conjunction with or in lieu of immunohistochemical analysis and extensive diagnostic work-ups to reduce the occurrence of CUP.

Full PDF

DDeep Learning-based Computational Pathology Predicts Originsfor Cancers of Unknown Primary

Ming Y. Lu , , , Melissa Zhao , Maha Shady , , , Jana Lipkova , , , Tiffany Y. Chen , , ,Drew F. K. Williamson , , , and Faisal Mahmood ∗ , , Department of Pathology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA Department of Biomedical Informatics, Harvard Medical School, Boston, MA Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA Contributed Equally

Interactive Demo: http://toad.mahmoodlab.org *Correspondence:

Faisal Mahmood60 Fenwood Road, Hale Building for Transformative MedicineBrigham and Women’s Hospital, Harvard Medical SchoolBoston, MA [email protected] 1 a r X i v : . [ q - b i o . T O ] J un bstract Cancer of unknown primary (CUP) is an enigmatic group of diagnoses where the primary anatomical site oftumor origin cannot be determined

1, 2 . This poses a signiﬁcant challenge, since modern therapeutics such aschemotherapy regimen and immune checkpoint inhibitors are speciﬁc to the primary tumor . Patients with aCUP diagnosis routinely undergo an extensive diagnostic work-up of pathology, radiology, endoscopy, lab-oratory tests, and clinical correlation in an attempt to determine the primary origin. Such exploration is notonly time and resource consuming, but it might signiﬁcantly delay administration of the suitable treatment.Despite extensive diagnostic work-ups the primary may never be determined in many cases. Recent workhas focused on using genomics and transcriptomics for identiﬁcation of tumor origins . However, genomictesting is not conducted for every patient and lacks clinical penetration in low resource settings. Herein, toovercome these challenges, we present a deep learning-based computational pathology algorithm-TOAD-thatcan provide a differential diagnosis for CUP using routinely acquired histology slides. We used 17,486 gi-gapixel whole slide images with known primaries spread over 18 common origins to train a multi-task deepmodel to simultaneously identify the tumor as primary or metastatic and predict its site of origin. We testedour model on an internal test set of 4,932 cases with known primaries and achieved a top-1 accuracy of 0.84,a top-3 accuracy of 0.94 while on our external test set of 662 cases from 202 different hospitals, it achieved atop-1 and top-3 accuracy of 0.79 and 0.93 respectively. We further curated a dataset of 717 CUP cases from151 different medical centers and identiﬁed a subset of 290 cases for which a differential diagnosis was as-signed. Our model predictions resulted in concordance for 50% of cases ( κ =0.4 when adjusted for agreementby chance) and a top-3 agreement of 75%. Our proposed method can be used as an assistive tool to assigndifferential diagnosis to complicated metastatic and CUP cases and could be used in conjunction with or inlieu of immunohistochemical analysis and extensive diagnostic work-ups to reduce the occurrence of CUP.2or the vast majority of cancer diagnoses, the site of a primary tumor can be determined via pathologicalexamination of tissue, or through a clinical and radiological assessment of the patient. However, 1-3% of casesare often categorized as enigmatic cancers of unknown primary (CUP) where the anatomic site of primary ori-gin cannot be assigned despite extensive diagnostic investigation and clinical correlation

1, 2 . Decades of studyhave led to cancer treatment strategies that generally rely upon knowledge of the primary site of the tumor,whether it be surgical resection, radiation therapy, chemotherapeutic regimen, or targeted immunotherapies .A majority of CUP cases where a putative primary cannot be assigned are treated with empirical chemother-apy and have poor prognosis (median survival 7-11 months, one year survival 25%)

1, 2 . Hence, CUP patientsoften undergo comprehensive diagnostic work-ups including pathology, radiology, endoscopic, and laboratoryexaminations to determine the occult primary site

2, 3 . Recent work has proposed using molecular (genomic andtranscriptomic) features for determining primary origin . However, such testing is not routinely performedfor every patient and lacks clinical penetration in low resource settings. The frontline of primary site clas-siﬁcation remains tissue examination by a pathologist using histology with the aid of immunohistochemistry(IHC). However, despite the improvements from sophisticated imaging modalities, speciﬁc and sensitive im-munohistochemical testing, and molecular proﬁling the diagnosis of CUP remains a current-day diagnosticchallenge. Moreover, uncertainty in classifying a lesion as primary or metastatic and mistaking a relapse ofan antecedent malignancy have also been reported in literature

7, 8 . Evidence suggests that around 10% CUPscan be prematurely diagnosed due to suboptimal investigation at the time of presentation

1, 2 . Recent advancesin deep learning

9, 10 have increasingly demonstrated accurate and reliable performance on a variety of differenthuman identiﬁable features and phenotypes as well as phenotypes that are typically not recognized byhuman experts . Tumor origin assessment via deep learning

In order to address the difﬁculties and complexities associated with identifying the primary sites of tumorspecimens, we propose a deep learning-based solution that uses scanned H&E whole slide images (WSIs),which are routinely used for clinical diagnosis, for identifying the site of primary origin without immunohis-tochemical analysis, genomic testing, or extensive clinical diagnostic screening. We developed Tumor OriginAssessment via Deep-learning (TOAD), a high-throughput, interpretable deep learning framework that can beused to simultaneously predict whether the histological sample is metastatic and assign differential diagnosisfor primary origin. In addition to addressing an unmet need in the diagnosis of CUP patients, TOAD canalso act as an assistive tool for pathologists for complicated metastatic cases where a large number of IHCsare required to narrow a differential diagnosis. TOAD is capable of providing assistance with differentialdiagnosis (top-3, top-5 predictions) instead of a single diagnosis for the pathologists consideration. Such dif-ferential diagnosis are a routine part of the clinical and pathological work-up for CUP cases and assist withnarrowing down possibilities of potential primaries. Our study uses 24,885 WSIs from 23,273 patient casesfrom the Brigham and Women’s Hospital and the TCGA, where each slide was treated as an independent case.3igure 1:

Tumor Origin Assessment via Deep Learning (TOAD) workﬂow.

Patient data in the form ofdigitized high-resolution FFPE H&E histology slides (known as WSIs) serve as input into the main network.For each WSI, the tissue content is automatically segmented and divided into an average of thousands to tens ofthousands of regions as small image patches. These images are processed by a pretrained convolutional neuralnetwork, which serves as an encoder to extract a compact, descriptive feature vector from each patch. Using anattention-based multiple instance learning algorithm, TOAD learns to rank all tissue regions in the slide usingtheir feature vectors and aggregate their information across the whole slide based on their relative importance,assigning greater weights to regions perceived to have high diagnostic relevance. As an additional covariate,the patient’s sex can be fused with the aggregated histology features to further guide classiﬁcation. By usinga multi-branched network architecture and a multi-task objective, TOAD can predict both the tumor origin aswell as whether the cancer is primary or metastatic. Additionally, the attention scores that the network assignsto each region can be used to interpret the model’s prediction.We trained our model using 17,486 WSIs using our weakly-supervised multi-task training paradigm. Then,extensive analysis was conducted to assess the performance of TOAD by ﬁrst testing on 4,932 WSIs withknown primaries, and carefully analyzing complicated metastatic cases to determine the capability of TOADfor assigning differential diagnosis. Second, to further assess the adaptability of our model we evaluated onan external multi-institutional test set of 662 cases from 202 different medical centers. Third, we curated anadditional test dataset of 717 consented CUP cases received from 151 medical centers that could not be as-signed a primary using histology alone and identiﬁed a subset of 290 cases where a primary differential wasidentiﬁed based on immunohistochemical analysis, radiology, patient history, clinical correlation or at autopsy(see

Extended Data Figure 1 for an overview of our study design).Our weakly-supervised multi-task deep learning classiﬁer model was trained on gigapixel WSIs with-out requiring manual expert annotation for regions of interest (ROIs) and predicts major primary sites at theslide level. We combine transfer learning and weakly-supervised multi-task learning to allow a single, uniﬁedpredictive model to be efﬁciently trained on tens of thousands of WSIs, a scale that is likely required to solveboth the complex problem of classifying 18 common cancer origins and predicting if the cancer is metastaticor primary simultaneously. Using attention-based learning, our approach automatically learns to locate regionsin the slide that are of high diagnostic relevance and aggregates their information to make the ﬁnal predictions.Subsequently, using custom visualization tools, the relative importance of each region examined by the modelcan be intuitively displayed as high-resolution attention heatmaps for human interpretability and validation(

Figure 3, Extended Data Figure 6, Interactive Demo ).4OAD begins by automatically segmenting and patching the tissue regions in the WSI into many smallercropped regions that can be directly processed by a convolutional neural network (CNN). Using transfer learn-ing, a deep residual CNN is ﬁrst deployed as an encoder to compress the raw input data by embedding theminto compact low-dimensional feature vectors for efﬁcient training and inference. Following feature extraction,TOAD uses a custom, lightweight neural network that takes in the deep features of all tissue regions in the slideas input for weakly-supervised learning. Building upon the attention-based pooling operator

12, 28 , an attentionmodule learns to rank each region’s relative importance toward the determination of each classiﬁcation taskof interest and aggregates their deep feature representations into a single slide-level feature vector for eachtask by computing their attention-score-weighted average. Further, we explored incorporating the patient’sgender as an additional covariate by fusing it with the slide-level features via concatenation before the ﬁnalclassiﬁcation layers that predict independently both the cancer origin (multi-class classiﬁcation) and whetherthe tumor is primary or metastatic (binary classiﬁcation). The two classiﬁcation problems are learned jointlyduring training by using a multi-task objective and sharing the model parameters of intermediate layers. Aseparate attention branch layer, however, is used for each task to increase the model’s expressivity, allowingit to attend to different sets of information-rich regions of the slide depending on the task (

Figure 1 ). Furtherdetails of the model architecture, training and dataset are described in the methods and

Extended Data Figure1, Extended Data Table 1 . Evaluation of model performance

We evaluated our proposed deep learning framework by partitioning our dataset, which consists of a totalof 24,885 FFPE H&E digitized diagnostic slides from 23,297 patient cases, into 70/10/20 splits for training,validation and testing, respectively. On this held-out test set of 4,932 slides with known primaries that were notpreviously seen by the model, TOAD achieved an overall accuracy of 83.6%, and a micro-averaged AUC ROCof 0.988 (95% CI: 0.987 - 0.990) (

Figure 2c ). When the model is evaluated using top-k differential diagnosisaccuracy, i.e , how often the groundtruth label is found in the model’s k highest conﬁdence predictions, TOADachieved a top-3 accuracy of 94.4% and top-5 accuracy of 97.8% (

Figure 2e ). Such top differential predictionscan be extensively useful for complicated metastatic and CUP cases where narrowing down potential primariescan assist with the diagnostic workﬂow and reduce the number of IHC stains and other diagnostic tests requiredto pin a culprit primary.

Figure 2a shows performance for each individual primary site and a full summarytable of classiﬁcation performance metrics including precision, recall, F1-score, and one-vs-rest AUC ROCare included in

Extended Data Table 3 . The training with validation performance over time for models ofdifferent conﬁgurations are shown in

Extended Data Figure 5 (see

Supplementary Data File Table 1 forindividual case assessments). By binning the model’s predictions based on their conﬁdence, we note thatthe majority of predictions on the test set are made with high conﬁdence, e.g.

Model performance of TOAD. a.

Slide-level classiﬁcation performance on the test set (n=4,932) for18 tumor origins. Columns represent the tumor’s true origin and rows represent the model’s predicted origins. b. For each origin, fraction of samples correctly classiﬁed with a conﬁdence (probability) score of greaterthan 0.5, 0.75 and 0.95 respectively ( top ); fraction of samples (y-axis) correctly classiﬁed at or above a certainconﬁdence threshold (x-axis, computed over increments of 0.025 in probability score) ( bottom ). c. Micro-averaged received operator characteristic (ROC) curves for the multi-class classiﬁcation of the tumor origin,evaluated on the test set (n=4,932) and an independent test set of external cases only (n=662). d. ROC curvesfor the auxiliary task of predicting primary vs. metastasis in the test set and external test set. e. Top-k accuracyof model for tumor origin classiﬁcation on the test set and external test set for k ∈ { , , } . f. Overall accuracy( left ) of model for predicting primary vs. metastasis and right. accuracy of model’s predictions stratiﬁed intolow conﬁdence ( . ≤ p < . ) and high conﬁdence ( . ≤ p ≤ . ). g. left. Sensitivity score for thebest, median and worst tumor origin for based on the model’s top-k predictions for k ∈ { , , ..., } . right. Accuracy of predictions for different bins of prediction conﬁdence.6ompared to less conﬁdent predictions (

Figure 2b ). Together, the high top-k accuracy suggests that we canpotentially use TOAD’s top predictions for a given slide to narrow down the origin of the tumor to a handful ofpossible locations while predictions with high conﬁdence ( e.g. ≥ . ) are generally reliable ( Figure 2g ). Forinterpretability and further validation, we examined attention heatmaps for tumors metastasized from the lung,breast and colon and conﬁrmed that high attention regions generally exhibit tumor morphology characteristicof the respective primary tumor (

Figure 3 , Extended Data Figure 6 ). Additionally, TOAD was able to predictwhether the tumor specimen is a primary or metastatic tumor with an accuracy of 89.4% and an AUC ROCof 0.934 (95% CI: 0.926 - 0.942) (

Figure 2d ), high conﬁdence predictions were assigned to the majority ofcorrect cases (

Figure 2f ). Extended Data Table 7 shows site-wise performance for this binary task.

Generalization to multi-institutional external test cohort

To assess the adaptability of our model across different healthcare systems with different H&E staining proto-cols and patient populations, we also validated TOAD on an additional test set of 662 external cases submittedfrom 202 US and international medical centers (for geographic diversity, see

Extended Data Figure 4, Sup-plementary Data File Table 4 ). Without tuning or any form of domain adaptation, our trained model producedan accuracy of 78.5%, top-3 accuracy of 92.6%, top-5 accuracy of 95.9% and AUC ROC of 0.981 (95% CI:0.976 - 0.986) on this additional independent test set (

Figure 2c,e ). Similarly, on the second task of distin-guishing between metastasis and primary tumor, the model scored an AUC of 0.922 (95 % CI: 0.899 - 0.945)and accuracy of 87.3% (

Figure 2d,f ). The model’s performance is consistent with results on the ﬁrst test set,indicating that our model is capable of generalization to diverse data sources not encountered during training.Individual cases assessments are available in

Supplementary Data File Table 2.Evaluation on challenging metastatic cases

It is challenging to objectively evaluate the model’s ability to correctly predict the origin of tumor for CUPcases because there are limited and weak ground truth labels. In light of this challenge, we ﬁrst analyzed theperformance of TOAD on difﬁcult metastatic cases in our test set for which a diagnosis is available. On these882 metastatic cases, TOAD achieved a micro-averaged AUC of 0.939 (95% CI: 0.930 - 0.948) and overallaccuracy of 62.6%, top-3 accuracy of 84.9% and top-5 accuracy of 92.3% (

Table 1. B ). This demonstrates thatTOAD can assist with assigning a differential diagnosis by narrowing down possible origins. The sensitivityfor correctly identifying these cases as metastatic, using the prediction from our multi-task network is 70.9%.Furthermore, we queried the pathology report for each metastatic case in our database and were able to extractthe number of IHC tests performed in 767 reports (median: 2, min: 0, max: 27). Using the number of IHCtests performed as an indirect measure of the difﬁculty in diagnosing the case, we examined the performance ofTOAD across different levels of IHC usage. As expected, in cases that were diagnosed without requiring IHC( n = 264 ), TOAD scored the highest accuracy of 65.9%, and a top-3 and top-5 accuracy of 89.8% and 93.9%7igure 3: Exemplars of metastases from primary sites with attention heatmaps . For all cases, smooth at-tention scores are computed on overlapping 256 x 256 patches, normalized using percentiles and displayed ontop of the original H&E WSI as a semi-transparent overlay where overlaid regions range from crimson (highattention, high diagnostic relevance) to navy (low attention, low diagnostic relevance). From left to right, lowmagniﬁcation with corresponding attention map, medium magniﬁcation with corresponding attention map, andhigh magniﬁcation patches of high attention regions. a : Medium and high magniﬁcation views demonstratesheets of cells as well as small tubules and glands, morphologies consistent with metastatic breast carcinomas. b : Medium and high magniﬁcation views demonstrate so-called ”dirty necrosis” and variably-sized glands withdensely-packed, hyperchromatic nuclei, characteristic of colorectal adenocarcinoma. The attention heatmapsallow the model’s predictions for each case to be visually interpretible for human experts, revealing the mor-phological features used by the model for classiﬁcation determination. By studying attention heatmaps fordifferent metastatic tumors, we veriﬁed that the model is attending strongly to tumor regions for predicting thesite of origin as expected. More attention heatmaps for tumors metastasized from the lung are shown in Ex-tended Data Figure 6 and high resolution heatmaps for cases from all primary sites can be accessed throughour interactive demo available at http://toad.mahmoodlab.org.8 able 1. Testing on primary, metastasis of known primary and CUP

A. Performance on Overall Test and External Test Set

Primary and Met. Test Set Cohen’s κ Top-1 Acc Top-3 Acc Top-5 AccTest Cases (n=4932) 0.820 0.836 0.944 0.978External Test Cases (n=662) 0.746 0.785 0.926 0.959

B. Performance Analysis on Challenging Metastatic Cases

Metastatic Test Set Cohen’s κ Top-1 Acc Top-3 Acc Top-5 AccAll Cases (n=882) 0.567 0.626 0.849 0.923Required no IHC (n=264) 0.610 0.659 0.898 0.939Required between 1 and 5 IHC (n=303) 0.551 0.617 0.838 0.911Required ≥ C. Performance on CUP Cases

CUP Test Set Cohen’s κ Top-1 Agreement Top-3 Agreement Top-5 AgreementPrimary assigned (n=290) 0.397 0.500 0.745 0.900Conﬁdence ≥ ≥ n = 200 ) that required 5 (75th percentile) or more IHCtests, TOAD still managed to achieve an accuracy of 56.0%, and a top-3 accuracy of 79.0%. Moreover, thetop-5 accuracy for these difﬁcult cases remained at 91.5%. Similarly, we identiﬁed and analyzed performanceon two other subsets of challenging cases, including 64 cases that could not be diagnosed with IHC analysisand required further clinical or radiologic correlation to make the diagnosis (top-1: 60.9%, top-3: 76.6%, top-5: 93.8%) and 155 cases that were characterized as poorly-differentiated tumor in the pathology reports (top-1:58.7%, top-3: 83.9%, top-5: 92.9%). As expected, performance for challenging and difﬁcult to diagnose caseswere lower, but largely consistent with performance on the entire set of metastatic cases. It is worth noting thatthe model was able to achieve this level of performance without having access to additional clinical variables,or IHC results, as it makes its predictions solely based on the digitized H&E slide and the patient’s gender.Overall, these results again suggest the potential of TOAD to provide reliable and valuable candidate primariesbased its top predictions to guide further differential diagnosis, IHC work-up and reduce the requirementof laboratory tests and clinical correlation. We also calculated Cohen’s kappa , which measures the inter-observer agreement between the model and the assigned differential, while taking into account agreement bychance. The κ scores fell in the range of moderate to substantial agreement for metastatic cases and indicatedeven stronger agreement on our overall test and external test sets ( Table 1. A, B. ).9 valuation on a multi-institutional CUP cohort

We further curated a dataset of 717 consented cases from 151 medical centers that were assigned a diagnosisof CUP at some point during their course of diagnosis and treatment. These challenging cases were submittedfrom 146 US and 5 international medical centers (

Extended Data Figure 4, Supplementary Data File Table5) . None of these cases could be assigned a primary diagnosis using the histology slide alone. Instead, all casesrequired thorough IHC testing and the patients underwent extensive clinical workups (including radiology,endoscopy, etc.) in an attempt to determine the occult primary anatomic site. For a more thorough evaluationof our model, we carefully analyzed all available electronic medical records (EMRs) for patients in this CUPdataset, including clinical and familial history, radiology reports, endoscopy reports, treatment, and follow uphistory. We identiﬁed a subset of 290 cases which were assigned a primary differential at some point duringthe course of diagnosis and treatment, while the remaining cases could not be assigned a primary at any pointor had limited medical records. As expected, these differential diagnosis for CUP cases involve elements ofuncertainty and conjecture and should be distinguished from conﬁdent, ground truth labels, which cannot berealistically obtained for CUP cases.We used our trained TOAD model to assign predictions for each case in our CUP dataset (using onlyinformation contained in the histology slide and the patient’s gender) and observed that the model’s top pre-diction directly concorded with the site indicated by the primary differential assigned in 145 of the 290 cases(50.0%), with a κ score of 0.397, indicating that there was fair agreement by chance between the assigned dif-ferentials and the model’s predictions ( Table 1. C , see

Supplementary Data File Table 3 for individual casepredictions). When using the model’s top-3 and top-5 predictions, the agreement jumps to 74.5% and 90.0%of cases respectively. This is a particularly encouraging result since our model was able to assign concordantdifferential diagnosis based on the histology image, such differentials are typically assigned using extensiveinvestigative diagnostic work-ups. We observed that similar to what we found on the test sets, a signiﬁcantfraction of the model’s predictions were made with high conﬁdence ( e.g.

196 out of 290 predictions had aconﬁdence score of ≥ κ = 0 . ). This serves as further evidence for our hypothesisthat the model’s high conﬁdence predictions are generally more reliable, which we also observed on our testset of conﬁrmed metastatic and primary cases. Here, the model’s average conﬁdence on all 717 CUP cases is0.605 (median: 0.579) and 457/717 (63.8%) of cases are predicted with a conﬁdence of 0.5 or higher.We further subcategorized the CUP cases into high-certainty diagnoses (n=185) and low-certainty (n=105)based on the strength of the evidence used to make the determination, language used in EMRs and whether thecancer was treated based on a certain primary as well as if the patient responded to that particular treatment. Asexpected, agreement is poor for cases in the low-certainty bin ( κ = 0 . ) while for high-certainty diagnoses,higher agreement was observed ( κ = 0 . ) across all metrics ( Table 1C ). In

Extended Data Figure 8 and9 , we demonstrate in two independent cases, how the top predictions from the TOAD model can be used in10onjunction with IHC testing to assist in suggesting and determining the origins of challenging metastatic casesinitially assigned a diagnosis of CUP.

Discussion

In this study, we presented TOAD, the ﬁrst deep learning algorithm developed to predict the origin of a tumorbased on whole-slide histopathology. TOAD targets the difﬁcult problem of assigning origins for cancers ofunknown primary using just histology WSIs, a task that is typically accomplished using extensive clinicalwork-ups including IHC analysis, radiologic imaging and clinical correlation. By using weakly-supervisedlearning, our algorithm was developed using tens of thousands of cases without requiring any ﬁne-grainedmanual annotation of the slide or representative ROIs and can be easily tested on digitized histology imagesof arbitrary size. To demonstrate the effectiveness of our proposed algorithm, we trained and validated ourmodel using a large dataset of diagnostic WSIs and showed consistent performance between a large test set ofmetastatic and primary cases and a geographically diverse, independent test set of external cases received fromover two hundred different institutions. It has been shown that pathologists show limited ability to identify theorigins of metastatic tumors when provided with minimal clinical information and especially when evaluatingbased on morphology alone . We show that despite using only histology and the patient’s gender as inputfor decision making, our model can make fairly accurate predictions particularly in assigning top-3 or top-5primary differentials even for challenging metastatic cases that required extensive IHC tests and clinical orradiologic correlation to diagnose. Lastly, we also curated a large set of CUP cases, which were acquired froma diverse cohort of US and international medical centers, and we subsequently identiﬁed a subset of cases forwhich a primary differential was assigned at some point after the initial diagnosis, often after extensive clinicalworkups. For these extremely challenging cases, the H&E slide proved to be insufﬁcient for human experts toassign a primary whereas despite being limited to just the patient’s gender and morphological information inthe WSIs, our model was able to make predictions that are concordant with the primary differentials assignedafter IHC work-up to a meaningful degree. We also showed that metadata such as the patient’s gender canbe incorporated in the model and using multitask learning, we can use a single model to additionally predictwhether a tumor is primary or metastatic without sacriﬁcing performance. We conducted ablation experimentsto investigate the effect of adding gender as a covariate and multi-task learning and found minimal differencein performance ( Extended Data Table 4 ).As further analysis, we also showed that our multi-task network can distinguish between primary tumorsand metastatic tumors found at the same site, which can pose occasional difﬁculties to pathologists

7, 8 . As anexample, when asked to predict tumors found in the central nervous system (CNS) as primary or metastatic,the model reached an accuracy of 95.0% ( n = 446 ) on gliomas and other tumors metastasized to the brainand similarly reported high accuracy for GI metastatic sites (colorectal: 92.9% accuracy for n = 406 and11sophagogastric: 94.1% accuracy for n = 273 ) in our test set ( Extended Data Figure 7 ). From additionalexperiments, we conﬁrmed that TOAD can also be applied to predict the primary for subsets of tumors thatshare the same morphological appearance ( e.g. various subtypes of squamous cell carcinoma,

Extended DataFigure 2 ) or have metastasized to a common site ( e.g. the lymph node,

Extended Data Figure 3 ). Whilemore restricted in their scope than our main 18-class classiﬁer, these networks targeting speciﬁc subgroups oftumors have the potential to serve as additional readers when attempting to rule out plausible candidate originsproposed by the main network (

Extended Data Figure 9 ).Overall, an encouraging observation is that a substantial fraction of our model’s predictions were madewith high conﬁdence, which also consistently proved to be more reliable and accurate. By using the top-kpredictions, the model is also capable of narrowing the tumor origin down to a handful of possible locations( e.g. top-3 or 5 most likely locations) with fairly high accuracy. This suggests the potential clinical applica-bility of TOAD to both suggest high-likelihood candidate primaries when the diagnosis of a case is initiallyambiguous and as a second reader to human experts, potentially prompting re-evaluation or exploring alter-native hypotheses when the model produces a high conﬁdence prediction in disagreement. In such cases, theattention heatmap and high attention patches (

Figure 3, Extended Data Figure 6, Interactive Demo ) may beused in conjunction with the model’s probability score predictions for human interpretability and validation.Unlike previous works that predicted the cancer type based on genomic alterations in the tumor, as the ﬁrsthistology-based deep learning algorithm proposed and validated for automated prediction of tumor origins,our approach is also arguably more broadly applicable, especially for low-resource settings where clinicalexpertise, immunohistochemistry and molecular testing may be limited.

Online Methods

Dataset Description

For model development, we curated a dataset of 14,518 WSIs from internal consented patient cases at theBrigham and Women’s Hospital. These slides were collected between 2010 - 2019 and scanned at 20 × usingan Aperio scanner. Unless indicated otherwise, each WSI corresponds to a unique patient. We grouped thesecases into 18 common cancer origins, where each origin encompasses both common and rare tumor subtypesfor which at least 10 cases were found in our database ( Extended Data Table 2 ). All data used for thisstudy were anonymised. We additionally queried the TCGA Data Commons and all diagnostic WSIs fromrepositories corresponding the 18 classes. Among slides downloaded from the TCGA, slides which do notcontain tumors or that lacked lower magniﬁcation downsamples were excluded. In total, we gathered 10,367WSIs from 8,755 patient cases across the 24 TCGA studies. Our overall dataset was composed of 24,885FFPE H&E digitized diagnostic slides (20,413 primary and 4,472 metastatic WSIs from 23,297 patient cases;124.7% F, 45.3% M) (

Extended Data Table 1 ). This roughly amounted to 21.8 Terabytes of raw data. Thisdataset is randomly partitioned and is stratiﬁed by class, into a training set (70% of cases), a validation set(10% of cases) and a test set (20% of cases). The partitioning was performed at the patient-level and thereforeall slides from the same patient are always placed into the same set. Additionally, we processed 662 externalconsult cases consented for research and received at the Brigham & Women’s Hospital from 202 medicalcenters across 34 states in USA and 19 international medical centers from 8 other countries (see

ExtendedData Figure 4 for geographic diversity). Slides for these cases were prepared at their respective institutionsusing a variety of different tissue preparation, processing and staining protocols. A full origin-wise breakdownof these datasets are summarized in

Extended Data Table 1 . Lastly, to further validate our model, we alsoidentiﬁed 717 consented cases that were assigned a diagnosis of CUP. These cases are received from 146medical centers across 22 US states and 5 international centers from 2 other countries (see

Extended DataFigure 4 for geographic diversity). For each case we reviewed electronic medical records including pathologyreport in combination with laboratory results, patient history, oncology, radiology, endoscopy and autopsyreports where applicable and if available, we determined a subset of 290 cases with a primary differential.These differentials were assigned during the course of diagnosis or treatment. It was veriﬁed that none ofthese cases could be diagnosed using histology alone and required extensive immunohistochemical analysisor clinical correlation. While it is not possible to obtain a ground truth for CUP cases such analysis based onthe differential diagnosis was used to assess the value of our model in assigning appropriate differentials toCUP cases using histology alone. For further analysis, we additional split these 290 cases into high-certanity(n=185) and low-certainty differentials (n=105) based on the language and evidence used in the electronicmedical records (

Extended Data Figure 1 for detailed study design).

Multi-task Weakly-Supervised Computational Pathology

We used deep learning to simultaneously predict the origin of tumor in each WSI and whether it is the primarysite or a metastasis. Due to the enormous size of gigapixel WSIs as well as the large variation in the shapeof tissue content captured by the image, it is generally considered inefﬁcient, unintuitive, and intractable todeploy deep learning algorithms based on convolutional neural networks (CNN) directly on top of the entireWSI for training or inference. While it is possible to use smaller regions of interests (ROIs) for training, thisapproach has the drawback that since the slide-level diagnosis ( e.g.

Lung Adenocarcinoma) is only manifestedin a fraction of the tissue content in the WSI, unless human expertise and manual labor is involved to ensurethese smaller regions are representative of the diagnosis made for the entire slide, naively associating themwith the slide-level diagnosis will lead to noisy and erroneous labels. To overcome this limitation, we builta compact neural network model and used a form of weakly-supervised machine learning known as multiple13nstance learning. By considering each WSI as a collection (known as a bag) of smaller image regions (knownas instances), we trained the multi-task network directly with slide-level labels without the need for manuallyextraction of regions of interests (ROIs), while also taking into account information from the entire slide.For computational efﬁciency, we ﬁrst performed dimensionality reduction on the raw image data by encodingeach 256 ×

256 RGB image patch into a 1024-dimensional feature vector using a pretrained CNN (transferlearning). In the low-dimensional feature space, the information from all tissue regions in each slide areaggregated by extending attention-based pooling to multiple tasks, based on which the classiﬁcation layersof the network outputs the ﬁnal slide-level predictions. Speciﬁcally, two stacked fully-connected layers Fc and Fc , parameterized by W ∈ R × , b ∈ R and W ∈ R × , b ∈ R in the base of thenetwork allow the model to learn histology-speciﬁc feature representations by tuning deep features extractedthrough transfer learning, mapping the set of patch feature embeddings { z k } ∈ R in a given WSI to 512-dimensional vectors: h k = ReLU ( W ( ReLU ( W z (cid:62) k + b )) + b ) (1) Multi-task Attention Pooling.

In the proposed multi-task learning framework, the multi-layered attention mod-ule consists of layers Attn-Fc and Attn-Fc with weight parameters V a ∈ R × , and U a ∈ R × (shared across all tasks), and one independent layer W a,t ∈ R × for each task t . This network module istrained to assign an attention score a k,t (eqn 2) to each patch, where after Softmax activation, a high score (near1) indicates that a region is highly informative towards determining the slide-level classiﬁcation task and a lowscore (near 0) indicates the region has no diagnostic value (for simplicity the bias parameters are not shown inthe equation): a k,t = exp (cid:8) W a,t (cid:0) tanh (cid:0) V a z (cid:62) k (cid:1) (cid:12) sigm (cid:0) U a z (cid:62) k (cid:1)(cid:1)(cid:9)(cid:80) Nj =1 exp (cid:8) W a,t (cid:0) tanh (cid:0) V a z (cid:62) j (cid:1) (cid:12) sigm (cid:0) U a z (cid:62) j (cid:1)(cid:1)(cid:9) (2)Attention pooling then simply averages the feature representations { h k } of all patches in the slide, weightedby their respective predicted attention scores { a k,t } , and the resulting feature vector h slide,t ∈ R is treated asthe histology deep features representing the entire slide for task t . This intuitive, trainable aggregation functionallowed the network to learn to automatically identify the subset of informative regions in the slide in order topredict the primary without requiring detailed annotation outlining the precise regions of tumor. Late-stage Fusion and Classiﬁcation.

We adopt a simple fusion mechanism to incorporate a patient’s biologicalsex into the model’s prediction by treating the sex s , as an additional covariate encoded by binary values, andconcatenating it to the deep features extracted from the histology slide. The concatenation results in a 513-dimensional feature vector that is fed into the ﬁnal classiﬁcation layer W cls,t for task t to obtain the slide-level14robability prediction scores: p t = Softmax ( W cls,t Concat ([ h slide,t , s ]) + b cls,t ) (3)In our study, the ﬁrst task of predicting the origin site of tumor is a 18-class classiﬁcation problem and thesecond task of predicting whether a tumor is primary or metastatic is a binary problem. Accordingly, the task-speciﬁc classﬁcation layers are parameterized by W cls, ∈ R × and W cls, ∈ R × respectively. Training Details.

We randomly sampled slides using a mini-batch size of 1 WSI and used multi-task learningto supervise the neural network during training. For each slide, the total loss is a weighted sum of loss incurredfrom the ﬁrst task of predicting the tumor origin and the loss from the second task of predicting primary vs.metastasis: L total = c L cls, + c L cls, (4)The standard cross-entropy was used for both tasks and to give higher importance to the main task of tumororigin prediction, we used c = 0 . and c = 0 . . After each mini-batch, the model parameters are updatedvia the Adam optimizer with an L2 weight decay of 1e-5 and a learning rate of 2e-4. To curb the model frompotential over-ﬁtting, we also used dropout layers with p = 0 . after every hidden layer. Model Selection.

During training, the model’s performance on the validation set was monitored each epoch.Beyond epoch 50, if the validation loss on the tumor origin prediction task had not decreased for 20 consecutiveepochs, early stopping was triggered and the best model with the lowest validation loss was used for reportingthe performance on the held-out test set.

Additional Experiments

Classiﬁcation of Adenocarcinoma and Squamous Cell Carcinoma.

The Adenocarcinoma model was developed using a subset of 8292 adenocarcinoma WSIs that fall under 5 ofthe 18 tumor origin classes considered by the main network: Lung (2558), Colorectal (2448), Esophagogastric(1320), Prostate (1101) and Pancreatic (865). Similarly, the Squamous Cell Carcinoma (SCC) network wasdeveloped using a subset of 1707 SCC WSIs from 4 origins: Lung (854), Head Neck (424), Cervix (264),and Esophagogastric (165). For all experiments, the cases were partitioned into 70/10/20 splits for train-ing/validation/testing. The model architecture, learning schedule and hyperparameters used were the same asfor the main network.

Classiﬁcation of tumor metastasized to the liver and lymph.

The lymph site-speciﬁc model was developed using a subset of 697 WSIs of metastatic tumors from four pri-mary origins including: Lung (341), Breast (185), Skin (110) and Thyroid (61). The liver site-speciﬁc network15as developed using a subset of 740 WSIs of metastatic tumors from four primary origins including: Pancre-atic (225), Colorectal (224), Breast (179), and Lung (112). For all experiments, the cases were partitioned into70/10/20 splits for training/validation/testing. The model architecture, learning schedule and hyperparametersused were the same as for the main network except the mulitask attention branch for predicting primary vs.metastatic was disabled since all cases were metastatic.

Computational Hardware and Software

We processed all WSIs on Intel Xeon multi-core CPUs (Central Processing Units) and a total of 16 NVIDIAP100 and 2080 Ti GPUs (Graphics Processing Units) using our custom, publicly available CLAM wholeslide processing pipeline implemented in Python. Each deep learning model was trained on multiple GPUs us-ing the Pytorch deep learning library (version 1.5). Unless otherwise speciﬁed, plots were generated in Python(version 3.7.5) using matplotlib (version 3.1.1) and numpy (version 1.18.1) was used for vectorized numericalcomputation. The geographic diversity maps were generated using additional Python packages including pyshp(version 2.1.0), basemap (version 1.1.0) and geopy version (version 1.22.0). The confusion matrix plot wascreated in R (version 3.6.3) using ComplexHeatmap (version 2.5.3). The Area under the curve of the receiveroperating characteristic curve (AUC ROC) was estimated using the scikit-learn scientiﬁc computing library(version 0.22.1), based on the Mann-Whitney U-statistic. The 95% conﬁdence intervals of the true AUC wasestimated using DeLong’s method implemented by pROC (version 1.16.2) in R. WSI Processing

Segmentation.

Tissue segmentation of WSIs was performed automatically using the CLAM library at a down-sampled magniﬁcation of each slide. A binary mask for the tissue regions were computed by applying binarythresholding to the saturation channel of the image downsample after conversion from RGB to the HSV colorspace. Median blurring and morphological closing were also performed to smooth the detected tissue contoursand suppress artifacts such as small gaps and holes. The approximate contours of the detected tissue as well astissue cavities were then ﬁltered based on their area to produce the ﬁnal segmentation mask.

Patching.

We exhaustively cropped segmented tissue contours into 256 ×

256 patches (without overlap) at20 × magniﬁcation (if the 20 × downsample is not found in the image pyramid, 512 ×

512 patches were in-stead cropped from the 40 × downsample and downscaled to 256 × Feature Extraction.

Given the enormous bag sizes (number of patches in each WSI) in our dataset, we ﬁrstused a convolutional neural network based on the ResNet50 architecture to encode each patch into a compactlow-dimensional feature vector. Speciﬁcally, a ResNet50 model pretrained on Imagenet was truncated afterthe 3rd residual block and was followed by an adaptive mean-spatial pooling layer to reduce the spatial fea-16ure map obtained from each 256 × × Interpreting Model Prediction via Attention Heatmap

To visually interpret the importance of each region in a WSI towards the model’s classiﬁcation predictions, weﬁrst computed the reference distribution of attention scores by tiling the WSI into 256 ×

256 patches withoutoverlap and computing the attention score for each patch for the task of primary origin prediction. To generatemore ﬁne-grained heatmaps, we subsequently repeated the tiling but with an overlap of up to 90% and con-verted the attention scores computed from overlapping crops to normalized percentile scores between 0.0 (lowattention) to 1.0 (high attention) based on the initial reference distribution. The normalized scores were thenregistered onto the original WSI corresponding each patch’s spatial location and scores in overlapped regionswere accumulated and averaged. Finally, a colormap was applied to the attention scores and the heatmap wasdisplayed as an overlay layer with a transparency value of 0.5. These attention maps have been shown in

Figure3, Extended Data Figure 6, 8, 9 and can also be visulized in our interactive demo http://toad.mahmoodlab.org.

Data Availability

Digitized, high-resolution diagnostic whole slide image data from the TCGA and their corresponding diagnosesare publicly accessible through the NIH genomic data commons. All reasonable requests for in-house raw andanalyzed data and materials will be promptly reviewed by the authors to determine whether the request issubject to any intellectual property or conﬁdentiality obligations. Patient-related data not included in the papermay be subject to patient conﬁdentiality. All requests for data that can be shared will be processed throughformal channels, in concordance with institutional and departmental guidelines and will require a materialtransfer agreement.

Code Availability

All code was implemented in Python using PyTorch as the primary deep learning package. All code and scriptsto reproduce the experiments of this paper are available at https://github.com/mahmoodlab/TOAD

All source code is provided under the GNU GPLv3 free software license.

Author Contributions

M.Y.L. and F.M. conceived the study and designed the experiments. M.Y.L. performed the experimental anal-ysis. M.Z. D.W. T.C. curated the in-house datasets. M.Y.L. M.Z. M.S. J.L. F.M. analyzed the results. M.Y.LM.S. J.L. developed data visualization tools. M.Y.L. F.M. prepared the manuscript. F.M. supervised the re-search. 17 cknowledgements

The authors would like to thank Alexander Bruce for scanning internal cohorts of patient histology slides atBWH; Jingwen Wang, Matteo Barbieri, Katerina Bronstein, Lia Cirelli, Eric Askeland for querying the BWHslide database and retrieving archival slides; Celina Li for assistance with EMRs and RPDR; Martina Bragg,Terri Mellen and Sarah Zimmet for logistical support; Zahra Noor for developing the interactive demo website;and Kai-ou Tung of Boston Children’s Hospital for anatomical illustrations. This work was supported in partby internal funds from BWH Pathology, Google Cloud Research Grant and Nvidia GPU Grant Program andNIGMS R35GM138216 (F.M.). M.S. was additionally supported by the NIH Biomedical Informatics andData Science Research Training Program, grant number: NLM T15LM007092. The content is solely theresponsibility of the authors and does not reﬂect the ofﬁcial views of the National Institute of Health, NationalInstitute of General Medical Sciences or the National Library of Medicine.

Competing Interests

The authors declare that they have no competing ﬁnancial interests.

Ethics Oversight

The study was approved by the Mass General Brigham (MGB) IRB ofﬁce under protocol 2020P000233.

References

1. Rassy, E. & Pavlidis, N. Progress in reﬁning the clinical management of cancer of unknown primary inthe molecular era.

Nature Reviews Clinical Oncology

New England Journal of Medicine , 757–765 (2014).3. Massard, C., Loriot, Y. & Fizazi, K. Carcinomas of an unknown primary origindiagnosis and treatment.

Nature reviews Clinical oncology , 701–710 (2011).4. Jiao, W. et al. A deep learning system accurately classiﬁes primary and metastatic cancers using passengermutation patterns.

Nature communications , 1–12 (2020).5. Penson, A. et al. Development of genome-derived tumor type prediction to inform clinical cancer care.

JAMA oncology , 84–91 (2020).6. Grewal, J. K. et al. Application of a neural network whole transcriptome–based pan-cancer method fordiagnosis of primary and metastatic cancers.

JAMA network open , e192597–e192597 (2019).7. Nass, D. et al. Mir-92b and mir-9/9* are speciﬁcally expressed in brain primary tumors and can be usedto differentiate primary from metastatic brain tumors.

Brain pathology , 375–383 (2009).8. Estrella, J. S., Wu, T.-T., Rashid, A. & Abraham, S. C. Mucosal colonization by metastatic carcinomain the gastrointestinal tract: a potential mimic of primary neoplasia. The American journal of surgicalpathology , 563–572 (2011).9. Esteva, A. et al. A guide to deep learning in healthcare.

Nature medicine , 24–29 (2019).10. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. nature , 436–444 (2015).181. Liu, Y. et al. A deep learning system for differential diagnosis of skin diseases.

Nature Medicine et al.

Data efﬁcient and weakly supervised computational pathology on whole slide images. arXiv preprint arXiv:2004.09666 (2020).13. Campanella, G. et al.

Clinical-grade computational pathology using weakly supervised deep learning onwhole slide images.

Nature medicine , 1301–1309 (2019).14. Chen, P.-H. C. et al. An augmented reality microscope with real-time artiﬁcial intelligence integration forcancer diagnosis.

Nature medicine , 1453–1457 (2019).15. Mei, X. et al. Artiﬁcial intelligence–enabled rapid diagnosis of patients with covid-19.

Nature Medicine et al.

Video-based ai for beat-to-beat assessment of cardiac function.

Nature , 252–256(2020).17. Hollon, T. C. et al.

Near real-time intraoperative brain tumor diagnosis using stimulated raman histologyand deep neural networks.

Nature medicine , 52–58 (2020).18. Esteva, A. et al. Dermatologist-level classiﬁcation of skin cancer with deep neural networks.

Nature ,115–118 (2017).19. Bulten, W. et al.

Automated deep-learning system for gleason grading of prostate cancer using biopsies:a diagnostic study.

The Lancet Oncology (2020).20. Bera, K., Schalper & Madabhushi, A. Artiﬁcial intelligence in digital pathologynew tools for diagnosisand precision oncology.

Nature Reviews Clinical Oncology , 703–715 (2019).21. Raghunath, S. et al. Prediction of mortality from 12-lead electrocardiogram voltage data using a deepneural network.

Nature Medicine et al.

Classiﬁcation and mutation prediction from non–small cell lung cancer histopathologyimages using deep learning.

Nature medicine , 1559–1567 (2018).23. Kather, J. N. et al. Deep learning can predict microsatellite instability directly from histology in gastroin-testinal cancer.

Nature medicine , 1054–1056 (2019).24. Tomaˇsev, N. et al. A clinically applicable approach to continuous prediction of future acute kidney injury.

Nature , 116–119 (2019).25. AbdulJabbar, K. et al.

Geospatial immune variability illuminates differential evolution of lung adenocar-cinoma.

Nature Medicine et al.

Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning.

Nature Biomedical Engineering , 158 (2018).27. Mitani, A. et al. Detection of anaemia from retinal fundus images via deep learning.

Nature BiomedicalEngineering , 18–27 (2020).28. Ilse, M., Tomczak, J. & Welling, M. Attention-based deep multiple instance learning. In InternationalConference on Machine Learning , 2132–2141 (2018).29. McHugh, M. L. Interrater reliability: the kappa statistic.

Biochemia medica: Biochemia medica ,276–282 (2012).30. Sheahan, K. et al. Metastatic adenocarcinoma of an unknown primary site: a comparison of the relativecontributions of morphology, minimal essential clinical data and cea immunostaining status.

Americanjournal of clinical pathology , 729–735 (1993).19 xtended Data Figure 1. Overall study design. 1. Model training and testing. For model development,we collected in total, 24,885 FFPE H&E digitized diagnostic slides (from 23,297 patient cases) of conﬁrmeddiagnosis and randomly sampled 70% of cases (17,486 slides) for training the model, and 20% of cases (4,932slides) as a held-out for evaluation. The remaining 10% of cases (2,467 slides) were used for validation duringtraining in order the select the best performing model.

2. External test.

In order to further assess the model’sability to generalize on data from sources not encountered during training, we also evaluated the model onan external test cohort of 662 cases, submitted for consultation from over 200 US and international medicalcenters.

3. Evaluation on challenging CUP cases.

Lastly, to assess the model’s ability to inform meaningfulpredictions for origins of cancers that cannot be readily diagnosed by human experts, we curated an additionaldiverse dataset of 717 CUP cases sourced from institutions across the country and outside the US. While theprimary could not be initially assigned for all of these cases based on H&E histology alone, using EMR andevidence from many other forms of clinical and diagnostic reports, we identiﬁed a subset of 290 cases forwhich a primary differential was eventually assigned over the course of the patient’s history. We validated ourmodel against the recorded primary differential for agreement, showcasing its applicability to cases withoutclear morphological indication for a particular primary.20 xtended Data Figure 2. Classiﬁcation performance of adenocarcinoma network and squamous cellcarcinoma network.

Often pathologists can readily distinguish between adenocarcinoma (AD) and squamouscell carcinoma (SCC) based on the morphological and architectural appearance of the tumor cells present inthe tissue. However, within the respective family of AD and SCC subtypes, determining the origin of the tumorcan remain a challenging task. Therefore we hypothesized that we can develop TOAD models to speciﬁcallypredict the origin of tumors for top primary sites of adenocarcinoma a. and SCC b. . Cases from ﬁve and fourprimary sites were chosen for the development of the AD classiﬁer and SCC classiﬁer respectively, based ontheir frequency in the database. The confusion matrix is plotted for each TOAD model ( middle ). Additionally,micro-averaged AUC and overall accuracy are noted for the models trained with and without incorporatingsex as an additional covariate ( left, right ). The AD network achieved an micro-averaged AUC ROC of 0.977(95% CI: 0.974 - 0.981) and overall accuracy of 85.8% and did not beneﬁt from adding sex, where the modelachieved a similar AUC ROC of 0.978 (95% CI: 0.975 - 0.981) and accuracy of 86.2%. The SCC networkscored a higher sensitivity for cervical cancer (0.85 with sex vs. 0.69 without sex), which led to a modestincrease in AUC from 0.945 (95% CI: 0.932 - 0.959) to 0.956 (95% CI: 0.944 - 0.967) and accuracy of 78.4%to 82.9%. 21 xtended Data Figure 3. Classiﬁcation performance of site-speciﬁc networks for the tumor metastasizedto the liver and lymph node. We also explored the possibility of using TOAD to predict the primary originsof metastatic tumors grouped by a common metastatic site, including the liver ( a. ) and the lymph node ( b. ).Metastatic cases from the top four primary origins for each site were chosen based on their frequency in ourdatabase. left. Micro-averaged ROC curve. middle.

Confusion matrix. right.

Overall accuracy and micro-averaged AUC. For tumors metastasized to the liver, the micro-averaged AUC ROC was 0.890 (95% CI: 0.862- 0.918) when incorporating sex vs. 0.874 (95% CI: 0.843 - 0.905) without sex as an additional covariate. Wefound that while incorporating sex improved the sensitivity for breast cancer (0.87 with sex vs. 0.56 withoutsex), it came at the expense of lowered sensitivity for all other primary sites. On the other hand, incorporatingsex led to a substantial increase in the sensitivities for lung and breast cancers metastasized to the lymph nodeand the overall accuracy of the lymph node network increased from 73.6% to 76.9%.22 xtended Data Figure 4. Geographical diversity of our external test set and CUP cases. left.

The externaltest set of 662 cases are submitted from in total 202 medical centers across 34 states in USA and 19 medicalcenters from 8 other countries including Switzerland, Brazil, Greece, United Arab Emirates, China, SaudiArabia, Kuwait and Canada. right.

Similarly, our CUP cases consist of 717 slides from 146 medical centersacross 22 US states and 5 centers from 2 other countries including China and Kuwait.23 xtended Data Figure 5. Model performance during training and validation. a, b.

Classiﬁcation errorand cross-entropy loss for predicting the tumor origin for different model conﬁgurations (averaged over eachepoch) on the training and validation set respectively. c, d.

Classiﬁcation error and cross-entropy loss forpredicting primary vs. metastatic tumor for multi-task (and single task) model conﬁgurations.24 xtended Data Figure 6. Exemplars of metastases from lung primaries with attention heatmaps . Fromleft to right, low magniﬁcation with corresponding attention map, medium magniﬁcation with correspondingattention map, and high magniﬁcation patches. Medium and high magniﬁcation views demonstrate sheets ofcells, variably-sized glands, and cells in inﬁltrative single ﬁles. The cells have large, hyperchromatic nucleiand low nuclear:cytoplasmic ratio, consistent with metastatic lung carcinomas.25 xtended Data Figure 7. Model performance on the binary problem of distinguishing between primaryand metastatic tumors in different tissue sites.

The barplot shows model accuracy (y-axis) on the test set(n=4932) for different tissue sites (x-axis) and the number of cases found at each site. These sites should not beconfused with the 18 common primary sites used for the origin determination task. This bar plot was plottedby stratifying all test cases based on the site from which the tissue was sampled and the accuracy reported isfor predicting if the slide is a primary or metastatic tumor.26 xtended Data Figure 8. TOAD-assisted CUP work-up: case study 1.

The ﬁgure above shows a represen-tative case which underwent a standard CUP work-up involving extensive IHC staining and clinical correlation.Strong PAX8 staining suggested Mllerian origin and multiple IHCs were used to rule out other primaries. Ret-rospectively, we analyzed the case with TOAD and found the top-3 determinations to be Ovarian, Breast, Lung,and following this deterimation just three IHC stains (PAX8, GATA3, and TTF1) could be used to conﬁrm Ml-lerian origin and rule out breast carcinoma and lung adenocarcinom respectively. This workﬂow demonstrateshow TOAD can be used as an assistive diagnostic tool. .27 xtended Data Figure 9. TOAD-assisted CUP work-up: case study 2.

This representative case demon-strates that TOAD can be used as an assistive tool or as an additional reader even when the top-1 predictionis not in agreement with the differential assigned. This particular case of brain metastasis underwent a typicalCUP work-up with several stains and clinical correlation. Retrospectively, when we analyzed the case usingCUP we found that the top-3 predictions included Lung, GI, Pancreatic in decreasing order of conﬁdence,where the conﬁdence between lung and GI was almost the same. TTF1 could be used to rule out Lung, p63could be used to rule out SCC because of prior SCC history and optionally CDX2 and SATB2 could be usedto conﬁrm GI origin. Additionally since adenocarcinoma morphology is identiﬁed, we tested this case usingour adenocarcinoma TOAD model (see

Extended Data Figure 2 ). This model suggested similar predictionscores for Lung ( p = 0 . ), Esophagogastric ( p = 0 . ), but listed Colorectal ( p = 0 . ) as the other likelycandidate, whereas Pancreatic did not appear within the top-3 predicted origins, accompanied by a probabilityscore of only p = 0 . . This suggests it may be possible to also consider predictions from more speciﬁcnetworks ( e.g. metastatic-site-speciﬁc, or morphological-subtype-speciﬁc) when trying to rule out plausiblecandidates. 28 xtended Data Table 1. WSI Dataset summary Primary Organ Training Validation Test External Test TotalLung 2800 405 792 83 4080Breast 2286 328 651 70 3335Colorectal 1718 244 486 212 2660Ovarian 778 109 220 20 1127Pancreatic 628 88 182 11 909Adrenal 198 21 55 0 274Skin 693 97 189 52 1031Prostate 774 120 212 20 1126Renal 983 136 284 11 1414Bladder 758 102 210 22 1092Esophagogastric 1046 148 298 20 1512Thyroid 708 100 202 2 1012Head Neck 626 86 178 10 900Glioma 1649 235 446 111 2441Germ Cell 256 28 66 3 353Endometrial 1009 140 297 11 1457Cervix 262 34 75 2 373Liver 314 46 89 2 451

Total 17486 2467 4932 662 25547 xtended Data Table 2. Tumor types grouped for classiﬁcation Primary Organ Disease Models IncludedLung Lung Adenocarcinoma, Lung Squamous Cell Carcinoma, Non-Small Cell LungCancer, Small Cell Lung Cancer, Poorly Differentiated Non-Small Cell Lung Can-cer, Lung Carcinoid, Large Cell Neuroendocrine Carcinoma, Atypical Lung Car-cinoid, Lung Adenosquamous Carcinoma, Lung Neuroendocrine Tumor, Sarcoma-toid Carcinoma of the Lung, Large Cell Lung CarcinomaBreast Breast Invasive Ductal Carcinoma, Invasive Breast Carcinoma, Breast Invasive Lob-ular Carcinoma, Breast Mixed Ductal and Lobular Carcinoma, Breast InvasiveMixed Mucinous Carcinoma, Breast Ductal Carcinoma In SituColorectal Colon Adenocarcinoma, Rectal Adenocarcinoma, Colorectal Adenocarcinoma,Mucinous Adenocarcinoma of the Colon and RectumGlioma Glioblastoma, Glioblastoma Multiforme, Astrocytoma, Diffuse Glioma, Anaplas-tic Astrocytoma, Oligodendroglioma, Pilocytic Astrocytoma, Anaplastic Oligoden-droglioma, Ganglioglioma, Anaplastic Oligoastrocytoma, OligoastrocytomaEsophagogastric Esophageal Adenocarcinoma, Stomach Adenocarcinoma, Esophagogastric Adeno-carcinoma, Adenocarcinoma of the Gastroesophageal Junction, Esophageal Squa-mous Cell Carcinoma, Diffuse Type Stomach Adenocarcinoma, Poorly Differenti-ated Carcinoma of the StomachEndometrial Uterine Endometrioid Carcinoma, Uterine Serous Carcinoma/Uterine PapillarySerous Carcinoma, Endometrial Carcinoma, Uterine Carcinosarcoma/Uterine Ma-lignant Mixed Mullerian Tumor, Uterine Mixed Endometrial Carcinoma, UterineClear Cell Carcinoma, Uterine Undifferentiated CarcinomaRenal Renal Clear Cell Carcinoma, Renal Cell Carcinoma, Papillary Renal Cell Carci-noma, Chromophobe Renal Cell Carcinoma, Renal Oncocytoma, Collecting DuctRenal Cell Carcinoma, Renal Non-Clear Cell Carcinoma, Unclassiﬁed Renal CellCarcinomaOvarian High-Grade Serous Ovarian Cancer, Endometrioid Ovarian Cancer, Clear CellOvarian Cancer, Low-Grade Serous Ovarian Cancer, Serous Ovarian Cancer, Ovar-ian Epithelial Tumor, Ovarian Carcinosarcoma/Malignant Mixed Mesodermal Tu-mor, Mucinous Ovarian Cancer, Serous Borderline Ovarian Tumor, Ovarian Cancer,Other, Mixed Ovarian Carcinoma, Mucinous Borderline Ovarian Tumor, Small CellCarcinoma of the OvaryProstate Prostate Adenocarcinoma, Prostate Small Cell CarcinomaBladder Bladder Urothelial Carcinoma, Upper Tract Urothelial Carcinoma. Bladder Adeno-carcinoma, Bladder Squamous Cell CarcinomaThyroid Papillary Thyroid Cancer, Medullary Thyroid Cancer, Anaplastic Thyroid Cancer,Hurthle Cell Thyroid Cancer, Poorly Differentiated Thyroid CancerSkin Melanoma, Cutaneous MelanomaPancreatic Pancreatic Adenocarcinoma, Adenosquamous Carcinoma of the Pancreas, Intraduc-tal Papillary Mucinous Neoplasm, Acinar Cell Carcinoma of the Pancreas30 xtended Data Table 2. Tumor types grouped for classiﬁcation

Primary Organ Disease Models IncludedHead and Neck Oral Cavity Squamous Cell Carcinoma, Oropharynx Squamous Cell Carcinoma,Head and Neck Squamous Cell Carcinoma, Larynx Squamous Cell Carcinoma,Sinonasal Squamous Cell CarcinomaLiver Hepatocellular CarcinomaCervix Cervical Squamous Cell Carcinoma, Endocervical Adenocarcinoma, Cervical Ade-nocarcinoma, Cervical Adenosquamous CarcinomaGerm Cell Seminoma, Mixed Germ Cell Tumor, Yolk Sac Tumor, Embryonal Carcinoma, Ter-atoma, Mature Teratoma, Non-Seminomatous Germ Cell TumorAdrenal Adrenocortical Carcinoma, Adrenocortical Adenoma31 xtended Data Table 3. Test performance on 18-class classiﬁcation of primary origin

Primary Origin Precision Recall F1-score AUC-ROC (95% CI) CountLung 0.778 0.808 0.793 0.970 (0.965 - 0.975) 792Breast 0.879 0.873 0.876 0.988 (0.985 - 0.992) 651Colorectal 0.928 0.877 0.902 0.991 (0.987 - 0.995) 486Glioma 0.975 0.951 0.963 0.999 (0.998 - 1.000) 446Esophagogastric 0.819 0.715 0.763 0.964 (0.952 - 0.975) 298Endometrial 0.895 0.778 0.832 0.986 (0.979 - 0.992) 297Renal 0.892 0.898 0.895 0.989 (0.981 - 0.996) 284Ovarian 0.651 0.805 0.720 0.980 (0.974 - 0.986) 220Prostate 0.763 0.925 0.836 0.992 (0.987 - 0.997) 212Bladder 0.843 0.743 0.790 0.983 (0.975 - 0.991) 210Thyroid 0.902 0.911 0.906 0.995 (0.991 - 0.998) 202Skin 0.791 0.783 0.787 0.984 (0.977 - 0.990) 189Pancreatic 0.605 0.808 0.692 0.971 (0.958 - 0.983) 182Head Neck 0.897 0.781 0.835 0.988 (0.982 - 0.995) 178Liver 0.806 0.843 0.824 0.996 (0.993 - 1.000) 89Cervix 0.759 0.587 0.662 0.978 (0.963 - 0.993) 75Germ Cell 0.866 0.879 0.872 0.997 (0.994 - 0.999) 66Adrenal 0.909 0.727 0.808 0.996 (0.992 - 1.000) 55Micro-avg 0.836 0.836 0.836 0.988 (0.987 - 0.990) 4932Macro-avg 0.831 0.816 0.820 0.986 4932Weighted-avg 0.843 0.836 0.837 0.984 493232 xtended Data Table 4. Ablation study

A. Performance on Origin Prediction

Test set (n=4932) Micro-Avg AUC (95% CI) Top-1 Acc Top-3 Acc Top-5 AccHist. feat. + sex, multi-task 0.988 (0.987 - 0.989) 0.836 0.944 0.978Hist. feat. only, multi-task 0.986 (0.984 - 0.987) 0.828 0.939 0.968Hist. feat. + sex 0.988 (0.987 - 0.989) 0.825 0.945 0.976Hist. feat. only 0.987 (0.985 - 0.988) 0.824 0.939 0.971