Learning Interclass Relations for Image Classification
LLearning Interclass Relations for Image Classification
Raouf Muhamedrahimov Amir Bar Ayelet Akselrod-Ballin
Zebra Medical Vision Ltd. {raouf,amir,ayelet}@zebra-med.com
Abstract
In standard classification, we typically treat class categories as independent ofone-another. In many problems, however, we would be neglecting the naturalrelations that exist between categories, which are often dictated by an underlyingbiological or physical process. In this work, we propose novel formulations ofthe classification problem, based on a realization that the assumption of class-independence is a limiting factor that leads to the requirement of more trainingdata. First, we propose manual ways to reduce our data needs by reintroducingknowledge about problem-specific interclass relations into the training process.Second, we propose a general approach to jointly learn categorical label represen-tations that can implicitly encode natural interclass relations, alleviating the needfor strong prior assumptions, which are not always available. We demonstrate thisin the domain of medical images, where access to large amounts of labelled datais not trivial. Specifically, our experiments show the advantages of this approachin the classification of Intravenous Contrast enhancement phases in CT images,which encapsulate multiple interesting inter-class relations.
In multi-class classification, class categories are typically treated as equally (or rather, infinitely)different from one another. In many tasks, however, this is not the case. In medical imaging, forexample, discrete class categories often represent stages in continuous physiological processes, suchas the progression of a pathology [1, 2, 3, 4]. Hence, in the standard representation of class categories,information on their underlying phenomena is lost. Naturally, we would want to formulate ourproblem in a way that captures the ordinal nature of categories in the learning process to more closelyreflect the reality of our data. Ordinal Regression [5, 6, 7] deals specifically these classificationproblems, where categories follow some natural order [8, 9, 7, 10]. In recent work, this type ofclassification has been performed by mapping discrete ground truth labels into a soft distributionduring training, in order to incorporate ordinal relations [8]. However, this approach requires priorknowledge or assumptions about the label domain and the precise ordinal nature of the categories. Inthis study, we extend on this framework and propose different approaches to encoding underlyingordinal relations into ground truth label representations, demonstrating that these can either be definedbased on prior knowledge, or learnt from data. We argue that these approaches better represent theformulation of some problems, with the potential to improve the model performance.The application domain we focus on is intravenous (IV) contrast enhancement phase classification inAbdominal CT. IV contrast administration and the resulting enhancement patterns are often critical inthe diagnostic process in CT. Multiphase CT scans are acquired in distinct physiologic vascular timepoints after initial IV administration, such as the non-enhanced , arterial , venous and delayed phases[1, 11, 12]. Information about the particular contrast phase of a CT scan relies upon manual entryby a technician and is often partially or inconsistently captured in the metadata associated with thescan (contained in the image format, DICOM). As such, an algorithmic solution to contrast phaseclassification is essential in permitting fully automated ML analysis of dynamic radiographic findings, Preprint. Under review. a r X i v : . [ c s . C V ] J un igure 1: Examples of interclass relations for Intravenous Contrast flow phases. (a) An example of aCT abdomen image taken in four different phases. (b)-(d) demonstrate different class relations wherein (b) class categories are unordered and in (c) contrast phases are ordered sequentially while in (d)phases are ordered in a circular manner.capable of discerning, for example, between benign liver fibronodular hyperplasia and malignanthepatocellular carcinoma [13, 1].This task is particularly interesting from an Ordinal Regression standpoint as its interclass similaritiescan be articulated in different ways. The categories (or phases) follow a temporal order – the timefollowing contrast injection. The visual features of the image, however, are cyclic – following thedelayed phase, the contrast agent is fully excreted and the scan returns to a non-enhanced appearance.Furthermore, the diagnostic characteristics overlap between categories such that a single clinicalfinding could potentially be diagnosed in multiple phases [11, 12].In summary, we propose three formulations of the ordinal classification task, centered aroundintroducing representations of interclass relations into the training process and demonstrate theireffectiveness in the application of intravenous contrast phase classification in CT images. Our maincontributions are as follows. (1) We demonstrate that encoding cyclic ordinal assumptions intoground-truth label representations during training improves classification performance over a naiveone-hot approach in a medical imaging setting. We show that these improvements are most significantfor small training sets, which are typical in the domain. (2) We propose two reformulations ofthe classification task in which label representations are learned from data. Under constraints, wedemonstrate that natural ordinal relations can be implicitly learnt and encoded during training, leadingto the same improvements in performance while requiring few prior assumptions. (3) To the best ofour knowledge, this is the first time a circular ordinal regression approach has been employed in themedical imaging domain of IV contrast. Finally, while our experiments suggest these approachesmay be particularly well suited to medical imaging tasks, the methods themselves are general to alldomains and may lead to improvements outside the field of medical imaging. Ordinal regression.
The goal of both classification and ordinal regression is to predict the categoryof an input instance x ∈ R H × W × D from a set of K possible class labels in set Y . The differenceis that in ordinal regression, there is a natural ordering (or ranking) associated with the classes[8, 9, 7, 10, 14, 15]. In some studies, an approach based on regression is taken; inputs are mapped toa continuous value in the label (or rank) space and K − thresholds are defined for categorization[7, 9, 10]. Other approaches formulate the task as a classification problem, treating thresholds in therank-space as fixed and training classifiers over the K ordinal categories [8, 14, 15]. This study takesthe latter approach and builds on recent work by [8], who propose the Soft Ordinal vectors (SORD)framework, where known ordinal information is encoded into the ground truth vectors through asoft target encoding scheme. Unlike previous work, we apply this in the medical domain, and ourgoal is to encode cyclic ordinal assumptions based on visual semantics. We extend their frameworkand propose an approach we term PL-SORD, where we allow the network to implicitly choose anoptimal label encoding through reformulation of the training loss, defining a set of "candidates" forthe categorical ordering. In some cases alleviates the need for strong prior assumptions with the samebenefits to network supervision. 2 ntravenous contrast Recent works on IV contrast have modeled this as an image classificationtask, relying on human based annotations for training [16, 17, 18]. [17, 18] propose systems toquality assess whether a scan was accurately taken in Portal Venous Phase (PVP). The proposedapproach in [17] is semi-automatic and requires an expert in the loop. In [18], a fully automaticsystem is proposed based on a CNN; however, the input is preprocessed such that it contains onlya limited view of the image which includes the Aorta and Portal Vein. This constraint, might dropthe overall algorithm performance as described in [17]. [16] deal with contrast phase classificationwith a view of a full abdominal CT slice, however, it is mainly used to examine neural networksvisualization approaches in the context of clinical decision making. Moreover, this approach is basedon a manually extracted 2D slices and human annotations, with a limited dataset of 3253 scans froma single institution. In this study, we also approach this as a classification task with the primary goalof demonstrating the effectiveness of introducing ordinal assumptions in label encodings. In terms ofmodelling, our network is based on a 3D representation of abdominal organs using training labelsextracted from scan metadata. In doing so, we are able to build up a training set of 264,198 full CTscans from more than 10 institutions, supporting the generalizability of our experimental results.
Classification tasks performed over a set of discrete categories Y , generally necessitate a labelencoding f : Y ⇒ R |Y| , which maps any target class t ∈ Y (the ground-truth) to a vector ofprobability values y ·| t = f ( t ) . The labels in this setting represent a probability distribution that thenetwork attempts to match by optimizing for some loss metric L over a given training set. Naturally,any relation that might exist between a class i ∈ Y and the target class t can be represented in theencoded value y i | t = f i ( t ) .Classification tasks over K classes are most commonly performed by representing each category as aone-hot vector (see Figure 2): y ohi | t = f i ( t ) = (cid:26) if i = t otherwise (1)In training, the network will treat all classes where i (cid:54) = t as equally (or infinitely) wrong. In practice,this might not be the case – some classes could be considered more correct than others. Biases mayalso exist in the data labels themselves. As the network tries match the label distribution exactly,there is motivation to define f in a way that assigns a higher (non-zero) value to those categoriesclosely related to a target class.In Section 4, we start by formulating the task as an ordinal regression problem by assigning K ordinal positions (or ranks) corresponding to each class, Λ = { r , r , ..., r K } in some continuousdomain r i ∈ R d . Using the SORD encoding approach, we incorporate known class relations intolabel representations based on a pairwise interclass similarity metric φ ( r t , r i ) , which represents theordinal "distance" between the categories. Within this framework, we can represent the cyclic natureof the categories by simply ranking them in polar coordinates. Once the categorical ranks Λ anddistance metric φ for the task are specified based on prior knowledge, the predetermined mapping f can be directly applied to the labels.In Section 5, instead of predefining a label encoding based on prior knowledge, we propose a moregeneral approach whereby the label encoding f is learnt from the data as part of the training process,exploiting ability of deep neural networks to generalize based on the visual semantics of the ordinalcategories [19, 20, 21, 22].In Section 5.1, we propose an approach we term PL-SORD, where we allow the network to implicitlychoose an optimal label encoding f defining a set of "candidates" Λ ∈ S for the categorical ordering.Extending this, we loosen our constraints even further to see how this impacts the optimal labelrepresentation learnt from the data and the extent to which it reflects natural interclass similarities.Specifically, in Section 5.2, we propose a formulation of the problem that attempts to directly learn aparametrized encoding function f from the data, optimized jointly with the network parameters.3 .51.0 N on - c on t r a s t One-Hot - Independent Classes Linear Ordinal Relationship Circular Ordinal Relationship A r t e r i a l V enou s NC Art. Ven. Del.
Relative linear rank D e l a y ed NC Art. Ven. Del.
Relative linear rank
NC Art.Ven.Del.
Relative circular rank
Ground truth class rank E n c oded c l a ss l abe l v a l ue Figure 2: Examples of SORD-encoded label representation values under different assumptions on therelations between classes: (left) no relations (one-hot encoding, s = ∞ ), (middle) linear relations,and (right) circular relations ( s = 1 . π ). In each setting, categories considered adjacent to the trueclass are assigned a higher (non-zero) value than distant ones. We start by describing the formulation of a classification task as an ordinal regression problem,and define a fixed label representation f using the SORD label encoding scheme based on priorassumptions on the ordinal relationship between categories. We define K ordinal positions (or ranks) corresponding to each class, Λ = { r , r , ..., r k } in somecontinuous domain r i ∈ R d . If the target class is t , the label representation for the class i can becomputed as the softmax over the pairwise interclass similarities: y i | t = f i ( t ) = e − φ ( r t ,r i ) (cid:80) Kk =1 e − φ ( r t ,r k ) (2)where φ ( r t , r i ) is a metric function representing the ordinal "distance" between the categories. Oncethe categorical ranks Λ and distance metric φ for the task are specified, the mapping f can be directlyapplied to the labels. The gradual return to some original state can be accounted for by defining the ordinal categories asangles in polar coordinates (See Figure 2). The distance metric can then be defined as the shortest4igure 3: Summary of the PL-SORD approach. A set of proposed categorical orderings defineseparate label encodings, each used to calculate an order-specific loss for a single network prediction.During training, both the network parameters and the weighted contribution of each loss term to thetotal are optimized over the data.angular distance between two classes in this space: φ θ a ( r t , r i ) = (cid:107) r i − r t mod 2 π (cid:107) (3) φ θ b ( r t , r i ) = 2 π − φ θ a ( r t , r i ) (4) φ θ ( r t , r i ) = ( s × min( φ θ a , φ θ b )) (5)where s can be understood as a scaling factor for the target probability distribution, treated as anetwork hyperparameter – as s → ∞ , the label distribution becomes one-hot, as s → , all y ·| t become equal. In this section, we propose a more generalized approach to jointly learn the label encoding f toencapsulate categorical relations within the training process itself. Instead of pre-defining the categorical ordering Λ as in (Section 4.2), we relax this constraint andallow the network to choose from a set S of possible categorical orderings (Figure 3). We refer tothis approach as PL-SORD. While the set of possible relations S can be defined quite generally, herewe propose one possible approach. We first specify the set of unique ordinal positions that can beassigned to any of our K categories, so any rank r i must have a value from { R , ..., R M } . The set S of possible orderings is then taken as all size- K permutations of these "available" positions { R i } Mi =1 .Using a fixed distance metric φ and Eq. 2, each ordinal permutation Λ ∈ S defines a separate labelencoding f ( t ; Λ) = y Λ ·| t as in Section 4.1, which can then used to calculate the order-specific loss forthe prediction ˆ y , without changes to the network: L Λ (ˆ y, y ) = L (ˆ y, y Λ ) (6)The total network loss is computed as a weighted sum of the N order-specific losses for the samenetwork, as illustrated in Figure 3. So we optimize for both the loss weights λ and the modelparameters W during training: min W,λ L S (ˆ y, y ) = N (cid:88) j =1 σ j ( λ ) L Λ j (7)5ere, σ j ( λ ) = e λj (cid:80) Kk =1 e λk is the j th output of the softmax applied over the learned weights λ ∈ R N , and represents the contribution to the total loss from permutation Λ j . Intuitively, the ordinalpermutation which contributes the lowest training loss would be assigned the largest weight duringtraining. We alleviate the need to set constraints on distance metrics, rank-space, and symmetrical relationshipsbetween classes, and instead propose a formulation of the problem that attempts to directly learnthe label encoding from the data. Specifically, when the target class is t , we hope to directly learnthe distribution y αi | t = f ( t, α ) . It is clear that imposing no constraints would result in a degeneratesolution, so we preserve the ground-truth signal by fixing the label value corresponding to the targetclass t as s ∈ (0 , , which is treated as a hyperparameter, with the remaining (1 − s ) distributed overthe remaining classes. Concretely, the encoding for each target class t is directly parametrized by α t ∈ R K − and given by: y αi | t = f i ( t ; α ) = (cid:40) s if i = t (1 − s ) e αt,i (cid:80) k e αt,k otherwise (8)During training, we optimize for both the network parameters W and the encoding parameters α : min W,α L S (ˆ y, y ) = L (ˆ y, f ( t ; α )) (9) For this study, we use a proprietary abdominal CT dataset consisting of 334,079 scans (181,881patients) from over 10 institutions, with 90%/10%/10% for training, validation, and test partitions,respectively. All 264,198 scans (144,144 patients) in the training partition were used for this study.For model validation, 1,000 scans (963 patients) were sampled with 250 scans from each class,equally sampled from 5 institutions. Four different labels were assigned to their respective contrastphases (non-contrast, arterial, venous, delayed). The labels were assigned using regular expressionsapplied to free text information contained in the SeriesDescription DICOM header. For testing, 192CT scans were sampled from in the held-out test partition and were manually labeled by an expertradiologist with similar representation of each class, with 3 institutions represented within each class.Modelling approaches were compared on several randomly-sampled training sets with 2k, 4k, 8k,16k, 32k, 80k, and 264k samples respectively.All Patient Health Information (PHI) was removed from the data prior to acquisition, in compliancewith HIPAA standards. The axial slices of all scans have an identical size of 512x512, but the numberof slices in each scan vary between 42 and 1026, with slice spacing ranging from 0.45 mm to 5.0 mm.
The same model and training configuration was used for all experiments in this study. A region ofinterest, containing the liver, kidneys, aorta, and inferior vena cava (IVC), is automatically localizedfor each CT scan using an algorithmic approach similar to [23]. 20 input slices were uniformlysampled from the extracted region and each resized to × . The input CT image pixel rangewas clipped in the range of a CT window centered at 40 with a width of 350 Hounsfield units.We use a 3D ResNet50 architecture [24, 25] and apply a cross-entropy loss to the output of the model.The networks are trained on 2 NVIDIA GPUs using an Adam optimizer with a learning rate of × − and β = 0 . , β = 0 . [26]. We use a batch size of 32 with equal representation ofsamples from each class. We continue each training run for 150k iterations, measuring performanceon the validation set every 150 iterations, saving the model with the highest categorical accuracy.Models were implemented using Keras (Tensorflow backend).6 Training samples ('000) T e () a cc u r a cy ( \ % ) One- %) ba( eline (s = ∞ )Circular SORD (s = 1)Circular SORD (s = 0.625)Circular PL-SORD (s = 0.625) Figure 4: Comparison of contrast phase classification test performance using, the one-hot encodingbaseline ( s = ∞ ), circular SORD encoding ( s = 1 , s = 0 . ), and PL-SORD s = 1 π with learntordinal ranks. The highest test accuracy for a training set equal or smaller in size is reported. Ordinalformulations result in higher accuracy across all training set sizes, most notably for small datasets. Introducing prior knowledge about class relations, we compare the performance of SORD encodingapproach described in Section 4.2 to a one-hot baseline on the task of contrast phase classification.We start by assigning equally-spaced ranks r i ∈ (0 , π ) to each of the categories in order of theirphysiological appearance: r NC = 0 , r A = π , r V = π , and r D = π and assume a circularrelationship exists between phases, in line with their visual and diagnostic features. Two scalingfactors for the hyperparameter s defined in Eq. 5 are compared. Specifically, s = 0 . , for whichthe loss is more centered around the ground truth class and s = 1 , where it is more distributed acrossadjacent classes.In Figure 4, SORD encoding resulted in performance improvements across all training set sizes forboth scaling factors s = 0 . and s = 1 . We see the most marked improvements for small datasets.Specifically, when holding everything else fixed, the models utilizing the Circular SORD achieveimproved performance over the standard one-hot formulation even when presented less than oftraining data. We compare the approach described in Section 5.1 to a one-hot baseline and SORD encoding forcontrast phase classification. A circular relationship is still assumed between phases and we allowthem to take on 4 angular positions r i ∈ { , π, π, π } , without setting a constraint on the relativeorder. We then define our set of possible ordinal relations S as all size- K permutations of theseequally-spaced positions. With an equal number of categories and possible positions, we are left with | S | = ( N − = 3 candidates for the natural ordering, after taking into account invariance to reversaland rotation in the circular setting . Letting Λ = ( r NC , r A , r V , r D ) ∈ S represent the assigned ranksof the non-enhanced, arterial, venous, and delayed phases, respectively, our set is given by: S = { Λ , Λ , Λ } = (cid:8) (0 , π, π, π ) , (0 , π, π, π ) , (0 , π, π, π ) (cid:9) (10)The approach was compared to the circular SORD and one-hot encoding for the same datasets andconfiguration. The results in Figure 4 indicate that the performance is comparable to the CircularSORD, although it requires fewer assumptions. In a manual analysis of the learned weights ( σ ( λ ) inEq. 9), it was found that the permutation that minimizes the network training loss follows the naturalordering of the classes, i.e. (“Non-Contrast", “Arterial", “Venous", “Delayed"). So even if it seemsthat the ordering itself does not necessarily need to be enforced for an given task, representing therelationship in the labels can still be beneficial. In the angular setting, the permutations (0 , π, π, π ) , ( π, , π, π ) , ( π, π, π, are all equivalent C Art. Ven. Del. (a) O e-Hot - No I terclass Relatio ships = ∞ E c oded l abe l v a l ue No -co trastArterialVe ousDelayed NC Art. Ve . Del.
Ordi al positio of class(b) Circular Ordinal Relationship ( ϕ θ )s = 0.625 NC Art. Ven. Del. (c) Learnt Label Representations = 0.855
Figure 5: Comparison of data label representations with a learnt encoding: (a) one-hot encodingbaseline with unrelated classes ( s = ∞ ), (b) circular relationship between classes ( s = 1 . π ), and(c) learnt label representation. The approach described in Section 5.2 of directly learning label representations is compared tothe one-hot baseline on the contrast enhancement phase classification task. In this setting, the labelencoding is learned from the data and no assumptions are imposed on the pairwise similarities;intuitively, encouraging the network to make certain mistakes above others. For comparison, we usethe training set of 32k samples and set the hyper parameter s = 0 . (see Section 5.2), such that weimpose a maximum y value which is similar to the circular SORD experiment. This model obtained atest accuracy of 92.28% whereas the one-hot baseline achieved 90.63%.We find the learnt label distribution y wi | t (in Figure 5c) results in an asymmetric encoding that couldnot be defined though the SORD framework described above, showing a higher degree of flexibility.However, it is worth noting that a "circular" relationship is not reflected, but rather weight is given toa single category responsible for the bulk of misclassifications for each target class. It is clear thatthe easiest-to-optimize encoding does not necessarily reflect the natural ordinal relations, withoutappropriate constraints on f . In additional to overlap of features, this highlights that some of thebenefit of the SORD approach could be attributed to its accounting for biases, errors, or overlap inthe labels of the data.While this does not result in improvements in performance over the SORD approaches, it offers apromising avenue for tasks where the relationship between categories can not be reasonably assumed,due to a large number of classes or limited knowledge of the problem. In this study we proposed three formulations of a classification task, centered around the introductionof interclass relations into label representations. In the application of intravenous contrast phaseclassification in CT images, we demonstrate that incorporating cyclic ordinal assumptions duringtraining significantly improves classification performance over standard one-hot approaches, par-ticularly when datasets are small. In the PL-SORD approach, we show that we can learn a labelencoding that implicitly incorporates the natural ordinal relations, leading to the same improvementsin performance while requiring fewer prior assumptions. Finally, by alleviating the need to setany ordinal assumptions and directly learning a label encoding from data, we again demonstrateimprovements over a one-hot encoding. While the directly-learned approach does not result inaccuracy improvements over approaches based in SORD (or exactly reflect our ordinal assumptions)it offers a promising avenue for tasks where the relationship between categories can not be reasonablyassumed. It also highlights that some of the benefit of these approaches could be attributed to itsaccounting for biases, errors, or overlap in the labels of the data. All point to promising avenues forimproving performance in medical classification tasks, where small datasets, ordinal relations, andnoisy labels are exceedingly common. 8 eferences [1] Kyu Jin Choi, Jong Keon Jang, Seung Soo Lee, Yu Sub Sung, Woo Hyun Shim, Ho Sung Kim,Jessica Yun, Jin-Young Choi, Yedaun Lee, Bo-Kyeong Kang, et al. Development and validationof a deep learning system for staging liver fibrosis by using contrast agent–enhanced ct imagesin the liver.
Radiology , 289(3):688–697, 2018.[2] Felix Grassmann, Judith Mengelkamp, Caroline Brandl, Sebastian Harsch, Martina E Zim-mermann, Birgit Linkohr, Annette Peters, Iris M Heid, Christoph Palm, and Bernhard HFWeber. A deep learning algorithm for prediction of age-related eye disease study severityscale for age-related macular degeneration from color fundus photography.
Ophthalmology ,125(9):1410–1420, 2018.[3] H Kuang, M Najm, D Chakraborty, N Maraj, SI Sohn, M Goyal, MD Hill, AM Demchuk,BK Menon, and W Qiu. Automated aspects on noncontrast ct scans in patients with acuteischemic stroke using machine learning.
American Journal of Neuroradiology , 40(1):33–38,2019.[4] CG Peterfy, A Guermazi, S Zaim, PFJ Tirman, Y Miaux, D White, M Kothari, Y Lu, K Fye,S Zhao, et al. Whole-organ magnetic resonance imaging score (worms) of the knee in os-teoarthritis.
Osteoarthritis and cartilage , 12(3):177–190, 2004.[5] Ben G Armstrong and Margaret Sloan. Ordinal regression models for epidemiologic data.
American Journal of Epidemiology , 129(1):191–204, 1989.[6] Frank E Harrell Jr.
Regression modeling strategies: with applications to linear models, logisticand ordinal regression, and survival analysis . Springer, 2015.[7] Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Support vector learning for ordinalregression. 1999.[8] Raul Diaz and Amit Marathe. Soft labels for ordinal regression. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 4738–4747, 2019.[9] Wei Chu and S Sathiya Keerthi. Support vector ordinal regression.
Neural computation ,19(3):792–815, 2007.[10] Bin Gu, Victor S Sheng, Keng Yeow Tay, Walter Romano, and Shuo Li. Incremental supportvector learning for ordinal regression.
IEEE Transactions on Neural networks and learningsystems , 26(7):1403–1416, 2014.[11] Kyongtae T Bae. Intravenous contrast medium administration and scan timing at ct: considera-tions and approaches.
Radiology , 256(1):32–61, 2010.[12] Kristie Guite, Louis Hinshaw, and Fred Lee. Computed tomography in abdominal imaging:how to gain maximum diagnostic information at the lowest radiation dose. In
Selected Topicson Computed Tomography . InTech, 2013.[13] Changjian Sun, Shuxu Guo, Huimao Zhang, Jing Li, Meimei Chen, Shuzhi Ma, Lanyi Jin,Xiaoming Liu, Xueyan Li, and Xiaohua Qian. Automatic segmentation of liver tumors frommultiphase contrast-enhanced ct images based on fcns.
Artificial intelligence in medicine ,83:58–66, 2017.[14] Wenzhi Cao, Vahid Mirjalili, and Sebastian Raschka. Rank-consistent ordinal regression forneural networks. arXiv preprint arXiv:1901.07884 , 2019.[15] Eibe Frank and Mark Hall. A simple approach to ordinal classification. In
European Conferenceon Machine Learning , pages 145–156. Springer, 2001.[16] Kenneth A Philbrick, Kotaro Yoshida, Dai Inoue, Zeynettin Akkus, Timothy L Kline, Alexan-der D Weston, Panagiotis Korfiatis, Naoki Takahashi, and Bradley J Erickson. What does deeplearning see? insights from a classifier trained to predict contrast enhancement phase from ctimages.
American Journal of Roentgenology , pages 1184–1193, 2018.[17] Laurent Dercle, Lin Lu, Philip Lichtenstein, Hao Yang, Deling Wang, Jianguo Zhu, FeiyunWu, Hubert Piessevaux, Lawrence H Schwartz, and Binsheng Zhao. Impact of variability inportal venous phase acquisition timing in tumor density measurement and treatment responseassessment: metastatic colorectal cancer as a paradigm.
JCO clinical cancer informatics ,1(1):1–8, 2017. 918] Jingchen Ma, Laurent Dercle, Philip Lichtenstein, Deling Wang, Aiping Chen, Jianguo Zhu,Hubert Piessevaux, Jun Zhao, Lawrence H Schwartz, Lin Lu, et al. Automated identification ofoptimal portal venous phase timing with convolutional neural networks.
Academic Radiology ,27(2):e10–e18, 2020.[19] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understandingdeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 , 2016.[20] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.
Journal of machinelearning research , 9(Nov):2579–2605, 2008.[21] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple frameworkfor contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 , 2020.[22] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In
European conference on computer vision , pages 818–833. Springer, 2014.[23] Berkman Sahiner, Aria Pezeshk, Lubomir M Hadjiiski, Xiaosong Wang, Karen Drukker,Kenny H Cha, Ronald M Summers, and Maryellen L Giger. Deep learning in medical imagingand radiation therapy.
Medical physics , 2018.[24] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace thehistory of 2d cnns and imagenet? In
Proceedings of the IEEE conference on Computer Visionand Pattern Recognition , pages 6546–6555, 2018.[25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition ,pages 770–778, 2016.[26] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980arXiv preprintarXiv:1412.6980