[PDF] CoMIR: Contrastive Multimodal Image Representation for Registration

Abstract

We propose contrastive coding to learn shared, dense image representations, referred to as CoMIRs (Contrastive Multimodal Image Representations). CoMIRs enable the registration of multimodal images where existing registration methods often fail due to a lack of sufficiently similar image structures. CoMIRs reduce the multimodal registration problem to a monomodal one, in which general intensity-based, as well as feature-based, registration algorithms can be applied. The method involves training one neural network per modality on aligned images, using a contrastive loss based on noise-contrastive estimation (InfoNCE). Unlike other contrastive coding methods, used for, e.g., classification, our approach generates image-like representations that contain the information shared between modalities. We introduce a novel, hyperparameter-free modification to InfoNCE, to enforce rotational equivariance of the learnt representations, a property essential to the registration task. We assess the extent of achieved rotational equivariance and the stability of the representations with respect to weight initialization, training set, and hyperparameter settings, on a remote sensing dataset of RGB and near-infrared images. We evaluate the learnt representations through registration of a biomedical dataset of bright-field and second-harmonic generation microscopy images; two modalities with very little apparent correlation. The proposed approach based on CoMIRs significantly outperforms registration of representations created by GAN-based image-to-image translation, as well as a state-of-the-art, application-specific method which takes additional knowledge about the data into account. Code is available at: this https URL.

Full PDF

CCoMIR: Contrastive Multimodal ImageRepresentation for Registration

Nicolas Pielawski ∗ , Elisabeth Wetzer * , Johan Öfverstedt,Jiahao Lu, Carolina Wählby, Joakim Lindblad, Nataša Sladoje Deparment of Information Technology, Uppsala University, Sweden

Abstract

We propose contrastive coding to learn shared, dense image representations, re-ferred to as CoMIRs (Contrastive Multimodal Image Representations). CoMIRsenable the registration of multimodal images where existing registration methodsoften fail due to a lack of sufﬁciently similar image structures. CoMIRs reduce themultimodal registration problem to a monomodal one in which general intensity-based, as well as feature-based, registration algorithms can be applied. The methodinvolves training one neural network per modality on aligned images, using acontrastive loss based on noise-contrastive estimation (InfoNCE). Unlike othercontrastive coding methods, used for e.g. classiﬁcation, our approach generatesimage-like representations that contain the information shared between modalities.We introduce a novel, hyperparameter-free modiﬁcation to InfoNCE, to enforcerotational equivariance of the learnt representations, a property essential to theregistration task. We assess the extent of achieved rotational equivariance and thestability of the representations with respect to weight initialization, training set,and hyperparameter settings, on a remote sensing dataset of RGB and near-infraredimages. We evaluate the learnt representations through registration of a biomedicaldataset of bright-ﬁeld and second-harmonic generation microscopy images; twomodalities with very little apparent correlation. The proposed approach based onCoMIRs signiﬁcantly outperforms registration of representations created by GAN-based image-to-image translation, as well as a state-of-the-art, application-speciﬁc ∗ Authors contributed equally.Preprint. Under review. ! " ( " )! % ( % ) " % & % & " B F C h a ll e n g i n g r e g i s t r a t i o n S u cc e ss f u l r e g i s t r a t i o n & % '(& " ) '( " ) % S H G C o M I R o f B F C o M I R o f S H G Figure 1: Registration of images of different modalities (here bright-ﬁeld (BF) and second-harmonicgeneration imaging (SHG)) may be very challenging. CoMIR successfully estimates the sharedrepresentation of these images, and enables their successful registration by monomodal approaches. a r X i v : . [ c s . C V ] J un ethod which takes additional knowledge about the data into account. Code isavailable at: https://github.com/MIDA-group/CoMIR . Multimodal images refer to images captured with multiple types of sensors, where every sensoroutputs information not fully provided by the other sensors. Multimodal image fusion is the processof combining information from multiple imaging modalities. It allows downstream tasks to exploitcomplementary information as well as relationships between modalities. In order to perform imagefusion, the images need to be aligned, either by joint acquisition, or by manual or automatedregistration. Our method, Contrastive Multimodal Image Representation for registration (CoMIR)reduces the challenging problem of multimodal registration to a simpler, monomodal one, as shown inFig. 1. Successful registration allows us to make use of the combined information. However, to enableregistration, common structures between the modalities need to be found, reﬂected by the shared ormutual information (MI). While image-to-image translation aims to predict cross-modality, i.e. topredict one modality given the other modality as input, our method directly learns representationscalled CoMIRs related to the MI by maximizing the Mutual Information Noise-Contrastive Estimation(InfoNCE) [19, 24]. InfoNCE can, under certain assumptions, act as a lower bound to MI, and hasbeen successfully applied in connection with contrastive coding in other tasks such as classiﬁcationand segmentation. In the case of registration, the representations must be both translation and rotationequivariant, whereas classiﬁcation requires invariance. Although contrastive losses (CL) are oftenused for representation learning [2, 6, 21, 23, 24, 42, 45, 46, 48, 52, 60], to the best of our knowledgethe method presented here is the ﬁrst to produce dense representations for very different imagingmodalities, which can be utilized for monomodal registration by existing feature- or intensity-basedregistration methods.The contributions of this paper are the following: We show that (1) contrastive learning of alignedpairs of images can produce representations that reduce complex (or even impossible) multimodalregistration tasks to much simpler monomodal registration; (2) the proposed CoMIRs are rotationequivariant due to a modiﬁcation of a commonly used CL. This modiﬁcation is model-independentand does not require any architecture modiﬁcations, nor any additional hyperparameter tuning.Thanks to our sophisticated scheme for generating training patches, as little as one image pair can besufﬁcient to generate CoMIRs (depending on the nature of the imaging data).

Image registration consists of ﬁnding transformations matching unaligned images. This processis extremely useful, notably in computer vision and medical imaging and has been well studiedin the last ﬁfty years [18]. In many cases, a modality-independent solution can be obtained byoptimizing MI or generating hand-crafted local structural feature descriptors [22]. Unfortunately,optimizing MI has many caveats: it has a slow convergence rate and a narrow catchment basin withmany local maxima. More, the method relies on speciﬁc statistical clues that not all datasets contain.Learning-based algorithms can potentially provide a solution, as the algorithm can learn semanticrelationships between modalities. In general it is difﬁcult to ﬁnd suitable image similarities betweenmultimodal images due to their difference in appearance and/or signal expression. Deep learninghas changed the ﬁeld of similarity and metric learning from learning distances to learning featureembeddings that ﬁt distance functions [17]. CL [20] and in particular triplet loss [47, 59] have showngreat success in deep metric learning for a wide range of applications, such as object retrieval [44],single-shot object classiﬁcation [54], object tracking [33, 51], classiﬁcation [8] and multimodal patchmatching of aerial images [61]. In [25], keypoint descriptors are learnt using contrastive metriclearning to minimize the difference between two feature representations from two correspondingpoints and maximize the difference of two feature representation from two distant points to registerMR and CT images. The link between CL, classiﬁcation and MI is an active area of study, with [48]showing that the triplet loss can be extended to make use of N negative samples instead of one and isan approximation of the softmax function. Contrastive Loss and MI

Several recent representation learning approaches [2, 23, 24, 45, 50,52, 60] are based on the infomax principle which refers to maximizing the MI between input andoutput, [3, 38]. One approach showing impressive results is Deep InfoMax (DIM) [24] whichargues that maximizing MI with high dimensional inputs and outputs is a difﬁcult task. Similarly,2ontrastive Predictive Coding (CPC)[45] maximizes MI between global and local representationpairs by sequential aggregation to learn latent features which can be used for classiﬁcation. UsingNoise-Contrastive Estimation (InfoNCE) as a lower bound to MI with ﬂexible critic was found toresult in the most useful representations for classiﬁcation in [24, 45]. It was further used in [2],where InfoNCE is maximized between features from multiple views in a self-supervised manner. Theterm multiview can refer to augmented versions of one unlabelled image or multiple modalities ofone instance. Instead of maximizing MI between features from one single image as in DIM, MI ismaximized across multiple feature scales simultaneously as well as across independently augmentedcopies of each image. However, in [53], Tschannen et al. show that maximizing tighter bounds onMI than InfoNCE can result in worse representations and argue that the success of these methodscannot only be attributed to the maximization of MI. The authors put the success of approximate MImaximization in connection with the usage of the triplet loss or CL by showing that representationlearning across different views by maximizing InfoNCE can be equivalent to metric learning usingthe multi-class K-pair loss presented in [48]. Tschannen et al. point out that the negative samplesfor the CL have to be drawn independently in order for InfoNCE to yield a lower bound for the MI,an assumption often disregarded. Despite this, [2, 6, 24, 46, 52] report a better performance of therespective downstream task when using many negative samples or hard examples.

Equivariance

As [24] emphasizes, the usefulness and quality of a learnt representation is not only amatter of information content but also representational characteristics. Features desirable to requestfrom a representation are equivariances. In tasks such as segmentation or registration, translationaland rotational equivariance are highly beneﬁcial. Equivariance deﬁnes the property of a functionto commute with the action of a symmetry group when its domain and codomain are acted on bythat symmetry group. A function or operator f : Ω (cid:55)→ Y is called equivariant under a family oftransformations T if for any transformation T ∈ T , there ∃ T (cid:48) ∈ T s.t. f ( T ( X )) = T (cid:48) ( f ( X )) ∀ X ∈ Ω . (1)Feature equivariance can be achieved by three different approaches. Firstly, by data augmentation,where randomly rotated pairs of input and label masks are passed to a model, which learns somedegree of rotational equivariance. The second approach is model-based, as proposed in [12, 13, 58],where equivariant mappings are achieved by adjusting the convolutional, activation and poolinglayers to be applied over groups and sharing weights. Thereby these so-called group equivariantconvolutional networks (G-CNNs) ensure that the layers themselves become equivariant. A furtherin-depth study of the theory of equivariant CNNs is presented in [11] and [57]. Thirdly, [9] encouragesrotational invariance through an additional constraint in the loss function which has been adapted by[36] to achieve equivariance. They propose an additional term to the cross-entropy loss used to trainthe model for segmentation: L rot = N (cid:80) x i ∈ X || O ( I ) − ¯ r ( I ) || , where O ( I ) is the feature map ofthe image at 0°and ¯ r ( I ) the mean feature map of the input rotated by multiples of 90°. This introducesa hyperparameter, impairs training due to the different scale of gradients between the cross-entropyloss and L rot and cannot guarantee any rotational equivariance apart from the C symmetry group,i.e. rotations by multiples of 90°. Enforcing equivariance through the loss was used by [32], whichmodiﬁes a CL proposed by [55] to learn action-equivariance for Markov Decision Processes. We introduce a modality independent approach to map two given images of different modalities tosimilar representations called CoMIRs. These learnt representations are sufﬁciently similar to allowthe application of monomodal registration algorithms. As our method does not require any additionalknowledge regarding the modalities at hand, there are no limitations with respect to application area.In the following section we introduce the CL used in our method, the sampling scheme needed toform the CL, and our modiﬁcations to achieve rotational equivariance. We also discuss the choice ofcritic used for the downstream task of registration.

Contrastive Loss

Here, we introduce the CL for two modalities, the general case for M modalitiesis given in App. 7.1. Let D = { ( x i , x i ) } ni =1 be an i.i.d. dataset containing n data points, where x j is an image in modality j , and f j the network processing modality j with respective parameters θ j for j ∈ { , } . For an arbitrary datapoint x = ( x , x ) ∈ D , the loss is given by L opt ( D ) = − E x , x ∼D  log p ( x , x ) p ( x ) p ( x ) p ( x , x ) p ( x ) p ( x ) + (cid:80) x i (cid:54) = x p ( x i , x i ) p ( x i ) p ( x i )  (2)3q. (2) is equivalent to a categorical loss discriminating between negative and positive samples.The ratio distribution p ( x , x ) p ( x ) p ( x ) is approximated by the exponential of a critic function h ( y , y ) ,that will compute an arbitrary similarity between y = f ( x ) and y = f ( x ) for the scalingparameter τ > L ( D ) = − n n (cid:88) i =1  log e h ( y i , y i ) /τ e h ( y i , y i ) /τ + (cid:80) y j , y j ∈D neg e h ( y j , y j ) /τ  (3) L ( D ) is named InfoNCE as described in [45]. Its minimization approximately maximizes alower bound on the MI, given by I ( y , y ) − log( n ) , which gets tighter as n → ∞ assum-ing that D neg , the set of chosen negative samples, is sampled i.i.d. The MI is deﬁned as I ( x , y ) = E x ∼X , y ∼Y [log p ( x , y ) p ( x ) p ( y ) ] . The Critic

The loss function in Eq. (3).contains the ratio p ( x , x ) p ( x ) p ( x ) which is an unknown quantity[60], but can be approximated with the exponential of a statistical, often bilinear, model [2, 6, 45, 48,52, 60]. Our experiments using a bilinear model resulted in CoMIRs less suitable for registration(see section 4.2 and App. 7.4, Fig. 11) than choosing a positive, symmetric critic h ( y , y ) with aglobal maximum for y = y . We experiment with both a Gaussian model with a constant variance h ( y , y ) = −|| y − y || which uses the mean squared error (MSE) as a similarity function, and atrigonometric model h ( y , y ) = (cid:104) y , y (cid:105)|| y || || y || which relates to the cosine similarity. Rotational Equivariance

To equip our model with rotational equivariance, we propose to merge anadditional constraint to our objective function which does not require additional parameter tuning, andcan be incorporated into the CL. In particular, instead of only maximizing h ( y i , y i ) within the CLfor an aligned pair x i , x i , we also maximize h ( y i , T (cid:48) ( f ( T ( x i )))) , and h ( y i , T (cid:48) ( f ( T ( x i )))) for f i being the model trained on modality x i , h the critic between the resulting representations, and T i , T (cid:48) i ∈ T , here T = C , the ﬁnite, cyclic, symmetry group of multiples of ◦ rotations. Rather thanextending our original loss term h ( y i , y i ) by three explicit loss terms, we combine the constraints inone loss term that implicitly enforces the C equivariance h ( T (cid:48) ( f ( T ( x i ))) , T (cid:48) ( f ( T ( x i )))) . (4)We randomly sample T and T once per training step for each element in the batch, which iterativelyoptimizes all combinations of C -transformations. Sampling of negative samples

In [31], the authors argued that the ability to discriminate betweensignal and noise increases with more negative samples. We sample negative samples D neg fromrandom patches during training within the entirety of all training image pairs at random positionswith random orientations. The patches can be extracted from the original images without introducingany padding border effects due to the rotations. Every extracted patch is subject to data speciﬁc,random augmentation. This sampling scheme guarantees large variation within every batch, whichincreases the (statistical) efﬁciency and generalization of the model. The elements from the batchare reused as negatives, such that for a given pair of matching samples, there are n − negativesamples available, for a batch size of n . We evaluate our proposed method on two multimodal image datasets.

Zurich Dataset

The open Zurich dataset [56] consists of 20 aerial images of the city of Zurich ofabout × px. The images are composed of four channels, RGB and Near-InfraRed (NIR) andare captured with the same sensor in identical resolution. An example is given in App. 7.2 Fig. 5. Biomedical Dataset

The dataset consists of 206 aligned Bright-Field (BF) and Second-HarmonicGeneration (SHG) tissue microarray core image pairs which are × px in spatial dimensionsand were manually aligned using landmarks [29]. The training set consists of 40 image pairs of size × px which are center-cropped from the corresponding original images. The validation setfor the CNN training consists of another 25 such pairs, a tuning set for registration parameters ofadditional 7 pairs, and the test set for evaluation of another 134 pairs. App. 7.2 Fig. 6 shows anexample image of an original tissue microarray core and the center-cropped patch used in this study.The modiﬁed dataset used in this study is provided in [1].4 G B Input images CoMIR =0°4 disabled CoMIR =90°4 disabled CoMIR =0°4 enabled CoMIR =90°4 enabled N I R Correlation [ ] + betweenCoMIR =0° and CoMIR [0;360) disabled enabled Figure 2: Top row: RGB input patch x cropped from the Zurich test set and the resulting CoMIRsusing the cosine similarity; Bottom row: the matching NIR image x and its resulting CoMIRs. TheCoMIRs f i ( x i ) and T (cid:48) ( f i ( T ( x i ))) are shown for x i , T a rotation of °and T (cid:48) = T − . Withoutthe C -equivariance constraint in the loss function, the resulting CoMIRs are not identical and showa directional component along edges. The polar plot shows the positive correlation between theCoMIR ° = f i ( x i ) and CoMIR θ = T (cid:48) ( f i ( T ( x i ))) for T a rotation by θ ∈ [0 , °, demonstratingthat rotational equivariance beyond multiples of ° is achieved by the C -constrained CL. Animatedversion: https://youtu.be/iN5GlPWFZ_Q . While different downstream tasks might require different properties of thelearnt representations, for registration methods based on α -AMD [43] and Scale Invariant FeatureTransform (SIFT, [39]), we require the representations to be rotationally equivariant. Visually, wesee the impact of the rotational equivariance constraint on the learnt CoMIRs of the Zurich datasetin Fig. 2. We observe that if neither the model nor the loss enforce equivariant feature maps, thelearnt representations have a directional component, particularly along edges in the image. We testthe level of equivariance by measuring the positive correlation between the CoMIR of an unrotatedimage which is rotated by the angle θ and the unrotated CoMIR of the same input image rotated bythe angle θ . The result for θ ∈ [0 , ° is shown in Fig. 2 and demonstrates that the C -constrainedCL achieves rotational equivariance for all angles, beyond multiples of °. Reproducibility of CoMIR

Training a model to produce CoMIRs with c channels leads to multiplesolutions reaching the same loss value. One such case is a permutation of the channels, and is likelyto occur between two training sessions: c ! different permutations can be achieved. It is possible,however, to consistently obtain very similar CoMIRs between experiments. [10] observe that theconvergence of convolutional ﬁlters is in direct connection with their initialization. Similarly, we ﬁndthat initializing the models’ weights randomly with a ﬁxed seed results in similarly trained modelsand CoMIRs. We compare CoMIRs generated by 50 identical models, all initialized with the sameseed and trained on the Zurich dataset. We then compute the mean pairwise correlation between allthe CoMIRs to measure the similarity and consistency between runs. The source of stochasticitybetween runs originates from the minibatch sampling and from the data augmentation scheme. To testthe inﬂuence of the data we train 20 models on patches from the same single image, and additionallytrain 19 models on patches augmented from a single training image varying for each experiment.Table 1 shows the results of the experiments. We observe a high consistency between training runswhen the training patches are sampled from the same training image with a ﬁxed weight initialization.Fig. 10 in App. 7.4 shows the generated CoMIRs used to compute the average correlation. Temperature Analysis τ The temperature τ is a hyperparameter of the loss function. High τ yieldsa smoother loss than low values. We tested a wide range of values of τ and all models converged andproduced reasonable CoMIR. See App. 7.4 Fig. 11 for examples of representations with various τ and loss combinations. We found τ = 0 . to be a good setting for the Zurich and τ = 0 . for thebiomedical dataset. 5 riginal pix2pix CycleGAN DRIT CoMIR-Cosine B F S H G CoMIR-MSE C o M I R ( B F ) C o M I R ( S H G ) Figure 3: Multiple methods to transform BF and SHG into one common modality. To the left, theoriginal BF (top) and SHG (bottom) images from the test set, followed by image translations of BFand SHG in the respectively other domain. To the right CoMIRs based on MSE and cosine similarity.The arrows indicate the image pairs to be registered in the experimental setup.Table 1: Mean pairwise correlation between the CoMIRs during 4 different experiments on theZurich dataset. The consistency (i.e. the reproducibility of a given training session) between theCoMIRs is highest when the weights are initialized with a ﬁxed seed (most important) and the trainingdistribution is similar (less important). 95% C.I. estimated with empirical bootstrap.Mean pairwise Corr. 1 ﬁxed train image 1 varying train image per modelFixed weights 93.0% [ 90.6%; 95.5%], N=100 78.4% [ 70.4%; 83.9%], N=38Random weights -2.7% [-22.3%; -1.4%], N=40 1.9% [-18.0%; 5.7%], N=38

For each image pair in the test set of 134 biomedicalimages, ﬁrst a random rotation up to ± degrees, followed by random translations in x and y for upto ± px was applied. To avoid any border effects, these transformed patches were cropped fromthe original × images. The magnitude of transformation has been measured by the averageEuclidean displacement of the corner points C i between the reference image and the transformedimage (cid:80) i =1 || C Refi − C T ransi || . While the transformations themselves were entirely random, theywere sampled in a stratiﬁed manner, such that 44 pairs were considered to have undergone a "large"transformation, 45 a "medium" and another 45 a "small" transformation. An average displacementof ≤ px was considered a small, in (100 , px a medium and > px a large transformation.The same evaluation metric was used for all registration results, err = (cid:80) i =1 || C Refi − C Regi || ,for C Refi , C Regi , i ∈ { , , , } the corners of the reference image and the one resulting from theregistration respectively, which assumes an err > px to be a failure. Implementation details

We make no assumptions about the property of the model apart fromdifferentiability. We choose two identical dense U-Nets [28], one per modality, which share noweights. We experiment with both the MSE and cosine similarity as a critic. The temperature wasset to τ = 0 . . The number of negative samples was 46. An important parameter to choose is theﬁeld of view of the patches that the model is trained on. This is data-speciﬁc and a patch shouldbe large enough to cover signiﬁcant structures in the images. For both datasets patches of the size × px were chosen. For the Zurich dataset we choose 3-channel CoMIRs, for the biomedicaldataset we choose 1-channel CoMIRs as this is suitable for the downstream registration methodsbased on α -AMD, SIFT and MI. The SHG images were preprocessed by applying a log-transform log(1 + x ) with x ∈ [0 , . Implementation details of our method as well as the following registrationalgorithms can be found in App. 7.3. Baseline

As a baseline comparison we perform multimodal registration using Mattes MI algorithm[41] on the original SHG and BF images with a (1+1) evolutionary strategy [49].

Intensity-based Registration using α -AMD α -AMD [43] is a general registration method based ondistances combining intensity and spatial information [37]. It has been shown to be both accurate andhaving a large convergence region. One of its limitations, compared to e.g. MI, is that it requires the6ntensities to be in [0 , and approximately equal for corresponding structures (not merely correlated),which typically is not the case in multimodal scenarios. Feature-based Registration using SIFT

SIFT is a feature detection algorithm introduced by [39]. Itextracts features from both a reference and a ﬂoating image which are invariant to scale and rotation,and robust across a large range of afﬁne distortion, additive noise, and change in illumination. Theextracted feature points are matched with Random Sample Consensus (RANSAC [16]).

Manual Registrations

To obtain a baseline for comparing the machine performance to a humanlevel, a panel of six annotators was selected to perform the registration on a small set of test patches(n=10, randomly selected, identical for all annotators). The setup for this registration task is the sameas for the automatic registration methods, i.e. × px patches from the center of the tissuemicroarray cores. Note that this differs signiﬁcantly from the setup in which the ground truth (GT)for this dataset was originally manually acquired where the full cores were available to the annotatoras shown in App. 7.2 Fig. 6. The GT was obtained by manually choosing landmarks [29], while forthe manual registrations in our evaluation, the SHG images were moved over the bright-ﬁeld imagesto obtain an appropriate ﬁt. Generative Methods

Generative models like generative adversarial networks (GANs) are oftenused in image-to-image translation ([27, 62]) in which they have the potential to enable the use ofmonomodal registration methods by translating one modality into the other. We implement threewell-known image translation methods: pix2pix [27],

CycleGAN [62], and

DRIT [34, 35]. Asimage translation aims to predict a representation cross-modality, i.e. to predict a BF image given theSHG input and vice versa, the resulting images can be considered to be in a common space in whichmonomodal registration can be attempted. It was only possible to ﬁnd corresponding SIFT featuresfor a maximum of three image pairs among all combinations of image translations and modalities,but for those the registration error also exceeded our threshold of 100px. We report the results forregistration by MI for these methods in App. 7.3 Fig. 8.

Data-speciﬁc State-of-the-Art

The ﬁrst intensity-based registration method capable of automaticallyaligning SHG and BF images was proposed as CurveAlign in [29]. CurveAlign relies on data-speciﬁc,biomedical knowledge, i.e. that BF images are stained by hematoxylin and eosin (H&E), where eosinstains extracellular matrix components such as collagen, in particular shades of pink. SHG mainlycorresponds to collagen ﬁbers. Using this prior on the data, CurveAlign performs segmentation onthe BF image to isolate collagen structures, which are then registered to the SHG using a registrationscheme based on MI with a (1 + 1) evolutionary algorithm.

Choice of Critic

The visual effect on an image of the Zurich test set w.r.t the choice of critic canbe seen in App. 7.4 Fig. 11. The correlation between the CoMIRs of RGB and NIR using abilinear model, cosine similarity and MSE as a critic was averaged across a set of temperaturesettings τ (N=15). The results show that a bilinear critic produces weakly correlated maps w.r.t. τ : ¯ ρ bilinear = 60 . . . ; whereas using MSE resulted in ¯ ρ MSE = 85 . . . andcosine similarity in ¯ ρ cosine = 91 . . . . The 95% C.I. were computed with empiricalbootstrap. Hence, we perform the registration evaluation on the biomedical dataset for MSE andcosine similarity. Training with the MSE gave consistently better CoMIRs w.r.t. registration, see App.7.4 Fig. 9. Results

Fig.3 shows an image pair of the biomedical test set, together with the representations pro-duced by the image translations through pix2pix, CycleGAN and DRIT, as well as the correspondingCoMIRs. The arrows indicate the pairs of modalities for which the registration task was attempted.The difference between intensities of the GAN generated images and the respective originals is toolarge for intensity-based registration by α -AMD. Fig. 4 shows the empirical cumulative distributionfunctions (eCDF) of the successful registrations as a function of the error for α -AMD, α -AMDwith multiple starts (MS), and SIFT on the CoMIRs based on MSE, using both CoMIR learnt fromBF (A → B) and SHG (B → A) as the reference image. The eCDF gives the cumulative numberof the successful registrations for the increasing error. As a baseline comparison it also shows theeCDF of the MI registration on the original BF and SHG images, the eCDF for CurveAlign whichregisters BF to SHG, as well as the median eCDF for the six manual registrations on a subset often images. A registration error ≥

100 is considered a failure and thus is not shown. The feature-and intensity-based registration performance on CoMIRs is consistently better than the multimodalregistration approaches on the original multimodal images. Wilcoxon signed-rank tests show thatSIFT signiﬁcantly ( p = 5e − ) outperforms CurveAlign, but show no signiﬁcant difference between7 C u m u l a t i v e s u cc e ss r a t e F n ( x ) p < . p < . *** AMD

A B (CoMIR)

AMD

B A (CoMIR)MI (Original)

AMD ( MS ) A B (CoMIR)

AMD ( MS ) B A (CoMIR)CurveAlign (Original) SIFT

A B (CoMIR)SIFT

B A (CoMIR)Median Manual Reg. (Original)0.2% 0.5% 1.0% 2.0% 4.0% 6.0% 10.0%Relative error [%]

Figure 4: The cumulative number of successful registrations for the increasing error shown as aneCDF of the different registration methods over the biomedical test set. The results are compared tothe median of the error of six independent manual registrations on a subset of ten images. The relativeerror is the proportion to the patch width and height. A Wilcoxon signed-rank test is performed tostatistically highlight the differences between CurveAlign, SIFT, α -AMD. α -AMD MS and SIFT ( p = . ). The median error on the manual registration highlights thedifﬁculty of the task at hand. We further observe that the success of registration by MI is highlydependent on the size of the transformation between the image pair. We perform registration by MIof the original images; for the corresponding CoMIRs; the BF and GAN generated BF as well asSHG and GAN generated SHG for pix2pix, CycleGAN and DRIT and observe linear dependencebetween the mean registration error and the mean displacement of the corner points for all attempts.This shows that while MI is suitable to handle different modalities, its success is dependent on theproximity to the global extremum. This is shown in App. 7.4 Fig. 8. SIFT registration was attemptedfor the GAN-produced representations, however it failed to achieve a mean pixel error < pxon the entire test set. The choice of critic inﬂuences the resulting CoMIRs. MSE encourages theintensities of the representations to be more similar than cosine similarity, which is favorable for theregistration task. We observe that for both registration by SIFT, as well as α -AMD, the results areconsiderably better for the MSE-based CoMIRs than for the cosine similarity, see App. 7.4 Fig. 9.App. 7.4 Table 2 gives the number of successfully registered image pairs which resulted in less than relative error to the image width and length ( < px), as well as relative error ( < px). Boththe training (GPU, 1345s) and inference (CPU, 5s/image) of the model are fast, and the process ofregistration adds between 2s/image with SIFT and 25s/image with α -AMD MS. More details arefound in App. 7.4.1. An animated illustration depicting the process of registration of three exampleimages with α -AMD is available at: https://youtu.be/zpcgnqcQgqM . The contrastive loss based on InfoNCE was successfully adjusted to the task of registration andmodiﬁed to result in rotationally equivariant features. The proposed CoMIRs successfully extractshared content in multimodal images to enable multimodal image registration by reducing theproblem to a monomodal one. Using monomodal, intensity- and feature-based registration methodssigniﬁcantly outperforms multimodal registration by MI, as well as a state-of-the-art, data-speciﬁcapproach. For registration tasks, the generated CoMIRs contain more valuable information thanGAN generated images. We show that the training of CoMIRs is stable w.r.t. hyperparameters andreproducible in connection with weight initialization and training data. We show that the size ofthe training dataset can be as little as one image pair and give insight to the choice of critic. Futureresearch could explore adding dense aleatoric uncertainty to CoMIRs, learning CoMIRs of modalitiesapart from images (e.g. volumetric images, audio, videos, and time-series) and extend equivariantproperties to other groups, as well as investigate the CoMIRs’ applicability to segmentation and pixelregression tasks. 8

Broader Impact

Using CoMIRs for multimodal image registration has a direct application in multimodal imagefusion. A wide range of areas beneﬁt from fusing content of images of different sensors. One sucharea is material science. Early stage anomaly detection is used to characterize newly developedmaterials w.r.t. physical properties. As a concrete example, carbides along the grain boundary ofa material can indicate impairments in material strength, but require different Scanning ElectronMicroscopy (SEM) sensors which acquire images asynchronously at different spatial resolutions[7]. This results in a multimodal registration problem which could be addressed by the proposedCoMIRs. The implications research in material science based on successful registration and fusionof images of this kind can have on the society are widespread. The Materials Genome Initiative(MGI) which has been launched by the US Federal Government in 2011 for example, aims toaddress clean energy, national security, and human welfare [40]. While this area of research canhave a beneﬁcial impact on society, by developing biocompatible materials for medical advancesor materials needed in a variety of settings to reduce the carbon footprint, scientiﬁc ﬁndings in thisarea are directly linked to military developments also (see e.g. the US Air Force’s involvementin MGI). Another area of application which makes use of multimodal imaging data is the ﬁeld ofremote sensing. Again, while aerial observation can be used to monitor geological changes likemelting glaciers due to climate change or early detection of wildﬁres, it is also tightly connectedto military action (espionage, navigation systems such as Unmanned Aerial Vehicles (UAV), targetlocalization, lethal autonomous war weapons). The area which can proﬁt the most from multimodalimage registration and following fusion in a positive way, is biomedicine. It is a broad and active ﬁeldof research to combine information from modalities such as computed tomography (CT), magneticresonance imaging (MRI) and Positron emission tomography (PET) or BF, SHG and two-photon-excited ﬂuorescence (TPEF) microscopy. These imaging techniques often provide complementaryand clinically relevant information needed for a diagnostic task. For example CT gives good spatialresolution and dense tissue contrast, while MRI yields better soft tissue contrast. BF and SHG are forexample studied together in connection with collagen organization in tumor growth. By providing ajoint representation between multiple modalities, issues regarding patient data anonymization needto be taken into account, e.g. if only images in one modality were subject to anonymization (e.g.photographs), while the other was not (e.g. CT scan), CoMIRs could potentially facilitate datade-anonymization from one modality to another. Registration by MI is computationally expensive,and as we show in our paper, MI is highly susceptible to an initial starting position close to theglobal extremum. In order to perform well, a larger number of restarts is needed to overcome beingtrapped in a local extremum, increasing the computational load even more. Generating CoMIRs foran entire dataset combined with registration by SIFT is much cheaper computationally, especiallyregarding the small training data needed for CoMIR generation. Hence, in settings where multimodalregistration is already in place, CoMIRs can reduce the energy cost required and in consequence theenvironmental impact, however it may encourage the analyses of multimodal data on a much greaterscale than ever before. It must be pointed out that the CoMIRs’ usage is not limited to the task ofmultimodal registration, but could be useful for classiﬁcation, segmentation or patch retrieval. This issubject of future research, but we foresee a potential in segmentation by training on the label masksas a second modality for example. 9 eferences [1] Anonymous. Multimodal Biomedical Dataset for Evaluating Registration Methods (patches from TMACores). https://doi.org/10.5281/zenodo.3874362 , June 2020.[2] P. Bachman, R. D. Hjelm, and W. Buchwalter. Learning representations by maximizing mutual informationacross views. In

Advances in Neural Information Processing Systems 32 , pages 15535–15545. CurranAssociates, Inc., 2019.[3] A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind separation and blinddeconvolution.

Neural Computation , 7(6):1129–1159, 1995.[4] S. L. Best, Y. Liu, A. Keikhosravi, C. R. Drifka, K. M. Woo, M. A. Guneet S. Mehta, T. N. Thimm,M. Houlihan, J. S. Bredfeldt, E. J. Abel, and W. H. . K. W. Eliceiri. Collagen organization of renal cellcarcinoma differs between low and high grade tumors.

BMC Cancer , 490(19), 2019.[5] J. S. Bredfeldt, Y. Liu, M. W. Conklin, P. J. Keely, T. R. Mackie, and K. W. Eliceiri. Automatedquantiﬁcation of aligned collagen for human breast carcinoma prognosis.

J Pathol Inform. , 5(28), 2014.[6] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visualrepresentations. arXiv preprint arXiv:2002.05709 , 2020.[7] Y.-H. Chen.

Multimodal Image Fusion and Its Applications . PhD thesis, University of Michigan, 2016.[8] G. Cheng, C. Yang, X. Yao, L. Guo, and J. Han. When deep learning meets metric learning: Remotesensing image scene classiﬁcation via learning discriminative CNNs.

IEEE Transactions on Geoscienceand Remote Sensing , 56(5):2811–2821, May 2018.[9] G. Cheng, P. Zhou, and J. Han. Learning rotation-invariant convolutional neural networks for objectdetection in vhr optical remote sensing images.

IEEE Transactions on Geoscience and Remote Sensing ,54(12):7405–7415, 2016.[10] J. P. Cohen, H. Z. Lo, and W. Ding. Randomout: Using a convolutional gradient norm to win the ﬁlterlottery.

International Conference on Learning Representations (ICLR) Workshops , 2016.[11] T. Cohen, M. Geiger, and M. Weiler. A general theory of equivariant CNNs on homogeneous spaces. In

Advances in Neural Information Processing Systems 32 , pages 9145–9156. Curran Associates, Inc., 2019.[12] T. Cohen and M. Welling. Group equivariant convolutional networks. In

Proceedings of The 33rdInternational Conference on Machine Learning , volume 48 of

Proceedings of Machine Learning Research ,pages 2990–2999. PMLR, 2016.[13] T. S. Cohen and M. Welling. Steerable CNNs. arXiv preprint:1612.08498 , 2016.[14] M. W. Conklin, J. C. Eickhoff, K. M. Riching, C. A. Pehlke, K. W. Eliceiri, P. P. Provenzano, A. Friedl, andP. J. Keely. Aligned collagen is a prognostic signature for survival in human breast carcinoma.

AmericanJournal of Pathology , 3(178):1221–1232, 03 2011.[15] C. R. Drifka, A. G. Loefﬂer, K. Mathewson, A. Keikhosravi, J. C. Eickhoff, Y. Liu, S. M. Weber,W. John Kao, and K. W. Eliceiri. Highly aligned stromal collagen is a negative prognostic factor followingpancreatic ductal adenocarcinoma resection.

Oncotarget , 7(46):76197–76213, 2016.[16] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model ﬁtting with applicationsto image analysis and automated cartography.

Commun. ACM , 24(6):381–395, June 1981.[17] W. Ge. Deep metric learning with hierarchical triplet loss. In

The European Conference on ComputerVision (ECCV) , September 2018.[18] A. A. Goshtasby.

Image Registration; Principles, Tools and Methods . Advances in Computer Vision andPattern Recognition. Springer-Verlag London, 1 edition, 2012.[19] M. Gutmann and A. Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalizedstatistical models. In

Proceedings of the 13th International Conference on Artiﬁcial Intelligence andStatistics , volume 9 of

Proceedings of Machine Learning Research , pages 297–304. PMLR, 2010.[20] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In , volume 2,pages 1735–1742, June 2006.

21] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representationlearning, 2019.[22] M. P. Heinrich, M. Jenkinson, M. Bhushan, T. Matin, F. V. Gleeson, S. M. Brady, and J. A. Schnabel. Mind:Modality independent neighbourhood descriptor for multi-modal deformable registration.

Medical ImageAnalysis , 16(7):1423 – 1435, 2012. Special Issue on the 2011 Conference on Medical Image Computingand Computer Assisted Intervention.[23] O. J. Hénaff, A. Srinivas, J. D. Fauw, A. Razavi, C. Doersch, S. M. A. Eslami, and A. van den Oord.Data-efﬁcient image recognition with contrastive predictive coding.

CoRR , abs/1905.09272, 2019.[24] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio.Learning deep representations by mutual information estimation and maximization. In

InternationalConference on Learning Representations , 2019.[25] J. Hu, S. Sun, X. Yang, S. Zhou, X. Wang, Y. Fu, J. Zhou, Y. Yin, K. Cao, Q. Song, and X. Wu. Towardsaccurate and robust multi-modal medical image registration using contrastive metric learning.

IEEE Access ,7:132816–132827, 2019.[26] M. J. Huttunen, R. Hristu, A. Dumitru, I. Floroiu, M. Costache, and S. G. Stanciu. Multiphoton microscopyof the dermoepidermal junction and automated identiﬁcation of dysplastic tissues with deep learning.

Biomed. Opt. Express , 11(1):186–199, Jan 2020.[27] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarialnetworks. In

Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on , 2017.[28] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio. The one hundred layers tiramisu: Fullyconvolutional densenets for semantic segmentation. In

Proceedings of the IEEE conference on computervision and pattern recognition (CVPR) workshops , pages 11–19, 2017.[29] A. Keikhosravi, B. Li, Y. Liu, and K. W. Eliceiri. Intensity-based registration of bright-ﬁeld and second-harmonic generation images of histopathology tissue sections.

Biomed. Opt. Express , 11(1):160–173, Jan2020.[30] D. S. Kerby. The simple difference formula: An approach to teaching nonparametric correlation.

Compre-hensive Psychology , 3:11–IT, 2014.[31] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan.Supervised contrastive learning, 2020.[32] T. Kipf, E. van der Pol, and M. Welling. Contrastive learning of structured world models. In

InternationalConference on Learning Representations , 2020.[33] B. L., V. J., H. J.F., V. A., and T. P.H.S. Fully-convolutional siamese networks for object tracking.

EuropeanConference on Computer Vision (ECCV) 2016 Workshops , 9914, 2016.[34] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. K. Singh, and M.-H. Yang. Diverse image-to-image translationvia disentangled representations. In

European Conference on Computer Vision , 2018.[35] H.-Y. Lee, H.-Y. Tseng, Q. Mao, J.-B. Huang, Y.-D. Lu, M. K. Singh, and M.-H. Yang. Drit++: Diverseimage-to-image translation via disentangled representations. arXiv preprint arXiv:1905.01270 , 2019.[36] K. Lin, B. Huang, L. M. Collins, K. Bradbury, and J. M. Malof. A simple rotational equivariance lossfor generic convolutional segmentation networks: preliminary results. In

IGARSS 2019 - 2019 IEEEInternational Geoscience and Remote Sensing Symposium , pages 3876–3879, 2019.[37] J. Lindblad and N. Sladoje. Linear time distances between fuzzy sets with applications to pattern matchingand classiﬁcation.

IEEE Transactions on Image Processing , 23(1):126–136, 2013.[38] R. Linsker. Self-organization in a perceptual network.

Computer , 21(3):105–117, 1988.[39] D. G. Lowe. Object recognition from local scale-invariant features. In

Proceedings of the Seventh IEEEInternational Conference on Computer Vision , volume 2, pages 1150–1157 vol.2, 1999.[40] Materials Genome Initiative. . Accessed: 2020-06-01.[41] D. Mattes, D. R. Haynor, H. Vesselle, T. K. Lewellyn, and W. Eubank. Nonrigid multimodality imageregistration. In

Medical Imaging 2001: Image Processing , volume 4322, pages 1609 – 1620. InternationalSociety for Optics and Photonics, SPIE, 2001.

42] I. Misra and L. van der Maaten. Self-supervised learning of pretext-invariant representations, 2019.[43] J. Öfverstedt, J. Lindblad, and N. Sladoje. Fast and robust symmetric image registration based on distancescombining intensity and spatial information.

IEEE Transactions on Image Processing , 28(7):3584–3597,2019.[44] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured featureembedding. In

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2016.[45] A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. arXivpreprint arXiv:1807.03748 , 2018.[46] B. Poole, S. Ozair, A. Van Den Oord, A. Alemi, and G. Tucker. On variational bounds of mutualinformation. In

Proceedings of the 36th International Conference on Machine Learning (ICML) , volume 97of

Proceedings of Machine Learning Research , pages 5171–5180. PMLR, 09–15 Jun 2019.[47] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A uniﬁed embedding for face recognition andclustering. In

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2015.[48] K. Sohn. Improved deep metric learning with multi-class n-pair loss objective. In

Advances in NeuralInformation Processing Systems 29 , pages 1857–1865. Curran Associates, Inc., 2016.[49] M. Styner, C. Brechbuhler, G. Szckely, and G. Gerig. Parametric estimate of intensity inhomogeneitiesapplied to MRI.

IEEE Transactions on Medical Imaging , 19(3):153–165, 2000.[50] C. Sun, F. Baradel, K. Murphy, and C. Schmid. Contrastive bidirectional transformer for temporalrepresentation learning.

CoRR , abs/1906.05743, 2019.[51] R. Tao, E. Gavves, and A. W. Smeulders. Siamese instance search for tracking. In

The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , June 2016.[52] Y. Tian, D. Krishnan, and P. Isola. Contrastive multiview coding. arXiv preprint arXiv:1906.05849 , 2019.[53] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic. On mutual information maximizationfor representation learning. In

International Conference on Learning Representations , 2020.[54] E. Ustinova and V. Lempitsky. Learning deep embeddings with histogram loss. In

Advances in NeuralInformation Processing Systems 29 , pages 4170–4178. Curran Associates, Inc., 2016.[55] E. van der Pol, T. Kipf, F. A. Oliehoek, and M. Welling. Plannable approximations to MDP homomorphisms:Equivariance under actions, 2020.[56] M. Volpi and V. Ferrari. Semantic segmentation of urban scenes by learning local class interactions. In

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , June 2015.[57] M. Weiler and G. Cesa. General e(2)-equivariant steerable CNNs. In

Advances in Neural InformationProcessing Systems 32 , pages 14334–14345. Curran Associates, Inc., 2019.[58] M. Weiler, F. A. Hamprecht, and M. Storath. Learning steerable ﬁlters for rotation equivariant CNNs. In

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2018.[59] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for large margin nearest neighborclassiﬁcation. In

Advances in Neural Information Processing Systems 18 , pages 1473–1480. MIT Press,2006.[60] H. Wu, A. Gattami, and M. Flierl. Conditional mutual information-based contrastive loss for ﬁnancial timeseries forecasting. arXiv preprint arXiv:2002.07638 , 2020.[61] H. Zhang, W. Ni, W. Yan, D. Xiang, J. Wu, X. Yang, and H. Bian. Registration of multimodal remotesensing image based on deep fully convolutional neural network.

IEEE Journal of Selected Topics inApplied Earth Observations and Remote Sensing , 12(8):3028–3042, Aug 2019.[62] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistentadversarial networks. In , 2017. Appendix

In this section, we present a general algorithm for (supervised) training of models for generating CoMIRs givena dataset of aligned images. The CL used in this work is not fundamentally limited to two modalities, but can beextended for M modalities.Let D = { ( x i , . . . , x Mi ) } ni =1 be an i.i.d. dataset containing n data points, where x j is an image in modality j ,and f j the network processing modality j with respective parameters θ j for j ∈ { , . . . , M } . For an arbitrarydatapoint x i = ( x i , x i , . . . , x Mi ) ∈ D , i ∈ { , , . . . , n } the loss is given by L opt ( D ) = − E x , ··· , x M ∼D  log p ( x , ··· , x M ) (cid:81) Mi =1 p ( x i ) p ( x , ··· , x M ) (cid:81) Mi =1 p ( x i ) + (cid:80) x j (cid:54) = x p ( x j , ··· , x Mj ) (cid:81) Mi =1 p ( x ij )  (5)The pseudo-code for training models using the CL to produce CoMIRs, based on aligned images, is given inalgorithm 1. Algorithm 1

CoMIR learning algorithm input: batch size N , set of datapoints { x i } Ni =1 in M modalities, a group G (e.g. C ), a criticfunction h ( · , · ) , a temperature τ .initialize all models { f i } Mi =1 for sampled mini-batch x = { x k } Nk =1 do for all m ∈ { , . . . , M } do for all k ∈ { , . . . , N } do T ∼ G y mk = T (cid:48) ( f m ( T ( x mk ))) end forend for z = (cid:107) Mi =1 y i M N × ) S ∈ R MN × MN for all i ∈ { , . . . , M N } dofor all j ∈ { , . . . , i } do S i,j = S j,i = log h ( z i , z j ) − log τ end forend for L = 0 for all i ∈ { , . . . , M N } dofor all m ∈ { , . . . , M } do L = L − S ( i,mN + i )% MN + log (cid:16)(cid:80) MNj =1 e S i,j − (cid:80) Mj =1 e S ( i,jN + i )% MN + e S ( i,mN + i )% MN (cid:17) end forend for L = MN ( M − L update all models { f i } Mi =1 to minimize L end forreturn all models { f i } Mi =1 Zurich Dataset

The dataset can be downloaded at https://sites.google.com/site/michelevolpiresearch/data/zurich-dataset . Each image was acquired with QuickBird andconsists of four channels in the order [NIR,B,G,R]. An example image of the dataset is given in Fig. 5. In allour experiments we train on only one image of the Zurich zh10.tif , except for the experiments in table 1 forwhich we investigate the impact of the training distribution and vary which image from the Zurich dataset isused as the training image. As a test image to evaluate the representations zh12.tif was chosen.

Biomedical Dataset

The dataset used in this study was kindly provided by the authors of [29] and modiﬁed forthis study. The modiﬁed dataset used in our experimental setup can be downloaded at https://zenodo.org/record/3874362 , [1]. Multimodal Image Registration is of particular interest in biomedical tasks such as theregistration of bright-ﬁeld (BF) microscopy images and second-harmonic generation (SHG). The two ways ofimaging provide complementary information and are of particular interest in studying collagen organization incancerous tissue, [4], [15] [14] and [5]. In general the two modalities are often captured in different microscopeswhich means that the SHG sample must be relocated and registered within a BF scan. This is a challengingtask as the modalities result in very different signals and have little appearance in common. The difﬁculties ofthis registration task are discussed in detail in [29], which provides the ﬁrst intensity-based registration method,called CurveAlign, capable of automatically aligning SHG images and BF images that are usually alignedmanually. They evaluate their methods on tissue microarray cores as can be seen in Fig. 6 to the left. In thispaper we use center-cropped patches within these cores, marked by the green square. This registration task isharder as the overview of larger structures and the tissue boundary are missing. Using these patches allowsus to avoid any padding effects after applying random transformations as described in 4.2. The GT alignmentused for the evaluation is based on manual landmark annotation provided by [29] and was done on the fulltissue microarray cores. The manual registration performed as part of the evaluation in this paper is howeverperformed on the cropped patches and by overlaying the SHG image on to the BF image. The two kinds ofmanual registrations (GT and manual registration in the evaluation) are hence not one to one comparable. Themanual registration in the evaluation was done according to the setup used in the automated methods, to give abaseline and sense of the magnitude of pixel error that can be expected in this task.By enabling registration on patches within the tissue micro arrays, a further downstream task of patch retrievalcan be achieved. This is in particular interesting for ﬁnding SHG patches within BF whole slide images as isrelevant in ongoing research such as [26].

We chose the same dense U-Net architectures for experiments on the Zurich and the biomedical dataset [28],using 32 convolutional ﬁlters for the ﬁrst convolution, 4 dense blocks of depth 6 as down and up blocks and 4bottleneck layers. Upsampling was used to avoid grid artefacts. Max pooling, a dropout rate of 0.2, no earlytransition or activation function in the last layer and a compression rate of the convolutional layers of . wereused. The commonly used non-linear activation in the ﬁnal layer is omitted. The 1-channel CoMIRs illustratedin 3 are visualized by applying a logistic function with temperature of . . Illustrations done on the Zurichdataset are made by normalizing two corresponding CoMIRs jointly by the maximum of the 1-percentile and theminimum of the 99-percentile of the two representations. Experiments on Zurich Dataset

For generating CoMIRs in experiments on the Zurich dataset, Adam optimizerwas used with a learning rate of e − , a weight decay of e − . Temperature τ was set to . , the batch sizeto 24, and the steps per epoch to . The gradient norm was limited to . L and L activation decay were set to . To generate a batch, patches of size × were cropped from the training image. Data augmentationconsisted of ﬂips ( p = 0 . ) and random rotations from by up to ± ° using either a linear, nearest neighbor orcubic interpolation randomly. With a probability p = 0 . either additive Gaussian noise ( µ = 0 , σ ∈ (0 ., . randomly chosen), Gaussian blur ( σ = 0 . ), or coarse dropout (rate of dropout of per channel, superpixel m a g e I m a g e I m a g e I m a g e I m a g e I m a g e I m a g e I m a g e I m a g e I m a g e M a nu a l a li g n m e n t e rr o r [ p x ] Individual errorDeviation of error

SIFT A B error

AMD ( MS ) A B error

Figure 7: Variation in manual annotation errors across test patches (blue dots represent errors ofindividual annotators) compared to the best performing registration methods, α AMD and SIFT,applied on CoMIRs. The parameters ˆ σ are estimated along with their 95% conﬁdence interval(Rayleigh distribution). size of ) is applied or the edge image is taken. Lastly, each channel is multiplied randomly (p=0.3) by a valuein (0 . , . . The models were trained for 5 epochs. Experiments on Biomedical Dataset

For generating 1-channel CoMIRs for the multimodal registration ex-periments, stochastic gradient descent was used with a learning rate of e − , a weight decay of e − anda momentum of . . Temperature τ was set to . , the batch size to 32, and the steps per epoch to . Thegradient norm was limited to . L activation decay was set to e − , L activation decay to e − . Togenerate a batch, patches of size × were cropped from the training image and the data augmentationconsisted of ﬂips ( p = 0 . ) and random integer rotations by up to ± ° using either a linear, nearest neighboror cubic interpolation randomly. The models were trained for 23 epochs in the efﬁcient mode of the modelimplementation. Registration using α -AMD For α -AMD, 3 resolution levels were used with sub-sampling factors given by , , and Gaussian smoothing σ given by , , respectively. A SGD optimizer was used with a momentumof . , and step-sizes (2 , , → . where the transition in the last resolution level is a linear interpolationbetween and . , and a hard gradient clipping at a magnitude of , parameter-wise. The logistic function(without temperature) is applied to the CoMIRs to map them into the range [0 , , which is needed for themethod, and then the images were quantized into (non-zero) levels. The number of iterations, per resolutionlevel, are , , respectively. A sampling fraction of . was used. For the multistart versionof the method, three starts were chosen at rotations (in radians) in {− . , , . } , and the registration withthe lowest ﬁnal distance value was chosen as the output of the algorithm. We used a modiﬁed version ofan open-source implementation (https://github.com/MIDA-group/py_alpha_amd_release) of the method. Weprovide the modiﬁed version in the github repository. Registration using SIFT

The SIFT implementation in Fiji 2.0.0 was used based on the mpicbg.imagefeatures package. The feature descriptor size was set to 4 samples per row and col-umn, the orientation bins for 8 bins per local histogram. The scale octaves were set to be in [128 , px with3 steps per scale octave and an initial σ of each scale octave equal to . . Registration by Mutual Information

Registration by Mattes MI was implemented in Matlab 2019b using a(1+1) evolutionary algorithm with an initial size of the search radius of 1e-5, a minimum size of the search radiusof 1.5e-8, a growth factor of the search radius of (1+1e-4) and a maximum number of iterations of 1500. Thenumber of spatial samples for the MI computation was 500 and the number of histogram bins 80. All pixels wereincluded in the overlap region. As registration by MI required 1-channel input, all 3-channel representations (BFand GAN generated BF images) have been reduced to one channel by principal component analysis (PCA).

Registration by CurveAlign

CurveAlign v5.0 Beta provides HSV color based registration of SHG and BFimages. The method was run with the default settings, however it was set to rigid registration, where the defaultis afﬁne. The code is provided in https://github.com/uw-loci/curvelets . Manual Registration

The registration was performed using in Fiji 2.0.0 with TrakEM 2. The mean error of thecorners after manual registration for ten randomly chosen images of the test set is shown in Fig. 7. α -AMD, α -AMD,SIFT, MI and CurveAlign which resulted in less than 1% pixel error ( < px) and less than 5% pixelerror ( < px). (cid:99) BF and (cid:100)

SHG denote the fake images produced by a GAN image translation given therespective other modality. 95% C.I. are Clopper-Pearson intervals.

Reg. Method MS α -AMD α -AMD SIFT MI CurveAlignInput — — — — — — BF → SHG SHG → BF BF → SHGOriginals Succ.( < % Err.) — — — — — — 7 [3; 14] 7 [3; 14] [14; 32]Succ.( < % Err.) — — — — — — 30 [21; 41] 29 [20; 40] [24; 44]Input A → B B → A A → B B → A A → B B → A A → B B → A —CoMIR MSE Succ.( < % Err.) 72 [60; 84] 70 [58; 82] 49 [38; 61] 43 [33; 55] [63; 86] 67 [55; 79] 7 [3; 14] 9 [4; 17] —Succ.( < % Err.) 90 [78; 101] 87 [75; 98] 63 [51; 75] 54 [43; 66] [87; 108] 93 [82; 103] 30 [21; 41] 30 [21; 41] —Input A → B B → A A → B B → A A → B B → A A → B B → A —CoMIR Cos. Succ.( < % Err.) [43; 66] 53 [42; 65] 19 [12; 28] 18 [11; 27] 47 [36; 59] 52 [41; 64] 12 [6; 20] 11 [6; 19] —Succ.( < % Err.) [50; 74] 62 [50; 74] 34 [24; 45] 33 [24; 44] 67 [55; 79] 69 [57; 81] 29 [20; 40] 27 [18; 37] —Input (cid:100) BF → BF (cid:91) SHG → SHG (cid:100) BF → BF (cid:91) SHG → SHG (cid:100) BF → BF (cid:91) SHG → SHG (cid:100) BF → BF (cid:91) SHG → SHG —CycleGAN Succ.( < % Err.) — — — — 0 [0; 4] 0 [0; 4] [2; 13] 5 [2; 11] —Succ.( < % Err.) — — — — 0 [0; 4] 0 [0; 4] [21; 41] 28 [19; 39] —Input (cid:100) BF → BF (cid:91) SHG → SHG (cid:100) BF → BF (cid:91) SHG → SHG (cid:100) BF → BF (cid:91) SHG → SHG (cid:100) BF → BF (cid:91) SHG → SHG —Pix2Pix Succ.( < % Err.) — — — — 0 [0; 4] 0 [0; 4] 6 [2; 13] [3; 14] —Succ.( < % Err.) — — — — 0 [0; 4] 0 [0; 4] 29 [20; 40] [21; 41] —Input (cid:100) BF → BF (cid:91) SHG → SHG (cid:100) BF → BF (cid:91) SHG → SHG (cid:100) BF → BF (cid:91) SHG → SHG (cid:100) BF → BF (cid:91) SHG → SHG —DRIT Succ.( < % Err.) — — — — 0 [0; 4] 0 [0; 4] 5 [2; 11] [2; 13] —Succ.( < % Err.) — — — — 0 [0; 4] 0 [0; 4] [22; 42] 29 [20; 40] — Alternative methods pix2pix, CycleGAN and DRIT train on image patches created and augmented inthe same way as for the gernation of CoMIRs, however the patches are not created during runtime butbefore training. The code for the competing methods are provided in https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix for pix2pix and CycleGAN and https://github.com/HsinYingLee/DRIT for DRIT.The pix2pix and CycleGAN models were trained for 100 epochs with the initial learning rate of e − andanother 100 epochs with a linearly decaying learning rate thereafter. Adam optimizer was used with β = 0 . .The GAN-objective was lsgan , the discriminator architecture was a × PatchGAN and a 9-block ResNetfor the generator architecture. 64 ﬁlters were used in the last convolutional layer of the generator as well as theﬁrst generator of the discriminator. Instance normalization was used and no dropout for the generator. During thetest phase, the input image sizes were padded to minimum multiples of 256 (i.e. as the size of input images were834, they were padded to 1024) due to the network architectures’ restriction for pix2pix [27] and CycleGAN[62]. To avoid edge artefacts during inference, the padded borders were ﬁlled with the reﬂection of the vectormirrored on the ﬁrst and last values of the vector along each axis. The padded areas of translated images arecropped off before further processing.DRIT [34, 35] was adjusted to ﬁt the registration task, by using the disentangled attribute representation of theimage for registration with the corresponding image, rather than using the disentangled attribute representationof a random given image from the other modality. The discriminator had no normalization layers and its scalewas set to 3. The model was trained for 1200 epochs, with a learning rate decay for the last 600 epochs.

In table 2 the number of successfully registered image pairs which resulted in less than relative error to theimage width and length ( < px), as well as relative error ( < px) is given for each applicable registrationmethod. Fig. 7 shows the results of the manual registration on ten image pairs from the test set together with theresult of the automated registration based on α -AMD and SIFT. Fig. 8 shows the error of registration resultsbased on MI with respect to the displacement of the corner points resulting from the transformation applied tothe image. To point out the importance of the extent of the transformation which needs to be recovered in caseof MI, SIFT results are also shown for comparison. Fig. 9 shows the eCDF of automatic registrations of thebiomedical test set on the CoMIRs learnt by using a critic based on MSE and cosine similarity. Fig. 10 showsthe inﬂuence weight initialization and a particular training image have on the CoMIRs (3 channels) generated ofan image patch of image zh12.tif in the Zurich test set. Fig. 11 shows the inﬂuence of temperature τ and thechoice of critic on the generated CoMIRs (3-channel) of zh12.tif . The experiments are run on an Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz (hyperthreading enabled) and aTitan V (12GB) GPU. Table 3 shows the approximate time needed for registration using MI on the CPU versusgenerating CoMIR and registration using SIFT. The inference includes the loading of the dataset and the timereported includes generation and registration using each modality as a reference image (A → B and B → A). M e a n R e g i s t r a t i o n E rr o r o f C o r n e r P o i n t s Registration by MI

BF SHGBF SHGCycleGAN SHGCycleGAN BFpix2pix BFpix2pix SHGDRIT BFDRIT SHG

A BB A

50 100 150 200 250 300Mean Displacement of Corner Points050100150200250300350 M e a n R e g i s t r a t i o n E rr o r o f C o r n e r P o i n t s Registration by SIFT

A BB A

Figure 8: The mean registration error of the corner points vs. the mean displacement of cornerpoints is shown for registration by MI as well as SIFT. It can be seen that MI suffers from largetransformations and fails to detect a global extremum if the initial starting position of the optimizationis too far away. This depedence of the accuracy on the starting position can be observed for allmodalities. In contrast, SIFT does not depend on the extent of the transformation between the imagesthat have to be registered, as can be seen to the right.Table 3: Approximate time for steps in registration pipelines given in seconds for the entire test set.

Device Preprocessing Registration using MI (Matlab) Preprocessing + RegistrationCPU 1 worker 114 9593 9707CPU 18 worker — 896 1010Training 23 epochs Inference Registration using SIFT Inference + Registration(Pytorch/Python) (Pytorch/Python) (Fiji/Java)GPU 1345 49 — 324CPU 1 worker — 2631 275 2906CPU 18 worker — 1020 — 1295

While registration using MI will scale linearly with the number of image pairs which need to be registered, thetraining of the proposed model can be considered constant overhead and the generation of CoMIRs takes only . seconds per image on a GPU. After training the inference can also be done on the CPU, albeit is slower,but can be beneﬁcial in a clinical setting where no access to a GPU can be given, or to encode images of largespatial dimensions which do not ﬁt onto a GPU. It should be noted that no parallelization attempt was made forregistration by SIFT nor for the preprocessing (turning the images to 1-channel images by PCA), which couldspeed up the registration further. C u m u l a t i v e s u cc e ss r a t e F n ( x ) p < . *** r = . AMD C u m u l a t i v e s u cc e ss r a t e F n ( x ) p < . *** r = . AMD (MS) C u m u l a t i v e s u cc e ss r a t e F n ( x ) p < . *** r = . SIFT C u m u l a t i v e s u cc e ss r a t e F n ( x ) p < . r = . Mutual Information (MI)

A B

MSE

B A

Cosine

A B

Cosine

B A

Figure 9: eCDF of the automatic registration methods over the biomedical test dataset for CoMIRsproduced by the MSE-based critic as well as the cosine similarity-based critic ((a) α -AMD, (b) α -AMD MS, (c) SIFT and (d) MI). Wilcoxon tests between the results based on MSE and cosinesimilarity are shown on the right side of each plot, along with the matched-pairs rank-biserialcorrelation r (effect size)[30]. 19 (a) CoMIRs of models trained with different weight initializations and different images. (b) CoMIRs of models trained with different weight initializations and a unique common image. (c) CoMIRs of models trained with the same weight initialization and different images. (d) CoMIRs of models trained with the same weight initialization and a unique common image. Figure 10: Different experiments were performed to show that a random weight initialization schemewith a ﬁxed seed yields similar models. Using a small training set increases the similarity of thetrained models. All models were trained on only one training image, either ﬁxed or varying, dependingon the experiment. (a) and (c) have the 12th image omitted as they were trained on the same image asthe test image, used to create this ﬁgure. 20 =0.001 =0.002 =0.005 =0.01 =0.02 =0.05 =0.1 =0.2 =0.5 =1 =2 =5 =10 =20 =50 (a) CoMIRs of models trained with the MSE based critic. =0.001 =0.002 =0.005 =0.01 =0.02 =0.05 =0.1 =0.2 =0.5 =1 =2 =5 =10 =20 =50 (b) CoMIRs of models trained with the cosine similarity based critic. =0.001 =0.002 =0.005 =0.01 =0.02 =0.05 =0.1 =0.2 =0.5 =1 =2 =5 =10 =20 =50 (c) CoMIRs of models trained with the bilinear model based critic.

Figure 11: CoMIRs generated from models trained with different losses and temperatures ττ