[PDF] Evolution Is All You Need: Phylogenetic Augmentation for Contrastive Learning

Abstract

Self-supervised representation learning of biological sequence embeddings alleviates computational resource constraints on downstream tasks while circumventing expensive experimental label acquisition. However, existing methods mostly borrow directly from large language models designed for NLP, rather than with bioinformatics philosophies in mind. Recently, contrastive mutual information maximization methods have achieved state-of-the-art representations for ImageNet. In this perspective piece, we discuss how viewing evolution as natural sequence augmentation and maximizing information across phylogenetic "noisy channels" is a biologically and theoretically desirable objective for pretraining encoders. We first provide a review of current contrastive learning literature, then provide an illustrative example where we show that contrastive learning using evolutionary augmentation can be used as a representation learning objective which maximizes the mutual information between biological sequences and their conserved function, and finally outline rationale for this approach.

Full PDF

EEvolution Is All You Need: PhylogeneticAugmentation for Contrastive Learning

Amy X. Lu

University of TorontoVector Institute [email protected]

Alex X. Lu

University of Toronto [email protected]

Alan Moses

University of Toronto [email protected]

Abstract

Self-supervised representation learning of biological sequence embeddings alle-viates computational resource constraints on downstream tasks while circumvent-ing expensive experimental label acquisition. However, existing methods mostlyborrow directly from large language models designed for NLP, rather than withbioinformatics philosophies in mind. Recently, contrastive mutual informationmaximization methods have achieved state-of-the-art representations for ImageNet.In this perspective piece, we discuss how viewing evolution as natural sequenceaugmentation and maximizing information across phylogenetic “noisy channels”is a biologically and theoretically desirable objective for pretraining encoders. Weﬁrst provide a review of current contrastive learning literature, then provide anillustrative example where we show that contrastive learning using evolutionaryaugmentation can be used as a representation learning objective which maximizesthe mutual information between biological sequences and their conserved function,and ﬁnally outline rationale for this approach.

Self-supervised learning representation learning of biological sequences aims to capture meaningfulproperties for downstream analyses, while pretraining only on labels derived from the data itself.Embeddings alleviate computational constraints, and yield new biological insights from analyses in arich latent space; to do so in a self-supervised manner further circumvents the expensive and time-consuming need to gather experimental labels. Though recent works have successfully demonstratedthe ability to capture properties such as ﬂuorescence, pairwise contact, phylogenetics, structure, andsubcellular localization, these works mostly use methods designed for natural language processing(NLP) (Yang et al., 2018; Bepler and Berger, 2019; Riesselman et al., 2019; Rives et al., 2019; Alleyet al., 2019; Heinzinger et al., 2019; Elnaggar et al., 2019; Gligorijevic et al., 2019; Armenteros et al.,2020; Madani et al., 2020; Elnaggar et al., 2020). This leaves open the question of how best to designself-supervised methods which align with biological principles.Recently, contrastive methods for learning representations achieve state-of-the-art results on Ima-geNet (van den Oord et al., 2018; H´enaff et al., 2019; Tian et al., 2019; He et al., 2019; Chen et al.,2020). Two “views” v and v of an input are deﬁned (e.g. two image augmentation strategies), andthe contrastive objective is to distinguish one pair of “correctly paired” views from N − “incorrectlypaired” dissimilar views. This incentivizes the encoder to learn meaningful properties of the input,while disregarding nuisance factors. Theoretically, it can be shown that such an objective maximizesthe lower-bound on the mutual information, I ( v , v ) (Poole et al., 2019).In this piece, we ﬁrst provide a review of current contrastive learning literature for obtaining repre-sentations in non-biological modalities. Then, we propose that molecular evolution is a good choiceof augmentation to provide “views” for contrastive learning in computational biology, from both the Presented at Machine Learning in Computational Biology (MLCB) 2020. a r X i v : . [ q - b i o . B M ] D ec heoretical and biological perspectives. Finally, we illustrate how evolutionary augmentation can beused to optimize a deep neural network encoder to preserve the information in biological sequencesthat pertains to their function. To encourage the development of novel contrastive methods for biological applications, we provide abroad overview of existing contrastive methods. Section 3 describes one such method, SimCLR, ingreater detail.

The InfoMax optimization principle (Linsker, 1988) aims to ﬁnd a mapping g such that the Shannonmutual information between the input and output is maximized, i.e. max g ∈G I ( X ; g ( X )) . Recentworks revive this principle as a representation learning objective to train deep encoders as g , and yieldempirically desirable representations in the modalities of imaging (van den Oord et al., 2018; Hjelmet al., 2018; Bachman et al., 2019; Tian et al., 2019; H´enaff et al., 2019; L¨owe et al., 2019; He et al.,2019; Chen et al., 2020; Tian et al., 2020; Wang and Isola, 2020), text (Rivi`ere et al., 2020; van denOord et al., 2018; Kong et al., 2019), and audio (L¨owe et al., 2019; van den Oord et al., 2018).Most follow a variation of this optimization objective: given input x , and transformations t and t , deﬁne v = t ( x ) and v = t ( x ) as two different “views” of x . These “transformations” canbe parameterless augmentations (Chen et al., 2020), or another neural network summarizing globalinformation (Hjelm et al., 2018; van den Oord et al., 2018). Further, deﬁne encoder(s) g and g , andlatent representations z = g ( v ) and z = g ( v ) . The encoder mappings may be constrained by G and G (e.g. architecturally). In some works, g and g may share some (Hjelm et al., 2018) or all(Chen et al., 2020) parameters. The goal is to ﬁnd encoder mappings which maximize the (estimated)mutual information between the outputs: max g ∈G ,g ∈G I (cid:48) ( g ( v ); g ( v )) (1)This objective is shown to lower-bound the true InfoMax objective (Tschannen et al., 2019). Perhapsthe most widely adapted estimator is the InfoNCE estimator, which provides an unnormalized lowerbound on the mutual information by optimizing the objective (van den Oord et al., 2018): L NCE := − E v ,v − ,v +2 (cid:104) log exp( f ( g ( v ) , g ( v +2 )))exp( f ( g ( v ) , g ( v +2 ))) + (cid:80) N − j =1 exp( f ( g ( v ) , g ( v − j ))) (cid:105) , (2)where ( v , v +2 ) ∼ p ( v , v ) is a “real” pair of views drawn from their empirical joint distribution, andwe draw negative samples v − ∼ p ( v ) from the marginal distribution to form N − “fake” pairs. N denotes the total number of pairs (and, in practice, often refers to the batch size). In Arora et al.(2019), losses in this general form are termed “contrastive learning”.We see that Equation 2 is a cross-entropy which distinguishes one positive pair from N − negativepairs, where f is a “critic” classiﬁer (reminiscent of adversarial learning), and should learn to returnhigh values for the “real” pair. As is common in deep learning, the expectation is calculated overmultiple batches. For a more detailed discussion of the connection between the InfoNCE loss, theInfoMax objective for representation learning, and other mutual information estimators, see AppendixA. Existing works select “views” of the input in different ways. These include using different timesteps of an audio or video sequence (van den Oord et al., 2018; Sermanet et al., 2018) or usingdifferent patches of the same image (van den Oord et al., 2018; H´enaff et al., 2019; Hjelm et al.,2018; Bachman et al., 2019). Recently, contrastive learning between local and sequentially-globalembeddings is used to establish representations for proteins (Lu et al., 2020). Augmentations are an2ft-used strategy for constructing different views (Hu et al., 2017; He et al., 2019; Chen et al., 2020),sometimes applied in conjunction with image patching (H´enaff et al., 2019; Bachman et al., 2019).In this work, we argue that using evolution as a sequence augmentation strategy is a biologicallyand theoretically desirable choice to construct views. Previous work have explored evolutionaryconservation as a means of sequence augmentation during training, such as augmenting a HMM usingsimulated evolution (Kumar and Cowen, 2009), or generating from a PSSM (Asgari et al., 2019).Other methods include using generative adversarial networks (GANs) for -omics data augmentation(Eftekhar, 2020; Marouf et al., 2020) or injecting noise by replacing amino acids from an uniformdistribution (Koide et al., 2018). For genomic sequences, augmentations can be formed using reversecomplements and extending (or cropping) genome ﬂanks (Cao and Zhang, 2019; Kopp et al., 2020).

As a motivating example of how phylogenetic augmentation can be used in contrastive learning, wespeciﬁcally describe how SimCLR (Chen et al., 2020) can be applied to biological sequences, thoughspeciﬁc details (e.g. choice of critic, encoding process, etc.) can be adapted following details inSection 2.As outlined in Figure 1, homologous sequences can be considered as “evolutionary augmented views”of a common ancestor, x . Sequences v and v are encoded by an encoder g ( · ) to obtain embeddings z and z . The pair of embeddings augmented from the same ancestor – that is, embeddings ofhomologous sequences – will be the positive pair ( v , v +2 ) ∼ p ( v , v ) . To sample negative samples { v − j } N − j =1 from p ( v ) , we can draw negatives from all non-homologous sequences.Figure 1: SimCLR (Chen et al., 2020) can be re-casted as a phylogenetic tree where augmentations areevolution. In the original Chen et al. (2020) paper, x is an input image, and two image augmentationmethods, t and t , are sampled from a set of image augmentation methods T , to produce imageaugmentation v and v , which are then passed into a trainable encoder g ( · ) (i.e. g and g shareparameters entirely). In conceptualizing evolution as an augmentation strategy, x can be viewed asa common ancestor, while T denote possible evolutionary trajectories, characterized by differentevolutionary distance, mutation and genetic drift, and t , t are two example trajectories that leadto v and v , sampled from a set of homologs. Note that notations are adapted from the originalSimCLR paper for consistency with the current work.The key idea is that properties of the ancestral sequence that were important for its biological functionwill be preserved in both descendants (i.e. views). By training the encoder to project these to nearbylocations in the latent space, we ensure that proximity in the latent space corresponds to similarbiological functions without explicit labels during pretraining, analogous to how SimCLR learnssemantic content without image labels. Practically, existing bioinformatics databases of homologs canbe used; since contrastive learning uses pairs of positive examples, the number of training examplesfor pretraining increases quadratically with the number of homologs. We see that contrastive learningframeworks such as SimCLR can be directly adapted to capture phylogenetic principles.3 Why Evolution as Biological Sequence Augmentation?

Biological sequences are vehicles for information transmission . As such, information theoreticprinciples are directly applicable to biological sequence analyses, and therefore, this may be morea more powerful approach than methods based on the analogy with natural language (Alley et al.,2019; Rives et al., 2019; Rao et al., 2019; Elnaggar et al., 2020).The analogy between molecular evolution and noisy-channel coding is well-rooted in prior work(Gatlin et al., 1972; MacKay, 2003; Vinga, 2014; Kuruoglu and Arndt, 2017): DNA dictates infor-mation transmission across generations, which must be transferred through a noisy “mutation anddrift channel”. Further, as noted in Kimura (1961), as the genotype-to-phenotype manifestation isinformation transfer, and genomic information is passed down by heredity, we may view functionalphenotypes as “decoded” information that was transmitted from a common ancestor via molecularevolution. Drawing from these writings, we argue that using maximizing mutual information acrosshomologs is a good proxy for structure and function (Adami et al., 2000), which are the central aimsfor biological sequence embeddings (Rao et al., 2019).Even without relying on the mutual information estimation interpretation of the InfoNCE loss, thecontrastive learning objective directly encourage representational invariance to shared features acrossviews (Chen et al., 2020). Therefore, in using phylogenetic relationships to create views, learnedrepresentations directly capture the philosophy of evolutionary conservation in comparative genomics:functional elements will be preserved in comparisons of related sequences, while non-functionalsequences will decay. Hence, functional elements in biological sequences can be identiﬁed throughsequence comparisons (Hardison, 2003). This is perhaps the most successfully employed presumptionin bioinformatics (Eddy, 1998; Altschul et al., 1990).We therefore argue that InfoMax-based deep learning on evolutionary augmentation has two attractivefeatures from the biological perspective: (1) Molecular evolution and the genotype-to-phenotyperelationship has a clear analogy to information transmission; and (2) contrastive learning in thissetting encourages agreement between important features across evolutionary views (homologoussequences), which directly mirrors comparative genomics.

Tian et al. (2020) proposes the “InfoMin” principle for selecting optimal views. The authorstheoretically and empirically demonstrate that good views should have minimize their shared MIwhile keeping task-relevant information intact for downstream uses. More formally, for a downstreamclassiﬁcation task C to predict label y ∈ Y from x , the optimal representation z ∗ = g ( x ) is the isthe minimal sufﬁcient statistic for task C , such the representation is as useful as access to x whiledisregarding all nuisance in x (Tian et al., 2020; Soatto and Chiuso, 2014). Then, the optimal viewsof task C is ( v ∗ , v ∗ ) = min v ; v I ( v ; v ) , subject to I ( v ; y ) = I ( v ; y ) = I ( x ; y ) . With optimalviews ( v ∗ , v ∗ ) , the subsequently learned representations ( z ∗ , z ∗ ) are optimal for task C . There are two implications in adapting the InfoMin principle to biology which render evolutionaryaugmentations desirable. Firstly, sampling evolutionary trajectories t , t ∼ T to create v = t ( x ) and v = t ( x ) provide a simple way to reduce I ( v , v ) by selecting paired views ( v , v +2 ) with agreater phylogenetic distance between them. Secondly, note that in order to choose views based onthe InfoMin principle, access to labels y ∈ Y and knowledge of task C is needed. In fact, supervised contrastive learning (Khosla et al., 2020) empirically yields improved results by explicitly samplingnegatives from a different downstream class. If given labels for a downstream biological task ofinterest (e.g. remote homology), one can explicitly negative sample from dissimilar classes (e.g.different folds); however, owing to the difﬁcult label-acquisition process and open-ended natureof biological questions, access to Y , or even task C , may not be always possible. Further, it is The optimal property of representations ( z ∗ , z ∗ ) assumes access to a “minimally sufﬁcient encoder” whichserve as a minimal sufﬁcient statistic of the input (Soatto and Chiuso, 2014). More formally, a “sufﬁcient encoder” g sufﬁcient require that g sufﬁcient ( v ) has kept all information about v in v , and a “minimal sufﬁcient encoder” g ∈G sufﬁcient discards all irrelevant “nuisance” information such that I ( g ( v ); v ) ≤ I ( g sufﬁcient ( v ); v ) , ∀ g sufﬁcient ∈G sufﬁcient . supervised contrastive learning while still circumventing expensive experimentallabel gathering. Hence, it may be best considered a general strategy for weakly-supervised contrastivelearning. Current methods for self-supervised representation learning in biology are mostly adapted fromNLP methods. Contrastive learning achieves state-of-the-art results in the image modality, andhas a desirable theoretical property of being a lower-bound estimator of mutual information. Wedemonstrate how evolution can be used as a sequence augmentation strategy for contrastive learning,and provide justiﬁcations for doing so from biological and theoretical perspectives. More generally,data augmentation is a critical preprocessing step in many image analysis applications of deeplearning, but is it less clear how to augment data for biological sequence analysis. As research inapplications of deep learning in biology expand, we hope the view of evolution as augmentation willguide the ideation of deep learning methods in computational biology.

References

Christoph Adami, Charles Ofria, and Travis C Collier. Evolution of biological complexity.

Proceed-ings of the National Academy of Sciences , 97(9):4463–4468, 2000.Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational informationbottleneck. arXiv preprint arXiv:1612.00410 , 2016.Ethan C Alley, Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, and George M Church.Uniﬁed rational protein engineering with sequence-only deep representation learning. bioRxiv ,page 589333, 2019.Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic localalignment search tool.

Journal of molecular biology , 215(3):403–410, 1990.Jose Juan Almagro Armenteros, Alexander Rosenberg Johansen, Ole Winther, and Henrik Nielsen.Language modelling for biological sequences–curated datasets and baselines. bioRxiv , 2020.Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saun-shi. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprintarXiv:1902.09229 , 2019.Ehsaneddin Asgari, Nina Poerner, Alice McHardy, and Mohammad Mofrad. Deepprime2sec: Deeplearning for protein secondary structure prediction from the primary sequences. bioRxiv , page705426, 2019.Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizingmutual information across views. In

Advances in Neural Information Processing Systems , pages15509–15519, 2019.David Barber and Felix V Agakov. The im algorithm: a variational approach to informationmaximization. In

Advances in neural information processing systems , page None, 2003.Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, AaronCourville, and R Devon Hjelm. Mine: mutual information neural estimation. arXiv preprintarXiv:1801.04062 , 2018.Anthony J Bell and Terrence J Sejnowski. An information-maximization approach to blind separationand blind deconvolution.

Neural computation , 7(6):1129–1159, 1995.5ristan Bepler and Bonnie Berger. Learning protein sequence embeddings using information fromstructure. arXiv preprint arXiv:1902.08661 , 2019.Zhen Cao and Shihua Zhang. Simple tricks of convolutional neural network architectures improvedna–protein binding prediction.

Bioinformatics , 35(11):1837–1843, 2019.Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework forcontrastive learning of visual representations. arXiv preprint arXiv:2002.05709 , 2020.Monroe D Donsker and SR Srinivasa Varadhan. Asymptotic evaluation of certain markov processexpectations for large time. iv.

Communications on Pure and Applied Mathematics , 36(2):183–212,1983.Sean R. Eddy. Proﬁle hidden markov models.

Bioinformatics (Oxford, England) , 14(9):755–763,1998.Majid Ghorbani Eftekhar. Prediction of protein subcellular localization using deep learning and dataaugmentation. bioRxiv , 2020.Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, and Burkhard Rost. End-to-end multitasklearning, from protein language to protein features without alignments. bioRxiv , page 864405,2019.Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones, TomGibbs, Tamas Feher, Christoph Angerer, Debsindhu Bhowmik, et al. Prottrans: Towards crackingthe language of life’s code through self-supervised deep learning and high performance computing. arXiv preprint arXiv:2007.06225 , 2020.Lila L Gatlin et al.

Information theory and the living system . Columbia University Press, 1972.Vladimir Gligorijevic, P Douglas Renfrew, Tomasz Kosciolek, Julia Koehler Leman, KyunghyunCho, Tommi Vatanen, Daniel Berenberg, Bryn C Taylor, Ian M Fisk, Ramnik J Xavier, et al.Structure-based function prediction using graph convolutional networks. bioRxiv , page 786236,2019.Michael Gutmann and Aapo Hyv¨arinen. Noise-contrastive estimation: A new estimation principlefor unnormalized statistical models. In

Proceedings of the Thirteenth International Conference onArtiﬁcial Intelligence and Statistics , pages 297–304, 2010.Ross C Hardison. Comparative genomics.

PLoS Biol , 1(2):e58, 2003.Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast forunsupervised visual representation learning. arXiv preprint arXiv:1911.05722 , 2019.Michael Heinzinger, Ahmed Elnaggar, Yu Wang, Christian Dallago, Dmitrii Nechaev, Florian Matthes,and Burkhard Rost. Modeling the language of life-deep learning protein sequences. bioRxiv , page614313, 2019.Olivier J H´enaff, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. Data-efﬁcientimage recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272 , 2019.R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, AdamTrischler, and Yoshua Bengio. Learning deep representations by mutual information estimationand maximization. arXiv preprint arXiv:1808.06670 , 2018.Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama. Learningdiscrete representations via information maximizing self-augmented training. In

Proceedings ofthe 34th International Conference on Machine Learning-Volume 70 , pages 1558–1567. JMLR. org,2017.Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, AaronMaschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. arXiv preprintarXiv:2004.11362 , 2020. 6otoo Kimura. Natural selection as the process of accumulating genetic information in adaptiveevolution.

Genetics Research , 2(1):127–140, 1961.Satoshi Koide, Keisuke Kawano, and Takuro Kutsuna. Neural edit operations for biological sequences.In

Advances in Neural Information Processing Systems , pages 4960–4970, 2018.Lingpeng Kong, Cyprien de Masson d’Autume, Wang Ling, Lei Yu, Zihang Dai, and Dani Yogatama.A mutual information maximization perspective of language representation learning. arXiv preprintarXiv:1910.08350 , 2019.Wolfgang Kopp, Remo Monti, Annalaura Tamburrini, Uwe Ohler, and Altuna Akalin. Deep learningfor genomics using janggu.

Nature communications , 11(1):1–7, 2020.Anoop Kumar and Lenore Cowen. Augmented training of hidden markov models to recognize remotehomologs via simulated evolution.

Bioinformatics , 25(13):1602–1608, 2009.Ercan E Kuruoglu and Peter F Arndt. The information capacity of the genetic code: Is the naturalcode optimal?

Journal of Theoretical Biology , 419:227–237, 2017.Ralph Linsker. Self-organization in a perceptual network.

Computer , 21(3):105–117, 1988.Sindy L¨owe, Peter O’Connor, and Bastiaan Veeling. Putting an end to end-to-end: Gradient-isolatedlearning of representations. In

Advances in Neural Information Processing Systems , pages 3033–3045, 2019.Amy X Lu, Haoran Zhang, Marzyeh Ghassemi, and Alan Moses. Self-supervised contrastive learningof protein representations by mutual information maximization. bioRxiv , 2020.David JC MacKay.

Information theory, inference and learning algorithms . Cambridge universitypress, 2003.Ali Madani, Bryan McCann, Nikhil Naik, Nitish Shirish Keskar, Namrata Anand, Raphael R Eguchi,Po-Ssu Huang, and Richard Socher. Progen: Language modeling for protein generation. arXivpreprint arXiv:2004.03497 , 2020.Mohamed Marouf, Pierre Machart, Vikas Bansal, Christoph Kilian, Daniel S Magruder, Christian FKrebs, and Stefan Bonn. Realistic in silico generation and augmentation of single-cell rna-seq datausing generative adversarial networks.

Nature communications , 11(1):1–12, 2020.XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionalsand the likelihood ratio by convex risk minimization.

IEEE Transactions on Information Theory ,56(11):5847–5861, 2010.Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A Alemi, and George Tucker. On variationalbounds of mutual information. arXiv preprint arXiv:1905.06922 , 2019.Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Peter Chen, John Canny, Pieter Abbeel,and Yun Song. Evaluating protein transfer learning with tape. In

Advances in Neural InformationProcessing Systems , pages 9686–9698, 2019.Adam J Riesselman, Jung-Eun Shin, Aaron W Kollasch, Conor McMahon, Elana Simon, ChrisSander, Aashish Manglik, Andrew C Kruse, and Debora S Marks. Accelerating protein designusing autoregressive generative models. bioRxiv , page 757252, 2019.Alexander Rives, Siddharth Goyal, Joshua Meier, Demi Guo, Myle Ott, C Lawrence Zitnick, JerryMa, and Rob Fergus. Biological structure and function emerge from scaling unsupervised learningto 250 million protein sequences. bioRxiv , page 622803, 2019.Morgane Rivi`ere, Armand Joulin, Pierre-Emmanuel Mazar´e, and Emmanuel Dupoux. Unsupervisedpretraining transfers well across languages. arXiv preprint arXiv:2002.02848 , 2020.Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine Hsu, Eric Jang, Stefan Schaal, SergeyLevine, and Google Brain. Time-contrastive networks: Self-supervised learning from video. In , pages 1134–1141. IEEE,2018. 7tefano Soatto and Alessandro Chiuso. Visual representations: Deﬁning properties and deep approxi-mations. arXiv preprint arXiv:1411.7676 , 2014.Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. arXiv preprintarXiv:1906.05849 , 2019.Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. Whatmakes for good views for contrastive learning. arXiv preprint arXiv:2005.10243 , 2020.Michael Tschannen, Josip Djolonga, Paul K Rubenstein, Sylvain Gelly, and Mario Lucic. On mutualinformation maximization for representation learning. arXiv preprint arXiv:1907.13625 , 2019.Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictivecoding. arXiv preprint arXiv:1807.03748 , 2018.Susana Vinga. Information theory applications for biological sequence analysis.

Brieﬁngs inbioinformatics , 15(3):376–389, 2014.Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through align-ment and uniformity on the hypersphere. arXiv preprint arXiv:2005.10242 , 2020.Kevin K Yang, Zachary Wu, Claire N Bedbrook, and Frances H Arnold. Learned protein embeddingsfor machine learning.

Bioinformatics , 34(15):2642–2648, 2018.

A InfoMax Principle and Mutual Information Estimation forRepresentation Learning

A.1 Applying InfoMax to Representation Learning

Using InfoMax for representation learning extends as far back as ICA (Bell and Sejnowski, 1995).As described in Equation 1, in recent years, works typically maximize mutual information of two encoded “views” of an input (e.g. different patches of an image, or augmentations). By the dataprocessing inequality, Tschannen et al. (2019) show that: I ( g ( v ); g ( v )) ≤ I ( x ; g ( v ) , g ( v )) , (3)such that maximizing Equation 1 is equivalent to maximizing a lower bound on the true InfoMaxobjective. This ability to maximize the mutual information in the latent embedding space ratherthan directly between the input and the encoded output (as per the original InfoMax formulation)has a few advantages (Tschannen et al., 2019): MI is difﬁcult to estimate in high dimensions, andthis formulation enables MI estimation in a lower-dimensional space; furthermore, the ability touse creative encoders for G can accommodate speciﬁc modelling needs and data intricacies. dataintricacies. A.2 InfoNCE Estimator

InfoNCE is one of many mutual information estimators, and following the rationale in Section A.1,the original van den Oord et al. (2018) paper does this estimation in the embedding space. For theInfoNCE loss (Equation 2) which estimates I ( z ; z ) = I ( g ( v ); g ( v )) in Equation 3, the optimalcritic function is f ∗ ( z , z ) = p ( z | z ) p ( z ) (van den Oord et al., 2018). Inserting this in the InfoNCE lossfunction (Equation 1) and rearranging, we have the bound (van den Oord et al., 2018; Poole et al.,2019): I ( z , z ) ≥ log( N ) − L ∗ NCE (4)where N is the number of samples. From Equation 4, note that the bound is tight when: (1) We usemore samples for N which increases the log( N ) term; and (2) we have a better f which results in alower L NCE . Empirically, most works corroborate the former theoretical observation regarding N f doesnot usually hold, as will be further discussed in Section A.3.The contrastive nature of the InfoNCE loss stems from its direct adaptation of the noise-contrastiveestimation (NCE) method (Gutmann and Hyv¨arinen, 2010). Noise-contrastive estimation wasoriginally proposed for the problem of estimating parameters for unnormalized statistical modelsin high dimensions, by reducing the problem to simply estimating logistic regression parameters todistinguish between observed data and noise. In InfoNCE, the distinction is made between “similarityscores”, as scored by critic f ( z , z ) , for one positive pair and N − negative pairs of encoded views. A.3 Other Mutual Information Estimators

The InfoNCE estimator is one of many approaches which builds on advancements in variationalmethods to create differentiable and tractable sample-based mutual information estimators in highdimensions (Donsker and Varadhan, 1983; Barber and Agakov, 2003; Nguyen et al., 2010; Alemiet al., 2016; Belghazi et al., 2018; van den Oord et al., 2018; Hjelm et al., 2018). Many of theseestimations involve a “critic” classiﬁer, f , which can be as simple as a bilinear model (van den Oordet al., 2018; H´enaff et al., 2019; Tian et al., 2019) or dot-product (Chen et al., 2020; Lu et al., 2020),or a model accepting a concatenation of two views as input. There may be a different f for each view(van den Oord et al., 2018), or a global f (Hjelm et al., 2018). Usually, f is trained jointly with g and g .The aim of f is often to approximate the unknown densities p ( B ) and p ( B | A ) , or density ratios p ( A | B ) p ( B ) = p ( B | A ) p ( A ) (Poole et al., 2019). Intuitively, if I ( A, B ) is high, then f should intuitively be ableto easily assign high probabilities to those samples drawn from p ( A, B ) (Tschannen et al., 2019).The InfoNCE estimator reduces variance as compared to other estimators, by depending on multiplesamples, but trades off bias to do so (Poole et al., 2019).Importantly, it should be noted that whether the empirical success of the InfoNCE loss should beattributable to mutual information estimation has been questioned (Poole et al., 2019; Tschannenet al., 2019), instead attributing success to geometric properties in the latent space (Wang and Isola,2020). For example, a higher-capacity ff