# The Agnostic Structure of Data Science Methods

aa r X i v : . [ s t a t . O T ] J a n The Agnostic Structure of Data Science Methods

Domenico Napoletani ∗ , Marco Panza † , Daniele Struppa ‡ In this paper we want to discuss the changing role of mathematics in science, as a way todiscuss some methodological trends at work in big data science. Classically, any applicationof mathematical techniques requires a previous understanding of the phenomena and ofthe mutual relations among the relevant data. Modern data analysis, on the other hand,does not require that. It rather appeals to mathematics to re-organize data in order toreveal possible patterns uniquely attached to the speciﬁc questions we may ask about thephenomena of interest. These patterns may or may not provide further understanding perse , but nevertheless provide an answer to these questions.It is because of this diminished emphasis on understanding that we suggested in [13] todenote such methods under the label of ‘agnostic science’, and we speak of ‘blind methods’to denote individual instances of agnostic science. These methods usually rely only on largeand diverse collections of data to solve questions about a phenomenon. As we will see inSection 3, a reliance on large amounts of data is, however, not suﬃcient in itself to make amethod in data science blind.The lack (or only partial presence) of understanding of phenomena in agnostic sciencemakes the solution to any speciﬁc problem dependent on patterns identiﬁed automaticallythrough mathematical methods. At the same time, this approach calls for a diﬀerent kind ofunderstanding of what makes mathematical methods and tools well adapted to these tasks.A current explanation of the power of data science is that it succeeds only because ofthe relevance and size of the data themselves. Accordingly, the use of mathematics in datascience reduces itself to disconnected methods devoid of a common structure. This viewwould amount to a new Wignerian paradox of “unreasonable eﬀectiveness”, where sucheﬀectiveness is assigned to data and not to mathematics. This aﬃrmation of the exclusiveprimacy of data would be a sort of revenge of facts against their mathematization.We reject such an answer as both immaterial and unsupported. In our rejection we do notargue that any exploration of data is doomed to failure without some previous understandingof the phenomenon (in other words, we do not oppose data science’s methods by defending ∗ University Honors Program and Institute for Quantum Studies, Chapman University, Orange (CA) † CNRS, IHPST (CNRS and Univ. of Paris 1, Panthéon-Sorbonne) & Chapman University, Orange (CA) ‡ The Donald Bren Presidential Chair in Mathematics, Chapman University, Orange (CA) modus operandi .This account relies on the results of our previous works [13], [15], [16], which we reorganizein a comprehensive way. Furthermore, we identify a possible direction for future researches,and a promising perspective from which the question can be tackled.In [13] we discussed the lack of understanding proper to the current big data methods.In so doing, we did not borrow any general and univocal notion of understanding for anempirical (physical, biological, biomedical, social, etc.) phenomenon. We simply observedthat big data methods typically apply not to the study of a certain empirical phenomenon,but rather to a pair composed by a phenomenon and a question about it.Such methods are used when it is impossible to identify a small number of independentvariables whose measurement suﬃces to describe the phenomenon and to answer the questionat hand.We have argued that, when this happens, no appropriate understanding of the phe-nomenon is available, regardless of how one could conceive the notion of understanding.This highlights the need to understand why mathematics is successful in agnostic sciencesuccessful (and when), despite the blindness of its methods.Let us now brieﬂy describe the structure of the paper.In Section 2, we discuss what we consider to be the basic trend of agnostic science: the“Microarray Paradigm”, as we called it in [13]. This name was chosen to reﬂect the fact thatthis trend became ﬁrst manifest in biology and biomedical sciences, though it is now pervasivein all data science. It is characterized by the handling of huge amounts of data, whose speciﬁcprovenance is often unknown, and whose modes of selection are often disconnected from anyprevious identiﬁcation of a relevant structure in the phenomenon under observation. Thisfeature is intended as the most notable virtue of the paradigm, since it allows investigatingthe relevant phenomena, and speciﬁcally the data that have been gathered about them,without any previous hypothesis on possible correlations or causal relationships among thevariables that are measured.As already noted, there is an important distinction to be done between the use of powerfuland often uncontrollable algorithms on huge amounts of data and agnostic science properlysaid. This distinction will be investigated in Section 3 with the help of a negative example,the Page Rank algorithm used by Google to weight web pages. The basic point, here, is thatthe lack of local control of the algorithm in use is not the same as lack of understanding ofthe relevant phenomenon. The former is proper to any algorithm working on huge amountsof data, like Page Rank; the latter is instead, by deﬁnition, the characteristic feature ofagnostic science.In Section 4, we will go further in our analysis of agnostic science, by investigating therelations between optimization and “forcing”, a term we ﬁrst introduced in this context in[13]. More speciﬁcally, by forcing we referred in [16] to the following methodological practice:2 he use of speciﬁc mathematical techniques on the available data that is notmotivated by the understanding of the relevant phenomena, but only by the abilityof such techniques to structure the data to be amenable to further analysis.

For example, we could impose continuity and smoothness on the data, even in the case ofvariables that can only take discrete value, just for the purpose of being able to use derivativesand diﬀerential equations to analyze them. In cases such as this, we say that we are forcingmathematics over the available data.In our terminology, optimization itself can be seen as a form of forcing when we carefullyconsider the ways it is used within agnostic science.By better describing this modus operandi , in relation to deep learning techniques, we willalso make clear that the reasons for the eﬀectiveness of optimization cannot be regardedas the rationale for the success of agnostic science. In short, this is because optimizationis itself a form of forcing and does not ensure that the optimal solutions found by suchmethods correspond to anything of signiﬁcance in the evolution or state of the phenomenon.Having made this clear, in Section 5, we shall outline a tentative answer to the question ofthe success and the appropriateness of agnostic science, by indicating a possible directionfor further reﬂection.Our suggestion is based on the observation that blind methods can be regarded as com-plying with a simple and general prescription for algorithms’ development, what we called“Brandt’s Principle” in [16]. We then discuss the way in which data science ([7], Chapter 16)relies on entire families of methods (“ensemble methods”), and we interpret it in light of theMicroarray Paradigm, ultimately showing how this forces some implicit analytic constraintson blind methods. We ﬁnally propose that these analytic constraints, when taken togetherwith Brandt’s principle, exert strong restrictions on the relevant data sets that agnosticscience can deal with.

We now describe in some detail a biological problem and a corresponding experimentaltechnique. This provides a paradigmatic example of a blind method, and illustrates thetypical way agnostic science works.One of the great advances of biology and medical practice has been the understandingof the relevance of genes in the development of several diseases such as cancer. The impactof genetic information, as encompassed by DNA sequences, is primarily mediated in anorganism by its expression through corresponding messenger RNA (mRNA) molecules. Weknow that the speciﬁc behavior of a cell largely depends on the activity, concentration, andstate of proteins in the cell, and the distribution of proteins is, in turn, inﬂuenced by thechanges in levels of mRNA. This opens the possibility of understanding diseases and theirgenetic basis through the analysis of mRNA molecules.3he mechanism that leads from a certain distribution of mRNA molecules to the mani-festation of a certain disease is, however, often not understood, as well as not understood arethe speciﬁc mRNA molecules that are relevant for particular diseases. Biologists developeda technique, called ‘DNA microarray’, that can bypass, to some extent, this lack of under-standing, and allows the identiﬁcation of patterns, within mRNA distributions, that may bemarkers of the presence of some diseases.We ﬁrst describe very brieﬂy the experimental structure of the DNA microarray, andthen the way it can be used in diagnostics (we refer to [13] for a list of references on thistechnique, or to [1] for a broad introduction).A DNA microarray is essentially a matrix of microscopic sites where several thousandsof diﬀerent short pieces of a single strand of DNA are attached. Messenger RNA (mRNA)molecules are extracted from some speciﬁc tissues of diﬀerent patients, then ampliﬁed andmarked with a ﬂuorescent substance and ﬁnally dropped on each site of the microarray.This makes each site take a less or more intense ﬂorescence according to the amountof mRNA that binds with the strands of DNA previously placed in it. The intensity anddistribution of the ﬂuorescence give a way to evaluate the degree of complementarity of theDNA and the mRNA strands.This provides a correspondence between the information displayed by a DNA microarrayand the behavior of a cell. This correspondence, however, is by no means exact or univocal,since the function of many proteins in the cell is not known, and several strands of DNAare complementary to the mRNA strands of all protein types. Nevertheless, thousands ofstrands of DNA are checked on a single microarray, so that one might expect this method tooﬀer a fairly accurate description of the state of the cells, even if it does not oﬀer any sortof understanding of what is happening in the relevant tissues. The microarray is, indeed,particularly valuable for a huge number of variables, whose relation to each other and to thestate of the cell we ignore.This does not forbid, for example, the use of microarrays for the diagnosis of manyillnesses, since by measuring the activity of proteins one may be able to distinguish patientsaﬀected by a certain pathology from those patients that do not, even without knowing thereason for the diﬀerences.From a mathematical point of view, this process can be described as follows: let us labelby ‘ X i ’ the vector of expression levels of mRNA strands associated to a patient i : the hope isto be able to ﬁnd a function F such that F ( X i ) = 0 if the patient does not have a particulardisease, and F ( X i ) = 1 if the patient does have the disease. The question of how to ﬁndsuch a function F is at the core of agnostic science, and we will come back to it momentarily.This short description of DNA microarrays should already be enough to justify why thistechnology can be taken as a paradigmatic example of the way agnostic science works. Inshort, this point of view can be summarized as follows: If enough and suﬃciently diverse data are collected regarding a certain phe-nomenon, we can answer all relevant questions about it.

This slogan, to which we refer as the Microarray Paradigm, pertains not only to this speciﬁc4xample, but applies as well to agnostic science as a whole . The question we want to tacklehere is what makes this paradigm successful (at least in a relevant number of cases).To better grasp the point, let us sketch the general scheme that agnostic science appliesunder this paradigm: data are processed through an appropriate algorithm that works onthe available data independently both of their speciﬁc nature, and of any knowledge con-cerning the relations possibly connecting the relevant variables. The process is subject tonormalization constraints imposed by the data, rather than by the (unknown) structure ofthe phenomenon, and this treatment produces an output which is taken as an answer to aspeciﬁc question about this phenomenon.This approach makes it impossible to generalize the results or even to deal with a changeof scale: diﬀerent questions require diﬀerent algorithms, whose structure is general andapplied uniformly across diﬀerent problems. Moreover, the speciﬁc mathematical way inwhich a question is formulated depends on the structure of the algorithm which is used, andnot the other way around.To better see how blinds methods work, we will provide a quick overview of SupervisedMachine Learning (we shall came back later to this description in more detail). While this isjust an example, it will show the strong similarity between blind methods and interpolationand approximation theory.The starting point is a training set ( X, Y ), constituted by M pairs ( X i , Y i ) ( i = 1 , , . . . , M ),where each X i is typically an array (cid:16) X [ j ] i (cid:17) ( j = 1 , , . . . , N ) of given values of N variables.For example, in the DNA microarray example, each array X i is the expression of mRNAfragments detected by the microarray, while the corresponding Y i indicates the presence( Y i = 1) or absence ( Y i = 0) of a given disease.By looking at this training set with the help of an appropriate algorithm, the data scientistlooks for a function F such that F ( X i ) = Y i or F ( X i ) ≈ Y i ( i = 1 , , . . . , M ). This functionis usually called a ‘classiﬁer’ if the output is categorical, a ‘model’ or a ‘learned function’ forcontinuous outputs. In the following we will refer in both cases to F as a ‘ﬁtting function’,to stress the connection of supervised machine learning with approximation theory.In general, one looks for a function F that satisﬁes these conditions for most indices i ;moreover it is helpful if the function belongs to a functional space F selected because ofits ability to approximate general regular functions in a computationally eﬃcient way. Forexample, if the regular functions of interest are analytical functions over an interval, F canbe taken to be the space of polynomial functions deﬁned on the same interval.Let A be the space of parameters that deﬁne a function in F , and denote its elements as F a ( X )’, where a is an array of parameters in A . The standard way to ﬁnd the most suitable F a ∈ F for a supervised learning problem is akin to functional approximation that can bedescribed as follows. One deﬁnes a “ﬁtness function” E ( a ) by comparing the output F a ( X i ) We introduced the term ‘Microarray Paradigm’ in [13] to refer to a speciﬁc attitude towards the solutionof data science problems, and not only to denote the technique underlying DNA microarrays. The microarrayparadigm, as an attitude to the solution of problems, should not be confused with agnostic science, which isa general scientiﬁc practice implementing this attitude. It should also not be confused with any particularblind method—namely a speciﬁc way to implement the microarray paradigm.

5o the corresponding value Y i at each X i , and by setting E ( a ) = X i ( Y i − F a ( X i )) . The most suitable F ¯ a is identiﬁed, then, by seeking a value ¯ a that minimizes E ( a ). Thefunction F ¯ a selected in this way is, then, tested on a testing set ( X i , Y i ), ( i = M + 1 , M +2 , . . . M + M ′ ), and, if it is found to be accurate on this set as well, is ﬁnally used, byanalytical continuation, to forecast ˜ Y when a new instance of argument ˜ X is taken intoaccount.Though brief, this description shows that supervised machine learning consists in ana-lytically continuing a function found by constraining its values on a discrete subset of pointsdeﬁned by the training set. Moreover, supervised learning gives a mathematical form to thebasic classiﬁcation problem: given a ﬁnite set of classes of objects, and a new object that isnot labelled, ﬁnd the class of objects to which it belongs. What is relevant, however, is thateach of the numbers M, N , and M ′ may be huge, and we have no idea of how the values Y i depend on the values X i , or how the values in each array X i are related to each other.In particular, we do not know whether the variables taking these values are reducible (thatis, depend on each other in some way) or whether it is possible to apply suitable changes ofscale or normalization on the variables.Supervised learning is therefore emblematic of agnostic science since we have no way toidentify a possible interpolating function F a , except the use of appropriate algorithms. Ourlack of understanding of the phenomenon, in particular ensures that there is no eﬀectivecriterion to guide the choice of the vector of parameters a , which are instead taken ﬁrst to bearbitrary values, and eventually corrected by successive reiterations of the algorithm, untilsome sort of stability is achieved.Note that not all data science algorithms fall directly under the domain of supervisedlearning. For example, in unsupervised learning, the goal is not to match an input X to anoutput Y , but rather to ﬁnd patterns directly in the set of inputs { X , . . . , X M } . The mostrecent consensus is that unsupervised learning is most eﬃciently performed when conceivedas a particular type of supervised learning (we shall come back later to this point). Anotherimportant modality of machine learning is reinforcement learning, a sort of semi-supervisedlearning strategy, where no ﬁxed output Y is attached to X , and the ﬁtting function F a is evaluated with respect to a system of “rewards” and “penalties”, so that the algorithmattempts to maximize the ﬁrst and minimize the second. This type of machine learning ismost often used when the algorithm needs to make a series of consecutive decisions to achievea ﬁnal goal (as for example when attempting to win in a game such as chess or GO). Similarlyto unsupervised learning, it has been shown ([12]) that reinforcement learning works bestwhen implemented in a modiﬁed, supervised learning setting.Given the possibility of reducing both unsupervised and reinforcement learning to super-vised learning schemes, we continue our analysis of supervised learning algorithms.We now turn to consider another important question, suggested by the uncontrolledparameter structure of the supervised ﬁtting function. Isn’t it possible that, by working on alarge enough data set, one can ﬁnd arbitrary patterns in the data set that have no predictive6ower? The question can be addressed with the help of combinatorics, namely throughRamsey’s theory. This makes possible to establish, in many cases, the minimal size for a set S in order to allow the identiﬁcation of a given combinatorial pattern in a subset of S [6]. Byadapting Ramsey’s theory to data analysis, Calude and Longo ([4]) have shown that largeenough data sets allow to establish any possible correlation among the data themselves.This might suggest that wondering how much data is enough is only part of the story .Another, possibly more important one, if we want to avoid falling into the risk of makingthe result of blind methods perfectly insigniﬁcant, is wondering how much data is too much.There are several comments to be made on this matter.To begin with, we note that Ramsey’s theory proves the existence of lower bounds onthe size of sets of data that ensures the appearance of correlations. But these lower boundsare so large as to be of little signiﬁcance for the size of data sets one usually handles.Even more importantly, Ramsey theory allows to establish the presence of patterns insubsets of the initial data set. In supervised learning, on the other hand, we require thatevery element of X matches with an appropriate element of Y : this is essentially diﬀerentfrom seeking correlations in a subset of the data set. In other words, Ramsey’s theory wouldonly show that it is possible to write F ( X i ) = Y i for some speciﬁc subset of elements of X and of Y : this would have no useful application in practice for supervised learning, wherethe totality of the available data must be properly matched.Hence, as long as ﬁnding patterns within a data set X is tied to supervised learning,there is no risk of uncontrolled and spurious correlations. Instead, any such correlation willbe strongly dependent on its relevance in ﬁnding the most appropriate ﬁtting function F .Moreover, we will see in Section 4 that, even when blind methods seem not to fall withinthe structural constraints of supervised learning, they can still be reinterpreted as such.We should add that agnostic science enters the game not in opposition to traditional,theoretically-tied methods, but as an other mode of exploration of phenomena, and it shouldin no way discourage, or even inhibit the search for other methods based on previous under-standing. Any form of understanding of the relevant phenomena is certainly welcome. Still,our point here is that there is no intrinsic methodological weakness in blind methods thatis not, in a way or another, already implicit in those other methodologies with a theoreticalbent. At their core they all depend on some sort of inductive inference: the assumptionthat a predictive rule, or a functional interpolation of data, either justiﬁed by a structuralaccount of phenomena, or by analytical continuation of interpolating functions, will continueto hold true when confronted with new observations.Supervised learning shows that we can succeed, despite the obvious (theoretical and/orpractical) risks, in using data to ﬁnd pattern useful to solve speciﬁc problems with theavailable resources (though not necessarily patterns universally associated with the relevantphenomena). The degree of success is manifest in disparate applications such as: face recog-nition [25]; automated translation algorithms [28]; playing (and beating humans) at diﬃcult On this question, note that there are ways to apply data science also to small data sets, if we acceptstrong limitations on the type of questions and we impose strong regularization restrictions on the type ofsolutions [18].

Before continuing our search for such a (meta-)understanding, we observe that agnosticscience is not equivalent to the use of data-driven algorithms on huge amount of data. Aswe will see in this section, we can single out computationally eﬃcient algorithms that canbe applied to extremely large data sets. And yet these very algorithms can be proven toconverge, we fully understand their output, and we understand the structure of the datathat makes them useful, so that they cannot considered blind algorithms, and their use isnot an example of agnostic science. To better see this point, we will describe PageRank: thealgorithm used by Google to weight webpages ([3], [19]).Let A be a web page with n other pages T i ( i = 1 , , . . . , n ) pointing to it. We introducea damping factor d A (0 ≤ d ≤

1) that describes the probability that a random web-surferlanding on A will leave the page. If d A = 0, no surfer will leave the page A ; if d A = 1, everysurfer will abandon the page. One can chose d A arbitrarily, or on the basis of any possiblea priori reason. Such choice does not aﬀect the outcome of the algorithm in the limit of asuﬃciently large number of iterations of the algorithm itself. The PageRank of A is givenby this formula: P R ( A ) = (1 − d A ) + d A n X i =1 P R ( T i ) C ( T i ) ! . where P R ( T i ) and C ( T i ) are the PageRanks of T i and the links going out of it.This formula is very simple. But it is recursive: in order to compute P R ( A ), one needs tocompute the PageRank of all the pages pointing to A . In general, this makes impossible todirectly compute it, since, if A points to some T i , then P R ( T i ) depends on P R ( A ), in turn.However, this does not make the computation of P R ( A ) impossible, since we can computeit by successive approximations: (1) we begin by computing P R ( A ) choosing any arbitraryvalue for P R ( T i ); (2) the value of P R ( A ) computed in step (1) is used to provisionallycompute P R ( T i ); (3) next P R ( A ) is recalculated on the basis of the values of P R ( T i ) foundin (2) ; and so forth, for a suﬃcient number of times.It is impossible to say a priori how many times the process is to be reiterated in order toreach a stable value for any page of the Web. Moreover, the actual complexity and dimensionof the Web makes impossible to follow the algorithm’s computation at any of its stages, and8or all the relevant pages. This is diﬃcult even for a single page A , if the page is suﬃcientlyconnected within the Web. Since the Web is constantly changing, the PageRank of eachpage is not ﬁxed and is to be computed again and again, so that the algorithm needs to berun continuously. Thus it is obvious the impossibility of any local control on this process.Still, it can be demonstrated that the algorithm converges to the principal eigenvector ofthe normalized link matrix of the Web. This makes the limit PageRank of any page, namelythe value of the PageRank of the given page in this vector, a measure of the centrality ofthis page in the Web.Whether this actually measures the importance of the page is a totally diﬀerent story.What is relevant is that the algorithm has been designed to compute the principal eigenvec-tor, under the assumption that the value obtained in this way is an index of the importanceof the page. Given any deﬁnition of importance, and under suitable conditions, it has beenrecently proved ([9], Theorem 2) that PageRank will asymptotically (in the size of the web)rank pages according to their importance.This result conﬁrms the essential point: the algorithm responds to a structural under-standing of the Web, and to the assumption that the importance of any page is proportionalto its centrally in its normalized link matrix. Then, strictly speaking, there is nothing blindin this approach, and using it is in no way an instance of agnostic science, though the Webis one of the most obvious example of Big Data we might imagine. But, then, what makesblind methods blind, and agnostic science agnostic?Agnostic science appears when, for the purpose of solving specify problems, one usesmethods to search patterns whic h—unlike the case of PageRank—correspond to no previousunderstanding. This means we use methods and algorithms to ﬁnd problem-dependentpatterns in the hope that, once discovered, they will provide an apparent solution to thegiven problem. If this is so, then agnostic science is not only a way to solve problems fromdata without structural understanding, but also a family of mathematically sophisticatedtechniques to learn from experience by observation of patterns. Still, attention to invariantpatterns is, ultimately, what Plato ( Theaetetus , 155d) called ‘astonishment [ θαυμάζειν ]’, andconsidered to be “the origin of philosophy [ ἀρχὴ φιλοσοφίας ]”. What happens with agnosticscience is that we have too much data to be astonished by our experience as guided bythe conceptual schemas we have at hand. So we use blind methods to look for sources ofastonishment deeply hidden within these data.

We will now try to understand the features of an algorithm that make it suitable to identifyappropriate patterns within a speciﬁc problem.The question has two facets. On one hand, it consists in asking what makes thesealgorithms successful. On the other hand, it consists in wondering what makes them soappropriate (for a speciﬁc problem). The diﬃculty is that what appears to be a good answer9o the ﬁrst question seems to contrast, at least at ﬁrst glance, with the possibility of providinga satisfactory answer to the second question.Indeed, as to the ﬁrst question, we would say, in our terminology, that the algorithmsperform successfully because they act by forcing, i.e. by choosing interpolation methods andselecting functional spaces for the ﬁtting functions in agreement to a criterion of intrinsic(mathematical) eﬀectiveness, rather than conceiving these methods in connection with therelevant phenomena.This answer seems to be in contrast with the possibility of providing a satisfactory answerto the second question, since it seems to negate from the start the possibility of understandingthe appropriateness of methods in agnostic science. We think it is not. In this section weshall reﬁne the notion of forcing by looking more carefully at the use of optimization in datascience, and more particularly for a powerful class of algorithms, the so called Deep LearningNeural Networks.Finally, in Section 5 we explore several ways to make the answer to the question of thesuccessfulness of data science algorithms compatible with the existence of an answer to theappropriateness question.As we showed in [13, 16], boosting algorithms are a clear example of forcing. Theyare designed to improve weak classiﬁers, generally just slightly better than random ones,and to transform them, by iteration, in strong classiﬁers. This is achieved by carefullyfocusing at each iteration on the data points that were misclassiﬁed at the previous iteration.Boosting algorithms are particularly eﬀective in improving the accuracy of classiﬁers basedon sequences of binary decisions (so called “classiﬁcation trees”). Such classiﬁers are easy tobuild but on their own are relatively inaccurate. Boosting can, in some cases, reduce errorrates for simple classiﬁcation trees from a high of to about ([7], Chapter 10).Regularization algorithms oﬀer a second example. If the data are too complicated and/orrough, these algorithms render them amenable to being treated by other algorithms, for ex-ample by reducing their dimension. Despite the variety of regularization algorithms, they canbe all conceptually equated to the process of approximating a non-necessarily diﬀerentiablefunction by a function whose derivative absolute value is bounded from above everywhereon its domain.The use of these algorithms reveals a double application of forcing: forcing on the originaldata to smooth them; and then forcing on the smooth data to treat them with a second set ofalgorithms. For example, after a regularization that forces the data to be smooth, some datascience methods advocate the forcing of unjustiﬁed diﬀerential equations, in the search fora ﬁtting function ([22], chapter 19). These methods have been very eﬀective in recognitionof the authenticity of handwritten signatures, and they depend essentially on the conditionthat the data are accounted for by a smooth function.Since in virtually all instances of forcing the mathematical structure of the methods is ei-ther directly, or indirectly reducible to an optimization technique, we claim that optimizationis a form of forcing within the domain of agnostic science.In a sense, this is suggested by the historical origins of optimization methods ([20]; [21]).When Maupertuis, then President of the Berlin Academy of Sciences, ﬁrst introduced the10dea of least action, he claimed to have found the quantity that God wanted to minimize whencreating the universe. Euler, at the time a member of the Berlin Academy of Sciences, couldnot openly criticize his President, but clearly adopted a diﬀerent attitude, by maintainingthat action was nothing but what was expressed by the equations governing the systemunder consideration. In other terms, he suggested one should force the minimization (ormaximization) of an expression like Z F ( x ) dx on any physical system in order to ﬁnd the function F characteristic of it. Mutatis mutandis ,this is the basic idea that we associate today with the Lagrangian of a system. Since thenoptimization became the preeminent methodology in solving empirical problems. One couldsay that the idea of a Lagrangian has been generalized to the notion of ﬁtting function,whose optimization characterizes the dynamics of a given system.Though this might be seen as a form of forcing, within a quite classical setting, oneshould note that, in this case, the only thing that is forced on the problem is the form of therelevant condition, and the request that a certain appropriate integral reaches a maximumor minimum. In this classical setting, however, the relevant variables are chosen on the basisof a preliminary understanding of the system itself, and the relevant function is chosen sothat its extremal values yield those solutions that have already been found in simpler cases.Things change radically when the ﬁtting function is selected within a convenient func-tional space through an interpolation process designed to make the function ﬁt the givendata. In this case, both the space of functions and the speciﬁc ﬁtting procedure (whichmakes the space of functions appropriate) are forced on the system. These conditions areoften not enough to select a unique ﬁtting function or to ﬁnd or ensure the existence of anabsolute minimum, so that an additional choice may be required (forced) to this purpose.There are many reasons why such an optimization process can be considered eﬀective.One is that it matches the microarray principle: enough data, and a suﬃciently ﬂexible set ofalgorithms, will solve, in principle, any scientiﬁc problem. More concretely, optimization hasshown to be both simple and relatively reliable: not necessarily to ﬁnd the actual solution ofa problem, but rather to obtain, without exceeding time and resources constraints, outcomesthat can be taken as useful solutions to the problem. The outcomes of optimization processescan be tested in simple cases and shown compatible with solutions that had been found withmethods based on a structural understanding of the relevant phenomenon . In addition, theresults of an optimization process may turn out to be suitable for practical purposes, evenwhen it is not the best possible solution. An example is provided by algorithms for self-driving cars. In this case, the aim is not that of mimicking in the best possible way humanreactions, but rather simply to have a car that can autonomously drive with a suﬃcientattention to the safety of the driver and of all other cars and pedestrians on the road.This last example makes it clear that we can conceive optimization as a motivation forﬁnding algorithms without being constrained by the search for the best solution. Optimiza-tion becomes a conceptual framework for the development of blind methods. This is diﬀerent from choosing the ﬁtting function on the basis of solutions previously obtained with thehelp of an appropriate understanding. ) may provide fundamental contributions todata science in this respect. On the other hand, blind methods place at the center theprocess of interpolation itself, rather than any correspondence between existing instances ofsolutions (i.e. simpler problems) and those to be determined (that we can equate to morecomplex problems).Optimization as forcing also raises some important issues, beyond the obvious one whichis typical of blind methods, namely the absence of an a priori justiﬁcation.One issue is that, in concrete data science applications such as pattern recognition ordata mining, optimization techniques generally require ﬁxing a large number of parameters,sometime millions of them, which not only make control of the algorithms hopeless, but alsomakes it diﬃcult to understand the way algorithms work. This large number of parametersoften results in lack of robustness, since diﬀerent initial choices of the parameters can leadto completely diﬀerent solutions.Another issue is evident when considering the default technique of most large scale opti-mization problems, the so called point by point optimization. Essentially, this is a techniquewhere the search for an optimal solution is done locally, by improving gradually any currentlyavailable candidate for the optimal choice of parameters. This local search can be done, forexample, by using the gradient descent method ([5], section 4.3), which does not guaranteethat we will reach the desired minimum, or even a signiﬁcant relative minimum. Since vir-tually all signiﬁcant supervised machine learning methods can be shown to be equivalent topoint by point optimization [16], we will brieﬂy describe the gradient descent method.If F ( X ) is a real-valued multi-variable function, its gradient ∇ F is the vector that givesthe slope of its tangent oriented towards the direction in which it increases most. Thegradient descent method exploits this fact to obtain a sequence of values of F which convergesto a minimum. Indeed, if we take K n small enough and set x n +1 = x n − K n ∇ F ( x n ) ( x = 0 , , . . . ) then F ( x ) ≥ F ( x ) ≥ F ( x ) , . . . and one can hope that this sequence of values converges towards the desired minimum.However, this is only a hope, since nothing in the method can warrant that the minimum itdetects is signiﬁcant. For example, in [17] a general classiﬁcation problem from developmental biology is formulated as a pathintegral akin to those used in quantum mechanics. Such integrals are usually analyzed with the help ofperturbative methods such as WKB approximation methods (see [26], Chapter 18). .2 Deep Learning Neural Networks Let us now further illustrate the idea of optimization as forcing, by considering the paradig-matic example of Deep Learning Neural Networks (we follow here [7], Section 11.3, and [5])as it applies to the simple case of classiﬁcation problems.The basic idea is the same anticipated above for the search of a ﬁtting function F bysupervised learning. One starts with a Training Set ( X, Y ) , where X is a collection of M arrays of variables: X = ( X , . . . , X M ) X i = (cid:16) X [1] i , . . . X [ N ] i (cid:17) and Y is a corresponding collection of M variables: Y = ( Y , . . . , Y M ) . What is speciﬁc to deep learning neural networks is the set of speciﬁc steps used to recursivelybuild F .1. We build K linear functions Q [ k ] ( X i ) = A ,k + N X n =1 A n,k X [ n ] i ( k = 1 , . . . , K ) , where A n,k are K ( N + 1) parameters chosen on some a priori criterion, possibly evenrandomly, and K is a positive integer chosen on the basis of the particular applicationof the algorithm.2. One then selects an appropriate non-linear function G (we will say more about howthis function is chosen) to obtain K new arrays of variables H [ k ] ( X i ) = G (cid:16) Q [ k ] ( X i ) (cid:17) ( k = 1 , . . . , K ) .

3. One chooses (as in step 1) a new set of T ( K + 1) parameters B k,t in order to obtain T linear combinations of the variables H [ k ] ( X i ) Z [ t ] ( X i ) = B ,t + K X k =1 B k,t H [ k ] ( X i ) ( t = 1 , . . . , T ) , where T is a positive integer appropriately chosen in accordance with the particularapplication of the algorithm.If we stop after a single application of steps 1-3, the neural network is said to have onlyone layer (and is, then, “shallow” or not deep). We can set T = 1 and the process ends byimposing that all the values of the parameters are suitably modiﬁed (in a way to be describedshortly) to ensure that Z [1] ( X i ) ≈ Y i ( i = 1 , . . . , M ) , ˜ X , we can then deﬁne our ﬁtting function F as F ( ˜ X ) = Z [1] ( ˜ X ) .In deep networks, steps 1-3 are iterated several times, starting every new iteration from the M arrays Z [ t ] ( X i ) constructed by the previous iteration. This iterative procedure createsseveral “layers”, by choosing diﬀerent parameters A and B (possibly of diﬀerent size as well)at each iteration, with the obvious limitation that the dimension of the output of the lastlayer L has to match the dimension of the elements of Y . If we denote by Z [1] L ( X i ) the outputof the last layer L , we impose, similarly to the case of one layer, that Z [1] L ( X i ) ≈ Y i .In other words, the construction of a deep learning neural network involves the repeatedtransformation of an input X by the recursive application of a linear transformation (step1) followed by a nonlinear transformation (step 2) and then another linear transformation(step 3).The algorithm is designed to allow learning also in absence of Y by using X itself,possibly appropriately regularized, in place of Y (auto-encoding). When an independent Y is used, the learning is called ‘supervised’ and provides an instance of the setting describedin Section 3. In absence of it, the learning is, instead, called ‘unsupervised’ ([5], chapter14), and its purpose is to ﬁnd signiﬁcant patterns and correlations within the set X itself.The possibility of using an algorithm designed for supervised learning also for unsupervisedlearning is an important shift of perspective. It allows to constrain the exploration of patternswithin X , for the sole purpose of regularizing the data themselves. Whichever correlationsand patterns are found, they will be instrumental to this speciﬁc aim, rather than to theambiguous task of ﬁnding causal relationships within X .Two things remain to be explained. The ﬁrst concerns the non-linear function G , called‘activation function’ (because of the origin of the algorithm as a model for neural dynamic).Such function can take diﬀerent forms. Two classical examples are the sigmoid function G ( u ) = 11 + e − u and the ReLU (Rectiﬁed Linear Unit) function G ( u ) = max (0 , u ) . This second function is composed of two linear branches and therefore is, mathematicallyspeaking, much simpler than the sigmoid function. While the ReLU is not linear, it hasuniform slope on a wide portion of its domain, and this seems to explain its signiﬁcantlybetter performance as activation function for deep networks. The use of an appropriateactivation function allows the method to approximate any function that is continuous onthe compact sets in R n . This result is known as the Universal Approximation Theorem forNeural Networks ([8])The second thing to be explained concerns the computation of the parameters accordingto the condition: Z [1] L ( X i ) ≈ Y i ( i = 1 , . . . , M ) . For classiﬁcation problems, one often imposes P ( Z [1] ( X i )) ≈ Y i , ( i = 1 , . . . , M ), where P is a ﬁnal,suitably chosen, output function (see [7], Section 11.3). For any new input ˜ X , the ﬁtting function is, then, F ( ˜ X ) = P ( Z [1] ( ˜ X )). M X i =1 h Y i − Z [1] L ( X i ) i . The gradient is computed by an appropriate fast algorithm adapted to neural networksknown as backpropagation ([7], Section 11.4). As we already noted in [14], the eﬀectivenessof neural networks (both deep and shallow) seems to depend more on the speciﬁc structureof the backpropagation algorithm, than on the universal approximation properties of neuralnetworks. Note also that the minimization of the ﬁtness function is equivalent to a regular-ization of the ﬁnal ﬁtting function, if we stop the iterative application of the backpropagationalgorithm when the value of the ﬁtness function does not signiﬁcantly decreases any more([5], Section 7.8).When dealing with deep networks, one can go as far as considering hundreds of layers,though it is not generally true that increasing the number of layers always improves theminimum of the corresponding ﬁtness function. Nevertheless, in many cases, taking morelayers often allows up to a tenfold reduction of errors. For example, it has been showed thatin a database of images of handwritten digits, classiﬁcation errors go from a rate of 1.6% fora 2 layers network ([7], section 11.7) to a rate of 0.23% with a network of about 10 layers([24]).This short description of the way in which Deep Learning Neural Networks work should beenough to clarify why we have taken them as an example of optimization by forcing. Aboveall, both the dependence of the eﬀectiveness of neural networks on the structure of thebackpropagation algorithm, and their interpretation as regularization are clear indicationsthat the way neural networks are applied is an instance of forcing.More broadly, the opacity of the recursive process that creates the layers of the network ismatched by computationally driven considerations that establish the speciﬁc type of gradientdescent method to be used, and by a criterion to stop the whole iterative process that issimply based on the inability to ﬁnd better solutions. But this same description shouldalso be enough to make clear the second question mentioned in the beginning of the presentsection: how can methods like deep learning neural networks be appropriate for solvingspeciﬁc problems, when the methods themselves do not reﬂect in any way the particularfeatures of the problems? We explore this question in the next section.

A simple way to elude the question of the appropriateness of blind methods is by negatingits premise: one can argue that, in fact, blind methods are in no way appropriate; that theirsuccess is nothing but appearance and that the faith in their success is actually dangerous,since such faith provides an incentive to the practice of accepting illusory forecasts andsolutions. 15he problem with this argument is that it ultimately depends on arguing that blindmethods do not succeed since they do not conform with the pattern of classical science.But an objective look at the results obtained in data science should be enough to convinceourselves that this cannot be a good strategy. Of course, to come back to the example ofDNA microarrays, grounding cancer therapy only on microarrays is as inappropriate as itis dangerous, since, in a domain like that, looking for causes is as crucial as it is necessary.But we cannot deny the fact that microarrays can be used also as an evidential bases in asearch for causes. And also we cannot deny that, in many successful applications of blindmethods—like in handwriting recognition—the search of causes is much less crucial.So we need another justiﬁcation of the eﬀectiveness of blind methods, which, far fromnegating the appropriateness question, takes it seriously and challenges the assumption thatclassical science is the only appropriate pattern for good science. Such an approach cannotdepend, of course, on the assumption that blind methods succeed because they performappropriate optimization. This assumption merely displaces the problem, since optimizationis only used by these methods as a device to ﬁnd possible solutions.A more promising response to the question of the appropriateness of blind methodsmight be that they succeed for the same reason as classical induction does: blinds methodsare indeed interpolation methods on input/output pairs, followed by analytical continuation,which is how induction works. Of course, one could argue that induction itself is not logicallysound. But could one really reject it as an appropriate method in science because of this?Is there another way to be empiricist, other than trusting induction? And can one reallydefend classical science without accepting some form of empiricism, as reﬁned as it might be?We think all these questions should be answered in the negative, and therefore we believethe objection itself to be immaterial.There are, however, two other important objections to this response.The ﬁrst is that it applies only to supervised methods, that is, methods based on theconsideration of a training set on which interpolation is performed. It does not apply, atleast not immediately, to unsupervised ones, where no sort of induction is present. However,this objection is superseded by noting that it is possible to reduce unsupervised methods tosupervised ones through the auto-encoding regularization processes described above.The second objection is more relevant. It consists in recognizing that, when forcing is atwork, interpolation is restricted to a space of functions which is not selected by consideringthe speciﬁc nature of the relevant phenomenon, and that cannot be justiﬁed by any sort ofinduction. The choice of the functional space, rather, corresponds to a regularization of thedata and it often modiﬁes those data in a way that does not reﬂect, mathematically, thephenomenon itself.This objection is not strong enough to force us to completely dismiss the inductionresponse. But it makes it clear that advocating the powerfulness of induction cannot beenough to explain the success of agnostic science. This general response is, at least, to becomplemented by a more speciﬁc and stronger approach.In the remainder of this section, we would like to oﬀer the beginning of a new perspective,which is consistent with our interpretation of the structure of blind methods.16he basic idea is to stop looking at the appropriateness question as a question concern-ing some kind of correspondence between phenomena and speciﬁc solutions found by blindmethods. The very use of forcing makes this perspective illusory. We should instead lookat the question from a more abstract, and structural perspective. Our conjecture, alreadyadvanced in [15, 16], is that we can ﬁnd a signiﬁcant correspondence between the structureof the algorithms used to solve problems, and the way in which phenomena of interest indata science are selected and conceived. We submit, indeed, that the (general) structure ofblind methods, together with the formal features of the Microarray Paradigm, exert strongrestrictions on the class of data sets that agnostic science deals with.

To justify the existence of these restrictions, we start by recalling a result of [16], that allblind methods share a common structure that conforms to the following prescription:

An algorithm that approaches a steady state in its output has found a solution toa problem, or needs to be replaced.

In [16] we called this prescription ‘Brandt’s Principle’, to reﬂect the fact that it was ﬁrstexpounded by Achi Brandt for the restricted class of multiscale algorithms ([2]).As simple as Brandt’s Principle appears at ﬁrst glance, in [16] we showed that thisprinciple allows a comprehensive and coherent reﬂection the structure of blind methods inagnostic science. First of all, Brandt’s Principle is implicit in forcing, since an integral ideain forcing is that if an algorithm does not work, another one is to be chosen. But, morespeciﬁcally, the key to the power of this principle is that the steady state output of eachalgorithm, when it is reached, is chosen as input of the next algorithm, if a suitable solutionto the initial problem has not yet been found.Notably, deep learning architecture matches with Brandt’s Principle, since the iterationof the gradient descent algorithm is generally stopped when the improvement of parametersreaches a steady state and, then, either the function that has been obtained is accepted andused for forecasting or problem-solving, or the algorithm is replaced by a new one, or atleast re-applied starting from a new assignation of values to the initial parameters. Morethan that, all local optimization methods satisfy this principle. And since most algorithmsin agnostic data science can be rewritten as local optimization methods, we can say thatvirtually all algorithms in agnostic data science do.In [16] we argued that thinking about algorithms in terms of Brandt’s principle oftensheds light on those characteristics of a speciﬁc method that are essential to its success. Forexample, the success of deep learning algorithms, as we have seen in the previous section,relies in a fundamental way on two advances: (1) the use of the ReLU activation functionthat, thanks to its constant slope for nonzero arguments, allows the fast exploration ofthe parameter space with gradient descent; and (2) a well deﬁned regularization obtainedby stopping the gradient descent algorithm when error rates do not improve signiﬁcantlyanymore. Both these advances took a signiﬁcant time to be identiﬁed as fundamental to the17uccess of deep learning algorithms, perhaps exactly for their deceiving simplicity, and yetboth of them are naturally derived from Brandt’s principle.There is, however, a possible objection against ascribing a decisive importance to thisprinciple (one that is in the same vein as that we discussed in section 2, considering anargument from [4]). This objection relies on the observation that, in practical applications,agnostic science works with ﬂoating-point computations, which require a ﬁnite set of ﬂoating-point numbers. The point, then, is that any iterative algorithm on a ﬁnite set of inputsreaches a limit cycle in ﬁnite time, in which case also steady states satisfying Brandt’sprinciple become trivial and uninformative about the nature of the subjacent phenomenon.However, blind methods that satisfy Brandt’s principle, such as boosting algorithms andneural networks, will usually converge to steady states after just a few thousands iterations.Limit cycles in an algorithm’s output due to the limitations of ﬂoating-point arithmetic willinstead appear after a very large number of iterations, comparable to the size of the set ofﬂoating-point numbers. Any practical application of Brandt’s principle needs to take thisinto consideration, by imposing, for example, that the number of iterations necessary toreach a steady state is at most linear in the size of the training set.Regardless of this practical limitation in recognizing steady states, the real signiﬁcance ofBrandt’s principle for supervised learning algorithms is that it shifts the attention from theoptimization of the ﬁtting function F to the study of the dynamics of the algorithms’ output.In this perspective, applying Brandt’s Principle depends on building sequences of steady-state ﬁtting functions { F j } that are stable in their performance under the deformationsinduced by the algorithms that we choose to use during the implementation of the principleitself. This generates a space of ﬁtting functions, that are mapped into each other by thediﬀerent algorithms.A clear example of this process is provided by boosting, where the entire family of ﬁttingfunctions (that is recursively found) is robust in its performance, once the classiﬁcationerror on the training set stabilizes. Moreover, ﬁtting functions found by the algorithm ateach recursive step are built by combining all those that were found earlier (by weighingand adding them suitably). Indeed, boosting is an instance of “ensemble methods”, wheredistinct ﬁtting functions are combined to ﬁnd a single ﬁnal ﬁtting function. It can be arguedthat a lot of the recent progress in data science has been exactly due to the recognition ofthe central role of ensemble methods, such as boosting, in greatly reducing error rates whensolving problems ([7], Chapter 16). Note that for ensemble methods to be trustworthy, eventually they must stabilize to a ﬁt-ting function that does not signiﬁcantly change with the addition of more sample points.Therefore, if we take the limit of an inﬁnite number of sample points, generically distributedacross the domain where the relevant problem is well deﬁned, the ﬁtting function is forcedto be unique.To understand the signiﬁcance of this instance of forcing, we ﬁrst rephrase the basicslogan of the Microarray Paradigm in more precise terms as a ‘Quantitative Microarray18aradigm’:

Given enough sample data points, and for a large and diverse enough set ofvariables X , the value of any other variable Y relevant for the solution of a givenproblem can be generally calculated from the value of X (via the ﬁtting function F ( X ) = Y ). Once this assumption is admitted, the unicity of F in the limit entails a form of analyticityon the ﬁtting function which we call ‘asymptotic sample-analyticity’: Let N be the dimension of X ; then, for N suﬃciently large, F ( X ) = Y is uniquelydetermined, on an appropriate domain, by a generic inﬁnite set of sample points. In application, we will always have ﬁnite data, and we will not be able to choose F uniquelyto solve a problem. But suppose that the same solution is given by the entire class ofasymptotically sample-analytic functions that are compatible with the available data. Dowe trust that such solution is reﬂective of a relevant actual property of the phenomenon atissue ? The question shifts the attention from the nature of data sets to the nature of thespace of functions deﬁned on them, and on their assemblage by appropriate algorithms.Our suggestion is that blind methods succeed when (and because) they select, in agree-ment with Brandt’s principle, appropriate classes of asymptotically sample-analytic func-tions, apt to robustly provide uniquely determinate solutions for the data problems at issue.Despite this shift from data sets to functions on data sets, the properties of such functionsenforce some general conditions on the data as well. First of all, the Quantitative MicroarrayParadigm essentially requires variables in the data set to be strongly interdependent in thelimit of large data sets. Second, we claimed at the end of Section 5.2 that Brandt’s Principleidentiﬁes a space (ensemble) of ﬁtting functions that are stable in their ability to solve agiven problem. Such stability is possible only if the functional relations that can be deﬁnedon the variables are themselves robust, so that they persist across the large data sets thatare required by the microarray paradigm. It is not important for these interdependenceand robustness to be apparent, that is, we do not need to be able to identify the speciﬁcdependence of variables from each other. Nor it is important for robustness to warrant along-term conservation of speciﬁc relations among the variables. What is relevant is that theinterdependence is strong and persistent enough to allow iterative algorithms conformingto Brandt’s Principle to subsequently correct their outputs by building more and moreconvenient ﬁtting functions.The requirements of interdependence of variables and robustness of functional relationsamong them can now be used to discriminate data sets most suitable for the application ofblind methods. For example, such requirements are believed to be satisﬁed by developmentalbiological systems (see [10]). Moreover, in [16] we gave evidence that social and economicalsystems satisfy a generalization of the ‘principle of Developmental Inertia’, an organizingprinciple for developmental biology (ﬁrst proposed in [11]). It is therefore likely that the Note that sample-analyticity can be forced on any discrete variable in the problem after imposingcontinuity on such variables.

In this paper we reviewed and extended a perspective on the methodological structure ofdata science which we have been building in a series of papers [13, 14, 15, 16]. The basicassumption of our approach is that data science is a coherent approach to empirical problemsthat in its most general form does not build understanding about phenomena. Because ofthis characteristic, we labelled this approach to empirical phenomena ‘agnostic science’, andcalled the methods that make up agnostic science ‘blind methods’.The basic attitude underlying agnostic science is the belief that if enough and suﬃcientlydiverse data are collected regarding a certain phenomenon, it is possible to answer all relevantquestions about it. In Section 2 we referred to this belief as the microarray paradigm andwe explored the speciﬁc ways it is used in the practice of machine learning.We noted in Section 3 that not all computational methods dealing with large data sets areproperly within the domain of agnostic science, and we gave the example of PageRank, analgorithm used to weight webpages. The convergence of this algorithm and the signiﬁcanceof its output are readily intelligible and therefore we argued that PageRank is not a blindmethod.In Section 4.1 we explored how the microarray paradigm calls for a new type of math-ematization in agnostic science, where mathematical methods are forced on the problem,i.e., they are applied to a speciﬁc problem only on the basis of their ability to reorganize thedata for further analysis by general purpose techniques that are selected only on the basis ofthe richness of their mathematical structure, rather than by any particular relevance for theproblem at hand. We then showed that optimization methods are used in data science as aform of forcing. This is particularly signiﬁcant since virtually all methods of data science canbe rephrased as a type of optimization method. In particular, in Section 4.2 we argued thatdeep learning neural networks are best understood within the context of forcing optimality.In Section 5 we moved to the broader question of the appropriateness of blind methodsin solving problems. In Section 5.1 we argued that this question should not be interpretedas a search for a correspondence between phenomena and speciﬁc solutions found by blindmethods. Rather, it is the internal structure of blind methods that should be understood,and its implications on the structure of the data sets that are most appropriate for suchmethods.To this extent, we reviewed in Section 5.2 a simple prescription on algorithms, Brandt’sprinciple, which asserts that an algorithm that approaches a steady state in its outputhas found a solution to a problem, or needs to be replaced. One of our main claims in2016] was that Brandt’s principle is ideally suited to the understanding of the dynamicalstructure of blind methods. For example, in Section 5.2 we used Brandt’s principle tounderstand two of the signiﬁcant innovations of deep learning neural networks: the use ofthe ReLU activation function in the network; and an eﬃcient criterion for early stopping ofthe algorithm. Ensemble methods, where distinct ﬁtting functions are combined to ﬁnd asingle ﬁnal ﬁtting function, can also be interpreted within the context of Brandt’s principle.In Section 5.3 we showed that Brandt’s principle and the microarray paradigm force aspeciﬁc type of analytical structure, which we call ‘sample-analyticity’, on the ﬁnal ﬁttingfunction found by ensemble methods. And we argued that sample-analyticity forces a shiftfrom data sets to functions on data sets. In turn, the properties of such functions enforcetwo general conditions on the data sets: a strong interconnectedness of the variables of thedata set; and the robustness of the functional relations of such variables.We ﬁnally speculated that blind methods are most appropriate for the solution of prob-lems in biological, social and economical systems, since data sets arising from these systemsare likely to satisfy the two conditions above.

Acknowledgments

We thank Maxime Darrin for reading and commenting a preliminary version of this paperand the anonymous referees for many detailed and useful remarks.

References [1] Dhammika Amaratunga, Javier Cabrera, Ziv Shkedy.

Exploration and Analysis of DNAMicroarray and Other High-Dimensional Data . John Wiley & sons, Hoboken, 2014.[2] A. Brandt, A., Multiscale Scientiﬁc Computation: Review 2001, in T. J. Barth, T. F.Chan, R. Haimes (eds.)

Multiscale and Multiresolution Methods: Theory and Applica-tions , Springer Verlag, Berlin-Heidelberg, 2002, p. 3-95.[3] S. Brin & L. Page, The anatomy of a large-scale hypertextual Web search engine.

Computer Networks and ISDN Systems , 1998, , pp. 107-117.[4] C. Calude, G. Longo, The Deluge of Spurious Correlations in Big Data, Foundations ofScience , 2017, , pp. 595–612.[5] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning ,The MIT Press, 2016.[6] Ronald L. Graham, Bruce L. Rothschild, Joel H. Spencer.

Ramsey Theory . John Wiley& sons, Hoboken, 2015.[7] T. Hastie, R. Tibshirani, J. Friedman,

The Elements of Statistical Learning: Data Min-ing, Inference, and Prediction, Second Edition (Springer Series in Statistics

Springer;2nd edition (2016). 218] K. Hornik, Approximation capabilities of multilayer feedforward networks,

Neural Net-works , 1991, , issue 2, pp. 251–257.[9] G. Masterton, E. J. Olsson & S. Angere, Linking as voting: how the Condorcet jurytheorem in political science is relevant to webometrics. Scientometrics , 2016 , 3, pp.945-966.[10] Minelli, Alessandro.

The Development of Animal Form: Ontogeny, Morphology, andEvolution . Cambridge University Press, Cambridge, 2003.[11] Minelli, Alessandro. 2011. A principle of Developmental Inertia. In

Epigenetics: LinkingGenotype and Phenotype in Development and Evolution , eds B. Hallgrimsson and B. K.Hall. Berkeley, CA: University of California Press.[12] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, MarcG. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski,Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dhar-shan Kumaran, Daan Wierstra, Shane Legg and Demis Hassabis. Human-level controlthrough deep reinforcement learning. Nature volume 518, pages 529–533 (26 February2015).[13] D. Napoletani, M. Panza, and D.C. Struppa, Agnostic science. Towards a philosophy ofdata analysis,

Foundations of Science , 2011 , pp. 1–20.[14] D. Napoletani, M. Panza, and D.C. Struppa. Processes rather than descriptions? Foun-dations of Science, August 2013, Volume 18, Issue 3, pp 587–590.[15] D. Napoletani, M. Panza, and D.C. Struppa, Is big data enough? A reﬂection on thechanging role of mathematics in applications. Notices of the American MathematicalSociety

61, 5, pp. 485–490, 2014.[16] D. Napoletani, M. Panza, and D.C. Struppa, Forcing Optimality and Brandt’s Principle,in J. Lenhard and M. Carrier (ed.),

Mathematics as a Tool , Boston Studies in thePhilosophy and History of Science 327, Springer, 2017.[17] D. Napoletani, E. Petricoin, D. C. Struppa. Geometric Path Integrals. A Languagefor Multiscale Biology and Systems Robustness. The Mathematical Legacy of LeonEhrenpreis, Springer Proceedings in Mathematics, Volume 16, 2012, pp 247-260.[18] D. Napoletani, M. Signore, T. Sauer, L. Liotta, E. Petricoin. Homologous Control ofProtein Signaling Networks. Journal of Theoretical Biology Volume 279, Issue 1, 21,2011.[19] L. Page, S. Brin, R. Motwani, & T. Winograd, The PageRank cita-tion ranking: bringing order in the Web. Manuscript to be found athttp://ilpubs.stanford.edu:8090/422/1/1999-66.pdf.2220] M. Panza, De la nature épargnante aux forces généreuses. Le principe de moindre ac-tion entre mathématiques et métaphysique : Maupertuis et Euler (1740-1751),

Revued’Histoire des sciences , 1995, , pp. 435-520.[21] M. Panza, The Origins of Analytical Mechanics in 18th century, in H. N. Jahnke (ed.) AHistory of Analysis , American Mathematical Society and London Mathematical Society,s.l., 2003, pp. 137-153.[22] J. Ramsay, B. W. Silverman,