Emergent limits of an indirect measurement from phase transitions of inference
aa r X i v : . [ phy s i c s . d a t a - a n ] J a n Emergent limits of an indirect measurement from phase transitions of inference
Satoru Tokuda,
1, 2, ∗ Kenji Nagata, and Masato Okada
2, 3 Mathematics for Advanced Materials – Open Innovation Laboratory,AIST, c/o Advanced Institute for Materials Research,Tohoku University, Sendai, Miyagi, 980-8577, Japan Department of Complexity Science and Engineering,The University of Tokyo, Kashiwa, Chiba 277-8561, Japan Center for Materials Research by Information Integration,National Institute for Materials Science, Tsukuba, Ibaraki 305-0047, Japan (Dated: January 7, 2020)Measurements are inseparable from inference, where the estimation of signals of interest fromother observations is called an indirect measurement. While a variety of measurement limits havebeen defined by the physical constraint on each setup, the fundamental limit of an indirect mea-surement is essentially the limit of inference. Here, we propose the concept of statistical limits onindirect measurement: the bounds of distinction between signals and noise and between a signal andanother signal. By developing the asymptotic theory of Bayesian regression, we investigate the phe-nomenology of a typical indirect measurement and demonstrate the existence of these limits. Basedon the connection between inference and statistical physics, we also provide a unified interpretationin which these limits emerge from phase transitions of inference. Our results could pave the way fornovel experimental design, enabling assess to the required quality of observations according to theassumed ground truth before the concerned indirect measurement is actually performed. ∗ [email protected] I. INTRODUCTION
Measurements are the foundation of science and engineering. Until recently, it had been implicitly believed thatmeasurements are bounded only by physical constraints. But as it is, measurements are also inseparable from inference.The standard quantum limit had been regarded as a measurement limit. Some recent studies, however, showed thatthis limit can be surpassed, in terms of inference, by considering the quantum-mechanical nature of a photon, suchas two-photon [1–5] and four-photon processes [6–8]. The diffraction limit had been regarded as a limit of an opticalmeasurement. Some recent studies, however, showed that this limit can be surpassed, in terms of inference, byconsidering the kind of point spread function in the experiment [9–12].The fundamental limit of an indirect measurement, which estimates signals of interest from other observations, isessentially the limit of inference. Compressed sensing involves an inference of sparse source signals from compressivelymixed signals [13, 14]. Such an inference can surpass the Nyquist sampling rate, while it has a typical limit emergingfrom the phase transition of inference [15–20]. On the basis of the connection between statistical inference andstatistical physics [21–24], it has been shown that many other high-dimensional statistical inferences, such as error-correcting codes [25], the perceptron [26], community detection [27, 28], and matrix factorization [29], also have typicallimits emerging from phase transitions. Watanabe showed that there exist essentially the same phenomena as withphase transitions even in low-dimensional statistical inference and provided a new or different definition of phasetransitions in statistical inference [30, 31].In this study, we propose the concept of statistical limits for an indirect measurement. The statistical limits areclassified into two types: bounds of distinction between signals and noise (signal detection limit, SDL) and between asignal and another signal (signal resolution limit, SRL). To demonstrate the existence of these limits, we investigatethe phenomenology of a typical indirect measurement by developing the asymptotic theory of Bayesian regressionwith conditionally independent observations. From the perspective of statistical physics, we also provide a unifiedinterpretation in which these limits emerge from phase transitions of inference and show the phase diagram describedby the magnitude of the noise and degree of overlap between each signal.
II. RESULTSA. Setup
We study an indirect measurement of source signals via mixed signals (Fig. 1). Suppose the set { φ k } Kk =1 of K source signals, defined as a Gaussian function φ k ( x ; µ k , ρ k ) := exp (cid:16) − ρ k x − µ k ) (cid:17) (1)with x ∈ R , µ k ∈ R , and ρ k ≥
0. This type of unimodal source signal is ubiquitous in various fields, e.g., the spectralline shape in spectroscopy, pulse shape in electronics, and point spread function in optics.The set of n mixed signals { Y i } ni =1 with noise is observed as conditionally independent random variables subjectedto the conditional probability density function p ( y i | x i , w, b ) : = r b π exp (cid:18) − b y i − f ( x i ; w )) (cid:19) (2)with the observation noise variance b − ≥ x i = x i − + ∆ x for ∆ x := | x n − x | / ( n − K source signals is f ( x ; w ) : = K X k =1 a k φ k ( x ; µ k , ρ k ) (3)with a k ≥ w := { a k , ρ k , µ k } Kk =1 . Note that f ( x ; w ) = 0 is defined for K = 0, which correspondsto the case of no source signal.Suppose that D n := { x i , Y i } ni =1 is a sample taken from p ( y i | x i , w , b ) with the ground truth w = w , K = K and b = b . In other words, the ground truth is unobservable while D n := { x i , Y i } ni =1 is observable. It is consideredthat { φ k } K k =1 is indirectly observed by estimating w , K and b from D n . By means of Bayesian inversion, w isestimated as the posterior distribution p ( w | D n , b, K ) = 1 p ( D n | b, K ) n Y i =1 p ( Y i | x i , w, b ) ϕ ( w | K ) ∝ exp (cid:18) − nb E n ( w ) (cid:19) ϕ ( w | K ) (4)with the mean square error E n ( w ) : = 1 n n X i =1 ( Y i − f ( x i ; w )) , (5)where ϕ ( w | K ) is some prior distribution and p ( D n | b, K ) is the marginal likelihood, with b and K as hyperpa-rameters. This is a straightforward extension of the least squares method. In the same manner as in our previ-ous work [32], b and K are estimated by maximizing p ( D n | b, K ) or, equivalently, by minimizing the function˜ F n ( b, K ) := − log p ( D n | b, K ), which is called the Bayes free energy. B. Self-averaging property
There are two kinds of random variables { Y i } ni =1 and w in our setup. We first consider the typicality of our setup,which is independent of the realization of { Y i } ni =1 , by means of the conditional expectation over { Y i } ni =1 | { x i } ni =1 :[ · · · ] := R ( · · · ) Q ni =1 p ( y i | x i , w , b ) dy i . We derived the relation E n ( w ) = [ E n ( w )] + O p (cid:18) √ nb (cid:19) (6)with [ E n ( w )] = 1 b + 1 n n X i =1 ( f ( x i ; w ) − f ( x i ; w )) . (7)This relation means that E n ( w ) = [ E n ( w )] holds for n → ∞ . This property is called self-averaging, which is theconcept originally proposed in the field of statistical physics of disordered systems [33]. In practice, it is difficult toignore the term O p (cid:0) ( √ nb ) − (cid:1) as b ≪
1, i.e., as the observation is too noisy. Notably, [ E n ( w )] = b − always holds,and [ E n ( w )] = 0 especially holds for b → ∞ as the noiseless limit.We also proved that ˜ F n ( b, K ) is self-averaging; i.e., ˜ F n ( b, K ) = [ ˜ F n ( b, K )] holds for n → ∞ . This self-averagingproperty is precisely given by the asymptotic form˜ F n ( b, K ) = b ′ ˜ F ( b ′ , K ; w ) − n b π + bR n b (8)with ˜ F ( b ′ , K ; w ) : = − b ′ log Z exp (cid:18) − b ′ H ( w ; w ) (cid:19) ϕ ( w | K ) dw (9)for b ′ := b/ ∆ x as the inverse of the effective noise variance and the Riemann integral H ( w ; w ) : = Z x n x ( f ( x ; w ) − f ( x ; w )) dx, (10)where R n is a random variable following the chi-square distribution with n degree of freedom; R n → n holds for n → ∞ , while [ R n ] = n always holds. Note that H ( w ; w ) can be expressed more explicitly (see Supplementary note1). In the sense of a saddle point approximation for w = w as n → ∞ , we obtain ˜ F ( b ′ , K ; w ) = 0 for ϕ ( w | K ) > b = b is a stationary point of ˜ F n ( b, K ). This means that b minimizing ˜ F n ( b, K ) converges to b as n → ∞ ;i.e., the estimator of b is consistent.We should mention that K minimizing ˜ F ( b ′ , K ; w ) is equal to K minimizing ˜ F n ( b, K ) when b is fixed. There canbe a point at the intersection of ˜ F ( b ′ , K ; w ) with ˜ F ( b ′ , K ′ ; w ) for K = K ′ ; a phase transition of statistical inferencecan occur with a variation in b ′ . Note that ˜ F ( b ′ , K ′ ; w ) enables us to access the typicality of such a phenomenon,which is independent of the realization of { Y i } ni =1 for n → ∞ . C. Bayes specific heat
We quantify a kind of fluctuation in the indirect measurement of w via D n by means of the conditional expectationover w | D n : h· · · i := R ( · · · ) p ( w | D n , b, K ) dw . We introduced the Bayes specific heat, a quantity characterizing the phase of statistical inference: ˜ C n ( b, K ) = (cid:18) nb (cid:19) (cid:0) h E n ( w ) i − h E n ( w ) i (cid:1) . (11)Note that Eq. (11) is derived from our general definition of the Bayes specific heat in statistical inference (seeSupplementary note 2).We proved that ˜ C n ( b, K ) = [ ˜ C n ( b )] holds especially for n → ∞ with b = O ( b / log n ); i.e., ˜ C n ( b, K ) is conditionallyself-averaging. This conditionally self-averaging property is precisely given by the finite-size scaling relation˜ C n ( b, K ) = Λ( b ′ , K ; w ) + O p (cid:18) max (cid:18) nb , bb , n √ b (cid:19)(cid:19) (12)with the scaling function Λ( b ′ , K ; w ) : = (cid:18) b ′ (cid:19) (cid:0) h H ( w ; w ) i − h H ( w ; w ) i (cid:1) . (13)Note that Eq. (12) means that the bivariate function of ( n, b ) is effectively reduced to the univariate function of b ′ for n → ∞ with b = O ( b / log n ). The quantity ˜ C n ( b, K ) is not necessarily self-averaging under the condition b = b , corresponding to the Nishimori line [34, 35]; i.e., ˜ C n ( b , K ) = Λ( b / ∆ x, K ; w ) + O p (1) holds. In practice, itis difficult to ignore the term O p (cid:0) ( n √ b ) − (cid:1) as b ≪
1, i.e., as the observation is too noisy.There is a remarkable junction between our physical insight and Watanabe’s theory of statistical inference [31, 36–41]. The value of Λ( b ′ , K ; w ) ≥ b ′ , K ; w ) = 3 K holds if p ( w | D n , b, K ) is sufficiently approximated by a Gaussian distribution (regular case)[31, 36–38].We should mention that Λ( b ′ , K ; w ) enables us to characterize the typical phase of statistical inference, which isindependent of the realization of { Y i } ni =1 for n → ∞ , while Eq. (12) means that Λ( b ′ , K ; w ) is surely observable onlyfor b = O ( b / log n ). This constraint is related to Watanabe’s corollary [41], while the difference between the noisevariance and temperature should also be considered (see Supplementary notes 2 and 3). D. Signal detection limit
Here, we show a paradigmatic example of the SDL. Suppose that one source signal constitutes the ground truth(Fig. 2a), i.e., f ( x i ; w ) = a φ ( x ; µ , ρ ) and that the mixed signals, i.e., some realization of D n , are obtained inthe presence of observation noise of some magnitude. If the magnitude of the noise is large enough, no signals can bedetected (Fig. 2b). If the magnitude of the noise is small enough, the source signal is successfully detected (Fig. 2c).These results are actually demonstrated by performing a Monte Carlo (MC) simulation (see Methods section) andimply the existence of an SDL as a critical magnitude of the noise, which specifies whether or not the source signalis detectable.To elucidate that there typically exists an SDL, which is independent of the realization of { Y i } ni =1 , we calculated˜ F ( b ′ , K ; w ) and Λ( b ′ , K ; w ) for w of the above example. Phenomenologically, the value of K that minimizes˜ F ( b ′ , K ; w ) changes from 0 to 1 around b ′ = b ′ sn (Fig. 3a), while Λ( b ′ , w ) has a peak around b ′ = b ′ sn (Fig. 3b);there is a phase transition of statistical inference, and b ′ sn corresponds to the SDL. The magnitudes of the noise inthe cases of Fig. 2a and Fig. 2b correspond to b ′ = b ′ and b ′ = b ′ (Fig. 3); these cases are in different phases, andthere exists an SDL as the boundary between these phases under the condition b = b .This phase transition is also understandable as a variation in the posterior distribution (Fig. 4). While p ( w | D n , b, b ′ = b ′ (Fig. 4a-4c) is fairly consistent with ϕ ( w | p ( w | D n , b,
1) at b ′ = b ′ is sufficiently approximated by aGaussian distribution with mean w (Fig. 4g-4i). In addition, p ( w | D n , b,
1) at b ′ = b ′ sn represents the intermediatestate (Fig. 4d-4f) between these two states. Correspondingly, Λ( b ′ , w ) is fairly consistent with 0 and 1 . K / b ′ ≪ b ′ sn and b ′ ≫ b ′ sn , respectively (Fig. 3b); the case of K = 1 is regular at b ′ = b ′ .We should mention that this phenomenon is essentially the same as, but slightly different from, the phenomenoncalled freeze-out [46]. Freeze-out was investigated only in the case where the sample is independently and identicallydistributed; our setup corresponds to the conditionally independent case, where the self-averaging property needs tobe considered. Freeze-out occurs with a variation in the sample size, which is generally different from the magnitude ofthe noise. One of our contributions in terms of the general aspect of statistical inference is the derivation of Eq. (12),i.e., the finite-size scaling relation between the sample size and magnitude of the noise. In a strict sense, finite-sizeeffects should also be considered (see Supplementary note 3). E. Signal resolution limit
Here, we show a paradigmatic example of the SRL. Suppose that two strongly overlapping source signals constitutethe ground truth, i.e., f ( x i ; w ) = a φ ( x ; µ , ρ ) + a φ ( x ; µ , ρ ), which appears almost as one (Fig. 5a), and thatthe mixed signals, i.e., some realization of D n , are obtained in the presence of observation noise of some magnitude.If the magnitude of the noise is large enough , one source signal is detected (Fig. 5b), which is statistically optimalbut a misestimation of the ground truth. If the magnitude of noise is small enough, two source signals are successfullydetected and resolved (Fig. 5c). These results are actually demonstrated by performing an MC simulation (seeMethods section) and imply the existence of an SRL as the critical magnitude of the noise, which specifies whetheror not the strongly overlapping source signals are resolvable.To elucidate that there typically exists an SRL, which is independent of the realization of { Y i } ni =1 , we calculated˜ F ( b ′ , K ; w ) and Λ( b ′ , K ; w ) for w of the above example. As in Sec. II D, the value of K that minimizes ˜ F ( b ′ , K ; w )changes from 0 to 1 around b ′ = b ′ sn (Fig. 6a), while Λ( b ′ , w ) has a peak around b ′ = b ′ sn (Fig. 6b); there is aphase transition of statistical inference, and b ′ sn corresponds to the SDL. Unlike in Sec. II D, the value of K thatminimizes ˜ F ( b ′ , K ; w ) changes from 1 to 2 around b ′ = b ′ ss (Fig. 6a), while Λ( b ′ , w ) has a peak around b ′ = b ′ ss (Fig. 6b); there is another phase transition of statistical inference, and b ′ ss corresponds to the SRL. The magnitudesof the noise in the cases of Fig. 5a and Fig. 5b correspond to b ′ = b ′ and b ′ = b ′ (Fig. 6); these cases are in differentphases, and there exists an SRL as the boundary between these phases under the condition b = b .This phase transition is also understandable as a variation in the posterior distribution (Fig. 7). While p ( w | D n , b, b ′ = b ′ (Fig. 7a-7c) is far from a Gaussian distribution, p ( w | D n , b,
2) at b ′ = b ′ is sufficiently approximated by aGaussian distribution with mean w (Fig. 7g-7i). In addition, p ( w | D n , b,
2) at b ′ = b ′ ss represents the intermediatestate (Fig. 7d-7f) between these two states. Correspondingly, Λ( b ′ , w ) is fairly consistent with 1 . b ′ , w ))and 3(= 3 K /
2) at b ′ sn ≪ b ′ ≪ b ′ ss and b ′ ≫ b ′ ss , respectively (Fig. 6b); the case of K = 2 is regular at b ′ = b ′ .We should explain why p ( w | D n , b,
2) at b ′ = b ′ (Fig. 7a-7c) exhibits such correlation in the parameters. Althoughthe ground truth is w with K = 2, here, we consider the pseudo-ground truth ˜ w := { ˜ a , ˜ ρ , ˜ µ } with ˜ K = 1 andthe analytic set ˜ W : = { w | f ( x i ; w ) = f ( x i ; ˜ w ) } = ˜ W ∪ ˜ W ∪ ˜ W (14)of w with K = 2, where ˜ W : = { w | a k + a k ′ = ˜ a , ρ k = ρ k ′ = ˜ ρ , µ k = µ k ′ = ˜ µ } , ˜ W : = { w | a k = ˜ a , a k ′ = 0 , ρ k = ˜ ρ , µ k = ˜ µ } and ˜ W : = { w | a k = ˜ a , , ρ k = ˜ ρ , µ k = ˜ µ , φ k ′ ( x i ; µ k ′ , ρ k ′ ) = 0 } for i = 1 , · · · , n and k = k ′ . Note that φ k ′ ( x i ; µ k ′ , ρ k ′ ) = 0 holds for ρ k ′ → ∞ or µ k ′ → ±∞ , where ρ k ′ = 0 and µ k ′ = x i are necessary. Notably, H ( ˜ w ; ˜ w ) = 0 holds for ∀ ˜ w ∈ ˜ W ; i.e., p ( w | D n , b,
2) can be relatively large around w = ˜ w for ϕ ( w | >
0. The above scenario of the pseudo-ground truth corresponds to p ( w | D n , b,
2) at b ′ = b ′ (Fig. 7a-7c); i.e., w ≃ ˜ w is statistically optimal for the case of K = 2 at b ′ = b ′ . This scenario qualitatively holdsat any b ′ sn ≪ b ′ ≪ b ′ ss , since Λ( b ′ , w ) is fairly consistent with 1 . b ′ , w )). Note that this result is from thesituation where f ( x i ; w ) = a φ ( x ; µ , ρ ) + a φ ( x ; µ , ρ ) appears almost as one signal (Fig. 5a), where a crucialproblem is implied in the context of spectroscopic measurements [32]; there is a risk of failure to recognize whetheror not an energy level is degenerate. F. Phase diagram
We also investigated the dependence of the phase of indirect measurements on the degree of overlap between eachsignal. We introduce δ := | µ ∗ − µ ∗ | /
2, defined by w = { a ∗ k , ρ ∗ k , µ ∗ k } K k =1 with K = 2, as the degree of overlap between φ ( x ; µ ∗ , ρ ∗ ) and φ ( x ; µ ∗ , ρ ∗ ). Note that w is same as the setting of Fig. 5a except for µ ∗ and µ ∗ .The phase diagram shows that there are three phases described by b ′ and δ (Fig. 8). The boundaries correspondingto ˜ F ( b ′ , w ) = ˜ F ( b ′ , w ) and ˜ F ( b ′ , w ) = ˜ F ( b ′ , w ) are the SDL and SRL, respectively, under the condition b = b ; the SRL is always a tighter bound than the SDL. There is a clear dependence such that the smaller δ is, thelarger the critical b ′ as the SRL. We should mention that the cases of Secs. II D and II E correspond to δ = 0 and δ = 0 .
25, respectively; the case of δ = 0 is essentially regarded as K = 1, while the ground truth is surely K = 2.In other words, there exists ˜ w that satisfies f ( x ; w ) = f ( x ; ˜ w ) in the case of δ = 0 even if w = ˜ w . This situationcorresponds to the nonidentifiable case in Watanabe’s theory [38], where δ = 0 is a singularity, and δ > δ = 0; the SRL does not emerge.We should mention the relation between the peak positions of Λ( b ′ , w ) and the SDL and SRL. In Secs. II D andII E, the phase of statistical inference was defined by K minimizing ˜ F ( b ′ , K ; w ) and was characterized by Λ( b ′ , K ; w ).Here, we consider this chicken or egg problem more precisely. First, ˜ F ( b ′ , K ; w ) relatively characterizes a phase ofstatistical inference by comparing ˜ F ( b ′ , K ′ ; w ) with K = K ′ , while Λ( b ′ , K ; w ) absolutely characterizes the phase.In other words, the former approach monitors the state of the hyperparameter optimization or model selection, whilethe latter approach monitors the state of the parameter estimation. In the latter approach, it can be said that thereare only two phases at δ &
1, where the two peak positions of Λ( b ′ , w ) merge. If two source signals are detectable,then they are also resolvable at the same time. In the region bounded by the SDL, the SRL, and δ & K = 1 isstatistically optimal, but w is not sufficiently estimated; i.e., p ( w | D n , b,
1) is far from a Gaussian distribution. Fromthis viewpoint, the peak positions of Λ( b ′ , w ) can be regarded as a tighter bound than the SDL and SRL underthe condition b = b . III. DISCUSSION
Our results can be utilized for experimental design in the form of virtual measurement analytics [48]; we canvirtually check the noise immunity of an arbitrary indirect measurement by following our procedure. In other words,we can preliminarily consider the typically required quality of observations, i.e., b / ∆ x beyond the SDL and SRL, byemulating the indirect measurement with the assumed ground truth w before the concerned measurement is actuallyperformed. Note that our procedure can be applied to not only a Gaussian signal φ k but also any type of sourcesignal without loss of generality.Our results also contribute to the development of some new aspects of statistical inference. The results in Sec.II B can be regarded as a frontier of the asymptotic theory of the regression problem with conditionally independentobservations. The point is that Eqs. (6-13) always hold without loss of generality, even if f ( x ; w ) is replaced by anyother regression function. This means that the self-averaging property has not been argued enough at least in thecontext of Watanabe’s theory [31]. The results in Sec. II C can be regarded as a frontier of statistical inference witha broader meaning. The definition of the Bayes specific heat and the derivation of Eq. (11) are different from thequantity called the learning capacity [46] (see Supplementary notes 2 and 3). Our definition and derivation reflectthe consequence of Eqs. (12,13); the limits n → ∞ , b → ∞ and b → ∞ are different.It should be emphasized that we assumed the large-size limit n → ∞ of the sample but did not take the high-dimension limit dim( w ) → ∞ of the parameter into account. This situation is different from the ordinary picture ofphysics; the phase transitions shown in Secs. II D-II F are outside the Ehrenfest classification but inside Watanabe’sdefinition [31]. On the other hand, high-dimensional statistical inference as the limit n, dim( w ) → ∞ such that n/ dim( w ) stays finite has been energetically studied [24–29]. In this case, there exist phase transitions characterized bythe Ehrenfest classification. The first-order phase transitions, associated with metastability, are related to algorithmichardness [24, 27–29, 49, 50]. Further studies can be pursued at the junction of our viewpoint and algorithmic hardness.Our results are also related to machine learning. As Watanabe showed [31, 38], the generalization loss is associatedwith Λ( b ′ , K ; w ). This means that a drastic change in Λ( b ′ , K ; w ), i.e., a phase transition of inference, causes aqualitative change in the generalization performance. The widely applicable information criterion (WAIC) [39, 40], asa criterion for the generalization loss, is meaningful, while the cross-validation loss is meaningless since our setup isbased on conditionally independent observations [31, 51]. The point is that the generalization loss and WAIC are notself-averaging, while ˜ F n ( b, K ) is self-averaging. From this viewpoint, our procedure may provide a perspective for theassessment of the typical generalization performance, which is independent of the realization of { Y i } ni =1 for n → ∞ .However, we must consider the difference between an indirect measurement and machine learning; their purposes, i.e.,inference and prediction, are essentially different. IV. METHODSA. Outline of the derivation of Eq. (6)
Here, we show an outline of the derivation of Eq. (6). The average [ E n ( w )] (Eq. (7)) and the variance [ E n ( w ) ] − [ E n ( w )] = n b P i ( f ( x i ; w ) − f ( x i ; w )) were exactly obtained. The relation [ E n ( w ) ] − [ E n ( w )] = O (( nb ) − )holds, where n P i ( f ( x i ; w ) − f ( x i ; w )) = H ( w ; w ) for n → ∞ . Then, we obtained Eq. (6). B. Outline of the derivation of Eq. (8)
Here, we show an outline of the derivation of Eq. (8). By considering the noise additivity, we divided Y i intothe signal and noise, i.e., Y i = f ( x i ; w ) + N i , where N i ∼ N (0 , b − ). Then, we obtained nE n ( w ) = P i s i ( w ) +2 P i s i ( w ) N i + R n /b , where s i ( w ) := f ( x i ; w ) − f ( x i ; w ) and R n := b P i N i . By using Jensen’s inequality, (cid:2) − log R exp (cid:0) − b (cid:0)P i s i ( w ) + 2 P i s i ( w ) N i (cid:1)(cid:1) ϕ ( w | K ) dw (cid:3) ≥ − log R exp (cid:0) − b P i s i ( w ) (cid:1) ϕ ( w | K ) dw holds, wherethe equality holds when P i s i ( w ) N i = 0, which is asymptotically satisfied for n → ∞ . Based on this asymptoticequality, Eq. (8) was obtained. C. Outline of the derivation of Eq. (12)
Here, we show an outline of the derivation of Eq. (12), where the details are shown in Supplementary note 3.First, we obtained the asymptotic behaviour of the Bayes specific heat from the second derivative of the Bayes freeenergy, whose asymptotic form was obtained by Watanabe [31, 36–38]. As a result, we found that the Bayes specificheat is generally divided into two parts, the real log canonical threshold as the average and a fluctuation of order O p (1 / log n ). Second, we considered the specifics of our setup. By considering the noise additivity, ˜ C n ( b, K ) was alsodivided into two parts, the average over and a fluctuation by D n . We evaluated the order of the fluctuation part andfound that there are the terms of orders O p ( b/b ) and O p (( n √ b ) − ). Finally, we combined the general result andspecifics and obtained Eq. (12). D. Monte Carlo simulation
In Secs. II D-II F, we performed an MC simulation by using parallel tempering [52, 53] based on the Metropoliscriterion. The variable b ′ was discretized as 400 points consisting of b = 0 and 399 logarithmically spaced pointsin the interval [10 − , ]. By sampling from p ( w | D n , b, K ) ∝ exp( − b ′ H ( w ; w )) ϕ ( w | K ) with ϕ ( w | K ) = Q Kk =1 exp( − a k / − ρ k / − µ k / / (500 √ π ), we calculated ˜ F ( b ′ , K ; w ) and Λ( b ′ , K ; w ) at each point of b ′ > F ( b ′ , K ; w ).The error bars of ˜ F ( b ′ , K ; w ) and Λ( b ′ , K ; w ) were calculated by bootstrap resampling [56]. Especially in Figs. 2and 5, we sampled from p ( w | D n , b, K ) ∝ exp( − nb E n ( w )) ϕ ( w | K ) with a realization of D n and calculated ˜ F n ( b, K )in the same manner as in our previous work [32]. ACKNOWLEDGMENTS
The authors are grateful to Chihiro H. Nakajima, Koji Hukushima, Kouki Yonaga, Masayuki Ohzeki, ShotaroAkaho, Sumio Watanabe, Tomoyuki Obuchi and Yoshiyuki Kabashima for valuable discussions. M.O. was supportedby a Grant-in-Aid for Scientific Research on Innovative Areas (No. 25120009) from the Japan Society for the Promotionof Science, the ”Materials Research by Information Integration” Initiative (MI2I) project of the Support Programfor Starting Up Innovation Hub from the Japan Science and Technology Agency (JST), and the Council for Science,Technology and Innovation (CSTI), Cross-ministerial Strategic Innovation Promotion Program (SIP), ”StructuralMaterials for Innovation” (Funding agency: JST).
AUTHOR CONTRIBUTION
S.T. and M.O. conceived the project and concrete setup. S.T. derived all the equations, performed all the numericalcalculations, interpreted the results and wrote the manuscript. K.N. partially supervised the project. M.O. supervisedthe project. [1] J. Rarity, P. Tapster, E. Jakeman, T. Larchuk, R. Campos, M. Teich, and B. Saleh, Physical review letters , 1348(1990).[2] A. Kuzmich and L. Mandel, Quantum and Semiclassical Optics: Journal of the European Optical Society Part B , 493(1998).[3] E. Fonseca, C. Monken, and S. P´adua, Physical Review Letters , 2868 (1999).[4] K. Edamatsu, R. Shimizu, and T. Itoh, Physical review letters , 213601 (2002).[5] H. Eisenberg, J. Hodelin, G. Khoury, and D. Bouwmeester, Physical review letters , 090502 (2005).[6] T. Nagata, R. Okamoto, J. L. O’brien, K. Sasaki, and S. Takeuchi, Science , 726 (2007).[7] R. Okamoto, H. F. Hofmann, T. Nagata, J. L. O’Brien, K. Sasaki, and S. Takeuchi, New Journal of Physics , 073033(2008).[8] G.-Y. Xiang, B. L. Higgins, D. Berry, H. M. Wiseman, and G. Pryde, Nature Photonics , 43 (2011).[9] T. A. Klar, S. Jakobs, M. Dyba, A. Egner, and S. W. Hell, Proceedings of the National Academy of Sciences , 8206(2000).[10] M. G. Gustafsson, Proceedings of the National Academy of Sciences , 13081 (2005).[11] E. Betzig, G. H. Patterson, R. Sougrat, O. W. Lindwasser, S. Olenych, J. S. Bonifacino, M. W. Davidson, J. Lippincott-Schwartz, and H. F. Hess, Science , 1642 (2006).[12] Y. Ashida and M. Ueda, Physical review letters , 095301 (2015).[13] E. J. Candes and T. Tao, IEEE Transactions on Information Theory , 5406 (2006).[14] D. L. Donoho et al. , IEEE Transactions on information theory , 1289 (2006).[15] D. L. Donoho, Discrete & Computational Geometry , 617 (2006).[16] D. Donoho and J. Tanner, Philosophical Transactions of the Royal Society A: Mathematical, Physical and EngineeringSciences , 4273 (2009).[17] Y. Kabashima, T. Wadayama, and T. Tanaka, Journal of Statistical Mechanics: Theory and Experiment , L09003(2009).[18] D. L. Donoho, A. Maleki, and A. Montanari, Proceedings of the National Academy of Sciences , 18914 (2009).[19] S. Ganguli and H. Sompolinsky, Physical review letters , 188701 (2010).[20] F. Krzakala, M. M´ezard, F. Sausset, Y. Sun, and L. Zdeborov´a, Journal of Statistical Mechanics: Theory and Experiment , P08009 (2012).[21] E. T. Jaynes, Physical review , 620 (1957).[22] E. T. Jaynes, Probability theory: The logic of science (Cambridge university press, 2003).[23] V. Balasubramanian, Neural computation , 349 (1997).[24] L. Zdeborov´a and F. Krzakala, Advances in Physics , 453 (2016).[25] N. Sourlas, Nature , 693 (1989).[26] H. S. Seung, H. Sompolinsky, and N. Tishby, Physical review A , 6056 (1992).[27] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov´a, Physical Review Letters , 065701 (2011).[28] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov´a, Physical Review E , 066106 (2011).[29] Y. Kabashima, F. Krzakala, M. M´ezard, A. Sakata, and L. Zdeborov´a, IEEE Transactions on information theory , 4228(2016).[30] S. WATANABE, in SEVENTH WORKSHOP ON INFORMATION THEORETIC METHODS IN SCIENCE AND EN-GINEERING (Citeseer, 2014) p. 27.[31] S. Watanabe,
Mathematical theory of Bayesian statistics (CRC Press, 2018).[32] S. Tokuda, K. Nagata, and M. Okada, Journal of the Physical Society of Japan , 024001 (2016).[33] I. M. Lifshitz, Soviet Physics Uspekhi , 549 (1965).[34] H. Nishimori, Journal of Physics C: Solid State Physics , 4071 (1980).[35] Y. Iba, Journal of Physics A: Mathematical and General , 3875 (1999).[36] S. Watanabe, in International Conference on Algorithmic Learning Theory (Springer, 1999) pp. 39–50.[37] S. Watanabe, Neural Computation , 899 (2001).[38] S. Watanabe, Algebraic geometry and statistical learning theory , Vol. 25 (Cambridge University Press, 2009).[39] S. Watanabe, Neural Networks , 20 (2010).[40] S. Watanabe, Journal of Machine Learning Research , 3571 (2010).[41] S. Watanabe, Journal of Machine Learning Research , 867 (2013).[42] I. Bernshtein, Functional Analysis and its applications , 273 (1972).[43] M. Sato and T. Shintani, Annals of Mathematics , 131 (1974).[44] M. Kashiwara, Inventiones mathematicae , 33 (1976). [45] A. N. Varchenko, Functional analysis and its applications , 175 (1976).[46] C. H. LaMont and P. A. Wiggins, Physical Review E , 052140 (2019).[47] S. Watanabe and S.-i. Amari, Neural Computation , 1013 (2003).[48] K. Nagata, R. Muraoka, Y.-i. Mototake, T. Sasaki, and M. Okada, Journal of the Physical Society of Japan , 044003(2019).[49] F. Antenucci, S. Franz, P. Urbani, and L. Zdeborov´a, Physical Review X , 011020 (2019).[50] F. Ricci-Tersenghi, G. Semerjian, and L. Zdeborov´a, Physical Review E , 042109 (2019).[51] S. WATANABE, in TENTH WORKSHOP ON INFORMATION THEORETIC METHODS IN SCIENCE AND ENGI-NEERING (2017) p. 38.[52] C. J. Geyer, (1991).[53] K. Hukushima and K. Nemoto, Journal of the Physical Society of Japan , 1604 (1996).[54] X.-L. Meng and W. H. Wong, Statistica Sinica , 831 (1996).[55] A. Gelman and X.-L. Meng, Statistical science , 163 (1998).[56] B. Efron et al. , The Annals of Statistics , 1 (1979). Source signals(ground truth) p ( y i | x i , w , b ) p ( w | D n , b , K ) Noisyobservations Statisticalinf erence D n w M ixed signals Source signals(estimated) w FIG. 1. Schematic of our setup. Mixed signals D n := { x i , Y i } ni =1 (black dots), i.e., discretized f ( x i ; w ) (red dotted line) asa superposition of { φ k } Kk =1 (blue solid lines) with statistical noise, are taken from the conditional probability density function p ( y i | x i , w , b ), where w and b − are the ground truth of the parameter set and noise variance, respectively. Inversely, { φ k } Kk =1 are estimated from D n in the form of the parameter set w taken from the posterior distribution p ( w | D n , b, K ). Theindirect measurement in our setup consists of these two processes. The problem here is that this indirect measurement is notalways successfully performed due to the presence of statistical noise as with noisy-channel coding; the SDL and SRL typicallyexist. FIG. 2. Paradigmatic example of the signal detection limit. ( a ) Ground truth f ( x ; w ) (red line) with w = { , , } as the K = 1 source signal. ( b ) A realization of D n (black dots) at b = b ′ ∆ x for n = 101. It was simulated that K = 0 minimizes˜ F n ( b, K ) for this D n . ( c ) A realization of D n (black dots) at b = b ′ ∆ x for n = 101 and f ( x ; ˆ w ) (red line) with the maximuma posteriori (MAP) estimator ˆ w . It was simulated that K = 1 minimizes ˜ F n ( b, K ) for this D n . FIG. 3. Emergence of the signal detection limit as a phase boundary of statistical inference. ( a ) Log-log plot of ˜ F ( b ′ , K ; w )with w = { , , } for K = 1. The inequality ˜ F ( b ′ , w ) < ˜ F ( b ′ , w ) holds for b ′ < b ′ sn (region in light blue), and˜ F ( b ′ , w ) > ˜ F ( b ′ , w ) holds for b ′ > b ′ sn (region in light pink). The Bayes factor exp( − b ′ ( ˜ F ( b ′ , w ) − ˜ F ( b ′ , w ))) < b ′ (inset). ( b ) Semi-log plot of Λ( b ′ , K ; w ) with w = { , , } for K = 1 and asymptote 3 K / FIG. 4. Variation in the posterior distribution from the prior distribution with magnitude of the noise. Histogram of the MCsample from p ( w | D n , b, ∝ exp( − b ′ H ( w ; w )) ϕ ( w | K ) with w = { , , } for K = 1 at ( a - c ) b ′ = b ′ , ( d - f ) b ′ = b ′ sn , and( g - i ) b ′ = b ′ . Each row corresponds to a marginal distribution; ( a , d , g ) corresponds to p ( a | D n , b, b , e , h ) correspondsto p ( µ | D n , b, c , f , i ) corresponds to p ( ρ | D n , b, ϕ ( w |
1) (red dashed line). FIG. 5. Paradigmatic example of the signal resolution limit. ( a ) Ground truth f ( x ; w ) (red dotted line) with w = { , , ± . } as a superposition of K = 2 overlapping source signals (blue solid lines). ( b ) A realization of D n (blackdots) at b = b ′ ∆ x for n = 101 and f ( x ; ˆ w ) (red solid line) with the MAP estimator ˆ w . It was simulated that K = 1 minimizes˜ F n ( b, K ) for this D n . ( c ) A realization of D n (black dots) at b = b ′ ∆ x for n = 101 and f ( x ; ˆ w ) (red dotted line) with theMAP estimator ˆ w as a superposition of two source signals (blue solid lines). It was simulated that K = 2 minimizes ˜ F n ( b, K )for this D n . FIG. 6. Emergence of the signal resolution limit as a phase boundary of statistical inference. ( a ) Log-log plot of ˜ F ( b ′ , K ; w ) with w = { , , ± . } as K = 2. The inequality ˜ F ( b ′ , w ) < ˜ F ( b ′ , w ) holds for b ′ < b ′ sn (region in light blue), ˜ F ( b ′ , w ) > ˜ F ( b ′ , w ) holds for b ′ sn < b ′ < b ′ ss (region in light pink), and ˜ F ( b ′ , w ) > ˜ F ( b ′ , w ) holds for b ′ > b ′ ss (region in lightyellow). The Bayes factor exp( − b ′ ( ˜ F ( b ′ , w ) − ˜ F ( b ′ , w ))) < b ′ < b ′ ss , and exp( − b ′ ( ˜ F ( b ′ , w ) − ˜ F ( b ′ , w ))) > b ′ > b ′ ss (inset). ( b ) Semi-log plot of Λ( b ′ , K ; w ) with w = { , , ± . } and asymptote 3 K / FIG. 7. Variation in the posterior distribution with the magnitude of the noise and underlying correlation of the parameters.Two-dimensional histogram of the MC sample from p ( w | D n , b, ∝ exp( − b ′ H ( w ; w )) ϕ ( w | K ) with w = { , , ± . } as K = 2 at ( a - c ) b ′ = b ′ , ( d - f ) b ′ = b ′ ss , and ( g - i ) b ′ = b ′ . Each row corresponds to a marginal distribution; ( a , d , g )corresponds to p ( a , a | D n , b, b , e , h ) corresponds to p ( µ , µ | D n , b, c , f , i ) corresponds to p ( ρ , ρ | D n , b, FIG. 8. Phase diagram of the indirect measurement with respect to the magnitude of the noise and degree of overlap betweeneach signal. Semi-log plot of the peak positions of Λ( b ′ , w ) (black dashed and grey solid lines) with w = { , , ± δ } as K = 2. The inequality ˜ F ( b ′ , w ) < ˜ F ( b ′ , w ) holds for b ′ < b ′ sn (region in light blue), ˜ F ( b ′ , w ) > ˜ F ( b ′ , w ) holds for b ′ sn < b ′ < b ′ ss (region in light pink), and ˜ F ( b ′ , w ) > ˜ F ( b ′ , w ) holds for b ′ > b ′ ss (region in light yellow). SUPPLEMENTARY NOTE 1: EXPLICIT EXPRESSION OF THE HAMILTONIAN
Here, we show a more explicit expression of Eq. (10): H ( w ; w ) = J ( w , w ) − J ( w , w ) + J ( w, w ) (S1)with J ( w, w ′ ) : = K X j =1 K ′ X k =1 a j a ′ k s ρ j + ρ ′ k exp − ( µ j − µ ′ k ) (cid:0) ρ − j + ρ ′− k (cid:1) ! ˜ J ( ρ j , µ j , ρ ′ k , µ ′ k ) (S2)for w ′ := { a ′ k , ρ ′ k , µ ′ k } K ′ k =1 as KK ′ > J ( w, w ′ ) := 0 as KK ′ = 0, where˜ J ( ρ j , µ j , ρ ′ k , µ ′ k ) : = √ π r ρ j + ρ ′ k (cid:18) x n − µ j ρ j + µ ′ k ρ ′ k ρ j + ρ ′ k (cid:19)! − √ π r ρ j + ρ ′ k (cid:18) x − µ j ρ j + µ ′ k ρ ′ k ρ j + ρ ′ k (cid:19)! . (S3)Note that ˜ J ( ρ j , µ j , ρ ′ k , µ ′ k ) = √ π holds for x → −∞ and x n → ∞ as a Gaussian integral. We also define˜ J ( ρ, µ ) : = ˜ J ( ρ, µ, ρ, µ ) = √ π √ ρ ( x n − µ )) − √ π √ ρ ( x − µ )) (S4)for convenience. Now, we obtain the average part of ˜ F n ( b, K ), b ′ ˜ F ( b ′ , K ; w ) = b ′ J ( w ; w )2 − log Z exp (cid:18) − b ′ H ( w ; w ) (cid:19) ϕ ( w | K ) dw (S5)with ˜ H ( w ; w ) := J ( w, w ) − J ( w , w ), where b ′ J ( w ; w ) depends exclusively on a typical observation of D n and noton the inference of w , especially at b = b .In the case of K = 1 as in Sec. IID, we obtain b ′ J ( w , w ) = a ∗ b √ ρ ∗ ∆ x ˜ J ( ρ ∗ , µ ∗ ) ≤ √ πa ∗ b √ ρ ∗ ∆ x (S6)for w := { a ∗ , ρ ∗ , µ ∗ } , where the equality holds for x → −∞ and x n → ∞ . This means that the typical quality of D n depends on the signal-to-noise ratio a ∗ √ b , the signal-to-resolution ratio ( √ ρ ∗ ∆ x ) − , and the measurement interval[ x , x n ].In the case of K = 2 as in Sec. IIE, we obtain b ′ J ( w , w ) = a ∗ b √ ρ ∗ ∆ x (cid:16) ˜ J ( ρ ∗ , µ ∗ ) + ˜ J ( ρ ∗ , µ ∗ ) + 2 exp (cid:0) − δ ρ ∗ (cid:1) ˜ J ( ρ ∗ , µ c ) (cid:17) ≤ √ πa ∗ b √ ρ ∗ ∆ x (cid:0) (cid:0) − δ ρ ∗ (cid:1)(cid:1) (S7)especially for w := { a ∗ k , ρ ∗ k , µ ∗ k } k =1 , δ := | µ ∗ − µ ∗ | / ≥ µ c := ( µ ∗ + µ ∗ ) / a ∗ = a ∗ = a ∗ / ρ ∗ = ρ ∗ = ρ ∗ , where the equality holds for x → −∞ and x n → ∞ . This means that the typical quality of D n depends on the signal-to-noise ratio a ∗ √ b /
2, the signal-to-resolution ratio ( √ ρ ∗ ∆ x ) − , the measurement interval[ x , x n ], and additionally the degree of overlap between each signal δ √ ρ ∗ . In Sec. IIF, we adopted δ as the degree ofoverlap between each signal, where ρ ∗ was fixed, since both ( √ ρ ∗ ∆ x ) − and δ √ ρ ∗ depend on ρ ∗ . Note that the termexp (cid:0) − δ ρ ∗ (cid:1) ≤ δ = 0 and µ c = µ ∗ . SUPPLEMENTARY NOTE 2: GENERAL DEFINITION OF THE BAYES SPECIFIC HEAT
Here, we derive Eq. (11) from a broader perspective of statistical inference beyond our setup. We start byintroducing the conditional probability density p ( w | D n , β, b, K ) : = 1 Z n ( β, b, K ) exp (cid:18) − nβ L n ( w ; b ) (cid:19) ϕ ( w | K ) (S8)9with β ≥ inverse temperature [38], the empirical log loss function L n ( w ; b ) : = − n n X i =1 log p ( Y i | x i , w, b ) (S9)and the partition function Z n ( β ; b, K ) : = Z exp (cid:18) − nβ L n ( w ; b ) (cid:19) ϕ ( w | K ) dw. (S10)Note that Eq. (S8 ) for β = 1 is just Bayes’ theorem, where p ( w | D n , , b, K ) and Z n (1; b, K ) are the posteriordistribution and marginal likelihood, respectively. If β → ∞ and ϕ ( ˆ w | K ) >
0, then p ( w | D n , β, b, K ) converges to δ ( w − ˆ w ), where ˆ w is the maximum likelihood estimator [38]. Notably, p ( w | D n , β, b, K ) = ϕ ( ˆ w | K ) holds for β = 0.Here, we define the specific heat C n ( β ; b, K ) : = ∂ h nL n ( w ; b ) i β ∂β − = β I n ( β ; b, K ) (S11)with the Fisher information I n ( β ; b, K ) : = *(cid:18) ∂∂β log p ( w | D n , β, b, K ) (cid:19) + β = h ( nL n ( w )) i β − h nL n ( w ) i β , (S12)where the average h· · · i β := R ( · · · ) p ( w | D n , β, b, K ) dw . Considering the connection between statistical inference andstatistical physics, h nL n ( w ; b ) i β is the internal energy for nL n ( w ; b ) as the Hamiltonian of a disordered system and F n ( β ; b, K ) : = − β log Z n ( β ; b, K ) . (S13)is the free energy. Then, we also obtain the relation C n ( β ; b, K ) = − β ∂ ( βF n ) ∂β (S14)as in statistical physics. As the Bayes free energy is defined by ˜ F n ( b, K ) = F n (1; b, K ), we define the Bayes specificheat as C n (1; b, K ), where this definition can be applied not only to L n ( w ; b ) in our setup but also to any otherempirical log loss functions without loss of generality. We should compare the Bayes specific heat, especially in theform of Eq. (S14 ), with the learning capacity [46], which is defined by the second derivative of the
Bayes free energywith respect to n as an approximation of the second-order finite difference. Notably, these are different quantities, as β and n are different, while the quantities conditionally show similar asymptotic behaviours for n → ∞ . We showtheir similarities and differences in Supplementary note 3.Now, we consider the specifics of our setup, i.e., the relation L n ( w ; b ) = b E n ( w ) −
12 log b π . (S15)Then, we obtain the scaling relations C n ( β ; b, K ) = ˜ C n ( β ′ , K ) and I n ( β ; b, K ) = ˜ I n ( β ′ , K ) for β ′ := βb , where thescaling functions are ˜ C n ( β ′ , K ) : = ∂ h nE n ( w ) i β ′ ∂β ′− = β ′ ˜ I n ( β ′ , K ) (S16)and ˜ I n ( β ′ , K ) : = *(cid:18) ∂∂β ′ log p ( w | D n , β, b, K ) (cid:19) + β ′ = h ( nE n ( w )) i β ′ − h nE n ( w ) i β ′ . (S17)Note that these scaling relations mean that the bivariate functions of ( β, b ) are exactly reduced to the univariatefunctions of β ′ , while F n ( β ; b, K ) does not satisfy such a scaling relation, as shown by F n ( β ; b, K ) = − β log Z exp (cid:18) − nβ ′ E n ( w ) (cid:19) ϕ ( w | K ) dw − n b π . (S18)Now, we take β = 1, i.e β ′ = b and then obtain ˜ C n ( b, K ) in the form of Eq. (11) as the Bayes specific heat in oursetup.0 SUPPLEMENTARY NOTE 3: FINITE-SIZE SCALING
We show the derivation of Eq. (12), whose outline is mentioned in Sec. IVC, in more detail. We start from abroader perspective of statistical inference beyond our setup. The asymptotic behaviour of the free energy [38] isshown by βF n ( β ; b, K ) = nβL n ( w ′ ; b ) + λ log nβ + ( m −
1) log log nβ + O p ( β ) (S19)for n → ∞ , where w ′ , λ > m ≥ w that minimizes the Kullback-Leibler distance from p ( y i | x i , w , b ) to p ( y i | x i , w, b ), a rational number, called the real log canonical threshold, and a natural number, respectively. Notethat w ′ = w holds if b = b and K = K . By following Eq. (S14 ), we obtain C n ( β ; b, K ) = λ − ( m − (cid:18) nβ + 1(log nβ ) (cid:19) + o p (cid:0) β (cid:1) (S20)for n → ∞ . If we take β → ∞ , then C n ( β ; b, K ) = λ + o p (cid:0) β (cid:1) holds; the quantity C n ( β ; b, K ) is not necessarily self-averaging. The relation C n ( β ; b, K ) = λ holds for n → ∞ with β = O (1 / log n ), which corresponds to the conditionshown in Watanabe’s corollary [41]; the quantity C n ( β ; b, K ) is self-averaging under this condition. Here, we mentionthat the expectation of the learning capacity over the observation also converges to λ for n → ∞ [46], while thedefinition of the learning capacity as a random variable is not suitable for this type of scaling analysis. This is adefinite difference between the learning capacity and the Bayes specific heat.Now, we consider the specifics of our setup, i.e., the relation˜ C n ( β ′ , K ) = λ − ( m − (cid:18) nβ ′ + 1(log nβ ′ ) (cid:19) + o p (cid:16) β ′ (cid:17) , (S21)where the correspondence of ( β, b ) and β ′ is considered. Then, we obtain˜ C n ( b, K ) = λ − ( m − (cid:18) nb + 1(log nb ) (cid:19) + o p (cid:0) b (cid:1) (S22)for β = 1.To evaluate the term o p (cid:0) b (cid:1) more tightly, we consider the additivity of statistical noise: Y i = f ( x i ; w ) + N i , (S23)where N i ∼ N (0 , b − ). Then, we obtain nE n ( w ) = n X i =1 s i ( w ) + 2 n X i =1 s i ( w ) N i + b − R n , (S24)where s i ( w ) := f ( x i ; w ) − f ( x i ; w ) and R n := b P ni N i . Following Eq. (11), we also obtain˜ C n ( b, K ) = (cid:18) b (cid:19) * n X i =1 s i ( w ) ! + − * n X i =1 s i ( w ) + + V n ( b, K ) + ˜ V n ( b, K ) + W n ( b, K ) + ˜ W n ( b, K ) , (S25)where V n ( b, K ) : = b X i = j ( h s i ( w ) s j ( w ) i − h s i ( w ) i h s j ( w ) i ) N i N j , (S26)˜ V n ( b, K ) : = b n X i =1 (cid:16)(cid:10) s i ( w ) (cid:11) − h s i ( w ) i (cid:17) N i , (S27) W n ( b, K ) : = b X i = j (cid:0)(cid:10) s i ( w ) s j ( w ) (cid:11) − (cid:10) s i ( w ) (cid:11) h s j ( w ) i (cid:1) N j , (S28)1and ˜ W n ( b, K ) := b n X i =1 (cid:0)(cid:10) s i ( w ) (cid:11) − (cid:10) s i ( w ) (cid:11) h s i ( w ) i (cid:1) N i . (S29)We evaluate the order of each term in Eq. (S25 ) as n → ∞ . First, we obtain (cid:18) b (cid:19) * n X i =1 s i ( w ) ! + − * n X i =1 s i ( w ) + = Λ( b ′ , K ; w ) (S30)for n → ∞ as a Riemann integral.Second, we evaluate the order of V n ( b, K ) as n → ∞ . Now, we obtain[ V n ( b, K )] = 0 (S31)for n → ∞ such that[ h ( h s i ( w ) s j ( w ) i − h s i ( w ) i h s j ( w ) ii ) N i N j ] = ( h s i ( w ) s j ( w ) i − h s i ( w ) i h s j ( w ) i ) [ N i ][ N j ] (S32)is satisfied, where [ N i ] = 0. Then, we also obtain (cid:2) V n ( b, K ) (cid:3) − [ V n ( b, K )] = b b X i = j ( h s i ( w ) s j ( w ) i − h s i ( w ) i h s j ( w ) i ) = O (cid:18) b b (cid:19) (S33)for n → ∞ such that h ( h s i ( w ) s j ( w ) i − h s i ( w ) i h s j ( w ) i ) N i N j i = ( h s i ( w ) s j ( w ) i − h s i ( w ) i h s j ( w ) i ) [ N i ][ N j ] (S34)and [( h s i ( w ) s j ( w ) i − h s i ( w ) i h s j ( w ) i ) ( h s k ( w ) s l ( w ) i − h s k ( w ) i h s l ( w ) i ) N i N j N k N l ]= ( h s i ( w ) s j ( w ) i − h s i ( w ) i h s j ( w ) i ) ( h s k ( w ) s l ( w ) i − h s k ( w ) i h s l ( w ) i ) [ N i ][ N j ][ N k ][ N l ] (S35)are satisfied, where [ N i ] = 0, [ N i ] = b − , h s i ( w ) i = O (( nb ) − ) and h s i ( w ) s j ( w ) i = O (( nb ) − ). In summary, we obtain V n ( b, K ) = [ V n ( b, K )] + O p (cid:18) bb (cid:19) = O p (cid:18) bb (cid:19) (S36)for n → ∞ .In the same way, we also obtain ˜ V n ( b, K ) = h ˜ V n ( b, K ) i + O p (cid:18) b √ nb (cid:19) (S37)with h ˜ V n ( b, K ) i = b b n X i =1 (cid:16)(cid:10) s i ( w ) (cid:11) − h s i ( w ) i (cid:17) = O (cid:18) bb (cid:19) , (S38)where [ N i ] = b − , (cid:2) N i (cid:3) = 3 /b , h s i ( w ) i = O (( nb ) − ) and (cid:10) s i ( w ) (cid:11) = O (( nb ) − ). This means that ˜ V n is self-averaging;i.e., ˜ V n = [ ˜ V n ] ≥ n → ∞ . Furthermore, we also obtain W n ( b, K ) = O p (cid:18) n √ b (cid:19) , (S39)2and ˜ W n ( b, K ) = O p (cid:18) n / √ b (cid:19) (S40)for n → ∞ , where [ N i ] = 0, [ N i ] = b − , [ W n ] = 0, [ ˜ W n ] = 0, h s i ( w ) i = O (( nb ) − ), (cid:10) s i ( w ) (cid:11) = O (( nb ) − ), (cid:10) s i ( w ) s j ( w ) (cid:11) = O (cid:0) ( nb ) − (cid:1) , and (cid:10) s i ( w ) (cid:11) = O (cid:0) ( nb ) − (cid:1) . By considering the consistency between Eq. (S22 ) andEq. (S25 ) with Eq. (S30 ) and Eqs. (S36 -S40 ), we obtain Eq. (12), where Λ( b ′ , K ; w ) = λ is a real log canonicalthreshold.Here, we demonstrate the validity of Eq. (12). By performing the same simulation in Figs. 2 and 5 basedon 100 different realizations of D n , for n = 101, taken from identical p ( y i | x i , w , b ), we calculated ˜ C n ( b, K )for each realization. Phenomenologically, the expectation of ˜ C n ( b, K ) over the realizations is fairly consistent withΛ( b ′ , K ; w ) at any b when K = 1 for ( b , K ) = ( b ,
1) (Fig. S1 a), when K = 1 for K = 2 (Fig. S1 c), andwhen K = 2 for ( b , K ) = ( b , C n ( b, K ) for the realizations is small enough(Fig. S2 ); the quantity ˜ C n ( b, K ) is considered to be self-averaging at any b in these cases without the condition b = O ( b / log n ). According to Eq. (12), these cases means that the expectation of ˜ C n ( b, K ) corresponds to theaverage term Λ( b ′ , K ; w ), where the standard deviation of ˜ C n ( b, K ) corresponds to the fluctuation term of order O p (cid:0) max (cid:0) (log nb ) − , b/b , ( n √ b ) − (cid:1)(cid:1) . Note that the standard deviation of ˜ C n ( b, K ) as the fluctuation term shows adependence on Λ( b ′ , K ; w ) (Fig. S2 ), which is not predicted by Eq. (12).In other cases, the expectation of ˜ C n ( b, K ) is not consistent with Λ( b ′ , K ; w ) at b & b (Fig. S1 ); finite-size effectson ˜ C n ( b, K ) appear at b & b , where the term of order O p ( b/b ) is dominant. According to Eq. (S25 ) with Eqs. (S36 -S40 ), the expectation of ˜ C n ( b, K ) corresponds to Λ( b ′ , K ; w ) + ˜ V n ( b, K ) as n → ∞ , where the standard deviationof ˜ C n ( b, K ) corresponds to V n ( b, K ) + W n ( b, K ) + ˜ W n ( b, K ) as n → ∞ . Phenomenologically, the expectations andstandard deviations of ˜ C n ( b, K ) are roughly proportional to √ b if b is large enough (Fig. S3 ) and also show a roughdependence on b . We conjecture the relation˜ C n ( b, K ) = Λ( b ′ , K ; w ) + O p √ bb ! (S41)for n, b → ∞ in these cases, i.e., a special case of Eq. (12). Note that ˜ C n ( b, K ) is considered to be self-averaging for b = O ( b / log n ) in all cases.3 a bc d FIG. S1 . Finite-size effects on the Bayes specific heat. Semi-log plots of Λ( b ′ , K ; w ) (blue circles) and the expectations of˜ C n ( b, K ) over the realization of D n for b = b (red triangles), b = b (yellow squares) and b = b (purple crosses) when ( a ) K = 1 and ( b ) K = 2, in the same case of Fig. 2, and when ( c ) K = 1 and ( d ) K = 2, in the same case of Fig. 5. Note thatΛ( b ′ , K ; w ) is replotted from Figs. 3b and 6b, where b = b ′ ∆ x ( b := b ′ ∆ x , b := b ′ ∆ x , and b := b ′ ∆ x ). FIG. S2 . Fluctuation of the Bayes specific heat for the realizations. Log-log plots of the standard deviation of ˜ C n ( b, K ) forrealizations of D n for ( b , K ) = ( b ,
1) (blue squares for K = 1), ( b , K ) = ( b ,
2) (red squares for K = 1), and ( b , K ) = ( b , K = 1 and purple crosses for K = 2), where each case corresponds to Fig. S1 . a b c FIG. S3 . Scaling analyses of the Bayes specific heat. Log-log plots of the expectations (blue triangles for K = 1 and yellowsquares for K = 2) of ˜ C n ( b, K ), subtracting Λ( b ′ , K ; w ) as the baseline, and the standard deviations (red triangles for K = 1and yellow squares for K = 2) of ˜ C n ( b, K ) over the realization of D n in the cases of ( a ) Fig. S1 a , ( b ) Fig. S1 b and ( c ) Fig.S1 dd