Katsuyuki Hagiwara | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Katsuyuki Hagiwara is active.

Explore More

Publication

Featured researches published by Katsuyuki Hagiwara.

Neural Networks | 2001

Upper bound of the expected training error of neural network regression for a Gaussian noise sequence

Katsuyuki Hagiwara; Taichi Hayasaka; Naohiro Toda; Shiro Usui; Kazuhiro Kuno

In neural network regression problems, often referred to as additive noise models, NIC (Network Information Criterion) has been proposed as a general model selection criterion to determine the optimal network size with high generalization performance. Although NIC has been derived using asymptotic expansion, it has been pointed out that this technique cannot be applied under the assumption that a target function is in a family of assumed networks and the family is not minimal for representing the target true function, i.e. the overrealizable case, in which NIC reduces to the well-known AIC (Akaike Information Criterion) and others depending on a loss function. Because NIC is the unbiased estimator of generalization error based on training error, it is required to derive the expectations of errors for neural networks for such cases. This paper gives upper bounds of the expectations of training errors with respect to the distribution of training data, which we call the expected training error, for some types of networks under the squared error loss. In the overrealizable case, because the errors are determined by fitting properties of networks to noise components, including in data, the target set of data is taken to be a Gaussian noise sequence. For radial basis function networks and 3-layered neural networks with bell shaped activation function in the hidden layer, the expected training error is bounded above by sigma2* - 2nsigma2*logT/T, where sigma2* is the variance of noise, n is the number of basis functions or the number of hidden units and T is the number of data. Furthermore, for 3-layered neural networks with sigmoidal activation function in the hidden layer, we obtained the upper bound of sigma2* - O(log T/T) when n > 2. If the number of data is large enough, these bounds of the expected training error are smaller than sigma2* - N(n)sigma2*/T as evaluated in NIC, where N(n) is the number of all network parameters.

Neurocomputing | 2002

Regularization learning, early stopping and biased estimator

Katsuyuki Hagiwara

Abstract In this article, we present a unified statistical interpretation of regularization learning and early stopping for linear networks in the context of statistical regression; i.e. linear regression model. Firstly, those concepts are shown to be equivalent with the use of a biased estimator under the purpose of constructing the network with lower generalization error than the least-squares estimator. It is also found that the biased estimator is a shrinkage estimator. Secondly, we showed that the optimal regularization parameter or the optimal stopping time according to the generalization error are obtained by solving the bias/variance dilemma. Lastly, we gave the estimates of the optimal regularization parameter and the optimal stopping time based on the training data. Simple numerical simulations showed that those estimates are possible to improve the generalization error compared with the least-squares estimator. Additionally, we discussed the relationship between the Bayesian interpretation of the regularization parameter and the optimal regularization parameter which minimizes the generalization error.

international symposium on neural networks | 2000

Regularization learning and early stopping in linear networks

Katsuyuki Hagiwara; Kazuhiro Kuno

Generally, learning is performed so as to minimize the sum of squared errors between network outputs and training data. Unfortunately, this procedure does not necessarily give us a network with good generalization ability when the number of connection weights are relatively large. In such situation, overfitting to the training data occurs. To overcome this problem: there are several approaches such as regularization learning and early stopping. It has been suggested that these two methods are closely related. In this article, we firstly give an unified interpretation for the relationship between two methods through the analysis of linear networks in the context of statistical regression; i.e. linear regression model. On the other hand, several theoretical works have been done on the optimal regularization parameter and the optimal stopping time. Here, we also consider the problem from the unified viewpoint mentioned above. This analysis enables us to understand the statistical meaning of the optimality. Then, the estimates of the optimal regularization parameter and the optimal stopping time are present and those are examined by simple numerical simulations. Moreover, for the choice of regularization parameter, the relationship between the Bayesian framework and the generalization error minimization framework is discussed.

Neural Networks | 2008

Relation between weight size and degree of over-fitting in neural network regression

Katsuyuki Hagiwara; Kenji Fukumizu

This paper investigates the relation between over-fitting and weight size in neural network regression. The over-fitting of a network to Gaussian noise is discussed. Using re-parametrization, a network function is represented as a bounded function g multiplied by a coefficient c. This is considered to bound the squared sum of the outputs of g at given inputs away from a positive constant delta(n), which restricts the weight size of a network and enables the probabilistic upper bound of the degree of over-fitting to be derived. This reveals that the order of the probabilistic upper bound can change depending on delta(n). By applying the bound to analyze the over-fitting behavior of one Gaussian unit, it is shown that the probability of obtaining an extremely small value for the width parameter in training is close to one when the sample size is large.

international symposium on neural networks | 2000

On the problem in model selection of neural network regression in overrealizable scenario

Katsuyuki Hagiwara; Kazuhiro Kuno; Shiro Usui

In this article, we analyze the expected training error and the expected generalization error in a special case of overrealizable scenario, in which output data is a Gaussian noise sequence. Firstly, we derived the upper bound of the expected training error of a network, which is independent of input probability distributions. Secondly, based on the first result, we derived the lower bound for the expected generalization error of a network, provided that the inputs are not stochastic. From the first result, it is clear that we should evaluate the degree of overfitting of a network to noise component in data more larger than the evaluation in NIC. From the second result, the expected generalization error, which is directly associated with the model selection criterion, is larger than in NIC. These results suggest that the model selection criterion in overrealizable scenario will be larger than NIC if inputs are not stochastic. Additionally, the results of numerical experiments agree with our theoretical results.

Neurocomputing | 2016

On scaling of soft-thresholding estimator

Katsuyuki Hagiwara

LASSO is known to have a problem of excessive shrinkage at a sparse representation. To analyze this problem in detail, in this paper, we consider a positive scaling for soft-thresholding estimators that are LASSO estimators in an orthogonal regression problem. We especially consider a non-parametric orthogonal regression problem which includes wavelet denosing. We first gave a risk (generalization error) of LARS (least angle regression) based soft-thresholding with a single scaling parameter. We then showed that an optimal scaling value that minimizes the risk under a sparseness condition is 1 + O ( log n / n ) , where n is the number of samples. The important point is that the optimal value of scaling is larger than one. This implies that expanding soft-thresholding estimator shows a better generalization performance compared to a naive soft-thresholding. This also implies that a risk of LARS-based soft-thresholding with the optimal scaling is smaller than without scaling. We then showed their difference is O ( log n / n ) . This also shows an effectiveness of the introduction of scaling. Through simple numerical experiments, we found that LARS-based soft-thresholding with scaling can improve both of sparsity and generalization performance compared to a naive soft-thresholding.

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences | 2006

On the Expected Prediction Error of Orthogonal Regression with Variable Components

Katsuyuki Hagiwara; Hiroshi Ishitani

In this article, we considered the asymptotic expectations of the prediction error and the fitting error of a regression model, in which the component functions are chosen from a finite set of orthogonal functions. Under the least squares estimation, we showed that the asymptotic bias in estimating the prediction error based on the fitting error includes the true number of components, which is essentially unknown in practical applications. On the other hand, under a suitable shrinkage method, we showed that an asymptotically unbiased estimate of the prediction error is given by the fitting error plus a known term except the noise variance.

international conference on neural information processing | 2014

Least Angle Regression in Orthogonal Case

Katsuyuki Hagiwara

LARS(least angle regression) is one of the sparse modeling methods. This article considered LARS under orthogonal design matrix, which we refer to as LARSO. In this article, we showed that LARSO reduces to a simple non-iterative algorithm that is a greedy procedure with shrinkage estimation. Based on this result, we found that LARSO is exactly equivalent with a soft-thresholding method in which a threshold level at the kth step is the (k + 1)th largest value of the absolute values of the least squares estimators. For LARSO, C p type model selection criterion can be derived. It is not only interpreted as a criterion for choosing the number of steps/coefficients in a regression problem but also regarded as a criterion for determining an optimal threshold level in LARSO-oriented soft-thresholding method which may be useful especially in non-parametric regression problems. Furthermore, in the context of orthogonal non-parametric regression, we clarified relationship between LARSO with C p type criterion and several methods such as the universal thresholding and SUREshrink in wavelet denoising.

international conference on neural information processing | 2009

Orthogonalization and Thresholding Method for a Nonparametric Regression Problem

Katsuyuki Hagiwara

In this article, we proposed training methods for improving the generalization capability of a learning machine that is defined by a weighted sum of many fixed basis functions and is used as a nonparametric regression method. In the basis of the proposed methods, vectors of basis function outputs are orthogonalized and coefficients of the orthogonal vectors are estimated instead of weights. The coefficients are set to zero if those are less than predetermined threshold levels which are theoretically reasonable under the assumption of Gaussian noise. We then obtain a resulting weight vector by transforming the thresholded coefficients. When we apply an eigen-decomposition based orthogonalization procedure, it yields shrinkage estimators of weights. If we employ the Gram-Schmidt orthogonalization scheme, it produces a sparse representation of a target function in terms of basis functions. A simple numerical experiment showed the validity of the proposed methods by comparing with other alternative methods including the leave-one-out cross validation.

international conference on neural information processing | 2007

Orthogonal Shrinkage Methods for Nonparametric Regression under Gaussian Noise

Katsuyuki Hagiwara

In this article, for regression problems, we proposed shrinkage methods in training a machine in terms of a regularized cost function. The machine considered in this article is represented by a linear combination of fixed basis functions, in which the number of basis functions, or equivalently, the number of adjustable weights is identical to the number of training data. This setting can be viewed as a nonparametric regression method in statistics. In the regularized cost function employed in this article, the error function is defined by the sum of squared errors and the regularization term is defined by the quadratic form of the weight vector. By assuming i.i.d. Gaussian noise, we proposed three thresholding methods for the orthogonal components which are obtained by eigendecomposition of the Gram matrix of the vectors of basis function outputs. The final weight values are obtained by a linear transformation of the thresholded orthogonal components and are shrinkage estimators. The proposed methods are quite simple and automatic, in which the regularization parameter is enough to be fixed for a small constant value. Simple numerical experiments showed that, by comparing the leave-one-out cross validation method, the computational costs of the proposed methods are strictly low and the generalization capabilities of the trained machines are comparable when the number of data is relatively large.

Explore More