The problem of perfect predictors in statistical spike train models
OO R I G I N A L A R T I C L E
NBDT } J o u r n a l S e c t i o n
The problem of perfect predictors in statisticalspike train models
Sahand Farhoodi | Uri Eden Department of Mathematics and Statistics,Boston University, Boston, MA, 02215,USA
Correspondence
Sahand Farhoodi, Department ofMathematics and Statistics, BostonUniversity, Boston, MA, 02215, USAEmail: [email protected]
Funding information
NIH, NIMH, MH105174; SimonsFoundation, Simons Collaboration on theGlobal Brain, 542971
Generalized Linear Models (GLMs) have been used exten-sively in statistical models of spike train data. However,the IRLS algorithm, which is often used to fit such mod-els, can fail to converge in situations where response andnon-response can be separated by a single predictor or alinear combination of multiple predictors. Such situationsare likely to arise in many neural systems due to propertiessuch as refractoriness and incomplete sampling of the sig-nals that influence spiking. In this paper, we describe multi-ple classes of approaches to address this problem: StandardIRLS with a fixed iteration limit, computing the maximumlikelihood solution in the limit, Bayesian estimation, regu-larization, change of basis, and modifying the search pa-rameters. We demonstrate a specific application of each ofthese methods to spiking data from rat somatosensory cor-tex and discuss the advantages and disadvantages of each.We also provide an example of a roadmap for selecting amethod based on the problem’s particular analysis issuesand scientific goals.
K E Y W O R D S neural coding , generalized linear models , model convergence , perfectpredictors , complete separation a r X i v : . [ s t a t . A P ] F e b Sahand Farhoodi et al. | INTRODUCTION
Generalized linear models (GLM) provide a powerful tool for relating observed neural spiking data to the biological,behavioral, and stimulus signals that influence them [1, 2, 3]. These models express the conditional intensity of spiking,which defines the probability of observing a spike in any small time interval, in terms of a design matrix whose columnsrepresent the signals, or covariates, of interest. GLMs possess a number of properties that make them well suited tospike train modeling [1, 4, 5, 6]. One important property is that GLMs can be formulated to guarantee that the like-lihood of the spiking data is convex with respect to the model parameters, allowing for rapid computation of theirmaximum likelihood estimates (MLE) [7, 8]. Most statistical software packages implement maximum likelihood esti-mation for GLMs using the computationally efficient Iteratively Reweighted Least Squares algorithm (IRLS algorithm)[9, 10, 11]. Another useful property of GLMs is that in most cases, the parameter estimates have an asymptoticallymultivariate normal distribution with a covariance matrix determined by the Fisher information, which is computed aspart of the IRLS algorithm [9, 11]. This makes it easy to compute confidence intervals (CI) about individual parametersor about the firing intensity, as well as providing for simple hypothesis tests, e.g. maximum likelihood ratio tests, aboutwhether firing rates are influenced by specific sets of covariates [9, 11]. Accurate estimation of the Fisher informationalso helps determine the extent to which one signal or set of signals is confounded with another in its influence onneural spiking [9, 11].For some GLMs, the IRLS algorithm may fail to converge because the likelihood is maximized in the limit as one ormultiple parameters tend to ±∞ . This problem is sometimes referred to as the monotone likelihood or non-existenceof the MLE problem [12]. Another perspective is that this situation arises when the response and non-response canbe separated by a single predictor or a linear combination of multiple predictors. For this reason, this phenomenonis more commonly referred to as the complete separation problem [13]. This problem is well-studied in the case oflogistic regression with small to medium-sized data [13, 14, 15, 16, 17, 18]. Heinze et. al. explore a number of potentialsolutions for this case including the omission of predictors that result in complete separation, fixing the parametervalues for such predictors prior to fitting the model, changing the form of the model, and using an ad-hoc adjustment[19, 20]. They arrive at a preferred procedure using a modification proposed by Firth in order to reduce the bias ofmaximum likelihood estimates in GLMs [21, 22, 23].The problem arises in the statistical literature on point processes such as Cox processes as well [24, 25], butis not discussed in detail in the neural modeling literature, despite the fact that it is likely to arise in many neuralsystems due to properties such as refractoriness and incomplete sampling of the signals that influence spikes. In thesemodels, separation occurs when any covariate forces the system not to spike whenever it takes on a nonzero value.Such covariates are called perfect predictors, since one can perfectly predict no spiking when they are nonzero. Forexample, a model for a neuron with an absolute refractory period lasting longer than 1 ms will be perfectly predictedby an indicator for whether there has been a spike in the past ms. Perfect predictors can also appear in models whennonzero values of certain predictors are infrequently sampled, even in cases where a lot of data exists. For example,a model fit for a rat’s hippocampal place field that includes an indicator for being in a corner of the environment thatthe rat tends to avoid may suggest that this variable is a perfect predictor, even though the neuron might occasionallyfire in that region given enough time. We say that perfect predictors of the kind in the first example are structural,while ones of the kind in the second example are the result of sampling.We posit that when these issues arise in statistical neural models, researchers adopt one of a set of ad-hoc meth-ods to avoid dealing with it, and the issues do not make the discussion of the resulting papers. However, perfectprediction can lead to a variety of issues limiting the utility of inference from point process GLMs, which may not bealleviated by ad-hoc approaches. One issue is that the IRLS algorithm will not converge, and will typically only stop ahand Farhoodi et al. 3 when some fixed iteration limit has been reached. This often involves an order of magnitude more computationaltime than fitting a GLM without perfect predictors. In some cases, the IRLS algorithm reaches a region of the likeli-hood surface where the computations become numerically unstable, and subsequent iterations can actually lead to areduction in likelihood. Another important issue is that the computed Fisher information in these fit models does notaccurately reflect the variance-covariance structure of the parameter estimates. In particular, the computed Fisherinformation is often close to singular, and if inappropriately used to construct confidence bounds and perform hy-pothesis tests, typically leads to extremely large standard error estimates for the perfect predictors and inaccuratecovariance between the perfect predictors and all other signals [19, 26, 27].In this paper, we explore a range of different approaches for dealing with perfect predictors in point process GLMs.These approaches fall into multiple categories: Interpreting the IRLS output despite lack of convergence, fixing theparameter estimates for perfect predictors, Bayesian estimation methods, regularization methods, re-parameterizingthe model to remove perfect predictors, and modifications to the parameter search domain or stopping criteria. In theMethods section, we introduce these different approaches and explain how they deal with the problem of separation.In the Results section, we select one specific method from each category and illustrate its application to modelingspiking data from a rat cortical neuron, in vitro. Furthermore, we compare these methods and illustrate how they dealwith different problems associated with complete separation. Finally, in the Discussion section, we discuss the prop-erties of these approaches, point out some advantages and disadvantages of each method, and introduce a potentialroad map to select the most appropriate method to use in various situations. | METHODS2.1 | Problem Formulation
Let X be the design matrix with p predictors of neural spiking (columns) and n observations (rows) and Y = ( y , . . . , y n ) T be the response vector of spike counts where the superscript T denotes the transpose of a matrix. We use x ij , x i · and x · j respectively to denote the ij -th element, i -th row and j -th column of matrix X . Let β = ( β , β , . . . , β p ) T be theparameter vector, θ = X β , and µ = (cid:197) ( Y | X ) be the conditional expectation of Y given X . Let g (·) be the link functionthat relates the conditional expectation of Y to the linear combination of predictors, i.e. g ( µ ) = θ . Here, we work withthe canonical link function for Poisson point process which is the log function [9]. The Poisson point process GLM isdescribed by y j | x · j ∼ Poiss ( λ j ) (1a) λ j = exp (cid:16) β + p (cid:205) i =1 x ji β i (cid:17) . (1b)where λ j is the firing rate for the j -th observation and ≤ j ≤ n . The IRLS algorithm is widely used to find themaximum likelihood solution for any generalized linear model. This algorithm starts with an initial vector β ( ) andthen computes β ( s ) recursively using the update equation β ( s +1 ) = β ( s ) + I (cid:0) β ( s ) (cid:1) − U (cid:0) β ( s ) (cid:1) (2)until convergence where U ( β ) and I ( β ) respectively denote the score function and the Fisher information computedfor β . For the process described in Eq. 1, the log-likelihood function l ( β ) , the score function U ( β ) , and the Fisher Sahand Farhoodi et al. information I ( β ) are defined as l ( β ) = n (cid:205) i =1 y i log ( µ i ) − n (cid:205) i =1 µ i (3a) U ( β ) = ∂l ( β ) ∂β = X T ( Y − µ ) (3b) I ( β ) = − ∂ l ( β ) ∂β = X T W X (3c)where W is the matrix of weights defined by W − = ( dθdµ ) V and V is the variance function evaluated at µ [9]. For ourchoice of link function g (·) , W will be a diagonal matrix with i -th diagonal element equal to µ i .The i -th column of X is a perfect predictor if having a nonzero value at this column leads to zero spike count, i.e. forany j ∈ { , . . . , n }, x ji (cid:44) implies y j = 0 . Additionally, a set of columns, i , . . . , i m , generate a perfect predictor if somelinear combination of these are a perfect predictor, i.e. there exists a set of weights a , . . . , a m , such that whenever a x ji + · · · + a m x ji m (cid:44) then y j = 0 . We let S be the set of row indices corresponding to perfect predictors or columnsthat generate perfect predictors, and define the set of perfect parameters as { β i : i ∈ S } . These parameters divergeto ±∞ when the IRLS algorithm (described in Eq. 2) is used to fit the model. We say that i -th row is a perfect row if ithas at least one nonzero value at a perfect column.A naive way to deal with perfect predictors is to omit them from the model by removing all perfect columns andperfect rows from X . However, perfect predictors are extremely informative and removing them can harm the modelsubstantially in terms of goodness-of-fit and statistical power. Putting this naive approach aside, in what follows webriefly explore different families of approaches that can be used to deal with perfect predictors. | Standard IRLS with Fixed Iteration Limit
The first approach for dealing with perfect predictors is to use the standard IRLS procedure to estimate the MLEwithout adjusting the model or estimation procedure. As pointed out before, in the presence of perfect predictorsthe actual maximum likelihood solution is achieved only when perfect parameters are equal to ±∞ , and therefore theparameter estimates will depend on when the algorithm terminates, typically after a fixed iteration limit is reached.We will call this approach "Standard IRLS" throughout the rest of this paper. | Maximum Likelihood Limit
Eq. 1b can also be rewritten as λ j = exp (cid:16) β + (cid:213) i (cid:60) S x ji β i + (cid:213) i ∈ S | x ji | β i (cid:17) . (4)For perfect parameters ( i ∈ S ) taking the absolute value of x ji only affects the sign of β i and doesn’t change thepredicted values of our model. Thus, the maximum likelihood solution is achieved when for any i ∈ S , β i is equal toits limiting value −∞ . Therefore, one approach is to set perfect parameters equal to their limiting value manually, andthen use the IRLS algorithm to estimate non-perfect parameters, after omitting perfect columns and rows from X and Y . The final model then includes both the estimated parameters when perfect predictors are removed, and theperfect parameters set to −∞ . In practice, a perfect parameter cannot be set numerically equal to - ∞ , so often a very ahand Farhoodi et al. 5 large negative number is used instead. | Bayesian Estimation
The next approach to deal with the problem of separation is using Bayesian GLM. In this approach, we treat theparameters β as random variables, assign them prior distributions π , and compute and maximize their posterior dis-tributions given the observed data. For the case that π is a zero-mean distribution with covariance matrix Σ , theposterior log-likelihood function, the posterior score function, and the posterior Fisher information are respectivelygiven by l b ( β ) = n (cid:205) i =1 y i log ( µ i ) − n (cid:205) i =1 µ i − β T Σ − β (5a) U b ( β ) = X T ( Y − µ ) − Σ − β (5b) I b ( β ) = X T W X + Σ − (5c)where the subscript b stands for Bayesian. In order to use the IRLS algorithm to fit Bayesian GLM, these equationscan be used in Eq. 2 to maximize l b ( β ) . An appropriate choice of Σ can solve the perfect predictor problem in a coupleof ways: the diagonal terms of Σ prevent any individual parameters from diverging to ±∞ , and the off-diagonal termsof Σ can be used to impose smoothness between sets of parameters, so that sampling issues are less likely to causeperfect predictors. | Regularization Methods
Another approach that is commonly used to address the problem of separation is regularization. In this approach,instead of maximizing the log-likelihood function, a weighted average of the log-likelihood function and a penaltyfunction, p ( β ) , is maximized. This penalty function is intended to prevent parameters from diverging by penalizinglarger values of the parameters. Lasso and Ridge penalty functions are two of the most used [28]. A penalty functionthat has been used specifically to deal with the problem of separation in logistic regression is Firth’s penalty func-tion [20]. In fitting a regularized GLM, the regularized log-likelihood function, the regularized score function and theregularized Fisher information are computed according to l r ( β ) = ( − Λ ) l ( β ) − Λ p ( β ) (6a) U r ( β ) = ( − Λ ) X T ( Y − µ ) − dp ( β ) dβ (6b) I r ( β ) = ( − Λ ) X T W X − d p ( β ) dβ (6c)where the subscript r stands for regularization and Λ is a variable determining the amount of penalization. Thesequantities then are used directly in the IRLS algorithm described by Eq. 2 to maximize the regularized log-likelihoodfunction. Sahand Farhoodi et al. | Change of Basis
The next family of methods focuses on reparameterizing the model in order to impose smoothness between specificsets of parameters. This method can be used when we believe that the receptive field has some smoothness propertiesthat are not explicitly captured by the model structure. For example, it is common to use estimates spike rates inspatial bins to model place fields, despite the fact that this does not capture the fact that rates in adjacent bins tendto be close. Similarly, models of refractoriness or bursting often focus on autocovariance in lagged bins, ignoring thetendency of neurons to have smooth history dependence structure. One way of imposing smoothness on predictorsis using basis functions such as smoothing splines [29, 30] or raised cosines [31]. Employing this approach Eq. 1b isreplaced by λ j = exp (cid:16) ˜ β + q (cid:213) k =1 g k ( x j . ) ˜ β k ) (cid:17) (7)where g k is the k -th basis function and q denotes the number of basis functions. As before, the IRLS algorithmdescribed in Eq. 2 can be used to fit this model. This approach is capable of alleviating issues related to both structuraland sampling perfect predictors, by introducing a new smoother predictor, g k ( x j . ) , that integrates information over arange of the receptive field that is better sampled. | Modified Search Criteria
Another approach is to modify the search or stopping criteria for the estimation algorithm to prevent parametersfrom diverging. One explicit way of doing so is to restrict the search space of the IRLS algorithm and force parametersto remain in that restricted space. Another way of preventing parameters from diverging by modifying the searchcriterion is to update β i in equation only if U ( β i ) is greater than a threshold. This prevents the IRLS algorithm frommaking adjustments to β when the log-likelihood function is nearly flat, which is a characteristic of the likelihood inthe presence of perfect predictors. | RESULTS3.1 | Data Set and the Specific Estimation Algorithms
In this section, we pick one specific method from each family of approaches introduced in the Methods section, andcompare them for the problem of fitting spiking data collected from rat cortical neurons. This data has been previ-ously used for challenge A of the Quantitative Single-Neuron Modeling Competition (2009) developed by researchersat Ecole Polytechnique Fédérale de Lausanne (EPFL) and is accessible on CRCNS.org . It contains spiking activityfrom the somatosensory cortex of a Wistar rat in response to somatic current stimulation. This dataset contains 13repetitions of a 60-second stimulation protocol, where for the first 39 seconds both the injected current waveformin pA and voltage responses are provided [32, 33]. We only use the first 39 seconds of these repetitions to constructour training set. Two repetitions are used to fit the model and the other repetitions are used for resampling to com-pute cross-validation statistics and to examine whether perfect predictors arise due to subsampling. Different pairsof repetitions are used to fit each model multiple times (10 times in total) and we report average statistics for memoryusage and computational run times (See Table 1). See http://crcns.org/data-sets/challenges/ch-epfl-2009 ahand Farhoodi et al. 7
We denote the firing rate of the neuron at time t by λ ( t ) , the spiking count in the interval ( t , t + ∆ t ) by ∆ N t , thefiring history of the neuron up to time t by H t , and let p be a fixed value representing the length of history which webelieve can influence the current spike intensity. We divide the whole range of current stimulus levels into q disjointintervals, and let I i ( t ) be an indicator function that takes value 1 if the input current at time t falls into the i -th interval.We relate the firing rate at time t to the past history of spike counts and the input current using the following pointprocess GLM: λ ( t | H t ) = exp (cid:16) β + p (cid:213) i =1 β i ∆ N t − i + q (cid:213) i =1 β p + i I i ( t ) (cid:17) . (8)where p and q are set to 200 ms and 6 respectively.In general, statistical model identification procedures, such asthe step-wise regression, can be used to select the order of a model. Here, our purpose is not to identify the mostparsimonious model, but to introduce a model that can capture specific features of the data, and compare multiplefitting procedures in their handling of perfect predictors.When we fit this model for a single neuron using the IRLS algorithm directly, we obtain a fit with 31 perfectpredictors. 20 of these perfect predictors are associated with the parameters that capture history dependence (firstsum of Eq. 8) occurring at small lags due to the refractory period of the neuron. These likely reflect structural perfectpredictors, in that we would not expect to see any spikes occur immediately after another, even if we collected muchmore data. There are also two perfect predictors corresponding to the two lowest current indicator functions, I ( t ) and I ( t ) . It is not immediately clear if these are structural perfect predictors or sampling predictors (i.e. whether theneuron has a non-negligible probability of firing when the stimulus current is this low). In this case, when we increasethe data size (by using the extra trials in the training set to fit the model), these perfect predictors remain so, suggestingthey may be structural as well. The remaining perfect predictors occurred either in the history component of themodel at specific lags that had neighbors that were not perfect predictors, or in the stimulus response componentfor stimulus levels that we expect could drive some amount of neural spiking. In order to analyze if these perfectpredictors are structural or due to sampling, we expand the training data by including additional trials and fit the samemodel described in Eq. 8. We observed that all these perfect predictors vanish and hence categorize them as samplingperfect predictors, since they only existed in the first place because of limited sampling in the training data. Withoutusing any of the methods described in the Methods section, all the perfect predictors in this example diverged to largenegative numbers, limited only due to the finite iteration limit of the IRLS algorithm.In order to compare the general approaches defined in the Methods section, we selected one specific methodfrom each family that, in our estimation, is the most natural option or is most likely to be used by neural data analystswhen encountering the problem of separation. However, we focus only on those advantages and disadvantages ofeach method that can be generalized to all methods of the family from which it was selected. Below, we specify themethod selected from each family of approaches. Standard IRLS.
For this method, we set the iteration bound equal to 100, and run the IRLS algorithm without adjustingthe model or estimation procedure.
Maximum Likelihood Limit.
In this method, the perfect rows and columns are removed from X and Y prior to fittingthe model. The final model includes the estimated parameters when perfect predictors are removed, combinedwith perfect parameters set to −∞ . Bayesian Estimation.
There are two types of parameters in the model described by Eq. 8, the ones correspondinghistory spiking counts ( β : p ) and the ones corresponding input current ( β p +1 : p + q ). We consider separate prior Sahand Farhoodi et al. distributions for these two groups, as we observed no correlation between their parameters. Each one of theseprior distributions is a zero-mean multivariate Normal distribution with a covariance matrix Σ with elements Σ ij = c | i − j | where c ∈ [ , ] . This choice of Σ defines a temporal correlation on parameters that fall off geometricallyas the distance between them increases, and therefore can deal with the problem of separation by imposingsmoothness between nearby estimated parameters. We selected the parameter c = 0 . using a leave-one-outcross-validation method. We note that more advanced methods for estimating this parameter are possible (e.g.treating it as a hyperparameter and building a hierarchical Bayesian model to estimate it). Regularization.
We used ridge regression, which uses an L penalty term, to fit the model parameters because of itscommon usage in the neural data analysis literature. Therefore, the penalty function in Eq. 6 will be p ( β ) = β T β .The choice of Λ can affect the outcome of the regularization model significantly and thus, we use cross-validationto determine it. The value that gave the best model fit was Λ = 0 . . This specific method is called "Ridge GLM"throughout the rest of this paper. Change of Basis.
We reparameterized the model using a cardinal spline basis expansion [29, 30] in place of the spikecount and indicator functions in Eq. 8. As shown in Appendix 1, this basis expansion method is capable ofeliminating perfect predictors if the knots are chosen such that there is at least one non-perfect predictor betweenevery two successive knots. However, by choosing knots too far from each other, a good amount of informationin the data will be lost due to excessive imposed smoothness. This makes it challenging to determine the idealnumber and location of the knots. One complicated but proven to work option is to use Bayesian curve-fittingwith free-knot spline algorithms [34]. Another simpler option is letting knots be distributed regularly and thenusing cross-validation only to select the number of knots. In this case study, we used the latter approach, wherewe pick the numbers of knots that gives the best goodness-of-fit. This specific method is called "Cubic SmoothingSpline" throughout the rest of this paper.
Modified Search Criterion.
As an example of a modified search criterion approach, we restrict the search space ofthe IRLS algorithm to R = { β : (cid:205) p + qi =1 β i ≤ r } . We set r = ( p + q ) d , where p + q is the total number of parameters(excluding the intercept) and d = − is the threshold for identifying if a parameter is perfect or not based on theoutput of the Standard IRLS method. We call this method "Bounded Search" in the rest of this paper. | Comparison of Specified Methods
First, we compare the different methods based on their goodness-of-fit. One approach for investigating goodness-of-fit is to examine Kolmogorov-Smirnov (KS) plots comparing the empirical and theoretical distributions of transformedspike times based on the times-rescaling theorem [35, 36]. Fig. 1 depicts these plots for different methods we specifiedin the previous section. The blue line represents the KS plot and the red lines represent global 95% bounds for a well-fit model. The larger the deviation outside of these bounds, the poorer the model fit to the data. The Cubic SmoothingSpline provides the best fit and Ridge GLM provides the poorest fit to the data from this neuron in terms of the KSanalysis. The rest of the methods are fairly similar based on the KS plots.Another approach for comparing goodness-of-fit between models is to examine the proportion of deviance in anull model that is explained by each, which is expressed by R = D N − D M D N . (9)where D M and D N denote the deviance of the proposed model and the null model respectively. The value of R showsthe ratio of the deviance explained by the proposed model that the null model fails to explain. This measure does ahand Farhoodi et al. 9 Standard IRLS Maximum Likelihood Limit Bayesian EstimationRidge GLM Cubic Smoothing Spline Bounded Search
F I G U R E 1
KS plots for six different fitting approaches. The spike times were transformed via the time-rescalingtheorem [35, 36] based on each fitted model. The blue line is the KS plot and the red lines represent global 95%confidence bounds for a well fit model. The Cubic Smoothing Spline method shows the smallest deviation outside ofthe bounds and the Ridge GLM has the largest, suggesting the best and the worst performance respectively.not attempt to account for over-fitting, and tends to be higher for larger models. Therefore, we also consider a cross-validated version of R , denoted by R C , which can directly be used to assess the prediction power of the fitted models.A statistical comparison between different models based on their deviances requires taking into account the effectivedegrees of freedom (d.o.f.) of each model as well. Among various measures introduced for the effective d.o.f., weuse the approach that to our knowledge is the most prevalent. This approach defines the effective d.o.f to be thetrace of the hat matrix H , which is satisfying HY = ˆ Y where ˆ Y denotes the estimate of the response vector [28]. Inorder to make the comparison easier, we divide the effective d.o.f. for each method by that for Standard IRLS, andcall it effective d.o.f. ratio. Additionally, we compare the different methods based on the computational resourcesthey require. To do so, we compute the relative run time and the relative peak memory of each method with respectto standard IRLS on the same system. All models have been fit 10 times, each using a different pair of trials from thetraining set, and mean values are computed.Table 1 summarizes the comparisons between different methods, in terms of goodness-of-fit and consumed com-putational resources. Of all the methods, the Standard IRLS and Maximum Likelihood Limit methods have the largest R values for the data used to fit the model, and the Cubic Smoothing Spline has the smallest. However, these methodsdo not generalize as well to other datasets, as they have the smallest R CV values. In fact, R CV for the standard IRLSapproach is on average negative, indicating a poorer fit than a null, homogeneous Poisson model of the data (see Eq. TA B L E 1
Summary of goodness-of-fit measures and computational resource consumption for differentestimation methods. All models have been fit 10 times, each time using a different pair of trials of the training set,and mean values are computed.
Method
R R CV Number ofParameters Effectived.o.f. Ratio RelativeRun Time RelativePeak Memory
Standard IRLS 0.3404 < R CV ) than other methods. The Ridge GLM isthe slowest method (largest relative run time), which is due to its exhaustive search to find an optimal value for thetuning parameter Λ . Cubic Smoothing Splines, despite the fact that it has the smallest effective d.o.f ratio, shows thelargest relative peak memory after Standard IRLS. This is because this method needs to construct and store a newdesign matrix before running the IRLS algorithm.Another issue that accompanies the complete separation or perfect prediction problem is that the observed Fisherinformation at the maximum likelihood solution typically does not accurately reflect the variance-covariance structureof the parameter estimates. In particular, the observed Fisher information is often close to singular, and if inappropri-ately used to construct confidence bounds and perform hypothesis tests, typically leads to extremely large standarderror estimates for the perfect predictors and inaccurate covariance between the perfect predictors and all othersignals [19, 26, 27]. Below, we examine how different each of the methods deals with this problem.Bootstrapping represents one alternative to the Fisher information for computing standard error estimates forperfect predictors for the Standard IRLS estimator. In this approach, random subsets of the original data are usedto fit the model multiple times, and then the empirical distribution of estimated values is used to compute standarderror estimates for all parameters. Bootstrapping is far more computationally expensive than the other proposedmethods, but results in more robust standard error estimates for IRLS, and therefore is considered as our benchmarkhere. The estimated values, ˆ β , along with their confidence intervals using Standard IRLS, Bootstrapping, BayesianEstimation, and the Cubic Smoothing Spline approaches are shown in Fig. 2. As expected, the bootstrap approachresults in no standard error estimates for structural perfect parameters, as they always diverge to −∞ . The bootstrapintervals are tighter than those obtained by the Fisher Information for IRLS for nearly all parameters, suggesting thatthe existence of perfect predictors can influence uncertainty estimates in the IRLS procedure even for non-perfectpredictor parameters. Both Bayesian Estimation and Cubic Smoothing Spline methods result in relatively narrowconfidence intervals which provide smoother estimates than bootstrapping IRLS. We observed that Ridge GLM andBounded Search obtain standard error estimates that are artificially larger than what was obtained by Bootstrapping, ahand Farhoodi et al. 11 Standard IRLS BootstrappingBayesian Estimation Cubic Smoothing Spline
F I G U R E 2
Estimated parameter values, ˆ β , and estimated confidence intervals obtained by four differentapproaches: Standard IRLS, Bootstrapping, Bayesian Estimation, and Cubic Smoothing Splines. For each approach,the estimated influence of past spiking is shown in the left panel and the estimated influence of the input current isshown in the narrower right panel. Thick blue lines or dots represent the exponentiated parameter estimates andthin blue lines and bars represent the estimated 95% confidence bounds.especially for structural perfect parameters. The Maximum Likelihood Limit approach doesn’t provide an estimationprocedure for standard error of perfect parameters, as they are completely removed from the model prior to fitting.In Fig. 3 in order to examine how different methods estimate the correlation between parameters, we concentrateon the first 100 predictors, which are associated with the influence of past spike counts at lags of 1-100 ms, andillustrate the estimated correlation between one fixed predictor and all other predictors. For the fixed predictor, weconsider a sampling perfect predictor (left column), a non-perfect predictor (middle column), and a putative structuralperfect predictor associated with the neuron’s refractory period (right column). We observe that for non-perfectpredictors, all methods perform more or less similarly, with Bayesian Estimation providing a smoother version of thecorrelation structure estimated by the Ridge GLM and Standard IRLS methods (middle column). In this case, we cansee a local correlation structure for parameters around the fixed predictor (60-th predictor), which is expected; theeffect of firing activity at lag i and i +1 is likely to be correlated. For the sampling perfect predictor (left column), we seethat Standard IRLS results in zero correlation for all non-perfect predictors. However, we know that this predictor isperfect only due to sampling, and hence we expect to see a temporal structure similar to what we observed betweentwo non-perfect predictors in the middle column. Ridge GLM shows this local correlation structure weakly, but withsubstantially reduced effect size and statistical power. On the other hand, the correlation estimated by Bayesian GLMis much more aligned with what we expect to see, and clearly shows a temporal correlation structure similar to thatbetween non-perfect predictors. For the putative structural perfect parameter, Standard IRLS again estimates all thecorrelations to non-perfect predictors to be zero. In this case, this might be the preferred inference; if it is impossiblefor the neuron to fire during its refractory period, this should be true regardless of the influence of a spike beyondthe refractory period. In this case, Ridge GLM also gives very small correlations, but Bayesian GLM still estimates a
60 100 0 16 100
Sampling Perfect Predictor(49-th predictor) Non-perfect Predictor(60-th predictor) Structural Perfect Predictor(16-th predictor) S t a n d a r d I R L S R i d g e G L M B a y e s i a n E s t i m a t i o n Lag Lag Lag
F I G U R E 3
Estimated correlation between a fixed predictor and all other predictors associated with the pasthistory of spike counts at lags 1-100ms. This is shown for a sampling perfect predictor (left column), a non-perfectpredictor (middle column), and a putative structural perfect predictor (right column), when Standard IRLS (first row),Ridge GLM (second row), and Bayesian Estimation (third row) were used. The red ticks in the top panel represent thelags associated with perfect predictors.smooth correlation structure, consistent with the prior distribution. Note that we could use our knowledge of therefractory period of the neuron to adjust the prior to prevent this correlation. Though not shown in this figure, theCubic Smoothing Spline estimates correlations very similar to the Bayesian Estimation method, and the BoundedSearch estimates are very close to those obtained from Ridge GLM. The Maximum Likelihood Limit approach, again,does not provide correlation estimates between any perfect predictor and any other predictor, as it removes all perfectpredictors from the model prior to fitting. | DISCUSSION
In this paper, we investigated the problem of complete separation in GLMs, which leads to the failure of the IRLSalgorithm to converge when a single predictor or a linear combination of multiple predictors completely separate aresponse from a non-response. This occurs frequently in point process GLMs due first, to structural properties of ahand Farhoodi et al. 13 neurons and their receptive fields that prevent a response, e. g. refractoriness, and second, incomplete sampling ofthe signals that influence spiking leading to no observed neural responses. This phenomenon can have a substantialimpact on modeling and statistical inference from neural data. We broadly presented various classes of approaches todeal with this problem (section 2), and illustrated how they compare in practice (section 3). Here, we further discusssome of the advantages and disadvantages that are associated with each family of approaches.
Standard IRLSAdvantages.
Already implemented in many statistical software packages. Describes the existing data set opti-mally. Inference for non-perfect parameters is typically reliable.
Disadvantages.
Very slow; continues to run until iteration limit reached. Fitted models generalize poorly to otherdatasets. Poor estimates of variance-covariance structure for sampling perfect parameters make inferences aboutthese parameters challenging.
Maximum Likelihood LimitAdvantages.
Fast. Describes the existing data optimally. Inference for non-perfect parameters is typically reli-able.
Disadvantages.
Fitted models generalize poorly to other datasets. Not typically implemented in statistical soft-ware packages. No explicit computation of the Fisher information for estimating the variance-covariance structureof the perfect parameters.
Bayesian EstimationAdvantages.
Provides increased statistical power and prediction accuracy when appropriate priors are available.Allows for precise estimation of confidence bounds and covariance between parameters. Prior can be chosen todistinguish between covariates that may lead to structural or sampling perfect predictors.
Disadvantages.
Requires methods for selecting appropriate priors. Results may be sensitive to the choice ofprior.
Regularization MethodsAdvantages.
Already implemented in many software packages. Controls perfect parameters with minimal impacton non-perfect parameters.
Disadvantages.
Sensitive to tuning parameter, which typically involves an exhaustive search to find an optimalvalue. This makes this method relatively slow. Can result in diminished statistical power (compared to BayesianEstimation and Change of Basis) since it treats structural and sampling perfect predictors identically.
Change of BasisAdvantages.
Leads to more parsimonious models that take advantage of known structure in receptive field mod-els. With a judicious choice of model structure, this can lead to a substantial increase in statistical power andmodel interpretability and reductions in computational time. Prior knowledge can be used to select a modelstructure that distinguishes structural and sampling perfect predictors.
Disadvantages.
Selecting an appropriate model structure can be challenging. Some reparameterizations maynot solve the perfect prediction problem. The results may be quite sensitive to the choice of model structure.Inferences may require inversion of transformations associated with reparameterization.
Modifying Search CriterionAdvantages.
Has minimal impact on non-perfect parameters.
Disadvantages.
Results may be sensitive to the choice of criterion. It can result in diminished statistical power(compared to Bayesian Estimation and Change of Basis) since it treats structural and sampling perfect predictorsidentically.
Is model generalizabilityimportant? Inferences about perfect predictors?
Standard IRLS - Easier implementation - More computational time
Change of Basis - Harder implementation - Model parsimony - Sensitive to model structure - Inferences in reparameterized model space
Bayesian Estimation - Easier implementation- Relatively slower for large data sets- How to select a suitable prior?- Inferences in original model space
Modifying Search Criterion - Harder implementation- Less computational time Prior knowledge about covariability ofparameters?YESStart NO YESNO Are all perfectpredictors wellsampled? YES NO YESNO
Regularization Methods - Easier implementation - More computational time Increased statistical power
Maximum Likelihood Limit - Harder implementation - Less computational time
F I G U R E 4
Roadmap for selecting an approach for dealing with perfect prediction, based on the features andgoals of the particular analysis question. Each path ends with two approaches to consider with properties that arewell-aligned to the resulting analysis issues. The decision between these two can be made according to the pros andcons given for each method (yellow boxes).Knowing the potential advantages and disadvantages of each of these approaches can help modelers select one,or a subset of methods to use to explore their data. We can also use these to develop a sequence of questions thatmodelers might ask to help guide them to a decision about which methods might work best for a particular analysisproblem. We present an example of such a modeling ‘roadmap’ in Fig. 4, in flowchart form. Note that this representsjust one potential approach to selecting a method; there may be many other roadmaps, depending on what particularset of questions might arise in the modeling process. In the road map depicted in Fig. 4, the central questions focuson how well the signals that influence spiking are sampled, whether the fitted model needs to generalize to other ahand Farhoodi et al. 15 datasets, whether the perfect predictors relate to the scientific questions to be addressed, and whether there is priorknowledge about receptive field structure to constrain the model. Each path ends with two approaches to considerwith properties that are well-aligned to the resulting analysis issues. The decision between these two can be madeaccording to the pros and cons given for each method (yellow boxes). For example, if we had a model for a place cellusing indicator functions over space, the Bayesian Estimation approach might be most useful if we have a sensibleprior and the scientific questions are focused on inferences about individual parameters that represent coding over asmall subregion of the place field. If sensible priors are hard to come by, or if the scientific questions are more focusedon more global properties of the place field (e.g. its center and width) than a smoothed estimate based on a changeto a spline basis might be more appropriate.Perfect predictors can arise in neural data due to many different factors, such as structural properties of the re-ceptive field, lack of data for some regions of the receptive field, or even features of the receptive field that cannotbe sampled under some experimental conditions. In this paper, we focused on point process GLMs, but completeseparation can also happen in other domains of neuroscience. In particular, GLMs are used for Binomial and multi-nomial data which occur quite often when one looks at behavioral measures in neuroscience experiments. The ideaspresented in this paper are not only limited to point process GLMs, and can directly be applied to other studies thatinvolve other kinds of GLMs. Neuroscience experiments going forward are likely to yield larger data sets with manymore neurons, and potentially with receptive field structures that involve many predictors simultaneously. As a result,models needed to explain such large and complex data sets are more likely to encounter the problem of completeseparation. We anticipate that the types of methods and decision procedures discussed in this paper are going tobecome more and more important as the experiments and data sets evolve to capture more complex structures ofthe stimuli. acknowledgements
The work for this project was supported by the Simons Collaboration on the Global Brain, 542971 and the NationalInstitute of Mental Health, MH105174. We also thank CRCNS data sharing website, and Thomas Berger and RichardNaud in the laboratory of Henry Markram at the EPFL, for collecting the data for challenge A of the 2009 quantitativesingle-neuron modeling competition (the data used in this paper), and publishing it publicly on CNCRS.org. references [1] Paninski L, Pillow J, Lewi J. Statistical models for neural encoding, decoding, and optimal stimulus design. Progress inBrain Research 2007;165:493–507.[2] Paninski L, Brown EN, Iyengar S, Kass RE. Statistical models of spike trains. Stochastic Methods in Neuroscience 2008;.[3] Eden UT, Kass RE. Statistical models of spike train data. Neuroscience in the 21st Century 2016;2:3137–3151.[4] Truccolo W, Eden UT, Fellows MR, Donoghue JP, Brown EN. A point process framework for relating neural spiking activityto spiking history, neural ensemble, and extrinsic covariate effects. Journal of Neurophysiology 2005;93(2):1074–1089.[5] Chen Z, Brown EN. Generalized linear models for point process analyses of neural spiking activity. Jaeger D, Jung R(eds) Encyclopedia of Computational Neuroscience 2013;p. 1–4.[6] Keat J, Reinagel P, Reid RC, Meister M. Predicting every spike: a model for the responses of visual neurons. Neuron2001;30(3):803–817. [7] Paninski L, Simoncelli EP, Pillow JW. Maximum likelihood estimation of a stochastic integrate-and-fire neural encodingmodel. Neural Comput 2004;16(12):2533–2561.[8] Paninski L. Maximum likelihood estimation of cascade point-process neural encoding models. Network: Computationin Neural Systems 2004;15(4):248–262.[9] McCullagh P, Nelder JA. Generalized linear models (second edition). ISBN: 0-412-31760-5 1989;Chapman and Hall.[10] Fisher RA. The case of zero survivors (Appendix to Bliss, C.I.(1935)). Ann Appl Biol 1935;22:164–165.[11] Green PJ. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alter-natives (with discussion). J R Statist Soc 1984;B 46:149–192.[12] Bryson MC, Johnson ME. The incidence of monotone likelihood in the Cox model. Technometrics 1981;23(4):381–383.[13] Albert A, Anderson JA. On the existence of maximum likelihood estimates in logistic regression models. Biometrika1984;71(1):1–10.[14] Santner TJ, Duffy DE. A note on A. Albert and J.A. Anderson’s conditions for the existence of maximum likelihoodestimates in logistic regression models. Biometrika 1986;73(3):755–758.[15] Lesaffre E, Albert A. Partial separation in logistic discrimination. Journal of the Royal Statistical Society, Series B1989;51(1):109–116.[16] Clarkson DB, Jennrich RI. Computing extended maximum likelihood estimates for linear parameter models. Journal ofthe Royal Statistical Society, Series B 1991;53(2):417–426.[17] E EL, Marx BD. Collinearity in generalized linear regression. Communications in Statistics—Theory and Methods1993;22(7):1933–1952.[18] Kolassa JE. Infinite parameter estimates in logistic regression, with application to approximate conditional inference.Scandinavian Journal of Statistics 1997;24(4):523–530.[19] Heinze G. A comparative investigation of methods for logistic regression with separated or nearly separated data. StatistMed 2006;25:4216–4226.[20] Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Statist Med 2002;21:2409–2419.[21] D DF. Bias reduction, the Jeffreys prior and GLIM. Fahrmeir L, Francis B, Gilchrist R, Tutz G (eds) Advances in GLIM andStatistical Modelling Lecture Notes in Statistics 1992;78:91–100.[22] Firth D. Generalized linear models and Jeffreys priors: an iterative weighted least-squares approach. Dodge Y, WhittakerJ (eds) Computational Statistics 1992;1:553–557.[23] Firth D. Bias reduction of maximum likelihood estimates. Biometrika 1993;80(1):27–38.[24] Nagashima K, Sato Y. Information criteria for Firth’s penalized partial likelihood approach in Cox regression models. StatMed 2017;36(21):3422–3436.[25] Heinze G, Schemper M. A Solution to the Problem of Monotone Likelihood in Cox Regression. Biometrics 2001;57:114–119.[26] Hauck Jr WW, Donner A. Wald’s test as applied to hypotheses in logit analysis. Journal of the American StatisticalAssociation 1977;72(360):851–853.[27] Vaeth M. On the use of Wald’s test in exponential families. International Statistical Review 1985;53(2):199–214. ahand Farhoodi et al. 17 [28] Friedman J, Hastie T, Tibshirani R. The Elements of Statistical Learning. Springer Series in Statistics 2001;ISBN:9780387848587.[29] Boor CD. A Practical Guide to Splines. Applied Mathematical Sciences 1978;27:ISBN: 978–0–387–95366–3.[30] Lin Y, Zhang HH. Component selection and smoothing in smoothing spline analysis of variance models. Annals ofStatistics 2006;34(5):2272–2297.[31] Pillow JW, J JS, Paninski L, Sher A, Litke AM, Chichilnisky EJ, et al. Spatio-temporal correlations and visual signalling ina complete neuronal population. Nature 2008;454:995–999.[32] Gerstner W, Naud R. How Good are Neuron Models? Science 2009;326(5951):379–380.[33] Gerstner W, Naud R, Carandini M, Sincich LC, Horton JC, Adams DL, et al. Quantitative Single-Neuron Modeling Com-petition for year 2009, developed by researchers at EPFL; data from Wistar rat cortical neurons and rhesus monkeylateral geniculate nucleus and retinal ganglion cells. CRCNSorg;.[34] Dimatteo I, Genovese CR, Kass RE. Bayesian curve-fitting with free-knot splines. Biometrika 2001;88(4):1055–1071.[35] Papangelou F. Integrability of expected increments of point processes and a related random change of scale. Trans AmerMath Soc 1972;165(1972):483–506.[36] Brown EN, Barbieri R, Ventura V, Kass RE, Frank LM. The time-rescaling theorem and its application to neural spike traindata analysis. Neural computation 2002;14(2):325–346. | APPENDIX5.1 | Cardinal Spline Reparameterization
In this section, we explore a specific example of basis expansion methods, cardinal spline reparameterization, andexplain in detail how it can solve the problem of perfect predictors. This basis expansion method is perfectly suitablefor cases where predictors are a set of sequential indicator functions, e.g. in a history-dependent model where the k -th predictor is the spiking counts at lag k . Following the notations introduced in Eq. 7, the k -th basis function isgiven by g k ( x j . ) = p (cid:213) i =1 x ji s ik (10)where s ik is the i k -th element of matrix S that is constructed as follows. First, an increasing sequence of length q − from interval ( , p ) is augmented by 1, p , one value less than 1, and one value greater than p to yield the sequence ζ < ζ < · · · < ζ q which is called the set of control points or knots. Then, in order to construct the j -th row of S ,index i is selected such that ζ i ≤ j < ζ i +1 and α = j − ζ i ζ i +1 − ζ i is computed. All elements of S j . are set to zero except thefour consecutive elements { s j , i − , s j , i , s j , i +1 , s j , i +2 } that are given by (cid:104) s j , i − s j , i s j , i +1 s j , i +2 (cid:105) = (cid:104) α α α (cid:105) − t − t t − t t t − − t − t − t t
00 1 0 0 (11) where ≤ t ≤ is called the tension parameter and must be picked prior to constructing S . In order to find therelation between original parameters ( β i in Eq. 1b) and parameters after changing basis ( ˜ β i in Eq. 7), we start bysimplifying the right hand side of Eq. 7. λ j = exp (cid:16) ˜ β + q (cid:213) k =1 g k ( x j . ) ˜ β k ) (cid:17) = exp (cid:16) ˜ β + q (cid:213) k =1 ˜ β k ( p (cid:213) i =1 x ji s ik ) (cid:17) = exp (cid:16) ˜ β + p (cid:213) i =1 x ji ( q (cid:213) k =1 s ik ˜ β k ) (cid:17) . (12)Comparing Eq. 1b and Eq. 12 and noting that the second sum in Eq. 12 is the i -th element of vector Sβ reveals therelation between β and ˜ β : β = ˜ β , β : p = S ˜ β : qq