Do NHL goalies get hot in the playoffs? A multilevel logistic regression analysis
aa r X i v : . [ s t a t . A P ] F e b Do NHL goalies get hot in the playoffs?A multilevel logistic regression analysis
Likang Ding, Ivor Cribben, Armann Ingolfsson, Monica TranUniversity of Alberta, Edmonton, AB T6G 2R6, [email protected],[email protected],[email protected],[email protected] 22, 2021
Abstract
The hot-hand theory posits that an athlete who has performed well in the recent past willperform better in the present. We use multilevel logistic regression to test this theory forNational Hockey League playoff goaltenders, controlling for a variety of shot-related and game-related characteristics. Our data consists of 48,431 shots for 93 goaltenders in the 2008-2016playoffs. Using a wide range of shot-based and time-based windows to quantify recent saveperformance, we consistently find that good recent save performance has a negative effect onthe next-shot save probability, which contradicts the hot-hand theory.
Keywords:
Hot hand, ice hockey, goaltenders, National Hockey League, time-based window,shot-based window.
The hot-hand phenomenon generally refers to an athlete who has performed well in the recentpast performing better in the present. Having a “hot goalie” is seen as crucial to success in theNational Hockey League (NHL) playoffs. A goaltender who keeps all pucks out of the net for negative slope coefficients for the variable of nterest measuring the influence of the recent save performance on the probability of saving thenext shot on goal; in other words, we have demonstrated that contrary to the hot-hand theory,better past performance usually results in worse future performance. This negative impact ofrecent good performance is robust, according to our analysis, to both varying window sizes anddefining the window size based on either time or number of shots.The remainder of the paper is organized as follows: in Section 2, we review related literature;in Section 3, we describe our data set; in Section 4, we specify our regression models; and inSection 5, we present our results. Section 6 concludes. We summarize five streams of related work addressing the following: (1) whether the hot hand isa real phenomenon or a fallacy, (2) whether statistical methods have sufficient power to detect ahot hand, (3) whether offensive and defensive adjustments reduce the impact of a hot hand, (4)estimation of a hot-hand effect for different positions in a variety of sports, and (5) specificationof statistical models to estimate the hot hand. (1) Is the hot hand a real phenomenon or a fallacy?
The hot hand was originally studied inthe 1980s in the context of basketball shooting percentages (Gilovich, Vallone, and Tversky 1985;Tversky and Gilovich 1989a,b). These studies concluded that even though players, coaches, andfans all believed strongly in a hot-hand phenomenon, there was no convincing statistical evidenceto support its existence. Instead, Gilovich, Vallone, and Tversky (1985) attributed beliefs in ahot hand to a psychological tendency to see patterns in random data; an explanation that hasalso been proposed for behavioral anomalies in various non-sports contexts, such as financialmarkets and gambling (Miller and Sanjurjo 2018). Contrary to these findings, recent papers byMiller and Sanjurjo (2018, 2019) demonstrate that the statistical methods used in the originalstudies were biased, and when their data is re-analyzed after correcting for the bias, strongevidence for a hot hand emerges. (2) Do statistical methods have sufficient power to detect a hot hand?
The Gilovich, Vallone, nd Tversky (1985) study analyzed players individually. This approach may lack sufficientstatistical power to detect a hot hand, even if it exists (Wardrop 1995, 1999). Multivariateapproaches that pool data for multiple players have been proposed to increase power (Arkes2010). We follow this approach, by pooling data for multiple NHL goaltenders over multipleplayoffs. (3) Do offensive and defensive adjustments reduce the impact of a hot hand? A hot hand,even if it is real, might not result in measurable improvement in performance if the hot playeradapts by making riskier moves or if the opposing team adapts by defending the hot playermore intensively. For example, a hot basketball player might attempt riskier shots and theopposing team might guard a player they believe to be hot more closely. The extent to whichsuch adjustments can be made varies by sport, by position, and by the game situation. Forexample, there is little opportunity for such adjustments for basketball free throws (Gilovich,Vallone, and Tversky 1985) and there is less opportunity to shift defensive resources towardsa single player in baseball than in basketball (Green and Zwiebel 2017), because the fieldingteam defends against a single batting team player at a time. An NHL goaltender must facethe shots that are directed at him and thus has limited opportunities to make riskier moves ifhe feels that he is hot. Furthermore, the opposing team faces a single goaltender and thereforehas limited opportunities to shift offensive resources away from other tasks and towards scoringon the goaltender. Therefore, NHL goaltenders provide an ideal setting in which to measurewhether the hot-hand phenomenon occurs. (4) Estimation of a hot-hand effect for different positions in a variety of sports.
In addition tobasketball shooters, the list of sports positions for which hot-hand effects have been investigatedincludes baseball batters and pitchers (Green and Zwiebel 2017), soccer penalty shooters ( ¨Ottingand Groll 2019), dart players ( ¨Otting et al. 2020), and golfers (Livingston 2012).In ice hockey, a momentum effect has been investigated at the team level (Kniffin and Mihalek2014). A hot-hand effect has been investigated for ice hockey shooters (Vesper 2015), but notfor goaltenders, except for the study by Morrison and Schmittlein (1998). The latter study ocused on the duration of NHL playoff series, noted a higher-than-expected number of shortseries, and proposed a goaltender hot-hand effect as a possible explanation. This study did notanalyze shot-level data for goaltenders, as we do. (5) Specification of statistical models to estimate the hot hand. Hot-hand researchers haveused two main approaches in specifying their statistical models: (1) Analyze success rates,conditional on outcomes of previous attempts (Albright 1993; Green and Zwiebel 2017) or(2) incorporate a latent variable or “state” that quantifies “hotness” (Green and Zwiebel 2017;¨Otting and Groll 2019). We follow the former approach. With that approach, the history of pastperformance is typically summarized over a “window” that is defined in terms of a fixed numberof past attempts—the “window size.” It is not clear how to choose the window size. We varythe window size over a range that covers the window sizes used in past work. Furthermore, inaddition to shot-based windows, we also use time-based windows—an approach that complicatesdata preparation and has not been used by other investigators.We contribute to the hot-hand literature by investigating NHL goaltenders, a position thathas not been studied previously, and which provides a setting in which there are limited oppor-tunities for either team to adapt their strategies in reaction to a perception that a goaltenderis hot. In terms of methodology, we use multilevel logistic regression, which allows us to pooldata across goaltender-seasons to increase statistical power, and we use a wide range of bothshot-based and time-based windows to quantify a goaltender’s recent save performance.
Our data set consists of information about all shots on goal in the NHL playoffs from 2008to 2016. The season-level data is from (Hockey Reference 2017)and the shot-level data is from corsica.hockey (Perry 2017). We have data for 48,431 shots,faced by 93 goaltenders, over 9 playoff seasons, with 91.64% of the shots resulting in a save.We divided the data into 224 groups, containing from 2 to 849 shot observations, based oncombinations of goaltender and playoff season. The data set includes 1,662 shot observations or which one or more variables have missing values. Removing those observations changes theaverage save proportion from 91.64% to 91.61% and the number of groups from 224 to 223.We exclude observations with missing values from our regression analysis but we include theseobservations when computing the variable of interest (recent save performance), as discussed inSubsection 3.2. The dependent variable, y ij , equals 1 if shot i in group j resulted in a save and 0 if the shotresulted in a goal for the opposing team. A shot that hits the crossbar or one of the goalpostsis not counted as a shot on goal and is not included in our data set. The primary independent variable of interest, x ij , is the recent save performance, immediatelybefore shot i , in goaltender-season group j . It is not obvious how to quantify this variable andtherefore we investigate several possibilities. In all cases, we define the recent save performanceas the ratio of the number of saves to the number of shots faced by the goaltender, over some“window”. We compare shot-based windows, over the last 5, 10, 15, 30, 60, 90, 120, and 150shots faced by the goaltender and time-based windows, over the last 10, 20, 30, 60, 120, 180, 240,and 300 minutes played by the goaltender. We chose these window sizes to make the shot-basedand time-based window sizes comparable, given that a goaltender in the NHL playoffs faces anaverage of 30 shots on goal per 60 minutes of playing time. The largest window sizes (150 shotsand 300 minutes) correspond, roughly, to 5 games, an interval length that Green and Zwiebel(2017) suggested was needed to determine whether a baseball player was hot.A window could include one or more intervals during which the group j goaltender wasreplaced by a backup goaltender. Shots faced by the backup goaltender are excluded from thecomputation of x ij . A window could include time periods between consecutive games, whichcould last several days. or a given window, we exclude shot observations for which the number of shots or the timeelapsed prior to the time of the shot is less than the window size. This reduces the number ofshots used in the analysis by 20% to 50%, depending on the window size. As stated previously,we included shots with missing values in the other independent variables in the computation of x ij but we excluded those shots from the regression analysis that we describe in Section 4. We include a vector, Z ij , of six control variables for shot i of group j that we expect could havean impact on the shot outcome. The control variables are: Game score , Position , Home , Rebound , Strength , and
Shot type . All of these variables are categorical, except
Position , which can beexpressed either as one categorical variable or as two numerical variables. In what follows, weelaborate on each of the control variables.
Game score indicates whether the goaltender’s team is Leading (base case), Tied, or Trailingin the game.
Home is a binary variable indicating whether the goaltender was playing on homeice or not (base case).
Rebound is a binary variable indicating whether the shot occurred within2 seconds of uninterrupted game time of another shot from the same team (Perry 2017) ornot (base case).
Strength represents the difference between the number of players from thegoaltender’s team on the ice and the number of players from the other team on the ice.
Strength can take values of +2 , +1 , , − , and − Shot type denotes the shot type: Backhand (base case), Deflected,Slap, Snap, Tip-in, Wrap-around, or Wrist.The numerical specification for
Position is based on a line from the shot origin to the midpointof the crossbar of the goal. We use d (distance) for the length in feet of this line and α for theangle that this line makes with a line connecting the midpoints of the crossbars of the two goals(Figure 1a). A limitation of this specification is that the save probability could depend on d and α in a highly nonlinear manner. Rather than introduce nonlinear terms, we also investigatea categorical specification, which we describe next. Position of shot origin.
Goal (cid:626) d (a) Numerical specification Goal
TopSlotCorner Corner (b) Categorical specificationFor the categorical specification of
Position , we divide the ice surface into three regions: Top,Slot, and Corner (base case) (Figure 1(b)), and categorize each shot based on the region fromwhich the shot originated. We expect the save probability to be lower for shots from the Slotregion than from the Top or Corner regions.A limitation of both specifications for
Position is that we do not have data on whether theshot originated from the left or right side of the ice.
We used multilevel logistic regression, with partial pooling, also referred to as mixed effectsmodelling. We center the variable of interest (Enders and Tofighi 2007; Hox and Roberts 2011)by subtracting the group mean, that is, we set x ∗ ij = x ij − ¯ x j , where ¯ x j is the average of x ij ingroup j . The centered variable x ∗ ij represents the deviation in performance of the goaltender-season group j from his average performance for the current playoffs. Our interest is in whethersuch deviations persist over time.We allow the intercepts and the slope coefficients of the variables of interest to vary by group,but the control variable slope coefficients are the same for all groups, as shown in the following artial pooling specification:Pr( y ij = 1) = logit − ( α j + β j x ∗ ij + γZ ij ) , (1)where logit( p ) ≡ ln( p/ (1 − p )), α j is the intercept for group j , β j is the slope for group j , and γ is the global vector of coefficients for the control variables. We represent the intercept and slopeof the variable of interest as the sum of a fixed effect and a random effect, that is: α j = ¯ α + α ∗ j and β j = ¯ β + β ∗ j .All results that we report in Section 5 were obtained using Markov chain Monte Carlo(MCMC), using the rstan and rstanarm R packages. We used the default prior distributions forthe rstanarm package. The default distributions are weakly informative—Normal distributionswith mean 0 and scale 2 . lme4 R package, and found the MCMCand ML estimates to be nearly identical, except for a few instances where the ML estimationalgorithm did not converge. The lack of ML estimation convergence in some instances is con-sistent with the findings in Eager and Roy (2017). MCMC is considered a good surrogate insituations where an ML estimation algorithm has not been established (Wollack et al. 2002).
In this section, we first provide detailed results for the longest window sizes, using the categoricalspecification for
Position . Second, we investigate the robustness of our main finding to thewindow size, window type, individual goaltender-seasons, and which specification was used for
Position . Third, we illustrate the estimated impact of the control variables on the save probabilityfor the next shot. Fourth, we provide diagnostics for the MCMC estimation. k = 150 shots t = 300 minutes Variable Mean 2.5% 97.5% Mean 2.5% 97.5%Intercept fixed effect 2 .
10 1 .
64 2 .
57 2 .
13 1 .
67 2 . x ∗ ij ) fixed effect − . − . − . − . − . − . Game score: Tied − . − .
16 0 . − . − .
15 0 . Game score: Trailing − . − .
25 0 . − . − .
23 0 . Home − . − .
12 0 .
09 0 . − .
08 0 . Rebound − . − . − . − . − . − . Strength: +2 1 . − .
22 5 .
61 1 . − .
09 5 . Strength: +1 0 .
76 0 .
24 1 .
28 0 .
73 0 .
22 1 . Strength: .
94 0 .
51 1 .
36 0 .
94 0 .
52 1 . Strength: − .
45 0 .
00 0 .
87 0 .
45 0 .
03 0 . Shot type: Deflected − . − . − . − . − . − . Shot type: Slap − . − .
24 0 . − . − .
34 0 . Shot type: Snap − . − .
25 0 . − . − .
32 0 . Shot type: Tip-in − . − . − . − . − . − . Shot type: Wrap-around . − .
10 0 .
93 0 . − .
19 0 . Shot type: Wrist . − .
15 0 . − . − .
21 0 . Position: Top .
70 0 .
51 0 .
89 0 .
68 0 .
49 0 . Position: Slot − . − . − . − . − . − . Table 1 provides means and 95% credible intervals for the intercept and slope fixed effects andfor the control variable coefficients, for our baseline models: the models with 150-shot and 300-minute windows, and a categorical specification for
Position . The window sizes for the baselinemodels are comparable to those used by Green and Zwiebel (2017).Our main finding from the baseline models is that a goaltender’s recent save performancehas a negative and a statistically significant fixed effect value for both window sizes, which is contrary to the hot-hand theory.The two baseline models give consistent results for the control variables. The only twocontrol variable coefficients that disagree in sign,
Home and
Shot type: Wrist , have 95% credibleintervals that contain zero. The posterior mean values for the significant control variables havethe same sign, have similar magnitude, and are in the direction we expect, for both windowtypes. Specifically, the posterior mean values indicate that a goaltender performs better whenhis team has more skaters on the ice, when facing a wrap-around shot, and when facing a shot M i nu t e s S h o t s Fixed effect for recent save performanceShot-based window Time-based window from the top region. A goaltender performs worse when facing a rebound shot, a deflected shot,a slap shot, a snap shot, a tip-in shot, or a shot from the slot region.
Our main finding, that recent save performance has a negative fixed effect value, holds for allwindow sizes and types (Figure 2). (Although not shown in the figure, all of the fixed effectsare statistically significant.) The fixed effect magnitude increases with the window size.The fact that the slope fixed effects, ˆ¯ β , are negative leaves open the possibility that theslopes for some individual goaltender-seasons are positive, but Figures 3-4 show that this is notthe case. These figures show that all of the individual-group slope coefficients, ˆ β j , for both thelongest window sizes (Figure 3) and the shortest window sizes (Figure 4), are strongly negative.Figures 3-4 also show positive correlation between the slope estimates from the shot-based vs.the time-based window models.The coefficients of the significant control variables remained consistent in sign and similar inmagnitude across all window sizes and types (Figure 5). The control variable point estimates forall window types and sizes are within a 95% Bayesian confidence interval (or a credible interval)for the 300-minute baseline model. Furthermore, changing the specification for Position fromcategorical to numerical, for the t = 300 baseline model, resulted in a slope fixed effect that β j s) for all groups, for k = 150 window vs. t = 300 window. = 0.3158 (cid:154) - (cid:3) (cid:3) R² = 0.4175 -8.7-8.6-8.5-8.4-8.3-8.7 -8.6 -8.5 -8.4 -8.3 t = m i nu t e s k = 150 shots Figure 4: Estimated slopes ( ˆ β j s) for all groups, for k = 15 window vs. t = 30 window. (cid:3)(cid:3) = 0.5601x - (cid:3) (cid:3) R² = 0.7383 -1.1-1-0.9-0.8-0.7-0.6-1.1 -1 -0.9 -0.8 -0.7 -0.6 t = m i nu t e s k = 15 shots t = 300-minute baseline model are included for comparison. -1.5-1-0.500.511.522.5 G a m e s c o r e : T i e d G a m e s c o r e : T r a ili ng H o m e R e bound S t r e ng t h : - S t r e ng t h : S t r e ng t h : + S t r e ng t h : + S ho t t yp e : D e f l ec t e d S ho t t yp e : S l a p S ho t t yp e : S n a p S ho t t yp e : T i p - i n S ho t t yp e : W r a p - a r ound S ho t t yp e : W r i s t P o s iti on : S l o t P o s iti on : T op remained negative and was similar in magnitude. The signs of the coefficients for all statisticallysignificant control variables in this model variant agreed with the t = 300 baseline model. Figures 6–7 illustrate the impact of recent save performance and the control variables on theestimated save probability for the next shot, using the t = 300-minute baseline model. Increating Figure 6, we set all control variables to their baseline values. In creating Figure 7, foreach panel, we set x ∗ ij = 0 and we set all control variables except the one being varied in thepanel to their baseline values.From Figure 6, we see that as the deviation of a goaltender’s recent save performance fromhis current-playoff average increases from the 2.5th to the 97.5th percentile, his estimated saveprobability for the next shot decreases by 5 pps. For comparison, this range is larger than the3.7 pp range of average save percentages during the 2018-19 regular season (as discussed in . . x ∗ ij , the recentsave performance.
75% 80% 85% 90% 95% 100%2.5%50.0%97.5% Estimated save probability R ece n t p e rf o r m a n ce p e r ce n til e Section 1) but smaller than the 8 pp range of average save percentages during the playoffs ofthe same season.Given that we define x ∗ ij to be the deviation in performance of the goaltender-season group j from his average performance for the current playoffs, the percentiles for x ∗ ij correspond todifferent recent save performances for different groups. To illustrate the effect in more concreteterms, consider the largest group: The 699 shots faced by Tim Thomas during the 2011 playoffs.The group average was ¯ x j = 94%. We set each control variable category equal to its sampleproportion in the group j data. For shots where Thomas’ recent save performance was at the2.5th percentile, corresponding to x ij = 91 . x ij = 97 . Home and
Game score have minimal impact on the estimated save probability. Different values for
Position , Rebound , Strength , and
Shot type , in contrast, have a substantial impact on estimatedsave probability: a shot from the Slot is 16 pps less likely to be saved than a shot from the Top;a rebound shot is 15 pps less likely to be saved than a non-rebound shot; and a shot from a teamthat has a 2-man advantage is 9 pps less likely to be saved than shot from a team that is 2 menshort. For
Shot type , the save probability decreases by 18 pps when moving from wrap-around(the shot type least likely to result in a goal) to a deflected shot (the type most likely to result (a)
Rebound (b)
Home (c)
Position (d)
Game score (e)
Strength (f)
Shot type in a goal).
We computed two diagnostic statistics: ˆ R and n eff . To check whether a chain has converged tothe equilibrium distribution we can compare the chain’s behavior to other randomly initializedchains. The potential scale reduction statistic, ˆ R , allows us to perform this comparison, bycomputing the ratio of the average variance of draws within each chain to the variance of thepooled draws across chains; if all chains are at equilibrium, the two variances are equal andˆ R = 1, and this is what we found for all of our models.The effective sample size, n eff , is an estimate of the number of independent draws from he posterior distribution for the estimate of interest. The n eff metric computed by the rstan package is based on the ability of the draws to estimate the true mean value of the parameter.As the draws from a Markov chain are dependent, n eff is usually smaller than the total numberof draws. Gelman et al. (2013) recommend running the simulation until n eff is at least 5 timesthe number of chains, or 5 × We used multilevel logistic regression to investigate whether the performance of NHL goaltendersduring the playoffs is consistent with a hot-hand effect. We used data from the 2008–2016 NHLplayoffs. We measured past performance using both shot-based windows (as has been done inpast research) and time-based windows (which has not been done before). Our window sizesspanned a wide range: from, roughly, half a game to 5 games. We allowed the intercept and theslope with respect to recent save performance to vary across goaltender-season combinations.We found a significant negative impact of recent save performance on the next-shot save prob-ability. This finding was consistent across all window types, window sizes, and goaltender-seasoncombinations. This finding is inconsistent with a hot-hand effect and contrary to the findingsfor baseball in Green and Zwiebel (2017), who used a similar window size and hypothesized thatskilled activity would generally demonstrate a hot-hand effect.If a goaltender’s performance, after controlling for observable factors, was completely ran-dom, then we would expect a period of above-average or below-average recent save performanceto be likely to be followed by a period of save performance that is closer to the average, be-cause of regression to the mean. As we increase the sample size used to measure recent saveperformance (that is, increase the window size), we would expect the average amount by whichperformance moves toward the average to decrease. We observe the opposite (see Figure 2),which argues against our finding being driven by regression to the mean.A motivation effect provides one possible explanation for our finding. That is, if a goal- ender’s recent save performance has been below his average for the current playoffs, then hismotivation increases, resulting in increased effort and focus, causing the next-shot save prob-ability to be higher. Conversely, if the recent save performance has been above average, thenthe goaltender’s motivation, effort, and focus could decrease, leading to a lower next-shot saveprobability. B´elanger et al. (2013) find support for the first of these effects (greater perfor-mance after failure) for “obsessively passionate individuals” but did not find support for thesecond effect (worse performance after success) for such individuals. The study found supportfor neither effect for “harmoniously passionate individuals.” These findings are consistent withHall-of-Fame goaltender Ken Dryden’s (2019) sentiment that “if a shot beats you, make sureyou stop the next one, even if it is harder to stop than the one before.” The psychologicalmechanisms underlying our finding could benefit from further study.Although the estimated recent save performance coefficient is consistently negative, its mag-nitude varies and in particular, the magnitude increases sharply with the window size. Weexpect to see more reliable estimates with longer window sizes, but the increase in magnitudeis surprising, given that we define the recent save performance as a scale-free save percentage.One limitation of our study is that, in defining windows, we ignore the time that passesbetween games. Past research, such as Green and Zwiebel (2017), shares this limitation. Thislimitation could be particularly serious for backup goaltenders, for whom the interval betweentwo successive appearances could be several days long. eferences Albright, S. Christian (1993). “A Statistical Analysis of Hitting Streaks in Baseball.”In:
Journal of the American Statistical Association
Journal of Quantitative Analysis in Sports
Journal of Personality andSocial Psychology doi : .Dryden, Ken (2019). Scotty: A Hockey Life Like No Other . McLelland & Stewart.Eager, Christopher and Joseph Roy (2017).
Mixed Effects Models are Sometimes Ter-rible . arXiv: .Enders, Craig K. and Davood Tofighi (2007). “Centering Predictor Variables in Cross-Sectional Multilevel Models: A New Look at an Old Issue.” In:
Psychological Methods
Prior Distributions for rstanarm Models . https://cran.r-project.org/web/packages/rstanarm/vignettes/priors.html .Gelman, Andrew et al. (2013). Bayesian Data Analysis . 3rd edition. Chapman andHall/CRC.Gilovich, Thomas, Robert Vallone, and Amos Tversky (1985). “The Hot Hand in Bas-ketball: On the Misperception of Random Sequences.” In:
Cognitive Psychology
ManagementScience
Season Data . .Hox, Joop and J. Kyle Roberts (2011). Handbook of Advanced Multilevel Analysis . Psy-chology Press.Kniffin, Kevin M. and Vince Mihalek (2014). “Within-Series Momentum in Hockey: NoReturns for Running up the Score.” In:
Economics Letters
Journal of Economic Behavior & Organization
Econometrica
Journal of Economic Perspectives
Chance arXiv preprint arXiv:1911.08138 .¨Otting, Marius et al. (2020). “The Hot Hand in Professional Darts.” In:
Journal of theRoyal Statistical Society: Series A (Statistics in Society)
Corsica Play-by-Play Data . .Tversky, Amos and Thomas Gilovich (1989a). “The Cold Facts about the ‘Hot Hand’in Basketball.” In: Chance
Chance
Chance
The American Statistician doi : .— (1999). “Statistical Tests for the Hot-Hand in Basketball in a Controlled Setting.”In: Unpublished manuscript
1, pp. 1–20.Wollack, James A. et al. (2002). “Recovery of Item Parameters in the Nominal Re-sponse Model: A Comparison of Marginal Maximum Likelihood Estimation andMarkov Chain Monte Carlo Estimation.” In: