[PDF] Modelling multi-scale state-switching functional data with hidden Markov models

Abstract

Data sets comprised of sequences of curves sampled at high frequencies in time are increasingly common in practice, but they can exhibit complicated dependence structures that cannot be modelled using common methods of Functional Data Analysis (FDA). We detail a hierarchical approach which treats the curves as observations from a hidden Markov model (HMM). The distribution of each curve is then defined by another fine-scale model which may involve auto-regression and require data transformations using moving-window summary statistics or Fourier analysis. This approach is broadly applicable to sequences of curves exhibiting intricate dependence structures. As a case study, we use this framework to model the fine-scale kinematic movement of a northern resident killer whale (Orcinus orca) off the coast of British Columbia, Canada. Through simulations, we show that our model produces more interpretable state estimation and more accurate parameter estimates compared to existing methods.

Full PDF

MM ODELLING MULTI - SCALE STATE - SWITCHING FUNCTIONALDATA WITH HIDDEN M ARKOV MODELS

A P

REPRINT

Evan Sidrow

Department of StatisticsUniversity of British ColumbiaVancouver, Canada [email protected]

Nancy Heckman

Department of StatisticsUniversity of British ColumbiaVancouver, Canada

Sarah M.E. Fortune

Marine Mammal Research UnitUniversity of British ColumbiaVancouver, Canada

Andrew W. Trites

Institute for the Oceans and FisheriesUniversity of British ColumbiaVancouver, Canada

Ian Murphy

Department of StatisticsUniversity of British ColumbiaVancouver, Canada

Marie Auger-M´eth´e

Department of StatisticsUniversity of British ColumbiaVancouver, CanadaJanuary 12, 2021 A BSTRACT

Data sets comprised of sequences of curves sampled at high frequencies in time are increasinglycommon in practice, but they can exhibit complicated dependence structures that cannot be mod-elled using common methods of Functional Data Analysis (FDA). We detail a hierarchical approachwhich treats the curves as observations from a hidden Markov model (HMM). The distributionof each curve is then deﬁned by another ﬁne-scale model which may involve auto-regression andrequire data transformations using moving-window summary statistics or Fourier analysis. This ap-proach is broadly applicable to sequences of curves exhibiting intricate dependence structures. Asa case study, we use this framework to model the ﬁne-scale kinematic movement of a northern res-ident killer whale (

Orcinus orca ) off the coast of British Columbia, Canada. Through simulations,we show that our model produces more interpretable state estimation and more accurate parameterestimates compared to existing methods. K eywords Accelerometer data · animal movement · biologging · diving behaviour · hierarchical modelling · killerwhales · state-switching · statistical ecology · time series Biologging technology now provides researchers with kinematic data collected almost continuously in time (Hootenet al., 2017). The collection and analysis of data from devices such as accelerometers have brought new insightsto areas ranging from monitoring machine health (Getman et al., 2009) to understanding physical activity levels inchildren (Morris et al., 2006). The study of animal movement in particular has been transformed by tracking devicesthat record kinematic information in a variety of environments (B¨orger et al., 2020; Jeanniard-du Dot et al., 2016).Tags can record over 50 observations per second, resulting in time series containing millions of observations over thecourse of several hours. These data contain a wealth of information about human and animal behaviour, but parsingthese large data sets poses a challenge for statisticians and biologists.Biologging data are frequently modelled as a set of curves analyzed by methods of Functional Data Analysis, or FDA– see, e.g., Ramsay and Silverman (2005). For example, Morris et al. (2006) view a child’s activity level as a set ofdaily curves where metabolic activity is a function of time. Similarly, Fu and Heckman (2019) view the dive proﬁle ofa southern elephant seal (

Mirounga leonina ) as a set of dive curves whereby the amplitude and phase variation of thedive is used to classify dive types. a r X i v : . [ s t a t . M E ] J a n PREPRINT - J

ANUARY

12, 2021FDA was originally developed to process curves assumed to be independent replicates (i.e., there is no between-curve dependence), and within-curve ﬁne-scale structure is not usually incorporated in FDA models. However, sets ofcurves often exhibit complex sequential dependencies both between curves and within curves, especially in the case ofbiologging data (Leos-Barajas et al., 2017). For example, dive proﬁles of marine animals show distinct dive “types”that cluster together in time (Tennessen et al., 2019a) while simultaneously displaying bouts of short-term periodicitywithin dives (Adam et al., 2019). Fine-scale periodicity within a larger process is also common in ﬁelds ranging frommachine health (Xin et al., 2018; Lucero et al., 2019) to speech recognition (Juang and Rabiner, 1991).Some FDA models account for between-curve dependence occurring when multiple curves arise from separate groupsof individuals, but these models are inadequate when modelling certain types of dependence in time. For example,previous studies have used multilevel models with random effects to model variation between and within individualsin daily activity levels of children (Morris et al., 2006) or in menstrual cycles of adults (Brumback and Rice, 1998).More recent work involving multilevel models includes that of Di et al. (2009), Crainiceanu et al. (2009), and Chenand M¨uller (2012). However, these multilevel models are not appropriate in many biologging applications since theydo not account for different curve types. Further, the ﬁrst two papers do not account for temporal between-curvedependence and the third paper’s model of temporal dependence is not appropriate in most biologging applications.In addition to multi-level models, FDA researchers have also used functional time series to model dependence in asequence of curves. Functional time series extend the ideas of classic time series to model the evolution of one curveinto the next (Kokoszka and Reimherr, 2018), but they do not account for sequences of time-series curves whosedistributions are determined by well-deﬁned hidden states.Traditional FDA techniques similarly fall short when modelling complicated within-curve data. In particular, within-curve structure is usually modelled by a generic smooth mean function and a covariance function (Yao et al., 2005) orwith random regression (Rice and Wu, 2001). However, time-series data exhibiting both sharp behavioural changesand periodic ﬁne-scale structure are difﬁcult to model with these classic FDA techniques.To accommodate both temporal dependence and changes in curve type, we turn to the ﬁeld of animal movementmodelling (Hooten et al., 2017), where one of the most prevalent techniques of late is the hidden Markov model, orHMM (Patterson et al., 2017; McClintock et al., 2020). HMMs interpret animal movement data as arising from asequence of unobserved behavioural states, allowing biologists to infer the underlying behaviour of an animal fromsequential observations of its position. While ubiquitous in ecology literature, HMMs have seen little use in non-parametric functional modelling with a few notable exceptions. In particular, Langrock et al. (2018) take a non-parametric approach to model the distributions of HMM observations with B-splines, but this approach does notaccount for certain types of temporal correlation.While useful, HMMs alone are also not sufﬁcient to model intricate ﬁne-scale time-series data for three primary rea-sons. Firstly, HMMs assume that subsequent observations are independent given an underlying hidden state process,but this is often not the case when observations are taken at extremely high frequencies. Several solutions have beenproposed in the ecology literature such as the hidden movement Markov model (Whoriskey et al., 2016) and the con-ditionally auto-regressive hidden Markov model, or CarHMM (Lawler et al., 2019). Secondly, classic HMMs fail tomodel simultaneous behavioural processes that occur at different time scales (i.e., both between and within curves). Toaddress this issue, statistical ecologists have employed hierarchical hidden Markov models (HHMMs) (Leos-Barajaset al., 2017; Adam et al., 2019), which model both scales with conditionally dependent HMMs. Thirdly, traditionalHMMs, CarHMMs, and HHMMs cannot easily capture complicated dependence structures at short time scales. Forexample, Adam et al. (2019) fail to capture ﬁne-scale periodic swimming pattern of horn sharks (

Heterodontus fran-cisci ) using a traditional HHMM. Heerah et al. (2017) successfully use Fourier analysis within an HMM to account fordaily behavioural cycles in marine mammals, and Fourier analysis has previously been used with accelerometer datato explain animal behaviour (Fehlmann et al., 2017; Alex Shorter et al., 2017). Thus, incorporating Fourier analysiswithin the structure of an HMM appears to be a promising approach to account for ﬁne-scale periodic structures.In this paper, we combine existing methods from statistical ecology literature in novel ways to account for complex,temporally dependent functional data. The resulting suite of methods makes up a “tool box” that can be used tobuild arbitrarily complex hierarchical models to explain multi-scale functional and time-series data with intricatedependence structure. We begin in Section 2 by describing HMMs as well as two variants, CarHMMs and HHMMs,and discuss how Fourier analysis can handle ﬁne-scale dependence structures. We also show how these methods canbe combined to analyze increasingly complex data. In Section 3 we ﬁt several candidate models to data from a killerwhale (

Orcinus orca ) from the threatened northern resident population off the coast of British Columbia, Canada.Section 4 details a simulation study based on these candidate models, and in Section 5 we discuss our results.2

PREPRINT - J

ANUARY

12, 2021

Consider a sequence of T curves, where any particular curve t is characterized by a curve-level (or coarse-scale) obser-vation Y t as well as a sequence of T ∗ t within-curve (or ﬁne-scale) observations Y ∗ t . Namely, Y ∗ t ≡ (cid:8) Y ∗ t, , . . . , Y ∗ t,T ∗ t (cid:9) is made up ﬁne-scale quantities derived from curve t indexed using t ∗ . Both Y t and Y ∗ t,t ∗ can be either vectors orscalars. We call the sequence of coarse-scale observations Y ≡ (cid:8) Y , . . . , Y T (cid:9) and the collection of all ﬁne-scaleobservations Y ∗ ≡ (cid:8) Y ∗ , . . . , Y ∗ T (cid:9) . To develop our model for this data, we detail the structure of a traditional HMMfollowed by three variations which generalize its base structure. We then show how each of these generalized HMMscan be synthesized to form a wide variety of more complicated models. Hidden Markov models describe state-switching Markovian processes in discrete time and are the core structure we useto model both Y and Y ∗ . For simplicity we focus on Y to introduce the model. An HMM is comprised of a sequenceof unobserved states X ≡ (cid:8) X , . . . , X T (cid:9) together with an observation sequence Y ≡ (cid:8) Y , . . . , Y T (cid:9) , where X t isassociated with the observation Y t . The Y t ’s are often referred to as “emissions” and the index t typically refers to time.The X t ’s form a Markov chain and can take integer values between and N . Their distribution is governed by thedistribution of the initial state X and the N × N transition probability matrix Γ , where Γ ij = Pr( X t +1 = j | X t = i ) .We assume that X follows the chain’s stationary distribution, which is denoted by an N -dimensional row vector δ ,where δ i = Pr( X = i ) . A Markov chain’s stationary distribution is determined by its probability transition matrixvia δ = δ Γ and (cid:80) Ni =1 δ i = 1 . The distribution of an emission Y t conditioned on the corresponding hidden state X t does not depend upon any other observation or hidden state. If X t = i then we denote the conditional density orprobability mass function of Y t as f ( i ) ( · ; θ ( i ) ) or simply f ( i ) ( · ) , where θ ( i ) is a state-dependent parameter describingthe emission distribution.Using observation emissions, denoted here as y ≡ { y , . . . , y T } , we can ﬁnd the maximum likelihood estimates ofthe parameters Γ and θ ≡ { θ (1) , . . . , θ ( N ) } . The likelihood L HMM can be evaluated using the well-known forwardalgorithm (Zucchini et al., 2016): L HMM ( θ, Γ; y ) = δP ( y ; θ ) T (cid:89) t =2 Γ P ( y t ; θ ) N , where N is an N -dimensional column vector of ones and P ( y t ; θ ) is an N × N diagonal matrix with ( i, i ) th entry f ( i ) ( y t ; θ ( i ) ) .Following Leos-Barajas et al. (2017), we reparameterize the N × N transition probability matrix Γ such that the entriesof the matrix are forced to be non-negative and the rows sum to 1: Γ ij = exp( η ij ) (cid:80) Nk =1 exp( η ik ) , where i, j = 1 , . . . , N and η ii is set to zero for identiﬁability. This simpliﬁes likelihood maximization by removingconstraints in the optimization problem. For simplicity, we will continue to use Γ in our notation, suppressing thereparameterization in terms of η . Figure 1a represents the dependence structure of an HMM. A conditionally auto-regressive hidden Markov model, or CarHMM (Lawler et al., 2019), is a generalization of anHMM which explicitly models auto-correlation in the observation sequence beyond the correlation induced by thehidden state process. Like a traditional HMM, a CarHMM is made up of a Markov chain of unobserved states X , . . . , X T that can take integer values between and N . A CarHMM also has a transition probability matrix Γ and initial distribution δ equal to the stationary distribution of Γ . Unlike a traditional HMM, the CarHMM assumesthat the distribution of Y t conditioned on X , . . . , X T and Y , . . . , Y t − depends on both X t and Y t − rather than only X t . The ﬁrst emission Y is treated as a ﬁxed initial value which does not depend upon X . We denote the conditionaldensity or probability mass function of Y t given Y t − = y t − and X t = i as f ( i ) ( ·| y t − ; θ ( i ) ) or simply f ( i ) ( ·| y t − ) .This model is highly general, as f ( i ) ( ·| y t − ) can be any valid density or probability mass function that depends uponthe parameters θ ( i ) and the previous observation Y t − . As a concrete example, if Y t is a scalar, then one may assume3 PREPRINT - J

ANUARY

12, 2021 (a) Hidden Markov Model (

HMM )(b) Conditionally Autoregressive HMM (

CarHMM )(c) Hierarchical HMM (

HHMM )(d) HMM with Discrete Fourier Transform (

HMM-DFT ) Figure 1: Dependence structures of a standard HMM (a) and three HMM variants (b, c, and d). Hidden state sequencesare denoted as X on the coarse-scale and X ∗ on the ﬁne scale. Likewise, observations (or emissions) are denoted by Y on the coarse scale and Y ∗ on the ﬁne scale. The HHMM shown here has a traditional HMM as both the coarse- andﬁne-scale model. In (d), observations are transformed using a moving window and denoted as ˜ Y ∗ with correspondinghidden states ˜ X ∗ .that Y t given X t = i is Normally distributed with parameters θ ( i ) = { µ ( i ) , σ ( i ) , φ ( i ) } , where: E ( Y t | Y t − = y t − , X t = i ) = φ ( i ) y t − + (1 − φ ( i ) ) µ ( i ) , V ( Y t | Y t − = y t − , X t = i ) = ( σ ( i ) ) . (1)A CarHMM which follows Equation (1) can be viewed as a discrete time version of a state-switching Ornstein-Uhlenbeck process (Michelot and Blackwell, 2019). This follows in the same way that an AR(1) process is thediscrete-time version of a traditional Ornstein-Uhlenbeck process.4 PREPRINT - J

ANUARY

12, 2021As previously, the likelihood corresponding to a general CarHMM can be easily calculated using the forward algo-rithm. If y is the sequence of observed emissions, then L CarHMM ( θ, Γ; y ) = δ T (cid:89) t =2 Γ P ( y t | y t − ; θ ) N , where P ( y t | y t − ; θ ) is an N × N diagonal matrix with ( i, i ) th entry equal to f ( i ) ( y t | y t − ; θ ( i ) ) . Figure 1b shows agraphical representation of the dependence structure of a CarHMM. A hierarchical hidden Markov model, or HHMM, accounts for processes occurring simultaneously at different scalesby modelling both the coarse-scale process and ﬁne-scale process with either HMMs (Leos-Barajas et al., 2017; Adamet al., 2019) or CarHMMs. The coarse-scale model is either an HMM and CarHMM as deﬁned in Sections 2.1 and 2.2,where X , . . . , X T make up an unobserved Markov chain with N possible states and Y , . . . , Y T are the correspondingobservations with state-dependent parameters θ ( i ) for i = 1 , . . . , N . In the hierarchical setting, however, each state X t also emits another sequence of ﬁne-scale unobserved states, X ∗ t ≡ { X ∗ t, , . . . , X t,T ∗ t } , which in turn emits a sequenceof ﬁne-scale observations Y ∗ t ≡ { Y ∗ t, , . . . , Y t,T ∗ t } . For each curve t , the ﬁne-scale process { X ∗ t , Y ∗ t } then followsanother HMM (or CarHMM) whose parameters depend on the value of X t . If X t = i , then the components of X ∗ t make up a Markov chain with N ∗ ( i ) possible states, an N ∗ ( i ) × N ∗ ( i ) transition probability matrix Γ ∗ ( i ) , and an initialdistribution δ ∗ ( i ) which we assume is equal to the stationary distribution of the chain. The distribution of Y ∗ t,t ∗ given Y ∗ t,t ∗ − = y ∗ t,t ∗ − , X ∗ t,t ∗ = i ∗ , and X t = i is governed by the parameter θ ∗ ( i,i ∗ ) and has density or probability massfunction denoted f ∗ ( i,i ∗ ) (cid:0) ·| y ∗ t,t ∗ − ; θ ∗ ( i,i ∗ ) (cid:1) or simply f ∗ ( i,i ∗ ) ( ·| y ∗ t,t ∗ − ) . We denote the set of ﬁne-scale emissionparameters corresponding to X t = i as θ ∗ ( i ) = (cid:8) θ ∗ ( i, , . . . , θ ∗ ( i,N ∗ ( i ) ) (cid:9) . In summary: { Y, X } follows a (Car)HMM with Γ ∈ R N × N , ( Y t | Y t − = y t − , X t = i ) has density f ( i ) ( ·| y t − ; θ ( i ) ) , { Y ∗ t , X ∗ t | X t = i } follows a (Car)HMM with Γ ∗ ( i ) ∈ R N ∗ ( i ) × N ∗ ( i ) , ( Y ∗ t,t ∗ | Y ∗ t,t ∗ − = y ∗ t,t ∗ − , X ∗ t,t ∗ = i ∗ , X t = i ) has density f ∗ ( i,i ∗ ) ( ·| y ∗ t,t ∗ − ; θ ( i,i ∗ ) ) . Given the coarse-scale hidden state sequence X , the T + 1 sets { X ∗ , Y ∗ } , . . . , { X ∗ T , Y ∗ T } , and { Y , . . . , Y T } areassumed to be independent of one another.Forcing certain parameters to be shared can reduce complexity and increase interpretability of the HHMM. For exam-ple, in our killer whale case study (see Section 3), we take N ∗ ( i ) = N ∗ for all i . We also share the ﬁne-scale emissionparameters across the N coarse-scale hidden states (cid:0) i.e., θ ∗ (1 ,i ∗ ) = · · · = θ ∗ ( N,i ∗ ) = θ ∗ ( · ,i ∗ ) for all i ∗ = 1 , . . . , N ∗ (cid:1) .Coarse-scale hidden states therefore differ only in their coarse-scale emission parameters θ ( i ) and ﬁne-scale probabilitytransition matrices Γ ∗ ( i ) .Due to the nested structure of the HHMM, the likelihood is easily calculated using the forward algorithm. Let y bethe sequence of observed coarse-scale emissions and y ∗ ≡ { y ∗ , . . . , y ∗ T } be the collection of T observed ﬁne-scaleemission vectors. In addition, let θ ∗ ≡ { θ ∗ (1) , . . . , θ ∗ ( N ) } denote the collection of all ﬁne-scale emission parametersand Γ ∗ ≡ { Γ ∗ (1) , . . . , Γ ∗ ( N ) } denote the collection of all ﬁne-scale transition probability matrices. The likelihood ofthe observed data is then L HHMM ( θ, θ ∗ , Γ , Γ ∗ ; y, y ∗ ) = δP m ( y , y ∗ ; θ, θ ∗ , Γ ∗ ) T (cid:89) t =2 Γ P m ( y t , y ∗ t | y t − ; θ, θ ∗ , Γ ∗ ) N (2)where P m ( y t , y ∗ t | y t − ; θ, θ ∗ , Γ ∗ ) is an N × N diagonal matrix whose exact structure depends upon the coarse- andﬁne-scale models. If the coarse-scale model is an HMM, P m ( y , y ∗ ; θ, θ ∗ , Γ ∗ ) and P m ( y t , y ∗ t | y t − ; θ, θ ∗ , Γ ∗ ) for t ≥ both have ( i, i ) th entries equal to f ( i ) ( y t ) L ﬁne (cid:0) θ ∗ ( i ) , Γ ∗ ( i ) ; y ∗ t (cid:1) . If the coarse-scale model is a CarHMM, P m ( y , y ∗ ; θ, θ ∗ , Γ ∗ ) has ( i, i ) th entry equal to L ﬁne (cid:0) θ ∗ ( i ) , Γ ∗ ( i ) ; y ∗ (cid:1) and P m ( y t , y ∗ t | y t − ; θ, θ ∗ , Γ ∗ ) for t ≥ has ( i, i ) th entry equal to f ( i ) ( y t | y t − ) L ﬁne (cid:0) θ ∗ ( i ) , Γ ∗ ( i ) ; y ∗ t (cid:1) . The ﬁne-scale likelihood L ﬁne corresponds to the likelihoodof the ﬁne-scale model, which can be either a CarHMM or an HMM. Figure 1c graphically displays the dependencestructure for an HHMM. 5 PREPRINT - J

ANUARY

12, 2021

In many applications where data are collected at high frequencies, intricate dependency structures arise within theﬁne-scale process that cannot be adequately modelled with the HMM-based models described thus far. To handlethese additional ﬁne-scale structures, we recommend replacing Y ∗ t = { Y ∗ t, , . . . , Y ∗ t,T ∗ t } with relevant statistics thatsummarize any non-Markovian behaviour. To maintain the temporal structure of the ﬁne-scale process, local summarystatistics can be calculated from a moving window over the elements of Y ∗ t . Subject matter experts are often requiredto determine the speciﬁc summary statistics employed as well as the optimal window size and stride length of themoving window. Stride length refers to the distance between the ﬁrst element of consecutive windows, so a stridelength of h indicates that the ﬁrst window starts at Y ∗ t, , the second at Y ∗ t,h +1 , and so on. Larger stride lengths result ina loss of information but also reduce the dimension of the ﬁne-scale process, which allows for faster model ﬁtting. Inaddition, setting the stride length equal to the window size avoids artiﬁcial residual correlation arising from overlappingwindows. For our case study, we use the discrete Fourier transform (DFT) of a moving forward window of width h and stride h across Y ∗ t . Namely: DF T { Y ∗ t,t ∗ , . . . , Y ∗ t,t ∗ + h − } ( k ) = h − (cid:88) n =0 Y ∗ t,t ∗ + n exp (cid:18) − i πh kn (cid:19) (3)for t ∗ = 1 , h + 1 , h + 1 , . . . and k = 0 , , . . . , h − with i = √− . If Y ∗ t,t ∗ is a vector then the DFT is takencomponent-wise. We omit the ﬁnal window if t ∗ + h − exceeds T ∗ t , denote the total number of windows as ˜ T ∗ t , andindex each window with ˜ t ∗ = 1 , . . . , ˜ T ∗ t . Next, we calculate transformed observations ˜ Y ∗ t, ˜ t ∗ ≡ { ˜ A ∗ t, ˜ t ∗ , ˜ W ∗ t, ˜ t ∗ } : ˜ A ∗ t, ˜ t ∗ ≡ h h (cid:88) n =1 Y ∗ t,h (˜ t ∗ − n , ˜ W ∗ t, ˜ t ∗ ≡ ˜ ω (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) DF T { Y ∗ t,h (˜ t ∗ − , . . . , Y ∗ t,h ˜ t ∗ } ( k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (4)where ˜ ω ≤ h − is a problem-speciﬁc tuning parameter corresponding to the maximum recorded frequency withineach window. In words, ˜ A ∗ t, ˜ t ∗ is the average value of Y ∗ t within window ˜ t ∗ and ˜ W ∗ t, ˜ t ∗ is the squared two-norm ofthe component of the window that can be attributed to frequencies between one and ˜ ω periods per window. Moreintuitively, ˜ W ∗ t, ˜ t ∗ corresponds to the “wiggliness” of the ﬁne-scale data within curve t and window ˜ t ∗ .When performing this transformation, the ﬁne-scale HMM (or CarHMM) must be redeﬁned since ˜ Y ∗ t exists on acoarser scale than Y ∗ t itself. As a result, there are only ˜ T ∗ t hidden states associated with ˜ Y ∗ t = (cid:8) ˜ Y ∗ t, , . . . , ˜ Y ∗ t, ˜ T ∗ t (cid:9) ,which we denote as ˜ X ∗ t = (cid:8) ˜ X ∗ t, , . . . , ˜ X ∗ t, ˜ T ∗ t (cid:9) . The ﬁne-scale probability transition matrices Γ ∗ ( i ) and probabilitydensity functions f ∗ ( i,i ∗ ) are then applied directly to ˜ X ∗ t and ˜ Y ∗ t instead of X ∗ t and Y ∗ t . The likelihood of this modelis therefore identical to that of the original HMM or CarHMM as deﬁned in Sections 2.1 and 2.2, but Y ∗ is replacedwith ˜ Y ∗ and X ∗ is replaced with ˜ X ∗ . To clearly differentiate models, we refer to an HMM with ˜ Y ∗ as observationsand ˜ X ∗ as hidden states as an HMM-DFT. Figure 1d displays the dependence structure of an HMM-DFT. Traditional HHMMs treat both the coarse-scale and the ﬁne-scale processes as realizations of an HMM or CarHMM.However, the ﬁne-scale observations for a particular dive Y ∗ t can be modelled using a large variety of parametricmodels which admit easy-to-compute likelihoods or penalized likelihoods. As such, the ﬁne-scale HMM likelihoodterm L ﬁne in Equation (2) can be replaced by the likelihood of a general ﬁne-scale model whose parameters dependupon the coarse-scale hidden state. Possible candidates for the ﬁne-scale model include any of the models described inthe previous subsections in addition to many others not described here. For example, Bebbington (2007) and Borcherset al. (2013) investigate data sets with count onsets as observations, so they use variations of a Poisson process astheir ﬁne-scale model. If the ﬁne-scale model is a simple Poisson process, then this approach is equivalent to aMarkov-modulated Poisson process (Fischer and Meier-Hellstern, 1993). The ﬁne-scale process can also be modelledas in (Langrock et al., 2018), who use B-splines to model the emission distribution of an HMM. This non-parametricapproach uses a penalized likelihood term which can easily replace the usual ﬁne-scale likelihood term in Equation (2).Another class of ﬁne-scale models is the set of continuous time methods such as a continuous-time HMM (CTHMM)(Liu et al., 2015) or a state-switching Ornstein-Uhlenbeck process (Michelot and Blackwell, 2019). A continuous-timeHMM may be appropriate if observations are not equi-spaced in time (Liu et al., 2015). Xu et al. (2020) model high-frequency biologging accelerometer data of individuals by incorporating a CTHMM into a hierarchical model similarto ours. However, they assume that individuals are partitioned into subgroups a priori whereas we use an HMM toinfer the coarse-scale hidden states. 6 PREPRINT - J

ANUARY

12, 2021These examples include a few of many ﬁne-scale models that can act as initial building blocks in a practitioner’stoolbox to construct increasingly complex hierarchical models. A myriad of possible models can be built using thisframework, but these models can quickly become complicated and computationally expensive to ﬁt. Therefore, modelsshould be constructed with care to achieve an adequate ﬁt of the data while avoiding over-ﬁtting and high computa-tional costs.

To illustrate the process of constructing a model using these building blocks, we analyze the dive behaviour of anorthern resident killer whale in Queen Charlotte Sound off the coast of British Columbia, Canada, and constructseveral candidate models to categorize and describe its diving behaviour.Understanding animal behaviour is important for conservation efforts, as environmental changes caused by anthro-pogenic activity can directly impact animal behaviour (Sutherland, 1998). HMMs have been used to understand howdiving behaviours of various species are affected by disturbances (e.g., DeRuiter et al. (2017) and Isojunno et al.(2017)). For killer whales, we are interested in categorizing different diving behaviours to identify potential forag-ing dives. Northern resident killer whales feed almost exclusively on calorie-rich Chinook salmon (

Oncorhynchustshawytscha ) (Ford and Ellis, 2006), which typically occur deeper and are less numerous than smaller types of salmon(Ford et al., 2009). Northern resident killer whales therefore must expend signiﬁcant amounts of energy to captureChinook (Williams and Noren, 2009; Noren, 2011; Wright et al., 2017). Acceleration data can be used to estimatean animal’s energy expenditure (Green et al., 2009; Wilson et al., 2019), but the animal’s behavioural state must beaccounted for in order to obtain accurate estimates (Jeanniard du Dot et al., 2016). Therefore, understanding both thebehavioural state of the killer whale and the distribution of acceleration within each behavioural state is needed todetermine the true energetic requirements of the animal.

The data we use were collected on September 2, 2019 from 12:49 pm to 6:06 pm PDT and consist of depth and ac-celeration over time. Observations were collected at a rate of 50 Hz using a CATs biologger (Customizable AnimalTracking Solutions, ). Acceleration was measured in three dimensions, which together represent the com-plete range of movement of an animal (forward/backward, upward/downward, and right/left). Tri-axial accelerationreadings are common in these types of tags and are often used to infer animal behaviour such as foraging (Cade et al.,2018; Fehlmann et al., 2017; Wright et al., 2017). The act of attaching and detaching the tag caused anomalous be-haviour before 1:20 pm and after 6:00 pm, so observations taken during these time periods are ignored. There werealso periods of time when the tag failed to record observations, resulting in data gaps between 2:25 pm and 2:37 pmand between 4:07 pm and 5:07 pm. To preprocess the data, we smooth the depth and acceleration curves by taking amoving average within a window of / th of a second. We then deﬁne a killer whale “dive” as any continuous inter-val of data that occurs below 0.5 meters in depth and lasts for at least 10 seconds. Data are preprocessed in part withthe divebomb package in Python (Nunes, 2019). The preprocessed data contain a total of 267 dives, all of which aredisplayed in Figure 2. Each dive is treated as one curve, and the sequence of dives makes up the coarse-scale process.Speciﬁcally, the sequence of observed coarse-scale observations y = (cid:8) y , . . . , y (cid:9) is a time series of dive durationsin seconds. For dive t , the ﬁne-scale observations are contained in y ∗ t ≡ (cid:8) y ∗ t, , . . . , y ∗ t,T ∗ t (cid:9) , which is a sequence con-taining the within-dive acceleration data in units of meters per second squared. The collection of all acceleration datais denoted as y ∗ = (cid:8) y ∗ , . . . , y ∗ (cid:9) . Deﬁning a suitable model to describe this killer whale kinematic data involves selecting an appropriate number ofhidden states, model structure, and emission distributions for both the coarse- and ﬁne-scale observations.We do not use information criteria to select the number of dive types N since these metrics tend to overestimate thenumber of behavioural states in biological processes (Pohle et al., 2017). We instead plot the duration of each diveversus the duration of the dive preceding it ( y t versus y t − for t ≥ ). This type of visualization is known as a lagplot. If the emission distributions of the hidden states are well-separated, a lag plot should reveal N distinct patterns,where each pattern corresponds to one dive type (Lawler et al., 2019). This is unfortunately not the case for our killerwhale data, as there is one cluster of data centred at approximately y t = y t − = 30 seconds. However, longer divesappear to be characterized by bouts of less “wiggly” behaviour in the acceleration data compared to shorter dives, sowe choose N = 2 to differentiate these dive types. The absence of a more principled method to select N highlights theimportance of model validation techniques in lieu of information criteria (see Section 3.4). Prior to ﬁtting the model,7 PREPRINT - J

ANUARY

12, 2021Figure 2: Dive depth (top panel) and three-dimensional acceleration (bottom three panels) from a killer whale overapproximately 5 hours. An exact physical interpretation of each of component acceleration is difﬁcult due to variationsin tag orientation. There are data gaps occurring from around . to . hours and from around . to . hours. Bothdata gaps were excluded from the analysis.lag plots reveal no signiﬁcant auto-correlation between dive duration observations (see Figure 1 of the supplementarymaterial), and visual inspection shows no obvious complicated dependence. Therefore, we select a simple HMM tomodel the coarse-scale process since neither a CarHMM nor a moving-window transformation is called for. Given thatdive t is of type i , we assume that the dive duration Y t follows a Gamma distribution with unknown parameters µ ( i ) and σ ( i ) : E ( Y t | X t = i ) = µ ( i ) , V ( Y t | X t = i ) = (cid:16) σ ( i ) (cid:17) . This is consistent with previous studies, including that of Leos-Barajas et al. (2017).We then select a model corresponding to the ﬁne-scale observations of acceleration. Similarly to the coarse model, werely on lag plots and visual inspection to select N ∗ = 3 subdive states. Although N ∗ is selected heuristically, we testthe validity of this model in Section 3.4. In contrast to the coarse-scale observations, the ﬁne-scale acceleration dataexhibit signiﬁcant sinusoidal behaviour. Thus, we transform each ﬁne-scale observation sequence y ∗ t into ˜ y ∗ t usingEquation (4) with a window size of h = 100 (two seconds) and a maximum frequency of ˜ ω = 10 (5 Hz). We thenhave that ˜ y ∗ t, ˜ t ∗ = { ˜ a ∗ t, ˜ t ∗ , ˜ w ∗ t, ˜ t ∗ } , where ˜ a ∗ t, ˜ t ∗ is a three dimensional vector of component-wise average accelerationand ˜ w ∗ t, ˜ t ∗ is a scalar describing the “wiggliness” of a particular window. Even after transforming the raw accelerationdata, there is still strong auto-correlation within each component of ˜ a ∗ t, ˜ t ∗ prior to ﬁtting the model (see Figure 1 of thesupplementary material). Therefore, we choose a CarHMM as deﬁned in Section 2.2 as the ﬁne-scale model.8 PREPRINT - J

ANUARY

12, 2021Figure 3: Graphical representation of the conditionally auto-regressive hierarchical hidden Markov model with discreteFourier transform (

CarHHMM-DFT ) used in the simulation and case study. The type of dive t is denoted by X t and Y t represents the associated dive duration. The raw acceleration vector associated with dive t and time stamp t ∗ isdenoted by Y ∗ t,t ∗ . The subdive state of the killer whale during dive t and window ˜ t ∗ is denoted as ˜ X ∗ t, ˜ t ∗ , and thecorresponding transformed observation is denoted by ˜ Y ∗ t, ˜ t ∗ .We then select the speciﬁc emission distribution of ˜ Y ∗ t, ˜ t ∗ . First, we assume that ˜ W ∗ t, ˜ t ∗ and all three componentsof ˜ A ∗ t, ˜ t ∗ are independent of one another when conditioned on the dive types and subdive states. To reduce modelcomplexity, we also assume that the three sets of ﬁne-scale emission parameters are shared across the two divetypes (cid:0) θ ∗ (1 ,i ∗ ) = θ ∗ (2 ,i ∗ ) ≡ θ ∗ ( · ,i ∗ ) for i ∗ = 1 , , (cid:1) . This implies that the subdive states within dive type 1 have thesame interpretation as those within dive type 2. To specify the emission distribution of ˜ A ∗ t, ˜ t ∗ , consider the sequence (cid:8) ˜ A ∗ t, , . . . , ˜ A ∗ t,T ∗ t (cid:9) for a particular dive t . We assume that each of the three components of this sequence are Normallydistributed as in Equation (1) and all components are independent of one another when conditioned on the subdivestates (cid:8) ˜ X t, , . . . , ˜ X t, ˜ T ∗ t (cid:9) . Each component is assumed to have its own mean and variance parameters, but all com-ponents share the same auto-correlation parameter. Thus the distribution of ˜ A ∗ t, ˜ t ∗ given ˜ A ∗ t, ˜ t ∗ − and X ∗ t, ˜ t ∗ = i ∗ hasparameters µ ∗ ( · ,i ∗ ) A ∈ R , σ ∗ ( · ,i ∗ ) A ∈ R , and φ ∗ ( · ,i ∗ ) A ∈ [0 , . To specify the emission distribution of ˜ W ∗ t, ˜ t ∗ , we assumethat given ˜ X ∗ t, ˜ t ∗ = i ∗ , ˜ W ∗ t, ˜ t ∗ follows a Gamma distribution parameterized by its mean µ ∗ ( · ,i ∗ ) W and standard deviation σ ∗ ( · ,i ∗ ) W . In addition, ˜ W ∗ t, , . . . , ˜ W ∗ t,T ∗ t are assumed to be independent of one another given the subdive state sequence (cid:8) ˜ X t, , . . . , ˜ X t,T ∗ t (cid:9) . We do not include ˜ W ∗ t, ˜ t ∗ − in the distribution of ˜ W ∗ t, ˜ t ∗ because the auto-correlation evident fromthe lag plot is not severe and may be explained by subsequent observations occurring within the same subdive state.In total, the parameters to estimate are Γ , Γ ∗ = { Γ ∗ (1) , Γ ∗ (2) } (probability transition matrices) ,θ = { µ (1) , σ (1) , µ (2) , σ (2) } ( Y emission parameters), and θ ∗ = { θ ∗ ( · , , θ ∗ ( · , , θ ∗ ( · , } ( ˜ Y ∗ emission parameters), where θ ∗ ( · ,i ∗ ) = { µ ∗ ( · ,i ∗ ) A , σ ∗ ( · ,i ∗ ) A , φ ∗ ( · ,i ∗ ) A , µ ∗ ( · ,i ∗ ) W , σ ∗ ( · ,i ∗ ) W } . Recall that θ ∗ ( · ,i ∗ ) is the set of parameters describing the distribution of ˜ Y ∗ t, ˜ t ∗ conditioned on ˜ X ∗ t, ˜ t ∗ = i ∗ . We referto this ﬁnal model as the CarHHMM-DFT since it includes a CarHMM, hierarchical HMM, and DFT-based trans-formation. The likelihood of this model is easily calculated using the forward algorithm and can be maximized withrespect to the parameters above (see the appendix for details). Figure 3 shows the dependence structure of the fullCarHHMM-DFT. 9

PREPRINT - J

ANUARY

12, 2021Table 1: Estimates with standard errors for the parameters of the distributions of dive duration (cid:0) Y t (cid:1) , acceleration (cid:0) ˜ A ∗ t, ˜ t ∗ (cid:1) , and wiggliness (cid:0) ˜ W ∗ t, ˜ t ∗ (cid:1) of the killer whale kinematic data using the full CarHHMM-DFT. The parenthesesrefer to standard errors estimated using the observed information matrix.Feature Dive / Subdive Type Parameter Estimate (Standard Error) ˆ µ ˆ σ ˆ φ Dive Duration ( s ) – Y t .

68 (0 .

60) 9 .

57 (0 . —2 . .

4) 64 . . — x -Acc. ( m/s ) – (cid:16) ˜ A ∗ t, ˜ t ∗ (cid:17) x .

020 (0 . .

034 (0 . .

976 (0 . .

244 (0 . .

079 (0 . .

886 (0 . .

218 (0 . .

265 (0 . .

626 (0 . y -Acc. ( m/s ) – (cid:16) ˜ A ∗ t, ˜ t ∗ (cid:17) y .

469 (0 . .

044 (0 . .

976 (0 . .

436 (0 . .

082 (0 . .

886 (0 . .

384 (0 . .

321 (0 . .

626 (0 . z -Acc. ( m/s ) – (cid:16) ˜ A ∗ t, ˜ t ∗ (cid:17) z − .

683 (0 . .

052 (0 . .

976 (0 . − .

593 (0 . .

096 (0 . .

886 (0 . − .

366 (0 . .

317 (0 . .

626 (0 . Wiggliness - ˜ W ∗ t, ˜ t ∗ .

34 (0 .

29) 12 .

95 (0 . —2 . .

2) 330 . . —3 —In addition to the CarHHMM-DFT, we consider three variations for comparison. As in the full model, each of thefollowing models assume that all components of ˜ Y ∗ t, ˜ t ∗ are conditionally independent of one another given the divetypes and subdive states:1. An HHMM-DFT , which models the coarse-scale observations with an HMM and transforms the ﬁne-scaleobservations using Equation (4), but models ˜ Y ∗ t, ˜ t ∗ as emissions from a simple HMM rather than a CarHMM.2. A CarHHMM , which models the coarse-scale observations with an HMM, transforms the ﬁne-scale obser-vations using Equation (4), and models ˜ A ∗ t, ˜ t ∗ as emissions of a CarHMM. However, the “wiggliness” ˜ W ∗ t, ˜ t ∗ is omitted from this model altogether.3. A CarHMM-DFT , which models the coarse-scale observations as an independent and identically distributedsequence of dives, transforms the ﬁne-scale observations using Equation (4), and models ˜ Y ∗ t, ˜ t ∗ as emissionsof a CarHMM. This model assumes that there is only one dive type.Each of the three candidate models above leaves out one important aspect of the full CarHHMM-DFT: the HHMM-DFT assumes there is no auto-correlation between ﬁne-scale observations, the CarHHMM does not incorporate “wig-gliness” (cid:0) ˜ W ∗ t, ˜ t ∗ (cid:1) , and the CarHMM-DFT lacks a hierarchical structure and thus does not distinguish between divetypes. To illustrate an application of this method and compare the candidate models, we ﬁt all four models to the data shownin Figure 2. We ﬁrst report the results from the full CarHHMM-DFT in detail and assess the quality of the ﬁt. We thencompare these results with those from the other candidate models.The coarse-scale parameter estimates suggest that the killer whale has at least two distinct dive behaviours (see Table1 and Figure 4a). Dive type 1 corresponds to shorter, shallower dives which likely reﬂect resting, travelling, and to alesser extent searching for prey. Dive type 2 is longer, deeper, and may be associated with behaviours such as hunting(Tennessen et al., 2019a), but it is unclear whether any of the dives in this study are successful foraging dives. Nodive in this data set has a maximum depth deeper than 30 meters and a study by Wright et al. (2017) of killer whalesin Johnstone Strait found that most prey captures occur at depths deeper than 100 meters. However, the killer whalestudied here was tagged north of Johnstone Strait in Queen Charlotte Sound, and signiﬁcant numbers of ﬁsh have beenobserved to be caught near the surface (Fortune and Trites, unpublished data). Dive type 2 could also be associatedwith behaviours such as socializing that can take place several meters below the surface (Tennessen et al., 2019a).The means of “wiggliness” (cid:0) ˜ W ∗ t, ˜ t ∗ (cid:1) associated with each subdive state are separated by an order of magnitude (seeTable 1 and Figure 4b). Subdive state 1 has the smallest mean corresponding to ˜ W ∗ t, ˜ t ∗ and the smallest variance10 PREPRINT - J

ANUARY

12, 2021 (a) Estimated Gamma probability density functions of a killer whale’s dive duration ( Y t ) corresponding to dive types 1 and 2.(b) Estimated Normal conditional densities of the three components of a killer whale’s acceleration (cid:16) ˜ A ∗ t, ˜ t ∗ | ˜ A ∗ t, ˜ t ∗ − = µ ∗ ( · ,i ∗ ) A (cid:17) plotted on a linear scale and estimated Gamma probability density of the killer whale’s wiggliness (cid:16) ˜ W ∗ t, ˜ t ∗ (cid:17) plotted on a log-log scale. Density functions correspond to subdive states i ∗ = 1 , , and . Density functions corresponding to acceleration areconditioned on ˜ A ∗ t, ˜ t ∗ − = µ ∗ ( · ,i ∗ ) A because the density function of ˜ A ∗ t, ˜ t ∗ itself depends on ˜ A ∗ t, ˜ t ∗ − . Figure 4: Estimated probability density functions for coarse-scale and ﬁne-scale emissions corresponding to obser-vations of killer whale behaviour. Densities are estimated by ﬁtting the CarHHMM-DFT to the case study data (seeTable 1).corresponding to ˜ A ∗ t, ˜ t ∗ . It also has the highest auto-correlation in ˜ A ∗ t, ˜ t ∗ . This implies less overall activity and moreconsistent acceleration compared to the other subdive states. Subdive state 2 has a mean “wiggliness” (cid:0) ˜ W ∗ t, ˜ t ∗ (cid:1) oneorder of magnitude higher than subdive state 1 and its acceleration has about twice the variance compared to subdive11 PREPRINT - J

ANUARY

12, 2021Figure 5: The x –component of acceleration (cid:0) y ∗ t,t ∗ (cid:1) x (top two panels) and dive depth (bottom two panels) of a northernresident killer whale for a sequence of six selected dives. Each panel is partitioned into dives by vertical black lines.The curve colours in the ﬁrst and third panels correspond to the estimated dive types while the curve colours of thesecond and fourth panels correspond the estimated subdive states. Both the dive types and subdive states are estimatedby ﬁtting the CarHHMM-DFT to the data and performing the forward-backward algorithm to determine the hiddenstate with the highest probability.state 1. The auto-correlation of acceleration is also slightly lower than subdive state 1. We therefore hypothesize thatsubdive state 2 corresponds to ﬂuking (active swimming), as strong sinusoidal behaviour in acceleration is character-istic of this behaviour in marine mammals (Simon et al., 2012). Finally, the mean of ˜ W ∗ t, ˜ t ∗ and variance of ˜ A ∗ t, ˜ t ∗ insubdive state 3 are both much higher than in the other two states, and the auto-correlation of ˜ A ∗ t, ˜ t ∗ is also much lower.This corresponds to vigorous swimming activity, especially as the killer whale begins or ends a dive (see Figure 5).The estimated probability transition matrices and associated stationary distributions on the coarse scale are ˆΓ = (cid:18) .

788 0 . .

809 0 . (cid:19) and ˆ δ = (0 .

792 0 . for the transitions between dives. The estimated probability transition matrices and stationary distributions on the ﬁnescale are ˆΓ ∗ (1) = (cid:32) .

679 0 .

321 0 . .

038 0 .

904 0 . .

000 0 .

232 0 . (cid:33) , ˆΓ ∗ (2) = (cid:32) .

859 0 .

141 0 . .

114 0 .

841 0 . .

000 0 .

216 0 . (cid:33) , ˆ δ ∗ (1) = (0 .

087 0 .

731 0 . , and ˆ δ ∗ (2) = (0 .

401 0 .

496 0 . PREPRINT - J

ANUARY

12, 2021 (a) Histogram of pseudoresiduals of ˜ W ∗ t, ˜ t ∗ (b) Empirical distribution of Y t Figure 6: Pseudoresiduals of wiggliness ( Φ − (cid:0) Pr( ˜ W ∗ t, ˜ t ∗ < ˜ w ∗ t, ˜ t ∗ | Y, ˜ Y ∗ / { ˜ Y ∗ t, ˜ t ∗ } ) (cid:1) , left) plotted over a standard normaldensity as well as a weighted empirical distribution of dive duration ( Y t , right) plotted over the corresponding ﬁttedGamma distributions. Both plots are generated by ﬁtting the CarHHMM-DFT to the killer whale case study data andperforming the forward-backward algorithm.for dive types 1 and 2. In summary, about 79% of dives are short dives of type 1, and the whale performs an averageof 4.72 short type 1 dives before switching to dive type 2 and an average of 1.24 longer type 2 dives before switchingback to dive type 1. This ﬁnding is consistent with those of Tennessen et al. (2019a) and Williams and Noren (2009),both of whom describe common bouts of short resting dives before a killer whale performs a longer, more energy-intensive deep dive. Further, this killer whale is in the less active subdive state 1 40% of the time during a dive of type1 compared to only only 9% of the time during a dive of type 2. Less active swimming behaviour is consistent with theneed for marine mammals to conserve energy when diving to depth and holding their breath for long periods (Williamset al., 1999; Hastie et al., 2006). Figure 5 shows the decoded dive behaviour of six selected dives, and Section 1.3 ofthe supplementary material also shows the probability of each dive type and subdive state given the data and the ﬁttedmodel. We use two visual tools to evaluate the CarHHMM-DFT: pseudoresidual plots and empirical histograms. The pseu-doresidual of a coarse-scale observation y t is equal to Φ − (cid:16) Pr( Y t < y t |{ Y , . . . , Y T , ˜ Y ∗ , . . . , ˜ Y ∗ T } / { Y t } ) (cid:17) and thepseudoresidual of a ﬁne-scale observation ˜ y ∗ t, ˜ t ∗ is Φ − (cid:16) Pr( ˜ Y ∗ t,t ∗ < ˜ y ∗ t, ˜ t ∗ |{ Y , . . . , Y T , ˜ Y ∗ , . . . , ˜ Y ∗ T } / { ˜ Y ∗ t, ˜ t } ) (cid:17) , where Φ is the cumulative distribution function of a standard Normal distribution. If the model is correct, then all pseu-doresiduals are independent and follow a standard Normal distribution. Histograms of the pseudoresiduals mostlysupport that the CarHHMM-DFT is well-speciﬁed. One exception is ˜ W ∗ t, ˜ t ∗ , whose pseudoresiduals are noticeablyright-skewed (see Figure 6a). This implies that the true distribution of ˜ W ∗ t, ˜ t ∗ may follow a heavier-tailed distributionthan the Gamma distribution used in the case study. See Sections 1.4 through 1.6 of the supplementary material forpseudoresidual plots corresponding to all observations and models.We also plot histograms of dive duration corresponding to each dive type in Figure 6b. Each observation of diveduration is weighted by the estimated probability that it corresponds to a particular dive type as decoded by the forward-backward algorithm (Zucchini et al., 2016). This procedure results in two histograms – one corresponding to dive type1 and another corresponding to dive type 2. Each histogram is then plotted together with the corresponding emissiondistribution estimated by the CarHHMM-DFT. Analogous histograms corresponding to the ﬁne-scale observations arecontained in Section 1 of the supplementary material. Our results mostly show that the CarHHMM-DFT explainsthe data well, but there are some exceptions. In histograms corresponding to subdive state 3, ˜ W ∗ t, ˜ t ∗ is right-skewedand ˜ A ∗ t, ˜ t ∗ has heavier tails compared to a normal distribution, indicating the existence of rare events corresponding toexceptionally sudden changes in acceleration of the killer whale. These outliers are potential subjects for future studyand may indicate biologically relevant phenomena such as prey capture (Tennessen et al., 2019b).13 PREPRINT - J

ANUARY

12, 2021

The HHMM-DFT, which ignores auto-correlation in acceleration, decodes dive types and subdive states similarly tothe CarHHMM-DFT, but it is less likely to categorize the behaviour at the beginning and end of dives as subdive state3 (see Figures 4 and 5 of the supplementary material). In addition, for all three components of each of σ ∗ ( · , A , σ ∗ ( · , A ,and σ ∗ ( · , A , the HHMM-DFT produces estimates which are approximately to percent larger than those of theCarHHMM-DFT. The estimated uncertainties of the three components of each of ˆ µ ∗ ( · , A , ˆ µ ∗ ( · , A , and ˆ µ ∗ ( · , A are alsoless than half of those for the CarHHMM-DFT (see Tables 1 and 2 of the supplementary material). This suggeststhat including auto-correlation in the model signiﬁcantly affects parameter estimates. Further, the pseudoresiduals ofthe HHMM-DFT are noticeably light-tailed and do not follow a standard normal distribution (see Figure 15 of thesupplementary material). These ﬁndings suggest that the HHMM-DFT is a signiﬁcantly worse ﬁt to these data thanthe full CarHHMM-DFT.The CarHHMM does not model the “wiggliness” of the acceleration data, so it regularly fails to pick up obviousbehavioural changes corresponding to the periodicity shown in Figure 5 (see Figure 6 of the supplementary mate-rial). These results essentially disqualify the CarHHMM as a viable model for this data set. The pseudoresiduals ofacceleration are also light-tailed relative to a Normal distribution (see Figure 16 of the supplementary material).Finally, the CarHMM-DFT, which lacks a hierarchical structure, produces ﬁne-scale parameter estimates and subdivestate estimates similar to those of the CarHHMM-DFT. However, its lack of hierarchical structure means that it failsto differentiate between short and long dives. This model therefore does not infer the dive-level Markov chain or therelationship between the dive and subdive levels. In particular, the CarHMM-DFT does not indicate that the whale ismore likely to be in subdive state 1 when engaged in longer dives compared to shorter dives.A more complete set of results for each of the candidate models is presented in Section 1 of the supplementary material. We perform a simulation study based on data generated from the full CarHHMM-DFT as deﬁned in Section 3.2 toevaluate each candidate model when the ground-truth is known. The parameters used to generate the data are basedon those estimated in the case study (see Table 1), with slight modiﬁcations made for simplicity. In particular, we setthe number of subdive states to N ∗ = 2 and ˜ A ∗ t, ˜ t ∗ to a scalar instead of a three dimensional vector. We then ﬁt all fourmodels to the simulated data. Metrics used to evaluate each model include decoding accuracy of hidden states, biasin parameter estimates, empirical standard errors of parameter estimates, and ﬁtting times. To assess the accuracy ofuncertainty estimates, we also compare the empirical standard errors of a given model’s parameter estimates with thestandard errors estimated using the inverse of the observed Fisher information. We generate 500 independent training data sets using the CarHHMM-DFT as a generative model. Each training dataset consists of a sequence of 100 curves which we call a sequence of killer whale dives. Each dive can be one of N = 2 dive types based on a Markov chain with probability transition matrix Γ = (cid:18) .

79 0 . .

81 0 . (cid:19) . Dive duration is Gamma distributed and the coarse-scale emission parameters are µ (1) = 25 . s, σ (1) = 9 . s, µ (2) = 104 . s, σ (2) = 64 . s. After generating the dive durations for all 100 dives in a data set, dive t is broken into a sequence of ˜ T ∗ t = (cid:98) Y t / (cid:99) two-second windows, where the last Y t − T ∗ t seconds of each simulated dive are ignored. Each two-second segment isassigned one of N ∗ = 2 behaviours according to a ﬁne-scale Markov chain ˜ X ∗ t ≡ (cid:8) ˜ X ∗ t, , . . . , ˜ X ∗ t, ˜ T ∗ t (cid:9) with probabilitytransition matrices Γ ∗ (1) = (cid:18) .

68 0 . .

05 0 . (cid:19) and Γ ∗ (2) = (cid:18) .

86 0 . .

15 0 . (cid:19) for dive types 1 and 2, respectively. Instead of generating the raw observations Y ∗ t,t ∗ , we directly simulate the ﬁne-scale transformed observations ˜ Y ∗ t, ˜ t ∗ = (cid:8) ˜ A ∗ t, ˜ t ∗ , ˜ W ∗ t, ˜ t ∗ (cid:9) . Recall from Section 3.2 that we must specify the mean,standard deviation, and auto-correlation parameters corresponding to (cid:8) ˜ A ∗ t, , . . . , ˜ A ∗ t, ˜ T ∗ t (cid:9) as well as the mean and14 PREPRINT - J

ANUARY

12, 2021Table 2: Average decoding accuracies and training times for all models used to categorize dive type and subdive statein the simulation study. Each of the four models was ﬁt to 500 training data sets comprised of 100 simulated divesand tested on test data sets also comprised of 100 simulated dives. Reported values are averages, and ± refers to thesample standard deviation across the 500 data sets. Rows labelled Both/Both correspond to overall average decodingaccuracy. Model Train Time (min) Dive Type Subdive State Dive Accuracy Subdive AccuracyCarHMM-DFT ± Both Both ————- . ± . . ± . . ± . . ± . . ± . HHMM-DFT ± Both Both . ± .

04 0 . ± . . ± .

03 0 . ± . . ± . . ± .

10 0 . ± . . ± . CarHHMM ± Both Both . ± .

17 0 . ± . . ± .

21 0 . ± . . ± . . ± .

14 0 . ± . . ± . CarHHMM-DFT ± Both Both . ± .

04 0 . ± . . ± .

03 0 . ± . . ± . . ± .

09 0 . ± . . ± . standard deviation parameters corresponding to (cid:8) ˜ W ∗ t, , . . . , ˜ W ∗ t, ˜ T ∗ t (cid:9) . We select the following parameters in line withthe results from the case study: µ ∗ ( · , A = 0 . s, σ ∗ ( · , A = 0 . s, φ ∗ ( · , A = 0 . ,µ ∗ ( · , A = 0 . s, σ ∗ ( · , A = 0 . s, φ ∗ ( · , A = 0 . ,µ ∗ ( · , W = 23 . , σ ∗ ( · , W = 13 . ,µ ∗ ( · , W = 301 . , σ ∗ ( · , W = 330 . . It is not possible to uniquely reconstruct the raw accelerometer data Y ∗ from ˜ Y ∗ alone, but we describe one possiblemapping from ˜ Y ∗ to Y ∗ in the appendix. Figure 21 of the supplementary material shows one realization of ˜ Y ∗ for ﬁvedives of one simulated data set along with the corresponding reconstructed realization of Y ∗ .The two simulated dive types differ in that dives of type 1 are much shorter on average (26 seconds) than dives of type2 (105 seconds). The two simulated subdive states differ primarily due to µ ∗ W and σ ∗ W since both are much higher forsubdive state 2 than for subdive state 1. These larger parameter values correspond to much more vigorous and variableperiodic behaviour in the acceleration data.We calculate maximum likelihood estimates { ˆ θ, ˆΓ , ˆ θ ∗ , ˆΓ ∗ } for all four candidate models for each of the 500 data setsusing the Cedar Compute Canada cluster with 1 CPU and 4 GB of dedicated memory per data set. For each of the500 training data sets, we simulate a test data set to assess how well each model predicts the hidden states, as follows.Each test data set consists of a sequence of 100 dives and is created from the generative model with true parameters { θ, Γ , θ ∗ , Γ ∗ } . To assess the coarse-scale hidden state prediction, we estimate p t ( i | y, ˜ y ∗ ) ≡ Pr( X t = i | Y = y, ˜ Y ∗ =˜ y ∗ ) , i = 1 , , t = 1 , . . . , using the test-set observations ( y, ˜ y ∗ ) and training-set maximum likelihood estimates.These estimates are found using the forward-backward algorithm (Zucchini et al., 2016). We compare these estimatedconditional probabilities to { x , . . . , x } , the true coarse-scale state realizations in the test data, by calculating the average dive decoding accuracy for a single training/test data set pair, (cid:80) t =1 ˆ p t ( x t | y, ˜ y ∗ ) / . We then report theaverage of these over the 500 training/test data set pairs. Analogously, to assess prediction of the ﬁne-scale states, weestimate p ∗ t, ˜ t ∗ ( i ∗ | y, ˜ y ∗ ) ≡ Pr( ˜ X ∗ t, ˜ t ∗ = i | Y = y, ˜ Y ∗ = ˜ y ∗ ) , i ∗ = 1 , , ˜ t ∗ = 1 , . . . , ˜ T ∗ t , t = 1 , . . . , , using thetest-set observations, the training-set maximum likelihood estimates, and the forward-backward algorithm. Denoting15 PREPRINT - J

ANUARY

12, 2021Figure 7: Estimated probabilities that each dive is of type 2 for ﬁve selected dives of a simulated data set of killerwhale dive behaviour. Each panel is partitioned into dives by vertical black lines. The colour of the curve correspondsto the true dive type while the colour of the background corresponds to the true subdive state. The CarHMM-DFT isomitted because it assumes that there is only one dive type.the true ﬁne-scale state realizations from the test data set and dive t as { ˜ x ∗ t, , . . . , ˜ x ∗ t, ˜ T ∗ t } , we deﬁne the overall averagesubdive decoding accuracy as the average value of ˆ p ∗ t, ˜ t ∗ (˜ x ∗ t, ˜ t ∗ | y, ˜ y ∗ ) across all simulated test data sets, dives, andwindows. The conditional probabilities are estimated according to one of the four models under study, using themaximum likelihood estimates from the training data set in conjunction with the forward-backward algorithm. The full CarHHMM-DFT is the best performing model of the four candidates since it is the generating model. Itsaverage dive decoding accuracy and average subdive decoding accuracy are both greater than 0.9. All parameterestimates of ﬁne-scale mean values and probability transition matrices, ˆ µ ∗ , ˆΓ ∗ and ˆΓ , respectively, have statisticallyinsigniﬁcant biases. In addition, the biases of ˆ σ , ˆ φ , and ˆ µ are either as small or smaller than the other three candidatemodels. The empirical standard errors of all parameter estimates ( ˆ θ , ˆΓ , ˆ θ ∗ , ˆΓ ∗ ) are well-approximated by the inverseof the observed Fisher information matrix, although the estimated standard errors tend to be slightly smaller than theempirical standard errors. This underestimation is especially noticeable for parameters associated with the wiggliness ˜ W ∗ t, ˜ t ∗ , where the empirical standard error can be up to twice as large as the estimated standard error. See Tables 5through 9 of the supplementary material for detailed results.The HHMM-DFT performs similarly to the CarHHMM-DFT in most respects. Its average dive decoding accuracyis comparable to the CarHHMM-DFT while its average subdive decoding accuracy is worse by approximately 5percentage points. Its parameter estimates are comparable to the CarHHMM-DFT with the notable exception thatit greatly overestimates σ ∗ ( · , A and σ ∗ ( · , A . In addition, the estimated standard errors of ˆ µ ∗ ( · , A , ˆ µ ∗ ( · , A , ˆ σ ∗ ( · , A , and16 PREPRINT - J

ANUARY

12, 2021Figure 8: Estimated probabilities that each window corresponds to subdive state 2 for ﬁve selected dives of a simulateddata set of killer whale dive behaviour. Each panel is partitioned into dives by vertical black lines. The colour of thecurve corresponds to the true subdive state while the colour of the background corresponds to the true dive type. ˆ σ ∗ ( · , A are much smaller than the associated empirical standard errors (see Table 7 of the supplementary material).These results suggest that the estimates of standard deviation can be too large and estimates of standard errors canbe too small when auto-correlation is ignored. This ﬁnding is consistent with the results of the case study, where theHHMM-DFT produced larger estimates of σ ∗ A and smaller estimates of standard error compared to the CarHHMM-DFT. When standard errors are underestimated, the associated conﬁdence intervals are too narrow, implying thatresearchers may be overconﬁdent in their parameter estimates.The CarHHMM is the worst-performing model in terms of accuracy, as its average dive decoding accuracy is below . and its average subdive decoding accuracy is below . . This result is consistent with expectations because theCarHHMM does not model the “wiggliness” of the ﬁne-scale process, which is the most distinct difference betweenthe subdive states. In addition to its poor average decoding accuracy, the CarHHMM is also the worst of the fourcandidate models at estimating parameters. Parameter estimates associated with subdive type 2 ( θ ∗ ( · , ) are especiallypoor. See Section 2 of the supplementary material for more detailed results.17 PREPRINT - J

ANUARY

12, 2021Finally, the CarHMM-DFT is nearly identical to the CarHHMM-DFT in terms of average subdive decoding accuracy,ﬁne-scale parameter biases, and both estimated and empirical standard error for the ﬁne-scale parameter estimates.In addition, the time required to ﬁt the CarHMM-DFT is less than half of that of the other models (see Table 2).However, this model cannot estimate dive type as it lacks any hierarchical structure. The CarHMM-DFT nonethelessﬁts a (misspeciﬁed) single Gamma distribution over the dive duration of all dives. The resulting parameter estimates( ˆ µ and ˆ σ ) are highly correlated (See Figure 25 of the supplementary material).Figures 7 and 8 display ﬁve dives of one simulated data set as well as the decoded dive types and subdive statesassociated with each model. The CarHHMM-DFT and CarHMM-DFT produce similar estimates of subdive state whilethe HHMM-DFT is slightly more likely to predict that a given window corresponds to subdive state 2. As expected,the CarHHMM is the least accurate model when predicting subdive state. The CarHHMM-DFT and HHMM-DFTyield similar estimates of dive type while the CarHHMM is less accurate and even misclassiﬁes the dive type of thesecond dive. The CarHMM-DFT does not estimate dive type. Current Functional Data Analysis literature addresses dependence between curves either with multilevel models (Chenand M¨uller, 2012; Di et al., 2009), which lack a time component, or with functional time series, which overlook thepossibility that curves have several distinct “types” (Kokoszka and Reimherr, 2018). Our work addresses these issuesand introduces a ﬂexible framework to model functional time-series data using HMMs. We suggest handling temporaldependence between curves by using either an HMM or a CarHMM to model the curve sequence. We then suggestviewing each individual curve as an HMM emission whose distribution is described by a ﬁne-scale model. Here weuse a CarHMM as the ﬁne-scale model, but there are a wide range of possible ﬁne-scale models, including a Poissonprocess or continuous time approach similar to that of Michelot and Blackwell (2019). We also incorporate a moving-window transformation at the ﬁne scale to capture intricate dependence structures on short time scales. Together, thecoarse- and ﬁne-scale models make up a hierarchical structure which can account for simultaneous processes takingplace at different time scales. Provided the construction is not overly complex, a hierarchical model created using ourmethod can be both ﬂexible and easy to ﬁt using maximum likelihood estimation.We demonstrate the usefulness of this framework using a biomechanical/ecological example, where we use HMMsto classify the coarse- and ﬁne-scale diving behaviour of a northern resident killer whale in Queen Charlotte Soundoff the coast of British Columbia, Canada. Our analysis gives a deeper understanding of a killer whale’s tri-axialmovement and thus its behaviour and energy expenditure (Gleiss et al., 2011; Qasem et al., 2012), both of whichare important for understanding the foraging ecology and nutritional status of northern resident killer whales (Noren,2011). Our model is also applicable to many diving animals such as sharks (Adam et al., 2019), seals (Jeanniard duDot et al., 2016), and porpoises (Leos-Barajas et al., 2017). In addition, since complicated state-switching processeswith temporal dependence are common in settings ranging from speech recognition (Juang and Rabiner, 1991) andneuroscience (Langrock et al., 2013) to oceanography (Bulla et al., 2012) and ecology (Adam et al., 2019), we believethat researchers can adapt our methodology for the analysis of a wide range of time series data in a variety of ﬁelds.

PREPRINT - J

ANUARY

12, 2021

References

Adam, T., Grifﬁths, C., Leos Barajas, V., Meese, E., Lowe, C., Blackwell, P., Righton, D., and Langrock, R. (2019).Joint modelling of multi-scale animal movement data using hierarchical hidden Markov models.

Methods in Ecologyand Evolution , 10(9):1536–1550.Alex Shorter, K., Shao, Y., Ojeda, L., Barton, K., Rocho-Levine, J., van der Hoop, J., and Moore, M. (2017). A day inthe life of a dolphin: Using bio-logging tags for improved animal health and well-being.

Marine Mammal Science ,33(3):785–802.Bebbington, M. S. (2007). Identifying volcanic regimes using hidden Markov models.

Geophysical Journal Interna-tional , 171(2):921–942.Borchers, D. L., Zucchini, W., Heide-Jørgensen, M. P., Ca˜nadas, A., and Langrock, R. (2013). Using hidden Markovmodels to deal with availability bias on line transect surveys.

Biometrics , 69(3):703–713.Brumback, B. A. and Rice, J. A. (1998). Smoothing spline models for the analysis of nested and crossed samples ofcurves.

Journal of the American Statistical Association , 93(443):961–976.Bulla, J., Lagona, F., Maruotti, A., and Picone, M. (2012). A multivariate hidden Markov model for the identiﬁcation ofsea regimes from incomplete skewed and circular time series.

Journal of Agricultural, Biological and EnvironmentalStatistics , 17:544–567.B¨orger, L., Bijleveld, A., Fayet, A., Machovsky-Capuska, G., Patrick, S., Street, G., and Wal, E. (2020). Biologgingspecial feature.

Journal of Animal Ecology , 89(1):6–15.Cade, D. E., Barr, K. R., Calambokidis, J., Friedlaender, A. S., and Goldbogen, J. A. (2018). Determining forwardspeed from accelerometer jiggle in aquatic environments.

Journal of Experimental Biology , 221(2):jeb170449.Chen, K. and M¨uller, H.-G. (2012). Modeling repeated functional observations.

Journal of the American StatisticalAssociation , 107(500):1599–1609.Crainiceanu, C. M., Staicu, A.-M., and Di, C.-Z. (2009). Generalized multilevel functional regression.

Journal of theAmerican Statistical Association , 104(488):1550–1561.DeRuiter, S. L., Langrock, R., Skirbutas, T., Goldbogen, J. A., Calambokidis, J., Friedlaender, A. S., and Southall,B. L. (2017). A multivariate mixed hidden Markov model for blue whale behaviour and responses to sound exposure.

The Annals of Applied Statistics , 11(1):362–392.Di, C.-Z., Crainiceanu, C. M., Caffo, B. S., and Punjabi, N. M. (2009). Multilevel functional principal componentanalysis.

The Annals of Applied Statistics , 3(1):458–488.Fehlmann, G., O’Riain, J., Hopkins, P. W., O’Sullivan, J., Holton, M. D., Shepard, E. L. C., and King, A. J. (2017).Identiﬁcation of behaviours from accelerometer data in a wild social primate.

Animal Biotelemetry , 5:6.Fischer, W. and Meier-Hellstern, K. S. (1993). The Markov-modulated Poisson process (MMPP) cookbook.

Perform.Evaluation , 18(2):149–171.Ford, J. K. B. and Ellis, G. M. (2006). Selective foraging by ﬁsh-eating killer whales

Orcinus orca in British Columbia.

Marine Ecology-Progress Series , 316:185–199.Ford, J. K. B., Ellis, G. M., Olesiuk, P. F., and Balcomb, K. C. (2009). Linking killer whale survival and preyabundance: Food limitation in the oceans’ apex predator?

Biology Letters , 6(1):139–42.Fu, E. and Heckman, N. (2019). Model-based curve registration via stochastic approximation EM algorithm.

Compu-tational Statistics and Data Analysis , 131:159–175.Getman, A., Cooper, C. D., Key, G., Zhou, H., and Frankle, N. (2009). Detection of mobile machine damage us-ing accelerometer data and prognostic health monitoring techniques. In , pages 101–104.Gleiss, A. C., Wilson, R. P., and Shepard, E. L. C. (2011). Making overall dynamic body acceleration work: On thetheory of acceleration as a proxy for energy expenditure.

Methods in Ecology and Evolution , 2(1):23–33.Green, J. A., Halsey, L. G., Wilson, R. P., and Frappell, P. B. (2009). Estimating energy expenditure of animalsusing the accelerometry technique: Activity, inactivity and comparison with the heart-rate technique.

Journal ofExperimental Biology , 212(5):745–746.Hastie, G. D., Rosen, D. A. S., and Trites, A. W. (2006). The inﬂuence of depth on a breath-hold diver: Predicting thediving metabolism of Steller sea lions (

Eumetopias jubatus ). Journal of Experimental Marine Biology and Ecology ,3:163–170.Heerah, K., Woillez, M., Fablet, R., Garren, F., Martin, S., and De Pontual, H. (2017). Coupling spectral analysis andhidden Markov models for the segmentation of behavioural patterns.

Movement Ecology , 5:20.19

PREPRINT - J

ANUARY

12, 2021Hooten, M. B., King, R., and Langrock, R. (2017). Guest editor’s introduction to the special issue on “animal move-ment modeling”.

Journal of Agricultural, Biological and Environmental Statistics , 22(3):224–231.Isojunno, S., Sadykova, D., DeRuiter, S., Cur´e, C., Visser, F., Thomas, L., Miller, P. J. O., and Harris, C. M. (2017).Individual, ecological, and anthropogenic inﬂuences on activity budgets of long-ﬁnned pilot whales.

Ecosphere ,8(12):e02044.Jeanniard-du Dot, T., Guinet, C., Arnould, J. P., Speakman, J. R., and Trites, A. W. (2016). Accelerometers canmeasure total and activity-speciﬁc energy expenditures in free-ranging marine mammals only if linked to time-activity budgets.

Functional Ecology , 31(2):377–386.Jeanniard du Dot, T., Trites, A. W., Arnould, J. P. Y., Speakman, J. R., and Guinet, C. (2016). Activity-speciﬁcmetabolic rates for diving, transiting, and resting at sea can be estimated from time-activity budgets in free-rangingmarine mammals.

Ecology and Evolution , 7(9):2969–2976.Juang, B. H. and Rabiner, L. R. (1991). Hidden Markov models for speech recognition.

Technometrics , 33(3):251–272.Kokoszka, P. and Reimherr, M. (2018).

Introduction to Functional Data Analysis , chapter 8. Chapman and Hall/CRC.Langrock, R., Adam, T., Leos-Barajas, V., Mews, S., Miller, D. L., and Papastamatiou, Y. P. (2018). Spline-basednonparametric inference in general state-switching models.

Statistica Neerlandica , 72(3):179–200.Langrock, R., Swihart, B. J., Caffo, B. S., Punjabi, N. M., and Crainiceanu, C. M. (2013). Combining hiddenMarkov models for comparing the dynamics of multiple sleep electroencephalograms.

Statistics in Medicine ,32(19):3342–3356.Lawler, E., Whoriskey, K., Aeberhard, W. H., Field, C., and Mills Flemming, J. (2019). The conditionally autoregres-sive hidden Markov model (CarHMM): Inferring behavioural states from animal tracking data exhibiting conditionalautocorrelation.

Journal of Agricultural, Biological and Environmental Statistics , 24(4):651–668.Leos-Barajas, V., Gangloff, E. J., Adam, T., Langrock, R., van Beest, F. M., Nabe-Nielsen, J., and Morales, J. M.(2017). Multi-scale modeling of animal movement and general behavior data using hidden Markov models withhierarchical structures.

Journal of Agricultural, Biological and Environmental Statistics , 22(3):232–248.Liu, Y.-Y., Li, S., Li, F., Song, L., and Rehg, J. M. (2015). Efﬁcient learning of continuous-time hidden Markovmodels for disease progression.

Advances in Neural Information Processing Systems , 28:3599–3607.Lucero, P., S´anchez, R., Macancela, J., Cabrera, D., Cerrada, M., Li, C., and Alonso, H. R. (2019). Accelerometerplacement comparison for crack detection in railway axles using vibration signals and machine learning. In , pages 291–296.McClintock, B. T., Langrock, R., Gimenez, O., Cam, E., Borchers, D. L., Glennie, R., and Patterson, T. A. (2020).Uncovering ecological state dynamics with hidden Markov models.

Ecology Letters , 23(12):1878–1903.Michelot, T. and Blackwell, P. G. (2019). State-switching continuous-time correlated random walks.

Methods inEcology and Evolution , 10(5):637–649.Morris, J. S., Arroyo, C., Coull, B. A., Ryan, L. M., Herrick, R., and Gortmaker, S. L. (2006). Using wavelet-basedfunctional mixed models to characterize population heterogeneity in accelerometer proﬁles.

Journal of the AmericanStatistical Association , 101(476):1352–1364.Noren, D. P. (2011). Estimated ﬁeld metabolic rates and prey requirements of resident killer whales.

Marine MammalScience , 27(1):60 – 77.Nunes, A. (2019). Divebomb version 1.0.7.

Ocean Tracking Network , gitlab.oceantrack.org/anunes/divebomb .Patterson, T. A., Parton, A., Langrock, R., Blackwell, P. G., Thomas, L., and King, R. (2017). Statistical modellingof individual animal movement: an overview of key methods and a discussion of practical challenges. Advances inStatistical Analysis , 101:399–438.Pohle, J., Langrock, R., van Beest, M., and Schmidt, N. M. (2017). Selecting the number of states in hidden Markovmodels: Pragmatic solutions illustrated using animal movement.

Journal of Agricultural, Biological and Environ-mental Statistics , 22:1–24.Qasem, L., Cardew, A., Wilson, A., Grifﬁths, I., Halsey, L. G., Shepard, E. L. C., Gleiss, A. C., and Wilson, R.(2012). Tri-axial dynamic acceleration as a proxy for animal energy expenditure; should we be summing values orcalculating the vector?

PloS one , 7(2):e31187.Ramsay, J. O. and Silverman, B. W. (2005).

Functional Data Analysis . Springer.Rice, J. A. and Wu, C. O. (2001). Nonparametric mixed effects models for unequally sampled noisy curves.

Biometrics ,57(1):253–9. 20

PREPRINT - J

ANUARY

12, 2021Simon, M., Johnson, M., and Madsen, P. T. (2012). Keeping momentum with a mouthful of water: Behavior andkinematics of humpback whale lunge feeding.

Journal of Experimental Biology , 215(21):3786–3798.Sutherland, W. J. (1998). The importance of behavioural studies in conservation biology.

Animal Behaviour ,56(4):801–809.Tennessen, J., Holt, M. M., Ward, E. J., Hanson, M. B., Emmons, C. K., Giles, D. A., and Hogan, J. T. (2019a). HiddenMarkov models reveal temporal patterns and sex differences in killer whale behavior.

Scientiﬁc Reports , 9:14951.Tennessen, J. B., Holt, M. M., Hanson, M. B., Emmons, C. K., Giles, D. A., and Hogan, J. T. (2019b). Kinematicsignatures of prey capture from archival tags reveal sex differences in killer whale foraging activity.

Journal ofExperimental Biology , 222(3).Whoriskey, K., Auger-M´eth´e, M., Albertsen, C. M., Whoriskey, F. G., Binder, T. R., Krueger, C. C., and Mills Flem-ming, J. (2016). A hidden Markov movement model for rapidly identifying behavioral states from animal tracks.

Ecology and Evolution , 7(7):2112–2121.Williams, R. and Noren, D. P. (2009). Swimming speed, respiration rate, and estimated cost of transport in adult killerwhales.

Marine Mammal Science , 25(2):327–350.Williams, T. M., Haun, J. E., and Friedl, W. A. (1999). The diving physiology of bottlenose dolphins (

Tursiopstruncatus ): I. balancing the demands of exercise for energy conservation at depth.

Journal of Experimental Biology ,202(20):2739–2748.Wilson, R. P., B¨orger, L., Holton, M. D., Scantlebury, D. M., G´omez-Laich, A., Quintana, F., Rosell, F., Graf, P. M.,Williams, H., Gunner, R., Hopkins, L., Marks, N., Geraldi, N. R., Duarte, C. M., Scott, R., Strano, M. S., Robotka,H., Eizaguirre, C., Fahlman, A., and Shepard, E. L. C. (2019). Estimates for energy expenditure in free-livinganimals using acceleration proxies: A reappraisal.

Journal of Animal Ecology , 89(1):161–172.Wright, B. M., Ford, J. K. B., Ellis, G. M., Deecke, V. B., Shapiro, A. D., Battaile, B. C., and Trites, A. W. (2017).Fine-scale foraging movements by ﬁsh-eating killer whales (

Orcinus orca ) relate to the vertical distributions andescape responses of salmonid prey (

Oncorhynchus spp.).

Movement Ecology , 5:3.Xin, G., Hamzaoui, N., and Antoni, J. (2018). Semi-automated diagnosis of bearing faults based on a hidden Markovmodel of the vibration signals.

Measurement , 127:141–166.Xu, Z., Laber, E. B., and Staicu, A.-M. (2020).

Hierarchical Continuous Time Hidden Markov Model, with Applicationin Zero-Inﬂated Accelerometer Data , pages 125–142. Springer International Publishing, Cham.Yao, F., M¨uller, H.-G., and Wang, J.-L. (2005). Functional data analysis for sparse longitudinal data.

Journal of theAmerican Statistical Association , 100(470):577–590.Zucchini, W., Macdonald, I. L., and Langrock, R. (2016).

Hidden Markov Models for Time Series - An IntroductionUsing R . CRC Press. 21

PREPRINT - J

ANUARY

12, 2021

A Detailed description of data simulation from Section 4.1

We easily simulate realizations of the coarse-scale HMM ( X and Y ) given the parameters Γ and θ . For each dive t , wealso easily generate the ﬁne-scale hidden Markov chain ˜ X ∗ t ≡ (cid:110) ˜ X ∗ t, , . . . , ˜ X ∗ t, ˜ T ∗ t (cid:111) according to one of the probabilitytransition matrices Γ ∗ (1) or Γ ∗ (2) , depending upon the value of X t . This determines the sequence of ﬁne-scale hiddenstates corresponding to each window. Recall that the ﬁne-scale model is based on a sequence of ˜ T ∗ t two-secondwindows, each containing 100 observations, and our model is formulated in terms of quantities derived from theraw data within each window (namely, the average acceleration, ˜ A ∗ t, ˜ t ∗ , and wiggliness, ˜ W ∗ t, ˜ t ∗ ). Generating the rawacceleration data from ˜ A ∗ t, ˜ t ∗ and ˜ W ∗ t, ˜ t ∗ is not straightforward. In our simulation study, we generate raw accelerationdata so that we can visualize our results in terms of the underlying data curves for each dive. Here, we explain howwe generate the acceleration curves so that ˜ A ∗ t, ˜ t ∗ and ˜ W ∗ t, ˜ t ∗ both follow the speciﬁed model. A key component is thediscrete Fourier transformation of the 100 raw acceleration values in window ˜ t ∗ of dive t : ˆ Y ∗ ( k ) t, ˜ t ∗ ≡ DF T (cid:110) Y ∗ t, t ∗ − , . . . , Y ∗ t, t ∗ (cid:111) ( k ) for k ≥ , as deﬁned in Equation (3).We simulate the raw acceleration data for dive t in three steps: (1) simulate the average acceleration within eachwindow (cid:16) ˆ Y ∗ (0) t, ˜ t ∗ (cid:17) , (2) simulate all other Fourier coefﬁcients within each window (cid:16) ˆ Y ∗ ( k ) t, ˜ t ∗ , k = 1 , . . . , (cid:17) , and (3) takethe inverse discrete Fourier transform of ˆ Y ∗ t, ˜ t ∗ : { Y ∗ t, t ∗ − , . . . , Y ∗ t, t ∗ } ≡ IDF T (cid:110) ˆ Y ∗ (0) t, ˜ t ∗ , . . . , ˆ Y ∗ (99) t, ˜ t ∗ (cid:111) for ˜ t ∗ = 1 , . . . , ˜ T ∗ t . The details of steps (1) and (2) are described below.For step (1), we generate ˆ Y ∗ (0) t, , . . . , ˆ Y ∗ (0) t, ˜ T ∗ t as a CarHMM with underlying Markov state sequence ˜ X ∗ t, , . . . , ˜ X ∗ t, ˜ T ∗ t andrandom ﬁrst emission. Speciﬁcally, we let ˆ Y ∗ (0) t, | ˜ X ∗ t, = i ∗ ∼ N (cid:18) , (cid:16) σ ∗ ( · ,i ∗ ) A (cid:17) (cid:19) andˆ Y ∗ (0) t, ˜ t ∗ | ˜ X ∗ t, ˜ t ∗ = i ∗ , ˆ Y ∗ (0) t, ˜ t ∗ − ∼ N (cid:18) φ ∗ ( · ,i ∗ ) A ˆ Y ∗ (0) t, ˜ t ∗ − , (cid:16) σ ∗ ( · ,i ∗ ) A (cid:17) (cid:19) , (5) ˜ t ∗ = 2 , . . . , ˜ T ∗ t where σ ∗ ( · , A = 0 . s , σ ∗ ( · , A = 0 . s . φ ∗ ( · , A = 0 . and φ ∗ ( · , A = 0 . .For step (2), we ﬁrst construct ˆ Y ∗ ( k ) t, ˜ t ∗ , k = 1 , . . . , , as ˆ Y ∗ ( k ) t, ˜ t ∗ = a ( k ) t, ˜ t ∗ i (cid:113) b ( k ) t, ˜ t ∗ (6)where the a ( k ) t, ˜ t ∗ ’s are independent and equal to either 1 or -1 each with probability 1/2 and i is the imaginary number.We include i in Equation (6) to force all variation within a window to take the form of a sine wave, which reducesthe variation between the endpoints of windows compared to a cosine wave. Given the ﬁne scale states, the b ( k ) t, ˜ t ∗ ’s areindependent and independent of the a ( k ) t, ˜ t ∗ ’s. The distribution of b ( k ) t, ˜ t ∗ is b ( k ) t, ˜ t ∗ | ˜ X ∗ t, ˜ t ∗ = 1 ∼ Gamma(16 . /k , . b ( k ) t, ˜ t ∗ | ˜ X ∗ t, ˜ t ∗ = 2 ∼ Gamma(4 . /k , . . (7)The ﬁrst argument of Gamma ( · , · ) is the shape parameter and the second is the scale parameter. The squared mag-nitude of the k th Fourier coefﬁcient is equal to b ( k ) t, ˜ t ∗ , which decays like /k to “smooth out” the raw accelerationdata.We then deﬁne the remaining 50 Fourier coefﬁcients: ˆ Y ∗ (50) t, ˜ t ∗ = 0 and ˆ Y ∗ ( k ) t, ˜ t ∗ = − ˆ Y ∗ (100 − k ) t, ˜ t ∗ , k = 51 , . . . , . PREPRINT - J

ANUARY

12, 2021This guarantees that the inverse discrete Fourier transform is real-valued.We now show that this construction of the raw acceleration data results in the distributions listed in Section 4.1. It suf-ﬁces to show that the construction of the discrete Fourier transformations, the ˆ Y ( k ) t, ˜ t ∗ ’s, yields the desired distributions.First, since ˆ Y (0) t, ˜ t ∗ = (cid:80) n =1 Y ∗ t, t ∗ − n = 100 ˜ A ∗ t, ˜ t ∗ , Equation (5) implies that ˜ A ∗ t, , . . . , ˜ A ∗ t, ˜ T ∗ follows a CarHMMwith Normal emissions distributions and parameters µ ∗ ( · , A = µ ∗ ( · , A = µ ∗ ( · ,i ∗ ) A = 0 , σ ∗ ( · , A = 0 . s , φ ∗ ( · , A = 0 . , σ ∗ ( · , A = 0 . s , and φ ∗ ( · , A = 0 . .From Equations (4) and (6), the wiggliness within window ˜ t ∗ of dive t is ˜ W ∗ t, ˜ t ∗ = ˜ ω (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) ˆ Y ( k ) t, ˜ t ∗ (cid:12)(cid:12)(cid:12)(cid:12) = ˜ ω (cid:88) k =1 b ( k ) t, ˜ t ∗ . If ˜ ω < , then ˜ W ∗ t, ˜ t ∗ is the sum of independent Gamma-distributed random variables with identical scale parameters,so the distribution of ˜ W ∗ t, ˜ t ∗ is also Gamma. Thus, by Equation (7) ˜ W ∗ t, ˜ t ∗ | ˜ X ∗ t, ˜ t ∗ = 1 ∼ Gamma (cid:32) ˜ ω (cid:88) k =1 . /k , . (cid:33) and ˜ W ∗ t, ˜ t ∗ | ˜ X ∗ t, ˜ t ∗ = 2 ∼ Gamma (cid:32) ˜ ω (cid:88) k =1 . /k , . (cid:33) . Setting ˜ ω to 10 and carrying out simple calculation of the mean and variance of a Gamma distribution yields µ ∗ ( · , W =23 . , σ ∗ ( · , W = 13 . , µ ∗ ( · , W = 301 . , and σ ∗ ( · , W = 330 . . B Likelihood of CarHHMM-DFT

The overall likelihood of the CarHHMM-DFT model is as follows: L CarHHMM-DFT ( θ, θ ∗ , Γ , Γ ∗ ; y, ˜ y ∗ ) = δP ( y , ˜ y ∗ ; θ, θ ∗ , Γ ∗ ) T (cid:89) t =2 Γ P ( y t , ˜ y ∗ t ; θ, θ ∗ , Γ ∗ ) N . In particular, P ( y t , ˜ y ∗ t ; θ, θ ∗ , Γ ∗ ) = diag (cid:104) f (1) ( y t ; θ (1) ) L ﬁne (cid:16) θ ∗ , Γ ∗ (1) ; ˜ y ∗ t (cid:17) , . . . ,f ( N ) ( y t ; θ ( N ) ) L ﬁne (cid:16) θ ∗ , Γ ∗ ( N ) ; ˜ y ∗ t (cid:17) (cid:105) , where f ( i ) ( y t ; θ ( i ) ) is the emission distribution of dive duration given that X t = i . The likelihood L ﬁne correspondsto the ﬁne-scale model and is equal to the following: L ﬁne (cid:16) θ ∗ , Γ ∗ ( i ) ; ˜ y ∗ t (cid:17) = δ ∗ ( i ) ˜ T ∗ t (cid:89) ˜ t ∗ =2 Γ ∗ ( i ) P (˜ y ∗ t, ˜ t ∗ | ˜ y ∗ t, ˜ t ∗ − ; θ ∗ ) N ∗ , where P (˜ y ∗ t, ˜ t ∗ | ˜ y ∗ t, ˜ t ∗ − ; θ ∗ ) is an N ∗ × N ∗ diagonal matrix with ( i ∗ , i ∗ ) th entry equal to f ∗ ( · ,i ∗ ) (˜ y ∗ t, ˜ t ∗ | ˜ y ∗ t, ˜ t ∗ − ; θ ∗ ( · ,i ∗ ) ) .Recall that f ∗ ( · ,i ∗ ) ( ·| ˜ y ∗ t,t ∗ − ; θ ∗ ( · ,i ∗ ) ) is the probability density function of ˜ Y ∗ t, ˜ t ∗ when ˜ X ∗ t, ˜ t ∗ = i ∗ and ˜ Y ∗ t, ˜ t ∗ − =˜ y ∗ t, ˜ t ∗ − . 23 odelling state-switching functional and biologging data withhidden Markov models: supplementary material Evan Sidrow, Nancy Heckman, Sarah Fortune, Andrew W. Trites, Ian Murphy,and Marie Auger-M´eth´e

Figure 1: Collection of 5 lag plots where a given observation is on the x -axis and the subsequent observationis on the y -axis. Each plot corresponds to a feature in the killer whale dive data. Lag plots are used todecide whether to include auto-correlation in the HMMs used to model the killer whale’s kinematic data.They can also be used to determine the number of hidden states if multiple distinct patterns are visible.1 .2 Parameter Estimates Feature Dive / Subdive Type Parameter Estimate (Standard Error)ˆ µ ˆ σ ˆ φ Dive Duration ( s ) - Y t .

69 (0 .

60) 9 .

57 (0 .

51) —2 104 . .

4) 64 . .

5) — x -Acceleration ( m/s ) - (cid:16) ˜ A ∗ t, ˜ t ∗ (cid:17) x .

020 (0 . .

034 (0 . .

976 (0 . .

244 (0 . .

079 (0 . .

886 (0 . .

218 (0 . .

265 (0 . .

626 (0 . y -Acceleration ( m/s ) - (cid:16) ˜ A ∗ t, ˜ t ∗ (cid:17) y .

469 (0 . .

044 (0 . .

976 (0 . .

436 (0 . .

082 (0 . .

886 (0 . .

384 (0 . .

321 (0 . .

626 (0 . z -Acceleration ( m/s ) - (cid:16) ˜ A ∗ t, ˜ t ∗ (cid:17) z − .

683 (0 . .

052 (0 . .

976 (0 . − .

593 (0 . .

096 (0 . .

886 (0 . − .

366 (0 . .

317 (0 . .

626 (0 . W ∗ t, ˜ t ∗ .

34 (0 .

28) 12 .

95 (0 .

27) —2 301 . .

2) 330 . .

2) —3 10200 (210) 15300 (350) —

Table 1: Estimates and predicted standard errors of emission parameters for dive duration ( Y t ), acceleration( ˜ A ∗ t, ˜ t ∗ ), and wiggliness ( ˜ W ∗ t, ˜ t ∗ ) of the killer whale case study data under the full CarHHMM-DFT. Theparentheses refer to the estimated standard error of the parameter estimates derived from the inverse of theobserved information matrix. Feature Dive / Subdive Type Parameter Estimate (Standard Error)ˆ µ ˆ σ ˆ φ Dive Duration ( s ) - Y t .

34 (0 .

60) 8 .

82 (0 .

51) —2 92 . .

7) 63 . .

5) — x -Acceleration ( m/s ) - (cid:16) ˜ A ∗ t, ˜ t ∗ (cid:17) x − .

023 (0 . .

053 (0 . .

074 (0 . .

149 (0 . .

214 (0 . .

422 (0 . y -Acceleration ( m/s ) - (cid:16) ˜ A ∗ t, ˜ t ∗ (cid:17) y .

342 (0 . .

064 (0 . .

408 (0 . .

091 (0 . .

320 (0 . .

341 (0 . z -Acceleration ( m/s ) - (cid:16) ˜ A ∗ t, ˜ t ∗ (cid:17) z − .

418 (0 . .

104 (0 . − .

478 (0 . .

136 (0 . − .

346 (0 . .

359 (0 . W ∗ t, ˜ t ∗ .

32 (0 .

38) 15 .

94 (0 .

38) —2 252 . .

8) 296 . .

9) —3 8800 (150) 15580 (290) —

Table 2: Estimates and predicted standard errors of emission parameters for dive duration ( Y t ), acceleration( ˜ A ∗ t, ˜ t ∗ ), and wiggliness ( ˜ W ∗ t, ˜ t ∗ ) of the killer whale case study data under the HHMM-DFT. The parenthesesrefer to the estimated standard error of the parameter estimates derived from the inverse of the observedinformation matrix. 2 .2.3 CarHHMM Feature Dive / Subdive Type Parameter Estimate (Standard Error)ˆ µ ˆ σ ˆ φ Dive Duration ( s ) - Y t .

21 (0 .

67) 11 .

64 (0 .

70) —2 136 (11) 63 . .

4) — x -Acceleration ( m/s ) - (cid:16) ˜ A ∗ t, ˜ t ∗ (cid:17) x − .

061 (0 . .

392 (0 . .

725 (0 . .

228 (0 . .

043 (0 . .

945 (0 . .

172 (0 . .

129 (0 . .

721 (0 . y -Acceleration ( m/s ) - (cid:16) ˜ A ∗ t, ˜ t ∗ (cid:17) y .

38 (0 .

12) 0 .

552 (0 . .

725 (0 . .

387 (0 . .

052 (0 . .

945 (0 . .

441 (0 . .

118 (0 . .

721 (0 . z -Acceleration ( m/s ) - (cid:16) ˜ A ∗ t, ˜ t ∗ (cid:17) z − .

44 (0 .

11) 0 .

506 (0 . .

725 (0 . − .

537 (0 . .

059 (0 . .

945 (0 . − .

561 (0 . .

135 (0 . .

721 (0 . Table 3: Estimates and predicted standard errors of emission parameters for dive duration ( Y t ) and accelera-tion ( ˜ A ∗ t, ˜ t ∗ ) of the killer whale case study data under the CarHHMM. The parentheses refer to the estimatedstandard error of the parameter estimates derived from the inverse of the observed information matrix. Feature Dive / Subdive Type Parameter Estimate (Standard Error)ˆ µ ˆ σ ˆ φ Dive Duration ( s ) - Y t . .

3) 31 . .

3) — x -Acceleration ( m/s ) - (cid:16) ˜ A ∗ t, ˜ t ∗ (cid:17) x .

035 (0 . .

000 (0 . .

241 (0 . .

084 (0 . .

861 (0 . .

208 (0 . .

266 (0 . .

619 (0 . y -Acceleration ( m/s ) - (cid:16) ˜ A ∗ t, ˜ t ∗ (cid:17) y .

046 (0 . .

000 (0 . .

440 (0 . .

086 (0 . .

861 (0 . .

383 (0 . .

322 (0 . .

619 (0 . z -Acceleration ( m/s ) - (cid:16) ˜ A ∗ t, ˜ t ∗ (cid:17) z .

053 (0 . .

000 (0 . − .

593 (0 . .

101 (0 . .

861 (0 . − .

355 (0 . .

318 (0 . .

619 (0 . W ∗ t, ˜ t ∗ .

07 (0 .

29) 13 .

61 (0 .

28) —2 291 . .

2) 308 . .

1) —3 10210 (200) 15010 (340) —

Table 4: Estimates and predicted standard errors of emission parameters for dive duration ( Y t ), acceleration( ˜ A ∗ t, ˜ t ∗ ), and wiggliness ( ˜ W ∗ t, ˜ t ∗ ) of the killer whale case study data under the CarHMM-DFT. The parenthesesrefer to the estimated standard error of the parameter estimates derived from the inverse of the observedinformation matrix. Note that ˆ µ ∗ ( · , A is undeﬁned because ˆ φ ∗ ( · , A = 1, indicating that the mean of ˜ A ∗ t, ˜ t ∗ isequal to ˜ A ∗ t, ˜ t ∗ − when in subdive state 1. 3 .3 Decoded dive states Figure 2: The x –component of acceleration (cid:0) y ∗ t,t ∗ (cid:1) x (top two panels) and dive depth (bottom two panels) ofa northern resident killer whale for a sequence of six selected dives. Each panel is partitioned into dives byvertical black lines. The curve colours in the ﬁrst and third panels correspond to the estimated dive typeswhile the curve colours of the second and fourth panels correspond the estimated subdive states. Both thedive types and subdive states are estimated by ﬁtting the CarHHMM-DFT to the data and performing theforward-backward algorithm to determine the hidden state with the highest probability.4igure 3: The probability that each dive is a particular type or that the whale is in a particular subdivestate, given the ﬁtted model. Each panel is partitioned into dives by vertical black lines. The curve coloursin the ﬁrst and third panels correspond to the estimated dive types while the curve colours of the second andfourth panels correspond the estimated subdive states. Both the dive types and subdive states are estimatedby ﬁtting the CarHHMM-DFT to the data and performing the forward-backward algorithm.5 .3.2 HHMM-DFT Figure 4: The x –component of acceleration (cid:0) y ∗ t,t ∗ (cid:1) x (top two panels) and dive depth (bottom two panels)of a northern resident killer whale for a sequence of six selected dives. Each panel is partitioned into divesby vertical black lines. The curve colours in the ﬁrst and third panels correspond to the estimated divetypes while the curve colours of the second and fourth panels correspond the estimated subdive states. Boththe dive types and subdive states are estimated by ﬁtting the HHMM-DFT to the data and performing theforward-backward algorithm to determine the hidden state with the highest probability.6igure 5: The probability that each dive is a particular type or that the whale is in a particular subdivestate, given the ﬁtted model. Each panel is partitioned into dives by vertical black lines. The curve coloursin the ﬁrst and third panels correspond to the estimated dive types while the curve colours of the second andfourth panels correspond the estimated subdive states. Both the dive types and subdive states are estimatedby ﬁtting the HHMM-DFT to the data and performing the forward-backward algorithm.7 .3.3 CarHHMM Figure 6: The x –component of acceleration (cid:0) y ∗ t,t ∗ (cid:1) x (top two panels) and dive depth (bottom two panels)of a northern resident killer whale for a sequence of six selected dives. Each panel is partitioned into divesby vertical black lines. The curve colours in the ﬁrst and third panels correspond to the estimated divetypes while the curve colours of the second and fourth panels correspond the estimated subdive states. Boththe dive types and subdive states are estimated by ﬁtting the CarHHMM to the data and performing theforward-backward algorithm to determine the hidden state with the highest probability.8igure 7: The probability that each dive is a particular type or that the whale is in a particular subdivestate, given the ﬁtted model. Each panel is partitioned into dives by vertical black lines. The curve coloursin the ﬁrst and third panels correspond to the estimated dive types while the curve colours of the second andfourth panels correspond the estimated subdive states. Both the dive types and subdive states are estimatedby ﬁtting the CarHHMM to the data and performing the forward-backward algorithm.9 .3.4 CarHMM-DFT Figure 8: The x –component of acceleration (cid:0) y ∗ t,t ∗ (cid:1) x (top two panels) and dive depth (bottom two panels) ofa northern resident killer whale for a sequence of six selected dives. Each panel is partitioned into dives byvertical black lines. The curve colours in the ﬁrst and third panels correspond to the estimated dive typeswhile the curve colours of the second and fourth panels correspond the estimated subdive states. Both thedive types and subdive states are estimated by ﬁtting the CarHMM-DFT to the data and performing theforward-backward algorithm to determine the hidden state with the highest probability.10igure 9: The probability that each dive is a particular type or that the whale is in a particular subdivestate, given the ﬁtted model. Each panel is partitioned into dives by vertical black lines. The curve coloursin the ﬁrst and third panels correspond to the estimated dive types while the curve colours of the second andfourth panels correspond the estimated subdive states. Both the dive types and subdive states are estimatedby ﬁtting the CarHMM-DFT to the data and performing the forward-backward algorithm.11 .4 Model checking - dive duration ( Y t ) Figure 10: Empirical histogram (left) and psuedoresiduals (right) of dive duration ( Y t ) plotted over theestimated emission distributions and a standard normal density, respectively. Both plots are generated usingthe ﬁtted CarHHMM-DFT and the killer whale case study data. Figure 11: Empirical histogram (left) and psuedoresiduals (right) of dive duration ( Y t ) plotted over theestimated emission distributions and a standard normal density, respectively. Both plots are generated usingthe ﬁtted HHMM-DFT and the killer whale case study data.12 .4.3 CarHHMM Figure 12: Empirical histogram (left) and psuedoresiduals (right) of dive duration ( Y t ) plotted over theestimated emission distributions and a standard normal density, respectively. Both plots are generated usingthe ﬁtted CarHHMM-DFT and the killer whale case study data. Figure 13: Empirical histogram (left) and psuedoresiduals (right) of dive duration ( Y t ) plotted over theestimated emission distribution (left) and a standard normal density (right), respectively. Both plots aregenerated using the ﬁtted CarHMM-DFT and the killer whale case study data.13 .5 Model checking - acceleration ( ˜ A ∗ t, ˜ t ∗ ) Figure 14: Empirical histograms (top three rows) and psuedoresiduals (bottom row) of acceleration ( ˜ A ∗ t, ˜ t ∗ )plotted over the estimated emission distributions and a standard normal density, respectively. Note that themean of acceleration at time ˜ t ∗ depends upon acceleration at time ˜ t ∗ −

1, so only the deviation from theconditional mean at each particular time step is plotted. All plots are generated using the ﬁtted CarHHMM-DFT and the killer whale case study data. 14 .5.2 HHMM-DFT

Figure 15: Empirical histograms (top three rows) and psuedoresiduals (bottom row) of acceleration ( ˜ A ∗ t, ˜ t ∗ )plotted over the estimated emission distributions and a standard normal density, respectively. All plots aregenerated using the ﬁtted HHMM-DFT and the killer whale case study data.15 .5.3 CarHHMM Figure 16: Empirical histograms (top three rows) and psuedoresiduals (bottom row) of acceleration ( ˜ A ∗ t, ˜ t ∗ )plotted over the estimated emission distributions and a standard normal density, respectively. Note that themean of acceleration at time ˜ t ∗ depends upon acceleration at time ˜ t ∗ −

1, so only the deviation from theconditional mean at each particular time step is plotted. All plots are generated using the ﬁtted CarHHMMand the killer whale case study data. 16 .5.4 CarHMM-DFT

Figure 17: Empirical histograms (top three rows) and psuedoresiduals (bottom row) of acceleration ( ˜ A ∗ t, ˜ t ∗ )plotted over the estimated emission distributions and a standard normal density, respectively. Note that themean of acceleration at time ˜ t ∗ depends upon acceleration at time ˜ t ∗ −

1, so only the deviation from theconditional mean at each particular time step is plotted. All plots are generated using the ﬁtted CarHMM-DFT and the killer whale case study data. 17 .6 Model checking - wiggliness ( ˜ W ∗ t, ˜ t ∗ ) Figure 18: Empirical histograms (top row) and psuedoresiduals (bottom row) of wiggliness ( ˜ W ∗ t, ˜ t ∗ ) plottedover the estimated emission distributions and a standard normal density, respectively. All plots are generatedusing the ﬁtted CarHHMM-DFT and the killer whale case study data. Figure 19: Empirical histograms (top row) and psuedoresiduals (bottom row) of wiggliness ( ˜ W ∗ t, ˜ t ∗ ) plotted18ver the estimated emission distributions and a standard normal density, respectively. All plots are generatedusing the ﬁtted HHMM-DFT and the killer whale case study data. The CarHHMM does not model wiggliness.

Figure 20: Empirical histograms (top row) and psuedoresiduals (bottom row) of wiggliness ( ˜ W ∗ t, ˜ t ∗ ) plottedover the estimated emission distributions and a standard normal density, respectively. All plots are generatedusing the ﬁtted CarHMM-DFT and the killer whale case study data.19 Simulation Study Results

Figure 21: Acceleration data for a ﬁve dives of a selected simulated data set. The raw accelerometer datais denoted as Y ∗ t, ˜ t ∗ which is recorded at 50 hertz while ˜ A ∗ t, ˜ t ∗ and ˜ W ∗ t, ˜ t ∗ are recorded at 0.5 hertz. Dives areseparated by vertical black lines. The colour of the curve corresponds to the true ﬁne-scale state, while thecolour of the background corresponds to the true dive type. Subdive state 1 can plausibly be interpreted as“gliding” while subdive state 2 can plausibly be interpreted as active swimming, or “ﬂuking”.20 .2 Accuracy comparisons Model Train Time (m) Dive Type Subdive Type Dive Accuracy Subdive AccuracyCarHMM-DFT 70 ±

11 Both Both ————- 0 . ± . . ± . . ± . . ± . . ± . ±

51 Both Both 0 . ± .

036 0 . ± . . ± .

030 0 . ± . . ± . . ± .

095 0 . ± . . ± . ±

52 Both Both 0 . ± .

165 0 . ± . . ± .

214 0 . ± . . ± . . ± .

136 0 . ± . . ± . ±

40 Both Both 0 . ± .

036 0 . ± . . ± .

031 0 . ± . . ± . . ± .

086 0 . ± . . ± . Table 5: Average decoding accuracies and training times for all models used to categorize dive type andsubdive state in the simulation study. Each of the four models was ﬁt to 500 training data sets comprised of100 simulated dives and tested on test data sets also comprised of 100 simulated dives. Reported values areaverages, and ± refers to the sample standard deviation across the 500 data sets. Rows labelled Both/Bothcorrespond to overall average decoding accuracy. 21 .3 Empirical joint distributions - dive duration ( Y t ) Model Parameter Dive Type Estimate Bias Empirical SE Observed Fischer SECarHMM-DFT µ .

161 — 4 .

255 2 . ± . σ .

119 — 5 .

393 2 . ± . µ .

776 0 .

096 ( p = 0 . .

181 0 . ± . .

589 3 .

989 ( p = 0 . .

726 13 . ± . σ . − .

081 ( p = 0 . .

952 0 . ± . . − .

186 ( p = 0 . .

388 11 . ± . µ .

877 2 .

197 ( p = 0 . .

693 0 . ± . . − .

177 ( p = 0 . .

095 12 . ± . σ .

954 0 .

384 ( p = 0 . .

950 0 . ± . . − .

409 ( p = 0 . .

432 10 . ± . µ .

719 0 .

039 ( p = 0 . .

170 0 . ± . .

404 2 .

804 ( p = 0 . .

535 12 . ± . σ . − .

134 ( p = 0 . .

930 0 . ± . . − .

065 ( p = 0 . .

225 10 . ± . Table 6: Estimates and standard errors for the parameters of the distribution of dive duration across allfour models in the simulation study. The “Estimate” column is the average across 500 simulations, and the“Observed Fisher SE” column is the median across 500 simulations. The ± refers to the inter-quartile range.SE stands for standard error. Reported p -values test if the observed bias is statistically signiﬁcant using aone-sample t -test. Figure 22: Kernel density estimate of the distribution of estimates of dive duration parameters (ˆ µ and ˆ σ )for each dive type. The red star represents the true values of µ and σ for both dive types. These estimatesare for the CarHHMM-DFT. 22 .3.2 HHMM-DFT Figure 23: Kernel density estimate of the distribution of estimates of dive duration parameters (ˆ µ and ˆ σ )for each dive type. The red star represents the true values of µ and σ for both dive types. These estimatesare for the HHMM-DFT. Figure 24: Kernel density estimate of the distribution of estimates of dive duration parameters (ˆ µ and ˆ σ )for each dive type. The red star represents the true values of µ and σ for both dive types. These estimatesare for the CarHHMM. 23 .3.4 CarHMM-DFT Figure 25: Kernel density estimate of the distribution of estimates of dive duration parameters (ˆ µ and ˆ σ ).The red star represents the true values of µ and σ for both dive types. These estimates are for the CarHMM-DFT. The CarHMM-DFT assumes that there is only one dive type. There is no red star which representsthe true values because the CarHMM-DFT assumes that there is only one dive type.24 .4 Empirical joint distributions - acceleration ( ˜ A ∗ t, ˜ t ∗ ) Model Parameter Subdive Type Estimate Bias Empirical SE Observed Fischer SECarHMM-DFT µ ∗ A − . − .

001 ( p = 0 . .

142 0 . ± . − . − .

000 ( p = 0 . .

018 0 . ± . σ ∗ A . − .

016 ( p = 0 . .

001 0 . ± . . − .

021 ( p = 0 . .

002 0 . ± . φ ∗ A . − .

012 ( p = 0 . .

010 0 . ± . . − .

082 ( p = 0 . .

016 0 . ± . µ ∗ A − . − .

001 ( p = 0 . .

027 0 . ± . .

000 0 .

000 ( p = 0 . .

012 0 . ± . σ ∗ A .

136 0 .

086 ( p = 0 . .

022 0 . ± . .

144 0 .

044 ( p = 0 . .

007 0 . ± . φ ∗ A µ ∗ A .

021 0 .

021 ( p = 0 . .

591 0 . ± . .

000 0 .

000 ( p = 0 . .

013 0 . ± . σ ∗ A .

065 0 .

015 ( p = 0 . .

022 0 . ± . .

187 0 .

087 ( p = 0 . .

018 0 . ± . φ ∗ A . − .

099 ( p = 0 . .

155 0 . ± . . − .

768 ( p = 0 . .

137 0 . ± . µ ∗ A − . − .

001 ( p = 0 . .

111 0 . ± . − . − .

000 ( p = 0 . .

018 0 . ± . σ ∗ A . − .

016 ( p = 0 . .

001 0 . ± . . − .

021 ( p = 0 . .

002 0 . ± . φ ∗ A . − .

012 ( p = 0 . .

010 0 . ± . . − .

083 ( p = 0 . .

016 0 . ± . Table 7: Estimates and standard errors for the parameters of the distribution of acceleration across all fourmodels in the simulation study. The “Estimate” column is the average across 500 simulations, and the“Observed Fisher SE” column is the median across 500 simulations. The ± refers to the inter-quartile range.SE stands for standard error. Reported p -values test if the observed bias is statistically signiﬁcant using aone-sample t -test. 25 .4.1 CarHHMM-DFT Figure 26: Kernel density estimate of the distribution of estimates of acceleration parameters (ˆ µ ∗ A , ˆ σ ∗ A , andˆ φ ∗ A ) for each subdive state. The red star represents the true values of µ ∗ A , σ ∗ A , and φ ∗ A for both subdivestates. These estimates are for the CarHHMM-DFT. Figure 27: Kernel density estimate of the distribution of estimates of acceleration parameters (ˆ µ ∗ A and ˆ σ ∗ A )for each subdive state. The red star represents the true values of µ ∗ A and σ ∗ A for both subdive states. Theseestimates are for the HHMM-DFT. Note that the HHMM-DFT assumes no auto-correlation in ˜ A ∗ t, ˜ t ∗ , soˆ φ ∗ ( · ,i ∗ ) A is absent from these plots. 26 .4.3 CarHHMM Figure 28: Kernel density estimate of the distribution of estimates of acceleration parameters (ˆ µ ∗ A , ˆ σ ∗ A , andˆ φ ∗ A ) for each subdive state. The red star represents the true values of µ ∗ A , σ ∗ A , and φ ∗ A for both subdivestates. These estimates are for the CarHHMM. Figure 29: Kernel density estimate of the distribution of estimates of acceleration parameters (ˆ µ ∗ A , ˆ σ ∗ A , andˆ φ ∗ A ) for each subdive state. The red star represents the true values of µ ∗ A , σ ∗ A , and φ ∗ A for both subdivestates. These estimates are for the CarHMM-DFT. 27 .5 Empirical joint distributions - wiggliness ( ˜ W ∗ t, ˜ t ∗ ) Model Parameter Subdive Type Estimate Bias Empirical SE Observed Fischer SECarHMM-DFT µ ∗ W .

362 0 .

022 ( p = 0 . .

582 0 . ± . . − .

511 ( p = 0 . .

144 4 . ± . σ ∗ W . − .

001 ( p = 0 . .

539 0 . ± . . − .

515 ( p = 0 . .

622 5 . ± . µ ∗ W .

353 0 .

013 ( p = 0 . .

620 0 . ± . . − .

759 ( p = 0 . .

438 4 . ± . σ ∗ W . − .

012 ( p = 0 . .

583 0 . ± . . − .

604 ( p = 0 . .

482 5 . ± . µ ∗ W σ ∗ W µ ∗ W .

341 0 .

001 ( p = 0 . .

568 0 . ± . .

209 0 .

009 ( p = 0 . .

869 4 . ± . σ ∗ W . − .

032 ( p = 0 . .

526 0 . ± . . − .

416 ( p = 0 . .

225 5 . ± . Table 8: Estimates and standard errors for the parameters of the distribution of wiggliness across all fourmodels in the simulation study. The “Estimate” column is the average across 500 simulations, and the“Observed Fisher SE” column is the median across 500 simulations. The ± refers to the inter-quartile range.SE stands for standard error. Reported p -values test if the observed bias is statistically signiﬁcant using aone-sample t -test. Figure 30: Kernel density estimate of the distribution of estimates of wiggliness parameters (ˆ µ ∗ W and ˆ σ ∗ W )for each subdive state. The red star represents the true values of µ ∗ W and σ ∗ W for both subdive states. Theseestimates are for the CarHHMM-DFT. 28 .5.2 HHMM-DFT Figure 31: Kernel density estimate of the distribution of estimates of wiggliness parameters (ˆ µ ∗ W and ˆ σ ∗ W )for each subdive state. The red star represents the true values of µ ∗ W and σ ∗ W for both subdive states. Theseestimates are for the HHMM-DFT. The CarHHMM did not use wiggliness to model the whale’s behaviour.

Figure 32: Kernel density estimate of the distribution of estimates of wiggliness parameters (ˆ µ ∗ W and ˆ σ ∗ W )for each subdive state. The red star represents the true values of µ ∗ W and σ ∗ W for both subdive states. Theseestimates are for the CarHMM-DFT. 29 .6 Empirical joint distributions - probability transition matrices ( Γ , Γ ∗ ) Model Parameter Estimate Bias Empirical SE Observed Fischer SECarHMM-DFT Γ —— —— —— ——Γ —— —— —— ——Γ ∗ (1)12 .

179 —— 0 .

018 0 . ± . ∗ (1)21 .

079 —— 0 .

011 0 . ± . ∗ (2)12 —— —— —— ——Γ ∗ (2)21 —— —— —— ——HHMM-DFT Γ . − .

006 ( p = 0 . .

063 0 . ± . .

822 0 .

013 ( p = 0 . .

119 0 . ± . ∗ (1)12 . − .

008 ( p = 0 . .

070 0 . ± . ∗ (1)21 . − .

003 ( p = 0 . .

014 0 . ± . ∗ (2)12 . − .

001 ( p = 0 . .

018 0 . ± . ∗ (2)21 . − .

004 ( p = 0 . .

025 0 . ± . .

307 0 .

095 ( p = 0 . .

257 0 . ± . . − .

098 ( p = 0 . .

280 0 . ± . ∗ (1)12 . − .

051 ( p = 0 . .

252 0 . ± . ∗ (1)21 .

134 0 .

084 ( p = 0 . .

081 0 . ± . ∗ (2)12 . − .

020 ( p = 0 . .

196 0 . ± . ∗ (2)21 . − .

006 ( p = 0 . .

042 0 . ± . . − .

002 ( p = 0 . .

064 0 . ± . .

818 0 .

009 ( p = 0 . .

117 0 . ± . ∗ (1)12 .

325 0 .

005 ( p = 0 . .

069 0 . ± . ∗ (1)21 .

050 0 .

000 ( p = 0 . .

012 0 . ± . ∗ (2)12 .

141 0 .

001 ( p = 0 . .

017 0 . ± . ∗ (2)21 .

152 0 .

002 ( p = 0 . .

020 0 . ± . Table 9: Estimates and standard errors for all probability transition matrices across all four models in thesimulation study. The “Estimate” column is the average across 500 simulations, and the “Observed FisherSE” column is the median across 500 simulations. The ± refers to the inter-quartile range. SE stands forstandard error. Reported p -values test if the observed bias is statistically signiﬁcant using a one-sample t -test. 30 .6.1 CarHHMM-DFT Figure 33: Kernel density estimate of the distributions ˆΓ, ˆΓ ∗ (1) , and ˆΓ ∗ (2) for the CarHHMM-DFT. Thereare two ﬁne-scale probability transition matrices- one corresponding to each dive type. We only plot theoﬀ-diagonal elements of the probability transition matrices since this uniquely deﬁnes a stochastic matrix.The red star represents the true values. 31 .6.2 HHMM-DFT Figure 34: Kernel density estimate of the distributions ˆΓ, ˆΓ ∗ (1) , and ˆΓ ∗ (2) for the HHMM-DFT. There aretwo ﬁne-scale probability transition matrices- one corresponding to each dive type. We only plot the oﬀ-diagonal elements of the probability transition matrices since this uniquely deﬁnes a stochastic matrix. Thered star represents the true values. 32 .6.3 CarHHMM Figure 35: Kernel density estimate of the distributions ˆΓ, ˆΓ ∗ (1) , and ˆΓ ∗ (2) for the CarHHMM. There are twoﬁne-scale probability transition matrices- one corresponding to each dive type. We only plot the oﬀ-diagonalelements of the probability transition matrices since this uniquely deﬁnes a stochastic matrix. The red starrepresents the true values. 33 .6.4 CarHMM-DFT Figure 36: Kernel density estimate of the distribution ˆΓ ∗ (1)(1)