[PDF] Anomaly Detection in Energy Usage Patterns

Abstract

Energy usage monitoring on higher education campuses is an important step for providing satisfactory service, lowering costs and supporting the move to green energy. We present a collaboration between the Department of Statistics and Facilities Operations at an R1 research university to develop statistically based approaches for monitoring monthly energy usage and proportional yearly usage for several hundred utility accounts on campus. We compare the interpretability and power of model-free and model-based methods for detection of anomalous energy usage patterns in statistically similar groups of accounts. Ongoing conversation between the academic and operations teams enhances the practical utility of the project and enables implementation for the university. Our work highlights an application of thoughtful and continuing collaborative analysis using easy-to-understand statistical principles for real-world deployment.

Full PDF

AAnomaly Detection in Energy UsagePatterns

Henry Linder , Nalini Ravishanker , Ming-Hui Chen , David McIntosh ,and Stanley Nolan Department of Statistics, University of Connecticut, Storrs, CT 06269 Utility Operations & Energy Management, Facilities Operations,University of Connecticut, Storrs, CT 06269 * Corresponding author: [email protected]

Abstract

Energy usage monitoring on higher education campuses is an important step forproviding satisfactory service, lowering costs and supporting the move to green en-ergy. We present a collaboration between the Department of Statistics and FacilitiesOperations at an R1 research university to develop statistically based approaches formonitoring monthly energy usage and proportional yearly usage for several hundredutility accounts on campus. We compare the interpretability and power of model-free and model-based methods for detection of anomalous energy usage patterns instatistically similar groups of accounts. Ongoing conversation between the academicand operations teams enhances the practical utility of the project and enables imple-mentation for the university. Our work highlights an application of thoughtful andcontinuing collaborative analysis using easy-to-understand statistical principles forreal-world deployment.

Keywords:

Boxplots, Cluster analysis, Graphical interface, Proportions, Utility Manage-ment. 1 a r X i v : . [ s t a t . A P ] F e b Introduction

Energy management at large academic institutions usually poses unique operational chal-lenges (Cruz Rios et al., 2017). The energy infrastructure in a university can be complexacross multiple dimensions: the number of buildings at a campus is large, the facilitiesexhibit diverse use-types, and the buildings vary in size, energy eﬃciency, and state ofrepair. For utility managers, close and in-depth scrutiny of energy usage across the entirecampus must be balanced against time and manpower by using resources for the most crit-ical maintenance problems. The balance between these priorities is the focus of FacilitiesOperations in an institution. Understanding and monitoring patterns in historical energyusage data and identifying anomalous behavior is an important step in this direction, andstatistical methods can provide a systematic framework for implementing these.This is the focus of an ongoing collaboration at the University of Connecticut betweenthe Department of Statistics and Facilities Operations - for working together in a con-certed way to achieve systematic and statistically valid procedures to understand patternsand detect anomalies in energy usage, ultimately minimizing wasteful consumption. Wedescribe the problem of managing and monitoring monthly energy usage in a large set ofutility accounts located across the university’s 4000 acre main campus. The number ofaccounts is prohibitively large to permit manual assessment by Facilities Operations engi-neers. Moreover, while each account individually represents only a small fraction of overallenergy consumption on campus, nonetheless, when aggregated across several hundred ac-counts over the full ﬁscal year, abnormal consumption may pose a substantial ﬁnancial costfor the university. Since energy use is a fundamental need for university operations, themandate from Facilities Operations is to statistically distinguish abnormal and unneces-sary usage from anticipated and essential usage. We do an in-depth analysis of time seriesof monthly energy usage, as well as monthly proportions within each year of the energyusage. We have chosen to employ simple, but eﬀective statistical procedures that can beeasily communicated to engineers and be jointly implemented with our collaborators fromFacilities Operations on ready-to-use dashboards. The variety of building use-types on theuniversity campus poses a unique challenge. An individual building’s energy consumptionproﬁle depends directly on factors including use-type, square footage and building age.2here also exists signiﬁcant correlations between energy usage on diﬀerent campus build-ings due to common drivers such as student and faculty population sizes, climate variationswithin a year, and campus-wide energy initiatives.Existing literature on methods for visualizing campus energy usage and understandingand modeling energy data varies in the level of detail and sophistication. Students at theWorcester Polytechnic Institute (O’Hara et al., 2020) discussed the physical infrastructureneeds for real-time collection of electricity data, with a focus on the hardware meteringtechnology used to measure demand. The Harvard Medical School implemented a real-time visualization dashboard for energy usage by building (Lieberman, 2010) to presentgraphics of historic energy usage, with a goal of increasing community awareness. Breyeret al. (2020) considered software solutions to encourage behavioral changes in campusenergy consumers at the University of Michigan. They focused on public-facing tools toincrease awareness among student, faculty, and staﬀ communities to achieve structuralchanges in demand for energy. Ma et al. (2015) gave a descriptive analysis of aggregateenergy consumption at seven universities across the world. They considered the eﬀects ofpopulation and building footprint size on energy consumption, and used this to comparean energy consumption index between universities to assess their relative energy usage.To our knowledge, existing literature on anomaly detection methods for energy con-sumption seem to primarily focus on high-frequency demand/usage data which are notsuitable for coarser-grained monthly data. The methods used in these papers ranged fromusing z -scores relating the mean and standard deviation for individual time series based onprevious behavior (Seem, 2007) to looking for or large discontinuities in usage (Zhao, 2014)to using reduced dimension processes and looking for large prediction errors (Ma et al.,2017) or distance based abnormality scores (Rashid and Singh, 2018).The literature does not address statistical concepts such as power of the anomaly detec-tion schemes or the probability of false positives, nor the mechanisms for anomaly detectionin monthly proportions data. It also does not directly address the managerial problem ofmonitoring a large number of energy accounts, observed at a low frequency and with smallsample sizes. Many existing methods are intended for monitoring a small number of high-frequency time series, but that type of data is only available for a small subset of buildings.3oreover, the Facilities Operations management problem exists even when high-frequencymeter data is unavailable, so small data approaches are still necessary.In this article, we describe our collaborative work with Facilities Operations which high-lights the value of statistical practice by academics that focuses on operational problem-solving. To identify candidate buildings that may exhibit aberrant energy usage, we con-sider energy consumption across the ﬁscal year. We also integrate data from externalsources to normalize usage by relevant weather covariates, and then convert these intomonthly usage proportions within the year. We group buildings according to similar char-acteristics, and apply statistical clustering methods. This enables full utilization of thesimilarities in consumption proﬁles across accounts. By grouping buildings with similarenergy consumption proﬁles, we obtain an aggregate benchmark against which we assessthe usage characteristics of individual accounts within these buildings. We propose a graph-ical, model-free method for anomaly detection based on approximate, group-wise controllimits on boxplots constructed for transformed monthly usage proportions data. We alsopropose an approach based on a linear model, which we use as a benchmark against whichto compare our graphical method. Our model is motivated by a statistical process con-trol approach described by Fu and Jeske (2014). They used historical in-control data toestimate nuisance parameters in their model, leading to an approximate Bartlett-type like-lihood ratio (ABLR). We report to Facilities Operations engineers a list of accounts ﬂaggedas potential outliers. These anomalies can be investigated by Facilities Operations on anongoing basis for physical malfunctions or other maintenance problems.aThe format of the article follows. We ﬁrst provide a description of a monthly gasconsumption data set collected by Facilities Operations in Section 2. In Section 3, weanalyze monthly proportional usage within a ﬁscal year. in Section 4, we identify anomalousaccounts within known homogeneous groups, using two approaches. First, we consider amodel-free approach based on the boxplot to identify energy accounts that are anomalousrelative to similar accounts. In Section 4.3, we introduce a model-based approach to identifychanges in mean value, and compare its properties to those of the less sophisticated, model-free method.. In Section 5, we describe an extension to the setting where the groupsthat partition accounts are not all known a priori and must be statistically determined.4inally, we end in Section 6 with a discussion of the interface that we provide to FacilitiesOperations. The raw data consist of monthly measurements of natural gas consumption collected byUtility Operations and Energy Management in Facilities Operations at the university’smain campus. A total of 245 separate utility accounts are available across 115 buildings,with the number of accounts varying between buildings. In some cases one building containsmultiple accounts, such as an apartment building with several units. In what follows, wehave denoted buildings by generic names due to privacy considerations.The availability of historical data varies by building and account, with the earliestobservation in February 2007 and the last in December 2018. Prior to the analysis, weomitted four accounts which have missing ﬁnal observations, as this is an indcation thatthe accounts are no longer actively used. We also omitted 1 account with over 10% of ob-servations missing, and 2 other accounts known a priori to be substantial outliers. Table 1shows the time series lengths in our dataset, most of which are either 82 or 143 monthslong. For example, of the 71 accounts in Apartment Complex A, 69 series were observedfrom February 2007, or for 143 months; 1 series for 137 months; and the last for only 82months. Year Month Freq. Year Month Freq.2007 February 128 2011 March 32007 May 1 2011 April 22007 August 1 2011 June 12008 May 1 2011 July 22009 April 1 2012 March 832009 December 1 2013 March 12010 March 3 2013 May 12010 April 1 2015 January 32010 September 1 2015 April 22010 October 1 2016 June 1Table 1: Series start dates with observed frequencies. Values count the series beginning ina given month. 5or the 238 accounts used in the analysis, monthly gas consumption was measured inunits of hundred cubic feet (“CCF”). Each observation corresponds to a single utility bill inone account in one billing period, measured as the diﬀerence between two meter readings.The bills are grouped across accounts by a calendar-monthly billing period. The billingperiod provides a reference alignment of each utility bill, despite diﬀerences across accountsin the speciﬁc start and end dates in a given billing period. The monthly observations fortwo distinct residential complexes are shown in Figure 1(a). Speciﬁcally, ﬁve accountsin Apartment Complex A, Building C, and 12 accounts from Apartment Complex B areshown. (a) Raw observations CC F Apt. Complex A, Bldg. 3Apt. Complex B (b) Monthly degree day observations D eg r ee da ys Cooling degree days (CDD)Heating degree days (HDD)

Figure 1: (a)

Raw data observed in Building C of Apartment Complex A (red lines), andall of Apartment Complex B (black). Data in Complex A is available since 2007, comparedwith 2012 for Complex B. (b)

Historical degree day observations, by heating and coolingIt is possible that a value in any month for any account may be exactly 0, for one oftwo reasons. First, a building may be unoccupied or otherwise inactive for that month.For instance, Apartment Complex A, Building 6, has values of 0 in the same months acrossseveral years, July and August. This is because the residence hall is unoccupied in the6ummer. Second, the gas utility company only charges for integer-valued CCF values, sothat a recorded value of 0 may actually represent a nonzero CCF value that is below 1.In addition to energy usage, our analysis also includes weather covariates. We useddaily weather data from the National Oceanic and Atmospheric Administration (NOAA).This data is made available through the Applied Climate Information System ( ) web API. Based on guidance from Facilities Operations, we obtaineddaily observations of heating degree days (HDD) and cooling degree days (CDD) deﬁnedas HDD = max(0 , − u ) , and CDD = max(0 , u − , (1)where u refers to the average (of the maximum and minimum temperatures) on any givenday. Degree days provide a nonlinear measure of temperature deviation from a neutralbalance point (65 degrees), and provide a robust context for temperature (Quayle andDiaz, 1980). It is worth noting that the daily value of a single degree day will generally begreater than 1. The degree day converts the one-day deviation from the balance point u , to u days of 1-degree deviation from the balance point. Therefore, the degree day measures themagnitude of a deviation from the balance point. To reckon the HDD (CDD) for a month,it is usual to sum the daily HDD values. Figure 1(b) shows the historical observations ofheating and cooling degree days.In addition to the diﬀerent number of calendar days in each month, the speciﬁc start andend dates for energy consumption within any month are not consistent across all accounts.We re-normalized the energy data to account for this problem in Section 3.1. We alsoadjusted for weather eﬀects. We calculated the average consumption per day for each observation, and then re-normalizedall observations to represent a 30-day month. Furthermore, we applied a standard adjust-ment by dividing the CCF value by the area of the building (square footage), so thateach observation represents the monthly average-per-square-foot for a 30-day month. For7uildings that contain multiple accounts, we subdivided the square footage evenly acrossall accounts. For example, Apartment Complex A, Building C, has an area of 4521 squarefeet, and ﬁve gas accounts. Suppose the utility bill for one of these ﬁve accounts from2007-01-19 to 2007-02-20 (for 32 days) is 76 CCF. We computed its normalized consump-tion value for February of 2007 to be: 30 × × / = 0 . . . . (a) Re−normalized observations − da y CC F pe r s qua r e f oo t . . . (b) Degree day−adjusted re−normalized observations − da y CC F / ft pe r deg r ee da y Apt. Complex A, Bldg. 3Apt. Complex B

Figure 2: (a)

Normalized data, adjusted for nominal month billing period duration andper-unit square footage. (b)

Degree day-adjusted re-normalized data, with imputed values.8 .2 Missing Data Imputation

Of the 238 accounts, 39 accounts have missing observations in the time series interior.The number of missing values in any of the 39 time series was at most 7, while 23 seriesonly had missing a single value. We believe that it is reasonable to assume that thesevalues are missing at random, as they typically arise from clerical oversight (Little andRubin, 2019). We used the “imputeTS” package in R (Moritz and Bartz-Beielstein, 2017)to impute the missing values in the degree day-adjusted re-normalized observations. Basedon a visual inspection of the output of the various interpolation, Kalman smoothing, andmoving average options in this package, we selected the

Kalman smoothing and structuraltime series model option. We note that, if an imputed value was negative, we replaced theimputed value with 0 (since consumption must be non-negative). The weather-adjustedre-normalized series with imputed values are shown in Figure 2(b).

Table 2 gives the number of accounts (each corresponding to a monthly time series) for theseven utility services.Utility Service Type Service Group N = 70 accounts. The re-normalized data for these accounts are shown in the top ofFigure 3. The pattern of these time series reﬂects the use of gas for heating in the winter.The accounts after degree day adjustment are shown in the bottom panel of Figure 3. . . . Apartment Complex A − Normalized data − da y CC F / ft. . . . Apartment Complex A − Degree day−adjusted data − da y CC F / ft. pe r deg r ee da y Figure 3: Historical data for 70 gas accounts in Apartment Complex A.

Top:

Normalizeddata, calculated as hundred cubic feet (CCF) per square foot, averaged over billing periodduration and aggregated to a 30-day month.

Bottom:

Degree day-adjusted and normalizeddata, calculated as per-degree day, where degree days are the sum of cooling and heatingdegree days.We grouped the data into yearly cycles based on the university’s ﬁscal year, whichbegins on July 1 and continues until the following June 30. This grouping coincides withboth the university’s accounting schedule and the academic year. Student residents departcampus in May, so logistically, summer is a good time for Facilities Operations to addressmaintenance problems. 10

Detecting Anomalies in Accounts from Known Ho-mogeneous Groups

Anomaly detection could refer to identifying an account with behavior that departs markedlyfrom its own history, or from a normal pattern that is expected based on known characteris-tics for that account. For example, the normal pattern for each month could be representedby a median, or suitable quantiles of the expected usage of the accounts.Based on observations from all 70 accounts in Apartment Complex A, we applied amoving window of two years and computed curves that represent the 2.5% and 97.5%percentiles across both years. We then considered using the curves as reference limits fornormal behavior in the following year. Our choice to use a two year window was motivatedby the high degree of variance we observed in the quantiles when considering quantiles foronly the single preceding year.We show six years (2012 to 2018) of weather-adjusted data in Figure 4, and the quantilesfor the previous two years are shown in red. Our intention was to use the quantiles toidentify accounts with an anomaly in some month, identiﬁed by a usage value falling outsidethe reference quantiles. The two year window accommodates systematic structural changesto the consumption pattern over time. This tendency is apparent in the shift of peakconsumption from October in 2012–2013, to November in 2017–2018.The reference quantiles derived from the previous two years represent aggregates acrossmultiple accounts. For a speciﬁc building, deviations from those reference quantiles willbe due to idiosyncratic factors of the new year: new occupants in a dormitory, new classschedules in academic buildings. Therefore, they may be assumed to be statistically inde-pendent of observations in the current year. We used this to directly ﬂag outlying valuesin. Below the horizontal axis of each panel of Figure 4, we report the number of accountsabove and below the reference quantiles. 11 ataJu Se No Ja Ma MaAu Oc De Fe Ap Ju . . . . Fiscal year 2012−2013 DD − ad j u s t ed ob s e r v ed da t a Quantiles for 2012−2013Ju Se No Ja Ma MaAu Oc De Fe Ap Ju . . . . Fiscal year 2013−2014 DD − ad j u s t ed ob s e r v ed da t a Quantiles for 2012−2013 and 2013−2014Ju Se No Ja Ma MaAu Oc De Fe Ap Ju . . . . Fiscal year 2014−2015 DD − ad j u s t ed ob s e r v ed da t a

17 5 1 2 4 5 3 4 1 2 4 51 1 39 0 1 3 4 1 5 5 10 0

Quantiles for 2013−2014 and 2014−2015Ju Se No Ja Ma MaAu Oc De Fe Ap Ju . . . . Fiscal year 2015−2016 DD − ad j u s t ed ob s e r v ed da t a Quantiles for 2014−2015 and 2015−2016Ju Se No Ja Ma MaAu Oc De Fe Ap Ju . . . . Fiscal year 2016−2017 DD − ad j u s t ed ob s e r v ed da t a Quantiles for 2015−2016 and 2016−2017Ju Se No Ja Ma MaAu Oc De Fe Ap Ju . . . . Fiscal year 2017−2018 DD − ad j u s t ed ob s e r v ed da t a Figure 4: Fiscal year observations for Apartment Complex A. Red lines give previous twoyears’ 2 .

5% and 97 .

5% quantiles. Below the chart is the number of accounts above the97 .

5% quantile (line 1) and below the 2 .

5% quantile (line 2).The reference quantiles captured the dominant central tendency of recent observations.This permits identiﬁcation of anomalous accounts such as in 2015–2016, where a singleaccount is consistently elevated above other series. This demonstrates the importance ofgrouping together buildings with similar consumption proﬁles: for a diﬀerent building use-type, that series might not be anomalous. On the other hand, we ﬂag a large number ofaccounts as “outliers” from July to October in 2017. This suggests that the accounts weﬂagged might not provide a meaningful summary of deviation from a “normal” baseline.In particular, the past quantiles do not reﬂect behavior in the current year.Another important aspect of Figure 4 is the recurrence of level diﬀerences across ac-counts. This is apparent in the series with repeatedly high usage in October. Because themagnitude of this series dominates the other series, it is diﬃcult to compare this series withthe others. This may reﬂect a systematic diﬀerence in that account, but it also suggeststhat level shifts may pose an obstacle to applying these quantiles to the data.12 .2 Model-Free Anomaly Detection on Proportions

We identiﬁed two problems with the quantile-based ﬂagging, namely, (1) inconsistent usagepatterns from one year to the next, despite within-group homogeneity in a single year; and(2) diﬀerences in scale that may dominate usage patterns that are otherwise similar. Toremedy these problems, we calculated proportional usage for each month within the ﬁscalyear. Denoting energy usage in a given year for account i in month m by x im , i = 1 , . . . , N ,where N = 70, m = 1 , . . . ,

12, we computed the proportion p im of usage in month m relative to overall consumption: p im = x im / X k =1 x ik (2)The resulting sequence is of length 12.The proportion transformation permits us to compare accounts with diﬀerent magni-tudes by removing level information about the series. It also allows us to address thevariability in usage patterns between years, such as the shift of peak usage from Octoberto November noted above in Section 4.1. This is accomplished by summarizing the rela-tive magnitude of each month’s observation, while also controlling for year-wide changesin temperature allocation, occupancy of residences, and other factors that may vary fromyear to year.We considered applying Tukey’s boxplot to the marginal monthly proportions acrossall accounts. The boxplot deﬁnes a convenient deﬁnition of an outlier, deﬁned in terms ofindividual observations from a homogeneous group. Denote the lower and upper quartilesof the data (25% and 75% quantiles, respectively) by Q and Q . The interquartile rangeis deﬁned as IQR = ( Q − Q ), which is a robust estimate of the population variance.Tukey’s thresholds for outliers are based on the lower and upper fences, which aredeﬁned in Table 3 for moderate and severe outliers. The lower bound for non-outlier valuesis found as the smallest observed value larger than the lower fence. Likewise, the upperbound is the largest observed value smaller than the upper fence.These threshold bounds assume a normally-distributed sample, which is clearly violatedby proportion values. Therefore, we transformed each proportion to the logit scale, deﬁned13oderate outlier Severe outlierLower fence Q − . × IQR Q − × IQRUpper fence Q + 1 . × IQR Q + 3 × IQRTable 3: Lower and upper fences for outlier detection, based on Tukey’s boxplot. Thelower bound is the smallest value larger than the lower fence, and the upper bound is thelargest value smaller than the upper fence. Observations below the lower bound or abovethe upper bound are ﬂagged as outliers.by the transformation ‘ im = log p im − p im ! (3)We then constructed boxplots within each month across all accounts. These boxplots areshown for three ﬁscal years in the top panel of Figure 5. Ju Se No Ja Ma Ma − − − − Fiscal year 2015−2016 DD − ad j u s t ed l og i t Fiscal year 2015−2016 DD − ad j u s t ed p r opo r t i on Ju Se No Ja Ma MaAu Oc De Fe Ap Ju . . . . Boxplot whiskers Ju Se No Ja Ma Ma − − − − Fiscal year 2016−2017 DD − ad j u s t ed l og i t Fiscal year 2016−2017 DD − ad j u s t ed p r opo r t i on Ju Se No Ja Ma MaAu Oc De Fe Ap Ju . . . . Ju Se No Ja Ma Ma − − − − Fiscal year 2017−2018 DD − ad j u s t ed l og i t Fiscal year 2017−2018 DD − ad j u s t ed p r opo r t i on Ju Se No Ja Ma MaAu Oc De Fe Ap Ju . . . . Figure 5: Boxplots on logit of monthly proportions of ﬁscal yearly consumption (top), andcorresponding proportions (bottom). Bounds corresponding to the boxplot whiskers areshown in red.The number of outliers is marked below the bottom panel of Figure 5. The boxplotsidentify two of the residential accounts as outliers in ﬁscal year 2017–2018. The proportionsfor these series are highlighted in Figure 6. One of the accounts exhibits elevated usage in14ctober relative to other accounts in the group, and relatively low usage from December toApril. The decrease exhibits the outliers, although the speciﬁc decreases in the proportionsmay be due to a linear response to the exceptionally large usage in October. The otherseries exhibits low usage in November, and relatively high usage from December to March.The outlier data point itself is the low usage in November, and the elevated usage in thewinter does not register as statisticall signiﬁcant. The proportions clealry demonstratelower consumption in some months, although these same points appear far less abnormalin the non-proportion data.

Ju Se No Ja Ma Ma . . . . P r opo r t i on o f f i sc a l y ea r u s age Severe outlier (boxplot, DD−adjusted prop.)

Ju Se No Ja Ma Ma . . . cc f pe r s q . ft. pe r deg r ee da y Figure 6: Outlier observations from 2017–2018. Proportion data is shown at left, degree-day-adjusted data on the right. Series with outliers are highlighted, and outlier observationsare marked.We follow in detail the historical paths of these two accounts in Figure 7. In previousyears, we observe that the behavior of Account 2 is comparable to the other accounts inApartment Complex A, with a marked change in 2017–2018. In contrast, the large valuein Account 1 recurs with the same pattern across years, and is consistently out-of-syncwith the other buildings. Further investigation revealed that Account 1 is the only accountin Apartment Complex A that is classiﬁed by the utility as “non-residential small generalservice”, instead of “residential heating.” Moreover, a Facilities Operations annotation forthis account indicates this account is used for the laundry dryer in Building C.15 u Se No Ja Ma Ma . . . . CC F / ft. pe r deg r ee da y Ju Se No Ja Ma Ma . . . . CC F / ft. pe r deg r ee da y Ju Se No Ja Ma Ma . . . . CC F / ft. pe r deg r ee da y Ju Se No Ja Ma Ma . . . . CC F / ft. pe r deg r ee da y Ju Se No Ja Ma Ma . . . . CC F / ft. pe r deg r ee da y Ju Se No Ja Ma Ma . . . . CC F / ft. pe r deg r ee da y Account 1Account 2

Figure 7: Outlier time series from ﬁscal year 2017–2018.Therefore, we understand Account 1 to represent behavior that is truly outlying, relativeto the other accounts, because we may attribute diﬀerences in consumption to knownexternal factors. We thus conjecture that there may be a real anomaly in Account 2, aswell, because of incompatibility with its own past behavior, as well as the extreme lowconsumption in November. Further investigation by Facilities Operations will clarify thestatus of this account, and assess the reality of any anomaly there.

In light of the lack of publications that consider anomaly detection on proporitions, we pro-pose a linear model that oﬀers an approach to monitor each new observation as it arrives,in an online fashion. We adapt the general monitoring framework of Fu and Jeske (2014),which implements hypothesis tests to monitor for mean changes of a given magnitude.These authors developed an integrated likelihood ratio (ILR) test to implement a proce-dure similar to Bartlett’s sequential probability ratio test (SPRT). Since a“Bartlett-typelikelihood ratio” (BLR) test would require multivariate integration across many dimensionsas well as a maximization over nuisance parameters, they instead used historical in-controldata to estimate the nuisance parameters, leading to an approximate Bartlett-type likeli-16ood ratio (ABLR).As above, we consider observations of monthly energy consumption data for N utilityaccounts. We further consider K historical years of data, and modify our notation ac-cordingly. Denote the consumption in month m of year k for the i th account by x ikm , i = 1 , . . . , N , k = 1 , . . . , K , and m = 1 , . . . ,

12. The dataset contains 12 KN observations.Each year of data is a 12-month cycle for each account. For each month m in year k for the i th account, we calculate the proportion of overall yearly consumption by p ikm = x ikm P ‘ =1 x ik‘ . (4)Finally, we apply the logistic transformation to p ikm , m = 1 , . . . ,

12, yielding data y ikm = log p ikm − p ikm ! . (5)Let y ik = ( y ik , . . . , y ik ) denote the 12-dimensional vector of logit-transformed, yearlyproportional consumptions. We assume y ik = β i + (cid:15) ik , (6) (cid:15) ik = ( (cid:15) ik , . . . , (cid:15) ik, ) ∼ N ( , Σ) , (7)where β i = ( β i , . . . , β i, ) is the mean of the logit transformed yearly proportional con-sumptions for the i th account for i = 1 , . . . , N .Let D K = { y ik , i = 1 , . . . , N, k = 1 , . . . , K } denote the “historical” data for these N accounts. Write β = ( β , . . . , β N ) . The likelihood function based on D K is L ( β , Σ | D K ) ∝ | Σ | − KN/ exp ( − K X k =1 N X i =1 ( y ik − β i ) Σ − ( y ik − β i ) ) . (8)17hen, the maximum likelihood estimates of β , . . . , β N , and Σ areˆ β i = 1 K K X k =1 y ik , i = 1 , . . . , N (9) b Σ = 1

N K N X i =1 K X k =1 ( y ik − ˆ β i )( y ik − ˆ β i ) . (10)Let y K +1 = ( y ,K +1 , . . . , y N,K +1 ) denote the logit transformed yearly proportional con-sumptions of N accounts at year K + 1. Let C = diag( c , c , . . . , c ) be a ﬁxed 12 × C = I , where I is the 12 ×

12 identity matrix. Here thevalues of the diagonal elements of C specify the magnitudes of change in β i that indicatethe underlying consumption is no longer in-control for the i th account. In other words, wetest the hypothesis H : β i = β (0) i vs. H : β i = C β (0) i (11)as we observe the new value of y i,K +1 for i = 1 , . . . , N .The test statistic is T ABLRi = exp (cid:26) − ( y i,K +1 − C β (0) i ) Σ − ( y i,K +1 − C β (0) i ) (cid:27) exp (cid:26) − ( y i,K +1 − β (0) i ) Σ − ( y i,K +1 − β (0) i ) (cid:27) and log (cid:18) T ABLRi (cid:19) =( β (0) i ) ( C − I )Σ − y i,K +1 −

12 ( β (0) i ) C Σ − C β (0) i + 12 ( β (0) i )Σ − β (0) i . Under H , we havelog (cid:18) T ABLRi (cid:19) ∼ N (cid:18) ( β (0) i ) ( C − I )Σ − β (0) i −

12 ( β (0) i ) C Σ − C β (0) i + 12 ( β (0) i )Σ − β (0) i , ( β (0) i ) ( C − I )Σ − ( C − I ) β (0) i (cid:19) . (12)18rite Z i ( y i,K +1 , β (0) i , Σ) = (cid:12)(cid:12)(cid:12)(cid:12) ( β (0) i ) ( C − I )Σ − y i,K +1 − ( β (0) i ) ( C − I )Σ − β (0) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:26) ( β (0) i ) ( C − I )Σ − ( C − I ) β (0) i (cid:27) / . (13)Let Z max ( y K +1 , β (0) , Σ) = max { Z ( y ,K +1 , β (0)1 , Σ) , . . . , Z N ( y N,K +1 , β (0) N , Σ) } , (14)where β (0) = (( β (0)1 ) , . . . , ( β (0) N ) ) . Note that under H , we have( β (0) i ) ( C − I )Σ − y i,K +1 − ( β (0) i ) ( C − I )Σ − β (0) i (cid:26) ( β (0) i ) ( C − I )Σ − ( C − I ) β (0) i (cid:27) / ∼ N (0 , i = 1 , . . . , N . Thus, under H and for z >

0, we have P ( Z max ( y K +1 , β (0) , Σ) < z ) = [2Φ( z ) − N , where Φ( · ) is the standard normal cumulative distribution function. For a given signiﬁcancelevel α , the rejection region is given by { Z max ( y K +1 , β , Σ) ≥ z (1+ N √ − α ) / } , (15)where z (1+ N √ − α ) / is the ((1 + N √ − α ) / N (0 , z (1+ N √ − α ) / ) =(1 + N √ − α ) / K is generally small, we also seek to account for thevariability in the estimates of β i and Σ. Therefore, we adopt a noninformative Jeﬀrey’s-type prior distribution, namely, π ( β , Σ) ∝ | Σ | − / (16)19n combination with the likelihood, this gives the posterior distribution π ( β , Σ | D K ) ∝ Σ − ( KN +1) / exp ( − K X k =1 N X i =1 ( y ik − β ) Σ − ( y ik − β i ) ) (17)We sample from this sequentially, with β i | Σ ∼ N ( ˆ β i , Σ /K ) , i = 1 , . . . , N (18)Σ − ∼ W (Ψ − , N K −

12) (19)Ψ = K X k =1 N X i =1 ( y ik − β i )( y ik − β i ) (20)where W (Ψ , ν ) is a Wishart distribution with scale matrix Ψ and ν degrees of freedom.This yields the following algorithm for online monitoring of the proportional usage data. Yearly Monitoring AlgorithmStep 0

Set signiﬁcance level α Step 1

Compute credible level γ α • For each of M replicates, perform the following simulations: – Simulate new data y ∗ ( m ) i,K +1 ∼ N ( ˆ β i , b Σ) independently for i = 1 , . . . , N – Perform B replicates of the following simulation:(i) Generate β (0) mb and Σ mb from the “posterior” distribution π ( β (0) , Σ | D K ) ∝ Σ − ( KN +1) / exp ( − K X k =1 N X i =1 ( y ik − β (0) i ) Σ − ( y ik − β (0) i ) ) . (21)(ii) Calculate the indicator function δ b ( y ∗ ( m ) K +1 , β (0) mb , Σ mb ) = n Z max ( y ∗ ( m ) K +1 , β (0) mb , Σ mb ) ≥ z (1+ N √ − α ) / o , (22)where y ∗ ( m ) = (( y ∗ ( m )1 ,K +1 ) , . . . , y ∗ ( m ) N,K +1 ) for b = 1 , . . . , B . – Calculate ˆ q m = B P B b =1 δ b ( y ∗ ( m ) K +1 , β (0) mb , Σ mb ).for m = 1 , . . . , M . 20 Compute γ α as the α th percentile of the empirical distribution of the M valuesof ˆ q m ’s. Step 2

Monitor• Using the observation y K +1 in the K + 1 year, perform B replicates of thefollowing:(i) Generate β (0) b and Σ b from the “posterior” distribution π ( β (0) , Σ | D K ) ∝ Σ − ( KN +1) / exp ( − K X k =1 N X i =1 ( y ik − β (0) i ) Σ − ( y ik − β (0) i ) ) . (23)(ii) Calculate the indicator function δ b ( y K +1 , β (0) b , Σ b ) = n Z max ( y K +1 , β (0) b , Σ b ) ≥ z (1+ N √ − α ) / o (24)for b = 1 , . . . , B .• Calculate ˆ q = B P B b =1 δ b ( y K +1 , β (0) b , Σ b ).• If ˆ q ≥ γ α , do not reject H , i.e., no abnormal consumptions in the K + 1 year.• If ˆ q < γ α , reject H , and report that there are abnormal logit transformedyearly proportional consumptions for at least one account, which corresponds toaccount i ∗ such that i ∗ = argmax ≤ i ≤ N { Z i ( y i,K +1 , b β i , b Σ) } . A feature of the transformation in Equation 5 is that it applies separately to each ofthe 12 months of the year. This contrasts with the more traditional multiple logistictransformation that would express each of 11 months relative to a 12th, benchmark month,which omits one month in order to enforce the constraint that the 12 proportions sum tounity. As we observe in the boxplot-based method, a substantial shift in the consumption21f a single month aﬀects the proportion values for the other 11 months, as well, whichintroduces complicated dynamics in the relative behavior of the overall proportion vector.Expressed in the diagonal matrix C , this chaotic behavior would require search over a largespace of possible values of C to elaborate the wide range of possible distortions across the12 monthly proportions.Our method is far simpler, and far more interpretable: for a given month m in a givenaccount i , the alternative hypothesis speciﬁes a multiplicative increase in the log odds of theproportion p ikm . A disadvantage of this approach is that it is not the case that the inverse-logit transformed proportions sum to 1, but the beneﬁt is interpretibility of the change ina given month’s proportion. Additionally, this approach also sidesteps the possibility forchaotic behavior across the full proportion vector y ik , in favor of a marginal eﬀect withineach month that does not directly impact the mean value in other months.To demonstrate the relative performance of the two monitoring approaches, we proceedwith a constructive data analysis. We start with the accounts 1 and 2, identiﬁed by themodel-free procedure as outliers. We also estimate the full statistical model across K = 5years.Suppose we know a priori that a given series i is an outlier in the sense of being out-of-control in a given year. Within the model, this means, the mean value of the new year( K + 1) is Cβ i . Under H , we can calculate the eﬀective nominal increase in magnitudefor a given month m as ˆ C im = y i ( K +1) m /β im . The the accounts with large values alongthe diagonal of ˆ C are of intuitive interest, so we consider two more accounts. First, wecalculate the sum of the squared diagonal for the full matrix ˆ C i , namely, P m =1 ˆ C im , andconsider the account with the large such value. For the second account, we apply the sameprocedure to the subset of indices m that correspond to the months October–April.These four accounts are shown in Figure 8. In the top panel, we plot the ﬁtted meanvalue based on ˆ β i on the proportional scale, as well as the new data observation y i . Forreference, the historical data is shown in gray. The second row shows the same data,transformed to the logit scale. Finally, the bottom row shows the ratio of y i ( K +1) to ˆ β i ,which can be thought of as the nominal proportional increase ˆ C i of the new observation toits mean, the most direct estimate of the true change from the in-control value.22 ul Sep Nov Jan Mar May . . . . Acct 1 − p i p i Estimate β ^ i New data y i ( K + ) Historical y ik Jul Sep Nov Jan Mar May . . . . Acct 2 − p i p i Jul Sep Nov Jan Mar May . . . . Acct 3 − p i p i Jul Sep Nov Jan Mar May . . . . Acct 4 − p i p i Jul Sep Nov Jan Mar May − − − − logit(p i ) l og i t ( p i ) Jul Sep Nov Jan Mar May − − − − logit(p i ) l og i t ( p i ) Jul Sep Nov Jan Mar May − − − − logit(p i ) l og i t ( p i ) Jul Sep Nov Jan Mar May − − − − logit(p i ) l og i t ( p i ) Jul Sep Nov Jan Mar May . . . . . . C^ = y i ( K + ) β ^ i C ^ Jul Sep Nov Jan Mar May . . . . . . C^ = y i ( K + ) β ^ i C ^ Jul Sep Nov Jan Mar May . . . . . . C^ = y i ( K + ) β ^ i C ^ Jul Sep Nov Jan Mar May . . . . . . C^ = y i ( K + ) β ^ i C ^ Figure 8:

Top:

Monthly proportional energy consumption p i for four accounts, accounts 1and 2 identiﬁed using the model-less approach, and accounts 3 and 4 identiﬁed as exhibitingthe largest ratio ˆ C i = P m =1 y i ( K +1) m / ˆ β im across all the data (account 3) and across only themonths October–April (account 4). Shown are in-control historical proportions and logit-transformed mean value. Hypothesis testing is performed for the single new observationin period ( K + 1), or 2017-2018. Middle:

Monthly logit-scale proportional usage values,showing estimated mean value and corresponding transformed data y i for each account.Estimation and hypothesis tests are performed on these transformed data points. Bottom:

Empirical increase in monthly consupmtion in year ( K + 1) as a percent of the mean value.Noted is the threshold C = 1 .

5, which reliably identiﬁes large deviations from the meanvalue.As noted, the ﬁrst two accounts in Figure 8 were identiﬁed using the model-free proce-dure. Two scenarios are present: the new observation in period ( K + 1) is consistent withpast behavior, as in the ﬁrst account, and the model-free method likely identiﬁed the outlierbased on deviations from the group-wide trends. In the second account, the observation inthe new period exhibits substantial diﬀerences from the historical trend. Accordingly, theﬁrst example is not likely to be well-suited to identiﬁcation with the model-based approach.The ﬁgure also shows the transformed values of the proportions on the logit scale, aswell as the empirical estimates of ˆ C for both accounts. For the ﬁrst account, we see astable ratio of observed, new values to the estimates based on historical data, and none23f the ratios is much larger than 1. On the other hand, the second account exhibits asharp increase in the mean parameter associated with November. The magnitude is largeat over 1.5, and this is an intuitively appealing value. Moreover, the second is of the type ofoutlier the model is well-suited to identify: the new observation is not consistent with pastbehavior during the cold months, from November to March. The decrease in Novembermay reﬂect the increase in December, a nonlinear relationship that is diﬃcult to specify interms of the matrix C . Nonetheless, the empirical estimates suggest a clear discontinuitybetween the current observation and estimates based on its past behavior.We do note that the ﬁrst outlier account exhibits a one historical series with low nu-merical stability: in one year, July and August both exhibited very low consumption inthat account, and this is exaggerated by the logit transformation into a very large, negativevalue. The estimation procedure itself is robust to this large observation, and the estimatefor that month is only somewhat inﬂuenced by these outliers.In the ﬁrst of the remaining two accounts, we observed large values of ˆ C i , but they occurin the edge months of July, August, and June. Although the magnitude is substantial,the real proportion of usage is quite small in these months. Although the diﬀerencesmay be statistically large, these deviations are not practically signiﬁcant—they do notappear to reﬂect major behavioral changes in the series’ yearly proportional pattern, somuch as artifacts of the logit transformation. This suggests that, in order for the model-based approach to work, we should monitor for changes of magnitude around c = 1 . C . However, the types of out-of-control behavior come with caveats:we require an explicit magnitude C of change in the mean parameters β i , we should onlyconsider speciﬁc months, and the we must calibrate the values of all 12 diagonal elements C m to accommodate a change of speciﬁc magnitude in only one month. Similar exercisesare necessary for changes in each of the other months that may be of interest.Second, we observe that the model-based approach has low power to identify patterns inthe proportions wherein a single account exhibits behavior out-of-sync with other, similar24ccounts. The grouping structure is only employed in the model-based approach to accountfor low sample sizes, and pool estimation of the variance. The model itself does not capturedeviations in single accounts from their group-wise trends.Despite its comparative simplicity, the model-free approach requires far fewer assump-tions, and identiﬁes a wider range of out-of-control behaviors in the data.Because of small sample sizes, the model-based approach requires more detailed speciﬁ-cation of the full scenarios, and the grouping of similar accounts is a necessary componentof the model to permit pooling of information. But, this will only increase precision onestimates of individual accounts’ mean proﬁles, which may miss accounts that deviate fromgroup-wise patterns not captured by an account-level model. For accounts such as those in Apartment Complex A, the homogeneous behavior we observein the data reinforces known structural similarities between the accounts. But, in casessuch as the single account of service type “seasonal - commercial” in Table 2, such agrouping is unavailable. To this end, we also sought a statistical method to group accountswith similar behavior. This would be followed by ﬂagging anomalous accounts and / ortimes using methods similar to Section 4. We describe hierarchical clustering based on theintra-year proportions introduced in Section 4.We constructed ﬁscal year proportions for 233 accounts, removing one account with anegative observation; three accounts with missing observations; and one account in Apart-ment Complex B that is primarily 0. We considered two years of data, ﬁscal years 2016–2018. We computed the Euclidean distances between accounts based on 24 data points,under the assumption they are independent. For two vectors of 24 proportions (two yearsof proportions), p i and p j , i = j , we calculated the distances d ( p i , p j ) = qP k =1 ( p ik − p jk ) We then applied hierarchical clustering (Johnson et al., 2002). We used Ward’s methodto cluster the accounts (Ward Jr, 1963), which minimizes the error sum-of-squares whenagglomerating clusters. To select the number of clusters, we used the silhouette criterion.25his oﬀers a conventional procedure for identifying the number of clusters in a set of databy using the tightness of groups of observations to determine the optimal number of groups(Rousseeuw, 1987).We ﬁrst applied the clustering to the residential utility accounts, as deﬁned in Section 2but with some series removed, as previously discussed in this section. We considered 69accounts in Apartment Complex A and 11 accounts in Apartment Complex B. The clus-ters are shown in Figure 9. In particular, cluster 1 consists of 23 accounts from ApartmentComplex A, while cluster 2 contains all of Apartment Complex B accounts and the remain-ing 46 accounts in Complex A. The clusters are distinguished by comparatively high usagein June in cluster 1 and a smoother overall pattern during the academic year, and highNovember usage in cluster 2 with a greater decrease in January.

Ju Se No Ja Ma Ma Ju Se No Ja Ma Ma . . . . . . Cluster 1 : 23 residential acct. : FY 2016−2018 P r opo r t i on Ju Se No Ja Ma Ma Ju Se No Ja Ma Ma . . . . . . Cluster 2 : 57 residential acct. : FY 2016−2018 P r opo r t i on Figure 9: Fiscal-yearly proportion vectors for clustered residential accounts (ApartmentComplexes A and B).To cluster the non-residential accounts, we used the service groups deﬁned in Table 2,namely, small and medium-to-large accounts. Within each of these groups, we performedhierarchical clustering. We combined the partitions of all non-residential accounts, deﬁnedin terms of the account size and the two clusterings. Finally, we combined all degeneratesingleton clusters.The medium-to-large accounts were found to have seven clusters, two of which weredegenerate, and the small accounts contained three clusters, one of which was degenerate.This yielded the ﬁnal eight clusters shown in Figure 9. In addition to the visual homogeneityof these groups, we report in Table 4 the cluster assignment counts for all non-residential26ccounts for buildings containing more than one account. We note the grouping of accountsaccording to the known building labels, in general, such as Apartment Complex F andApartment Complex E, Building 4.

Ju Se No Ja Ma Ma Ju Se No Ja Ma Ma . . . . . . . Cluster 1 : 15 medium to large non−residential acct. : FY 2016−2018 P r opo r t i on Ju Se No Ja Ma Ma Ju Se No Ja Ma Ma . . . . . . . Cluster 2 : 7 small non−residential acct. : FY 2016−2018 P r opo r t i on Ju Se No Ja Ma Ma Ju Se No Ja Ma Ma . . . . . . . Cluster 3 : 29 medium to large non−residential acct. : FY 2016−2018 P r opo r t i on Ju Se No Ja Ma Ma Ju Se No Ja Ma Ma . . . . . . . Cluster 4 : 40 small non−residential acct. : FY 2016−2018 P r opo r t i on Ju Se No Ja Ma Ma Ju Se No Ja Ma Ma . . . . . . . Cluster 5 : 54 small non−residential acct. : FY 2016−2018 P r opo r t i on Ju Se No Ja Ma Ma Ju Se No Ja Ma Ma . . . . . . . Cluster 6 : 3 small non−residential acct. : FY 2016−2018 P r opo r t i on Ju Se No Ja Ma Ma Ju Se No Ja Ma Ma . . . . . . . Cluster 7 : 2 small non−residential acct. : FY 2016−2018 P r opo r t i on Ju Se No Ja Ma Ma Ju Se No Ja Ma Ma . . . . . . . Cluster 8 : 3 degenerate non−residential acct. : FY 2016−2018 P r opo r t i on Figure 10: Cluster partition of utility-designated “non-residential” accounts.1 2 3 4 5 6 7 8Ag 1 - - - 1 1 - - -Ag 2 1 - - 1 - - - -Dormitory 1 - - - - - 1 1 -Student Union - - 2 - - - - -Dormitory 2 - - 2 - 1 - - -Dormitory 3 - - - - 7 - - -Apartment Complex C - - 7 - - - - -Apartment Complex D - - 4 - 3 - - -Apartment Complex E, Bldg. 1 - - - - 8 - - -Apartment Complex E, Bldg. 2 - - - 1 7 - - -Apartment Complex E, Bldg. 3 - - - - 8 - - -Apartment Complex E, Bldg. 4 - - - - 12 - - -Apartment Complex F - - 1 20 - - - -Table 4: Cluster assignment for utility-designated “non-residential” accounts.Within each of the homogeneous clusters, we then used the approach of Section 4.Under this procedure, we identify possible anomalies across all accounts, even those forwhich known homogeneous structure is unavailable.27

Implementation for Management

Having speciﬁed a method for anomaly detection among ﬁscal yearly proportions, the ﬁnalsteps are implementation, integration into the Facilities Operations energy managementworkﬂow, and a mechanism to incorporate feedback from engineers into the statisticalmodel. We pursued seamless interaction between the Department of Statistics and FacilitiesOperations through an intense collaboration on a weekly, or even more frequent, basis.These meetings occurred face-to-face as well as electronically, and provided opportunitiesfor dynamic exchange of ideas.Implementation requires, on the one hand, aggregation, storage, and analysis of thedata; and on the other hand, an interface with which Facilities Operations may reviewthe results of the anomaly detection as well as manage the information provided by theanalysis.Data management and analysis proceed from records of monthly utility usage, in termsof the speciﬁc dataset discussed above, as well as a suite of additional datasets collectedon separate utility services including electric, water, and sewage. We implemented anautomated software procedure to periodically download data from a state-wide reportingsystem used to track state utility records. After extracting utility data via a web API, westore inputs and outputs in a database server. We also integrated into the same databasesystem those datasets not yet available through the web-based procedure. The procedureswere hosted on a Linux server and backed by a Microsoft SQL Server database. We imple-mented the automation procedures in a combination of shell scripts, Python, and R. Theseprocedures included automatic merging of the utility data with NOAA weather data, nor-malization and degree day-adjustment, and the analysis detailed in the previous sections.A basic ﬂow diagram of the various inputs is shown in Figure 11.28 tate of CTreporting APIfor utility usage NOAA weatherdata APIAdditional utilityusage data SQLdatabase Data processing,analysis Interactive webapplication

Figure 11: Flow diagram of data ingestion, storage, processing, and presentation to oper-ators.Following the pre-processing and analysis steps, we provided a web interface for dy-namic interaction with the data and statistical outputs. These primarily consisted of adatabase with a record of which utility accounts were ﬂagged as anomalous, which Facil-ities Operations engineers interact with through a dashboard interface implemented in RShiny (Chang et al., 2018). This software implementation provides professional-quality,contemporary web technology, as well as back-end implementation that may be managed,customized, and extended by statisticians who were not previously trained in web develop-ment. We implemented the ﬂagging system in such a way that Facilities Operations maydismiss ﬂags after review of the physical facilities, which was recorded in the database. Thecluster results of the ﬂagging interface are shown in Figure 12.29igure 12: Cluster analysis output for ﬂagged accounts. Account names and numbers areredacted for privacy purposes.Aside from interaction with the ﬂagging system, the primary mechanism for operatorfeedback is through the composition of the homogeneous groups to which the boxplotsbased analyses are applied. Although the cluster analysis provides a good option in theabsence of operator-derived groupings, it is improved by operation in tandem with theextensive domain experience of the Facilities Operations engineers. Therefore, Statisticsand Facilities Operations can work in tandem to review and verify the validity of thegroupings, as well as the sensibility of the resultant analysis. In some instances, suchas the laundry dryer in Apartment Complex A, Building 3, it is necessary to manuallyadjust even the known homogeneous groups to more accurately reﬂect the ground truth ofsimilarity in energy consumption proﬁles of accounts in the same group.

We have described the development and implementation of easy to understand statisticalmethods in an applied context. The tools and approaches described in this article oﬀer auseful framework for analyzing and monitoring energy usage on a university campus through30n ongoing collaboration between members of the Statistics Department and FacilitiesOperations staﬀ. The energy monitoring has enabled us to gain deeper insight into energyusage at speciﬁc sites of interest and identify potential problems without requiring extensivemanual searches. An attractive characteristic of this project is that an academic departmentin an R1 institution has coordinated a long-term problem solving collaboration with theuniversity’s administrative units.We have proposed both a model-free graphical procedure for detecting anomalous val-ues for monthly energy usage as well as for monthly proportions within a given year. Wehave also formulated a simple model-based approach in the Bayesian framework which isuseful even with small sample sizes. These approaches have been automated and can alsobe easily modiﬁed for similar operational problems that can beneﬁt from applying simpleand informative statistical methods to a large number of observational units. Our projectshowcases well-informed statistical practice in collaboration with non-technical collabora-tive counterparts.

Acknowledgements

The authors would like to thank the Utility Operations & Energy Management team atFacilities Operations, especially Mark Bolduc and Brian McKeon for their time, for theirtime, resources, and expertise. We would like to thank the Editor, the Associate Editor,and two reviewers for their helpful comments and suggestions, which led to an improvedversion of the paper.

References

Breyer, A., Etter, D., Friedrichs, M., Guerriero, R., Ho, M. N., and Kerns, N.(Published 2011. Accessed July, 2020). Assessing a campus energy monitor-ing system. http://graham.umich.edu/media/files/campus-course-reports/CEMS%20Final%20Report.pdf . 31hang, W., Cheng, J., Allaire, J., Xie, Y., and McPherson, J. (2018). shiny: Web applica-tion framework for R . R package version 1.2.0.Cruz Rios, F., Naganathan, H., Chong, W. K., Lee, S., and Alves, A. (2017). Analyzingthe impact of outside temperature on energy consumption and production patterns inhigh-performance research buildings in arizona.

Journal of Architectural Engineering ,23(3):C4017002.Fu, Y. and Jeske, D. R. (2014). Spc methods for nonstationary correlated count data withapplication to network surveillance.

Applied Stochastic Models in Business and Industry ,30(6):708–722.Johnson, R. A., Wichern, D. W., et al. (2002).

Applied Multivariate Statistical Analysis .Prentice hall Upper Saddle River, NJ.Lieberman, E. (2010). Web-based display tracks campus energy use.Little, R. J. and Rubin, D. B. (2019).

Statistical Analysis with Missing Data , volume 793.Wiley.Ma, Y., Lu, M., and Weng, J. (2015). Energy consumption status and characteristics anal-ysis of university campus buildings. In . Atlantis Press.Ma, Z., Song, J., and Zhang, J. (2017). A real-time detection method of abnormal buildingenergy consumption data coupled POD-LSE and FCD.

Procedia Engineering , 205:1657–1664.Moritz, S. and Bartz-Beielstein, T. (2017). imputeTS: Time series missing value imputationin R.

The R Journal , 9(1):207–218.O’Hara, C., Hobson-Dupont, M., Hurgin, M., and Thierry, V. (Published 2007,Accessed July, 2020). Monitoring electricity consumption on the WPI campus. https://web.wpi.edu/Pubs/E-project/Available/E-project-060107-130245/unrestricted/iqpfinaldraft.pdf . 32uayle, R. G. and Diaz, H. F. (1980). Heating degree day data applied to residentialheating energy consumption.

Journal of Applied Meteorology , 19(3):241–246.Rashid, H. and Singh, P. (2018). Monitor: An abnormality detection approach in buildingsenergy consumption. In , pages 16–25. IEEE.Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validationof cluster analysis.

Journal of Computational and Applied Mathematics , 20:53–65.Seem, J. E. (2007). Using intelligent data analysis to detect abnormal energy consumptionin buildings.

Energy and buildings , 39(1):52–58.Ward Jr, J. H. (1963). Hierarchical grouping to optimize an objective function.

Journal ofthe American Statistical Association , 58(301):236–244.Zhao, L. (2014). A novel method for detecting abnormal energy data in building energymonitoring system.