[PDF] Machine learning based automated identification of thunderstorms from anemometric records using shapelet transform

Abstract

Detection of thunderstorms is important to the wind hazard community to better understand extreme winds field characteristics and associated wind induced load effects on structures. This paper contributes to this effort by proposing a new course of research that uses machine learning techniques, independent of wind statistics based parameters, to autonomously identify and separate thunderstorms from large databases containing high frequency sampled continuous wind speed measurements. In this context, the use of Shapelet transform is proposed to identify key individual attributes distinctive to extreme wind events based on similarity of shape of their time series. This novel shape based representation when combined with machine learning algorithms yields a practical event detection procedure with minimal domain expertise. In this paper, the shapelet transform along with Random Forest classifier is employed for the identification of thunderstorms from 1 year of data from 14 ultrasonic anemometers that are a part of an extensive in situ wind monitoring network in the Northern Mediterranean ports. A collective total of 235 non-stationary records associated with thunderstorms were identified using this method. The results lead to enhancing the pool of thunderstorm data for more comprehensive understanding of a wide variety of thunderstorms that have not been previously detected using conventional gust factor-based methods.

Full PDF

MM ACHINE LEARNING BASED AUTOMATED IDENTIFICATION OFTHUNDERSTORMS FROM ANEMOMETRIC RECORDS USINGSHAPELET TRANSFORM

A P

REPRINT

Monica Arul

NatHaz Modeling LaboratoryDepartment of Civil EngineeringUniversity of Notre DameNotre Dame, IN 46556 [email protected]

Ahsan Kareem

NatHaz Modeling LaboratoryDepartment of Civil EngineeringUniversity of Notre DameNotre Dame, IN 46556 [email protected]

Massimiliano Burlando

University of GenoaVia Balbi, 5, 16126, Genoa, Italy [email protected]

Giovanni Solari

University of GenoaVia Balbi, 5, 16126, Genoa, Italy [email protected] A BSTRACT

Detection of thunderstorms is important to the wind hazard community to better understand extremewinds ﬁeld characteristics and associated wind-induced load effects on structures. This papercontributes to this effort by proposing a new course of research that uses machine learning techniques,independent of wind statistics-based parameters, to autonomously identify and separate thunderstormsfrom large databases containing high-frequency sampled continuous wind speed measurements. Inthis context, the use of “Shapelet transform is proposed to identify key individual attributes distinctiveto extreme wind events based on similarity of shape of their time series. This novel shape-basedrepresentation when combined with machine learning algorithms yields a practical event detectionprocedure with minimal domain expertise. In this paper, the shapelet transform along with RandomForest classiﬁer is employed for the identiﬁcation of thunderstorms from 1-year of data from 14ultrasonic anemometers that are a part of an extensive in-situ wind monitoring network in the NorthernMediterranean ports. A collective total of 235 non-stationary records associated with thunderstormswere identiﬁed using this method. The results lead to enhancing the pool of thunderstorm data formore comprehensive understanding of a wide variety of thunderstorms that have not been previouslydetected using conventional gust factor-based methods. K eywords Thunderstorm detection · Time series shapelets · Shapelet Transform · Machine Learning · Wind monitoringnetwork

A modern study of thunderstorms was ﬁrst accomplished by the Thunderstorm Project [1] that was commissioned by theU.S. Congress in 1945 after a series of severe thunderstorm-related aircraft incidents. The report rendered a preliminaryunderstanding of the life cycle, form, and distribution of thunderstorms and of their associated physical processes.These studies facilitated the burgeoning growth of thunderstorm research in the ﬁeld of atmospheric sciences. This ledto a meteorological line of research that used anemometers, radars, barometers, radiosondes, and instrumented aircraftto study more about the causes and attributes of thunderstorms. For instance, [2] used the data from anemometers,thermometers, barometers and hygrometers mounted along a 444 m high transmission tower to study an intense event a r X i v : . [ phy s i c s . g e o - ph ] J a n hat occurred in Oklahoma in May 1969. [3] analyzed the characteristics of 20 thunderstorm outﬂows using radar imagescombined with the data collected from a 461 m tower equipped with meteorological sensors. [4] carried out a detailedanalysis on the life cycle of thunderstorm gust fronts using the data collected from Doppler radars, mesonets stationsand rawinsondes. [5] described the passing of a thunderstorm downburst over a suburb of Brisbane, Australia with thehelp of data collected from an instrumented tower along with radar images. [6] documented the varying thermodynamicand kinematic characteristics of two extreme thunderstorm outﬂows utilizing high-resolution near-surface observations.[7] analyzed a super-cell thunderstorm that produced a rear-ﬂank downdraft which passed over 7 monitored towers inLubbock, Texas, in June 2002; they also examined data from doppler radars and meteorological soundings. [8] provideda deeper understanding of the characteristics of thunderstorm outﬂow winds and wind proﬁles using data collected forProject SCOUT using mobile Doppler radars.During the same period when thunderstorms were researched with fervor by the meteorologists, the wind engineers alsorealized that several wind events in the mid-latitude region that caused catastrophic damages to the built environmentwere mainly due to thunderstorms [9]. This led to a surge in research in the ﬁeld of wind engineering guided by theone that took place in atmospheric science. The wind engineering line of research is mostly based on anemometricrecordings, without analyzing the meteorological information related to the wind event or the weather scenarios outof which they developed or, at most, utilizing the correlation between some meteorological quantities or phenomena(e.g., rain, surface temperature, thunder, lightning) to infer the presence of thunderstorm cells and strong convection inthe atmosphere. A majority of the initial research focused on identifying thunderstorms from a mixed-wind climate toenable separate analyses [10]. [11] studied the separation of thunderstorms and non-thunderstorms and also obtaineddistributions for gust speed wind speeds in Australia. [12] segregated thunderstorms from other extreme wind eventsusing wind duration, presence of thunder or lightning, rainfall, and sudden fall in the temperature. [13] examined thestatistics of thunderstorm and non-thunderstorm winds collected from various cities in the United States. [14] classiﬁedwind events as thunderstorms and non-thunderstorms based on whether thunder and rain were recorded. [15] separatedthunderstorms from monsoon winds based on gust factor and found that the gust factor of thunderstorms is alwayshigher when compared to the monsoon winds. [16] claimed that in mid-latitude areas, thunderstorms cannot be clearlyseparated from synoptic depressions due to the existence of a third set of events, called gust fronts, which presentintermediate properties. His criterion of extraction was based on the evaluation of three parameters: mean wind speed,peak wind speed and gust factor. An automated method to extract and classify thunderstorms from non-thunderstormwind data present in the Automated Surface Observing System (ASOS) was developed by [17] for extreme valueanalysis. The identiﬁcation of thunderstorms was based on weather observations, peak wind data and thunderstormstart and end times reported by manual observers. [18] established a semi-automated method based on gust factorto separate and classify extra-tropical depressions, thunderstorms, and gust fronts using quantitative controls andqualitative judgments. During these developments concurrently models for capturing the load effects of thunderstormoutﬂows have been advanced which can signiﬁcantly beneﬁt from the additional knowledge derived from new databeing collected [19, 20].The meteorological and wind engineering ﬁelds have witnessed over the years, a tremendous increase in the volume ofdata related to wind events due to the rapidly growing wind monitoring networks and stations that can continuouslyrecord wind ﬁeld measurements with high sampling rates. This has led to massive amounts of high-dimensional datacollected continuously over time and stored as a time series. From a “big data” perspective, many of the meteorologicaland wind engineering techniques mentioned above, hold little utility for mining massive wind databases that requireindexing, predicting, classifying, clustering, segmenting, and identifying patterns from data. Thus, given the wideprevalence of big data, there has been increased attention in using machine learning techniques to automatically detectdesired events from large databases [21]. In this paper, a new line of research is proposed that uses machine learningtechniques to autonomously identify and separate desired wind events, thunderstorms in the present case, from largevolumes of continuous data. Since the data from the wind monitoring networks are stored as time series, the use ofan efﬁcient time series representation named “Shapelet transform” has been proposed in this paper that is combinedwith a machine-learning algorithm to identify thunderstorm from anemometric records. The Shapelet Transform is aunique time series representation technique that is solely based on the time series shape [22]. For example, in termsof wind speed, every set of extreme wind event have their own unique time series signature as they pass over anyinstrumented station. The Shapelet Transform technique easily captures these unique time series shapes correspondingto each extreme event and the machine learning algorithm uses these shapes to identify thunderstorms from a largedatabase. Moreover, a time series shape-based approach can unravel unseen corners of random phenomena like wind,which might have been overlooked by analysts using conventional meteorological and wind engineering parameters.Section 2 gives a general overview of the shapelet transform and elaborates on the ﬁve major stages in the transformwith the help of illustrative examples. Section 3 describes the wind monitoring network that has been used in this study.The network has been installed in the ports of the High Tyrrhenian Sea for the “Wind and Ports” (WP) and “Wind, Portsand Sea” (WPS) projects funded by the European Cross-border program to investigate extreme wind events in port2reas. Section 4 explains the preprocessing of the raw wind velocity data acquired from the monitoring network. Inparticular, a wavelet-based denoising method is employed to remove a signiﬁcant amount of noise while retaining theimportant features in the signal even when the noise is non-uniform. In section 5, the different shapes discovered fromthe wind speed measurements are elaborated. A comprehensive summary of the new thunderstorms detected usingthese shapes has been explained in great detail. Section 6 provides a synopsis of the efﬁcient shapelet transform-basedthunderstorm detection method. Wind speed measurements recorded by an anemometer during a thunderstorm and during normal periods is shownin Fig.1. A sharp peak in wind velocity appears for a short duration in the thunderstorm time series that differssigniﬁcantly from the other time series. These local shapes have very high ability to differentiate and can be used todistinguish between various types of time series. Thus, local shapes that may occur anywhere in a time series withhigh discriminatory power are termed as shapelets. In the present case, wind velocity shapelets serve as a dominantattribute for identifying thunderstorms from a large anemometric database containing continuous records. The identiﬁedshapelets are used to transform the original wind speed data where each attribute of the transformed dataset denotesthe distance between a time series and a shapelet [22]. The transformed dataset is then applied to a machine learningalgorithm, that identiﬁes thunderstorms.

Wind Velocity (m/s)

T i m e ( s )

T h u n d e r s t o r m N o n - t h u n d e r s t o r m0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0024681 01 21 41 6

Wind Velocity (m/s)

T i m e ( s )

Figure 1: Time series of wind speed measurements for a thunderstorm (shapelet in red) vs a non-thunderstorm

Fig.2 shows the main features of the “Wind and Ports” (WP) [23] and “Wind, Ports and Sea” (WPS) [24, 25] projectsﬁnanced by the European Cross-border program “Italy–France Maritime 2007-2013”. These projects are involved inthe safe wind management and risk assessment for selected ports in Italy and France with the help of an extensivein-situ wind monitoring network.WP consists of 23 ultrasonic anemometers distributed in the ports of Genoa, LaSpezia, Livorno, Savona – Vado Ligure, and Bastia. WPS, an enhancement of the WP network, consists of ﬁveadditional ultrasonic anemometers installed in the ports of Savona, LaSpezia, Livorno, and L’Île-Rousse. The ultrasonicanemometers measure wind speed and direction with a precision of 0.01 m/s and 1° respectively. The sampling rate ofanemometers is set to 10 Hz for the Ports of Genoa, La Spezia, and Livorno. One anemometer in the Port of Savona hasa sampling frequency of 1 Hz while the others are set to 10 Hz. The anemometers in the Ports of Bastia and L’Île-Roussehave a sampling frequency of 2 Hz. Apart from anemometers, three Lidar (Light Detection and Ranging) wind proﬁlersand three weather stations, each comprising of an additional ultra-sonic anemometer, thermometer, barometer, and a3igure 2: Anemometric stations at the ports that are part of the “Wind and Ports” and “Wind, Ports and Sea” projecthygrometer are also installed in the ports of Genoa, Savona, and Livorno. A detailed description of the installed Lidarscan be found in [25].The instruments are positioned uniformly across the port areas and are installed on high rise towers or antenna mastsatop buildings at 10 m above the ground level to record undisturbed wind speed measurements. Local servers located inthe head ofﬁce of each of these ports receive data from the instruments and buffers accumulate the data over pre-deﬁnedintervals (10-min) for analyzing basic statistics such as average, peak wind speed and mean wind direction. The serversthen send the raw data and statistics to a central server located at the Department of Civil, Chemical and EnvironmentalEngineering (DICCA), the University of Genoa where the data is systematically checked and validated before storing itin a database.[26] used the semi-automatic, gust factor-based thunderstorm detection approach developed in [18] to identify 10-min,1-hour and 10-hour thunderstorm records from 14 ultrasonic anemometers. These thunderstorms have been used asground truth for the present study and are referred to here as “cataloged thunderstorms”. The analysis carried out in thispaper is limited to the data gathered by the 14 ultrasonic anemometers used in [26]. Table 1 shows the main propertiesof the anemometers used in this study along with their periods of measurement.

Close inspection of the data revealed several periods of measurement in which the recordings were extremely noisyand unreliable. It is thought that the majority of these noisy measurements were a result of interference from theradar on the approaching ships in the port areas. Excluding these measurements during the analysis will lead to asigniﬁcant loss of valuable data. Hence, it is imperative to remove noise from the wind speed measurements to restorethe completeness of the database. The use of canonical ﬁlters holds little utility in this case as the noise characteristicsare unknown. Moreover, great care needs to be taken during the denoising process so that important features of thesignal, such as spikes, are preserved as these sharp features can aid the shapelet transform in identifying the occurrenceof thunderstorms. For this purpose, a robust denoising procedure based on Stationary Wavelet Transform (SWT) [27]is used in this study as wavelets localize features in the data to different scales and can retain signal features whileremoving noise. 4able 1: Details of the anemometers used in the studySWT has a better performance in terms of denoising when compared to the Discrete Wavelet Transform (DWT), as theformer has the properties of shift and scale invariance which is absent in the latter. The SWT-based denoising procedureis as follows• Suitable mother wavelet and decomposition level are determined for denoising• Wavelet Transform is applied to the signal to obtain the wavelet coefﬁcients• Wavelet transform concentrates the signal features in large-magnitude coefﬁcients and the small-magnitudecoefﬁcients are typically noise. A suitable threshold method and an appropriate threshold limit are applied toeach level to remove noise.• The signal is then reconstructed by applying the inverse wavelet transform of the thresholded coefﬁcients.Following the above-mentioned procedure, SWT is applied to the noisy signals using Daubechies db10 as the motherwavelet with a maximum of 8 levels of decomposition. Soft ﬁxed form thresholding is then applied to the waveletcoefﬁcient to remove noise from the signal. Fig. 3 illustrates the effectiveness of this procedure. It can be seen from theﬁgure, that in both cases, SWT-based denoising has removed signiﬁcant amount of the noise while retaining the majorfeatures in the signal even when the noise is nonuniform.

The identiﬁcation of thunderstorms involves three major stages as shown in Fig. 4. In the ﬁrst stage, the preprocessedwind speed measurements are used to create a labeled time series learning set. The time series learning set is then utilizedto transform the original data through ﬁve steps: generation of candidate shapelets, calculation of distance between atime series and a shapelet, assessment of shapelet quality, discovery of shapelets, and shapelet-based transformationof data. In a machine learning context, the features are the discovered shapelets and the instances are the time series.5

T i m e ( s )

S i g n a l ( S )( a ) W i n d s p e e d m e a s u r e d a t L i v o r n o ( L I ) b y a n e m o m e t e r n o . 4 o n O c t 1 , 2 0 1 0 a t 0 2 : 0 0 : 0 0 A M

Wind Velocity (m/s)

D e n o i s e d S i g n a l ( D S )R e s i d u a l ( R = S - D S )

T i m e ( s )S i g n a l ( S )( b ) W i n d s p e e d m e a s u r e d a t L i v o r n o ( L I ) b y a n e m o m e t e r n o . 4 o n O c t 1 , 2 0 1 0 a t 0 8 : 0 0 : 0 0 A M

Wind Velocity (m/s)

D e n o i s e d S i g n a l ( D S )R e s i d u a l ( R = S - D S )

Figure 3: Wavelet-based denoising procedure to remove noise from the raw anemometric records6 reated by tomfrom the Noun Project

Created by Zach BogartfromtheNounProject

Created by Christina Baryshevafrom the Noun ProjectCreated by Christina Baryshevafrom the Noun Project

Created by tomfrom the Noun Project

Created by Zach BogartfromtheNounProject

Created by Christina Baryshevafrom the Noun ProjectCreated by Christina Baryshevafrom the Noun Project

Created by Zach Bogartfrom the Noun Project

Created by Christina Baryshevafrom the Noun Project

Created by Christina BaryshevafromtheNounProject

Created by Christina Baryshevafrom the Noun Project

Created by tomfrom the Noun Project

CreatedbyZachBogartfrom the Noun Project

Created by Zach BogartfromtheNounProject

Created by Zach Bogartfrom the Noun Project

Created by Christina Baryshevafrom the Noun Project Created by Christina Baryshevafrom the Noun Project

Created by Christina Baryshevafrom the Noun Project C r ea t e d b y C h r i s t i n a B a r ys h eva f r o m t h e N oun P r o j ec t Created by Christina Baryshevafrom the Noun Project

Created by Christina Baryshevafrom the Noun Project

Created by tomfrom the Noun Project

Created by Christina Baryshevafrom the Noun Project

Created by tomfrom the Noun Project

Created by Zach BogartfromtheNounProject

Created by Lars MeiertoberensfromtheNounProject

Created by Zach Bogartfrom the Noun Project

Created by Christina Barysfrom the Noun Project

CreatedbytomfromtheNounP

Created by tomfrom the Noun Project

CreatedbyZachBogartfrom the Noun Project

Created by Christina BaryshevafromtheNounProject I . S h a p e l e t t r a n s f o r m Time series of wind speed measurement Generation ofshapelet candidates Shapelet distancecalculation Discovery ofshapelets Shapelettransform

CreatedbyZachBogartfromtheNounProject

Created by Christina Baryshevafrom the Noun Project

CreatedbyZachBogartfromtheNounProject

Created by Lars Meiertoberensfrom the Noun Project

CreatedbyZachBogartfrom the Noun Project

Created by Christina Baryshevafrom the Noun Project

Created by tomfrom the Noun Project

Created by Zach Bogartfrom the Noun Project

New time series Shapelet transform

Created by David Christensenfrom the Noun Project

Training ofclassifier II . T r a i n i ng o f c l a ss i f i e r Trainedclassifier

Created by Lars Meiertoberensfrom the Noun Project

Created by Zach BogartfromtheNounProject

Created by Christina Baryshevafrom the Noun Project

CreatedbytomfromtheNounProject

Created by tomfrom the Noun Project

Created by Zach Bogartfrom the Noun Project

Created by Christina Baryshevafrom the Noun Project

Shapelettransform

Features I n s t a n ce s Trained classifier

Created by Zach Bogartfrom the Noun Project

Created by Christina Baryshevafrom the Noun Project

Created by Zach Bogartfrom the Noun ProjectCreated by Zach Bogartfrom the Noun ProjectCreated by Zach BogartfromtheNounProject

Thunderstorms

Other

III . T hund e r s t o r m I d e n ti f i ca ti on Features I n s t a n ce s C l a ss l a b e l s Figure 4: Methodology for thunderstorm identiﬁcation using shapelet transformThus, every element in the shapelet transform is the minimum Euclidean distance between each discovered shapelet andevery time series in the learning set. In the second stage, the shapelet transform along with the associated class labelsare used to train a random forest classiﬁer to identify thunderstorms. The trained classiﬁer is then employed to detectthunderstorms from new wind speed measurements from the wind monitoring system in the third stage. Each of thesestages are explained in detail in the following sections.

As mentioned in Section 3, the thunderstorms identiﬁed by [26] using the semi-automated procedure developed by[18] is used as the ground truth. This procedure establishes a wind velocity threshold and uses gust factor to separatethe dataset into cyclones, thunderstorms, and intermediate events and the expert judgment involves the visual checkof wind velocity records. Using this procedure, a total of 198 thunderstorm events and 277 strongly non-stationaryrecords corresponding to these events were identiﬁed from the 14 anemometers listed in Table 1. Of these, 120 recordswere detected by anemometers during 2010-2014 in the Port of Livorno, which is well known to experience frequentthunderstorms. These 120 wind speed records along with other non-thunderstorm records are used to train the shapeletalgorithm to identify thunderstorms. The trained algorithm is then used to detect thunderstorms from all anemometersin the year 2015 (no measurements are available in 2015 for anemometer no. 1 in Genoa, so the year 2012 is considered)and the results are compared with the cataloged events. 7 lass: Thunderstorm Class: Thunderstorm Class: Other TS TS TS n W W W n Creation of time series dataset (TS)Generation of candidate shapelets (W)

Figure 5: Illustration of candidate shapelet generation for each time series

Let

T S = { T S , T S . . . . . . , T S n } be a time series dataset where T S i = (cid:104) t i, , t i, , . . . , t i,m (cid:105) is an individual timeseries containing wind speed measurements. Let the class labels for each time series be denoted by C. Here the classlabels are “Thunderstorm” and “Other”. A set of input-output pairs is created using the individual time series and theirassociated labels which serve as the learning set, Φ = (

T S, C ) . In the present case, the learning set for the algorithmconsists of 120 thunderstorm records labeled as “Thunderstorms” and 120 non-thunderstorm records labeled as “Other”.The learning set contains equal samples from both the classes to avoid any classiﬁer bias during the identiﬁcation ofthunderstorms. The dataset is then randomly split into training (70%) and test (30%) sets. So, the training set contains168 time-series records, and the test set contains 72 records. The time series in training set is used in the following stepsto discover shapelets and the efﬁcacy of the method is tested on the time series in the testing set.As shown in Fig.5, each subsequence of a time series is regarded as a prospective shapelet candidate. The smallestsubsequence contains three data points as it is the shortest possible length for a time series and the largest subsequenceis the length of the time series. Thus, if m denotes the length of an individual time series and l denotes the length of asubsequence, then there are ( m − l ) + 1 distinct subsequences in any individual time series. In the present study, thecontinuous wind speed measurements are broken down into 1-hour intervals sampled at 10 Hz. Thus, each time seriesin the training set has 36000 data points. Let us take the ﬁrst time series in the training set for illustration. The set of allcandidate shapelets for this time series is W = { w , w , . . . ., w , w , w } (1)8here w (ﬁrst three data points) is the shortest shapelet length and w (entire time series) is the longest shapeletlength. Thus, the set W contains w different lengths of shapelets. In a similar way, shapelet candidates aregenerated from all of the time series in the learning set. Candidate shapelet(S r ) Time series (TS )d Sr,2,1 d Sr,2,n-p d Sr,2,n d Sr,2 = min {d

Sr,2,1 , ... , d

Sr,2,n } Figure 6: Illustration of calculation of Euclidean distance between a candidate shapelet and a time seriesThe similarity between a shapelet and a time series is measured using Euclidean distance. The Square Euclideandistance between two subsequences X and Y of the same length l is (cid:80) li =1 ( x i − y i ) . Consider a shapelet candidate S r and a time series T S as shown in Fig.6. The distance between S r and T S is the minimum of all the Euclideandistances calculated. If the time series contains a shape very similar to the candidate shapelet, the Euclidean distancewill be very low and vice versa. Once the distance between a shapelet candidate and each and every time series in T S iscalculated this way, an orderline DS is created that contains the list of these distances along with their class label. Theorderline is sorted in ascending order of the distance value. Thus, if there are n time series, the orderline ( D S r ) for ashapelet candidate S r is given by D S r = (cid:10) d S r, , d S , ,...,... d S,m (cid:11) .In the present study, each time series leads to the generation of 35998 shapelet candidates as seen in Eq. (1). Eachof these 35998 shapelets is compared with other time series using Euclidean distance. So, for the present case, W iscompared with the 167 other time series in the training set using a minimum Euclidean distance. The thus obtained 167distances values are ordered in increasing value to create the orderline. Then W is compared with the 167 other timeseries and so on. A huge volume of shapelet candidates are generated from the previous step. It is not computationally efﬁcient to storeall the shapes irrespective of their quality. To optimize the process, Information Gain [28] is used for testing the qualityof the various captured shapes [29–31]. To illustrate the concept of information gain, consider a dataset S which hastwo classes A and B . The randomness in the dataset is measured in terms of entropy. The entropy of S is given by: E ( S ) = − p ( A ) log ( p ( A )) − p ( B ) log ( p ( B )) (2)where p ( A ) and p ( B ) are the number of objects in each of these classes. Entropy takes a value between 0 and 1. A highentropy suggests a low level of purity among the classes and most of the machine learning algorithms aim to reduce theentropy. A metric that is used to measure the reduction in entropy after a dataset is split based on an attribute is called9 G S1 = max {IG , IG ,…,IG n } Class A Class B d S1,5 d S1,m d S1,6 d S1,4 d S1,3 d S1,2 d S1,1

Split point - IG D S1 = d S1,5 d S1,m d S1,6 d S1,4 d S1,3 d S1,2 d S1,1 D S1 = Split point - IG d S1,5 d S1,m d S1,6 d S1,4 d S1,3 d S1,2 d S1,1 D S1 = Split point - IG n Figure 7: One-dimensional representation of the arrangement of time series objects by the distance to the candidateshapelet S1. Information Gain is calculated for each possible split point on the orderline DS1

Shapelet S1 IG(max) ShapeletS2 IG(max - 1)ShapeletS3 IG(max - 2)ShapeletS4 IG(max - 3)ShapeletSr IG(min)

Figure 8: Discovery of shapelets based on Information Gain10nformation Gain (IG). Consider an attribute that splits the dataset T into two datasets T A and T B . The IG of this splitis given by IG = E ( S ) − (cid:18) | T A || T | E ( T A ) + | T B || T | E ( T B ) (cid:19) (3)where ≤ IG ≤ . A high information gain denotes the high informative power of an attribute. This way the leastinformative attributes can be abandoned. Here, shapelets are the attributes and the orderline ( DS ) containing distancesbetween the shapelet candidate and the time series is split in various ways and the IG of each of these splits are comparedas shown in Fig. 7. An optimal shapelet generates large distance values for a time series that does not belong to its ownclass and small distance values otherwise. A best split in the orderline has all the distance values of a particular classon one side of the split and the rest of the distance values on the other side. This split produces the highest IG. Thisway, the highest IG obtained by each of the shapelet candidates are calculated. The shapelets are then arranged in adecreasing order based on IG as shown in Fig. 8. Whichever shapelet surpasses the minimum provided informationgain threshold (0.05 in the present case) is retained and the other shapelets are discarded. This makes sure that theselected shapelets are meaningful and have discriminatory power. The shapelet algorithm was developed by [32] and the full workﬂow along with the code is available at [33]. The same algorithm has been used but with slight modiﬁcations to suit the datasets in the present study.Detailed information about the application of the algorithm is also available in [34]. The algorithm is straightforward,and no parameter tuning is required. The only requirement for the shapelet algorithm is the input time series (

T S ). Forthe present case, a total of 32 shapelets are discovered from the training set using the shapelet algorithm. The top 6shapelets with the highest IG among the 32 are shown in Fig.9. From shapelets 1-2 and 4-6, it can be seen that the peaksin the time series are extracted as shapelets. A section of time series from a non-thunderstorm wind speed measurementis also captured as a shapelet as seen in Shapelets 3. As the current study involves two classes (thunderstorm andnon-thunderstorm), two groups of shapelets are discovered, one for each class. Of the 32 shapelets, 11 correspond tonon-thunderstorm records. This may raise another question as to why two families of shapelets are required for a binaryclassiﬁcation problem instead of just using one set of shapelets for classiﬁcation. Using only one family of shapelets ondatasets generated due to erratic natural phenomena like the wind will affect the detection accuracy as these datasetscontain many instances of time series where a clear A vs B cannot be established. Such time series can only be correctlyidentiﬁed if they are compared with both sets of shapelets.It should be noted that, the discovered shapelets are of various lengths. Predetermining the optimal length of shapelet isimpossible and unnecessary as it hinders the detection accuracy of the algorithm. It is also very difﬁcult to interpret thevariety of shapelet lengths obtained from the algorithm as these lengths have been chosen from several 1000s of shapeletlengths that were compared with several other time series. However, there is a provision in the shapelet algorithm to setthe maximum and minimum shapelet length to achieve speedup. This provision should be used with care and shouldonly be utilized in cases where only a certain length of shapelets are of interest.

Shapelet transform converts time series data into a local-shape space where each attribute denotes the distance betweena time series and a shapelet [22, 35]. To train the machine learning algorithm for thunderstorm identiﬁcation, theoriginal time series in the training set ( T R S ) is transformed to a local shape space using the 32 discovered shapelets.Thus a 168 x 32 matrix is constructed where each element is the minimum Euclidean distance between a time series anda shapelet in the training set. In a similar way, the testing set ( T T S ) is also transformed using the discovered shapelets.In the present case, the testing set has 72 time series, and a 72 x 32 matrix is constructed. In the context of machinelearning, the shapelets are the features and the individual time series are the instances and the class labels are attachedat the end of each instance as shown in Fig. 10. Random Forest (RF) classiﬁer [36] with 500 trees is used for the current study to identify thunderstorms from theshapelet transformed testing set. [32, 35] compared the performance of shapelets using several standard classiﬁers andensemble classiﬁers on a variety of datasets from UCR time-series repository. According to their study, a random forestclassiﬁer with 500 trees is found to be optimal on a shapelet-transformed dataset. Hence the same has been adoptedin the present study. It is also found that increasing the number of trees beyond 500 did not result in any signiﬁcant11

Wind Velocity (m/s)

T i m e ( s )

S h a p e l e t 1 S h a p e l e t 4S h a p e l e t 5 S h a p e l e t 6S h a p e l e t 2S h a p e l e t 3

Wind Velocity (m/s)

T i m e ( s )0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 00481 21 62 0

Wind Velocity (m/s)

T i m e ( s ) 0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 00481 21 6

Wind Velocity (m/s)

T i m e ( s )0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 00481 21 62 02 4

Wind Velocity (m/s)

T i m e ( s ) 0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 00481 21 6

Wind Velocity (m/s)

T i m e ( s )

Figure 9: Shapelets discovered for the identiﬁcation of thunderstorms S S S Class T T S T T S T T S Thunderstorm dist

TTS1,S1 dist

TTS1,S2 dist

TTS1,Sr

Thunderstorm dist

TTS2,S1 dist

TTS2,S2 dist

TTS2,Sr

Other dist

TTS72,S1 dist

TTS72,S2 dist

TTS72,S32

Features I n s t a n ce s Figure 10: Shapelet Transform containing a matrix of Euclidean distance between the discovered shapelets and the timeseries in testing set12ncrease in accuracy. Moreover, along with each prediction, the classiﬁer also gives a class probability. For instance, ifa time series is predicted as Thunderstorm, the classiﬁer also gives the prediction probability, i.e., prob (Thunderstorm)= 77% and prob (Other) = 23%. More insights about this is provided in the following sections.

The Random Forest classiﬁer is ﬁrst trained on the shapelet transformed training dataset. No threshold is set for thedepth of trees in the Random Forest. The nodes are allowed to expand until all the leaves are pure. The average depthof the trained RF is 3 (obtained by taking the mean of the individual tree depth of all 500 trees). The trained classiﬁeris then used on the transformed testing set to identify thunderstorms. 95% classiﬁcation accuracy is obtained with aprecision of 97% for “Thunderstorm” records. The high classiﬁcation accuracy shows the efﬁcacy of the method toidentify thunderstorms. This trained classiﬁer is then used on the wind speed measurements from all anemometers toidentify thunderstorms.

The Shapelet-based Random Forest classiﬁer is used to detect thunderstorms records from all the 14 ultrasonicanemometers shown in Table 1. A summary of the detection performance of Shapelets in terms of several metrics isprovided in Table 2. The shapelet-based classiﬁer was able to discover a total of 235 strongly non-stationary wind speedrecords, as shown in Table 2, that can be traced back to thunderstorm outﬂows. Apart from the 60 non-stationary catalogrecords that were originally identiﬁed based on the gust factor method, an additional 175 records were identiﬁed usingshapelets. All the detections were cross-checked with the radar images to separate true positives from false positives.Figs. 11-14 show a typical l-hour record of a strongly non-stationary and non-gaussian event detected by shapeletsalong with the radar images displaying the associated thunderstorm. In particular, the ﬁgure shows the mean velocityover a 1-hr period (horizontal dashed line), the peak velocity averaged over 1 sec (red circle) and the position of eachport on the radar image (white circle).Table 2: Non-stationary records related to thunderstorms identiﬁed by the shapelet-based classiﬁerFrom the ﬁgures, it can be seen that the shapelet based classiﬁer was able to detect a large variety of thunderstormswith different duration and wind speeds allowing it to generalize well to thunderstorms that are not similar to the onesin the training set. As mentioned before, the classiﬁer also returned a probability for each of the predictions. Fig.1513

Wind Velocity (m/s)

T i m e ( h h : m m ) G e n o a( a ) ( b )

Figure 11: Thunderstorm outﬂow recorded by Anemometer 1, Port of Genoa, 29 August 2012 (a) wind velocity in a1-hour period with peak velocity averaged on 1 s denoted by a red dot (b) Vertical Maximum Intensity (VMI) fromradar reﬂectivity displaying the thunderstorm.

Wind Velocity (m/s)

T i m e ( h h : m m ) L i v o r n o( a ) ( b )

Figure 12: Thunderstorm outﬂow recorded by Anemometer 4, Port of Livorno, 2 October 2015 (a) wind velocity in a1-hour period with peak velocity averaged on 1 s denoted by a red dot (b) Vertical Maximum Intensity (VMI) fromradar reﬂectivity displaying the thunderstorm14

S a v o n a

Wind Velocity (m/s)

T i m e ( h h : m m )( a ) ( b )

Figure 13: Thunderstorm outﬂow recorded by Anemometer 1, Port of Savona, 23 June 2015 (a) wind velocity in a1-hour period with peak velocity averaged on 1 s denoted by a red dot (b) Vertical Maximum Intensity (VMI) fromradar reﬂectivity displaying the thunderstorm

L a S p e z i a

Wind Velocity (m/s)

T i m e ( h h : m m )( a ) ( b )

Figure 14: Thunderstorm outﬂow recorded by Anemometer 3, Port of La Spezia, 10 August 2015 (a) wind velocity in a1-hour period with peak velocity averaged on 1 s denoted by a red dot (b) Vertical Maximum Intensity (VMI) fromradar reﬂectivity displaying the thunderstorm15

No. of detected events

P r e d i c t i o n P r o b a b i l i t y ( % )

Figure 15: Number of event detections for various intervals of prediction probabilities for the Port of Livornoshows the number of event detections and their respective prediction probabilities for the Port of Livorno. Out of the103 events (i.e., catalog detections plus new detections in 2015), 49 events were classiﬁed with a prediction probabilitybetween 90% and 100%. Prediction probability is the fraction of the total number of trees which votes for a speciﬁcclass. The default probability threshold in Random Forest (RF) is set as 0.5. That is, in case of binary classiﬁcation (i.e.Class 0 and Class 1) if the probability of a data point being in Class 1 is calculated by RF to be greater than 0.5, it willassign that data point to class 1 else to class 0. This probability threshold can be changed for problems with severe classimbalance. However, for the present case, the default threshold of 0.5 has been used for all examples.For the case of thunderstorm detection, a time series with very high prediction probability (like 96%) for Class:Thunderstorms, shows that the RF has made a conﬁdent decision that this time series contains thunderstorm record.For a time series with prediction probability around 50%, there is an equal chance that the time series may or may notcontain thunderstorm records. For time series with very low prediction probability (say <10%), RF has predicted withconﬁdence that it does not contain any thunderstorm events. If a greater number of input samples are predicted withvery high probability, then the prediction results can be accepted with more conﬁdence as in the present case. Thecataloged and new thunderstorm events were predicted with a prediction probability greater than 90%.The algorithm is tuned to be very sensitive to the slightest change in the shape of the wind velocity time series. Hence atotal of 1594 false positives have been detected. These wind events either had ramp-ups or ramp downs or multiplepeaks in wind velocities but did not have a signiﬁcant ramp-up, peak and ramp down as seen in thunderstorms. Thefalse positives were detected with a prediction probability ranging from 50 – 70%. Thus, for future detections, the falsepositives can be easily reduced by carefully tuning the class probability thresholds of the model. A total of 41 falsenegatives were obtained i.e. thunderstorms that were missed by the shapelet algorithm. A few illustrative examples ofthese false negatives are shown in Fig.16 and 17. Compared to Figs. 11-14, the 1-hr non-stationary records in Fig.16and 17 have peaks with a very short ramp up and ramp down duration thus making it a challenging task to identify thesesmall shapes. However, these shapes appear more signiﬁcantly in a 10-min time window [37]. One way to overcomethis difﬁculty is allowing the shapelet algorithm to learn on 10-min wind velocity datasets instead of 1-hr. But this willlead a deluge of datasets and the thunderstorms with a longer duration cannot be effectively captured using this timewindow. 16

Wind Velocity (m/s)

T i m e ( h h : m m ) ( a ) ( b )G e n o a

Figure 16: False-negative: Thunderstorm outﬂow recorded by Anemometer 1, Port of Genoa, 11 November 2012 (a)wind velocity in a 1-hour period with peak velocity averaged on 1 s denoted by a red dot (b) Vertical MaximumIntensity (VMI) from radar reﬂectivity displaying the thunderstorm

M e a n

Wind Velocity (m/s)

T i m e ( h h : m m ) S a v o n a( a ) ( b )

Figure 17: False-negative: Thunderstorm outﬂow recorded by Anemometer 4, Port of Savona, 24 February 2015 (a)wind velocity in a 1-hour period with peak velocity averaged on 1 s denoted by a red dot (b) Vertical MaximumIntensity (VMI) from radar reﬂectivity displaying the thunderstorm17

CONCLUDING REMARKS

This paper proposes a new line of research that uses machine learning techniques, independent of wind-based parameters,to autonomously identify and separate desired wind events, thunderstorms in the present case, from large volumesof continuous data. This would complement the meteorological and wind engineering methodologies to identifythunderstorms. This objective is achieved using a relatively new and efﬁcient time series representation named “Shapelettransform” that is combined with machine learning algorithm (Random Forest classiﬁer) to identify thunderstorm fromanemometric records. The Shapelet Transform is a unique time series representation technique that is solely basedon the shape of the time series and provides a universal standard feature for detection which is based on the distancebetween a shapelet and a time series. This method is designed to be used to detect thunderstorms in large data sets withcontinuously recorded data. In this study, the method is used to identify thunderstorms from 1-year of data from 14ultrasonic anemometers that are a part of the “Wind and Ports” and “Wind, Ports and Sea” project – an extensive in-situmonitoring network aimed at investigating extreme wind events in port areas of Italy. A total of 235 non-stationaryrecords associated with a wide variety of thunderstorms with varying duration and wind velocities were identiﬁed usingthe shapelet transform method. The results lead to a comprehensive understanding of a wide variety of thunderstormsthat have not been previously detected using conventional gust factor based methods.

ACKNOWLEDGEMENTS

This work was supported in part by the Robert M. Moran Professorship and National Science Foundation Grant (CMMI1612843). The contribution of the third and fourth authors is funded by the European Research Council (ERC) underthe European Union’s Horizon 2020 research and innovation program (grant agreement No. 741273) for the project"THUNDERR - Detection, simulation, modelling and loading of thunderstorm outﬂows to design wind-safer andcost-efﬁcient structures" – through an Advanced Grant 2016. The data used for this research were recorded by themonitoring network set up as part of the European Projects “Winds and Ports” (grant No. B87E09000000007) and“Wind, Ports and Sea” (grant No. B82F13000100005), funded by the European Territorial Cooperation Objective,Cross-border program Italy-France Maritime 2007–2013.

Funding

This work was supported in part by the Robert M. Moran Professorship and National Science Foundation Grant (CMMI1612843).

References [1] Horace Robert Byers and Roscoe R Braham.

The thunderstorm: report of the Thunderstorm Project . USGovernment Printing Ofﬁce, 1949.[2] Jess Charba. Application of gravity current model to analysis of squall-line gust front.

Monthly Weather Review ,102(2):140–156, 1974.[3] R Craig Goff. Vertical structure of thunderstorm outﬂows.

Monthly Weather Review , 104(11):1429–1440, 1976.[4] Roger M Wakimoto. The life cycle of thunderstorm gust fronts as viewed with doppler radar and rawinsonde data.

Monthly weather review , 110(8):1060–1082, 1982.[5] Douglas J Sherman. The passage of a weak thunderstorn downburst over an instrumented tower.

Monthly weatherreview , 115(6):1193–1205, 1987.[6] Kirsten Deann Gast.

A comparison of extreme wind events as sampled in the 2002 thunderstorm outﬂow experiment .PhD thesis, Texas Tech University, 2003.[7] KD Gast and JL Schroeder. Supercell rear-ﬂank downdraft as sampled in the 2002 thunderstorm outﬂowexperiment. In , pages 2233–2240, 2003.[8] W Scott Gunter and John L Schroeder. High-resolution full-scale measurements of thunderstorm outﬂow winds.

Journal of Wind Engineering and Industrial Aerodynamics , 138:13–26, 2015.[9] CW Letchford, C Mans, and MT Chay. Thunderstorms—their importance in wind engineering (a case for the nextgeneration wind tunnel).

Journal of Wind Engineering and Industrial Aerodynamics , 90(12-15):1415–1433, 2002.[10] L Gomes and BJ Vickery. Extreme wind speeds in mixed wind climates.

Journal of Wind Engineering andIndustrial Aerodynamics , 2(4):331–344, 1978. 1811] L Gomes. On thunderstorm wind gusts in australia.

Civil Engg. Trans. IE Aust. , 18:33–39, 1976.[12] JD Riera and LF Nanni. Pilot study of extreme wind velocities in a mixed climate considering wind orientation.

Journal of Wind Engineering and Industrial Aerodynamics , 32(1-2):11–20, 1989.[13] Lawrence A Twisdale and Peter J Vickery. Research on thunderstorm wind design parameters.

Journal of WindEngineering and Industrial Aerodynamics , 41(1-3):545–556, 1992.[14] Edmund CC Choi. Extreme wind characteristics over singapore–an area in the equatorial belt.

Journal of WindEngineering and Industrial Aerodynamics , 83(1-3):61–69, 1999.[15] Edmund CC Choi and Ferry A Hidayat. Gust factors for thunderstorm and non-thunderstorm winds.

Journal ofwind engineering and industrial aerodynamics , 90(12-15):1683–1696, 2002.[16] Michael Kasperski. A new wind zone map of germany.

Journal of Wind Engineering and Industrial Aerodynamics ,90(11):1271–1287, 2002.[17] Franklin T Lombardo, Joseph A Main, and Emil Simiu. Automated extraction and classiﬁcation of thunderstormand non-thunderstorm wind data for extreme-value analysis.

Journal of Wind Engineering and IndustrialAerodynamics , 97(3-4):120–131, 2009.[18] Patrizia De Gaetano, Maria Pia Repetto, Teresa Repetto, and Giovanni Solari. Separation and classiﬁcation ofextreme wind events from anemometric records.

Journal of Wind Engineering and Industrial Aerodynamics ,126:132–143, 2014.[19] Dae-Kun Kwon and Ahsan Kareem. Gust-front factor: New framework for wind load effects on structures.

Journal of structural engineering , 135(6):717–732, 2009.[20] Giovanni Solari, Massimiliano Burlando, and Maria Pia Repetto. Detection, simulation, modelling and loadingof thunderstorm outﬂows to design wind-safer and cost-efﬁcient structures.

Journal of Wind Engineering andIndustrial Aerodynamics , 200:104142, 2020.[21] Guangzhao Chen and Franklin T Lombardo. An automated classiﬁcation method of thunderstorm and non-thunderstorm wind data based on a convolutional neural network.

Journal of Wind Engineering and IndustrialAerodynamics , 207:104407, 2020.[22] Jason Lines, Luke M Davis, Jon Hills, and Anthony Bagnall. A shapelet transform for time series classiﬁcation.In

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining ,pages 289–297, 2012.[23] Giovanni Solari, Maria Pia Repetto, Massimiliano Burlando, Patrizia De Gaetano, Marina Pizzo, Marco Tizzi,and Mattia Parodi. The wind forecast for safety management of port areas.

Journal of Wind Engineering andIndustrial Aerodynamics , 104:266–277, 2012.[24] Maria Pia Repetto, M Burlando, G Solari, P De Gaetano, and M Pizzo. Integrated tools for improving theresilience of seaports under extreme wind events.

Sustainable cities and society , 32:277–294, 2017.[25] Maria Pia Repetto, Massimiliano Burlando, Giovanni Solari, Patrizia De Gaetano, Marina Pizzo, and Marco Tizzi.A web-based gis platform for the safe management and risk assessment of complex structural and infrastructuralsystems exposed to wind.

Advances in Engineering Software , 117:29–45, 2018.[26] Shi Zhang, Giovanni Solari, Patrizia De Gaetano, Massimiliano Burlando, and Maria Pia Repetto. A reﬁned anal-ysis of thunderstorm outﬂow characteristics relevant to the wind loading of structures.

Probabilistic EngineeringMechanics , 54:9–24, 2018.[27] Guy P Nason and Bernard W Silverman. The stationary wavelet transform and some statistical applications. In

Wavelets and statistics , pages 281–299. Springer, 1995.[28] Claude E Shannon and Warren Weaver. The mathematical theory of communication, 117 pp.

Urbana: Universityof Illinois Press , 1949.[29] Abdullah Mueen, Eamonn Keogh, and Neal Young. Logical-shapelets: an expressive primitive for time seriesclassiﬁcation. In

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery anddata mining , pages 1154–1162, 2011.[30] Lexiang Ye and Eamonn Keogh. Time series shapelets: a new primitive for data mining. In

Proceedings of the15th ACM SIGKDD international conference on Knowledge discovery and data mining , pages 947–956, 2009.[31] Lexiang Ye and Eamonn Keogh. Time series shapelets: a novel technique that allows accurate, interpretable andfast classiﬁcation.

Data mining and knowledge discovery , 22(1-2):149–182, 2011.1932] Anthony Bagnall, Jason Lines, Aaron Bostrom, James Large, and Eamonn Keogh. The great time seriesclassiﬁcation bake off: a review and experimental evaluation of recent algorithmic advances.

Data Mining andKnowledge Discovery , 31(3):606–660, 2017.[33] Markus Löning, Anthony Bagnall, Sajaysurya Ganesh, Viktor Kazakov, Jason Lines, and Franz J Király. sktime:A uniﬁed interface for machine learning with time series. arXiv preprint arXiv:1909.07872 , 2019.[34] Monica Arul and Ahsan Kareem. Applications of shapelet transform to time series classiﬁcation of earthquake,wind and wave data.

Engineering Structures , 228:111564, 2021. ID: 271087.[35] Jon Hills, Jason Lines, Edgaras Baranauskas, James Mapp, and Anthony Bagnall. Classiﬁcation of time series byshapelet transformation.

Data Mining and Knowledge Discovery , 28(4):851–881, 2014.[36] Leo Breiman. Random forests.

Machine learning , 45(1):5–32, 2001.[37] Massimiliano Burlando, Shi Zhang, and Giovanni Solari. Monitoring, cataloguing, and weather scenarios ofthunderstorm outﬂows in the northern mediterranean.