Classification of tokamak plasma confinement states with convolutional recurrent neural networks
CClassification of tokamak plasma confinement stateswith convolutional recurrent neural networks
F. Matos , V. Menkovski , F. Felici , A. Pau , F. Jenko , theTCV Team ‡ and the EUROfusion MST1 Team § Max Planck Institute for Plasma Physics, Boltzmannstraße 2, 85748 Garching,Germany Eindhoven University of Technology, 5612 AZ Eindhoven, Netherlands Ecole Polytechnique Fdrale de Lausanne (EPFL), Swiss Plasma Center (SPC),CH-1015 Lausanne, SwitzerlandE-mail: [email protected]
Abstract.
During a tokamak discharge, the plasma can vary between differentconfinement regimes: Low (L), High (H) and, in some cases, a temporary (intermediatestate), called Dithering (D). In addition, while the plasma is in H mode, Edge LocalizedModes (ELMs) can occur. The automatic detection of changes between these states,and of ELMs, is important for tokamak operation. Motivated by this, and by recentdevelopments in Deep Learning (DL), we developed and compared two methods forautomatic detection of the occurrence of L-D-H transitions and ELMs, applied on datafrom the TCV tokamak. These methods consist in a Convolutional Neural Network(CNN) and a Convolutional Long Short Term Memory Neural Network (Conv-LSTM).We measured our results with regards to ELMs using ROC curves and Youden’s scoreindex, and regarding state detection using Cohen’s Kappa Index.
Keywords : CNN, LSTM, Deep Learning, ELM, H mode, L mode, Dither, AutomatedDetection
1. Introduction
In a fusion experiment, plasma can typically be described as being in one of two differentconfinement regimes or modes: Low (L) and High (H). Furthermore, the plasma can alsosometimes be described as being in a third, additional, mode, called the Intermediateor Dithering (D)[1] phase. In addition, when the plasma is in H mode, Edge LocalizedModes (ELMs) can periodically occur.Current tokamaks regularly run in H mode, which motivates the necessity for somemeasure of control (and therefore, detection) of ELMs and transitions between plasma ‡ See author list of S. Coda et al 2019 Nucl. Fusion 59 112023 § See author list of B. Labit et al., 2019 Nucl. Fusion 59 086020 a r X i v : . [ phy s i c s . d a t a - a n ] N ov modes. Furthermore, it is expected that future machines will also run in the sameoperating conditions[2]. Thus, the development of automated, data-based approachesto automatically detect the occurrence of certain events would be useful for both existingand future tokamak experiments and operation. A detector would not only simplify andspeed-up the post-experimental, offline analysis of shots, but also (ideally) detect ELMsand plasma state rapidly enough to allow for its usage in the real-time control systems ofa fusion experiment, for purposes of plasma control and real-time discharge monitoringand supervision[3].Due to uncertainties in the scaling laws, it is difficult to determine, a priori , when,during a discharge, a switch between different plasma modes will occur[4]. Nevertheless,physicists can usually pinpoint, through a post-experimental visual analysis of severaldiagnostic signal time-traces, at what point in time any transitions between differentmodes did take place. Similarly to transitions between plasma modes, the occurrence ofan ELM can usually be pinpointed by looking at the time-traces of several diagnosticsfrom a plasma discharge post-shot. Yet through an analysis of signals, some types ofELMs can be easily confused with dithers; a distinction between the two phenomenacan not always be clearly made[5].Although the identification by an expert, through post experimental visual analysisof signal time-traces, of a single ELM, or a single transition between plasma modes, isrelatively straightforward for a typical shot, it becomes much more cumbersome to carryout that analysis effectively for many shots, especially when the associated time-seriesdata is long, and when a shot has many transitions between different modes.Recent advances in the ML field with the introduction of Deep Learning (DL)approaches deal with exactly such challenges. In the past years, the field of DeepLearning has brought about significant advances in Computer Vision and SequentialData Processing. Convolutional Neural Networks (CNNs) have proven adept atlocalization, recognition and detection tasks in both 2-dimensional[6, 7, 8, 9, 10] and1-dimensional[11, 12, 13, 14, 15, 16] data (i.e. signal analysis) in many different fields ofscience. In addition, Long Short-Term Memory (LSTM) Networks, which are one typeof Recurrent Neural Network, have been successfully used for processing of sequentialdata where one expects correlations to exist across time, namely, automatic translation,natural language modelling[17], traffic analysis[18], and automated video description[19].These tasks are much akin to what one can expect to find in terms of processing fusionshot data.Given this, a Deep Learning approach is well motivated to address this challenge.Specifically, deep neural network models offer particular advantages when modelinghigh-dimensional data as given in this setting. In this work, we develop an approachfor automatic classification of L-D-H plasma states and detection of ELMs based ontwo deep neural network models. The first model is based on a sliding-window feed-forward neural network, specifically a convolutional neural network (CNN). The secondmodel is based on a recurrent neural network (RNN), specifically a long short-termmemory network (LSTM) with convolutional layers. The first model captures the localcorrelations within the windows to classify the transitions between plasma states fromthe shape of the signals. The second model extends this to capturing longer-termdependencies in the evolution of the states with the recurrent neural network layers.We empirically demonstrate the approach on data collected from the TCV tokamakexperiment, labelled by an ensemble of experts. The presented results demonstrate theeffectiveness of the proposed model to detect the state and events of the plasma. Wefurther discuss the trade-offs between increased precision and increased complexity ofboth models.This paper is organized as follows: Section 2 discusses related work and Section3 describes the physical phenomena being analyzed. Section 4 formalizes our problem,details the data we have available, and explains our decisions regarding how we modelthe data and design and train the neural networks. Section 5 gives an overview of themetrics we used to evaluate our results and our rationale behind using those metrics.Section 6 gives an overview of the results achieved, and we wrap up with a discussionin Section 7.
2. Previous work
Several different approaches for automated detection of events in plasma experimentsexist. One such approach is to use threshold-based detectors. This corresponds todefining a point or series of points (in time) at which a signal surpasses a certainamplitude as corresponding to a detection[20, 21, 22], with additional constraints suchas an increasing probability of the occurrence of an ELM as time passes since the lastone. These approaches are limited to simple thresholding and cannot compute complexpatterns in the data. Other work builds upon methods such as Kalman Filters to modelthe expected characteristics of the signal over a period of time[23], whilst also keepingtrack (in each time point) of the current plasma mode, according to a pre-defined model.In both of these cases, a detection algorithm’s performance depends on the extent towhich the theoretical assumptions and mathematical descriptions as to how the signalsshould behave are correct, whether those assumptions are exhaustive (i.e., whetherthere may be additional causes which are unaccounted for), and whether some of thoseassumptions are more important than others; in other words, it is difficult to designan exhaustive rule-based system to detect the occurrence of transitions between plasmamodes, as well as to detect ELMs.The alternative is to use a purely data-based, supervised, Machine Learning (ML),approach, whereby a set of data, previously manually labeled by an expert (for example,through visual analysis), is used to train a detector. In this case, one does not specifywhich characteristics or correlations in the data are thought to correspond to theoccurrence of an event; rather, it is expected that the algorithm can automatically learnwhat those correlations are, based on the labels, and then use the learned data featuresto make correct classifications on new data. Examples of such work are the usage ofSupport Vector Machines (SVMs)[24, 25, 26, 27] and Multi-Layer Perceptron (MLP)Neural Networks[28] on data from several tokamaks for detection of L-H transitions,classification of L and H modes, and detection of ELMs.This type of scenario is, indeed, well suited for application of ML methods towardsenabling automation. However, traditional ML methods such as SVMs and MLPstypically have limitations when faced with data with complex dynamics, such as thelong sequences (i.e., signal time-series) present in this environment. SVMs typicallydepend on expert-defined feature engineering, which, while being superior to simplethreshold-based detectors, is nevertheless insufficient when considering the complexdata correlations which are observed in this setting. On the other hand, MLPs, whilenot requiring that sort of expert-defined input, are very inefficient when compared tomodern Deep Learning models such as CNNs and RNNs, requiring much larger numbersof neurons and layers to perform the same task. These limitations are what motivateus to use Deep Learning approaches instead.
3. Background
When a discharge starts, the plasma is considered to be in Low (L) confinement mode.Once a certain threshold of input heating power to the plasma is reached[29], the plasmacan spontaneously transition into High (H) confinement mode. Originally discovered atthe ASDEX-Upgrade Tokamak[30], High (H) mode is nowadays regularly observed inalmost all other machines[31]. H mode is characterized by the appearance, in the plasmaedge, of a steep gradient in the electron density and the electron/ion temperatures,and a reduction in the transport of particles and energy. As a consequence of thisedge transport barrier, the temperature and energy in the plasma core increase. Whencompared to L-mode, H mode allows for a larger amount of stored plasma energy perinput power, thus rendering the fusion process more efficient. Yet the actual inputpower threshold that triggers the transition between the two modes is dependent onmany factors, such as, for example, the configuration of the magnetic field, plasmadensity, and plasma size [4]. Furthermore, when the input heating power passes theaforementioned threshold but a change from L to H mode does not immediately occur,the plasma can be considered to be in a dithering (D)[1] phase. In this case, a temporary,weak, edge transport barrier starts to develop at the plasma edge, only to collapse andreappear in rapid succession[29]. These oscillations then repeat themselves until theplasma transitions into L or H mode. The localization of transitions into, and out of, Dmode can, however, be difficult to identify, and there are often disagreements betweenexperts as to which periods of a shot are in a Dithering phase [32].
When the plasma enters H mode, the corresponding accumulation of energy and thelarge pressure gradient at the plasma edge can trigger the occurrence of Edge LocalizedModes (ELMs). These consist of periodic bursts of particles and energy which, if a longamount of time passes between successive ELMs, can impose a significant power loadon the divertor, potentially damaging it. However, ELMs also allow for the periodicremoval of accumulated impurities from the plasma, and for a relaxation of the plasmadensity, which can otherwise increase as the H mode progresses, eventually triggering adisruption[33]. On the other hand, frequent, less energetic, ELMs lower the power loadon the divertor, at the cost of reduced plasma confinement. Thus, tokamak operationrequires knowledge of the occurrence of ELMs, in particular for larger machines whereELMs may cause deterioration of in-vessel components. Although several different typesof ELMs exist, for the purposes of this work, we did not make any distinctions betweenthem – we train the models to detect all occurring ELMs equally, regardless of theirsubclass.
4. Methods
To develop a model for this task, we formulate the problem as follows:We observe a sequence of measurements x t for 0 < t ≤ N from the sensors for each shot.These observations are conditioned on the state of the plasma z t at corresponding time t , where z t ∈ Z and Z : { (cid:48) Low (cid:48) , (cid:48) Dither (cid:48) , (cid:48) High (cid:48) } . Our goal is to find the most likelysequence of z N and the occurrence of ELMS e N that explains the observations x N .ˆ z N = arg max z N (cid:88) t log p ( z t | x t , z t − )ˆ e N = arg max e N (cid:88) t log p ( e t | x t )For this purpose, we develop two models.The first model is trained to detect the transitions between the different states of theplasma defined as q t ∈ Q where Q : { (cid:48) Low → Dither (cid:48) , (cid:48) Dither → Low (cid:48) , (cid:48) Low → High (cid:48) , (cid:48) High → Low (cid:48) , (cid:48) Dither → High (cid:48) , (cid:48) High → Dither (cid:48) , (cid:48) N otransition } and to detect theELM events as e t ∈ E where E : { (cid:48) ELM (cid:48) , (cid:48) N oELM (cid:48) } .We implement this model with a feed-forward CNN that processes a windowof observations x t − w , .., x t , ..., x t + w and produces a probability distribution over thetransitions p ( q z t − → z t | x t − w : t + w ) and over the presence of an ELM p ( ELM t | x t − w : t + w ) at t . We now define the probability of transitioning to z t after being in z t − ( p ( z t | x t , z t − )) with our model p ( q z t − → z t | x t − w : t + w ) where w is the number ofobservations around t , therefore:ˆ z N = arg max z N (cid:88) t log p ( q z t − → z t | x t − w : t + w )Practically, we implement the arg max given above as a state evolution of a finalstate machine S t ( z ( a ) → z ( b ) ) where z ( a ) and z ( b ) are elements in Z and the transitionprobabilities are given by p ( q z t − → z t | x t − w : t + w ) at time t (see Figure 1). The evolutionof the state machine produces several possible sequences of states, and the one mostlikely to have generated the observed sequence of transitions can be found through animplementation of the Viterbi algorithm[34]. 𝐿 𝐷𝐻 𝑞 𝐿𝐷 𝑞 𝐿𝐻 𝑞 𝐻𝐿 𝑞 𝐷𝐿 𝑞 𝐷𝐻 𝑞 𝐻𝐷 𝑁 𝑜 _ 𝑡 𝑟 𝑎 𝑛 𝑠 . 𝑁 𝑜 _ 𝑡 𝑟 𝑎 𝑛 𝑠 . 𝑁𝑜_𝑡𝑟𝑎𝑛𝑠 . 𝑆𝑡𝑎𝑟𝑡
Figure 1: State machine for processing of the CNN outputs
Conv InputFeature
Extraction
FeatureExtraction FeatureExtractionConv Input Conv Input … ……… …
Conv Input
Feature
Extraction … T r a n s i t i on I n p u t S i g n a l C on v o l u t i on s + M a x P oo li n g + D r o p ou t 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑞 𝑡 + 𝑞 𝑡 + 𝑞 𝑡 + 𝑞 𝑡 + 𝑞 𝑡 + 𝑞 𝑡 + 𝑞 𝑡 + 𝑞 𝑡 + 𝑞 𝑡 + 𝑞 𝑡 + 𝑞 𝑡 + 𝑞 𝑡 + 𝑞 𝑡 + 𝑞 𝑡 + 𝑞 𝑡 + 𝒘 𝑥 𝑡 + 𝑛 − 𝑥 𝑡 + 𝑛 𝑥 𝑡 + 𝑛 + 𝑥 𝑡 + 𝑛 + 𝑥 𝑡 + 𝑛 + 𝑞 𝑡 + 𝑛 − 𝑞 𝑡 + 𝑛 𝑞 𝑡 + 𝑛 + 𝑞 𝑡 + 𝑛 + 𝑞 𝑡 + 𝑛 + Figure 2: Representation of how a CNN can be used to model the transitions betweendifferent plasma modes. The network’s output prediction for a time slice t depends onlyon the data features in a defined region immediately surrounding t .The first model can capture the localized correlations in the signals that indicate thetransition of the state of plasma well. However, it is incapable of capturing the longerdistance correlations that may be present in the signal. To generalize the approachfurther, we introduce a sequence model that models the full sequence of observationsup to time t and produce a probability distribution p ( z t | x t ) for 0 < t ≤ N , as well as adistribution over the presence of the ELM s ( p ( ELM t | x t ). This model is implementedby extending the CNN with a recurrent (LSTM) neural network. In this case, the modelnow observes a sequence of sliding windows x t − w , ..., x t , ..., x t + w for each t in the range { , ..t } . Conv Input 1FeatureExtraction 1 Feature
Extraction 2
Feature
Extraction 11
Conv Input 2 Conv Input 11 … ……… …
Conv Input n
FeatureExtraction n … … L S T M S t a t e I n p u t S i g n a l C on v o l u t i on s + M a x P oo li n g + D r o p ou t 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑥 𝑡 + 𝑛 − 𝑥 𝑡 + 𝑛 𝑥 𝑡 + 𝑛 + 𝑥 𝑡 + 𝑛 + 𝑥 𝑡 + 𝑛 + 𝑧 𝑡 + 𝑧 𝑡 + 𝑧 𝑡 + 𝑧 𝑡 + 𝑧 𝑡 + 𝑧 𝑡 + 𝑧 𝑡 + 𝑧 𝑡 + 𝑧 𝑡 + 𝑧 𝑡 + 𝑧 𝑡 + 𝑧 𝑡 + 𝑧 𝑡 + 𝑧 𝑡 + 𝑧 𝑡 + 𝑧 𝑡 + 𝑛 − 𝑧 𝑡 + 𝑛 𝑧 𝑡 + 𝑛 + 𝑧 𝑡 + 𝑛 + 𝑧 𝑡 + 𝑛 + 𝒘 Figure 3: Schematic representation of the flow of data inside a convolutional LSTMNeural Network. The network’s prediction (i.e. output probability) at any time t ofa shot depends not only on whatever features the convolutional layers have extractedfrom the points immediately around t , but also on features extracted in the past.The first model has a lower computational complexity and can be trained moreefficiently, as we only need windows of the signal with or without the different transitions,but it is limited to the information only present in the given window (see Figure 2).Increasing the size of this window that forms the context, increases the complexity bothof the model and of dealing with multiple transitions appearing.The second model addresses these challenges by modeling the sequence rather thana fixed window (see Figure 3). As a sequential model, it has an internal representationof the past observations x , .., x t , that enable it to weigh-in the likelihood of transitionbased on information in the more distant past[35]. The LSTM effectively assumes therole of the finite state machine and so the model can directly model the state of theplasma rather than the transitions. However, it introduces higher level of complexity,particularly for training, as we need to train on sequences rather than fixed-lengthwindows. For the purposes of this work, we have assembled a dataset based on the time-tracesof four signals originating in the TCV tokamak[36, 37]. We opted, for the purposes ofthis work, to use the same, limited set of diagnostic signals that experimentalists use todetermine, in post-shot analysis, the state of the plasma (Figure 4). S i g n a l v a l u e s ( n o r m . ) PDFIRDMLIPLowDitherHigh
Figure 4: Switches between different plasma modes(Low, Dither and High), and time-traces of the collected signals, TCV shot S i g n a l v a l u e s ( n o r m . ) PDFIRDMLIPELMLowDitherHigh
Figure 5: ELMs and L and H plasma modes, TCV shot
Photodiode (PD) signal . Corresponds to the measurements given by the photodiodediagnostic at TCV along a vertical chord, measuring the line-integrated emittedvisible radiation; the photodiode has an H α filter which measures radiation at653.3 nm.Transitions between different plasma states, as well as ELMs, can be most easilyobserved through analysis of the photodiode (PD) signal (Figure 5). Transitionsfrom L to H mode are characterized by a sudden drop in the baseline value ofthe signal, whereas transitions back into L mode have the opposite trace, i.e., thebaseline PD signal suddenly increases and remains at a steady level. ELMs arecharacterized by a sudden spike in the PD signal, followed by a relaxation thattakes at most 2ms. D modes generate rapid fluctuations in the signal (see Figure7); they do not necessarily correspond to a change in the baseline signal value,unless they are followed by a transition into a different state from the one at thepoint where they started.(ii) Interferometer (FIR) signal . The interferometers at TCV measure the line-integrated electron density in the plasma along 14 parallel, vertical lines of sight.Of these, we take the mean value, per time instant, of the 12 inner-most detectors.In the interferometer signal, the transition between L and H mode can most easily beseen as a sudden increase in the time derivative of the signal, while transitions backinto L mode correspond to a decrease in the derivative. Similarly to what happenswith the photodiode signal, ELMs may provoke short (albeit less pronounced) spikesin the FIR signal.(iii)
Diamagnetic Loop (DML) signal . Refers to the measurement of the total toroidalmagnetic flux of the plasma[38]. The derivative of the DML signal frequentlyswitches signs when a transition occurs between L and H mode, as well as when anELM occurs (Figure 6). Furthermore, the sign of this signal’s derivative changesdepending on the sign of the plasma current.(iv)
Plasma Current (IP) signal . Refers to the total plasma electric current. For thiswork, we use the current value to determine when the actual classification of plasmastates should begin. Specifically, we ignore, for classification purposes, time pointswhere the absolute value of the current is lower than 50 kA. S i g n a l v a l u e s ( n o r m . ) PDFIRDMLIPELMLowDitherHigh
Figure 6: ELMs, and L and H modes from a section of TCV shot S i g n a l v a l u e s ( n o r m . ) PDFIRDMLIPLowDitherHigh
Figure 7: L, D and H modes from a section of TCV shot
The two proposed models develop different maps. The first model is a map betweena fixed window of observations and a distribution over transitions, while the secondmodels a sequence of observations and produces a sequence of states (see Figure 8).Accordingly, the training data has different arrangements. For transitionclassification, we need to prepare a dataset D , { ( x, q, e ) } , where a training point consistsof a section of the recorded signal( x t − w , ...x t , ..., x t + w ), the corresponding label of one ofthe transitions q t in Q and the matching label e t indicating the presence (or not) of anELM. Figure 9 illustrates this in detail.For the second model D , { ( x, z, e ) } , a training point consists of a sequence ofwindows of observations drawn from x t to x t + l + w (where l is a defined sequence length,and w is the window length), a sequence of state labels z t in Z of length l , with eachlabel corresponding to the state of the plasma at times t , and a sequence of labels e t oflength l corresponding to the presence of an ELM at times t . Figure 9 illustrates thisin detail.1There is an inherent uncertainty in the labeling of the ELMs and plasma states,particularly when it comes to transitions into and out of dithers. The raw data onlyhas hard, binary, one-hot encodings[39] – that is, a transition between two states, forexample, is labeled as a sudden switch (from one time slice to the next) from one stateto another. This means that it is easy to mistakenly label an event or transition in aslightly shifted time slice. This type of hard threshold also makes it difficult for a neuralnetwork to generalize to outside of its training set[40].Therefore, for the first model (CNN), we process the target time-series such thatthe probability of an ELM, or of a given state transition, is a continuous value, startingat zero and peaking at one, with several intermediate probabilities. In practical terms,we apply on each event a gaussian smoothing such that, if an ELM or state transitionoccurs at time t , its probability at that point is 1, and we define an interval ∆ t – beforeand after t – where the probability, respectively, smoothly increases and decreases.We defined these smoothing intervals as corresponding to 2ms, which, at the definedsampling rate, translates to 20 time slices. We do the same with the states z t for thesecond model (Conv-LSTM), such that a switch between two different states, from z to z , does not happen immediately from one time slice to the next, but rather, theprobability of z decreases, while that of z increases, over a span of 20 time slices.This procedure not only models the uncertainty in the labeling process, but alsoacts as an automatic regularization for the neural network training process, i.e., it makesit easier for a neural network to generalize what it learns to unseen data[41]. P D ( n o r m . ) PD P r o b . LHHL DHHD LDDL P r o b . LowDitherHigh P r o b . ELM
Figure 8: Representation of the different types of encoding of the target “smooth” datadistributions, to be learned by the two classifiers, from TCV shot
Figure 9: Representation of the sliding temporal windows fed to the CNN on top of thePD signal, and their corresponding ELM probability output. At inference time, thesewindows slide over the 4 signals across the whole shot, each of them rendering an outputprobability for a single time slice.The choice of the size of the temporal windows with which the CNN is trained is atrade-off between the assumptions made about the data, and computational feasibility.Larger windows contain more spatial information and thus, intuitively, should makethe classification at a particular time slice more precise, but also make the trainingand inference process by the network slower. Smaller windows contain arguably lessinformation, but can be processed faster. We opted to train the CNN with temporalwindows with a length of 20ms, which we judged to be a good compromise between thosetwo requirements. At our sampling rate, these windows are 200 time slices long. This isillustrated in Figure 9: the green region represents a window of signals (in this case, onlythe PD signal) which is fed to the neural network, and its associated target, which isthe probability of an ELM occurring at t = 0 . t = 0 . t = 0 . t = 0 . p ( e t | x t − w : t + w ) and p ( q z t − → z t | x t − w : t + w ), where w = 180 and w = 20. In practice, in a real-time setting,that offset would constitute a minimum delay between the occurrence of an event ina machine, and a detection by the classifier. Once again, the size of this offset is atrade-off: a smaller offset is ideal for real-time applications because it gives more timefor feedback control mechanisms, but it also contains less information for the networkto accurately classify an event.We train the Conv-LSTM not with windows, but with sequences of windows. Thedistinction is an important one, for it implies different assumptions about the data. In3 Figure 10: Example of a sequence fed to the LSTM. At a 10 kHz sampling rate, itconsists of 200 overlapping temporal windows of length 40. The output probability fora given window depends not only on what data features are present in that window, butalso on the past windows in the sequence.the case of the windows fed to the CNN, it is assumed that each window is independentof each other. In the data fed to the Conv-LSTM, each sequence itself is composedof several windows, with future windows depending on past ones. We defined each ofthose sequences to consist of 200 windows (since that was also the length of the windowsfed to the CNN). In this case, each of the individual windows has a length of 4ms (40time slices), with an offset of 2ms, as in the data for the CNN (see Figure 10). Thesequences have a stride[42] of 1: each window starts and ends exactly 1 time slice afterthe previous one finishes. Each of these sequences is randomly subsampled from thewhole shots, and the corresponding targets for them are chosen randomly from one ofthe three labelers.Although not all of these subsamples start in L mode, our expectation is that thenetwork would learn by itself that an actual shot always begins in that state. Thereare several reasons for this. First, the network will learn to recognize any features inthe subsequences that are consistent with the beginning of a shot, and learn that thosefeatures correlate to L mode. Second, even if some training sequences start in D orH mode, the network will statistically learn that these modes are more frequently theresult of a transition from a previous mode.
The architecture of the neural networks used for the transition detection starts with a1-D convolutions with four channels, each of which receives the values from the PD, FIR,IP and DML signals. These are followed by several convolutional layers, interspersed4with pooling and dropout layers, which are trained for feature extraction, with deeperlayers extracting higher-level data features (Figure 11). The last layers of the network arefully-connected, and are responsible for receiving the pre-processed high-level featuresand producing an appropriate output for them, i.e., the desired classification. Thismodel is loosely inspired by the VGG architecture for classification of images wherefixed sized filters are used[43].
Conv1D (64,3)Conv1D (128,3)Dropout (0,5)Maxpool (2) Conv1D (256,3)Conv1D(256,3)Conv1D(256,3)Dropout (0,5)Maxpool (2) Dense(64)Dense(16) Dense (7)/Dense(2)Conv1D (256,3)
Conv1D(256,3)
Conv1D(256,3)Dropout (0,5)Maxpool (2)
Figure 11: Architecture of the Convolutional NNOur convolutional LSTM network builds on top of CNN model that showed the bestperformance on the transition detection task. We add a recurrent layer that processesthe output of the CNN to capture the longer-distance correlations in the data (Figure12). We designed the networks using the Keras framework for Deep Learning[44]. Bothnetworks used a categorical cross-entropy loss function, and were trained with the Adamoptimizer[45] using the default learning rate value provided by Keras.
Conv1D (64,3)Conv1D (128,3)Dropout (0,5)Maxpool (2) Conv1D (256,3)
Conv1D(256,3)
Conv1D(256,3)Dropout (0,5)Maxpool (2) Dense(64)
Dense(16)
Conv1D (256,3)Conv1D(256,3)Conv1D(256,3)
Dropout (0,5)
Maxpool (2) LSTM(32)LSTM(32) Dense (32)Dropout (0,5)
Dense (3)/
Dense(2)(Time-
Distributed) (Time-Distributed) (Time-
Distributed)
Figure 12: Architecture of the convolutional LSTM. All layers and nodes use ReLUactivation functions, apart from the final output layer, which uses Softmax activation.5
In total, we possessed 54 shots fully labeled by the three experts. In a typical DeepLearning setting, some sort of normalization[46] is usually applied on the available data.The most common procedure would have been to normalize across the entire dataset.However, because of the different calibrations of the PD signals and the subsequentlarge variance and multimodal distribution associated with it, we decided, at this stage,to normalize each shot separately dividing each signal in each shot by its own meanacross the whole shot. For potential real-time applications, as any new shots could falloutside the normalization range, the procedure would require grouping and normalizingthe shots with respect to different signal gains and calibrations.From these normalized full sequences, we draw batches of smaller temporal windowsand subsequences to train the neural networks. There are several reasons for thissubsampling. First, the full shot time-series are up to about 20,000 time slices long,but the actual length of a shot can vary significantly. Yet for purposes of trainingthe networks, we require batches of data of fixed length, which can be achieved bysubsampling from the full sequences.Second, this method allows us to automatically perform data augmentation fortraining, since one long sequence will contain many shorter subsequences and windows.Third, feeding very large temporal windows to a CNN would be computationallydifficult, as the number of network parameters requiring training would growconsiderably.Finally, the distribution of the data in the full sequences is highly unbalanced:in most shots, dithering phases are significantly shorter than L and H phases; onlya few dozen transitions happen at most per shot; and, some transitions tend to bemore frequent than others. Training with whole sequences would significantly bias thenetworks towards the events and transitions that occur more frequently in the labeleddata. Drawing subsequences allows us to control the data fed to the network suchthat this inherent bias is mitigated. To do this, the training data batches must bebalanced, i.e., generated such that they contain roughly equal fractions of the differenttypes of events and/or transitions of interest. In the CNN, there are 8 possible eventsof interest – LH, HL, HD, DH, LD, DL, ELM, and no transition. Generating batchesfor the CNN means that, for a batch containing n data samples, n/8 of those sampleswill correspond to each of those different types of transitions. Similarly, for the Conv-LSTM, the batches are generated such that the three target distributions (L, D and H)correspond to approximately 1/3 of the data samples each.
5. Evaluation metrics
We consider the detection of single, discrete ELMs by the networks as corresponding to apoint in time (in a shot) where the direct network outputs for ELM probability ˆ e N reach6a maximum value. This is not necessarily a point where the output network probabilityfor ELM is 1, but rather, a point t where the output probability P ( ELM t ) follows a seriesof strictly increasing probability values, and precedes a series of strictly decreasing ones.Because we defined the length of the gaussian smoothing of the probabilities as 20, herewe consider a local maximum for P ( ELM t ) within a 20-wide interval to correspond tothe detection of a single ELM – which we denote as a positive. The remaining points areconsidered non-detections, i.e., negatives. In addition, we defined different probabilitythresholds for what can be considered a detection of an ELM by the network. Forexample, defining a threshold of 50% implies that only ELM probability maxima abovethat threshold are considered positives.Positives and negatives must then be compared to the labeled ELMs. To that end,we build the ELM Confusion Matrix, which defines several variables: negatives thatmatch their label at the same point in time are True Negatives (TN), while those thatdo not are False Negatives (FN). Similarly, positives that match their label are TruePositives (TP) and those that do not are False Positives(FP).Using this method to determine the points in which the network detects individualELMs, one can then compute the True Positive Rate (TPR) and False Positive Rate(FPR) for different detection thresholds: T P R = T PT P + F N (1)
F P R = F PF P + T N (2)Plotting the TPR versus FPR for a series of different detection thresholds yields theclassifier’s ROC curve[47], which illustrates the network’s capacity for discriminationgiven different detection thresholds. There are several ways to compute the idealdetection threshold based on the ROC curve, depending on the task in question. Inour case, we use the Youden index[48], whereby the best threshold is the value whichmaximizes the difference
T P R − F P R , the maximum value being 1.
To compare the models’ accuracy with that of the human labelers, we use Cohen’sKappa-statistic coefficient, which measures agreement between two sets of categoricaldata[49], defined as κ = p − p e − p e (3)where p denotes the actual relative agreement between the two sets, and p e denotesthe probability of the two sets randomly agreeing with each other. Generically, the κ coefficient’s values oscillate between 0 and 1, the former indicating poor performance,and the latter indicating perfect performance. In our case, given two sequences z and z of plasma states, Kohen’s Kappa measures the overlap between them. If z t = z t for7all time instants t , the metric will yield a score of 1; if there are mismatches betweenthe two sequences, the score will go down.The κ -statistic can be interpreted differently based on the sections of the data forwhich it is computed. For that reason, we will now define several variables that allowus to interpret the κ -statistic scores.Remember that we possess labels drawn from three different experts; as such,generically, labeled shot states at each point in time t of a shot can be in one of threepossible categories: • No majority agreement, i.e., all labelers disagree as to what state the plasma is in,which we denote as category C . • Majority agreement, i.e., two labelers agree on the state of the plasma, while onedisagrees, which we denote as category C . • Consensual agreement – all labelers agree as to what state the plasma is in, whichwe denote as category C .We define the union of C and C as ground truth ( C ), i.e., they are sections ofshots where there is at least a majority opinion as to what state the plasma is in. Wealso have, for each shot, the most likely sequences ˆ z N of states (given the observeddata) produced by the neural networks, which we will now denote as C .Computing the κ -statistic score, κ l , between sets C and C gives an indicationof the probability that a single labeler disagrees with the ground truth: a κ l -score of1 would indicate that there is agreement between all the labelers all the time, while alower score would indicate that at least some of the time, one labeler disagrees withthe others. Simultaneously, computing the κ -statistic score between sets C and C ( κ n ) gives an indication of the networks’ performance given the ground truth. But, inaddition, we can directly compare κ l and κ n . This comparison allows to test how anetwork and a single labeler compare against each other, on average, given the groundtruth.The κ -coefficient is calculated separately for each of the three possible labels forthe plasma state (L, D and H), and as a weighted mean across all three states. Theweights of that mean are taken to be the relative frequencies of each individual state inthe dataset, based on the ground truth ( C ) labels.
6. Results
We performed several training runs using the data labeled by the three experts; wecarried out experiments where we trained both models (CNN and Conv-LSTM) threetimes, each time randomizing the training and test shots, to test whether differences inthe data could lead to different results. In a typical Deep Learning setting, the data isusually split so that approximately 80 −
90% is used for training, and 20 −
10% is usedfor validation of the results, i.e., testing the network’s capability to accurately predicton data that was not used for training. In our case, we opted for a training/test data8split of 50%, i.e., of the 54 shots, we used 27 for training and 27 for testing. The resultsthat follow are the best results of those three experiments, for each model. We alsoexperimented with varying offsets (see Figure 9) for the convolutional windows to seewhat effect that factor could have on the results; we settled for an offset value of 2 ms (20time slices), as smaller offsets degraded results, while larger ones did not improve them.We computed the metric scores on the training and test data at several points duringtraining to control for overfitting[50], and present the results from the epoch where thestate detection results on test data were the highest. We ran the neural networks on anNVIDIA Quadro RTX 5000 GPU. We computed the κ -statistic based on the regions defined in Subsection 5.2 – that is, wecompute scores based on the network output versus the ground truth ( κ n ), and basedon labeler disagreement versus the ground truth ( κ l ). We computed the scores on aper-state (L, D and H) basis, and also computed a mean of the values obtained for eachstate.We trained the CNN for 250 epochs, allowing for the loss function to stabilize;each epoch consisted in 32 batches, with each batch containing 64 data samples. Uponcompletion of training, we tested the CNN’s accuracy on both the training and testdata. The model’s results on ELM classification (ROC curve) can be seen in Figure13. Table 1 shows the scores κ n and κ l for the entire dataset, while Figure 14 containshistograms showing the κ n s distribution on a per-shot basis.L D H Mean K n Train 0.691 0.358 0.657 0.649Test 0.219 0.115 0.157 0.182 K l Train 0.937 0.896 0.987 0.958Test 0.941 0.848 0.986 0.962Table 1: κ -statistic scores ( κ n and κ l ) for each plasma mode and as a mean, on trainingand test data (values across all shots), for the CNN9 T r u e P o s i t i v e R a t e (a) Training data. T r u e P o s i t i v e R a t e (b) Test data. Figure 13: ROC curves for ELM detection for the CNN model. The detection thresholdsthat maximize the Youden index are 0 . . .
993 and 0 .
99. Using the ideal threshold for the training data(0 .
2) on the test data gives a slightly lower Youden index of 0 . Low Dither . . . . . . High . . . . . . Mean F r e q u e n c y Score F r e q u e n c y Score F r e q u e n c y Score F r e q u e n c y Score (a) Training data.
Low Dither . . . . . . High . . . . . . Mean F r e q u e n c y Score F r e q u e n c y Score F r e q u e n c y Score F r e q u e n c y Score (b) Test data.
Figure 14: Distribution of the κ -statistic score ( κ n ) on a per-shot basis, for the CNN. We trained the convolutional LSTM for 400 epochs, allowing the loss function tostabilize. Each epoch consisted of 64 batches, with each batch containing 64 datasamples. The results of computing scores κ l and κ n , using the same definitions as forthe CNN can be seen in tables 2. The ROC curves detailing the results on ELM detectioncan be seen in Figure 15. Figure 16 contains histograms showing the score K n valueson a per-shot basis.0L D H Mean K n Train 0.96 0.889 0.967 0.96Test 0.82 0.766 0.85 0.832 K l Train 0.96 0.94 0.992 0.98Test 0.901 0.808 0.98 0.935Table 2: κ -statistic scores ( κ n and κ l ) for each plasma mode on training and test data,for the Conv-LSTM T r u e P o s i t i v e R a t e (a) Training data. T r u e P o s i t i v e R a t e (b) Test data. Figure 15: ROC curves for ELM detection for the Conv-LSTM model. The detectionthreshold which maximizes the Youden index is 0 . . .
977 and 0 .
969 for each set respectively. Using the idealthreshold for the training data (0 .
5) on the test data gives a slightly lower Youden indexof 0 . Low Dither . . . . . . High . . . . . . Mean F r e q u e n c y Score F r e q u e n c y Score F r e q u e n c y Score F r e q u e n c y Score (a) Training data.
Low Dither . . . . . . High . . . . . . Mean F r e q u e n c y Score F r e q u e n c y Score F r e q u e n c y Score F r e q u e n c y Score (b) Test data.
Figure 16: Distribution of the κ -statistic score ( κ n ) on a per-shot basis, for the Conv-LSTM1 A comparison of the κ n scores on training and test data for each classifier shows thatthe vonvolutional LSTM performs better than the CNN for all three plasma states.Furthermore, looking at the distribution of the mean κ n scores on a per-shot basisthrough the histograms, one can see that the worst Conv-LSTM classifications do nothave a score lower than 0.6 on training data, while for the CNN alone, even on trainingdata, mean κ n scores lower than 0.2 exist. For both classifiers, the performance ontraining data surpasses that on test data, both on a state-by-state basis, and as a meanacross all states, which indicates the occurrence of overfitting.For both networks, an analysis of the κ l scores of their training and test dataindicates that human labeler disagreement is highest for dithers – the scores for thatparticular state are consistently lower. Interestingly, both networks also score theirlowest results for dithers.Comparing the Conv-LSTM’s κ l and κ n scores shows that, at least on training data,the network behaves, on average, similarly to a single human labeler, making errors (ordisagreeing with the ground truth) at approximately the same rate – the mean κ l scorefor training data is 0.98, while the mean κ n score for training data is 0.96. On test data,the Conv-LSTM performs slightly worse than a single human labeler, as seen by thefact that the network’s mean K-index score on test data κ n is 0.832, while κ l is 0.935.As measured by the Youden index, we achieve excellent performance in detection ofELMs on both training and test data using both models; the ideal detection thresholdsgenerate true positive detection rates very close to 1, while bringing false positivedetection rates essentially to 0. The Youden indexes for test data are only slightlylower than for training data, which suggests that overfitting is minimal. Furthermore,for both models, on both training and test data, the ROC curves’ points are mostlyconcentrated close to True Positive Rates of 1 and False Positive Rates of 0, whichindicates that the choice of ELM detection threshold does not significantly change thebehavior of the classifiers.Finally, the scores for ELMs being essentially the same for both models indicatesthat the features in the data which allow for identification of ELMs are mostly local: theCNN, even without knowledge of long-term temporal correlations, performs excellentclassification.Because the Conv-LSTM has highest κ n scores, we made a case-by-case analysis ofthat network’s classification of all our available shots. Broadly, the Conv-LSTM’s resultson state detection, on a per-shot basis, can be placed into six different categories:(i) A (sometimes very) short detection, of a dither that is not labeled in the data.Due to the way the K-score κ n is computed, a mistaken dither classification by thenetwork of a single time point (in a whole sequence), in a shot which has no regionswhere the ground truth ( C ) is dithering, will bring the score for that state downto 0, even if the remainder of the shot is correctly classified (17 shots).2(ii) A clearly incorrect classification, of a temporal region of a shot as being in adithering state (4 shots);(iii) A missed detection of an L-H transition (1 shot);(iv) A missed detection of an H-L transition (2 shots);(v) An overall bad detection across an entire shot (7 shots);(vi) An overall good detection across an entire shot (23 shots).Table 3 lists 6 shots which are representative of each of the types of results listedabove. The table shows the computed κ n scores for each of those shots on a per-statebasis, as well as the score’s mean value, and the fraction of time, for the ground truthof each shot, that a particular state is labeled. The table also lists which of the 6 casesabove the shot is representative of. Figures 17 to 22 are plots of those same shots, wherethe background color in the top plot denotes the state detected by the Conv-LSTM, andin the bottom plot, denotes the ground truth label. Small gray areas in the bottom plotdenote regions where ground truth is not defined, i.e., there is no majority agreementbetween labelers.Case Shot ID L D H MeanFraction Score Fraction Score Fraction Score1 57751 0.756 0.97 0 0 0.243 0.97 0.972 34010 0.679 0.856 0.073 0.232 0.248 0.602 0.7483 58182 0.22 0.912 0.095 0.969 0.685 0.927 0.9284 30197 0.951 0.384 0 1 0.049 0.384 0.3845 33459 0.811 0.662 0 0 0.189 0.846 0.6976 33942 0.455 0.953 0.183 0.884 0.412 0.997 0.962Table 3: Kappa statistic ( κ n ) scores for each plasma mode on training and test data forselected shots representative of each of the six result categories P D , n o r m a li z e d LDH C l a ss i f i c a t i o n G r . T r u t h Shot
Figure 17: TCV shot t = 0 . P D , n o r m a li z e d LDH C l a ss i f i c a t i o n G r . T r u t h Shot
Figure 18: In TCV shot t = 0 . s , but it shortly thereafter (incorrectly) switches back to dithering. P D , n o r m a li z e d LDH C l a ss i f i c a t i o n G r . T r u t h Shot
Figure 19: In TCV shot t = 0 . s ) but then incorrectly switches back to L mode and remainsthere until the first ELMs (spikes in the PD signal) appear. P D , n o r m a li z e d LDH C l a ss i f i c a t i o n G r . T r u t h Shot
Figure 20: In shot t = 0 . s .4 P D , n o r m a li z e d LDH C l a ss i f i c a t i o n G r . T r u t h Shot
Figure 21: Shot t = 0 . s , immediately after classifying a D mode, the network oscillates betweenL and H in quick succession for about 0 . s , which to the naked eye might appear inthis plot as a gray area; in reality, it is an artifact of the plot, with alternating red andgreen regions. P D , n o r m a li z e d LDH C l a ss i f i c a t i o n G r . T r u t h Shot
Figure 22: Shot
7. Conclusions
We have developed two Deep Learning-based classifiers to perform automatic detectionof ELMs and classification of plasma modes. The task was two-fold: on one hand,to perform a binary classification, for each time slice of a plasma shot, on whetheran ELM is occurring or not; and, to automatically determine which plasma mode (oralternatively, whether a transition between plasma modes) is occurring. One approachis to use a convolutional Neural Network (CNN), which uses only local correlations indata to perform classification. The second approach uses a Convolutional LSTM (Conv-LSTM) neural network, which also takes advantages of long-term temporal correlationsin data.On ELM detection, the two networks can achieve essentially equal results. On theplasma state classification, a clear difference can be seen between the results obtainedwith the CNN, and those obtained with the Conv-LSTM. Comparing the κ -index ( κ n )scores of each network shows that the LSTM’s scores are clearly higher, which suggeststhat, at least when it comes to detection of plasma modes, the processing of long-termcorrelations in data facilitates accurate classification. There is some indication thatoverfitting occurred. However, our monitoring of the training progression indicatedthat, while the metric values for test data are always lower, they did, nevertheless,become better as training progressed. Thus, an overfitting-avoidance strategy such asearly stopping would, in this case, not have helped achieve better test accuracy.While the results from the Conv-LSTM are better, that network is also morecomplex with both network training and inference taking longer.Although this work used data from the TCV tokamak, it should also be possibleto adapt it to other machines; as a matter of fact, the data sources used exist on mosttokamaks. As long as the data fed to the neural networks is from those same sources,this model could in principle be used for automatic labeling of shots from a number ofdifferent machines. Acknowledgements
This work has been carried out within the framework of the EUROfusion Consortium and hasreceived funding from the Euratom research and training programme 2014-2018 and 2019-2020under grant agreement No 633053. The views and opinions expressed herein do not necessarilyreflect those of the European Commission. We would like express our gratitude to B. Labit,R. Maurizio and O. Sauter at SPC/EPFL for taking the time to manually label the data usedfor training. This work was supported in part by the Swiss National Science Foundation. References [1] Zhang T, Gao X, Zhang S, Wang Y, Han X, Liu Z, Ling B, Team E et al.
Physics Letters A et al.
Nuclear Fusion et al. Physics of Plasmas et al. Journal ofPhysics: Conference Series vol 123 (IOP Publishing) p 012033[5] Ryter F, Buchl K, Fuchs C, Gehre O, Gruber O, Herrmann A, Kallenbach A, Kaufmann M,Koppendorfer W, Mast F et al.
Plasma physics and controlled fusion A99[6] Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V and RabinovichA 2015 Going deeper with convolutions
The IEEE Conference on Computer Vision and PatternRecognition (CVPR) [7] Ince T, Kiranyaz S, Eren L, Askar M and Gabbouj M 2016
IEEE Trans. Industrial Electronics Advances in Neural Information Processing Systems 25 ed Pereira F, BurgesC J C, Bottou L and Weinberger K Q (Curran Associates, Inc.) pp 1097–1105[9] Tompson J, Goroshin R, Jain A, LeCun Y and Bregler C 2015 Efficient object localization usingconvolutional networks
The IEEE Conference on Computer Vision and Pattern Recognition(CVPR) [10] L¨ahivaara T, K¨arkk¨ainen L, Huttunen J M and Hesthaven J S 2018
The Journal of the AcousticalSociety of America
Computers in biology and medicine
Journal of Sound and Vibration
Journal of Chemometrics e2977[14] Golik P, T¨uske Z, Schl¨uter R and Ney H 2015 Convolutional neural networks for acoustic modelingof raw time signal in lvcsr Sixteenth annual conference of the international speech communicationassociation [15] Kiranyaz S, Ince T and Gabbouj M 2015
IEEE Transactions on Biomedical Engineering Expert systems with applications Thirteenth annual conference of the international speech communication association [18] Ma X, Tao Z, Wang Y, Yu H and Wang Y 2015
Transportation Research Part C: EmergingTechnologies arXiv preprint arXiv:1604.01729 [20] Webster A and Dendy R 2013 Physical review letters
Plasma physics and controlled fusion [23] Shousha R et al. et al. Nuclear Fusion et al. Plasma Physics and Controlled Fusion [26] Murari A, Vagliasindi G, Zedda M K, Felton R, Sammon C, Fortuna L and Arena P 2006 IEEETransactions on Plasma Science Plasma Physics and Controlled Fusion et al. Plasma Physics and Controlled Fusion et al. Nuclear Fusion et al. Physical Review Letters Phys. Rev. Lett. (22) 2276–2279[32] Basse N, Zoletnik S, Antar G, Baldzuhn J, Werner A et al. Plasma physics and controlledfusion Speech and Language Processing, second edition (PearsonEducation)[35] Boulanger-Lewandowski N, Bengio Y and Vincent P 2013 High-dimensional sequence transduction (IEEE) pp3178–3182[36] Hofmann F, Lister J, Anton W, Barry S, Behn R, Bernel S, Besson G, Buhlmann F, Chavan R,Corboz M et al.
Plasma Physics and Controlled Fusion B277[37] Coda S et al.
Nuclear Fusion [38] Moret J M, Buhlmann F and Tonetti G 2003
Review of Scientific instruments Digital Design and Computer Architecture
Computer organizationbundle, VHDL Bundle (Elsevier Science) ISBN 9780080547060[40] Szegedy C, Vanhoucke V, Ioffe S, Shlens J and Wojna Z 2016 Rethinking the inception architecturefor computer vision
Proceedings of the IEEE conference on computer vision and patternrecognition pp 2818–2826[41] Zheng Q, Yang M, Yang J, Zhang Q and Zhang X 2018
IEEE Access stat arXiv preprint arXiv:1409.1556 [44] Chollet F et al. https://keras.io [45] Kingma D P and Ba J 2014 arXiv preprint arXiv:1412.6980 [46] Han J, Pei J and Kamber M 2011 Data mining: concepts and techniques (Elsevier)[47] Fawcett T 2006