PPrepared for submission to JINST N th Workshop on Xwhenwhere
An Automatic Data Cleaning Procedure for ElectronCyclotron Emission Imaging on EAST Tokamak UsingMachine Learning Algorithm
C. Li, a T. Lan, a , Y. Wang, a J. Liu, a J. Xie, b T. Lan, b H. Li b and H. Qin a , c a Department of Engineering and Applied Physics, School of Physical Sciences,University of Science and Technology of China, No. 96 Jinzhai Road, Hefei, China b School of Physical Sciences, University of Science and Technology of China,No. 96 Jinzhai Road, Hefei, China c Plasma Physics Laboratory, Princeton University,Princeton, NJ, 08543 U.S.A.
E-mail: [email protected]
Abstract: A new data cleaning procedure for electron cyclotron emission imaging (ECEI) of EASTtokamak is developed. Machine learning techniques, including Support Vector Machine (SVM)and decision tree, are applied to identifying saturated, zero, and weak signals of ECEI raw data,which not only reduces the effort of researchers for data analysis, but also improves the accuracy ofdata preprocessing. Proper training sets are sampled using massive raw ECEI data from the EASTtokamak. Optimal window size of temporal signal, kernel function, and other model parametersare obtained by model training. With the optimized parameters, the recognition rates of saturated,zero, and weak signals in raw data are 99.4%, 99.86%, and 99.9%, respectively, which proves theaccuracy of this procedure.Keywords: ECEI, classification of data, window separation, support vector machine, decision tree Corresponding author. a r X i v : . [ phy s i c s . d a t a - a n ] M a y ontents Electron Cyclotron Emission Imaging (ECEI) has been introduced as a useful diagnostic methodfor detecting the two dimensional (vertical and horizontal) distribution of electron temperature( T e ) in different tokamaks [1, 2]. Based on ECEI data, formation of magnetic islands and MHDinstabilities can be investigated [3, 4]. ECEI data thus play a critical role in the study of magneticconfined plasmas. However, due to the complexity of measurement environment and the limitation– 1 –f acquisition range, raw ECEI data contain lots of invalid signals, including saturated signals, zerosignals, and weak signals. The appearance of these invalid signals increases the complexity of dataanalysis. Data cleaning is inevitable before further analysis. Traditional approaches of classifyinginvalid ECEI data mainly rely on human eyes, which is obviously slow and subjective to humanerrors. On the EAST tokamak, one discharge generates 7.6 GB ECEI data, and massive ECEI datahave been accumulated. An automatic data cleaning tool is needed.As a powerful technique for data analysis, machine learning has entered the field of plasmaphysics. For example, the position of the magnetic probe and inversion radius can be determinedbased on neural network algorithm [5, 6]. Researchers are also seeking ways to predict thedisruptions of plasmas in large fusion devices via machine learning method [7]. Compared withtraditional methods, the results of machine learning technique improves with the increase of trainingdata. At the same time, the human labor can be replaced by computers, which accelerates the processof discovery.In this paper, a new data cleaning procedure using machine learning algorithms, is developedfor ECEI system of the EAST tokamak. The system is a muti-channel system, and the similarityof one shot data is not obvious. Therefore, several machine learning methods are combined toidentify different kinds of signals rather than using preference-based performance measures [8].Our procedure starts from the window separation which divides an ECEI signal into segments withunified length. Then, Support Vector Machine (SVM) and decision tree are used to analyze thesesegments and classify the properties of raw data. SVM has a unique advantage in high-dimensionalpattern recognition [9], and decision tree is easy to use and powerful for addressing optimizationproblems [10]. We use SVM to identify saturated signals, and apply the method of decision tree toclassifying zero signals and weak signals. In order to validate models, a five-fold cross validationis used, which can efficiently ease the problem of over-fitting. To enhance the reliability of theprocedure, proper training sets are sampled from massive raw ECEI data of the EAST tokamak.System parameters including the window size and the kernel function are adjusted to optimize themodel. It is found that the performance of model is sensitive to the size of window. To be specific,smaller window size implies larger computational complexity, and larger window size reduces theeffectiveness of features. It is also found that the polynomial kernel function can be effectivelytrained by SVM. With these techniques, accuracies of identifying saturated signals, zero signalsand weak signals reach 99.4%, 99.86%, and 99.9% respectively. The training time will be limited– 2 –o several seconds if GPU is used.This paper is organized as follows. Section 2 briefly introduces the physical background of theECEI data. In Section 3, details of the classification procedure are provided. Sample sets are shownin Section 4. Section 5 shows results of the validation. And a summary is given in Section 6. Each set of the ECEI data in the EAST tokamak consists of 384 channels (24 vertical and16horizontal). Each channel detects the electron cyclotron radiation in the tokamak, and the radiationis mixed with the local oscillation (LO) frequency on the antenna array and down-converted to theIF [11, 12]. After the signal is received by the intermediate frequency system, it is amplified withband-pass filters and converted to an analog signal. The final radiation image is obtained throughthe data acquisition card.The signal lasts for ten seconds and ranges from -1 V to 1 V. The range is 0 V to 2 V ifzero drift is processed. Typically, the range of saturated signals is from -1 V to 1 V, and therange of zero signals and weak signals is from 0 V to 2 V. By analyzing the electron temperaturefluctuation ( σ T e ), sawtooth instability can be studied [13–15]. In general, there are three types ofinvalid data: saturated signals, zero signals, and weak signals. According to previous experience, T e profiles of saturated signals exceed the range, insufficient attenuation generates saturated signals,and zero signals can be considered that they are almost all noise. The following facts are also thecharacteristics of the invalid data: T e profiles of zero signals are very close to 0 V; zero signals aredue to the error of the antenna route; the signal-to-noise ratio of weak signal is stronger than that ofzero signal but weaker than that of normal signal.Figure 1 shows typical patterns of saturated signals, normal signals, zero signals, and weaksignals. The signal in figure 1(a) is considered as a saturated signal because it reaches the rangeduring 4s and 6s. The signal in figure 1(c) is a zero signal because its baseline is essentially at 0 V.The amplitude of the signal in figure 1(d) differs slightly from that of the noise. So it is regarded asa weak signal caused by the poor signal-to-noise ratio. To classify four kinds of signals above, a classifier is set up. The flow chart of the classifier is shownin figure 2. Saturated signals, zero signals, and weak signals are identified in sequence. Saturated– 3 – /s T e / v -1-0.500.51 Saturated signal t/s T e / v -1012 Normal signalt/s T e / v -0.04-0.0200.020.04 Zero signal t/s T e / v -0.0500.050.1 Weak signal (a) (b)(c) (d)
Figure 1 . Typical patterns for four kinds of signals in the ECEI measurement: (a) the pattern of a saturatedsignal, (b) the pattern of a normal signal, (c) the pattern of a zero signal, and (d) the pattern of a weak signal. signals, zero signals, weak signals, and normal signals are marked as 1, 2, 3 and 0, respectively. saturated signal zero signal weak signal Label as 0
No No No
Label as 1 Label as 2 Label as 3
Yes Yes
Yes
Figure 2 . The flow chart of the classifier.
The feature of saturated signals is that there will be a period of saturation. The length of durationas a parameter is optional. We select the duration as 0.6 seconds. The model of identifyingsaturated signals will be trained by SVM. The distance between each signal and 1 V is defined as– 4 –he parameter of the sample. Sample sets are divided into two categories. One is saturated, and theother is unsaturated. Each sample is labeled, 1 means saturated, and 0 means unsaturated.
Before training the model, parameters of signals are obtained according to pretreating. The signalis divided into fifty windows. For each window, a distance L relative to the range is defined by thestandard deviation L = (cid:118)(cid:116) ( n (cid:213) i = ( l i − ) )/ n , (3.1)which measures the fluctuation relative to the mean. If we set the expectation of the formula to 1v,it will represent fluctuations relative to the range. In (3.1), l i is the electron temperature at the i-thtime step and n is the total number of time steps in each window. t/s T e / v -0.500.51 window sequence number d i s t an c e Figure 3 . (a) Raw signal of 42987th Shot 1CH 2Row (saturated). (b) The signal of 42987th Shot 1CH 2Rowis divided into fifty windows, and it shows the distance between each window and the range.
Figure 3(b) shows the distance of each window of 42987th Shot 1CH 2Row. After sorting,several minimum distances will be obtained. The feature-space of SVM consists of the minimumdistance and two adjacent distances. – 5 – .1.2 Method of SVM for training data
A random sampling of 20% instances are taken as a test set, and the rest is used for training. It isfound that when polynomial kernel functions are used, the model can be best trained and most ofsaturated signals are successfully identified. Only a few predictions are wrong, and they are foundto be very similar to saturation signals.
The feature of zero signals is that T e profiles of the steady segment are close to 0 V. After theidentification of saturated signals, the remaining signals will be classified by the decision treealgorithm. The distance between the steady segment and the noise fragment of the signal is selectedas the feature-space of the decision tree algorithm. Signals are marked as zero signals and non-zerosignals, respectively recorded as 2 and 0. Finding the steady segment of a signal is the precondition for the classification. In general, thesteady segment is the smoothest segment within two seconds of the highest peak. The smoothnessof the segment is represented by the standard deviation of T e profiles. The procedure of findingthe steady segment is to locate the highest peak first and then compare the smoothness betweensegments around the highest peak. An example is given below.In the same way, the signal is divided into fifty windows. And then the time average electrontemperature for each window (< T e >) is obtained. The largest < T e > is found and the correspondingwindow is marked as "C". Then three adjacent windows are grouped as one set, and there are 48groups in all labeled by S , S , S ...... S . Next, we calculate the smoothness of each group. Sincethe steady segment may appear on either side of the highest peak, it is necessary to start at theC-th window, and compare the sum of the smoothness of left nine groups and right nine groups.The steady segment of the signal is in the smoother side. The assumption is that the right side issmoother. From S c , S c + ...... S c + , the minimum value is found and identified the steady segment ofthis signal.Figure 4(b) shows the < T e > profiles of 52327 Shot 4CH 9Row. It is obvious that the highestpeak lies in the third second. Figure 4(c) shows the smoothness of the entire signal. From S , S ...... S , the smallest can be quickly found. In Figure 4(b), the 12-th window which is located atthe red marker in Figure 4(c) is indeed the steady segment. There are 384 steady segments of one– 6 –hot for each channel. In general, steady segments of most channels have the same position. Thesteady segment is composed of three adjacent windows. In order to reduce the error, the middlewindow is selected as the representative for the steady segment. t/s T e / v -0.100.1 T e / v -0.0500.05 s m oo t hne ss Smoothness distribution diagram (a)(b)(c)
Figure 4 . (a) Raw signal of 52327 Shot 4CH 9Row which is a typical zero signal. (b) After smoothing,the data of each window is averaged. (c) Smoothness distribution diagram is obtained by averaging over thefluctuation of each window, and the 12-th window which is located at the red marker is the steady segment.
A parameter of the model is the distance between the steady segment and the noise segments. Soit is also important to find out noise segments. At the end of each signal, there will be a periodof segment which contains almost noise. Figure 5(a) is the raw signal of 49024 Shot 4CH 9Row.Figure 5(b) shows σ T e profiles of fifty windows. The 34-th window which is marked by a redcross is the steady segment and the 46-th window which is labeled by a black circle is where thebackground noise begins. It can be seen from Figure 5(b) that a prominent peak sits between thesteady segment and background noise segments. This is not a coincidence, but an inevitable stagefor each signal. Before 9s there is active plasma emitting cyclotron radiation, and then the plasmaterminates. Since it is a transient process, σ T e profiles will change very fast. This feature can beused to quickly identify the background noise of each signal.– 7 – /s T e / v -0.100.10.20.3 f l u c t ua t i on
50 windows average fluctuation (a)(b)
Figure 5 . (a) Raw signal 49024 Shot 4 CH 9Row. (b) σ T e profiles of 49024 Shot 4CH 9Row. The 34-thwindow which is marked by a red cross is the steady segment and the 46-th window which is labeled by ablack circle is where the background noise begins. According to the method described in Section 3.2.2, the background noise of each signal is found.The next step is to calculate the distance S between the steady segment and background noisesegments. In Figure 6, the blue marker is the time average temperature of the steady segment (< T e >). The red marker indicates the sum of the time average temperature and the temperaturefluctuation of the noise segments, i.e., < T e > + σ T e . If < T e > - (< T e >+ σ T e ) is small enough,they will be classified as zero signals. Thus it is the parameter of decision tree to identify zerosignals. In Figure 6, the distance S between the steady segment and noise segments of 49024 Shot4CH 9Row is approximately 0.075 V, which is small. Thus, it is identified as a zero signal. The Signal-to-Noise Ratio (SNR) of weak signals resides between normal signals and zero signals.In the physical analysis of the ECEI data, the focus is on the relative temperature fluctuation ofelectrons ( σ T e /< T e >) [16]. In the actual measurement, σ T e profiles include both the normalelectron temperature fluctuation and the background noise. If the SNR is too poor, the subsequent– 8 – indow
34 36 38 40 42 44 46 48 50 T e / v The distance between the stationary segment and the trailing noise segment
Figure 6 . The distance S between the steady segment and the trailing noise segment (49024 Shot 4CH9Row). The blue marker represents the average of the smooth segment signals, i.e., < T e >. The red markersrepresent the sum of the mean of the noise segment signal and the average of the fluctuation values, i.e., < T e >+ σ T e . If < T e >-(< T e >+ σ T e ) is small enough, the signal is identified as a zero signal. physical research will not make sense. Therefore it is necessary to identify weak signals. The noiseof the system is generally inferred by comparing experiments with or without RF input [17]. Itis preferable to use noise sections at the end of the signal as the background noise. The electrontemperature fluctuation of the steady segment divided by that of noise segments ( σ T e / σ T e ) isselected as a parameter that represents the signal-to-noise ratio, and will be used as the parameterfor the decision tree. Figure 7 shows the proportion of saturated signals, normal signals, weak signals, and zero signalsin the sample set. The 42987-th, the 42999-th, the 49024-th, and the 51064-th shot are used asa sample set. They are tagged manually. Some of them are used for training, and the rest areused for testing. A total of 1536 samples are collected. 8% of 1536 samples are saturated signals,4.5% are zero signals, 12.5% are weak signals, and 75% are normal signals. The sample set is– 9 –andomly distributed. The proportion of abnormal signals is moderate and representative. First,after saturated signals are classified, 1409 signals are left. They include 61 zero signals and 1348non-zero signals. After zero signals are identified, the rest are sent for the classification of weaksignals and non-weak signals. saturated signal 8%zero signal 4.5%weak signal 12.5% normal signal 12.5%
Figure 7 . The sample set for the classification.
After the SVM and decision tree algorithms are implemented, massive raw data from the experimentsof ECEI on EAST tokamak are fed into the learner for training, through which parameters of themodel, such as window size and kernel function, have been optimized. The SVM and decision treemodels with the optimized parameters generated satisfactory classification results.
In multiple tests, recognition rates of saturated signals reach 100%. One of samples predicted wrongis the signal of 42987-th shot 22CH 3Row. It is found that this signal is very close to saturatedsignals. In order to test the effect of the model, a five-fold cross validation is adopted. 1536 samplesare randomly divided into five groups. One of which is selected as the test sample and the other– 10 –our are training samples. Sensitivity (Sen), specificity (Spe), and total accuracy (Q) are calculatedrespectively for each validation as follows,
Sen = T P /( T P + F N ) , (5.1) Spe = T N /( T N + F P ) , (5.2) Q = ( T P + T N )/(
T P + F N + T N + F P ) . (5.3) Table 1 . Results of five-fold cross validation for saturated signals.
TP FN TN FP Sen(%) Spe(%) Q(%)Cross-validation 1 26 0 280 1 100 99.64 99.67Cross-validation 2 17 1 287 2 94.4 99.3 99Cross-validation 3 24 1 282 1 96 99.65 99.35Cross-validation 4 27 0 279 1 100 99.64 99.67Cross-validation 5 26 0 279 2 100 99.29 99.35total 120 2 1407 7 98.36 99.5 99.4The results are listed in Table 1. Here, TP represents the number of saturated signals identifiedcorrectly, FN represents the number of saturated signals that are wrongly identified as unsaturatedsignals, TN is number of unsaturated signals that are correctly classified, and FP is the number ofunsaturated signals that are wrongly identified as saturated signals [18]. From table 1, Sen, Spe,and Q are almost 100%. It is obvious that the model of the classification can accurately identifysaturation signals. able 2 . Results of five-fold cross validation for zero signals.
TP FN TN FP Sen(%) Spe(%) Q(%)Cross-validation 1 17 0 264 0 100 100 100Cross-validation 2 13 1 267 2 92.86 100 99.64Cross-validation 3 10 0 271 0 100 100 100Cross-validation 4 10 0 271 0 100 100 100Cross-validation 5 9 1 275 0 90 100 99.65total 59 2 1348 0 96.7 100 99.86
Table 3 . Results of five-fold cross validation for weak signals
TP FN TN FP Sen(%) Spe(%) Q(%)Cross-validation 1 33 0 235 0 100 100 100Cross-validation 2 27 0 241 0 100 100 100Cross-validation 3 39 0 228 1 100 99.6 99.6Cross-validation 4 30 0 238 0 100 100 100Cross-validation 5 41 0 235 0 100 100 100total 170 0 1177 1 100 99.9 99.9
Based on results of statistics, the boundary between weak signals and non-weak signals is σ T e / σ T e = 1.299. From table 3, results of five-fold cross validation are close to 100%. One of samplespredicted wrong is 49024-th shot 23CH 11Row whose σ T e / σ T e profile is 1.3004, which is veryclose to the boundary between weak signals and normal signals. In summary, artificial intelligence technologies are applied to classifying the massive ECEI dataon the EAST tokamak. As a pretreating procedure, the data are separated into different segments.SVM algorithm is used to identify saturated signals and the method of decision tree is applied toclassifying zero signals and weak signals. The models are trained using the massive ECEI dataon the EAST tokamak based on optimized model parameters. Cross-validation studies showedthat the model can identify saturated signals, zero signals and weak signals with accuracies of– 12 –9.4%, 99.86%, and 99.9% respectively. It proves that this model can be used in practice. In thefuture study, similar automatic classification techniques will be developed and applied to identifyingphysical modes, such as plasma instabilities from valid ECEI signals.
Acknowledgments
This work was supported partly by National key research and development program under Grant Nos.2016YFA0400600, 2016YFA0400601 and 2016YFA0400602, Anhui Provincial Natural ScienceFoundation under Grant Nos. 1808085MA25 and also by the Fundamental Research Funds for theCentral Universities with the Grant Nos. WK2150110008, Wk2030040098.
References [1] Gao BX, Xie JL, Mao Z, Luo C, Zhu YL, Zhao ZL, Tong L, Liu WD, Luhmann NC, Domier CW andTobias B,
The electron cyclotron emission imaging system on EAST with continuous large observationarea , Journal of Instrumentation (2018) P02009.[2] Kim JB, Lee W, Yun GS, Park HK, Domier CW and Luhmann NC Jr,
Data acquisition andprocessing system of the electron cyclotron emission imaging system of the KSTAR tokamak , Reviewof Scientific Instruments (2010) 10D931.[3] Baonian Wan for the EAST and HT-7 Teams and International Collaborators,
Recent experiments inthe EAST and HT-7 superconducting tokamaks , Nuclear Fusion (2009) 104011.[4] M.Becoulet, M. Kim, G. Yun et al.,
Non-linear MHD modelling of edge localized modes dynamics inKSTAR , Nuclear Fusion (2017) 116059.[5] BoWang, Bingjia Xiao, Jiangang Li, Yong Guo and Zhengping Luo,
Artificial Neural Networks forData Analysis of Magnetic Measurements on East , Journal of Fusion Energy (2016) 390.[6] N Isei, AIsayama, SIshida et al., Electron cyclotron emission measurements in JT-60U , FusionEngineering and Design (2001) 213-220.[7] A. Vannucci, K.A. Oliveira and T. Tajima, Forecast of TEXT plasma disruptions using soft X rays asinput signal in a neural network , Nuclear Fusion (1999).[8] T. Lan,J. Liu and H. Qin, Preference-based performance measures for Time-Domain GlobalSimilarity method , JINST (2017) C12008.[9] Shi-ping Li, Fang-chao Chen and Long Wang, Modulation recognition algorithm of digital signalbased on support vector machine , Control and Decision Conference (CCDC) (2012). – 13 –
10] Hongyan Zhao,
The analysis and application of the C4.5 algorithm in decision tree technology , Advanced Materials Research (2012) 754-757.[11] Xu Xiao-Yuan, Wang Jun, Yu Yi et al.,
Electron temperature fluctuation in the HT-7 tokamak plasmaobserved by electron cyclotron emission imaging , Chinese Physics B (2009).[12] Liu Yong, Ti Ang, Han Xiang et al., Present Status of the Electron Cyclotron Emission Measurementson HT-7 and EAST , Plasma Science and Technology (2011) 10090630.[13] Zhao Z, Xie J, Qu C, Liao W, Li H, Lan T, Liu A, Zhuang G and Liu W,
Analysis of sawtooth collapsetime using electron cyclotron emission imaging on EAST tokamak , Radiation Effects and Defects inSolids (2017) 760-7.[14] Azam Hussain, Zhenling Zhao, Jinlin Xie et al.,
Observations of compound sawteeth in ion cyclotronresonant heating plasma using ECE imaging on experimental advanced superconducting tokamak , Physics of Plasmas (2016) 042504.[15] Azam Hussain, Gao Bing-Xi, Liu Wan-Dong and Xie Jin-Lin,
Electron Cyclotron Emission ImagingObservations of m/n=1/1 and Higher Harmonic Modes during Sawtooth Oscillation in ICRF HeatingPlasma on EAST , Chinese Physics Letters (2015) 065201.[16] B.J.Tobias, R. L. Boivin, J. E. Boom et al.,
On the application of electron cyclotron emission imagingto the validation of theoretical models of magnetohydrodynamic activity , Physics of Plasmas (2011) 056107.[17] X. Han, X. Liu, Y. Liu et al., Design and characterization of a 32-channel heterodyne radiometer forelectron cyclotron emission measurements on experimental advanced superconducting tokamak , Review of Scientific Instruments (2014) 10897623.[18] Dinesh V. Rojatkar, Krushna D. Chinchkhede and G.G. Sarate,
Handwritten Devnagari consonantsrecognition using MLPNN with five fold cross validation , International Conference on Circuit, Powerand Computing Technologies