[PDF] Deep reinforcement learning for smart calibration of radio telescopes

Abstract

Modern radio telescopes produce unprecedented amounts of data, which are passed through many processing pipelines before the delivery of scientific results. Hyperparameters of these pipelines need to be tuned by hand to produce optimal results. Because many thousands of observations are taken during a lifetime of a telescope and because each observation will have its unique settings, the fine tuning of pipelines is a tedious task. In order to automate this process of hyperparameter selection in data calibration pipelines, we introduce the use of reinforcement learning. We test two reinforcement learning techniques, twin delayed deep deterministic policy gradient (TD3) and soft actor-critic (SAC), to train an autonomous agent to perform this fine tuning. For the sake of generalization, we consider the pipeline to be a black-box system where the summarized state of the performance of the pipeline is used by the autonomous agent. The autonomous agent trained in this manner is able to determine optimal settings for diverse observations and is therefore able to perform 'smart' calibration, minimizing the need for human intervention.

Full PDF

MMNRAS , 1–7 (2021) Preprint 8 February 2021 Compiled using MNRAS L A TEX style ﬁle v3.0

Deep reinforcement learning for smart calibration of radio telescopes

Sarod Yatawatta ★ and Ian M. Avruch † ASTRON, Oude Hoogeveensedĳk 4, 7991 PD Dwingeloo, The Netherlands

ABSTRACT

Modern radio telescopes produce unprecedented amounts of data, which are passed through many processing pipelines beforethe delivery of scientiﬁc results. Hyperparameters of these pipelines need to be tuned by hand to produce optimal results.Because many thousands of observations are taken during a lifetime of a telescope and because each observation will have itsunique settings, the ﬁne tuning of pipelines is a tedious task. In order to automate this process of hyperparameter selection indata calibration pipelines, we introduce the use of reinforcement learning. We use a reinforcement learning technique calledtwin delayed deep deterministic policy gradient (TD3) to train an autonomous agent to perform this ﬁne tuning. For the sakeof generalization, we consider the pipeline to be a black-box system where only an interpreted state of the pipeline is used bythe agent. The autonomous agent trained in this manner is able to determine optimal settings for diverse observations and istherefore able to perform smart calibration, minimizing the need for human intervention.

Key words:

Instrumentation: interferometers; Methods: numerical; Techniques: interferometric

Data processing pipelines play a crucial role for modern radio tele-scopes, where the incoming data ﬂow is reduced in size by manyorders of magnitude before science-ready data are produced. In or-der to achieve this, data are passed through many processing steps,for example, to mitigate radio frequency interference (RFI), to re-move strong confusing sources and to correct for systematic errors.Calibration is one crucial step in these interferometric processingpipelines, not only to correct for systematic errors in the data, butalso to subtract strong outlier sources such that weak RFI mitigationcan be carried out. Most modern radio telescopes are general purposeinstruments because they serve diverse science cases. Therefore, de-pending on the observational parameters such as the direction in thesky, observing frequency, observing time (day or night) etc., the pro-cessing pipelines need to be adjusted to realize the potential qualityof the data.The best parameter settings for each pipeline are mostly deter-mined by experienced astronomers. This requirement for hand tuningis problematic for radio telescopes that stream data uninterruptedly.In this paper, we propose the use of autonomous agents to replace thehuman aspect in ﬁne-tuning pipeline settings. We focus on calibrationpipelines, in particular, distributed calibration pipelines where dataat multiple frequencies are calibrated together, in a distributed com-puter (Yatawatta 2015; Brossard et al. 2016; Yatawatta et al. 2018;Ollier et al. 2018; Yatawatta 2020). In this setting, the smoothness ofsystematic errors across frequency is exploited to get a better qualitysolution, as shown by results with real and simulated data (Patil et al.2017; Mertens et al. 2020; Mevius et al. 2021).The key hyperparameter that needs ﬁne-tuning in a distributed ★ E-mail: [email protected] † E-mail:[email protected] calibration setting is the regularization factor for the smoothness ofsolutions across frequency. It is possible to select the regularizationanalytically (Yatawatta 2016), however, extending this to calibrationalong multiple directions (where each direction will have a uniqueregularization factor) is not optimal. In this paper, we train an au-tonomous agent to predict the regularization factors for a distributed,multi-directional calibration pipeline. The calibration pipeline willperform calibration with a cadence of a few seconds to a few minutesof data. Therefore, for an observation of long duration the calibrationpipeline needs to be run many times. At the beginning of every ob-servation the autonomous agent will recommend the regularizationfactors to be used in the pipeline based on the pipeline state and willupdate the factors at the required cadence.We train the autonomous agent using reinforcement learning (RL)(Sutton & Barto 2018). Reinforcement learning is a branch of ma-chine learning where an agent is trained by repeated interactions withits environment, with proper rewards (or penalties) awarded when-ever a correct (or incorrect) decision (or action) is made. In our case,the environment or the observations made by the agent are i) the inputdata to the pipeline ii) the input sky model to the pipeline used incalibration, and, iii) the output solutions iv) the output residual datafrom the pipeline. However, we do not feed data directly to the agent,instead, only an interpretation of the pipeline state is fed to the agent.Because the calibration is done on data at a small time duration, theamount of data used to determine this state is also small (comparedwith the full observation). The determination of the reward is doneconsidering two factors. First, the calibration should perform accu-rately such that the variance of the residual data should be lower thanthe input data (scaled by the correction applied during calibration).Secondly, we should not overﬁt the data, which will suppress signalsof scientiﬁc interest in the residual. In order to measure the overﬁt-ting, we use the inﬂuence function (Cook & Weisberg 1982; Koh& Liang 2017) which we can calculate in closed form (Yatawatta © a r X i v : . [ a s t r o - ph . I M ] F e b Yatawatta and Avruch

Notation : Lower case bold letters refer to column vectors (e.g. y ).Upper case bold letters refer to matrices (e.g. C ). Unless otherwisestated, all parameters are complex numbers. The set of complex num-bers is given as C and the set of real numbers as R . The matrix inverse,pseudo-inverse, transpose, Hermitian transpose, and conjugation arereferred to as (·) − , (·) † , (·) 𝑇 , (·) 𝐻 , (·) ★ , respectively. The matrixKronecker product is given by ⊗ . The vectorized representation ofa matrix is given by vec (·) . The identity matrix of size 𝑁 is givenby I 𝑁 . All logarithms are to the base 𝑒 , unless stated otherwise. TheFrobenius norm is given by (cid:107) · (cid:107) and the L1 norm is given by (cid:107) · (cid:107) . In this section, we give a brief overview of the distributed, directiondependent calibration pipeline that is used by the RL agent for hy-perparemeter tuning. We refer the reader to Yatawatta (2015) for anin-depth overview of its operation.The signal received by an interferometer is given by (Hamakeret al. 1996) V 𝑝𝑞 𝑓 = 𝐾 ∑︁ 𝑘 = J 𝑘 𝑝 𝑓 C 𝑘 𝑝𝑞 𝑓 J 𝐻𝑘𝑞 𝑓 + N 𝑝𝑞 𝑓 (1)where we have signals from 𝐾 discrete sources in the sky being re-ceived. All items in (1), i.e., V 𝑝𝑞 𝑓 , C 𝑘 𝑝𝑞 𝑓 , J 𝑘 𝑝 𝑓 , J 𝑘𝑞 𝑓 , N 𝑝𝑞 𝑓 ∈ C × are implicitly time varying and subscripts 𝑝, 𝑞 correspond tostations 𝑝 and 𝑞 , 𝑓 correspond to the receiving frequency, and 𝑘 correspond to the source index. The coherency of the 𝑘 -th source atbaseline 𝑝 - 𝑞 is given by C 𝑘 𝑝𝑞 𝑓 and for a known source, this canpre-computed (Thompson et al. 2001). The noise N 𝑝𝑞 𝑓 consists ofthe actual thermal noise as well as signals from sources in the skythat are not part of the model. The systematic errors due to the propa-gation path (ionosphere), receiver beam, and receiver electronics arerepresented by J 𝑘 𝑝 𝑓 and J 𝑘𝑞 𝑓 . For each time and frequency, with 𝑁 stations, we have 𝑁 ( 𝑁 − )/ 𝑝 - 𝑞 .Calibration is the estimation of J 𝑘 𝑝 𝑓 for all 𝑝 and 𝑓 , using data within a small time interval (say 𝑇 ), within which it is assumed that J 𝑘 𝑝 𝑓 remains constant. An important distinction is that we only cal-ibrate along 𝐾 directions, but in real observations, there is alwayssignals from faint sources (including the Galaxy) that are not beingmodeled. We are not presenting a method of calibration in this pa-per, and therefore the actual method of calibration is not relevant.However, for the sake of simplicity, we follow a straightforward de-scription. What we intend to achieve is to relate the hyperparametersused in the pipeline to the performance of calibration. Therefore, werepresent J 𝑘 𝑝 𝑓 as 8 real parameters (for frequency 𝑓 and direction 𝑘 ), and we have 8 × 𝑁 × 𝐾 real parameters in total for frequency 𝑓 ,per direction, namely 𝜽 𝑘 𝑓 .We represent (1) in vector form as v 𝑝𝑞 𝑓 = 𝐾 ∑︁ 𝑘 = s 𝑘 𝑝𝑞 𝑓 ( 𝜽 𝑘 𝑓 ) + n 𝑝𝑞 𝑓 (2)where s 𝑘 𝑝𝑞 𝑓 ( 𝜽 𝑘 𝑓 ) = (cid:16) J ★𝑘𝑞 𝑓 ⊗ J 𝑘 𝑝 𝑓 (cid:17) 𝑣𝑒𝑐 ( C 𝑘 𝑝𝑞 𝑓 ) , v 𝑝𝑞 𝑓 = 𝑣𝑒𝑐 ( V 𝑝𝑞 𝑓 ) , and n 𝑝𝑞 𝑓 = 𝑣𝑒𝑐 ( N 𝑝𝑞 𝑓 ) .We stack up the data and model vectors of (2) for the 𝑇 time slotswithin which a single solution is obtained as x 𝑓 = [ real ( v 𝑇 ) , imag ( v 𝑇 ) , real ( v 𝑇 ) , . . . ] 𝑇 (3) s 𝑘 𝑓 ( 𝜽 𝑘 𝑓 ) = [ real (cid:16) s 𝑘 𝑓 ( 𝜽 ) 𝑇 (cid:17) , imag (cid:16) s 𝑘 𝑓 ( 𝜽 ) 𝑇 (cid:17) , real (cid:16) s 𝑘 𝑓 ( 𝜽 ) 𝑇 (cid:17) ,. . . ] 𝑇 which are vectors of size 8 × 𝑁 ( 𝑁 − )/ × 𝑇 and we have x 𝑓 = 𝐾 ∑︁ 𝑘 = s 𝑘 𝑓 ( 𝜽 𝑘 𝑓 ) + n 𝑓 = s 𝑓 ( 𝜽 𝑓 ) + n 𝑓 (4)as our ﬁnal measurement equation 𝜽 𝑘 𝑓 ∈ R 𝑁 , 𝜽 𝑓 ∈ R 𝐾 𝑁 .In order to estimate 𝜽 𝑘 𝑓 from (4), we deﬁne a loss function 𝑔 𝑘 𝑓 ( 𝜽 𝑘 𝑓 ) = 𝑙𝑜𝑠𝑠 ( x 𝑓 , 𝜽 𝑘 𝑓 ) (5)depending on the noise model, for example, mean squared error loss, (cid:107) x 𝑓 − s 𝑓 ( 𝜽 𝑓 )(cid:107) . We calibrate data at many frequencies and weexploit the smoothness of 𝜽 𝑘 𝑓 with 𝑓 and in order to make thesolutions smooth, we introduce the constraint 𝜽 𝑘 𝑓 = Z 𝑘 b 𝑓 (6)where b 𝑓 ( ∈ R 𝑃 ) is a polynomial basis ( 𝑃 terms) evaluated at 𝑓 and Z 𝑘 ( ∈ R 𝑁 × 𝑃 ) is the global variable ensuring smoothness acrossfrequency.With the use of Lagrange multipliers y 𝑘 𝑓 ( ∈ R 𝑁 ), we write theaugmented Lagrangian as 𝑙 ( 𝜽 𝑘 𝑓 , y 𝑘 𝑓 , Z 𝑘 : ∀ 𝑘, 𝑓 ) = (7) ∑︁ 𝑘 𝑓 𝑔 𝑘 𝑓 ( 𝜽 𝑘 𝑓 ) + 𝜌 𝑘 y 𝑇𝑘 𝑓 ( 𝜽 𝑘 𝑓 − Z 𝑘 b 𝑓 ) + 𝜌 𝑘 (cid:107) 𝜽 𝑘 𝑓 − Z 𝑘 b 𝑓 (cid:107) which we minimize to ﬁnd a smooth solution.The hyperparameters in (7) are the regularization factors 𝜌 𝑘 ( ∈ R + ,we have one regularization factor per each direction 𝑘 ). Therefore,we need to select 𝐾 positive, real valued hyperparameters to get theoptimal solution to (7). We can consider the 𝐾 directional calibrationas 𝐾 separate subproblems and ﬁnd analytical solutions to optimal 𝜌 𝑘 as done in Yatawatta (2016), however this is not optimal. Thealternative is to use handcrafted methods such as grid search, butbecause each observation is unique, performing this for every obser-vation is not feasible. In the next section, we will present the use ofRL to automatically ﬁnd the best values for 𝜌 𝑘 . MNRAS , 1–7 (2021) mart Calibration The RL agent interacts with the pipeline (the pipeline is solving(7)) to recommend the best values for 𝜌 𝑘 . In order to do this, the RLagent needs to have some information about the environment (i.e., theperformance of the pipeline). In Fig. 2, we show the interactions be-tween the agent and the pipeline. After each calibration, we calculatethe residual as R 𝑝𝑞 𝑓 = V 𝑝𝑞 𝑓 − 𝐾 ∑︁ 𝑘 = (cid:98) J 𝑘 𝑝 𝑓 C 𝑘 𝑝𝑞 𝑓 (cid:98) J 𝐻𝑘𝑞 𝑓 (8)where (cid:98) J 𝑘 𝑝 𝑓 ∀ 𝑘, 𝑝, 𝑓 are calculated using the solutions obtained for 𝜽 𝑘 𝑓 .We can rewrite (4) in a simpliﬁed form (dropping the subscripts 𝑘, 𝑓 ) as x = s ( 𝜽 ) + n (9)where x ( ∈ R 𝐷 ) is the data, s ( 𝜽 ) is our model and n is the noise. Weﬁnd a solution to 𝜽 by minimizing a loss 𝑔 ( 𝜽 ) , say (cid:98) 𝜽 . Thence, wecalculate the residual as y = x − s ( (cid:98) 𝜽 ) . (10)We have shown (Yatawatta 2019) that the probability density func-tions of x and y , i.e., 𝑝 𝑋 ( x ) , 𝑝 𝑌 ( y ) are related as 𝑝 𝑋 ( x ) = |J | 𝑝 𝑌 ( y ) (11)where |J | is the Jacobian determinant of the mapping from x to y .Using (10), we can show that |J | = exp (cid:32) 𝐷 ∑︁ 𝑖 = log ( + 𝜆 𝑖 (A)) (cid:33) (12)where A = 𝜕 s ( 𝜽 ) 𝜕 𝜽 𝑇 (cid:18) 𝜕 𝑔 ( 𝜽 ) 𝜕 𝜽 𝜕 𝜽 𝑇 (cid:19) − 𝜕 𝑔 ( 𝜽 ) 𝜕 𝜽 𝜕 x 𝑇 (13)evaluated at 𝜽 = (cid:98) 𝜽 . Ideally, we should have |J | = A , 𝜆 𝑖 (A) =

0. But in practice, due to model errorand degeneracies in the model, we have some non-zero values for 𝜆 𝑖 (A) .Using (13), we can calculate 𝐸 { 𝜕 R 𝑝𝑞 𝑓 𝜕𝑥 𝑝 (cid:48) 𝑞 (cid:48) 𝑙 𝑓 } for the residual R 𝑝𝑞 𝑓 in (8), where the expectation 𝐸 {·} is taken over 𝑝 (cid:48) , 𝑞 (cid:48) (excluding thetuple 𝑝, 𝑞 ) and 𝑙 ( ∈ [ , ] ). We call the image made in the usual radioastronomical image synthesis manner, replacing the residual R 𝑝𝑞 𝑓 with 𝐸 { 𝜕 R 𝑝𝑞 𝑓 𝜕𝑥 𝑝 (cid:48) 𝑞 (cid:48) 𝑙 𝑓 } , as the inﬂuence map , which we denote by I( 𝜽 ) .In Fig. 1, we show a montage of I( 𝜽 ) generated in our simulations(see section 4). All the images shown in Fig. 1 are made using thesame 𝑢𝑣 coverage, and the images are not deconvolved with the PSF.Therefore, the only reason for the diﬀerences in these images arethe sky models (the true sky and the sky model used in calibration)and the performance of the calibration pipeline itself. Therefore,inﬂuence maps serve a useful purpose in diagnosing the performanceof calibration (Yatawatta 2019).We relate the inﬂuence map to the bias and variance of calibrationresidual as follows. The bias is primarily caused by the errors inthe sky model (diﬀerence between the true sky and the sky modelused in calibration). Therefore, with low regularization, we havehigh bias, high inﬂuence and low variance. With high regularization,we have low bias, low inﬂuence and high variance. An example ofthis behavior is shown in Fig. 4. Note that we consider the biasand variance of the residual data, and not the bias and variance ofthe solutions of calibration. If we consider the solutions, with low regularization we will have low bias and high variance (overﬁttingcompared to the ground truth) and with high regularization, we willhave high bias and low variance (underﬁtting). The ﬁne tuning of 𝜌 𝑘 will give us the optimal tradeoﬀ between bias and the variance of theresidual data, and we achieve this by RL as described in section 3. In this section, we describe the basic concepts of reinforcement learn-ing, especially tailored to our problem and thereafter we go intospeciﬁc details about our RL agent. A RL agent interacts with its en-vironment via some observations and performs an action dependingon these observations and the current state of the agent. The objec-tive of the agent is to maximize its reward depending on the action itperforms. As shown in Fig. 2, our agent interacts with our calibrationpipeline, which is the environment. We describe these interactionsas follows: • At an instance indexed by 𝑡 , the input to the agent is given bystate 𝑠 𝑡 . This is a summary of the observation that we deem suﬃcientfor the agent to evaluate its action. In our case, we give the sky modelused in calibration as part of 𝑠 𝑡 . In addition, we create an inﬂuencemap (an image), that summarizes the performance of calibration aspart of 𝑠 𝑡 . Note that we do not feed the data or the metadata (such asthe 𝑢𝑣 coverage) into the agent. • Depending on 𝑠 𝑡 and its internal parameter values, the agentperforms an action 𝑎 𝑡 . In our case, the action is speciﬁcation of theregularization factors 𝜌 𝑘 ( 𝐾 values) which is fed back to the pipeline. • We determine the reward 𝑟 𝑡 (a real valued scalar) to the agentusing two criteria. First, we compare the noise of the input data 𝜎 and the output residual 𝜎 of the pipeline (by making images of theﬁeld being observed) and ﬁnd the ratio 𝜎 𝜎 . We get a higher rewardif the noise is reduced. Second, we look at the inﬂuence map andﬁnd its standard deviation, which we add to the reward. If we havehigh inﬂuence (high bias), we will get a smaller reward. With boththese components constituting the reward, we aim to minimize boththe bias and the variance of the pipeline.Note that unlike in most RL scenarios, we do not have a terminalstate (such as in a computer game where a player wins or loses). Weterminate when we reach a certain number of iterations of reﬁnementon 𝜌 𝑘 .We give an overview of the inner workings of our RL agent, and fora more thorough overview, we refer the reader to (Silver et al. 2014;Lillicrap et al. 2015; Fujimoto et al. 2018). The major componentsof the agent are: • Actor: the actor with policy 𝜋 (·) will produce action 𝑎 given thestate 𝑠 , 𝜋 ( 𝑠 ) → 𝑎 . • Critic: given the state 𝑠 and action 𝑎 , the critic will evaluate theexpected return 𝑞 (a scalar), 𝑄 ( 𝑠, 𝑎 ) → 𝑞 . The mapping 𝑄 (· , ·) isalso called as the state-action value function.Note that neither the actor nor the critic are dependent on past states,but only the current state and a measure of the state’s favorabilityfor future rewards, which is a characteristic of a Markov decisionprocess (Sutton & Barto 2018). Looking into the future, the expectedcumulative reward can be written as the Bellman equation 𝑄 ★ ( 𝑠, 𝑎 ) = 𝑟 + 𝛾 m 𝑎𝑥 𝑎 (cid:48) = 𝜋 ( 𝑠 (cid:48) ) 𝑄 ( 𝑠 (cid:48) , 𝑎 (cid:48) ) (14)where 𝑠 (cid:48) , 𝑎 (cid:48) is the (optimal) state action pair for the next step and 𝛾 ≈ MNRAS000

Yatawatta and Avruch

Figure 1.

A montage of inﬂuence maps 128 ×

128 pixels each, centered at the phase centre, generated with various calibration settings and sky models. All thesemaps are created using the same 𝑢𝑣 coverage. The values in each inﬂuence map are normalized to the range [ , ] . AgentPipeline

Reward and RegularizationDataOutDataIn Influence mapSolutions I n t e r p r e t e r SkyModel

Figure 2.

The interactions between the RL agent and its environment, i.e.,the pipeline.

We model both the actor and the critic as deep neural networks,with trainable parameters 𝜙 for the actor and 𝜉 for the critic. We trainthe actor and the critic as follows. Given the current state 𝑠 , the actortries to maximize 𝐽 ( 𝜙 ) = 𝑄 ( 𝑠, 𝑎 )| 𝑎 = 𝜋 ( 𝑠 ) by choice of action. Weperform gradient ascent on 𝐽 ( 𝜙 ) , with the gradient (using the chainrule) given as ∇ 𝜙 𝐽 ( 𝜙 ) = ∇ 𝑎 𝑄 ( 𝑠, 𝑎 )| 𝑎 = 𝜋 ( 𝑠 ) × ∇ 𝜙 𝜋 ( 𝑠 ) . (15)The critic is trained by minimizing the error between the currentstate-action value 𝑄 ( 𝑠, 𝑎 ) and the expected cumulative reward underoptimal settings (14) 𝐽 ( 𝜉 ) = (cid:107) 𝑄 𝜉 ( 𝑠, 𝑎 ) − ( 𝑟 + 𝛾 m 𝑎𝑥 𝑎 (cid:48) = 𝜋 ( 𝑠 (cid:48) ) 𝑄 ( 𝑠 (cid:48) , 𝑎 (cid:48) ))(cid:107) . (16)In practice, RL is prone to instability and divergence, and thereforewe follow van Hasselt et al. (2015); Lillicrap et al. (2015); Fujimotoet al. (2018) with the following reﬁnements: • Instead of having one critic, we have two online ( 𝑄 ( 𝑠, 𝑎 ) and 𝑄 ( 𝑠, 𝑎 ) ) and two target ( 𝑄 ( 𝑠 (cid:48) , 𝑎 (cid:48) ) and 𝑄 ( 𝑠 (cid:48) , 𝑎 (cid:48) ) ) critics and selectthe lower evaluation out of the two to evaluate (14). • Rather than having just one actor, we also have a target actornetwork. When evaluating (14), we use the target network to evaluatem 𝑎𝑥 𝑎 (cid:48) = 𝜋 ( 𝑠 (cid:48) ) 𝑄 ( 𝑠 (cid:48) , 𝑎 (cid:48) ) . • The parameters of the target networks are updated by a weighted average between the online and target parameters, and for the targetactor, the update is only performed at delayed intervals. • We use random actions at the start of learning for exploration,which is called the warmup period. • We keep a buﬀer to store past actions, current states, next statesand rewards that we re-use in mini-batch mode sampling duringtraining (this is called the replay buﬀer).See Algorithm I (Fujimoto et al. 2018) for a detailed description ofthe TD3 algorithm.

Simulations are required to train the RL agent. In this section, wepresent the details of our simulations and show that the agent is ableto learn, in other words, with more training, the reward obtained bythe agent increases.Before we proceed to describing the agent used in radio interfer-ometric calibration pipelines, we consider a much simpler problem.This is to test and demonstrate the eﬃcacy of our reinforcementlearning algorithm (TD3), which has already been proven to be su-perior in other environments. However, we did not ﬁnd an analogueto hyperparameter tuning. Elastic net regression (Zou & Hastie 2005)observes x ( ∈ R 𝐷 ) given the design matrix A ( ∈ R 𝐷 × 𝐷 ) and esti-mates the parameters 𝜽 ( ∈ R 𝐷 ) of the linear model x = A 𝜽 . Theestimation is performed under regularization as (cid:98) 𝜽 = arg min 𝜽 (cid:16) (cid:107) x − A 𝜽 (cid:107) + 𝜌 (cid:107) 𝜽 (cid:107) + 𝜌 (cid:107) 𝜽 (cid:107) (cid:17) (17)where we have hyperparameters 𝜌 and 𝜌 . We can easily relate (17)to (7), which is our original problem.Representing (17) as in (9) and using (13), for the elastic netproblem, we have A = A (cid:16) A 𝑇 A + 𝜌 I (cid:17) − (cid:16) − A 𝑇 (cid:17) (18)which can be used to calculate the inﬂuence function. The RL agentfor the elastic net problem hyperparameter selection interacts withits environment as follows: • State: design matrix A and eigenvalues ( + 𝜆 (A)) . • Action: regularization factors 𝜌 , 𝜌 . • Reward: (cid:107) x (cid:107)(cid:107) x − A 𝜽 (cid:107) + 𝑚𝑖𝑛 ( + 𝜆 (A)) 𝑚𝑎𝑥 ( + 𝜆 (A)) .Note that the reward considers both the (inverse) residual error as MNRAS , 1–7 (2021) mart Calibration Figure 3.

The smoothed score variation with episode for the agent selectinghyperparameters in elastic net regression. well as the bias (ratio of eigenvalues of the inﬂuence matrix). In thismanner, we can achieve a balance between the bias and the varianceof the residual.We consider 𝐷 =

20 and the agent is implemented in Pytorch(Paszke et al. 2017), within an openai.gym compatible environment . The critic uses 2 linear layers each for the state and for the action,ﬁnally combined in one linear layer to produce a scalar output. Theactor uses three linear layers to produce the 2 dimensional output.We use exponential linear unit activation (Clevert et al. 2015) exceptin the last layer of the actor, where we use hyperbolic tangent acti-vation (which is scaled to the valid range afterwards). We use batchnormalization (Ioﬀe & Szegedy 2015) between all layers except thelast. We use the LBFGS algorithm Yatawatta et al. (2019) to solve(17) and the Adam optimizer Kingma & Ba (2014) is used to trainthe agent.In each episode, we create A by ﬁlling its entries with unit variance,zero mean Gaussian random values. The ground truth value of theparameters, 𝜽 is generated with the number of non-zero entriesrandomly generated in the range [ , 𝐷 ] and these non-zero entriesare ﬁlled with unit variance, zero mean Gaussian entries. In order toget x , we add unit variance, zero mean Gaussian noise to A 𝜽 with asignal to noise ratio of 0 . [ , ] MHz. Calibration is performedfor every 𝑇 =

10 time samples (every 10 sec) and 6 such calibrationsare performed to interact with the RL agent. In other words, 1 minof data are used by the agent to make a recommendation on theregularization parameters. In practice, the ﬁrst 1 min of data is usedto adjust the hyperparameters of the pipeline, and thereafter, the https://gym.openai.com/ pipeline is run unchanged for the full observation (which typicallywill last a few hours). We use SAGECal for calibration and excon for imaging.We simulate data for calibration along 𝐾 = J 𝑘 𝑝 𝑓 in (1) that are smooth both intime and in frequency, but have random initial values. One directionout of 𝐾 represents the observed ﬁeld, and its location is randomlychosen within 1 deg radius from the ﬁeld centre. The remaining 𝐾 − [ , ] Jy. The intrinsic intensities ofthe outlier sources are randomly selected within [ , ] Jy. Anadditional 200 weak sources (point sources and Gaussians) withintensities in [ . , . ] Jy are randomly positioned across a 16 × .

05. Note that the sky model used for calibration onlyincludes the 𝐾 sources (not the weak sources) and the 𝐾 − • State: the sky model used in calibration and the inﬂuence map I( 𝜽 ) generated using 1 min of data. • Action: regularization factors 𝜌 𝑘 for 𝐾 directions. • Reward: 𝜎 𝜎 + 𝐵𝜎 (I( 𝜽 )) , where 𝜎 and 𝜎 are the image noisestandard deviations of the input data and output residual, respectively. 𝐵 = 𝑇 𝑁 ( 𝑁 − )/ I( 𝜽 ) and 𝜎 (I( 𝜽 )) is the standard deviation of the inﬂuence map.In Fig. 4, we show the residual image standard deviation, inﬂuencemap standard deviation and the calculated reward. The 𝑥 -axis corre-sponds to the regularization parameters 𝜌 𝑘 , scaled up and down byvarious factors (starting from the regularization that gives the highestreward). For low regularization values, we see more or less constantresidual image noise, however, we clearly see a high level of inﬂu-ence. This high inﬂuence on the left hand side of Fig. 4 implies highbias (due to the unmodeled sources in the sky) and as a consequence,we should expect high level of weak signal suppression. On the otherhand, on the right hand side of Fig. 4, for high regularization, we seeincreased residual image noise, meaning poor calibration. We com-pose our reward to reach a balance between the bias and variance ofthe residual.The RL agent and environment are quite similar to the ones usedin elastic net regression. The critic uses 3 convolutional layers toprocess the inﬂuence map, and two linear layers to process the actionand the sky model. These two are combined at the ﬁnal linear layer.The actor also uses 3 convolutional layers to process the inﬂuencemap and two linear layers for the sky model. Once again, they arecombined at the ﬁnal linear layer. The ﬁrst three layers of both theactor and critic use batch normalization. The activation functions areexponential linear units except at the last layer of the actor, where weuse hyperbolic tangent activation. The action is scaled to the requiredrange by the environment.The training of our RL agent is done as follows. In each episode, http://sagecal.sourceforge.net/ https://sourceforge.net/projects/exconimager/ MNRAS000

Figure 4.

The residual noise, the inﬂuence map noise and the reward forvarious values of regularization. The regularization factors are scaled up anddown from the optimal values ( × Figure 5.

The smoothed score variation with episode for the agent in thecalibration pipeline. we generate a random sky model and data as described previously.The pipeline is run with initial regularization factors determinedanalytically (Yatawatta 2016). The state information generated by thepipeline is fed into the RL agent to retrieve updated regularizationfactors. The pipeline is run again and we repeat the cycle. We limit thenumber of such iterations per each episode to 10 (arbitrary number).At the ﬁnal iteration, we record the reward (or the score) to measurethe progress of the agent. The smoothed score over a number of suchepisodes is shown in Fig. 5. We see a positive trend in the scorein Fig. 5, indicating that the agent is able to learn and improve itsrecommendations for the regularization factors to use. We also see asmall dip around the 35-th episode, where we switched from learningto warmup mode (exploration instead of exploitation), where insteadof choosing the action based on the output of the actor, we switchedto random actions (because we re-started the training at episode 32).Once we have a fully trained agent, we only have to run oneiteration to get a recommendation from the agent. Note that we haveused only 1 min of data for this and afterwards, we can keep the pipeline running with the same hyperparameters for a much longertime (ideally for the full observation). This aspect needs more workwith real data, which we intend to pursue as future work. In addition,we can develop similar agents for other data processing steps in radioastronomy, including RFI mitigation and image synthesis. Extendingbeyond radio astronomy, similar techniques can be applied in otherdata processing pipelines in general machine learning problems.

We have introduced the use of reinforcement learning in hyper-paremeter tuning of radio astronomical calibration pipelines. By us-ing both the input-output noise reduction as well as the inﬂuence todetermine the reward oﬀered to the RL agent, we can reach a balancebetween the bias and the variance introduced by the pipeline. Asillustrated by the elastic-net regression example, we can also applythe same technique to many other data processing steps, not only inradio astronomy but also in other applications.

DATA AVAILABILITY

Ready to use software based on this work and test data are alreadyavailable online . REFERENCES

Brossard M., El Korso M. N., Pesavento M., Boyer R., Larzabal P., WĳnholdsS. J., 2016, preprint, ( arXiv:1609.02448 )Clevert D.-A., Unterthiner T., Hochreiter S., 2015, arXiv e-prints,Cook R., Weisberg S., 1982, Residuals and Inﬂuence in Regression. Mono-graphs on statistics and applied probability, Chapman & Hall, http://books.google.nl/books?id=MVSqAAAAIAAJ

Fujimoto S., van Hoof H., Meger D., 2018, arXiv e-prints, p.arXiv:1802.09477Geman S., Bienenstock E., Doursat R., 1992, Neural Computation, 4, 1Hamaker J. P., Bregman J. D., Sault R. J., 1996, Astronomy and AstrophysicsSupp., 117, 96Ioﬀe S., Szegedy C., 2015, arXiv e-prints, p. arXiv:1502.03167Kingma D. P., Ba J., 2014, preprint, ( arXiv:1412.6980 )Koh P. W., Liang P., 2017, in Precup D., Teh Y. W., eds, Proceedings ofMachine Learning Research Vol. 70, Proceedings of the 34th Interna-tional Conference on Machine Learning. PMLR, International Conven-tion Centre, Sydney, Australia, pp 1885–1894, http://proceedings.mlr.press/v70/koh17a.html

LeCun Y., Bengio Y., Hinton G., 2015, Nature, 521, 436 EPLillicrap T. P., Hunt J. J., Pritzel A., Heess N., Erez T., Tassa Y., Silver D.,Wierstra D., 2015, arXiv e-prints, p. arXiv:1509.02971Mertens F., et al., 2020, Monthly Notices of the Royal Astronomical Society,493, 1662Mevius M., et al., 2021, pre-printMnih V., et al., 2015, Nature, 518, 529Ollier V., Korso M. N. E., Ferrari A., Boyer R., Larzabal P., 2018, SignalProcessing, 153, 348Paszke A., et al., 2017, in NIPS-W.Patil A. H., et al., 2017, ApJ, 838, 65Silver D., Lever G., Heess N., Degris T., Wierstra D., Riedmiller M., 2014, inProceedings of the 31st International Conference on International Con-ference on Machine Learning - Volume 32. ICML 2014. JMLR.org, pp387–395 https://github.com/SarodYatawatta/smart-calibrationMNRAS , 1–7 (2021) mart Calibration Sutton R. S., Barto A. G., 2018, Reinforcement Learning: An Introduction.A Bradford Book, Cambridge, MA, USAThompson A., Moran J., Swenson G., 2001, Interferometry and synthesis inradio astronomy (3rd ed.). Wiley Interscience, New YorkYatawatta S., 2015, MNRAS, 449, 4506Yatawatta S., 2016, in 24th European Signal Processing Conference (EU-SIPCO), 2016.Yatawatta S., 2019, Monthly Notices of the Royal Astronomical Society, 486,5646Yatawatta S., 2020, Monthly Notices of the Royal Astronomical Society, 493,6071Yatawatta S., Diblen F., Spreeuw H., Koopmans L. V. E., 2018, MNRAS,475, 708Yatawatta S., De Clercq L., Spreeuw H., Diblen F., 2019,in 2019 IEEE Data Science Workshop (DSW). pp 208–212,doi:10.1109/DSW.2019.8755567Zou H., Hastie T., 2005, Journal of the Royal Statistical Society: Series B(Statistical Methodology), 67, 301van Hasselt H., Guez A., Silver D., 2015, arXiv e-prints, p. arXiv:1509.06461This paper has been typeset from a TEX/L A TEX ﬁle prepared by the author. MNRAS000