Deep Reinforcement Learning for Efficient Measurement of Quantum Devices
V. Nguyen, S.B. Orbell, D.T. Lennon, H. Moon, F. Vigneau, L.C. Camenzind, L. Yu, D.M. Zumbühl, G.A.D. Briggs, M.A. Osborne, D. Sejdinovic, N. Ares
DDeep Reinforcement Learning for Efficient Measurement of Quantum Devices
V. Nguyen † , S.B. Orbell † , D.T. Lennon , H. Moon , F. Vigneau , L.C. Camenzind ,L. Yu , D.M. Zumb¨uhl , G.A.D. Briggs , M.A. Osborne , D. Sejdinovic , and N. Ares Department of Materials, University of Oxford, Oxford OX1 3PH, United Kingdom Department of Engineering, University of Oxford, Oxford OX2 6ED, United Kingdom Department of Statistics, University of Oxford, Oxford OX1 3LB, United Kingdom Department of Physics, University of Basel, 4056 Basel, Switzerland
Deep reinforcement learning is an emerging machine learning approach which can teach a computerto learn from their actions and rewards similar to the way humans learn from experience. Itoffers many advantages in automating decision processes to navigate large parameter spaces. Thispaper proposes a novel approach to the efficient measurement of quantum devices based on deepreinforcement learning. We focus on double quantum dot devices, demonstrating the fully automaticidentification of specific transport features called bias triangles. Measurements targeting thesefeatures are difficult to automate, since bias triangles are found in otherwise featureless regions ofthe parameter space. Our algorithm identifies bias triangles in a mean time of less than 30 minutes,and sometimes as little as 1 minute. This approach, based on dueling deep Q-networks, can beadapted to a broad range of devices and target transport features. This is a crucial demonstration ofthe utility of deep reinforcement learning for decision making in the measurement and operation ofquantum devices.
INTRODUCTION
Reinforcement learning (RL) is a neurobiologically-inspired machine learning paradigm where an RL agentwill learn policies to successfully navigate or influencethe environment. Neural network-based deep reinforce-ment learning (DRL) algorithms have proven to be verysuccessful by surpassing human experts in domains suchas the popular Atari 2600 games [1], chess [2], and Go[3]. RL algorithms are expected to advance the controlof quantum devices [4–21], because the models can berobust against noise and stochastic elements present inmany physical systems and they can be trained withoutlabelled data. However, the potential of deep reinforce-ment learning for the efficient measurement of quantumdevices is still unexplored.Semiconductor quantum dot devices are a promisingcandidate technology for the development of scalable quan-tum computing architectures. Singlet-triplet qubits en-coded in double quantum dots [22] have demonstrablylong coherence times [23], as well as high one- and two-qubit gate fidelities [24]. But quantum dot devices aresubject to variability, and many measurements are re-quired to characterise each device and find the conditionsfor qubit operation. Machine learning has been used toautomate the tuning of devices from scratch, known assuper coarse tuning [25–27], the identification of singleor double quantum dot regimes, known as coarse tuning[28, 29], and the tuning of the inter-dot tunnel couplingsand other device parameters, referred to as fine tuning[30–32].The efficient measurement and characterisation of quan-tum devices has been less explored so far. We have previ-ously developed an efficient measurement algorithm forquantum dot devices combining a deep-generative modeland an information-theoretic approach [33]. Other ap- proaches have developed classification tools which areused in conjunction with numerical optimisation routinesto navigate quantum dot current maps [28, 31, 34]. Thesemethods, however, fail when there are large areas in pa-rameter space that do not exhibit transport features. Toperform efficient measurements in these areas, which areoften good for qubit operation, requires prior knowledgeof the measurement landscape and a procedure to avoidover-fitting, i.e. a regularisation method.In this paper, we propose to use DRL for the efficientmeasurement of a double quantum dot device. Our al-gorithm is capable of finding specific transport features,in particular bias triangles, surrounded by featureless ar-eas in a current map. The state-of-the-art DRL decisionagent is embedded within an efficient algorithmic work-flow, resulting in significant reduction of the measurementtime in comparison to existing methods. A convolutionalneural network (CNN), a popular image classificationtool [35, 36], is used to identify the bias triangles. Thisoptimal decision process allows for the identification ofpromising areas of the parameter space without the needfor human intervention. Fully automated approaches,such as the measurement algorithm presented here, couldhelp to realise the full potential of spin qubits by address-ing key difficulties in their scalability.We focus on quantum dot devices that are electrostat-ically defined by Ti/Au gate electrodes fabricated on aGaAs/AlGaAs heterostructure (Fig. 1 a ) [37, 38]. All theexperiments were performed using GaAs double quantumdot devices at dilution refrigerator temperatures of ap-proximately 30 mK. The two-dimensional electron gascreated at the interface of the two semiconductor materi-als is depleted by applying negative voltages to the gateelectrodes. The confinement potential defines a doublequantum dot which is controlled by these gate voltagesand coupled to the electron reservoirs (the source and † Both authors contributed equally and are displayed in decreasing order of seniority. a r X i v : . [ c ond - m a t . m e s - h a ll ] S e p -1471 -831V B1 (mV)-1356-716 V B ( m V ) V B2 V B1 I V B1 V B V B1 V B a cdb I (nA)I (nA) I (nA)II (nA(nA))
FIG. 1.
Overview of device architecture and quantumdot environment . (a) False-colour SEM image of a GaAsdouble quantum dot device. Barrier gates, labelled V B1 andV B2 , are highlighted in red. The arrow represents the flow ofcurrent through the device between source and drain contacts. (b) A current map. The white grid represents the blocksavailable for investigation by the DRL agent. The DRL agentis initiated in a random block (state) indicated by a filled whitesquare. The filled orange blocks show the available actionspace for the DRL agent and the arrow shows a possible policydecision. (c) and (d)
The nine sub-blocks defined within eachblock, a 32 mV ×
32 mV window in gate voltage space, tocalculate a statistical state vector. These sub-blocks are equalin gate voltage size, five of them are shown in ( c ) and four in( d ). The red sub-block in ( d ) contains bias triangles. drain contacts). Depending on the combination of gatevoltages, the double quantum dot can be in the ‘open’, the‘pinch-off’ or the ‘single-electron transport’ regime. In the‘open’ regime, an unimpeded current flows through the de-vice. Conversely, when the current is completely blocked,the device is said to be in the ‘pinch-off’ regime. In the‘single-electron transport’ regime, the current is maximalwhen the electrochemical potentials of each quantum dotare within the bias voltage window V bias between sourceand drain contacts.Our algorithm interacts with a quantum dot environ-ment within which our DRL decision agent operates to efficiently find the target transport features. The environ-ment consists of states, defined by sets of measurementsin gate voltage space, and a set of actions and rewards tonavigate that space. This quantum dot environment hasbeen developed based upon the OpenAI Gym interface[39] (see Supplementary Information A for further de-tails of the quantum dot environment’s state, action andreward). Manual identification and characterisation oftransport features requires a high-resolution measurementof a current map defined by, for example, barrier gate volt-ages V B1 and V B2 while keeping other gate voltages fixed,an example of which is shown in Fig. 1 b . A super coarsetuning algorithm allows us to choose a set of promisinggate voltages and focus on exploring the current mapas a function of two gates, for example the two barriergates [27]. This is the gate voltage space our DRL agentwill navigate.Our DRL algorithm takes the gate voltage coordinatesfound by our previous super coarse tuning algorithm [27],and divides the gate voltage space corresponding to theunmeasured current map into blocks. The size of theblocks is chosen such that they can fully contain biastriangles (blocks are shown as a white grid in Fig. 1 b ).Devices with similar gate architectures often show biastriangles of similar sizes for a given V bias . The DRLagent is initiated in a random block. The agent acquiresa reduced number of current measurements from thisblock and makes a decision on whether a high resolutionmeasurement is required and on which block to explorenext if bias triangles are not observed. The agent has apossible action space represented by a vector of length six;this means the agent can decide to acquire measurementsin any of the four contiguous blocks (‘up’, ‘down’, ‘left’or ‘right’) or in the two diagonal blocks that permit theagent to efficiently move between the ‘open’ and the ‘pinch-off’ transport regimes. These blocks correspond to anincrease or decrease of both gate voltages, which stronglymodulates the current through the device. The remainingtwo diagonal blocks, which correspond to a decrease ofone gate voltage and an increase of the other, do not oftenlead to such significant changes in the transport regimeand are thus not included in the agent’s action spaceto maximise the efficiency of the algorithm. The DRLagent can be efficiently trained using current maps alreadyrecorded from many other devices. This is because theirtransport features are sufficiently similar, even thoughthe gate voltage values at which they are observed varyfor different devices.The decision of which block to explore next is basedon the current measurements acquired by the DRL agentin a given block. The block is divided into nine sub-blocks (Fig. 1 c and d ) and the mean µ and standarddeviation σ of the current measurements correspondingto each sub-block are calculated. These statistical values,constituting an 18-element vector, provide the agent withinformation of its probable location in the current map.The statistical state vector or state representation vectorenables the DRL decision agent to abstract knowledgeabout the transport regime, distinguishing between ‘open’,‘pinch-off’, and ‘single-electron transport’ regimes, with areduced number of measurements. In this way, the statevector defines a state in the quantum dot environment.This statistical approach, compared to the alternativeof using CNNs to evaluate acquired measurements, makesthe agent less prone to over-fitting during training andmore robust to experimental noise. To decide whetherthe agent has found bias triangles in a given block, thealgorithm uses a CNN as a binary classification tool.Combining a state representation based on measurementstatistics and CNNs in a reinforcement learning frameworkwhich makes use of the experience of the agent navigatingsimilar environments during training, our algorithm pro-vides a decision process for efficient measurement withouthuman intervention. RESULTSDescription of the algorithm
The algorithm is comprised of different modules forclassification and decision making (Fig. 2). In the ini-tialisation stage, two low-resolution current traces areacquired by the algorithm as a function of V B1 (V B2 )with V B2 (V B1 ) set to the maximum voltage given bythe gate voltage window to be explored. The algorithmextracts from these measurements the maximum and min-imum current values and its standard deviation, whichwill be used in a later stage by the classification modules.The gate voltage regions we explore are delimited by a640 mV ×
640 mV window centred in the gate voltagecoordinates proposed by a super coarse tuning algorithm,as mentioned in the Introduction, and the current tracesin this stage have a resolution of 6 . ×
32 mVblocks and the agent is initialised in a randomly selectedblock. The algorithm takes random pixel measurementsof current within this block. Each pixel is 1 mV × µ and σ for each of the 9 sub-blocks in which the block is divided.Pixels are sampled randomly from the block until thestatistics from the state representation have converged.Convergence is generally achieved after sampling fewerthan 100 pixels, significantly less than the 1024 pixelsin a block (see Supplementary Information B for theconvergence curves and the convergence criterion).The state vector is first evaluated by a pre-classificationmodule. A block is considered to correspond to the single-electron transport regime if any of the µ values is within0 .
01 to 0 . × . a shows the blocks in a currentmap that would be identified by the pre-classifier as cor-responding to the single-electron transport regime, whileFig. 3 b shows the blocks that would be evaluated by theCNN binary classification to determine if bias trianglesare observed (See Supplementary Information C for asummary of the CNN’s architecture and its training).If the pre-classifier considers the block to correspond tothe ‘open’ or ‘pinch-off’ regimes, or if the CNN does notidentify bias triangles within the block, the DRL agent hasto decide which block to explore next. With this objective,the state vector is normalised using the variance and meancurrent values obtained in the initialisation stage, andfed into a deep neural network which controls the DRLdecision agent. The agent will then propose an actionwhich it expects will lead to the highest long-term reward.This action a t , given by a t = arg max a (cid:48) Q π ( s t , a (cid:48) ), is theaction which maximises the Q -function for the agent’sstochastic policy π in the state-action pair ( s t , a (cid:48) ) at time t . The Q -function measures the value of choosing anaction a (cid:48) when in state s t and therefore the action a t represents the agent’s prediction for the most efficientroute to bias triangles. In our quantum dot environmentsetting, the action determines the next block to exploreand the algorithm begins a new iteration. The deep reinforcement learning agent
Our algorithm makes use of the deep Q-learning frame-work which uses deep neural networks to approximate the Q -function [1]. The Q -function is defined by Q π ( s t , a t ) = E [ R t | s = s t , a = a t , π ], which gives an expected reward R t for a chosen action a t taken by an agent with apolicy π in the state s t . This expected reward is de-fined as R t = (cid:80) ∞ τ = t γ τ − t r τ , where γ ∈ [0 ,
1] is a dis-count factor that trades-off the importance of immediaterewards r t , and future rewards r τ>t . The agent aimsto maximize R t via the Q π ( s t , a t ) learnt by the neuralnetwork. In particular, we chose to implement the du-eling deep Q-network (dueling DQN) [40] architecturefor our DRL decision agent. This architecture factorsthe neural network into two entirely separate estimatorsfor the state-value function and the state-dependent ac-tion advantage function [40]. The state-value function, V π ( s t ) = E a t ∼ π ( s t ) [ Q π ( s t , a t )] gives a measure for howvaluable it is, for an agent with a stochastic policy π inthe search for a promising reward, to be in a given state s t . The state-dependent action advantage function [40]gives a relative measure of the importance of each action,given by A π ( s t , a t ) = Q π ( s t , a t ) − V π ( s t ). In dueling Deep Reinforcement Learning decision agent
Open or pinch-off regimes V B1 (mV) V B ( m V ) V B1 V B1 (mV) V B ( m V ) C u rr e n t ( n A ) (iii) Pre-classification V B + - Initialisation -+-+ μ σ .. μ σ (iv) CNN classification
State vector(v) Neural Network
Action space V B1 V B2 V B1 & V B2 (ii) Random pixel block measurementBlock measurement Termination (i) Current trace measurement
StateState
Single-electron transport regime Bias trianglesNo bias triangles μ σ .. μ σ State vector V B1 (mV) C u rr e n t ( n A ) V B V B1 V B1 C u rr en t C u rr en t -922 -954V B1 (mV)-1112-1144 V B ( m V ) C u rr e n t ( n A ) V B1 (mV) V B ( m V ) C u rr e n t ( n A ) C u rr en t FIG. 2.
Schematic depicting the algorithmic workflow. (See main text for a full description) In the initialisation stage,starting from the gate voltages coordinates proposed by a coarse tuning algorithm, the algorithm measures low-resolution currenttraces as a function of V B1 (V B2 ) with V B2 (V B1 ) set to the maximum voltage given by the gate voltage window of interest ( i ).The algorithm then performs a random pixel measurement in the block corresponding to the proposed starting gate voltages( ii ). In this measurement, mean current values and standard deviation are calculated for 9 sub-blocks within the block untilconvergence. The statistical state representation vector (state vector) obtained is then assessed by the pre-classification stage( iii ). If the mean current value corresponding to any of the sub-blocks falls within threshold values given by the initialisationstage, then the block is pre-classified as corresponding to a possible single-electron transport regime. In this case, the block isexplored further by performing a high-resolution scan. This block measurement is normalised and input into a CNN binaryclassification algorithm ( iv ). If the CNN identifies bias triangles, then the algorithm terminates. If either the pre-classifier or theCNN-classifier rejects a block, then the state vector is input into the DRL decision agent ( v ). The decision agent subsequentlyselects an action on the gate voltages which determines the next block to measure via the random pixel method. DQN, when combining the state-value function and thestate-dependent action advantage function, it is crucial toensure that given Q we can recover V π ( s t ) and A π ( s t , a t )uniquely. For this purpose, the advantage function esti-mator is forced to be zero at the chosen action a t [40].This approach allows the agent, through the estimationof V π ( s t ), to learn the value of certain states in terms oftheir potential to guide the agent to a promising reward.This is particularly beneficial in our case, since differentstate vectors can correspond to the same transport regimeand thus be equally valuable in the search of bias triangles.Consequently, the most beneficial action in these stateswould often coincide. For example, in most states cor-responding to the ‘pinch-off’ regime, the most beneficialaction is often to increase both gate voltages.To train the DRL agent, we designed a reward functionto ensure that the agent would learn to efficiently locatebias triangles. To this end, during training, the agent isrewarded for the detection of bias triangles and penalised for the number of blocks explored or measured in a singlealgorithm run, N . The reward r = +10 is assigned to theblocks exhibiting bias triangles. Other blocks are assigned r = −
1. During training, the maximum number of blocksthat could be measured in a given run, N max , is set to300. If after N max block measurements the agent hadnot found bias triangles, the algorithm is terminated andthe agent is punished with r = −
10 (see SupplementaryInformation A for further details regarding the design ofthe reward function). In other words, N max determineshow far from the starting block the agent can reach in gatevoltage space, as it can only explore contiguous blocks.We trained the dueling DQN (DRL decision agent)using the prioritised experience replay method [41] froma memory buffer. This method ensures that successfulpolicy decisions are replayed more frequently in the DRLagent’s learning process. The agent does not benefitfrom an ordered sequence of episodes during learning,yet it is able to learn from rare but highly successfulpolicy decisions and it is less likely to settle in localminima of the decision policy. We trained the agent over10000 episodes (algorithm runs), each time initialised ina random block for 4 different current maps which werepreviously recorded. The training takes less than an houron a single CPU. -1471 -831V B1 (mV)-1356-716 V B ( m V ) -1471 -831V B1 (mV)2 4 6 8M 0.1 0.2 0.3 0.4I (nA) CNN prediction = 0.731 ba FIG. 3.
Classification tools. (a)
Example of blocks con-sidered by the pre-classifier as corresponding to the ‘single-electron transport’ regime overlaid on the corresponding cur-rent map. The colour-bar represents the number (M), outof nine, of sub-blocks which were not rejected by the pre-classification stage. (b)
Blocks in (a) , displaying features cor-responding to the ‘single-electron transport’ regime, overlaidon the corresponding current map. Inset: A block displayingbias triangles and the corresponding output value of the CNNbinary classifier.
Experimental results
We demonstrate the real-time (‘online’) performance ofour algorithm in a double quantum dot device. The algo-rithm performance is evaluated according to the numberof blocks explored in an algorithm run, N , which is equalto the number of blocks explored to successfully identifybias triangles unless N = N max , and according to the lab-oratory time spent in this task. For training and testingthe algorithm’s performance we use different devices, bothsimilar to the device shown in Fig. 1 a . We ran the DRLalgorithm in two different regions of gate voltage space,I and II, which are centred in the coordinates from oursuper coarse tuning algorithm [27]. We ran the algorithm10 times in each region. The DRL agent was initiatedin a different block for every run, sampled uniformly atrandom. From these repeated runs, we can estimate themedian ¯ N of the distribution of values of N obtained fora given region. We can also estimate ( L, U ), where L and U are the lower and upper deciles of the distribution. Toidentify bias triangles, the DRL agent required ¯ N = 40(9 , N = 32 (10 ,
94) for region II. Inboth regions considered, our algorithm efficiently locatedbias triangles in a mean time of 30 minutes and, on oneoccasion, in less than 1 minute. This is an order of mag-nitude improvement in measurement efficiency comparedto the laboratory time required to acquire a current map with the grid scan method, i.e. measuring the currentwhile sweeping V B2 and stepping V B1 , which is approx-imately 5 . × a ). Conversely, wheninitiated in a transport regime corresponding to highercurrents, the agent increases the magnitude of the neg-ative voltage applied to the gate electrodes (Fig. 4 b ).The policy thus leads to block measurements in the ar-eas of gate voltage space where bias triangles are usuallylocated.We have performed an ablation study. Ablation studiesare used to identify the relative contribution of differ-ent algorithm components. In this case, our aim is todetermine the benefit of using a DRL agent. We thusproduced an algorithm in which the DRL decision agentwas replaced with a random decision agent. We comparedits performance with the DRL algorithm. The randomagent selects an action, sampled uniformly and randomly.The QDE’s action space is six-dimensional except in in-stances where the agent is in a state (block) along theedges (five-dimensional action space) and in the corners(four-dimensional action space) of the gate voltage win-dow considered. This measurement strategy is similar toa random walk within the gate voltage space, but unlike apure random walk strategy, it will not measure the sameblock twice. The random decision agent’s measurementrun will be terminated when the CNN classifies a blockmeasurement as containing bias triangles. The randomagent was initialised in the same random positions asthe DRL agent so that a fair comparison could be madebetween their performances. We performed 10 runs ofeach algorithm in each of the two different regions of pa-rameter space considered in this work, I and II (Fig. 4 c and d ). The DRL agent outperforms the random decisionagent in the value of ¯ N , and thus in the laboratory timerequired to successfully identify bias triangles. Note thatthe relation between ¯ N and the laboratory time is notlinear, as high-resolution block measurements are onlyperformed for each block classified as corresponding to thesingle-electron transport regime by the pre-classificationstage.In region II, the random agent requires ¯ N equal to 85(50 , N corresponding to the DRL agent (see Supplemen-tary Information D for the value of ¯ N in region I andcorresponding lab times). The good performance of therandom decision agent can be explained by its use of thepre-classifier, which makes the random search efficient. b ca d ef NN FIG. 4.
Performance benchmark. (a, b)
Example trajectories of the DRL agent in gate voltage space. a ( b ) Correspondsto region I (II). The trajectories are indicated inverting the colour scale of the current map for the blocks measured by thealgorithm. The current map measured by the grid scan method is displayed for illustrative purposes and it is not seen by theDRL agent. The blue and red squares indicate the start and end of the trajectory, respectively. (c, d) Real-time performancecorresponding to the grid scan method (green line), the algorithm with a random decision agent (blue) and the algorithm with aDRL decision agent (red). The box plots indicate the laboratory time and the corresponding number of blocks explored, N , forregions of the gate voltage space I and II in c and d , respectively. The results of all 10 runs for both agents in each regime areplotted as points. The central line of the box plot corresponds to ¯ N , while the upper and lower boundaries of the box display theupper ( Q
3) and lower ( Q
1) quartiles. The minimum and maximum whisker bars display ( Q − . × IQR ) and ( Q . × IQR )respectively, where
IQR is the interquartile range. (e, f )
Histograms of values of N for the random and DRL decision agentsover 10 algorithm runs for each region, I ( e ) and II ( f ). This performance test was performed offline. The insets show the boxplots, indicating the quartiles and ¯ N values for the DRL and random agents. In the inset only the outlier points are plotted. The random decision agent is an order of magnitudequicker than the grid scan method.To test the statistical significance of the DRL agent’sadvantage, we have tested the performance of both algo-rithms in a much larger number of runs. To perform thisstatistical convergence test would have been too costly inlaboratory time, so we used previously recorded currentmaps, which were measured by the grid scan method. Wewill call this performance test ‘offline’, as opposed to ‘on-line’ in the case of real-time measurements. By initiatingboth agents 1024 times in each of the blocks in I and II,we obtained a histogram of the N blocks measured tosuccessfully terminate the algorithm (see Fig. 4 e and f for I and II, respectively). We observe a higher numberof runs for which the DRL algorithm performed fewerblock measurements for successful termination. In regionII, the DRL agent requires ¯ N of 17 (2 , N for the random agent is 30 (3 , N for region I). Our resultssuggest that the DRL advantage is statistically significant.The two-tailed Wilcoxon signed rank test [42] allows usto make a statistical comparison of the two distributionscorresponding to the DRL and the random agent. We have applied this test to the offline performance for re-gions I and II (see Supplementary Information D for theresults of the Wilcoxon signed rank test applied to theonline performance test, for which critical values for thetest threshold are used instead of assuming a normal ap-proximation, given the number of algorithm runs is below20). The two-tailed Wilcoxon signed rank test yields ap-value < .
001 for both regions. This means that the nullhypothesis, stating there is no difference in the medianperformance between the two agents, can be rejected. Inaddition, the median of the differences ( ¯ N DRL − ¯ N Random ),estimated using the one-tailed Wilcoxon signed rank test,is less than zero. We can therefore confirm that the DRLagent offers a statistically significant advantage over therandom agent.To further illustrate the advantages of our algorithm,we have, for comparison, implemented a Nelder-Meadnumerical optimisation method [28, 34], an alternativeapproach not based on reinforcement learning. To ensurea fair comparison with our reinforcement learning method,our implementation of the Nelder-Mead optimisation (seeSupplementary Material E for further details) was termi-nated when the CNN classified a block as exhibiting bias
Region I Region II abc
FIG. 5.
Offline performance distribution.
The perfor-mance of different algorithms is evaluated by initiating analgorithm run in each block and estimating N for regions Iand II. Black areas indicate that the algorithm failed wheninitiated at those blocks. (a) Performance distribution (heat-map) for the Nelder-Mead method, (b) the DRL decisionagent, and (c) the algorithm with a random decision agent. triangles in the same way as our DRL algorithm, i.e. whenthe output value of the CNN classifier was greater than0 .
5. In the original implementation, stricter numericalstopping conditions must be met, thereby increasing thenumber of measurements performed before termination.The Nelder-Mead, random decision, and DRL decisionalgorithms were compared offline. We have initiated thealgorithms in each block within each gate voltage regionand estimated ¯ N , creating a performance distribution orheat map (Fig. 5). We observe that large areas of gatevoltage space which do not exhibit transport features cor-respond to large flat areas in the optimisation landscapeand thus severely limit the Nelder-Mead method. Oftenthe simplex was initiated in these areas and in those cases,the Nelder-Mead algorithm just repeatedly measured thearea around the initial simplex. On other occasions, thealgorithm moved away from the initial simplex but thenbecame trapped in other areas of the parameter space inwhich transport features are not present. The methodonly succeeded in locating bias triangles when it was initi-ated in the double dot regime. The DRL decision agent’sperformance is non-uniform as the ‘pinch-off’ regime is less effectively characterised by the agent than the ‘open’and ‘single-electron transport’ regimes. The performanceof the random decision agent is also non-uniform, as itcompletes the tuning procedure more efficiently wheninitiated close to the target transport features.The Nelder-Mead algorithm was also tested online un-der the same conditions as the DRL and random decisionagents. None of 20 runs succeeded before reaching the pre-defined maximum number of measurements, N max , andthus the results are not presented alongside the onlineresults of grid scan, random decision, and DRL decisionalgorithms in Fig. 4. DISCUSSION
We have demonstrated efficient measurement of a quan-tum dot device using reinforcement learning. We are ableto locate bias triangles fully automatically from a set ofgate voltages defined by a super coarse tuning algorithm[27], and in as little as one minute. Our approach gives a10 times speed up in the median time required to locatebias triangles compared with grid scan methods. Ourapproach is less dependant on the transport regime inwhich the algorithm is initiated, compared to an algo-rithm based on a random agent and to a Nelder-Meadnumerical optimisation method. We have also demon-strated the statistical advantage of a DRL decision agentover a random decision agent. Our DRL approach is alsorobust against featureless areas in the parameter spacewhich limit other approaches. While numerical optimi-sation methods requires time-consuming measurementsat each step of the optimisation process, our algorithmuses statistics calculated via pixel sampling to explore thetransport landscape. This statistical state representationallows us to efficiently measure the transport regime (orthe state of the environment in DRL terms) and avoidover-fitting during agent training. Other options for staterepresentation that go beyond a statistical summary ofcurrent values could also be considered. The measurementtime remains, however, the dominant contribution in thetime required to identify transport features. Fast readouttechniques such as radio-frequency reflectometry can beused to reduce measurement times [43–48].Our method is inherently flexible and modular such thatit could be generalised to automate a variety of efficientmeasurement tasks. For example, the reward functioncould be modified so that the agent could learn to locateand score multiple bias triangles within the current map.Furthermore, by retraining the CNN classifier and theDRL agent, the method would be able to locate differ-ent types of transport features, such as those observedwith charge sensing techniques [50]. Our algorithm couldalso incorporate other gate electrodes by increasing theaction space and retraining. This approach would allow asignificantly speed up for super coarse and coarse tuningalgorithms. We also expect DRL approaches to scalebetter than random searches as the dimensionality of theproblem increases.An additional benefit of reinforcement learning is thecapacity of the network’s policy to be continuously up-dated. Thereby, the agent’s policy can be updated inreal-time as the algorithm becomes familiar with a newdevice. This not only improves the general policy but alsomeans that, over time, the pre-trained agent could learnthe particularities of a specific device. To tune large quan-tum device arrays, due to the increasing dimensionalityof the parameter space, DRL could offer a large advan-tage over conventional heuristic methods. Our quantumdot environment and algorithmic framework offer a valu-able resource to develop and test other algorithms anddecision agents for quantum device measurement and tun-ing. Additionally, our dueling deep Q-network methodscan be translated to further applications in experimentalresearch.
Acknowledgements
We acknowledge J. Zimmerman and A. C. Gossard forthe growth of the AlGaAs/GaAs heterostructure. Thiswork was supported by the Royal Society, the EPSRCNational Quantum Technology Hub in Networked Quan-tum Information Technology (EP/M013243/1), Quan-tum Technology Capital (EP/N014995/1), EPSRC Plat-form Grant (EP/R029229/1), the European ResearchCouncil (grant agreement 818751), Graphcore, the SwissNSF Project 179024, the Swiss Nanoscience Institute, the NCCR QSIT, the NCCR SPIN, and the EU H2020European Microkelvin Platform EMP grant No. 824109.This publication was also made possible through supportfrom Templeton World Charity Foundation and JohnTempleton Foundation. The opinions expressed in thispublication are those of the authors and do not necessarilyreflect the views of the Templeton Foundations.
Author Contributions
S.B.O., D.T.L., N.A. and the machine performed theexperiments. F.V. contributed to the experiment. V.N.and S.B.O. developed the algorithm in collaboration withM.A.O and D.S. The sample was fabricated by L.C.C.,L.Y., and D.M.Z. The project was conceived by V.N.and N.A.. G.A.D.B., V.N., S.B.O and N.A. wrote themanuscript. All authors commented and discussed theresults.
Competing Interests
The authors declare that they have no competing in-terests.
Correspondence
Correspondence and requests for materialsshould be addressed to Natalia Ares (email: [email protected]). [1] Mnih, V. et al.
Human-level control through deep rein-forcement learning.
Nature , 529–533 (2015).[2] Silver, D. et al.
A general reinforcement learning algo-rithm that masters chess, shogi, and Go through self-play.
Science , 1140–1144 (2018).[3] Silver, D. et al.
Mastering the game of Go with deepneural networks and tree search.
Nature , 484–489(2016).[4] August, M. & Hern´andez-Lobato, J. M. Taking Gradi-ents Through Experiments: LSTMs and Memory Proxi-mal Policy Optimization for Black-Box Quantum Control.
Lecture Notes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notesin Bioinformatics) , 591–613 (2018).[5] F¨osel, T., Tighineanu, P., Weiss, T. & Marquardt, F. Re-inforcement Learning with Neural Networks for QuantumFeedback.
Physical Review X , 31084 (2018).[6] Bukov, M. et al. Reinforcement Learning in DifferentPhases of Quantum Control.
Physical Review X , 31086(2018).[7] Niu, M. Y., Boixo, S., Smelyanskiy, V. & Neven, H.Universal quantum control through deep reinforcementlearning. AIAA Scitech 2019 Forum et al.
Generalizable control for quantum parameterestimation through reinforcement learning. npj QuantumInformation , 82 (2019). [9] Daraeizadeh, S., Premaratne, S. P. & Matsuura, A. Y.Designing high-fidelity multi-qubit gates for semiconduc-tor quantum dots through deep reinforcement learning.Preprint at http://arxiv.org/abs/2006.08813 (2020).[10] Herbert, S. & Sengupta, A. Using Reinforcement Learningto find Efficient Qubit Routing Policies for Deploymentin Near-term Quantum Computers. Preprint at http://arxiv.org/abs/1812.11619 (2018).[11] Palittapongarnpim, P., Wittek, P., Zahedinejad, E.,Vedaie, S. & Sanders, B. C. Learning in quantum control:High-dimensional global optimization for noisy quantumdynamics. Neurocomputing , 116–126 (2017).[12] An, Z. & Zhou, D. L. Deep reinforcement learning forquantum gate control.
Europhysics Letters (2019).[13] Porotti, R., Tamascelli, D., Restelli, M. & Prati, E. Co-herent transport of quantum states by deep reinforcementlearning.
Communications Physics (2019).[14] Schuff, J., Fiderer, L. J. & Braun, D. Improving thedynamics of quantum sensors with reinforcement learning. New Journal of Physics , (2020).[15] Wang, T. et al. Benchmarking Model-Based Reinforce-ment Learning. Preprint at http://arxiv.org/abs/1907.02057 (2019).[16] Wei, P., Li, N. & Xi, Z. Open quantum system con-trol based on reinforcement learning.
Chinese ControlConference , 6911–6916 (2019). [17] Gao, X. & Duan, L. M. Efficient representation of quan-tum many-body states with deep neural networks. NatureCommunications , 662 (2017).[18] Barr, A., Gispen, W. & Lamacraft, A. Quantum GroundStates from Reinforcement Learning Proceedings of Ma-chine Learning Research , 635–653 (2020).[19] Deng, D. L. Machine Learning Detection of Bell Nonlo-cality in Quantum Many-Body Systems.
Physical ReviewLetters , 240402 (2018).[20] Carleo, G. & Troyer, M. Solving the quantum many-bodyproblem with artificial neural networks.
Science ,6325, 602-606 (2017).[21] Sørdal, V. B. & Bergli, J. Deep reinforcement learning forquantum Szilard engine optimization.
Physical Review A , 042314 (2019).[22] Loss, D., DiVincenzo, D. P. & DiVincenzo, P. Quantumcomputation with quantum dots.
Physical Review A ,120–126 (1997).[23] Malinowski, F. K. et al. Notch filtering the nuclear en-vironment of a spin qubit.
Nature Nanotechnology ,16–20 (2017).[24] Cerfontaine, P. et al. Closed-loop control of a GaAs-basedsinglet-triplet spin qubit with 99.5% gate fidelity and lowleakage
Nature Communications ,1 5-10 (2020).[25] Baart, T. A., Eendebak, P. T., Reichl, C., Wegschei-der, W. & Vandersypen, L. M. Computer-automatedtuning of semiconductor double quantum dots into thesingle-electron regime. Applied Physics Letters ,213104(2016).[26] Darulov´a, J. et al.
Autonomous tuning and charge statedetection of gate defined quantum dots
Physical ReviewApplied , , 054005,(2020),[27] Moon, H. et al. Machine learning enables completelyautomatic tuning of a quantum device faster than humanexperts.
Nature Communications
PLoSONE , 10 (2018).[29] Zwolak, J. P. et al. Autotuning of Double-Dot DevicesIn Situ with Machine Learning.
Physical Review Applied , 034075 (2020).[30] van Esbroeck, N. M. et al. Quantum device fine-tuningusing unsupervised embedding learning.
New Journal ofPhysics , (2020).[31] Teske, J. D. et al. A machine learning approach for auto-mated fine-tuning of semiconductor spin qubits.
AppliedPhysics Letters , 133102 (2019).[32] Durrer, R. et al.
Automated tuning of double quantumdots into specific charge states using neural networks.Preprint at http://arxiv.org/abs/1912.02777 (2019).[33] Lennon, D. T. et al.
Efficiently measuring a quantumdevice using machine learning. npj Quantum Information , 79 (2019).[34] Kalantre, S. S. et al. Machine learning techniques forstate recognition and auto-tuning in quantum dots. npjQuantum Information , 6 (2019).[35] Lecun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature , 436–444 (2015).[36] Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenetclassification with deep convolutional neural networks.
Advances in Neural Information Processing Systems ,1097-1105, (2012) [37] Camenzind, L. C. et al. Hyperfine-phonon spin relax-ation in a single-electron GaAs quantum dot.
NatureCommunications , 3454 (2018).[38] Camenzind, L. C. et al. Spectroscopy of Quantum DotOrbitals with In-Plane Magnetic Fields.
Physical ReviewLetters , 207701 (2019).[39] Brockman, G. et al.
OpenAI Gym. Preprint at http://arxiv.org/abs/1606.01540 (2016).[40] Wang, Z. et al.
Dueling Network Architectures for DeepReinforcement Learning. , 1995–2003 (2016).[41] Schaul, T., Quan, J., Antonoglou, I. & Silver, D. Priori-tized experience replay. Preprint at https://arxiv.org/abs/1511.05952 (2016).[42] Wilcoxon, F. Individual Comparisons by Ranking Meth-ods. Biometrics Bulletin , 80 (1945).[43] Crippa, A. et al. Level Spectrum and Charge Relaxationin a Silicon Double Quantum Dot Probed by Dual-GateReflectometry.
Nano Letters , 1001–1006 (2017).[44] Schupp, F. J. et al. Sensitive radiofrequency readout ofquantum dots using an ultra-low-noise SQUID amplifier.
Journal of Applied Physics , 244503 (2020).[45] Volk, C. et al.
Loading a quantum-dot based Qubyteregister. npj Quantum Information , 29 (2019).[46] Ares, N. et al. Sensitive Radio-Frequency Measurementsof a Quantum Dot by Tuning to Perfect Impedance Match-ing.
Physical Review Applied , 034011 (2016).[47] De Jong, D. et al. Rapid Detection of Coherent Tunnelingin an InAs Nanowire Quantum Dot through DispersiveGate Sensing.
Physical Review Applied , 1 (2019).[48] Jung, M., Schroer, M. D., Petersson, K. D. & Petta, J. R.Radio frequency charge sensing in InAs nanowire dou-ble quantum dots. Applied Physics Letters , 253508(2012).[49] Kingma, D. P. & Ba, J. L. Adam: A method for stochasticoptimization. Preprint at https://arxiv.org/abs/1412.6980 (2015).[50] Field, M. et al.
Measurements of Coulomb blockade witha noninvasive voltage probe.
Physical Review Letters SUPPLEMENTARY INFORMATIONA. Quantum dot environment
In DRL, among the key components is the formal modelof the environment with which the agent interacts. There-fore, we build an environment from the quantum dotdevice for training our algorithm and name it the quan-tum dot environment (QDE). The QDE was developedto be compatible with the OpenAI Gym interface [39].This environment is ready to be used for benchmarkingand training existing DRL algorithms. In addition, thisenvironment is useful for interested DRL researchers todevelop their methods for improving quantum technolo-gies.The voltage space of the QDE is delimited by a 640 mV ×
640 mV window defined by the barrier gates. Thewindow is divided in 32 mV ×
32 mV blocks and theagent is initialised in a randomly selected block.
1. State
A state s is the statistics set, of a block, consisting ofthe means µ and standard deviations σ of the current ineach of the nine sub-blocks.Instead of making densely overlapping blocks by a mov-ing kernel horizontally and vertically, we propose to rep-resent each state by 3 blocks per dimension for simplicity.In other words, the image is represented by 3 d blockswhere d is the dimension. The dimension d correspondsto the number of gates used. In our setting, each statein 2 gates includes 9 blocks as the tensor of 9 × × × × ×
2. Action
Our action space includes increasing (+) or decreasing( − ) each gate voltage. We have specially designed twoactions to modify both gates simultaneously. In higherdimensional setting, such as controlling d > × d + 2.
3. Reward
The reward function is carefully constructed to force theagent to learn to navigate through the voltage landscapeand identify bias triangles using the fewest N measure-ments. We follow the popular Taxi-v2 environment ∗ todesign the reward scores. We summarise the componentsof the reward function in Table I. We note that the utilityof the agent is robust with respect to different magnitudesof these scores, provided that the detection of a pair ofbias triangles receives a much higher score than blockmeasurement steps. TABLE I. Summary of the reward function.
Instance Reward Termination
Each block measured -1 FalseBias triangle detected +10 True N equal to N max -10 True We assign high reward to our target state of bias-triangles. We encourage the algorithm to find the bias-triangles using the fewest number of measurement bydesigning the reward score as follows. We assign thehighest reward r = +10 to the bias-triangles location.Then, other states will take r = −
1. The maximumnumber of steps per episode during training is set as 300.Beyond this threshold, if the algorithm cannot find thebias-triangles, it will terminate and assign r = −
10. Themaximum number of steps controls how far away from astarting point the device-measurement can go. ∗ https://gym.openai.com/envs/Taxi-v2/ FIG. 6.
Pixel sampling convergence for the random pixelmeasurement in sub-blocks taken from pinch-off (low current),open (high current), single dot, and double dot regimes. Thethreshold values set for the pre-classification tools are indicated.Therefore, any sub-block within the grey shaded region willpass the pre-classification. The dashed lines represent the truemeans after measuring the sub-blocks using a 1 mV × B. Random Pixel Measurement and Pre-classifier
To assess the state of a block, the algorithm first con-ducts a random pixel measurement. Pixels are repeatedlysampled at random from the block and the statistics arecalculated for each sub-block until convergence. The con-vergence of both the mean and standard deviation of eachsub-block must be satisfied before the measurement isterminated. The convergence is accepted if the meanchange in the values of the state representation is lessthan a threshold percentage (one percent threshold forthis paper) of the state representation prior to the update.The state vector is then assessed by the pre-classificationstage. If the mean current values, of any of the sub-blocks,falls between the threshold values, calculated using theinitialisation one-dimensional scans, then the block is pre-classified as a boundary region. The convergence of thenormalised mean current of a sub-block, in the randompixel sampling measurement method is shown in Fig. 6,for low and high current, as well as single and double dotsub-blocks. The sub-blocks with single and double dottransport features fall within the pre-classifier thresholdvalues and therefore, in the full algorithm, would be mea-sured using a grid scan and evaluated by the CNN binaryclassifier.Satisfactory convergence for a block is achieved in fewerthan 50 pixel measurements in all regimes, compared tothe 1 ,
024 pixels measured in a grid scan of the block. Thisrepresents a huge improvement in measurement efficiencyand the evaluation of a state of the DRL agent.
Convolution Max-Pool Convolution Max-Pool Convolution32@32x32 32@16x1664@16x16 64@8x8 64@8x8 1x64 1x321x1
DenseDense
FIG. 7.
CNN binary classifier architecture.
TABLE II. CNN binary classifier confusion matrixConfusion ParametersTrue positive 18%False positive 4%True negative 76%False negative 2%F-measure 85%Accuracy 94%
C. CNN Binary Classifier
A Convolutional Neural Network (CNN) [35, 36] is amultilayered neural network with a special architecture todetect complex patterns. To decide whether the agent hasfound bias triangles in a given block, the algorithm uses aCNN as a binary classification tool. If the CNN outputsa value greater than a 0 .
1. Network Architecture
We summarise in Fig. 7 the network architecture andhyperparameters used. There are a total of 320065 train-able parameters. The convolutional layers have a RectifiedLinear Unit (RELU) activation function, while the denselayers have and Exponential Linear Unit (ELU) activationfunction. The final, output, layer has a Sigmoid activationfunction.
2. Confusion Matrix
The CNN was trained over 10 epochs using 11425 datapoints in the training set, 4896 in the validation set and6994 in the test set. Data was augmented by applying ro-tations. We trained the network using an Adam optimiser[49] and a binary cross-entropy loss function. Regulari-sation was achieved using a L2 regulariser, set to 0 . .
1. InTable II we present a summary of the prediction resultson the test set of the binary classification problem. The2 a b FIG. 8.
Example trajectories . ( a,b ) Example trajectoriesfrom Fig. 4 a, b , respectively, with the insets showing thebias triangles which triggered the stopping condition for thealgorithm. confusion parameter representation is useful for analysingthe types of error that a classifier typically makes. TheF-measure and accuracy are other commonly used metricsto analyse the efficacy of a binary classification tool.
3. Positive Examples
As the DRL agent navigates through the environ-ment, the algorithm evaluates each block using the pre-classification protocol. If the block passes the pre-classifierstage, a grid scan of the block is measured and the CNNbinary classification tool is used to evaluate the block. Ifthe CNN positively classifies the block as containing biastriangles the algorithm is terminated and the run treatedas successful. In Fig. 8 we show the blocks that werepositively classified by the CNN, causing the algorithmto terminate during the real-time measurements.
TABLE III. Deep Reinforcement Learning Architecture
Hyper-parameters Used
Discount factor 0 . (cid:15) greedy 1 e − Replay buffer 20000PER- β (start, final, no steps) (1 . , . , . e − FC Layers 128 , , , Fully Connected LayersState Input
V(s) +−+−
Gate AGate B
Statistical Feature +− All Gates
ActionNext measurement μσ A(s,a) Q(s,a)
FIG. 9.
Summary of DRL framework.
Our deep rein-forcement learning framework using a statistical state repre-sentation.
D. DRL Decision Agent
1. Network Architecture
We first summarise the network architecture and hy-perparameters used in Table III. We further illustrate themodel architecture in Fig. 9.
2. Training
We present the pseudo code for the training of theDRL decision agent with prioritised experience replay inFig. 10. The training process starts as follows. An agentinitially will make random action choices to gain expe-riences which will be stored in a replay buffer B . Fromthis buffer, the data sample will be randomly selected ata rate proportional to the temporal difference (TD) error.Particularly, it prefers to pick samples with unexpectedtransitions since these contain more information to learnthan from others samples. Then, the neural network willbe updated given such ‘unexpected’ samples to improvethe networks policy for the next iterations.We then illustrate the learning process of our DRL agentby showing the N measurements required to locate biastriangles as a function of the number of training episodes(Fig. 11). This test, which was run in different regionsof gate voltage space to the ones explored in the maintext, was performed to assess the learning rate of the DRL3 FIG. 10.
Pseudocode.
Training the dueling deep Q-networkwith prioritised experience replay. network. Bias triangles were labelled in advance by humanexperts and the CNN and pre-classifier modules were notused. Unsurprisingly, when we test the performance of theagent in the same device in which it is trained, less numberof learning episodes are required compared to when weperform the test run in a different device. However, inorder to be robust against device variability, training andtesting have to be run in different devices.
3. Performance
We summarise the online performance, in Table IV andthe offline performance in Table VI, of the DRL decisionagent with respect to the random decision agent. We usethe two-tailed Wilcoxon signed rank test [42] to assessthe null hypothesis that the DRL and Random agent’sperformances are drawn from the same distribution. Thep-value, given in Table V and Table VII for online andoffline tests respectively, represents the confidence in thenull hypothesis. The null hypothesis can only be rejectedwith confidence, at a level of 2%, in the online resultsin the case of ¯ N in region II. For the offline results, thenull hypothesis can be rejected for ¯ N in both regions. Aone-tailed Wilcoxon signed rank test demonstrates thatthe median of the differences ( ¯ N DRL − ¯ N Random ) is lessthan zero. We can therefore conclude that the DRL agentoffers a statistically significant advantage over the randomagent.
Episode N Starting from pinch-off regimeStarting from open regimeStarting from single-dot regime
Episode N Starting from pinch-off regimeStarting from open regimeStarting from single-dot regime ab FIG. 11.
Training convergence.
We evaluate the perfor-mance of our DRL agent during a learning process. The regionof gate voltage space is different to the ones explored in themain text. The bias triangles are labelled in advance by ex-perts. Lines show the mean N with uncertainty bounds fordifferent starting locations. The performance test in a was runon the same device in which training was performed, while in b , the test was run on a different device to the one used fortraining.TABLE IV. Summary of the online performance of the DRLand Random decision agents online in the two parameterregimes. The performance metrics used are the number ofblocks measured, N , before identifying a pair of bias trianglesand the corresponding lab time.Agent DRL RandomRegion I median lab time (s)
932 68310% percentile time (s) 228 22290% percentile time (s) 2430 4844
Region II median lab time (s)
822 98910% percentile time (s) 181 34990% percentile time (s) 1766 1500
Region I ¯ N
41 5410% percentile N N
104 135
Region II ¯ N
32 8510% percentile N
10 5090% percentile N
94 143 TABLE V. Summary of the Wilcoxon signed rank test analysison the online performance of the DRL and Random decisionagents in the two parameter regions. The performance metricsused are the number of blocks measured, N , before identifyinga pair of bias triangle and the corresponding lab time. Wilcoxon signed rank p-valueRegion I lab time
Region II lab time
Region I N Region II N offline performance of the DRLand Random decision agents online in the two parameterregions. The performance metrics used are the number ofblocks explored, N , before identifying a pair of bias triangles(the lower the ¯ N , the better the performance). For offlineexperiments, lab times are not a performance metric to beconsidered. DRL agent Random agentRegion I ¯ N N N
64 46
Region II ¯ N
17 3010% percentile N N
31 101
4. Policy
In the reinforcement learning context, a policy defineswhat an agent does to accomplish a task. We presentthe optimal policies at different training stages in Fig.12 wherein we use arrows to indicate the action, i.e., thedirection to move in the gate voltage space to perform thenext measurement. The algorithm learns that it shouldmove towards more positive gate voltages if the state ispinch-off (low-current) or go towards more negative gate voltages if the state is the open regime (high-current).
E. The Nelder-Mead numerical optimisationmethod
We construct a fitness function by taking the L -normof a probability vector defining a difference metric be-tween the current state, i.e. a given transport regime, TABLE VII. Summary of the Wilcoxon signed rank test anal-ysis on the offline performance of the DRL and Randomdecision agents in the two parameter regions. The perfor-mance metrics used are the number of blocks measured, N .For offline experiments, lab times are not a performance metricto be considered. Wilcoxon signed rank p-valueRegion I
N <
Region II
N < and the target state or transport regime. In slight vari-ance from the implementation in [28], we have defined theprobability vector of the current state as ( p ( s ) , − p ( s )) T and the target vector defined as (1 , T . s is a coordinatein gate voltage space and this coordinate’s fitness value iscalculated, as above, by evaluating the CNN prediction p ( s ) of the probability that a window (32 mV ×
32 mV)defined around s contains bias triangles. Thus, in sin-gle and double dot transport regimes, the value of p ( s )should be higher than in the pinch-off and open regimes.The value of the L -norm should have minima at thelocations of bias triangles. The Nelder-Mead numericaloptimisation method, with two gate voltages as free pa-rameters, then automated the location of these minima.This method converges on local minima, in n dimensions,by evaluating a set of n + 1 test coordinates within theoptimisation landscape, called a simplex. We defined theinitial simplex similarly to [28], as the fitness value of thestarting ( s ) and two additional coordinates obtained byreducing the voltage on each of the barrier gate voltagesone at a time by 75 mV.5 -1090 -990 -890 -790Gate A-720-620-520-420 V B2 V B1 -1090 -990 -890 -790Gate A-720-620-520-420 V B2 V B1 FIG. 12.
Optimal policies.
We plot the optimal policies learned at
Top early stage and