Bayesian optimisation of large-scale photonic reservoir computers
Piotr Antonik, Nicolas Marsal, Daniel Brunner, Damien Rontani
CCognitive Computation manuscript No. (will be inserted by the editor)
Bayesian optimisation of large-scale photonic reservoircomputers
Piotr Antonik · Nicolas Marsal · Daniel Brunner · Damien Rontani
Received: date / Accepted: date
Abstract
Introduction.
Reservoir computing is a grow-ing paradigm for simplified training of recurrent neuralnetworks, with a high potential for hardware implemen-tations. Numerous experiments in optics and electronicsyield comparable performance to digital state-of-the-art algorithms. Many of the most recent works in thefield focus on large-scale photonic systems, with tensof thousands of physical nodes and arbitrary intercon-nections. While this trend significantly expands the po-tential applications of photonic reservoir computing, italso complicates the optimisation of the high numberof hyper-parameters of the system.
Methods.
In this work, we propose the use of Bayesianoptimisation for efficient exploration of the hyper-pa-rameter space in a minimum number of iteration.
Results.
We test this approach on a previously reportedlarge-scale experimental system, compare it to the com-monly used grid search, and report notable improve-ments in performance and the number of experimentaliterations required to optimise the hyper-parameters.
Conclusion.
Bayesian optimisation thus has the poten-
This work was supported by AFOSR (grants No. FA-9550-15-1-0279 and FA-9550-17-1-0072), the R´egion Grand-Est, andthe Volkswagen Foundation via the NeuroQNet Project.P. Antonik, N. Marsal, D. RontaniLMOPS EA 4423 LaboratoryCentraleSup´elec & Universit´e de Lorraine2 rue Edouard BelinF-57070 Metz, FranceCorresponding author: P. Antonik (E-mail: [email protected])D. BrunnerFEMTO-ST Institute/Optics DepartmentCNRS & Universit´e Bourgogne Franche-Comt´e15B Avenue des MontbouconsF-25030 Besan¸con, France tial to become the standard method for tuning the hy-per-parameters in photonic reservoir computing.
Keywords
Bayesian optimisation · Photonic reservoircomputing · Large-scale networks · Hyper-parameteroptimisation
Reservoir Computing (RC) is a set of machine learningmethods for designing and training artificial neural net-works [1,2]. The simple idea behind this concept is toexploit the dynamics of a random recurrent neural net-work to process time series and only train the linear out-put layer by solving a (relatively simple) system of lin-ear equations [3]. The reservoir computing paradigm isparticularly well-suited for hardware implementations,which has attracted much interest from the communityin the past ten years. The performance of the numerousexperimental implementations in electronics [4], opto-electronics [5,6,7,8,9], optics [10,11,12,13], and inte-grated on chip [14] is comparable to other digital algo-rithms on a series of benchmark tasks, such as wirelesschannel equalisation [1], phoneme recognition [15], andprediction of future evolution of financial time series[16].While the idea of reservoir computing greatly sim-plifies the training of a recurrent neural network, theoptimisation of the hyper-parameters of the network re-mains a full-size problem. In order to maximise perfor-mance, one has to carefully design the topology of thenetwork, place it in the right dynamical regime (usually,task-dependent), and make sure that the scaling of theinput signals is well chosen. These considerations yield alist of multiple hyper-parameters that need to be tunedsimultaneously for each benchmark task. a r X i v : . [ c s . N E ] A p r Piotr Antonik et al.
Most experimental implementations of reservoir com-puting so far share two core characteristics. First, thetopology of the network, i.e. the interconnections be-tween the different neurons, is fixed by the hardwaredesign – either ring-like topology for time-delay sys-tems [4,5,6,7,8,10,11,12,13], or square mesh topologyfor integrated systems [14,17]. The size of the networkis only fixed per se in integrated realisations of RC, butcan be modified in time-delay setups. In practice, how-ever, it is also commonly fixed to a constant value, toavoid multiple and non-trivial re-adjustments of the ex-perimental setup. Since the network topology and sizeshould no longer be optimised, the list of the hyper-pa-rameters to tune is reduced to, typically, two variables:the input scaling, and the feedback strength. Second,most benchmark tasks used by the RC community sofar typically consist of training sets of several thousandsof inputs. Such (relatively) short datasets can be pro-cessed by most time-delay and integrated experimentsin a matter of seconds. That is, numerous evaluations ofthe system performance with different values of the hy-per-parameters do not present any inconvenience. Forthis reason, the simple grid search has been the stan-dard hyper-parameter optimisation method in experi-mental reservoir computers so far.A recent trend in photonic reservoir computing isthe design of parallel systems to facilitate the scalabil-ity of the network and increase the processing speed.This idea has been demonstrated through frequency-multiplexing of the reservoir nodes [13], and with free-space optics [18,19,20,21]. As a result, reservoir com-puters of unprecedented sizes – up to tens of thousandsof physical nodes – have been reported, and could beapplied to complex tasks in computer vision, such ashand-written digit recognition [19] and human actionrecognition in video streams [20]. While these advancesexpand the potential applications of reservoir comput-ing, they also make the optimisation of the hyper-pa-rameters more challenging. Similarly to the two factorsabove, (1) the topology of the network has to be opti-mised in parallel photonic experiments, which increasesthe number of hyper-parameters to tune, and (2) com-puter vision datasets (e.g. MNIST set of handwrittendigits or KTH video database of human motions) typi-cally consist of 50 , − ,
000 inputs. Considering thetypically low speed of certain free-space optical compo-nents, one evaluation of the experimental performancewith a set of hyper-parameters could easily take fromseveral hours to a day. In other words, the grid searchis no longer suitable for these experiments.Consequently, more efficient methods for the opti-misation of the hyper-parameters are required, such ase.g. the evolutionary-inspired genetic algorithm [22]. In this work, we propose the Bayesian optimisation [23,24,25,26] for this task. This idea has already been tried ontime-delay [27], low-connectivity [28], and small-dimension[29] reservoir computers in numerical simulations. Here,we apply it to a more critical situation of an experimen-tal large-scale reservoir computer, where the grid searchis no longer a suitable option. The simple idea behindBayesian optimisation is to build a surrogate model ofthe cost function using Gaussian Process (GP) regres-sion [30], and then efficiently sample the hyper-parame-ters space by looking for regions with the most potentialfor improvement.Specifically, we consider the photonic reservoir com-puter introduced in [20], together with the video-basedhuman action classification task, and apply Bayesianoptimisation to tune the hyper-parameters of the exper-iment. We perform numerical and experimental investi-gations, and report notable performance improvementsin both cases. Furthermore, Bayesian optimisation of-fers a better understanding of the importance of differ-ent hyper-parameters, i.e. it helps to differentiate thesignificant parameters from those that have little im-pact on the system performance. Considering the aboveadvantages, Bayesian optimisation could become thestandard hyper-parameters optimisation method for large-scale photonic reservoir computers.
We start by briefly reviewing the basic principles ofreservoir computing (Sec. 2.1). Then, we present theexperimental reservoir computer and its hyper-parame-ters (Sec. 2.2) and the human actions classification task(Sec. 2.4), originally introduced in [20], used here to testthe Bayesian optimisation approach, presented in Sec.2.3.2.1 Basic principles of reservoir computingA typical discrete-time reservoir computer contains alarge number N of internal variables x i ∈ ...N − ( n ) evolv-ing in discrete time n ∈ Z , as given by x i ( n + 1) = f nl N − (cid:88) j =0 W ij x j ( n ) + K − (cid:88) j =0 b ij u j ( n ) . (1)where f nl is the nonlinear function, u j ( n ) is the inputsignal of dimension K , b ij is the N × K matrix of inputweights, often referred to as the “input mask”, and W ij is the N × N matrix of interconnecting weights betweenthe neurons of the neural network. ayesian optimisation of large-scale photonic reservoir computers 3 The reservoir computer produces an output signal y ( n ), given by a linear combination of the states of itsinternal variables y ( n ) = N − (cid:88) i =0 w i x i ( n ) , (2)where w i are the readout weights, trained either offline(using standard linear regression methods, such as theridge regression algorithm [31] used here), or online [32],in order to minimise the Normalised Mean Square Error(NMSE) between the output signal y ( n ) and the targetsignal d ( n ), given byNMSE = (cid:68) ( y ( n ) − d ( n )) (cid:69)(cid:68) ( d ( n ) − (cid:104) d ( n ) (cid:105) ) (cid:69) . (3)2.2 Photonic reservoir computer and itshyper-parametersThe experimental setup, introduced in [20], is schema-tised in Fig. 1. It is composed of a free-space opticalarm and digital electronics, depicted in blue. The opti-cal beam, generated by a green LED source at 530 nm(Thorlabs M530L3), is linearly polarised, collimated,and expanded to roughly 17 mm in diameter to evenlylighten the 7 .
68 mm × .
68 mm surface of the spatiallight modulator (Meadowlark XY Phase P512 – 0532).The SLM is imaged by a high-speed camera (Allied Vi-sion Mako U-130B) after focusing the light beam withthe imaging lens and transforming phase modulation(induced by the liquid crystals of the SLM) into inten-sity modulation through a second polariser.The experimental setup implements the nonlinearfunction f nl in Eq. 1, that can be modelled as f nl ( X i ( n )) = (cid:4) I sin ( (cid:98) X i ( n ) (cid:99) ) (cid:5) (4)where X i ( n ) is the argument of the function, definedbelow, I is the intensity of the illuminating beam and (cid:98)(cid:99) , are the 8-bit and 10-bit quantifications due tothe SLM and the camera, respectively [19,20]. The restof the Eq. 1 is computed in Matlab. At each discretetimestep n , the input to the nonlinear function X i ( n ) = N − (cid:88) j =0 W ij x j ( n ) + K − (cid:88) j =0 b ij u ( n ) (5)is computed, and the resulting matrix is loaded ontothe SLM device. The polarisation-filtered SLM image,formed via an imaging lense, is recorded by the camera.The resulting data corresponds to the reservoir state x i ( n + 1) = f nl ( X i ( n )). L E D Pol Pol Cam S L M × W ij + × u ( n ) b ij × w i y ( n )PC / Matlab x i ( n − X i ( n ) Fig. 1
Scheme of the experimental setup, composed of anoptical arm (top half) and digital electronics (bottom half,rendered in blue). The optical part is composed of a lightsource (green LED), a pair of lenses to expand the beam tomatch the surface of the SLM, and a linear polariser rotatedaccordingly to the fast axis of the SLM. The spatial lightmodulator is imaged by a camera through an imaging lens anda second polariser, that transforms the phase modulation intointensity modulation. The electronics part is composed of acomputer, running Matlab, that captures the reservoir states x i from the camera, evaluates the outputs y ( n ), computes theinputs X i to the SLM and loads them to the device. The input mask b ij and the interconnection matrix W ij are generated randomly in the beginning of the ex-periment. The input mask b ij is initially drawn froma uniform distribution over the interval [ − , +1] as in[33,5,10], and then multiplied by a global scaling fac-tor β , called the input gain . The interconnection matrix W ij is generated as follows. First, a diagonal matrix ofsize N × N is created and multiplied by a coefficient α ,called the feedback gain , since the diagonal elements of W ij are responsible for the feedback of the reservoir, i.e.the connection of each neuron to its past states. A frac-tion ρ of the off-diagonal elements of W ij are assigneda fixed value γ , while all the others are set to zero. Theoff-diagonal elements correspond to the connections be-tween different neurons. Therefore, the connectivity ofthe network is defined by two parameters: the intercon-nection density ρ and the interconnection gain γ . Insummary, the dynamics of the reservoir computer de-pend on four hyper-parameters: α, β, γ, and ρ , recappedin Tab. 1.The processing speed of the system depends on twomain factors: (i) the time Matlab requires to computethe next SLM matrix (which increases with the reser-voir size) and (ii) the communication speed betweenMatlab and the SLM (which is independent of the reser-voir size). The experiment is capable of processing 7 Piotr Antonik et al. video frames per second with the smallest reservoir ( N =1 , N = 16 , ,
000 inputs (see Sec.2.4) takes from approximately 2 to 7 hours.2.3 Bayesian optimisation of hyper-parametersMany optimisation problems in machine learning are“black-box” problems, where the objective function F ( x )is unknown. In our study, F ( x ) is the performance of thereservoir computer on the KTH dataset (see Sec. 2.4),i.e. the classification accuracy, as function of the fourhyper-parameters (see Sec. 2.2): accuracy = F ( α, β, γ, ρ ).Finding the maximal value of F ( α, β, γ, ρ ) is a key stepin obtaining the optimal performance from the reservoircomputer.If F is computationally cheap to evaluate, one cansample the hyper-parameter space at many points e.g.via grid or random search. Grid search has been thestandard approach in the photonic reservoir computingfield so far [4,5,6,7,8,10,11,12,13,34]. However, if thefunction evaluation is expensive – such as in our case,where one experimental evaluation of F ( α, β, γ, ρ ) takesfrom 2 to 7 hours (see Sec. 2.2) – it is important tominimise the number of samples of F required to findits optimum.The Bayesian optimisation technique attempts tofind the global optimum in a minimum number of steps.It is an iterative approach that builds a surrogate modelto approximate the objective function F , the formerbeing much cheaper to evaluate. The sampling of newpoints in the hyper-parameter space is guided by an ac-quisition function, which estimates the hyper-parame-ter regions of most uncertainty and most gain, i.e. wherean improvement over the current best observation (op-timum) is the most likely.A popular surrogate model for Bayesian optimisa-tion is the GP regression [30], a non-parametric kernel-based probabilistic model. Unlike linear regression, whichseeks the best parameters that fit a linear model ontodata, the GP approach finds a distribution over thepossible functions f ( x ) that are consistent with the ob-served data. The Bayesian approach consists in buildinga starting GP model of F ( α, β, γ, ρ ) from an initial setof observations, and then updating the model as newdata points are being observed. An observation, in thiscontext, is the performance of the RC (the accuracy) fora certain combination of hyper-parameters ( α, β, γ, ρ ).The set of possible functions f ( x ) for the GP model isdefined by specifying their smoothness. This is achievedthrough a covariance or kernel function, of the model.In practice, choosing a kernel function often requires an initial guess by the user, while approaches exist tooptimising the kernel through cross-validation [35].The acquisition function is executed over the GPprediction of the objective function F ( α, β, γ, ρ ). It takesinto account the GP model with its uncertainties toevaluate the potential improvement over the currentoptimum in each point of the hyper-parameters space.Intuitively, it provides a trade-off between exploitation of the region close to the current optimum – i.e. ob-servation of the neighbour points – and exploration –the probing of different regions of the hyper-parame-ters space in search for another possible optimum. Inthis work, we use the expected improvement function,that evaluates the expected amount of improvement inthe objective function, ignoring values that cause an in-crease in the objective (see [24] for a review of severalacquisition functions).The Bayesian optimisation of our reservoir computercan be summarised as follows:1. Run the reservoir computer (numerically or exper-imentally) several times (typically 5-10) to build astarting set of observations.2. Compute the GP model from the observation set.3. Evaluate the acquisition function over the entire hy-per-parameters space and find its maximum, whichbecomes the candidate for the next observation.4. Set the new values of the hyper-parameters and runthe reservoir computer. Add the resulting observa-tion to the set.5. Repeat steps 2 to 4 until the desired accuracy hasbeen achieved.The Bayesian optimisation is supported by Matlaband can be implemented using the provided functions.The GP regression is carried out by the fitrgp func-tion. To avoid influencing the algorithm with any a pri-ori knowledge we might possess, we let it choose its op-timal parameters (such as the basis function, the kernelfunction and its parameters) through cross-validationby setting the option OptimizeHyperparameters to all [36]. For the acquisition function, we chose the expected-improvement function, readily implemented in Matlab[37] and defined by:EI( x ) = max (0 , F best − F ( x )) (6)where F best is the optimal value of F ( α, β, γ, ρ ) ob-served so far. On top of the standard Bayesian optimi-sation procedure described above, we added a specialmodification to check that every new observation can-didate is a previously unprobed, new set of parameters.This allows the algorithm to scan the hyper-parame-ters space faster by avoiding the repetition of identicalparameter values. ayesian optimisation of large-scale photonic reservoir computers 5 Figure 2 illustrates the Bayesian optimisation algo-rithm in action on a toy example in one dimension.The objective function, displayed in black, was cho-sen to present one local and one global minimum. Redmarkers show the observations of the target function,the acquisition function is shown in green, and the GPmodel is shown in blue with shaded uncertainty. Thegreen markers indicate the acquisition function’s max-ima, which are the candidates for the following obser-vations. For visual clarity, the plot of the acquisitionfunction was rescaled at each step. Figure 2(a) showsthe stage after the two starting observations, that wereused to initialise the GP model. Figure 2(b) shows thatthe model has located the local minimum region afterthree additional observations, with a significant uncer-tainty in the right-hand region. In Fig. 2(c), after an-other four observations, the acquisition function forcesthe process away from the local minimum on the left inorder to explore the right-hand side region. Finally, inFig 2(d), after a total of 12 observations, the model hasfound another minimum region. The low overall uncer-tainty indicates that it is the global minimum of thefunction, that can now be exploited to pin-point theexact optimal value.2.4 KTH human actions classification taskSimilarly to Ref. [20], we used the KTH database ofhuman actions [38], limited to the first “s1” scenario.The video database contains six types of human actions– walking, jogging, running, boxing, hand waving, andhand clapping (illustrated in Fig. 3) – performed 4 timesby 25 subjects, for a total of 600 video sequences. Theyvary in length and contain between 24 and 239 frames.All videos were recorded over homogeneous backgroundwith a static camera at 25 fps and downsampled to thespatial resolution of 160 ×
120 pixels.We used the Histograms of Gradients (HOG) al-gorithm [39,40] to extract the relevant features fromthe video frames. The main idea of this technique isthat local object appearance and shape can often beexpressed well enough by distribution of local intensitygradients or edges’ directions [20]. The computationof HOG features was performed in Matlab, individu-ally for each frame of every sequence using the built-in extractHOGFeatures function with a cell size of 8 × ×
2. Given the frame size of 160 × ,
576 features per frame. We thenapplied the principal component analysis (PCA) [41,42] based on the covariance method [43], to reduce thenumber of features down to 2 , .
6% of to-tal variance. The reservoir computer was trained over a subsetof 450 video sequences, and tested over the remaining150 sequences. We trained 6 binary classifiers (i.e. theoutput nodes), each for one motion class, and appliedthe winner-takes-all approach between them. The finaldecision over the duration of a sequence was made bytaking the most frequent class in the reservoir output.The classification accuracy is defined as the ratio of thecorrectly recognised video sequences in the testing setover the total number of 150 sequences. ,
024 to 16 ,
384 nodes.The optimisation of our four hyper-parameters with-out a priori knowledge of the regions of better perfor-mance is a non-trivial task for grid search, especiallywhen the evaluation of a set of values can take up toseveral hours experimentally. In order to keep manage-able experimental times, we had to severely restrict thegrid search intervals down to 2-3 values for each pa-rameters. This approach allows to determine the rightscaling of each parameter, but lacks the resolution tofind the most optimal values. Table 1 contains the al-lowed values for each parameter used in the grid search.The Bayesian optimisation is only affected by the di-mensionality of the hyper-parameter space in the sensethat it requires a larger starting set of observations to fitan accurate enough GP model onto the data and startsampling the right regions for improvement. Our trialshave shown that 8 starting observations were enoughto properly initialise the GP model. To truly test theBayesian optimisation approach, we used a much larger,and fine-grained hyper-parameter space than with thegrid search. Moreover, despite gaining some intuition onthe optimal parameters with the grid search, we madesure not to disclose any a priori knowledge to the GPmodel. That is, the starting observations were chosenfrom the extreme values in the hyper-parameter space,with one observation in the middle to force a non-linearfit. Table 1 shows the intervals used for the Bayesianoptimisation.
Piotr Antonik et al.(a) 2 observations (b) 5 observations (c) 9 observations (d) 12 observations
Fig. 2
Illustration of the Bayesian optimisation on a toy 1D problem. The target function is shown in black, the red markerscorrespond to the observations, the acquisition function is rendered in green, and the GP model is shown in blue (the shadecorresponds to the uncertainty).Parameter Symbol Search values (grid search) Search intervals (Bayesian optimisation)Feedback gain α . , . , . . − . β . , . − − γ . , . , . − − ρ . , . , . − − Table 1
Hyper-parameters search intervals for the two optimisation approaches.
Fig. 3
Examples of action frames from the KTH database,from left to right: boxing, hand clapping, hand waving, jog-ging, running, and walking. Six different subjects are illus-trated out of the total of 25. All videos have been taken out-doors over a homogeneous background, which corresponds tothe “s1” subset of the full database.
Figure 4 shows the results obtained with both opti-misation methods, experimentally (in red) and numer-ically (in green), with reservoirs of different sizes. Eachpoint corresponds to the highest accuracy we couldobtain with the corresponding approach. In each ex-periment, the Bayesian optimisation (plus markers andsolid lines) either matches, or outperforms the grid search(round markers and dotted lines). The improvement isquite significant in numerical simulations with a small( N = 1 , .
3% to 86 . N = 16 , .
0% to 90 . . .
6% reported in [44], a 4% increase in performance isa valuable improvement. Figure 5 illustrates the functioning of the Bayesianoptimisation process on the hyper-parameters of thereservoir computer with N = 1 ,
024 nodes. For the sakeof visualisation, we fixed two parameters to their op-timal values – the interconnection density ρ and theinterconnection gain γ – and ran the optimisation ofthe two remaining hyper-parameters: the feedback gain α and the input gain β , so that the results could beplotted on a 3D graph. After collecting 5 observationsof the cost function, we fit a GP model and evaluate theacquisition function (not displayed here) to guide thesampling of the hyper-parameter space (Fig. 5(a)). Thegeometry of the GP model’s cost function – a rapidlyrising fraction on the left-hand side and a moderatelyflat fraction on the right-hand side – suggests the ex-ploration of the flat region with the most promisinguncertainty. After 5 additional observations there theprocess discovers a pit (Fig. 5(b)) and starts exploiting(Fig. 5(c)). The optimal accuracy of 86% (as indicatedin Fig. 4) is found after 19 iterations of the algorithm(Fig. 5(d)). As a side note, one should keep in mindthat the shape of the objective function (i.e. the geom-etry of the GP model) highly depends on the task athand (here, the classification of videos from the KTHdataset) and the definition of the accuracy (see Sec.2.4). ayesian optimisation of large-scale photonic reservoir computers 7 . . . . . . . A cc u r a c y Reservoir sizeSimulations (grid search)Simulations (Bayesian optimisation)Experiments (grid search)Experiments (Bayesian optimisation)
Fig. 4
Performance of numerical (blue lines) and experimental (red lines) reservoir computers of different sizes, optimised witheither the grid search (dotted lines and hollow markers) or the Bayesian approach (solid lines and cross markers), with searchintervals and values given in Tab. 1. In most cases, the Bayesian optimisation outperforms the grid search (and matches inthe worst ones), with the accuracy increase of up to 4% (in the case of the largest experimental reservoir). This is a significantimprovement in the field of classification tasks, where the last fractions of percent are the hardest to gain. a pri-ori intuition on the optimal settings, the grid searchneeds to be executed over all 54 combinations of the hy-per-parameters values from Tab. 1 in order to find thebest composition. Bayesian optimisation, on the otherhand, requires between 19 iterations (for a small reser-voir with N = 1 ,
024 nodes) and 39 iterations (for alarge reservoir with N = 16 ,
384 nodes), which corre-sponds to a 65% −
28% time gain, respectively.Another significant advantage of an iterative algo-rithm over the grid search is that it autonomously con-verges towards the optimal value, and could, if neces-sary, be stopped, e.g. when the desired accuracy hasbeen achieved. To illustrate this idea, consider the ex-perimental results with the largest reservoir of 16 , . . α , β , γ , and ρ . Grid searchonly evaluates the combinations that the user believesto be relevant, while Bayesian optimisation uses theacquisition function (see Sec. 2.3) to explore the entirehyper-parameters space and in particular the regionsof high uncertainty, where an improvement could po-tentially be found. Therefore, by exploring the regionsa user might not have thought of, it provide additionalinsights about the cost function’s shape and the relativeimpact of the different dimensions.Based on this approach we learned through numer-ical simulations that the system is only sensitive to thefirst two parameters – the input scaling β and the feed-back gain α . In other words, those parameters need tobe accurately adjusted to maximise the accuracy, whilethe remaining two parameters – the interconnectiongain γ and the interconnection density ρ – can take mul-tiple (very) different values within the ranges we stud-ied. To illustrate this result, we ran the Bayesian opti-misation of a small numerical reservoir computer with N = 1 ,
024 nodes for 500 iterations, to let it explore thehyper-parameters space. Out of 500 observations, the
Piotr Antonik et al. − − − − − . . . .
81 1 . . . − − . − . − . − . I npu t ga i n ( l og ) F e e d b a c k g a i n − A cc u r a c y (a) 5 observations − − − − − . . . .
81 1 . . . − − . − . − . − . I npu t ga i n ( l og ) F e e d b a c k g a i n − A cc u r a c y (b) 10 observations − − − − − . . . .
81 1 . . . − − . − . − . − . I npu t ga i n ( l og ) F e e d b a c k g a i n − A cc u r a c y (c) 14 observations − − − − − . . . .
81 1 . . . − − . − . − . − . I npu t ga i n ( l og ) F e e d b a c k g a i n − A cc u r a c y (d) 19 observations Fig. 5
Illustration of the Bayesian optimisation on a small reservoir computer ( N = 1 , (a) The starting set of 5 observations (red marks), thefitted GP model (blue) and its lower uncertainty (light blue). Upper uncertainty is omitted, again, for the sake of clarity. (b)
The process starts with the exploration of the flat region of the model with the lowest uncertainty. (c)
The process discoversa pit in the cost function towards the lower values of the feedback gain and exploits it to find the optimum. (d)
The optimumis found after 19 iterations, with the best accuracy of 86% (see Fig. 4). maximum accuracy of 86% was obtained in 116 differentpoints. However, these points only differ in the valuesof γ and ρ , which take values within [10 − , − . ] and[10 − , β = 0 . α = 1.The same outcome appears with different reservoirsizes, as summarised in Tab. 2. The first two param-eters, α and β , only slightly vary with different reser-voir sizes, without a noticeable trend. In general, thesystem works best with a high feedback gain α (mem-ory capacity is required to store the information fromprevious frames) and a relatively low input scaling β ,which is mostly due to the large dimensionality of theinput signal. As for γ and ρ , optimal performance canbe obtained in multiple points of the ( γ, ρ )-plane. In other words, reservoirs with different topologies, i.e.networks with different interconnection matrices, per-form the classification equally as good. On the otherhand, the spectral radii of these matrices remain verysimilar, which is logical, since the spectral radius de-termines the dynamics of the system and how it pro-cesses the input information. These findings call for anin-depth study of the properties of the interconnectionmatrix W ij in large-scale photonic reservoir computers,that we leave for future work. In this work, we proposed the Bayesian optimisation al-gorithm for tuning the hyper-parameters in large-scale ayesian optimisation of large-scale photonic reservoir computers 9Grid search Bayesian optimisationReservoir size α β log γ log ρ α β log γ log ρ ,
024 0 . . − − . − . , −
10] [0 , − ,
096 0 . . − − . − . − . ,
400 0 . . − − . − . , − .
3] [ − . , − . ,
216 0 . . − − . − . , −
10] [0 , − ,
544 1 . . − − .
01 [0 , −
10] [0 , − ,
384 1 . . − − . , −
10] [0 , − Table 2
Optimal values of hyper-parameters obtained in numerical simulations for different reservoir sizes. photonic reservoir computers. We tested this approachon a previously reported experimental system, appliedto a challenging task in computer vision, and comparedthe results to the grid search, commonly used by theRC community. We report improvements in terms of(1) the classification performance, with an accuracy in-crease up to 4%, and (2) the convergence time to theoptimal set of hyper-parameters, with a roughly 30%gain in time (that could be doubled for a less than1 .
5% accuracy penalty). Taking into account the prox-imity of the accuracy of our photonic reservoir com-puter to the state-of-the-art results on this task, andthe experimental hyper-parameters optimisation timemeasured in days, these improvements prove to be pre-cious enhancements of the system performance. Fur-thermore, extensive exploration of the hyper-parame-ters space with the Bayesian method offers valuable in-sights on its underlying structure and the relative im-portance of the parameters. Considering all the advan-tages offered by the Bayesian optimisation algorithm,it may soon become the new standard approach for theoptimisation of hyper-parameters in photonic reservoircomputing.
Compliance with Ethical StandardsConflict of Interest.
The authors declare that theyhave no conflict of interest.
Ethical approval.
This article does not contain anystudies with human participants or animals performedby any of the authors.
References
1. H. Jaeger, Science (5667), 78 (2004). DOI 10.1126/science.10912772. W. Maass, T. Natschlger, H. Markram, NeuralComputation (11), 2531 (2002). DOI 10.1162/0899766027604079553. M. Lukoˇseviˇcius, H. Jaeger, Computer Science Review (3), 127 (2009). DOI 10.1016/j.cosrev.2009.03.0054. L. Appeltant, M. Soriano, G.V. der Sande, J. Danck-aert, S. Massar, J. Dambre, B. Schrauwen, C. Mirasso,I. Fischer, Nature Communications (1) (2011). DOI10.1038/ncomms1476 5. Y. Paquot, F. Duport, A. Smerieri, J. Dambre,B. Schrauwen, M. Haelterman, S. Massar, Scientific Re-ports (1) (2012). DOI 10.1038/srep002876. L. Larger, M.C. Soriano, D. Brunner, L. Appeltant, J.M.Gutierrez, L. Pesquera, C.R. Mirasso, I. Fischer, OpticsExpress (3), 3241 (2012). DOI 10.1364/oe.20.0032417. R. Martinenghi, S. Rybalko, M. Jacquot, Y.K. Chembo,L. Larger, Physical Review Letters (24) (2012). DOI10.1103/physrevlett.108.2441018. L. Larger, A. Bayl´on-Fuentes, R. Martinenghi, V.S.Udaltsov, Y.K. Chembo, M. Jacquot, Physical ReviewX (1) (2017). DOI 10.1103/physrevx.7.0110159. P. Antonik, M. Haelterman, S. Massar, CognitiveComputation (3), 297 (2017). DOI 10.1007/s12559-017-9459-310. F. Duport, B. Schneider, A. Smerieri, M. Haelterman,S. Massar, Optics Express (20), 22783 (2012). DOI10.1364/oe.20.02278311. D. Brunner, M.C. Soriano, C.R. Mirasso, I. Fischer,Nature Communications (1) (2013). DOI 10.1038/ncomms236812. Q. Vinckier, F. Duport, A. Smerieri, K. Vandoorne, P. Bi-enstman, M. Haelterman, S. Massar, Optica (5), 438(2015). DOI 10.1364/optica.2.00043813. A. Akrout, A. Bouwens, F. Duport, Q. Vinckier, M. Hael-terman, S. Massar, arXiv:1612.08606 (2016)14. K. Vandoorne, P. Mechet, T.V. Vaerenbergh, M. Fiers,G. Morthier, D. Verstraeten, B. Schrauwen, J. Dambre,P. Bienstman, Nature Communications (1) (2014). DOI10.1038/ncomms454115. F. Triefenbach, A. Jalalvand, B. Schrauwen, J.P.Martens, in Advances in neural information processingsystems (2010), pp. 2307–231516. The 2006/07 forecasting competition for neural net-works & computational intelligence. (2006)17. F.D.L. Coarer, M. Sciamanna, A. Katumba,M. Freiberger, J. Dambre, P. Bienstman, D. Rontani,IEEE Journal of Selected Topics in Quantum Electronics (6), 1 (2018). DOI 10.1109/jstqe.2018.283698518. J. Bueno, S. Maktoobi, L. Froehly, I. Fischer, M. Jacquot,L. Larger, D. Brunner, Optica (6), 756 (2018). DOI10.1364/optica.5.00075619. P. Antonik, N. Marsal, D. Rontani, IEEE Journal of Se-lected Topics in Quantum Electronics (1), 1 (2020).DOI 10.1109/jstqe.2019.292413820. P. Antonik, N. Marsal, D. Brunner, D. Rontani,Nature Machine Intelligence (2019). DOI 10.1038/s42256-019-0110-821. J. Dong, M. Rafayelyan, F. Krzakala, S. Gigan, IEEEJournal of Selected Topics in Quantum Electronics (1),1 (2020). DOI 10.1109/jstqe.2019.293628122. B. Penkovsky, L. Larger, D. Brunner, Journal of Ap-plied Physics (16), 162101 (2018). DOI 10.1063/1.5039826. URL http://aip.scitation.org/doi/10.1063/1.5039826 (4), 347(1994). DOI 10.1007/bf0109926324. E. Brochu, V.M. Cora, N. De Freitas, arXiv preprintarXiv:1012.2599 (2010)25. J. Mockus, Bayesian approach to global optimization:theory and applications , vol. 37 (Springer Science & Busi-ness Media, 2012)26. P.I. Frazier, arXiv preprint arXiv:1807.02811 (2018)27. J. Yperman, T. Becker, arXiv:1611.05193 (2016). URL http://arxiv.org/abs/1611.05193
28. A. Griffith, A. Pomerance, D.J. Gauthier, Chaos: AnInterdisciplinary Journal of Nonlinear Science (12),123108 (2019). DOI 10.1063/1.512071029. L. Cerina, G. Franco, M.D. Santambrogio, in Proceedingsof ESANN (2019)30. C.E. Rasmussen, C.K. Williams,
Gaussian process formachine learning (MIT press, 2006)31. A.N. Tikhonov, A. Goncharsky, V. Stepanov, A.G.Yagola,
Numerical methods for the solution of ill-posedproblems , vol. 328 (Springer Netherlands, 1995)32. P. Antonik, F. Duport, M. Hermans, A. Smerieri,M. Haelterman, S. Massar, IEEE Transactions on Neu-ral Networks and Learning Systems (11), 2686 (2017).DOI 10.1109/tnnls.2016.259865533. A. Rodan, P. Tino, IEEE Transactions on Neural Net-works (1), 131 (2011). DOI 10.1109/tnn.2010.208964134. G.V. der Sande, D. Brunner, M.C. Soriano, Nanophoton-ics (3) (2017). DOI 10.1515/nanoph-2016-013235. D.J.C. MacKay, Neural Computation (3), 448 (1992).DOI 10.1162/neco.1992.4.3.44836. MathWorks. Gaussian process regression model. http://fr.mathworks.com/help/stats/fitrgp.html
37. MathWorks. Bayesian optimization algo-rithm. http://fr.mathworks.com/help/stats/bayesian-optimization-algorithm.html
38. C. Schuldt, I. Laptev, B. Caputo, in
Proceedings of the17th International Conference on Pattern Recognition,2004. ICPR 2004. (IEEE, 2004). DOI 10.1109/icpr.2004.133446239. N. Dalal, B. Triggs, in (IEEE, 2005). DOI 10.1109/cvpr.2005.17740. H.E. Bahi, Z. Mahani, A. Zatni, S. Saoud, A robustsystem for printed and handwritten character recog-nition of images obtained by camera phone. Tech.rep. (2015). URL
41. K. Pearson, The London, Edinburgh, and Dublin Philo-sophical Magazine and Journal of Science (11), 559(1901). DOI 10.1080/1478644010946272042. H. Hotelling, Journal of Educational Psychology (6),417 (1933). DOI 10.1037/h007132543. L.I. Smith, A tutorial on principal components analysis.Tech. rep. (2002)44. Y. Shi, W. Zeng, T. Huang, Y. Wang, in2015 IEEE In-ternational Conference on Multimedia and Expo (ICME)