Meta-learning for fast classifier adaptation to new users of Signature Verification systems
11 Meta-learning for fast classifier adaptation to newusers of Signature Verification systems
Luiz G. Hafemann, Robert Sabourin,
Member, IEEE, and Luiz S. Oliveira.
Abstract —Offline Handwritten Signature verification presentsa challenging Pattern Recognition problem, where only knowl-edge of the positive class is available for training. While classifiershave access to a few genuine signatures for training, duringgeneralization they also need to discriminate forgeries. This isparticularly challenging for skilled forgeries, where a forgerpractices imitating the user’s signature, and often is able to createforgeries visually close to the original signatures. Most work inthe literature address this issue by training for a surrogate ob-jective: discriminating genuine signatures of a user and randomforgeries (signatures from other users). In this work, we proposea solution for this problem based on meta-learning, where thereare two levels of learning: a task-level (where a task is to learn aclassifier for a given user) and a meta-level (learning across tasks).In particular, the meta-learner guides the adaptation (learning)of a classifier for each user, which is a lightweight operation thatonly requires genuine signatures. The meta-learning procedurelearns what is common for the classification across differentusers. In a scenario where skilled forgeries from a subset ofusers are available, the meta-learner can guide classifiers to bediscriminative of skilled forgeries even if the classifiers themselvesdo not use skilled forgeries for learning. Experiments conductedon the GPDS-960 dataset show improved performance comparedto Writer-Independent systems, and achieve results comparableto state-of-the-art Writer-Dependent systems in the regime of fewsamples per user (5 reference signatures).
Index Terms —Meta Learning, Signature Verification, Biomet-rics
I. I
NTRODUCTION
Offline Handwritten signature verification remains a chal-lenging problem in the presence of skilled forgeries, wherethe forger has access to the user’s signature and practicesimitating it [1]. This problem is particularly challenging sincein a practical application scenario we cannot expect to haveaccess to skilled forgeries for every user in the system fortraining the classifiers.This problem is mainly addressed in three ways in the lit-erature: (i) training a classifier for each user using a surrogateobjective, where the negative samples are genuine signaturesfrom other users (called random forgeries in this context) [2],[3], [4] (ii) training a one-class classifier for each user [5]; (iii)training a global, writer-independent classifier [6], [7], [8].
L. G. Hafemann and R. Sabourin are with the Laboratoire d’imagerie,de vision et d’intelligence artificielle, ´Ecole de technologie sup´erieure,Universit´e du Qu´ebec, Montreal, Canada. (e-mail: [email protected],[email protected])L. S. Oliveira is with the Department of Informatics, Federal University ofParana, Curitiba, Brazil (e-mail:[email protected])This work was supported by the Fonds de recherche du Qu´ebec - Natureet technologies (FRQNT), the CNPq grant
The first alternative (Writer Dependent (WD) classification)optimizes a surrogate objective, which therefore can be sub-optimal. The second alternative (one class Writer Dependentclassification) is an appropriate formulation of the problem, butempirical results show that this approach performs worse thanthe first. A possible reason is that for signature verificationtasks we normally have only a small number of samplesper user, which makes it hard to estimate the support (orprobability density) of the positive class. For instance, recentwork considers a feature space in R , while the numberof signatures from one individual can be as low as 1-5 inpractical applications [1], [4]. Lastly, the third alternative(Writer Independent (WI) classification) alleviates the problemof a small number of samples per user by transforming theproblem in a binary classification problem: comparing a querysignature with a reference (template) signature, where the sameclassifier is used for all users [9], [7]. However, empiricallythese approaches also show worse performance than WDclassification, at least when the number of signatures availablefor training (per user) is larger than 1 [4]. We hypothesize thata reason for this gap in performance is that the WI classifierscompare a query signature with a reference signature one ata time, while the WD classifiers are trained with multiplereferences at the same time, and therefore can better estimatethe invariances in a person’s signature (intra-class variation).Considering different approaches, WD classification (al-ternative (i) above) shows better empirical performance [1].However, this approach has other shortcomings compared toWI approaches: they require training a classifier for each user,which is not desirable in some scenarios: For instance, whenthe number of users is very large, and each user do not use thesystem often - many classifiers are trained but are almost neverused. Also, in the cases where features are learned from data(e.g. [4]), if we want to change the feature representation, forinstance by training with new data, it is not straightforwardto incorporate the new features without re-training all WDclassifiers in the system, while a global (WI) classifier wouldnot require any extra step. WI systems also naturally handlesthe issue of adding more signatures to the reference set.In this work, we propose to formulate the task as ameta-learning problem, inspired by the work of a ForensicsHandwritten Expert: the expert acquires knowledge examininggenuine signatures and forgeries from several people alonghis/her training and work experience. For a new case, alongwith knowledge of signatures from the individual, this previousexperience is also used when analyzing a signature of interest.The main contributions of this paper are: (cid:13) a r X i v : . [ c s . C V ] O c t • We formulate the signature verification task as a meta-learning problem, considering a meta-learner that learnsacross-tasks (classification for specific individuals), thatis subsequently adapted to a particular user in order tomake a prediction on a query signature. • We extend Model Agnostic Meta Learning (MAML)[10], to consider different loss functions during classifieradaptation and meta-learning, to address the issue ofpartial-knowledge during training. • The resulting system is as scalable as a WI system(there is a single meta-model), but that is also adaptablefor individual users with a lightweight operation (a fewgradient descent steps). Additionally, contrary to otherwork that learns representations to train WD classifiers([4]), not only the final classification layer is adaptedto the new user, but the feature representation is alsoadapted. • We evaluate the approach in four widely used datasets,achieving results comparable to state-of-the-art on theGPDS-960 dataset. Finally, we discuss the limitations ofthe approach, most notably the requirement of using datafrom a large number of users for training, and worseresults when transferring the meta-learner to the otherdatasets. Code to reproduce the experiments can be foundat https://github.com/luizgh/sigver.The paper is organized as follows: section II reviews therelated work on signature verification and meta-learning. Sec-tion III introduces the formulation of signature verification asa meta-learning problem, and the proposed algorithm. SectionIV describes the experimental protocol, and section V presentsand discusses our results. Finally, section VI concludes thepaper. II. R
ELATED W ORK
The objective of signature verification systems is to classifya query signature as being genuine (produced by the claimedindividual), or a forgery (produced by another person). In thePattern Recognition community, different forgeries are consid-ered: Random forgeries - in which the forger has no knowledgeof the user’s signature, and use his signature instead; Simpleforgeries - in which the forger knows the person’s name,but not their signature; Skilled forgeries - where the forgerhas access to the user’s signature, and practices imitatingit. While the problem of distinguishing random and simpleforgeries is relatively easy (i.e. low error rates in state-of-the-art classifiers), skilled forgeries still present a significantchallenge for classification.These systems can be broadly categorized as Writer-Dependent (WD, also called User-Dependent) and Writer-Independent (WI, also called User-Independent). For Writer-Dependent classifiers, we consider a dataset for each user { x, y } ni =1 , where x are signatures, and y indicate whetherthey are genuine signatures from the user ( y = 1 ) or randomforgeries ( y = 0 ) [2], [3], [4]. Some work consider one-classWD classifiers, in which only genuine signatures from theuser are used for training (only y = 1 ) [5]. For WI classifiers,there are two main approaches: training a single classifier in Users S a m p l e s
160 or 300 53150
Fig. 1: Common dataset separation for Feature Learning fol-lowed by WD classification, on the GPDS dataset. Features arelearned in D . Model selection is conducted in V v . The systemis evaluated by training WD classifiers for the exploitation set E . [4]a dissimilarity space, and metric learning approaches. In thefirst case, the training samples are difference of feature vectors: | φ ( x ) − φ ( x ) | , with y = 1 if both signatures are from thesame user, and y = 0 otherwise [6], [7]. The metric learningapproaches use a siamese network architecture [11], whichtakes two signatures ( x , x ) as input, and outputs a metric(distance) between them.Recent work on signature verification rely on feature learn-ing methods [12], [4], [13], [14], [8], in which learning isconducted directly from signature pixels, instead of relying onhandcrafted feature extractors. In this case, a function φ ( x ) is learned to extract features from signature images x , bytraining using a surrogate objective, e.g. dictionary learning[15], [14], or classifying the user that produce the signatures[4]. For instance, the SigNet model [4] is a ConvolutionalNeural Network trained with the following objective: L = − (cid:88) j y ij log P ( y j | X i ) (1)Where X i is a signature and y i is the user that wrote the signa-ture. Therefore, the network is trained to obtain a representa-tion space where signatures from different people are linearlyseparable [4]. This feature representation is learned from aDevelopment dataset D , which is then used to extract featuresand train Writer-Dependent classifiers for a disjoint set ofusers (exploitation set E ) - a diagram of dataset separation isshown in Figure 1. While this approach achieved state-of-the-art verification performance, we note that the feature learningprocess does not directly optimize for the final objective of thesystem, which is distinguish genuine signatures and forgeries.This is addressed to some extent in the SigNet-F model, byalso classifying whether or not the signature is a forgery.However, in that case, the neuron classifying forgeries does notuse a reference signature from the user. While this was shownto be helpful in obtaining a good feature representation, thisneuron did not generalize in classifying forgeries for unseenusers [4].Such methods using feature learning followed by WD classi-fication also have other shortcomings: they require training oneclassifier for each user, which may be an expensive operation(e.g. best results in [12], [4] were reported with an SVMtrained with the RBF kernel for each user). If the feature genuine genuine ? ? Fig. 2: Illustration of the data available for one task (user).Left: the reference (support) set. Right: query samples.extractor is updated (e.g. trained with more data), then allclassifiers need to be retrained. Also, these systems use a fixedrepresentation for all users, and it is possible that adaptingthe representation for each user would yield improvements inclassification performance.It is also worth noting that, for WI classification, signatureverification systems can be trained jointly (feature extractionand classification) [8]. Despite being jointly trained, such WIsystems still perform worse than WD classifiers trained withfeatures learned with surrogate objectives, at least when morethan one signature references are used [4]. A possible reasonfor this gap is the fact that WI systems compare the querysignature to each reference individually (or comparing withthe centroid of the signatures), which is less powerful thantraining a classifier for the user, in capturing the invariancesof the person’s signature.
A. Meta-learning
In a broad sense, meta-learning is concerned with theproblem of learning to learn , with origins in the 80’s and90’s [16], [17]. More recently, algorithms based on meta-learning have achieved state-of-the-art results in tasks suchas hyperparameter optimization [18], neural network architec-ture search [19]), and few-shot learning [20], [10]. Few-shotlearning considers a scenario where only a few samples fromeach class are available for training, which is similar to actualapplication scenarios in handwritten signature verification.The goal of these meta-learning approaches for few-shotlearning is to train a model that can quickly (i.e. in a fewiterations) adapt to a new task using only a few samples.A new task in this context refers, for instance, to classifya new object, for which only a few samples are known.Ravi and Larochelle [20] proposed learning an optimizer and initialization for the tasks (Meta Nets). They propose using aLong short-term memory (LSTM) model to learn the updaterule for adapting the network parameters to a new task. Finn etal [10] proposed a Model Agnostic Meta Learning (MAML)procedure that does not require any extra parameters. Thismodel optimizes the sensitivity of the weights, that is, obtaina feature representation that is highly adaptive, such that asingle (or a few) gradient descent iterations are sufficient tooptimize to new tasks.III. P
ROPOSED M ETHOD
In this work we propose a meta-learning approach forsignature verification. This formulation considers a meta-learner that guides the adaptation of a classifier for each user.We consider that each user describes a task : discriminating genuine genuine genuine forgerygenuine genuine genuine forgery
Fig. 3: Example of the meta-learning setup. Each user repre-sents an episode , where D u is used for classifier adaptationand D (cid:48) u is used for meta-update.TABLE I: Table of symbols T Distribution of tasks (i.e. users) T u Task for user u D meta-train Training set for the meta-learner D meta-test Testing set for the meta-learner D u Samples for weight adaptation for user u D (cid:48) u Samples for meta-update for user uG u Genuine signatures for user uS u Skilled forgeries for user uθ Network parameters θ ( u ) k Parameters adapted to user u after k descent steps L Loss function for weight adaptation L (cid:48) Loss function for meta-update between genuine signatures (created by the user) and forgeries.Figure 2 illustrates the data available for one task: we considera reference (support) dataset that is used for training a classifierthat can classify new queries as genuine or forgery.In a meta-learning setting, we consider that training aclassifier for a particular user is guided by a meta-learner,that leverages data from multiple tasks for learning. Forthis we consider a dataset D meta-train , and then evaluate thegeneralization performance on unseen users D meta-test .We note that this approach has a direct correspondence toprevious work that used feature learning followed by WDclassification (section II), and here we make the associationbetween the terminology in the meta-learning research andprevious work on Signature Verification. In both cases weuse a separate set of users for feature learning ( D meta-train is analogous to the development set in section II), which isthen used for to train and test classifiers on a new set ofusers ( D meta-test is analogous to the exploitation set). The keydifferences of meta-learning is that: (i) The loss optimizedfor feature learning is directly related to the final objective(separate genuine signatures and forgeries); (ii) training aclassifier for a new user is a lightweight process (a few gradientdescent iterations); (iii) not only the classifier, but the featuresare also adapted for each user.In the next section we formalize the problem of signatureverification as a meta-learning task. A. Problem formulation
We consider that each user describes a task T u ∈ T , wherethe task consists in classifying a signature image as genuine(created by the user) or forgery (not created by the user). Acollection of users therefore describes a distribution of tasks T , and the aim of the meta-learner is to explore the structure Meta-learningGeneralization
Classi fi erAdaptation(Alg. 2)Meta-Training(Alg. 1)Classi fi cation Weights ( )Adapted Weights ( )
Fig. 4: Overview of the meta-learning system for signatureverification.present in this distribution. We consider a dataset D meta-train containing tasks from T , that is used for meta-learning. Foreach user we consider a set D u , that is used to adapt theclassifier, and a set D (cid:48) u that is used for updating the meta-learner. Lastly, to verify the generalization to unseen users,we consider a set D meta-test , that contains data from a disjointset of users ( D meta-train ∩ D meta-test = ∅ ). Figure 3 illustrates themeta-learning setup, and the symbols used in this paper arelisted in Table I for clarity. B. Model Agnostic Meta-Learning for signature verification
In this work we propose an extended version of Model-Agnostic Meta-Learning (MAML) [10], by considering dif-ferent criteria for classifier adaptation and meta-learning. Anoverview of the system can be seem in figure 4. We consider adevelopment set for meta-training, that consists in learning theweights θ of a Convolutional Neural Network, that are highly adaptable to new tasks. During generalization, for a user u ,a reference set D u is used to adapt the classifier to this user(using K gradient descent steps) obtaining weights θ ( u ) K . Thisadapted classifier is then used to classify a query image x q ,obtaining P ( y = 1 | x q , θ ( u ) K ) .Algorithm 1 describes the full meta-training algorithm.Meta-training is conducted in episodes (Figure 3). In eachepisode, the classifier is adapted to a particular user using D u (lines 7 to 10), and the adapted classifier is used to classifythe set D (cid:48) u . The loss is then back-propagated through allintermediate steps of the classifier adaptation (lines 11 and12), and is used to update the meta-learner weights θ (line14). Therefore, instead of having a feature representation thatis directly applicable for any user, they are learned to workwell for new users after K gradient descent steps on the user’ssignatures. For stability during training, we train on “mini-batches” of episodes, by accumulating the gradients for M episodes before updating θ . Algorithm 1
Meta-Training algorithm
Input: M : Meta-batch size Input: K : Number of gradient descent steps Input: α , β Learning rates
Output: θ : Meta-learned weights Randomly initialize θ while not done do Sample a batch of tasks { T u } Mu =1 ∼ T θ grad ← (cid:126) for u ← to M do Sample D u (cid:46) Genuine only θ (cid:48) ← θ for k ← to K do (cid:46) Adapt weights to u θ (cid:48) k ← θ (cid:48) k − − α ∇ θ (cid:48) k − L ( D u , θ (cid:48) k − ) end for Sample D (cid:48) u (cid:46) Genuine and forgeries θ grad ← θ grad + M ∇ θ L (cid:48) ( D (cid:48) u , θ (cid:48) K ) end for θ ← θ − βθ grad (cid:46) Meta-update end while
Figure 5 illustrates the classifier adaptation procedure. Inthis work, we adapt the MAML algorithm to use differentloss functions for the classifier adaptation and the final loss(used for the meta-update). In particular, we consider a lossfunction L that only uses genuine signatures for the classifieradaptation, and a loss function L (cid:48) that use both genuinesignatures and forgeries. Let D u = G u ∪ G i (cid:54) = u be the trainingset consisted of genuine signatures from the user ( G u ) andrandom forgeries ( G i (cid:54) = u ). We consider the following loss forclassifier adaptation: L ( D u , θ ) = − | G u | (cid:88) x : G u log( P ( y | x, θ )) − | G i (cid:54) = u | (cid:88) x : G i (cid:54) = u log( P ( y | x, θ )) (2)where | G u | and | G i (cid:54) = u | are the number of users in the sets,which is used to correct for the imbalance between the twoclasses.Let D (cid:48) u = G (cid:48) u ∪ G (cid:48) i (cid:54) = u ∪ S (cid:48) u be the a disjoint set of signaturesfor user u : genuine signatures ( G (cid:48) u ), random forgeries ( G (cid:48) i (cid:54) = u ),and (if available), skilled forgeries S (cid:48) u . We define the lossfunction for meta-update as follows: L (cid:48) ( D (cid:48) u , θ ) = − | G (cid:48) u | (cid:88) x : G (cid:48) u log( P ( y | x, θ ( u ) K )) − | G (cid:48) i (cid:54) = u | (cid:88) x : G (cid:48) i (cid:54) = u log( P ( y | x, θ ( u ) K )) − | S (cid:48) u | (cid:88) x : S (cid:48) u log( P ( y | x, θ ( u ) K )) (3)On generalization, for a new user we first adapt the weightsto this user using a set of reference signatures D u , and thenclassify a new query signature using the adapted weights.Algorithm 2 describes the classifier adaptation to a new user. Model Model Model
Fig. 5: Illustration of one iteration of meta-training for one task T u . Starting with parameters θ , the weights are specializedfor the task in K gradient descent steps. Each step involves computing the loss (1), back-propagating the loss w.r.t to θ (cid:48) k − (2) and updating the weights (3). For the meta-update, the loss L (cid:48) is backpropagated through the entire chain (from L (cid:48) backto the initial θ ), computing ∇ θ L (cid:48) ( D (cid:48) u , θ uK ) . Algorithm 2
Classifier adaptation
Input: K : Number of gradient descent steps Input: α Learning rate
Input: θ Meta-learned weights
Input: D u Reference set for user u Output: θ (cid:48) K : Weights adapted to the user after K steps θ (cid:48) ← θ for k ← to K do θ (cid:48) k ← θ (cid:48) k − − α ∇ θ (cid:48) k − L ( D u , θ (cid:48) k − ) end for We note that only the loss function L is used, and thereforeonly genuine signatures are used when adapting a classifierfor a new user. C. Meta-learning for one-class classification
The approach defined above can also be extended for one-class classification, where the classifier adaptation is done withonly genuine signatures from the user of interest. This is easilyimplemented by considering D u = G u . It is worth noting thatsimilarity-based methods and one-class methods that involvefeature learning often suffer from the problem of collapsingrepresentations into a point [21]. This is often addressed byadding a penalty in the loss function that requires dissimilaritems to be far apart in the feature space. In our formulation,while the user’s classifier is only trained with data from oneclass, we observe that training does not collapse to a singlepoint since the meta-training procedure directly optimizes theperformance on separating forgeries in D (cid:48) u .IV. E XPERIMENTAL P ROTOCOL
We conducted most experiments on the GPDS-960 dataset[22], that consists of 881 users, with 24 genuine signaturesper user and 30 skilled forgeries. We follow the same datasetseparation as previous work (figure 1), with users 350-881 as D meta-train , 300-350 as D meta-val and users 0-300 as D meta-test . Weused the same pre-processing method from previous work [12],[4], by removing the background noise using OTSU, centeringthe images in a canvas of size × and resizing themto × . We analyze the impact of the hyperparameters in the clas-sifier’s performance, measured in D meta-val . We consider theexperiments by varying these parameters: • Number of gradient descent steps in the classifier adap-tation: K ∈ { , } • One-class classification vs adaptation using genuine sig-natures and random forgeries • Fraction of users with skilled forgeries available fortraining • Performance as we vary the number of reference genuinesignaturesWe compare the results on D meta-val with a baseline usingfeature learning followed WD classification [4]. As in [4], weevaluate each model with repeated random subsampling: werandomly partition the validation set into training ( D u ) andtesting ( D (cid:48) u ), repeating the experiment 10 times with differentpartitions. We report the mean and standard deviation of themetrics.In all experiments, we train the meta-classifier for a totalof 100 epochs, considering a meta-batch size M = 4 . Weconsider an initial meta-learning rate β = 0 . , that isreduced (with cosine annealing) to − by the last epoch. Weused early stopping, by keeping the meta-learner weights thatperformed best in the validation set. Following [23], we usedMulti-Step Loss Optimization (MSL) for improving trainingstability. For the first 20 epochs, instead of computing the lossfunction L (cid:48) only after K steps (step 12 of algorithm 1), wecompute the loss function for all intermediate θ (cid:48) k , and considera weighted average of the losses. In the first epoch the lossusing each θ (cid:48) k contributes equally to the loss function, and theweights are annealed to give more weight to the last step untiliteration 20, after which only the loss function at the final step K contributes to the loss. We found this procedure effectivein stabilizing training (measured by the variation in validationaccuracy across epochs). We also attempted to use learnabletask learning rates (LSLR) described in [23] without success.Empirically, we also noticed that when using only genuinesignatures the task learning rate needs to be larger than thecase where skilled forgeries are available for training. In ourexperiments, if the fraction of users with skilled forgeries isless than 10% we used a task learning rate α = 0 . , and a TABLE II: Base architecture used in this work
Layer SizeInput 1x150x220Convolution (C1) 32x5x5Max Pooling 32x5x5Convolution (C2) 32x5x5Pooling 32x5x5Fully Connected (FC3) 1024Fully Connected (FC4) 256Fully Connected + Sigmoid 1 learning rate of α = 0 . for the other experiments.In order to evaluate the transferability of the features toother operating conditions, we conducted experiments on otherdatasets, (that were collected in different regions, and followeddifferent collection processes): MCYT-75 [24], CEDAR [25]and Brazilian PUC-PR [26]. We conducted two experiments:(i) use the meta-learner trained on GPDS directly for newusers of these datasets; (ii) train a meta-learner with data fromthe four datasets. It is worth noting that, with the exceptionof GPDS, the datasets are relatively small, with 55, 75 and60 users for CEDAR, MCYT and Brazilian PUC-PR. Weobserved that the formulations from this work require a largeamount of users for training, and for this reason, we conducted10-fold cross validation. We divide each dataset in 10 folds (byusers), and for each run we consider 1 fold as meta-test, andthe remaining folders for meta-training and validation. As inthe previous experiments, we further use repeated subsamplingfor evaluating the adaptation for the new users. In total, forexperiment (ii), we train 10 CNN models and perform 10adaptations for each user. We report the mean error rates overall runs, and the standard deviation across the 10 differentadaptations (each based on different train/test splits of therepeated subsampling).The CNN architecture used in the experiments is listed intable II. We found that using a smaller network, comparedto previous work using feature learning followed by WDclassification, was successful in the meta-learning setting. Thisnetwork has a total of 1.4M weights and uses 0.1 GFLOPSfor forward propagation, while SigNet [4] has 15.8M weightsand uses 0.6 GFLOPS. That is, the CNN used in this work is10x smaller and 6x times faster.We evaluate the performance using the following metrics:False Rejection Rate (FRR): the fraction of genuine signa-tures rejected as forgeries; False Acceptance Rate (FAR random and FAR skilled ): the fraction of forgeries accepted as genuine(considering random forgeries and skilled forgeries). We alsoreport the Equal Error Rate (EER): which is the error whenFAR = FRR. We considered two forms of calculating the EER:EER global τ : using a global decision threshold and EER user τ :using user-specific decision thresholds. In both cases, to calcu-late the Equal Error Rate we only considered skilled forgeries.For FRR and FAR, we report the values with a threshold of . (i.e. if p ( y = 1 | x, θ (cid:48) K ) ≥ . we consider the model predicting x as a genuine signature). - F RR EER: 3.48EER: 2.86 EER: 4.60 EER: 17.03Meta-learning One-classMeta-learning Two-classSigNet-F*SigNet*
Fig. 6: ROC curves on D meta-val comparing the one-class andtwo-class formulations with the baselines.V. R ESULTS
A. System design
In this section we report the results on D meta-val (GPDSusers 300-350), considering the experiments defined in sectionIV. The objective is to evaluate different aspects of thesystem, such as the number of gradient steps (that trades-offcomputation complexity and accuracy), as well as investigatethe performance of the model in different data scenarios.In a first experiment we consider the results of the one-class formulation and the two-class formulation as we varythe number of Random Forgeries used for classifier adaptation( K = 5 gradient descent steps;for meta-training we considered that skilled forgeries wereavailable on D meta-train (users 350-881). Note that for vali-dation, no skilled forgeries were used for training. Table IIIreports the results of these experiments. We observe similarverification performance on the two formulations. Note that theformulation using random forgeries is more computationallyexpensive, as the classifier adaptation involves a larger batch ofimages (e.g. computing the loss for one-class uses 5 images,while for two-class with SigNet* used thesame approach proposed in [4], but using the CNN architecturedefined for this work (table II). We note that the meta-learningformulation performed much better, while being a simplermodel (single model for all users). A comparison with theSigNet CNN architecture from [4] is conducted in section V-B,where we compare to the state-of-the-art. Figure 6 presentsROC curves for the one-class formulation and the two-classformulation with K . For eachvalue of K , we meta-trained a network and evaluate itsperformance on D meta-val . We observed improved performance TABLE III: Performance on D meta-val with one-class and two-class formulations Type random
FAR skilled
EER global τ EER user τ SigNet* + WD 5 7434 10.48 ( ± . ) 0.03 ( ± . ) 24.67 ( ± . ) 17.03 ( ± . ) 13.17 ( ± . )SigNet-F* + WD 5 7434 18.08 ( ± . ) 0.16 ( ± . ) 1.55 ( ± . ) 4.6 ( ± . ) 3.08 ( ± . )Meta-learningOne-class 5 - 2.54 ( ± . ) 2.74 ( ± . ) 4.24 ( ± . ) 3.48 ( ± . ) 1.69 ( ± . )Meta-learningTwo-class 5 5 2.82 ( ± . ) 1.98 ( ± . ) 4.18 ( ± . ) 3.8 ( ± . ) 2.04 ( ± . )5 10 5.1 ( ± . ) 1.94 ( ± . ) 2.66 ( ± . ) 3.56 ( ± . ) 1.85 ( ± . )5 20 2.84 ( ± . ) 1.98 ( ± . ) 3.1 ( ± . ) 2.86 ( ± . ) 1.78 ( ± . )5 30 2.62 ( ± . ) 2.48 ( ± . ) 3.46 ( ± . ) 3.17 ( ± . ) 1.4 ( ± . ) EE R one-classtwo-class Fig. 7: Performance on D meta-val as we vary the number ofupdate steps K . EE R one-classtwo-class Fig. 8: Performance on D meta-val as we vary the number ofusers in D meta-train for which skilled forgeries are available.with larger number of steps, but with diminishing returns. Wenote a high variance of the errors in these experiments, andtherefore we cannot determine a particular K as being optimal.As we increase the number of steps, we also increase thecomputational cost. If we consider that forward propagationand backward propagation have similar cost, the classifieradaptation for a new user takes K the time for a singleforward pass. A higher K also requires more memory (in theorder of K ) during meta-training, since the whole updatesequence needs to be stored in memory in order to computethe gradient for meta-update (as can be seen in figure 5).In figures 8 and 9 we analyze the impact in performance
100 200 300 400 531 EE R
0% skilled10% skilled50% skilled100% skilled (a) One-class
100 200 300 400 531 EE R
0% skilled10% skilled50% skilled100% skilled (b) Two-class
Fig. 9: Performance on D meta-val as we vary the number ofusers available for meta-training. (a): one-class formulation;(b) two-class formulation.as we vary the size of the D meta-train set. As noted in sectionIII-B, if skilled forgeries from a subset of users are available,we can incorporate them into the meta-update loss function L (cid:48) . In this experiment we considered that D meta-train containsall 531 users, and vary the number of users for which skilledforgeries are available. For each case, we build a datasetconsisting of genuine signatures for all users and skilledforgeries for the selected users, and trained a model. Figure8 shows the performance as we vary the number of usersfor which skilled forgeries as available. We re-iterate that weevaluate the performance on a disjoint set of users ( D meta-val )for which only genuine signatures are used. We observed that TABLE IV: Comparison with state-of-the art on the GPDSdataset (errors in %)
Reference Type Dataset τ ) 5.25 ( ± . )Hafemann et al [4] WD GPDS-300 5 SigNet-F (user τ ) 2.42 ( ± . )Hafemann et al [4] WD GPDS-300 12 SigNet-F (global τ ) 3.74 ( ± . )Hafemann et al [4] WD GPDS-300 12 SigNet-F (user τ ) 1.69 ( ± . )Souza et al [30] WI GPDS-300 5 SigNet (global τ ) 9.05 ( ± . )Souza et al [30] WI GPDS-300 5 SigNet (user τ ) 4.40 ( ± . )Souza et al [30] WI GPDS-300 12 SigNet (global τ ) 7.96 ( ± . )Souza et al [30] WI GPDS-300 12 SigNet (user τ ) 3.34 ( ± . )Present work WI/WD GPDS-300 5 MAML one-class (global τ ) 5.52 ( ± . )Present work WI/WD GPDS-300 5 MAML one-class (user τ ) 3.35 ( ± . )Present work WI/WD GPDS-300 5 MAML two-class (global τ ) 5.16 ( ± . )Present work WI/WD GPDS-300 5 MAML two-class (user τ ) 2.94 ( ± . )Present work WI/WD GPDS-300 12 MAML one-class (global τ ) 4.70 ( ± . )Present work WI/WD GPDS-300 12 MAML one-class (user τ ) 2.93 ( ± . )Present work WI/WD GPDS-300 12 MAML two-class (global τ ) 4.39 ( ± . )Present work WI/WD GPDS-300 12 MAML two-class (user τ ) 2.68 ( ± . ) the meta-learning formulation of the problem is well suitedto incorporating information from skilled forgeries (when itis available), and this generalizes well to unseen users, forwhich we only have genuine signatures. However, we observedthat the performance is not very good when there are onlygenuine signatures for meta-training: the one-class formulationachieves 14.15% EER when only genuine signatures areavailable, and 3.48% EER when skilled forgeries are availablefor all 531 users in meta-training.In figure 9, we evaluate the performance of the system aswe vary the number of users in D meta-train . We also consider 4levels of availability of skilled forgeries in the meta-trainingset: 0% (genuine only), 10%, 50% and 100%, where thepercentages refer to the number of users for which skilledforgeries are available (e.g. 10% with 100 users means thatforgeries for 10 users are considered, where the remaining 90users have only genuine signatures). For a given number ofusers and skilled forgery percentage, we construct a datasetwith randomly selected users (taken from the 531 users in thedevelopment set), with genuine signatures from all the selectedusers, and skilled forgeries for a fraction of the users. We thenuse this dataset for meta-training a model, and evaluate itsperformance on D meta-val . We observed improved performanceboth as more users are available for meta-training, as well aswhen more knowledge of skilled forgeries is available. Mostsurprisingly, we observed that for the two-class formulation,a classifier trained with 100 users with 100% forgeries (i.e.forgeries for every user in meta-train) performed better than amodel trained with 531 users with forgeries for only 100 users(comparing figures 9b and 8): 6.07% EER vs 9.14% EER. Were-iterate that this measures the performance on discriminategenuine signatures and skilled forgeries, and the model thathas access to more users (with the same amount of users withskilled forgeries) has better performance on discriminatingrandom forgeries, since its optimization consisted mostly ofthis problem. B. Comparison with the state-of-the-art
We now compare our results with the state-of-the-art in theGPDS-300 dataset. For these comparisons, we considered amodel trained with the one-class formulation, and a model EE R global user (a) One-class EE R global user (b) Two-class Fig. 10: Performance on GPDS-300 as we vary the numberreference signatures available for each user. (a): one-classformulation; (b) two-class formulation.trained with the two-class formulation, with r = 30 forgeries.In both cases, we used the whole dataset D meta-train for trainingthe meta-classifier, and used 5 genuine signatures for classifieradaptation, with k = 5 updates. While training was conductedwith 5 reference signatures, we evaluate the performance ofthe system with different number of references.Table IV compares our results with the state-of-the-art.We observe an improved performance compared to other WIsystems, achieving 5.16% EER (global τ ) with 5 referencesignatures, compared to 9.05% from [30]. Comparing to WDsystems, we observed similar performance in some scenarios(5 reference signatures), and worse results otherwise. With12 reference signatures, the proposed system obtained 4.39%EER (global τ ), compared to 3.74 for the WD system [4].However, the proposed system is more scalable, as a singlemodel is stored for all users.Figure 10 shows the performance on GPDS-300 as we varythe number of reference samples available for each user. Ascommonly observed in WD systems (e.g. [4]), the performancegreatly improves as more reference samples are available fortraining: For the one-class formulation, performance with asingle reference is 9.09% EER (global τ ) and 5.81% EER(user τ ). With 12 references, we obtain 4.70% EER (global τ ) and 2.93% EER (user τ ). TABLE V: Transfer performance to the other datasets
Target Dataset Training Dataset EER (global) EER (user)MCYT GPDS 15.48 ( ± ± ± ± ± ± ± ± ± ± ± ± TABLE VI: Comparison with the state-of-the-art in MCYT
Reference Type τ ) 3.58 ( ± τ ) 2.87 ( ± τ ) 15.37( ± τ ) 12.77( ± τ ) 14.50( ± τ ) 12.44( ± C. Transfer to other datasets
We now consider results on three other datasets: MCYT,CEDAR and the Brazilian PUC-PR. Table V shows theperformance in two scenarios: (i) meta-learner trained onlyin GPDS, with its generalization to new operating conditionsand (ii) meta-learned trained on all four datasets (using 10-fold cross validation, as described in section IV). Whilethe method generalized well to unseen GPDS users, we seethat the generalization performance to other datasets is muchworse. Furthermore, we notice that even when training witha subset of users from all datasets, the performance does notimprove for all datasets. A possible explanation is that theGPDS dataset is still much larger (10 times larger than theothers) and dominates training. Overall, this suggests that theTABLE VII: Comparison with the state-of-the-art in CEDAR
Reference Type ± ± τ ) 11.06( ± τ ) 8.27( ± τ ) 10.21( ± τ ) 7.07( ± TABLE VIII: Comparison with the state-of-the-art on theBrazilian PUC-PR dataset (errors in %)
Reference Type genuine + skilled /EERBertolini et al. [37] WI 15 Graphometric 8.32Batista et al. [38] WD 30 Pixel density 10.5Rivard et al. [9] WI 15 ESC + DPDF 11.08Eskander et al. [7] WD 30 ESC + DPDF 10.67Hafemann et al.[4] WD 5 SigNet (user τ ) 2.92 ( ± τ ) 2.07 ( ± τ ) 5.95 ( ± τ ) 2.58 ( ± τ ) 5.13 ( ± τ ) 1.70 ( ± τ ) 8.55 ( ± τ ) 6.70( ± τ ) 6.93( ± τ ) 5.74( ± proposed method requires a large amount of data from thetarget application, and is sensitive to changes in operatingconditions. Finally, tables VI, VII and VIII compares de resultswith the state-of-the-art on MCYT, CEDAR and BrazilianPUC-PR, respectively.It is worth noting that the meta-learning does generalizeto new users of the GPDS dataset, as verified in sectionsV-A and V-B, since we evalute in a D meta-test that containsa disjoint set of users that was used to train the meta-learner.What we observed, however, is that this meta-learned does nottransfer well to other datasets. This has been observed morerecent work with meta-learning [39], that shows that althoughthese models perform well for new classes of the samedistribution (e.g. same dataset), the performance deteriorateswhen evaluating on new datasets (i.e. a shift in the task-distribution). This is still an active area of research in meta-learning. VI. C ONCLUSION
In this paper we proposed to formulate Signature Verifi-cation as a meta-learning problem, where each user definesa task. This formulation enables directly optimizing for theobjective (separating genuine signatures and forgeries) evenwhen forgeries are not available for all users. The resultingsystem is scalable and yet adaptable for individual users:a single meta-classifier is learned and stored, and for theverification of a given signature, the classifier is adaptedto the claimed user and subsequently used for verification.The proposed method is also able to naturally incorporatenew reference signatures for a user, and enable adapting therepresentation as more training data is available. The draw-backs of this solution are twofold: increased computationalcost and worse transferability to new conditions. The methodis K slower, when using K updates for the classificationadaptation, although it allows the option to trade storage andcomputational cost - the adapted weights for a given user canbe stored for faster classification.In our experiments with the GPDS-960 dataset, the proposedmethod obtains better results than WI systems in the literature,and approach the performance of WD systems, especiallywhen few samples are available for training. With 5 referencesignatures, the proposed method obtains 5.16% EER (usinga global threshold), compared to 9.05% of a WI system and5.25% of a WD system. For a larger number of references theWD system still performs better, but the gap in performanceis greatly reduced. Considering 12 reference signatures, themethod obtains 4.39% EER (with a global threshold), vs3.74% for the WD system, while being more scalable (singlemeta-classifier) Our experiments transferring the meta-learnerto other datasets show reduced performance, highlighting theneed for better adaptation to new conditions, which will beexplored in future work. Future work also includes consideringa dynamic scenario, where the meta-classifier is updated asnew training data is available.R EFERENCES[1] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, “Offline handwrittensignature verification — Literature review,” in tional Conference on Image Processing Theory, Tools and Applications(IPTA) , Nov. 2017, pp. 1–8.[2] J. Vargas, C. Travieso, J. Alonso, and M. Ferrer, “Off-line SignatureVerification Based on Gray Level Information Using Wavelet Transformand Texture Features,” in , Nov. 2010, pp. 587–592.[3] M. B. Yılmaz and B. Yanıko˘glu, “Score level fusion of classifiers inoff-line signature verification,” Information Fusion , vol. 32, Part B, pp.109–119, Nov. 2016.[4] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, “Learning featuresfor offline handwritten signature verification using deep convolutionalneural networks,”
Pattern Recognition , vol. 70, pp. 163–176, Oct. 2017.[5] Y. Guerbai, Y. Chibani, and B. Hadjadji, “The effective use of theone-class SVM classifier for handwritten signature verification basedon writer-independent parameters,”
Pattern Recognition , vol. 48, no. 1,pp. 103–113, Jan. 2015.[6] R. Kumar, J. D. Sharma, and B. Chanda, “Writer-independent off-linesignature verification using surroundedness feature,”
Pattern RecognitionLetters , vol. 33, no. 3, pp. 301–308, Feb. 2012.[7] G. Eskander, R. Sabourin, and E. Granger, “Hybrid writer-independent-writer-dependent offline signature verification system,”
IET Biometrics ,vol. 2, no. 4, pp. 169–181, Dec. 2013.[8] H. Rantzsch, H. Yang, and C. Meinel, “Signature embedding: Writerindependent offline signature verification with deep metric learning,” in
Advances in Visual Computing , ser. Lecture Notes in Computer Science.Springer International Publishing, pp. 616–625, DOI: 10.1007/978-3-319-50832-0 60.[9] D. Rivard, E. Granger, and R. Sabourin, “Multi-feature extraction andselection in writer-independent off-line signature verification,”
Interna-tional Journal on Document Analysis and Recognition , vol. 16, no. 1,pp. 83–103, 2013.[10] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learningfor fast adaptation of deep networks,” in
Proceedings of the 34thInternational Conference on Machine Learning , vol. 70. InternationalConvention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp.1126–1135.[11] J. Bromley, I. Guyon, Y. LeCun, E. Siickinger, and R. Shah, “SignatureVerification using a “Siamese” Time Delay Neural Network,” 1994.[12] L. G. Hafemann, R. Sabourin, and L. S. Oliveira, “Writer-independentfeature learning for offline signature verification using convolutionalneural networks,” in
Neural Networks, The 2016 International JointConference on , 2016.[13] L. G. Hafemann, L. S. Oliveira, and R. Sabourin, “Fixed-sized repre-sentation learning from offline handwritten signatures of different sizes,”
International Journal on Document Analysis and Recognition (IJDAR) ,pp. 1–14, 2018.[14] E. N. Zois, M. Papagiannopoulou, D. Tsourounis, and G. Economou,“Hierarchical Dictionary Learning and Sparse Coding for Static Signa-ture Verification,” p. 11.[15] E. N. Zois, I. Theodorakopoulos, D. Tsourounis, and G. Economou,“Parsimonious Coding and Verification of Offline Handwritten Sig-natures,” in , Jul. 2017, pp. 636–645.[16] J. Schmidhuber, “Evolutionary principles in self-referential learning,”PhD Thesis, Technische Universit¨at M¨unchen, M¨unchen, 1987.[17] Y. Bengio, S. Bengio, and J. Cloutier, “Learning a synaptic learningrule,” in
IJCNN-91-Seattle International Joint Conference on NeuralNetworks , vol. ii, Jul. 1991, pp. 969 vol.2–.[18] D. Maclaurin, D. Duvenaud, and R. Adams, “Gradient-based hyperpa-rameter optimization through reversible learning,” in
Proceedings of the32nd International Conference on Machine Learning , ser. Proceedingsof Machine Learning Research, F. Bach and D. Blei, Eds., vol. 37. Lille,France: PMLR, 07–09 Jul 2015, pp. 2113–2122.[19] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing NeuralNetwork Architectures using Reinforcement Learning,” in
InternationalConference on Learning Representations , 2017.[20] S. Ravi and H. Larochelle, “Optimization as a model for few-shotlearning,” 2016.[21] P. Perera and V. M. Patel, “Learning Deep Features for One-ClassClassification,” arXiv:1801.05365 [cs] , Jan. 2018.[22] J. Vargas, M. Ferrer, C. Travieso, and J. Alonso, “Off-line HandwrittenSignature GPDS-960 Corpus,” in
Document Analysis and Recognition,9th International Conference on , vol. 2, Sep. 2007, pp. 764–768.[23] A. Antoniou, H. Edwards, and A. Storkey, “How to train your MAML,”in
International Conference on Learning Representations , 2019. [24] J. Ortega-Garcia, J. Fierrez-Aguilar, D. Simon, J. Gonzalez, M. Faundez-Zanuy, V. Espinosa, A. Satue, I. Hernaez, J.-J. Igarza, C. Vivaracho, andothers, “MCYT baseline corpus: a bimodal biometric database,”
IEEProceedings-Vision, Image and Signal Processing , vol. 150, no. 6, pp.395–401, 2003.[25] M. K. Kalera, S. Srihari, and A. Xu, “Offline signature verification andidentification using distance statistics,”
International Journal of PatternRecognition and Artificial Intelligence , vol. 18, no. 07, pp. 1339–1360,Nov. 2004.[26] C. Freitas, M. Morita, L. Oliveira, E. Justino, A. Yacoubi, E. Lethelier,F. Bortolozzi, and R. Sabourin, “Bases de dados de cheques bancariosbrasileiros,” in
XXVI Conferencia Latinoamericana de Informatica ,2000.[27] J. Hu and Y. Chen, “Offline Signature Verification Using Real AdaboostClassifier Combination of Pseudo-dynamic Features,” in
DocumentAnalysis and Recognition, 12th International Conference on , Aug. 2013,pp. 1345–1349.[28] Y. Serdouk, H. Nemmour, and Y. Chibani, “New gradient featuresfor off-line handwritten signature verification,” in , Sep. 2015, pp. 1–4.[29] A. Soleimani, B. N. Araabi, and K. Fouladi, “Deep Multitask MetricLearning for Offline Signature Verification,”
Pattern Recognition Letters ,vol. 80, pp. 84–90, Sep. 2016.[30] V. L. F. Souza, A. L. I. Oliveira, and R. Sabourin, “A Writer-IndependentApproach for Offline Signature Verification using Deep ConvolutionalNeural Networks Features,” in , Oct. 2018, pp. 212–217.[31] J. Wen, B. Fang, Y. Y. Tang, and T. Zhang, “Model-based signature ver-ification with rotation invariant features,”
Pattern Recognition , vol. 42,no. 7, pp. 1458–1466, Jul. 2009.[32] J. F. Vargas, M. A. Ferrer, C. M. Travieso, and J. B. Alonso, “Off-line signature verification based on grey level information using texturefeatures,”
Pattern Recognition , vol. 44, no. 2, pp. 375–385, Feb. 2011.[33] S. Y. Ooi, A. B. J. Teoh, Y. H. Pang, and B. Y. Hiew, “Image-basedhandwritten signature verification using hybrid methods of discreteRadon transform, principal component analysis and probabilistic neuralnetwork,”
Applied Soft Computing , vol. 40, pp. 274–282, 2016.[34] S. Chen and S. Srihari, “A New Off-line Signature Verification Methodbased on Graph,” in , vol. 2, 2006, pp. 869–872.[35] R. Kumar, L. Kundu, B. Chanda, and J. D. Sharma, “A Writer-independent Off-line Signature Verification System Based on SignatureMorphology,” in
Proceedings of the First International Conference onIntelligent Interactive Technologies and Multimedia , ser. IITM ’10. NewYork, NY, USA: ACM, 2010, pp. 261–265.[36] R. Bharathi and B. Shekar, “Off-line signature verification based onchain code histogram and Support Vector Machine,” in , Aug. 2013, pp. 2063–2068.[37] D. Bertolini, L. S. Oliveira, E. Justino, and R. Sabourin, “Reducingforgeries in writer-independent off-line signature verification throughensemble of classifiers,”
Pattern Recognition , vol. 43, no. 1, pp. 387–396, Jan. 2010.[38] L. Batista, E. Granger, and R. Sabourin, “Dynamic selection of genera-tive–discriminative ensembles for off-line signature verification,”
PatternRecognition , vol. 45, no. 4, pp. 1326–1340, Apr. 2012.[39] E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, K. Xu, R. Goroshin,C. Gelada, K. Swersky, P.-A. Manzagol, and H. Larochelle, “Meta-dataset: A dataset of datasets for learning to learn from few examples,” arXiv preprint arXiv:1903.03096 , 2019.
Luiz G. Hafemann received his B.S. degree inComputer Science in 2008 and his M.Sc. degree inInformatics in 2014, both from the Federal Univer-sity of Paran´a, Curitiba, PR, Brazil. He received hisPh.D. degree in Systems Engineering in 2019 fromthe ´Ecole de Technologie Sup´erieure, Universit´e duQu´ebec, in Montreal, QC, Canada. He is currently aresearcher at Sportlogiq, applying computer visionmodels for sports analytics. His current interestsinclude meta-learning, adversarial machine learningand group activity recognition. Luiz S. Oliveira received his B.S. degree in Com-puter Science from Unicenp, Curitiba, PR, Brazil,the M.Sc. degree in electrical engineering and in-dustrial informatics from the Centro Federal deEducac¸˜ao Tecnol´ogica do Paran´a (CEFET-PR), Cu-ritiba, PR, Brazil, and Ph.D. degree in ComputerScience from ´Ecole de Technologie Sup´erieure, Uni-versit´e du Qu´ebec in 1995, 1998 and 2003, respec-tively. From 2004 to 2009 he was professor of theComputer Science Department at Pontifical CatholicUniversity of Paran´a, Curitiba, PR, Brazil. In 2009,he joined the Federal University of Paran´a, Curitiba, PR, Brazil, where he isprofessor of the Department of Informatics and head of the Graduate Programin Computer Science. His current interests include Pattern Recognition,Machine Learning, Image Analysis, and Evolutionary Computation.