Triplet loss based embeddings for forensic speaker identification in Spanish
Emmanuel Maqueda, Javier Alvarez-Jimenez, Carlos Mena, Ivan Meza
TT RIPLET LOSS BASED EMBEDDINGS FOR FORENSIC SPEAKERIDENTIFICATION IN S PANISH
A P
REPRINT
Emmanuel Maqueda
Facultad de Estudios Superiores CuautitlánUniversidad Nacional Autónoma de México [email protected]
Javier Alvarez-Jimenez
Universidad Abierta y a Distancia de México [email protected]
Carlos Mena
University of Malta [email protected]
Ivan Meza*
Instituto de Investigacionesen Matemáticas Aplicadas y en SistemasUniversidad Nacional Autónoma de México [email protected]
January 25, 2021 A BSTRACT
With the advent of digital technology, it is more common that committed crimes or legal disputesinvolve some form of speech recording where the identity of a speaker is questioned [1]. In faceof this situation, the field of forensic speaker identification has been looking to shed light on theproblem by quantifying how much a speech recording belongs to a particular person in relation toa population. In this work, we explore the use of speech embeddings obtained by training a CNNusing the triplet loss. In particular, we focus on the Spanish language which has not been extensivelystudies. We propose extracting the embeddings from speech spectrograms samples, then exploreseveral configurations of such spectrograms, and finally, quantify the embeddings quality. We alsoshow some limitations of our data setting which is predominantly composed by male speakers. Atthe end, we propose two approaches to calculate the Likelihood Radio given out speech embeddingsand we show that triplet loss is a good alternative to create speech embeddings for forensic speakeridentification. K eywords Triplet Loss · Speaker Identification · Forensic · Spanish
The use of Triplet loss [2] was popularized with the introduction of the FaceNet architecture which was aimed to faceidentification tasks. This loss allows to train a neural network, commonly a Convolutional Neural Network (CNN), toproduce a vector representation of an image. The goal is that the neural network learns a mapping from face images toan euclidean space, it is desirable that distances among face image positions directly correspond to a measure of facesimilarity. Within this setting a desired outcome is that images of the same face will cluster together and images ofdifferent faces will be separated by a margin. Triplet loss has been applied into different types of images: objects [3],person re-identification [4], information retrieval [5].Recently, Triplet loss has been proposed for speech tasks, for instance: the speaker verification task [6], for speakerturn [7], speech emotion classification [8], among other tasks. However, its applicability to forensic speaker identificationhas not been explored, particularly for the Spanish language.
Forensic Speaker Identification (FSI) focuses on gatheringand quantifying the evidence that will be presented in a court. FSI addresses the question if a specific recording registersor not, speech produced by a specific person [9, 10]. The more basic scenario in FSI consists of two sample speechrecordings, a reference sample and a questioned sample. For the reference sample, we always know the identity of a r X i v : . [ c s . S D ] F e b PREPRINT - J
ANUARY
25, 2021the speaker. This certainty is guaranty by the chain of custody, this is because we know the conditions in which therecording was taken, including the identity of the speaker. On the other hand, for the questioned recording we are notsure about the identity of the person whose voice is in the recording. In a case that involves FSI, the identity of thespeaker in the questioned recording is contested regarding the identity in the reference recording; one of the involvedparts affirms that the voice in the reference and the questioned recordings are the same ( same-speaker hypothesis );while the other part affirms the contrary ( different-speaker hypothesis ).The goal in a FSI case, is not only about matching the two recordings by their similarity , same-speaker hypothesis,FSI requires a stronger legal standard which makes also necessary to quantify the chances of the questioned sample tobe associated to other speakers of the population, this address the different-speaker hypothesis. This measurement isknown as typicality . With these two measurements, similarity and typicality it is common to calculate the LikelihoodRadio (LR) [11]. LR offers a quantifiable measurement that updates the odds of one of the hypothesis, this informationshould be taken into consideration in the context of the presence of other evidence regarding the case.In this work, we propose to extract speech embeddings from speech spectrogram samples for reference and questionedspeech recordings in order to quantify the LR. We start by presenting related work in section 2. We continue bypresenting the details of our neural model and the implementation of the triplet loss in section 3. We present two sets ofresults: first, in section 5.1, we measure the quality of the embeddings by proposing inner and outer speaker distancemetrics, together with the use of the well established silhouette clustering metric. Second, in section 5.2, we proposetwo ways to calculate the LR in terms of the speech embedding distances. Once we show that speech embeddings arean option to be used in FSI we discuss some ethical aspects to be considered in section 6. Finally, we summarise ourmain findings in section 7.
Forensic speech science has advanced in last three decades in which it has established different methodologies andtechniques to face the speech identification problem [10]. In the case of the methodology, there has been a paradigmshift towards empirically grounded methods [12, 13]. This shift has been motivated by the requirements of admissibilityof science evidence that had become a standard in some courts around the world [14, 15]. The main result of this shifthas been the adoption of the Likelihood-Ratio (LR) as a mean to introduce the evidence in court. LR is formulated inthe following manner: LR = p ( E | H s ) p ( E | H d ) (1)where E represents the evidence, in FSI this is the quantification of speech properties in the questioned recorded sample. H s correspond to the same-speaker hypothesis and H d to the different-speaker hypothesis. The numerator can beconsidered a similarity score while the denominator a typicality one. LR is not to be considered independent of theother facts of the case, on the contrary its meaning only depends on the strength or lack of the rest of the evidence andtheir compatibility with either of the hypotheses.On the other hand, from the point of the techniques there has been several proposals which allow the quantification of p ( E | H s ) and p ( E | H d ) . One approach that was extensively explored was the statistical analysis supposing a Gaussiandistribution of speech features [9]. A common approach is to measure a specific phonetic and phonological speechproperties (e.g., formants) in a specific context (e.g., a word [16, 17]). Motivated by the application of multi-variablestatistics in the forensic field [18] new approaches were suggested together with kernel approaches to improve thestatistical analysis of the speech evidence [19, 20].Another proposed method is the use Gausian-Mixture-Model/Universal Background Model (GMM-UBM) [21]. Thismethodology reflects more a generative machine learning approach which depends on data intensive algorithms, forthis reason in these methodologies there is no necessity of measuring a specific voice property. The GMM-UBMmodel is parametric deppends on a dataset of recordings which is used to define the model parameters, in this case theparameters of the Gaussian mixtures. A GMM speaker-specific model is created to quantify the similarity term, and aUBM general model, based on the population of possible speakers, is generated to quantify the typicality term. Withthe advent of different machine learning techniques some other machine learning-based approaches had been proposed,such as using Support Vector Machines, boosting algorithms and Random Forest [22, 23]. A fundamental piece toadopt ML techniques is its appropriateness to calculate the LR [24]. Recently, there has been proposals that exploits thediscriminative power of Neuronal Networks and their capability of producing representations, some examples of theseapproaches are: DNN senone i-vectors [25], bottleneck features [26] and x-vectors [27].On the other hand, with the progress of self-supervised methods for training deep learning networks there has beenadvances in proposals for learning representations for embedding spaces. Contrastive methods have been proposed tocompare a questioned sample to a set of samples of known speakers, this setting correspond to a setting of speaker PREPRINT - J
ANUARY
25, 2021Figure 1: Triplet loss applied to three speech segments which are transformed into three vectors. Possible cases relationamong A, P and N inputs, the third case is a zero loss case. verification were there is access to samples of the speaker to identify, it is a matter of verifying if the two sets of samplesmatch [28, 27]. Of particular interest are the advances with the Triplet Loss, since it maps raw speech representationinto an euclidean space [2]. The use of these embeddings for the speaker identification task is trivial since the distanceamong embeddings can be used to determine which speaker is close to a questioned recording. Here it is importantto notice, that speaker identification does not quantify typicality, but just similarity. This approach had been used indifferent scenarios: [29] uses a Residual CNN and GRU that transforms a spectogram into a dimension vector, itproposes to use cosine similarity to guide the triplet loss and it evaluates accuracy and error to recognise the speaker.[30] presented a CNN Inception Resnet to generate a dimension embedding, it focused on L norm, it proposedvalidation and false accept rate to evaluate its system. [31] modifies the triplet loss to make it more efficient, it also usesaccuracy (top 1 and top 5) to evaluate its system. In all these three cases, the resulting performances were superior toprevious approaches.Of particular interest for our experimentation is the difference between female and male voices, since as it will bepresented further down that our dataset has an imbalance among this type of speakers. According to the medical notion,the voice originates in the throat of the speaker specifically in the larynx. There is an understanding that the size ofthe larynx correlates with sexual characteristics which at the same time determine the sex (or pitch) of the voice asan acoustic event. In average, the male larynx is larger than the female larynx and is naturally inclined to produce alower pitched speech. However this is not a rule since for men and women speech can overlap. In this regard, the sexualcharacteristics of the vocal tract, biologically determined, determine sex in speech, however the identification phase isof one of gender, since it is constituted by a subjective interpretation [32, 33, 34]. With this in mind our experimentationwill be on perception which means on gender. Triplet loss compares an anchor input with two other inputs, a positive input which shares a property with the anchor, inour case it is the identity of the speaker, and with a negative input which does not share such property. The comparisonis guided by the following formulation: L ( A, P, N ) = max ( D ( A, P ) − D ( A, N ) + m, (2)where A , P and N are vectors representing the anchor, the positive and negative inputs respectively. D is distancemetric and m is a margin. In a ideal setting it is expected that the distance between A and P to be less than distancebetween A and N at least by a margin m , if that is the case the loss is zero when the network calculating the vectors isdoing a good job. However, if this is not the case the loss will be positive and by using back propagation the weightsof the model that produces the vectors from raw information will be adjusted. Figure 1 shows the relation betweenthe CNN model and the triplet loss, it is important to notice that the CNN blocks are the same neuronal network thattransforms all inputs since the weights are shared. The Figure also illustrates the three cases in relation to distancesamong the embeddings and the margin. L norm is used as a distance metric.A common arrangement in Triplet loss is to use the same neuronal model to produce the vector inputs from raw inputs.In this work we propose the use a CNN that will receive segments of the spectogram of speech recordings. Figure 2shows this arrangement. We propose a simple CNN composed by layers when possible otherwise . Figure 2 shows adiagram with the specific details of our model. 3 PREPRINT - J
ANUARY
25, 2021Figure 2: Convolutional network
The input of our CNN is a segment of a spectogram that represents a time slice ( t in milliseconds) and frequenciesinformation up to . kHz (enough to characterise the human voice), using always bins for the frequencies. Allour recordings are down sampled to K and the spectogram is normalized and pre-amph. These specific choices arefollowed from typical pre-processing of speech signals. In order to obtain the spectogram we use a hann window, withvariable window ( w in milliseconds) and hop ( h in milliseconds) size. The parameters t , w and h allows us to generatean image patch (spectogram segment) with a variable width but a constant height W × . This patch is the CNN’sraw input which will be transformed it into a dimension embedding for all our experiments. In this work we use the Spanish Voxforge dataset, which is entirely based on the recordings from the Voxforge Project .The Voxforge Project is a non-profit initiative that aims to collect transcribed speech for use with Free and Open SourceSpeech Recognition Engines. We chose the Voxforge because the speakers read always the same prompt, a paragraphof El Quijote , this eliminates overfitting by the content of what it is said. We expect that our models focus on propertiesof how things are said. The speakers donated a sample of their voice by registering to the project website, filling aform with relevant information about the speaker and reading some prompts directly through the speaker’s computermicrophone. Thanks to this mechanism, we can know the following about every speaker, which is relevant for forensicpurposes and our experimentation:• Username: It could be left in blank or it could be an alias• Gender: Male / Female• Age: −
17 =
Youth / −
64 =
Adult / -or greater = Senior)• Native Speaker?: Yes / No• Dialect: Country or RegionSince the beginning of the project in , several languages have been added by the community, making necessary toclarify that our corpus only contains Spanish recordings until the year . The original recordings were manuallysegmented into utterances. Table 1 shows the total number of male and female speakers and how they are classified intothe corpus according to their nationality. Country Males FemalesArgentina
143 31
Chile
69 3
Latin America
148 10
Mexico
68 8
Spain
Unknown
45 4
Total
Table 1: Number of Speakers in the Voxforge Corpus and their Nationalities.The Spanish Voxforge dataset is composed by , recordings which in average last . seconds. They come in a kHz , bit, mono format. The total duration of the whole recordings is approximately hours. For the experiments,we split the speakers in training, validation and testing. PREPRINT - J
ANUARY
25, 2021
We performed two levels of experimentation. First we explored the speech embeddings by measuring quality of thespeech embeddings and the cluster they generate on samples from the same spaker. Second we evaluated the capabilityof the speech embeddings to be used to calculate an LR, we propose two approaches: one is to use directly distance andquestioned as a proxy for the LR, second is to use a ratio between the previous distance divided by the shortest distanceto other speaker in the population.
To quantify the quality of the embeddings, we calculate three metrics: inner average distance for the same speakersamples (
IAD ), outer average distance between speaker and centroids of other speakers (
OAD ). We expect a IADsmall which will signal that samples from the same speaker will land in the same region. For OAD, we expect a largenumber that will signal that the samples from a different speaker will be far away from other speakers. With these twometrics we proposed to calculate a distance ratio ( DR ) that will tell us the relation between a speaker and the rest ofspeakers, we will expect a small amount for DR signaling good speech embeddings. We also calculated the meansilhouette coefficient ( M SC ) which ranges from − (worst result) to (best result). In this case, a number closer to will indicate that there is less confusion among the clusters of embeddings from the same speaker samples.The first question we address is how long we have to train the network using triplet loss. For these experiments we setthe parameters to t = 2000 ms , w = 100 and h = 50 , we also set a margin of and we train the model during , and days (exploratory experiments with shorter time showed that the minimum training time was a day). Table 2 reportsthe results. As can be seen day of training is enough to reach a good results:Training days IAD IOD DR MSC1 .
77 27 . .
62 0 . . . . . Table 2: Validation results for different training duration, one day gives the best results amd more time does not affectthe behaviour of the network.The second question we address was how large has to be the CNN’s input speech segment. For this we have to exploredifferent parameters of the spectogram: h , t while we fix w to ms which is a common window size for speechsignal processing. Varying h allows us to control the amount of information that passes through in a segment, weexplored ms, ms and ms. On the other hand t allows to control the amount of signal that the CNN will ’see’,we explore s, . s and s values; we did not try a larger time since some all of recordings were at least s, but notnecessary longer. We set the margin to since in previous experiments had allows to identify a good compromisebetween DR and M SC . Table 3 summarise the main findings for different combinations of these parameters, eachmodel was trained during a day. As can be seen the more information in the patch the smaller the DR is (which is agood result). This points that a more informative patch is obtained by a larger segment with a small hope size.Figure 3 shows the projection of embeddings. As it can be seen same speaker embeddings clusters, in this case thesecluster represent one speaker’s recording from which we extracted samples that became embeddings. The figureillustrates our best ( t = 2000 and h = 25 ) and worst models ( t = 1000 and h = 50 ). As it can be seen in bothprojections there is ordering of the speakers (clusters of speakers) but also shows some speakers which are close amongthemselves. It can been seen that our best model produces more organized positions while for the worst model there issome confusion among the clusters in the middle.As presented in the section 4 there is an imbalance between female and male speakers, and between the nationalitiesof the speakers. To quantify the effect of these imbalances we performed more evaluations. In the case of genderwe created three models: with only female recordings ( F , training and validation speakers), with only malerecordings ( M ∗ , / ) and training with a comparable amount of male recordings with female recordings ( M , / ). For this experiments we set the parameters to t = 2000 ms , w = 100 ms and h = 50 ms . Table 5 shows themain results on the corresponding validation dataset. As it can be seen there is an effect of gender miss-match, howeverthis is not severe for the case of training with female speakers and using the model with male speakers. On the otherdirection, we can notice a severe drop in performance. As expected, the best setting is to train with the biggest amountof recordings of a gender and evaluating on that gender.Table 6 shows our findings for the nationality. Given the amount of speakers, we decided to compare two types ofspeakers from two regions: Latinamerica ( L ) and Spain ( S ). We set two models, L ( / ) and S ( / ). As we cansee there is not a notable difference given the nationality, considering the region Latinamerica packs more nationalities5 PREPRINT - J
ANUARY
25, 2021 t (ms) h (ms) Patch size IAD IOD DR MSC ×
256 3 .
21 22 . × * .
60 19 .
36 0 . . × * .
68 23 .
21 0 . . ×
256 3 .
57 25 . ×
256 3 .
42 21 .
31 0 . . × * .
98 23 .
46 0 . . ×
256 4 .
69 30 .
24 0 . ×
256 4 .
44 29 . . ×
256 4 .
77 27 .
60 0 . . Table 3: Validation results for different parameters of speech segments.Figure 3: Projections of a recording for speakaer in validation set. First projection corresponds to our best model( t = 2000 , h = 25 ), second to our worst model ( t = 1000 , h = 25 ) (same color same speaker, 218 speakers)in the dataset we can not see an effect of the regional accents on the capabilities of triplet loss in producing goodembeddings. Similarly to gender, the miss-match between training and evaluation datasets speakers produces drops inthe performance, but not as severe as one might expect. In these experiments we aim to establish a way to calculate the LR based on the distances among embeddings. Themore straight forward proposal is to use a normalized distance between the centroids of the reference and questionedsample embeddings, we call this approach distance based ( D ). The second proposal is to use a radio between the samedistance D and the distance to the closest speaker in the population in order to account for tipicallity, we call thisapproach distance radio ( DR ). The first proposal can be formalized as: LR D = D ( q, r ) N ; (3)while the second proposal as: LR DR = min ( D ( q, p )) D ( r, q ) ; ∀ p ∈ P opulation (4)Where q is the centroid of the questioned embedding samples, r the centroid of the reference and p is a centroid of aspeaker from the population and N is a normalizing factor.For the experimentation we used two Forensic Speech Identification scenarios: genuine (were the same-speakerhypothesis is true) and impostor (where the different-speaker hypothesis is true). Per speaker we randomly selectedthree recordings from where we sampled segments per recording as a reference. In the case of the genuine scenariowe selected a fourth recording as the questioned source of samples while for the impostor scenario we randomly selectedan extra recording from a different speaker. Additionally, as our population we randomly selected different speakersfrom which we extracted the same amount of samples that our reference (i.e., ). It is important to remark that therecordings for these settings came from the validation split of the data. With this considerations, we have have genuine cases and impostor cases. Figure 4 shows for both approaches the LR scores: distance based ( D ) anddistance radio ( DR ). For the case of the distance as a proxy for LR ( D ), as expected genuine cases are concentratedwith lower values with a mean in . while impostor are located with a mean of . . The Equal Error Rate (EER) for6 PREPRINT - J
ANUARY
25, 2021Metric ValueIAD . IOD . DR . MSC . Table 4: Evaluation results with parameters t=2000 , h=25 , w=100 and m=2 .Model Evaluating IAD IOD DR MSC F F .
74 22 .
23 0 . . F M .
68 19 .
84 0 . . M M .
29 33 .
17 0 . . M F .
97 33 .
12 0 . . M * M .
71 21 .
03 0 . . M * F .
32 22 .
87 0 . . Table 5: Evaluations measure the effect of gender imbalance, F (female) and M (male) are comparable since they havethe same number of training speakers; M * is not directly comparable since it relies in a larger amount of speakers.the metric is of . with a sensitivity of . . For the case of distance radio ( DR ) we see that it corresponds betterwith the common interpretation of LR were the score indicates an update in the believe on the same speaker hypothesis.In this case, lower scores are associated to impostor cases with a mean of . and larger score correspond to genuinecases with a mean of . . This approach has an EER . and sensitivity index of . .Figure 4: Histogram of scores for positive (Genuine scores) and negative (Impostor scores) for both LR cases7 PREPRINT - J
ANUARY
25, 2021Model Evaluating IAD IOD DR MSC
L L .
05 21 .
73 0 . . L S .
68 25 .
17 0 . . S S .
41 26 .
42 0 . . S L .
73 20 .
10 0 . . Table 6: Evaluations measure the effect of nationality imbalance, L (Latinamerica) and S (Spain)Figure 5 shows the DET and ROC curves in a log scale for both scores. DET curves are common in the forensic field, inparticular it compares the False Match Rate (FMR) with False Non-Match Rates (FNMR) of the system. For both DETcurves we see that as the number of false positives grows it is harder to missmatch a case. A better system produces acurve located to the left bottom corner. Our experiments show that bot approaches D and DR share a lot of predictivepower. It is important to take into account that DET curves are related to ROC curves, which are more common in theML field. Within the ROC curves is common to measure Area Under the Curve (AUC) to compare two systems, in thiscase we can observe that the DR approach has AUC of . while D a AUC score of . , this points to a slightlymore robust discriminative power for the DR metric.Figure 5: DET and ROC Log curves, for both proposals: distance (D) and distance ratio (DR). In recent years, the topic of sharing personal information has become extremely relevant due to the intensive interac-tion between modern societies and technology, in particular technology that are able to take autonomous decisions.Nowadays, simple things like clicking a “like” button or typing some words to search in a search engine could triggerundesired advertisements or the risk to be scammed in creative ways [35]. As computer scientists, these type ofobservations raise the question of how datasets are share through the internet in order to contribute to the advances ofscience but at the same time, not to harm anyone in anyway, as it can be understood in the hippocratic oath for artificialintelligence practitioners [36]. 8
PREPRINT - J
ANUARY
25, 2021In this work, the use of the Voxforge Spanish dataset raises some privacy concerns, since the donors accepted thattheir voices are part of a database destined to create language technologies however while maintaining anonymity. Onthe other hand, due to the type of information provided by each donor (mainly a personalised username), it could beeasy to identify some of them and performing actions against them like identity fraud in certain speech systems [37].Fortunately, some of this concerns were contemplated by the developers of the Voxforge datasets which allowed andoption to opt-out, however it could hardly be enforced for all distributed copies [38]. Following this concern ourexperiments do not rely on the username field and it only releases the models, not the recordings. We also will honorany opt-out petition to the Voxforge dataset.Other aspect of concern regarding this work is if the proposed system is ready to be deployed and used in a real courtcase. From our perspective, even though the experiments show an interesting and promising performance, it is clear thatour models are not ready for deployment in court. In particular, the selection of the population has to be properly. In areal case, it must be transparent who will be included in the population to calculate typicality. Additionally, the resultsof our models have to be used in the context of other evidence, as previously mentioned the LR has the goal of updatingthe odds of a legal hypothesis, but it is not a definitive test of identity. So in this context, any attempt to use our modelsas a recognition system will be ill-advised.
In this work we explored the use of the triplet loss function and a convolution neural network (CNN) for the forensicidentification of speakers. We have shown that this setting offers an alternative to well established approaches. Withinthis setting, the goal is to train a convolutional network (CNN) to produce an embedding of a speech sample. Althoughprevious research have shown that such embeddings can cluster speakers, in our experimentation we have shown thatthe distance among clusters can be used to approximate a Likehood Radio (LR) which is a common measurement toconvey the plausibility of a legal forensic hypothesis.Our experimentation points out that the CNN is able to produce a good embedding representation for long speechsamples and finer resolution of the spectogram. We hypothesize this is because it contains more information of a speakervoice. In particular we have focused on Spanish from Latin America and Spain, we have shown that our approach is notaffected by variant of Spanish. However, gender has a strong effect on the performance. So far our results points outthat models have to be gender dependent, but more research has to be done on this point (see [34]).Finally, our results also showed that triplet loss conveys a good discriminative power when the distances are used toapproximate LR. We proposed two approaches for the calculation of LR, the first one is solely based on the distancesamong a questioned and reference samples ( D ). This approach can be considered independent of the population. Thesecond approach uses the radio between the first approximation D and the distance of the questioned sample to theclosest speaker from the population ( DR ). We conclude that this second option provides a better alternative, sinceit conveys information in two aspects: first numerically it corresponds to other LR scores and second it has a betterpredictive power (AUC= . ), this makes it suitable for the forensic identification of speakers. The authors thank CONACYT for the computer resources provided through the INAOE Supercomputing Laboratory’sDeep Learning Platform for Language Technologies with the project
Experimentos con voz, traducción y clasificaciónde textos (id. PAPTL 01-008). We also acknowledge Fernanda Hernandez and Sandra Vázquez which were involved inearly stages of the development of the code and experimentation.
Conflict of interest
The authors declare that they have no conflict of interest.
References [1] Philip Rose. Identifying criminals by their voice-the emerging applied discipline of forensic phonetics.
AustralianLanguage Matters , 5(2):6–7, 1997.[2] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition andclustering. In
Proceedings of the IEEE conference on computer vision and pattern recognition , pages 815–823,2015. 9
PREPRINT - J
ANUARY
25, 2021[3] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu.Learning fine-grained image similarity with deep ranking. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 1386–1393, 2014.[4] De Cheng, Yihong Gong, Sanping Zhou, Jinjun Wang, and Nanning Zheng. Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In
Proceedings of the iEEE conference on computervision and pattern recognition , pages 1335–1344, 2016.[5] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In
International Workshop on Similarity-Based Pattern Recognition , pages 84–92. Springer, 2015.[6] Chunlei Zhang, Kazuhito Koishida, and John HL Hansen. Text-independent speaker verification based on tripletconvolutional neural network embeddings.
IEEE/ACM Transactions on Audio, Speech, and Language Processing ,26(9):1633–1644, 2018.[7] Hervé Bredin. Tristounet: triplet loss for speaker turn embedding. In , pages 5430–5434. IEEE, 2017.[8] Jian Huang, Ya Li, Jianhua Tao, Zhen Lian, et al. Speech emotion recognition from variable-length inputs withtriplet loss function. In
Interspeech , pages 3673–3677, 2018.[9] Phil Rose.
Forensic speaker identification . cRc Press, 2002.[10] Geoffrey Stewart Morrison, Cuiling Zhang, and Ewald Enzinger. Forensic speech science.
The Bloomsburycompanion to phonetics , pages 183–197, 2019.[11] Irving J Good. Weight of evidence and the bayesian likelihood ratio.
The use of statistics in forensic science ,pages 85–106, 1991.[12] Michael J Saks and Jonathan J Koehler. The coming paradigm shift in forensic identification science.
Science ,309(5736):892–895, 2005.[13] Geoffrey Stewart Morrison. Forensic voice comparison and the paradigm shift.
Science & Justice , 49(4):298–308,2009.[14] Paul C Giannelli. The admissibility of novel scientific evidence: Frye v. united states, a half-century later.
Colum.L. Rev. , 80:1197, 1980.[15] Christophe Champod and Joëlle Vuille. Scientific evidence in europe–admissibility, evaluation and equality ofarms.
International Commentary on Evidence , 9(1), 2011.[16] Phil Rose. A forensic phonetic investigation into non-contemporaneous variation in the f-pattern of similar-sounding speakers. In
Fifth International Conference on Spoken Language Processing , 1998.[17] Phil Rose. Long-and short-term within-speaker differences in the formants of australian" hello".
Journal of theinternational Phonetic Association , pages 1–31, 1999.[18] Colin GG Aitken and David Lucy. Evaluation of trace evidence in the form of multivariate data.
Journal of theRoyal Statistical Society: Series C (Applied Statistics) , 53(1):109–122, 2004.[19] Philip Rose, D Lucy, Takashi Osanai, et al. Linguistic-acoustic forensic speaker identification with likelihoodratios from a multivariate hierarchical random effects model-a non-idiot’s bayes’ approach.
Proceedings of the10th Australian International Conference on Speech Science and Technology , 2004.[20] Geoffrey Stewart Morrison. A comparison of procedures for the calculation of forensic likelihood ratios fromacoustic–phonetic data: Multivariate kernel density (mvkd) versus gaussian mixture model–universal backgroundmodel (gmm–ubm).
Speech Communication , 53(2):242–256, 2011.[21] Frédéric Bimbot, Jean-François Bonastre, Corinne Fredouille, Guillaume Gravier, Ivan Magrin-Chagnolleau,Sylvain Meignier, Teva Merlin, Javier Ortega-García, Dijana Petrovska-Delacrétaz, and Douglas A Reynolds.A tutorial on text-independent speaker verification.
EURASIP Journal on Advances in Signal Processing ,2004(4):101962, 2004.[22] Pedro Univaso, Juan Maria Ale, and Jorge A Gurlekian. Data mining applied to forensic speaker identification.
IEEE Latin America Transactions , 13(4):1098–1111, 2015.[23] Geoffrey Stewart Morrison, Ewald Enzinger, Daniel Ramos, Joaquín González-Rodríguez, and Alicia Lozano-Díez.
Statistical models in forensic voice comparison . CRC Press LLC Boca Raton, Florida, 2020.[24] J. Gonzalez-Rodriguez, J. Fierrez-Aguilar, and J. Ortega-Garcia. Forensic identification reporting using automaticspeaker recognition systems. In , volume 2, pages II–93, 2003.10
PREPRINT - J
ANUARY
25, 2021[25] Daniel Garcia-Romero and Alan McCree. Supervised domain adaptation for i-vector based speaker recognition.In , pages 4047–4051.IEEE, 2014.[26] Sibel Yaman, Jason Pelecanos, and Ruhi Sarikaya. Bottleneck features for speaker recognition. In
Odyssey2012-The Speaker and Language Recognition Workshop , 2012.[27] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. X-vectors: Robustdnn embeddings for speaker recognition. In , pages 5329–5333. IEEE, 2018.[28] Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer. End-to-end text-dependent speaker verification.In , pages 5115–5119.IEEE, 2016.[29] Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, and ZhenyaoZhu. Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304 , 2017.[30] Chunlei Zhang and Kazuhito Koishida. End-to-end text-independent speaker verification with triplet loss on shortutterances. In
Interspeech , pages 1487–1491, 2017.[31] J Mo and L Xu. Self-attention networks for speaker identification with negative-focused triplet loss. In
Journal ofPhysics: Conference Series , volume 1601, page 052004. IOP Publishing, 2020.[32] David Azul. How do voices become gendered? a critical examination of everyday and medical constructions ofthe relationship between voice, sex, and gender identity. In
Challenging popular myths of sex, gender and biology ,pages 77–88. Springer, 2013.[33] Rose McDermott and Peter K Hatemi. Distinguishing sex and gender.
PS: Political Science & Politics , 44(1):89–92,2011.[34] Fatih Ertam. An effective gender recognition approach using voice data via deeper lstm networks.
AppliedAcoustics , 156:351–358, 2019.[35] Danesh Irani, Steve Webb, Kang Li, and Calton Pu. Modeling unintended personal-information leakage frommultiple online social networks.
IEEE Internet Computing , 15(3):13–19, 2011.[36] Oren Etzioni. A hippocratic oath for artificial intelligence practitioners. Technical report, 2018. Accessed01/21/2021.[37] Saeid Safavi, Hock Gan, Iosif Mporas, and Reza Sotudeh. Fraud detection in voice-based identity authenticationapplications and services. In ,pages 1074–1081. IEEE, 2016.[38] Michael Zimmer. “but the data is already public”: on the ethics of research in facebook.