[PDF] A Real-time Robot-based Auxiliary System for Risk Evaluation of COVID-19 Infection

Abstract

In this paper, we propose a real-time robot-based auxiliary system for risk evaluation of COVID-19 infection. It combines real-time speech recognition, temperature measurement, keyword detection, cough detection and other functions in order to convert live audio into actionable structured data to achieve the COVID-19 infection risk assessment function. In order to better evaluate the COVID-19 infection, we propose an end-to-end method for cough detection and classification for our proposed system. It is based on real conversation data from human-robot, which processes speech signals to detect cough and classifies it if detected. The structure of our model are maintained concise to be implemented for real-time applications. And we further embed this entire auxiliary diagnostic system in the robot and it is placed in the communities, hospitals and supermarkets to support COVID-19 testing. The system can be further leveraged within a business rules engine, thus serving as a foundation for real-time supervision and assistance applications. Our model utilizes a pretrained, robust training environment that allows for efficient creation and customization of customer-specific health states.

Full PDF

AA Real-time Robot-based Auxiliary System for Risk Evaluation of COVID-19Infection

Wenqi Wei, Jianzong Wang ∗ , Jiteng Ma, Ning Cheng, Jing Xiao Ping An Technology (Shenzhen) Co., Ltd.

Abstract

In this paper, we propose a real-time robot-based auxiliary sys-tem for risk evaluation of COVID-19 infection. It combinesreal-time speech recognition, temperature measurement, key-word detection, cough detection and other functions in order toconvert live audio into actionable structured data to achieve theCOVID-19 infection risk assessment function. In order to bet-ter evaluate the COVID-19 infection, we propose an end-to-endmethod for cough detection and classiﬁcation for our proposedsystem. It is based on real conversation data from human-robot,which processes speech signals to detect cough and classiﬁes itif detected. The structure of our model are maintained conciseto be implemented for real-time applications. And we furtherembed this entire auxiliary diagnostic system in the robot andit is placed in the communities, hospitals and supermarkets tosupport COVID-19 testing. The system can be further leveragedwithin a business rules engine, thus serving as a foundation forreal-time supervision and assistance applications. Our modelutilizes a pretrained, robust training environment that allows forefﬁcient creation and customization of customer-specic healthstates.

Index Terms : COVID-19, cough detection, cough classiﬁca-tion, real-time system, disease control

1. Introduction

The newly identiﬁed RNA betacoronavirus - COVID-19 hasrapidly spread out to more than 200 countries, causing4,370,306 infections and 294,217 death worldwide [1], posinggreat threat to public health. The severity of come COVID-19cases mimic SARS-CoV [2] and the case fatality rate is high insome country, but the early symptom of COVID-19 is mild inmany cases (including cough, fever, and difﬁculty in breathing)[3, 4, 5]. Although the medical diagnosis of COVID-19 requiresviral nucleic acid test based on saliva [6], the most common wayfor detecting potential infections in public areas is body temper-ature measurement. However, there are many factors that mayhave impact on this measurement, e.g. temperature of the en-vironment. And more importantly, the body temperature mea-surement usually requires a relatively close contact with the po-tential infected people. However, professional protections arenot easily available for the security personnel, which rise therisk of spreading the virus. Currently there is no efﬁcient meth-ods for COVID-19 diagnose in a short time.In this article, we propose a robot-based COVID-19 aux-iliary diagnostic system for rapid diagnosis of users. It useshuman-robot dialogue to avoid routine consultations with doc-tors, thereby prevents cross-infectoins caused by long stay in ahospital and eases the pressure of shortage of medical resources[7]. Dialogues based on medical diagnosis are very valuable,but this assets are not fully utilized. For example, a cough in a *Corresponding author: Jianzong Wang, [email protected] conversation may be related to the user’s current physical con-dition. In order to better perform the diagnosis of COVID-19,we designed an algorithm for cough detection and classiﬁcationfor this system, which is used for rapid diagnosis and reducesthe work pressure for doctors.Cough is a body responding when something irritates thethroat or airway. The irritant stimulates nerves and our brainresponds to the message and tells muscles in the chest and ab-domen to push air out of lung, then a cough happens. Someresearches showed that the sound of cough differs with regard-ing to different type of diseases [8] [9]. For instance, a commoncold or ﬂus commonly cause wet coughs as there are inﬂamma-tion and the body is pushing mucus out of the respiratory sys-tem, while upper respiratory infections often cause dry coughssuch as COVID-19. Recently, there are methods of diagnosisof cough-related diseases based on acoustic signal developed[10, 11, 12, 13]. These methods extract multiple features, suchas Mel Frequency Cepstrum Coefﬁcient (MFCC), energy level,dominant/maximum frequency. They are developed not onlyfor cough detection [14, 15] but also for classiﬁcation of a spe-ciﬁc cough-related disease like pertussis [11], croup [13], andtuberculosis [12]. Those studies have proved the effectivenessof acoustic signals in disease detection. However, there are twomain drawbacks for those studies: using Artiﬁtial Neural Net-works (ANN), Gaussian Mixture Model (GMM), or SupportVector Machine (SVM) as classiﬁers on very limited data maycause severe over-ﬁtting; those models require input of ﬁxedlength in the time domain, which neglect part of the input dataand may cause missing of important information for a speciﬁcsymptom.To solve the problem, we ﬁrst designed a CNN-based net-works that could be used for detection of cough events. Dueto the obvious difference in the duration of cough sounds, weuse a multi-scale method in the cough detection network to bet-ter capture cough sounds with a short duration and a smallersound. It could be pre-trained on large scale datasets whichcontain more acoustic data of cough. And for screening peo-ple who are at high risk of COVID-19 infection, we also usea classiﬁer based on attentional similarity [16]. It focuses onfew-shot problems and can handle cough events with differentduration. This feature ﬁts well with the feature that COVID-19patients have fewer cough voices and the duration of cough isnot ﬁxed. The data used in the experiment is from real medical-conﬁrmed patients of COVID-19. Inspired by [17], the detec-tion method and evaluation metric we proposed can be unit-edly implemented as an auxiliary real-time system in a robotfor helping with remote screening of people with high infectionrisk. At the same time, we will generate electronic cough casesfor data storage and analysis. The result given by our evalua-tion method is not equal to a medical diagnose. But it gives aneffective guide on whom shall be isolated and receive medicaltested ﬁrst and avoid any contact with potential infected people. a r X i v : . [ ee ss . A S ] A ug . System Architecture The COVID-19 risk evaluation algorithm is developed in thesystem as illustrated in Figure 1. First, the robot will measurethe users body temperature utilizing the infrared imaging. Theresult of the body temperature measurement will be stored in theelectronic medical record together with demographic informa-tion of the user. Then, the robot will simulate a regular doctorconsultation by asking questions like

Do you have any cold orshown fever symptoms in the last 14 days? , Have you been tothe public areas of high risk in the last 14 days? . The cough de-tection module will record and supervise the whole encounter,and as the user start to respond to the question and shown coughsymptom, the corresponding frame will be immediately extractthe cough event and send it to the cough classiﬁcation mod-ule. The cough classiﬁcation module calculates the attentionalsimilarity between the current cough and coughs of various dis-eases and gives the ﬁnal classiﬁcation result. If no cough isdetected throughout the consultation, the robot will say pleasecough naturally at the end to help collecting cough information.At the end of the conversation, the electronic medicalrecord is generated for the current user which contains bodytemperature, demographic information (gender, age), diseasehistory (extracted from the translated text), a complete record-ing and translation of the human-robots conversation, the userscough audio, intelligent diagnosis results, together with an epi-demic map. The map indicates the areas of high risk aroundthe users trace, the trace of the conﬁrmed cases nearby, and theinformation of the designated hospital. When the diagnosis re-sult indicates COVID-19 positive, we will send the result andthe basic information to doctors for further conﬁrmation andimmediately inform the user to arrange a nucleic acid test andsuggest a self isolation.Figure 1:

Structure of our system.

All electronic medical records are stored in the system andwill be sent to doctors afterward. Doctors can use the ex-ploration function in the system to compare different medicalrecords. If a particular electronic medical record proves to beabnormal, the doctor can conduct a detailed review of all avail-able information such as electronic medical records, human-robot conversation, and transcribed text.It should be noted that we have set medical rules for thissystem in actual use. For example, if the body temperaturereaches an abnormal value, the doctor will be notiﬁed regard-less of the result of the cough test, and the user is required tostay at home immediately. If the user does not cough duringthe whole process, the user will be asked to cough for a coughdiagnosis.

3. Dataset

The data is collected by volunteers interacting with robotsplaced in the community or hospital. Due to privacy concerns, the dataset is only used for academic research.Figure 2:

Source of the dataset.

The dataset includes 1283 speech recording segments col-lected via human-robot conversations administered to 184 re-spondents aged from 6 to 80. Each conversation is partitionedinto segments and we only extract those from respondents withlength over 10 seconds. The total speech recordings includes392 segments from 64 (36 male and 28 female) COVID-19 in-fectors. And for the control group, 361 segments from 40 re-spondents (37 male and 2 female) who have long smoking his-tories, 153 segments from 20 (11 male and 9 female) respon-dents with acute bronchitis, 109 segments from 20 (7 male and13 female) with chronic pharyngitis, 21 segments from 10 chil-dren with pertussis aged 6-12 under the supervision of their par-ents, and 258 segments from 40 healthy people with no smokinghabit nor any conﬁrmed disease. 21 people from the COVID-19infectors have other chronic diseases (hypertension, diabetes,and heart related diseases). Not all recording segments fromboth control group and the COVID-19 infectors contain coughevent. It should be noted that our dataset will update the dataof the diagnosed patients who have used the robot to the datasetevery other week, and update the model with the new dataset.All of the COVID-19 infectors are lab-conﬁrmed based onviral nucleic acid tests. As [3] reported that there are infectorsshow no external symptom nor chest X-ray / CT manifestations,this viral nucleic acid test is currently the most authoritativemethods for COVID-19 diagnosis. And for respondents withother diseases (acute bronchitis, chronic pharyngitis, and per-tussis), they are medical-conﬁrmed by doctors beforehand.

4. Methodology

The overview structure of our model is shown in Fig.1. Thegeneral procedure of our model is silence removal, feature ex-traction, cough detection, and cough classiﬁcation. The detailsof each section will be elaborated as followed.

Before any processing, a sound detector can be used to removethe silent segments to reduce working load. This could be im-plemented by comparing the standard deviation of each frame tothe mean of the standard deviation of each recording. Throughsetting a threshold, the silent part could be removed.And prior to any detection or classiﬁcation, all recordingscan be resampled to a frequency of 16000 Hz. This is becauseall the required information is contained below 8000Hz, whichis half of the new sampling rate. The audio signals were thendivided into frames of 320ms for processing with a 50% over-lap between subsequent frames. Each frame is then convertedinto frequency domain through a Fast Fourier Transformation(FFT). Through the pre-processing procedure, several featurescan be further obtained from each frame including time-domainfeatures, frequency-domain features and MFCC. .2. Feature Extraction

As there are studies on the performance of different acousticsfeatures for the detection of cough and the further classiﬁcationtask [11], we utilized those top performing ones for our model.These features are listed: MFCC [18], Zero-crossing Rate [19],Crest Factor, Energy Level, Spectrogram.When extracting features from cough sounds, we used aframe size N = 1024 samples and Hamming window for w ( n ) .Frame-to-frame overlap of 50% was used. And a followingfeature extraction was conducted to obtain the features listedabove. The ﬁrst four features are used as input for cough detec-tion and the spectrogram is used for classiﬁcation tasks. Pramono et at. [11] gives a concise explanation of how a coughevent is formed and analyzes the feasibility of cough detectionbased on the acoustic features. In our method, we found that fora given input speech recording, there are situations where thecough event in the time-domain can be pulse-like signal appearsalone, and can also be a series of continuous high-intensity sig-nal. Considering this characteristic of a cough event, we pro-pose a novel convolution neural networks (CNN) based methodfor cough detection.Inspired by the pyramid structure usually used for multi-scale object detection, we implement a multi-layer CNN to de-tect coughs with different durations. Different layers in theCNN are designed to have different receptive ﬁeld. And atthe end of the network we concate the outputs from each layeras a combined feature for the last classiﬁcation. Speciﬁcally,the CNN network is composed of multiple blocks stacked, eachof which is composed of 3*3 convolutional layers followed bybatch normalization, a relu activation function layer, and a pool-ing layer. In particular, we extract the results of each convolu-tional layer, and stitch the shallow features with the deep fea-tures to enhance the context of the features. It should be notedthat the size of shallow features and deep features are not thesame. Due to the existence of the pooling layer, the feature sizeof the previous layer is always twice the size of the next layer, sowhen feature stitching, deep features need to be double upsam-pled. Finally, SVM is used as the binary classiﬁer. The initialfeature space is mapped using a Gaussian kernel to maximizethe ﬁnal linear separability between different sound events. Ifa given input segment is predicted as cough events, it is furtherpushed to the cough classiﬁcation models.Figure 3:

Examples of electronic medical record and epidemicmaps.

After detecting cough events, we build a cough classiﬁer aim-ing to identify COVID-19 cough among events that revealsother respiratory disease. A major difﬁculty in this task is that the cough events in our dataset that reveal COVID-19 is quitesmall and unmet the requirements for training task [20]. Toachieve this, we introduce the c-way k-shot few shot learn-ing algorithm proposed by Vinyals et al [21]. This methodsolves few-shot problem by iteratively build meta tasks on se-lected samples and could accommodate to unseen classes withonly a few samples. It is a iterative training strategy and ﬁtsour COVID-19 infection detection well as the available data isvery limited. During each training iteration, there is a train-ing set D = { ( X, Y ) | X ∈ S, Y ∈ C train } where C train isthe class we select for training and S is the support set. Each X ∈ { x , x , x c × k , x q } is obtained from our feature learningmodel f cnn ( · ) and consists k randomly selected examples se-lected from each of the c random classes and a query example x q , which is also randomly chosen from the remaining of the c classes. This novel training strategy makes the model havethe capability to learn to compare the common and differencethrough the iterative training procedure [22]. Thus provides theclassiﬁer the ability to generalize to some extent.Based on this few-shot training strategy, we implement thisclassiﬁcation task with attentional similarity [16]. Differentfrom other classiﬁers or methods of computing similarity, itcan take input features with different length and give a scalarsimilarity measurement. Assume the two input features are X i ∈ R D × T i and X j ∈ R D × T j . Instead of using poolingoperation to compress out the time dimension, attentional sim-ilarity uses a second-order similarity to product with a learntweight matrix: f att sim ( X i , X j ) = X Ti X j W ij (1)where W ij ∈ R T i × T j is the weight matrix to capture the im-portance of segment-by-segment similarity, which can be ap-proximated by a rank-1 approximation by W ij = A i A Tj . Thusthe attentional similarity expression can be written as: f att sim ( X i , X j ) = T r ( X Ti X j A i A Tj )= T r ( A Tj X Ti X j A i )= A Tj X Ti X j A i (2)where A is the attention vector computed by using another stackof convolutional layers by feeding corresponding X to ﬁnd theimportant segments.Here the input X is the segment selected based on the coughdetection where the length in time dimension of each X canvary for different cough events. This method gives an quanti-tative evaluation of the risk of COVID-19 infection. The ﬁnalclassiﬁcation is based on the similarity of a given X to the meanof each class in the support set.

5. Experiment

The cough detection network and the classi-ﬁcation network are trained separately. For the cough detectionnetwork, our backbone network consists of a stack of blocks,each of which has a 33 convolutional layer followed by batchnormalization, a ReLU activation layer and a 4 4 maxpool-ing layer. The output of each maximum pooling layer will bechanged to the same size through the upsampling operation, andthen all the outputs will be spliced together and sent to the SVMfor cough detection. The related frames will be marked corre-spondingly. And for the classiﬁcation network, it has the samerchitecture as [16]. The input feature of the network is spec-trogram. For optimization, we use stochastic gradient descent(SGD) and initial learning rate of 0.01. The learning rate isdivided by 10 every 20 epochs for annealing, and we set themaximal number of epochs to 60. Moreover, we set the weightdecay to 1e-4 to avoid overtting.

Training:

As for the classiﬁcation task, since the totalamount of data is limited, we use cross validation method torandomly separate entire dataset into 10 segments and itera-tively use one of them as test set and the others as train set fora 100-epoch-training. The ﬁnal statistics result is based on theaverage of the total 10 iterations. Using the clinical diagnosisas the reference standard, we then calculated performance mea-sures such as the sensitivity, specicity, positive predictive value(PPV) and negative predictive value (NPV). All these values arereported as a percentage (%).

For the cough detection task, the results are shown in Table 1.The detection model we proposed achieved the highest scorein TPR, which means that our model can determine whethereach frame of speech contains cough information very accu-rately. Lower FPR also indicates that our model produces lessmisjudgments.Table 1:

Cough delection result on different models

Group True Positive Rate False Positive RateANN 94.27 5.5SVM 95.20 5.73GMM 81.87 0.32Ours 98.8 2.34For classiﬁcation tasks, the result is shown in Table 2. Sinceonly 4 types of diseases and health status are contained in thedataset, the evaluation criteria is calculated using Top-1 accu-racy. It should be noted that during the data collection, wemake sure that there is at most one cough-related disease thateach repondent have. In other works, we deleted all data thatdon’t meet the above requirement. And for the ﬁnal classiﬁca-tion task, a cough event is classiﬁed to the class that it has thehighest probability given by the softmax transformation afterprocessed by attentional similarity. Following this strategy, wealso compared our classiﬁer with other state-of-the-art methods[20, 21, 22, 23].Table 2:

Comparison of the classiﬁcation result of cough eventwith different methods. (COVID-19 represents the classiﬁcationaccuracy rate only for COVID-19)

Model Depth Param Top-1 COVID-19SVM - - 64 46MLP - - 44 31LSTM 4 1.8M 46 37M&mnet [20] 4 2.5M 66 59PN[22] 4 2.5M 66 58SNN[23] 7 4.8M 67 60MN[21] 8 6.1M 71 68AttSim[16] 4 2.5M 79 76Next, we conducted experiments for identifying differentdiseases as several binary classiﬁcation tasks using AttSim. Theratio of positive and negative samples in each disease is 1:1,where the positive sample is the data of the current disease, andthe negative sample is the data randomly sampled from other diseases. The classiﬁcation result is shown in Table 3. Pertussisand COVID-19 achieved high sensitivity, which indicates thatour model can diagnose these two diseases very well. We no-ticed that for some chronic diseases, such as pharyngitis, themodel is not well diagnosed. This is because the symptoms ofchronic diseases are not obvious and are easily confused withhealthy coughs. For some acute conditions, such as bronchi-tis, because it is often accompanied by a large number of con-tinuous coughs, there is a clear distinction. It is worth notingthat our model has shown good performance in the diagnosisof COVID-19, which proves that COVID-19 disease detectionthrough cough is effective.Table 3:

Cough classiﬁcation result on different groups

Disease sensitivity specicity PPV NPVbronchitis 94.3 91.2 93.4 89.6pharyngitis 85.3 83.3 90.2 86.7pertussis 99.8 95.8 96.3 90.2healthy 96.3 92.1 93.7 95.4COVID-19 98.7 94.7 94.5 91.7

6. Discussion and Conclusion

In this paper, we propose a robot-based COVID-19 infectionrisk evaluation system. The robot realizes the conventional in-terrogation function through voice interaction with the user. Inresponse to the possible coughing in daily conversations, weimplement a cough detector with CNN to detect cough eventswith variable length. And we further classify these coughs bycomputing the attentional similarity with the mean of the knowndisease. For COVID-19 particularly, the weighted sum can gen-erate a 76% Top-1 accuracy. At the same time, we will struc-turally extract the important information in the human-robot di-alogue and store it in the electronic medical record together withthe cough recording. According to the location of the robot, wewill also generate a dedicated epidemic map to remind users toavoid high-risk areas. This is a system of great value for viruscontrol as it require no close contact with possible infectors.Moreover, this system can also help doctors or government ofﬁ-cers to allocate limited medical resources and strength isolationpolicy. Being implement in community or hospital, our systemcan make all the procedure automatic thus is of huge medicalvalues.The future work will focus on designing an individual elec-tronic pass based on the electronic medical record collected inour system. People with a safe passcode will have full assessto public areas. With the help of our design, healthcare practi-tioners could trace people with high risks of COVID-19 and itwould potentially save public resources.

7. Acknowledgements

This paper is supported by National Key Research and Devel-opment Program of China under grant No.2018YFB1003500,No.2018YFB0204400 and No.2017YFB1401202. Correspond-ing author is Jianzong Wang from Ping An Technology (Shen-zhen) Co., Ltd.

8. References [1] E. Dong, H. Du, and L. Gardner, “An interactive web-based dash-board to track covid-19 in real time,”

Lancet Infectious Diseases ,2020.2] D. Wang, B. Hu, C. Hu, F. Zhu, X. Liu, J. Zhang, B. Wang, H. Xi-ang, Z. Cheng, Y. Xiong et al. , “Clinical characteristics of 138hospitalized patients with 2019 novel coronavirus-infected pneu-monia in wuhan, china,”

JAMA , vol. 323, no. 11, pp. 1061–1069,2020.[3] W.-j. Guan, Z.-y. Ni, Y. Hu, W.-h. Liang, C.-q. Ou, J.-x. He,L. Liu, H. Shan, C.-l. Lei, D. S. Hui et al. , “Clinical character-istics of 2019 novelnfection in china,” medRxiv , 2020.[4] J. T. Wu, K. Leung, and G. M. Leung, “Nowcasting and forecast-ing the potential domestic and international spread of the 2019-ncov outbreak originating in wuhan, china: a modelling study,”

The Lancet , 2020.[5] C. Huang, Y. Wang, X. Li, L. Ren, J. Zhao, Y. Hu, L. Zhang,G. Fan, J. Xu, X. Gu et al. , “Clinical features of patients infectedwith 2019 novel coronavirus in wuhan, china,”

The Lancet , 2020.[6] N. Chen, M. Zhou, X. Dong, J. Qu, F. Gong, Y. Han, Y. Qiu,J. Wang, Y. Liu, Y. Wei et al. , “Epidemiological and clinical char-acteristics of 99 cases of 2019 novel coronavirus pneumonia inwuhan, china: a descriptive study,”

The Lancet , 2020.[7] V. M. Corman, O. Landt, M. Kaiser, R. Molenkamp, A. Meijer,D. K. Chu, T. Bleicker, S. Br¨unink, J. Schneider, M. L. Schmidt et al. , “Detection of 2019 novel coronavirus (2019-ncov) by real-time rt-pcr,”

Eurosurveillance , vol. 25, no. 3, 2020.[8] J. A. Smith, H. L. Ashurst, S. Jack, A. Woodcock, and J. E. Earis,“The description of cough sounds by healthcare professionals,”

Cough , vol. 2, no. 1, pp. 1–9, 2006.[9] C. C. Grant, “Postinfectious cough and pertussis in primary care,”

The Lancet Respiratory Medicine , vol. 2, no. 1, pp. 2–3, 2014.[10] S. J. Barry, A. D. Dane, A. H. Morice, and A. D. Walmsley, “Theautomatic recognition and counting of cough,”

Cough , vol. 2,no. 1, p. 8, 2006.[11] R. X. A. Pramono, S. A. Imtiaz, and E. Rodriguez-Villegas, “Acough-based algorithm for automatic diagnosis of pertussis,”

PloSone , vol. 11, no. 9, 2016.[12] G. Botha, G. Theron, R. Warren, M. Klopper, K. Dheda,P. Van Helden, and T. Niesler, “Detection of tuberculosis by auto-matic cough sound analysis,”

Physiological measurement , vol. 39,no. 4, p. 045005, 2018.[13] R. V. Sharan, U. R. Abeyratne, V. R. Swarnkar, and P. Porter, “Au-tomatic croup diagnosis using cough sound recognition,”

IEEETransactions on Biomedical Engineering , vol. 66, no. 2, pp. 485–495, 2018.[14] B. H. Tracey, G. Comina, S. Larson, M. Bravard, J. W. L´opez,and R. H. Gilman, “Cough detection algorithm for monitoring pa-tient recovery from pulmonary tuberculosis,” in . IEEE, 2011, pp. 6017–6020.[15] S. Matos, S. S. Birring, I. D. Pavord, and H. Evans, “Detectionof cough signals in continuous audio recordings using hiddenmarkov models,”

IEEE Transactions on Biomedical Engineering ,vol. 53, no. 6, pp. 1078–1083, 2006.[16] S.-Y. Chou, K.-H. Cheng, J.-S. R. Jang, and Y.-H. Yang, “Learn-ing to match transient sound events using attentional similarityfor few-shot sound recognition,” in

ICASSP 2019 IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 26–30.[17] J. Mizgajski, A. Szymczak, R. Gowski, P. Szymaski, P. elasko,ukasz Augustyniak, M. Morzy, Y. Carmiel, J. Hodson, ukasz Wj-ciak, D. Smoczyk, A. Wrbel, B. Borowik, A. Artajew, M. Baran,C. Kwiatkowski, and M. ya Hoppe, “Avaya Conversational Intel-ligence: A Real-Time System for Spoken Language Understand-ing in Human-Human Call Center Conversations,” in

Proc. Inter-speech , 2019, pp. 3659–3660.[18] C. Ittichaichareon, S. Suksri, and T. Yingthawornsuk, “Speechrecognition using mfcc,” in

International Conference on Com-puter Graphics, Simulation and Modeling (ICGSM’2012) , 2012,pp. 28–29. [19] E. Scheirer and M. Slaney, “Construction and evaluation of a ro-bust multifeature speech/music discriminator,” in , vol. 2. IEEE, 1997, pp. 1331–1334.[20] S.-Y. Chou, J.-S. R. Jang, and Y.-H. Yang, “Learning to recognizetransient sound events using attentional supervision.” in

IJCAI ,2018, pp. 3336–3342.[21] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al. , “Matchingnetworks for one shot learning,” in

Advances in neural informa-tion processing systems , 2016, pp. 3630–3638.[22] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks forfew-shot learning,” in

Advances in neural information processingsystems , 2017, pp. 4077–4087.[23] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural net-works for one-shot image recognition,” in