[PDF] Exploring Semi-Supervised Learning for Predicting Listener Backchannels

Abstract

Developing human-like conversational agents is a prime area in HCI research and subsumes many tasks. Predicting listener backchannels is one such actively-researched task. While many studies have used different approaches for backchannel prediction, they all have depended on manual annotations for a large dataset. This is a bottleneck impacting the scalability of development. To this end, we propose using semi-supervised techniques to automate the process of identifying backchannels, thereby easing the annotation process. To analyze our identification module's feasibility, we compared the backchannel prediction models trained on (a) manually-annotated and (b) semi-supervised labels. Quantitative analysis revealed that the proposed semi-supervised approach could attain 95% of the former's performance. Our user-study findings revealed that almost 60% of the participants found the backchannel responses predicted by the proposed model more natural. Finally, we also analyzed the impact of personality on the type of backchannel signals and validated our findings in the user-study.

Full PDF

EExploring Semi-Supervised Learningfor Predicting Listener Backchannels

Vidit Jain ∗ [email protected] Delhi, India Maitree Leekha ∗ [email protected] Technological UniversityNew Delhi, India Rajiv Ratn Shah †‡ [email protected] Delhi, India Jainendra Shukla † [email protected] Delhi, India ABSTRACT

Developing human-like conversational agents is a prime area in HCIresearch and subsumes many tasks. Predicting listener backchan-nels is one such actively-researched task. While many studies haveused different approaches for backchannel prediction, they all havedepended on manual annotations for a large dataset. This is a bot-tleneck impacting the scalability of development. To this end, wepropose using semi-supervised techniques to automate the processof identifying backchannels, thereby easing the annotation process.To analyze our identification module’s feasibility, we compared thebackchannel prediction models trained on (a) manually-annotatedand (b) semi-supervised labels. Quantitative analysis revealed thatthe proposed semi-supervised approach could attain 95% of the for-mer’s performance. Our user-study findings revealed that almost60% of the participants found the backchannel responses predictedby the proposed model more natural. Finally, we also analyzedthe impact of personality on the type of backchannel signals andvalidated our findings in the user-study.

CCS CONCEPTS • Human-centered computing → User studies ; Empirical stud-ies in HCI ; •

Computing methodologies → Semi-supervisedlearning settings . KEYWORDS

Conversational Agents, Backchanneling, Multimodal analysis. ∗ The authors contributed equally, and wish they be regarded as joint first authors. † Jainendra Shukla and Rajiv Ratn Shah are partly supported by the Infosys Center forAI and Center for Design and New Media at IIIT Delhi. ‡ Rajiv Ratn Shah is also partly supported by the ECRA Grant (ECR/2018/002776) bySERB, Government of India.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

ACM Reference Format:

Vidit Jain, Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla. 2021.Exploring Semi-Supervised Learning for Predicting Listener Backchannels.In

CHI Conference on Human Factors in Computing Systems (CHI ’21), May8–13, 2021, Yokohama, Japan.

ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3411764.3445449

Human conversations, even the most casual ones, have a lot ofcomplexity associated with them. Two people actively engagedin a conversation frequently respond to each other, not only withrespect to the content of the conversation, but also to behavioralaspects, such as facial expressions and prosody. Developing Embod-ied Conversational Agents (ECAs) [9] and spoken dialogue systemscapable of incorporating these complex elements to converse nat-urally is a challenging task and has been a constant focus of theArtificial Intelligence and Human Computer Interaction researchcommunities. Of these complex conversational constructs, dyadiccomponents like listener backchannels are among the most crucialfor modeling virtual humans and are also the main focus of thisstudy .In a peer-to-peer conversation, a backchannel occurs when oneof the participants is speaking, and the other (the listener) interjectsa short response to the former [41]. These responses do not inter-rupt the flow of the conversation; rather, they convey the listener’sstate-of-mind about the speaker’s dialog. They also reflect coopera-tion and understanding between the two parties [17]. Backchannelscan be verbal, non-verbal (visual) or both. Vocalisations like ’hmm’or ’uh-huh’, gestures such as head nods or head shakes, and a com-bination of verbal and non-verbal responses are common examplesof backchannels.In the past few years, the research community has shown a keeninterest in modeling the listener’s backchanneling behavior. A largenumber of such studies on backchannel prediction have focusedon the use of rule-based classifiers. Ward [38], Truong et al. [36],Ward and Tsukahara [39] utilized different acoustic features of thespeaker such as pitch, pausal information, etc. to predict backchan-nel opportunities. A recent study by Park et al. [29] on backchannel A teaser video demonstrating our work can be found in the video figure section ofthe submission. It illustrates a virtual listener that emits backchannels to the speaker’scontext using the models proposed with this work. a r X i v : . [ c s . H C ] J a n HI ’21, May 8–13, 2021, Yokohama, Japan Vidit Jain, Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla prediction for children also used similar prosodic features and hand-crafted rules. Expanding the feature set, Moubayed et al. [1] usedboth visual and prosodic speaker features.Transitioning towards data-driven automatic prediction fromhand-crafted rules, Solorio et al. [34] used prosodic features withlocally weighted linear regression to predict backchannel oppor-tunities. Morency et al. [27] used multimodal (visual and acoustic)speaker features and a hidden markov model to predict backchan-nels. More recently, many researchers also used deep learning tech-niques for predicting backchannels. Ruede et al. [32] used Longshort-term memory (LSTM) based model with acoustic features.Hara et al. [16] also used LSTMs to predict turn-taking, filler words,and backchannels in a multitask learning paradigm. In a recentstudy, Goswami et al. [15] used state of the art machine learningand deep learning-based time-series classification techniques tomodel backchannel opportunities in children with multimodal fea-tures. Inspired by prior work, we too explore the use of machinelearning and deep learning based time series models and includeseveral similar multimodal features in our analysis.All these studies, although well-founded, have the followinglimitations:(1) Most of them have focused only on the backchannel opportu-nity prediction task, with a very few involving the next stepof the problem i.e., predicting the type of backchannelingsignals.(2) All studies in the literature have relied on data with anno-tated backchannel instances during their modeling phase,where they develop a backchannel predictor using the speakercontext. For instance, the datasets used by prior works, in-cluding the Switchboard Dialog Act Corpus (SwDA) [23],Iraqi Arabic [40], P2PStory [33], data collected by Morency etal. [27], and many others, had to be manually annotated forlistener backchannels by multiple coders. This annotationprocess can be extremely time-consuming, depending on theamount of data present. Furthermore, this approach also doesnot scale well when trying to expand the scope of develop-ment by collecting more data. For instance, in many studies,including the ones cited above, the datasets used cover arelatively small population size. The size of the dataset usedis primarily constrained by the amount of time it will taketo annotate it, which further impacts the generalizability ofthe study. Similar challenges are faced when consideringconversations from low-resource languages, which may beharder to annotate by virtue of their limited resources. Thisis one of the major issues we address in the present study i.e., can we automate the process of identifying when andhow the listener backchannels from the data, thereby easingthe annotation process?Our novel contributions include:(1) Use of self-training based semi-supervision for labeling theinstances in a dataset for the presence or absence of listenerbackchannels, and therefore, exploring computational tech-niques to guide the development of conversational agents.This identification model partly replaces the human anno-tator, and therefore uses the listener’s multimodal features to detect his/her backchannel responses. In addition to iden-tifying backchannel opportunities, this step also identifiesthe type of signals (verbal, visual or both) associated withthe backchannels. To analyze the feasibility of automatingthe labeling process via semi-supervision, we compare thebackchannel prediction models trained on the labels assignedby the semi-supervised identification process, with the mod-els trained on the ground truth labels (from the annotators).(2) Inspired by Bevacqua et al. ’s work [4] on personality con-tingent listener backchannels, we statistically analyzed howpeople with varying personalities emit different types ofbackchannel responses. In particular, we study the impactof the extraversion trait of a subject on their preference ofmodality [5] for their backchannel response.(3) Finally, unlike most prior works, in addition to predictingthe backchannel opportunities, we also predict the signal forthe listener agent. The signal prediction task itself is waymore challenging than the opportunity prediction task asthe signals may vary significantly from person to person. Weapproach this task by first predicting the type of signal toemit (visual, verbal, or both). Then, based on our findings in ( ) and depending on what personality we want our virtuallistener to embody, we select the exact signal combinationfor it to express.With this work, unlike most of the past studies that have focusedon English datasets, we are also amongst the first to use peer-to-peer conversations in Hindi, which is a low-resource language, forlistener backchannels . In particular, we use the conversationaldataset collected by Khan et al. [24], and annotate it for analyzingbackchannels. Our quantitative and subjective evaluations revealthat:(i) By leveraging semi-supervision for identification of listenerbackchannels, we were able to detect the presence of backchan-nels ∼

90% of the times, and the type of signals associated ∼

85% of the times, with only a small subset (25%) of labeleddata.(ii) Comparing the prediction models trained on the labels gen-erated by the identification models, with those trained onmanually-annotated labels: the former setting is able to reach ∼

93% of the latter’s performance in case of opportunity pre-diction, and ∼

96% for signal category prediction. Note thatthe cost performance is significantly less for the former asit needs only a small amount of labeled data, thereby sub-stantially reducing the efforts required in annotating thedata.(iii) Subjective analysis in the form of a user study with twentyseven participants supported our quantitative observations.Approximately 75% of the subjects found the backchannelsproduced by our proposed model more or equally naturalto the responses by the model trained on annotated labels.The participants also confirmed our observations of person-ality impacting the preference of modality for backchannelresponses. However, this does not entail that the techniques we propose cannot be used fordatasets from other languages xploring Semi-Supervised Learningfor Predicting Listener Backchannels CHI ’21, May 8–13, 2021, Yokohama, Japan

The rest of this paper has been organized as follows: Section 2discusses the dataset used in this work, the annotation process,and the initial data analysis. In Section 3, we formally describe theproblem statements for the identification and prediction tasks anddiscuss how we model them. Additionally, this section also brieflydiscusses the features used for modeling these tasks. Section 4begins by detailing the complete experimental setup and followswith a quantitative evaluation of all the tasks. The section alsodiscusses the observations of a user study, performed to analyze theefficacy of our models in real-time. Finally, we conclude the paperwith Section 6, which discusses the future scope of this work.

In this work, we use

Vyaktitv , a peer-to-peer Hindi conversationsdataset, curated by Khan et al. [24]. The dataset provides audio andvideo recordings of participants involved in a dyadic conversation.There are a total of 25 conversations (50 individual recordings) witheach one lasting 16 minutes and 6 seconds on an average. A totalof 38 subjects (24 Male, 14 Female) were a part of the dataset. Italso provides the Big Five personality traits [10] for all the subjects.Note that, to the best of our knowledge, no work has used Hindiconversations so far for analysing listener backchannels.

The audio and visual feeds for the individual speakers were anno-tated for verbal and visual activity based backchannels. Specifically,three annotators used the ELAN annotation software [8] with acustom tier template to mark the onset and offset time for differentbackchannel signals including- nod , head-shake , mouth , eyebrow ,and short-utterance (Table 1). The overall agreement amongst theannotators with respect to the presence of backchannels ( i.e., ifany backchannel activity was present in a particular time range)was near perfect with a Fleiss’ 𝜅 of 0 .

86. For the individual signals( 𝜅 ): substantial agreement was observed for nod (0 . head-shake (0 . mouth (0 . eyebrow (0 .

45) and short-utterance (0 . Figure 1: Sample depicting the consensus strategy adoptedfor combining the annotations from different coders.

BC Signal Labels (

Default ) N Mean Freq. Mean Dur. (s)

Nod none , nod 2037 42 . . none , head-shake 207 4 . . neutral , smile/laugh, frown 227 4 . . neutral , raise, frown 27 0 . . none , short-utterance(eg: “ohh", “okay" ) 1161 24 . . Table 1: Backchannel Signals and their Descriptive Statistics.(after taking consensus) N represents the total number ofa particular feedback signal observed across the completedataset. Mean Freq. is the average number of times a partic-ipant emits a particular signal during a conversation. MeanDur. is the average duration of the signals in seconds (s). value for the onset and offset time was taken as the average of thetimestamps by different annotators.

A total of 2781 backchannel instances were observed across all theparticipants in the dataset. In Table 1, we present a descriptive anal-ysis for these instances in terms of the type of feedback signals. Inparticular, we record the total number ( N ), average frequency perparticipant ( Mean Freq. ), and the average duration (

Mean Dur. )for each signal type. Note that a backchannel instance could havemultiple feedback signals (eg., nod, and smile), and therefore, hasbeen included in each of the possible signal categories. Observa-tions from the analysis reflect how all the participants frequentlyused nods and short utterances. On the other hand, backchannelingvia head-shakes, mouth, and eyebrow movements was not com-mon. In particular, instances of backchanneling through eyebrowmovements were scarce in the dataset (only 27), and therefore, werefrained from considering them in our predictive analysis. Fur-thermore, the average duration for all these backchanneling signalswas around 1 . (i) , we aim to dig deeper by assessing the signalsthat co-occur frequently. In this study, we refer to a backchannelinstance as multimodal if it has multiple associated signals; else, itis a unimodal instance. Note how unimodal nods and utterances HI ’21, May 8–13, 2021, Yokohama, Japan Vidit Jain, Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla (i) (ii)

Figure 2: (i)

Frequency of different combinations ofbackchannel signals emitted together by the subjects (acrossall subjects). (ii)

Probability density distribution plot depict-ing the relation between extroversion and the ratio of mul-timodal to unimodal backchannel instances ( 𝜏 ) emitted bythe subjects. were amongst the top frequent signals emitted. Furthermore, mostof the frequent multimodal instances included at least either a nodor an utterance. Multimodal signals where more than three typesof signals co-occurred were infrequent.Finally, we utilize the Big Five traits provided in the dataset toanalyze if personality traits influenced the type of signals (multi-modal or unimodal) emitted by the subjects. For our analysis, wedefined the variable 𝜏 as the ratio of the number of multimodal tounimodal backchannel instances emitted by a subject. Then, weused a two-sample Kolmogorov-Smirnov (K-S) test, with the nullhypothesis that the distribution of 𝜏 for the two categories of partic-ipants (based on a particular personality trait) was similar. The testpresented an interesting insight in terms of the subjects’ preferencefor multimodal and unimodal feedbacks. Of the five traits, the testssuggested a significant difference in the distribution of 𝜏 when con-sidering the extraversion trait. The probability density distributionin Figure 2 (ii) shows how subjects with low extraversion scores(introverts) tended to have a lower value for 𝜏 , indicating theirpreference for unimodal signals, while extroverts used multimodalfeedback more often. Quantitatively, the average probabilities ofan extrovert emitting multimodal or unimodal backchannels were0 .

51 and 0 .

49, respectively. The same values for an introvert were0 .

35 and 0 .

65, respectively. This observation has been utilized laterin this study while deploying the backchannel prediction mod-els to a virtual-agent, thereby lending it a ‘personality’ based onwhich it can choose to emit multimodal or unimodal feedbacks in aprobabilistic fashion (Section 4.3).

Negative samples are the instances from the conversations wherethe listener did not emit any backchannel. Specifically, we evaluatedthe following two conditions while extracting such instances:(1) the listener is not speaking, and(2) the listener is not backchanneling in that time frame. We used the Audacity tool to extract the voice-activity of thelistener. Combining the voice-activity and the annotations, we ex-tracted regions from the conversations that met the above two con-ditions. We sampled disjoint instances from these regions, wherethe length of each instance (in seconds) was taken as a randomfloating-point number in the range [1 .

06, 5 . This section elaborates on the two-step methodology followed inthe present study. We begin with the backchannel opportunityand signal identification module, which utilizes the listener’s fea-tures to automatically classify the instances into different categoriesbased on the presence or absence of backchannel activity. This stepemploys a semi-supervised training paradigm and aims to simplifythe manual annotation task. Next, we discuss the prediction mod-ule, which uses the speaker’s contextual features to predict thesebackchannels. The training step of this latter module makes use ofthe labels generated by the former semi-supervised step. Figure 3summarises this workflow.

Task Formulation.

We begin this module by first formally defin-ing the task of identifying backchannel opportunities and signals.Consider a time frame 𝑇 𝑖 𝑗 , which starts at the 𝑖 th second and endsat the 𝑗 th, and let 𝐿 𝑖 𝑗 represent the listener’s visual and acousticfeatures in that time frame. Then, our aim for the backchannelopportunity identification task is to learn a function F 𝑏𝑐 mappinga time series in the listener’s feature space to the correspondingbackchannel opportunity label BO 𝑖 𝑗 (binary label signifying thepresence or absence of backchannels in the time 𝑇 𝑖 𝑗 ), i.e. , F 𝑏𝑐 ( 𝐿 𝑖 𝑗 ) ↦→ BO 𝑖 𝑗 (1)We model (identify and predict) the backchannel signals differ-ently from the literature. Instead of identifying the different signals(like nod, head-shake, etc.) individually, we categorize them intotwo types- visual (nod, mouth, head-shake) and verbal backchan-nels (utterances). This is done because the task of predicting theexact backchannel signal that a listener must emit based on thespeaker’s context ( i.e., the signal prediction task) is challenging,primarily because of the subjective nature of these signals. Forinstance, subject A may emit a smile to a particular speaker con-text, whereas another subject B may emit a nod in response to thesame. Grouping signals together into visual and verbal categoriessimplifies the tasks at hand.A backchannel instance could be associated with visual, verbal,or both kinds of signals. Therefore, the signal identification taskaims at finding the type of signal the listener emits whenever s/hebackchannels. The goal is to learn a mapping function F 𝑠𝑖𝑔 fromthe listener’s feature space to one of the three signal categories(verbal, visual, both) ( BS 𝑖 𝑗 ), i.e., xploring Semi-Supervised Learningfor Predicting Listener Backchannels CHI ’21, May 8–13, 2021, Yokohama, Japan Figure 3: Methodology: (i)

Semi-supervised learning for identifying backchannels and type of signals emitted using a subsetof labeled data. (ii)

Learning to predict these instances and signals using the speaker’s context. F 𝑠𝑖𝑔 ( 𝐿 𝑖 𝑗 ) ↦→ BS 𝑖 𝑗 (2)It is important to note that both the identification tasks make useof only the listener’s features. This is a crucial distinction from theprediction module. The identification module identifies backchan-nel opportunities and signals much like a human annotator, bypaying attention to the listener. On the other hand, as expected, theprediction module would use only the speaker’s contextual featuresto predict these backchannels. Modelling: Semi-Supervised Learning.

Annotating large datasetsis a tedious task, and researchers have long been exploring waysto ease out and automate this manual process [2, 6, 19, 31]. Evenin the context of the present study, the annotation process tookaround 90 hours , where the three annotators viewed and labeledall the conversations. This indeed is a bottleneck! To the best ofour knowledge, this challenge in developing ECAs, specifically formodeling the backchannel behavior of an active listener, is an openresearch gap that has not been investigated by prior literature. Theidentification module of our workflow is a step towards tacklingthis challenge.Several AI techniques based on Semi-Supervision [14, 21] andWeak-Supervision [11, 26] have been utilized to decrease the an-notation costs in different application domains. Here we exploreself-training based semi-supervised learning paradigm [37] for iden-tifying listener backchannel instances and the signals associatedwith them, using only a small subset of the manually annotateddata. The following steps summarise the general approach to self-training based semi-supervision adopted here (Figure 3):(1) We start with the labeled portion of the dataset ( L ) to trainan initial classifier ( C ) that learns to identify the backchannel hours each annotator, taking the average time to annotate one side of a conversationas minutes. The total amount of recorded content being ∼ . hours long. instances and the signals associated based on the listener’sfeatures.(2) C is then used to predict the labels for the unlabeled data( U ).(3) Of these predictions, the instances which meet a specific selection criterion are removed from U and added, alongwith their predicted pseudo labels, to the training set. Thisupdated training set comprising the initially labeled ( L ) andthe newly added pseudo labeled instances are used to trainthe classifier C again.(4) The cycle continues until no new instance matches the se-lection criteria.In the present study, we have the labels for the complete dataset( D ). Therefore, in our context, L and U are disjoint subsets of D . Furthermore, as selection criteria for Step-3, we use a highthreshold value of 0 .

90 on the predicted class probability. Note thatafter Step-4, we use the trained classifier C to predict the pseudolabels for all the remaining instances from U . Our experimentsrevealed that such instances were very few in number. The formal definition for the backchannel opportunity predictiontask is similar to the one proposed in prior literature [15]. Considera time window 𝑇 𝑖 𝑗 , and let 𝑆 𝑖 𝑗 be the speaker’s visual and acousticfeatures for that period. The backchannel opportunity predictiontask entails predicting whether the listener would backchannelafter 𝑇 𝑖 𝑗 ( i.e., the label BO 𝑖 𝑗 + ) using only the speaker’s features(context). Similarly, the signal prediction task aims to predict thetype of feedback signal (visual, verbal, or both) ( i.e., BS 𝑖 𝑗 + ) thatwill be emitted by the listener using the speaker’s context. Thefollowing function mappings represent these tasks: This notion becomes more apparent as we discuss the experimental setup in Section 3.4

HI ’21, May 8–13, 2021, Yokohama, Japan Vidit Jain, Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla G 𝑏𝑐 ( 𝑆 𝑖 𝑗 ) ↦→ BO 𝑖 𝑗 + (3) G 𝑠𝑖𝑔 ( 𝑆 𝑖 𝑗 ) ↦→ BS 𝑖 𝑗 + (4)Inspired by literature [15, 28] we extract a 3 second contextwindow before each instance in our dataset and use the speaker’svisual and acoustic features for that time frame to predict listenerbackchannels.The prediction tasks are performed in a supervised fashion, withthe labels derived from the semi-supervision based identificationmodule. Since the identification module used the listener’s channel(features), the prediction module is essentially learning based oncross-channel semi-supervised labels. This section elucidates all the features extracted from the datasetfor modeling the identification and prediction tasks. Table 2 sum-marises these visual and prosodic features. In particular, inspiredby prior work [15, 22] we use OpenFace [3] to extract- 18 facialaction units (FAUs), velocity and acceleration of gaze, translationaland rotational head velocities and accelerations, blink rate, pupillocation, and smile ratio. Additionally, we also find the gaze stateas a categorical variable, taking up three values- left , right , or blinking . These features from the listener’s channel are usedfor backchannel opportunity, and signal identification tasks andthe prediction tasks utilize the corresponding features from thespeaker’s channel.As prosodic features, in addition to the voice activity (discussedin Section 2.3), we also extract the fundamental frequency ( F0 ),the energy, and the first 13 Mel-Frequency Cepstral Coefficients(MFCC) using the pyAudioAnalysis library [13]. All of them areused for the identification as well as the prediction tasks. We now discuss the experimental settings for our work. Since eachtask can be analyzed and evaluated based on several dimensions,such as the type of features used (multimodal vs. unimodal), andthe amount of initial ‘labeled’ data in semi-supervision ( L ), wecould theoretically have had a combinatorial number of settingscorresponding to all pairs of values for these variables. However,we follow a more structured pattern to limit the number of settingsand draw focus on the most crucial elements. Separatemodels were trained for backchannel opportunity and signal iden-tification tasks. We tried several different machine learning algo-rithms for training the classifier C in a semi-supervised paradigmdescribed in Section 3.1. Specifically, we experimented with RandomForests ( RF ), Support Vector Machine Classifier ( SVC ), K-NearestNeighbour Classifier (

KNN ), AdaBoost (

ADA ), and ResNet (

ResNet ).ResNet is the state-of-the-art deep learning model for time seriesclassification [12]. Additionally, we also experimented with thewidely used Label Spreading algorithm (

LSpread , implemented as The following GitHub repository was used for the same: https://github.com/antoinelame/GazeTracking

Features DescriptionVisual Features

FAUs

Regressive values of 18 Facial Action Units:

AU01_r, AU02_r, AU04_r, AU05_r, AU06_r,AU07_r, AU09_r, AU10_r, AU12_r, AU14_r,AU15_r, AU17_r, AU20_r, AU23_r, AU25_r,AU26_r, AU28_r, AU45_rgaze_vel,gaze_acc

Velocity and acceleration of eye gaze gaze_state

Categorical feature signifying the direction ofgaze as- left , right , blinkinghead_vel_T , head_acc_T Translational velocity and acceleration of head head_vel_R , head_acc_R Rotational velocity and acceleration of head blink_rate

First order differential of Eye Aspect Ration pupil

Location of the pupils. smile_ratio

Stretch of the smile calculated as the ratio oftwo characteristic dimensions of the mouth [1]

Prosodic Features F0 The fundamental frequency of the speech signal energy

The sum of squares of the signal values, normal-ized by the respective frame length mfcc

Mel-Frequency Cepstral Coefficients 1-13 voice_activity

Binary state characterizing whether the personis speaking or not, based on acoustic signals.

Table 2: Visual and vocal prosodic features used in the study. a part of the scikit-learn python library [30]) for semi-supervisedlearning as a baseline. All but for the ResNet model were trainedusing the mean and standard deviation aggregates of the time se-ries based listener features. ResNet was trained using the detailedtime series features for the complete time window. Finally, for theproportion ( 𝑥 ) of the dataset D taken as the initial ‘labeled’ data( L ) for training C , we experimented with all values in the range ( , ) with a step size of 5%. Although semi-supervision isapplicable only when the amount of unlabelled data exceeds thelabeled, experimenting with the whole range helps in the analyzingthe models’ sensitivity, i.e. , how the performance changes when weapproach the fully supervised setting (by increasing 𝑥 ).Manifold evaluation for the identification tasks can be easilyunderstood from Figure 4. First, we randomly create 5 folds fromthe data ( D ). Four of these folds are used for training C via semi-supervision; i.e., a random 𝑥 % sample of the data from these fourfolds serve as the initial ‘labeled’ set ( L ), while the rest is termed as‘unlabelled’ ( U ). Once the model has been trained, it is evaluatedon the 5th fold, i.e. , the pseudo labels produced by the identificationmodels are compared against the ground truth labels. To ensure thatour models do not over-fit to a particular random split, we run 10simulations of training and evaluation for each pair of values of C and 𝑥 , and report the average results across all the simulations . Asevaluation metrics, we use the weighted average precision, recall,F1-score, and overall accuracy, for both the opportunity and signalidentification tasks. The results for each simulation were taken as the average of all folds. xploring Semi-Supervised Learningfor Predicting Listener Backchannels CHI ’21, May 8–13, 2021, Yokohama, Japan Figure 4: 5-Fold evaluation of semi-supervised models forbackchannel opportunity and signal identification.Figure 5: Multimodal RNN Fusion based architecture forbackchannel opportunity prediction model.

We beginevaluating the prediction experiments by first analyzing the im-pact of using different sets of features i.e., unimodal vs. multimodalfeatures. This is done by using the supervised (manually anno-tated) labels for both training and evaluation. Inspired by Tavabi et al. [35], we use a multimodal RNN fusion-based architecture forbackchannel opportunity prediction, shown in Figure 5. In essence,the time-series features for video and audio modalities are firstindividually passed through LSTM encoders. Following this, theencoders’ outputs are concatenated and again passed through arecurrent layer, enabling the model to learn the temporal dependen-cies between modalities. The latent representation from this fusionlayer is finally passed to a softmax layer through a fully-connectedlayer to predict feedback opportunities. The two unimodal modelssimply include a softmax layer after the encoder to output predic-tions.The signal prediction task was more complex, as the labels forthe task also suffered from a serious class imbalance ( visual: verbal: both : 835) which was adversely impacting the predic-tive performance of the models . Therefore, we use the mean andstandard deviation aggregates of the time series based speaker con-text features, along with SVM-SMOTE [7] to handle this imbalance .Note that signal prediction was made using only the feature set(unimodal/multimodal) that performed the best for the opportu-nity prediction task. Using the upsampled data , we tried several Note that though the same data was also used for the signal identification task, theimbalance did not cause a degradation in the performance there. SMOTE takes data of the shape ( 𝑛 _ 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 , 𝑛 _ 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒𝑠 ) as input, and therefore,time-series data had to be aggregated. This also eliminated the utility of models likeResNet for the task. Only the training set was upsampled i.e., the test set remained as is. models for signal prediction, including Random Forests ( RF ), Sup-port Vector Machine Classifier ( SVC ), AdaBoost (

ADA ), K-NearestNeighbors (

KNN ), and a Multi-Layered Perceptron model (

MLP ) .The next set of experiments analyze the feasibility and the perfor-mance to cost trade-off when adopting semi-supervision to label thedataset. For this, we compare the prediction tasks in two paradigms:(1) A complete supervised setting, using true labels providedby the annotators during the training and evaluation phases.Note that previous evaluations of the prediction tasks alreadyproduced the results for this setting.(2) The proposed setting using labels generated by the semi-supervised identification modules for training, while eval-uating on the true labels. For this, we use only the bestfeature set (unimodal/multimodal) found via the previousevaluations, along with the pseudo labels produced by thebest-found pair of C and 𝑥 from the identification tasks.With these two paradigms, we assessed how far semi-supervisedbackchannel identification fetches in comparison to human anno-tations for the downstream prediction tasks.For evaluating the backchannel prediction models, several recentworks have used the leave-one-subject-out approach [22]. In thepresent study, we use a slightly modified version of this techniqueto evaluate our prediction models in different paradigms. Insteadof having the data from each subject comprise a test fold, we di-vide the subjects into 6 groups, and data from each of these groupsform a test fold, i.e., leave-one-group-out. This modification wasprimarily done because some subjects had very few backchannelinstances in their conversations. Combining instances from multi-ple subjects to form a test set helped in better analyzing the modelpredictions and keeping the number of folds tractable. As metricsfor the backchannel opportunity prediction task, we report thepositive class’ precision, recall, and F1-score, as well as the overallaccuracy. For signal prediction, we report the weighted average pre-cision, recall, and F1-score, the overall accuracy, and the confusionmatrices. In Figure 6, we have the sensitivity analysis in terms of accuracy ofall the models we tried for the backchannel opportunity and signalidentification tasks. Each curve in these plots represents how theperformance of a particular classifier ( C ) for the identification taskchanges as we increase the amount of initial labeled data available( 𝑥 ) for semi-supervision. Using this information, we looked for thebest possible pair of values for C and 𝑥 . However, in addition toachieving high performance, we also wanted to limit 𝑥 . For this, wemanually found the elbow point for the best performing classifierin each task. The elbow value for 𝑥 is one where decreasing it fur-ther would cause a drastic drop in performance, while increasing itwould not change the performance significantly. Ideally, we alsowanted 𝑥 to be less than 50%, where we could have more unla-belled than labeled data for semi-supervision, thereby improvingthe annotation cost trade-off. Please refer the supplementary material for details on the hyper-parameters usedfor all the different models in the identification and prediction tasks

HI ’21, May 8–13, 2021, Yokohama, Japan Vidit Jain, Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla

Figure 6: Sensitivity Analysis using accuracy as the met-ric for semi-supervised backchannel opportunity and signalidentification models. The shaded portion represents stan-dard deviation.

Model

OpportunityIdentification

SignalIdentification

Precision Recall F1-score Accuracy Precision Recall F1-score Accuracy

LSpread .

66 0 .

67 0 .

65 0 .

66 0 . SVC .

56 0 .

52 0 .

56 0 .

54 0 .

69 0 .

70 0 .

68 0 . KNN .

63 0 .

67 0 .

62 0 . ADA .

74 0 .

75 0 .

69 0 . RF .

76 0 .

76 0 . ResNet .

81 0 .

84 0 .

83 0 . Table 3: Backchannel opportunity and signal identification:Detailed results for 𝑥 = . Figure 6 ( 𝑖 ) shows how ResNet outperforms all the other modelsin terms of the listener backchannel opportunity identification task.Observe how increasing 𝑥 beyond 25% does not change the perfor-mance a lot. It also meets all the other criteria we discussed above.Therefore, we choose 𝑥 =

25% as the final seed value (used to initial-ize the amount of labeled data) for the opportunity identificationtask. For the signal identification task (Figure 6 ( 𝑖𝑖 ) ), we observethat ResNet and Random Forests both have somewhat overlappingaccuracy for different values of 𝑥 . However, by taking a closer look,we analyze that for values of 𝑥 < 𝑥 = 𝑥 set to 25% for both the opportunity and signal identificationtasks are shown in Table 3. We begin discussing the prediction models with Table 4, whichrecords the results of our backchannel opportunity prediction taskperformed using the annotated labels for training, while experi-menting with different subsets of input features. The unimodal-audio model, with positive class F1 of 0 .

74 and overall accuracy of0 .

70, outperforms the unimodal-video model, with the correspond-ing metric values of 0 .

71 and 0 .

66, respectively. The multimodalmodel utilizing both the video and audio features beats the audiomodel by a small margin, attaining an F1 of 0 .

75 and 0 .

72 accuracy.

Feature set

Precision Recall F1-score Accuracy

Video .

61 0 .

86 0 .

71 0 . Audio .

64 0 .

90 0 .

74 0 . Video + Audio

Table 4: Backchannel opportunity prediction model trainedtrained across different sets of features using supervised(manually-annotated) labels for training. Overall, the mul-timodal feature set performed the best.

Labels used

Precision Recall F1-score Accuracy

Supervised .

66 0 .

89 0 .

75 0 . Semi-Supervised .

62 0 .

82 0 .

70 0 . Table 5: Backchannel opportunity prediction: comparisonof model trained using supervised manually-annotated la-bels with the one using labels generated by the identificationmodule.

Model SupervisedLabels Semi-SupervisedLabels

Precision Recall F1-score Accuracy Precision Recall F1-score Accuracy

SVC .

80 0 .

75 0 .

75 0 . KNN .

74 0 .

73 0 .

74 0 .

72 0 .

71 0 . ADA .

81 0 . .

81 0 .

78 0 . . RF .

78 0 .

76 0 .

75 0 .

75 0 . MLP .

80 0 .

81 0 . .

75 0 .

74 0 . Table 6: Backchannel signal prediction: comparison of mod-els trained using supervised (manually-annotated) labels,and those using labels generated by the semi-supervised sig-nal identification Random Forest model (with initial la-beled data).

Predicted

C1 C2 C3 T r u e C1 .

71 0 .

07 0 . C2 .

07 0 .

87 0 . C3 .

27 0 .

05 0 . ( 𝑖 ) Predicted

C1 C2 C3 T r u e C1 .

70 0 .

09 0 . C2 .

08 0 .

80 0 . C3 .

23 0 .

06 0 . ( 𝑖𝑖 ) Table 7: Confusion matrices for the best backchannel signalprediction model ( 𝑖 ) trained on manually-annotated labels, ( 𝑖𝑖 ) trained using labels generated by the signal identifica-tion models. Here, C1 , C2 , C3 refer to the ‘visual’, ‘verbal’, and‘both’ class labels, respectively. Since this set has overall the best performance for the baseline op-portunity prediction task with supervised labels, we use the sameto train and analyze models for subsequent tasks as well.Continuing with opportunity prediction, we now discuss theresults obtained using the semi-supervised labels generated by theopportunity identification model for training (the evaluation/testset used the same supervised labels). Table 5 records these results.Evidently, with the labels generated using just 25% of the annotateddata (in the identification step), we are able to achieve nearly 93% ofthe supervised F1 and 92% of the supervised accuracy scores on theopportunity prediction tasks. The values of these metrics observedhere are 0 .

70 and 0 .

66, respectively. xploring Semi-Supervised Learningfor Predicting Listener Backchannels CHI ’21, May 8–13, 2021, Yokohama, Japan

Moving on to backchannel signal prediction for determining thecategory of signals for the listener to emit, Table 6 shows the re-sults obtained for all the models trained in different settings. Whenusing the manually-annotated labels for training, we observe thatthe

MLP model has the best accuracy of 0 .

87, however in terms ofthe F1-Score, the

ADA model outperforms all the others. Given thedata imbalance in the signal prediction task, we use F1-score asthe deciding metric, and choose

ADA as the best performing model,with an F1-score of 0 .

81. We observe a similar performance trendwhen using the semi-supervised signal identification model to gen-erate the training labels. Here as well,

ADA is the best performingmodel in terms of F1-score, with a value of 0 .

78, which is 96% ofthe corresponding metric for the supervised setting. In Table 7, wereport the confusion matrices for the best (

ADA ) signal predictionmodels found for the two settings. We used the worst-case confu-sion matrix [25] computed using all the matrices generated acrossmanifold evaluation of the model. The performance for the ‘visual’( C1 ) class remained almost unchanged in the two settings. The falsepositives between the ‘visual’ and the ‘both’ ( C3 ) classes reducedwhen using semi-supervised labels. The ‘verbal’ ( C2 ) class perfor-mance dropped slightly. On the other hand, we performed betterfor the class C3 , but that came at the cost of a slight increase inthe false positives from C2 . Overall, the performance of the modeltrained using semi-supervised labels was indeed comparable withthe one trained on manually-annotated labels. For problems like predicting listener backchannels, quantitativeevaluations based on different metrics may not be sufficient to ana-lyze the models’ efficacy. For such cases, qualitative assessmentsbecome extremely important. Many prior studies have used ECAsor robots to deploy their models and assess the backchannel predic-tions. However, in the present study, we follow a slightly differentapproach, which is partly inspired by [15] and [28]. Specifically,using Apple’s Memoji feature, we created two virtual avatars-

Ar-jun (introvert) and

Karan (extrovert). We used the same Memojitechnology to record (as short clips) the neutral state and differentcombinations of backchannel signals, mentioned in Table 8, foreach of them. Furthermore, we prepared a short video compilation(nearly 4 minutes long) of a few speakers from the dataset used inthis work. Using our prediction models , along with a personalitycontingent signal combination sampling technique, we recordedthe backchanneling responses that Arjun and Karan would emitfor the speakers in the compilation. Although the sampling is notthe main focus of this work, we briefly describe below the stepsfollowed for re-use by future works: • After getting the predictions for backchannel opportunity,and the signal category from our models, we use inversetransform sampling to decide whether to emit unimodal ormultimodal signals (in case of visual and verbal categoriesonly) based on the personality being modeled (extrovert or The names used are hypothetical, and do not compromise anonymity. Backchannel opportunity prediction was done at an interval of seconds. Dependingon the output, the signal categories were predicted. We also ensured that the mod-els used for different speakers in the compilation were the ones where the speakerbelonged to the test set. Unimodal MultimodalVisual nod (0 . . .

09) nod + smile (0 . . Verbal utter: top short utterances like okay , hmm , haan (yes), etc.,were proabbilistically sampled nod+utter (0 . . . . . Both - Table 8: Signal combinations from different categories usedin the present study. The corresponding normalized proba-bilities inferred from the data and used while sampling thesignal combinations are also shown. introvert). Note that the probabilities used here are calculatedfrom the data and are mentioned in Section 2.2. • Now, we choose the exact signals to emit from Table 8 usinginverse transform sampling, where the probabilities for eachsignal combination are calculated from the data .We recorded six response videos, three for each Arjun and Karan,using the following different prediction policies:(1) Random Prediction Model ( Random policy): A baseline modelthat predicted backchannel opportunities and the signal cat-egories at random .(2) Supervised Model ( MA policy): The best opportunity (multi-modal) and signal category ( ADA ) prediction models trainedon the m anually- a nnotated labels.(3) Semi-supervised Model ( SSL policy): The opportunity (mul-timodal) and signal category (

ADA ) prediction models trainedwith the labels generated from the s emi- s upervised identifi-cation models ( ResNet and RF , respectively with 25% data)used in this work.With these short clips recording the backchannel responses aspredicted by different models for Arjun and Karan, we wanted totest the following three hypotheses: [H1]: As perceived by a human watching the response videos, thebackchannels predicted by our models look more natural,and therefore, better than those predicted by the randompolicy model. [H2]:

From a user’s perspective, there is no significant differencein the quality of backchannel responses emitted by modelsin the settings 2 and 3 above, i.e., using semi-supervisedlearning with a small subset of labelled data does not impactthe quality of backchannel responses generated by the finalprediction models. [H3]:

Finally, to assess the extent to which backchannel responsesdepend on personality, and the utility of using such person-ality contingent signal sampling: A human can judge theextraversion personality trait for Arjun and Karan based ontheir response videos.Twenty seven students were recruited as participants from auniversity’s mailing list. They were all native Hindi speakers. Asan introduction, the participants were acquainted with the concept The only step that is contingent on personality is deciding on the unimodal vs.multimodal signals. The second step of signal sampling uses common probabilities forboth characters.

HI ’21, May 8–13, 2021, Yokohama, Japan Vidit Jain, Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla ( 𝑖 ) ( 𝑖𝑖 ) Figure 7: Stills from two of our user study videos- ( 𝑖 ) Karan,and ( 𝑖𝑖 ) ArjunFigure 8: Box & Whisker plot for the Likert scale ratings(where is very poor and is very good ) given by the partici-pants to each of the three policies, based on ( 𝑖 ) placement &frequency, and ( 𝑖𝑖 ) signal type selection. of listener backchannels. They were instructed to watch all six re-sponse videos and carefully observe the virtual listeners’ backchan-nel responses in each one. It was a blind study, i.e. the participantsknew neither about the prediction policies nor about the listeners’personality traits in the videos.Furthermore, to ensure that the appearance of the two avatarsdid not bias the participants’ judgement, we switched the avatars(with everything else as is) for half the participants. In other words,for half of the participants, Arjun acted as an extrovert, and viceversa. With this, we will be able to analyze whether backchannelsindeed depend on the extraversion personality trait.After watching all the videos, the participants were presentedwith a questionnaire meant to test the three hypotheses mentionedabove. In particular, we asked the participants to • Rank the three policies in order based on the quality ofbackchannels, taking into consideration factors like the place-ment, frequency, and selection of backchannel signals. • Rate each one on a Likert scale based on the above parame-ters. • Identify the personality traits of the avatars based on thevideos.Post survey analysis of the participants’ responses lead to someexciting findings which supported our hypothesis:(1)

Models from both the manually-annotated and semi-supervised settings performed better than random: signifi-cantly better than the random policy, both in terms of theplacement & frequency, as well as the signal selection. Weused the Wilcoxon signed-rank test, and the correspondingp-values are [ P - placement & frequency, S - signal selection, ∗ indicates significant difference at 95% confidence interval]: • MA > Random : 0.0019 ( P *); 0.0013 ( S *) • SSL > Random : 0.00011 ( P *); 0.0001 ( S *)A box-plot for these ratings is shown in Figure 8. Notice howthe boxes for MA and SSL do not overlap with

Random . Thisalso indicates that the former two certainly received betterratings than the later. This validates our first hypothesis [H1] that with a data-driven approach, we were able to emitmore natural and human-like backchannels.(2)

Prediction models trained on semi-supervised labelsproduced more natural backchannels:

Comparing theratings provided for the MA and the SSL policies, we foundthat in terms of the frequency and placement ( P ), there wasno significant difference between the two (p-value=0 . S ), the ratings suggested that SSL wassignificantly better than MA (p-value=0 . MA and SSL overlap, with themedian line for

SSL slightly above MA ’s. This indicates thatthere is likely to be a difference between the two sets ofratings, even though it may not be significant (as found fromthe test). For Signal Selection, the boxes do not overlap atall, indicating a difference.Furthermore, as a part of the questionnaire, we also askedthe participants to compare the response videos generatedby these two prediction models. Of the total, 60% of theparticipants found the backchannel responses emitted bythe SSL model more natural than the MA model ( SSL > MA ),and 15% observed no perceptible difference between the two( SSL ∼ MA ). This also aligns with our quantitative predictionresults where the proposed SSL model was able to reach ∼

95% of the latter’s performance. Thus, the two modelswere, both qualitatively and quantitatively, very similar.This confirms our second hypothesis [H2] with most of theparticipants finding the proposed model (

SSL ) similar (ormore natural) to the MA model.(3) Karan and Arjun’s extraversion traits were percepti-ble:

80% of the participants were able to accurately identifywhich of the two virtual-human characters was introvert andextrovert. This confirms our third and final hypothesis [H3] that the type of backchannel signals emitted by an individualindeed depends on their personality.

Relation to prior work and some interesting findings:

Ourquantitative and qualitative evaluations in the previous sectiongreatly validate how semi-supervision can be extremely useful indesigning human-like ECAs, by focusing on the task of listenerbackchannel prediction. We found that with just 25% of manuallyannotated data ( 175 minutes), we were able to train a backchannelprediction system that performed comparably well, and even better xploring Semi-Supervised Learningfor Predicting Listener Backchannels CHI ’21, May 8–13, 2021, Yokohama, Japan in terms of some parameters, as the one trained using 100% data.Most of the prior works, even the most recent ones, including[15, 16, 29, 32], have depended on a large amount of annotation.We believe that this observation can significantly benefit the HCIcommunity. Furthermore, studies have shown that backchannelresponses vary greatly with culture [18, 42], and most of the priorstudies have focussed primarily on the American and Europeanpopulation in this regard. In this work, we worked with subjects ofIndian origin, and therefore, our work holds cultural significanceas well.A particularly interesting finding from Section 4.3 was that mostparticipants found the backchannel responses generated by the

SSL policy more ‘natural’ than the MA policy. We hypothesise this couldbe attributed to some form of label noise: After annotations, weonly took those positive instances where at least two raters agreed,and the rest were discarded. Our negative samples overlapped withthose instances where only 1 of the raters had annotated a positiveBC. The SSL model could be learning to predict these backchannels.In fact, in our demo videos, we found two such instances. Startingwith a small amount of seed data, the SSL model could have beenlearning these instances as well (given there are some hints ofBC), which could explain the observation that even though SSLperformed comparably with MA quantitatively, but seemed morenatural to the participants in the subjective study. Limitations : We would also like to highlight some limitations ofour work, which can form the basis for future studies. First, we onlyused the participants’ visual and acoustic features and did not in-clude the content of the conversation itself for predicting backchan-nels. The main reason for this was that the primary language usedin the

Vyaktitv dataset was Hindi. The English translations for thedialogues were not available, making the use of state-of-the-artNLP techniques non-trivial. In another aspect, our qualitative usercase study involved a short 4-minute long compilation of speakersfrom the

Vyaktitv dataset. A more rigorous analysis could follow bydeploying the models and the avatars as a real time-system. Finally,we did not annotate eye blinks/gaze [20] as backchannel signals,primarily because they were not as apparent in the dataset. Thiscould be a cultural difference as well.

Ethical Consideration : When developing data-driven systemsleveraging data from human-subjects, it becomes imperative thatwe respect their privacy boundaries. We want to state that noPersonally Identifiable Information (PII) was used while training orevaluating the system. We also complied with the agreements inthe

Vyaktitv dataset to ensure the safe use of data.

In this work, we confirmed the feasibility of using semi-supervisedlearning to (semi-) automate the process of identifying and labelinglistener backchannel instances (both opportunities and the asso-ciated signals) from conversations. We used a Hindi peer-to-peerconversation-based multimodal dataset

Vyaktitv , for our experi-ments. However, the methodology proposed in the study is generaland can be adapted for other conversational datasets as well. Quan-titative evaluation alongside a subjective analysis in the form of a user study strongly validated our hypothesis that- prediction mod-els trained using semi-supervised labels performed comparablywith those using manually annotated labels. Furthermore, we sta-tistically and qualitatively confirmed that the type of backchannelsignals emitted are intimately linked to an individual’s personality(extraversion in particular).Future work directions include validating the scope of semi-supervised learning for listener backchannel prediction on otherdatasets. Other parallel tasks, like listener disengagement predic-tion, can also be similarly performed. Methodologically, futurestudies could explore by devising heuristics and using weak super-vision based techniques to identify backchannels. Our observationsfrom analyzing the impact of personality on the type (modality)of backchannel response also open some new research questions;for instance- can we also analyze if the frequency of backchannelsemitted by different individuals depend on personality [4]? Further-more, can we use these findings to embed personality into a virtualhuman further? These are some exciting lines future researcherscan look into by conducting more extensive analysis.

ACKNOWLEDGEMENTS

Jainendra Shukla and Rajiv Ratn Shah are partly supported by theInfosys Center for AI and Center for Design and New Media atIIIT Delhi. In addition, Rajiv Ratn Shah is also partly supportedby the ECRA Grant (ECR/2018/002776) by SERB, Government ofIndia. Finally, we would like to thank the annotators, and all theparticipants who took part in our case study for their time andefforts.

REFERENCES [1] Samer Al Moubayed, M. Baklouti, Mohamed Chetouani, Thierry Dutoit, AmmarMahdhaoui, Jean-Claude Martin, Stanislav Ondáš, Catherine Pelachaud, JeromeUrbain, and M. Yilmaz. 2009. Generating Robot/Agent backchannels during astorytelling experiment.

Proceedings - IEEE International Conference on Roboticsand Automation

ICRA’09 (06 2009), 3749 – 3754. https://doi.org/10.1109/ROBOT.2009.5152572[2] Stéphane Ayache and Georges Quénot. 2008. Video corpus annotation usingactive learning. In

European Conference on Information Retrieval . Springer BerlinHeidelberg, Berlin, Heidelberg, 187–198.[3] Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency.2018. Openface 2.0: Facial behavior analysis toolkit.

Journal onMultimodal User Interfaces

6, 1-2 (2012), 27–38.[5] Elisabetta Bevacqua, Maurizio Mancini, and Catherine Pelachaud. 2008. A listen-ing agent exhibiting variable behaviour. In

International Workshop on IntelligentVirtual Agents . Springer Berlin Heidelberg, Berlin, Heidelberg, 262–269.[6] Patrícia Bota, Joana Silva, Duarte Folgado, and Hugo Gamboa. 2019. A semi-automatic annotation approach for human activity recognition.

Sensors

19, 3(2019), 501.[7] Kevin W. Bowyer, Nitesh V. Chawla, Lawrence O. Hall, and W. Philip Kegelmeyer.2011. SMOTE: Synthetic Minority Over-sampling Technique.

CoRR abs/1106.1813(2011), 321—-357. arXiv:1106.1813 http://arxiv.org/abs/1106.1813[8] Hennie Brugman, Albert Russel, and Xd Nijmegen. 2004. Annotating Multi-media/Multi-modal Resources with ELAN.

LREC

Proceedings of the FourthInternational Conference on Language Resources and Evaluation (LREC’04)(2004), 2065–2068.[9] Justine Cassell, Joseph Sullivan, Elizabeth Churchill, and Scott Prevost. 2000.

Embodied conversational agents . MIT press, Cambridge, MA, United States.[10] John M Digman. 1990. Personality structure: Emergence of the five-factor model.

Annual review of psychology

41, 1 (1990), 417–440.[11] Jared A Dunnmon, Alexander J Ratner, Khaled Saab, Nishith Khandwala, MatthewMarkert, Hersh Sagreiya, Roger Goldman, Christopher Lee-Messer, Matthew PLungren, Daniel L Rubin, et al. 2020. Cross-modal data programming enablesrapid medical machine learning.

Patterns

HI ’21, May 8–13, 2021, Yokohama, Japan Vidit Jain, Maitree Leekha, Rajiv Ratn Shah, and Jainendra Shukla [12] Hassan Ismail Fawaz, Germain Forestier, Jonathan Weber, Lhassane Idoumghar,and Pierre-Alain Muller. 2019. Deep learning for time series classification: areview.

Data Mining and Knowledge Discovery

33, 4 (2019), 917–963.[13] Theodoros Giannakopoulos. 2015. pyAudioAnalysis: An Open-Source PythonLibrary for Audio Signal Analysis.

PloS one

10 (2015), 1–17.[14] Mononito Goswami, Lujie Chen, and Artur Dubrawski. 2020. DiscriminatingCognitive Disequilibrium and Flow in Problem Solving: A Semi-Supervised Ap-proach Using Involuntary Dynamic Behavioral Signals.

Proceedings of the AAAIConference on Artificial Intelligence

34 (2020), 420–427.[15] Mononito Goswami, Minkush Manuja, and Maitree Leekha. 2020. Towards Social& Engaging Peer Learning: Predicting Backchanneling and Disengagement inChildren. arXiv:2007.11346 [cs.HC][16] K. Hara, K. Inoue, K. Takanashi, and Tatsuya Kawahara. 2018. Prediction of Turn-taking Using Multitask Learning with Prediction of Backchannels and Fillers.

INTERSPEECH

Journal of pragmatics

35, 7 (2003), 1113–1142.[18] Bettina Heinz. 2003. Backchannel responses as strategic responses in bilingualspeakers’ conversations.

Journal of Pragmatics

35, 7 (2003), 1113 – 1142. https://doi.org/10.1016/S0378-2166(02)00190-X[19] Réka Hollandi, Ákos Diósdi, Gábor Hollandi, Nikita Moshkov, and Péter Horváth.2020. AnnotatorJ: an ImageJ plugin to ease hand-annotation of cellular compart-ments.

Molecular Biology of the Cell

31 (2020), mbc–E20.[20] Paul Hömke, Judith Holler, and Stephen C Levinson. 2018. Eye blinks are per-ceived as communicative signals in human face-to-face interaction.

PloS one

ISPRS Journal of Photogrammetry and RemoteSensing

167 (2020), 12–23.[22] Rajni Jindal, Maitree Leekha, Minkush Manuja, and Mononito Goswami. 2020.What makes a Better Companion? Towards Social & Engaging Peer Learning.

Proceedings of the Twenty-forth European Conference on Artificial Intelligence. IOSPress

325 of Frontiers in Artificial Intelligence and Applications (2020), 482–489.[23] Daniel Jurafsky, Carol Van Ess-dykema, et al. 1997. Switchboard discourselanguage modeling project.[24] S. N. Khan, M. Leekha, J. Shukla, and R. R. Shah. 2020. Vyaktitv: A MultimodalPeer-to-Peer Hindi Conversations based Dataset for Personality Assessment.

Pro-ceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Autonomous Agentsand Multi-Agent Systems

20 (01 2010), 70–84. https://doi.org/10.1007/s10458-009-9092-y[28] Markus Mueller, David Leuschner, Lars Briem, Maria Schmidt, Kevin Kilgour,Sebastian Stueker, and Alex Waibel. 2015. Using Neural Networks for Data-DrivenBackchannel Prediction: A Survey on Input Features and Training Techniques.In

Human-Computer Interaction: Interaction Technologies , Masaaki Kurosu (Ed.).Springer International Publishing, Cham, 329–340.[29] Hae Won Park, Mirko Gelsomini, Jin Lee, and Cynthia Breazeal. 2017. TellingStories to Robots: The Effect of Backchanneling on a Child’s Storytelling. In

Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot In-teraction . Association for Computing Machinery, New York, NY, USA, 100–108.https://doi.org/10.1145/2909824.3020245[30] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: MachineLearning in Python.

Journal of Machine Learning Research

12 (2011), 2825–2830.[31] Michael Rubinstein, Ce Liu, and William T Freeman. 2012. Annotation propa-gation in large image databases via dense image correspondence. In

EuropeanConference on Computer Vision . Springer Berlin Heidelberg, Berlin, Heidelberg,85–99.[32] Robin Ruede, Markus Müller, Sebastian Stüker, and Alex Waibel. 2019.

Yeah,Right, Uh-Huh: A Deep Learning Backchannel Predictor . Springer InternationalPublishing, Cham, 247–258. https://doi.org/10.1007/978-3-319-92108-2_25[33] Nikhita Singh, Jin Joo Lee, Ishaan Grover, and Cynthia Breazeal. 2018. P2PSTORY:Dataset of Children as Storytellers and Listeners in Peer-to-Peer Interactions. In

Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems .Association for Computing Machinery, New York, NY, United States, 1–11. [34] T. Solorio, O. Fuentes, Nigel G. Ward, and Yaffa Al Bayyari. 2006. Prosodic featuregeneration for back-channel prediction.

INTERSPEECH . Association for Computing Machinery, New York, NY, USA, 95–104.https://doi.org/10.1145/3340555.3353750[36] Khiet P Truong, Ronald Poppe, and Dirk Heylen. 2010. A rule-based backchannelprediction model using pitch and pause information.

Eleventh Annual Conferenceof the International Speech Communication Association

Machine Learning

Proceeding of Fourth International Conference on Spoken LanguageProcessing. ICSLP ’96

Journal of pragmatics

32, 8 (2000),1177–1207.[40] Nigel G Ward and Yaffa Al Bayyari. 2006. A case study in the identification ofprosodic cues to turn-taking: Back-channeling in Arabic.

Ninth InternationalConference on Spoken Language Processing

Language in society

18, 1 (1989), 59–76.[42] Sheida White. 1989. Backchannels across Cultures: A Study of Americans andJapanese.