[PDF] Brain-computer interface with rapid serial multimodal presentation using artificial facial images and voice

Abstract

Electroencephalography (EEG) signals elicited by multimodal stimuli can drive brain-computer interfaces (BCIs), and research has demonstrated that visual and auditory stimuli can be employed simultaneously to improve BCI performance. However, no studies have investigated the effect of multimodal stimuli in rapid serial visual presentation (RSVP) BCIs. In the present study, we propose a rapid serial multimodal presentation (RSMP) BCI that incorporates artificial facial images and artificial voice stimuli. To clarify the effect of audiovisual stimuli on the RSMP BCI, scrambled images and masked sounds were applied instead of visual and auditory stimuli, respectively. Our findings indicated that the audiovisual stimuli improved the performance of the RSMP BCI, and that the P300 at Pz contributed to classification accuracy. Online accuracy of BCI reached 85.7+-11.5%. Taken together, these findings may aid in the development of better gaze-independent BCI systems.

Full PDF

HHighlights

Brain-computer interface with rapid serial multimodal presentation using artiﬁcial facial imagesand voice

A Onishi• We proposed a P300-based RSMP BCI that uses artiﬁcial face and voice stimuli.• Audiovisual stimuli enhanced the classiﬁcation accuracy for the RSMP BCI.• LPP at Pz contributed to the classiﬁcation of the BCI. a r X i v : . [ c s . H C ] F e b rain-computer interface with rapid serial multimodal presentationusing artiﬁcial facial images and voice ⋆ Dr. A Onishi a,b , ∗ (Assistant Professor) a Department of Electronic Systems Engineering, National Institute of Technology, Kagawa College, 551, Kohda, Takuma-cho, Mitoyo-shi, Kagawa,769-1192, Japan b Center for Frontier Medical Engineering, Chiba University, 1-33 Yayoi-cho, Inage-ku, Chiba, Japan

A R T I C L E I N F O

Keywords :BCIP300RSMPRSVPmultimodalaudiovisual

A B S T R A C T

Electroencephalography (EEG) signals elicited by multimodal stimuli can drive brain-computer inter-faces (BCIs), and research has demonstrated that visual and auditory stimuli can be employed simulta-neously to improve BCI performance. However, no studies have investigated the eﬀect of multimodalstimuli in rapid serial visual presentation (RSVP) BCIs. In the present study, we propose a rapid se-rial multimodal presentation (RSMP) BCI that incorporates artiﬁcial facial images and artiﬁcial voicestimuli. To clarify the eﬀect of audiovisual stimuli on the RSMP BCI, scrambled images and maskedsounds were applied instead of visual and auditory stimuli, respectively. Our ﬁndings indicated thatthe audiovisual stimuli improved the performance of the RSMP BCI, and that the late positive poten-tial (LPP) at Pz contributed to classiﬁcation accuracy. Taken together, these ﬁndings may aid in thedevelopment of better gaze-independent BCI systems.

1. Introduction

Brain-computer interfaces (BCIs) measure brain signals, whichare then decoded into commands for controlling an exter-nal device [26], making them valuable for individuals withdisabilities. Several BCIs that rely on electroencephalog-raphy (EEG) have been proposed. One well-studied BCIutilizes the P300 component of the event-related potential(ERP)(i.e., P300- or ERP-based BCI), which appears in re-sponse to rare stimuli [7].P300-based BCIs can be driven by visual, auditory, ortactile stimuli. In early studies, visual stimuli included a matrix of letters, which is referred to as the P300 speller orFarwell and Donchin speller [7, 16, 15]. The P300 spellerturns a row or column of gray letters on the matrix white.Users can spell a desired letter by counting silently when theletter turns white. Similarly, an auditory P300-based BCIthat can select “Yes”, “No”, “Pass”, and “End” has been in-vestigated [18]. Furthermore, tactors attached to a partici-pant’s waist can also be used instead of visual or auditorystimuli [4]. These three sensory modalities are associatedwith diﬀerent pathways to the brain. Thus, even with impair-ment in one modality, P300-based BCIs can be eﬀective.Performance of the BCI depends on the stimulus becausethe ERP is modulated by stimulus modality and content, andbecause the ERP including the P300 is used as a feature forthe classiﬁcation of the BCI. P300 amplitude and latencydiﬀer across modalities, exhibiting diﬀerences in classiﬁca-tion accuracy [2, 23]. In addition, complex visual and au-ditory stimuli that contain rich information have been ap-plied to BCI. In one previous study, a P300-based BCI with ⋆ This document is the results of a research project funded by the JSPSKAKENHI grant (18K17667). ∗ Corresponding author [email protected] (A. Onishi) http://onishi.starfree.jp/ (A. Onishi)

ORCID (s): (A. Onishi) a green/blue ﬂicker matrix exhibited a higher accuracy thanone with a white/gray ﬂicker matrix [20]. Another study re-ported improved performance using a P300-based BCI thatpresents facial images instead of color changes [10, 13, 9].Several studies have also investigated the applicability of au-ditory stimuli. Spatial auditory stimuli from speakers arounda user are helpful for increasing the accuracy of P300-basedBCIs [17]. Furthermore, natural auditory stimuli, such asanimal sounds (e.g., frog, seagull), exhibit unique ERP wave-forms [19]. These sensory modalities can also be applied tothe BCI simultaneously.Research has demonstrated that visual and auditory stim-uli can be employed simultaneously in BCIs [18]. For ex-ample, an audiovisual BCI that responds yes or no has beenproposed [24]. Another study indicated that a bimodal P300-based BCI combining visual and tactile stimuli exhibitedbetter performance than unimodal BCIs [4, 23]. Further-more, auditory stimuli delivered via bone conduction head-phones and tactile stimuli have been applied in multimodalBCI systems [14]. A bimodal, direction-congruent BCI withspatial auditory stimuli and corresponding tactile stimuli ex-hibited better performance than unimodal BCIs [27]. Takentogether, these ﬁndings indicated that multimodal stimuli im-prove the classiﬁcation accuracy of the BCI. Thus, multi-modal BCIs are advantageous in that they can not only usemultiple sensory pathways but also improve performance,likely via sensory integration.Unlike the letter matrices utilized in traditional visualP300-based BCIs, rapid serial visual presentation (RSVP)involves the rapid presentation of stimuli at the center of themonitor one by one in a random order [1]. RSVP is advanta-geous for P300-based systems because it does not require eyegaze movements to drive the BCI [23]. However, the eﬀectof multimodal stimuli on performance of the BCI remainsunclear.Therefore, in the present study, we proposed a P300-

A Onishi et al.:

Preprint submitted to Elsevier

Page 1 of 7

CI with RSMP using artiﬁcial facial images and voice あ (a) い (i) う (u) え (e) お (o)Facial imagewith phasescramblingFacial image Figure 1:

Types of visual stimuli. Artiﬁcial facial images thatrepresent Japanese vowels were rendered, following which theywere trimmed in an ellipse. These stimuli were used in the AVand V conditions. In addition, phase scrambling was appliedto the facial images used in the A condition. based BCI incorporating rapid serial multimodal presenta-tion (RSMP) (RSMP BCI), which utilizes artiﬁcial facial im-ages and artiﬁcial voice. The stimuli represented ﬁve Japanesevowels, and they were provided such that each stimulus in-dicated a single vowel. We hypothesized that audiovisualstimuli would also be eﬀective for the P300-based RSMPBCI given the congruence of the stimuli and ERP compo-nents elicited by the facial images. To clarify the eﬀect ofaudiovisual stimuli in BCI systems, we prepared and com-pared facial images with phase scrambling as well as maskedsounds.

2. Methods

Eleven healthy participants were included in this study. Theirmean age was . . years. Two of them were female,while the others were male. All participants provided writ-ten informed consent prior to the experiment. This study wasconducted in accordance with the guidelines of the InternalEthics Committee at Chiba University. In this study, we prepared an RSMP BCI system that canselect one of ﬁve Japanese vowels using brain signals. Ar-tiﬁcial facial images and vowel sounds were applied as BCIstimuli. We examined the eﬀect of the audiovisual stimulion the RSMP BCI by comparing the following three con-ditions: audiovisual (AV), visual (V), and auditory (A). Inthe AV condition, the artiﬁcial facial images shown in Fig. 1and corresponding artiﬁcial voice were presented simulta-neously. In the V condition, artiﬁcial voice stimuli weremasked and presented with the artiﬁcial facial images. Inthe A condition, facial images with phase scrambling (seeFig. 1) were presented together with artiﬁcial voice. Notethat both visual and auditory stimuli were presented simul-taneously even in the V and A conditions.To input a cued Japanese vowel via this BCI system,a participant was asked to count the appearance of the in-structed vowel in response to series of stimuli. Figure 2represents an example of the task. At the beginning, a cuewas presented to the participant. Next, audiovisual stimuliwere presented in random order. During stimulus presenta- m s m s m s ・・・ Cue Nontarget Target Feedback

Figure 2:

Example of the BCI task (AV condition). First, avowel to count (target vowel) was cued with audiovisual stim-uli and an instruction message. Second, a facial image andthe corresponding sound were presented for 500 ms, followingwhich they disappeared for 500 ms. Such stimuli appeared ev-ery 1,000 ms, changing vowels in a pseudo-random order. Ifthe stimuli represented the cued vowel, a participant countedthe appearance of the stimuli silently (up to 15 times). Afterstimulus presentation, EEG signals recorded during the abovetask were translated into a vowel and fed back to the partici-pant if the classiﬁer was trained. tion, participants counted the appearances of the cued stim-uli silently. Finally, EEG signals recorded during the taskwere analyzed and translated into a vowel output. Note thatall participants were informed regarding the cue type, whichcould be recognized from the auditory or visual stimulus inthe A and V conditions.Each stimulus lasted 500 ms, and the inter-stimulus in-terval was 500 ms. In other words, the stimulus onset asyn-chrony (SOA) was 1,000 ms. All ﬁve stimuli were repeat-edly presented 15 times in a trial. Each run included twotrials. Runs were repeated ﬁve times in a pseudo-random or-der, respectively for each stimulus condition. The probabil-ity of target appearance was 1/5. During the experiment, theoutput was not indicated in order to save experimental timeand to fatigue among participants. Classiﬁcation accuracyfor each stimulus condition was calculated via oﬄine cross-validation. Before starting BCI experiments, the sound pres-sure level was adjusted to 20 dB SL by the Method of Limits.The BCI system consisted of a PC for managing the ex-periment and EEG recordings (HP Probook 430 G3, HP Inc.,CA), a PC for presenting the stimulus (Handmade PC, G31M-ES2L, Windows 10), a display monitor (E178FPc, Dell com-puters, TX), an EEG ampliﬁer (Polymate mini AP108, Miyuki-Giken Co., Ltd., Japan), an audio interface (UCA222, BehringerGmbH, Germany), headphones (ATH-M20x, Audio-TechnicaCo., Ltd., Japan), and an AD-converter (AIO-160802AY-USB, Contec Co., Ltd., Japan). The sound pressure levelwas adjusted using an attenuator (FX-AUDIO AT-01J, NorthFlat Japan Co., Ltd., Japan). The above-mentioned audiovi-

A Onishi et al.:

Preprint submitted to Elsevier

Page 2 of 7CI with RSMP using artiﬁcial facial images and voice sual stimuli were generated as follows: For facial images,a three-dimensional (3D) human model TY2 was renderedusing Poser 10 (Smith Micro Inc., CA). Then, the facial im-ages were trimmed elliptically using Corel PHOTO-PAINTEssentials X8 (Corel Inc., Canada). For the contrast exper-iment, the images were scrambled via Fourier phase scram-bling using MATLAB 2016a (MathWorks Inc., MA). Artiﬁ-cial voice stimuli were generated with CeVIO Creative Stu-dio S 6 (Frontier works Inc., Japan) using the ONE model.All vowels were C4 in musical notes and lasted for 500 ms.The sounds were trimmed, and the RMSs were equalized us-ing MATLAB and employed in AV and A conditions. Soundsmasked by Gaussian noise were also generated in MATLABfor use in the V condition.

EEG signals were recorded from C3, Cz, C4, P3, Pz, P4, O1,and O2, where the ground and reference electrodes were theforehead and right mastoid, respectively. The sampling ratewas 500 Hz. A hardware low-pass ﬁlter (cut-oﬀ frequency:30 Hz) and high-pass ﬁlter (time constant: 1.5 sec.) wereapplied in addition to a notch-ﬁlter (50 Hz).

Oﬄine classiﬁcation accuracy in the three stimulus condi-tions was estimated via oﬄine leave-one-out cross-validation.Recorded EEG signals were trimmed for 1 s, and the baselinecorrelation was removed using pre-0.1 s EEG signals. Be-fore classiﬁcation, a notch ﬁlter (50 Hz), Savitzky-Golay ﬁl-ter (3rd order, 69 sample), and downsampling (140 samples)were applied. The multichannel EEG data were then vector-ized. Principal component analysis was applied to reducethe number of dimensions, where the contribution thresholdwas 0.9999. Finally, linear discriminant analysis (LDA) wasapplied to classify the signal.The output was determined as follows. The vowels “a,”“i,” “u,” “e,” and “o,” were labeled as 1 to 5, respectively.Given the stimulus set 𝐼 ∈ {1 , , ..., , the number of stim-ulus repetitions 𝑅 , preprocessed and vectorized testing datain response to stimulus 𝑖 of the 𝑟 -th stimulus repetition 𝐱 𝑟,𝑖 ,and the trained weight vector of the LDA 𝐰 , the input wasestimated by ﬁnding the stimulus 𝑖 yielding the maximumsummation of the inner product of data and LDA weight vec-tor: ̂𝑖 = arg max 𝑖 ∈ 𝐼 𝑅 ∑ 𝑟 =1 𝐰 ⋅ 𝐱 𝑟,𝑖 . (1)The estimate stimuli ̂𝑖 was the output of the trial. For ex-ample, the vowel “u” was the output if ̂𝑖 = 3 . If the outputwas equal to a cued stimulus, the estimation was correct. Theclassiﬁcation accuracy was decided by 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 ∕ 𝑜𝑢𝑡𝑝𝑢𝑡 . Dur-ing the oﬄine analysis, the number of stimulus repetitions 𝑅 was varied from 1 to 15 to estimate accuracy for each stim-ulus repetition. The oﬄine accuracy was statistically ana-lyzed using two-way repeated-measures analyses of variance(ANOVAs). Note that the factors included were stimulus condition and the number of stimulus repetitions. In addi-tion, post-hoc pair-wise t-tests were applied, where p-valueswere corrected using Bonferroni’s method. To gain insight into the contribution of EEG waveforms,grand-averaged EEG waveforms were visualized. Similar tothe oﬄine classiﬁcation, baseline correction, a notch ﬁlter,and a Savitzky-Golay ﬁlter were applied. Note that down-sampling and PCA were not applied when computing grand-averaged EEG waveforms.In addition, the point-biserial correlation coeﬃcients or 𝑟 -values were computed [3, 12, 21]. The 𝑟 -value of a timesample in a channel can be calculated as follows: Given theamount of data in target and nontarget classes 𝑁 and 𝑁 ,mean values of target and nontarget classes 𝜇 and 𝜇 , andstandard deviation 𝜎 , the 𝑟 -value is computed as 𝑟 ∶= √ 𝑁 ⋅ 𝑁 𝑁 + 𝑁 𝜇 − 𝜇 𝜎 . (2)The 𝑟 -value stands for the Pearson correlation between ERPamplitude and classes. This implies that the statistical testfor the Pearson correlation is applicable. We applied a test ofno correlation, where p-values were corrected using Bonfer-roni’s method. The 𝑟 -value was squared ( 𝑟 -value) to achievevisualization. The 𝑟 -value increases as the mean values ofthe target and nontarget classes separate, and as the standarddeviation decreases. Note that downsampling was appliedwhen computing 𝑟 -values in addition to the preprocessingapplied to the grand-averaged EEG waveforms, but PCA wasnot applied.

3. Results

In order to make clarify the eﬀect of audiovisual stimuli onthe RSMP BCI, we calculated and compared classiﬁcationaccuracy among the AV, V, and A conditions. The oﬄineclassiﬁcation accuracy of these three conditions is shown inFig. 3 and Table. 1. The highest mean accuracy was ob-served in the AV condition, followed by the V condition inwhich stimuli were presented more than three times. Thelowest mean accuracy for all repetitions was observed in theA condition. AV, V, and A conditions reached 72.7%, 67.3%,and 51.8% at best.Two-way repeated measures ANOVA revealed signiﬁ-cant main eﬀects of stimulus type ( 𝑝 < . , 𝐹 (2 ,

20) =1 . ) and repetition ( 𝑝 < . , 𝐹 (14 , . ). Thepost-hoc pairwise t-test revealed signiﬁcant diﬀerences be-tween all pairs of stimulus type ( 𝑝 < . ). Grand-averaged EEG waveforms for AV, V, and A stimuliare shown in Fig. 4, 5, and 6, respectively. Target EEG sig-nals were similar to nontarget EEG signals for all conditions.In the AV condition, target and nontarget signals diﬀered in

A Onishi et al.:

Preprint submitted to Elsevier

Page 3 of 7CI with RSMP using artiﬁcial facial images and voice

Table 1

Oﬄine classiﬁcation accuracy for each stimulus condition and the number of repetition.Condition Subject Repetition1 2 3 4 5 6 7 8 9 10 11 12 13 14 15AV 1 30 20 20 50 50 70 60 80 50 60 60 60 60 40 402 20 50 50 50 50 60 40 40 50 40 50 50 50 50 503 50 50 40 70 70 60 80 60 60 70 70 80 80 70 804 30 50 50 50 80 90 80 90 90 90 90 100 100 100 1005 60 70 90 90 90 90 90 90 90 90 90 90 90 90 906 20 30 40 40 60 60 70 50 40 50 50 20 30 30 307 20 10 20 20 20 50 40 40 30 40 30 40 30 40 308 70 90 80 80 80 80 80 70 70 70 70 80 80 80 909 70 50 80 90 100 90 90 90 100 100 100 100 100 100 10010 50 30 50 70 70 70 80 90 90 80 80 80 80 80 8011 60 60 70 60 70 80 70 80 80 80 80 80 80 90 90Mean 43.6 46.4 53.6 60.9 67.3 72.7 70.9 70.9 68.2 70.0 70.0 70.9 70.9 70.0 70.9V 1 30 20 10 10 0 10 20 0 0 10 20 10 30 30 302 10 20 10 30 20 20 40 40 40 50 60 30 40 50 503 50 80 70 60 80 80 80 80 80 80 90 80 90 90 1004 20 40 60 80 80 80 90 100 100 100 100 100 100 100 1005 30 50 70 70 80 60 70 80 70 60 60 70 80 80 806 20 40 50 30 50 50 40 40 40 30 50 50 40 60 607 40 30 20 30 30 40 30 30 30 50 30 40 40 40 408 30 70 70 60 80 80 70 80 80 80 90 90 90 90 909 60 50 60 60 60 60 60 50 60 50 50 40 40 40 5010 20 30 50 50 70 60 60 60 60 60 60 60 50 50 4011 60 90 100 70 80 90 100 100 100 100 100 100 100 100 100Mean 33.6 47.3 51.8 50.0 57.3 57.3 60.0 60.0 60.0 60.9 64.5 60.9 63.6 66.4 67.3A 1 30 30 20 10 30 20 10 20 20 10 10 0 0 0 02 30 40 20 60 30 30 20 20 20 20 30 20 30 40 403 10 40 20 10 10 10 10 20 20 20 30 20 20 30 304 20 10 40 20 30 30 30 30 20 10 30 50 30 20 405 40 70 60 40 60 60 60 50 60 50 50 40 60 70 606 30 50 40 40 50 60 50 60 40 40 40 50 40 40 507 10 50 30 30 30 30 50 60 50 60 50 50 60 60 708 20 50 70 40 50 50 60 50 50 60 60 70 70 60 609 50 40 30 40 50 40 50 50 60 60 70 80 70 70 7010 40 30 40 50 70 70 50 60 50 50 40 40 50 50 5011 20 60 70 70 100 100 100 100 100 100 100 100 100 100 100Mean 27.3 42.7 40.0 37.3 46.4 45.5 44.5 47.3 44.5 43.6 46.4 47.3 48.2 49.1 51.8

P4 and O2. In the V condition, there were large diﬀerencesin P3 and P4 between target and nontarget signals. In theA condition, we observed diﬀerences in C3 and P3 betweentarget and non-target signals. 𝑟 -values The biserial correlation coeﬃcients for each stimulus condi-tion are represented in Fig. 7, 8, and 9, respectively. Signiﬁ-cant 𝑟 values are shown in bright color, while non-signiﬁcant 𝑟 values are shown in black (zero). In the AV condition,signiﬁcant 𝑟 values were observed around 0.5 s for P3 andPz. In addition, signiﬁcant 𝑟 values appeared around 0.5 sfor P3, Pz, P3, and Oz in the V condition. In contrast, nosigniﬁcant 𝑟 values were conﬁrmed in the A condition.

4. Discussion

In the present study, we investigated the eﬀect of audiovisualcontent in an RSMP BCI system. Our ﬁndings indicated thatthe highest classiﬁcation accuracy occurred in the AV con-dition. This result implies that sensory integration of audi-tory and visual contents increases the classiﬁcation accuracyeven for RSMP BCIs. Since RSMP can be used in gaze-independent BCIs, our study provides insight into methodsthat can be used to develop a new gaze-independent BCIs.Our ﬁndings indicated that the AV condition was associ-ated with the best performance among all three conditions,in accordance with the ﬁndings of previous P300-based BCIstudies. Wang et al. reported that an audiovisual P300-basdBCI exhibited better performance than a visual-only or auditory-only BCI [24]. However, this may have been due to the ab-sence of visual or auditory stimuli. Even though the currentstudy included both visual and auditory stimuli, the AV con-

A Onishi et al.:

Preprint submitted to Elsevier

Page 4 of 7CI with RSMP using artiﬁcial facial images and voice (cid:3) (cid:4) (cid:5) (cid:6) (cid:7) (cid:8) (cid:9) (cid:10) (cid:11) (cid:3)(cid:2) (cid:3)(cid:3) (cid:3)(cid:4) (cid:3)(cid:5) (cid:3)(cid:6) (cid:3)(cid:7)(cid:13)(cid:19)(cid:24)(cid:19)(cid:27)(cid:20)(cid:27)(cid:20)(cid:23)(cid:22)(cid:0)(cid:15)(cid:27)(cid:20)(cid:21)(cid:19)(cid:26)(cid:16)(cid:5)(cid:2)(cid:6)(cid:2)(cid:7)(cid:2)(cid:8)(cid:2)(cid:9)(cid:2) (cid:12) (cid:18)(cid:18) (cid:28) (cid:25) (cid:17) (cid:18) (cid:29) (cid:0) (cid:15) (cid:1) (cid:16) (cid:12)(cid:14)(cid:14)(cid:12)

Figure 3:

Oﬄine mean classiﬁcation accuracy for audiovisual(AV), visual (V), and auditory (A) conditions. −505

C3 Cz C4 −505 A m p li t u d e [ u V ] P3 Pz P4 O1 O2 NontargetTarget

Figure 4:

Grand-averaged EEG waveforms in the AV condition. dition yielded the greatest classiﬁcation accuracy. Thurlingset al. examined a visual-tactile BCI in congruent and in-congruent conditions, showing that the congruent conditionyielded performance improvements [22]. These results im-ply that an absence of stimuli or existing yet incomprehen-sible stimuli do not contribute to increases in BCI perfor-mance. In other words, pairs of multimodal stimuli that canbe integrated easily may help to increase BCI accuracy.Target and nontarget ERP waveforms were quite simi-lar in the present study. This implies that stimulus inten-sity inﬂuences both the target and nontarget. One reason isthat target and nontarget stimuli are similar in RSVP and forauditory stimuli. A bimodal P300-based BCI incorporatingboth tactile and visual stimuli exhibited a target/nontargetdiﬀerence in addition to slight diﬀerences in ERP signalswhen compared with unimodal BCI [4]. Another study re-ported that enhancement of the N1 component and reduc-tion of P300 amplitude in a visual-tactile P300-based BCI[23]. However, in these studies, BCIs used visual stimulithat appeared at diﬀerent locations on the monitor (i.e., non-RSMP), while visual stimuli were presented only at the cen- −505

C3 Cz C4 −505 A m p li t u d e [ u V ] P3 Pz P4 O1 O2 NontargetTarget

Figure 5:

Grand-averaged EEG waveforms in the V condition. −505

C3 Cz C4 −505 A m p li t u d e [ u V ] P3 Pz P4 O1 O2 NontargetTarget

Figure 6:

Grand-averaged EEG waveforms in the A condition. ter of the monitor in the current study. This diﬀerence inpresentation may cause target and nontarget ERP waveformsand enhancement of the P300 at Fz when using auditory-tactile stimuli [8]. In our study, P300 amplitude was not en-hanced by audiovisual stimuli. In addition, SOA was longerin our study (1,000 ms) than in previous RSVP BCI stud-ies. Some RSVP BCI studies have reported SOAs of only100 ms [28, 25], which leads to a large target/nontarget dif-ference. ERPs elicited by a stimulus lasted approximately1,000 ms, and the overlap of the time between target and non-target stimuli was short. As a result, exogenous ERP com-ponents remained in both average ERP waveforms. The ex-ogenous ERP component will look small if the SOA is shortbecause the overlap of each stimulus is large. Researchershave also examined the inﬂuence of speed on RSVP, report-ing that the area under the curve decreases as SOA increases[11]. However, no studies have investigated SOAs longerthan 500 ms.In the current study, we observed no face-speciﬁc ERPcomponents, in contrast to previous studies reporting facespeciﬁcity for N170 and N400 [9]. This may be explained by

A Onishi et al.:

Preprint submitted to Elsevier

Page 5 of 7CI with RSMP using artiﬁcial facial images and voice -0.1 0 0.3 0.6 1.0Time [s]C3CzC4P3PzP4O1O2 0.0000.0020.0040.0060.0080.010 r v a l u e Figure 7: 𝑟 -values in the AV condition. Only signiﬁcant valuesare shown in color (above zero). -0.1 0 0.3 0.6 1.0Time [s]C3CzC4P3PzP4O1O2 0.0000.0020.0040.0060.0080.010 r v a l u e Figure 8: 𝑟 -values in the V condition. Only signiﬁcant valuesare shown in color (above zero). (1) the eﬀect of audiovisual stimuli, (2) the eﬀect of artiﬁcialfacial images, or (3) the eﬀect of the manner of stimulus pre-sentation. However, face-speciﬁc ERP components were ob-served when visual stimuli were presented without auditorystimuli, indicating that these components may be changed byauditory stimuli. Alternatively, the component may not ap-pear when using artiﬁcial facial images. One study reportedthat both dummy facial images and facial images can evokesimilar ERP components [6]. Another study reported thanan RSVP BMI using facial images exhibited N170 compo-nents [5]. Therefore, the use of audiovisual stimuli may haveresulted in the disappearance of face-speciﬁc ERP compo-nents.According to the 𝑟 -values and ERP waveforms observedin the current study, The ERP component that appears atapproximately 500 ms in Pz, or the late positive potential(LPP), is important for classiﬁcation during the AV and Vconditions. However, P300 enhancement was not observed,in contrast to the ﬁndings of some previous multimodal P300-based BCI studies [24, 23, 8]. Nonetheless, other researchers -0.1 0 0.3 0.6 1.0Time [s]C3CzC4P3PzP4O1O2 0.0000.0020.0040.0060.0080.010 r v a l u e Figure 9: 𝑟 -values in the A condition. Only signiﬁcant valuesare shown in color (above zero). have reported decreases in P300 [23]. Complex multimodalstimuli may not always enhance the P300 component, al-though enhancement of other components can be observedwhen facial images are applied in a typical visual P300-basedBCI.A wider range of 𝑟 -values was observed in the V con-dition than in the AV condition, while accuracy values werehigher in the AV condition than in the V condition. Although 𝑟 -values were exploited to explain the contribution of ERPcomponents in previous BCI studies, our results suggest that 𝑟 -values do not explain the important ERP component verywell. This is due to the limitation of 𝑟 -values, which donot directly take classiﬁers and dimension reduction meth-ods into account.

5. Conclusion

In the present study, we proposed an RSMP BCI that utilizesartiﬁcial facial images and artiﬁcial voice. To clarify the ef-fect of audiovisual stimuli on the BCI, scrambled images andmasked sounds were employed as stimuli, respectively. Ourresults indicated that audiovisual stimuli without scrambledimages or masked sounds yielded the highest classiﬁcationaccuracy for the RSMP BCI. These results suggest the fea-sibility of audiovisual stimuli for use with RSMP BCIs andmay help to improve gaze-independent BCI systems.

CRediT authorship contribution statement

A Onishi:

Conceptualization of this study, Methodol-ogy, Experiments, Data analysis, Writing.

References [1] Acqualagna, L., Treder, M.S., Schreuder, M., Blankertz, B., 2010. Anovel brain-computer interface based on the rapid serial visual presen-tation paradigm. 2010 Annual International Conference of the IEEEEngineering in Medicine and Biology Society, EMBC’10 , 2686–2689.

A Onishi et al.:

Preprint submitted to Elsevier

Page 6 of 7CI with RSMP using artiﬁcial facial images and voice [2] Belitski, A., Farquhar, J., Desain, P., 2011. P300 audio-visual speller.Journal of Neural Engineering 8, 025022.[3] Blankertz, B., Lemm, S., Treder, M., Haufe, S., Müller, K.R., 2011.Single-trial analysis and classiﬁcation of ERP components—a tuto-rial. NeuroImage 56, 814–825.[4] Brouwer, A.M., van Erp, J.B., 2010. A tactile P300 brain-computerinterface. Frontiers in Neuroscience 4.[5] Cai, B., Xiao, S., Jiang, L., Wang, Y., Zheng, X., 2013. A rapid facerecognition BCI system using single-trial ERP, in: 2013 6th Interna-tional IEEE/EMBS Conference on Neural Engineering (NER), IEEE.pp. 89–92.[6] Chen, L., Jin, J., Zhang, Y., Wang, X., Cichocki, A., 2015. A survey ofthe dummy face and human face stimuli used in bci paradigm. Journalof Neuroscience Methods 239, 18–27.[7] Farwell, L.A., Donchin, E., 1988. Talking oﬀ the top of your head: to-ward a mental prosthesis utilizing event-related brain potentials. Elec-troencephalography and Clinical Neurophysiology 70, 510–523.[8] Jiang, J., Zhang, B., Yin, E., Wang, C., Deng, B., 2019. A novelauditory-tactile P300-based BCI paradigm. 2019 IEEE InternationalConference on Computational Intelligence and Virtual Environmentsfor Measurement Systems and Applications, CIVEMSA 2019 - Pro-ceedings .[9] Jin, J., Allison, B.Z., Kaufmann, T., Kübler, A., Zhang, Y., Wang, X.,Cichocki, A., 2012. The changing face of P300 BCIs: A compari-son of stimulus changes in a P300 BCI involving faces, emotion, andmovement. PLoS ONE 7.[10] Kaufmann, T., Schulz, S.M., Grünzinger, C., Kübler, A., 2011. Flash-ing characters with famous faces improves ERP-based brain-computerinterface performance. Journal of Neural Engineering 8.[11] Lees, S., Mccullagh, P., Payne, P., Maguire, L., Lotte, F., Coyle, D.,2020. Speed of rapid serial visual presentation of pictures, numbersand words aﬀects event-related potential-based detection accuracy.IEEE Transactions on Neural Systems and Rehabilitation Engineer-ing 28, 113–122.[12] Onishi, A., Takano, K., Kawase, T., Ora, H., Kansaku, K., 2017. Af-fective stimuli for an auditory P300 brain-computer interface. Fron-tiers in Neuroscience 11, 522.[13] Onishi, A., Zhang, Y., Zhao, Q., Cichocki, A., 2011. Fast and reli-able P300-based BCI with facial images, in: Proceedings of the 5thInternational Brain-Computer Interface Conference, pp. 192–195.[14] Rutkowski, T.M., Mori, H., 2015. Tactile and bone-conduction audi-tory brain computer interface for vision and hearing impaired users.Journal of Neuroscience Methods 244, 45–51.[15] Salvaris, M., Sepulveda, F., 2009. Visual modiﬁcations on the P300speller BCI paradigm. Journal of Neural Engineering 6.[16] Schalk, G., McFarland, D.J., Hinterberger, T., Birbaumer, N., Wol-paw, J.R., 2004. BCI2000: a general-purpose brain-computer inter-face (BCI) system. IEEE Transactions on Biomedical Engineering51, 1034–1043.[17] Schreuder, M., Blankertz, B., Tangermann, M., 2010. A new auditorymulti-class brain-computer interface paradigm: Spatial hearing as aninformative cue. PLoS ONE 5.[18] Sellers, E.W., Donchin, E., 2006. A P300-based brain-computer in-terface: Initial tests by ALS patients. Clinical Neurophysiology 117,538–548.[19] Simon, N., Käthner, I., Ruf, C.A., Pasqualotto, E., Kübler, A., Halder,S., 2015. An auditory multiclass brain-computer interface with natu-ral stimuli: usability evaluation with healthy participants and a motorimpaired end user. Frontiers in Human Neuroscience 8, 1039.[20] Takano, K., Komatsu, T., Hata, N., Nakajima, Y., Kansaku, K., 2009.Visual stimuli for the P300 brain-computer interface: A comparisonof white/gray and green/blue ﬂicker matrices. Clinical Neurophysiol-ogy 120, 1562–1566.[21] Tate, R.F., 1954. Correlation between a discrete and a continu-ous variable. Point-biserial correlation. The Annals of MathematicalStatistics 25, 603–607.[22] Thurlings, M.E., Brouwer, A.M., van Erp, J.B., Werkhoven, P., 2014.Gaze-independent ERP-BCIs: Augmenting performance through location-congruent bimodal stimuli. Frontiers in Systems Neuro-science 8, 1–14.[23] Thurlings, M.E., Brouwer, A.M., Van Erp, J.B., Blankertz, B.,Werkhoven, P.J., 2012. Does bimodal stimulus presentation increaseERP components usable in BCIs? Journal of Neural Engineering 9.[24] Wang, F., He, Y., Pan, J., Xie, Q., Yu, R., Zhang, R., Li, Y., 2015.A novel audiovisual brain-computer interface and its application inawareness detection. Scientiﬁc Reports 5, 1–12.[25] Wei, W., Qiu, S., Ma, X., Li, D., Wang, B., He, H., 2020. Reduc-ing calibration eﬀorts in RSVP tasks with multi-source adversarialdomain adaptation. IEEE Transactions on Neural Systems and Reha-bilitation Engineering 28, 2344–2355.[26] Wolpaw, J.R., Birbaumer, N., McFarland, D.J., Pfurtscheller, G.,Vaughan, T.M., 2002. Brain-computer interfaces for communicationand control. Clinical Neurophysiology 113, 767–91.[27] Yin, E., Zeyl, T., Saab, R., Hu, D., Zhou, Z., Chau, T., 2016. AnAuditory-Tactile Visual Saccade-Independent P300 Brain–ComputerInterface. International Journal of Neural Systems 26, 1650001.[28] Zheng, L., Sun, S., Zhao, H., Pei, W., Chen, H., Gao, X., Zhang,L., Wang, Y., 2020. A cross-session dataset for collaborative brain-computer interfaces based on rapid serial visual presentation. Fron-tiers in Neuroscience 14, 1–12.

A Onishi et al.: