[PDF] TapNet: The Design, Training, Implementation, and Applications of a Multi-Task Learning CNN for Off-Screen Mobile Input

Abstract

To make off-screen interaction without specialized hardware practical, we investigate using deep learning methods to process the common built-in IMU sensor (accelerometers and gyroscopes) on mobile phones into a useful set of one-handed interaction events. We present the design, training, implementation and applications of TapNet, a multi-task network that detects tapping on the smartphone. With phone form factor as auxiliary information, TapNet can jointly learn from data across devices and simultaneously recognize multiple tap properties, including tap direction and tap location. We developed two datasets consisting of over 135K training samples, 38K testing samples, and 32 participants in total. Experimental evaluation demonstrated the effectiveness of the TapNet design and its significant improvement over the state of the art. Along with the datasets, (this https URL), and extensive experiments, TapNet establishes a new technical foundation for off-screen mobile input.

Full PDF

TTapNet: The Design, Training, Implementation, and Applicationsof a Multi-Task Learning CNN for Off-Screen Mobile Input

Michael Xuelin Huang, Yang Li, Nazneen Nazneen, Alexander Chao, Shumin Zhai

Google, Inc.Mountain View, CA, USA{mxhuang,liyang,fnazneen,alexanderchao,zhai}@google.com

ABSTRACT

To make off-screen interaction without specialized hardware prac-tical, we investigate using deep learning methods to process thecommon built-in IMU sensor (accelerometers and gyroscopes) onmobile phones into a useful set of one-handed interaction events.We present the design, training, implementation and applicationsof TapNet, a multi-task network that detects tapping on the smart-phone. With phone form factor as auxiliary information, TapNet canjointly learn from data across devices and simultaneously recognizemultiple tap properties, including tap direction and tap location.We developed two datasets consisting of over 135K training sam-ples, 38K testing samples, and 32 participants in total. Experimentalevaluation demonstrated the effectiveness of the TapNet design andits significant improvement over the state of the art. Along with thedatasets, codebase , and extensive experiments, TapNet establishesa new technical foundation for off-screen mobile input. CCS CONCEPTS • Human-centered computing → Gestural input . KEYWORDS

Back-of-device, gesture recognition, IMU

ACM Reference Format:

Michael Xuelin Huang, Yang Li, Nazneen Nazneen, Alexander Chao, ShuminZhai. 2021. TapNet: The Design, Training, Implementation, and Appli-cations of a Multi-Task Learning CNN for Off-Screen Mobile Input. In

CHI Conference on Human Factors in Computing Systems (CHI ’21), May8–13, 2021, Yokohama, Japan.

ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3411764.3445626

Touchscreen as the sole method of mobile device input is increas-ingly challenged by its inherent limitations. The difficulty of one-handed use and the visual occlusion by the operating finger are twoof them. Beyond research explorations, we began to see off-screenocclusion-free alternatives in mainstream products, such as doublepressing power button to activate the camera, squeezing the Active TapNet DatasetPermission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).

Figure 1: TapNet is a one-model-for-all solution that jointlylearns from cross-device data and performs multiple taprecognition tasks at a time.

Edge [23] of Google Pixel phones and long-press power button [30]of the iPhones for voice command invocation.In addition to a tap event (presence of a tap) alone, potentiallyuseful tap properties include tap direction [20, 34] (i.e. whether thetap is on the front, back, or side of the phone), tap location [17, 21](i.e. which region the tap falls on the devices), and tap finger part [9](i.e. tapping with finger pad vs fingernail). Prior studies has focusedon the recognition of a single tap property, based on a limitedamount of data as a proof-of-concept. This paper aims to addressthe needs in practice, by predicting comprehensive tap propertiesfor diverse application purposes and advancing the state of the artof off-screen interaction towards a practical level of performance.Note that the state of the art for this topic can only be advancedif there are well-established benchmarks with accessible datasetsand codebases. Our work is set to help create a benchmark with ex-tensive dataset development, neural network model design, modeltraining, and experimentation.After experimenting with multiple methods and architecturesfor tap detection, we found that a multi-input, multi-output (MIMO)convolutional neural network (CNN) gave the best results. The re-sulting method, TapNet, enables joint learning on cross-device dataand joint prediction of multiple tap properties. Compared with thesolution of one model for each task, a joint model (see Figure 1)can exploit the interrelation among multiple tap properties duringtraining and also saves run-time computation and memory. TapNetcontains shared convolutional layers for task-agnostic knowledge a r X i v : . [ c s . H C ] F e b HI ’21, May 8–13, 2021, Yokohama, Japan Huang, et al. extraction, meanwhile each of its output branch retains the task-specific information. TapNet uses the inertial measurement unit(IMU) signals as primary input and the phone form factor as “aux-iliary information” [28] for cross-device training , i.e. joint trainingon cross-device data. Together, TapNet offers a one-model-for-allsolution [13] across devices.Our technical investigation and evaluation shed lights on twoimportant aspects for machine learning (ML) in HCI development:methods and data. First, thanks to the MIMO architecture, Tap-Net increased tap property recognition accuracy over the prior artmethods, particularly for the more difficult tasks such as tap loca-tion recognition. Specifically, TapNet yields a mean distance erroraround 10% of the screen diagonal, similar to the between-icondistance on mobile phone home page. Although this resolution ismuch less precise compared to the touchscreen, such resolutioncan already enlarge the interaction design space, by for exampleenabling users to perform selection by tapping on the back of thephone, even when they wear gloves.Second, we discuss an efficient data strategy for developing a MLbased HCI application. Different from most computer vision tasksthat heavily rely on multi-person data to achieve user generaliz-ability [12, 44], a well-performing ML model for HCI (interactive)tasks may not necessarily demand multi-person training data. Thekey is to ensure the training data diversity. To test this hypothesis,we collected a one-person dataset for training and a separate multi-person dataset for testing (and optionally model adaptation). Theresults show that the one-person model could achieve a comparablelevel of user generalizability as a model tuned on multi-person data.Our contribution is four-fold. We 1) developed a multi-task net-work that can be jointly trained across devices. The network couldsimultaneously recognize a set of tap properties, including tapdirection and location; 2) developed two datasets and conductedextensive evaluations that advanced the state of the art; 3) offerednew perspectives on alleviating the data hurdle in ML based HCIresearch; 4) established a benchmark with opensource code anddatasets for off-screen tapping recognition, which will be repro-ducible and accessible by others.

This study is related to off-screen interaction, in particular, tap-based interaction, as well as multi-task neural networks [3].

Back-of-device (BoD) [5, 6, 14, 16] and edge-of-device interactions [10,16, 37, 41] have attracted much attention, however, most of them re-lied on specialized sensors that are not readily available on phones.For instance, BackXPress studied finger pressure for BoD interac-tion using a sandwiched smartphone [5]. InfiniTouch recognizedfinger location from capacitive image outside the touchscreen [16].Some exploited small widgets to enable off-screen gesture sensing.Wearing a magnet can allow magnetometer to track the 3D fingermovement [4, 24]. Adding a mirror can make camera to detect fin-ger location on the back [33] or around the device [38]. Acousticsensing exploits sound propagation properties to recognize BoDgesture [29], grip force [32], and contact finger part [9], and elec-tric field sensing detects around-device gestures in a non-intrusive manner [45, 47]. Unlike using the IMU, these detection methodsrequire additional hardware installed on the already very compactmobile devices, increasing manufacturing and material cost andpotentially reducing available space to the largest possible battery.

Despite the line of research on detecting tapping from motion sig-nals captured by the built-in IMU sensors on smartphones [8, 26, 39,40], there is room for improvement. SecTap allowed users to move acursor by tilting and select by back tapping [19]. Bezel-Tap detectedtap on the edge of the devices [27]. Granell and Leiva conducted fea-ture engineering for tap detection [8]. In contrast to the detectionof tap event alone, BackPat recognized tapping in three locationsbased on gyroscope and microphone signals [26]. BackTap [40] andBeyondTouch [39] classified four-corner tap based on accelerometer,gyroscope, and microphone signals. These studies relied on simplefeatures (e.g. mean, kurtosis, and skewness) and statistical methodstypically based on shallow neural networks [8, 19, 26, 39, 40], thusonly achieved limited precision of tap location detection.More recently, researchers started to explore neural networksfor tap location classification for PIN code inference in the fieldof privacy and security [17, 21]. PINlogger classified tapping onten buttons using a single layer neutral network [21]. Liang et al.applied a two-layer CNN model to estimate tap location from z-axis signal of accelerometer [17]. Although these studies yieldedpromising result, their network capacity was relatively small andthe reported accuracy was still far from practical level.

To improve tap recognition, we develop a multi-input and multi-output a neural network that can estimate multiple tap properties.This section reviews two related concepts: multi-task learning [3]and learning with auxiliary information. Multi-task learning withneural network leads to a multi-branch architecture. Each branchaddresses one of the recognition tasks, such as different targetsin tracking [22], languages in translation [13], and head poses ingaze estimation [43]. The shared layers of all branches extract thecommon knowledge across tasks, whose generalizability is ensuredby the regularization effect across tasks. Multi-task models employmultiple loss functions thus more supervisions during training thatcan potentially helps learning. To our knowledge, no attempt hasbeen made to apply multi-task learning for off-screen tap recogni-tion and the multi-task architecture can conduce to the learning ofeach tap recognition task from their interrelationship.We are also the first to exploit the form factor of the phone asauxiliary information, which has been shown to be beneficial fortraining. Zhang et al. investigated the benefit of using auxiliary in-formation in training and found that it can improve performance intesting even without using the auxiliary information as input [42].Stephenson et al. pointed out that conditioning on auxiliary in-formation can achieve higher robustness than that of appendingauxiliary information directly to the main features [28]. Liao et al.showed that integrating simple but essential auxiliary informationcan increase prediction accuracy [18]. apNet: The Design, Training, Implementation, and Applications of a Multi-Task Learning CNN for Off-Screen Mobile Input CHI ’21, May 8–13, 2021, Yokohama, Japan

Figure 2: Accelerometer and gyroscope signals are processed in the Gating component and passed to TapNet if met set criteria.Using phone form factor (embedded in the device vector) as auxiliary information, TapNet jointly recognizes multiple tapproperties, including tap event, direction, finger part, and location.

Moving beyond the prior work in this space, this project aims atdeveloping IMU-based input methods that meet the requirementof practical applications, by means of deep neural network designand training. The key objective is to achieve the five recognitiontasks (i.e. the five network outputs) with one network (TapNet) asshown in Table 1. Each task aims to recognize one tap property , suchas direction and location. We first describe the pipeline overview,followed by the core of the method (a MIMO network). We thendiscuss the gating component and signal filtering for recognition.

Task Paradigm TapNet Output

Event 2-class tap, non-tapDirection 6-class front, back,left, right, top, bottomFinger part 2-class finger pad, finger nailLocation 35-class the region ID in a 5x7 gridLocation regression x- and y- ratio compared with screenwidth and height

Table 1: The five tasks of TapNet: four classification tasksto detect different tap properties, such as tap direction andlocation, and one regression task to estimate tap location.

As shown in Figure 2, the system listens to the six-channel IMU(three-axis accelerometer and three-axis gyroscope) data and main-tains a 150-ms data window. To avoid unnecessary down-streamcomputation for recognizing tap properties, a heuristic-based gatingmechanism using the z-axis signal of the accelerometer is performedto reject obvious non-tap motions. If the signal passes the gating,the six-channel IMU signals within a 120-ms feature window willbe concatenated into a one-dimensional feature vector. The fea-ture window is identified and aligned across tap samples by the tap-induced peak in the z-axis signal. TapNet, a multi-input, multi-output convolutional neural network, takes this feature vector anda device vector as input. The device vector depicts the phone size,IMU pointing direction, and installation location relative to theupper left corner of the device housing. TapNet then outputs thepredictions of five tap properties at a time as shown in Table 1.

TapNet uses a multi-input and multi-output architecture. We useIMU signals as primary input and device vector that describes thephone form factor as auxiliary input. Using this auxiliary informa-tion helps to accommodate the difference of device form factor andachieves the cross-device training.The output of TapNet contains multiple branches. Since differenttap properties have a confound impact on IMU responses, we jointlylearn the mappings from IMU signals onto these interrelated tapproperties using a multi-layer CNN. In this architecture, differenttap-related tasks share four convolutional layers, which extract thecommon shape patterns that are indicative across tap properties.Following the shared layers, there are branches of fully connectedlayers, each of which extracts property-specific patterns for indi-vidual task. In practice, shared layers for tap properties also meansshared computation. Therefore, TapNet requires less computationand memory than multiple networks each trained for a single task.Training over an intersection of multiple tap tasks confines thelearning in a restrained feature embedding space and thus allowsit to converge to a solution for all related tasks [13]. Multi-tasklearning allows for good alignment between feature embeddings ofdifferent tasks. Such additional guidance or supervision can poten-tially prevent over-fitting especially given few training samples.

As illustrated in Figure 2, a one-channel CNN is used for tap recog-nition. We exploit convolutional layers to extract shape featuresfrom IMU signals, because the convolutional filter offers temporal

HI ’21, May 8–13, 2021, Yokohama, Japan Huang, et al. locality to capture signal dynamics and shared weights to reducetrainable parameters. In general, a convolutional filter in early lay-ers describes the basic shape features, such as a peak, a valley, or acertain degree of slope. A convolutional filter in later layers, witha large receptive field, is more likely to contain shape semantics,such as a large magnitude peak or an impulse with double peaks.Regarding the options between one-channel vs multi-channelnetwork, we see that the one-channel network can be more efficientas it allows for filter reuse across IMU channels. Conventionalmethods applied multi-channel CNN to describe signal alignmentacross channels, and it has been widely use to handle EEG [25]and IMU [36] data. However, a large number of filters are requiredto depict the fine-grained signal alignment combinations of thesix-channel signals, and thus increases the model training difficultyand the demand for data. In contrast, we propose to concatenatethe six-channel data into a one-dimensional vector and apply one-channel convolutional layers in TapNet. By doing so, each one-channel convolutional filter can be reused to detect shape featuresacross channels, and then rely on the fully connected layers todraw decision by associating filter activations in different parts ofthe signals. Dividing the shape extraction and alignment analysisconceivably mitigates the model training difficulty and achieves anefficient use of data. We also evaluated this in an experiment.

Although TapNet is rather lightweight compared with vision-basedconvolutional networks, performing recognition per frame gener-ates unnecessary overheads. Our observation, as well as previousfindings [17], suggests that a peak or valley in the z-axis output isa necessary (high recall) but not sufficient (low precision) signal toa tapping event on the device. As such, we use the z-axis signal ofaccelerometer as a gating signal to the CNN in order to minimizethe amount of computation (hence power consumption) on themobile device. Gating signal alone may not be enough for accuraterecognition of tap properties, it is sufficient for tap-like (high recalllow precision) event detection, and thus qualified as gating for theCNN recognition. Therefore, we only perform CNN recognitionwhen we observe a tap-like signal from the gating signal.Specifically, we first perform a simple, linear complexity peakdetection on the gating signal at the per-frame level (Figure 2 thepink region). This peak detection yields a set of peaks and valleysand their corresponding timestamps. We define a tap-like signal as the one has an impulse that contains at least one peak, and an impulse as a group of peaks between each pair the interval is lessthan the threshold, 𝑇 𝑣 , which is empirically set to 80 ms. We alsoapply a magnitude threshold on the peak and valley to control thesensitivity of the gating component. Sensor system on smartphones provides motion data in a number offormats, including the raw signal, its gravity excluded counterpart,and rotation vector. As we aim to detect tap event and identify tapproperties in an orientation-invariant manner, we leverage the rawaccelerometer and gyroscope data and compute their first-orderderivative. Therefore, these signals represent the change of forceon the housing of the phone and the resulting change of rotation velocity, and they stay at zero when the phone is stationary. Inaddition to orientation invariance, we also see that the first-orderderivatives accelerometer and gyroscope signals have high shapesimilarity. As such, using this representation conduces to the reuseof convolutional filters and alleviates network training difficulty.All tap feature vectors are temporally aligned. Specifically, Thefirst major peak in the accelerometer z-axis is aligned to a desig-nated frame (=106th) in the feature vector, and this determines thefeature window location. Then 50 sensor samples ( ∼ This section describes two datasets: the one-person dataset (135,260samples) was used for TapNet training only (not testing) and themulti-person dataset (38,545 samples) was for performance testingand user generalizability tuning. Our hypothesis is that a well-designed data collection protocol can ensure training data diversityeven from one person and be sufficient for some tap recognition tasks .When we need to improve the model, it is relatively easy to enlargethe one-person training set, evaluate on the test set, and repeat thisprocess in an iterative fashion. This is more efficient than collectingadditional multi-person data.

We used a single researcher training strategy. One of us producedthe entire training data following a comprehensive data collectionprotocol, which aims to cover the diversity in real use. Tap datawere collected through multiple sessions on Phone A and Phone B .Each session aims to collect tap in a specific condition describedby a set of characteristics of the phone, grip gestures, and thetapping itself (see Table 2). Visual indicator was presented on thescreen to guide the experimenter. The experimenter can examinethe finger alignment with the target location by turning the phoneback and forth to ensure the annotation correctness. Please refer tothe supplemental file for a detailed description.The advantages of this strategy are three-fold: 1) it is low-costand convenient for rapid development iteration, 2) it producessystematic and diverse data, and 3) the data quality is manageableand assessable. This strategy allows to use the researcher’s intuitionof the problem space to push the recognition envelope to the fullestdegree, for example, by including the training data with differentgrip gestures and forces. The risks of this strategy include thepotential lack of diverse tapping patterns that a single person cannot fully represent as well as the impact of the biomechanics. Wemitigate the risks by evaluating on a separate multi-person dataset.We repeated data collection under the same condition to mitigatebias. In the end, the one-person data collection took over 30 sessions(around half an hour for each) over 69 days. In total, the one-persontraining dataset contains 109,200 tapping actions, including 45,500front taps, 48,450 back taps, 4,020 left taps, 4,020 right taps, 3,610top taps, 3,600 bottom taps. Among these tapping actions, 94,764were collected from a Phone A and 14,440 from a Phone B, and85.7% of them are tapping with finger pad and the rest are with Phone A = Pixel 3 and Phone B = Pixel 3 XL apNet: The Design, Training, Implementation, and Applications of a Multi-Task Learning CNN for Off-Screen Mobile Input CHI ’21, May 8–13, 2021, Yokohama, Japan

Condition

Phone size 2 Phone A (5.5"), Phone B (6.3")Phone case 2 with case, without caseOrientation 4 portrait, portrait downward, landscape+90 ◦ , landscape -90 ◦ Grip and tapmanner 7 tapping with one hand (the four combina-tions of holding in left/right hand and tap-ping in left/right hand), both hands withthumbs, tapping on the phone which islying on a solid surface, and on a soft one.Grip force 3 phone rests on the palm, grasp the phonewith normal force, grasp with strong force.Grip location 5 bottom, mid-bottom, middle, mid-top, topGrip gesture 4 grasping the phone against to the palmwith one, two, three, and four fingersThumb gesture 2 resting on the screen edge, hovering abovethe screenTap event 2 tap, non-tapTap finger part 2 finger pad, finger nailTap direction 6 front, back, left, right, top, bottomTap location 35 5x7 grid on the housing of the phoneTap force 5 extremely gentle, gentle, normal, strong,extremely strong

Table 2: The one-person dataset contains samples in differ-ent conditions of phone, grip gestures, and tapping actions. finger nail. We also collected non-tap motions (26,060 in total) in anumber of scenarios: grasping (4,700) the phone, rubbing the phonecase (6,000), releasing a tap (6,160), knocking on the phone placingsurface (4,400), shaking and moving with the phone in pocket andin bag (4,800). The usage of this dataset (135,260 samples includingtap and non-tap) was solely in training (or “teaching”). None wasused in any testing to ensure the validity and generalizability ofthe efficacy measures reported later.

To evaluate TapNet performance, we collected a multi-person (n=31)dataset. We focused on collecting the natural tapping actions tosimulate tapping data distribution in real use, in particular, with thecommon finger (thumb and index), in the finger comfort range, andfor four most common one-handed and two-handed gestures [11,15] as shown in Figure 3.

The four grip gestures we studied arein a combination conditions of phone, grip gestures, and tappingactions as shown in Table 3. To keep data collection natural, weavoid continuous and extensive tapping. For every condition, par-ticipants were asked to tap for five to ten times and take rest atvarious intervals to avoid fatigue. In addition, data collection wasinformed by research conducted on the most common hand gripsand finger placements [11, 15]. Unrealistic reach for targets on theback surface of the phone were excluded.

Condition Option

Phone size Phone A (61%), Phone B (39%)Phone condition with case (58%), without case (42%)Grip and tap man-ner one-handed/portrait (thumb & index), two-handed/portrait (thumbs & indexes), two-handed/landscape (thumbs & indexes),two-handed/portrait (index)Handedness right (90%), left (10%)Hand size small (32%), medium (48%), large (20%)Tap direction front (54%), back (28%), left (4%), right (4%), top(5%), bottom (5%)Tap location comfort range in the 5x7 grid of the screen/back

Table 3: The multi-person dataset contains samples in differ-ent conditions of phone, grip gestures, and tapping actions.The number in the parenthesis shows the data percentage.Figure 3: Four common phone grip gestures investigated inthe multi-person dataset. Only natural tapping actions withthese grip gestures were collected for testing.

We used a 5.5" Phone A and a 6.3" Phone Bthroughout the collection. To provide the most natural feeling,we assigned phone size in the collection based on participants’ per-sonal phone size. To achieve a balance ratio, we selected participantsaccording to their personal phone size in recruitment.

We recruited 31 participants (13 female, age:18-54). We marked down a number of factors that may affect tap-ping signal responses, including finger length, finger nail length,hand size, and handedness. During data collection, the experimentinterface would adapt to the handedness of individual participantto ensure the comfort range was valid.Overall, the multi-person dataset contains 38,545 taps, including20,615 taps from front, 10,505 back, 1,705 left, 1,705 right, 1,975 top,and 2,040 bottom, among which 58.3% was collected on phoneswith rubber cases, and the rest without.

This section evaluates the overall performance of TapNet, followedby the comparison between one- and multi- channel CNN as wellas ablation studies on the MIMO network design.We measured classification performance by F1 score and regres-sion by mean absolute error (MAE) and r score. F1 score is anweighted average metric of precision and recall, ranging from [0, 1].The MAE of tap location, i.e. √︃ [( 𝑔 𝑥 − 𝑡 𝑥 )/ 𝑤 ] + [( 𝑔 𝑦 − 𝑡 𝑦 )/ ℎ ] , is HI ’21, May 8–13, 2021, Yokohama, Japan Huang, et al. computed by the Euclidean distance between ground truth location ( 𝑔 𝑥 , 𝑔 𝑦 ) and TapNet output ( 𝑡 𝑥 , 𝑡 𝑦 ) normalized by screen width 𝑤 and height ℎ . Its range is [0, 1.414], and a baseline (always predict-ing the center) is 0.707. Similar to standard deviation, r measureshow well the regression line fits the data with a range of [0, 1].In the implementation, training to recognize the presence oftap event requires tap and non-tap data (e.g. phone shaking andrubbing motions). As non-tap data does not have annotations abouttap finger part, direction, and location, it cannot be used to trainthe rest of the network. As such, the two parts of the network (tapevent v.s. the rest of the tap properties) were trained in turn. The tapevent branch was trained once after every ten epochs of trainingthe rest of the tap property branches.We applied ReLU and batch normalization after each convolu-tional layer and used Adam optimizer with a learning rate of 1e-4and a momentum decay (1e-6). Each model was trained with suffi-cient number of epochs until the validation (5% data) loss converges,and training was done using Tensorflow [1] on a 4G memory GPU.The datasets were collected on Phone A and Phone B that ran onAndroid 9, and the IMU sampling rate was approximately 416Hz.In run time, it took 0.56 ms on the Phone B main CPU to finishone TapNet inference. A simplified TapNet can also be run in realtime (9ms) on the embedding DSP using Tensor Lite for microcon-troller [31], but this is beyond the scope of this paper. The majorlatency (105 ms) of the whole pipeline lies in waiting to observethe complete tap signal in the feature window.Due to space limitation, we have put some of the implementationevaluations into the supplemental file. Interesting findings include:1) high sensor sampling rate is crucial to tap recognition accuracy;2) the use of gyroscope in our task configuration is important; and3) data augmentation by temporally shifting the samples conducesto the performance gain but scaling the samples does not. This section evaluates TapNet’s improvement over prior art anduser generalizability when training on one-person data.

We implemented and trainedfour ML algorithms, including TapNet, to compare their relativeperformance. The overall performance measured by the weightedaverage F1 score and MAE across participants and devices areshown in Table 4. Support Vector Machine (SVM) represents a lineof studies [19, 39] that used traditional machine learning methods.TinyCNN is a replication of Liang et al.’s two-layer CNN [17]. SISOrefers to a truncated single-input and single-output (SISO) TapNet. Ituses the same training configuration (Adam optimizer and learningrate) as the MIMO TapNet, and it gives performance reference ifTapNet is configured for a single task. All the methods were trainedon the one-person dataset and tested on the multi-person dataset.Overall, MIMO TapNet significantly outperforms the prior art [17,19, 39]. Compared with the best performance among SVM and Tiny-CNN, TapNet achieved considerable improvements on tap direc-tion (by 51.0%) and location (classification: 161.5% and regression:30%). The SISO TapNet variant also outperforms related works by amarked margin. TapNet generally outperforms its SISO variant, im-plying that the MIMO architecture also contributes to performanceimprovement in addition to the computation and memory benefits.

F1 score MAE (r ) Method 2-classEvent 2-classFingerpart 6-classDirection 35-classLocation Locationregression

SVM [19, 39] .73 .93 .55 .13 .20 (.28)

TinyCNN [17] .89 .93 .47 .03 .40 (-2.4)

SISO TapNet .88 .89 .82 .30 .19 (.12)

TapNet .87 .94 .85 .34 .14 (.49)Table 4: Weighted average F1 score of four classificationtasks (no color shading) as well as MAE and r score of thetap location regression task (purple shading). SVM and Tiny-CNN are our re-implementation of related works [17, 19, 39].SISO TapNet is the truncated single-input and single-outputvariant. TapNet is the proposed MIMO method. Note that wehave built one model per task for the single-output models.In contrast, TapNet is a multi-task network that gives pre-dictions for all five tasks at a time. Finger part, direction,and location are not conditional on the event detection. Thenumbers are the higher the better for F1 and r scores, whilelower the better for MAE. A close examination of the 35-class tap location classificationreveals that TapNet can be usable despite its seemingly limited F1score (0.34). Figure 4(a) shows the normalized confusion matrix oftap location classification. Neighboring region above or below thetarget in a 5x7 grid has an index offset of five. The three parallellines along the diagonal with a five-cell offset in the confusion ma-trix indicates TapNet either predicts correctly or to nearby regions(see Figure 4(c)). This is in good agreement with the regression re-sults (MAE:.14; ∼

10% of the screen diagonal). Such location error issimilar to the distance between two icons on the phone home screen

Figure 4: (a) Normalized confusion matrix of the 35-class lo-cation classification while training on one person and test-ing on multiple. (b) An enlarged part of the confusion ma-trix. (c) The region IDs in a five-by-seven grid. Each regionrefers to a cell in the grid. Neighboring areas above or be-low the target has an index offset of five. Predicting to thenearby areas therefore leads to three parallel lines along thediagonal in the confusion matrix. apNet: The Design, Training, Implementation, and Applications of a Multi-Task Learning CNN for Off-Screen Mobile Input CHI ’21, May 8–13, 2021, Yokohama, Japan (see the supplemental video figure). It thus indicates that IMU-basedtap location recognition can be viable and useful in situations whichdoes not require very high resolution and when capacitive sensingis inadequate (e.g. wearing gloves or under water).However, the performance difference on simple tasks (event andfinger part classification) among different methods is marginal. Thissuggests that the minimal model capacity can be task-dependent;simple tasks may not be benefited from a model capacity increase,but the more complicated tasks can be.

Wehypothesized that training on diverse but one-person data can stillachieve generalizability for unseen users in some tap recognitiontasks. To verify this, we performed evaluation in two paradigms: • One-to-n-participant evaluation: TapNet was trained on theone-person dataset and tested on the multi-person dataset.This is the same as the previous evaluation. • Leave-one-participant-out cross-validation: TapNet was pre-trained on the one-person dataset, fine tuned on n-1 par-ticipants from the multi-person dataset, and tested on theleftout participant in an iterative fashion.By comparing the one-to-n-participant with the leave-one-outevaluation (see Table 5), we see that training on a one-person datacan still be effective. As expected, fine tuning on the n-1 participantsfurther improved the performance, but some of the improvementsare relatively moderate, for example, by 1% for tap event and 3.2%for finger part classification. This corroborates the hypothesis thattraining on the diverse data even from a single person can achieveviable user generalizability for tasks, such as event (F1:.92), fingerpart (F1:.93), and direction classification (F1:.85). That said, we alsolearned that tap location recognition relies on very subtle IMUresponses, which may relate to personal biomechanics and are hardto simulate by a single person.

F1 score MAE (r ) EvaluationParadigm 2-classEvent 2-classFingerpart 6-classDirection 35-classLocation Locationregression one-to-n .92 .93 .85 .42 .15 (.52) leave-one-out .93 .96 .92 .54 .11 ( .73 ) Table 5: Performance comparison between the one-to-n-participant and leave-one-participant-out evaluation. Thecolumns with no color shading are classification tasks, andthe one with purple shading is a regression task.

This section presents the ablation studies on the TapNet architec-ture: the use of multi-task learning (i.e. multi-output) and cross-device training with auxiliary information (i.e. multi-input). Weevaluated on tap direction classification as a representative task.

We firstpresent the evaluation on multi-task learning. We compared againstthe single-output counterpart (SISO; a truncated TapNet variant)

Figure 5: Weighted average F1 score of tap direction classifi-cation using multi-task learning (SIMO TapNet) and single-task learning (SISO TapNet). TapNet (either SIMO or its trun-cated variant) can considerably outperform a SVM [19, 39]and a tiny CNN model [17]. TapNet with multi-task learningshows advantage given small training data (e.g. 3K samples). and recent related works, including a SVM [19, 39] and a tiny CNNmodel [17]. Limited by the tap samples on Phone B ( ∼ To investigatecross-device model training, we evaluated model performance withincremental training samples from 1K to 15K, as previous exper-iments suggest that TapNet converges with around 15K trainingsamples. Note that in the case of training with 15K samples, TapNetwas jointly trained on 7.5K samples from Phone A and another 7.5Ksamples from Phone B. Similarly, its counterparts were pre-trainedon 7.5K samples from one device and fine tuned on the other. Asbaselines, we also evaluated training on the 7.5K device-specificsamples alone for each device.Figure 6 gives the performance comparison. A+B representsTapNet jointly trained on cross-device data; B->A and A->B denotetuning the pre-trained model from one device on the other; and Aand B indicate models trained on device-specific data. The overallperformance of different models increases with training samplesand they flats out after 4.5K samples per device. More interestingly,cross-device training with sensor location as auxiliary information

HI ’21, May 8–13, 2021, Yokohama, Japan Huang, et al.

Figure 6: Performance comparison of tap direction classifica-tion with auxiliary information (A+B) against fine tuning (B->A, A->B), and training on device-specific data (A, B) alone.X-axis shows the number of training samples per device andy-axis indicates the weighted average F1 score. The jointlytrained TapNet yields comparable performance to that ofthe fine tuned models on specific devices and outperformsthe models trained on device-specific data alone. (A+B) can approach the performance upper bound earlier (with1.5K samples per device) than the rest of the models.

To verify the efficacy of one-channel CNN for tap recognition, weperformed a comparison on tap direction classification, which is arepresentative task and demonstrated to require subtle signal align-ment across channels. As input data dimension affects the numberof trainable parameters (i.e. model capacity) of a specific networkarchitecture, we evaluated similar architectures with commensu-rate number of trainable parameters, so as to compare models withcomparable capacity. This experiment was conducted on the PhoneA data using the one-to-n evaluation paradigm.Figure 7 shows the performance comparison of one-channelCNN against its multi-channel counterparts. The x-axis shows thenumber of training samples, and the y-axis the weighted average ofF1 score. The solid lines denote the performance of large-capacitymodels with 163K (one-channel) and 144K (six-channel) trainableparameters, while the dashed lines those of small-capacity modelswith 11K (one-channel) and 9K (six-channel) trainable parameters.Most importantly, small-capacity one-channel CNN (purple dashed)and large-capacity six channel CNN (blue) have almost equivalentperformance with 6K-12K training samples. Further, comparing thesolid and dashed lines, large-capacity models generally outperformssmall-capacity models with similar architecture and input format.Taken together, we conclude that the one-channel CNN can addressacross-channel signal alignment more efficiently than its multi-channel counterpart for tap property recognition.

We designed, developed, and evaluated a set of deep learning meth-ods for on-device IMU signal based off-screen input, in particularTapNet, a multi-task network that allows for cross-device training

Figure 7: Performance comparison of one-channel CNN(TapNet) against its six-channel counterpart with two levelsof model capacity (large models: 163K, 144K trainable pa-rameters; small models: 11K and 9K). These are the averageresults over three runs to cancel out training randomness.The one-channel CNN with 11K parameters can achievescomparable performance as that of the six-channel CNNwith a larger number of parameter (144K). We tuned thenumber of filters and maintained its ratios across layers toroughly match the total number of trainable parameters be-tween the one-channel and six-channel models. with phone form factor as auxiliary information and joint predic-tion of tap-related tasks, including tap event, direction, finger part,location classification, and the regression of tap location. This ar-chitecture not only shares knowledge across tap properties anddevices during training, but also shares computation and memoryat run time. Some of the TapNet building blocks are not novel infields such as computer vision, but to bring them into buildingan interaction gesture to a practical performance level requiredoriginal research. TapNet achieved a marked improvement overthe prior art [17, 19, 39] on tap direction (by 51.0%) and location(161.5%) classifications as well as tap location regression (30%).We also discovered encouraging generalizability across userswhen the model was trained (or "taught") by a one person. We havetested TapNet on those who had not been in any set of the data col-lection and encouraged them to explore the effects of TapNet freelyon their own device in daily use (for such functions as taking screen-shot). Their experience matched with the leave-one-participant-outtest results reported here. This high generalizability is probablybecause the inter-class differences (e.g. finger pad vs. nail in thefinger part classification task) are generally much greater than theinter-person differences, such that the personal biomechanics onlyimposes a neglectable effect.The other advantage of our approach is that an expert designcould intentionally push the variations of an intentional tap gesturein terms of speed, strength, angle, and hand posture, However, itis possible certain recognition tasks, such as higher resolution taplocation classification, could demand more fine-grained informationonly available in person-specific data. Training on incremental one-person data can increase performance up to a certain level and thenplateaus (see Figure 8). Further adapting the model to multi-persondata may teach the model to understand the artifacts of personal apNet: The Design, Training, Implementation, and Applications of a Multi-Task Learning CNN for Off-Screen Mobile Input CHI ’21, May 8–13, 2021, Yokohama, Japan

Figure 8: Conceptual relation between performance andtraining strategy with one-person and multi-person data. biomechanics, and thus further improves the performance until itsnext plateau. This performance gain is from knowing how muchpeople can vary. To reach the ideal performance, it becomes a mustto know the person-specific information, i.e. how exactly the targetuser acts. In practice, especially for learning-based gesture studies,it can be beneficial to identify the curve in Figure 8, and then decidethe needed number of training participants for specific tasks.TapNet opens up new interaction opportunities such as one-handed interaction that uses off-screen tapping or designated tapgestures. TapNet is robust to phone case and fabric, and thus can beused for wearable interaction without specialized smart fabrics [7].This study also offers new potential for other research domains,including biometrics. For instance, it is possible to improve contin-uous and passive authentication [2, 35, 46] by using tap propertiesthat contain biometric information, i.e. with a clear gap betweenthe first two plateaus (orange area) in Figure 8.Although evaluation results demonstrated that TapNet benefitedfrom cross-device training, we do not expect the current TapNettrained on these two devices would directly apply to unseen deviceswithout further training, i.e. a cross-device model. However, thejoint training architecture has been shown effective and this is apromising step toward a device adaptive model. We have set up theinfrastructure and we plan to further investigate along this line byadding new devices and also opensource our implementation.

Without adding new sensor hardware, TapNet enables many newinput possibilities on smartphones. In addition to providing short-cuts to quick activations of apps or functions such as camera andscreenshot, this section sketches out four applications enabled byTapNet’s capability of measuring multiple proprieties in additionto the presence of tap event, such as direction and location. Pleaserefer to our supplemental video for these applications in action.

AssistiveTap allows users to complete a number of system interac-tions using back tap and tilting. Users can perform a back doubletap to invoke the AssistiveTap interface (see Figure 9a), then tilt

Figure 9: Four example use cases enabled by TapNet. Assis-tiveTap uses back tap and tilting for one-handed interaction.ExplorativeTap introduces a new two-sided interaction par-adigm for users with visual impairments. Interactive wall-papers enables interaction with UI background objects. In-ertial Touch senses tapping by force and angle changes andprovides auxiliary information for capacitive sensing. the phone (based on the same IMU signal input to TapNet) to selectgesture, and back tap to perform the selected gesture (e.g. ’back’gesture, scrolling, app switch). In other words, AssistiveTap pro-vides the back-of-device alternatives for the most commonly usedon-screen gestures, by using the gesture combination of back tapand tilting. It can be beneficial in situations, where one-handedinteraction is preferable or even required.

ExplorativeTap is a two-handed interaction method that exploitssignals from both the front (touch) and back of the phone tap (seeFigure 9b). Users can use the exploration finger on the screen toglide over the on-screen objects, hear them, and perform a backtap to confirm selection. This is an improvement for the visualimpairment accessibility modes, such as Voice Over on iOS andTalk Back on Android. These accessibility modes occupy the com-mon on-screen gestures (e.g. swipe and touch), and their usersneed to learn a more complicated system navigation gestures. Incontrast, ExplorativeTap starts the accessibility mode by a doubleback tap, selects object by a single back tap, and exits by lifting theexploration finger. It, therefore, solves the conflicts with systemnavigation gestures, and thus can be helpful for users with lowvision and print disability, who need quick and temporary accessto the accessibility mode.

Off-screen tap recognition makes it possible to interact with objectsliving in different interface layers, such as background targets thatdo not react to on-screen touch. For instance, TapNet can enableusers to interact with flashcards or news feeds shown in the wall-paper (see Figure 9c). They can back tap to change the flashcard ornews feeds, or edge tap to switch a different set of them, and thesecan be done even on the lock screen. It enables the user to utilizethe bits and pieces of time for their favorite spare time activity.

Inertial Touch estimates tap force from the IMU signals and tap loca-tion from the TapNet output. We can define Inertial Touch as a fasttouch event with a strong tapping momentum . As TapNet estimates

HI ’21, May 8–13, 2021, Yokohama, Japan Huang, et al. tap location from force and angle changes, it can function evenwhen capacitive sensing fails (Figure 9d). Common use situationsinclude when users are wearing gloves, having long fingernails, orwhen there are waterdrops on the touchscreen.

The aforementioned use cases can be particularly useful in chal-lenging situations when one-handed and non-contact interactionsare preferable or required. For example, AssistiveTap exploits backtap and titling to partially address the issue of limited thumb reach-ing area during one-handed interaction. ExplorativeTap allows forthe coordination between on-screen and off-screen interactionsand thus saves the need for learning additional set of on-screengestures. Inertial Touch offers auxiliary impact information beyondcapacitive sensing. These are just a few application examples oftapping into TapNet’s ability to detect multiple tap properties.

This paper presents the design, training, implementation and ap-plication of TapNet for off-screen mobile input. TapNet employs amulti-task convoluational neural network with motion signals asprimary input and phone form factor as auxiliary information. Itallows for joint learning on data across devices and simultaneousestimation of multiple tap properties. In comparison to many alter-natives, this neural network architecture worked the best towardsour goal of increasing the UI design space of off-screen interactions.To train and optimize this and other alternative ML models, we de-veloped a one-person dataset for training and a multi-person datasetfor testing. The evaluation results show that 1) TapNet significantlyoutperformed the state of the art especially in difficult recognitiontasks such as tap location estimation; 2) multi-task learning is moredata efficient, showing greater advantage particularly with a limitedamount of training data; 3) cross-device training with phone formfactor increases the efficiency of data utilization; and 4) one-channelCNN can achieve cross-channel signal alignment more efficientlythan its multi-channel counterpart. We verified the hypothesis thattraining on the one-person data can still generalize well acrossusers if the diversity of the training data can be ensured. This shedslight on the conceptual relation between model performance andML training strategy in IMU-based input systems. We demonstratedthat many new interaction use cases could be enabled by TapNet.Taken together, the TapNet project made significant progress to-wards practically enabling and enlarging the off-screen interactiondesign space by deep learning from on-device IMU signals, estab-lishing new benchmarks with reproducible results, datasets andcodebase.

REFERENCES [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jef-frey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard,et al. 2016. Tensorflow: A system for large-scale machine learning. In { USENIX } Symposium on Operating Systems Design and Implementation ( { OSDI } . USENIX Association, USA, 265–283.[2] Sara Amini, Vahid Noroozi, Sara Bahaadini, S Yu Philip, and Chris Kanich. 2018.DeepFP: A Deep Learning Framework For User Fingerprinting via Mobile MotionSensors. In . IEEE, USA,84–91.[3] Rich Caruana. 1997. Multitask learning. Machine learning

28, 1 (1997), 41–75. [4] Ke-Yu Chen, Shwetak N Patel, and Sean Keller. 2016. Finexus: Tracking precisemotions of multiple fingertips using magnetic sensing. In

Proceedings of the 2016CHI Conference on Human Factors in Computing Systems . ACM, USA, 1504–1514.[5] Christian Corsten, Bjoern Daehlmann, Simon Voelker, and Jan Borchers. 2017.BackXPress: Using back-of-device finger pressure to augment touchscreen inputon smartphones. In

Proceedings of the 2017 CHI Conference on Human Factors inComputing Systems . ACM, USA, 4654–4666.[6] Alexander De Luca, Emanuel Von Zezschwitz, Ngo Dieu Huong Nguyen, Max-Emanuel Maurer, Elisa Rubegni, Marcello Paolo Scipioni, and Marc Langheinrich.2013. Back-of-device authentication on smartphones. In

Proceedings of the SIGCHIConference on Human Factors in Computing Systems . ACM, USA, 2389–2398.[7] David Dobbelstein, Christian Winkler, Gabriel Haas, and Enrico Rukzio. 2017.PocketThumb: a Wearable Dual-Sided Touch Interface for Cursor-based Controlof Smart-Eyewear.

Proceedings of the ACM on Interactive, Mobile, Wearable andUbiquitous Technologies

1, 2 (2017), 9.[8] Emilio Granell and Luis A Leiva. 2016. Less is more: Efficient back-of-device tapinput detection using built-in smartphone sensors. In

Proceedings of the 2016 ACMInternational Conference on Interactive Surfaces and Spaces . ACM, USA, 5–11.[9] Chris Harrison, Julia Schwarz, and Scott E Hudson. 2011. TapSense: enhancingfinger interaction on touch surfaces. In

Proceedings of the 24th annual ACMsymposium on User interface software and technology . ACM, USA, 627–636.[10] David Holman, Andreas Hollatz, Amartya Banerjee, and Roel Vertegaal. 2013.Unifone: designing for auxiliary finger input in one-handed mobile interactions.In

Proceedings of the 7th International Conference on Tangible, Embedded andEmbodied Interaction . ACM, USA, 177–184.[11] Steven Hoober. 2013. How do users really hold mobile devices.

18 (2013), 2327–4662.[12] Michael Xuelin Huang, Jiajia Li, Grace Ngai, and Hong Va Leong. 2018. QuickBootstrapping of a Personalized Gaze Model from Real-Use Interactions.

ACMTransactions on Intelligent Systems and Technology (TIST)

9, 4 (2018), 43.[13] Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar,Llion Jones, and Jakob Uszkoreit. 2017. One Model To Learn Them All. arXivpreprint arXiv:1706.05137

0, 0 (2017), 1–10.[14] Huy Viet Le, Patrick Bader, Thomas Kosch, and Niels Henze. 2016. Investigat-ing Screen Shifting Techniques to Improve One-Handed Smartphone Usage. In

Proceedings of the 9th Nordic Conference on Human-Computer Interaction . ACM,USA, 27.[15] Huy Viet Le, Sven Mayer, Patrick Bader, and Niels Henze. 2018. Fingers’ Rangeand Comfortable Area for One-Handed Smartphone Interaction Beyond theTouchscreen. In

Proceedings of the 2018 CHI Conference on Human Factors inComputing Systems . ACM, USA, 31.[16] Huy Viet Le, Sven Mayer, and Niels Henze. 2018. InfiniTouch: Finger-aware inter-action on fully touch sensitive smartphones. In

The 31st Annual ACM Symposiumon User Interface Software and Technology . ACM, USA, 779–792.[17] Yi Liang, Zhipeng Cai, Jiguo Yu, Qilong Han, and Yingshu Li. 2018. Deep learningbased inference of private information using embedded sensors in smart devices.

IEEE Network

32, 4 (2018), 8–14.[18] Binbing Liao, Jingqing Zhang, Chao Wu, Douglas McIlwraith, Tong Chen, Sheng-wen Yang, Yike Guo, and Fei Wu. 2018. Deep sequence learning with auxiliaryinformation for traffic prediction. In

Proceedings of the 24th ACM SIGKDD Interna-tional Conference on Knowledge Discovery & Data Mining . ACM, USA, 537–546.[19] Zhen Ling, Junzhou Luo, Yaowen Liu, Ming Yang, Kui Wu, and Xinwen Fu. 2018.SecTap: Secure Back of Device Input System for Mobile Devices. In

IEEE INFOCOM2018-IEEE Conference on Computer Communications . IEEE, USA, 1520–1528.[20] William McGrath and Yang Li. 2014. Detecting tapping motion on the side ofmobile devices by probabilistically combining hand postures. In

Proceedings ofthe 27th annual ACM symposium on User interface software and technology . ACM,USA, 215–219.[21] Maryam Mehrnezhad, Ehsan Toreini, Siamak F Shahandashti, and Feng Hao. 2018.Stealing PINs via mobile sensors: actual risk versus user perception.

InternationalJournal of Information Security

17, 3 (2018), 291–313.[22] Hyeonseob Nam and Bohyung Han. 2016. Learning multi-domain convolutionalneural networks for visual tracking. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition . IEEE, USA, 4293–4302.[23] Philip Quinn, Seungyon Claire Lee, Melissa Barnhart, and Shumin Zhai. 2019.Active Edge: Designing Squeeze Gestures for the Google Pixel 2. In

Proceedingsof the 2019 CHI Conference on Human Factors in Computing Systems . ACM, USA,274.[24] Gabriel Reyes, Jason Wu, Nikita Juneja, Maxim Goldshtein, W Keith Edwards,Gregory D Abowd, and Thad Starner. 2018. SynchroWatch: One-handed synchro-nous smartwatch gestures using correlation and magnetic sensing.

Proceedings ofthe ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

1, 4 (2018),158.[25] Robin Tibor Schirrmeister, Jost Tobias Springenberg, Lukas Dominique JosefFiederer, Martin Glasstetter, Katharina Eggensperger, Michael Tangermann, FrankHutter, Wolfram Burgard, and Tonio Ball. 2017. Deep learning with convolutionalneural networks for EEG decoding and visualization.

Human brain mapping apNet: The Design, Training, Implementation, and Applications of a Multi-Task Learning CNN for Off-Screen Mobile Input CHI ’21, May 8–13, 2021, Yokohama, Japan [26] Karsten Seipp and Kate Devlin. 2014. BackPat: one-handed off-screen pattinggestures. In

Proceedings of the 16th international conference on Human-computerinteraction with mobile devices & services . ACM, USA, 77–80.[27] Marcos Serrano, Eric Lecolinet, and Yves Guiard. 2013. Bezel-Tap gestures: quickactivation of commands from sleep mode on tablets. In

Proceedings of the SIGCHIConference on Human Factors in Computing Systems . ACM, USA, 3027–3036.[28] Todd A Stephenson, Mathew Magimai Doss, and Hervé Bourlard. 2004. Speechrecognition with auxiliary information.

IEEE transactions on speech and audioprocessing

12, 3 (2004), 189–203.[29] Ke Sun, Ting Zhao, Wei Wang, and Lei Xie. 2018. Vskin: Sensing touch gestureson surfaces of mobile devices using acoustic signals. In

Proceedings of the 24thAnnual International Conference on Mobile Computing and Networking

Proceedings of the 14th AnnualInternational Conference on Mobile Systems, Applications, and Services . ACM, USA,277–289.[33] Pui Chung Wong, Hongbo Fu, and Kening Zhu. 2016. Back-Mirror: back-of-device one-handed interaction on smartphones. In

SIGGRAPH ASIA 2016 MobileGraphics and Interactive Applications { SOUPS } . ACM, USA,187–198.[36] Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, and Shonali Krish-naswamy. 2015. Deep convolutional neural networks on multichannel time seriesfor human activity recognition. In Twenty-Fourth International Joint Conferenceon Artificial Intelligence . AAAI press, USA, 3995–4001.[37] Hui-Shyong Yeo, Juyoung Lee, Andrea Bianchi, and Aaron Quigley. 2016. Sidetap& slingshot gestures on unmodified smartwatches. In

Proceedings of the 29thAnnual Symposium on User Interface Software and Technology . ACM, USA, 189–190. [38] Chun Yu, Xiaoying Wei, Shubh Vachher, Yue Qin, Chen Liang, Yueting Weng,Yizheng Gu, and Yuanchun Shi. 2019. HandSee: Enabling Full Hand Interactionon Smartphone with Front Camera-based Stereo Vision. In

Proceedings of the2019 CHI Conference on Human Factors in Computing Systems . ACM, USA, 705.[39] Cheng Zhang, Anhong Guo, Dingtian Zhang, Yang Li, Caleb Southern, Rosa IArriaga, and Gregory D Abowd. 2016. Beyond the touchscreen: an explorationof extending interactions on commodity smartphones.

ACM Transactions onInteractive Intelligent Systems (TiiS)

6, 2 (2016), 16.[40] Cheng Zhang, Aman Parnami, Caleb Southern, Edison Thomaz, Gabriel Reyes,Rosa Arriaga, and Gregory D Abowd. 2013. BackTap: robust four-point tapping onthe back of an off-the-shelf smartphone. In

Proceedings of the adjunct publicationof the 26th annual ACM symposium on User interface software and technology .ACM, USA, 111–112.[41] Cheng Zhang, Junrui Yang, Caleb Southern, Thad E Starner, and Gregory DAbowd. 2016. WatchOut: extending interactions on a smartwatch with inertialsensing. In

Proceedings of the 2016 ACM International Symposium on WearableComputers . ACM, USA, 136–143.[42] Qilin Zhang, Gang Hua, Wei Liu, Zicheng Liu, and Zhengyou Zhang. 2014. Canvisual recognition benefit from auxiliary information in training?. In

Asian Con-ference on Computer Vision . Springer, USA, 65–80.[43] Xucong Zhang, Michael Xuelin Huang, Yusuke Sugano, and Andreas Bulling.2018. Training person-specific gaze estimators from user interactions withmultiple devices. In

Proceedings of the 2018 CHI Conference on Human Factors inComputing Systems . ACM, USA, 624.[44] Xucong Zhang, Yusuke Sugano, Mario Fritz, and Andreas Bulling. 2017. Mpiigaze:Real-world dataset and deep appearance-based gaze estimation.

IEEE Transactionson Pattern Analysis and Machine Intelligence

41, 1 (2017), 162–175.[45] Yang Zhang, Gierad Laput, and Chris Harrison. 2017. Electrick: Low-cost touchsensing using electric field tomography. In

Proceedings of the 2017 CHI Conferenceon Human Factors in Computing Systems . ACM, USA, 1–14.[46] Nan Zheng, Kun Bai, Hai Huang, and Haining Wang. 2014. You are how youtouch: User verification on smartphones via tapping behaviors. In . IEEE, USA, 221–232.[47] Junhan Zhou, Yang Zhang, Gierad Laput, and Chris Harrison. 2016. AuraSense:enabling expressive around-smartwatch interactions with electric field sensing.In