[PDF] Augmenting Mobile Phone Interaction with Face-Engaged Gestures

Abstract

The movement of a user's face, easily detected by a smartphone's front camera, is an underexploited input modality for mobile interactions. We introduce three sets of face-engaged interaction techniques for augmenting the traditional mobile inputs, which leverages the combination of the head movements with touch gestures and device motions, all sensed via the phone's built-in sensors. We systematically present the space of design considerations for mobile interactions using one or more of the three input modalities (i.e., touch, motion, and head). The additional affordances of the proposed techniques expand the mobile interaction vocabulary, and can facilitate unique usage scenarios such as one-hand or touch-free interaction. An initial evaluation was conducted and users had positive reactions to the new techniques, indicating the promise of an intuitive and convenient user experience.

Full PDF

AAugmenting Mobile Phone Interaction with Face-Engaged Gestures

Jian Zhao ∗ Ricardo Jota † Daniel Wigdor ‡ Ravin Balakrishnan § Department of Comptuer Science, University of Toronto A BSTRACT

The movement of a user’s face, easily detected by a smartphone’sfront camera, is an underexploited input modality for mobile in-teractions. We introduce three sets of face-engaged interactiontechniques for augmenting the traditional mobile inputs, whichleverages the combination of the head movements with touch ges-tures and device motions, all sensed via the phone’s built-in sensors.We systematically present the space of design considerations formobile interactions using one or more of the three input modalities(i.e., touch, motion, and head). The additional affordances of theproposed techniques expand the mobile interaction vocabulary, andcan facilitate unique usage scenarios such as one-hand or touch-freeinteraction. An initial evaluation was conducted and users hadpositive reactions to the new techniques, indicating the promise ofan intuitive and convenient user experience.

Index Terms:

H.5.2 [[Information Interfaces and Presentation]:User Interfaces—Interaction Styles

NTRODUCTION

In recent years, mobile devices, such as smartphones, have becomeincreasingly popular in our everyday life. Touch input, includingtapping and ﬂicking, is currently the leading interaction mechanismfor high-end mobile phones. However, there are many situationswhere touch is limited. For instance, when outside during a coldwinter, due to limitations of capacitive sensors, users have to takeoff their gloves to touch the screen, e.g., in order to change theplayback of songs; when one hand is otherwise encumbered, usershave trouble performing zoom (pinch) actions on their phones, e.g.,in navigating maps. Under these circumstances, users could beneﬁtfrom mechanisms that augment the touch input, although they maynot use the augmented gestures all the time.Modern smartphones support many touch gestures, but also in-corporate a myriad of sensors, including accelerometers, gyro-scopes, and cameras, that can enable additional interaction affor-dances. There have been some previous attempts at leveragingthese sensors to augment traditional touch input with device motiongestures (e.g., shaking and swiping) [4, 13, 21, 26]. However, thereis another input channel—the movements of a user’s face, whichare easily detected by a phone’s front-camera, because people spendmuch time looking at their phones while using them, yet they arecurrently not effectively utilized as an input source.With fast hardware enabling real-time face tracking on mobiledevices, several techniques have been proposed to use such in-formation to support more natural interaction, such as auto-screenrotation [6] and multi-perspective panoramas [17]. However, in theprevious work, the head is typically used as a static reference tocompute the relative screen orientation of the phone, rather than as ∗ e-mail: [email protected] † e-mail: [email protected] ‡ e-mail: [email protected] § email: [email protected] a highly interactive input modality. Although some off-the-shelftechniques, such as smart scroll [1], have also explored the useof head gestures, they are limited in utilizing just the head inputchannel alone. In contrast, we believe that there is signiﬁcantpotential in leveraging the rich space of a user’s head movementsto enhance the expressiveness of existing touch and/or motiongestures when designing novel interaction techniques.As a ﬁrst step, we explore such techniques for a subset ofall possible movements of a user’s head, i.e., translations androtations of screen content so that it remains right-side up and infront of the eyes, while the head or phone are moved (Figure 1).We focus on face-engaged input methods that can be detectedby the built-in sensors of a phone, including the touch sensor,accelerometer, gyroscope, and camera. It is possible that othertypes of head motions (e.g., turning of the face) or combinations ofbody parts movements (e.g., whole body and midair hand gestures)can be involved. However, they require either extra hardware(e.g., infrared markers) or are computationally heavy in order tocorrectly distinguish computer vision features (rather than humanfront-faces, for which there are highly optimized detectors, in somecases built-in to the handset).In this paper, we propose three sets of novel face-engaged inter-action techniques for mobile phones by combining a user’s headmovement with existing input methods (i.e., touches and devicemotions). We also present a space of design considerations formobile interactions using the three input channels, which framesthe existing methods and our techniques in a systemic manner,shedding a light for future research. A basic system implemen-tation is described, for tracking a user’s face and fusing multiplesensors to detect gestural inputs with combinations of the threeinput modalities. The face-engaged techniques can provide extraaffordances and improve many of our everyday uses of mobiledevices, offering a secondary input approach when necessary. Wedemonstrate these techniques with a number of common mobilephone interaction tasks, such as scrolling, document browsing, mapnavigation, and menu selection. As a start of exploring the rich Figure 1: Possible head movements when front facing. a r X i v : . [ c s . H C ] O c t pace to leverage face gestures for augmenting traditional inputs,we believe that face-engaged interactions offer a convenient andnatural user experience, which could be very useful under somespecial situations (e.g., no-touch and one-hand). ESIGN C ONSIDERATIONS

Our goal is to add face movements to traditional touch or motiongestures to augment the current mobile input vocabulary. We thusconsider the interaction design systemically based on how the threeinput modalities, i.e., touch, motion, and face, are used for control-ling parameters in mobile interaction, including discrete control , continuous control , and not used [14]. We deﬁne discrete controlas mapping the sensor input into discrete events, or modes, suchas menu triggering [21]. Continuous control indicates real-timeupdates of the interface, based on continuous sensor input, suchas scrolling with tilting angles [10].As shown in Table 1, we frame previous work and techniquesproposed in this paper with the above design considerations of theparameter control methods. The ﬁrst column of Table 1 groupstechniques sharing the same attributes; some techniques may con-tain one or more attributes, given that there are multiple modes ofcontrolling parameters both continuously and discretely. The pro-posed face-engaged techniques are highlighted in bold font, whichare deliberately chosen to illustrate open opportunities and populatethis space, and grouped into three sub-categories, face-engagedtouch, face-engaged motion, and face-engaged touch & motion.As a start, we focus on a subset of all possible head movements(Figure 1), and the example techniques only demonstrate certainaspects of using the face gesture to augment traditional mobileinput methods. However, Table 1 presents a clear classiﬁcationof the techniques, and suggests a much wider space for futuremobile interaction design where other degrees of face input can besubstantially explored. ELATED W ORK

In this section, we discuss a summary of relevant mobile interactiontechniques, followed by above the design considerations in Table 1.For touch only input techniques, one of the major issues is theimprecision and occlusion of touch interfaces, also known as the“fat ﬁnger” problem. For example, Offset Cursor displays thecursor at a ﬁxed distance on the screen to avoid the ﬁnger occlu-sion [24]; similarly, Shift applies a hybrid approach where a normalcoarse direct touch selection is followed by a precise positionadjustment [30]. Along this line, Roudaut et al. proposed TapTapand MagStick that further improve the efﬁciency and accuracy ofsmall target selection [27]. Another relevant area is that concerningone-handed interaction. Karlson et al. have shown many cases inwhich users prefer one-handed use for mobile devices [19]. Somegestures for performing one-handed zooming have been widelyused on commercial systems, such as “double tap” on iOS and“tap then slide” in Google Map mobile app. Other techniquessuch as “rubbing” [23] or “rolling” [28] have also been proposedto facilitate single hand touch manipulations.However, touches are not always convenient for a user. Ex-amples include when one is wearing gloves or visually impaired,leading some researchers to leverage the device motions to designnovel gestures for mobile phone interaction. Based on a user’sspatial awareness, Virtual Shelves supports triggering shortcutsmapped to a hemisphere space in front of the body [21]; however,external sensors for tracking the phone are required. Accelerometerand gyroscope are widely used internal sensors for detecting motiongestures, for instance: auto screen rotation based on a device’sorientation [12], rolling a phone on different axes to scroll orcommand [14], tapping the back of a device or jerking it [26], andwhacking a phone with hard contact forces [15]. Also, throughcoupling the accelerometer data and touch events on an interactive

Technique(s) Face Motion Touch

D C D C D CRubbing [23], Microrolls [28] XOffset Cursor [24], Shift [30] XTapTap and MagStick [27] X XVirtual Shelves [21], Auto screen rotation [12], Whack-ing gestures [15], PhoneTouch [29] XRock‘n’scroll [4] XTimeTilt [26] X XTiltText [33], Chucking [11], Wrist angles [25] X XTilt Scrolling [10], Boom Chameleon [32], Expressivetyping [16] X XScatterDice Mobile [31], TouchProjector [5],Spilling [22] X X XSensor Synaesthesia [13] X X X XiOS 7 switch control [2], Smart stay and Smart pause [1] XGaze scrolling [20], Image zoom viewer [9], Smartscroll [1] X

Multi-scale scrolling

X X

Coarse-to-ﬁne text edit

X X XiRotate [6], Smart rotate [1] X X

3D map viewer

X XPanorama viewing [17], Smart scroll (tilting device) [1] X X

Touch-free menu

X X

Expressive ﬂicking

X X X X

One-hand navigator

X X X X X

Table 1: Design considerations of mobile interaction based on howtouch, motion, and face are used for parameter controls (D: discretecontrol, C: continuous control, and shaded empty: not used). surface, PhoneTouch allows for direct target selection using thephone and a pick&drop style of data transferring [29].Combining touch and motion further extends our interactionvocabulary with mobile phones. Hinckley et al. presents a sum-mary of such techniques and how the creation of novel gestures ispossible with touch and motion sensors [13]. Device orientationinferred from motion sensors is one important input to be usedsimultaneously with touch. TiltText employs tilting angles to re-solve text input ambiguities of the traditional phone keypad [33].Other usages of tilting and touch include measuring wrist deﬂectionangles [25] and scrolling [10]. ScatterDice Mobile supports the ex-ploration of multi-dimensional data on mobile phones by mappingits orientation to different chart viewing perspectives [31].In addition to orientation, device movement can provide moreinput modalities. For example, Boom Chameleon demonstrates adisplay tracked with 6 degrees-of-freedom in space as a windowin a virtual 3D world [32]. TouchProjector [5] and Spilling [22]allow the user to manipulate an object by direct touch on the screenaugmented by movement of the phone. Another type of motioninvolves hard force gestures such as shaking, tapping, and strikingthe device. Examples include using accelerometer to measuretyping pressure [16] and the “chucking” technique [11]. Hinckley et al. also present a number of techniques in this category, such as“hold and shake” and “hard drag” [13].n contrast to the above research, our goal is to design techniquesfrom a viewer’s perspective based on the relationship between thedevice and the user’s face (as opposed to the world position).There has been some attempt at employing face tracking tech-niques to provide an extra affordance for mobile interfaces. Kumar et al. utilizes a user’s gaze information for natural scrolling, butadditional hardware needs to be added [20]. Hansen et al. [9]describes the “mixed interaction space” between the user and thedevice and proposes using the face to perform image navigationsimilar to Image Zoom Viewer [7]. Along the same lines, a facetracking technique is applied in panorama viewing [17] and screenrotation with mobile phones [6]. Recently, head gestures emergedin some off-the-shelf systems, such as Sumsung’s “smart screen”technologies [1] and the accessibility switch control on iOS 7 [2].However, none of the above work has systematically exploredhow face movements can be combined with existing touch andmotion gestures to augment traditional input methods with newinteraction techniques, enabling a more ubiquitous and naturalusage for next generation mobile platforms.

ETECTING F ACE -E NGAGED G ESTURES

To enable the engagement of face gestures with touch and motion,we developed a centralized gesture recognizer to collect, process,and integrate various sources of input from the native phone sensorsincluding the capacitive touch screen, accelerometer, gyroscope,and front-facing camera.All techniques were implemented with an iPhone 5. As providedby the iOS SDK, different touch events such as “touch-beginning”,“touch moving”, and “touch ending” were also employed to detecttouch gestures or used for gesture delimiters in some techniques.The accelerometer and gyroscope data were sampled at 60Hz inorder to compute phone orientation in real time and detect high-frequency motion gestures such as shaking.Building on top of the face detection APIs in iOS, we imple-mented a face input processor to recognize large-angle face rota-tions, where image sequences of the front camera were sent to it fordetecting faces with 0, -45, and 45 degrees of rotations. Based onthe detection result sequence, the processor also generates differentface events, such as “face entering”, “face moving”, and “faceexiting”. Face detection and processing was the most computationalheavy process in the central gesture recognizer with a frame ratearound 16 fps with 480 ×

640 pixel camera image input sequence.Based on our experiments, this face tracking method was adequatefor our techniques although using external markers could be faster.In summary, our centralized gesture recognizer coordinates theinput data and events from touch, motion, and face channels asdescribed above at a ﬁxed clock rate, including: • each ﬁnger touch screen coordinates, ( T x , T y ) , from the touchsensor; • three-axis acceleration values, ( A x , A y , A z ) , from theaccelerometer; • angular rotation rates, ( R x , R y , R z ) , from the gyroscope; • and the following parameters via our face input processorfrom the front camera:a) face position ( F x , F y ) ,b) face scale F s estimated from eye distance,c) and face angle F a computed from eye positions.The face scale was used for implying the distance between thehead and the phone screen. As per the geometry concept shown inFigure 2, at any time, the product of face scale Fs and face-to-screendistance d equals to a constant which is determined by one’s eyedistance d eye and the camera’s focal length d image . Screenshots ofthe debugging output of the recognizer are shown in Figure 3. Figure 2: Representing face-to-phone distance with face scale,where for an individual, F s d = F s d = d eye d image = constant .Figure 3: Sample of debugging outputs. The above face events provide the ﬂexibility for others to buildmore complicated recognizers, e.g., integrating facial expressions.However, higher level APIs need to be developed to facilitateprogrammers by encapsulating the above various complicated pa-rameters, in a similar way as gesture recognizers in iOS.

ACE -E NGAGED I NTERACTION T ECHNIQUES

We now turn our attention to various techniques enabled by ourimplementation. We divide the discussion into the following threecategories by sensing modality.

Our ﬁrst category of techniques combines the touch sensor andthe face movement to enhance the traditional touch input for manydaily usage scenarios.

Scrolling is necessitated by displaying large content on smallscreens, and is thus a common interaction performed by mobilephone users when browsing documents, searching in a list, orviewing videos. One important issue in scrolling interfaces is therate control, i.e., the mapping between virtual content scrollingdistance and ﬁnger moving distance. The user demands differentscrolling speeds depended on the size of content and browsingtasks. For example, the iOS video player uses the perpendiculardistance between the touch position and scrollbar to controlthe scrolling speed for multi-scale navigation (similar to [3]).However, it may have hand occlusion problems with touch screens.We propose a more natural technique that uses the face-to-screendistance to govern the rate of scrolling—the closer the distance,the slower the scrolling speed—using the metaphor that peoplemove text closer to their face if they want to read more carefully.With this technique, users may conveniently adjust the scrolling ofcontent, such as videos and documents, in a spontaneous mannerby just moving the device or head. In addition, the representationof content can be modiﬁed according to the face-to-screen distance,for example, displaying stock price charts in different scales (byday, week, etc.) while providing different scrolling rates.

Implementation Details.

Two different multi-scale scrollingmechanisms, absolute and relative scaling, were implemented forthis technique. Absolute scaling directly maps the scrolling speed igure 4: Coarse-to-ﬁne text edit: (a)(b) ﬁrst touching the screen toset a rough cursor position, (c)(d) then using head gestures to movecursor in a ﬁner level. with the quantity of face scale F s , which is another way of inferringthe face-to-screen distance without extra sensors. In this method,each face scale value has a ﬁxed scrolling rate. Relative scalingadjusts current scrolling speed based on the relative changes offace scale while the user is actively scrolling. If the device or faceis moved during the interaction, the system modiﬁes the scrollingspeed based on the direction and distance traveled; and if the userdoes not perform any interaction, the scrolling speed remains thesame even when the face-to-screen distance changes. Throughpilot studies, we found that relative scaling was preferred, becauseusers may have various face sizes and different habits of phoneholding distances. To improve its stability in situations where theuser might not be able to hold the device still, we discretized thepossible range of Fs into 6 levels, in which the scrolling speedremained the same. We deﬁned the active scrolling status as: 1)the ﬁnger is on the screen, and 2) the time interval between twoscrolling actions does not exceed 0.5 second. Certain text editing tasks can be difﬁcult to perform on smartphonetouch screens due to the limited screen space and imprecision ofﬁnger input. For example, in cursor positioning, most commercialdevices apply the “ﬁnger hold” gesture to trigger a virtual mag-niﬁcation lens with ﬁxed offset to the touch point, allowing theuser to see beneath their ﬁnger (similar to [30]). While this isfunctionally complete, it reduces context of the text surroundingthe cursor. Also, it can be difﬁcult for making cursor adjustmentsnear the edges of the screen, because the ﬁnger likely slips off thescreen thus cannot be sensed.By leveraging the relative orientation between the face andphone, we propose a technique augmenting the classic method ofcursor positioning to overcome the above problems (Figure 4). Toplace the cursor on the desired location, a user can ﬁrst tap onthe screen normally to give the cursor an approximated position(Figure 4-ab), and then lean her head left or right to further movethe cursor character by character (Figure 4-cd). Since the secondstep of ﬁne-level position adjustment does not need touches, itremoves the need for the magnifying lens and hence possibleocclusions and preserves the context while moving the cursor.This technique can be embedded into any text editing applica-tions, providing an alternative cursor-manipulation method withminimal hand occlusion. Also note that it does not conﬂict withthe magnifying lens, which can be added onto traditional interfacesto move cursor at different granularities. Further, this techniquecould also be applied in text selection, where the starting and endingcursors can be manipulated in a similar way.

Implementation Details.

For activating the face cursor move-ment, we used a threshold of a 15-degree angle between perpen-dicular axes of the screen and face. This angle can be computedfrom the face angle F a in camera image space and the devicescreen orientation inferred from the system. During the movement,the cursor shifts at a constant speed (200ms per character, tunedthrough pilot studies) in the direction controlled by the head, andit stopped moving when the face was in the [-15, 15] degree range.One limitation of this technique may be that a user cannot move thehead freely to perform such interaction when side-lying on a bed. In addition to face-engaged touch inputs, leveraging sensing of therelative position between the face in the camera image and thedevice orientation to the ground allows for further interactions.

Nowadays mobile geo-map applications usually support not only atraditional 2D map layout but also an immersive 3D view of streets.To have a better navigation experience, people often need to switchbetween these two views. Current off-the-shelf implementationsrequire users to either explicitly click a “mode” button or drag themap vertically using two ﬁngers. The former is less intuitive and thelatter needs a coordination of two-point touch, which is not alwayspossible as discussed before. Thus, we leverage the phone tiltingangle to enable a more effective way of transforming between 2Dand 3D modes, inspired by our common habits of view perspectivesfor 3D models. Speciﬁcally, a user just naturally rotates the phoneto a roughly 45 degree angle relative to the ground from a verticalor horizontal holding of the phone, in order to change from 2D to3D, or vice versa. Figure 5-ab shows one possible interaction.In addition, when a user is exploring the 3D map view, it canoften be useful to quickly peek the right or left side of the currentviewpoint to gain more context of the location. This, however, hasnot been addressed in any of the current mobile map applications.With the face gestures, a user can rotate the head left or rightto control the direction and angle of the glimpse, and if the userrolls the head back to its original position, the view angle goesback to straight ahead (Figure 5-cd), which offers a quick andeasy Glimpse-like [8] interaction for 3D map exploration. Whenusing the above two techniques together, we believe that suchface-engaged motion interaction can enhance the user experienceof map navigation on mobile phones, which can also be naturallybuilt into the traditional map viewing interactions. Examples of useinclude virtual sightseeing, ﬁrst-person perspective gaming, or GPSapps for quick previews of the streets ahead at intersections.

Implementation Details.

Through iterative design, we have tunedthe various parameters to optimize this technique. The initialmotion for changing to the 3D view mode is initiated when thetitling angle of the device in a range of 45 ±

10 degrees. To detecthead leaning, we employed both face horizontal position F x andface angle F a provided by our system. For example, a right leaninggesture was initiated when F x fell in the right half of the screenwhile F a was towards the clockwise direction. This dual-variableapproach was used for decreasing the chance for false positives.We further set two minimum thresholds for F x (80 pixels relative tothe camera image center) and F a (10 degrees relative to the verticaldirection) to increase stability. Once the 3D map view was activatedby tilting the phone, we used F a to govern the view peeking anglein the corresponding direction, where three levels were providedat 10, 20 and 30 degrees, mapped to glimpsing angles with 45, 90and 135 degrees. Similar to the multi-scale scrolling technique, thisdiscretization of the continuous parameter control was intended toincorporate the noisy input. igure 5: 3D map viewer: (a) normal 2D viewing, (b) rotating thephone to enter the 3D view mode, and (c)(d) moving the head toglimpse left or right side of the 3D buildings.Figure 6: Touch-free menu: using the relative orientation betweenthe device and face angle to select menu items. Using face & motion input modalities opens the possibility oftouch-free interaction. As Zarek et al. [34] describe, a number ofscenarios of mobile phone use do not allow for capacitive touchinput. Such touch-free interaction can be very useful in those sce-narios to act as a quick easy manner of achieving necessary tasks.Our goal here is accessory-free input as opposed to leveragingexternal sensors for tracking, such as [21]. Though several projectshave examined motion sensing techniques using accelerometer,gyroscope, or camera [10, 11, 13, 15], few have explored applyingface tracking to enable touch-free interaction.We propose a pie-menu selection technique by employing therelative angles between the face and device. A user can freely rotatethe device or her face to navigate through menu items, where thecurrently selected item is always aligned with the vertical axis of theface (Figure 6). To conﬁrm a selection, a timeout can be applied.Triggering of the pie menu can be contextual (e.g., presented whena phone call is incoming to “answer on speaker” or “ignore”), oruser-initiated via actuated buttons (e.g., double-clicking the “home”key on an iPhone).One useful scenario is to augment the music play control on thelocked screen with touch-free manipulation of settings during coldweather (e.g., playback, volume, and sound EQ mixing as iconsshown in Figure 6). This face-engaged pie menu selection is moreﬂexible than motion-based techniques (e.g., [21]), because it can be

Figure 7: Expressive ﬂicking: (a) normal ﬂick, (b) phone swipe, (c)hold-and-swipe, and (d) ﬂick-and-swipe. used even when a user is lying down and it always keeps the eyeson the screen for displaying additional information.

Implementation Details.

To demonstrate the basic functions, weimplemented a pie menu with 8 menu items, and a selection timeout3 seconds. Each of these was set through iterative design. We foundthat 8 items (and thus 45 degrees) was the smallest discretizationof the relative orientation of head and phone that could be easilycontrolled via head motions alone (though turning the phone withthe hands did allow for a ﬁner grain). To compute the relative angleof the face to the phone, we used face angle Fa in the image spaceand the phone orientation inferred from the accelerometer data ( A x , A y , A z ) . We also found that 2 seconds was the fastest timeoutthat would ensure a very low false activation rate, thus reducinguser frustration for accidental selections. However, we can easilyemploy the face-to-device distance to achieve the selection in thistechnique in a faster way. Our ﬁnal combination uses all three input modalities: the movementof the face, touch input, and device motion.

Flicking is a widely used interaction on touch screens to enable theuse of pseudo-momentum to reduce physical work needed to scroll.However, users sometimes demand richer scrolling interactions fornavigating the content with semantics. For example, when readingbooks, people usually want to jump between chapters or ﬂip pagesback and forth, which must be done in multiple steps with normalﬂicking or tapping gestures.Through the combination of motion sensing and touch, we in-troduce expressive ﬂicking, which contains a series of new ﬂick-ing gestures: phone swipe, hold-and-swipe, and ﬂick-and-swipe(Figure 7). In phone swipe, the user needs to quickly move thephone left or right starting from the position in front of her face;in hold-and-swipe, the user performs similar phone swipe gesturesbut with ﬁngers on the touch screen; and in ﬂick-and-swipe, theuser needs to initiate a ﬂicking gesture while swiping the phone inthe same direction.Together with the normal ﬂicking, these gestures are orderedin the increasing intensity of actions, which can express differentinteraction semantics, e.g., scrolling with various levels of distancessuch as a page, multiple pages, sections, and chapters, for documentreading. Compared to the traditional ﬂicking, this technique canimprove the efﬁciency for multi-step tasks with one operation andembed more expressive physical metaphors to interactions. Unlikethe previous techniques designed with only touch and motion, weemployed the face detection as a mode indicator for whether toactivate the gesture recognizers. More speciﬁcally, the ﬂickinggesture is only available when initially performed with the face infront of the screen, to prevent from the user triggering gestures withaccidentally touching or shaking the phone.

Implementation Details.

For implementation, there are sometechnical considerations. First, one may not be able to hold herﬁnger in the exactly same location when swiping the phone. Thuswe added a 15 pixel tolerance for the touch position ( T x , T y ) in igure 8: One-hand navigator: (a) panning with normal ﬁnger sliding,(b) zooming by face-to-phone distance, and (c) rotating by face-to-phone orientation. hold-and-swipe. The second thing is the synchronization of touchand motion gestures in ﬂick-and-swipe. We looked into the distancetraveled by the ﬁnger touch from the beginning to the end of aphone swiping, regardless of when the user put down or released herﬁnger, in which the latter could be in the middle of or after the swip-ing. However, this may result in only utilizing part of the ﬂickingdistance for gesture recognition. Lastly, as mentioned above, thosegestures (except the normal ﬂicking) must be performed startingwith the face being detected, which may not execute the commandcorrectly due to false negatives of face detection, e.g., in a poorlighting condition. One-hand usage of mobile devices is another common scenario inour daily interactions [19]. Although there have been some attemptsto support panning and zooming with one-handed interaction [13,17, 23, 27], few have addressed rotation, which is commonly usedin mapping, image editing and viewing, and graphics design.A demonstration of the technique is shown in Figure 8. Similarto previous works [7, 9], we utilize the distance between the phoneand face to control the zooming scale, and deﬁned the anchorpoint as the ﬁnger touch position on the screen. As discussed inmulti-scale scrolling, we apply the relative scaling mechanism forzooming, i.e., the zoom level is adjusted according to differenceof the starting and ending face-to-phone distances (Figure 8-b).Compared to the former techniques (e.g., tilt-to-zoom), such in-teractions are face-centric which can be executed in any situationeven when the user is lying down. To rotate the view, one canput her ﬁnger down to set an anchor point and then rotate thedevice. The original relative orientation between the view and faceremains the same while the device is rotating, thus rotating thecontent onscreen (Figure 8-c). Similar to zooming, the rotationaction stops once the user releases the ﬁnger. As for the panningoperation, a user can achieve it with the traditional sliding gesture,when zooming or rotation modes are not activated (Figure 8-a). Ifthe user performs panning during those two modes, only the anchorpoints are adjusted accordingly.Compared to prior work, this technique enables one-hand ma-nipulation of maps or images with more completed operations,including pan, zoom, and rotate, all interleaved in a ﬂuid manner. Itis useful for cases when a user’s the other hand is occupied that isquite common in our everyday life.

Implementation Details.

Considering the implementation issuesof aforementioned techniques, we applied the same method of discretizing the continuous control in zooming and rotation toreduce the effect of noise. Iterative tuning determined that it wasoptimal to use 6 levels for zooming in the reasonable range of facescale Fs (that interprets the face-to-phone distance) and 20 levels forrotation in 360 degrees range. Within the same value level wherethe phone may move slightly, the adjustment of zooming levels orrotation degrees is not performed. However, other advanced ratecontrol techniques (such as in [17]) could be employed to providea smoother user experience.

NITIAL E VALUATION

Apart from the many small-scale iterative design sessions describedthroughout the paper, we conducted an initial evaluation to collectuser feedback on the six face-engaged mobile interactions. Thepurpose of this study was to assess the value of these techniquesin practice considering both their strengths and weaknesses. Wealso aimed to observe user interactions and collect their qualitativefeedback, since many of the techniques do not have a comparablebaseline (e.g., it does not make sense to compare the speed andaccuracy between touch selection and touch-free menu). Thus wevalued the ﬂuid and convenient user experience in co-existing ofthe techniques and traditional inputs as well as in special situationsthat some of the techniques were designed for.

We recruited 10 participants, including 6 males and 4 females, aged23-28, all right-handed, from a university network. All participantswere daily touch-screen phone users, which would allow us ad-equately compare our techniques with their everyday experience.Each participant received $10 as compensation after the study.

Participants were asked to try the proposed techniques using aniPhone 5 weighing 112 grams. The display was a 4.0 inch diagonalwith a resolution of 640 × ×

640 pixels resolutionduring the whole experiment. Custom software was developed forpresenting each technique as described in previous sections.

For each of the six face-engaged interaction techniques, the ex-perimenter ﬁrst demonstrated its usage and participants then spentseveral minutes to get familiar with the gestures and better under-stand the system. Next, participants were given some instructionsto explore and try the features of the technique for about 6 to10 minutes. After trying each technique, participants ﬁlled outa questionnaire of 5 questions using a 1-11 Likert scale (fromstrongly disagree to strongly agree). The order of techniquespresented to participants was randomized. When all the techniqueshad been explored, we conducted a short informal interview tocollect general comments from them, such as the best and worstthings of every technique. In the end of the study, participants wereasked to rank all six techniques based on their overall preference.

ESULTS

Overall, the ratings of Likert questions for each technique indi-cate that participants had positive reactions to the proposed face-engaged interactions with mobile phones (Table 2). Users thoughtall the techniques were generally easy to learn, where the scores ofQ1 were all above 9.0, except for expressive ﬂicking. Some partic-ipants commented that they could not distinguish hold-and-swipeand ﬂick-and-swipe in actions: “

I can easily do another gestureinstead ”. Some usability issues were observed for techniquesrequiring coordination of multiple objects simultaneously, such asrotating the phone while moving the head (in touch-free menu) or

Coarse-to-ﬁne text edit

3D map viewer

Touch-free menu

Expressive ﬂicking

One-hand navigator

Table 2: Likert questionnaire results, where in each cell: mean (std).The questions are—Q1: the technique is easy to use, Q2: thetechnique is easy to learn, Q3: the technique is useful in daily life,Q4: the technique is more efﬁcient than traditional methods, and Q5:I’d like to have the technique on my phone.Figure 9: The average rankings of technique, where T1-6 representstechnique in the same order of Table 2. preforming touch gestures while swiping the phone (in expressiveﬂicking). However, participants felt that it was worth taking someeffort to control the techniques through a bit of practice, becausethey can “ do a multi-step task with only one move ”.Of all the techniques, one-hand navigator and touch-free menuwere rated as the two most useful ones, as some said “this solvesmany problems in my everyday life”. Users were eager to havethose techniques available on their phones. When asked to comparethe same task with the standard technique, participants generalfavored the proposed techniques; especially when only one handor no hand is available, they agreed that face-engaged interactiontechniques were convenient, intuitive, and efﬁcient.The user preference rankings (Figure 9) indicate the need forsupporting effective one-hand and no-touch interactions in our dailylife, such as map navigation and menu item selection introducedin this paper. Seven out of eight participants mentioned one-handnavigator as their favorite technique. One user liked multi-scalescrolling the most since “ it seems very natural to scroll slowerwhen it is far away and it is useful for video browsing ”. Expressiveﬂicking was the least favorite technique, because it took so muchphysical effort and was difﬁcult to control without training. Someparticipants were also worried that shaking could easily damagetheir phones.

In general, participants enjoyed the process of trying all face-engaged techniques and thought these interactions were useful andnatural. They also indicated the experience was smooth and the facetracking was efﬁcient in responding to their face-engaged gestures.

Most of the techniques seemed to be intuitive, and the real-worldmetaphors used in many techniques helped users learn the inter-actions. For example, participants liked the fact that the directionof leaning the head corresponded to the direction of the operation,as moving the cursor in coarse-to-ﬁne text edit. Some mentionedthat “ it makes sense to take more physical effort to achieve morecomplicated tasks ” in expressive ﬂicking, although it was rated not very easy to learn because users had to manipulate the devicesimultaneously with touch.Participants particularly loved one-hand navigator which enablespanning, zooming, and rotating of objects all with one hand. Theyall thought that using the relative face-to-screen distance and orien-tation to control the zoom levels and rotation angles was very sim-ilar to their experiences in the physical world. One suggested thatthe rotation gesture could be more intuitive if the image orientationremained the same regardless of the head movement. Similar tothe zooming mechanism, multi-scale scrolling employs the distancebetween face and phone to govern the scrolling speed, which wasappreciated by most of the participants as well. Only one usermentioned that it was a little strange when directions of the actualscrolling and its speed control were perpendicular.Another aspect we observed was that users needed to get ac-customed to the rate control of parameters in many face-engagedinteraction techniques. Due to the noisy inputs of face trackingwith the front camera, we set thresholds and discrete levels inthe interaction space for head movements to increase the stabilitydespite the sacriﬁce of continuity. Participants noticed that interfaceparameter values sometimes appeared jerky and unstable, whichrequired additional time to get familiar with the system settings andsome practice to manipulate the interface more precisely. However,integrating advanced face tracking algorithms or the using externalsensors may enhance the user experience.

All participants agreed that the proposed techniques were usefulunder certain situations, as our goals are to augment traditionalinteractions rather than replace them. For expressive ﬂicking, someusers indicated that they would use it for browsing a large number ofimages or extensive documents, but normally they do not performthose tasks on mobile phones. One participant intended to applycoarse-to-text edit for texting: “ it releases my hands to do otherstuff and brings the cursor back to the desktop [manner] ”.Some participants even indicated that they would like to usemulti-scale scrolling, touch-free menu, and one-hand navigator indaily interactions. One commented that “ [one-hand navigator] iseasier to mix zooming and rotation. [. . . ] I want to have it formy Google Map ”. Another user added a number of scenarios fora touch-free menu, such as when she is cooking and her hands aredirty. Some said that multi-scale scrolling could be very useful ontablets for reading books or skimming online videos.

Although face-engaged interaction techniques were generally con-sidered natural and intuitive by participants, the form of griping thedevice, the motion range of wrist and the movement space of headshould be carefully considered into an appropriate gesture design.Participants thought that it was much easier to rotate the phonethan rotate the head. Some suggested that the touch-free pie menushould be designed as 270 degrees instead of a full circle, as “ it ishard to choose the lower end [of the menu] ”. Given much freedom,participants were sometimes confused about whether to rotate theirheads or the phone to complete a task. We also observed thatsome users felt a little uncomfortable holding the head stable whilerotating, e.g., for coarse-to-ﬁne text edit. But participants indicatedthat they could manage it, because they would use such interactionsfor special situations which are not likely to happen all the time.Moreover, several users thought that the beneﬁts of doing com-plicated tasks in one single action might be decreased due to thephysical effort needed for the gesture, e.g., ﬂick-and-swipe in ex-pressive ﬂicking. Participants said that in most of the cases, thetraditional ways might be preferred though it took longer, despitethe fact that they liked the physical metaphors, interaction seman-tics, and intuitions applied in the techniques.nother issue revealed during the interviews was that peoplemight feel awkward when performing the face-engaged interactiontechniques in front of others, especially those requiring the rotationof the head (e.g., touch-free menu). Also, one mentioned that whenit was crowded, there might not be enough space to do actions suchas swiping the phone and moving the phone far from the face.

ISCUSSION

From the study, although these face-engaged techniques were con-sidered to be intuitive and convenient in many usage scenarios,they limit interactions to a restricted space where the face must becaptured and tracked. In our system, the user has to be front-facingand hold the phone within a certain distance ranges. However,having a wider view-angle camera may solve these issues. Dur-ing the experiment, we did not limit users with certain standingpostures or facing directions, and the sensing seemed relativelystable throughout the study. However, lighting conditions might bedifferent for outside environments, which could affect the tracking.Moreover, some face-engaged gestures that require frequenthead rotations (e.g., coarse-to-ﬁne text edit) might have the fatigueissue. However, we argue that those techniques are not speciallydesigned for daily usage but for just-in-time and auxiliary use toachieve tasks in special scenarios that normally cannot be done.Another limitation is that the face tracking input is usuallynoisy (especially in non-stationary environments) and has hugelatency, thus it cannot be used for precise and high-frequencyparameter adjustment. Also, false positives or the loss of facetracking may interrupt the interaction, especially in continuouscontrol. Nonetheless, from the experiments, our current imple-mentations together with some technique speciﬁc adjustments (e.g.,discretization of continuous parameter, and dual variable gesturedetection) seem to be adequate for simple everyday tasks. But thedevelopment of better face tracking algorithms and faster hardwareon the phone would signiﬁcantly enhance the user experience. It isalso interesting to implement temporal ﬁltering techniques such asKalman ﬁlter [18] to stable the input.From the observations and interviews, we also identiﬁed an in-teresting point about the trade-off of speed and convenience versuseffort and learning. While these face-engaged gestures provide arich interaction vocabulary to do complicated tasks more efﬁciently,users need practice to best manage the new interactions. Sometimesusers have to take more physical effort, such as in expressive ﬂick-ing, to overcome the inefﬁciency of common techniques. Peoplemay not always prefer face-engaged interactions on a daily basis,but many of our techniques are designed to augment traditionalinputs, especially under unusual situations, e.g., when another handor touch is not available.

ONCLUSION AND F UTURE D IRECTIONS

We have explored novel interaction techniques for mobile devicesby augmenting the existing touch and motion gesture paradigm withface movements, which facilitates many special usage scenarios inour daily life. Three groups, in total six techniques, were discussedbased on their sensing methods, including face-engaged touch,face-engaged motion, and face-engaged touch & motion. From userreactions in the study, we conclude that these new techniques areintuitive and efﬁcient in many situations compared to the traditionalmethods, and have indicated a richer interaction vocabulary withextra affordances. Further, by framing the previous work and thosetechniques in a conceptual structure based on the parameter controlof the three input modalities, we believe that these face-engagedinteractions can provide inspirations and implications for futureresearch to better exploit face input, the underutilized channel.In the future, we would like to conduct more experiments andfurther extend the highly-rated techniques. It is also interesting tocontinually populate the design space by designing and testing other novel face-engaged interaction techniques. Finally, it is promisingto enhance the current gesture recognizer by exploring more ad-vanced face tracking algorithms and the use of other sensors. R EFERENCES [1] Sumsung Galaxy S4. , 2012.[2] IOS 7. , 2013.[3] C. Appert and J.-D. Fekete. Orthozoom scroller: 1d multi-scalenavigation. In

Proc. of ACM CHI , pages 21–30, 2006.[4] J. F. Bartlett. Rock ’n’ scroll is here to stay.

IEEE Comput. Graph.Appl. , 20(3):40–45, 2000.[5] S. Boring, D. Baur, A. Butz, S. Gustafson, and P. Baudisch. Touchprojector: mobile interaction through video. In

Proc. of ACM CHI ,pages 2287–2296, 2010.[6] L.-P. Cheng, F.-I. Hsiao, Y.-T. Liu, and M. Y. Chen. irotate: automaticscreen rotation based on face orientation. In

Proc. of ACM CHI , pages2203–2210, 2012.[7] E. Eriksson, T. R. Hansen, and A. Lykke-Olesen. Movement-basedinteraction in camera spaces: a conceptual framework.

PersonalUbiquitous Comput. , 11(8):621–632, 2007.[8] C. Forlines, C. Shen, and B. Buxton. Glimpse: a novel input model formulti-level devices. In

CHI ’05 Extended Abstracts , pages 1375–1378,2005.[9] T. R. Hansen, E. Eriksson, and A. Lykke-Olesen. Use your head:exploring face tracking for mobile interaction. In

CHI ’06 ExtendedAbstracts , pages 845–850, 2006.[10] B. L. Harrison, K. P. Fishkin, A. Gujar, C. Mochon, and R. Want.Squeeze me, hold me, tilt me! an exploration of manipulative userinterfaces. In

Proc. of ACM CHI , pages 17–24, 1998.[11] N. Hassan, M. M. Rahman, P. Irani, and P. Graham. Chucking: A one-handed document sharing technique. In

Proc. of INTERACT , pages264–278, 2009.[12] K. Hinckley, J. Pierce, M. Sinclair, and E. Horvitz. Sensing techniquesfor mobile interaction. In

Proc. of ACM UIST , pages 91–100, 2000.[13] K. Hinckley and H. Song. Sensor synaesthesia: touch in motion, andmotion in touch. In

Proc. of ACM CHI , pages 801–810, 2011.[14] K. Hinckley and D. Wigdor.

Human-Computer Interaction Handbook:Fundamentals, Evolving Technologies, and Emerging Applications ,chapter 6. CRC Press, 2012.[15] S. E. Hudson, C. Harrison, B. L. Harrison, and A. LaMarca. Whackgestures: inexact and inattentive interaction with mobile devices.In

Proc. of the Int’l conf. on Tangible, embedded, and embodiedinteraction , pages 109–112, 2010.[16] K. Iwasaki, T. Miyaki, and J. Rekimoto. Expressive typing: a newway to sense typing pressure and its applications. In

CHI ’09 ExtendedAbstracts , pages 4369–4374, 2009.[17] N. Joshi, A. Kar, and M. Cohen. Looking at you: fused gyro andface tracking for viewing large imagery on mobile devices. In

Proc.of ACM CHI , pages 2211–2220, 2012.[18] R. E. Kalman. A new approach to linear ﬁltering and predictionproblems.

J. Fluids Eng. , 20(1):35–45, 1960.[19] A. K. Karlson, B. B. Bederson, and J. L. Contreras-Vidal. Under-standing one-handed use of mobile devices. In J. Lumsden, editor,

Handbook of Research on User Interface Design and Evaluation forMobile Technology , chapter VI, pages 86–101. Information ScienceReference, 2008.[20] M. Kumar and T. Winograd. Gaze-enhanced scrolling techniques. In

Proc. of ACM UIST , pages 213–216, 2007.[21] F. C. Y. Li, D. Dearman, and K. N. Truong. Virtual shelves:interactions with orientation aware devices. In

Proc. of ACM UIST ,pages 125–128, 2009.[22] D. Olsen, J. Clement, and A. Pace. Spilling: Expanding handheld interaction to touch table displays. In

IEEE Int’l Workshop onTABLETOP , pages 163 –170, 2007.[23] A. Olwal, S. Feiner, and S. Heyman. Rubbing and tapping for preciseand rapid selection on touch-screen displays. In

Proc. of ACM CHI ,pages 295–304, 2008.24] R. L. Potter, L. J. Weldon, and B. Shneiderman. Improving theaccuracy of touch screens: an experimental evaluation of threestrategies. In

Proc. of ACM CHI , pages 27–32, 1988.[25] M. Rahman, S. Gustafson, P. Irani, and S. Subramanian. Tilttechniques: investigating the dexterity of wrist-based input. In

Proc.of ACM CHI , pages 1943–1952, 2009.[26] A. Roudaut, M. Baglioni, and E. Lecolinet. Timetilt: Usingsensor-based gestures to travel through multiple applications on amobile device. In

Proc. of INTERACT , pages 830–834, 2009.[27] A. Roudaut, S. Huot, and E. Lecolinet. Taptap and magstick:improving one-handed target acquisition on small touch-screens. In

Proc. of AVI , pages 146–153, 2008.[28] A. Roudaut, E. Lecolinet, and Y. Guiard. Microrolls: expandingtouch-screen input vocabulary by distinguishing rolls vs. slides of thethumb. In

Proc. of ACM CHI , pages 927–936, 2009.[29] D. Schmidt, F. Chehimi, E. Rukzio, and H. Gellersen. Phonetouch: atechnique for direct phone interaction on surfaces. In

Proc. of ACM UIST , pages 13–16, 2010.[30] A. Sears and B. Shneiderman. High precision touchscreens: designstrategies and comparisons with a mouse.

Int. J. Man-Mach. Stud. ,34(4):593–613, 1991.[31] J. Thomason and J. Wang. Exploring multi-dimensional data onmobile devices with single hand motion and orientation gestures. In

Proc. of MobileHCI , pages 173–176, 2012.[32] M. Tsang, G. W. Fitzmzurice, G. Kurtenbach, A. Khan, andB. Buxton. Boom chameleon: simultaneous capture of 3d viewpoint,voice and gesture annotations on a spatially-aware display.

ACMTrans. Graph. , 22(3):698–698, 2003.[33] D. Wigdor and R. Balakrishnan. Tilttext: using tilt for text input tomobile phones. In

Proc. of ACM UIST , pages 81–90, 2003.[34] A. Zarek, D. Wigdor, and K. Singh. Snout: one-handed use ofcapacitive touch devices. In