[PDF] GO-Finder: A Registration-Free Wearable System for Assisting Users in Finding Lost Objects via Hand-Held Object Discovery

Abstract

People spend an enormous amount of time and effort looking for lost objects. To help remind people of the location of lost objects, various computational systems that provide information on their locations have been developed. However, prior systems for assisting people in finding objects require users to register the target objects in advance. This requirement imposes a cumbersome burden on the users, and the system cannot help remind them of unexpectedly lost objects. We propose GO-Finder ("Generic Object Finder"), a registration-free wearable camera based system for assisting people in finding an arbitrary number of objects based on two key features: automatic discovery of hand-held objects and image-based candidate selection. Given a video taken from a wearable camera, Go-Finder automatically detects and groups hand-held objects to form a visual timeline of the objects. Users can retrieve the last appearance of the object by browsing the timeline through a smartphone app. We conducted a user study to investigate how users benefit from using GO-Finder and confirmed improved accuracy and reduced mental load regarding the object search task by providing clear visual cues on object locations.

Full PDF

GGO-Finder: A Registration-Free Wearable System for AssistingUsers in Finding Lost Objects via Hand-Held Object Discovery

Takuma Yagi [email protected] University of TokyoTokyo, Japan

Takumi Nishiyasu [email protected] University of TokyoTokyo, Japan

Kunimasa Kawasaki ∗ [email protected] LABORATORIES LTD.Kanagawa, Japan Moe Matsuki † [email protected] Corp.Tokyo, Japan Yoichi Sato [email protected] University of TokyoTokyo, Japan

ABSTRACT

People spend an enormous amount of time and effort looking forlost objects. To help remind people of the location of lost objects,various computational systems that provide information on theirlocations have been developed. However, prior systems for assistingpeople in finding objects require users to register the target objectsin advance. This requirement imposes a cumbersome burden onthe users, and the system cannot help remind them of unexpectedlylost objects. We propose GO-Finder (“Generic Object Finder”), aregistration-free wearable camera based system for assisting peoplein finding an arbitrary number of objects based on two key fea-tures: automatic discovery of hand-held objects and image-basedcandidate selection. Given a video taken from a wearable camera,Go-Finder automatically detects and groups hand-held objects toform a visual timeline of the objects. Users can retrieve the lastappearance of the object by browsing the timeline through a smart-phone app. We conducted a user study to investigate how usersbenefit from using GO-Finder and confirmed improved accuracyand reduced mental load regarding the object search task by pro-viding clear visual cues on object locations.

CCS CONCEPTS • Human-centered computing → Ubiquitous and mobile com-puting systems and tools . KEYWORDS

Memory aid, lost objects, wearable camera, object discovery, hand-object interaction

ACM Reference Format:

Takuma Yagi, Takumi Nishiyasu, Kunimasa Kawasaki, Moe Matsuki, and YoichiSato. 2021. GO-Finder: A Registration-Free Wearable System for Assisting ∗ This research was conducted independently of Fujitsu (Laboratory), and the resultsare not representative of Fujitsu. † This research was conducted independently of SoftBank Corp., and the results arenot representative of SoftBank.

IUI ’21, April 14–17, 2021, College Station, TX, USA © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.This is the author’s version of the work. It is posted here for your personal use. Notfor redistribution. The definitive Version of Record was published in , https://doi.org/10.1145/3397481.3450664.

Users in Finding Lost Objects via Hand-Held Object Discovery. In

ACM, New York, NY, USA, 13 pages.https://doi.org/10.1145/3397481.3450664

Looking for an object we do not remember leaving somewhere oc-curs frequently and is considered as a recurring problem regardlessof age [30]. One survey reported that people waste 2.5 days a yearlooking for misplaced objects [37]. Thus, technological support toassist users in finding lost objects is demanded [11, 30].Ubiquitous computing tackles this problem by collecting andproviding cues on where objects are located. Placing external sen-sors on the target object [18, 34], and detecting objects with visualsensors [5, 39, 41] are proposed as major solutions to keep track ofobject locations on behalf of users. Such prior systems are designedto track a small number of important objects and ask a user to reg-ister target objects in advance to track those objects. When lookingfor an object, the user searches a list of the registered objects ( e.g. ,a list of object names) to select which object to look for.However, objects we lose are not necessarily registered. We oftenlose unique objects such as important documents or a new itembought the day before. Since such objects are not usually registered,the system cannot help users find them. To deal with such losses, wemay think of to automatically registering all the objects appearingaround the user. However, this produces an enormous amount ofcandidates, which makes it impossible the user to find an objectwithin a reasonable amount of time. Moreover, assigning a uniquename to each object will be unrealistic as the number of objectsgrows. To support finding arbitrary objects, we need not only totrack potential objects to be lost but also to eliminate the burden ofregistration.We introduce two key ideas to overcome these issues. First, in-stead of tracking all the objects appearing around the user, we limitthe search range to objects handled by hands . Since most portableobjects we want to look for are handled with our hands, we cansignificantly reduce the number of candidate objects by limitingthe scope to hand-held objects. A reduced number of candidatesenables users to look for the target object in a realistic amount oftime.Another key idea is to use the object image as a query to selectwhich objects to look for from the candidates. Instead of assigning a r X i v : . [ c s . H C ] F e b UI ’21, April 14–17, 2021, College Station, TX, USA Takuma Yagi, Takumi Nishiyasu, Kunimasa Kawasaki, Moe Matsuki, and Yoichi Sato (3) Finding location from its frame of last appearance(1) Forgot location of object (2) Query it using thumbnail image

Figure 1: GO-Finder assists users in finding lost objects by showing last scene when the user handled it. User looks throughlist of object images to select object of interest. unique names to objects, we display to users a list of object imagesto select which object to look for. Visual information of objectsenables the user to identify the target object instantly withoutassigning a unique name to it.Based on these ideas, we propose GO-Finder (“Generic ObjectFinder”), a registration-free wearable system for assisting usersin finding arbitrary hand-held objects. GO-Finder only requires avideo captured from the wearable camera, does not require any reg-istration, and can handle arbitrary hand-held objects automatically.When finding objects by GO-Finder, users first skim through a listof object thumbnails, called the hand-held object timeline , to askthe system which objects to search for. Given the selected object,GO-Finder presents an image of the last scene when it appeared(Figure 1). This is achieved by a fully automatic process of hand-heldobject discovery , which detects and clusters hand-held objects.To validate the effectiveness of GO-Finder, we conducted a userstudy in a laboratory setting, mimicking a situation of finding anobject. We confirm that users can successfully find objects by usingGO-Finder and reduce their mental load on performing the object-search task compared to the unaided condition. Participant feedbacksuggests that it is feasible to find arbitrary hand-held objects us-ing our proposed hand-held object timeline, which significantlybroadens the coverage of objects to look for.

Various types of sensors, such as wireless tags [5, 26, 28, 36, 40],Bluetooth [19, 29, 32], stationary cameras [8, 41], and wearablecameras [12, 13, 39], have been studied for systems to assist usersin finding lost objects. Active and passive radio-frequency iden-tification (RFID) tags are frequently deployed by attaching themto target objects. While RFID tags are effective in indoor environ-ments, they cannot locate an object when taken outside the searchrange. To expend the search range, a combination of Bluetooth andGNSS are adopted in some commercial products ( e.g. , Tile [17]).Although these systems can provide the angle and distance fromthe tag, their guidance is less intuitive and attaching an external tagto each object will be a major bottleneck to track a large number ofobjects.Alternatively, camera-based systems have the merit of not re-quiring external sensors attached to objects. Butz et al. [8] used augmented reality (AR) markers to search for objects in an officeenvironment. Xie et al. [41] proposed a dual-camera system forindoor object retrieval. However, stationary cameras do not solvethe problem of the search range and are weak against occlusionswhen objects are hidden by other entities.Wearable camera-based systems mitigates these problems by cap-turing images from the user’s viewpoint. Since the camera movesalong with the user, the system captures a close-up of the surround-ing environments and it can be carried, significantly expandingthe search range. Similar to GO-Finder, Ueoka et al. [39] devel-oped a wearable camera based object retrieval system based onobject detection. The system consists of head-mounted RGB andinfrared cameras for capturing pre-registered objects. It assists inobject search by showing the last scene of the target object de-tected. The same strategy is adopted in this work. But unlike [39],our wearable-camera-based system, however, automatically groupsall the hand-held objects appearing around the user, eliminatingthe registration operation.Different from all the above works assuming a small number ofitems to be manually registered, we tackle the challenging problemof fully-automatic hand-held object tracking. We provide the userswith how to select the objects of interest efficiently from a list ofautomatically tracked objects.

Camera-based systems are used for mitigating memory problemsother than losing objects since visual information offers a largeamount of information better than textual information [6]. Tranet al. [38] proposed a system to monitor the progress of a cook-ing activity. Hodges et al. [15] proposed a wearable camera basedsystem called SenseCam, which takes wide-angle pictures period-ically ( e.g. , one shot every 30 s) to remind users of past events. Liet al. [24] proposed FMT, a wearable memory-assistance systemto remember the state of objects ( e.g. , the last time the plant waswatered). While their hardware configuration is similar to oursin using neck-mounted wearable cameras, they aim to recall pastinteractions of a few numbers of daily-used objects, asking usersto attach AR markers to each object. In contrast, GO-Finder aimsto expand the range of objects which could be searched for byremoving the registration operation.

O-Finder: A Registration-Free Wearable System for Assisting Users in Finding Lost Objects IUI ’21, April 14–17, 2021, College Station, TX, USA

GO-Finder executes hand-held-object detection and grouping todiscover objects appearing in first-person videos. Discovering ob-jects in first-person videos is a difficult problem since object cate-gories appearing in daily life are massive, diverse, and individual-dependent. To this end, various methods have been proposed todiscover objects in first-person videos [2, 4, 23, 31]. Lee et al. [23]developed a model to discover important object regions using mul-tiple first-person saliency cues. Lu et al. [27] proposed an objectclustering-based method for personal-object discovery. Their sys-tem involves object-scene distribution based on the assumptionthat personal objects appear in different scenes while non-personalobjects typically remain in similar scenes.Since objects appearing in first-person videos are typically han-dled by hands, hand information is used to improve object detection.Lee et al. [21, 22] proposed using hands as a guide to identify anobject of interest from a photo taken by people with visual impair-ment. Shan et al. [33] collected a large-scale dataset of hand-objectinteraction along with annotated bounding boxes of hands and ob-jects in contact with each other. Their proposed system can detecthands and objects in contact with each other from an image. Ouraim is not only detecting hand-held objects but also to discoverhand-held-object instances from first-person videos, which reducesthe number of candidates to be registered.

GO-Finder requires a wearable camera, processing server, andsmartphone for browsing the location of objects the user is lookingfor (see Figure 2). The procedure is divided into observation andretrieval phases.In the observation phase, a user wears a camera on their neck.The camera continuously stores the first-person images send tothe processing server. The server processes the received images todetect and track hand-held objects. Finally, images are clustered bytheir appearance to discover groups of object instances.In the retrieval phase, users use a smartphone-based interface(see Figure 3) to receive the processed results thorough a wirelessconnection. First, users select which object to look for through thehand-held object timeline (Figure 3 left). Then, they find the targetobject by viewing the pop-up screen showing the last appearanceof it (Figure 3 right).

GO-Finder attempts to detect hand-held objects and discover groupsof object instances from the first-person video. By discoveringobject instances, we can acquire the last appearance of the object,which is used to find the object. Figure 4 shows a rough sketch ofhow to acquire the last appearance of an object. An object detectordetects hand-held objects from first-person video frames. From allthe detected object images, we apply tracking and clustering (seeSection 4 for details) to discover groups of cropped object images,clustered by instance. Since we are interested in finding the lastlocation of the object, we only use the last thumbnail image andlast frame for our user interface.

GO-Finder automatically discovers hand-held object instances andregisters them as candidates. In this case, searching for objects bytheir names becomes unrealistic since it requires an associationbetween the object name and its appearance. We propose the hand-held object timeline , which selects the target object by browsing thethumbnail images of the objects (see Figure 3 left). Thumbnails ofthe objects are sorted by the last time they appeared in descendingorder. By skimming through the timeline, users select a thumbnailof the target object to retrieve its last appearance. We adopt theimage timeline as a metaphor for a photo album, which is widelyaccepted in existing smartphone-based interfaces.Note that the obtained object timeline can be used as a triggerto remind the user of the object location. The timeline acts as aconcise history of what the user has handled in the past. Evenbefore arriving at the target object, the user can be reminded ofpast actions by looking back at the timeline.

By clicking on a thumbnail of the object timeline, a pop-up screenwill appear to show the appearance of the object and time (seeFigure 3 right). Since the pop-up screen shows the critical momentof leaving an object, the user can instantly be reminded of thelocation of the object by looking at the surrounding environments.

We introduce the details on the hand-held object discovery algo-rithm used in GO-Finder (see Figure 5).

We use the state-of-the-art algorithm on hand-held object detec-tion [33] trained on a large-scale image dataset of hand-object inter-action collected from first-person video datasets [9, 25, 35]. Given avideo frame, it produces bounding boxes of the hand, contact state(self-contact, other people, portable object, and static object), andits manipulating objects (see Figure 5 (a)). It detects arbitrary typesof objects in contact with hands while rejecting other objects nothandled by them. Since we are interested only in portable objects,we extract detections that are predicted as a portable object in thecontact state prediction. Furthermore, detections that occupy morethan half the side length of the frame are considered noise and areexcluded from prediction.

Using the detected bounding boxes, we cluster them into a setof instances based on their appearance features. Every detectionshould be assigned to a single cluster, and re-appearing objectsshould be merged into existing clusters. To this end, we adopt acombination of local and global matching, which consists of threestages.

Stage 1: Frame-wise Tracking.

We first apply a visual tracker to thedetected hand-held objects. If the tracker successfully associatesbetween consecutive detections, we assign the detection to thesame cluster as the previous one (Figure 5 (c)). Since first-personvideos include large camera motion, we use an appearance-based

UI ’21, April 14–17, 2021, College Station, TX, USA Takuma Yagi, Takumi Nishiyasu, Kunimasa Kawasaki, Moe Matsuki, and Yoichi Sato

Observation Phase

Processing server

Retrieval Phase

Object information through Wi-FiVideo dataWearablecamera Smartphone

Figure 2: Users wears wearable camera on their neck.During observation phase, their first-person imagesare sent to processing server to discover hand-heldobjects. At retrieval, processed results are sent fromserver, and user retrieves last frame of objects throughsmartphone app.

Hand-held Object Timeline Pop-up Screen

Figure 3: Interface of smartphone app. (Left) Hand-held-object timeline. (Right) Pop-up screen.

Input frames Discovered object cluster Last thumbnail imageFrame of last appearanceHand-held object detection & clusteringNo detection No detection

Figure 4: Given first-person video frames, system detects hand-held objects and groups them to discover cluster of cropped ob-ject images for each object. Since we are interested in providing last location where the object appeared, we use last thumbnailimage and last frame in which the object appeared to help user find specific object. tracker [3], which performs similarity matching. The cost assign-ment matrix is calculated by the intersection-of-union between allthe tracker’s predictions and actual detections. Optimal assignmentis achieved by using the Hungarian algorithm [20].

Stage 2: Local Feature Matching.

When the tracking fails, we applylocal matching between the latest detection and existing clustersbased on the object’s appearance. We use pre-trained convolutionalneural network (CNN) features to find similar objects in the existingclusters. For every detection, a 2048-dimensional feature vector isfirst extracted from the layer before the final layer of ImageNet-pretrained ResNet-50 [14]. We then calculate the cosine similaritybetween the new detection and all the detections in the cluster(see Figure 5 (d), top). Next, for each cluster, if the maximum andmedian of the similarity scores are above certain thresholds, thenew detection is merged with that cluster. We check the median score to avoid false associations. If none of the clusters meets thecondition, then a new cluster is created.

Stage 3: Global Cluster-wise Merging.

Since the previous stage matchesagainst a single detection, it tends to form a new cluster if the view-point or boundary of the latest detection fluctuates even it shouldbe merged. To deal with such incorrectly segmented clusters, wetry to merge clusters by global cluster-wise merging. Given a pairof clusters, we calculate sample-wise cosine similarity betweenclusters, forming a similarity matrix (see Figure 5 (d), bottom). Notethat the scores calculated at stage 2 can be reused in this stage. Ifthe maximum and median of the similarity matrix exceed certainthresholds, the two clusters are merged.However, this merging process is time-consuming, and shouldnot be repeated every time. To reduce the number of trials, were-try merging only if the number of similarity matrix elements ismore than two times that of the last trial.

O-Finder: A Registration-Free Wearable System for Assisting Users in Finding Lost Objects IUI ’21, April 14–17, 2021, College Station, TX, USA (a) Input Frames (c) Frame-wise Tracking (d) Local & Global Matching (e) Discovered Objects(b) Hand-held Object Detection ......

No detection 0.8 0.7 0.90.7 0.7 0.90.6 0.6 0.80.8 0.7 1.0

Local Matching

Global MatchingTrack 1

Track 2Track 3 (split)Track 4 (split)Track 5 Cluster 1Cluster 2Cluster 3

Figure 5: Overview of hand-held-object-discovery algorithm. (a) Input frames. (b) Example of hand-held object detection. Yel-low and red boxes denote detected objects and hands, respectively. (c) Tracked detections. Typically, they are segmented dueto tracking failure or re-appearance. (d) Local matching between latest detection and existing cluster (top). Global matchingbetween two existing clusters (bottom). (e) Segments are clustered by instance. Last appeared scenes (images with red frame)will be displayed in user interface.

Determining Similarity Thresholds.

Changing the hyperparameters(maximum and median similarity threshold) may affect user expe-rience. Stricter thresholds produce oversegmented and increasednumber of clusters while achieving higher recall on discoveredtarget objects. This makes it more difficult for the user to select theobject of interest from the candidates. In contrast, looser thresholdsresult in a smaller number of clusters with the risk of missing ob-jects due to wrong associations. A reduced number of clusters maymake it easier for the user to select the target object, but it may beimpossible to find it if it is incorrectly merged with other objects.While we empirically selected these parameters during the study,we further introduce additional heuristics to explicitly suppressfalse associations.

Constrained Clustering using First-person Cues.

During similaritycalculation, the hand-held object discovery algorithm sometimesshows a high similarity to a different object due to the appear-ance of the hand and similar textures, producing false associations.Therefore, we introduce several heuristics to suppress such falseassociations. If a detected bounding box or a pair of them satisfiesthe following conditions, the similarity of that pair is set to zero. • Aspect ratio between two boxes: if the ratio of the twobounding box aspect ratios is larger than 1.5 • Ratio of skin color: if the ratio of the skin-colored region(calculated using color histogram) is larger than 0.3 • Area ratio of the object to the corresponding hand: ifthe ratio of the two area ratios (area of the object boundingbox to that of the hand bounding box) is larger than 1.5

Implementation Details.

We sampled video frames at 10 fps, andfurther resized them into VGA resolution before processing. Whilea smaller frame rate is enough to capture the timing of leaving an object, we find that a higher frame rate is better to track objectsstably. We set the maximum threshold to 0.8 and median thresholdto 0.7.

We conducted an in-lab experiment to determine (i) whether GO-Finder can correctly discover hand-held objects from the videoand (ii) whether users can use the system to find target objects.We hypothesized that by using GO-Finder, users can find objectscorrectly and quickly with less mental load.

We recruited 12 volunteers (10 males and 2 females) with agesranging from 18 to 28. They were all familiar with using smart-phones. The experiment was conducted in a room in our lab. Thetask was a hide-and-seek task performed by the participants. First,participants filled out a pre-study questionnaire on their past expe-rience of looking for lost objects. After an introduction to the task,each participant was asked to hide a set of objects inside a room(arrangement phase), conduct a surrogate task to forget the loca-tions of the objects (forgetting phase), and later asked to correctlyretrieve a subset of them (retrieval phase). The trial was repeatedthree times, changing the experimental conditions. Conditions wererandomly shuffled to eliminate order effects. After all the trials, par-ticipants filled out a post-study questionnaire on the usability ofthe interface. Finally, we conducted a semi-structured interview tofind further insights. Arrangement Phase.

First, the participant went to the room andasked to hide a set of objects prepared by the experimenter. The One participant (P05) was excluded from the analysis due to a misunderstanding ofinstruction.

UI ’21, April 14–17, 2021, College Station, TX, USA Takuma Yagi, Takumi Nishiyasu, Kunimasa Kawasaki, Moe Matsuki, and Yoichi Sato

Camera Target objectLocation tag ← Latest

Oldest → Figure 6: (Left) Object arrangement: participants hide objects at specified locations while wearing camera around their neck.(Middle) Objects used in study. (Right) Example of timeline of frame-based system. locations to hide the objects were specified with pink tags and theparticipants were informed about them in advance (see Figure 6 left).The participants carried a basket along with the objects. Duringthe experiment, participants wore a GoPro HERO 7 camera ( ◦ diagonal field-of-view) to record first-person videos. Forgetting Phase.

The participants moved to another room for a15-min interval to forget the arrangements. During the interval,the participant was asked to solve as many of a series of simplecalculation problems as possible.

Retrieval Phase.

The participants came back to the room and wereasked to bring back a subset of the hidden objects. The list ofobjects to bring back was shown in a photo. In addition to theneck-mounted camera, the participants wore smartphones aroundtheir necks to use the system. Under each condition, participantswere given instructions on how to use the system and becomefamiliarized with the interface by browsing the result of a samplevideo carrying a few objects. They were not forced to use the system;they used the system only when they needed to use it.

We compared three conditions: • No aid : The participant search for objects themselves with-out any assistance. • Frame-based aid : The participant is shown a timeline ofimages extracted every 5 sec. • Object-based aid (GO-Finder) : Our proposed system withhand-held object timeline and pop-up screen.The frame-based aid condition resembled automatic image capturedevices such as SenseCam [15]. We hypothesized that past imageswould help the participants remember their arrangement of objects.Regarding the duration of the task, we showed images taken every5 sec (see Figure 6 right), which is denser than typical devices ( e.g. ,30 sec). We used a laptop PC to run the object discovery algorithm. Theconnection between the laptop PC and smartphone was establishedvia Wi-Fi as shown in Figure 2. At every trial, participants hid 16objects in a choice of 20 locations and asked to retrieve 6 objectsfrom them. We used different object sets for each trial, resulting in48 objects in total (see Figure 6 right). The objects differed in colorand shape, and sometimes included multiple instances of the samecategory.

Localization Rate.

To measure how well the hand-held object discov-ery algorithm can discover target objects, we counted the numberof objects in which their locations are identifiable by a third person,who did not have any memory of arrangement; only using oursystem. We define the localization rate as a ratio of the number ofidentifiable objects to the total number of target objects. One ofthe authors manually calculated this metric by using the app. Wecounted as a success only if a close-up of an object is visible in thethumbnail of the timeline and the object location could be correctlydetermined from the pop-up screen without difficulty. This metricacts as an expected recall of the system.

Number of Clusters.

We also measured the number of clustersformed with the hand-held object discovery algorithm and ana-lyzed the contents of the timeline. We ran the algorithm for all 36trials (12 participants × Correctness of Retrieval.

We calculated the mean precision of eachtrial. We counted as correct when the user found an object listedon the target list and incorrect when the user opened a locationwith the incorrect or no objects. We compared three combinationsof two of the conditions by using the paired t-test on the differenceof mean scores.

O-Finder: A Registration-Free Wearable System for Assisting Users in Finding Lost Objects IUI ’21, April 14–17, 2021, College Station, TX, USA

Table 1: Localization rate of each object set (%).

Mean ± SD Min Max

Set 1 84.9 ± ± ± ± Task Completion Time.

We expected shorter task completion timeby using the system. We compared three combinations of two con-ditions by using the Wilcoxon signed-rank test on the difference inmean task completion times.

System Usage Time.

Since GO-Finder can search for objects di-rectly through the hand-held object timeline, we expected to havea shorter usage time using GO-Finder to the frame-based aid con-dition. We measured the number of times participants used thesystem and usage time per trial from the recorded videos. Questionnaire.

After all the trials, participants answered questionson each condition. First, participants were given the question, “Howdo you rate the difficulty of completing the task?" on a seven-pointscale (easy = 1, difficult = 7). We used the Wilcoxon signed-rank testin the difference of means. Regarding the features of the interface,we asked whether they agreed to the following questions on afive-point scale: Q1) The timeline is easy to view. Q2) The timelineis intuitive to use. Q3) The timeline helped me look for objects.Q4) The pop-up screen is easy to view. Q5) The pop-up screenhelped me look for objects. Q6) The timeline (under each condition)gave me a clue on the location of the target object. Q7) I could bereminded of the locations of the objects by using the system (undereach condition).

Observation and Interview.

We observed how the participants searchedfor objects. During the interviews, we asked what they thoughtduring the retrieval task. To collect insights on using this systemin daily life, we also asked “What do you recommend to improvethe interface?” , and “How do you feel about wearing a camera inprivate/public places?” . Localization Rate.

Table 1 shows the localization rate of the hand-held object discovery algorithm. The average score was 84.9, 83.3,and 88.5% for each object set, and the overall average was 85.6%.These results indicate that GO-Finder can correctly display 13.6/16objects per session on average. The minimum rate across the par-ticipants were 62.5%.We found differences in performance among objects. While sev-eral objects were discovered in all 12 trials (green cup, wood glue,electric bulb, futon pincher, green cloth, and teddy bear), some ob-jects were difficult to discover (waiter’s corkscrew: 16.7%, medicinebottle: 58.3%, black wallet, and spray bottle: 66.7%). We found that We counted as one time when the user attempted to search a location after using thesystem. small and black objects were difficult to correctly discover due toocclusion and texture-less regions.

Number of Clusters.

The number of clusters (objects) that appearedin the hand-held object timeline was 108.6 ( 𝑆𝐷 = . ) on average.Although it is not trivial to count the number of valid objects whichshould be discovered, the estimated number of valid objects (includ-ing furniture, drawers, and baskets) was expected to be about 20 to30, including the 16 target objects. Thus, we can conclude that thealgorithm over-segments an object into 4 to 5 clusters on average. Qualitative Analysis.

Figure 7 shows an example of the obtainedhand-held object timeline. We annotated the thumbnail images thatcontain close-ups of the target objects in green boxes. Forty outof 110 clusters contained 15 of the target objects. In addition ofthe target objects, GO-Finder discovered various valid objects andfalse positives. Examples of valid objects were chairs, baskets, anddrawers while untouched furniture, participant’s body, and otherpeople were discovered as false positives. While most objects wereeasily identifiable from the thumbnail images, some thumbnailswere difficult to identify due to occlusion, shadow, and irregularviews ( e.g. , the green cup in Figure 7 left-bottom).

Tables 2 and 3 show the results of the object retrieval task undereach condition. We report the average precision and its 95% confi-dence interval (CI) under each condition. As expected, GO-Findershowed better precision with less variance than the other two con-ditions. The paired t-test revealed significance only between theframe-based aid and object-based aid conditions ( 𝑝 = . , 𝑝 = . , and 𝑝 = . , respectively). However, both no aid/object-based aid and frame-based aid/object-based aid conditions showedlarge effect sizes ( 𝑑 = . and 𝑑 = . , respectively), indicating apositive effect by using the proposed system. In contrast, we didnot observe a marked difference between no aid and frame-basedaid conditions ( 𝑑 = . ).Table 4 and 5 shows the result of the task completion time in eachcondition. We did not observe improvement in task completion timeby using GO-Finder. The paired t-test did not show any significantdifference ( 𝑝 = . , 𝑝 = . , and 𝑝 = . ). The average timeand 95% confidence interval of the arrangement phase was 223 ±

16 sec.

Usage Time.

During the 12 sessions under the frame-based aidand object-based aid conditions, participants used the interface 32and 35 times, respectively. The mean (median) usage times were28.1 sec (23.0 sec) and 16.1 sec (12.5 sec), respectively. The pairedt-test revealed a significant difference with medium effect size in themean times ( 𝑝 = . , 𝑑 = . ). This suggests that participantswere able to browse the timeline more efficiently under the object-based aid condition than under the frame-based aid condition. Ease of Task.

Figure 8 and Table 6 show the results on ease of thetask. Surprisingly, the participants evaluated the frame-based aidcondition the most difficult. They evaluated the object-based aidcondition the easiest among the three conditions. Based on theWilcoxon signed-rank test, we found a significant difference in

UI ’21, April 14–17, 2021, College Station, TX, USA Takuma Yagi, Takumi Nishiyasu, Kunimasa Kawasaki, Moe Matsuki, and Yoichi Sato ← Latest Oldest → Figure 7: Example results of hand-held object timeline (localization rate=0.9375,

Mean ±

95% CI

No aid 0.728 ± ± ± Table 3: Results of paired t-tests on difference in mean precisions.

95% CI Effect size 𝑡 𝑝

LB UB 𝑑 No aid/frame-based aid -0.104 0.918 -0.193 0.175 0.04No aid/object based-aid -2.012 0.069 -0.407 0.018

Frame-based aid/object-based aid -2.339 -0.360 -0.011

Table 4: Task completion time (sec).

Mean & 95% CI

No aid 216 ± ± ± Table 5: Results of paired t-tests on difference in mean task completion times.

95% CI Effect size 𝑡 𝑝

LB UB 𝑑 No aid/frame-based aid -0.316 0.379 -173 130 0.12No aid/object-based aid 0.513 0.309 -128 206 0.23Frame-based aid/object-based aid 1.530 0.077 -27 148 the mean scores between the frame-based aid and object-based aidconditions ( 𝑝 = . , 𝑝 = . , and 𝑝 = . ). However, weobserved medium effect size in all the combinations ( 𝑟 = . , 𝑟 = . , and 𝑟 = . ). This suggests that the participant’s subjectivemental load have decreased by using GO-Finder. Functionality of Interface.

Figure 9 shows the results of questionsQ1–Q7. In Q1–Q5, participants reported positive impressions withthe proposed system. The Wilcoxon signed-rank test revealed asignificant difference between the frame-based aid and object-basedaid conditions in Q6 ( 𝑝 = . ) but not in Q7 ( 𝑝 = . ). However,Q6 and Q7 showed large and medium effect sizes ( 𝑟 = . , and 𝑟 = . ), respectively, suggesting that GO-Finder was more usefulin finding object locations compared to under the frame-based aidcondition. Video Observation.

In general, participants first looked for objectsthat they remembered and used the system when they were notconfident with the location. When using the system, they lookedfor thumbnails showing the target object, inferred the location fromthe pop-up screen, and successfully retrieved the object. Two users persist using GO-Finder even when the system failed to discoverthe target objects (P06, 10).

Usefulness of Object-based Aid Condition.

Eleven out of the 12 par-ticipants said that GO-Finder was convenient to use. Only with abrief instruction, they were able to retrieve the forgotten locationswith GO-Finder. They preferred the intuitiveness of the hand-heldobject timeline: “

The function I wanted most was there. Because theobjects were highlighted and zoomed in, I could see the target objectsand retrieve their last appearance by tapping the thumbnail ” (P08).They felt more secure and confident at retrieval: “

Since I don’thave to rely on my intuition, I looked at the smartphone once I felt lost.By using the system, I often felt confident about the location ” (P10).A few participants trusted the system’s output rather than theirmemory: “

I arrived at the wrong location since I relied on the system.I didn’t remember my memory but inferred the location from thepop-up screen and got wrong ” (P13).

Comparison to Frame-based Aid.

In contrast, nine participants gavenegative feedback regarding the frame-based aid condition. Theymainly complained that the timeline often did not capture the exactmoment of leaving objects. Difficulty in finding critical scenes fromlarge field-of-view images was also reported: “

Since the imagesoften don’t capture the scene when holding objects, I found myself

O-Finder: A Registration-Free Wearable System for Assisting Users in Finding Lost Objects IUI ’21, April 14–17, 2021, College Station, TX, USA ※ Error bar denotes 95% confidence interval

Figure 8: Ease of task (easy=1, difficult=7). ※ Error bar denotes 95% confidence interval

Figure 9: Results of questionnaire.Table 6: Result of Wilcoxon signed-rank tests on ease of task.

95% CI Effect size

𝑍 𝑝

LB UB 𝑟 No aid/frame-based aid -1.857 0.063 -2.581 0.081

No aid/object-based aid -1.501 0.133 -0.596 2.263

Frame-based aid/object-based aid -2.027 zooming into the image but found nothing several times ” (P10). Theuser has to additionally remember how they left the objects duringthe arrangement, sometimes being confused by their behavior: “

Iwas deluded by myself attempting to leave the object once but actuallydone it afterward "" (P03).One participant preferred the frame-based timeline becausethumbnails were evenly sampled in chronological order: “

I pre-ferred that (the frame-based timeline) because the entire timeline wasavailable and I could infer how I searched by looking at an image andthe image next to it ” (P06).

On Interface of System.

Participants preferred thumbnail imagesgiven from their point of view: “

The objects were shown by image,and were taken when I lost the object. The system was convenientsince critical moments were captured in the timeline ” (P09). Whileparticipants gave positive feedback for every component of theinterface (Q1–Q7), they gave lower scores on the ease of usingthe hand-held object timeline (Q1). First, over-segmentation of theobjects confused some participants: “

Regarding four thumbnailsshowing a tennis ball, I had no idea which one to press[...] ” (P01).The quality of thumbnail images (brightness, occlusion, contrast,and viewpoint) also made it difficult for the participants to findthe object of interest: “ [...]the thumbnail image of the last scene wasdifficult to identify. For example, I had to zoom in (to the thumbnail)when I looked for the pouch ” (P02).

Privacy Concerns.

While we expected to have negative feedbackon capturing images, six participants reported that they were notconcerned with recording videos while three participants raisedspecific concerns: “

I don’t feel any discomfort since I know what the system does; maybe because I know the system only collects informa-tion on objects. It might be different if the system captures people’sfaces ” (P12). Some participants changed their behavior since theywere aware of being recorded even though we did not give themany warning: “

I thought it was better not to hide the camera ” (P02).

Suggestions on Improvement.

Two participants suggested playing avideo snippet instead of a static image on the pop-up screen: “

I thinkit’ll be easier to remember if I can view the before and after of thelast scene. ” (P09). Regarding real-world use, participants suggestedquerying by background (P01, 04) and time (P12). They stated thatthe object itself is not a key to remember a scene and requested afunctionality to filter the candidates by their own.

In the user study, we confirmed that GO-Finder enabled the partici-pants to retrieve hidden objects with less mental load. Quantitativeand qualitative feedback suggests that the users gained confidenceby immediately accessing the last moment of when the objectswere seen. The frame-based timeline showing uniformly sampledvideo frames was not effective in this task. Since the object findingtask should be solved as quickly as possible, users requested directaccess to the location rather than having to keep relying on theirmemory.System evaluation showed that the hand-held object discoveryalgorithm of GO-Finder successfully extracted hand-held objectinstances while efficiently excluding other unrelated objects. GO-Finder worked without explicit registration, making it easy forusers to start using it.

UI ’21, April 14–17, 2021, College Station, TX, USA Takuma Yagi, Takumi Nishiyasu, Kunimasa Kawasaki, Moe Matsuki, and Yoichi Sato

Figure 10: System failure under severe occlusion.

Regarding the idea of using object images as a query, participantsadapted quickly to browsing objects using the hand-held objecttimeline (Q3). The timeline shown as a list of images was evaluatedas intuitive and participants were able to access the object of interestimmediately. Since the thumbnail images are captured from almostthe same view as the participant’s one, this timeline also workedas a clue to remembering (Q6).While most participants did not report any problem with query-ing by images of objects, some suggested using background andtime information as additional cues for retrieval. This feedback sug-gests that participants wanted to be reminded of the scene by usingtheir incomplete memory when browsing target objects throughthe timeline. In this study, we did not provide any features to nar-row down the candidates by contextual information such as scene,location, or time. Adding such features would provide options tousers reaching the target object in their most familiar way.

The use of wearable cameras raises privacy concerns in real-worlduse [10, 16]. Although GO-Finder’s contents are not shared amongother users, the privacy and comfort of bystanders must be secured.One way out of the difficulty is to filter out sensitive contentswhile storing only the information relevant to hand-held objects.Since GO-Finder only requires the last scene of an object, otherframes are no longer needed as we store the feature vector of objectdetections. Last scenes will be updated as objects re-appears soimages would not be kept stored permanently. Additionally, wecan remove identity information by running an off-the-shelf facedetector since we do not need bystander’s information. Supportedby the positive comments, we believe that GO-Finder can be usedwith minimal interference.

Object Re-identification.

The proposed system failed to discover ob-jects under severe occlusion by hands ( e.g. , waiter’s corkscrew, seeFigure 10). In this example, the participant gripped the corkscrewso that it was severely occluded by the hand. This confused boththe detector and clustering algorithm in determining the correctbounding box and appearance feature to identify the object, result-ing in over-segmentation or false cluster merge. As reported in6.1, small objects tend to be occluded and would be problematicconsidering real-world use. One potential solution is to use theobject’s appearance before or after but not during manipulationwith occlusion by hands.

Long-term Evaluation.

We evaluated GO-Finder using short videosequences of around 4 min. Although we revealed that GO-Findercan handle around 20 objects without registration, in a real situ-ation, the video length will be hours or days, and the number of objects will increase as we record more. To avoid having to lookover a massive amount of candidate objects, additional features toomit unimportant detections are necessary. For instance, we canuse object-scene distribution [27] to eliminate non-personal, staticobjects such as doors and furniture.

Multi-user Scenarios.

We assume each object is manipulated bya single user. However, in multi-user scenarios, we cannot trackobjects if they are moved by other people. One plausible solution isto share object information among users, which would be a trade-offbetween privacy protection.

We presented GO-Finder, a registration-free wearable camera-basedsystem for assisting users in finding lost objects. It supports findingof arbitrary number of objects based on two key ideas: hand-heldobject discovery and image-based candidate selection. The userstudy revealed that by using GO-Finder users can find the locationof lost objects correctly with a reduced mental load. Even the objectswere registered automatically without user intervention, the userswere able to identify the target object using the image-based hand-held object timeline. Going beyond tracking only a few selectedobjects, GO-Finder could be used as a practical tool to help findvarious unexpectedly lost objects in daily life.Our future work include developing a computationally efficientobject discovery algorithm and candidate filtering based on contextsas well as conducting a long-term user evaluation on naturalisticsituations of losing objects.

ACKNOWLEDGEMENTS

This work was supported by JST AIP Acceleration Research GrantNumber JPMJCR20U1 and Masason Foundation.

REFERENCES [1] Aaron Bangor, Philip Kortum, and James Miller. 2009. Determining what indi-vidual SUS scores mean: Adding an adjective rating scale.

Journal of usabilitystudies

4, 3, 114–123.[2] Gedas Bertasius, Hyun Soo Park, Stella X Yu, and Jianbo Shi. 2017. Unsupervisedlearning of important objects from first-person videos. In

Proc. IEEE InternationalConference on Computer Vision . 1956–1964.[3] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HSTorr. 2016. Fully-convolutional siamese networks for object tracking. In

Proc.European Conference on Computer Vision Workshops . 850–865.[4] Marc Bolaños and Petia Radeva. 2015. Ego-object discovery. arXiv preprintarXiv:1504.01639 .[5] Gaetano Borriello, Waylon Brunette, Matthew Hall, Carl Hartung, and CameronTangney. 2004. Reminding about tagged objects using passive RFIDs. In

Proc.ACM International Conference on Ubiquitous Computing . 36–53.[6] William F Brewer. 1988. Qualitative analysis of the recalls of randomly sampledautobiographical events. In

Practical Aspects of Memory: Current Research andIssues , MM Gruneberg, PE Morris, and RN Sykes (Eds.). Vol. 1. Wiley, 263–268.[7] John Brooke. 1996. SUS: a “quick and dirty’usability.

Usability evaluation inindustry , 189.

O-Finder: A Registration-Free Wearable System for Assisting Users in Finding Lost Objects IUI ’21, April 14–17, 2021, College Station, TX, USA [8] Andreas Butz, Michael Schneider, and Mira Spassova. 2004. Searchlight–a light-weight search function for pervasive environments. In

Proc. IEEE InternationalConference on Pervasive Computing . 351–356.[9] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, AntoninoFurnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett,Will Price, et al. 2018. Scaling egocentric vision: The epic-kitchens dataset. In

Proc. European Conference on Computer Vision . 720–736.[10] Tamara Denning, Zakariya Dehlawi, and Tadayoshi Kohno. 2014. In situ withbystanders of augmented reality glasses: Perspectives on recording and privacy-mediating technologies. In

Proc. SIGCHI Conference on Human Factors in Comput-ing Systems . 2377–2386.[11] Margery Eldridge, Abigail Sellen, and Debra Bekerian. 1992.

Memory problems atwork: Their range, frequency and severity . Technical Report EPC–92–129. RankXerox EUROPARC.[12] Markus Funk, Robin Boldt, Bastian Pfleging, Max Pfeiffer, Niels Henze, andAlbrecht Schmidt. 2014. Representing indoor location of objects on wearablecomputers with head-mounted displays. In

Proc. 5th Augmented Human Interna-tional Conference . 1–4.[13] Markus Funk, Albrecht Schmidt, and Lars Erik Holmquist. 2013. Antonius: Amobile search engine for the physical world. In

Proc. ACM Conference on Pervasiveand Ubiquitous Computing Adjunct . 179–182.[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residuallearning for image recognition. In

Proc. IEEE Conference on Computer Vision andPattern Recognition . 770–778.[15] Steve Hodges, Lyndsay Williams, Emma Berry, Shahram Izadi, James Srinivasan,Alex Butler, Gavin Smyth, Narinder Kapur, and Ken Wood. 2006. SenseCam: Aretrospective memory aid. In

Proc. ACM International Conference on UbiquitousComputing . 177–193.[16] Roberto Hoyle, Robert Templeman, Steven Armes, Denise Anthony, David Cran-dall, and Apu Kapadia. 2014. Privacy behaviors of lifeloggers using wearablecameras. In

Proc. 2014 ACM International Joint Conference on Pervasive and Ubiq-uitous Computing

Personal and Ubiquitous Computing

11, 4, 287–298.[19] Julie A Kientz, Shwetak N Patel, Arwa Z Tyebkhan, Brian Gane, Jennifer Wiley,and Gregory D Abowd. 2006. Where’s my stuff? Design and evaluation of amobile system for locating lost items for the visually impaired. In

Proc. 8th ACMConference on Computers and Accessibility . 103–110.[20] Harold W Kuhn. 1955. The Hungarian method for the assignment problem.

NavalResearch Logistics Quarterly

2, 1-2, 83–97.[21] Kyungjun Lee and Hernisa Kacorri. 2019. Hands holding clues for object recog-nition in teachable machines. In

Proc. ACM CHI Conference of Human Factors inComputing Systems . 1–12.[22] Kyungjun Lee, Abhinav Shrivastava, and Hernisa Kacorri. 2020. Hand-priming inobject localization for assistive egocentric vision. In

Proc. IEEE Winter Conferenceon Applications of Computer Vision . 3422–3432.[23] Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman. 2012. Discovering importantpeople and objects for egocentric video summarization. In

Proc. IEEE Conferenceon Computer Vision and Pattern Recognition . 1346–1353.[24] Franklin Mingzhe Li, Di Laura Chen, Mingming Fan, and Khai N Truong. 2019.FMT: A wearable camera-based object tracking memory aid for older adults.

Proc.ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

3, 3, 1–25.[25] Yin Li, Miao Liu, and James M Rehg. 2018. In the eye of beholder: Joint learning ofgaze and actions in first person video. In

Proc. European Conference on ComputerVision . 619–635.[26] Xiaotao Liu, Mark D Corner, and Prashant Shenoy. 2006. Ferret: RFID localizationfor pervasive multimedia. In

Proc. ACM International Conference on UbiquitousComputing . 422–440.[27] Cewu Lu, Renjie Liao, and Jiaya Jia. 2015. Personal object discovery in first-personvideos.

IEEE Transactions on Image Processing

24, 12, 5789–5799.[28] Robert J Orr, Ronald Raymond, Joshua Berman, and A Fleming Seay. 1999.

Asystem for finding frequently lost objects in the home . Technical Report GIT-GVU-99-24. Georgia Institute of Technology.[29] Ling Pei, Ruizhi Chen, Jingbin Liu, Tomi Tenhunen, Heidi Kuusniemi, and YuweiChen. 2010. Inquiry-based bluetooth indoor positioning via rssi probabilitydistributions. In

Proc. International Conference on Advances in Satellite and SpaceCommunications . 151–156.[30] Rodney E Peters, Richard Pak, Gregory D Abowd, Arthur D Fisk, and Wendy ARogers. 2004.

Finding lost objects: Informing the design of ubiquitous computingservices for the home . Technical Report GIT-GVU-04-01. Georgia Institute ofTechnology.[31] Cristian Reyes, Eva Mohedano, Kevin McGuinness, Noel E O’Connor, and XavierGiro-i Nieto. 2016. Where is my phone? Personal object retrieval from egocentricimages. In

Proc. first Workshop on Lifelogging Tools and Applications . 55–62. [32] David Schwarz, Max Schwarz, Jörg Stückler, and Sven Behnke. 2014. Cosero, findmy keys! Object localization and retrieval using Bluetooth Low Energy tags. In

Robot Soccer World Cup . 195–206.[33] Dandan Shan, Jiaqi Geng, Michelle Shu, and David F. Fouhey. 2020. Understandinghuman hands in contact at Internet scale. In

Proc. IEEE Conference on ComputerVision and Pattern Recognition . 9869–9878.[34] Makoto Shinnishi. 1999. Hide and seek: Physical real artifacts which responds tothe user. In

Proc. World Multiconference on Systemics, Cybernetics and Informatics ,Vol. 4. 84–88.[35] Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and KarteekAlahari. 2018. Actor and observer: Joint modeling of first and third-person videos.In

Proc. IEEE Conference on Computer Vision and Pattern Recognition . 7396–7404.[36] Masaya Tanbo, Ryoma Nojiri, Yuusuke Kawakita, and Haruhisa Ichikawa. 2017.Active RFID attached object clustering method with new evaluation criterion forfinding lost objects.

Mobile Information Systems

Proc. International Conference on Home-Oriented Informatics and Telematics .15–32.[39] Takahiro Ueoka, Tatsuyuki Kawamura, Yasuyuki Kono, and Masatsugu Kidode.2003. I’m here!: A wearable object remembrance support system. In

Proc. ACMInternational Conference on Mobile Human-Computer Interaction . 422–427.[40] Paul Wilson, Daniel Prashanth, and Hamid Aghajan. 2007. Utilizing RFID signal-ing scheme for localization of stationary objects and speed estimation of mobileobjects. In

Proc. IEEE International Conference on RFID . 94–99.[41] Dan Xie, Tingxin Yan, Deepak Ganesan, and Allen Hanson. 2008. Design andimplementation of a dual-camera wireless sensor network for object retrieval. In

Proc. IEEE International Conference on Information Processing in Sensor Networks .469–480.

UI ’21, April 14–17, 2021, College Station, TX, USA Takuma Yagi, Takumi Nishiyasu, Kunimasa Kawasaki, Moe Matsuki, and Yoichi Sato

Table 7: Result of the SUS test.ID Gender SUS score (rank)

P01 Female 60 (D)P02 Male 65 (C)P03 Female 80 (A)P04 Male 47.5 (F)P06 Male 85 (C)P07 Male 77.5 (B+)P08 Male 77.5 (B+)P09 Male 77.5 (B+)P10 Male 87.5 (A+)P11 Male 100 (A+)P12 Male 70 (C)P13 Male 77.5 (B+)

A DATASET DETAILS

Figure 11 shows the other two object sets (set 2 and set 3) used inthe study. To avoid confusion across object sets, we used differenttypes of daily objects between trials.

B ADDITIONAL RESULTSB.1 Usability Rest

In addition to the main result, we asked the participants to answerthe System usability scale (SUS) test [7]. Table 7 summarizes the SUSscores of each participant. The average score and its 95% confidenceinterval among all the participants were 75.4 ± B.2 Comfort on Neck-mounted Camera

We asked the question, “How comfortable was the neck-mountedcamera?” on a seven-point scale (unpleasant = 1, comfortable = 7).The participants reported slightly positive feedback on averageregarding comfort of camera (mean and 95% CI: 4.6 ± B.3 Object Discovery Examples

Figure 12 and 13 show additional examples of the obtained hand-held object timeline.