[PDF] Model enhancement and personalization using weakly supervised learning for multi-modal mobile sensing

Abstract

Always-on sensing of mobile device user's contextual information is critical to many intelligent use cases nowadays such as healthcare, drive assistance, voice UI. State-of-the-art approaches for predicting user context have proved the value to leverage multiple sensing modalities for better accuracy. However, those context inference algorithms that run on application processor nowadays tend to drain heavy amount of power, making them not suitable for an always-on implementation. We claim that not every sensing modality is suitable to be activated all the time and it remains challenging to build an inference engine using power friendly sensing modalities. Meanwhile, due to the diverse population, we find it challenging to learn a context inference model that generalizes well, with limited training data, especially when only using always-on low power sensors. In this work, we propose an approach to leverage the opportunistically-on counterparts in device to improve the always-on prediction model, leading to a personalized solution. We model this problem using a weakly supervised learning framework and provide both theoretical and experimental results to validate our design. The proposed framework achieves satisfying result in the IMU based activity recognition application we considered.

Full PDF

MModel enhancement and personalization usingweakly supervised learning for multi-modalmobile sensing

Diyan Teng , Rashmi Kulkarni , and Justin McGloin Qualcomm Techonologies Inc.

Abstract

Always-on sensing of mobile device user’s contextual information iscritical to many intelligent use cases nowadays such as healthcare, driveassistance, voice UI. State-of-the-art approaches for predicting user con-text have proved the value to leverage multiple sensing modalities forbetter accuracy. However, those context inference algorithms that runon application processor nowadays tend to drain heavy amount of power,making them not suitable for an always-on implementation. We claim thatnot every sensing modality is suitable to be activated all the time and it re-mains challenging to build an inference engine using power friendly sensingmodalities. Meanwhile, due to the diverse population, we ﬁnd it challeng-ing to learn a context inference model that generalizes well, with limitedtraining data, especially when only using always-on low power sensors.In this work, we propose an approach to leverage the opportunistically-oncounterparts in device to improve the always-on prediction model, leadingto a personalized solution. We model this problem using a weakly super-vised learning framework and provide both theoretical and experimentalresults to validate our design. The proposed framework achieves satisfyingresult in the IMU based activity recognition application we considered.

As the rapid growth of computational capability of mobile chips, more andmore augmented intelligent (AI) applications start to deviate from cloud basedsolutions to explore on device methods. The trend is not a surprise as mobiledevice users gain better understanding of data privacy [1, 2] and prefer highreliability, low latency experience [3, 4, 5]. Meanwhile, many research works [6,7] have indicated that the performance of on device AI applications can beenhanced if correct contextual information about the user can be leveraged.For example, speaker recognition [8] can localize to use diﬀerent model if usercontext shows him/her in a meeting room or in a bus. GPS can be pre-activated1 a r X i v : . [ s t a t . M L ] O c t nd smartphone message can be disabled when the context shows the user startsto drive [9]. Gesture/speech recognition [10, 11] can lead to diﬀerent intentiongiven user’s location context shows in car, in bedroom, or in a restaurant.Nevertheless, the contextual information would become worthless if the costof computation is way higher than the value it brings. In practice, the costmaps to both the infrastructure cost which include sensor price, battery con-sumption, component size, and the algorithm cost which include runtime com-plexity, memory requirement, training data gathering eﬀort. For example, GPSis an informative source of information widely used in the recognition of modeof transportation [12]. However, the on time power consumption when GPSis in tracking mode is typically at the order of 10’s mA, assuming one secondEPOCH. This is discouraging designers from including GPS in an always-ontransportation mode recognition engine in mobile devices. More importantly,we observe that the context information, typically, does not vary frequently intime. Thus, always-on low power solutions that leverage simple sensors andadopt low complexity algorithm are preferred.Even though, it seems a trivial solution to enable high cost sensing modalitiesin always-on inference, we claim that they can still be leveraged in a opportunis-tic fashion to maximize the system eﬃciency. Currently, we are exploring twoapproaches to incorporate the information opportunistically. On one hand, wecan use those information as a conﬁrmation or correction to the always-on pre-diction result. On the other hand, we consider leveraging the information toperform on device learning to personalize and enhance the always-on inferencemodel for each particular user, however, only if they are available. In this work,we focus on the second route. Speciﬁcally, we model the problem as a weaklysupervised learning task, where the annotation is provided opportunisticallythrough the high cost, non-always-on sensors. In mobile sensing it can leadto signiﬁcant improvement if the inference model is personalized to the target.In other words, it turns out to be extremely challenging in most of the mo-bile sensing use cases to build a model with good interpersonal generalizationperformance. And this has become the primary motivation of this work.The rest of this work is organized as follows. In Section 2 we review relatedarts on weakly supervised learning to serve as a background. In Section 3we propose our main algorithm that leverage the ideas in weakly supervisedlearning to learn the target distribution of interest. In Section 4, we show thatthe proposed algorithm is statistically consistent and provide some insight thatrelates the performance of the algorithm to the noise statistics in the annotation.In Section 5, we provide both synthetic example and an application to validateour theory. Finally, we conclude and point out a few problems that worthfurther research. The deﬁnition of weakly supervised learning varies in literature. To the bestof our knowledge, state-of-the-art works have focused on three types of weakly2upervised learning problems. The ﬁrst scenario considers that only part of thedata are labeled. This is also known as semi–supervised learning in some works.The objective in this case is to leverage prior knowledge, such as geometry of thedata, to optimize prediction power on the labeled instances while encouragingthe prior knowledge to be satisﬁed on both labeled and unlabeled instances. Thesecond scenario is known as the positive–unlabeled (PU) learning. Under thiscategory, only part of the instances from the positive hypothesis are labeled.The challenge is to properly handle the negative instances, without explicitknowledge of the label, so that the algorithm does not overﬁt due to the labelimbalance. The last set of problem can be categorized as learning with labelnoise. In other words, even though labeled, the training instances may notbe perfectly supervised. Instead, some noise adding process is considered sothat the learner has no direct access to the ground truth. There can be manypractical reasons for this to happen. For example, privacy requirement andadversarial attack.In this paper, we consider the last type of problem, however, under a slightlydiﬀerent setup. Instead of directly assuming some label noise behavior, we con-sider the label to be obtained through a separate inference process, which isimperfect. The following graphical model illustrates the concept we are dis-cussing about:Figure 1: Graphical model for weakly supervised frameworkHere, y ∈ { s , s , · · · , s K } denotes the ground truth class label which takesvalue in a discrete ﬁnite alphabets. x and z represent two independent mea-surements which contain class conditional information about y . The objectivehere is to learn the statistical relationship between x and y . In this work, weconsider to learn the generative distribution p ( x | y ) and assume the prior distri-bution p ( y ) is given. However, we are not directly given the pair { x n , y n } Nn =1 .Instead, there is an inference process g ( · ) which takes z as input and predicts y as ˜ y . In other words, we have no access to the generative distribution of p ( x | y )and p ( z | y ), but only the pair { x n , ˜ y n } Nn =1 . We start by assuming the predictive3odel g ( · ) is trained separately beforehand and ﬁxed. Moreover, the expertswho trained the model g ( · ) can also provide their confusion statistics ΠΠΠ thatunderstands their performance for predicting y using ˜ y . Speciﬁcally, the ( i, j )thelement of the confusion matrix ΠΠΠ represents the probability Pr( y = s i | ˜ y = s j ). Later, we will consider g ( · ) can be ﬂexible and discuss its eﬀect on the learningprocess.From a practical system design perspective, this model captures a few com-mon use cases in mobile sensing. For example, modality– x can be heavily userdependent such as speech, face image or motion kinetics while modality– z canbe invariant to user identity such as speed of traveling on traﬃc, altitude changeor illumination level et al . Therefore, we may enhance and personalize the pre-diction model for modality– x . Also, the price for obtaining label ˜ y might itselfbe diﬀerent. For example, we can always ask the user to provide an annota-tion, which is the most accurate but expensive. In comparison, if we design anannotator using modality– z that runs in background, it will not interfere withthe user experience. But, the label will become imperfect as a consequence. Itis also worth mentioning that typically the predictive power using modality– z is worse than using modality– x or tends to be more power hungry. Because,otherwise the problem statement becomes trivial, and there will be no value toimprove the model for x . In this section, we describe our algorithm for recovering the generative distri-bution of p ( x | y ) with access only to the pairs { x n , ˜ y n } Nn =1 . We start with thefollowing assumptions on the confusion matrix ΠΠΠ. Assumption 1.

The confusion matrix

ΠΠΠ is a proper left stochastic (Markov)matrix. i.e. (cid:80) i ΠΠΠ i,j = 1 . Assumption 2.

The confusion matrix

ΠΠΠ is invertible. i.e. det (ΠΠΠ) (cid:54) = 0 . Assumption 3.

The inference algorithm g ( · ) : z → ˜ y is deterministic. The ﬁrst assumption can be easily met. Typically the confusion matrix isprovided by the designer of the inference system using modality– z through em-pirical evaluation on some validation dataset. Therefore, it requires only propercolumn normalization. Later, we will understand the second assumption ensuresthe recovery of the generative distribution p ( x | y ) to be feasible. Intuitively, ifthe confusion matrix is not full rank, then information of some alphabet in y will be lost in modality– z . Therefore, it will become not recoverable. Moreover,we assumed the inference algorithm g ( · ) to be deterministic, which will mostlike be the case in practical system for complexity concern. Therefore, we cansimplify the graphical model to Figure 2. Alternatively, one can also deﬁne the forward confusion matrix which represents Pr(˜ y = s i | y = s j ). z From Figure 2 we observe that since y is unobserved, the generative distri-bution of p ( x | ˜ y ) can be written as: p ( x | ˜ y ) = (cid:88) y p ( x , y | ˜ y )= (cid:88) y p ( x | y ) p ( y | ˜ y ) (1)If we write it in matrix format, we have: (cid:2) p ( x | ˜ y = s ) p ( x | ˜ y = s ) · · · p ( x | ˜ y = s K ) (cid:3) = (cid:2) p ( x | y = s ) p ( x | y = s ) · · · p ( x | y = s K ) (cid:3) · ΠΠΠ (2)Finally, the generative distribution of p ( x | y ) can be recovered by inverting theconfusion matrix: (cid:2) p ( x | y = s ) p ( x | y = s ) · · · p ( x | y = s K ) (cid:3) = (cid:2) p ( x | ˜ y = s ) p ( x | ˜ y = s ) · · · p ( x | ˜ y = s K ) (cid:3) · ΠΠΠ − (3)In practice, we also have no access to the true distribution of p ( x | ˜ y ). And ithas to be learned from the stored training instances { x n , ˜ y n } Nn =1 . Denote theestimator of p ( x | ˜ y ) as q ( x | ˜ y ), we can similarly calculate the noise correctedestimator of p ( x | y ) as: (cid:2) q ( x | y = s ) q ( x | y = s ) · · · q ( x | y = s K ) (cid:3) = (cid:2) q ( x | ˜ y = s ) q ( x | ˜ y = s ) · · · q ( x | ˜ y = s K ) (cid:3) · ΠΠΠ − (4)which simpliﬁes to q ( x | y = s i ) = (cid:88) j q ( x | ˜ y = s j ) · ΠΠΠ − j,i ) (5)However, we need to be aware that the estimator (5) may not be a properprobability distribution. Even though, it automatically satisfy the condition5hat the integral over x equals one, there can be regions where this functiontake negative values. Therefore, it becomes challenging how to construct q ( x | ˜ y )when x is a continuous random vector. In the next section, we will prove thatthis noise correction estimator is indeed lossless when number of stored traininginstances goes to inﬁnity. Similarly, the posterior probability p ( y | x ) can beestimated by following this method. We have, for the posterior probability: p (˜ y | x ) = (cid:88) y p ( y, ˜ y | x )= (cid:88) y p ( y | x ) p (˜ y | y, x )= (cid:88) y p ( y | x ) p (˜ y | y )Thus we have: (cid:2) p ( y = s | x ) p ( y = s | x ) · · · p ( y = s K | x ) (cid:3) = (cid:2) p (˜ y = s | x ) p (˜ y = s | x ) · · · p (˜ y = s K | x ) (cid:3) · ΠΠΠ − where Π R in this case represents the right stochastic matrix whose ( i, j )th ele-ment measure Pr(˜ y = s j | y = s i ).Here, we also observe that estimator (5) is interestingly related to the methodof unbiased estimator proposed by Natarajan et al. [13]. Speciﬁcally, they con-sidered a binary classiﬁcation problem where y ∈ { +1 , − } with the followinglabel ﬂipping probabilities: κ +1 = Pr(˜ y = +1 | y = − , κ − = Pr(˜ y = − | y = +1) (6)satisfying the constraint: κ +1 + κ − < l ( f ( x ) , y ) = (1 − κ − y ) l ( f ( x ) , y ) − κ y l ( f ( x ) , − y ))1 − κ +1 − κ − (8)Their result can indeed be understood as the frequentists’ counterpart of theprobabilistic framework we proposed here. Additionally, we can generalize thetheory to a multiclass setting and construct: KKK · (cid:2) ˜ l ( f ( x ) , ˜ y = s ) · · · ˜ l ( f ( x ) , ˜ y = s K ) (cid:3) T = (cid:2) l ( f ( x ) , y = s ) · · · l ( f ( x ) , y = s K ) (cid:3) T (9)where KKK in this case is the right stochastic matrix with

KKK ( i,j ) = Pr(˜ y = s i | y = s j ). Similarly, we assume the forward confusion matrix KKK to be invertible, i.e. det(

KKK ) (cid:54) = 0. Therefore, the multiclass unbiased risk can be calculated as: (cid:2) ˜ l ( f ( x ) , ˜ y = s ) · · · ˜ l ( f ( x ) , ˜ y = s K ) (cid:3) T = KKK − · (cid:2) l ( f ( x ) , y = s ) · · · l ( f ( x ) , y = s K ) (cid:3) T (10)6hich simpliﬁes to ˜ l ( f ( x ) , ˜ y = s i ) = (cid:88) j KKK − i,j ) l ( f ( x ) , y = s j ) (11)One interesting observation here is that the original requirement κ +1 + κ − < κ +1 + κ − (cid:54) = 1. Thismatches the properties of the receiver operating characteristics (ROC) curve indecision theory []. In principle, ROC curve is always above the straight linepassing through P D = P FA = 0 and P D = P FA = 1, which corresponds to theBernoulli random guess decision. Otherwise, the decision rule can be ﬂippedto achieve that. In (10), the matrix inverse operation can automatically handlethe decision ﬂipping procedure when κ +1 + κ − > We analyze the consistency of our estimator deﬁned in (4) in this section. Thekey challenge in (4) is to analyze the eﬀect of the inverse stochastic matrixΠΠΠ − on the density estimator. We start by establishing a theory that guar-antees estimator (4) can recover the true generative distribution given certainconditions. Theorem 1.

Let D ( i ) KL ( p i (cid:107) q i ) be the Kullback–Leibler (KL) divergence that mea-sures the discrepancy between the two distributions p ( x | y = s i ) and q ( x | y = s i ) .Speciﬁcally, D ( i ) KL ( p i (cid:107) q i ) = (cid:90) p ( x | y = s i ) log p ( x | y = s i ) q ( x | y = s i ) d x (12) Similarly, let ˜ D ( i ) KL (˜ p i (cid:107) ˜ q i ) denote the KL divergence between p ( x | ˜ y = s i ) and q ( x | ˜ y = s i ) . Suppose both q ( x | y ) and q ( x | ˜ y ) are valid distribution functions andAssumptions 1–3 hold, we have (cid:88) i D ( i ) KL ( p i (cid:107) q i ) = 0 (13) if and only if (cid:88) i ˜ D ( i ) KL (˜ p i (cid:107) ˜ q i ) = 0 (14) Proof.

To prove the necessity part, observe that (cid:88) i ˜ D ( i )KL (˜ p i (cid:107) ˜ q i ) = (cid:88) i (cid:90) p ( x | ˜ y = s i ) log p ( x | ˜ y = s i ) q ( x | ˜ y = s i ) d x = (cid:88) i (cid:90) (cid:88) j p ( x | y = s j )ΠΠΠ ( j,i ) log (cid:80) j p ( x | y = s j )ΠΠΠ ( j,i ) (cid:80) j q ( x | y = s j )ΠΠΠ ( j,i ) (cid:124) (cid:123)(cid:122) (cid:125) ( ∗ ) d x ∗ yields( ∗ ) ≤ (cid:88) j p ( x | y = s j )ΠΠΠ ( j,i ) log p ( x | y = s j )ΠΠΠ ( j,i ) q ( x | y = s j )ΠΠΠ ( j,i ) = (cid:88) j p ( x | y = s j )ΠΠΠ ( j,i ) log p ( x | y = s j ) q ( x | y = s j )Thus (cid:88) i ˜ D ( i )KL (˜ p i (cid:107) ˜ q i ) ≤ (cid:88) i,j ΠΠΠ ( j,i ) (cid:90) p ( x | y = s j ) log p ( x | y = s j ) q ( x | y = s j ) d x = (cid:88) i,j ΠΠΠ ( j,i ) D ( j )KL ( p j (cid:107) q j )Necessity follows since if D ( i )KL ( p i (cid:107) q i ) = 0 for all i , then ˜ D ( i )KL (˜ p i (cid:107) ˜ q i ) = 0 for all i .For suﬃciency, the log–sum property cannot be used since ΠΠΠ − j,i ) may benegative valued. Instead, we evaluate D ( i )KL ( p i (cid:107) q i )= (cid:90) (cid:88) j p ( x | ˜ y = s j )ΠΠΠ − j,i ) log (cid:80) j p ( x | ˜ y = s j )ΠΠΠ − j,i ) (cid:80) j q ( x | ˜ y = s j )ΠΠΠ − j,i ) d x by directly observing that since p ( x | ˜ y = s j ) a.e. = q ( x | ˜ y = s j ) implies p ( x | ˜ y = s j )ΠΠΠ − j,i ) a.e. = q ( x | ˜ y = s j )ΠΠΠ − j,i ) for ﬁnite ΠΠΠ − . Therefore, the integral remains tobe zero.Theorem 1 guarantees that when sample size goes to inﬁnity, a perfect esti-mator for ˜ p yields a perfect estimator for p . Next, we discuss how the confusionmatrix aﬀect the convergence rate. Theorem 2.

Let G ( i ) n denotes the following empirical process G ( i ) n = (cid:90) (log q i − log p i ) · d( ¯P ( i ) − ¯P ( i ) n )= (cid:90)  log (cid:88) j ˜ q j ΠΠΠ − j,i ) − log (cid:88) j ˜ p j ΠΠΠ − j,i )  · d( ¯P ( i ) − ¯P ( i ) n ) (15) where ¯P ( i ) denotes the true distribution of x | y = s i and ¯P ( i ) n is the empiricalversion of it. Then we have the following condition holds for G ( i ) n : G ( i ) n = O (log λ max λ min · n − ∨ max j ˜ G ( j ) n ) (16)8 here λ max and λ min are respectively the maximal and minimal eigenvalue of ΠΠΠ − · ΠΠΠ − . And ˜ G ( j ) n = (cid:90) (log ˜ q j − log ˜ p j ) d( ˜P ( j ) − ˜P ( j ) n ) denotes the empirical process that measures the convergence of individual densityestimator.Proof. Rewrite ˜ G ( j ) n as˜ G ( j ) n = (cid:90) log (cid:80) j ˜ q j ΠΠΠ − j,i ) (cid:80) j ˜ p j ΠΠΠ − j,i ) d( ¯P ( i ) − ¯P ( i ) n )= 12 (cid:90) log ( (cid:80) j ˜ q j ΠΠΠ − j,i ) ) ( (cid:80) j ˜ p j ΠΠΠ − j,i ) ) d( ¯P ( i ) − ¯P ( i ) n )= 12 (cid:90) log ˜q T ΠΠΠ − ,i ) ΠΠΠ − ,i ) ˜q˜p T ΠΠΠ − ,i ) ΠΠΠ − ,i ) ˜p d( ¯P ( i ) − ¯P ( i ) n ) ≤ (cid:90) log λ max λ min · (cid:107) ˜q (cid:107) (cid:107) ˜p (cid:107) d( ¯P ( i ) − ¯P ( i ) n )= 12 (cid:90) log λ max λ min d( ¯P ( i ) − ¯P ( i ) n ) + 12 (cid:90) log (cid:107) ˜q (cid:107) (cid:107) ˜p (cid:107) d( ¯P ( i ) − ¯P ( i ) n )where the inequality is based on the fact that the eigenvalues of ΠΠΠ satisﬁes0 < √ | λ i | ≤

1. For the second term, we observe the rate for (cid:80) j ˜ q j to convergeto (cid:80) j ˜ p j is determined by the slowest term since each individual term is strictlypositive. Therefore, the upper bound to it is max j ˜ G ( j ) n . For the ﬁrst term, fromcentral limit theorem we have (cid:82) log λ max λ min d( ¯P ( i ) − ¯P ( i ) n ) = O (log λ max λ min · n − ).Theorem 2 provides insight on how the confusion process aﬀect the learningrate. Speciﬁcally, in addition to the learning rate of each individual densityestimator, an extra cost has to be paid based on how much information islost during the confusion process. To give one example, when the confusionmatrix is any permutation matrix, there will be no loss of information but onlydeterministic label swapping are performed. Thus, we have λ max = λ min = 1and no additional cost will be paid. In contrast, if we have a close to singularconfusion matrix, the loss of information will be high, because there are multiple˜ y now representing highly similar information in y . In this case, λ max is large.Finally, for the risk function constructed in (10), it can be proven to beunbiased. Theorem 3.

The risk function estimator deﬁned in (10) is an unbiased esti-mator of l ( f ( x , y )) : E ˜ y [˜ l ( f ( x , ˜ y ))] = l ( x , y ) (17)9 roof. For every y = s i , we haveE ˜ y | y = s i [˜ l ( f ( x , ˜ y ))] = (cid:88) ˜ y Pr(˜ y | y = s i )˜ l ( f ( x , ˜ y )) (18)To have (18) to be an unbiased estimator of l ( f ( x , y = s i )) for every s i , we needthe following equalities to hold KKK · (cid:2) ˜ l ( f ( x ) , ˜ y = s ) · · · ˜ l ( f ( x ) , ˜ y = s K ) (cid:3) T = (cid:2) l ( f ( x ) , y = s ) · · · l ( f ( x ) , y = s K ) (cid:3) T (19)which gives (10) when K is invertible. In this section, we provide synthetic example and real world applications for ourtheory. We start by considering a synthetic example. We draw binomial sam-ples from three classes with success parameters [0 . , . , .

08] respectively.The class annotations are corrupted using a confusion matrix ΠΠΠ. We selectthe confusion matrix to have identical diagonal components and identical oﬀ-diagonal components to simplify the experiment. And the noise level in thiscase can be controlled by adjusting the oﬀ-diagonal elements while maintainingthe rows sums to one. We evaluate the performance of estimator (5) by com-puting the sum of KL divergence between the estimated distribution and thetrue distribution, using the analytic form.10 .4 0.6 0.8 1 1.2 1.4 1.6 1.800.010.020.030.040.050.060.07 K L d i ve r g e n ce D ( p || q ) K L d i ve r g e n ce D ( p || q ) Log a r i t h m o f K L d i ve r g e n ce D ( p || q ) Figure 3: Convergence in sum of KL divergence: (a) as λ max λ min increases; (b) assample size increases; (c) Figure (b) in logarithm scale. Solid square indicatesthe mean over 2000 test runs. Error bar show the one sided standard deviation.11e observe the convergence behavior in terms of log λ max λ min is close to linearfor each ﬁxed sample size. And the convergence behavior in sample size in thisexample can be theoretically proven to be O ( n − ) using central limit theorem.Next we apply the proposed algorithm in a real application. We consideractivity recognition using smartphone accelerometer and gyroscope sensors. Aswe noticed for many users, it is a challenging task to distinguish phone call and slow walk ( ≤ . (cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88)(cid:88) Truth Result call (ﬁdget) slow walk bikecall (ﬁdget) 0.76 0.24 0slow walk 0.28 0.72 0bike 0 0 1

Table 1: Confusion statistics using GPS speed readings. Fidget: [0 , . . , , , Holding phone close to ear while speaking. People tends to pace very slowly aroundwithout moving to a particular destination. M a gn i t ud e ( m / s ) Accel value Accel-xAccel-yAccel-z0 50 100 150 200 250 300 350 400 450-505 M a gn i t ud e ( r a d / s ) Gyro value Gyro-xGyro-yGyro-z0 50 100 150 200 250 300 350 400 45000.20.40.60.8 P r ob a b ili t y m ass P r ob a b ili t y m ass P r ob a b ili t y m ass Figure 5: Activity recognition model personalization experiment. Row 1–2:accelerometer and gyroscope signal. (Order of events: call → slow walk → bike) Row 3: inference using baseline model. Row 4: inference using person-alized model (ground truth annotation). Row 5: inference using personalizedmodel (GPS based annotation). Probability mass is computed by marginalizingstochastic sequence realizations (samples from the HsMM posterior distribution)at each time slot.As we may observe, the baseline model is not correctly recognizing call asﬁdget. Instead, it creates some confusion between slow walk and bike. Sub-sequently, after we personalize the HsMM emission model using the separatecollection paired with ground truth annotation, the model has gained signiﬁ-cant conﬁdence to correctly recognize call as a ﬁdget event. Finally, the model13ersonalized with GPS based annotation also achieves a satisfying recognitionresult. A detailed empirical Bayes error rate (BER) in this experiment is pro-vided in Table 2. Model type Mdl1 Mdl2 Mdl3BER 0.22 0.04 0.05

Table 2: Empirical BER comparison. Mdl1: Baseline model. Mdl2: Personal-ized model (ground truth annotation). Mdl3: Personalized model (GPS basedannotation).

In this work, we proposed an automated annotation method for personalizationof always-on mobile sensing model. The proposed algorithm leverages the non-always-on sensing modalities opportunistically. Synthetic results show that ouralgorithm can ﬁnd the correct generative model given enough data. Our appli-cation shows the model can indeed help to improve smartphone based humanactivity recognition performance in some cases.Nevertheless, some problems remain open. First, it is still challenging toconstruct and verify the generative model estimated, whether it satisfy basicprobability measure properties, especially for high dimensional and continuousrandom variables. Second, as we noticed that the convergence rate is governedby both sample size and the eigenvalue structure of the confusion matrix, it isworth investigating if some tradeoﬀ can be deﬁned to perform sample selection.For example, if in addition to the noisy annotation, we are also provided aconﬁdence measure for that annotation, it is interesting to consider subsamplingthe data for re-training. As rejecting samples with low conﬁdence can leads tocleaner confusion statistics, it reduces the amount of samples that are availableto learn the generative distribution. Also, in situation where training needsto happen on edge, it is important for mobile devices to save as less data aspossible due to storage constraints.

References [1] J. Zhao, R. Mortier, J. Crowcroft, and L. Wang, “Privacy-preserving ma-chine learning based data analytics on edge devices,” in

Proceedings of the2018 AAAI/ACM Conference on AI, Ethics, and Society . ACM, 2018,pp. 341–346.[2] S. A. Osia, A. S. Shamsabadi, A. Taheri, K. Katevas, S. Sajadmanesh, H. R.Rabiee, N. D. Lane, and H. Haddadi, “A hybrid deep learning architecturefor privacy-preserving mobile analytics,” arXiv preprint arXiv:1703.02952 ,2017. 143] B. Varghese, N. Wang, S. Barbhuiya, P. Kilpatrick, and D. S. Nikolopoulos,“Challenges and opportunities in edge computing,” in . IEEE, 2016, pp. 20–26.[4] H. Li, K. Ota, and M. Dong, “Learning iot in edge: deep learning for theinternet of things with edge computing,”

IEEE Network , vol. 32, no. 1, pp.96–101, 2018.[5] N. D. Lane, S. Bhattacharya, P. Georgiev, C. Forlivesi, and F. Kawsar, “Anearly resource characterization of deep learning on wearables, smartphonesand internet-of-things devices,” in

Proceedings of the 2015 internationalworkshop on internet of things towards applications . ACM, 2015, pp. 7–12.[6] H. Zhu, E. Chen, H. Xiong, K. Yu, H. Cao, and J. Tian, “Mining mobileuser preferences for personalized context-aware recommendation,”

ACMTransactions on Intelligent Systems and Technology (TIST) , vol. 5, no. 4,p. 58, 2015.[7] X. Wang, D. Rosenblum, and Y. Wang, “Context-aware mobile music rec-ommendation for daily activities,” in

Proceedings of the 20th ACM inter-national conference on Multimedia . ACM, 2012, pp. 99–108.[8] I. Bisio, C. Garibotto, A. Grattarola, F. Lavagetto, and A. Sciarrone,“Smart and robust speaker recognition for context-aware in-vehicle appli-cations,”

IEEE Transactions on Vehicular Technology , vol. 67, no. 9, pp.8808–8821, 2018.[9] H. Park, D. Ahn, T. Park, and K. G. Shin, “Automatic identiﬁcationof drivers smartphone exploiting common vehicle-riding actions,”

IEEETransactions on Mobile Computing , vol. 17, no. 2, pp. 265–278, 2017.[10] L. P. Heck, M. Chinthakunta, D. Mitby, and L. Stifelman, “Location basedconversational understanding,” Jan. 26 2016, uS Patent 9,244,984.[11] J. Grubert, T. Langlotz, S. Zollmann, and H. Regenbrecht, “Towards per-vasive augmented reality: Context-awareness in augmented reality,”

IEEEtransactions on visualization and computer graphics , vol. 23, no. 6, pp.1706–1724, 2016.[12] T. Feng and H. J. Timmermans, “Transportation mode recognition usinggps and accelerometer data,”

Transportation Research Part C: EmergingTechnologies , vol. 37, pp. 118–130, 2013.[13] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari, “Learningwith noisy labels,” in