[PDF] Deep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities

Abstract

The vast proliferation of sensor devices and Internet of Things enables the applications of sensor-based activity recognition. However, there exist substantial challenges that could influence the performance of the recognition system in practical scenarios. Recently, as deep learning has demonstrated its effectiveness in many areas, plenty of deep methods have been investigated to address the challenges in activity recognition. In this study, we present a survey of the state-of-the-art deep learning methods for sensor-based human activity recognition. We first introduce the multi-modality of the sensory data and provide information for public datasets that can be used for evaluation in different challenge tasks. We then propose a new taxonomy to structure the deep methods by challenges. Challenges and challenge-related deep methods are summarized and analyzed to form an overview of the current research progress. At the end of this work, we discuss the open issues and provide some insights for future directions.

Full PDF

1111Deep Learning for Sensor-based Human ActivityRecognition: Overview, Challenges and Opportunities

KAIXUAN CHEN ∗ , University of New South Wales, Australia

DALIN ZHANG ∗ , University of New South Wales, Australia

LINA YAO,

University of New South Wales, Australia

BIN GUO,

Northwestern Polytechnical University, China

ZHIWEN YU,

Northwestern Polytechnical University, China

YUNHAO LIU,

Michigan State University, USAThe vast proliferation of sensor devices and Internet of Things enables the applications of sensor-basedactivity recognition. However, there exist substantial challenges that could influence the performance of therecognition system in practical scenarios. Recently, as deep learning has demonstrated its effectiveness inmany areas, plenty of deep methods have been investigated to address the challenges in activity recognition.In this study, we present a survey of the state-of-the-art deep learning methods for sensor-based humanactivity recognition. We first introduce the multi-modality of the sensory data and provide information forpublic datasets that can be used for evaluation in different challenge tasks. We then propose a new taxonomyto structure the deep methods by challenges. Challenges and challenge-related deep methods are summarizedand analyzed to form an overview of the current research progress. At the end of this work, we discuss theopen issues and provide some insights for future directions.CCS Concepts: •

General and reference → Surveys and overviews ; •

Hardware → Sensor devices andplatforms; •

Computer systems organization → Neural networks .Additional Key Words and Phrases: activity recognition, deep learning, sensors

ACM Reference Format:

Kaixuan Chen, Dalin Zhang, Lina Yao, Bin Guo, Zhiwen Yu, and Yunhao Liu. 2018. Deep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities.

J. ACM

37, 4, Article 111(August 2018), 38 pages. https://doi.org/10.1145/1122445.1122456

Recent advance in human activity recognition has enabled myriad applications such as smarthomes [63], healthcare [85], and enhanced manufacturing [46]. Activity recognition is essential tohumanity since it records people’s behaviors with data that allows computing systems to monitor,analyze, and assist their daily life. There are two mainstreams of human activity recognition systems: ∗ Both authors contributed equally to the paperAuthors’ addresses: Kaixuan Chen, University of New South Wales, Sydney, NSW, 2052, Australia, [email protected]; Dalin Zhang, University of New South Wales, Sydney, NSW, 2052, Australia, [email protected];Lina Yao, University of New South Wales, Sydney, NSW, 2052, Australia, [email protected]; Bin Guo, NorthwesternPolytechnical University, Xi’an, Shaanxi, 710129, China, [email protected]; Zhiwen Yu, Northwestern PolytechnicalUniversity, Xi’an, Shaanxi, 710129, China, [email protected]; Yunhao Liu, Michigan State University, East Lansing,MI, 48824, USA, [email protected] to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice andthe full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].© 2018 Association for Computing Machinery.0004-5411/2018/8-ART111 $15.00https://doi.org/10.1145/1122445.1122456 J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. a r X i v : . [ c s . H C ] J a n video-based systems and sensor-based systems. Video-based systems use cameras to take images orvideos to recognize people’s behaviors [8]. Sensor-based systems utilize on-body or ambient sensorsto dead reckon people’s motion details or log their activity tracks. Considering the privacy issuesof installing cameras in our personal space, sensor-based systems have dominated the applicationsof monitoring our daily activities. Besides, sensors take advantage of pervasiveness. Thanks to theproliferation of smart devices and Internet of Things, sensors can be embedded in portable devicessuch as phones, watches, goggles, and nonportable objects like cars, walls, and furniture. Sensorsare widely embedded around us, uninterruptedly and non-intrusively logging human’s motioninformation. Many machine learning methods have been employed in human activity recognition. However, thisfield still faces many technical challenges. Some of the challenges are shared with other patternrecognition fields such as computer vision and natural language processing, while some are uniqueto sensor-based activity recognition and require dedicated methods for real-life applications. Herelists a few categorizes of challenges that the community of activity recognition should respond. • The first challenge is the difficulty in feature extraction . Activity recognition is basically a clas-sification task, so it shares a common challenge with other classification problems, that is, featureextraction. For sensor-based activity recognition, feature extraction is more difficult becausethere is inter-activity similarity [21]. Different activities may have similar characteristics (e.g.,walking and running). Therefore, it is difficult to produce distinguishable features to representactivities uniquely. • Training and evaluation of learning techniques require large annotated data samples. However,it is expensive and time-consuming to collect and annotate sensory activity data. Therefore, annotation scarcity is a remarkable challenge for sensor-based activity recognition. Besides,data for some emergent or unexpected activities (e.g., accidentally fall) is especially hard toobtain, which leads to another challenge called class imbalance . • Human activity recognition involves three factors: users, time, and sensors. First, activity patternsare person-dependent. Different users may have diverse activity styles. Second, activity conceptsvary over time. The assumption that users remain their activity patterns unchanged in a long timeis impractical. Moreover, novel activities are likely to emerge when in use. Thirdly, diverse sensordevices are opportunistically configured on human bodies or in environments. The compositionand the layouts of sensors dramatically influence the data stimulated by activities. All the threefactors lead to heterogeneity of the sensory data for activity recognition and need to be mitigatedurgently. • The complexity of data association is another reason that makes recognition challenging. Dataassociation refers to how many users and how many activities the data is associated with.There are many specific challenges in activity recognition that are driven by sophisticated dataassociation. The first challenge can be seen in composite activities . Most activity recognitiontasks are based on simple activities, like walking and sitting. However, more meaningful ways tolog human daily routines are composite activities that comprise a sequence of atomic activities.For example, “washing hands” can be represented as { turning on the tap, soaping, rubbing hands,turning off the tap } . One challenge stimulated by composite activities is data segmentation .A composite activity can be defined as a sequence of activities. Therefore, accurate activityrecognition highly relies on precise data segmentation techniques. Concurrent activities showthe third challenge. Concurrent activities occur when a user participates in more than oneactivities simultaneously, such as answering a phone call while watching TV.

Multi-occupant

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. eep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities 111:3 activities are also associated with the complexity of data association. Recognition is arduouswhen multiple users engage in a set of activities, which usually happens in multi-residentsscenarios. • Another factor that needs to be concerned is the feasibility of the human activity recognitionsystem. Efforts need to be devoted to making the system acceptable by a vast number of userssince human activity recognition is quite close to human daily life, which can be twofold. First,the system should be recourse-intensive so that it fits portable devices and is able to give aninstant response. Thus, the computational cost issue should be addressed. Second, as therecognition system records users’ life continuously, there are risks of personal informationdisclosure. Therefore, privacy is another issue that should be dealt with to make the systemfeasible in private spaces. • Unlike images or texts, sensory data for activity recognition is complex and unreadable. Moreover,sensory data inevitably includes lots of noise information on account of the inherent imperfectionsof sensors. So, reliable recognition solutions should have interpretability in sensory data and the capability of understanding which part of data facilitates recognition and which partdeteriorates that.

Numerous previous works adopted machine learning methods in human activity recognition [79].They highly rely on feature extraction techniques including time-frequency transformation [62],statistical approaches [21] and symbolic representation [88]. However, the features extracted arecarefully engineered and heuristic. There were no universal or systematical feature extractionapproaches to effectively capture distinguishable features for human activities.In recent years, deep learning has embraced conspicuous prosperity in modeling high-levelabstractions from intricate data [112] in many areas such as computer vision, natural languageprocessing, and speech processing. After early works including [55, 77, 164] examined the effec-tiveness of deep learning in human activity recognition, related studies sprung up in this area.Along with the inevitable development of deep learning in human activity recognition, latest worksare undertaken to address the specific challenges. However, deep learning is still confronted withreluctant acceptance by researchers owing to its abrupt success, bustling innovation, and lack oftheoretical support. Therefore, it is necessary to demonstrate the reasons behind the feasibility andsuccess of deep learning in human activity recognition despite the challenges. • The most attractive characteristic of deep learning is “deep”. Layer-by-layer structures of deepmodels allow to learn from simple to abstract features scalably. Also, advanced computingresources like GPUs provide deep models with a powerful ability to learn descriptive featuresfrom complex data. The outstanding learning ability also enables the activity recognition systemto analyze multimodal sensory data for accurate recognition deeply. • Diverse structures of deep neural networks encode features from multiple perspectives. Forexample, convolutional neural networks (CNNs) are competent in capturing the local connectionsof multimodal sensory data, and the translational invariance introduced by locality leads toaccurate recognition [57]. Recurrent neural networks (RNNs) extract the temporal dependenciesand incrementally learn information through time intervals so are appropriate for streamingsensory data in human activity recognition. • Deep neural networks are detachable and can be flexibly composed into unified networks with oneoverall optimization function, which makes allowance for miscellaneous deep learning techniquesincluding deep transfer learning [2], deep active learning [50], deep attention mechanism [101]

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. and other not systematic but as effective solutions [64, 93]. Works that adopted these techniquescater to various challenges in deep learning.

Hundreds of deep learning methods have been explored for human activity recognition in recentyears. Very few works aim at giving a comprehensive review of current progress. Wang et al.[154] surveyed a number of deep learning methods for sensor-based human activity recognition.Nweke et al. [104] presented a survey on mobile and wearable sensor-based activity recognitionand categorized the deep learning methods into generative, discriminative, and hybrid methods. Liet.al [84] introduced different deep neural networks for radar-based activity recognition. However,there are still no works review the state-of-the-art works from the perspective of the challenges ofhuman activity recognition and how motivated deep learning models and techniques are developedto be challenge-specific. Compared with the previous surveys, the key contributions of this workcan be summarized as follows: • We conduct a comprehensive survey of deep learning approaches for sensor-based human activityrecognition. Our work provides a panorama of current progress and an in-depth analysis of thereviewed methods to serve both novices and experienced researchers. • We propose a new taxonomy of deep learning methods in the view of challenges of activityrecognition. Challenges stimulated by different reasons are presented for the readers to scanwhich research direction is of interest. We summarize the state-of-the-art and analyze howspecific deep networks or deep techniques can be applied to address the challenges. Moreover,we provide information on available public datasets and their extension to evaluate specificchallenges. The new taxonomy aims to build a problem-solution structure with a hope to suggesta rough guideline when readers are selecting their research topics or developing their approaches. • We discuss some rarely explored issues in this field and point out potential future researchdirections.

The performance of an activity recognition system depends crucially on the used sensor modality.In this section, we classify the sensor modalities into four strategies: wearable sensors, ambientsensors, object sensors, and other modalities.

As wearable sensors can directly and efficiently capture body movements,they are the most commonly used for human activity recognition. These sensors can be freelyintegrated into smartphones, watches, bands, and even clothes.

Accelerometer.

An accelerometer is a device used to measure acceleration which is the rate ofchange of the velocity of an object. The measuring unit is meters per second squared ( m / s ) or G-forces ( д ). The sampling frequency is usually in the range of tens to hundreds of Hz. For recognizinghuman activities, accelerometers can be mounted on various parts of a body, such as the waist[7], arm [173], ankle [10], wrist [61], et al. There are three axes in an often-used accelerometer.Therefore, a tri-variate time series would be achieved through an accelerometer. Gyroscope.

A gyroscope is a device that measures orientation and angular velocity. The unit ofangular velocity is measured in degrees per second (° / s ). The sampling rate is also from tens tohundreds of Hz. A gyroscope is usually integrated with an accelerometer and amounted on thesame body parts. In addition, a gyroscope has three axes and consequently provides three timesequences as well. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. eep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities 111:5

Magnetometer.

A magnetometer is another widely used wearable sensor for activity recognition,which is generally assembled with an accelerometer and a gyroscope into an inertial unit. Itmeasures the change of a magnetic field at a particular location. The measurement units are Tesla( T ), and the sampling rate is from tens to hundreds of Hz as well. Likewise, there are usually threeaxes in a magnetometer. Electromyography (EMG).

An EMG sensor is used to evaluate and record the electrical activityproduced by skeletal muscles. Different from the above three kinds of sensors, EMG sensors requireto be attached directly to human skin. As a result, it is less commonly used in conventional scenariosbut more suitable for fine-grained motions such as hand [194] or arm [159] movements and facialexpressions. The EMG provides a univariate time series of signal amplitudes.

Electrocardiography (ECG).

ECG is another biometric tool for activity recognition that measuresthe electrical activities generated by the heart. It also requires the sensor to contact the human’sskin directly. As different people’s hearts vibrate in significantly different ways, the ECG signalsare difficult for processing subject variations. An ECG sensor provides a univariate time series data.

Ambient sensors are usually embedded in the environment to capture theinteractions between humans and the environment. Unlike wearable sensors, a unique advantageof ambient sensors is that they can be used to detect multi-occupant activities. In addition, theambient sensors can also be adopted for in-door localizing, which is difficult for wearable sensorsto achieve.

WiFi.

WiFi is a local-area wireless network connection technology which uses a transmitter tosend signals to a receiver. The basis of the WiFi-based human activity recognition is that human’smovements and locations interfere with the signals’ propagation path from the transmitter tothe receiver, including both the direct propagation path and the reflecting propagation path. Thereceived signal strength (RSS) of WiFi signals is the most easily used and measured metric foractivity recognition [48]. However, RSS is not stable even when there is no dynamic change in theenvironment. A more advanced WiFi signal metric of channel state information (CSI) has beenrecently widely explored for activity recognition from both amplitude and phase aspects [171].In addition to coarse activities like walking and jogging, CSI can also be used to recognize smallmovements like lip movements [152], keystrokes [5], and heartbeats [156].

Radio-frequency identification (RFID).

RFID uses electromagnetic fields to automatically iden-tify and track the tags attached to objects, which contains electronically stored information. Thereare two kinds of RFID tags: active and passive tags. Active tags rely on a local power source (suchas a battery) to continuously broadcast their signals that can be detected hundreds of meters awayfrom an RFID reader. In contrast, passive RFID tags collect energy from a nearby RFID reader’sinterrogating radio waves to send its stored information. Thus, passive RFID tags are much cheaperand lighter. RSS is the mostly adopted tool for RFID-based activity recognition [85, 167]. Theworking mechanism is that human’s movements would change the single strength received by theRFID reader.

Radar.

Different from WiFi and RFID whose transmitters and receivers need to be placed on theopposite sides, radar transmitters and antennas are mounted on the same side of users. Dopplereffect is the basis of the radar-based system [84, 134]. Current research mostly adopts spectrogramsfor representing the Doppler effect and utilizes deep learning methods to process the spectrograms[134, 165].

The wearable and ambient sensors are used to target the motions of humansthemselves. However, besides simple activities (e.g., walking, sitting, jogging et al.), human performscomposite activities (e.g., drinking/eating, cooking, playing et al.) through continuously interacting

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. with surroundings in practical scenarios. As a result, incorporating the information on using objectsis crucial for recognizing more complex human activities.

Radio-frequency identification (RFID).

Regarding the cost-efficiency, reliability, and easy im-plementation, RFID sensors are the most widely used for identifying object usage. When actingas object sensors rather than ambient sensors, RFID tags are needed to be attached to the targetobjects such as mugs, books, computers, and toothpaste [20]. In the detection phase, a worn RFIDreader is also needed. Considering both the convenience and efficiency, wrist-worn RFID readersare one of the most adopted alternatives [41, 138]. Since an object needs a unique RFID tag, and auser is generally close to the objects when being used, most research favors passive RFID tags [20].The reading of an object sensor is processed to be binary marks for indicating whether the objectis used.

In addition to the above sensor modalities, there are other modalities thathave particular applications.

Audio Sensor.

Modern mobile devices normally have a built-in pair of a speaker and a microphone,which can be used to recognize human activities. The speaker is used to transmit ultrasound signals,and the microphone is used to receive the ultrasound signals. The basis is that the ultrasoundwould be modified by human movements and thus reflects the motion information. This modalityis particularly suitable for recognizing human’s fine-grained movements as control commands ofmobile devices since no external devices or signals are required [128]. Other potential applicationsare also exploited. For example, Lee et al. proposed to use the ultrasound signals from a worn pairof speaker and microphone to recognize chewing activities [81].

Pressure Sensor.

Unlike the above ambient sensing modalities which use electromagnetic or soundwaves to grasp human activities, the pressure sensor depends on mechanical mechanisms, whichrequires direct physical contact. It can be embedded in either smart environments or wearableequipment. When implanted in the smart environment, pressure sensors can be deposited at diverseplaces, such as a chair [31], a table [31], a bed [42], and the floor [117]. Due to its characteristics ofphysical contact, small movements or various static postures can be detected. Therefore, it may besuitable for particular scenarios like exercise monitoring (pressure sensors under a fitness mat)[31] and writing posture corrections [80]. When working as wearable devices, pressure sensorscan particularly be used for energy harvest and thus realize self-powered applications [71]. Theyare usually installed in shoes [131] and wrist bands [69] as well as on human’s chests [98].

There is an amount of publicly available human activity recognition datasets for various researchpurposes. We summarize some of the most popular ones in Table 1, which contains the dataacquisition context, number of subjects, number of activities, sensor types, and potential challengetasks they can be used in. In the data acquisition context, "daily living" refers to subjects performingcommon daily living activities under instructions. The challenges are further detailedly explainedin Section 3.

While progress has been made, human activity recognition remains a challenging task. This is partlydue to the broad range of human activities as well as the rich variation in how a given activity canbe performed. Using features that clearly separate activities is crucial. Feature extraction is one ofthe key steps in human activity recognition since it can capture relevant information to differentiatevarious activities. The accuracy of activity recognition approaches dramatically depends on the

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. eep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities 111:7 T a b l e . P u b l i c D a t a s e t s f o r H u m a n A c t i v i t y R e c o g n i t i o n D a t a s e t C o n t e x t S u b j e c t A c t i v i t i e s S e n s o r T y p e s C h a ll e n g e s W I S D M A c t i v i t y P r e d i c t i o n [ ] D a i l y L i v i n g W e a r a b l e C l a ss I m b a l a n c e U C I H A R [ ] D a i l y L i v i n g W e a r a b l e M u l t i m o d a l O PP O R T U N I T Y [ , ] D a i l y L i v i n g W e a r a b l e , O b j e c t , A m b i e n t M u l t i m o d a l C o m p o s i t e A c t i v i t y S k o d a C h e c k p o i n t [ ] C a r M a i n t e n a n c e W e a r a b l e S i m p l e D a p h n e t F r ee z i n g o f G a i t [ ] P a t i e n t s o f P a r k i n s o n ’ s D i s e a s e W e a r a b l e S i m p l e B e r k e l e y M H A DD a i l y L i v i n g W e a r a b l e , A m b i e n t M u l t i m o d a l P A M A P [ ] D a i l y L i v i n g W e a r a b l e M u l t i m o d a l S H O [ ] D a i l y L i v i n g W e a r a b l e S i m p l e U C I H A P T [ ] D a i l y L i v i n g w i t h a c t i v i t y t r a n s i t i o n W e a r a b l e M u l t i m o d a l U T D - M H A D [ ] C o n t r o ll e d C o n d i t i o n s W e a r a b l e M u l t i m o d a l HH A R [ ] D a i l y L i v i n g W e a r a b l e M u l t i m o d a l , H e t e r o g e n e i t y A R A S [ ] R e a l - w o r l d H o m e L i v i n g A m b i e n t , O b j e c t M u l t i m o d a l , M u l t i - o cc u p a n t A m b i e n t K i t c h e n [ ] F oo d P r e p a r a t i o n O b j e c t S i m p l e U S C - H A D [ ] D a i l y L i v i n g W e a r a b l e M u l t i m o d a l M H E A L T H [ ] R e a l - w o r l d H o m e L i v i n g W e a r a b l e M u l t i m o d a l B I D M CC o n g e s t i v e H e a r t F a i l u r e [ ] H e a r f a i l u r e W e a r a b l e C l a ss I m b a l a n c e D S A D S [ ] D a i l y L i v i n g a n d S p o r t s W e a r a b l e M u l t i m o d a l C A S A S - [ ] R e a l - w o r l d H o m e L i v i n g O b j e c t , A m b i e n t M u l t i - o cc u p a n t C o m p o s i t e A c t i v i t y M u l t i m o d a l S m a r t w a t c h / N o t c h / F a r s ee i n g [ ] D a i l y L i v i n g & F a ll D e t e c t i o n A D L & F a ll W e a r a b l e C l a ss I m b a l a n c e D a r m s t a d t D a i l y R o u t i n e s [ ] R e a l - w o r l d R o u t i n e s W e a r a b l e C l a ss I m b a l a n c e M o t i o n S e n s e [ ] D a i l y L i v i n g W e a r a b l e S i m p l e M o b i A c t / M o b i F a ll [ ] D a i l y L i v i n g & F a ll D e t e c t i o n A D L & F a ll W e a r a b l e M u l t i m o d a l V a n K a s t e r e n b e n c h m a r k [ ] R e a l - w o r l d H o m e L i v i n g O b j e c t S i m p l e A c t i v e M i l e s a R e a l - w o r l d R o u t i n e s W e a r a b l e M u l t i m o d a l A c t R e c T u t [ ] H a n d G e s t u r e & P l a y i n g T e nn i s W e a r a b l e M u l t i m o d a l a h tt p : // h a m l y n . d o c . i c . a c . u k / a c t i v e m i l e s / d a t a s e t s . h t m l J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. features extracted from raw signals. The most popular features that have been leveraged for activityrecognition are temporal features. Moving beyond time-domain features, researchers also haveinvestigated the feasibility of other features for activity recognition, including multimodal andstatistical features.

Typically, human activity is a combination of several continuousbasic movements and can last from a few seconds to up to several minutes. Therefore, consideringthe relatively high sensing frequency (tens to hundreds Hz), the data of human activity is representedby time-series signals. In this context, the basic streaming movements are more likely to exhibita smooth fluctuation, while, in contrast, the transitions between consecutive basic movementsmay induce substantial changes. In order to capture such signal characteristics of human activities,it is essential to extract useful temporal features of both within and between successive basicmovements.Some researchers manage to adopt traditional methods to extract temporal features and use deeplearning techniques for the following activity recognition. Basic signal statistics and waveformtraits such as mean and variance of time-series signals are commonly applied handcrafted featuresfor early-stage deep learning activity recognition [150]. This kind of feature is coarse and lacksscalability. A more advanced temporal feature extraction approach is to exploit the spectral powerchanges as time evolves by converting the time series from the time domain to the frequencydomain. A general example structure is shown in Figure 1 (a), where a 2D-CNN is usually usedto process the spectral features. In [67], Jiang and Yin applied the Short-time Discrete FourierTransform (STDFT) to time-serial signals and constructed a time-frequency-spectral image. Then,CNN is utilized to handle the image for recognizing simple daily activities like walking and standing.More recently, Laput and Harrison [78] developed a fine-grained hand activity sensing systemthrough the combination of the time-frequency-spectral features and CNNs. They demonstrated95.2% classification accuracy over 25 atomic hand activities of 12 people. The spectral features cannot only be used for the wearable sensor activity recognition but also be used for the device-freeactivity recognition. Fan et al. [40] proposed to develop time-angle spectrum frames for representingthe spectral power variations along time in different spatial angles of the RFID signals.Since one of the most favorable advantages of the deep learning technology is the impressivepower of automatic feature learning, extracting temporal features by a neural network is favorableto construct an end-to-end deep learning model. The end-to-end learning manner facilitates thetraining procedure and mutually promotes the feature learning and recognition processes. Variousdeep learning approaches have been applied for temporal information extraction, including RNN,temporal CNN, and their variants. RNN is a widely applied deep temporal feature extractionapproach in many fields [96, 182]. Traditional RNN cells suffer from vanishing/exploding gradientsproblems, which limits the application of EEG analysis. The Long Short-Term Memory (LSTM) unitsthat have overcome this issue are usually used to build an RNN for temporal feature extraction [45].The depth of an effective LSTM-based RNN needs to be at least two when processing sequentialdata [70]. As the sensor signals are continuous streaming data, a sliding window is generally usedto segment the raw data to individual pieces, each of which is the input of an RNN cell [30]. Atypical LSTM-based structure for temporal feature extraction is illustrated in Figure 1 (b). Thelength and moving step of the sliding window are hyper-parameters that need to be carefully tunedfor achieving satisfying performance. Besides the early application of the basic LSTM network,continuing research of diverse RNN variants is also being investigated in the human activityrecognition field. The Bidirectional LSTM (Bi-LSTM) structure that has two conventional LSTMlayers for extracting temporal dynamics from both forward and backward directions is an importantvariant of the RNN in various domains including human activity recognition [63]. In addition, Guan

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. eep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities 111:9 and Plötz [49] proposed an ensemble approach of multiple deep LSTM networks and demonstratedsuperior performance to individual networks on three benchmark datasets. Aside from the variantsof the RNN structure, some researchers also studied different RNN cells. For example, Yao et al.[169] leveraged the Gated Recurrent Units (GRUs) instead of LSTM cells to construct an RNN andapplied it to activity recognition. However, some studies revealed that the other sorts of RNNcells could not provide notably superior performance to the conventional LSTM cell concerningclassification accuracy [45]. On the other hand, due to its computational efficiency, GRUs are moresuitable for mobile devices where the computation resources are limited. time

Raw Signals time f r e q u e n cy Spectral Feature 2D-CNN (a) Time-Frequency-Spectral ti m e Raw Signals

LSTMLSTMLSTM LSTMLSTMLSTM

RNN

SlidingWindow (b) RNN time

Raw Signals 1D-CNN (c) CNNFig. 1. Example structures for temporal feature extraction

CNN is another favorable deep learning architecture for temporal feature extraction. Unlike RNN,a temporal CNN does not need a sliding window for segmenting streaming data. The convolutionoperations with small kernels are directly applied along the temporal dimension of sensor signalsso that local temporal dependencies can be captured. Some works employed one-dimensional(1D) convolutions on the individual univariate time series signals for temporal feature extraction[37, 46, 125, 126, 164]. When there were multiple sensors or multiple axes, multivariate time serieswould be yielded, thus requiring the 1D convolutions to be applied separately. Figure 1 (c) presentsa typical 1D-CNN structure for temporal feature handling. Conventional 1D CNNs usually have afixed kernel size, and thus can only discover the signal fluctuations within a fixed temporal range.Considering this gap, Lee et al. [82] combined multiple CNN structures of different kernel sizes toobtain the temporal features from different time scales. However, the multi-kernel CNN structurewould consume more computational resources, and the temporal scale that a pure CNN couldexplore is inadequate as well. Furthermore, if a large time scale is desirable, a pooling operationwould be commonly used between two CNN layers, which would cause information loss. Xi etal. [160] applied a deep dilated CNN to time series for solving the issues. The dilated CNN usesdilated convolution kernels instead of the standard convolutional kernels to expand the convolutionreceptive field (i.e., time length) with no loss of resolution. Because the dilated kernel only addsempty elements between the elements of the conventional convolution kernel, it does not require anextra computational cost. In addition to the consideration of various temporal scales, the temporaldisparity of different sensing modalities (e.g., different sensors, axes, or channels) is also a criticalconcern since commonly used CNN treats different modalities in the same way. To resolve thisconcern, Ha and Choi [54] presented a new CNN structure that had specific 1D CNNs for differentmodalities for learning modality-specific temporal characteristics. With the development of theCNNs, other kinds of CNN variants are also considered for effectively embedding temporal features.Shen et al. [135] utilized the gated CNN for daily activity recognition from audio signals and showedsuperior accuracy to the naive CNN. Long et al. adopted residual blocks to build a two-stream CNNstructure dealing with different time scales.

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.

Developing a deep hybrid model to explore different views of temporal dynamics is anotherattractive trend in the human activity recognition community. In light of the advantages of CNNand RNN, Ordóñez and Roggen [106] proposed to combine CNNs and LSTMs for both local andglobal temporal feature extraction. Xu et al. [162] adopted the advanced Inception CNN structurefor different scales of local temporal feature extraction and took the GRUs for efficient globaltemporal representations. Yuki et al. [172] employed a dual-stream ConvLSTM network with onestream handling smaller time length and the other one handling more substantial time lengthto analyze more complex temporal hierarchies. Zou et al. [195] induced an Autoencoder to firstenhance feature extractions and then applied the cascade CNN-LSTM to extract local and globalfeatures for WiFi-based activity recognition. On the other hand, Gumaei et al. [51] proposed ahybrid model of different types of recurrent units (SRUs and GRUs) for handling different aspectsof temporal information.

Sensing Modality 1Sensing Modality n Sensing Modality 2 FeatureFusionNetwork FusedFeature Vector Recognition (a) Feature Fusion

Sensing Modality 1Sensing Modality n Model 1Model n Sensing Modality 1 Model 1 ClassiﬁerEnsemble Recognition

Text (b) Classifier EnsembleFig. 2. Multi-modality fusion strategies

The current research of human activity recognition is usuallyachieved with multiple different sensors, such as accelerometers, gyroscopes, and magnetometers.Some research has further demonstrated that the combination of diverse sensing modalities canobtain better results than one particular sensor only [52]. As a result, learning the inter-modalitycorrelations along with the intra-modality information is a major challenge in the field of deeplearning-based human activity recognition. The sensing modality fusion can be performed followingtwo strategies:

Feature Fusion (Figure 2 (a)) that combines different modalities to produce singlefeature vectors for classification; and

Classifier Ensemble (Figure 2 (b)) in which outputs ofclassifiers operating only on features of one modality are blended together.Münzner et al. [100] investigated the feature fusion manner of deep neural networks for multi-modal activity recognition. They organized the fusion manners into four categories according todifferent fusion stages within a network. However, their study focuses on CNN-based architecturesonly. Here, we extend their definitions of feature fusion manners to all deep learning architecturesand manage to reveal more insights and specific considerations.

Early Fusion (EF).

This manner fuses the data of all sources at the beginning, irrespectiveof sensing modalities. It is attractive in terms of simplicity as a strategy though it is at risk ofmissing detailed correlations. A simple fusion approach in [82] transformed the raw x , y , and z acceleration data into a magnitude vector by calculating the Euclidean norm of x , y , and z values.Gu et al. [47] stacked the time serial signals of different modalities horizontally into a single 1Dvector and utilized a denoising autoencoder to learn robust representations. The output of the J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. eep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities 111:11 intermediate layer was used to feed the final softmax classifier. In contrast, Ha et al. [55] proposedto vertically stack all signal sequences to form a 2D matrix and directly applied 2D-CNNs tosimultaneously capture both local dependencies over time as well as spatial dependencies overmodalities. In [53], the authors preprocessed the raw signal sequence of a single modality intoa 2D format but by simply reorganizing, and stacked all modalities along the depth dimensionto finally achieve 3D data matrices. Afterwards, they applied a 3D-CNN to exploit the inter- andintra-modality features. However, conventional CNN is restricted to explore the correlations ofneighboring arranged modalities and thus misses the relations between the nonadjacent modalities.To solve this issue, unlike naturally organizing various data sources, Jiang and Yin [67] assembledsignal sequences of different modalities into a novel arrangement where every signal sequencehas the chance to be adjacent to every other sequence. This organization facilitates the DCNN toextract elaborated correlations of individual sensing axes. Dilated convolution is another solution toexploiting nonadjacent modalities without information loss and extra computational expenses [161].In addition to wearable sensors, RFID-based activity recognition requires the fusion of multipleRFID signals as well, and CNNs are also commonly used for the early fusion manner [86].

Sensor-based Fusion (SF).

In contrast to EF, SF fist considers each modality individually andthen fuses different modalities afterwards. Such an architecture not only extracts modality-specificinformation from various sensors but also allows flexible complexity distribution since the struc-tures of the modality-specific branches can be different. In [115, 116], Radu et al. proposed afully-connected deep neural network (DNN) architecture to facilitate the intra-modality learning.Independent DNN branches are assigned to each sensor modality, and a unifying cross-sensor layermerges all the branches to uncover the inter-modality information. Yao et al. [169] vertically stackedall axes of a sensor to form 2D matrices and designed individual CNNs for each 2D matrix to learnthe intra-modality relations. The sensor-specific features of different sensors are then flattenedand stacked into a new 2D matrix before being fed into a merge CNN for further extracting theinteractions among different sensors. A more advanced fusion approach was proposed by Choi etal. [34] to efficiently fuse different modalities by regulating the level of contribution of each sensor.The authors designed a confidence calculation layer for automatically determining the confidencescore of a sensing modality, and then the confidence score was normalized and multiplied withpre-processed features for the following feature fusion of addition. Instead of fusing sensor-specificfeature only at the late stage, Ha and Choi [54] proposed to create a vector of different modalities atthe early stage as well and to extract the common characteristics across modalities along with thesensor-specific characteristics; then both kinds of features are fused at the later part of the model.

Axis-based Fusion (AF).

This manner treats signal sources in more detail by handling eachsensor axis separately. In such a way, the interference between different sensor axes is gottenrid of. [100] referred this manner to

Channel-based late fusion (CB-LF) . Nevertheless, the sensorchannel may be confused with the "channel" in CNNs, so we use the term "axis" instead in thispaper. A commonly used AF strategy is to design a specific neural network for each univariate timeseries of each sensing channel [176, 191]. The information representations from all channels areconcatenated at last for input into a final classification network. 1D-CNNs are widely used as thefeature learning network of each sensing channel. Dong and Han [36] proposed to use separableconvolution operations to extract the specific temporal features of each axis and concatenate allthe features before feeding a fully-connected layer. In the studies of applying deep learning tohand-crafted features, the axis-specific process is a requirement. For instance, in [64], temporalfeatures of acceleration and gyro signals are first represented by FFT spectrogram images andthen vertically combined into a larger image for the following DCNN to learn inter-modalityfeatures. Furthermore, some research combined the spectrogram images along the depth dimension

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. to establish a 3D format [78], which could be easily handled by 2D CNNs with the depth dimensionas the CNN input channel.

Shared-filter Fusion (SFF).

Same to the AF approach, this manner processes the univariatetime-serial data of a sensor axis independently. However, the same filter is applied to all timesequences. Therefore, the filters are influenced by all input members. Compared to the AF manner,SFF is more simple and contains fewer trainable parameters. The most popular approach of SFF is toorganize the raw sensing sequences into a 2D matrix by stacking along the modality dimension, andthen to apply a 2D-CNN to the 2D matrix with 1D filters [37, 164, 174]. As a result, the architecture isequivalent to applying identical 1D-CNNs to different univariate time series. Although the featuresof all sensing modalities are not merged explicitly, they communicate with each other by the shared1D filters.

Fusion Network (e.g. 2D-CNN, Dilated CNN ...)

Feature VectorRcognition

All Modalities (a) Early Fusion

Feature Extraction 1

Fusion NetworkRcognition

Feature Extraction n Modality 1 Modality n (b) Sensor-based Fusion Modality 1

Axis 1 Axis 2

Modality n Axis 1

FeatureExtraction 1 FeatureExtraction 2 FeatureExtraction k Fusion NetworkRcognition (c) Axis-based Fusion

Modality 1

Axis 1 Axis 2

Modality n Axis 1

FeatureExtraction FeatureExtraction FeatureExtraction

Fusion NetworkRcognition sharedparameters (d) Shared-filter FusionFig. 3. Various strategies for feature fusion

Classifier Ensemble.

In addition to fusing features before interference, the integration of multi-ple modalities can be done by blending the recognition results from each modality as well. A rangeof ensemble approaches have been developed for fusing recognition results to yield an overallinference. For example, Guo et al. [52] proposed to use MLPs to create a base classifier for eachsensing modality and incorporate all classifiers by assigning ensemble weights in the classifierlevel. When building the base classifiers, the authors not only considered the recognition accuracybut also emphasized the diversity of the base classifiers by inducing diversity measures. Thus, thediversity of different modalities is preserved, which is critical to overcoming the over-fit issues andto improving the overall generalization ability. Besides the conventional classifier ensemble, Khanet al. [73] targeted the fall detection problem and introduced an ensemble of the reconstructionerror from the autoencoder of each sensing modality.The most attractive benefit of the classifier ensemble method is the scalability of additionalsensors. A well-developed model of a specific sensing modality can be easily merged into an existingsystem by configuring the ensemble part only. Reversely, when a sensor is removed from a system,the recognition model can be freely adapted to this hardware change. Nevertheless, an intrinsicshortcoming of the ensemble fusion is that the inter-modality correlations may be underestimateddue to the late fusion stage.

Different from the deep learning-based feature extraction,feature engineering-based methods are able to extract meaningful features, such as statisticalinformation. However, domain knowledge is usually required for manually designing such kindof features. Recently, Qian et al. [114] managed to develop a Distribution-Embedded Deep NeuralNetwork (DDNN) to integrate the statistical feature extraction process into an end-to-end for activity

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. eep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities 111:13 recognition. It encoded the idea of kernel embedding of distributions into a deep architecture, suchthat all orders of statistical moments could be extracted as features to represent each segment ofsensor readings, and further used for activity classification in an end-to-end training manner. Tobe specific, the authors aimed to design a network f that learns statistical features from multiplekernels without manual parameter tuning, i.e., f ( X ) = ϕ ( X ) , where X is the sensor data, and ϕ isa feature mapping function that extracts high-dimensional or even infinite-dimensional featuresfrom d -dimensional data space to Hilbert space H . As the technique of kernel embedding forrepresenting an arbitrary distribution requires the feature mapping to be injective, the neuralnetwork should meet the requirement of f − ( f ( X )) = X for all possible X . Therefore, the authorsutilized an autoencoder to guarantee the injectivity of the feature mapping. They also introducedan extra loss function based on MMD distance to force the autoencoder to learn good featurerepresentations of inputs. Extensive experiments on four datasets demonstrated the effectiveness ofthe statistical feature extraction methods. Although extracting statistical features has been exploredin a deep-learning-based way, more reasonable and meaningful explanations on the extractedfeatures are still undeveloped. Section 3.1 surveys the recent deep learning methods for extracting distinguishable features fromsensory data. We can see that most of them are supervised methods. One main characteristic ofsupervised learning methods is the necessity of a mass of labeled data to train the discriminativemodels. However, such a substantial amount of reliable labeled data is not always available for tworeasons. Firstly, the annotation process is expensive, time-consuming, and very tedious. Secondly,labels are subject to various sources of noise, such as sensor noise, segmentation issues, and thevariation of activities across different people, which makes the annotation process error-prone.Therefore, researchers have begun to investigate unsupervised learning and semi-supervisedlearning approaches to reduce the dependence on massive annotated data.

Unsupervised learning is mainly used for exploratory data analysisto discover patterns among data. In [83], the authors examined the feasibility of incorporatingunsupervised learning methods in activity recognition. In [144], the Expectation-Maximizationalgorithm and Hidden Markov Model Regression are used to analyze temporal acceleration data.Nevertheless, the community of activity recognition still needs more effective methods to deal withthe high-dimensional and heterogeneous sensory data for activity recognition.Recently, deep generative models such as Deep Belief Networks (DBNs) and autoencoders havebecome dominant for unsupervised learning. DBNs and autoencoders are composed of multiplelayers of hidden units. They are useful in extracting features and finding patterns in massivedata. Also, deep generative models are more robust against overfitting problems as compared todiscriminative models [97]. So, researchers tend to use them as the process of feature extraction toexploit unlabeled data as it is easy and cheap to collect unlabeled activity datasets. According toErhan et al. in [39], a generative pretraining of a deep model guides the discriminative trainingto better generalization solutions. Pretraining a deep network on large-scale unlabeled datasetsin an unsupervised fashion thus became very common. The whole process for recognition canbe divided into two parts. Firstly, the input data are fed to feature extractors, which are usuallydeep generative models, for pretraining, in order to extract features. Secondly, a top-layer or otherclassifier is added and then trained with labeled data in a supervised fashion for classification.During the supervised training, weights in the feature extractor may be fine-tuned. For example,DBN-based activity recognition models are implemented in [6]. The unsupervised pretraining isfollowed by fine-tuning the learned weights in an up-down manner with available labeled samples.

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.

In [56], the same pretraining process is conducted, but Restricted Boltzmann Machines (RBMs)are applied to learn a generative model of the input features. In another work [111], Plötz et al.proposed to use autoencoders for unsupervised feature learning as an alternative to PrincipalComponent Analysis (PCA) for activity recognition in ubiquitous computing. And the authors in[33, 47, 177] employed the variants of autoencoders such as stacked autoencoders [33], stackeddenoising autoencoders [47], and CNN autoencoders [177] to combine automatic feature learningand dimensionality reduction in one integrated neural network for activity recognition. In a recentwork [11], Bai et al. proposed a method called Motion2Vector to convert a time period of activitydata into a movement vector embedding within a multidimensional space. To fit with the contextof activity recognition, they use a bidirectional LSTM to encode the input blocks of the temporalwrist-sensing data. Two hidden states generated are concatenated to form the embedded vectorswhich can be considered as an appropriate representation of the input movement. Classifiers suchas C4.5, K nearest neighbor, and random forest are trained later for classification. The experimentsshowed that this method can achieve accuracy of higher than 87% when tested on public datasets.Despite the success of deep generative models in unsupervised learning for human activityrecognition, unsupervised learning still cannot undertake the activity recognition tasks indepen-dently since unsupervised learning is not capable of identifying the true labels of activities withoutany labeled samples presenting the ground truth. Therefore, the aforementioned methods shouldbe considered as semi-supervised learning, in which both labeled data and unlabeled data areleveraged for training the neural networks. (a) Co-training

Labeled Set Unabeled SetClassiﬁersAnnotators t r a i n qu e r ys t r a t e g y q u e r y l a b e l (b) Active LearningFig. 4. Co-training and active learning for Annotation Scarcity Semi-supervised learning has become a recent trend in activityrecognition because of the difficulty in obtaining labeled data. A semi-supervised method requiresless labeled data and massive labeled data for training. How to utilize unlabeled data for reinforcingthe recognition system has become a point of interest. As deep learning is powerful in capturingpatterns from data, various semi-supervised learning has been incorporated for activity recognitionsuch as co-training, active learning, and data augmentation.

Co-training was proposed by Blum and Mitchell in 1998 [17]. It was an extension of self-learning. In self-learning approaches, a weak classifier is first trained with a small amount oflabeled data. This classifier is used for classifying the unlabeled samples. The samples with highconfidence can be labeled and added to the labeled set for re-training the classifier. In co-training,multiple classifiers are employed, each of which is trained with one individual view of training data.Likewise, the classifiers select unlabeled samples to add to the labeled set by confidence score ormajority voting. The whole process of co-training can be seen in Figure 4 (a). With the training setaugmented, the classifiers are enhanced. Blum and Mitchell [17] suggested that co-training is fully

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. eep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities 111:15 effective under three conditions: (a) multiple views of training data are not strongly correlated, (b)each view contains sufficient information for learning a weak classifier, (c) the views are mutuallyredundant. In respect of sensor-based human activity recognition, co-training is compatible becausemultiple modalities can be regarded as multiple views. Chen et al. [28] applied co-training withmultiple classifiers on different modalities of the data. Three classifiers are trained on acceleration,angular velocity, and magnetism, respectively. The learned classifiers are used for predicting theunlabeled data after each training round. If most of the classifiers reach an agreement on predictingan unlabeled sample, this sample is labeled and moved to the labeled set for the next training round.The training flow is repeated until no confident samples can be labeled, or the unlabeled set isempty. Then a new classifier is trained on the final labeled set with all modalities.The process of co-training is like the process of human learning. People can learn new knowledgefrom existing experience, and new knowledge can be used to summarize and accumulate experience.Experience and knowledge constantly interact with each other. Similarly, co-training uses currentmodels to select new samples that they can learn from, and the samples help to train the modelsfor the next selection. However, automatic labeling may introduce errors. Acquiring correct labelscan improve accuracy.

Active learning is another category in semi-supervised learning. Different from self-learningand co-learning which label the unlabeled samples automatically, active learning requires annotatorswho are usually experts or users to label the data manually. In order to lighten the burden of labeling,the goal of active learning is to select the most informative unlabeled instances for annotators tolabel and improve the classifiers with these data so that minimal human supervision is needed. Herethe most informative instances denote the instances that bring the most enormous impact on themodel if their labels are available. A general framework of active learning can be seen in Figure 4 (b).It includes a classifier, a query strategy, and an annotator. The classifier learns from a small amountof labeled data, selects one or a set of the most useful unlabeled samples via query strategy, ask theannotator for true labels, and utilize the new labels for further training and next query. The activelearning process is also a loop. It stops when it meets the stop criteria. There are two commonquery strategies for selecting the most profitable samples which are uncertainty and diversity.Uncertainty can be measured by information entropy. Larger entropy means higher uncertaintyand better informativeness. Diversity means that the queried samples should be comprehensive,and the information provided by them are non-repetitive and non-redundant. In [140], the authorsapplied two query strategies. One of them is to select samples with lowest prediction confidence,and the other one resort to the idea of co-training, but it oppositely selects samples with highdisagreement among classifiers.Deep active learning approaches are deployed in activity recognition [59, 60]. Hossain et al. [59]considered that traditional active learning methods merely choose the most informative sampleswhich only occupy a small fraction of the available data pool. In this way, a large number ofsamples that are not selected are discarded. Although the selected samples are vital for training,the discarded samples are also of value on account of the substantial amount. Therefore, theyproposed a new method to combine active learning and deep learning in which not only the mostinformative unlabeled samples are queried but the less necessary samples are also leveraged. Thedata is first clustered with K-means clustering. While the intuitive idea is to query the optimalsamples such as the centroids of the clusters, in this work, the neighboring samples are also queried.The experiments show that the proposed method can achieve the optimal results by labeling 10%of the data.Hossain and Roy [60] further investigated two problems of deep active learning and humanactivity recognition. The first problem is that outliers can be easily mistaken for important samples.When entropy is calculated for selection, apart from informativeness, larger entropy may also mean

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. outliers because outliers belong to none of the classes. Therefore, a joint loss function was proposedin [60] to address this problem. Cross-entropy loss and information loss are jointly minimizedto reduce the entropy of outliers. The second problem considered in this work is how to reducethe workload of annotators as annotators are required to master domain knowledge for accuratelabels. Multiple annotators are employed in this work. They are selected from the intimate peopleof users. The annotator selection is made by the reinforcement learning algorithm according to theheterogeneity and the relations of users. The contextual similarity is used to measure the relationsamong users and annotators. The experimental results show that this work has an 8% improvementin accuracy and has a higher convergence rate.Co-training and active learning are based on the same idea of rebuilding the model upon labelsof unlabeled data. Different from these, another method is to synthesize new activity data, whichcan be applied when data collection is challenging in specific scenarios such as resource-limited orhigh-risk scenarios.

Data augmentation with synthesizing data indicates generating massive fake data from asmall amount of real data so the fake data can facilitate to train the models. One popular tool isGenerative Adversarial Network (GAN). GAN was firstly introduced in [44]. GAN is powerful insynthesizing data that follow the distribution of training data. A GAN is composed of two parts, agenerator and a discriminator. The generator creates synthetic data and the discriminator evaluatesthem for authenticity. The goal of the generator is to generate data that are genuine enough tocheat the discriminator while the goal of the discriminator is to identify images generated by thegenerator as fake. The training is in an adversarial way, which is based on a min-max theory. Duringtraining, the generator and the discriminator mutually improve their performance in generation anddiscrimination. Variants of GANs has been applied to different fields such as language generation[113] and image generation [193].The first work about data augmentation with synthesizing sensory data for activity recognitionis called SensoryGANs [153]. As sensory data is heterogeneous, a unified GAN may not be enoughto depict the complex distribution of different activity, Wang et al. employed three activity-specificGANs for three activities. After generation, the synthetic data are fed into classifiers for predictionwith original data. We should note that although this work uses deep generative networks, thegeneration process depends on labels so the process is not unsupervised. Zhang et al. [188] proposedto use semi-supervised GAN for activity recognition. Different from regular GAN, the discriminatorin semi-supervised GAN makes a K + The primary contributor to the success of deep learning technique is the availability of a large volumeof training data due to modern information technology. Most existing research on human activityrecognition follows a supervised learning manner, which requires a significant amount of labeleddata to train a deep model. However, some sensor data of specific activities are challenging to obtain,such as those related to falls of older people. In addition, raw data recorded from unconstrainedconditions is naturally class-imbalanced. When using an imbalanced dataset, conventional models

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. eep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities 111:17

Table 2. Deep Learning Works for Annotation Scarcity in Activity Recognition

Training Fashion Learning Approach References

Unsupervised Pretraining Alsheikh et al. 2016 [6];Hammerla et al. 2015 [56];Plötz et al. 2011 [111];Chikhaoui and Gouineau [33]; Gu et al. 2018 [47];Zeng et al. 2017 [177]; Bai et al. 2019 [11]Semi-Supervised Co-training Chen et al. 2019 [28]Active Learning Hossain et al. 2018 [59];Hossain and Roy 2019 [60]Synthesizing Data Wang et al. 2018 [153]; Zhang et al. 2019 [188] tend to predict the class with the majority number of training samples while ignoring the classwith few available training samples. Therefore, it is urgent to determine the class imbalance issuefor developing an effective activity recognition model.The most intuitive path to tackling the imbalance problem is to sub-sample the class with thelargest number of samples. However, such a method is at the risk of reducing the total amountof training samples and omitting some critical samples with featured characteristics. In contrast,augmenting new samples to the class with a minority number of samples could not only keep alloriginal samples but also enhance models’ robustness. Grzeszick et al. [46] utilized two augmentationmethods, Gaussian noises perturbation and interpolation, to tackle the problem of class imbalance.The augmentation approaches could preserve the coarse structure of the data, but a random timejitter in the sensorâĂŹs sampling process is simulated. They created a larger number of samples forthe under-represented classes and ensure that each class has at least a certain percentage of data inthe training set. Another direction of solving the imbalance concern is to modify the model-buildingstrategy instead of directly balancing the training dataset. In [49], Guan and Plötz utilized the F F Many state-of-the-art approaches for human activity recognition assume that the training dataand the test data are independent and identically distributed (i.i.d.). However, this is impracticalsince sensory data for activity recognition is heterogeneous. The heterogeneity of sensory datacan be divided into three categories. The first one is the heterogeneity with users which stemsfrom different motion patterns when activities are performed by different people. The secondheterogeneity is with time. In a dynamic streaming environment, data distributions of activities arechanging over time, and new activities may also emerge. The third category is the heterogeneitywith sensors. Sensors used for human activity recognition are usually sensitive. A small variation

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. in sensors can cause a significant disturbance in the sensory data. The factors that may potentiallybring about heterogeneity with sensors include sensor instances, types, positions, and layouts inthe environment. Considering the three categorizes of heterogeneity, in real-world scenarios wheresensing devices are deployed without restrictions, distribution divergence can be observed betweenthe training data and the test data, and a sudden drop in the recognition accuracy raises concerns.Before taking a closer look at the factors that cause heterogeneity in sensory data, we brieflyintroduce transfer learning . Transfer learning is a common machine learning technique thattransfers the classification ability of the learning model from one predefined setting to a dynamicsetting. Transfer learning is particularly effective in solving heterogeneity problems. It avoids thedecline in the performance of learning models when the training data and the test data followdifferent distributions. In the activity recognition context, this problem appears when activityrecognition models are deployed for application in a different configuration with where they aretrained. In transfer learning, source domain refers to domains that contain massive annotated dataand knowledge, and the goal is to leverage the information from the source domain to annotatethe samples in the target domain . Regarding activity recognition, the source domain correspondsto the original configuration, and the target domain denotes the new deployment that the systemhas never encountered (e.g., new activities, new users, new sensors). In the following sections, wedetailedly introduce three categorizes of heterogeneity and how the state-of-the-art approachesmanage to mitigate the heterogeneity. Most of them are based on transfer learning.

Owing to biological and environmental factors, the same activitycan be performed differently by different individuals. For example, some people walk slowly andsome may prefer to walk faster and more dynamically. Since people have diverse behavior patterns,data from different users are distributed variously. Usually, if the models are trained and testedwith data that are collected from a specific user, the accuracy can be rather high. However, thissetting is impractical. In practical human activity recognition scenarios, while a certain number ofparticipants’ data can be collected and annotated for training, the target users are usually unseenby the systems. So the distribution divergence between the training data and the test data appearsas a challenge in human activity recognition, and the performance of the models falls dramaticallyacross users. The research on personalized learning models for a specific user is significant. Modelpersonalization for a specific user with only a small amount of data from the target user has provedto be valid in [157]. Recently, personalized deep learning models for heterogeneity with users inactivity recognition have been explored. Woo et al. [158] proposed an approach to build an RNNmodel for each individual. Learning Hidden Unit Contributions (LHUC) were applied in [94] wherea particular layer with few parameters is inserted between every two hidden layers of CNN, andthe parameters are trained using a small amount of data. Rokni et al. [124] proposed to personalizetheir models with transfer learning. In the training phase, CNN is firstly trained with data collectedfrom a few participants (source domain). In the test phase, only the top layers of the CNN arefine-tuned with a small amount of data for the target users (target domain). Annotation for targetusers is required. GAN is also serviceable for addressing heterogeneity with users. In [139], theauthors generated data of the target domain directly from the source domain with GANs to enhancethe training of the classifier. Chen et al. [26] further defined person-specific discrepancy and task-specific consistency for people-centric sensing applications. Person-specific discrepancy means thedistribution divergence of data collected from different people, and task-specific consistency denotesthe inherent similarity of the same activity. Their learned features not only reduce person-specificdiscrepancy but also preserve task-specific consistency, guaranteeing the recognition accuracyafter transferring.

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. eep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities 111:19 (a) Concept Drift (b) Concept Evolution (c) Open-SetFig. 5. Heterogeneity with Time

Human activity recognition systems collect dynamic and streamingdata that logs people’s motions. In a real-world recognition system, the initial training data thatportrays a set of activities is collected to train the original model, and the model is then configuredfor future activity recognition. In long-term systems which are longer than months or even years, anatural feature that we should concern is that the streaming sensory data changes over time. Threeproblems can be derived from the heterogeneity with time in line with the extent of change andthe extent of the need in recognizing the new concepts of data. They are the concept drift problem,the concept evolution problem, and the open-set problem.

Concept Drift.

Figure 5(a) shows the first problem of heterogeneity with time in activity recogni-tion called concept drift [132]. It denotes the distribution shift between the training domain and thetest domain (or the source domain and the target domain). Concept drift can be abrupt or gradual[1]. To accommodate the drift, deep learning models should incorporate incremental training tocontinuously learn new concepts of human activities from newly coming data. For example, anensemble classifier termed multi-column bi-directional LSTM was proposed in [143]. The modelleverages new training samples gradually via incremental learning. Active learning is a specialtype of incremental learning. In streaming data systems, active learning is able to query groundtruth for some samples when change is detected in the data streams. It encourages to select themost efficient samples to update the models for the new concepts. That is why active learning canfacilitate deep learning models to mitigate the heterogeneity with time of the streaming sensorydata [50, 130]. In this way, Gudur et al. [50] proposed a deep Bayesian CNN with dropout to obtainthe uncertainties of the model and select the most informative data points to be queried accordingto the uncertainty query strategy. Owing to the active learning, the model supports updatingcontinuously and capturing the changes of data over time.

Concept Evolution.

Figure 5(b) represents the distribution of concept evolution. Concept evolu-tion denotes the emergence of new activities in the streaming data. The appearance of conceptevolution is because collecting labeled data for all kinds of activities in the initial learning phase isimpractical. Firstly, despite the effort, the initial training set in an activity recognition system isonly able to contain a limited number of activities. Secondly, people can perform new activities thatthey never did before the initial training of the activity recognition system (e.g., learning to playguitar for the first time). Thirdly, it is difficult to collect some certain activities such as people fallingdown. However, these activities still may appear in the test or the application phase. Thus, in theapplication phase, the concepts of the new activities still need to be learned. It is essential to studyactivity recognition systems which can recognize new activities in the streaming data settings.Nevertheless, this is difficult due to the restricted access to annotated data in the application phase.One approach is to decompose activities into mid-level features such as arm up, arm down, leg up,and leg down. This method demands experts to define the mid-level attributes for further training,

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. and the capability is limited when new activities composed of new attributes appear [102]. Otherdeep learning methods for activity concept evolution are still less explored, so some researcherstake a step back and study the problem of open-set.

Open-Set.

Open-set problem is currently a trending topic. Before that, most of the state-of-the-artworks are for “closed-set” problems where the training set and the test set contain the same setof activities. Open-set also originates from the fact that we can never collect sufficient kinds ofactivities in the initial training phase. But compared with concept evolution problems, the solutionsto open-set problems only need to identify whether the test samples belong to the target activities,rather than exactly recognize the activities. Figure 5(c) represents the distribution of open-setproblems where the shadow means the space where new activities may emerge. An intuitivesolution to open-set problems is to build a negative set so that they can be considered in a closed-setway. A deep model based on GAN is proposed in [165]. The authors generate fake samples withGAN to construct the negative set, and the discriminator of the GAN can be seamlessly used as theopen-set classifier.

Sensors used for activity recognition include wearable andambient sensors. Due to the sensitivity of sensors, a tiny variation in the sensors may lead tosubstantial changes in the data collected or transmitted by the sensors. The influential factorsof sensors include the instances, types, positions, and layouts in the environment. For illustrate,different instances of sensors may have different parameters such as the sampling rate; differenttypes of sensors collect totally different types of data with varying shapes, frequencies, and scales;wearable sensors attached to positions of human body can only record motions in the correspondingbody parts; environmental layouts of device-free sensors influence the propagation of signals. Allof these factors may cause drops in the recognition accuracy when the classifiers are not trained forspecific device deployments. Therefore, seamless deep learning models for activity recognition in thewild is necessary. [99] proves that features learned by deep learning models are transferable acrosssensor types and sensor deployments for activity recognition, especially the features extracted inthe lower layers, which agrees with previous conclusions in [170].

Sensor Instances.

Even when data is collected in the same setting, and only the sensor instancesare different, for example, a person replaces his smartphone with a new one, the recognitionaccuracy still declines soon. Both the hardware and the software are responsible. In fact, owing tothe imperfections in the production process, sensor chips show variation in the same conditions[35]. Also, the performance of devices differs in different software platforms [18]. For example,APIs, resolutions, and other factors are all influential to the performance of sensors. There havebeen a few works developing deep learning models to address heterogeneity problems causedby different sensor instances. One notable work is data augmentation with GANs [93]. Dataaugmentation is a solution of enriching training sets so that both the size and the quality oftraining sets meet the requirement of training a powerful deep learning model. A heterogeneitygenerator that synthesizes heterogeneous data from different sensor instances under variousdegrees of disturbance is developed in [93]. The aim is to replenish the training set with sufficientheterogeneity. Moreover, the authors deploy a heterogeneity pipeline with two parameters thatcontrol the heterogeneity of the training set. This approach tackles the challenge of heterogeneitywith device instances.

Sensor Types and Positions.

In this section, we introduce the heterogeneity of sensory datacaused by different sensor types and positions on human bodies because these two factors usuallyappear together. Thanks to the pervasiveness of wearables sensors and IoT equipment, peoplecan wear more than one smart devices to assist their daily life. And it is also common that usersreplace their smart devices or buy new electronic products. Since some devices are based on the

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. eep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities 111:21 same platforms (e.g., iPhone and Apple Watch), people prefer the activity recognition system toseamlessly recognize activities that are observed by the new device with models trained withthe old devices. In terms of positions, the devices should be attached to different body positionsaccording to the types. For example, a smartwatch should be attached to the user’s wrist whilea smartphone can be put in a trouser pocket or a shirt pocket. It is obvious that different bodypositions of the devices will lead to tremendous changes in the signals collected because the signalsare stimulated by the motions of corresponding body parts. Therefore, there are two issues raisedby such changes that urgently need to be considered to address the heterogeneity with sensortypes and positions. Firstly, massive data from the new sensors or new positions is required so thatthe new distribution can be estimated rather completely. Secondly, most of the existing works stillmediocrely characterize the old data and the new data with the same features, which is impracticalwhen sensor types and positions are not fixed. For instance, KL divergence is minimized betweenthe parameters of CNNs which are trained by the old data and the new data, respectively in [72].In order to address the issue mentioned, Akbari and Jafari [2] designed stochastic features thatare not only discriminative for classification but also able to reserve the inherent structures of thesensory data. The stochastic feature extraction model is based on a generative autoencoder.Wang et al. [155] further posed a question about how to select the best source positions fortransfer when there are multiple source positions available. This question is pragmatic sincethe smart devices can be placed in diverse positions such as in the hand, in a pocket, or on thenose (e.g., goggles), and inappropriate selection may lead to negative transfer. [43] proves thatthe similarity between domains in transfer learning is determinative. [155] suggests that highersimilarity indicates better transfer performance between two domains. Therefore, Chen et al. [29]assumed that data samples of the same activities are aggregated in the distribution space even whenthey are from different sensors. They propose a stratified distance which is class-wise to measurethe distances between domains. Wang et al. [155] proposed a semantic distance and a kineticdistance to measure domain distances, where the semantic distance involves spatial relationshipsbetween data collected from two positions and the kinetic information concerns the relationshipsof motion kinetic energy between two domains.

Sensor Layouts and Environments

Sensor layouts are in regard to device-free sensors such asWiFi and RFID. The signals collected by the receivers are usually considerably influenced by thelayouts and the environments. The reason is that during the signals are transmitted, the signalsare inevitably reflected, refracted, and diffracted by media and barriers such as air, glass, andwalls. And the spatial positions of the receivers also play a role. Despite the maturity in buildingclassification models for device-free activity recognition, very few works focus on how to getequally accurate recognition performance when sensors are configured in the wild. One example is[66], where an adversarial network is incorporated with deep feature extraction models to removethe environment-specific information and extract the environment-independent features.It should be noted that all the aforementioned methods need either labeled or unlabeled data fromthe target domain to update their models. In real activity recognition systems, a one-fits-all modelthat only requires one-time training and is general enough to fit all scenarios is indispensable. Zhenget al. [192] defined a new Body-coordinate Velocity Profile (BVP) to capture domain-independentfeatures. The features represent power distributions over different velocities of participated bodyparts and are unique to individual activities. The experimental results show that BVP is advantageousin cross-domain learning, and it fits all kinds of domain factors including users, sensor types, andsensor layouts. One-fits-all is a new direction for researchers to mitigate the heterogeneity problemin activity recognition.In conclusion, we review three categories of heterogeneity in activity recognition. They arecaused by different users, time streaming, and sensor deployments. They are further categorized

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. according to the extent of change or the main reason for changes. Table 3 summarises all the deeplearning approaches for heterogeneity in activity recognition that are introduced in this section.

Table 3. Deep Learning Works for Heterogeneity in Activity Recognition

Heterogeneity Subclass References

User - Rokni et al. 2018 [124]; Chen et al. 2019 [26];Soleimani and Nazerfard et al. 2019 [139];Wooet al. 2016 [158]; Matsui et al. 2017 [94];Saeedi et al. 2017 [130]; Morales et al. 2016 [99];Jiang et al. 2018 [66]; Khan et al. 2018 [72];Zheng et al. 2019 [192]; Gjoreski et al. 2019 [43]Time Concept Drift Tao et al. 2016 [143]; Gudur et al.2019 [50]Concept Evolution Nair et al. 2019 [102]Open-Set Yang et al. 2019 [165]Sensor Instance Mathur et al. 2018 [93]Type and Position Morales et al. 2016 [99]; Khan et al.Layoutand Envirionment Jianget al. 2018 [66];Zheng et al. 2019 [192]

Despite the success of applying a variety of deep learning models to recognizing human activities,the majority of existing research focuses on simple activities like walking, standing, and jogging,which are usually characterized by repeated actions or single body posture. The simple activitiesare basic and thus process lower-level semantics. In contrast, more composite activities may containa sequence of simple actions and have higher-level semantics, e.g., working, having dinner, andpreparing coffee, which can better reflect peopleâĂŹs daily life. As a result, it is desirable torecognize more complicated and high-level human activities for most practical human-computerinteraction scenarios. Since not only human body movements but also context information ofsurrounding environments are required for composite activity recognition, it is a more challengingtask compared to recognizing simple activities. In addition, designing effective experiments forcollecting sensor data for composite activities is also a challenging task that requires rich experienceof using diverse sorts of sensors and plans of human-computer interaction applications. Therefore,the development of composite activity recognition is much more unexplored than simple activities.Existing studies on composite activity recognition can be categorized into two streams. The firstone mixes complex and simple activities and tries to create a unified model to recognize both kinds ofactivities. For example, experiments are designed [150] to collect data of both simple and compositeactivities of in-home daily living. Although the authors used wrist-worn sensors, they could captureinformation about ambient environments, body movement, as well as human locations. There aretwenty-two simple and composite activities attributed to four strategies: 1) Locomotive (e.g., walkindoor, run indoor); 2) Semantic (e.g., clean utensil and cooking); 3) Transitional (e.g., indoor tooutdoor and walk upstairs); and 4) Postural/ relatively Stationary (e.g., standing and lying on bed).A simple multi-layer feedforward neural network was created to recognize all the activities with ahigh average test accuracy of 90%. However, the results are obtained with the subject-dependentsetting, where training and test samples are from the sample subject, which limits the proposedmethod’s adaptability.The second strategy is to consider composite activities separately from simple ones and to furtherregard a composite activity as the combination of a series of simple activities. This hierarchicalmanner is more intuitive and attracts stronger research interests. However, applying deep learning

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. eep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities 111:23 techniques to this area is still underexplored. One of the few deep learning works is [107] wherethe authors developed a multi-task learning approach to recognize both simple and compositeactivities simultaneously. To be concrete, the authors divided a composite activity into multiplesimple activities that were represented by a series of sequential sensor signal segments. The signalsegments are first input into CNNs to extract representations of low-level activities, which arethen loaded into a softmax classifier for recognizing simple activities. At the same time, the CNN-extracted features of all segments are taken into an LSTM network to exploit their correlations andconsequently results in a high-level semantic activity classification. In such a way, the priori ofsimple activities being the components of a composite activity is utilized by the shared deep featureextractor. Different from the joint learning manner, [32] infers a sequence of simple activities andits corresponding composite activity by using two conditional probabilistic models alternatively.The authors used an estimated action sequence to infer the composite activity, where the temporalcorrelations of simple actions are extracted for the composite activity classification. In reverse, thepredicted composite activity is utilized to help derive the simple activity sequence at the next timestep. As a result, the predictions of the sequence of simple activities and composite activities aremutually updated based on each other during the inference. The deep learning technique was usedfor feature extraction from raw signals. The experiment results showed increasing accuracy as acomposite activity evolved. Even though these works have demonstrated promising solutions torecognizing composite activities, there exists a major concern that properly cutting a raw time-serialsignal into segments of individual simple actions is the basis for success.

As original sensor data is represented by continuously streaming signals, a fixed-size window isalways used to partition raw sensor data sequences into segments as input into a model for activityrecognition. This is essential to overcome the limitation of the sample of a single time step toprovide adequate information about an activity. Ideally, one partitioned data segment processesonly one activity, and thus model a model predicts a single label for all the samples within asingle window. However, the samples in one window may not always share the same label whenan activity transition occurs in the middle of the window. Therefore, an optimal segmentationapproach is critical to increasing activity recognition accuracy.An intuitive manner is to attempt various fixed window sizes empirically. Nevertheless, althougha larger window size provides richer information, it increases the possibility that a transition occursin the middle of windows. On the contrary, a smaller window size cannot afford enough information.In light of this issue, [3] reports a hierarchical signal segmentation method, which initially used alarge window size and gradually narrowed down the segmentation until only one activity is in asub-window. The narrow down criterion is that two consecutive windows have different labels orconfidence of the classifier is less than a threshold. Different from the hierarchical framework, someresearchers explored to directly assign a label for each time-step instead of predicting a windowas a whole [168, 190]. Inspired by semantic segmentation in the computer vision community, theauthors employed fully connected networks (FCNs) to achieve such a goal. In an FCN, data from alarge window size is input, and a 1D CNN layer is used to replace the final fully-connected softmaxlayer, where the length of the feature map equals to time steps and the number of the featuremaps equals to the number of activity classes, to predict a label for each time step. Therefore, theFCNs could not only use the information of the corresponding time step itself but also utilize theinformation of its neighboring time steps. In [148], Varamin et al. designed a multi-label architectureto simultaneously predict the number of ongoing activities and the occurring possibility of eachalternative activity within a window. Using the optimal parameters learned from the training

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. dataset, a Maximum A Posteriori (MAP) inference was adopted to output the most likely activityset by combining the multi-label outputs.

In real-world scenarios, in addition to performing each activity one after another in a sequentialfashion, a person may carry out more than one activity at the same time, which is called concurrentactivities. For instance, one may make a phone call when watching TV. From the angle of sensorsignals, a piece of data may correspond to multiple ground truth labels. Therefore, concurrentactivity recognition can be abstracted as a multi-label task. Note that the concurrent activity isexecuted by a single subject.Zhang et al. [189] designed an individual fully-connected network for each candidate activityon top of shared multimodal fusion features. The final decision-make layer classified each activityindependently by independent softmax layers. A key drawback of this kind of structure is thatthe computational cost would increase considerably with the number of activities rises. To resolvethis issue, the authors further proposed to use a single neuron with the siдmoid activation tomake binary classification (performed or not) for each activity [87]. Okita and Inoue [105] alsotargeted the concurrent activity recognition and suggested a multi-layer LSTM framework to makeclassification of each activity from each LSTM layer. The pace of exploring deep learning methodson concurrent activity recognition is still slow, and there is a large room to improve.

Most of the state-of-the-art human activity recognition research focuses on monitoring and assistingpeople with regard to single-occupant. Nevertheless, living and working spaces are usually residedby multiple subjects; hence, designing solutions for handling multi-occupant is of notably practicalsignificance. There are mainly two types of multi-occupant activities: parallel activity whereoccupants perform activities individually such as one occupant is eating while the other one iswatching TV and collaborative activity where multiple occupants collaborate together to performthe same activity such as two subjects play table tennis [16]. For the parallel activity recognition,when only wearable-sensors are used, it can be divided into multiple single-occupant activityrecognition tasks and solved by conventional solutions; while ambient or object sensors are used,data association of mapping sensed signals to the occupant who actually causes the generation of thedata becomes the major challenge, which gets more serious as the number of occupants in the spaceincreases. The problem of data association is crucial to the multi-occupant scenario since failing todo so, data would be useless and could even endanger the life of residents in telehealth applications.For the collaborative activity , human interactions and instruments are generally involved; thus,context and object-use information play vital roles in designing recognition solutions. Althoughthe multi-occupant activity recognition is of great meaning, its deep learning-based research is stilllimited.In [127], both wearable and ambient sensors were used to recognize group activities of twooccupants. The ambient sensors were leveraged for extracting context information which is repre-sented by disparate functional indoor areas. The sensor data of different occupants was input intodifferent RBMs separately and then merged into a sequential network, a DBN and an MLP for theinference of the group activity. Pretty high accuracy of nearly 100% was achieved. However, mosttargeting scenarios that two occupants performed the same activity together were constrained. Onthe contrary, Tran et al. [146] did not restrain the occupants to act together. It aimed at recognizingactivities for each occupant separately. A multi-label RNN was created with each RNN cell respond-ing to activity recognition of one occupant. Nevertheless, the authors only used ambient sensorsand did not propose a specific solution to the data association issue.

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. eep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities 111:25

Although deep learning models have shown dominant accuracy in the sensor-based human activ-ity recognition community, they are typically resource-intensive. For example, the early DCNNarchitecture, AlexNet [74], which has five CNN layers and three fully-connected layers, processes61M parameters (249MB of memory) and performs 1.5B high precision operations to make a pre-diction. For non-portable applications, Graphic Processing Units (GPUs) are usually leveraged toaccelerate computation. However, GPUs are very expensive and power-hungry so that not suitablefor real-time applications on mobile devices. Moreover, current research has demonstrated thatmaking a neural network deeper by introducing additional layers and nodes is a critical approachto improving model performance, which inevitably increases computational complexity. Therefore,it is essential and challenging to resolve the issue of high computation cost to realize real-time andreliable human activity recognition on mobile devices by deep learning models.Considering deep neural networks are more effective in feature extraction than shallow ones, acombination of human-crafted and deep features is a potential solution to lowering computationcost. In [119], the authors incorporated the spectrogram features with only one CNN layer and twofully-connected layers for human activity recognition. The hybrid architecture showed comparativerecognition accuracy to state-of-the-art methods through evaluation on four benchmark datasets.To validate the feasibility of real-time usage, the authors implemented the proposed method onthree different mobile platforms, including two smartphones and one on-node unit. The resultsrevealed milliseconds to tens of milliseconds computational time of one prediction suggestingthe possibility of real-time applications. [110] also demonstrates the combination of hand-craftedfeatures and a neural network is a potential plan to achieve real-time activity recognition on amobile device. In addition to the cascade structure of hand-crafted features and deep learningfeatures, [118] proposed to arranged the deep learning features and hand-crafted features in parallelbefore fed into a fully-connected classifier. This structure could increase recognition accuracy withonly a small gain of computational consumption.Optimizing basic neural network cells and structure is another intuitive scheme of decreasingcomputation complexity. In [151], Vu et al. used a self-gated recurrent neural network (SGRNN)cell to not only decline the complexity of a standard LSTM but also prevent the gradient vanishingproblem. Their experiments displayed superior computation efficiency to LSTM and GRU in termsof the running time and model size. However, the running time was still in the order of hundredsof milliseconds and no real-world evaluation on mobile devices is carried out to show possiblereal-time implementation. For CNN-based methods, declining the filter size is an effective meansto reduce the model size and thus could optimize the memory consumption and the number ofcomputation operations. For example, [118] utilized 1D-CNNs instead of 2D-CNNs to controlthe model size. A more insightful strategy to dealing with both the storage and computationalproblems is the quantization of network [58]. This scheme is to constraint the weights and outputsof activation functions to only two values (e.g., -1, +1) instead of continuous numbers. There arethree major benefits to resources cost brought by network quantization: 1) the memory usage andmodel size are greatly reduced when compared to the full and precise networks; 2) the bitwiseoperations are considerably more efficient than conventionally used either floating or fixed-pointarithmetic; 3) if bitwise operations are used, then most multiply-accumulate operations (requirehundreds of logic gates at least) can be replaced by popcount-XNOR operations (only require asingle logic gate), which are especially well suited for FPGAs and ASICs [166]. In [166], Yang et al.explored a 2-bit CNN with weights and activations constrained to {-0.5, 0, 0.5} for efficient activityrecognition. The proposed binary model achieved a better performance tradeoff with regardsto recognition accuracy close to that of the full-precision counterpart and ∼ × acceleration on J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.

CPUs and ∼ × memory saving. Edel and Köppe [38] also studied the quantization of networkfor constructing a lightweight and fast deep learning model. Their Binarized-Bidirectional LSTMnetwork obtained recognition accuracy only 2% lower but 75% computation time saving comparedto its full-precision counterpart. The main application of human activity recognition is to monitor human behaviors so the sensorsneed to capture the activities of a user continuously. Since the way an activity is carried out variesamong users (due to age, gender, weight, and so on), it is possible for an adversary to infer usersensitive information such as age through the time series sensor data. Specifically, for the deeplearning technique, its black-box characteristic may be at the risk of revealing user-discriminativefeatures unintentionally. In [65], the authors investigated the privacy issue of using CNN featuresfor human activity recognition. Their empirical studies revealed that although CNN is trainedwith a cross-entropy loss only targeting activity classification, the obtained CNN features stillshowed powerful user-discriminative ability. A simple logistic regressor could achieve a highuser-classification accuracy of 84.7% when using the CNN features basically extracted for activitywhile the same classifier could only obtain 35.2% user-classification accuracy on raw sensor data.Therefore, it is essential to address the privacy leakage potentials of a deep learning model originallyused for human activity recognition.To address this concern, some researchers explored to utilize an adversarial loss function tominimize the discriminative accuracy of specific privacy information during the training process.For example, Iwasawa et al. [65] proposed to integrate an adversarial loss with the standard activityclassification loss to minimize the user identification accuracy. The authors of [92] and [91] alsoadopted the similar idea to prevent privacy leakage. Their experiment results show an effectivereduction of inferring accuracy for sensitive information. However, an adversarial loss functioncan only be used for protecting one kind of private information, such as user identity and gender.In addition, the adversarial loss goes against the end-to-end training process that making it hard toconverge stably. Considering this gap, [179] borrowed the idea of image style transformation fromthe computer vision community to protect all private information at once. The authors creativelyviewed raw sensor signals from two aspects: "style" aspect that describes how a user performsan activity and was influenced by user’s identical information like age, weight, gender, height,et al.; "content" aspect that describes what activity a user performs. They proposed to transformraw sensor data to have the "content" unchanged but the "style" is similar to random noises.Therefore, the method has the potential to protect all sensitive information at once. Besides the datatransformation strategy, data perturbation is another popular approach to resolve the privacy issue.For example, Lyu et al. proposed to tailor two kinds of data perturbation mechanisms: RandomProjection and repeated Gompertz to achieve a better tradeoff between privacy and recognitionaccuracy [89]. Recently, differential privacy has gained increasing research attention due to itsstrong theoretical privacy guarantee. Phan et al. [109] proposed to perturb the objective functionsof the traditional deep auto-encoder to enforce the ϵ -differential privacy. In addition to the privacypreservation in feature extraction layers, an ϵ -differential privacy preserving softmax layer wasalso developed for either classification or prediction. Different from the above approaches, thismethod provided theoretical privacy guarantees and error bounds. Sensory data for human activity is high-dimensional and unreadable. A data sample may includediverse modalities (e.g., acceleration, angular velocity) from multiple positions (e.g., wrist, ankle)of the tester body in a time window. However, only a few of modalities from specific positions

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. eep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities 111:27 contribute to identifying certain activities [76]. For example, lying is distinguishable when peopleare horizontal (magnetism), and ascending stairs can be recognized by the forward and the upwardacceleration of people’s ankle. Unrelated modalities can introduce noise and deteriorate the recog-nition performance. Moreover, the significance of modalities changes over time. For instance, in aParkinson disease detection system, anomaly only appears in gait in a short period instead of theentire time window [175]. Intuitively, the modalities show more considerable significance whenthe parts of the body are actively moving.Despite the success of deep learning in activity recognition, the inner mechanisms of deeplearning networks still remain unrevealed. Considering the varying salience of modalities andtime intervals, it is necessary to interpret the neural networks to explore the factors of the models’decisions. For example, then a deep learning model identifies the user is walking, we tend to knowwhich modality from which time interval is the determinant. Therefore, the interpretability of deeplearning methods has become a new trend in human activity recognition.The basic idea of interpretable deep learning methods is to automatically decide the importanceof each part of the input data, and achieve high accuracy by omitting the unimportant parts andfocusing on the salient parts. In fact, the standard fully connected layers already possess suchcapacity as they automatically reduce the weights of less important neurons during training. Li etal. [85] thereout proposed to use additional pooling layers to remove neurons with lower weights.However, this is largely insufficient since deep models still may encode some noise such as irrelevantmodalities [175]. Some researchers [19, 163] visualized the features extracted by neural networks.Salient features are sent to the subsequent models after the authors find out their relationships tothe activities from the visualization [163]. Nutter et al. [103] transformed sensory data to imagesso that visualization tools can be applied to the sensory data for more direct interpretability.Attention mechanism is recently popular in deep learning areas. Attention mechanism is origi-nally a concept in biology and psychology that illustrates how we restrict our attention to somethingcrucial for better cognitive results. Inspired by this, researchers apply neural attention mechanismsto deep learning to give neural networks the capability of concentrating on a subset of inputsthat really matters. Since the principle of deep attention models is to weigh input components,components with higher weights are assumed to be more tightly related to the recognition taskand show greater influence over the models’ decisions [133]. Some works employed attentionmechanism to interpret deep model behaviors [178, 181, 183]. Back to human activity recognition,attention mechanism not only highlights the most distinguishable modalities and time intervalsbut also informs us of the most contributing modalities and body parts to specific activities. Deepattention approaches can be categorized into soft attention and hard attention based on theirdifferentiability.

Soft Attention.

In machine learning, “soft” means differentiable. Soft attention assigns weightfrom 0 to 1 to each element of the inputs. It decides how much attention to focus on each element.Soft attention uses softmax functions in the attention layers to compute the weights so the wholemodel is a fully differentiable deterministic mechanism where gradients can be propagated to otherparts of the network as well as through the soft attention mechanism [180]. Attention layers isinserted into sequence-to-sequence LSTMs for feature extraction [142]. It is also inserted in theneural networks to tune the weights of all samples[101] in sliding windows since samples at differenttime points have varying contributions to activity recognition. Shen et al. [135] also considered thetemporal context. They designed a segment-level attention approach to decide which time segmentcontains more information. Combined with gated CNN, the segment-level attention better extractstemporal dependencies. Zeng et al. [175] developed attention mechanisms in two perspectives.They first propose sensor attention on the inputs to extract the salient sensory modalities and thenapply temporal attention to an LSTM to filter out the inactive data segments. Spatial and temporal

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. attention mechanisms are also employed in [90]. Especially, the spatial dependencies are extractedby fusing the modalities with self-attention.

Hard Attention.

Hard attention determines whether to attend to a part of inputs or not. Theweight it assigns to an input part is either 0 or 1 so the problem is non-differentiable. The processinvolves making a sequence of selections about which part to attend to. For example, the modelattends to a part of the input to obtain information, and decide where to attend in the next stepbased on the known information. The selection can be output by a neural network. However,since there is no ground truth indicating the correct selection policy, hard attention should berepresented as a stochastic process. This is where deep reinforcement learning comes in. Deepreinforcement learning tackles the selection problems in deep learning and allows the models topropagate gradients in the space of selection policies [184]. With deep reinforcement learning,hard attention can be trained with softmax functions and the standard gradient descent withbackpropagation.Different reinforcement learning techniques can be applied to hard attention mechanisms inhuman activity recognition. Zhang et al. [187] use dueling deep Q networks as a core of hardattention to focus on the salient parts of multimodal sensory data. Chen et al. [25, 28] minedimportant modalities and elide undesirable features with policy gradient. The attention is embeddedinto an LSTM to make selections step by step because LSTM incrementally learns information inan episode. Chen et al. [27] further considered the intrinsic relations between activities and sub-motions from human body parts. They employ multiple agents to concentrate on modalities thatare related to sub-motions. Multiple agents coordinate to portray the activities. The visualizationof the selected modalities and body parts validates that the attention mechanism provides insightsinto how sensory data elements affect the models’ prediction of activities.

Table 4. Deep Learning Works for Interpretability of Deep Learning Models in Sensory Data

Method Subclass References

Traditional - Li et al. 2015 [85]; Nutter et al. 2018 [103];Xue et al. 2018 [163]; Brophy et al. 2018 [19];Attetion Mechanism Soft Attention Tang et al. 2016 [142]; Murahari and Plötz 2018 [101];Zeng et al. 2018 [175]; Shen et al. 2018 [135];Ma et al. 2019 [90];Hard Attention Zhang et al. 2018 [187]; Chen et al. 2019 [28];Chen et al. 2018 [25]; Chen et al. 2019 [27];

To develop full potential of deep learning in human activity recognition, some future researchdirections are worthy of further investigation. Future directions can be stimulated by the challengessummarized in this work. Despite the effort devoted to these challenges, some of them are still notfully explored such as class imbalance, composite activities, concurrent activities, etc. Althoughcurrent research works still lack comprehensive and reliable solutions for the challenges, they layconcrete foundations and show guidance for future directions.Moreover, there are other research directions that have rarely been explored before. We outlineseveral key research directions that urgently need to be exploited as follows. • Independent unsupervised methods.

Human activity recognition needs a sufficient amount ofannotated samples to train the deep learning models. Unsupervised learning can help mitigatesuch requirements. So far, deep unsupervised models used for human activity recognition aremainly used for extracting features but are not able to identify activities because there is no

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. eep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities 111:29 ground truth. Therefore, one potential method for unsupervised learning to infer true labels is toseek other knowledge, which leads us to a popular method, deep unsupervised transfer learning [15]. Another way is to resort to data-driven methods such as ontology [122]. • Identifying new activities.

Identifying novel activities that have never been seen by the modelsis a big challenge in human activity recognition. A reliable model should be able to learn the newknowledge online and achieve accurate recognition without any ground truth. A promising wayis to learn features that are scalable to diverse activities. While [102] enlightens us that mid-levelattributes can be used to depict activities with a set of characteristics, disentangled features [145]may be another serviceable solution to representing novel activities. • Future activity prediction.

Future activity prediction is an extension of activity recognition.Unlike activity recognition, the activity prediction system can forecast users’ behaviors in advance.The prediction system is useful in detecting human intention so it can be applied to smart services,criminal detection and driver behavior prediction. In some common behavior tasks, the activitiesare usually in a certain order. Therefore, modeling the temporal dependencies across activities isbeneficial to predict future predictions.

LSTMs [9] are suitable for such tasks. But for long-spanactivities, LSTMs cannot contain such long dependencies. In this case, intention recognition basedon brain signals [185] can assist to inspire activity prediction. • A standardization of the state-of-the-art.

While hundreds of works have been investigated indeep learning and sensor-based human activity recognition, there lacks a standardization of thestate-of-the-art for a fair comparison. The experiment settings and evaluation metrics for assessingthe performance of activity recognition vary from paper to paper. While deep learning heavilyrelies on the training data, the division of training/ test/ validation sets influences the recognitionresults. Other factors including data processing and the implementation platforms also lead toskewed comparison. Therefore, having a mature standardization for all researchers is pressing. Itis noteworthy that such an issue is absent in other areas. For example, ImageNet Challenge [129]meticulously defines details in the experiment setting to ensure impartial comparison. Jordao etal. [68] implemented and evaluated a set of existing works with standardized settings, but thereis still no rigorous and well-recognized standardization in the field of human activity recognition.

This work aims at suggesting a rough guideline for novices and experienced researchers whohave interest in deep learning methods for sensor-based human activity recognition. We present acomprehensive survey to summarize the current deep learning methods for sensor-based humanactivity recognition. We first introduce the multi-modality of the sensory data and available publicdatasets and their extensive utilization in different challenges. We then summarize the challengesin human activity recognition based on their reasons and analyze how existing deep methods areadopted to address the challenges. At the end of this work, we discuss the open issues and providesome insights for future directions.

REFERENCES [1] Zahraa S Abdallah, Mohamed Medhat Gaber, Bala Srinivasan, and Shonali Krishnaswamy. 2018. Activity recognitionwith evolving data streams: A review.

ACM Computing Surveys (CSUR)

51, 4 (2018), 71.[2] Ali Akbari and Roozbeh Jafari. 2019. Transferring activity recognition models for new wearable sensors with deepgenerative domain adaptation. In

Proceedings of the 18th International Conference on Information Processing in SensorNetworks . ACM, 85–96.[3] Ali Akbari, Jian Wu, Reese Grimsley, and Roozbeh Jafari. 2018. Hierarchical signal segmentation and classificationfor accurate activity recognition. In

Proceedings of the 2018 ACM International Joint Conference and 2018 InternationalSymposium on Pervasive and Ubiquitous Computing and Wearable Computers . ACM, 1596–1605.J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. [4] Hande Alemdar, Halil Ertan, Ozlem Durmaz Incel, and Cem Ersoy. 2013. ARAS human activity datasets in multiplehomes with multiple residents. In

Proceedings of the 7th International Conference on Pervasive Computing Technologiesfor Healthcare . ICST, 232–235.[5] Kamran Ali, Alex X Liu, Wei Wang, and Muhammad Shahzad. 2015. Keystroke recognition using wifi signals. In

Proceedings of the 21st Annual International Conference on Mobile Computing and Networking . ACM, 90–102.[6] Mohammad Abu Alsheikh, Ahmed Selim, Dusit Niyato, Linda Doyle, Shaowei Lin, and Hwee-Pink Tan. 2016. Deepactivity recognition models with triaxial accelerometers. In

Workshops at the Thirtieth AAAI Conference on ArtificialIntelligence .[7] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, and Jorge Luis Reyes-Ortiz. 2013. A public domaindataset for human activity recognition using smartphones.. In

Esann .[8] Sina Mokhtarzadeh Azar, Mina Ghadimi Atigh, Ahmad Nickabadi, and Alexandre Alahi. 2019. ConvolutionalRelational Machine for Group Activity Recognition. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition . 7892–7901.[9] Moez Baccouche, Franck Mamalet, Christian Wolf, Christophe Garcia, and Atilla Baskurt. 2010. Action classificationin soccer videos with long short-term memory recurrent neural networks. In

International Conference on ArtificialNeural Networks . Springer, 154–159.[10] Marc Bachlin, Meir Plotnik, Daniel Roggen, Inbal Maidan, Jeffrey M Hausdorff, Nir Giladi, and Gerhard Troster. 2010.Wearable assistant for ParkinsonâĂŹs disease patients with the freezing of gait symptom.

IEEE Transactions onInformation Technology in Biomedicine

14, 2 (2010), 436–446.[11] Lu Bai, Chris Yeung, Christos Efstratiou, and Moyra Chikomo. 2019. Motion2Vector: unsupervised learning inhuman activity recognition using wrist-sensing data. In

Proceedings of the 2019 ACM International Joint Conference onPervasive and Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium on Wearable Computers .ACM, 537–542.[12] Donald S Baim, Wilson S Colucci, E Scott Monrad, Harton S Smith, Richard F Wright, Alyce Lanoue, Diane F Gauthier,Bernard J Ransil, William Grossman, and Eugene Braunwald. 1986. Survival of patients with severe congestive heartfailure treated with oral milrinone.

Journal of the American College of Cardiology

7, 3 (1986), 661–670.[13] Oresti Banos, Rafael Garcia, Juan A Holgado-Terriza, Miguel Damas, Hector Pomares, Ignacio Rojas, Alejandro Saez,and Claudia Villalonga. 2014. mHealthDroid: a novel framework for agile development of mobile health applications.In

International workshop on ambient assisted living . Springer, 91–98.[14] Billur Barshan and Murat Cihan Yüksek. 2014. Recognizing daily and sports activities in two open source machinelearning environments using body-worn sensor units.

Comput. J.

57, 11 (2014), 1649–1667.[15] Yoshua Bengio. 2012. Deep learning of representations for unsupervised and transfer learning. In

Proceedings of ICMLworkshop on unsupervised and transfer learning . 17–36.[16] Asma Benmansour, Abdelhamid Bouchachia, and Mohammed Feham. 2015. Multioccupant activity recognition inpervasive smart home environments.

ACM Computing Surveys (CSUR)

48, 3 (2015), 1–36.[17] Avrim Blum and Tom Mitchell. 1977. Combining labeled and unlabeled data with co-training. (1977).[18] Henrik Blunck, Niels Olof Bouvin, Tobias Franke, Kaj Grønbæk, Mikkel B Kjaergaard, Paul Lukowicz, and MarkusWüstenberg. 2013. On heterogeneity in mobile sensing applications aiming at representative data collection. In

Proceedings of the 2013 ACM conference on Pervasive and ubiquitous computing adjunct publication . ACM, 1087–1098.[19] Eoin Brophy, José Juan Dominguez Veiga, Zhengwei Wang, Alan F Smeaton, and Tomas E Ward. 2018. An InterpretableMachine Vision Approach to Human Activity Recognition using Photoplethysmograph Sensor Data. arXiv preprintarXiv:1812.00668 (2018).[20] Michael Buettner, Richa Prasad, Matthai Philipose, and David Wetherall. 2009. Recognizing daily activities withRFID-based sensors. In

Proceedings of the 11th international conference on Ubiquitous computing . ACM, 51–60.[21] Andreas Bulling, Ulf Blanke, and Bernt Schiele. 2014. A tutorial on human activity recognition using body-worninertial sensors.

ACM Computing Surveys (CSUR)

46, 3 (2014), 33.[22] Andreas Bulling, Ulf Blanke, and Bernt Schiele. 2014. A Tutorial on Human Activity Recognition Using Body-wornInertial Sensors.

Comput. Surveys

46, 3 (2014), 33:1–33:33. https://doi.org/10.1145/2499621[23] Ricardo Chavarriaga, Hesam Sagha, Alberto Calatroni, Sundara Tejaswi Digumarti, Gerhard Tröster, José del R Millán,and Daniel Roggen. 2013. The Opportunity challenge: A benchmark database for on-body sensor-based activityrecognition.

Pattern Recognition Letters

34, 15 (2013), 2033–2042.[24] Chen Chen, Roozbeh Jafari, and Nasser Kehtarnavaz. 2015. UTD-MHAD: A multimodal dataset for human actionrecognition utilizing a depth camera and a wearable inertial sensor. In . IEEE, 168–172.[25] Kaixuan Chen, Lina Yao, Xianzhi Wang, Dalin Zhang, Tao Gu, Zhiwen Yu, and Zheng Yang. 2018. Interpretable parallelrecurrent neural networks with convolutional attentions for multi-modality activity modeling. In . IEEE, 1–8.J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. eep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities 111:31 [26] Kaixuan Chen, Lina Yao, Dalin Zhang, Xiaojun Chang, Guodong Long, and Sen Wang. 2019. Distributionally RobustSemi-Supervised Learning for People-Centric Sensing. In

The Thirty-Third AAAI Conference on Artificial Intelligence,AAAI, Honolulu, Hawaii USA, January 27âĂŞFebruary 1, 2019 . 3321–3328.[27] Kaixuan Chen, Lina Yao, Dalin Zhang, Bin Guo, and Zhiwen Yu. 2019. Multi-agent Attentional Activity Recognition.In

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI, Macao, China, August10-16, 2019 . 1344–1350.[28] Kaixuan Chen, Lina Yao, Dalin Zhang, Xianzhi Wang, Xiaojun Chang, and Feiping Nie. 2019. A semisupervisedrecurrent convolutional attention model for human activity recognition.

IEEE transactions on neural networks andlearning systems (2019).[29] Yiqiang Chen, Jindong Wang, Meiyu Huang, and Han Yu. 2019. Cross-position activity recognition with stratifiedtransfer learning.

Pervasive and Mobile Computing

57 (2019), 1–13.[30] Yuwen Chen, Kunhua Zhong, Ju Zhang, Qilong Sun, and Xueliang Zhao. 2016. LSTM networks for mobile humanactivity recognition. In . AtlantisPress.[31] Jingyuan Cheng, Mathias Sundholm, Bo Zhou, Marco Hirsch, and Paul Lukowicz. 2016. Smart-surface: Large scaletextile pressure sensors arrays for activity recognition.

Pervasive and Mobile Computing

30 (2016), 97–112.[32] Weihao Cheng, Sarah M Erfani, Rui Zhang, and Ramamohanarao Kotagiri. 2018. Predicting Complex Activitiesfrom Ongoing Multivariate Time Series.. In

Twenty-Seventh International Joint Conference on Artificial Intelligence .3322–3328.[33] Belkacem Chikhaoui and Frank Gouineau. 2017. Towards automatic feature extraction for activity recognition fromwearable sensors: a deep learning approach. In . IEEE, 693–702.[34] Jun-Ho Choi and Jong-Seok Lee. 2018. Confidence-based deep multimodal fusion for activity recognition. In

Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive andUbiquitous Computing and Wearable Computers . ACM, 1548–1556.[35] Sanorita Dey, Nirupam Roy, Wenyuan Xu, Romit Roy Choudhury, and Srihari Nelakuditi. 2014. AccelPrint: Imperfec-tions of Accelerometers Make Smartphones Trackable.. In

NDSS .[36] Mingtao Dong, Jindong Han, Yuan He, and Xiaojun Jing. 2018. HAR-Net: Fusing Deep Representation and Hand-Crafted Features for Human Activity Recognition. In

International Conference On Signal And Information Processing,Networking And Computers . Springer, 32–40.[37] Stefan Duffner, Samuel Berlemont, Grégoire Lefebvre, and Christophe Garcia. 2014. 3D gesture classification withconvolutional neural networks. In . IEEE, 5432–5436.[38] Marcus Edel and Enrico Köppe. 2016. Binarized-blstm-rnn based human activity recognition. In . IEEE, 1–7.[39] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. 2010.Why does unsupervised pre-training help deep learning?

Journal of Machine Learning Research

11, Feb (2010),625–660.[40] Xiaoyi Fan, Wei Gong, and Jiangchuan Liu. 2018. TagFree Activity Identification with RFIDs.

Proceedings of the ACMon Interactive, Mobile, Wearable and Ubiquitous Technologies

2, 1 (2018), 7.[41] Kenneth P Fishkin, Matthai Philipose, and Adam Rea. 2005. Hands-on RFID: Wireless wearables for detecting use ofobjects. In

Ninth IEEE International Symposium on Wearable Computers (ISWC’05) . IEEE, 38–41.[42] Nicholas Foubert, Anita M McKee, Rafik A Goubran, and Frank Knoefel. 2012. Lying and sitting posture recognitionand transition detection using a pressure sensor array. In . IEEE, 1–6.[43] Martin Gjoreski, Stefan Kalabakov, Mitja Luštrek, and Hristijan Gjoreski. 2019. Cross-dataset deep transfer learningfor activity recognition. In

Proceedings of the 2019 ACM International Joint Conference on Pervasive and UbiquitousComputing and Proceedings of the 2019 ACM International Symposium on Wearable Computers . ACM, 714–718.[44] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. 2014. Generative adversarial nets. In

Advances in neural information processing systems . 2672–2680.[45] Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. 2016. LSTM: A searchspace odyssey.

IEEE transactions on neural networks and learning systems

28, 10 (2016), 2222–2232.[46] Rene Grzeszick, Jan Marius Lenk, Fernando Moya Rueda, Gernot A Fink, Sascha Feldhorst, and Michael ten Hompel.2017. Deep neural network based human activity recognition for the order picking process. In

Proceedings of the 4thinternational Workshop on Sensor-based Activity Recognition and Interaction . ACM, 14.[47] Fuqiang Gu, Kourosh Khoshelham, Shahrokh Valaee, Jianga Shang, and Rui Zhang. 2018. Locomotion activityrecognition using stacked denoising autoencoders.

IEEE Internet of Things Journal

5, 3 (2018), 2085–2093.J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. [48] Yu Gu, Lianghu Quan, and Fuji Ren. 2014. Wifi-assisted human activity recognition. In . IEEE, 60–65.[49] Yu Guan and Thomas Plötz. 2017. Ensembles of deep lstm learners for activity recognition using wearables.

Proceedingsof the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

1, 2 (2017), 11.[50] Gautham Krishna Gudur, Prahalathan Sundaramoorthy, and Venkatesh Umaashankar. 2019. ActiveHARNet: TowardsOn-Device Deep Bayesian Active Learning for Human Activity Recognition. arXiv preprint arXiv:1906.00108 (2019).[51] Abdu Gumaei, Mohammad Mehedi Hassan, Abdulhameed Alelaiwi, and Hussain Alsalman. 2019. A hybrid deeplearning model for human activity recognition using multimodal body sensing data.

IEEE Access

Proceedings of the 2016 ACM International JointConference on Pervasive and Ubiquitous Computing . ACM, 1112–1123.[53] Quang-Do Ha and Minh-Triet Tran. 2017. Activity Recognition from Inertial Sensors with Convolutional NeuralNetworks. In

International Conference on Future Data and Security Engineering . Springer, 285–298.[54] Sojeong Ha and Seungjin Choi. 2016. Convolutional neural networks for human activity recognition using multipleaccelerometer and gyroscope sensors. In . IEEE,381–388.[55] Sojeong Ha, Jeong-Min Yun, and Seungjin Choi. 2015. Multi-modal convolutional neural networks for activityrecognition. In . IEEE, 3017–3022.[56] Nils Yannick Hammerla, James Fisher, Peter Andras, Lynn Rochester, Richard Walker, and Thomas Plötz. 2015. PDdisease state assessment in naturalistic environments using deep learning. In

Twenty-Ninth AAAI Conference onArtificial Intelligence .[57] Nils Y. Hammerla, Shane Halloran, and Thomas Plötz. 2016. Deep, Convolutional, and Recurrent Models for HumanActivity Recognition Using Wearables. In

Twenty-Fifth International Joint Conference on Artificial Intelligence . 1533–1540.[58] Song Han, Huizi Mao, and William J. Dally. 2016. Deep Compression: Compressing Deep Neural Networks withPruning, Trained Quantization and Huffman Coding. In

International Conference on Learning Representation .[59] HM Hossain, MD Al Haiz Khan, and Nirmalya Roy. 2018. DeActive: scaling activity recognition with active deeplearning.

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

2, 2 (2018), 66.[60] HM Hossain and Nirmalya Roy. 2019. Active Deep Learning for Activity Recognition with Context Aware AnnotatorSelection. In

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining .ACM, 1862–1870.[61] Tâm Huynh, Mario Fritz, and Bernt Schiele. 2008. Discovery of activity patterns using topic models. In

UbiComp ,Vol. 8. 10–19.[62] Tâm Huynh and Bernt Schiele. 2005. Analyzing features for activity recognition. In

Proceedings of the 2005 jointconference on Smart objects and ambient intelligence: innovative context-aware services: usages and technologies . ACM,159–163.[63] Shoya Ishimaru, Kensuke Hoshika, Kai Kunze, Koichi Kise, and Andreas Dengel. 2017. Towards reading trackersin the wild: detecting reading activities by EOG glasses and deep neural networks. In

Proceedings of the 2017 ACMInternational Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2017 ACM InternationalSymposium on Wearable Computers . ACM, 704–711.[64] Chihiro Ito, Xin Cao, Masaki Shuzo, and Eisaku Maeda. 2018. Application of CNN for human activity recognitionwith FFT spectrogram of acceleration and gyro sensors. In

Proceedings of the 2018 ACM International Joint Conferenceand 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers . ACM, 1503–1510.[65] Yusuke Iwasawa, Kotaro Nakayama, Ikuko Yairi, and Yutaka Matsuo. 2017. Privacy Issues Regarding the Applicationof DNNs to Activity-Recognition using Wearables and Its Countermeasures by Use of Adversarial Training.. In

Twenty-Sixth International Joint Conference on Artificial Intelligence . 1930–1936.[66] Wenjun Jiang, Chenglin Miao, Fenglong Ma, Shuochao Yao, Yaqing Wang, Ye Yuan, Hongfei Xue, Chen Song, Xin Ma,Dimitrios Koutsonikolas, et al. 2018. Towards Environment Independent Device Free Human Activity Recognition. In

Proceedings of the 24th Annual International Conference on Mobile Computing and Networking . ACM, 289–304.[67] Wenchao Jiang and Zhaozheng Yin. 2015. Human activity recognition using wearable sensors by deep convolutionalneural networks. In

Proceedings of the 23rd ACM international conference on Multimedia . Acm, 1307–1310.[68] Artur Jordao, Antonio C Nazare Jr, Jessica Sena, and William Robson Schwartz. 2018. Human activity recognitionbased on wearable sensor data: A standardization of the state-of-the-art. arXiv preprint arXiv:1806.05226 (2018).[69] Pyeong-Gook Jung, Gukchan Lim, Seonghyok Kim, and Kyoungchul Kong. 2015. A wearable gesture recognitiondevice for detecting muscular activities based on air-pressure sensors.

IEEE Transactions on Industrial Informatics eep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities 111:33 [70] Andrej Karpathy, Justin Johnson, and Li Fei-Fei. 2016. Visualizing and understanding recurrent networks. In

The 4thInternational Conference on Learning Representations Workshop .[71] Sara Khalifa, Mahbub Hassan, Aruna Seneviratne, and Sajal K Das. 2015. Energy-harvesting wearables for activity-aware services.

IEEE internet computing

19, 5 (2015), 8–16.[72] Md Abdullah Al Hafiz Khan, Nirmalya Roy, and Archan Misra. 2018. Scaling human activity recognition via deeplearning-based domain adaptation. In . IEEE, 1–9.[73] Shehroz S Khan and Babak Taati. 2017. Detecting unseen falls from wearable devices using channel-wise ensemble ofautoencoders.

Expert Systems with Applications

87 (2017), 280–290.[74] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neuralnetworks. In

Advances in neural information processing systems . 1097–1105.[75] Jennifer R Kwapisz, Gary M Weiss, and Samuel A Moore. 2011. Activity recognition using cell phone accelerometers.

ACM SigKDD Explorations Newsletter

12, 2 (2011), 74–82.[76] Yongjin Kwon, Kyuchang Kang, and Changseok Bae. 2015. Analysis and evaluation of smartphone-based humanactivity recognition using a neural network approach. In . IEEE, 1–5.[77] Nicholas D Lane and Petko Georgiev. 2015. Can deep learning revolutionize mobile sensing?. In

Proceedings of the16th International Workshop on Mobile Computing Systems and Applications . ACM, 117–122.[78] Gierad Laput and Chris Harrison. 2019. Sensing Fine-Grained Hand Activity with Smartwatches. In

Proceedings of the2019 CHI Conference on Human Factors in Computing Systems . ACM, 338.[79] Oscar D Lara and Miguel A Labrador. 2013. A survey on human activity recognition using wearable sensors.

IEEECommunications Surveys & Tutorials

15, 3 (2013), 1192–1209.[80] Dong-Eun Lee, Sang-Min Seo, Hee-Soon Woo, and Sung-Yun Won. 2018. Analysis of body imbalance in variouswriting sitting postures using sitting pressure measurement.

Journal of physical therapy science

30, 2 (2018), 343–346.[81] Ki-Seung Lee. 2019. Joint Audio-ultrasound food recognition for noisy environments.

IEEE journal of biomedical andhealth informatics (2019).[82] Song-Mi Lee, Sang Min Yoon, and Heeryon Cho. 2017. Human activity recognition from accelerometer data usingConvolutional Neural Network. In .IEEE, 131–134.[83] Fei Li and Schahram Dustdar. 2011. Incorporating unsupervised learning in activity recognition. In

Workshops at theTwenty-Fifth AAAI Conference on Artificial Intelligence .[84] Xinyu Li, Yuan He, and Xiaojun Jing. 2019. A Survey of Deep Learning-Based Human Activity Recognition in Radar.

Remote Sensing

11, 9 (2019), 1068.[85] Xinyu Li, Yanyi Zhang, Mengzhu Li, Ivan Marsic, JaeWon Yang, and Randall S Burd. 2016. Deep neural networkfor RFID-based activity recognition. In

Proceedings of the Eighth Wireless of the Students, by the Students, and for theStudents Workshop, S3@MobiCom 2016 . ACM, 24–26.[86] Xinyu Li, Yanyi Zhang, Ivan Marsic, Aleksandra Sarcevic, and Randall S Burd. 2016. Deep learning for rfid-basedactivity recognition. In

Proceedings of the 14th ACM Conference on Embedded Network Sensor Systems CD-ROM . ACM,164–175.[87] Xinyu Li, Yanyi Zhang, Jianyu Zhang, Shuhong Chen, Ivan Marsic, Richard A Farneth, and Randall S Burd. 2017.Concurrent activity recognition with multimodal cnn-lstm structure. arXiv preprint arXiv:1702.01638 (2017).[88] Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. 2003. A symbolic representation of time series, withimplications for streaming algorithms. In

Proceedings of the 8th ACM SIGMOD workshop on Research issues in datamining and knowledge discovery . ACM, 2–11.[89] Lingjuan Lyu, Xuanli He, Yee Wei Law, and Marimuthu Palaniswami. 2017. Privacy-preserving collaborative deeplearning with application to human activity recognition. In

Proceedings of the 2017 ACM on Conference on Informationand Knowledge Management . ACM, 1219–1228.[90] Haojie Ma, Wenzhong Li, Xiao Zhang, Songcheng Gao, and Sanglu Lu. 2019. AttnSense: Multi-level AttentionMechanism For Multimodal Human Activity Recognition. In

Proceedings of the Twenty-Eighth International JointConference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019 . 3109–3115.[91] Mohammad Malekzadeh, Richard G Clegg, Andrea Cavallaro, and Hamed Haddadi. 2018. Protecting sensory dataagainst sensitive inferences. In

Proceedings of the 1st Workshop on Privacy by Design in Distributed Systems . ACM, 2.[92] Mohammad Malekzadeh, Richard G Clegg, Andrea Cavallaro, and Hamed Haddadi. 2019. Mobile sensor dataanonymization. In

Proceedings of the International Conference on Internet of Things Design and Implementation . 49–58.[93] Akhil Mathur, Tianlin Zhang, Sourav Bhattacharya, Petar Veličković, Leonid Joffe, Nicholas D Lane, Fahim Kawsar,and Pietro Lió. 2018. Using deep data augmentation training to address software and hardware heterogeneities inwearable and smartphone sensing devices. In

Proceedings of the 17th ACM/IEEE International Conference on Information

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.

Processing in Sensor Networks . IEEE Press, 200–211.[94] Shinya Matsui, Nakamasa Inoue, Yuko Akagi, Goshu Nagino, and Koichi Shinoda. 2017. User adaptation of convolu-tional neural network for human activity recognition. In .IEEE, 753–757.[95] Taylor Mauldin, Marc Canby, Vangelis Metsis, Anne Ngu, and Coralys Rivera. 2018. SmartFall: A smartwatch-basedfall detection system using deep learning.

Sensors

18, 10 (2018), 3363.[96] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černock`y, and Sanjeev Khudanpur. 2010. Recurrent neuralnetwork based language model. In

Eleventh annual conference of the international speech communication association .[97] Abdel-rahman Mohamed, George E Dahl, and Geoffrey Hinton. 2011. Acoustic modeling using deep belief networks.

IEEE transactions on audio, speech, and language processing

20, 1 (2011), 14–22.[98] A Moncada-Torres, K Leuenberger, R Gonzenbach, A Luft, and Roger Gassert. 2014. Activity classification based oninertial and barometric pressure sensors at different anatomical locations.

Physiological measurement

35, 7 (2014),1245.[99] Francisco Javier Ordóñez Morales and Daniel Roggen. 2016. Deep convolutional feature transfer across mobile activityrecognition domains, sensor modalities and locations. In

Proceedings of the 2016 ACM International Symposium onWearable Computers . ACM, 92–99.[100] Sebastian Münzner, Philip Schmidt, Attila Reiss, Michael Hanselmann, Rainer Stiefelhagen, and Robert Dürichen.2017. CNN-based sensor fusion techniques for multimodal human activity recognition. In

Proceedings of the 2017ACM International Symposium on Wearable Computers . ACM, 158–165.[101] Vishvak S Murahari and Thomas Plötz. 2018. On attention models for human activity recognition. In

Proceedings ofthe 2018 ACM International Symposium on Wearable Computers . ACM, 100–103.[102] Harideep Nair, Cathy Tan, Ming Zeng, Ole J Mengshoel, and John Paul Shen. 2019. AttriNet: learning mid-levelfeatures for human activity recognition with deep belief networks. In

Proceedings of the 2019 ACM International JointConference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium onWearable Computers . ACM, 510–517.[103] Mark Nutter, Catherine H Crawford, and Jorge Ortiz. 2018. Design of Novel Deep Learning Models for Real-timeHuman Activity Recognition with Mobile Phones. In .IEEE, 1–8.[104] Henry Friday Nweke, Ying Wah Teh, Mohammed Ali Al-Garadi, and Uzoma Rita Alo. 2018. Deep learning algorithmsfor human activity recognition using mobile and wearable sensor networks: State of the art and research challenges.

Expert Systems with Applications

105 (2018), 233–261.[105] Tsuyoshi Okita and Sozo Inoue. 2017. Recognition of multiple overlapping activities using compositional CNN-LSTMmodel. In

Proceedings of the 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing andProceedings of the 2017 ACM International Symposium on Wearable Computers . ACM, 165–168.[106] Francisco Ordóñez and Daniel Roggen. 2016. Deep convolutional and lstm recurrent neural networks for multimodalwearable activity recognition.

Sensors

16, 1 (2016), 115.[107] Liangying Peng, Ling Chen, Zhenan Ye, and Yi Zhang. 2018. AROMA: A Deep Multi-Task Learning Based Simpleand Complex Human Activity Recognition Method Using Wearable Sensors.

Proceedings of the ACM on Interactive,Mobile, Wearable and Ubiquitous Technologies

2, 2 (2018), 74.[108] Cuong Pham and Patrick Olivier. 2009. Slice&dice: Recognizing food preparation activities using embedded ac-celerometers. In

European Conference on Ambient Intelligence . Springer, 34–43.[109] NhatHai Phan, Yue Wang, Xintao Wu, and Dejing Dou. 2016. Differential privacy preservation for deep auto-encoders:an application of human behavior prediction. In

Thirtieth AAAI Conference on Artificial Intelligence .[110] Ivan Miguel Pires, Nuno Pombo, Nuno M Garcia, and Francisco Flórez-Revuelta. 2018. Multi-Sensor Mobile Platformfor the Recognition of Activities of Daily Living and their Environments based on Artificial Neural Networks.. In

Twenty-Seventh International Joint Conference on Artificial Intelligence . 5850–5852.[111] Thomas Plötz, Nils Y Hammerla, and Patrick L Olivier. 2011. Feature learning for activity recognition in ubiquitouscomputing. In

Twenty-Second International Joint Conference on Artificial Intelligence .[112] Samira Pouyanfar, Saad Sadiq, Yilin Yan, Haiman Tian, Yudong Tao, Maria Presa Reyes, Mei-Ling Shyu, Shu-ChingChen, and SS Iyengar. 2018. A survey on deep learning: Algorithms, techniques, and applications.

ACM ComputingSurveys (CSUR)

51, 5 (2018), 92.[113] Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, and Lior Wolf. 2017. Language generation with recurrent generativeadversarial networks without pre-training. arXiv preprint arXiv:1706.01399 (2017).[114] Hangwei Qian, Sinno Jialin Pan, Bingshui Da, and Chunyan Miao. 2019. A Novel Distribution-Embedded NeuralNetwork for Sensor-Based Activity Recognition. In

Proceedings of the Twenty-Eighth International Joint Conference onArtificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019 . 5614–5620.J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. eep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities 111:35 [115] Valentin Radu, Nicholas D Lane, Sourav Bhattacharya, Cecilia Mascolo, Mahesh K Marina, and Fahim Kawsar.2016. Towards multimodal deep learning for activity recognition on mobile devices. In

Proceedings of the 2016 ACMInternational Joint Conference on Pervasive and Ubiquitous Computing: Adjunct . ACM, 185–188.[116] Valentin Radu, Catherine Tong, Sourav Bhattacharya, Nicholas D Lane, Cecilia Mascolo, Mahesh K Marina, and FahimKawsar. 2018. Multimodal deep learning for activity and context recognition.

Proceedings of the ACM on Interactive,Mobile, Wearable and Ubiquitous Technologies

1, 4 (2018), 157.[117] Sankar Rangarajan, Assegid Kidane, Gang Qian, Stjepan Rajko, and David Birchfield. 2007. The design of a pressuresensing floor for movement-based human computer interaction. In

European Conference on Smart Sensing and Context .Springer, 46–61.[118] Daniele Ravi, Charence Wong, Benny Lo, and Guang-Zhong Yang. 2016. A deep learning approach to on-node sensordata analytics for mobile or wearable devices.

IEEE journal of biomedical and health informatics

21, 1 (2016), 56–64.[119] Daniele Ravi, Charence Wong, Benny Lo, and Guang-Zhong Yang. 2016. Deep learning for human activity recognition:A resource efficient implementation on low-power devices. In . IEEE, 71–76.[120] Attila Reiss and Didier Stricker. 2012. Introducing a new benchmarked dataset for activity monitoring. In . IEEE, 108–109.[121] Jorge-L Reyes-Ortiz, Luca Oneto, Albert Samà, Xavier Parra, and Davide Anguita. 2016. Transition-aware humanactivity recognition using smartphones.

Neurocomputing

171 (2016), 754–767.[122] Daniele Riboni, Linda Pareschi, Laura Radaelli, and Claudio Bettini. 2011. Is ontology-based activity recognition reallyeffective?. In . IEEE, 427–431.[123] Daniel Roggen, Alberto Calatroni, Mirco Rossi, Thomas Holleczek, Kilian Förster, Gerhard Tröster, Paul Lukowicz,David Bannach, Gerald Pirkl, Alois Ferscha, et al. 2010. Collecting complex activity datasets in highly rich networkedsensor environments. In . IEEE, 233–240.[124] Seyed Ali Rokni, Marjan Nourollahi, and Hassan Ghasemzadeh. 2018. Personalized Human Activity RecognitionUsing Convolutional Neural Networks. In

Thirty-Second AAAI Conference on Artificial Intelligence .[125] Charissa Ann Ronao and Sung-Bae Cho. 2015. Deep convolutional neural networks for human activity recognitionwith smartphone sensors. In

International Conference on Neural Information Processing . Springer, 46–53.[126] Charissa Ann Ronao and Sung-Bae Cho. 2016. Human activity recognition with smartphone sensors using deeplearning neural networks.

Expert systems with applications

59 (2016), 235–244.[127] Silvia Rossi, Roberto Capasso, Giovanni Acampora, and Mariacarla Staffa. 2018. A Multimodal Deep Learning Networkfor Group Activity Recognition. In . IEEE, 1–6.[128] Wenjie Ruan, Quan Z Sheng, Peipei Xu, Lei Yang, Tao Gu, and Longfei Shangguan. 2017. Making sense of Dopplereffect for multi-modal hand motion detection.

IEEE Transactions on Mobile Computing

17, 9 (2017), 2087–2100.[129] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge.

International journalof computer vision . IEEE,473–479.[131] Edward S Sazonov, George Fulk, James Hill, Yves Schutz, and Raymond Browning. 2010. Monitoring of postureallocations and activities by a shoe-based wearable sensor.

IEEE Transactions on Biomedical Engineering

58, 4 (2010),983–990.[132] Jeffrey C Schlimmer and Richard H Granger. 1986. Incremental learning from noisy data.

Machine learning

1, 3 (1986),317–354.[133] Sofia Serrano and Noah A. Smith. 2019. Is Attention Interpretable?. In

Proceedings of the 57th Conference of theAssociation for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers .2931–2951.[134] Mehmet Saygın Seyfioğlu, Ahmet Murat Özbayoğlu, and Sevgi Zubeyde Gürbüz. 2018. Deep convolutional autoencoderfor radar-based classification of similar aided and unaided human activities.

IEEE Trans. Aerospace Electron. Systems

54, 4 (2018), 1709–1723.[135] Yu-Han Shen, Ke-Xin He, and Wei-Qiang Zhang. 2018. SAM-GCNN: A Gated Convolutional Neural Network withSegment-Level Attention Mechanism for Home Activity Monitoring. In . IEEE, 679–684.[136] Muhammad Shoaib, Stephan Bosch, Ozlem Incel, Hans Scholten, and Paul Havinga. 2014. Fusion of smartphonemotion sensors for physical activity recognition.

Sensors

14, 6 (2014), 10146–10176.J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. [137] Geetika Singla, Diane J Cook, and Maureen Schmitter-Edgecombe. 2010. Recognizing independent and joint activitiesamong multiple residents in smart environments.

Journal of ambient intelligence and humanized computing

1, 1 (2010),57–63.[138] Joshua R Smith, Kenneth P Fishkin, Bing Jiang, Alexander Mamishev, Matthai Philipose, Adam D Rea, Sumit Roy,and Kishore Sundara-Rajan. 2005. RFID-based techniques for human-activity detection.

Commun. ACM

48, 9 (2005),39–44.[139] Elnaz Soleimani and Ehsan Nazerfard. 2019. Cross-Subject Transfer Learning in Human Activity Recognition Systemsusing Generative Adversarial Networks. arXiv preprint arXiv:1903.12489 (2019).[140] Maja Stikic, Kristof Van Laerhoven, and Bernt Schiele. 2008. Exploring semi-supervised and active learning foractivity recognition. In . IEEE, 81–88.[141] Allan Stisen, Henrik Blunck, Sourav Bhattacharya, Thor Siiger Prentow, Mikkel Baun Kjærgaard, Anind Dey, TobiasSonne, and Mads Møller Jensen. 2015. Smart devices are different: Assessing and mitigatingmobile sensing hetero-geneities for activity recognition. In

Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems .ACM, 127–140.[142] Yujin Tang, Jianfeng Xu, Kazunori Matsumoto, and Chihiro Ono. 2016. Sequence-to-sequence model with attentionfor time series classification. In . IEEE,503–510.[143] Dapeng Tao, Yonggang Wen, and Richang Hong. 2016. Multicolumn bidirectional long short-term memory for mobiledevices-based human activity recognition.

IEEE Internet of Things Journal

3, 6 (2016), 1124–1134.[144] Dorra Trabelsi, Samer Mohammed, Faicel Chamroukhi, Latifa Oukhellou, and Yacine Amirat. 2013. An unsupervisedapproach for automatic activity recognition based on hidden Markov model regression.

IEEE Transactions onautomation science and engineering

10, 3 (2013), 829–835.[145] Luan Tran, Xi Yin, and Xiaoming Liu. 2017. Disentangled representation learning gan for pose-invariant facerecognition. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 1415–1424.[146] Son N Tran, Qing Zhang, Vanessa Smallbon, and Mohan Karunanithi. 2018. Multi-Resident Activity Monitoringin Smart Homes: A Case Study. In . IEEE, 698–703.[147] Tim LM van Kasteren, Gwenn Englebienne, and Ben JA Kröse. 2011. Human activity recognition from wirelesssensor network data: Benchmark and software. In

Activity recognition in pervasive intelligent environments . Springer,165–186.[148] Alireza Abedin Varamin, Ehsan Abbasnejad, Qinfeng Shi, Damith C Ranasinghe, and Hamid Rezatofighi. 2018. DeepAuto-Set: A Deep Auto-Encoder-Set Network for Activity Recognition Using Wearables. In

Proceedings of the 15thEAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services . ACM, 246–253.[149] George Vavoulas, Charikleia Chatzaki, Thodoris Malliotakis, Matthew Pediaditis, and Manolis Tsiknakis. 2016. TheMobiAct Dataset: Recognition of Activities of Daily Living using Smartphones.. In

ICT4AgeingWell . 143–151.[150] Praneeth Vepakomma, Debraj De, Sajal K Das, and Shekhar Bhansali. 2015. A-Wristocracy: Deep learning on wrist-worn sensing for recognition of user complex activities. In . 1–6.[151] Toan H Vu, An Dang, Le Dung, and Jia-Ching Wang. 2017. Self-gated recurrent neural networks for human activityrecognition on wearable devices. In

Proceedings of the on Thematic Workshops of ACM Multimedia 2017 . ACM, 179–185.[152] Guanhua Wang, Yongpan Zou, Zimu Zhou, Kaishun Wu, and Lionel M Ni. 2016. We can hear you with wi-fi!

IEEETransactions on Mobile Computing

15, 11 (2016), 2907–2920.[153] Jiwei Wang, Yiqiang Chen, Yang Gu, Yunlong Xiao, and Haonan Pan. 2018. SensoryGANs: An Effective GenerativeAdversarial Framework for Sensor-based Human Activity Recognition. In . IEEE, 1–8.[154] Jindong Wang, Yiqiang Chen, Shuji Hao, Xiaohui Peng, and Lisha Hu. 2019. Deep learning for sensor-based activityrecognition: A survey.

Pattern Recognition Letters

119 (2019), 3–11.[155] Jindong Wang, Vincent W Zheng, Yiqiang Chen, and Meiyu Huang. 2018. Deep transfer learning for cross-domainactivity recognition. In

Proceedings of the 3rd International Conference on Crowd Science and Engineering . ACM, 16.[156] Xuyu Wang, Chao Yang, and Shiwen Mao. 2017. PhaseBeat: Exploiting CSI phase data for vital sign monitoring withcommodity WiFi devices. In . IEEE,1230–1239.[157] Gary Mitchell Weiss and Jeffrey Lockhart. 2012. The impact of personalization on smartphone-based activityrecognition. In

Workshops at the Twenty-Sixth AAAI Conference on Artificial Intelligence .[158] Sungpil Woo, Jaewook Byun, Seonghoon Kim, Hoang Minh Nguyen, Janggwan Im, and Daeyoung Kim. 2016.RNN-Based Personalized Activity Recognition in Multi-person Environment Using RFID. In . IEEE, 708–715.J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. eep Learning for Sensor-based Human Activity Recognition: Overview, Challenges and Opportunities 111:37 [159] Jian Wu, Zhongjun Tian, Lu Sun, Leonardo Estevez, and Roozbeh Jafari. 2015. Real-time American sign languagerecognition using wrist-worn motion and surface EMG sensors. In . IEEE, 1–6.[160] Rui Xi, Mengshu Hou, Mingsheng Fu, Hong Qu, and Daibo Liu. 2018. Deep dilated convolution on multimodality timeseries for human activity recognition. In . IEEE, 1–8.[161] Rui Xi, Ming Li, Mengshu Hou, Mingsheng Fu, Hong Qu, Daibo Liu, and Charles R Haruna. 2018. Deep dilation onmultimodality time series for human activity recognition.

IEEE Access

IEEE Access arXiv preprint arXiv:1805.07020 (2018).[164] Jianbo Yang, Minh Nhut Nguyen, Phyo Phyo San, Xiao Li Li, and Shonali Krishnaswamy. 2015. Deep convolutionalneural networks on multichannel time series for human activity recognition. In

Twenty-Fourth International JointConference on Artificial Intelligence .[165] Yang Yang, Chunping Hou, Yue Lang, Dai Guan, Danyang Huang, and Jinchen Xu. 2019. Open-set human activityrecognition based on micro-Doppler signatures.

Pattern Recognition

85 (2019), 60–69.[166] Zhan Yang, Osolo Ian Raymond, Chengyuan Zhang, Ying Wan, and Jun Long. 2018. DFTerNet: towards 2-bit dynamicfusion networks for accurate human activity recognition.

IEEE Access

IEEE Transactions on MobileComputing

17, 2 (2017), 293–306.[168] Rui Yao, Guosheng Lin, Qinfeng Shi, and Damith C Ranasinghe. 2018. Efficient dense labelling of human activitysequences from wearables using fully convolutional networks.

Pattern Recognition

78 (2018), 252–266.[169] Shuochao Yao, Shaohan Hu, Yiran Zhao, Aston Zhang, and Tarek Abdelzaher. 2017. Deepsense: A unified deeplearning framework for time-series mobile sensing data processing. In

Proceedings of the 26th International Conferenceon World Wide Web . International World Wide Web Conferences Steering Committee, 351–360.[170] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neuralnetworks?. In

Advances in neural information processing systems . 3320–3328.[171] Siamak Yousefi, Hirokazu Narui, Sankalp Dayal, Stefano Ermon, and Shahrokh Valaee. 2017. A survey on behaviorrecognition using wifi channel state information.

IEEE Communications Magazine

55, 10 (2017), 98–104.[172] Yuta Yuki, Junto Nozaki, Kei Hiroi, Katsuhiko Kaji, and Nobuo Kawaguchi. 2018. Activity Recognition using Dual-ConvLSTM Extracting Local and Global Features for SHL Recognition Challenge. In

Proceedings of the 2018 ACMInternational Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and WearableComputers . ACM, 1643–1651.[173] Piero Zappi, Clemens Lombriser, Thomas Stiefmeier, Elisabetta Farella, Daniel Roggen, Luca Benini, and GerhardTröster. 2008. Activity recognition from on-body sensors: accuracy-power trade-off by dynamic sensor selection. In

European Conference on Wireless Sensor Networks . Springer, 17–33.[174] Tahmina Zebin, Patricia J Scully, and Krikor B Ozanyan. 2016. Human activity recognition with inertial sensors usinga deep learning approach. In . IEEE, 1–3.[175] Ming Zeng, Haoxiang Gao, Tong Yu, Ole J Mengshoel, Helge Langseth, Ian Lane, and Xiaobing Liu. 2018. Understandingand improving recurrent networks for human activity recognition by continuous attention. In

Proceedings of the 2018ACM International Symposium on Wearable Computers . ACM, 56–63.[176] Ming Zeng, Le T Nguyen, Bo Yu, Ole J Mengshoel, Jiang Zhu, Pang Wu, and Joy Zhang. 2014. Convolutional neuralnetworks for human activity recognition using mobile sensors. In . IEEE, 197–205.[177] Ming Zeng, Tong Yu, Xiao Wang, Le T Nguyen, Ole J Mengshoel, and Ian Lane. 2017. Semi-supervised convolutionalneural networks for human activity recognition. In . IEEE,522–529.[178] Dalin Zhang, Kaixuan Chen, Debao Jian, and Lina Yao. 2020. Motor Imagery Classification via TemporalAttentionCues of Graph Embedded EEG Signals.

IEEE Journal of Biomedical and Health Informatics (2020).[179] Dalin Zhang, Lina Yao, Kaixuan Chen, Guodong Long, and Sen Wang. 2019. Collective Protection: Preventing SensitiveInferences via Integrative Transformation. In

The 19th IEEE International Conference on Data Mining (ICDM) . IEEE,1–6.[180] Dalin Zhang, Lina Yao, Kaixuan Chen, and Jessica Monaghan. 2019. A convolutional recurrent attention model forsubject-independent eeg signal analysis.

IEEE Signal Processing Letters

26, 5 (2019), 715–719.[181] Dalin Zhang, Lina Yao, Kaixuan Chen, and Sen Wang. 2018. Ready for Use: Subject-Independent Movement IntentionRecognition via a Convolutional Attention Model. In

Proceedings of the 27th ACM International Conference on

J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018.

Information and Knowledge Management(CIKM) . ACM, 1763–1766.[182] Dalin Zhang, Lina Yao, Kaixuan Chen, Sen Wang, Xiaojun Chang, and Yunhao Liu. 2019. Making sense of spatio-temporal preserving representations for EEG-based human intention recognition.

IEEE transactions on cybernetics (2019).[183] Dalin Zhang, Lina Yao, Kaixuan Chen, Sen Wang, Pari Delir Haghighi, and Caley Sullivan. 2019. A Graph-BasedHierarchical Attention Model for Movement Intention Detection from EEG Signals.

IEEE Transactions on NeuralSystems and Rehabilitation Engineering

27, 11 (2019), 2247–2253.[184] Dalin Zhang, Lina Yao, Sen Wang, Kaixuan Chen, Zheng Yang, and Boualem Benatallah. 2018. Fuzzy integraloptimization with deep q-network for eeg-based intention recognition. In

Pacific-Asia Conference on KnowledgeDiscovery and Data Mining . Springer, 156–168.[185] Dalin Zhang, Lina Yao, Xiang Zhang, Sen Wang, Weitong Chen, and Robert Boots. 2018. Cascade and parallelconvolutional recurrent neural networks on EEG-based intention recognition for brain computer interface. In

Thirty-Second AAAI Conference on Artificial Intelligence (AAAI) .[186] Mi Zhang and Alexander A Sawchuk. 2012. USC-HAD: a daily activity dataset for ubiquitous activity recognitionusing wearable sensors. In

Proceedings of the 2012 ACM Conference on Ubiquitous Computing . ACM, 1036–1043.[187] Xiang Zhang, Lina Yao, Chaoran Huang, Sen Wang, Mingkui Tan, Guodong Long, and Can Wang. 2018. Multi-modality sensor data classification with selective attention. In

Twenty-Seventh International Joint Conference onArtificial Intelligence .[188] Xiang Zhang, Lina Yao, and Feng Yuan. 2019. Adversarial Variational Embedding for Robust Semi-supervised Learning.(2019), 139–147.[189] Yanyi Zhang, Xinyu Li, Jianyu Zhang, Shuhong Chen, Moliang Zhou, Richard A Farneth, Ivan Marsic, and Randall SBurd. 2017. Car-a deep learning structure for concurrent activity recognition. In . IEEE, 299–300.[190] Yong Zhang, Yu Zhang, Zhao Zhang, Jie Bao, and Yunpeng Song. 2018. Human activity recognition based on timeseries analysis using U-Net. arXiv preprint arXiv:1809.08113 (2018).[191] Yi Zheng, Qi Liu, Enhong Chen, Yong Ge, and J Leon Zhao. 2014. Time series classification using multi-channels deepconvolutional neural networks. In

International Conference on Web-Age Information Management . Springer, 298–310.[192] Yue Zheng, Yi Zhang, Kun Qian, Guidong Zhang, Yunhao Liu, Chenshu Wu, and Zheng Yang. 2019. Zero-EffortCross-Domain Gesture Recognition with Wi-Fi. In

Proceedings of the 17th Annual International Conference on MobileSystems, Applications, and Services . ACM, 313–325.[193] Jun-Yan Zhu and Jim Foley. 2019. Learning to Synthesize and Manipulate Natural Images.

IEEE computer graphicsand applications

39, 2 (2019), 14–23.[194] Muhammad Zia ur Rehman, Asim Waris, Syed Gilani, Mads Jochumsen, Imran Niazi, Mohsin Jamil, Dario Farina, andErnest Kamavuako. 2018. Multiday EMG-based classification of hand motions with deep learning techniques.

Sensors

18, 8 (2018), 2497.[195] Han Zou, Yuxun Zhou, Jianfei Yang, Hao Jiang, Lihua Xie, and Costas J Spanos. 2018. Deepsense: Device-free humanactivity recognition via autoencoder long-term recurrent convolutional network. In2018 IEEE International Conferenceon Communications (ICC)