[PDF] PFirewall: Semantics-Aware Customizable Data Flow Control for Home Automation Systems

Abstract

Emerging Internet of Thing (IoT) platforms provide a convenient solution for integrating heterogeneous IoT devices and deploying home automation applications. However, serious privacy threats arise as device data now flow out to the IoT platforms, which may be subject to various attacks. We observe two privacy-unfriendly practices in emerging home automation systems: first, the majority of data flowed to the platform are superfluous in the sense that they do not trigger any home automation; second, home owners currently have nearly zero control over their data. We present PFirewall, a customizable data-flow control system to enhance user privacy. PFirewall analyzes the automation apps to extract their semantics, which are automatically transformed into data-minimization policies; these policies only send minimized data flows to the platform for app execution, such that the ability of attackers to infer user privacy is significantly impaired. In addition, PFirewall provides capabilities and interfaces for users to define and enforce customizable policies based on individual privacy preferences. PFirewall adopts an elegant man-in-the-middle design, transparently executing data minimization and user-defined policies to process raw data flows and mediating the processed data between IoT devices and the platform (via the hub), without requiring modifications of the platform or IoT devices. We implement PFirewall to work with two popular platforms: SmartThings and openHAB, and set up two real-world testbeds to evaluate its performance. The evaluation results show that PFirewall is very effective: it reduces IoT data sent to the platform by 97% and enforces user defined policies successfully.

Full PDF

PP F

I R E WA L L : Semantics-Aware Customizable DataFlow Control for Home Automation Systems

Haotian Chi ∗ , Qiang Zeng † , Xiaojiang Du ∗ , Lannan Luo †∗ Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA † Department of Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, USAEmail: { htchi, dux } @temple.edu, { zeng1, lluo } @cse.sc.edu Abstract —Emerging Internet of Thing (IoT) platforms providea convenient solution for integrating heterogeneous IoT devicesand deploying home automation applications. However, seriousprivacy threats arise as device data now ﬂow out to the IoTplatforms, which may be subject to various attacks. We observetwo privacy-unfriendly practices in emerging home automationsystems: ﬁrst, the majority of data ﬂowed to the platform aresuperﬂuous in the sense that they do not trigger any homeautomation; second, home owners currently have nearly zero control over their data.We present PF

IREWALL , a customizable data-ﬂow controlsystem to enhance user privacy. PF

IREWALL analyzes the au-tomation apps to extract their semantics, which are automaticallytransformed into data-minimization policies; these policies onlysend minimized data ﬂows to the platform for app execution, suchthat the ability of attackers to infer user privacy is signiﬁcantlyimpaired. In addition, PF

IREWALL provides capabilities andinterfaces for users to deﬁne and enforce customizable policiesbased on individual privacy preferences. PF

IREWALL adopts anelegant man-in-the-middle design, transparently executing data-minimization and user-deﬁned policies to process raw data ﬂowsand mediating the processed data between IoT devices and theplatform (via the hub), without requiring modiﬁcations of theplatform or IoT devices. We implement PF

IREWALL to workwith two popular platforms: SmartThings and openHAB, andset up two real-world testbeds to evaluate its performance. Theevaluation results show that PF

IREWALL is very effective: itreduces IoT data sent to the platform by 97% and enforces user-deﬁned policies successfully.

I. I

NTRODUCTION

With the prosperity of Internet of Things (IoTs), smart sys-tems (e.g., smart homes, factories, and hospitals) have becomerealistic and are expanding with an ever-increasing speed [1].IoT Platforms, such as SmartThings, Wink, openHAB, allowsmart home users to connect heterogeneous IoT devices (e.g.,sensors, actuators, appliances) to a platform-provided hub andto install applications on the platform to create automaticinteractions among devices, i.e., home automation.As IoT device data ﬂow to the platform, protecting userprivacy becomes critical [2], [3]. Existing work protects userprivacy by resolving threats caused by malicios automationapplications [4], [5], [6], [7] or handling attacks that eavesdropIoT device trafﬁc [8], [9], [10], [11]. Surprisingly, none inves-tigates privacy protection at the platform architectural level,even though the platform receives huge amounts of data from

An earlier version of this paper was submitted to USENIX Security onNovember 15th, 2018. This version contains some minor modiﬁcations basedon that submission. smart homes and has full data access privileges. Indeed, it isbaseless to assume the platform is secure and trustworthy. Aplatform could be compromised by both inside attackers [12]and remote attackers that exploit the vulnerabilities of its huband cloud [13]. Compared to clouds that have suffered manynotorious attacks, an IoT platform has a much larger attacksurface involving not only its cloud but also the hub and usercontrol interfaces (e.g., web and mobile app). Moreover, manyIoT platforms share users’ data with partners (e.g., advertisers)for the expansion of businesses [14], [15], [16]; any improperprotection may exﬁltrate private data to third parties.Our investigation of popular smart home platforms showsthat these platforms are factually overprivileged to access real-time data streams from connected devices, although most ofthe data do not trigger any automation. This deviates theprinciple of “data minimisation” in European General DataProtection Regulation (GDPR) [17] or “least privilege” in ac-cess control systems [18]. We also ﬁnd that no capabilities areprovided for users to control the leakage of private device datato the platform, failing to realize user-centric authorization.Therefore, our goals are to minimize the data sent to theplatform and allow users to deﬁne customizable data ﬂowcontrol policies for individual privacy preferences.Multiple challenges arise for attaining these goals. First, thedata minimization should not adversely affect the functionalityof home automation . We observe that the semantics ofhome automation apps can be represented as rules with eachfollowing a event-condition-action model and the state-of-art code analysis techniques [19], [20], [7] are proved to beeffective in extracting rule semantics from apps. Our insightis that by ﬁnding the minimum data ﬂows required by theserule semantics, we can properly generate and enforce dataﬂow control policies without affecting home automation. Forexample, suppose a rule has a semantic “ when a motion isdetected, if the indoor temperature is higher than 79 ◦ F , turnon the A/C ”. We can convert it into a data-minimization policy,such that if the indoor temperature is not higher than 79 ◦ F , nodata is sent to the platform; besides, if the A/C is already on(that is, the rule execution does not change anything), no datais sent even if the temperature is higher than 79 ◦ F . Optionally,users can have the system fuzz the data, such that even if thepolicy execution determines that the temperature should besent, a random value larger than 79 is reported.Second, many platforms are closed systems that do not al-low platform-level modiﬁcations and it is probably unrealisticto expect a platform to cooperate to enforce data minimization.Thus, how to enforce data-protection policies before data leave a r X i v : . [ c s . CR ] O c t he home network is a challenge. Intuitively, one may proposeto circumvent this challenge by building a new purely-localplatform, such that no data have to ﬂow out of a home; or, onecan simply cut the network cable of a local gateway [21] andenforce most of the home automation locally. However, a largenumber of existing platforms have been deployed in homes andit might be infeasible to convince users to switch to anothernew platform they are not familiar with; moreover, a purelylocal platform means that a lot of highly desired Internet-basedservices (e.g., messaging, storage, and remote management)will be cut out. Therefore, how to enforce data protection onthe existing platform architecture without sacriﬁcing the valuesof Internet-based services imposes extra difﬁculties.We leverage multiple system-building ideas into our sys-tem, named PF IREWALL . First, we build PF

IREWALL as adata mediator, which sits between IoT devices and the hub totransparently ﬁlter data based on privacy-protection policies.The advantage is that neither IoT devices nor the platformneeds to be modiﬁed. Thus, another challenge is that theoriginal communication between IoT devices and the hub isencrypted, which prevents PF

IREWALL from understandingand then ﬁltering data. We overcome this difﬁculty with aman-in-the-middle approach: the data mediator claims itselfas a hub to pair with all the devices, and meanwhile it createsthe same number of virtual devices to connect the hub.Furthermore, we borrow the idea of a DMZ (demilitarizedzone) when designing PF

IREWALL . A DMZ exposes certainexternal-facing services (e.g., web) to the Internet, while theorganization’s local area network (LAN) is segregated by aﬁrewall. This way, even a node in the DMZ is compromised,attackers need to bypass the ﬁrewall to reach the LAN.We propose to place the hub in a DMZ, and set up anextremely simple ﬁrewall between the DMZ and PF

IREWALL :the external world cannot initiate connection to PF

IREWALL ,and any inbound trafﬁc, unless it targets those virtual devices,should be discarded immediately by PF

IREWALL .We demonstrate the ideas by implementing PF

IREWALL towork with two representative platforms: Samsung SmartThingsand openHAB, which are of the most popular cloud-based and gateway-based

IoT platforms, respectively. We evaluatePF

IREWALL in two real-world deployments. The results sug-gest that PF

IREWALL reduces the amount of data sent tothe platform by 97% based on data-minimization policies.Our case study shows that the data reduction heavily impairsthe attacker’s ability to infer privacy-sensitive behaviors, e.g.,bathroom usage and the arrival and departure time of homemembers. The user-speciﬁed policies provides extra ﬁne-grained data control to resolve personalized privacy prefer-ences and concerns.The contributions of this work are summarized as follows. • We reveal the fact that most smart home platforms employa simple trust-by-default model between home devices andthe platforms, resulting in over-leakage of sensitive IoTdevice data. We ﬁnd several channels through which thecollected data could be revealed, demonstrating the severeprivacy risks. Despite the clear need for user-centric dataﬂow control, we ﬁnd that most leading platforms do nothave supports for this purpose.

InternetHub Core FrameworkIoT Devices Core Framework Messaging/Storage

Cloud-Based PlatformGateway-Based Platform

Companion and Third-Party Apps

Fig. 1:

Smart home platform architecture. • We design an effective data ﬂow control system to enhanceuser privacy in home automation. On one hand, data-minimization policies are automatically generated based onthe installed automation apps, reporting minimally neces-sary data for app execution and obfuscating the reporteddata for further protection. On the other, users are offeredcapabilities to prioritize policies speciﬁed by themselves tocustomize data ﬂow control for individual privacy prefer-ences and concerns. • A man-in-the-middle style enforcement mechanism inclosed-source smart home systems is designed. A proxydevice mediates the communication between IoT devicesand the hub, without modifying the devices or the hub. • We implement a proof-of-concept prototype to work withtwo platforms: SmartThings and openHAB. Through theevaluation in two real-world scenarios: a two-bedroomapartment and a public workplace, we demonstrate thatour system signiﬁcantly reduces the privacy risks due todata leakage and introduces negligible latency to homeautomation. A user study is conducted to learn users’attitude and capabilities towards deﬁning privacy-protectionpolicies with mobile interfaces.II. B

ACKGROUND : S

MART H OME P LATFORMS

Smart home platforms can be categorized into two types:cloud-based platforms (CBPs) and gateway/hub-based plat-forms (GBPs), according to whether the core framework of aplatform is hosted in a remote cloud or a gateway/hub devicelocated at home (as shown in Fig. 1); the two types are similar,otherwise. Note that the gateway running a core frameworkat home does not resolve the privacy leakage threats, as thegateway connects to the Internet and is under the full control ofthe platform administrator. Once the platform is compromised,the attacker gains equivalent capabilities of gaining user data.We choose a CBP—

SmartThings , one of the most popularand full-ﬂedged platforms, as an example to describe the keycomponents in a smart home system. • Hub.

A CBP hub connects IoT devices through distinctshort/medium-range wireless radios (ZigBee, Z-Wave, etc.).The hub plays a key role to ensure the interconnectivity andinteroperability of heterogeneous IoT devices. A GBP alsohas a hub-like device which not only connects IoT devicesbut also hosts the core framework (descrbied below). Notethat the hub or gateway device, though physically located athome, is conceptually regarded as a part of the platform interms of data privacy protection in that it is under the fullycontrol of the platform administrator. We use gateway for distinguishment.

Cloud.

The backend cloud of a CBP hosts the core frame-work and provides cloud messaging, storage as well asany other necessary services for the platform to function.The cloud in a GBP is typically responsible for messagingand storage. The cloud messaging service facilitates somecritical functionalities, such as notiﬁcation, third-party ap-plication integration, remote monitoring and control. ManyInternet-based services depend on the cloud. • Core Framework.

The core framework runs major func-tionalities of a platform, including home automation. TakeSmartThings as an example. It provides a sandboxed run-time environment for running device handlers and

Smar-tApps . Device handlers are software wrappers of physicaldevices which abstract the physical devices (as a set of capabilities and handle the underlying protocol-speciﬁccommunications between the core framework and the physi-cal devices). They expose uniform interfaces for SmartAppsto interact with devices. • Companion and Third-Party Apps.

To provide a conve-nient user interface (UI) for users to manage their hubs, IoTdevices and apps, a platform usually provides a smartphone companion app. For instance, in SmartThings companionapp, users can install and conﬁgure a SmartApp. Currentplatforms also expose interfaces (mostly RESTful cloudAPIs) to incorporate third-party services/applications (e.g.,mobile apps, IFTTT [22], webCoRE [23]).Therefore, a smart home platform has a large attack surfaceinvolving the hub, cloud, core framework services, companionapp, and APIs for third parties, let alone inside attacks. It isdangerous and unnecessary that users grant unlimited trust toit by allowing all the data to ﬂow to the platform.III. M

OTIVATION AND T HREAT M ODEL

In this section, we ﬁrst reveal two facts we have observed,and then present the threat model.

A. Privacy Concerns about Platforms1) Trust By Default:

In smart home systems, the platformsare typically fully trusted. That said, after being installed,a platform gains the access privilege to all connected homedevices technically by design and legally by claiming a termsand conditions or a privacy policy . To reduce developmentcomplexity and the time to market, most emerging platformsdo not provide access control between home devices and theirhubs to avoid accessing unnecessary data; instead, they simplycollect all data streams reported by devices for further process-ing. We studied the privacy-related practices in popular smarthome platforms and showed the details in Appendix A. In thissection we use SmartThings as an exemplar to demonstrate.

Are home data ﬂowing out of homes silently?

To answerthis question, we connected four types of ZigBee devices (amultipurpose sensor, a motion sensor, an arrival sensor andan outlet) and a Z-Wave sensor (Aeotec Multisensor 6) toa SmartThings hub and inserted log.debug code into the parse methods of device handlers which are used bythe core framework to parse the received IoT payload andgenerate in-system events. In this way, we obtain all datareceived by the SmartThings cloud via its hub on the livinglogging interface [24]. We did not install any automation apps

IoT Apps

PolicyGenerationConflict DetectionPolicy Engine

Policy Manager

Code AnalysisConfigurationCollection

Rule ExtractorIoT Devices

Device Takeover Virtual Device Manager

Data Flow Mediator

Virtual Devices

Platform

Fig. 2:

The architecture of PF

IREWALL . and did not operate any SmartThings-provided interfaces; weonly interacted with the devices physically. We found that theplatform cloud still kept receiving device attribute data (e.g.,motion, switch, temperature, etc.) from the above devices,indicating that device data ﬂow out via the hub even if theyare not subscribed to or requested by any service.This trust-by-default model introduces severe data leakagerisks to smart homes since attackers may gain unauthorizedaccess to home data by compromising the hub device, cloudinfrastructure, or the companion app [25]. Vulnerabilities inIoT platforms and clouds have been demonstrated by recentworks. For instance, Fernandes et al. [18] and Zuo et al. [26]respectively revealed that the abuse of OAuth tokens and cloudAPI tokens in mobile apps imposes signiﬁcant security andprivacy threats including unauthorized access to the platform.An inside attacker can also access all the data.

2) Limited User Capabilities:

Users visibility and controlhelps mitigate risks. However, users have few capabilities andinterfaces to inspect or control what their device sends to theInternet [25]. They only have a binary choice: whether ornot to connect a device to the platform; once connected, thedevice keeps reporting data to the hub device continuously andopaquely.

B. Threat Model

We consider the platform may be exploited by attackers foraccessing user private data and inferring user privacy-sensitivebehaviors. Attacks that exploit the home IoT device hardwarevulnerabilities, side channels, or home local networks to stealprivate data are out of the scope of this work. We assumethe home automation apps are not malicious (note that howto detect and handle malicious automation apps is a separateproblem and has been well studied, e.g. [7], [20], [22]).IV. PF

IREWALL S YSTEM O VERVIEW

To mitigate data leakage, we propose to introduce accesscontrol before data leave the control of users. In this way,privacy-oriented metrics can be applied to provide the dataexposure with certain privacy guarantees (i.e., data minimiza-tion in this paper) and end-user controls are also feasible tosatisfy personal privacy preferences. However, it is challengingto attain these goals for the following reasons. First, thedata ﬁltering, if not carefully performed, may accidentallyaffect home automation. Thus, how to precisely analyze appsand convert them into privacy-protection policies correctly isa challenge (Section V-A). Most smart home platforms areclosed-systems and do not allow platform-level modiﬁcations.oreover, the trafﬁc between IoT devices and the hub isencrypted. How to perform the data ﬁltering without modi-fying the device, hub, or platform framework is challenging(Section V-B). How to provide interfaces for non-expert usersto deﬁne their own privacy-protection policies is non-trivial(Section V-A2).For interoperability, the wireless protocols in IoT devicesare mostly open-source standard ones such as ZigBee, Z-Wave, LAN, etc., which makes it possible to place a man-in-the-middle device (named mediator ) between IoT devicesand the hub to intervene in the communication between them.On top of the mediator, it becomes possible to process theraw data ﬂows before forwarding them to the hub. With thisinsight, we build PF

IREWALL , a system that enforces carefullygenerated data ﬂow control policies before data are reportedto the backend platform for home automation. As shown inFig. 2, PF

IREWALL comprises the following modules: • Rule extractor extracts the home automation rules fromrule creation interfaces, e.g., IoT apps, webpages, smart-phone apps, etc. When rules are initially installed, the ruleextractor obtains rule semantics and rule-device bindinginformation. In appiﬁed IoT systems, the rule extractorcomprises a code analysis component to extract rule seman-tics from apps and a conﬁguration collection component tocollect rule-device binding information. The rule semanticsand rule-device bindings constitute the complete automationlogic. • Policy Manager generates and manages data ﬂow policiesused for protecting IoT data.

Policy generation , on one hand,interacts with the rule extractor to generate semantics-baseddata-minimization policies; on the other hand, it takes inuser-speciﬁed policies from the user interfaces and formatsthem into executable-formatted policies.

Conﬂict detection inspects if a user-speciﬁed policy conﬂicts with existingdata-minimization policies and thus affects home automa-tion; when conﬂicts are detected, it reports the conﬂict tothe user for making decisions.

Policy engine interprets andexecutes the above policies over the incoming raw data fromIoT devices. • Data Flow Mediator is a proxy who mediates the commu-nication between IoT devices and the hub. The mediator,on behalf of the hub, talks with IoT devices via device-dependent protocols (e.g., ZigBee, ZWave, WiFi, etc) andforwards the raw device data to the policy engine forprocessing. On the other hand, the mediator creates a virtualdevice instance to send the processed data to the hub, onbehalf of each real device. All virtual device instancesuse a uniform communication protocol supported by thetarget platform (e.g., LAN in SmartThings [27] and MQTTin openHAB [28]). Besides, the virtual devices receivedevice control commands from the hub, which will thenbe translated to protocol-speciﬁc commands and forwardedto the corresponding real device. The data mediation is nottransparent to the platform and therefore the platform worksexactly the same way.V. D

ESIGN AND I MPLEMENTATION

In this section, we present the detailed design and im-plementation of PF

IREWALL . We choose Samsung’s Smart-Things, one of the most mature and comprehensive smart TRIGGER :{ match (:type).(:subject).(:attribute) satisfy (:operator)->(:value) [ fetch1 ] (:type).(:subject).(:attribute*) [ branch ] (:operator1)->(:value) run (:method)(:parameters)(:delay) [ else ] (:method1)(:parameters1)(:delay1) } CHECK : [{ fetch (:type).(:subject).(:attribute) satisfy (:operator)->(:value) [ fetch1 ] (:type).(:subject).(:attribute*) [ branch ] (:operator)->(:value) run (:method)(:parameters) [ else ] (:method1)(:parameters1) }, ...] Listing 1: Context-aware policy formathome platforms, as the underlying platform to describe theimplementation of PF

IREWALL . We ﬁrst describe our policygeneration and management for contextually controlling IoTdata ﬂows. Then, we present how we enforce policies inexisting IoT systems by introducing a data ﬂow mediator. Toshow the applicability of PF

IREWALL , we also present howwe integrate PF

IREWALL with another platform, openHAB,by adapting the platform-speciﬁc components.

A. Data Flow Control Policies1) Policy Deﬁnition and Execution:

Home automation iscontext-aware: a rule executes a command when it is triggeredby an event and meanwhile the smart home is under the pre-scribed condition . Note that the event and condition are slightlydifferent: an event describes a context change (e.g., the motionsensor’s reading changes from “inactive” to “active”, whichindicating a motion is detected) while a condition indicatesa collection of static statuses (e.g., the motion sensor’s latestreading is “active”). To precisely ﬁlter raw IoT data ﬂows fordata minimization without interfering with the execution ofautomation rules, data ﬂows need to be processed contextually.To this end, we deﬁne a context-aware policy format.Formally, we deﬁne a data ﬂow policy as P =( T , C ) , where T and C denote the TRIGGER and

CHECK section in a policyas shown in Listing 1.

TRIGGER deﬁnes the incoming eventthat triggers the execution of P and CHECK encapsulates alist of items, each of which indicates a constraint that mustbe satisﬁed for the policy to indeed perform actions. type indicates that the event is ﬁred by a device or is a time change,etc; subject is to identify a speciﬁc IoT device (i.e., deviceID); attribute speciﬁes the attribute of a device (which mayhave multiple attributes) or the time-related feature (e.g., timeof day, date, timer). type , subject and attribute are tocheck if an incoming data matches the event that triggers thepolicy in TRIGGER and are to query the smart home statusfor constraint checking in

CHECK . operator and value denote a constraint that the incoming event or smart homestatus must satisfy for the policy context to be evaluated astrue. A policy action deﬁned in the run ﬁelds where method and parameters deﬁne how to process the raw data and delay controls the timing for reporting the processed data tothe platform. Besides, there are three optional ﬁelds markedwith “[]” that form an extended TRIGGER section or a

CHECK tem. [ fetch1 ] and [ branch ] evaluate an extra constraint onthe fetched data; if true action deﬁned in run is executed, andotherwise action in else will be executed instead.Policies are executed by a policy engine. The policy enginelistens to all the incoming raw data from the IoT devices andtime-related information if registered. When receiving a newdata item D (a.k.a. an event), the engine uses D to evaluate themaintained data ﬂow policies one by one. Algorithm 1 showsthe general workﬂow of how the engine evaluates and executesa policy P . Speciﬁcally, it ﬁrst checks if D matches the type , subject , and attribute in TRIGGER , and thenexamines if the value of D satisﬁes the constraint speciﬁed by operator and value . If true, P is triggered and proceedsto execute. Then the engine evaluates all items speciﬁed in CHECK . Since the data required for evaluating the

CHECK items are not newly captured events but the current smarthome status (e.g., the device working status), the policy enginefetches the information indexed by type , subject and attribute from a database DB , which stores the latest at-tribute values of all connected devices and updates them whendevices report any change. Only when constraints deﬁned inall CHECK items are satisﬁed, the policy is ﬁnally evaluatedand the actions deﬁned in all run or else ﬁelds will beperformed. During the above process, a policy terminates ifthere is any event mismatches or constraint violation. Besides,the policy engine also maintains another database DB ∗ to keeprecord of the lastest reported data for each device attribute. Algorithm 1:

The algorithm for executing a policy

Input : D ← new data item, P ← A privacy policy DB ← Newest Device Status Database DB ∗ ← Newest Reported Data Database

Output:

Privacy-Aware Data Set DS if match( D . source , P . TRIGGER . ( type , subject , attribute ) ) andsatisfy( D . value , P . TRIGGER . ( operator , value ) ) then foreach checkitem ∈ P.CHECK do val ← fetch ( DB , checkitem . ( type , subject , attribute ) ) if !satisfy( val , checkitem . ( operator , value ) ) then return if P.TRIGGER .contains( [branch] ) then val ∗ ← fetch ( DB ∗ , P . TRIGGER . ( type , subject , attribute ∗ ) ) if satisfy( val ∗ , P.TRIGGER.(operator1,value)) then DS ← run P . TRIGGER . ( method , paras ., delay ) else DS ← run P . TRIGGER . ( method1 , paras . , delay1 ) else DS ← run P . TRIGGER . ( method , parameters , delay ) foreach checkitem ∈ P . check do if checkitem .contains( [branch]) then val ∗ ← fetch ( DB ∗ , checkitem . ( type , subject , attribute ∗ ) ) if satisfy( val ∗ , checkitem . ( operator , value ) ) then DS ← checkitem . ( method , paras . ) else DS ← checkitem . ( method1 , paras . ) else DS ← checkitem . ( method , paras . )

2) Policy Generation: PF IREWALL generates two types ofpolicies: automation-based data-minimization policies (APs)and user-speciﬁed policies (UPs). To achieve data minimiza-tion, i.e., only report the minimum amount of data thatare necessary for home automation, rules are extracted frominstalled automation apps and analyzed to ﬁnd the minimumdata ﬂows for the rules to execute. UPs are generated fromuser interfaces and work with APs simultaneously, which isan important supplement to customize privacy preferences thatcannot be learned from home automation.

Presence sensor( ) == "present"Temperature sensor ( ) > 86Turn on the fan ( )

EventConditionAction

Automation RuleData Flow Policy

CHECK fetch ( ).( ).( )satisfy ( ) -> ( )fetch1 ( ).( ).( )branch ( ) -> ( )run ( ) ( ) (0)else ( ) ( ) ( )fetch ( ).( ).( )satisfy ( ) -> ( )run ( ) ( ) (0)match ( ).( ).( )satisfy ( ) -> ( )fetch1 ( ).( ).( )branch ( ) -> ( )run ( ) ( ) ( )else ( ) ( ) ( )

TRIGGER

Fig. 3:

The policy derivation from an automation rule.

Automation Rule Extraction

Rule extraction is the ﬁrst step for AP generation. Automationrules follow an event-condition-action model and are installedby installing IoT apps or selecting rule templates on webor mobile app interfaces. The rule extraction regarding bothmethods has been widely studied by state-of-art literature.Code analysis has been proved to be an effective way toextract rule semantics from IoT apps by state-of-art work.For example, by utilizing Abstract Syntax Tree (AST) analysison smart apps, [29] identiﬁes requested and used capabilitiesin SmartApps, [7], [30] breaks down SmartApps and extractsrule information, [31], [32], [33] builds Deterministic FiniteAutomatons (DFAs) from SmartApps. Symbolic execution isa more powerful technique to analyze rule semantics from apps[19], [20]. Text data crawling and natural language processing(NLP) are used for rule extraction from web pages and mobileapps [32], [34].Rather than design another code analyzer, in this paper,we adapt the solution provided in [19] to implement our ruleextractor since it not only implements a complete symbolicexecutor with API modeling but also provides an app-devicebinding collection approach. We obtain the source code fromthe authors and verify its effectiveness on 86 SmartAppsfrom SmartThings market apps. The executor works on theAST representation of a SmartApp; the rule extraction startsfrom an event subscription method subscribe() (event thattriggers a rule) and traces in the entry point of the eventhandler method. All paths branching at if-else statements(rule condition) are explored until a sink (rule action) isspotted; expressions (e.g., value assignment) and APIs (e.g.,device access methods, device control commmands) along thepaths are modelled . The combination of control ﬂow analysisand data ﬂow analysis allow us to extract the rule context(event and condition) and command (action) from a SmartApp.The right column of Fig. 3 shows the extracted rule from atemperature control SmartApp that deﬁnes a rule R “whena presence sensor ps becomes present , if the reading of atemperature sensor ts is higher than 86 ◦ F , turn on the fan f ”. Data-Minimization Policy Generation Due to page limits, we refer interested readers to the literature [19] formore details. onsider the example rule R . By default, the platform contin-uously receives and stores data streams from devices (presencesensor, temperature sensor, fan). However, we observe thatthese data are not all required for executing R in cases:(1) The presence sensor ps does not send any event;(2) ps sends a “not present” event;(3) The indoor temperature measured by ts is lower than86 ◦ F ;(4) The fan f is “ON”;(5) ps sends a “present” event and the last reported temper-ature by ts is higher than 86 ◦ F .In cases (1)-(4), there is no need to report any data from ps and ts to the platform; in case (5), it is unnecessaryto report temperature data since the temperature value storedin the platform database satisﬁes the rule condition checking;in no cases, the ON/OFF state of f is useful for executing R . From this example, we can conclude that only sporadicones in the data streams of devices are required for homeautomation, which motivates us to encode highly-structuredautomation rules to data-minimization policies. An exampleof generating an AP from R is shown in Figure 3. The TRIGGER of AP is derived from the

Event of R and CHECK is derived from the

Condition and

Action of R ,respectively. According to the policy deﬁnition and executionalgorithm presented in Section V-A1, the derived AP expressesmulti-faceted information for PF IREWALL to process data:1) Context: when and only when an incomming event of ps is “present” and meanwhile the latest received reading of ts is higher than 86 ◦ F and the state of f is not “ON”, some data will be reported, and otherwise, the policy willbe skipped and no data will be reported at all;2) Event reporting: if the latest reported value of ps is“present”, use the diffKeep() method to process thecurrent value for reporting, and otherwise, use keep() ;3) CHECK data reporting: if the latest reported value of ts ishigher than 86 ◦ F , use the block() method to process thecurrent value of ts , and otherwise, use randomize(86,MAX) ; use block() to process the state data of f .Table I shows a summary of all the methods used in the run and else ﬁelds. In the default setting, binary sensorssuch as the presence sensor reports binary values alternatively;thus, SmartThings only ﬁres an event when observing a valuechange. Our data ﬂow control breaks the alternate “present”and “not present” values in the data stream of ps . Thus,when the platform receives “present” but ﬁnds the last valueis also “present”, it will not issue a “present” event in itsframework and R cannot be triggered. Hence, the derived APuses diffKeep() rather than keep() to address this issue; diffKeep() reports “not present” followed by “present”with a time delay T , which ensures a “present” event is ﬁred.It is worth mentioning that the selection of T is non-trivial toguarantee the normal execution of home automation becauseit allows time for the platform to update a received data to itsdatabase. Similarly, it is required that SmartThings have up-dated the temperature value (if necessary) in database before itissues a “present” event to R ; otherwise, the app will fail thetemperature condition check when triggered by the event . The We manually observed app execution while tuning T and found a valueas small as 100 millisecond without causing failure in 1000 trials. TABLE I:

Summary of methods used in data ﬂow policies

Method Description keep()

Report the original value block()

Do not report diffKeep()

Report a different value and then the original value randomize(MIN,MAX)

Report a random value ∈ ( MIN , MAX ) pickOther(CUR,ENUM) Randomly picked a value ( (cid:54) = CUR ) from set

ENUM

TABLE II:

Boundary values for randomizing different attributes

Attribute Min Max Unit

Temperature -50 150 ◦ FIlluminance 0 100000 LuxHumidity 0 100 %Power 0 1800 Watt block() discards data without sending it. randomize() randomizes the ﬂoat-value attribute data (e.g., temperature).In the example, the temperature is used to compare witha threshold (86 ◦ F ), so a random value between 86 ◦ F andthe upper limit of a temperature M AX is sufﬁcient for thecondition checking.

MAX / MIN denotes the upper and lowerboundaries of a speciﬁc attribute (See Table II). We obtainsuch information from SmartThings Capabilities Reference[35]. Besides, we present how PF

IREWALL handles time/timer-related automation in Appendix B.

User-Speciﬁed Policy Generation

We propose an interactive approach for users to specify dataﬂow control policies. This is motivated by three reasons:1) users have individual privacy preferences that cannot bederived from automation rules; for example, users mightprioritize privacy rather than automation functionality for somedevice types during a time period or under certain situations;2) the platform may integrate a third-party service but thereis no rule extractor available to extract semantics from it; 3)users have rights to control the use of their data. In principle, (a) (b)

Fig. 4:

Screenshots of PF

IREWALL mobile app. The app provides aninformation tab showing users what data every device type generatesand the corresponding privacy implications, and a policy tab allowsusers to deﬁne context-aware data control policies.

Ps have higher priority than APs in controlling data.We develop a mobile app for end-users to specify policies.As shown in Fig. 4(a), information is displayed to help usersunderstand what privacy issues each device and its data mayimply. With the templates in Fig. 4(b), users are able toconﬁgure whitelist, blacklist and conditional control policiesduring a speciﬁed time period or under certain contexts.Finally, UPs are encoded into the policy format in Listing 1 forexecution. See Appendix E for the user survey we conductedto evaluate the policy templates.

3) Policy Conﬂicts:

A user is likely to deﬁne UPs whichconﬂict with existing APs and hinder the automation sinceUPs are designed for overriding APs. Nevertheless, users needa warning that shows them what conﬂicts are imposed andwhich automation rules are affected. Therefore, an automatedpolicy conﬂict detection is necessary. Two policies P and P conﬂict if the following requirements are satisﬁed: (1) P and P are triggered simultaneously; i.e., an event makes bothconstraints c T and c T (deﬁned in TRIGGER ﬁelds of P and P , respectively) hold; (2) both policies are ﬁnally executedi.e., all the constraints c i and c i in the CHECK ﬁelds of bothpolicies are evaluated true; (3) two policies deﬁne differentactions (i.e., data processing methods, parameters, or delays)for the same data. Formally, let S ( C ) denote the set of allpossible contexts that satisfy the set of constraints C , and O ( a ) , E ( a ) denote the object (i.e., the controlled data) andeffects of a certain action a (deﬁned in both TRIGGER and

CHECK ﬁelds). A conﬂict occurs when the formula holds.  S ( c T ) ∩ S ( c T ) (cid:54) = ∅ , S ( c , c , · · · ) ∩ S ( c , c , · · · ) (cid:54) = ∅ , ∃ i, j, O ( a i ) = O ( a j ) , E ( a i ) (cid:54) = E ( a j ) . (1)We detect policy conﬂict for each newly submitted UPagainst all APs. To calculate the constraint overlapping in theﬁrst two formulas in Equation 1, we encode each constraintin a policy into a quantiﬁer-free ﬁrst-order formulas: ( type [ . subject [ . attribute ]]) (cid:124) (cid:123)(cid:122) (cid:125) data source and type ( operator )( value ) . Thus, the constraint overlapping is transformed into a con-straint satisfaction problem which can be solved by a constraintprogramming (CP) solver. In our implementation, we use aJavaScript linear solver javascript-lp-solver [36].If the constraint satisfaction is solvable, two policies willbe executed simultaneously. We then check whether the twopolicies perform different actions (by looking at the methodsand parameters in run and else ﬁelds) on the same dataﬂow; if so, the new UP conﬂict with an existing AP. Theautomation app which the AP was derived from would beaffected and is displayed to users for making decisions.

B. Data Flow Mediation

To enforce data ﬂow policies in a closed-source IoTsystem, we introduce a data ﬂow mediator for relaying thecommunication between IoT devices and the hub, as shown inFig. 5. To this end, the mediator needs to (1) act as a hub tointeract with IoT devices and (2) generate a virtual device tointeract with the original hub on behalf of each real device.

Home Gateway Virtual Device Manager

1. join/leave2. raw data

IoT Device

1. join/leave2. privacy-aware data

IoT Hub

Virtual Device Instance

1. join/leave2. raw data 3. command3. command 3. command1. create/remove2. privacy-aware data 3. command1. join/leave2. privacy-aware data3. commandOriginal Flow PFirewall Flow

Fig. 5:

The workﬂow of the data ﬂow mediator.

1) Connecting IoT Devices:

To play the role of a hub,the mediator needs to handle 3 major interactions with IoTdevices: 1) devices join or leave the hub-leading network; 2)devices report attribute data to the hub; 3) the hub forwardscommands from the platform to devices. The hub functionalityis provided by many open-source platforms, e.g., openHAB[37] and Mozilla IoT [21], which allow developers to addadd-ons for integrating various IoT devices using differentcommunication techniques. Until now, openHAB supports 275bindings that have been tested to work with hundreds ofcommercial IoT devices and Mozilla IoT also have testedmore than 100 mainstream devices. In our implementation,we adapt the source code of Mozilla IoT to realize connectingwith ZigBee and Z-Wave devices since the two techniques arewidely used by IoT devices; speciﬁcally, the mediator is builton a Raspberry Pi with a Digi XStick USB dongle (ZB meshversion) and an Aeotec Z-Stick (Gen5) to extend ZigBee andZ-Wave capabilities, respectively.

C. Connecting the Hub and Platform

To interact with a target platform on behalf of a realdevice, the mediator creates a virtual device which could: (1)talk with the hub with a communication technique supportedby it, and (2) be identiﬁed as a compatible device by theplatform framework. Most emerging platforms support variousconnectivity protocols for developers to build customizednetwork devices; for example, SmartThings supports LAN-and cloud-based device integration [27], openHAB supportsMessage Queuing Telemetry Transport (MQTT) protocol [38],Mozilla IoT provides REST-based Web Things framekworkand APIs [39], and Wink allows creating RESTful API devices[40]. This feature alleviates the workload for interfacing with atarget platform. We implement the mediator to work with tworepresentative platforms: SmartThings and openHAB. Due topage limit, we present the openHAB part in Appendix C1.

Interfacing with SmartThings

We choose LAN as the protocol for communicating withthe SmartThings hub since PF

IREWALL is designed to besegregated from a DMZ by a ﬁrewall; thus attackers can-not initiate any connection to PF

IREWALL to obtain data.SmartThings provides a device handler (see Section II)for abstracting each supported device type; accordingly, webuild a virtual device (VD) type for each device handler (DH) that originally supports ZigBee or Z-Wave devices,as shown in Fig. 6. We develop a service manager irtual Device Manager

Virtual Device-SmartOutletVirtual Device-MotionSensor

SmartThings Cloud

Service ManagerDevice Handler -SmartOutletDevice Handler- MotionSensor

Discover:SSDP

Data

Subscribe/Command:UPnP

PFirewall

Data: UPnP

Fig. 6:

Overview of interfacing with SmartThings.

SmartApp on SmartThings that uses SSDP (Simple ServiceDiscovery Protocol) to discover VD instances on the LAN.To be considered as different devices (SmartThings uses IPand port to uniquely identify a device), each VD instanceis launched on a different port. After discovered a device,the service manager adds it as a child device . When a child device is added, SmartThings automatically selectsa DH to abstract it according to the model property ofthe child device ; thus, we make the model propertyof the VD instance, the child device the same as the name of the target DH that is used to represent the cor-responding real device. After the initial connection, a VDinstance on the mediator side interacts with a DH instanceon the SmartThings with the UPnP (Universal Plug and Play)protocol, which uses SOAP (Simple Object Access Protocol)messages. Additionally, we adapting all DHs for ZigBee/Z-Wave devices available in SmartThings IDE. In each DH,we add a subscribe() function which accomplishes theSUBSCRIBE step for UPnP communication; when a DH isinstantiated (which means a VD instance is created and a childdevice is added), it uses the IP and port to send a SUBSCRIBESOAP message to the VD instance, providing its IP and portinformation. Moreover, we change the code in parse andcommand-related functions for receiving ZigBee/Z-Wave dataand sending ZigBee/Z-Wave commands respectively, to codefor receving and sending SOAP messages in each DH. Thus,the VD and DH instances become addressable to each otherand realize a subscribe/publish based UPnP communication toreport data and send commands.VI. E

VALUATION

A. Evaluation Setup

We build two real-world testbeds for evaluating the per-formance of PF

IREWALL : an ofﬁce with 5 members (T ) anda two-bedroom apartment with 1 member (T ), as shown inFig. 7. In each testbed (T and T ), we deployed two parallelsystems ( SYS1 and

SYS2 ) by placing two same devices ateach position in Fig. 7;

SYS1 and

SYS2 have the same devicetypes, numbers, placement and app conﬁguration, as shown inTable III, Fig. 7 and Table IV. The only difference is that

SYS1 is a standard SmartThings deployment but

SYS2 introducesPF

IREWALL . We bind

SYS1 and

SYS2 in each testbed to twodifferent SmartThings accounts and run them simultaneouslybut independently. We choose SmartThings in the real-worldtestbeds because SmartThings provides ofﬁcial apps in its appstore, while openHAB needs users to write automation appsand provides no market apps. Instead, we perform some micro-benchmark tests for evaluating openHAB (see Appendix C2). TABLE III:

Devices in the two real-world testbeds

Testbed Device (

Abbreviation ) Attribute Number

Ofﬁce(T ) SmartThings hub v2 ( HUB ) – 1Multipurpose sensor ( MU ) contact, temperature 1Motion sensor ( MO ) motion, temperature 1Smart outlet ( OL ) switch, power 2Smart bulb ( SL ) switch 1Smartphone ( SP ) presence 5Apartment(T ) SmartThings hub v2 ( HUB ) – 1Multipurpose sensor ( MU ) contact, temperature 3Motion sensor ( MO ) motion, temperature 2Smart outlet ( OL ) switch, power 2Smart bulb ( SL ) switch 4Aeotec MultiSensor ( AM ) motion, humidity,illuminance 2Smartphone ( SP ) presence 1 MO1 MU1SL1 OL1OL2 SP1SP2SP3 SP4SP5 (a) The ofﬁce

MU2MU3 SL2SL3AM2 AM1OL3 MO2OL4SP6 MU4SL5SL4MO3 (b) The apartment

Fig. 7:

The layout and device placement in the two testbeds.

B. Performance of Data Mediating

To test the correctness of PF

IREWALL mediator, we disablethe data ﬁltering in

SYS2 of both testbeds, i.e., the medi-ator simply forwards the IoT data to SmartThings withoutexecuting policies. To capture received data by SmartThings,we insert log.debug code into the parse methods in all device handlers for the tested devices, which allows usto record the event logs per device on SmartThings web IDE.We observe that there exist duplicate events in the capturedSmartThings event logs, so we remove duplicates beforeanalyses; consecutive events that have the same modality (thesame device, attribute, value) and very close timestamps (notlonger than 1 second) are regarded as the duplicates. Werun the above setting in

SYS2 of both testbeds for 10 daysand compare the data sequence of each device received byPF

IREWALL mediator and SmartThings. Table V shows thetotal numbers of received data per device and the number ofinconsistencies in the data sequences. The result shows thatour mediator works effectively and correctly in relaying thereceived data to the platform.

C. Performance of Policy System

To test the performance of our policy system, we establisha comparative experiment by running

SYS1 and

SYS2 simul-taneously in both testbeds for another 10 days. We enabledata ﬁltering in

SYS2 , so

SYS2 in this experiment runs thedata-minimizaion policies. Also, we deﬁne two extra user-speciﬁed policies:

UP1 (DO NOT report

MO1.motion databetween 5pm to 10pm) in T and UP2 (DO NOT report

MU2.contact data between 8am to 6pm) in T .

1) Correctness and Reliability:

Comparing the receiveddata sequences is meaningless since data are ﬁltered in

SYS2 ,so we test the correctness of the execution of SmartApps. Tocapture the execution of apps, we manually insert logging codeABLE IV:

SmartApp and device settings for the evaluation environments. O: ofﬁcial app, C: custom app.

Testbed SmartApps (

Abbreviation )( Source ) Description and Device Bindings T UndeadEarlyWarning (

UEW )( O ) When door ( MU1 ) is opened, turn on light (

SL1 ).LightsOffWithNoMotionAndPresence (

LON )( O ) When no motion ( MO1 ) or presence (

SP1 ∼ ) is detected for 5 minutes, turn off light ( SL1 ).MyAutoCoffee (

MAC )( C ) When presence ( SP1 ) becomes present, if time is before 12am, turn on coffee machine (

OL1 ).MyAutoHeater (

MAH )( C ) When motion ( MO1 ) detected, if temperature (

MU1 ) < ◦ F , turn on heater ( OL2 ).MyFitnessNotiﬁcation (

MFN )( C ) When motion ( MO1 ) is active for longer than 60 minutes, send a message to alert.StrangerNotiﬁcation (

STN )( C ) When door ( MU1 ) is open, if no presence (

SP2 ∼ ), send a message.T UndeadEarlyWarning (

UEW )( O ) When door is opened ( MU2 ), turn on light (

SL2 ).SmartLights (

SML )( O ) When motion ( MO2 ) active if illuminance (

AM1 ) < LUX, turn on light (

SL3 ).TurnOnOnlyIfIArriveAfterSunset (

TOO )( O ) When presence ( SP6 ) becomes present if between 5-8pm, turn on oven (

OL3 ).TextMeWhenThere’sMotionAndI’mNotHere (

TMW )( O ) When motion ( MO2 ) active if not presence (

SP6 ), send a notiﬁcation.LetThereBeLight! (

LTB )( O ) When wardrobe door ( MU4 ) open, turn on light (

SL5 ); when door (

MU4 ) close, turn off light (

SL5 ).VirtualThermostat (

VIT )( O ) When motion ( MO2 ) is detected if temperature (

MU3 ) < ◦ F , turn on heater ( OL4 ); when motion (

MO2 )inactive for 20 minutes, turn off heater (

OL4 ).SmartBedroomLight (

SBL )( C ) When door ( MU3 ) opened, turn on light (

SL4 ); when door (

MU3 ) closed if motion (

MO3 ) inactive for 5minutes, turn off light (

SL4 ).NotifyMeWhenSomeoneFaints (

NMW )( C ) When humidity ( AM2 ) exceeds 85% if motion (

AM2 ) active but motion (

MO3 ) keeps inactive for 30minutes, send a notiﬁcation.

TABLE V:

Statistics of the data received by PF

IREWALL mediator andthat received by SmartThings for the evaluation of data mediating. Due topage limits, we only present the result of one device for each device type.

Total : the total data volume received by SmartThings cloud and PF

IREWALL mediator, respectively.

Testbed Device Attribute Total Inconsistency T MU1 contact 1960, 1960 0

MU1 temperature 174, 174 0

MO1 motion 2198, 2198 0

MO1 temperature 325, 325 0

OL1 switch 38, 38 0

SL1 switch 24, 24 0

SP1 presence 62, 62 0T AM1 motion 384, 384 0

AM1 humidity 656 , 656 0

AM1 illumance 927, 927 0

TABLE VI:

Statistics of SmartApp method call logs. MC : the number ofmethod calls; INC : the number of inconsistencies;

INCA : the number ofinconsistencies after eliminating redundant method calls.

Testbed App Data Control Results MC in SYS1 MC in SYS2

INC INCA T UEW

971 11 960 0

LON

11 11 0 0

MAC

MAH

MFN

13 11 2 2

STN UEW

26 10 16 3

SML

41 41 0 0

TOO

11 7 4 0

TMW

LTB

42 42 0 0

VIT

235 49 186 0

SBL

164 55 109 0

NMW into the installed SmartApps to record the method calls forcontrolling devices and sending notiﬁcations. We compare themethod call sequences of each app in

SYS1 and

SYS2 andcalculate the number of inconsistencies. We summarize theresult in Table VI. We ﬁgure that the

IN C values of someapps are large. This is because SmartThings apps do not checka device’s current status before sending it a command and thusredundant method calls are made while PF

IREWALL in designdisables redundant automation commands to reduce reportingdata. For instance, the app

UEW calls the light turn-on methodevery time the door (

MU1 ) is opened, no matter the light is “on” or “off”; however, the redundant method calls are avoided byour data ﬂow policies if the light’s status is already “on”. Thus,inconsistencies are detected in some apps. To eliminate the im-pact of redundant automation on the evaluation of automationaccuracy, we capture and remove the redundant method callsfrom method call sequences in

SYS1 by analyzing app anddevice logs; speciﬁcally, if a method call’s effect is to changea device to a state the device is already in, this method call isidentiﬁed as redundant and removed from the sequence. Werecalculate the inconsistencies, denoted as

IN CA . As showin Table VI,

IN CA in most apps are 0 except in four apps:

MAC , MFN in T and UEW in T . We manually analyze thecauses of these inconsistencies by examining the device eventand method call logs. we ﬁnd that the event log of SP1 in SYS2 has one more “present” than that of

SYS1 . Thisis because SmartThings detects presence by monitoring thedistance of a smartphone (GPS data) from the in-home hubwhile PF

IREWALL scans the home WiFi network to examine ifa smartphone enters/leaves; when

SP1 moves around, differentpresence statuses are detected by the two methods due todistinct detection ranges, leading to the inconsistency in

MAC .The inconsistencies in

MFN and

UEW appear because userspeciﬁed policies

UP1 and

UP2 block

MO1.motion and

MU2.contact data during certain periods, respectively. Weverify that the 2 inconsistencies in

MFN occur during 5pm-10pm and the 3 inconsistencies in

UEW occur during 8am-6pm.We also observe that no

MO1.motion or MU2.contact data are received by SmartThings in

SYS2 during the speciﬁedperiods in

UP1 and

UP2 , respectively. The above result showsthe correctness of our policy-based data ﬂow control in enforc-ing user-speciﬁed policies and in preserving home automationfunctionalities by generating data-minimization policies.

2) Latency:

We show the efﬁciency of PF

IREWALL by test-ing the introduced automation latency (mediating delay pluspolicy execution delay). We obtain the result by computing thetimestamp difference of the same command in both commandsequences (

SYS1 and

SYS2 ). We exclude the outliers fromour calculation where the command in

SYS1 is even issuedafter

SYS2 to reduce the inﬂuence of network delay andthe cloud response latency on the result. We calculate theautomation latency for each SmartApp in both testbeds andshow the result in Figure 8. The automation latency rangesfrom 124.7 to 486.4 millisecond. An averaged latency of 210.6millisecond is a tradeoff for using PF

IREWALL to mitigate E W L O N M A C M A H M F N S T N U E W S M L T O O T M W L T B S H A V I T S B L N M W Deployed SmartApps L a t e n c y ( m s ) Fig. 8:

Automation latency introduced by PF

IREWALL . The boxesshow the maximum, quartile, averaged and minimum values of themajority latencies per app. The blue squares are some outliers.

TABLE VII:

Comparison of reported data volume per device before andafter the deployment of PF

IREWALL . V OL : volume of reported data in

SYS1 and

SYS2 , respectively; RR : data reduction rate. We present the result forpartial devices. See Appendix D for the result of all deployed devices. Dev Attr

V OL RR

Attr

V OL RR

MU1 contact 1924, 22 0.98 temperature 142, 6 0.96

MO1 motion 2266, 47 0.98 temperature 307, 0 1

OL1 switch 29, 0 1

SL1 switch 22, 0 1

SP1 presence 34, 24 0.29

MU2 contact 52, 24 0.54 temperature 118, 0 1

MO2 motion 364, 68 0.81 temperature 173, 0 1

OL3 switch 44, 0 1

SL2 switch 60, 0 1

AM1 motion 364, 0 1

AM1 illuminance 1039, 1 0.99 humidity 668, 0 1

SP6 presence 28, 12 0.57 privacy leakage, although the latency is completely acceptablefor most automation apps.

3) Reduction of Data Leakage:

To show the effectivenessof data ﬁltering, we compare the data volume reported by eachdevice in the

SYS1 and

SYS2 of both testbeds. As show inTable VII, PF

IREWALL blocks 96.87% IoT data on averaged.More than 99% of ﬂoat-value sensor readings and devicestates (i.e., ON/OFF states of coffee machines, setpoints ofthermostats, locked/unlocked states of smart locks, etc.); thus,PF

IREWALL prevents the smart home platforms and potentialattackers from learning the private information of smart homesand homeowners based on ﬂoat-value sensors and householdappliances. PF

IREWALL also reduces the reporting of binary-value sensor attributes (contact, motion, presence) to distinctextents, according to the speciﬁc automation app semanticsand app-device bindings. The relative reduction rate RR ofbinary-value attributes are smaller than ﬂoat-value attributesin general, since binary attributes are used for triggering theexecution of automation apps in most cases and hence cannotbe totally blocked.

4) Privacy Gain:

To show how privacy preservation isachieved by the reducing data leakage, we compare the po-tential privacy leakage under several inference attacks withand without PF

IREWALL . Ofﬁce members and events proﬁling.

By analyzing thepresence sensor (

SP1 ∼ ) data in the research lab testbed (T ),the working hours of 5 members (person 1 ∼

5) each of whomcarries a presence sensor could be learned, based on theirentering and leaving time, as shown in Fig. 9(a). In additionto monitoring user presence in real time, the attacker could also learn the personal working preferences and group events.For example, person 1 may leave for classes each Tuesdayand Wednesday; person 3 works less hours than person 1 and2 during weekdays but shows up more on weekends; person4 has a more regular routine through the weekdays; person5 works less hours (4 or so) every day and the hours tendto be in the afternoon; moreover, the members may leavefor a group meeting on Friday morning. When PF

IREWALL is deployed, most presence data are ﬁltered since only the“present” events before 12am from

SP1 are required to turnon coffee machine outlet (see app

MAC ). The presence sensordata of the other persons are never sent because their valuesare kept “not present” in the platform database and only “notpresent” events from

SP1 are sent in order for the app

LON to pass its condition checking. when the last person leaves.which hides the real leaving time of person 1.. Therefore, anattacker could only learn when person 1 arrives the lab roomcorrectly (see Fig. 9(b)).

Bathroom usage monitoring.

By accessing the motion andhumidity data of the Aeotec Multisensor (

AM2 ) in the apart-ment testbed (T ), an attacker can learn the bathroom usagehabits. As depicted in Fig. 10(a), the attacker simply combineseach “active” with the next “inactive” event to obtain the startand end time of a bathroom usage. Moreover, the attackercan also use the humidity data (see Fig. 10(c)) as additionalinformation to help recognize “having shower” activities in thebathroom. In the experiment, the attacker identiﬁes 4 “havingshower” activities by comparing the humidity values with acommon sense threshold (i.e., 85%). When PF IREWALL isapplied, the humidity data is rarely sent (for executing theanomaly activity detection app

NMW ) and motion “active”(

AM2 ) is reported only once to keep the motion value “active”in the platform database. As shown in Fig. 10(b) and 10(d),the humidity and motion data are respectively sent only oncein our one-week experiment, preventing the attacker frommonitoring and learning the bathroom usage habits.

Appliance monitoring.

Non-intrusive load monitoring(NILM) techniques can infer appliance events based onelectricity data, causing privacy concerns [41], [42]. We setup another experiment to learn how attackers are preventedfrom inferring appliance working status and user activitieswhen power data are protected. We connect a microwave, akettle and a stove to a smart outlet and install an automationapp that turns off the outlet when a user leaves home to avoidﬁre accidents. Although the app only needs a presence sensordata to operate, the outlet also measures real time power dataand reports it to outside. To study the incurred privacy risk, wecollect the reported raw power data (see Fig. 11(a)) for 3 daysand perform inference attacks. The attack process includesdata pre-processing, clustering and mapping (Fig. 11(b)-11(d)). The inference result achieves 95.7% precision and92% recall in identifying appliance activities when comparedwith the manually collected ground truth. When PF

IREWALL operates, all power data are preserved for running this appand hence no user privacy could be inferred from power data.VII. D

ISCUSSION AND L IMITATIONS

Can PF

IREWALL perform home automation and thus getrid of the cloud?

Note that PF

IREWALL has access to all

Time of Day (hour)

MonTueWedThuFriSatSunMonTueWed D a y person 0person 1person 2person 3person 4 (a) Without data ﬂow control Time of Day (hour)

MonTueWedThuFriSatSunMonTueWed D a y person 0person 1person 2person 3person 4 (b) With data ﬂow control Fig. 9:

Inferred user working hours within 10 days with and withoutdata ﬂow control in testbed T . For simplicity of illustration, weround all presence data timestamps to the nearest hours. Time of Day (hour)

MonTueWedThuFriSatSun B a t h r oo m M o t i o n activeinactive (a) motion data without control Time of Day (hour)

MonTueWedThuFriSatSun B a t h r oo m M o t i o n activeinactive (b) motion data with control Time of Day (hour) B a t h r oo m R e l a t i v e H u m i d i t y ( % ) MonTueWedThuFriSatSun (c) humidity data without control

Time of Day (hour) B a t h r oo m R e l a t i v e H u m i d i t y ( % ) MonTueWedThuFriSatSun (d) humidity data with control

Fig. 10: received by the platform with and without data ﬂow control. Foran clearer display, motion data that indicate shorter than 3-minutebathroom activities are omitted in (a). device data and rule semantics from IoT apps. Theoretically,PF IREWALL is capable of running a rule engine to executethe extracted semantics; thus, no data is sent to the cloud atall. However, we did not employ this design due to practicalconsiderations. (1) The kick-cloud-out strategy may causeethical or legal concerns which our research team cannottackle. The SmartThings cloud can easily verify whether it istalking with a real SmartThings hub, and cut all the servicesif not. It means that, while PF

IREWALL may provide homeautomation, all other cloud-based services (messaging, storage,and remote management) will be lost. (2) Huge engineeringefforts are needed to implement an equivalent rule enginethat supports the same programming framework and APIsand maintain them in a long run. Therefore, we strategicallysegregate the data ﬂow control policy engine and the ruleengine; PF

IREWALL only deals with data ﬁltering.

User efforts.

In PF

IREWALL , users pair IoT devices withthe mediator on PF

IREWALL web interfaces and add thevirtual device instances to SmartThings with its companion

Time of Day (hour) O u t l e t P o w e r ( w a tt ) o f D a y s (a) Raw data Time of Day (hour) O u t l e t P o w e r ( w a tt ) o f D a y s (b) Slicing Duration (minute) P o w e r ( w a tt ) cluster 1cluster 2cluster 3 (c) K-means Clustering Time of Day (hour) O u t l e t P o w e r ( w a tt ) o f D a y s (d) Mapping clusters to appliances Fig. 11:

Appliance usage inference over 3-day power data withoutdata ﬂow control. mobile app; thus, users operation for connecting devices isdoubled. We design SmartThings-alike pairing interfaces onthe PF

IREWALL side, which makes pairing on both sidessimilar and reduces potential confusions. Moreover, we use thebrowser automation framework Selenium to develop a Pythonscript, which periodically checks the new SmartApps and de-vices, and installs corresponding instrumented SmartApps (forrule extraction) and custom device handlers (for PF

IREWALL mediation), respectively. Users only provide their SmartThingsaccounts to the script and no other operations are required.

Generality.

Although our implementation targets SmartThingsand openHAB, the presented approach can be potentiallyadapted to other ecosystems. As discussed in Section V-B,it is complete practical to realize a man-in-the-middle medi-ator in most systems. On one hand, the mediator could beextended to work with as various IoT devices as an open-source platform; on the other, the mediator could interfacingwith many platforms via a connectivity technique provided bythese platforms for creating and integrating software servicesand hardware devices as “things”. Moreover, approaches forextracting automation rules from IoT apps [31], [7], [32], [20],[19] and mobile/web interfaces [32], [34], [43] have beenbroadly studied. We envision that tools are developed by thecommunity for extracting rule semantics from more platformssuch that the data-minimization policies can be generated.VIII. R

ELATED W ORK

A. Privacy in Smart Home Platforms

Besides security, privacy is also an important research topicin smart home ecosystems. Zheng et al. [2] studied smarthome owners’ perceptions of privacy risks and actions takento protect their privacy; the study found that users are unawareof privacy risks from inference algorithms operating on datafrom their IoT devices, and they expect device manufacturersto protect their privacy though it is not the case. Celik et al.4] provided a tool for tracking the sensitive data ﬂows inprogramming frameworks and identiﬁed 138 out of 230 appsin SmartThings transmit at least one kind of sensitive data overplatform-provided APIs, which means malicious apps havethe capability to steal user data collected by the platform.Literature [18] and [31] also present app-level attacks thatcan brench user privacy. Closest to our work, FlowFence [6]enforced a data ﬂow control mechanism for sensitive dataprotection. However, FlowFence protects sensitive data fromunauthorized apps rather than the platform, so sensitive dataprotection still fails to other attacks; FlowFence requires thecooperation from the platforms and app developers to operate.

B. In-hub Security and Privacy Enforcement

Many in-hub schemes are proposed to enforce security andprivacy schemes in the IoT domain. Simpson et al. design a in-hub security manager built atop the smart home hub to patchvulnerable IoT devices and strengthen authentication. The se-curity manager is deployed in a open-source system HomeOS.FACT [44] and HanGuard [45] enforce access controls in themiddle by implementing controllers on an open-source huband a programmable WiFi router, respectively. By comparison,these schemes rely on a programmable hub (gateway, router)that can indeed intercept control the communication betweenhome area network and the Internet. However, in cloud-based smart home platforms like SmartThings, communica-tions between the commercial hub and the backend cloudare encrypted [46] and hence the router can neither decryptnor modify the packets on demand. PF

IREWALL controlsthe communication between IoT devices and the hub in auniﬁed, backward-compatible way, regardless of the speciﬁccommunication protocol employed by the hub and cloud.IX. C

ONCLUSION

We presented PF

IREWALL , a semantics-aware customiz-able data ﬂow control system for smart homes, which ﬁltersdata generated by IoT devices. PF

IREWALL can automaticallygenerate application-dependent policies based on installed au-tomation apps to block unnecessary data ﬂows and only reportthe minimum amount of data required for home automation.Furthermore, PF

IREWALL allows users to customize individualpolicies according to their own privacy preferences.We overcame many challenges and designed an elegantman-in-the-middle proxy based system, which enforces thesepolicies without modifying the platform or IoT devices. Weimplemented a prototype of PF

IREWALL and evaluated it intwo real-world testbeds. The evaluation results demonstratedthat PF

IREWALL can effectively and efﬁciently reduce sensi-tive data leakage without interfering with home automation.It heavily impairs an attacker’s ability to monitor and inferuser privacy-sensitive behaviors. In addition to smart homes,the system can also signiﬁcantly enhance privacy protectionin many other environments, such as smart factories andofﬁces, that leverage smart platforms for IoT device interactionautomation and other platform-provided services.R

Proceedings of the ACM on Human-Computer Interaction , vol. 2, no. CSCW, p. 200, 2018.[3] E. Zeng, S. Mare, and F. Roesner, “End user security & privacyconcerns with smart homes,” in

Symposium on Usable Privacy andSecurity (SOUPS) , 2017.[4] Z. B. Celik, L. Babun, A. K. Sikder, H. Aksu, G. Tan, P. McDaniel,and A. S. Uluagac, “Sensitive information tracking in commodity iot,”in

USENIX Security 2018 .[5] I. Bastys, M. Balliu, and A. Sabelfeld, “If this then what?: Controllingﬂows in iot apps,” in

Proceedings of the 2018 ACM SIGSAC Conferenceon Computer and Communications Security . ACM, 2018, pp. 1102–1119.[6] E. Fernandes, J. Paupore, A. Rahmati, D. Simionato, M. Conti, andA. Prakash, “Flowfence: Practical data protection for emerging iotapplication frameworks.” in

USENIX Security Symposium , 2016, pp.531–548.[7] Y. Tian, N. Zhang, Y.-H. Lin, X. Wang, B. Ur, X. Guo, and P. Tague,“Smartauth: User-centered authorization for the internet of things,” in . USENIX Association, 2017, pp. 361–378.[8] A. Acar, H. Fereidooni, T. Abera, A. K. Sikder, M. Miettinen,H. Aksu, M. Conti, A.-R. Sadeghi, and A. S. Uluagac, “Peek-a-boo:I see your smart home activities, even encrypted!” arXiv preprintarXiv:1808.02741 , 2018.[9] T. Datta, N. Apthorpe, and N. Feamster, “A developer-friendly libraryfor smart home iot privacy-preserving trafﬁc obfuscation,” in

Proceed-ings of the 2018 Workshop on IoT Security and Privacy . ACM, 2018,pp. 43–48.[10] N. Apthorpe, D. Reisman, and N. Feamster, “Closing the blinds: Fourstrategies for protecting smart home privacy from network observers,” arXiv preprint arXiv:1705.06809 , 2017.[11] N. Apthorpe, D. Reisman, S. Sundaresan, A. Narayanan, and N. Feam-ster, “Spying on the smart home: Privacy attacks and defenses onencrypted iot trafﬁc,” arXiv preprint arXiv:1708.05044

IEEE Symposium on Security and Privacy2016 .[19] H. Chi, Q. Zeng, X. Du, and J. Yu, “Cross-app threats insmart homes: Categorization, detection and handling,” arXiv preprintarXiv:1808.02125 , 2018.[20] Z. B. Celik, P. McDaniel, and G. Tan, “Soteria: Automated iot safetyand security analysis,” in

Usenix Security 2018

Proceedings of the IEEESymposium on Security and Privacy (S&P). https://doi. org/10.1109/SP ,2019.[26] C. Zuo, Z. Lin, and Y. Zhang, “Why does your data leak? uncoveringthe data leakage in cloud from mobile apps,” in

IEEE Symposium onSecurity and Privacy 2019 .27] “Lan-connected devices,” https://docs.smartthings.com/en/latest/cloud-and-lan-connected-device-types-developers-guide/index.html, 2018.[28] “MQTT,” https://http://mqtt.org/, 2019.[29] E. Fernandes, A. Rahmati, K. Eykholt, and A. Prakash, “Internet ofthings security research: A rehash of old ideas or new intellectualchallenges?”

IEEE Security & Privacy , vol. 15, no. 4, pp. 79–84, 2017.[30] W. Ding and H. Hu, “On the safety of iot device physical interactioncontrol,” in

Proceedings of the 2018 ACM SIGSAC Conference onComputer and Communications Security . ACM, 2018, pp. 832–846.[31] Y. J. Jia, Q. A. Chen, S. Wang, A. Rahmati, E. Fernandes, Z. M. Mao,and A. Prakash, “Contexiot: Towards providing contextual integrity toappiﬁed iot platforms,” in

Proceedings of The Network and DistributedSystem Security Symposium , 2017.[32] W. Zhang, Y. Meng, Y. Liu, X. Zhang, Y. Zhang, and H. Zhu, “Homonit:Monitoring smart home apps from encrypted trafﬁc,” in

Proceedings ofthe 2018 ACM SIGSAC Conference on Computer and CommunicationsSecurity . ACM, 2018, pp. 1074–1088.[33] Z. B. Celik, G. Tan, and P. McDaniel, “IoTGuard: Dynamic enforce-ment of security and safety policy in commodity iot,” 2019.[34] I. Hwang, M. Kim, and H. J. Ahn, “Data pipeline for generation andrecommendation of the iot rules based on open text data,” in

IEEEWAINA

Proceedings of the 18th ACMconference on Computer and communications security . ACM, 2011,pp. 87–98.[42] M. Lisovich and S. Wicker, “Privacy concerns in upcoming residentialand commercial demand-response systems.”[43] D. T. Nguyen, C. Song, Z. Qian, S. V. Krishnamurthy, E. J. Colbert,and P. McDaniel, “Iotsan: fortifying the safety of iot systems,” in

Pro-ceedings of the 14th International Conference on emerging NetworkingEXperiments and Technologies . ACM, 2018, pp. 191–203.[44] S. Lee, J. Choi, J. Kim, B. Cho, S. Lee, H. Kim, and J. Kim,“Fact: Functionality-centric access control system for iot programmingframeworks,” in

Proceedings of the 22nd ACM on Symposium on AccessControl Models and Technologies . ACM, 2017, pp. 43–54.[45] S. Demetriou, N. Zhang, Y. Lee, X. Wang, C. A. Gunter, X. Zhou,and M. Grace, “Hanguard: Sdn-driven protection of smart home wiﬁdevices from malicious mobile apps,” in

Proceedings of the 10th ACMConference on Security and Privacy in Wireless and Mobile Networks A PPENDIX

A. Investigation on Popular Smart Home Platforms

We study the privacy policies and practices on 7 popularcloud-based smart home platforms and 3 platforms that useother architectures for comparison. A brief summary is shownin Table VIII. “Easy to access?” shows if a privacy policyis explicitly displayed or prompted during the installation of the platform’s products (especially apps). “Collect devicedata?” shows whether a privacy policy claims that the platformaccesses users’ devices during the services. “Expose data topartners?”, “Restrict data use on 3rd parties?” and “Privacytechniques” show whether the platform claims to share users’data with third parties, whether it claims to restrict how thirdparties can legally use these data and what techniques itemploys to protect user privacy during data sharing. “Collectpersonal info.?” shows whether a platform collects personallyidentiﬁable information from users during the registrationprocess. “Access device data?” shows if the platform ac-cesses device data while providing services. “Expose datato partners?” shows whether the platform provides devicedata to third-parties, including integrated third-party services.“Access control before hub?” and “User controllable?” indicatewhether any access control mechanism is enforced before theplatform’s hub accesses device data and whether users cancontrol the access between devices and the platform’s hub.Some privacy policies fail to increase user perceptionsof sensitive data collection since they fail to 1) be easilyaccessible, or 2) use jargon-free words, or 3) claim sensitivedata collection explicitly. Some policies, although claim shar-ing data with third-parties, do not claim any data protectiontechniques or any restriction policies to the third-parties. Onthe other hand, we found the fact that most of the studiedplatforms request personal-identiﬁable information from dur-ing registration, access sensitive data from IoT devices, andshare data with business partners. However, most platforms donot have mechanisms to minimize the data access from usersand do not provide interfaces to users for ﬁne-grained controlson their sensitive data. Users are only capable of choosingwhether to agree with the privacy policy. Once a device isconnected to the platform, they cannot further decide how theirdeivces report data to the platforms.

B. Time/Timer-related Automation PF IREWALL also deals with time-related automations. Forinstance, if a rule is deﬁned as “when the door is opened iftime is after 18:00, turn on TV”, the derived policy needs tofetch system time for condition checking. When it comes to atimer-related automation, e.g., “when motion sensor becomesinactive for 5 minutes, turn off the light”, multiple policiesare bundled to operate by calling the methods for starting,stopping and ﬁring a timer. Fig. 13 illustrates the workﬂow ofhow PF

IREWALL handles this example.

C. Interfacing with openHAB1) Implementation:

We use the supported MQTT to inter-face with openHAB because it is a general connectivity proto-col, allowing for virtualizing any device types with ﬂexibility.Fig. 14 shows the high-level architecture of the integration.openHAB provides an embedded MQTT broker, so our workis to realize each virtual device (VD) as a MQTT client andcreate a Generic MQTT thing (supported by MQTT binding) inopenHAB for the real device represented by the VD. A thing inopenHAB has channels (equivalent to the concept “attribute”in SmartThings, e.g., motion, temperature, etc.) and eachchannel can be linked to an item (used for displaying valuesreceived by the linked channel and used as an interface forautomation rules to interact with the real device). In openHAB, a) (b) (c) (d)

Fig. 12:

The PFirewall Survey mobile app used in the user survey.

TABLE VIII:

A summary of privacy policies and facts in some well-known platforms. AGG: aggregation; ANO: anonymization.

Platform Privacy Policy FactsEasy toaccess? Collectdevice data? Expose datato partners? Restrict data useon 3rd parties? Privacytechniques Collectpersonal info.? Accessdevice data? Expose datato partners? Access controlbefore hub? Usercontrollable?

Wink " " " %

AGG " " " % %

Iris " " " % % " " " % %

Vera " " " "

AGG, ANO " " " % %

Lutron " % " % % " " " % %

Thingsee " " " "

AGG, ANO " " " % %

SmartThings % " " "

AGG, ANO " " " % %

EVRYTHNG % % % % % " " " % % openHAB % % % % % " " " % %

Mozilla IoT % % % % % " " % % %

Apple HomeKit " % % % % " " % % %

Timer (id1, duration)

ActiveInactive

Motion Sensor

StartTimer(id1)StopTimer(id)addCallback(id1, action1) if duration > 5min, fireTimer(id1) action1...

Fig. 13:

The workﬂow of how PF

IREWALL handles a timer-relatedrule example. The methods are show in Table IX. action1 is deﬁned toreport “inactive” to the platform with method keep and zero delay.Each timer maintains a list of actions which will be called when thetimer’s duration satisﬁes a certain constraint. each MQTT thing channel can be conﬁgured as a MQTTclient. By subscribing to the same MQTT topic (essentiallya path-alike string), MQTT clients can publish/receive datato/from the topic.When a new device is added to PF

IREWALL , a VDinstance is created. If the real device is a sensor (e.g., TABLE IX:

Methods for dealing with timer-related automation

Method Description startTimer(id)

Create or reset a timer with identity idstopTimer(id)

Stop and reset a timer with identity idfireTimer(id)

Fire a timer id and execute actions in its callbacks addCallback(id,act) Add an action act to the callbacks of timer id motion sensor in Fig. 14), the VD instance subscribesto a topic data/ { device id } / { attribute } (e.g., data/12345/motion ) for publishing data, where device idis generated randomly by PF IREWALL ; if the real device is anactuator (e.g., smart outlet), the VD instance subscribes to atopic data/ { device id } / { attribute } for publishingdata and a topic cmd/ { device id } / { attribute } forreceiving commands. The MQTT bining in openHAB doesnot provide a device discovery function. To automatically adda thing and its channel in openHab, there are two choices:operating on the web interfaces or adding a conﬁgurationﬁle in the openhab/conf/things/ directory. We choose irtual Device Manager Virtual Device-SmartOutletVirtual Device-MotionSensor openHAB

MQTT Embedded BrokerSmartOutlet

T1: data/id_outlet/switch

PFirewall item_s Motion Sensormotionswitch item_m

T3: data/id_motion/motion

T2: cmd/id_outlet/switch

MQTT topics

T1 T2 T3 T4

ThingChannelItem

T1 T2 T3 T4

Fig. 14:

Overview of how the mediator interfacing with openHAB.

TABLE X:

Comparison of reported data volume per device before and afterthe deployment of PF

IREWALL . V OL : volume of reported data in

SYS1 and

SYS2 , respectively; RR : relative reduction rate. We present the result for eachdevice type. See Appendix D for the complete result of all deployed devices. Dev Attr

V OL RR

Attr

V OL RR

MU1 contact 1924, 22 0.98 temperature 142, 6 0.96

MO1 motion 2266, 47 0.98 temperature 307, 0 1

OL1 switch 29, 0 1

OL2 switch 19, 0 1

SL1 switch 22, 0 1

SP1 presence 34, 24 0.29

SP2 presence 36, 1 0.97

SP3 presence 30, 1 0.96

SP4 presence 28, 1 0.96

SP5 presence 26, 1 0.96

MU2 contact 52, 24 0.54 temperature 118, 0 1

MU3 contact 268, 58 0.78 temperature 131, 8 0.94

MU4 contact 42, 42 0 temperature 109, 0 1

MO2 motion 364, 68 0.81 temperature 173, 0 1

MO3 motion 564, 21 0.96 temperature 157, 0 1

OL3 switch 44, 0 1

OL4 switch 49, 0 1

SL2 switch 60, 0 1

SL3 switch 68, 0 1

SL4 switch 70, 0 1

SL5 switch 42, 0 1

AM1 motion 364, 0 1

AM1 illuminance 1039, 1 0.99 humidity 668, 0 1

AM2 motion 462, 0 1

AM2 illuminance 1384, 0 1 humidity 893, 1 0.99

SP6 presence 28, 12 0.57 the latter approach to automate the process. By populating astring template with the same device id, attribute and topicinformation as the VD instance, PF

IREWALL creates a MQTTthing by adding a thing ﬁle to the openHAB directory througha FTP service. Thus, the created MQTT thing can receive datafrom or send commands to the VD by subscribing to the sametopics.

2) Evaluation: openHAB allows users to write automationapps with a domain speciﬁc language (DSL), which is adaptedfrom Xbase [47]. However, openHAB does not provide ofﬁcialapps for installation. To test our openHAB integration, wedevelop 13 apps implementing the same rule semantics to workwith the same devices, as shown in Table IV. We manuallyoperate the real devices to trigger each rule for 20 times andﬁnd all apps are executed correctly.

D. Complete Evaluation Result of Data Volume Reduction

Due to page limits, we only present the result of onedevice for each device type in Table VII in Section VI-C3.Table X shows the complete list of all deployed devices inboth testbeds.

E. User Study1) Setup:

We conduct a user survey to study users’ at-titude and abilities towards deﬁning customized data ﬂowcontrol policies with our policy templates (Section V-A2). Werecruit 20 adult participants who are knowledgeable aboutthe concepts “home automation”, “smart home” or “IoT”from our institutions. Participants completed the trial tasksof our “PFirewall Survey” app in our lab using smartphoneswe provided and after that answered several questions (seeSection E3).We asked the participants to get familiar with a smart homesetting where 10 automation rules (Fig. 12(b)) are conﬁguredto work with 15 devices (Fig. 12(a)). The app provides a page(Fig. 12(c)) to illustrate the architecture of the system and thepotential risks of data leakage; we did not explain the contentand ask questions about this page to avoid inﬂuencing theunderstanding of end-users by factors other than the interfaceitself. Besides, the app also provides an interface showing thelist of 15 devices; when a device is selected, the app switchesto a device detail page (e.g., Fig. 12(d)) showing what datathe device generates and what privacy risks are imposed ifthe data are leaked. In addition, policy templates (as shown inFig. 4(b)) were provided for participants to deﬁne their ownpolicies. After a 30-minute trial, participants were asked toanswer questions.

2) Results:

All 20 participants cared about their dataprivacy and thought it useful to deﬁne their own data ﬂowpolicies for protecting privacy. However, 2 participants thoughtthey would not spend time in deﬁning policies even if an appis available. We collect the number of participants who hadprivacy concerns on each listed device. Cameras and smartspeakers were the top two devices whose data are consideredsensitive by the participants (19 and 16, respectively); half ormore participants had concerns on the status data of smartlocks, doors and windows (11, 13, 10, respectively); Each ofhumidity sensors, heaters, lights, powers and coffee makers isconcerned by less than 3 participants. Except the listed devices,the participants also cared about the data privacy of smart TV,smart window blinds, smart outlet.Regarding the usability of our policy templates, 8 par-ticipants thought the templates are “very easy” to use and12 participants thought them “easy” to use. 3 participantsfound that they cannot specify policies to control data byspecifying multiple conditions with the templates, for example,the combination of an event and a speciﬁed time period.According to the feedback, we address this issue by allowingusers to select another condition after a condition has beenspeciﬁed.Overall, participants concern data privacy and hold a pos-itive attitude in deﬁning own policies with our templates. Theresult also shows that participants may overlook the privacyrisks of some devices like humidity sensor and powers, whichwe have discussed in Section VI-C4. Hence, data-minimizationpolicies and user-speciﬁed policies could work together toachieve better privacy protection.

3) Questions in the user study:

1) Do you care about your data privacy if you use a smarthome system?. YesB. No2) List the device(s) (from the given device list in our “PFire-wall Survey” app) which you have privacy concerns if thedevice data are leaked.3) Do you think it is useful in general to control your owndata to reduce privacy leakage risks?A. YesB. No4) Would you spend time deﬁning your own policies to controldata if an app like “PFirewall Survey” is available for youto do so?A. YesB. No5) Recall how our app guide you to deﬁne your own policies.Are the provided policy templates easy to understand anduse?A. EasyB. Somewhat challenging but still able to useC. Not usable6) Do you ﬁnd any policy that you think useful but the giventemplates fail to enable you to do so? If any, please list it.