Learning from Online Regrets: From Deleted Posts to Risk Awareness in Social Network Sites
LLearning from Online Regrets: From Deleted Posts to RiskAwareness in Social Network Sites
Nicolás Emilio Díaz Ferreyra
Rene Meis
Maritta Heisel
ABSTRACT
Social Network Sites (SNSs) like Facebook or Instagram are spaceswhere people expose their lives to wide and diverse audiences. Thispractice can lead to unwanted incidents such as reputation damage,job loss or harassment when pieces of private information reachunintended recipients. As a consequence, users often regret to haveposted private information in these platforms and proceed to deletesuch content after having a negative experience. Risk awareness isa strategy that can be used to persuade users towards safer privacydecisions. However, many risk awareness technologies for SNSsassume that information about risks is retrieved and measured by anexpert in the field. Consequently, risk estimation is an activity that isoften passed over despite its importance. In this work we introducean approach that employs deleted posts as risk information vehiclesto measure the frequency and consequence level of self-disclosurepatterns in SNSs. In this method, consequence is reported by theusers through an ordinal scale and used later on to compute a riskcriticality index. We thereupon show how this index can serve inthe design of adaptive privacy nudges for SNSs.
CCS CONCEPTS • Security and privacy → Privacy protections ; Social aspectsof security and privacy ; Usability in security and privacy ; •
In-formation systems → Users and interactive retrieval ; •
Human-centered computing → Human computer interaction (HCI) ; KEYWORDS adaptive privacy, privacy nudges, self-disclosure, awareness, socialnetwork sites, risk management
ACM Reference Format:
Nicolás Emilio Díaz Ferreyra, Rene Meis, and Maritta Heisel. 2019. Learn-ing from Online Regrets: From Deleted Posts to Risk Awareness in SocialNetwork Sites. In
ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3314183.3323849
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
UMAP’19 Adjunct, June 9–12, 2019, Larnaca, Cyprus © 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6711-0/19/06...$15.00https://doi.org/10.1145/3314183.3323849
Social Network Sites (SNSs) like Facebook or Twitter allow usersto create and maintain social connections with a wide spectrumof online communities which represent (in many cases) the differ-ent facets of their lives. User-generated content plays a major rolein this process since posts, comments, videos and photos are thevehicles that allow users to relate with each other within these plat-forms. Nevertheless, disclosing private and sensitive informationthrough these communication channels can result in unwantedincidents such as reputation damage, unjustified discrimination oreven job loss when such content reaches an unintended audience[1, 23]. Consequently, users very often regret having posted privateinformation in SNSs because they were unable to anticipate thenegative consequences of their actions [23].Risk awareness is key for making better and more informeddecisions in our daily lives. For instance, being aware of the risksof smoking can discourage people in engaging with tobacco con-sumption [10]. Likewise, nutrition labels can support people inimproving their eating habits [7]. However, users of SNSs receivevery little (for not saying none) information about the risk of on-line interaction, neither as part of the platform’s layout nor in thebody of the privacy policy [5]. Moreover, SNSs very often presentthemselves as spheres free of intrusions and privacy risks. This lackof information modulates the perceived severity of privacy risks infavour of information disclosure and, consequently, in benefit ofthe service providers [19, 22].Privacy scholars have developed a wide variety of preventativetechnologies [4, 9, 24] to induce changes in the privacy decisionsmade by the users of SNSs. For instance, these technologies pro-vide cues about the semantics of the content being shared by theusers (i.e. if a post contains private information or not) in order topersuade them towards safer privacy practices [5]. Such approachcan be improved by personalizing these cues with risk informationassociated with self-disclosure patterns in SNSs [4, 21]. That is, pro-viding a personalised assessment of the risks that may take place ifprivate information is revealed in a post. A prerequisite to performthis task is to have a repository of unwanted incidents, togetherwith their respective frequencies and severity levels [6]. However-to the best of our knowledge-, not many efforts have been madeon defining the necessary mechanisms to collect and process suchinformation. In other words, the information necessary to gener-ate a proper risk estimation often remains as an assumption forpreventative technologies.In this work, we introduce a method to estimate the risks ofinformation disclosure in SNSs using deleted posts as indicators of a r X i v : . [ c s . H C ] A ug nline regrets. Since users often delete their publications after liv-ing a negative experience in SNSs [23], we propose to leverage suchdeleted content to (i) retrieve information about unwanted inci-dents, and (ii) estimate their respective frequency and consequencelevel. For this purpose, we introduce an interface for collectingthis information in which the perceived consequence level of theincidents can be entered by the users using nominal values (i.e. cata-strophic , major , moderate , minor and insignificant ). In line with this,we describe how this information can be used later on to computea risk criticality index and integrate it into an adaptive awarenessmechanism.The rest of the paper is organized as follows. In the next sectionwe discuss related work in the area of adaptive privacy awareness.Following, section 3 introduces the theoretical foundations of thispaper. In particular, we discuss the use of heuristics for the genera-tion of awareness together with a risk estimation approach. Section4 elaborates on a method for collecting evidence of recurrent un-wanted incidents in SNSs using deleted posts as risk-informationvehicles. Next, section 5 introduces an algorithm for the genera-tion of adaptive privacy awareness. This algorithm combines theoutput produced by the method of section 4 together with the riskindex introduced in section 3. Following, section 6 analyses thestrengths and limitations of our approach, and section 7 describesthe corresponding design and evaluation plan. Finally, in section8 we outline the conclusions of this paper and give directions forfuture work. Privacy in SNSs is a multifaceted issue that has caught the atten-tion of many researchers across different disciplines [3]. In theparticular case of online self-disclosure, several Preventative Tech-nologies (PTs) have been proposed for nudging the users towardsa safer privacy behavior [5]. Basically, PTs generate interventions (i.e. warning messages or suggestions) when users attempt to pub-lish private or sensitive information in their profiles [5]. This way,users are induced to reflect on the content they are about to shareand the negative consequences that may occur after posting suchcontent. Although this is a well-grounded persuasive strategy, warn-ings are sometimes perceived as too invasive or annoying by theusers. This happens basically because not all users have the sameprivacy attitudes or concerns and, consequently, adopt differentprivacy strategies [20]. For instance, some users are more willingto disclose private information without much concern about theconsequences, and others rather keep such information away fromunwanted recipients. This suggests that PTs should generate in-terventions aligned with the users’ privacy attitudes in order toengage them in a continuous learning process [6]. In other words,PTs should incorporate adaptivity principles into their design.At a glance, the adaptive awareness process of PTs can be de-scribed as a loop consisting of two main activities as shown inFig. 1. In the first step, knowledge extraction , the information that isnecessary for the generation of adaptive interventions is gatheredand stored inside a Knowledge Base (KB). The second step, knowl-edge application , consists of querying the information inside theKB to shape personalized interventions. This two-phase processrepeats itself after some time in order to update the information inside the KB and thereby improve the quality of the interventions.For instance, Misra et al. [15] developed an approach in whichthe emerging communities inside a user’s ego network (i.e. thenetwork of connections between his/her friends) is used to buildpersonalized access-control lists (i.e. black lists of information re-cipients). In this case, knowledge extraction consists of retrievingthe user’s ego-network, and knowledge application involves theuse of this information for generating a personalized access-controllist. This process repeats itself when the topology of the user’sego-network changes (i.e. when contacts or links between contactsare added/removed).
Figure 1: Adaptive Privacy Loop
Combining adaptive awareness with risk management features isa promising approach for promoting safer privacy decisions amongthe users of SNSs. As mentioned by Díaz Ferreyra et al. [6] andAcquisti et al. [19], information about the potential risks of a self-disclosure act can nudge the users towards more proactive privacydecisions. In line with this premise (i.e. more risk information,better privacy decisions), De et al. [2] developed an approach usingattack-trees to inform the users of SNSs about the privacy risksthat may result from their privacy settings (e.g. the risks of havinga public profile). Despite its novelty, assumptions were made withregard to the information used for the estimation of such risks. Thisis, a prerequisite for the application of this approach is to have a KBcontaining information about common unwanted incidents, theirfrequency and consequence levels. In order to endow this and otherPTs with the information necessary for computing privacy risks,we propose a method for collecting such risk-related informationusing deleted posts. Likewise, we describe how this informationcan be aggregated and used to define an adaptive mechanism ofrisk awareness.
In this section we introduce the theoretical foundations of thispaper. Particularly, we discuss how regrettable online experiencescan be translated into patterns of information disclosure and usedthereafter for the generation of privacy awareness inside SNSs. Inline with this, we discuss the importance that the estimation of riskvalues has for PTs and introduce an approach that can be used tocarry on with this task.
A privacy regret in SNSs can be defined as a “feeling of sadness,repentance or disappointment which occurs when a piece of sensi-tive information reaches an unintended audience and results in annwanted incident” [6]. For instance, the regrettable scenario ofFig. 2 illustrates a situation in which a user gets in trouble withher employer after posting a negative comment about her work-place. Probably, this scenario might have been experienced by morethan one user while interacting inside a SNS. In this case, suchscenario can be abstracted into a pattern of information disclosurethat represents this and other regrettable scenarios with similarcharacteristics. Diaz Ferreyra et al. [6] propose to describe recurrentself-disclosure scenarios using privacy heuristics (PHs). Basically,PHs model patterns of information disclosure as a tuple
PAs is a set of private attributes,
Audience is acollection of recipients (e.g. Facebook friends), and
Risk is a char-acterization of the severity (i.e. a measure of the consequence and frequency ) of an
Unwanted Incident (UIN). Hence, the correspond-ing PH for the scenario of Fig. 2 models the severity of job losswhen a negative comment about ones’ workplace is disclosed to anaudience composed by work colleagues.The knowledge inside PHs can be used by PTs to communicatethe risks associated with disclosing certain patterns of private infor-mation in SNSs. This can be done by checking the
Risk informationinside a PH whose
PAs match with the ones disclosed in a new post[6]. For instance, if the same user of Fig. 2 (or any other user) at-tempts to disclose a negative comment about her workplace insidea new post, the PT could inform her that this can lead to job loss ifseen by her work colleagues. This information flow corresponds tothe knowledge application step described in Fig. 1 for a PT whoseKB consists of a collection of PHs. Such a Privacy Heuristics DataBase (PHDB) can be engineered through the elicitation of regret-table scenarios and their later encoding into PHs [4, 6]. In principle,one could shape a PH out of the experience of a single user (e.g.extracting the PAs, the Audience and the Risks in a face-to faceinterview or through an online questionnaire) [4]. However, theconsequence level of an UIN is subjective (i.e. varies from individualto individual) and the same UIN can be perceived as insignificant byone user and catastrophic by others. Moreover, a single occurrenceof an UIN is not enough to estimate its frequency. Therefore, onemust have multiple sources of evidence of an UIN to generate agood risk estimator.
USER’S POST “A typical day at the office. Lots of complaints andbad mood. Cannot wait for the day to be over...!”
Actual Audience:
PUBLIC.
Unintended Audience:
The user’s work colleagues.
Unwanted Incidents:
Reputation damage; job loss.
Figure 2: Example of self-disclosure scenario
Estimating the severity of privacy risks is an important step towardsthe generation of privacy awareness. Basically, this allows to prior-itize the communication of those risks with a high severity level over those risks whose severity level is low. One way to estimatesuch risk levels is through a risk index that aggregates instances ofelementary risk evidence measured through quantitative or qual-itative data [8]. Since ordinal scales such as insignificant , minor , moderate , major and catastrophic are convenient when measuringthe consequence of unwanted incidents [14], it is desirable that arisk index can deal with ordinal variables. One approach that takesthis aspect into account is the Criticality Index (CI) introduced byFacchinetti et al. [8] which generates a normalized value I of risktaking as input the frequency of the values used to measure theconsequence of an unwanted incident. This is, given a categoricalrandom variable X with ordered categories x k that represent de-creasing consequence levels k = , , ..., K , a value of I closer to 0indicates that the severity of a risk event is likely to be low whereasvalues closer to 1 indicate that the severity is likely to be high. An estimator of the risk index I can be obtained out of a sample of size n of the categorical variable X with the following equation [8]:ˆ I = (cid:205) Kk = ˜ F k − K − K levels, the values k = k = K correspond to the highest and lowest consequence valuesof an unwanted incident, respectively. Likewise, ˜ F k corresponds tothe empirical distribution function of the random variable X , whichfor a category x k is computed as the number of observations r l inthe sample with consequence levels between 1 and k :˜ F k = (cid:213) kl = r l n for k = , , ..., K Eq. 1 aggregates evidence about the consequence of an unwantedincident to determine the severity of its corresponding risk event.Consequently, ˆ I can be used to instantiate the Risk component of aPH, and consequently to generate risk awareness on well-knownpatterns of information disclosure in SNSs. Under this premise, wewill describe in the next two sections (i) a method for collectingevidence of unwanted incidents in SNSs using deleted posts and (ii)instantiate the Risks of PHs for the later generation of adaptive riskawareness in SNSs. This second task, which involves the generationof a confidence interval that contains the real value of I , is addressedin Section 5. As we discussed in Section 3, PHs are promising instruments forthe generation of risk awareness in SNSs. However, in order to putthese instruments into practice, one must properly estimate theseverity of the privacy risks that are associated to them. The CIdiscussed in section 3.2 is an adequate instrument to perform suchan estimation; however, this requires empirical evidence about thefrequency of UINs. In order to gather the evidence necessary toestimate those privacy risks that are associated with PHs we havedefined a method consisting of five steps:
Analyse Post Content , ElicitUnwanted Incident , Match Existing Heuristics , Add New Heuristic ,and
Update Contingency Table . As depicted in Fig. 4, each stageof the method draws on different external inputs and generatesthe outputs for the next step. The final output of the method is anupdated version of a contingency table which is a data structuresed to summarize the frequency of the different UINs that areassociated with a PH.
Step 1: Content Analysis
The method starts when a post withprivate information is deleted by the user. Basically, an event ofsuch characteristics is likely to occur when the user has lived aregrettable experience after disclosing sensitive data to the wrongaudience. Therefore, a deleted post which encloses this type of in-formation can be used as a vehicle for gathering information aboutUINs that result from self-disclosure actions in SNSs. To start withthe identification of private information inside deleted posts, onemust have a taxonomy of attributes of sensitive nature. This is oftena challenge itself due to the multiple definitions of private informa-tion, and the influence that the context where such information isdisclosed can have for this type of analysis. Consequently, differenttaxonomies of private/sensitive attributes have been proposed byscholars, each of them based on different interpretations of privateinformation [17]. Since the goal of this step is to identify regret-table posts, the taxonomy of attributes used for this task shouldbe aligned with this purpose. Thus, such taxonomy must includeattributes for which there is evidence of regret when they havebeen disclosed inside SNSs.Based on a study of regrets conducted by Wang et al. [23], DíazFerreyra et al. [4] proposed a taxonomy of surveillance attributes (SAs) that provides an intuitive representation of different aspectsof the users’ private information in SNS. As shown in Table 1, thistaxonomy organizes a collection of personal attributes (i.e. the SAs)around a number of high-level categories called “self-disclosuredimensions” which arrange them into demographics, sexual profile,political attitudes, religious beliefs, health factors and condition, lo-cation, administrative, contact, and sentiment . We will adopt thistaxonomy for the identification of private information, meaningthat this step can be summarized as the identification of SAs insidedeleted posts. Taking the example of Fig 2, the SAs disclosed in thepost are work location , employment status together with a negativesentiment . In principle, these and the rest of the taxonomy’s SAscould be automatically identified using methods and techniques for Dimension Surveillance Attributes
Demographics Age, Gender, Nationality, Racial origin, Eth-nicity, Literacy level, Employment status,Income level, Family statusSexual Profile Sexual preferencePolitical Attitudes Supported party, Political ideologyReligious Beliefs Supported religionHealth Factorsand Condition Smoking, Alcohol drinking, Drug use,Chronic diseases, Disabilities, Other healthfactorsLocation Home location, Work location, Favoriteplaces, Visited placesAdministrative Personal Identification NumberContact Email address, Phone numberSentiment Negative, Neutral, Positive
Table 1: The “self-disclosure” dimensions.
Natural Language Processing (NLP) such as regular expressions,named-entity recognition or deep learning algorithms (e.g. Nguyen-Son et al. [16] applied support vector machines for the identificationof private information in SNSs). This task goes beyond the scope ofthis paper; therefore, we rely on the assumption that this step canbe automated.
Step 2: Elicit Unwanted Incident
After identifying a set ofSAs inside the deleted post the next step is to gather informationabout the reasons that led the user to delete such content. In otherwords, we must (i) confirm that the post was deleted by the userbecause its publication led to an UIN, and (ii) if true, determinewhich particular UIN took place together with its consequence level.For this purpose, we have designed the interface of Fig 3, whichis displayed to the user after she deletes a post which containsSAs. Basically, this interface asks the user if the post being deletedcaused her an unpleasant experience and asks her to provide furtherdetails. Particularly, the user can specify the unwanted incident thattook place using a pre-defined list of UINs (i.e. using the list-boxwith the label “Unwanted Incident”) or adding a new item if it isnot in the list (i.e. using the “Other” button). Likewise, the usercan select the audience that should have not seen the post using apre-defined list of social circles (i.e. using the list-box with the label“Unintended Audience”) or adding a new item if it is not in the list(i.e. using the “Other” button). Finally, the consequence level of theunwanted incident can be specified using the list-box with the label“Consequence Level”. Fig. 3 illustrates the report corresponding tothe scenario of Fig. 2 in which the user describes the wake-up callfrom her superior as an UIN with a moderate consequence level.After completing all the fields on the interface, the user can submitthe report on the deleted post by clicking on “Submit”.
Figure 3: Submit Unwanted Incidents InterfaceStep 3: Match Existing Heuristics
As mentioned in section3.1, a PHDB is a KB that contains a collection of PHs which repre-sent different regrettable self-disclosure scenarios. Basically, thesescenarios are patterns that repeat themselves when different postswith similar information result in the same UIN after reaching thesame social circle (i.e. the same Audience). For instance, a user whoposts “I hate my job at this company but damn, it pays the rent! is revealing the same set of SAs like the ones beingrevealed in the post of Fig. 2. Moreover, this user can suffer thesame UIN as the user of Fig. 2 if the post is seen by her colleaguesfrom work. Consequently, the same pattern (i.e. the same PH) canbe extracted from two or more different posts. When this happens,deleted posts act as evidence of the same UIN and, therefore, canbe used in the estimation of privacy risks. igure 4: Update Contingency Table Method
The goal of this step is to identify if the regrettable scenarioelicited in Steps 1 and 2 of the method corresponds to a pre-existing PH i inside the PHDB. Overall, there are three ways in which a PH i can match the elicited scenario. The first one is when the SAs,Audience and UIN associated with PH i are equal to the SAs, Audi-ence and UIN elicited in Steps 1 and 2 of the method (i.e. PH i . SAs = post . SAs , PH i . Audience = post . Audience and PH i . U I N = post . U I N ).For instance, a user reporting job loss after posting “My job at thiscompany is like the coffee they serve...awful!” is a scenario very sim-ilar to the one of Fig. 2. Basically, in both cases the same UIN (i.e. job loss ) takes place after disclosing the same set of SAs (i.e. worklocation , employment status and a negative sentiment ) to the sameAudience (i.e. work colleagues ). If we analyse a regrettable scenarioin terms of sufficient and necessary conditions, one can say thatdisclosing a set of SAs to a specific Audience is sufficient conditionfor an UIN to (eventually) occur (i.e. SAs , Audience ⇒ U I N ). In thissense, both scenarios have the same sufficient and necessary condi-tions. Consequently, if the sufficient and necessary conditions of PH i are equal to the ones of the elicited scenario, then PH i matchesthe elicited scenario.The second matching case occurs when the UIN reported by theuser differs from the one of PH i , but the Audience and SAs stayequal (i.e. PH i . SAs = post . SAs , PH i . Audience = post . Audience and PH i . U I N (cid:44) post . U I N ). Let us consider again the previous examplein which the user compares her workplace with the quality ofher coffee. Let’s imagine for a moment that this time the userhas indicated harassment as the UIN instead of job loss . Accordingto the matching criterion just introduced, the PH i which modelsthe scenario of Fig. 2 does not match this new scenario. However,both scenarios describe the same sufficient conditions (i.e. SAs andAudience) under which these UINs may take place. Hence, the newscenario can be modelled just by adding a new UIN (i.e. harassment )to PH i (as we show in Step 5). Therefore, if only the sufficientconditions of PH i are equal to the ones of the elicited scenario, then PH i matches the elicited scenario after adding the elicited UIN toit. The third matching case occurs when the SAs extracted in Step 1of the method are a superset of the ones of PH i but the Audience andthe UIN reported by the user stay equal (i.e. PH i . SAs ⊂ post . SAs , PH i . Audience = post . Audience and PH i . U I N = post . U I N ). Forinstance, a user reporting job loss after posting “Moving to Boston was definitely not a good idea...the weather in this town sucks andso my job at this company! is also a scenario avery similar to the one of Fig. 2. In both cases the Audience is work colleagues and the UIN is job loss ; however, an additionalSA is disclosed in this new scenario: home location . According tothe matching criteria introduced so far, the PH i which models thescenario of Fig. 2 does not match the new scenario since theirsufficient conditions are not equal. Nevertheless, the SAs disclosedin the new scenario are a superset of the ones disclosed in Fig. 2.This means that, according to PH i , job loss can already occur whenrevealing fewer SAs to an audience composed by work colleagues .In this case, we say that the sufficient conditions of PH i absorb theones of the elicited scenario. Therefore, if the sufficient conditionsof PH i absorb the ones of the elicited scenario, then PH i matchesthe elicited scenario when the necessary conditions of both stayequal.If there is a PH that follows any of the matching criteria previ-ously described, then it is retrieved from the PHDB and handedto Step 5. Otherwise, a new PH which represents the regrettablescenario being reported by the user must be created and addedto the PHDB. This task, which corresponds to the Step 4 of themethod, takes care of generating the corresponding PH i , taking asinput the SAs of the deleted post together with the Audience andUIN entered by the user. The result in this case is a new PH and,consequently, an updated version of the PHDB which correspondsto PHDB* in Fig. 4. Step 5: Update Contingency Table
Regrettable self-disclosurescenarios can result in more than one UIN. For instance, the sce-nario of Fig. 2 may result sometimes in job loss and in other casesin reputation damage , depending on the situation reported by eachuser. Likewise, an UIN can be the consequence of different regret-table scenarios. For instance, disclosing ones’ sexual orientation orreligious beliefs can lead in both cases to unjustified discrimination .Therefore, a PH can be associated with different UINs, and differentUINs can be associated with more than one PH. Furthermore, someUINs are likely to occur more often than others. This means that, forthe PH associated with the scenario of Fig. 2, reputation damage canbe reported by the users more frequently than job-loss (or vice versa).Moreover, certain consequence values of a particular UIN can bereported more often than others. For instance, reputation damage can be perceived in most cases as an event of a minor magnitudend rarely as a catastrophic event. Therefore, some consequencevalues may have a higher frequency than others.A Contingency Table (CT) is a structure which organizes theinformation about the frequency of the UINs associated with a PH.Basically, it is a double-entry table in which each cell describes thenumber of times that an
U I N j has been reported as the consequenceof the scenario modelled by a PH i . This is expressed through atuple of five elements representing the frequency of the values catastrophic , major , moderate , minor and insignificant . For instance,according to Table 2, the U I N has been reported 108 times asa negative consequence of the scenario modelled by PH . Fromthese 108 occurrences, 50 were reported as events of catastrophic magnitude, 48 as major , and 10 as moderate . The goal of this step isto update the CT with the information provided by the user in Step2. Basically, this consists of incrementing by 1 the consequencevalue of the UIN reported by the user for the PHs identified in Step3 (or the PH created in Step 4). For instance, let us assume thatone of the PHs identified was PH and the user has reported U I N as the UIN with a moderate consequence level. Then, the outputof this step is an updated version of the CT in which the tuplethat corresponds to the PH and U I N contains now the values{50,48,11,0,0}. In the case that a new PH has been created as resultof Step 4 (or the user has specified a new UIN in Step 2), a newrow (or column) of {0,0,0,0,0} corresponding to such PH (or UIN)must be added to the CT prior to the incrementation of the UIN’sfrequency. U I N U I N U I N PH {0,0,0,0,0} {50,48,10,0,0} {0,0,44,188,90} PH {0,0,0,0,0} {0,0,79,55,0} {0,0,0,0,0} PH {0,0,0,0,0} {0,0,0,0,0} {120,88,7,0,0} PH {300,33,0,0,0} {0,0,0,0,0} {0,0,0,0,0} PH {0,0,0,0,0} {0,310,70,0,0} {0,0,0,0,0} Table 2: Contingency Table
As we mentioned previously, PHs represent knowledge on recurrentself-disclosure scenarios that often lead to regrettable experiences.Therefore, they can be used to (i) detect potentially regrettable sce-narios, and (ii) alert on the risks associated with such scenarios. Themethod we have just described is an instrument for collecting andorganizing evidence on UINs which are associated with differentPHs. Therefore, the content inside the CT can help us to estimateand communicate the severity of the different risks that may arisewhen disclosing certain patterns of private information in SNSs. Inthis section we describe how the information contained inside theCT can be applied to the generation of adaptive privacy warnings.Concretely, we introduce an algorithm which computes the severityof privacy risks associated with the publication of a post and informthe user about such risks. In order to regulate the frequency of suchinterventions, the algorithm incorporates a mechanism based onthe action taken by the user after the warning is triggered (i.e. ifthe user publishes or not the post in the end).
Algorithm 1 describes a process in which the information insidethe CT is used to communicate the risks associated with a self-disclosure act. Basically, this consists of the generation of a warningmessage wMSG describing the risks that may occur if the userposts a message with private information. For this, the function
GenerateWarningMSG is invoked when she attempts to share a post P in a SNS. First, function GetSAs (line 3) analyses the informationdisclosed inside P and extracts the SAs from it (i.e. like in Step 1 ofthe method introduced in section 4). The result (i.e. a set of SAs)is assigned to postSAs and used thereafter to compute a set of PHswhich can provide information about potential privacy risks. Thisbasically consists of collecting those PHs whose SAs are includedin the ones of the post. In other words, PH i is retrieved from the PHDB when PH i . SAs ⊆ postSAs . This step is performed by thefunction GetPHs and its result then assigned to postPHs (line 4). If postPHs (cid:44) ∅ , it means that there is evidence about privacy risksthat might occur after the publication of P . That being the case, thenext step is to estimate the severity of such risks and communicatethem to the user. Algorithm 1
Adaptive awareness pseudo-code function GenerateWarningMSG(
Post P ) W arninдMSG wMSG ; Set < SA > postSAs : = GetSAs ( P ) ; Set < PH > postPHs : = GetPHs ( postSAs , PHDB ) ; for each PH i ∈ postPHs do Audience Au : = GetAudience ( PH i ) ; Set < U I N > postU I Ns : = GetU I Ns ( ContTbl , PH i ) ; for each U I N j ∈ postU I Ns do ConsFreq F : = GetConsFreq ( ContTbl , PH i , U I N j ) ; CritIndex ˆ I ij : = ComputeCritIndex ( F ) ; if ˆ I ij > φ then wMSG . addRisk ( U I N j , Au ) ; end if end for end for RaiseW arninд ( wMSG ) ; Action usrAction : = W aitForU srAction () ; UpdateRiskThreshold ( usrAction ) ; end function In order to estimate the privacy risks of the post, we must firstdetermine the frequency of those UINs that can take place if thepost is shared. For this, we iterate through each PH i inside postPHs (line 5) and extract (i) its audience, and (ii) a list of the UINs thatare associated with it (lines 6 and 7). The information about theaudience is extracted by the function GetAudience , assigned tothe variable Au and used later for the generation of the warningmessage (line 6). On the other hand, the function GetU I Ns queries
ContTbl (i.e. the CT) to gather those UINs associated with the PH i whose frequency is greater than zero. Its outcome (i.e. a set ofUINs) is assigned to postU I Ns and used thereafter for estimatingthe privacy risks of the posts (line 7). Basically, the estimation of therisks consists of computing the CI of those UINs that are associatedwith any of the PHs inside postPHs . For this, we must first iteratehrough each U I N j in postU I Ns (line 8) and obtain the frequencyof its consequence values from the CT (line 9). This is done bythe function GetConsFreq which retrieves from the CT the cellcorresponding to the PH i and the U I N j . After its execution, itsoutcome is assigned to F and handed to the next step which is thecomputation of the CI. The function
ComputeCritIndex takes the frequency F of U I N j andestimates its risk severity (line 10). For this, it uses the approachdescribed in section 3.2 which consists of computing a CI usingEq. 1. To illustrate this step, let us assume that the user writesa post similar to the one in Fig. 1 and the SAs disclosed insidethe post matches the ones of PH . Let us also assume that theinformation inside the CT is the one illustrated in Table 2. Then,we have two UINs (i.e. U I N for job loss and U I N for reputationdamage) that are likely to occur if the post is shared by the user. Inorder to estimate the risks of this post we apply the Eq. 1 using thefrequency values corresponding to U I N and U I N for PH :ˆ I = ( + + + + ) − − = . I = ( + + + + ) − − = . I and ˆ I correspond to the CI of PH for U I N and U I N , respectively. According to these values, theseverity of job loss is higher than the one of reputation damage .However, as mentioned in section 3.2, these values are an estima-tion of the CI based on a sample. Consequently, we must build aconfidence interval containing the real parameter I with a certainconfidence level. For this, we must first estimate the variance of ˆ I according to the following equation: V ar ( ˆ I ) = n ( K − ) (cid:20) (cid:213) K − k = ( K − k ) p k ( − p k )− (cid:213) K − k = ( K − k ) p k (cid:213) k − l = ( K − l ) p l (cid:21) (2)where n is the size of the sample, K the number of consequencelevels, and p k the proportion of observations in the sample cor-responding to the category k . Using this equation, a confidenceinterval for ˆ I can be obtained as:ˆ I − Z α / · S ( ˆ I ) ≤ I ≤ ˆ I + Z α / · S ( ˆ I ) (3)where S ( ˆ I ) is the standard deviation of ˆ I , α the significance level,and Z α / the standard score for α /
2. Consequently, for a signifi-cance level α = .
05, the confidence intervals for I and I are [ . . ] and [ . . ] , respectively. Following a conserva-tive criterion, the outcome of the function ComputeCritIndex willbe the upper bound of the confidence interval created for each UIN.Therefore, the function will return 0.904 in the case of
U I N , and0.249 for U I N . As mentioned in section 3, not all users are equally concerned abouttheir privacy. There are users who are willing to expose themselvesmore in SNSs and users who rather keep their private informationaway from public disclosure. In other words, some users take higherprivacy risks than others when making privacy decisions. In linewith this premise, Algorithm 1 introduces a privacy risk threshold φ which is used to determine which UINs should (or should not) becommunicated to the user. Basically, it consists of a value between0 and 1 which is tested against the risk criticality index returnedby ComputeCritIndex . If this value is equal or higher than φ , thenit means that the risk is unacceptable for the user and, therefore,the corresponding UIN should be communicated. Conversely, ifthis value is lower than φ , then the risk is acceptable, and the UINshould not be informed (line 11). To illustrate this mechanism, letus assume that φ = . I and ˆ I . Since ˆ I = . > φ , then U I N is added to the body of the warning message wMSG togetherwith its corresponding audience Au (line 12). On the other hand, therisk criticality index ˆ I = . < φ , hence, U I N is not includedin the body of the warning message. Figure 5: Envisioned Interface
After the content of the warning message has been defined, thefunction
RaiseW arninд takes over the task of communicating themessage to the user (line 16). This can be done using a pop-up mes-sage like the one illustrated in Fig. 5. At this point, the user has thechance to re-think the content of her post or proceed with its pub-lication. It may happen that after some time the number of rejectedwarnings is higher/lower than the number of accepted warnings .This is, one may observe that, after a time frame τ , the user tends toignore (or accept) the warnings and proceed (or not proceed) withthe publication of her posts. In order to regulate the frequency of theinterventions, the value of φ is adjusted at the end of each τ intervalaccording to the actions taken by the user. Basically, this consists ofdecreasing/increasing the value of φ depending on the number oftimes the user has accepted/rejected an intervention. For this, thefunction W aitForU srAction waits for the user’s decision and for-wards it to the function
UpdateRiskThreshold which takes care ofupdating the value of φ (lines 17 and 18). This function keeps trackof the number of times the user has ignored/followed the warningswithin a τ period of time. After each τ period, if iдnored > accept ,then the value of φ is increased in δ (i.e φ τ + : = φ τ + δ ). Conversely,if iдnored < accept , then φ is decreased in δ (i.e φ τ + : = φ τ − δ ). DISCUSSION
Although our approach is devoid of assumptions related to theestimation of privacy risks, there are limitations that should beacknowledged and considered. One of these issues is related to theidentification of SAs inside the users’ post. As we mentioned in sec-tion 4, there are different NLP methods that could be applied for theautomatic identification of such SAs. However, the use of sarcasmor irony (which is common inside SNSs) can modify significantlythe meaning of a post and, therefore, hinder its analysis (e.g. a postcould be classified as negative, when in fact it is sarcastic) [11, 12].Another issue is related to posts which are not self-referential. Forinstance, a post like “Working at Google may sound great...but I amsure that it can be a very competitive and hostile work environment” isexpressing a negative opinion about working for Google, however,it is not saying that the user who wrote it works for this company.Therefore, whatever method one defines for the identification ofSAs inside posts, it should be aware of these variations in order toidentify regrettable self-disclosure scenarios in a correct way.Another aspect to be considered is related to the values thatare assigned to the parameters of Algorithm 1. For instance, onemust assign an initial value to φ which can result in more or lessinterventions at the initial phases of the awareness process. Forinstance, a value of φ closer to 0 would result in a higher interven-tion frequency, whereas a value closer to 1 would generate a loweramount of interventions. In line with this, the value assigned to τ can impact the values adopted by φ at the adaptation phase. This is,a small τ would limit the amount of evidence gathered on warningbeing accepted/ignored by the user. Hence, the value of φ wouldprobably not reflect the user’s privacy behaviour. Likewise, a big τ would result in values of φ that stay invariant for long periodsof time. Consequently, the frequency of the interventions wouldnot be reactive enough with regard to the user’s privacy decisions.Both parameters, φ and τ , should be chosen with care in order toguarantee a sustained awareness support to the user.Although the awareness system described in this paper standsfor the protection of the users’ privacy, it is ultimately a recom-mender system. Moreover, it is a system which requires analysingand processing personal information for shaping its recommenda-tions. Hence, its benefits come along with the privacy concerns thatare characteristic of recommender systems. That is, issues relatedto algorithmic transparency, fairness and trust that can jeopardizethe users’ privacy rights. This calls for a Data Privacy Impact As-sessment (DPIA) of the different software artefacts and informationflows described throughout this paper (i.e. PHDB, UPDB, post analy-sis, etc.). The notion of DPIA has been introduced in the EU GeneralData Protection Regulation [18] and is basically an assessment thatservice providers must conduct in order to identify and minimizethe risks that data processing may bring to the privacy rights ofdata subjects (i.e. the users). Although this analysis goes beyondthe scope of this work, it is a critical point that must be taken intoconsideration and further elaborated. Evaluating the approach introduced in this paper brings up a seriesof challenges related to the technical and human resources that arenecessary for setting up an experimental environment. Basically, our approach requires input from a large number of users in order toestimate the risk value of a set of self-disclosure scenarios. Moreover,enough evidence on each particular scenario inside the PHDB isnecessary to perform such estimation. Hence, one must count notonly on a large number of participants, but these participants shouldalso provide enough input on each of the PHs stored in the PHDB.In principle, gathering such extent of user input could be possiblethrough the development of some SNS plugin that implements theinterface of Fig. 3. However, it would still take some time untilenough information on each PH is collected and, consequently,until the corresponding risk index could be computed.One way of gathering the information necessary for the esti-mation of privacy risks is through an online questionnaire. Forinstance, one can propose a questionnaire consisting of a set ofself-disclosure scenarios and ask the participants to rate the severitylevel of each of them using an ordinal scale. Each scenario can beafterwards represented as a PH, and the information collected fromthe questionnaires inserted in the CT. By doing this, we simplify theprocess of collecting heuristic-related information through deletedposts and simulate the knowledge extraction process described insection 3. On the other hand, the process of knowledge applicationcan be carried forward by implementing the warning system de-scribed in this paper as a mobile application. That is, an applicationfrom which (i) users can post messages using their SNS accounts(e.g. via the Facebook or Twitter APIs), and (ii) intervenes followingthe approach described in section 5. This app can be used after-wards in an experiment in which the participants are requested toinstall it and use it for a certain time. Thereafter, the effectiveness ofthe interventions generated by the app can be evaluated conductingstructured interviews with the participants.
Risk communication and management is a valuable instrument forhelping the users of SNSs to make better and more informed pri-vacy decisions. Particularly, incorporating adaptive risk awarenessfeatures into the design of PTs can have a positive impact on theirengagement levels [6]. As we mentioned on section 3, engineeringsuch adaptive solutions requires the definition of processes relatedto knowledge extraction and application. In this work we haveaddressed both activities through (i) the definition of a method forcollecting information about UINs and (ii) an algorithm for generat-ing adaptive interventions. The envisioned interface illustrated onFig. 5 shows how these two instruments can work in cooperation inorder to endow SNSs with user-centred privacy awareness features.Adapting the risk information and frequency of interventionsis a step towards more effective PTs. However, a study of Kapteinet al. [13] suggest that framing the style of an intervention canalso improve its effectiveness. In their study, persuasive messagesfor promoting healthier eating habits were framed using either anauthoritarian-style (e.g. “The World Health Organization advisesnot to snack. Snacking is not good for you” ) or a consensus-style(e.g. “Everybody agrees: not snacking between meals helps you tostay healthy” ). The outcome of this experiment suggests that per-suasive messages are more effective when tailored to the user’spreferred persuasive style. This phenomenon was also observedn an experiment conducted by Schäwel et al. [21] in which warn-ing messages were used to promote safer self-disclosure decisionsamong the users of SNSs. Particularly, an intervention framed us-ing an authoritarian-style such as “Rethink what you are going toprovide. Privacy researchers from Harvard University identify suchinformation as highly sensitive!” can be more effective than anotherone framed using a consensus-style like “Everybody agrees: Provid-ing sensitive information can result in privacy risks!” , and vice versa.Adapting the persuasive style of interventions is an aspect that willbe further investigated in our future publications.
ACKNOWLEDGMENTS
This work was supported by the Deutsche Forschungsgemeinschaft(DFG) under grant No. GRK 2167, Research Training Group “User-Centred Social Media”.
REFERENCES [1] Emily Christofides, Amy Muise, and Serge Desmarais. 2012. Risky Disclosures onFacebook: The Effect of Having a Bad Experience on Online Behavior.
Journal ofAdolescent Research
27, 6 (2012), 714–731.[2] Sourya Joyee De and Daniel Le Métayer. 2018. Privacy Risk Analysis to EnableInformed Privacy Settings. In . 95–102. https://doi.org/10.1109/EuroSPW.2018.00019[3] Claudia Diaz and Seda Gürses. 2012. Understanding the landscape of privacy tech-nologies (extended abstract). In
Proceedings of the Information Security Summit,ISS 2012 . 58–63.[4] Nicolás E. Díaz Ferreyra, Rene Meis, and Maritta Heisel. 2017. Online Self-disclosure: From Users’ Regrets to Instructional Awareness. In
International Cross-Domain Conference for Machine Learning and Knowledge Extraction . Springer,83–102.[5] Nicolás E. Díaz Ferreyra, Rene Meis, and Maritta Heisel. 2017. Should User-generated Content be a Matter of Privacy Awareness? A position paper. In
Proceedings of the 9th International Conference On Knowledge Management andInformation Sharing (KMIS 2017) , Kecheng Liu, Ana Carolina Salgado, JorgeBernardino, and Joaquim Filipe (Eds.). Vol. 3. SciTePress, 212–216.[6] Nicolás E. Díaz Ferreyra, Rene Meis, and Maritta Heisel. 2018. At Your Own Risk:Shaping Privacy Heuristics for Online Self-disclosure. In
Proceedings of the 16thAnnual Conference on Privacy, Security and Trust (PST) . IEEE, 1–10.[7] Julie S Downs, George Loewenstein, and Jessica Wisdom. 2009. Strategies forPromoting Healthier Food Choices.
American Economic Review
99, 2 (2009),159–64.[8] Silvia Facchinetti and Silvia Angela Osmetti. 2018. A Risk Index for OrdinalVariables and its Statistical Properties: A Priority of Intervention Indicator inQuality Control Framework.
Quality and Reliability Engineering International
Proceedings of the 19th International Conference on World Wide Web . ACM, 351–360. https://doi.org/10.1145/1772690.1772727[10] Heikki Hiilamo, Eric Crosbie, and Stanton A Glantz. 2014. The evolution of healthwarning labels on cigarette packs: the role of precedents, and tobacco industrystrategies to block diffusion.
Tobacco Control
23, 1 (2014), e2–e2.[11] Aditya Joshi, Pushpak Bhattacharyya, and Mark J Carman. 2017. AutomaticSarcasm Detection: A Survey.
ACM Computing Surveys (CSUR)
50, 5 (2017), 73.[12] Aditya Joshi, Vaibhav Tripathi, Pushpak Bhattacharyya, Mark Carman, MeghnaSingh, Jaya Saraswati, and Rajita Shukla. 2016. How Challenging is Sarcasmversus Irony Classification?: A Study With a Dataset from English Literature. In
Proceedings of the 14th Australasian Language Technology Association Workshop .123–127.[13] Maurits Kaptein, Boris De Ruyter, Panos Markopoulos, and Emile Aarts. 2012.Adaptive persuasive systems: A Study of Tailored Persuasive Text Messages toReduce Snacking.
ACM Transactions on Interactive Intelligent Systems
2, 2 (2012),10.[14] Mass Soldal Lund, Bjørnar Solhaug, and Ketil Stølen. 2010.
Model-Driven RiskAnalysis: The CORAS Approach . Springer Science & Business Media.[15] Gaurav Misra, Jose M. Such, and Hamed Balogun. 2016. Non-Sharing Communi-ties? An Empirical Study of Community Detection for Access Control Decisions.In
Proceedings of the 2016 IEEE/ACM International Conference on Advances inSocial Networks Analysis and Mining (ASONAM) . 49–56.[16] Hoang-Quoc Nguyen-Son, Minh-Triet Tran, Hiroshi Yoshiura, Noboru Sonehara,and Isao Echizen. 2015. Anonymizing Personal Text Messages Posted in OnlineSocial Networks and Detecting Disclosures of Personal Information.
IEICETransactions on Information and Systems
98, 1 (January 2015), 78–88.[17] Georgios Petkos, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2015. PScore:A Framework for Enhancing Privacy Awareness in Online Social Networks. In
Proceedings of the 10th International Conference on Availability, Reliability andSecurity, ARES 2015 . IEEE, 592–600.[18] General Data Protection Regulation. 2016. Regulation (EU) 2016/679 of theEuropean Parliament and of the Council of 27 April 2016 on the protectionof natural persons with regard to the processing of personal data and on thefree movement of such data, and repealing Directive 95/46/EC (General DataProtection Regulation).
Official Journal of the European Union (OJ)
59 (2016),1–88.[19] Sonam Samat and Alessandro Acquisti. 2017. Format vs. Content: The Impactof Risk and Presentation on Disclosure Decisions. In
Thirteenth Symposium onUsable Privacy and Security (SOUPS 2017) . USENIX Association, 377–384.[20] Johanna Schäwel. 2017. Paving the Way for Technical Privacy Support: A Quali-tative Study on Users’ Intentions to Engage in Privacy Protection. In
The 67thAnnual Conference of the International Communication Association .[21] Johanna Schäwel and Nicole Krämer. 2018. Do You Really Want to Disclose?:Examining Psycological Variables that Influence the Effects of Persuasive Promptsfor Reducing Online Privacy Risks. (2018). Forschungsreferat beim 51. Kongressder Deutschen Gesellschaft für Psychologie (DGPs).[22] Luke Stark. 2016. The Emotional Context of Information Privacy.
The InformationSociety
32, 1 (January 2016), 14–27.[23] Yang Wang, Gregory Norcie, Saranga Komanduri, Alessandro Acquisti, Pedro Gio-vanni Leon, and Lorrie Faith Cranor. 2011. “I regretted the minute I pressedshare”: A Qualitative Study of Regrets on Facebook. In
Proceedings of the 7thSymposium on Usable Privacy and Security, SOUPS 2011 . ACM, 1–16.[24] Jan Henrik Ziegeldorf, Martin Henze, René Hummen, and Klaus Wehrle. 2015.Comparison-based privacy: nudging privacy in social media (position paper). In