Adversarial Stylometry in the Wild: Transferable Lexical Substitution Attacks on Author Profiling
AAdversarial Stylometry in the Wild:Transferable Lexical Substitution Attacks on Author Profiling
Chris Emmery
CSAI, Tilburg UniversityCLiPS, University of Antwerp [email protected] ´Akos K´ad´ar
Borealis AI [email protected]
Grzegorz Chrupała
CSAI, Tilburg University [email protected]
Abstract
Written language contains stylistic cues thatcan be exploited to automatically infer a vari-ety of potentially sensitive author information.Adversarial stylometry intends to attack suchmodels by rewriting an author’s text. Our re-search proposes several components to facili-tate deployment of these adversarial attacks inthe wild, where neither data nor target mod-els are accessible. We introduce a transformer-based extension of a lexical replacement at-tack, and show it achieves high transferabil-ity when trained on a weakly labeled corpus—decreasing target model performance belowchance. While not completely inconspicuous,our more successful attacks also prove notablyless detectable by humans. Our frameworktherefore provides a promising direction for fu-ture privacy-preserving adversarial attacks.
The widespread use of machine learning on con-sumer devices and its application to their data hassparked investigation of security and privacy re-searchers alike in correctly handling sensitive infor-mation (Edwards and Storkey, 2016; Abadi et al.,2016b). Natural Language Processing (NLP) isno exception (Fernandes et al., 2019; Li et al.,2018); written text can contain a plethora of authorinformation—either consciously shared or infer-able through stylometric analysis (Rao et al., 2000;Adams, 2006). This characteristic is fundamentalto author profiling (Koppel et al., 2002), and whilethe field’s main interest pertains to the study ofsociolinguistic and stylometric features that under-pin our language use (Daelemans, 2013), hereinsimultaneously lie its dual-use problems. Authorprofiling can, often with high accuracy, infer an ex-tensive set of (sensitive) personal information, suchas age, gender, education, socio-economic status,and mental health issues (Eisenstein et al., 2011; Alowibdi et al., 2013; Volkova et al., 2014; Plankand Hovy, 2015; Volkova and Bachrach, 2016). Ittherefore potentially exposes anyone sharing writ-ten online content to unauthorized information col-lection through their writing style. This can proveparticularly harmful to individuals in a vulnerableposition regarding e.g., race, political affiliation, ormental health.Privacy-preserving defenses against such infer-ences can be found in the field of adversarial sty-lometry. Our research concerns the obfuscationsubtask, where the aim is to rewrite an input textsuch that the style changes, and stylometric predic-tions fail. It is part of a growing body of researchinto adversarial attacks on NLP (Smith, 2012),which various modern models have proven vulnera-ble to; e.g., in neural machine translation (Ebrahimiet al., 2018), summarization (Cheng et al., 2020),and text classification (Liang et al., 2018).Adversarial attacks on NLP are predominantlyaimed at demonstrating vulnerabilities in existingalgorithms or models, such that they might be fixed,or explicitly improved through adversarial training.Consequently, most related work focuses on whiteor black-box settings, where all or part of the targetmodel is accessible (e.g., its predictions, data, pa-rameters, gradients, or probability distribution) tofit an attack. The current research, however, doesnot intend to improve the targeted models; rather,we want to provide the attacks as tools to protectonline privacy. This introduces several constraintsover other NLP-based adversarial attacks, as it callsfor a realistic, in-the-wild scenario of application.Firstly, authors seeking to protect themselvesfrom stylometric analysis cannot be assumed to be These are adversarial attacks on models making stylomet-ric predictions, not to be confused with adversarial learning. All code, data, and materials to fully reproduce the ex-periments are openly available at https://github.com/cmry/reap . a r X i v : . [ c s . C L ] J a n nowledgeable about the target architecture, nor tohave access to suitable training data (as the targetcould have been trained on any domain). Hence,we cannot optimally tailor attacks to the target,and need an accessible method of mimicking it toevaluate the obfuscation success. To facilitate this,we use a so-called substitute model, which for ourpurposes is an author profiling classifier trained inisolation (with its own data and architecture) thatinforms our attacks. Attacks fitted on substitutemodels have been shown to transfer their successwhen targeting models with different architectures,or trained on other data, in a variety of machinelearning tasks (Papernot et al., 2016). The effec-tiveness of an attack fitted on a substitute modelwhen targeting a ‘real’ model is then referred to as transferability , which we will measure for the ob-fuscation methods proposed in the current research.Secondly, for an obfuscation attack to workin practice (e.g., given a limited post history), itshould suggest relevant changes –to– the author’swriting on a domain of their choice . This impliesthe substitute models should be fitted locally, andtherefore need to meet two criteria: reliable accessto labeled data, and being relatively fast and easyto train. To meet the first criterion, the currentresearch focuses on gender prediction, as: i) Twit-ter corpora annotated with this variable are by farthe largest (and most common), ii) author profil-ing methods typically use similar architectures fordifferent attributes; therefore, the generalization ofattacks to other author attributes can be assumedto a large extent, and, most importantly, iii) Belleret al. (2014) and Emmery et al. (2017) have shownthat through distant labeling, a representative cor-pus for this task can be collected in under a day.This allows us to measure transferability of attacksfitted using realistically collected distant corporato models using high-quality hand labeled corpora.As for the attacks, we focus on lexical substitu-tion of content words strongly related to a givenlabel, as those have been shown to explain a signif-icant portion of the accuracy of stylometric models(see e.g., Rao et al., 2000; Burger et al., 2011; Sapet al., 2014; Rangel et al., 2016). To that effect, weextend the substitution attack of Jin et al. (2020)and apply it to author attribute obfuscation. Specif-ically, we explore the potential of training a simple(as to meet the speed criterion), non-neural substi-tute model f (cid:48) to indicate relevant words to perturb,where retaining the original meaning is prioritized. .342.122.059.012.010 f '(D) f ( D A D V ) Figure 1: Obfuscation scenario: model f (cid:48) trains ontweet batches, an omission score is used to determineand rank the words according to their classification con-tribution. These are then passed to either TextFooler,Masked BERT, or Dropout BERT to suggest top- k re-placement candidates. From these, a selection is madebased on their class probability change on f (cid:48) ( D ) . Fi-nally, f is evaluated on the perturbed tweets D ADV . Two transformer-based models are introduced tothe framework to propose and rank lexical substitu-tions towards a change in the predictions of f (cid:48) . Weevaluate if the attacks on f (cid:48) transfer across corpora,architectures, and a separately trained target model f (see Figure 1). Finally, we measure the qual-ity of changes using automatic evaluation metrics,and conduct an human evaluation that focuses ondetection accuracy of the attacks. Stylometry, the study of (predominantly) writingstyle, dates back several decades (Mosteller andWallace, 1963), and has seen increased accessibil-ity through the introduction of statistical models(see surveys by Holmes, 1998; Neal et al., 2017)and machine learning (e.g., Matthews and Merriam,1993; Merriam and Matthews, 1994). Computa-tional stylometry distinguishes several subtaskssuch as determining (Baayen et al., 2002) and ver-ifying author identity (Koppel and Schler, 2004),and author profiling (Argamon et al., 2005); e.g.,predicting demographic attributes. Adversarial sty-lometry (as conceptualized by Brennan et al., 2012)intends to subvert these inferences by changing anauthor’s text through imitation, or, as pertains toour research, the obfuscation of writing style (Kac-marcik and Gamon, 2006; Caliskan et al., 2018; Leet al., 2015; Xu et al., 2019).These changes, or perturbations, can be pro-duced in several ways, and the task is therefore of-en conflated with paraphrasing (Reddy and Knight,2016), style transfer (Kabbara and Cheung, 2016),and generating adversarial samples or triggers(Zhang et al., 2020b). Regardless of the employedmethod, the main challenge of obfuscation lies inretaining the original meaning of an input text; itswritten language medium limits any perturbationsto discrete outputs, and unnatural discrepanciesare significantly better discernible by humans than,say, a few pixel changes in an image. An addi-tional, persistent limitation is the absence of eval-uation metrics that guarantee complete preserva-tion of the original meaning of the input whilstchanges remain unnoticed (Potthast et al., 2016).This not only inhibits automatic evaluation of ob-fuscation, but all natural language generation re-search (Novikova et al., 2017)—placing an empha-sis on human evaluation (van der Lee et al., 2019).It is perhaps for this reason that most obfus-cation work uses heuristically-driven, controlledchanges such as splitting or merging words or sen-tences, removing stop words, changing spelling,punctuation, or casing (see e.g., Karadzhov et al.,2017; Eger et al., 2019). These specific attacks aretypically easier to mitigate through preprocessing(Juola and Vescovi, 2011). Obfuscation throughlexical substitution (Mansoorizadeh et al., 2016;Bevendorff et al., 2019, 2020) provides a middleground of control, semantic preservation and attackeffectiveness; however, they might prove less effec-tive against models relying on deeper stylistic fea-tures (e.g. word order, part-of-speech (POS) tags,or reading complexity scores). End-to-end systemshave been employed for similar purposes (Shettyet al., 2018; Saedi and Dras, 2020), or to rewrite en-tire phrases (Emmery et al., 2018; Bo et al., 2019)using (adversarially-driven) autoencoders. Such at-tacks seem less common, and provide less controlover the perturbations and semantic consistency.Our work does not assume the attacks to runend-to-end, but with a hypothetical human in theloop. We further opt for techniques that are morelikely to find strong semantic mirrors to the originaltext while making minimal changes. A substitutemodel (the algorithm, hyper-parameters, and out-put of which an author can manipulate as desired) isemployed to indicate candidate replacement words,and our attacks suggest and rank those against thissubstitute. Moreover, prior work typically attacksadversaries trained on the same data, whereas weadd a transferability measure. Lastly, while au- thor identification has been investigated in the wild(Stolerman et al., 2013), our work is, to our knowl-edge, the first to make a conscious effort towardsrealistic applicability of obfuscation techniques.
Our attack framework extends TextFooler (TF, Jinet al., 2020) in several ways. First, a substitute gen-der classifier is trained, from which the logit outputgiven a document is used to rank words by theirprediction importance through an omission score(Section 3.1). For the top most important words,substitute candidates are proposed, for which weadd two additional techniques (Section 3.2). Thesecandidates can be checked and filtered on consis-tency with the original words (by their POS tags,for example), accepted as-is, or re-ranked (Sec-tion 3.3). For the latter, we add a scoring method.Finally, the remaining candidates are used for itera-tive substitution until TF’s stopping criterion is met(i.e., the prediction changes, or candidates run out).
We are given a target classifier f , substitute clas-sifier f (cid:48) , a document D consisting of tokens D i ,and a target label y . Here, f (cid:48) is trained on somecorpus X , and receives an author’s new input text D , where the author provides label y . We denotea class label as ¯ y if f (cid:48) ( D ) predicts anything but y . Our perturbations form adversarial input D ADV ,that intends to produce f (cid:48) ( D ADV ) = ¯ y , and therebyimplicitly f ( D ADV ) = ¯ y . Note that we only submit D to f for evaluating the attack effectiveness, andit is never used to fit the attack itself.To create D ADV , a minimum number of editsis preferred, and thus we rank all words in D bytheir omission score (similar to e.g., K´ad´ar et al.,2017) according to f (cid:48) ( omission score in Al-gorithm 1). Let D \ i denote the document afterdeleting D i , and o y ( D ) the logit score by f (cid:48) . Theomission score is then given by o y ( D ) − o y ( D \ i ) ,and used in an importance score I of token D i , as: I D i = o y ( D ) − o y ( D \ i ) , if f (cid:48) ( D ) = f (cid:48) ( D \ i ) = y.o y ( D ) − o y ( D \ i ) + o ¯ y ( D ) − o ¯ y ( D \ i ) , if f (cid:48) ( D ) = y, f (cid:48) ( D \ i ) = ¯ y, y (cid:54) = ¯ y. (1)With I D i calculated for all words in D , the top k ranked tokens are chosen as target words T . LGORITHM 1:
Obfuscation by lexical replacement.
Input : f (cid:48) – substitute model D = { w , w , . . . , w n } – document y – target labelchecks – apply checks (bool)k – target max k -amount words Output : D ADV – obfuscated document for D i ∈ D do // via Equation 1 I D i ← omission score ( f (cid:48) , y ) T ← top k ( argsort desc ( D, I D i scores), k) D ADV = D for t ∈ T do // substitution attack on t C t ← candidates ( t ) A = ( D ADV i − , C t,j , D ADV i +1: n ) ≤ j ≤| C t | ¯ A = filter/rank ( D , A ; t , checks) // test attack success on f (cid:48) for D (cid:48) ∈ ¯ A do if arg max o y ( D (cid:48) ) (cid:54) = y then return D ADV = D (cid:48) else if o y ( D (cid:48) ) < o y ( D ADV ) then t in D ADV is replaced with c from D (cid:48) return D ADV
Four approaches to perturb a target word t ∈ T areconsidered in our experiments. These operationsare referred to as candidates in Algorithm 1. Synonym Substitution (WS)
This TF-basedsubstitution embeds t as t using a pre-trained em-bedding matrix V . C t is selected by computing thecosine similarity between t and all available word-embeddings w ∈ V . We denote cosine similaritywith Λ( t , w ) . A threshold δ is used to keep onlyreliable candidates Λ( t , w ) > δ . Masked Substitution (MB)
The embedding-based substitutions can be replaced by a languagemodel predicting the contextually most likely to-ken. BERT (Devlin et al., 2019)—a bi-directionalencoder (Vaswani et al., 2017) trained throughmasked language modeling and next-sentenceprediction—makes this fairly trivial. By replac-ing t with a mask, BERT produces a top- k mostlikely C t for that position. Implementing this in TFdoes imply each previous substitution of t mightbe included in the context of the current one. Thismethod of contextual replacement has two draw-backs: i) semantic consistency with the original word is not guaranteed (as the model has no knowl-edge of t ), and ii) the replaced context means se-mantic drift can occur, as all subsequent substitu-tions follow the new, possibly incorrect context. Dropout Substitution (DB)
A method to cir-cumvent the former (i.e., BERT’s masked pre-diction limitations for lexical substitution), waspresented by Zhou et al. (2019). They applydropout (Srivastava et al., 2014) to BERT’s inter-nal embedding of target word t before it is passedto the transformer—zeroing part of the weightswith some probability. The assumption is that C t (BERT’s top- k ) will contain candidates closer tothe original t than the masked suggestions. Heuristic Substitution
To evaluate the relativeperformance of the techniques we described before,we employ several heuristic attacks as baselines. Inthe order of Table 3: 1337-speak: converts charac-ters to their leetspeak variants, in a similar vein toe.g. diacritic conversion (Belinkov and Bisk, 2018).Character flip: inverts two characters in the middleof a word, which was shown to least affect readabil-ity (Rayner et al., 2006). Random spaces: splits atoken into two at a random position.
Given C t , either all, or only the highest ranked can-didate can be accepted as-is. Alternatively, all D (cid:48) can be filtered by submitting them to checks, or re-ranked based on their semantic consistency with D .These operations are referred to as rank/filter in Algorithm 1—both of which can be executed. Part-of-Speech and Document Encoding
TFemploys two checking components: first, it re-moves any c that has a different POS tag than t .If multiple D (cid:48) exist so that f (cid:48) ( D (cid:48) ) = ¯ y , it selectsthe document D (cid:48) which has the highest cosine sim-ilarity to the Universal Sentence Encoder (USE)embedding (Cer et al., 2018) of the original docu-ment D . If not, the D (cid:48) with the lowest target wordomission score is chosen (as per TF’s method). BERT Similarity
Zhou et al. (2019) use the con-catenation of the last four layers in BERT as asentence’s contextualized representation h . We ap-ply this in both Masked (MB) and Dropout (DB)BERT to re-rank all possible D (cid:48) by embeddingthem. Given document D , target t , and perturba-tion candidate document D (cid:48) , C t would be rankedvia an embedding similarity score: SIM (cid:0)
D, D (cid:48) ; t (cid:1) = n (cid:88) i w i,t × Λ (cid:0) h ( D i | D ) , h ( D (cid:48) i | D (cid:48) ) (cid:1) (2) UTHORS TWEETS FEMALE MALE TRAIN TEST TOKENS TYPES AVG SIZE
Huang et al. 37,929 47,211 26,758 20,453 30,602 7,651 935,062 46,600 28Emmery et al. 6,610 16,788,612 61,736 32,900 75,918 18,718 146,736,657 9,942,399 301Volkova et al. 4,620 12,226,859 32,376 26,708 47,298 11,777 67,186,535 7,836,539 269
Table 1: Corpus statistics indicating the number of authors, tweets, female and male labels, the size of the train andtest splits, number of types (unique words) and tokens (total words), and average tokens per document (avg size). where h ( D i | D ) is BERT’s contextualized repre-sentation of the i th token in D , and w i,t is the av-erage self-attention score of all heads in all layersranging from the i th token with respect to t in D . We use three author profiling sets (see Table 1 forstatistics) that are annotated for binary gender clas-sification (male or female): first, that of Volkovaet al. (2015) which was collected through anno-tating 5,000 English Twitter profiles by crowd-sourcing via Mechanical Turk. This can be consid-ered a ‘random’ sample of Twitter profiles, and istherefore the most unbiased set of the three. Hence,we consider it the most representative of an authorprofiling set, and employ this as training split (80%)for f , and test split for our attacks (20%).The second is the English portion of the Multi-lingual Hate Speech Fairness corpus of Huang et al.(2020), which was collected with a different ob-jective than author profiling. It was aggregatedfrom existing hate speech corpora (by Waseemand Hovy, 2016; Waseem, 2016; Founta et al.,2018)—which were largely bootstrapped with look-up terms, selection of frequently abusive users,etc.—and annotated post-hoc with demographicinformation. The collection did not focus on pro-files, and most authors are only associated with asingle tweet. This can cause a significant domainshift compared to general author profiling. How-ever, it can be seen as freely available (noisy) data.Lastly, we include a weakly labeled author pro-filing corpus by Emmery et al. (2017), collectedthrough English keyword look-up for self-reports—similar to Beller et al. (2014). This corpus likelyincludes incorrect labels, but was collected in lessthan a day, making it an ideal candidate for realisticaccess to (new) data to fit the substitute model. Zhou et al. (2019) additionally use a proposal score forfinding T that we replaced with the omission score. Profile counts in the current work differ due to collectionlimitations (e.g., removed accounts).
Preprocessing & Sampling
All three corporawere tokenized using spaCy (Honnibal and Mon-tani, 2017). Other than lowercasing, allocating spe-cial tokens to user mentions and hashtags (
200 instanceswere sampled for the attack (110 male, 90 female).While fairly small, this sample does reflect a realis-tic attack duration and timeline size, as they wouldbe executed for a single profile.
For the extension of TF, we re-implemented code by Jin et al. (2020) to work with Scikit-learn (Pe-dregosa et al., 2011). For their synonym substitu-tion component, we similarly used counter-fittedembeddings by Mrkˇsi´c et al. (2016) trained onSimlex-999 (Hill et al., 2015). The USE (Ceret al., 2018) implementation uses TensorFlow (Abadi et al., 2016a) as back-end, and all BERT-variants were implemented in Hugging Face’s Transformers library (Wolf et al., 2020) with Py-Torch (Paszke et al., 2019) as back-end.We adopt the same parameter settings as Jin et al.(2020) throughout our TF experiments: they set N (considered synonyms) and δ (cosine similarityminimum) empirically to 50 and 0.7 respectively.For MB and DB, we capped T at 50 and top- k at10 (to improve speed). For DB, we follow Zhouet al. (2019) and set the dropout probability to 0.3. https://spacy.io As the datasets are not shuffled to avoid overfitting onauthor-specific features, a few documents of the same authormight spill from the train into the test split; this avoids incor-porating those in our attack sample. https://github.com/jind11/TextFooler https://scikit-learn.org/ https://tensorflow.org/ https://huggingface.co/ https://pytorch.org/ ata Huang, Emmery, Volkovaimportance Omission scoreattack Heuristics, TextFooler, Masked BERT,Dropout BERTmodel Logistic Regression, N-GrAMranking None, POS + USE, BERT Sim Table 2: Grid of possible experimental configurations.
For f and f (cid:48) we require (preferably fast) pipelinesthat achieve high accuracy on author profiling tasks,and are sufficiently distinct to gauge how well ourattacks transfer across architectures, rather thansolely across corpora. As state-of-the-art algo-rithms have not yet proven to be sufficiently ef-fective for author profiling (Joo et al., 2019) we optfor common n -gram features and linear models. Logistic Regression
Logistic Regression (LR)trained on tf · idf using uni and bi-gram featuresproved a strong baseline in author profiling in priorwork. The simplicity of this classifier also makes ita substitute model that can realistically be run byan author. No tuning was performed: C is set to . N-GrAM
The New Groningen Author-profilingModel (N-GrAM) from Basile et al. (2018), wasproposed as a highly effective—simple—modelthat outperforms more complex (neural) alterna-tives on author profiling with little to no tuning.It uses tf · idf-weighted uni and bi-gram token fea-tures, character hexa-grams, and sublinearly scaledtf ( tf ) ). These features are then passedto a Linear Support Vector Machine (Cortes andVapnik, 1995; Fan et al., 2008), where C = 1 . To summarize (and see Table 2), the experiment isconducted as follows: the substitute target model( f (cid:48) )—LR for all experiments—is fit on a givencorpus. The real target model ( f , either LR or N-GrAM) is always fit on the corpus of Volkova et al.(2015). To evaluate the attacks, a 200-instance sam-ple is used. Target words are ranked via omissionscores from f (cid:48) , fed to either our Heuristics, TF,MB, or DB attacks. The heuristics directly changethe target words, while the rest outputs a ranked setof replacement candidates. The latter can either beevaluated against f (cid:48) through the TF pipeline, or theTop-1 candidate is returned. Filtering can be ap-plied through POS/USE for semantic similarity andPOS compatibility checks (Check), or not (Check). test = Volkova et al.LR f (cid:48) → Huang et al. Emmery et al. Volkova et al. f → LR NG LR NG LR NGnone .885 .940 .885 .940 .885 .940 H e u r i s ti c T op - WS .825 .930 .805 .890 .750 .915MB .655 .905 .595 .785 .145 .410DB .625 .895 .575 .785 .210 .530 C h ec k WS .540 .855 .355 .670 .000 .009 MB .415 .790 .120 .420 .000 .085DB .430 .775 .175 .430 .000 .085 C h ec k TF .705 .920 .780 .910 .375 .700TF + MB .640 .880 .760 .890 .380 .725TF + DB .650 .885 .755 .890 .435 .715
Table 3: Post-attack accuracy scores (below chance(55%) = better) of f on a test sample from the Volkovaet al. corpus. Left, the attack conditions: heuris-tics, top-1 synonym, applying POS and USE similaritychecks, or not applying those checks (Check). Splitsper training corpus are noted for f (cid:48) (always Logistic Re-gression (LR)). As target model, either LR, or N-GrAM(NG) was used. The substitution attacks are TextFooler(TF), Masked (MB) and Dropout BERT (DB). If TF’sstopping criterion was used, TF + is noted. Word Simi-larity (WS), reflects the TF pipeline without checks. Note that we are predominantly interested intransferability, and would therefore like to test asmany combinations of data and architecture accesslimitations as possible. If we assume an author doesnot have access to the data, the substitute classifieris trained on any other data than the Volkova et al.corpus. If we assume the author does not knowthe target model architecture, the target model isN-GrAM (rather than LR). A full model transfersetting (in both data and architecture) will thereforebe, e.g.: data f (cid:48) = Emmery et al., data f = Volkovaet al., f (cid:48) = LR, and f = NGrAM. Finally, for com-parison to an optimal situation, we test a settingwhere we do have access to the adversary’s data. The obfuscation success is measuredas any accuracy score below chance level perfor-mance, which given our test sample is 55%. Wewould argue that random performance is preferredin scenarios where the prediction of the oppositelabel is undesired. For the current task, however,any accuracy drop to around or lower than chancelevel satisfies the conditions for successful obfus- olkova et al. → TRAIN TEST T R A I N Huang et al. 0.640 0.620Emmery et al. 0.725 0.890
Table 4: Gender prediction accuracies of the substitutemodels f (cid:48) on train and test splits of f . cation. To evaluate the semantic preservation ofthe attacked sentences, we calculate both
METEOR (Banerjee and Lavie, 2005; Lavie and Denkowski,2009) using nltk , and BERTScore (Zhang et al.,2020a) between D and D ADV . METEOR capturesflexible uni-gram token overlap including morpho-logical variants, and BERTScore calculates similar-ities with respect to the sentence context.
Human Evaluation
For the human evaluation,we sampled 20 document pieces (one or moretweets) for each attack type in the best performingexperimental configuration. A piece was chosen ifit satisfied these criteria: i) contains changes for allthree attacks, ii) consists of at least 15 words (ex-cluding emojis and tags), and iii) does not containobvious profanity. All 60 document pieces ofthe three models were shuffled, and the 20 originalversions were appended at the end (so that ‘cor-rect’ pieces were seen last). Each substitute modeltherefore has 80 items for evaluation.While in prior work it is common to rate se-mantic consistency, fluency, and label a text (seee.g., Potthast et al., 2016; Jin et al., 2020), ourTwitter data are too noisy (including many spellingand grammar errors in the originals), and docu-ment batches too long to make this a feasible task.Instead, our six participants (three per substitute)were asked to indicate if: a) a sentence was artifi-cially changed, and if so, b) indicate one word thatraised their suspicion. This way, we can evaluatewhich attack produces the most natural sentences,and the least obvious changes to the input.The items were rated individually; the humanevaluators did not know beforehand that differentversions of the same sentences were repeated, nor If an attack drops accuracy to 0%, this effectively flips(in case of a binary label) the label. This label might also beundesired by the author (e.g., being classified as having polaropposite political views). This implies the target model beingmaximally unsure about the classification is desirable. To avoid exposing the raters to overly toxic content, bla-tant examples were filtered using a keyword list. Some minorexamples remained, for which we added a disclaimer. that the originals were shown at the end. All par-ticipants have a university-level education, a highEnglish proficiency, and are familiar with the do-main of the data. Several example ratings of thesame sentence can be found in Table 6.
As we alluded to in Section 4.1, both corpora usedto train our substitute models were in fact not refer-ence corpora for author profiling, and can thereforebe considered as suboptimal, disjoint domains. TheHuang et al. corpus in particular shows a strongdomain shift (see Table 4) for both training and testsets. The distantly labeled Emmery et al. corpusachieves 7.5% more accuracy on the train split ofVolkova et al., and test performance is significantlyhigher (27%). We might therefore expect betterobfuscation performance from the latter.
The results for all attacks are shown in Table 3.Note that these are performances for f ; therefore,when no attacks are applied (none), the perfor-mance for both substitute corpora stays the same(as those only influence the attacks). For the heuris-tic attacks, 1337 seems to make the more robustbaseline; outperforming some of the other settings—even on transferability. A surface-level advantageis that this attack has a minor impact on readabil-ity (when applied conservatively) and does notchange semantics; however, the heuristic attacksare fairly simple to mitigate in preprocessing (Juolaand Vescovi, 2011) and through character features(as shown by the performance of the N-GrAMmodel). For transferability, we evidently need todo more than simply trying to convert words to beout-of-vocabulary (OOV) with noise. While it canbe argued the heuristics could change all words,shifting everything OOV would not be robust; thetarget model side could easily spot the anomalousinput and might act (e.g., reject) accordingly. Transferability can be assessed by comparing theLR and N-GrAM (NG) columns. Globally it can beobserved that the substitute models trained on theEmmery et al. corpus systematically outperformthose trained on Huang et al.; both for the settingswhere the adversary’s architecture is known (LR),and where it is unknown (NG). This matches our T F H M B HD B Q T F Q M B QD B . . . .
81 %-Changes H T F H M B HD B Q T F Q M B QD B . . . .
81 BERTScore H T F H M B HD B Q T F Q M B QD B . . . .
81 METEOR
Figure 2: Rewrite metrics for the three different attacks:TextFooler (TF), Masked (MB) and Dropout BERT(DB)—split by corpora Huang et al. (H), Emmery et al.(Q). Shown are the relative number of changes (by doc-ument length), BERTScore ( F ), and METEOR (MET)with respect to the original document. expectations from the observed domain shift. Ourresults also show that a noticeable decrease in ob-fuscation performance occurs (10-30% increasedtarget model performance) when the attacks aretransferred to different data and another model.In contrast, as can be observed from the last twocolumns in Table 3, in a practically unrealistic set-ting where the model and data are available, theobfuscation is specifically tailored to known weak-nesses and therefore highly effective. Looking at the Top-1, Check and Check brack-ets (Table 3), other than the BERT-based modelshaving higher success of transferability than TF,they also retain obfuscation success; deteriorat-ing the target model’s performance to lower thanchance level (55%) for the settings not using addi-tional checks. This comparison also demonstratesthe synonym ranking to work (Top-1 vs. Checkand Check), and the Check condition to be toorestrictive; attaining lower attack power, and lowtransferability. This is further illustrated by the %-changes shown in Figure 2. Comparing the MBand DB variants, their performance seems almostidentical, with masking having a slight advantage.As Zhou et al. (2019) argued, applying dropoutshould produce words that are closer to the original(compared to MB), which might affect obfuscationperformance. Additionally, the BERT similarityranking (described in Section 3.3) applied to theMasked substitution candidates could have somebeneficial effect. This will have to be studied inmore detail using the output evaluations. Jin et al. (2020) found similar drops to 0% accuracy witha comparable percentage of changes for word-level models.
Huang et al. Emmery et al.ORG TF MB DB TF MB DB
ALTERED .888 .967 .633 .783 .950 .617 .633
WORD - .950 .583 .700 .867 .433 .433
Table 5: Human accuracy scores of predicting if a textwas altered, and guessing the attacked word (loweris better). All substitute models are those with theCheck setting, trained on different corpora (i.e., differ-ent words are attacked per training corpus). ORG indi-cates correct prediction of the originals.
Rewrite Metrics
The metrics in Figure 2 showa common initial limitation in their application tothis task: the more frequent an attack makes nochanges, the higher the automatic evaluation met-rics (BERTScore,
METEOR ). Hence, to comparemodels, these scores need to be considered in lightof the obfuscation performance, and related work.It can be observed that with consistently higherchanges, MB and DB score lower on semantic con-sistency than TF. However, between MB and DB,and TF for the Emmery et al. corpus, these differ-ences are minor. Furthermore, despite being fit ona different domain, these scores are comparable toprior obfuscation work (e.g., Shetty et al. (2018)show
METEOR scores between 0.69 and 0.79).
Human Evaluation
The results in Table 5 reflectthe same trend that can be observed in Table 3; highobfuscation success seems to result in higher hu-man error when predicting if a sentence was obfus-cated. Conversely, it seems that despite higher se-mantic consistency scores, the original TF pipelineis easier to detect. This can be attributed to thenumber of spelling and grammar errors the modelmakes without its additional checks. Furthermore,the 11% error in identifying the original sentencesalso reflects some expected margin of error in thistask, as our Twitter data is inherently noisy. Finally,while these results are in line with the obfuscationsuccess, and are lower than detectability scores inrelated work (Mahmood et al., 2020), they alsoindicate that the models are still detectable abovechance-level. Given three alternatives (includingthe original), performance should be 25% or lowerto indicate no intrusive changes are made to text(that are not semantically coherent or not inconspic-uous enough—both metrics used by Potthast et al.,2016). Therefore, while the presented approachesare effective, and realistically transferable, there isroom for improvement for practical applicability. RG ready to go home already . a better relationship with god . i need another job asap . HTF loan to go houses already . a improved relations with jesus . i should another labour asap .
HMB ready to go on already . a better relationship with god . i need another guy man . HDB ready to go somewhere already . a better relationship with god . i need another position vs . ORG trump criticizes kim jong un after missile launch : ‘ does this guy have anything better to do ? ’ .
HTF tramp criticized kam yung jt after rocket start : ‘ does this boyfriend have anything best to do ? ’ .
HMB trump criticizes ha woman congressman after campaign launch : ‘ does this book have anything else to do ? ’ .
HDB trump criticizes in at sin after bomb launch : ‘ does this kid have anything less to do ? ’ .
Table 6: Example ratings of different attacks (not shown together to the human evaluators) on two sentences withvarying semantic consitency and human detection accuracy. In the first example,
HMB was marked unaltered by allraters,
HDB by the majority, and
HTF by none. In the second, only
HDB was marked unaltered, by only one rater.Attacked words are marked in bold, guessing any one of these would count as correctly identifying the attack.
We have demonstrated the performance of author at-tribute obfuscation under a realistic setting. Usinga simple Logistic Regression model for candidatesuggestion, trained on a weakly labeled corpus col-lected in a day, the attacks successfully transferredto different data and architectures. This is a promis-ing result for future adversarial work on this task,and its practical implementation.It remains challenging to automatically evalu-ate how invasive the required number of changesare for successful obfuscation—particularly to anauthor’s message consistency as a whole. How-ever, in practice such considerations could be leftup to the author. In this human-in-the-loop sce-nario, a more extensive set of candidates could besuggested, and their effect on the substitute modelshown interactively. This way, the attacks can bemanually tuned to find a balance of effectiveness,inconspicuousness, and to guarantee semantic con-sistency. It would also show the author how theirwriting style affects potential future inferences.Regarding the performance of the attacks: wedemonstrated the general effectiveness of contex-tual language models in retrieving candidate sug-gestions. However, the quality of those candidatesmight be improved with more extensive rule-basedchecks; e.g., through deeper analyses using parsing.Nevertheless, such venues leave us with a core lim-itation of rewriting language, and therefore morebroadly NLP: while the Masked attacks seemedmore successful in our experiments, after manualinspection of the perturbations Dropout was foundto often be semantically closer (see also Table 6)—which was not reflected in the human evaluation.This begs the question if any automated approach,evaluated under the current limitations of semantic consistency metrics, could realistically optimizefor both obfuscation and inconspicuousness.As such, we would argue that future work shouldfocus on making as few perturbations as possible,retaining only the minimum amount of required ob-fuscation success. Given this, the other constraintsbecome less relevant; one could generate short sen-tences (e.g., a single tweet) that might be semanti-cally or contextually incorrect, but if it is a messagein a long post history, it will hardly be detectableor intrusive. This would require certain triggers (asdemonstrated by Wallace et al. (2019) for example),and ascertaining how well they transfer.
In our work, we argued realistic adversarial stylom-etry should be tested on transferability in settingswhere there is no access to the target model’s dataor architecture. We extended previous adversarialtext classification work with two transformer-basedmodels, and studied their obfuscation success insuch a setting. We showed them to reliably droptarget model performance below chance, thoughhuman detectability of the attacks remained abovechance. Future work could focus on further mini-mizing this detection under our realistic constraints.
Acknowledgments
Our research strongly relied on openly availableresources. We thank all whose work we could use.We would also like to thank the anonymous review-ers, Bertrand Higy, Bram Willemsen, and Chris vander Lee for their valuable feedback. ´Akos K´ad´arcontributed to this research independently. Thiswork does not reflect Borealis AI’s views nor anyinformation ´Akos K´ad´ar may have learned whileemployed by Borealis AI. eferences
Mart´ın Abadi, Paul Barham, Jianmin Chen,Zhifeng Chen, Andy Davis, Jeffrey Dean,Matthieu Devin, Sanjay Ghemawat, GeoffreyIrving, Michael Isard, et al. 2016a. Tensorflow:a system for large-scale machine learning. In
Proceedings of the 12th USENIX conference onOperating Systems Design and Implementation ,pages 265–283. USENIX Association.Martin Abadi, Andy Chu, Ian Goodfellow, H Bren-dan McMahan, Ilya Mironov, Kunal Talwar, andLi Zhang. 2016b. Deep learning with differen-tial privacy. In
Proceedings of the 2016 ACMSIGSAC Conference on Computer and Commu-nications Security , pages 308–318.Carlisle Adams. 2006. A classification for privacytechniques.
U. Ottawa L. & Tech. J. , 3:35.Jalal S Alowibdi, Ugo A Buy, and Philip Yu. 2013.Empirical evaluation of profile characteristicsfor gender classification on twitter. In
MachineLearning and Applications (ICMLA), 2013 12thInternational Conference on , volume 1, pages365–369. IEEE.Shlomo Argamon, Sushant Dhawle, Moshe Kop-pel, and James W Pennebaker. 2005. Lexicalpredictors of personality type. In
Proceedings ofthe 2005 Joint Annual Meeting of the Interfaceand the Classification Society of North America ,pages 1–16.Harald Baayen, Hans van Halteren, Anneke Neijt,and Fiona Tweedie. 2002. An experiment inauthorship attribution. In , volume 1,pages 69–75.Satanjeev Banerjee and Alon Lavie. 2005. Meteor:An automatic metric for mt evaluation with im-proved correlation with human judgments. In
Proceedings of the acl workshop on intrinsicand extrinsic evaluation measures for machinetranslation and/or summarization , pages 65–72.Angelo Basile, Gareth Dwyer, Maria Medvedeva,Josine Rawee, Hessel Haagsma, and MalvinaNissim. 2018. Simply the best: minimalist sys-tem trumps complex models in author profil-ing. In
International Conference of the Cross-Language Evaluation Forum for European Lan-guages , pages 143–156. Springer. Yonatan Belinkov and Yonatan Bisk. 2018. Syn-thetic and natural noise both break neural ma-chine translation. In
International Conferenceon Learning Representations .Charley Beller, Rebecca Knowles, Craig Harman,Shane Bergsma, Margaret Mitchell, and Ben-jamin Van Durme. 2014. I’ma belieber: Socialroles via self-identification and conceptual at-tributes. In
Proceedings of the 52nd AnnualMeeting of the Association for ComputationalLinguistics (Volume 2: Short Papers) , pages 181–186.Janek Bevendorff, Martin Potthast, Matthias Ha-gen, and Benno Stein. 2019. Heuristic author-ship obfuscation. In
Proceedings of the 57thAnnual Meeting of the Association for Computa-tional Linguistics , pages 1098–1108.Janek Bevendorff, Tobias Wenzel, Martin Potthast,Matthias Hagen, and Benno Stein. 2020. Ondivergence-based author obfuscation: An attackon the state of the art in statistical authorship ver-ification. it-Information Technology , 62(2):99–115.Haohan Bo, Steven HH Ding, Benjamin Fung,and Farkhund Iqbal. 2019. Er-ae: differentially-private text generation for authorship anonymiza-tion. arXiv preprint arXiv:1907.08736 .Michael Brennan, Sadia Afroz, and Rachel Green-stadt. 2012. Adversarial stylometry: Circumvent-ing authorship recognition to preserve privacyand anonymity.
ACM Transactions on Informa-tion and System Security (TISSEC) , 15(3):1–22.John D Burger, John Henderson, George Kim, andGuido Zarrella. 2011. Discriminating gender ontwitter. In
Proceedings of the Conference onEmpirical Methods in Natural Language Pro-cessing , pages 1301–1309. Association for Com-putational Linguistics.Aylin Caliskan, Fabian Yamaguchi, Edwin Dauber,Richard E. Harang, Konrad Rieck, Rachel Green-stadt, and Arvind Narayanan. 2018. When cod-ing style survives compilation: De-anonymizingprogrammers from executable binaries. In . The Internet Soci-ety.aniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua,Nicole Limtiaco, Rhomni St. John, Noah Con-stant, Mario Guajardo-Cespedes, Steve Yuan,Chris Tar, Brian Strope, and Ray Kurzweil. 2018.Universal sentence encoder for English. In
Pro-ceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing: Sys-tem Demonstrations , pages 169–174, Brussels,Belgium. Association for Computational Lin-guistics.Minhao Cheng, Jinfeng Yi, Pin-Yu Chen, HuanZhang, and Cho-Jui Hsieh. 2020. Seq2sick:Evaluating the robustness of sequence-to-sequence models with adversarial examples. In
The Thirty-Fourth AAAI Conference on ArtificialIntelligence, AAAI 2020, The Thirty-Second In-novative Applications of Artificial IntelligenceConference, IAAI 2020, The Tenth AAAI Sym-posium on Educational Advances in ArtificialIntelligence, EAAI 2020, New York, NY, USA,February 7-12, 2020 , pages 3601–3608. AAAIPress.Corinna Cortes and Vladimir Vapnik. 1995.Support-vector networks.
Machine learning ,20(3):273–297.Walter Daelemans. 2013. Explanation in compu-tational stylometry. In
International Confer-ence on Intelligent Text Processing and Com-putational Linguistics , pages 451–462. Springer.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: pre-trainingof deep bidirectional transformers for languageunderstanding. In
Proceedings of the 2019 Con-ference of the North American Chapter of theAssociation for Computational Linguistics: Hu-man Language Technologies, NAACL-HLT 2019,Minneapolis, MN, USA, June 2-7, 2019, Volume1 (Long and Short Papers) , pages 4171–4186.Association for Computational Linguistics.Javid Ebrahimi, Daniel Lowd, and Dejing Dou.2018. On adversarial examples for character-level neural machine translation. In
Proceedingsof the 27th International Conference on Compu-tational Linguistics , pages 653–663.Harrison Edwards and Amos J. Storkey. 2016. Cen-soring representations with an adversary. In .Steffen Eger, G¨ozde G¨ul S¸ ahin, Andreas R¨uckl´e,Ji-Ung Lee, Claudia Schulz, Mohsen Mesgar,Krishnkant Swarnkar, Edwin Simpson, and IrynaGurevych. 2019. Text processing like humans do:Visually attacking and shielding nlp systems. In
Proceedings of NAACL-HLT , pages 1634–1647.Jacob Eisenstein, Noah A Smith, and Eric P Xing.2011. Discovering sociolinguistic associationswith structured sparsity. In
Proceedings of the49th Annual Meeting of the Association forComputational Linguistics: Human LanguageTechnologies-Volume 1 , pages 1365–1374. Asso-ciation for Computational Linguistics.Chris Emmery, Grzegorz Chrupała, and WalterDaelemans. 2017. Simple queries as distant la-bels for predicting gender on twitter. In
Pro-ceedings of the 3rd Workshop on Noisy User-generated Text , pages 50–55.Chris Emmery, Enrique Manjavacas, and GrzegorzChrupała. 2018. Style obfuscation by invari-ance. In
Proceedings of the 27th InternationalConference on Computational Linguistics , pages984–996.Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh,Xiang-Rui Wang, and Chih-Jen Lin. 2008. Li-blinear: A library for large linear classifica-tion.
Journal of machine learning research ,9(Aug):1871–1874.Natasha Fernandes, Mark Dras, and AnnabelleMcIver. 2019. Generalised differential privacyfor text document processing. In
InternationalConference on Principles of Security and Trust ,pages 123–148. Springer, Cham.Antigoni Maria Founta, Constantinos Djouvas,Despoina Chatzakou, Ilias Leontiadis, JeremyBlackburn, Gianluca Stringhini, Athena Vakali,Michael Sirivianos, and Nicolas Kourtellis. 2018.Large scale crowdsourcing and characterizationof twitter abusive behavior. In
Twelfth Inter-national AAAI Conference on Web and SocialMedia .Felix Hill, Roi Reichart, and Anna Korhonen. 2015.Simlex-999: Evaluating semantic models with(genuine) similarity estimation.
ComputationalLinguistics , 41(4):665–695.avid I Holmes. 1998. The evolution of stylometryin humanities scholarship.
Literary and linguis-tic computing , 13(3):111–117.Matthew Honnibal and Ines Montani. 2017. spacy2: Natural language understanding with bloomembeddings, convolutional neural networks andincremental parsing.
To appear , 7.Xiaolei Huang, Xing Linzi, Franck Dernoncourt,and Michael J. Paul. 2020. Multilingual twittercorpus and baselines for evaluating demographicbias in hate speech recognition. In
Proceed-ings of the Twelveth International Conferenceon Language Resources and Evaluation (LREC2020) , Marseille, France. European LanguageResources Association (ELRA).Di Jin, Zhijing Jin, Joey Tianyi Zhou, and PeterSzolovits. 2020. Is BERT really robust? A strongbaseline for natural language attack on text clas-sification and entailment. In
The Thirty-FourthAAAI Conference on Artificial Intelligence, AAAI2020, The Thirty-Second Innovative Applica-tions of Artificial Intelligence Conference, IAAI2020, The Tenth AAAI Symposium on Educa-tional Advances in Artificial Intelligence, EAAI2020, New York, NY, USA, February 7-12, 2020 ,pages 8018–8025. AAAI Press.Youngjun Joo, Inchon Hwang, L Cappellato,N Ferro, D Losada, and H M¨uller. 2019. Authorprofiling on social media: An ensemble learningmodel using various features.
Notebook for PANat CLEF .Patrick Juola and Darren Vescovi. 2011. Analyzingstylometric approaches to author obfuscation. In
IFIP International Conference on Digital Foren-sics , pages 115–125. Springer.Jad Kabbara and Jackie Chi Kit Cheung. 2016.Stylistic transfer in natural language generationsystems using recurrent neural networks. In
Pro-ceedings of the Workshop on Uphill Battles inLanguage Processing: Scaling Early Achieve-ments to Robust Methods , pages 43–47.Gary Kacmarcik and Michael Gamon. 2006. Obfus-cating document stylometry to preserve authoranonymity. In
Proceedings of the COLING/ACLon Main conference poster sessions , pages 444–451. Association for Computational Linguistics. Akos K´ad´ar, Grzegorz Chrupała, and Afra Alishahi.2017. Representation of linguistic form and func-tion in recurrent neural networks.
ComputationalLinguistics , 43(4):761–780.Georgi Karadzhov, Tsvetomila Mihaylova, YasenKiprov, Georgi Georgiev, Ivan Koychev, andPreslav Nakov. 2017. The case for being average:A mediocrity approach to style masking and au-thor obfuscation. In
International Conferenceof the Cross-Language Evaluation Forum forEuropean Languages , pages 173–185. Springer.Moshe Koppel, Shlomo Argamon, and Anat RachelShimoni. 2002. Automatically categorizing writ-ten texts by author gender.
Literary and Linguis-tic Computing , 17(4):401–412.Moshe Koppel and Jonathan Schler. 2004. Au-thorship verification as a one-class classifica-tion problem. In
Proceedings of the twenty-firstinternational conference on Machine learning ,page 62.Alon Lavie and Michael J Denkowski. 2009. Themeteor metric for automatic evaluation of ma-chine translation.
Machine translation , 23(2-3):105–115.Hoi Le, Reihaneh Safavi-Naini, and AsadullahGalib. 2015. Secure obfuscation of authoringstyle. In
IFIP International Conference on In-formation Security Theory and Practice , pages88–103. Springer.Chris van der Lee, Albert Gatt, Emiel Van Mil-tenburg, Sander Wubben, and Emiel Krahmer.2019. Best practices for the human evaluationof automatically generated text. In
Proceedingsof the 12th International Conference on NaturalLanguage Generation , pages 355–368.Yitong Li, Timothy Baldwin, and Trevor Cohn.2018. Towards robust and privacy-preservingtext representations. In
Proceedings of the 56thAnnual Meeting of the Association for Compu-tational Linguistics (Volume 2: Short Papers) ,pages 25–30, Melbourne, Australia. Associationfor Computational Linguistics.Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian,Xirong Li, and Wenchang Shi. 2018. Deep textclassification can be fooled. In
Proceedings ofthe 27th International Joint Conference on Arti-ficial Intelligence , pages 4208–4215.sad Mahmood, Zubair Shafiq, and Padmini Srini-vasan. 2020. A girl has A name: Detecting au-thorship obfuscation. In
Proceedings of the 58thAnnual Meeting of the Association for Computa-tional Linguistics, ACL 2020, Online, July 5-10,2020 , pages 2235–2245. Association for Com-putational Linguistics.Muharram Mansoorizadeh, Taher Rahgooy, Mo-hammad Aminiyan, and Mahdy Eskandari. 2016.Author obfuscation using wordnet and languagemodels—notebook for pan at clef 2016. In
CLEF2016 Evaluation Labs and Workshop–WorkingNotes Papers , pages 5–8.Robert AJ Matthews and Thomas VN Merriam.1993. Neural computation in stylometry i: Anapplication to the works of shakespeare andfletcher.
Literary and Linguistic computing ,8(4):203–209.Thomas VN Merriam and Robert AJ Matthews.1994. Neural computation in stylometry ii: Anapplication to the works of shakespeare and mar-lowe.
Literary and Linguistic Computing , 9(1):1–6.Frederick Mosteller and David L Wallace. 1963.Inference in an authorship problem: A compara-tive study of discrimination methods applied tothe authorship of the disputed federalist papers.
Journal of the American Statistical Association ,58(302):275–309.Nikola Mrkˇsi´c, Diarmuid ´O S´eaghdha, BlaiseThomson, Milica Gasic, Lina M Rojas Bara-hona, Pei-Hao Su, David Vandyke, Tsung-HsienWen, and Steve Young. 2016. Counter-fittingword vectors to linguistic constraints. In
Pro-ceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Com-putational Linguistics: Human Language Tech-nologies , pages 142–148.Tempestt Neal, Kalaivani Sundararajan, Aneez Fa-tima, Yiming Yan, Yingfei Xiang, and DamonWoodard. 2017. Surveying stylometry tech-niques and applications.
ACM Computing Sur-veys (CSUR) , 50(6):1–36.Jekaterina Novikova, Ondrej Dusek, Amanda Cer-cas Curry, and Verena Rieser. 2017. Why weneed new evaluation metrics for nlg. In , pages 2231–2242. Asso-ciation for Computational Linguistics.Nicolas Papernot, Patrick McDaniel, and IanGoodfellow. 2016. Transferability in machinelearning: from phenomena to black-box at-tacks using adversarial samples. arXiv preprintarXiv:1605.07277 .Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas Kopf, Ed-ward Yang, Zachary DeVito, Martin Raison,Alykhan Tejani, Sasank Chilamkurthy, BenoitSteiner, Lu Fang, Junjie Bai, and Soumith Chin-tala. 2019. Pytorch: An imperative style, high-performance deep learning library. In H. Wal-lach, H. Larochelle, A. Beygelzimer, F. d’Alch´eBuc, E. Fox, and R. Garnett, editors,
Advancesin Neural Information Processing Systems 32 ,pages 8024–8035. Curran Associates, Inc.Fabian Pedregosa, Ga¨el Varoquaux, AlexandreGramfort, Vincent Michel, Bertrand Thirion,Olivier Grisel, Mathieu Blondel, Peter Pretten-hofer, Ron Weiss, Vincent Dubourg, et al. 2011.Scikit-learn: Machine learning in python.
Jour-nal of machine learning research , 12(Oct):2825–2830.Barbara Plank and Dirk Hovy. 2015. Personalitytraits on twitter—or—how to get 1,500 person-ality tests in a week. In
Proceedings of the 6thWorkshop on Computational Approaches to Sub-jectivity, Sentiment and Social Media Analysis ,pages 92–98.Martin Potthast, Matthias Hagen, and Benno Stein.2016. Author obfuscation: Attacking the stateof the art in authorship verification. In
CLEF(Working Notes) , pages 716–749.Francisco Rangel, Paolo Rosso, Ben Verhoeven,Walter Daelemans, Martin Potthast, and BennoStein. 2016. Overview of the 4th author profil-ing task at pan 2016: cross-genre evaluations.
Working Notes Papers of the CLEF .Josyula R Rao, Pankaj Rohatgi, et al. 2000. Canpseudonymity really guarantee privacy? In
USENIX Security Symposium , pages 85–96.K Rayner, SJ White, RL Johnson, and SP Liv-ersedge. 2006. Raeding wrods with jubmledettres: there is a cost.
Psychological science ,17(3):192.Sravana Reddy and Kevin Knight. 2016. Obfuscat-ing gender in social media writing. In
Proceed-ings of the First Workshop on NLP and Compu-tational Social Science , pages 17–26.Chakaveh Saedi and Mark Dras. 2020. Large scaleauthor obfuscation using Siamese variationalauto-encoder: The SiamAO system. In
Proceed-ings of the Ninth Joint Conference on Lexicaland Computational Semantics , pages 179–189,Barcelona, Spain (Online). Association for Com-putational Linguistics.Maarten Sap, Gregory Park, Johannes Eichstaedt,Margaret Kern, David Stillwell, Michal Kosinski,Lyle Ungar, and H Andrew Schwartz. 2014. De-veloping age and gender predictive lexica oversocial media. In
Proceedings of the 2014 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 1146–1151.Rakshith Shetty, Bernt Schiele, and Mario Fritz.2018. A4nt: author attribute anonymity by ad-versarial training of neural machine translation.In
Proceedings of the 27th USENIX Conferenceon Security Symposium , pages 1633–1650.Noah A Smith. 2012. Adversarial evaluation formodels of natural language. arXiv preprintarXiv:1207.0245 .Nitish Srivastava, Geoffrey Hinton, AlexKrizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: a simple wayto prevent neural networks from overfitting.
The journal of machine learning research ,15(1):1929–1958.Ariel Stolerman, Rebekah Overdorf, Sadia Afroz,and Rachel Greenstadt. 2013. Classify, but ver-ify: Breaking the closed-world assumption instylometric authorship attribution. In
IFIP Work-ing Group , volume 11, page 64.Ashish Vaswani, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N Gomez,Łukasz Kaiser, and Illia Polosukhin. 2017. At-tention is all you need. In
Proceedings of the31st International Conference on Neural Infor-mation Processing Systems , pages 6000–6010.Curran Associates Inc. Svitlana Volkova and Yoram Bachrach. 2016. In-ferring perceived demographics from user emo-tional tone and user-environment emotional con-trast. In
Proceedings of the 54th Annual Meetingof the Association for Computational Linguistics,ACL .Svitlana Volkova, Yoram Bachrach, Michael Arm-strong, and Vijay Sharma. 2015. Inferring latentuser properties from texts published in socialmedia. In
Twenty-Ninth AAAI Conference onArtificial Intelligence .Svitlana Volkova, Glen Coppersmith, and Ben-jamin Van Durme. 2014. Inferring user politicalpreferences from streaming communications. In
ACL (1) , pages 186–196.Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gard-ner, and Sameer Singh. 2019. Universal ad-versarial triggers for attacking and analyzingnlp. In
Proceedings of the 2019 Conferenceon Empirical Methods in Natural Language Pro-cessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 2153–2162.Zeerak Waseem. 2016. Are you a racist or am i see-ing things? annotator influence on hate speechdetection on twitter. In
Proceedings of the firstworkshop on NLP and computational social sci-ence , pages 138–142.Zeerak Waseem and Dirk Hovy. 2016. Hateful sym-bols or hateful people? predictive features forhate speech detection on twitter. In
Proceedingsof the NAACL student research workshop , pages88–93.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi,Pierric Cistac, Tim Rault, R´emi Louf, MorganFuntowicz, Joe Davison, Sam Shleifer, Patrickvon Platen, Clara Ma, Yacine Jernite, Julien Plu,Canwen Xu, Teven Le Scao, Sylvain Gugger,Mariama Drame, Quentin Lhoest, and Alexan-der M. Rush. 2020. Transformers: State-of-the-art natural language processing. In
Proceed-ings of the 2020 Conference on Empirical Meth-ods in Natural Language Processing: SystemDemonstrations, EMNLP 2020 - Demos, Online,November 16-20, 2020 , pages 38–45. Associa-tion for Computational Linguistics.iongkai Xu, Chenchen Xu, and Lizhen Qu. 2019.ALTER: Auxiliary text rewriting tool for naturallanguage generation. In
Proceedings of the 2019Conference on Empirical Methods in NaturalLanguage Processing and the 9th InternationalJoint Conference on Natural Language Process-ing (EMNLP-IJCNLP): System Demonstrations ,pages 13–18, Hong Kong, China. Associationfor Computational Linguistics.Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.Weinberger, and Yoav Artzi. 2020a. Bertscore:Evaluating text generation with BERT. In .Wei Emma Zhang, Quan Z Sheng, Ahoud Alhazmi,and Chenliang Li. 2020b. Adversarial attacks ondeep-learning models in natural language pro-cessing: A survey.
ACM Transactions on Intel-ligent Systems and Technology (TIST) , 11(3):1–41.Wangchunshu Zhou, Tao Ge, Ke Xu, Furu Wei, andMing Zhou. 2019. Bert-based lexical substitu-tion. In