[PDF] A Dataset of State-Censored Tweets

Abstract

Many governments impose traditional censorship methods on social media platforms. Instead of removing it completely, many social media companies, including Twitter, only withhold the content from the requesting country. This makes such content still accessible outside of the censored region, allowing for an excellent setting in which to study government censorship on social media. We mine such content using the Internet Archive's Twitter Stream Grab. We release a dataset of 583,437 tweets by 155,715 users that were censored between 2012-2020 July. We also release 4,301 accounts that were censored in their entirety. Additionally, we release a set of 22,083,759 supplemental tweets made up of all tweets by users with at least one censored tweet as well as instances of other users retweeting the censored user. We provide an exploratory analysis of this dataset. Our dataset will not only aid in the study of government censorship but will also aid in studying hate speech detection and the effect of censorship on social media users. The dataset is publicly available at this https URL

Full PDF

AA Dataset of State-Censored Tweets

Tu˘grulcan Elmas, Rebekah Overdorf, Karl Aberer

EPFLtugrulcan.elmas@epﬂ.ch, rebekah.overdorf@epﬂ.ch, karl.aberer@epﬂ.ch

Abstract

Many governments impose traditional censorship methods onsocial media platforms. Instead of removing it completely,many social media companies, including Twitter, only with-hold the content from the requesting country. This makessuch content still accessible outside of the censored region,allowing for an excellent setting in which to study govern-ment censorship on social media. We mine such content us-ing the Internet Archive’s Twitter Stream Grab. We release adataset of 583,437 tweets by 155,715 users that were cen-sored between 2012-2020 July. We also release 4,301 ac-counts that were censored in their entirety. Additionally, werelease a set of 22,083,759 supplemental tweets made up ofall tweets by users with at least one censored tweet as wellas instances of other users retweeting the censored user. Weprovide an exploratory analysis of this dataset. Our datasetwill not only aid in the study of government censorship butwill also aid in studying hate speech detection and the effectof censorship on social media users. The dataset is publiclyavailable at https://doi.org/10.5281/zenodo.4439509

While there are many disagreements and debates about theextent to which a government should censor and which con-tent should be allowed to be censored, many governments doimpose some limitations on content that can be distributedwithin its borders or by its citizens and residents. The spe-ciﬁc reasons for censorship vary widely depending on thetype of censored content and the context of the censor, butthe overall goal is always the same: to suppress speech andinformation.Traditional Internet censorship works by ﬁltering by IPaddress, DNS ﬁltering, or similar means. The rise of socialmedia platforms has lessened the impact of such techniques.As such, modern censors often instead issue take-down re-quests to social media platforms. As a practice, Twitter doesnot normally remove content that does not violate the termsof service, but instead censors the content by withholding itin the country if the respective government sends a legal re-quest (Jeremy Kessel 2017). As such, these tweets are still accessible on the platform from outside of the censored re-gion, which provides a setting to study the censorship poli-cies of certain governments. In this paper, we release an ex-tensive dataset of tweets that were withheld between July2012 and July 2020. Aside from an initiative by BuzzFeedjournalists who collected a smaller dataset of users that wereobserved to be censored between October 2017 and January2018, such dataset has never collected and released. We alsorelease uncensored tweets of the users which are censored atleast once. As such, our dataset will help not only in studyingthe government censorship but also the effect of censorshipto users and the public reaction to the censored tweets.The main characteristics of the datasets are as follows:1. 583,437 censored tweets from 155,715 unique users.2. 4,301 censored accounts.3. 22,083,759 additional tweets from users who posted atleast one censored tweet and retweets of those tweets.We made the dataset available at Zenodo:https://doi.org/10.5281/zenodo.4439509. The datasetonly consists of tweet ids and user ids in order to complywith Twitter Terms of Service. For the documentationand the code to reproduce the pipeline please refer tohttps://github.com/tugrulz/CensoredTweets.This paper is structured as follows. Section 2 discussesthe related work. Section 3 describes the collection pipeline.Section 4 provides an exploratory analysis of the dataset.Section 5 describes the possible use cases of the dataset. Fi-nally Section 6 discusses the caveats: biases of the dataset,ethical considerations and compliance with FAIR principles.

Traditional Internet censorship blocks access to an entirewebsite within a country by ﬁltering by IP address, DNS ﬁl-tering, or similar means. Such methods, however, are heavy-handed in the current web ecosystem in which a lot of con-tent is hosted under the same domain. That is, if a govern-ment wants to block one Wikipedia page, it must block ev-ery Wikipedia page from being accessible within its borders.Such unintended blockings, so-called “casualties,” are tol-erated by some censors, who balance the tradeoff between a r X i v : . [ c s . S I] J a n ow much benign content is blocked and how much tar-geted content is blocked. Some countries have gone so faras to ban entire popular websites based on some content.Turkey famously blocked Wikipedia from 2017 through Jan-uary 2020 (Anonymous 2017) in response to Wikipedia arti-cles that it deemed critical of the government. Several coun-tries, including Turkey, China, and Kazakhstan have inter-mittently blocked WordPress, which about 39% of the Inter-net is built on .To study this type of censorship, researchers, activists,and organizations complete measurement studies to deter-mine which websites are blocked or available in differentregions. Primarily, this tracking relies on volunteers runningsoftware from inside the censor, such as the tool provided bythe Open Observatory of Network Interference (OONI) .The rise of social media platforms further complicatescensorship techniques. In order to censor a single post, oreven a group or proﬁle, the number of casualties is by ne-cessity very high, up to every social media proﬁle and page.In response, some censors now request that social mediacompanies remove content, or at least block certain contentwithin their borders. To a large extent, social media com-panies comply with these requests. For example, in 2019,Reddit, which annually publishes a report (Reddit Inc. 2020)on takedown requests, received 110 requests to restrict or re-move content and complied with 41. In the ﬁrst half of 2020,Twitter received 42.2k requests for takedowns and complied31.2%. Twitter states that they have censored 3,215 accountsand 28,370 individual tweets between 2012 and July 2020.They also removed 97,987 accounts altogether from the plat-form after a legal request instead of censoring (Twitter Inc.2021). Measuring this type of targeted censorship requiresnew data and new methodologies. This paper takes the ﬁrststep in this direction by providing a dataset of regionallycensored tweets.To the best of our knowledge, only one similar datasethas been publicly shared: journalists at BuzzFeed manuallycurated and shared a list of 1,714 censored accounts (Sil-verman and Singer-Vine. 2018). This dataset covers userswhose entire proﬁle were observed to be censored betweenOctober 2017 and January 2018, but does not cover indi-vidual censored tweets. The Lumen Database stores the le-gal demands to censor content sent to Twitter in pdf formatwhich includes the court order and the urls of the censoredtweets. Although the database is open to the public, one hasto send a request for every legal demand, solve a captcha,download, and open a pdf to access a single censored tweet,which makes the database inaccessible and not interopera-ble. Although not publicly available, using a dataset of cen-sored tweets from October 2014 to January 2015 in Turkey,previous research found that most were political and crit-ical of the government (Tanash et al. 2015; Tanash et al.2016) and that Twitter underreports the censored content.Subsequent work reported a decline in government censor-ship (Tanash et al. 2017). Additionally, (Varol 2016) ana- https://kinsta.com/wordpress-market-share/ https://ooni.org/ lyzed 100,000 censored tweets between 2013 and 2015 andreported that the censorship does not prevent the contentfrom reaching a broader audience. There exists no other ro-bust, data driven study of censored Twitter content to thebest of our knowledge. This is likely due in part to the difﬁ-culty of collecting a dataset of censored tweets and accounts.Our dataset aims to facilitate such work. This section describes the process by which we collect cen-sored tweets and users, which is summarized in Figure 1.In brief, data retrieved via the Twitter API is structured asfollows: tweets are instantiated in a Tweet object, which in-cludes, among other attributes, another object that instan-tiates the tweet’s author (a User object). If the tweet is aretweet, it also includes the tweet object of the retweetedtweet, which in turn includes the user object of the retweeteduser. If a tweet is censored, its tweet object includes a “with-held in countries” ﬁeld which features the list of countriesthe tweet is censored in. If an entire proﬁle is censored,the User object includes the same ﬁeld. In order to build adataset of censored content, our objective is to ﬁnd all tweetand user objects with a “withheld in countries” ﬁeld.For this objective, we ﬁrst mine the Twitter Stream Grab,which is a dataset of 1% sample of all tweets between 2011and 2020 July. This dataset does not provide the informa-tion whether the entire proﬁle is censored. As such, we col-lect the censored users from the up-to-date Twitter data us-ing Twitter User Lookup API endpoint. We infer if a userwas censored in the past by a simple procedure we comeup with. We extend the dataset of censored users exploit-ing their social connections. We lastly mine a supplemen-tary dataset consisting of non-censored tweets by users whowere censored at least once. The whole collection process isdescribed in Figure 1. We now describe the process in detail.

Twitter features an API endpoint that provides 1% of alltweets posted in real-time, sampled randomly. The InternetArchive collects and publicly publishes all data in this sam-ple starting from September 2011. They name this datasetthe

Twitter Stream Grab (Team 2020). At the time of thisanalysis, the dataset consisted of tweets through June 2020.Although the dataset is publicly available, only a few stud-ies (Tekumalla, Asl, and Banda 2020; Elmas et al. 2020a;Elmas et al. 2020b) have mined the entire dataset due tothe cumbersome process of storing and efﬁciently process-ing the data. We mine all the tweets in this dataset to ﬁndtweets and users with the “withheld in countries” ﬁeld and,thus, ﬁnd the censored tweets and users.We process all the tweets between September 2011 andJune 2020 in this dataset and retain those with the “with-held in countries” ﬁeld. We found 583,437 tweets from155,715 users in total. Of these, 378,286 were retweets.The tweet ids in this dataset can be found can be found in tweets.csv and their authors’ user ids can be found in all users.csv .igure 1: Overview of the data collection process. We ﬁrst mined the entire Twitter Stream Grab (published by the InternetArchive), which consists of a 1% sample of all tweets, and selected the tweets that were denoted as being censored. Thisresulted in 583k censored tweets. We next created a list of accounts which had at least one censored tweets and determined, viathe user lookup endpoint, which accounts were fully censored and which only had selective tweets censored. We then inferredwhich of these accounts were censored in the past (See Section 3.3). We next extended the censored users dataset by exploitingthe users’ social connections. Finally, we mine the Twitter Stream Grab in order to collect the other (non-censored) tweets ofusers who had been censored at least once.

The user objects in the Twitter Stream Grab data do notinclude a “withheld in countries” ﬁeld due to the limits ofthe API endpoint it uses. Twitter provides this ﬁeld in thedata returned by the User Lookup endpoint. As the InternetArchive does not store the past data provided by this end-point, we collect censored users by communicating with thisend-point in December 2020. Precisely, we used the list ofusers with at least one censored tweet and collected whichof these users’ entire proﬁle is censored by December 2020.Our caveat is that Twitter only provides the data of userswhich still exist on the platform. Out of 155,715 candidateusers, 114,800 (73.7%) still exist on the platform by De-cember 2020. 62.2% of the remaining 40,915 users weresuspended by Twitter. Of the users still exist, we found that1,458 had their entire proﬁle censored.

For the users that have at least one censored tweet and whodo not exist on the platform by December 2020, we do notknow if their entire proﬁle was censored. Additionally, forthose users that have at least one censored tweet, but whoseaccounts were not found to be censored after retrieving theup-to-date content, we can not know if their entire proﬁlewas censored in the past. We came up with a strategy to in-fer the users who were once their entire proﬁle censored inthe past. We observe that some users’ censored tweets werein fact retweets, but the tweets that are retweeted by thoseusers were not censored. This is either because the legal re-quest was sent for the retweet and not the original tweet, so the Twitter only censored the retweet, or all the tweets (in-cluding the retweets) of the same account were censored au-tomatically because the account itself was censored. We be-lieve the former is unrealistic, primarily because censoringthe retweet does not block the visibility of the content thatis against the law of the sending country. Additionally, weobserved retweeted tweets that were unlikely for any gov-ernment to have sent a legal request (or for Twitter to acceptsuch a request) such as those who are posted by @jack , Twit-ter’s founder. Thus, when we observe a censored retweet thatretweets a non-censored, we assume the retweeting accountwas censored as a whole. By this reasoning, we inferred3,063 accounts that were censored in the past. Of those 3,063users, 1,531 users were still on the platform. 319 of themare not censored anymore while this method missed 326users who were actually censored, achieving 77.6% recall.This increased the number of users to 3,389. We provide thelist of users found to be censored only by this procedure in users inferred.csv in case one forgoes using it.

Up to this point, the number of users whose entire pro-ﬁle was censored is 3,389, which is bigger than the datasetby BuzzFeed which consisted of 1,714. However, whenwe evaluate our datasets’ recall on BuzzFeed’s dataset, wefound that we only captured 65% of the users on this dataset.To increase the recall as much as we can, we extend ourdataset by exploiting the connections of the list of censoredusers we have. Precisely, we collect the friends of the ac-counts in our dataset whose tweets were censored and theTwitter lists they are added to. For the latter case, we thencollect the members of these lists. We collect 3,233,554riends and the members of 70,969 lists in total. We found494 censored users, increasing the recall to 80.3% whenevaluated using the BuzzFeed’s dataset. We merge the ex-tended dataset with BuzzFeed dataset which is separatelystored in buzzfeed users.csv . We collected 4,301users whose proﬁle are entirely censored in total. The re-sulting dataset can be found in users.csv . Finally, we mine the Twitter Stream Grab for the tweetsof the users with at least one censored tweet and retweetsto those users by others. We found 22,083,759 suchtweets. These tweets will include non-censored tweets andmight serve as negatives for studies which would considerthe censored tweets as positives. They can be found in supplement.csv . We ﬁrst begin by reporting the statistics of the tweets pertheir current status. Of the 583,437 censored tweets, 328,873(56.3%) were still present (not deleted or removed from)the platform as of December 2020. Of those that remained,4,716 tweets were no longer censored, nor were their au-thors. Of those that were still censored, 154,572 of the tweetswere posted by accounts who were censored as a whole and168,234 tweets were posted by accounts who were not cen-sored.We continue with the per-country analysis of the censor-ship actions. Although the dataset features 13 different coun-tries, we found that 572,095 tweets (98%) were censoredin only ﬁve countries: Turkey, Germany, France, India andRussia. Figure 2 shows the statistics with respect to thesecountries.To better understand what this dataset consists of, we per-form a basic temporal and a topical analysis for each of theseﬁve countries. Researchers can use this analysis as a start-ing point. We perform the topical analysis by computing themost popular hashtags, mentions, urls and then observingthose entities. We measure the popularity by the number ofunique accounts mentioning each entity. We use this metricinstead of the tweet count in order to account for the sameusers using the same entities over and over. We do not in-clude the retweets in this analysis. Table 4 shows those en-tities and their popularity. We additionally report the mostfrequent tweeting languages (measured by the number oftweets) and most reported locations (measured by the num-ber of users) with respect to the censoring countries in Ta-ble 4. We perform the temporal analysis by computing thenumber of users censored tweet-wise or account-wise for theﬁrst time per month in Figure 3. We now brieﬂy describe ourﬁndings.We found that censored users are mostly based in a coun-try other than they are censored in, beyond the reach of thelaw enforcement of the censoring country. In the case ofTurkey, the users appear likely to be people who have emi-grated to, e.g., Germany or The United States, as the usersbased in those countries mostly tweet in Turkish. For In-dia and Russia, censored users are mostly based in neigh- bouring and/or hostile countries (such as Ukraine and Pak-istan). They might be the locals of those countries as theymostly tweet with the local language (Ukrainian and Urdu)of the countries they are based in. For the case of Germanyand France, it appears that most of the censored users arealso residents of foreign countries. However, these countriesof residence (e.g., The United States and Portugal) are notneighbouring and/or hostile.The temporal activity of the censorship actions appearsto follow the domestic politics and regional conﬂicts involv-ing Turkey, Russia and India. This is further corroboratedby the most popular hashtags. The biggest spike of the userscensored in Turkey is during the Operation Olive Branch inwhich the Turkish army captured Afrin from YPG, one ofthe combatants of Syrian Civil War based in Rojava, whichis recognized as a terrorist organization by Turkey (Win-ter 2018). The most popular hashtags among censored userswere all related to this event. For the case of India, the mostpopular hashtags are about the Kashmir dispute. We couldnot identify any major event around the time of the spikein the number of censored accounts, but the dispute is stillongoing. The biggest spike of the users censored in Rus-sia coincides with the series of persecution of Hizbut Tahrirmembers, designated as a terrorist group by Russia (Fish-man 2020). The accounts censored in Germany and Franceappear to be promoting extreme-right content such as ideasrelated white-supremacy and Islamophobia due to the hash-tags they frequently use. We could not identify any majorevent related to the spike of the censorship observed in thesecountries. However, we found that the accounts censoredwere also newly registered to the platform when they werecensored. 11.2% of users censored by Germany and 9.1%of users censored by France were ﬁrst censored within onemonth of the account creation, while this is only 4.6% forTurkey, 1% for Russia and 0 for India. Figure 4 shows thatthe creation of the accounts censored in Germany and Francefollows a similar trend to the time of censorship of the ac-counts in these two countries. Precisely, there is a surge ofnew accounts in late 2016 and 2017 which are started to becensored in mid 2017. This might be due to the German andFrench governments’ reaction to the many far-right accountsthat started to actively campaign on Twitter in late 2016 and2017. Germany introduced the Network Enforcement Act tocombat fake news and hate speech on social media in June2017 (Gesley 2017).

The main motivation to create this dataset was to study so-cial media censorship by the governments. Here we providepossible use cases related to this topic, such as analysis ofcensorship policies and the effect of censorship. We alsoprovide use cases with non-censorship focus such as study-ing disinformation and hate speech. Our caveat is that thislist is not exhaustive.

Analysis of Censorship Policies

This dataset provides anexcellent ground to study censorship policies of countries.Not only that, but it could also reveal where the platformigure 2: The statistics of the tweets and retweets with respect to countries and the existence of tweets on the platform byDecember 2020.Table 1: Entities used per number of distinct users with respect to the countries they are censored in.

Table 2: The most frequent self reported locations (measuredby number of users) and most frequent tweeting languages(measured by number of tweets) with respect to the countriesthey are censored in.

Country Locations LanguagesTurkey United States (482), Germany (470),˙Istanbul (297), Turkey (274), K¨urdistan (204) tr (73959), en (26101), de (1378),fr (1237), ar (1115)Germany United States (3544), Texas (692),Florida (608), California (519), Portugal (341) en (33742), tr (3999), de (3160),es (2327), ja (1050)France United States (2817), Texas (517),Florida (470), California (400), Portugal (337) en (18526), fr (632), ja (527),es (483), ﬁ (263)India Pakistan (827), Lahore (389), Karachi (319),Islamabad (292), Punjab (235) en (1719), ur (2262), ar (252),hi (73), in (48)Russia Ukraine (217), Indonesia (47), Istanbul (30),Ankara (28), Turkey (20) en (3968), uk (2441), tr (1166),ru (1030), ar (880) policies and censorship policies align, given that some ac-counts are later removed by Twitter.

Effect of Censorship

The dataset could be used to mea-sure the effect of censorship on censored users’ behaviour.Do users forgo using their accounts after being censored, ordoes the censorship backﬁre? Furthermore, what is the ef-fect of censorship on other users, e.g. does the public engagemore with censored tweets? Studies tackling these questionswill shed light on the effect of censorship.

Hate Speech Detection

Hate speech detection lacksground truth of hate speech as hate speech is rare and of-ten removed by Twitter before researchers collect them, e.g.(Davidson et al. 2017) sampled tweets based on a hate-oriented lexicon and found only 5% to contain hate speech.This dataset would help building a hate speech dataset asmany of tweets censored by Germany and France might in-volve hate speech. Researchers can collect those tweets, an-notate if they include hate speech and build a new dataset oftweets including hate speech. Such a dataset might be use-ful for hate speech detection in French and German. It wouldalso be used to evaluate existing hate speech detection meth-ods.

News and Disinformation

Some users who are censoredappear to be dissidents actively campaigning against thegovernments censoring them. Such users propagate newsand claims at high rates. We believe it would be interest-ing to see which news and claims are censored by the gov-ernment, despite not being removed by Twitter. It would bealso interesting to see what portion of these news were fakenews and which claims were debunked.

We collected our dataset from an extensive dataset whichconsists of a 1% random sample of all tweets. Due to thenature of random sampling, we assume the latter dataset tobe unbiased. However, some tweets in the random sampleare retweets of other tweets. The inclusion of tweets that areretweeted makes the dataset biased towards popular tweetsand users which are more likely to be retweeted. Biaseddatasets can impact studies which report who or what con-tent is more likely to be censored or which countries censormore often. To overcome this issue, we also include an unbi-ased subset of the original dataset. This dataset is collectedby mining the Twitter Stream Grab without collecting theretweets. The dataset consists of 39,913 tweets. It is storedin tweets debiased.csv .Our dataset is mined from the 1% sample of all tweetson Twitter. As we do not have access to the full sample, weacknowledge that this dataset is not exhaustive and adviseresearchers to take this fact into the account.

Our dataset consists only of the tweet ids and the user ids.This ensures that the users who chose to delete their ac-counts or users who protected their accounts will not havetheir data exposed. Sharing only the ids also complies withTwitter’s Terms of Service. Although our supplementarydataset is massive (16 million tweets), Twitter permits aca-demic researchers to share an unlimited number of tweetsigure 3: The temporal activity of the censored tweets, measured by the number of users censored for the ﬁrst time, with respectto countries. The censorship actions by Turkey, Russia and India appear to follow the domestic politics and regional conﬂictsinvolving those countries. The censorship actions by Germany and France appear to follow the account creation date of thecensored users. We did not identify any major event around the time of the censorship actions.Figure 4: Account creation dates of the users censored atleast once by Germany and/or France. We observe an in-creasing trend of new accounts in late 2016 and 2017. Thesame trend can be observed during the period of censorshipdepicted in Figure 3and/or user ids for the sole purpose of non-commercial re-search (Twitter Inc. 2020).We also acknowledge that, as with any dataset, there isthe possibility of misuse. By observing our dataset, mali-cious third parties can learn which content is censored and,e.g., learn to counter the censor. However, we also note thatour dataset is retrospective and not real-time. There is aspan of at least six months between the publication of thisdataset and the censorship of the user. Additionally, Twittersends notiﬁcations to the accounts that are censored (JeremyKessel 2017), so our dataset does not unduly inform a userof their own censure. Censored users may deactivate theiraccounts to avoid their public data being collected. Anothermisuse could be that a government could use this dataset fora political objective, e.g. to automatically detect users theyshould censor. While this is indeed a concern, we believe that this is an issue that must be addressed through gover-nance and not further data withholding.

FAIR principles state that FAIR data has to be ﬁndable, ac-cessible, interoperable and reusable. Our dataset is ﬁndable,as we made it publicly available via Zenodo. By choosingZenodo, we also ensure it is accessible by everyone whowishes to download it, regardless of university or industrystatus. The data is interoperable as it is in .csv format whichcan be operable by any system. Finally, it is reusable as longas Twitter and/or Internet Archive’s Twitter Stream Grab ex-ist. To further enhance its reusability, we share the necessarycode for reproduction of the dataset in a public Github repos-itory.

References [Anonymous 2017] Anonymous. 2017. Turkish authorities blockwikipedia without giving reason.

BBC News .[Davidson et al. 2017] Davidson, T.; Warmsley, D.; Macy, M.; andWeber, I. 2017. Automated hate speech detection and the problemof offensive language. In

Proceedings of the International AAAIConference on Web and Social Media , volume 11.[Elmas et al. 2020a] Elmas, T.; Overdorf, R.; Akg¨ul, ¨O. F.; andAberer, K. 2020a. Misleading repurposing on twitter. arXivpreprint arXiv:2010.10600 .[Elmas et al. 2020b] Elmas, T.; Overdorf, R.; ¨Ozkalay, A. F.; andAberer, K. 2020b. The power of deletions: Ephemeral astroturﬁngattacks on twitter trends. arXiv preprint arXiv:1910.07783 .[Fishman 2020] Fishman, D. 2020. November 2020: Hizb ut-tahrir,rosderzhava, fbk, anna korovushkina.

Institute of Modern Russia .Gesley 2017] Gesley, J. 2017. Germany: Social media platformsto be held accountable for hosted content under “facebook act”.

Global Legal Monitor .[Jeremy Kessel 2017] Jeremy Kessel. 2017. Increased transparencyon country withheld content. accessed on 2021-01-13.[Reddit Inc. 2020] Reddit Inc. 2020. Reddit transparency report.accessed on 2021-01-13.[Silverman and Singer-Vine. 2018] Silverman, C., and Singer-Vine., J. 2018. Here’s who’s been blocked by twitter’s country-speciﬁc censorship program.[Tanash et al. 2015] Tanash, R. S.; Chen, Z.; Thakur, T.; Wallach,D. S.; and Subramanian, D. 2015. Known unknowns: An analysisof twitter censorship in turkey. In

Proceedings of the 14th ACMWorkshop on Privacy in the Electronic Society , 11–20.[Tanash et al. 2016] Tanash, R. S.; Aydogan, A.; Chen, Z.; Wallach,D.; Marschall, M.; Subramanian, D.; and Bronk, C. 2016. De-tecting inﬂuential users and communities in censored tweets usingdata-ﬂow graphs. In

Proceedings of the 33rd Annual Meeting ofthe Society for Political Methodology (POLMETH), Houston, TX .[Tanash et al. 2017] Tanash, R.; Chen, Z.; Wallach, D.; andMarschall, M. 2017. The decline of social media censorship andthe rise of self-censorship after the 2016 failed turkish coup. In { USENIX } Workshop on Free and Open Communications on theInternet ( { FOCI } .[Team 2020] Team, A. 2020. Archive team: The twitter streamgrab. accessed on 2020-12-01.[Tekumalla, Asl, and Banda 2020] Tekumalla, R.; Asl, J. R.; andBanda, J. M. 2020. Mining archive. org’s twitter stream grabfor pharmacovigilance research gold. In Proceedings of the Inter-national AAAI Conference on Web and Social Media , volume 14,909–917.[Twitter Inc. 2020] Twitter Inc. 2020. Developer agreement andpolicy. accessed on 2021-01-13.[Twitter Inc. 2021] Twitter Inc. 2021. Twitter transparency report.accessed on 2021-01-13.[Varol 2016] Varol, O. 2016. Spatiotemporal analysis of censoredcontent on twitter. In