[PDF] Mining the online infosphere: A survey

Abstract

The evolution of AI-based system and applications had pervaded everyday life to make decisions that have momentous impact on individuals and society. With the staggering growth of online data, often termed as the Online Infosphere it has become paramount to monitor the infosphere to ensure social good as the AI-based decisions are severely dependent on it. The goal of this survey is to provide a comprehensive review of some of the most important research areas related to infosphere, focusing on the technical challenges and potential solutions. The survey also outlines some of the important future directions. We begin by discussions focused on the collaborative systems that have emerged within the infosphere with a special thrust on Wikipedia. In the follow up we demonstrate how the infosphere has been instrumental in the growth of scientific citations and collaborations thus fueling interdisciplinary research. Finally, we illustrate the issues related to the governance of the infosphere such as the tackling of the (a) rising hateful and abusive behavior and (b) bias and discrimination in different online platforms and news reporting.

Full PDF

MMining the online infosphere: A survey

Sayantan Adak , Souvic Chakraborty , Paramtia Das , Mithun Das , AbhisekDash , Rima Hazra , Binny Mathew , Punyajoy Saha , Soumya Sarkar , andAnimesh Mukherjee Department of Computer Science and Engineering, Indian Institute ofTechnology, Kharagpur, West Bengal, India – 721302 Advanced Technology Development Center, Indian Institute of Technology,Kharagpur, West Bengal, India – 721302 ∗ Abstract

Online Infosphere it has become paramount to monitor the infosphere to ensure social good as the AI-baseddecisions are severely dependant on it. The goal of this survey is to provide a comprehensive review of someof the most important research areas related to infosphere, focusing on the technical challenges and potentialsolutions. The survey also outlines some of the important future directions. We begin by discussions focusedon the collaborative systems that have emerged within the infosphere with a special thrust on Wikipedia. Inthe follow up we demonstrate how the infosphere has been instrumental in the growth of scientiﬁc citations andcollaborations thus fuelling interdisciplinary research. Finally, we illustrate the issues related to the governance ofthe infosphere such as the tackling of the (a) rising hateful and abusive behaviour and (b) bias and discriminationin diﬀerent online platforms and news reporting.

Online infosphere is the term corresponding to the Internet becoming a virtual parallel worldformed from billions of networks of artiﬁcial life at diﬀerent scales ranging from tiny pieces ofsoftware to massive AI tools running a factory or driving a car. The motivations for this arediverse, seeking to both help mankind and harm it.In this article, we shall attempt to portray some of the areas that are increasingly gainingimportance in research related to the evolution of this infosphere. In particular, we would beginwith infosphere as a collaborative platform, Wikipedia being the prime point of discussion. As anext step we would discuss how the infosphere has inﬂuenced the evolution of scientiﬁc citationsand collaborations. Finally, we shall outline the emerging research interest in the governance ofthis infosphere to eradicate discrimination, bias, abuse and hate. ∗ The ﬁrst nine authors have been arranged based on family names and have equal contributions. https://en.wikipedia.org/wiki/Infosphere a r X i v : . [ c s . D L ] J a n .1 Infosphere as a collaborative platform The infosphere hosts numerous collaborative platforms including question answering sites, folk-sonomies, microblogging sites and above all encyclopedias. In this survey we shall focus onWikipedia which is one of the largest online collaborative encyclopedia. We shall primarily dis-cuss two of the most important aspects of Wikipedia – (a) the quality of an article and itsindicators and (b) the collaboration dynamics of Wikipedia editors who constitute the backboneof this massive initiative. Under the ﬁrst topic we shall identify the diﬀerent features of anarticle like its language, structure and stability as well as their quality [150, 53, 120]. We shallfurther summarise attempts that have been made to automatically predict the quality of anarticle [52, 91]. Within the second topic we shall brieﬂy describe various issues related to thecommunity of editors including anomalies, vandalism and edit wars[71, 135]. Finally, we shalltalk about ways to enabling retention of editors on the platform [54, 104, 132].

Citations play a crucial role in shaping the evolution of a scientiﬁc discipline. With an exponentialgrowth of research publications in various disciplines it has become very important for researchersand scientists to grasp diﬀerent concepts within a short period of time. We would explore howthe infosphere has inﬂuenced the growth and interaction of diﬀerent scientiﬁc disciplines overthe period of last few years by investigating several diﬀerent aspects of citation networks. Oursurvey includes – (a) a detailed account of how the basic sciences and the computer sciences haveinteracted with each other over the years resulting in an interdisciplinary research landscape [58,103],(b) the temporal dynamics of citations [129], (c) ways for assessment of article quality, andﬁnally (d) a brief account of anomalous citation ﬂows.

The stupendous growth of the infosphere has resulted in the emergence of various online commu-nities that have massively started infusing bias, discrimination, hatred and abuse often resultingin violence in the oﬄine world. In this segment, we shall primarily focus our discussion on thefollowing topics – (a) analysis, spread, detection and mitigation of online hate speech and (b)biases that manifest across news media and in traditional recommendation systems. Within theﬁrst topic we shall motivate the need to tackle online hate speech by citing some of the adverseconsequences of the same. In particular, we shall see how unmoderated hate speech spreads in asocial network [93, 94], what are the challenges to automatically detect online hate speech [49, 2]and the possible techniques to combat this problem [95, 96]. Within the second topic we shalldiscuss two important forms of biases. The ﬁrst one corresponds to political biases that manifestdue to the massive production of unveriﬁed (and in many cases false) news generated in theform of news/blog/tweets etc. We shall also discuss the diﬃculties faced by modern machinelearning techniques in preventing the infusion of such biases. The second one narrates the idea offormation of ﬁlter bubbles [107] in traditional recommendation systems followed by a discussionon the need to systematically audit such systems [28, 27].

Wikipedia models a hypertext collaborative platform along with an open and free knowledgebase catering a large variety of information ranging from history, arts, culture, politics to science,technology and many more ﬁelds. Being one of the most widely viewed sites (within top ten)in the world since 2007, it spans over 208 languages with a copious amount of articles in eachedition. For example, the English-language Wikipedia, the largest in volume, contains more than2 million articles as of February, 2020. Owing to the collaborative nature and an open-accesspolicy announced by Wikipedia as “anyone can edit”, a number of challenges have cropped up inmaintaining the veracity-quality balance of the content. Although Wikipedia has enforced severalrules and strict administrative policies to protect the encyclopedia from malicious activities, alack of authoritative vigilance prohibits its trustworthiness in academics. In contrast, Wikipedia’sstructured, complete and detailed evolution history receives increasing attention of the researchcommunity in discovering automated solution (e.g. bots, software, API etc.) to meet the goalof quality management.

The elementary purpose of Wikipedia is free, unbiased, accurate information curation. To achievethis objective Wikimedia foundation which is the governing body of the platform, has developedlabyrinth of guidelines which editors are expected to follow so that highest encyclopedic standardsare maintained. These guidelines also enhance accessibility of the Wikipedia articles to a broadcommunity of netizens. We enumerate these guidelines into three categories as discussed below.

Expressions describing a subject should be neutral. Promotion bearing words such as renowned,visionary, iconic, virtuoso etc. should not be used. Subject importance should be demonstratedusing facts and attribution Prose should have active voice. Jargon needs to be elaborated orsubstantiated with reference. Any eﬀort to propagate myth or contentious content should becurtailed. An example for this is addition of preﬁx pseudo or suﬃx -gate which encourages thereader to assume that the subject is factitious or scandalous respectively. Euphemisms (e.g., passed away, collateral damage and cliches (e.g., lion’s share, tip of the iceberg ) disallowspresentation of prose directly and hence is restricted. Any unnecessary emphasis in the form ofitalics, quotations etc. is discouraged. For a complete list of details of content guidelines for theEnglish Wikipedia we refer the reader to the Wikipeida manual of style . These guidelines include proper formatting of the Wikipedia article in terms of section headings,infobox, article name, section organisation etc. The lead section should not be of arbitrary length.The following sections should not be exorbitant in size and bigger sections should be broken intocoherent smaller sections. Another requirement is proper positioning of the images with captionsand references. In order to alleviate manual labor in improving article structure there have beensome automated approaches leveraging advances in machine learning techniques [63].

These guidelines denote stability of the article, i.e., the respective article should not be subjectof frequent edit wars . There should not be abusive language exchange among editors anddiscussions toward improving article quality should organically reach consensus. This is the mostdiﬃcult objective in collaborative content creation and generally the onus lies in the hands ofsenior level editors and moderators for smooth conﬂict arbitration.

Although Wikipedia has grown signiﬁcantly in terms of volume and veracity over the last decade,the quality of articles is not uniform [138]. The quality of Wikipedia articles is monitored througha rating system where each article is assigned one of several class indicators. Some of the major https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Words_to_watch https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style FA , GA , B , C , Start and

Stub . Most complete and dependable content isannotated by an FA ( aka featured article ) tag while the lowest quality content is annotated witha Stub tag. The intention behind this elaborate scheme is to notify editors regarding currentstate of the article and extent of eﬀort needed for escalating to encyclopedic standards . Theeditors are expected to rigorously follow the aforementioned guidelines. As has been evidentfrom the guidelines, they are circuitous and often require experience for implementation. Suchstrict policy adherence have also been sometimes a barrier for onboarding of new editors onWikipedia which has led to the decline of newcomers over the past decade [132, 55]. Since itis nontrivial to discern qualifying diﬀerences between articles manually, it has given rise to theemergence of automated techniques using machine learning models. Automatic article assessment is one of the key research agendas of the Wikimedia foundation .One of the preliminary approaches [56] seeking to solve this problem extracted structural featuressuch as presence of infobox, references, level 2 headings etc. as indicators of the article qual-ity. [26] proposed the ﬁrst application of deep neural networks into quality assessment task wherethey employed distributional representation of documents [80] without using manual features.The authors in [123] introduce a hybrid approach, where textual content of the Wikipedia arti-cles are encoded using a BILSTM model. The hidden representation captured by the sequencemodel is further augmented with handcrafted features and the concatenated feature vector isused for ﬁnal classiﬁcation. [151] is an edit history based approach where every version of anarticle is represented by dimensional handcrafted features. Hence, an article with k versionswill be represented by k × matrix. This k length sequence is passed through a stacked LSTMfor ﬁnal representation used in classiﬁcation. [124] proposed a multimodal information fusionapproach where embeddings obtained from both article text as well as html rendering of the ar-ticle webpage is used for ﬁnal classiﬁcation. [52] proposed the ﬁrst approach which incorporatesinformation from three modes for quality assessment, i.e., article text, article image and articletalk page. [52] obtains improvement over [125] approach and achieves the SOTA result.A complementary direction of exploration has been put forward by [84, 31] where correlationbetween article quality and structural properties of co-editor network and editor-article networkhas been exploited. An orthogonal direction of research looks into edit level quality predictionwhich is a ﬁne-grained approach toward article content management [120]. The workhorse behind the success story of Wikipedia is the large pool of its voluntary editors;an encouragement toward global collaboration inﬂuences people to contribute on almost allwikipages. These group of people maintain Wikipedia pages behind the scenes which includescreating new pages, adding facts and graphics, citing references, keeping the wording and format-ting appropriate etc. to lead the articles to the highest level of quality. The achievement of anyopen collaborative project is hinged on the continued and active participation of its collaborators,and hence, Wikipedia needs to manage its voluntarily contributing editor community carefully.In the days of extreme socio-cultural polarization, algorithmically crafted ﬁlter bubbles and fakeinformation represented as facts, editors are highly motivated to contribute to the largest non-biased knowledge sharing platform although their works are not ﬁnancially compensated mostof the times [88]. In these lines there have been several works [145, 104], which attempt tounderstand the dynamics of interaction behaviours of the community in sustaining the health of wiki/Wikipedia:WikiProjectWikipedia/Assessment While investigating the editing behaviours of editors in general context, researchers have foundout a taxonomy of semantic intentions behind the edits, and conﬂicts and controversy areinherent components of the classiﬁcation. Wikipedia owes its success for several reasons andopenness is one of those pillars. Sometimes, the very openness misguides editors to violateWikipedia’s strict guidelines of the neutral-point of view (NPOV), and their disruptive editscause various kinds of anomalies. We describe the two dominant disputes, produced by thedamaging edits as follows.

Vandalism : With the freedom of editing anything by anyone, Wikipedia has to struggle instopping the malicious practice of contaminating articles by bad faith edits intentionally. Thepopular pages like famous celebrities, controversial topics etc. become the frequent targetsof vandalism where vandals try to mislead the readers by addition, deletion or modiﬁcation,which can be termed as hoax. Wikipedia has enforced several strict policies such as blockingand banning Vandals (registered / unregistered editors), patrolling recent changes by addingwatch-lists, protecting articles (ex, semi-protected pages) from new editors, random IP addressesetc. In addition to the administrative decisions, bots are employed to detect and revert thevandalism automatically and ﬁnally warn the editors without human intervention. Researchershave proposed various automated ways [131, 71], i.e., the state-of-the-art techniques based onmachine learning [75, 133] and deep neural methods [92, 135] in preventing Wikipedia fromvandalism. Edit war : Apart from the intended malpractice of vandalism, editors often engage themselvesin disagreement which further inﬂuence them to override each other’s contribution instead ofdispute resolution. Any such actions violating the three-revert rule is coined as edit warring inWikipedia and it promotes a toxic environment in the community. Ultimately in the long-term,the integrity of the encyclopedia will be aﬀected signiﬁcantly by the damaging eﬀects of editwars [117]. Although Wikipedia encourages editors to be bold , in contrast a constant refusal to get the point is also not entertained. Historically, Wikipedia managed the numbers of its volunteers quite successfully; however, ex-perts [54] note that it is at the danger of sharp decline of its active editors due to the lackof the socialization eﬀort. Editors may choose to leave the platform for personal reasons aswell as for their disagreement/conﬂict with their fellow editors. The damage is happening inboth ways - when new editors fail to inherit the rules and policies they easily become upsetand leave eventually. Experienced editors, on the other hand, can get discouraged because ofthe continuous upgradation of policies to retain newcomers, or even for the nuisances by thenewbies. Two way eﬀort are being taken to combat with this problem – researchers are comingup with various approaches (see [101, 102, 147] and the refereces therein) while Wikipedia itselfis running several wikiprojects , to proactively retain its contributors. Future directions : Due to the enormous volume of data publicly available from various multi-lingual wikiprojects, several interesting future directions can be explored. One of the directions is https://en.wikipedia.org/wiki/User:ClueBot_NG https://en.wikipedia.org/wiki/Wikipedia:Edit_warring https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Editor_Retention https://en.wikipedia.org/wiki/Wikipedia:Expert_retention for acomprehensive take on this emerging research scope. Research on citation network have always remained essential in solving various problems suchas predicting emerging topics, early citation prediction, modelling evolving citation networks.Citation network is a directed graph where nodes could be authors/papers/journals and edgesare the citation ﬂows (weighted/unweighted) from one node to another node. Using citationnetworks, one can predict which ﬁeld/topic could be the ‘most attractive ones to work on’ in theimmediate future years. The research dynamics in various ﬁelds over the years can be analyzedwith the help of the underlying citation network. Various studies uncover the chances of themanifestation of certain new ﬁelds/sub-ﬁelds by investigating the citation ﬂows from papers ofone ﬁeld to the papers of another ﬁeld over the time. In recent years, citation count predictiontask plays an important role in fund allocations and rewards. Researchers are also interested inbuilding models for automatic citation recommendation while drafting an article. Apart fromthese, some anomalous practices in exchanging citations have been exposed in the late ’90s.Now, such malpractices are becoming more common among the researchers/journals (mostlylow ranked). In the rest of this section, we shall discuss each of the above issues in details.

Various research questions such as “which ﬁeld will collaborate with which ﬁeld in future?”,“Which ﬁeld will receive more citations from recently published papers?”, etc., can be addressedwith the help of the underlying citation networks among the articles. Nowadays, research isperformed by combining the ideas from multiple disciplines. In [58] the authors have analyzedthe interdisciplinarity among the two basic science ﬁelds – Mathematics and Physics and onefast growing ﬁeld – Computer Science. Further they observe how the citation from papers ofone discipline ﬂows to the papers of another discipline over the years. They observe that ininitial years huge amount of citation ﬂows from Physics to Mathematics and vice versa. Overthe years, Computer Science started gaining citation from Mathematics. In the recent years,both the basic science ﬁelds tend to massively cite papers from Computer Science. They observehow popularity of some topics decreases over the time. They found that the Computer Sciencepapers mostly cites the quantum physics sub-ﬁeld for long time span. In late ’90s, Physics mostlycites information theory papers of Computer Science but in recent times it mostly cites papersfrom machine learning and social & information networks domain.Further, interdisciplinarity has been studied in diﬀerent ﬁelds including biology [103], mathe-matics [103], cognitive science [9, 76], social science [109], humanities [109]. Various studies[7, 121, 9] have attempted to propose novel metrics to measure the degree of interdisciplinaritybased on researchers’ scientiﬁc impact, collaborator’s knowledge, publication history, etc. Inaddition, metric for measuring interdisciplinarity of an article has been proposed [9] where au-thors’ research area, publications in diﬀerent domains have been used to deﬁne the metric. A https://meta.wikimedia.org/wiki/Research:Index Several studies have been carried out in the past to model the temporal dynamics of citationnetworks. In order to model the temporal dynamics of citation networks, researchers traditionallyused preferential attachment [1] and copying based [74] models.In [137], the authors investigated the citation behavior of older papers in various ﬁelds. Theyconcluded that older articles receive more citations over the years. It is observed that in 2013,36% of citations ﬂowed toward the at least ten years old papers. However, a re-investigation ofthis study showed that the observations are only partly true since the authors did not take intoaccount the accelerating volume of publications over time. In order to tackle the tug-of-war be-tween obsolescence and entrenchment, recently, in [129], the authors proposed a complex modelbased on the idea of relay-linking where the older article relays a citation to a recently publishedarticle. This model has very less number of parameters and ﬁts with the real data much betterthan the traditional models. Yet another novel citation growth model called RefOrCite [106]have been proposed recently where the authors allow copying from the references (out-edges)and citations (in-edges) of an article (as opposed to only references in the traditional setup). Itis observed that RefOrCite model ﬁts well with real compared to the previous models.

Citation count prediction

Predicting future impact of scientiﬁc articles is important for making decision in fund allocation(by funding agencies), recruitment etc. There are various works [82, 143, 149, 83] that havebeen carried out in the past to automatically estimate the citation count of scientiﬁc articles. Inthis article, we shall mainly focus on the recent literature. In 2015, the authors in [82] proposedTrend-based Citation Count Prediction (T-CCP) model where the model ﬁrst would ﬁrst learnthe type of the citation trends of the articles and then predict the citation count for that trend.All the articles were categorized into ﬁve citation trend categories based on the “burst” time(“burst time” is the time when the paper gets maximum citations) – early burst, middle burst,late burst, multi bursts, and no bursts. Two types of features have been used – (a) publicationrelated features like author centric features (i.e., h-index, number of papers published, citationcount, number of collaborators), publication venue (average citation count, impact factor etc.)and (b) reinforcement features which are the graph based features (i.e., PageRank, HITS etc.)calculated from weighted citation network among authors. In their model, they mainly useSVR and SVM (LibLinear) for citation count prediction and classiﬁcation task respectively. Inpaper [128], the authors found that the knowledge gathered from citation context within thearticle could help to predict future citation count. Number of occurrences of the citations for apaper within the article and the average number of words in citation context, have been derivedfrom citation context knowledge. Further they categorized the articles into six citation proﬁles(PeakInit, PeakMul, PeakLate, MonDec, MonIncr, Oth) and found that the above two citationcontext based features are able to nicely distinguish these six categories. In [127], the authorsobserved that the long term citation of an article depends on the citations it receives in the earlyyears (within one or two years from its publication date). The authors who cite an article in itsearly years are called early citer. Early citers based on whether they are inﬂuential or not aﬀectthe long term citation count of the article. In most cases, inﬂuential authors negatively aﬀectthe long term citation of an article. In [143], the authors have proposed a novel point processmethod to predict the citations of individual articles. In their approach they tried to capture two7roperties – the “rich gets richer” eﬀect and the recency eﬀect. The authors in [149] used fourfactors – intrinsic quality (citation count) of a paper, aging eﬀect, Mathew eﬀect and recencyeﬀect to derive a model called long term individual level citation count prediction (LT-CCP). Inthis model they mainly use RNN with LSTM units. It is observed that LT-CPP model achievesbetter performance than existing models. Authors in [83] proposed a neural model for predictingcitation count with the help of peer review text. They mainly learn two deep features – (a) theabstract-review match mechanism (in order to learn the abstract aware review representation)and (b) the cross review match from peer review text.

Often new researchers face diﬃculties in ﬁnding appropriate published research papers whileexploring the domain literature and citing published papers. Citation recommendation is atechnique that recommends appropriate published articles for the given text/sentence. Thesentences present around the reference (placeholder) are called context sentences. Citationrecommendation task can be divided into two parts – (i) local citation recommendation, and(ii) global citation recommendation. In case of local citation recommendation, only the contextsentences are used. In case of global citation recommendation, the whole article is used asinput and the system outputs a list of published papers as output. In cite [10] the authorsproposed a model for the global citation recommendation task where they embedded the textualinformation (i.e., the title and the abstract) of the candidate citations in a vector space andconsidered the nearest neighbors as the candidate citations for the target document. Further,re-ranking of the candidate citations was done. They used DBLP (50K articles having anaverage citation of 5 per article) and PubMed (45K articles with average citation of 17 perarticle) datasets and also introduced a new dataset OpenCorpus (7 million articles) in the paper.They showed that their model achieved state-of-the-art performance without using metadata(authors, publication venues, keyphrases). In paper [64], the authors proposed a deep learningmodel (consists of context encoder and citation encoder) and used a dataset [59] for contextaware citation recommendation. Pre-trained BERT [34] model has been used in order to learnthe embedding of the context sentences. GCN has been employed to learn the citation graphembedding from the paper-paper citation graph. They mainly revised two existing datasets– AAN and FullTextPeerRead (revised version of PeerRead). They showed that their modelperformed three times better than the SOTA approaches (CACR etc.). The authors in [110]proposed a novel method – ConvCN – based on the citation knowledge graph embedding.

Various anomalous citation patterns have been found to emerge over the years. Various ways ofmaliciously increasing one’s citation are through self-citations , citation stacking among journals,and citation cartel . Nowadays, authors are more concerned about their position in academia,publication pressure etc. and this leads to most of them adopting unfair means to increase theircitation. Citation cartel is one of the anomalous citation patterns which was ﬁrst reported inlate ’90s . Citation cartel is formed by a group of authors/editors/journals where they citeeach other heavily for mutual beneﬁt. The relationship in citation cartel could be author-author,editor-author, journal-journal etc. There are a few cases found where the journal’s impact factorincreases rapidly due to this anomalous behavior.

Cell Transplantation is a medical journalwhose impact factor rapidly increased between 2006 and 2010 (3.48 to 6.20). After investigationcarried out by JCR publisher, it was found that one review article published in this journal https://science.sciencemag.org/content/286/5437/53 edical Science Monitor cited almost 91% papers published in Cell Transplantation from thetime bucket 2008–2009. It was found that the impact factor of the journal

Cell Transplantation was calculated based on this time bucket . Surprisingly, the authors (three out of four) arefrom the editorial board of this journal. In cite [40] the authors tried to detect citation cartels.They deﬁned a citation cartel as a group of authors citing each other excessively than they dowith other authors’ works in the same domain. They observed that there could be multiplereasons like academic pressure, “publish or perish” concept in academia, fear of losing job,scientiﬁc competition etc. behind establishing such citation cartels. It was observed that suchunfair means are mostly adopted by low ranked researchers [41]. In their work, they prepareda multilayer graph where they include paper-paper citation network (directed graph), authors’collaboration network and authors’ citation networks (weighted directed graph). Finally, citationcartel has been captured from the authors’ citation network. Cartels have been discovered byusing Resource Description Framework (RDF) and RDF query language and some threshold hasbeen declared to identify the existence of citation cartel among authors. The authors in [72]proposed a novel algorithm – Citation Donors and REcipients (CIDRE) to detect the citationcartel among the journals that cite each other disproportionately to increase the impact factorof the journal. CIDRE algorithm ﬁrst distinguishes between the normal and malicious citationexchange with the help of few parameters. These parameters are similarity in research areas,citation inﬂow and outﬂow. A weighted citation network among 48K journals was constructedfrom the dataset collected from MAG . With the help of the algorithm, more than half ofthe malicious journals were detected (those were actually suspended by Thomson Reuters) inthe same year. In addition, CIDRE algorithm detected few malicious journal groups in 2019whose journals received 30% of its in-ﬂow citation from the journals in the same group. Suchanomalous citations help to grow the impact factor of the journals over the years. In [62], theauthors studied how malicious journals are increasing in the Indian research community andavoiding proper rules and regulations. The analysis has been carried out on Indian publishinggroup OMICS (considered as predatory by the research community). Surprisingly they observedthat such malicious journals share very similar characteristics with various reputed journals. Future directions : In order to gather more citations, malpractices among the journals arerapidly increasing. More research is required to build a mechanism which can automaticallypredict those (predatory) journals (depending on the topics of the journal). In case of citationrecommendation, there is a need for improving the recommendation system such that the systemis able to recommend papers that are conceptually similar or exhibit conﬂicting claims [42]. Also,prioritizing the citation recommendation would be another help to maintain the page limit givenby many conferences [42].

As noted in the introduction, this section is laid out into two major parts. The former partcenters around the spread, automatic detection and containment of hate speech. The latter partdeals with bias in media outlets and online recommendation platforms. https://scholarlykitchen.sspnet.org/2012/04/10/emergence-of-a-citation-cartel/ .1 Hate speech The Internet is one of the greatest innovations of mankind which has brought together peoplefrom every race, religion, and nationality. Social media sites such as Twitter and Facebookhave connected billions of people and allowed them to share their ideas and opinions instantly.That being said, there are several ill consequences as well such as online harassment, trolling,cyber-bullying, and hate speech . The rise of hate speech : Hate speech has recently received a lot of research attention withseveral works that focus on detecting hate speech in online social media [29, 33, 4, 119, 73].Even though several government and social media sites are trying to curb all forms of hatespeech, it is still plaguing our society. With hate crimes increasing in several states , there is anurgent need to have a better understanding of how the users spread hateful posts in online socialmedia. Companies like Facebook have been accused of instigating anti-Muslim mob violence inSri Lanka that left three people dead and a United Nations report blamed them for playinga leading role in the possible genocide of the Rohingya community in Myanmar by spreadinghate speech . In response to the UN report, Facebook later banned several accounts belongingto Myanmar military oﬃcials for spreading hate speech. In the recent Pittsburgh synagogueshooting , the sole suspect, Robert Gregory Bowers , maintained an account (@onedingo) onGab ?? and posted his ﬁnal message before the shooting . Inspection of his Gab account showsmonths of anti-semitic and racist posts that were endorsed by a lot of users on Gab. Understanding the spread of hate speech : We perform the ﬁrst study which looks into thediﬀusion dynamics of the posts by hateful users in Gab [93]. We choose Gab for all our analysis.This choice is primarily motivated by the nature of Gab, which allows users to post content thatmay be hateful in nature without any fear of repercussion. This provides an unique opportunityto study how the hateful content would spread in the online medium, if there were no restrictions.To this end, we crawl the Gab platform and acquire 21M posts by 341K users over a periodof 20 months (October, 2016 to June, 2018). Our analysis reveals that the posts by hatefulusers tend to spread faster, farther, and wider as compared to normal users. We ﬁnd that thehate users in our dataset (which constitute 0.67% of the total number of users) are very denselyconnected and are responsible for 26.80% of posts generated in Gab.We also study the temporal eﬀect of hate speech on the users and the platform as well [94]. Tounderstand the temporal characteristics, we needed data from consecutive time points in Gab.As a ﬁrst step, using a heuristic [98], we generate successive graphs which capture the diﬀerenttime snapshots of Gab at one month intervals. Then, using the DeGroot model [32], we assigna hate intensity score to every user in the temporal snapshot and categorize them based on theirdegrees of hate. We then perform several linguistic and network studies on these users acrossthe diﬀerent time snapshots. We ﬁnd that the amount of hate speech in Gab is consistently https://techcrunch.com/2018/07/25/facebook-2-5-billion-people https://en.wikipedia.org/wiki/Pittsburgh_synagogue_shooting Due to the massive scale of online social media, methods that automatically detect hate speechare required. In this section, we explore a few of the methods that try to automatically de-tect hate speech. Some of the approaches utilize Keyword-based techniques, machine learningmodels, etc. Perceiving the right features for a classiﬁcation problem can be one of the chal-lenging tasks when using machine learning. Though surface-level features, such as bag of words,uni-grams, larger n-grams, etc. [18, 136] have been used for this problem, since hate speech de-tection is usually applied on small pieces of text, one may face a data sparsity problem. Lately,neural network based distributed word/paragraph representations, also referred to as word em-beddings/paragraph embeddings [36] have been proposed. Using large (unlabelled) text corpus,for each word or for a paragraph a vector representation is induced [81] that can eventually beused as classiﬁcation features, replacing binary features indicating the presence or frequency ofparticular words.Hate speech detection is a task that cannot always be solved by using only lexicon based features/word embedding. For instance, ‘6 Million Wasn’t Enough‘ may not be regarded as some form ofhate speech when observed in isolation. However, when the context is given that the utteranceis directed toward Jewish people who were killed in the holocaust by white supremacists andNeo Nazis , one could infer that this is a hate speech against Jews. The above example showsus whether a message is hateful or not can be highly dependent on world knowledge. In [46],the authors annotated context dependant hate speech and showed that incorporating contextinformation improved the overall performance of the model.Apart from world knowledge, meta-information (i.e., background information about the user ofa post, number of posts by a user, geographical origin) can be used as a feature to improvethe hate speech detection task. Since the data commonly comes from the online social mediaplatform, variety of meta-information about the post can be collected while crawling the data. Auser who is known to post mostly hateful content, may do so in the future. Existing research [93]has found that, high number of hateful messages are generated from less number of users. It hasbeen also observed that men are more likely to spread hate speech than women [139]. Also, thenumber of profane words in the post history of user has been used as a feature for hate speechclassiﬁcation task [24].Nowadays the number of posts which consists of images, audios, and video content are gettingshared more in Social media platforms. In [49], the authors have explored the textual and visualinformation of images for the hate speech detection task.Most of the methods that have been explored earlier were supervised and heavily dependanton the annotated data. Oﬀ-the-shelf classiﬁers such as logistic regression, support vector ma-chines have been extensively used. Recently, deep neural models are being extensively used forthe classiﬁcation task. In [2], the authors explored several models such as CNN-GRU, BERT, After detection of hate speech, we need proper mitigation strategies to stop it from becomingviral. The current methods largely depend on blocking or suspending the users, deleting thetweets etc. This is performed mostly by moderators which is a tedious task for them given theinformation rate. Many companies like Facebook have started to automate this process but boththese methods have the risk of violating free speech. This work [69] identiﬁes various pitfalls withrespect to trust, fairness and bias of these algorithms. A more promising direction could be tocounter with speech which are popularly known as counter speech . Speciﬁcally, counter speechis a direct non-hostile response/comment that counters the hateful or harmful speech [116]. Theidea that ‘more speech’ is a remedy for dangerous speech has been familiar in liberal democraticthought at least since the U.S. Supreme Court Justice Louis Brandeis declared it in 1927. Thereare several initiatives with the aim of using counter speech to tackle hate speech. For example,UNESCO released a study [44] titled ‘Countering Online Hate Speech’, to help countries dealwith this problem.The frameworks for mitigating hate speech using counter speech involves two school of thoughts.One of them will be to develop fully automatic counter speech generation system, which canoutput contextually relevant counter speech given a hate speech. Since generating contextualreplies to a text is still a nascent area in natural language processing, generating counter speechis expected to be further diﬃcult for AI systems due to the variety of socio-political variablespresent in them. Hence, a more practical approach could be to ﬁnd a task force of moderatorswho can suitably edit system generated counter speech for large scale use.One of the earliest computational studies attempted to identify hate and counter users on Twitterand further observe how they interact in online social media [95]. The authors created a smalldataset of hate-counter reply pairs and observed how diﬀerent communities responded diﬀerentlyto the hate speech targeting them. The paper further tried to build a machine learning model toclassify a user as hate or counter. A followup study [96] on YouTube comments was conductedfurther to understand how diﬀerent communities attempted to respond to hate speech. Takingthe YouTube videos that contain hateful content toward three target communities:

Jews , African-American ( Blacks ) and

LGBT , the authors collected user comments to create a dataset whichcontained counter speech. The dataset is rich since it not only has the counter/non-counterbinary classes but also a detailed ﬁne-grained classiﬁcation of the counter class as describedin [142] with a slight modiﬁcation to the ‘Tone’ Category. The authors observed that the LGBTcommunity usually responded to hate speech via “humour” whereas the Jew community usedmessages with a “positive tone”. The work further adds a classiﬁcation model which can identifya text as counter speech and its type. The authors in [111] generated a large scale dataset havinghate and their counter replies. They further used this dataset for counter speech generation usingseq2seq models and variational autoencoders. One of the limitations of this paper was that thecounter speech data annotated through crowd-sourcing were very generic in nature. Hence, alater work [21] took help from experts from and NGO to curate a dataset of counter speech12oward Islamophobic statements from the social media. While this dataset provides diverse andto the point reply to hate speech, it largely depends on the experts availability. These challengeswere compiled in [134], where the authors showed how data collection and counter speechgeneration is dependant on the assistance from the experts. The former paper also highlightedthe weakness of the generation models with around 10% of the automatic responses being properresponse to the given hate speech. This reinstates the fact that current generation systems arenot capable of understanding the hidden nuances required to generate proper counter speech.Another important question that lurks around in the research community is the “eﬀect of counterspeech”. While in the case of banning or suspension the eﬀect, i.e., removal of tweet/user isvisible, the eﬀect of counter speech is rather subjective in nature. In a recent work [47], theauthors used a classiﬁer to identify 100,000 hate-counter speech pairs and found reduction ofhate speech due to the organised counter speech. While whether this solely was caused by thecounter speech or some other reason is still an open question to the research community.Overall the counter speech research shows promise but has several unanswered questions aboutits data collection, execution and eﬀect. Nevertheless, ﬁghting hate speech in this way hassome beneﬁts: it is faster, more ﬂexible and responsive, capable of dealing with extremism fromanywhere and in any language and it does not form a barrier against the principle of free andopen public space for debate. We hope gradually by using counter speech, it will be slowlypossible to move toward an online world where diﬀerences could exist but not divisions.

Future directions:

Increased polarization seems to be spreading hate speech more. Most ofthe current models have been developed for English language. There is a need for larger andbetter hate speech datasets for other languages as well. Using transfer learning for improvingthe task is another direction. Zero or few shot learning would allow models to be able to buildmodels for low resource languages. Finally, an orthogonal but very interesting direction is tounderstand the complexity of the annotation task itself. Due to the subjective nature of thetask, the perception of hate speech is diﬀerent for people belonging to diﬀerent demographics.Another important direction is to integrate the detection and counter systems to build an end-to-end framework for eﬀective hate speech detection and countering mechamism, as shown inﬁgure 1.

Counter speechgeneration

AUTOMATIC GENERATION USER BASED F ee d s f r o m s o c i a l m e d i a Counter speech Filtered hate speech

Mitigation Framework

Explanation

Detection system

Knowledgegraph

Detection Framework

User metadata

Figure 1: An overview of the hate speech framework.

As outlined in the introduction, in this section we shall talk about two types of important biases(out of many more that dwells in the online world) – bias in news reporting and bias in online13ecommendation systems.

As we hit 53.6% Internet penetration worldwide, compounded by an exponential growth inthe number of social media users in the developing countries fuelled by cheap data-rates andsmartphone-based-accessibility , the news media continues to play a signiﬁcant role in shapingpolitical discourse and inﬂuencing national priorities. Of late, while on one hand, it has becomeeasier to produce news without adequate references, on the other hand, most newsreaders sharenews without any veriﬁcation [43]. In many instances, even the mainstream media houseshave been accused of copying and distributing news from other media houses with little or noveriﬁcation . While being informed about news from sources other than direct correspondentsis a common practice, distribution of that news without veriﬁcation indeed is a worrying trend.In many cases, this has led to fake news propagation by the most reputed media moguls. So,quantifying media bias and deﬁning the abstract idea of bias in this context is an important areaof research. Genesis : In Manufacturing Consent [60], 1988, Edward Herman & Noam Chomsky saw newsmedia as the propagandist which will ﬁnd ways to propagate the “ﬁltered” message of the richand powerful to the ordinary masses. Sooner or later in any system, they hypothesize that thenews medium will get concentrated in the hands of a few people of power and money and will getmanipulated either by ownership or by ﬁltering out news not beneﬁcial for the people in power.Following this model, researchers have hypothesized diﬀerent kinds of biases like there have beennumerous studies examining bias in media especially in the US and the European context. Whilethe term “bias” still remains abstract, some studies have put eﬀorts to make a distinction betweenthe computational sense of bias and the journalistic sense of the same making it more scientiﬁcallydeﬁnable and quantiﬁable. Journalistic and linguistic studies mostly discuss selection/coveragebias, conﬁrmation/statement bias [78, 86, 105, 118] and psychological/cognitive biases [15,112]. Recently, a lot of work is being done where the researchers are interested to formulate acomputational basis for investigating bias. Some works are focused on speciﬁc kinds of bias, suchas gender [11, 89, 152], and race [20]. Politics, in particular, is a widely studied and discussedtopic. Researchers seek to ﬁnd ideological political bias of users in social networks [22, 66, 141],news media [5, 12, 77, 79, 113] and user comments [114, 148]. D’Alessio and Allen [25]list three kinds of media bias to be the most widely studied: Coverage/visibility bias [37],gatekeeping bias/selectivity [61] or selection bias [51] (sometimes referred to as agenda bias [37])and statement bias/tonality bias/presentation bias [37, 51].

Document level bias in reporting : A news article may choose to cover some aspects of onenews and ﬁlter other aspects to bias the sentiments of the readers toward/against a speciﬁcpolitical party or interest group. Researchers have annotated such sentiment leanings of newsarticles at the article level [70, 12] or the sentence level [85] building document sentimentprediction models based on the annotated data. Following previous researches on sentimentprediction of documents [70, 19, 16], dominated by BERT [35] based methods, Longformer [8]is shown to be the best choice in the prediction of document level bias.

Media bias is topic & demographics dependent : While machine learning models can achieve https://economictimes.indiatimes.com/tech/internet/internet-users-in-india-to-reach-627-million-in-2019-report/articleshow/68288868.cms?from=mdr http://archiwum.thenews.pl/1/10/Artykul/280476 Source level bias : Predicting the topical/political bias of individual news outlets is as criticalto media proﬁling as determining factuality. With the advent of user-generated content andexponential rise of digital media, scaling the process of determining media-bias and factuality ofreporting of the media houses has become more and more important as everybody who sharesan article or screenshot of the same of any source is a news provider now. While measuring thefactuality or bias of each news media is a hard task and requires world-knowledge, the predictionof aggregate factuality or bias of a news media house is relatively straight-forward. Political biasand factuality of reporting have a linguistic aspect (what was written) along with a social context(who read it). So, the authors in [6] crawled relevant data from Twitter, Facebook, YouTube& Wikipedia and studied the impact of diﬀerent metadata extracted from these sources whileclassifying the media sources. The evaluation results showed that what was written mattersmost, and that putting all information sources together yields huge improvements over thecurrent state-of-the-art. On the other hand, in [113], the authors studied the demographics ofthe US population interested in a media source with a speciﬁc political inclination, using theFacebook

AdSense tool, to understand the leanings of the audience of the news media housesand reports high accuracy by doing just that. Along with political bias, they were also ableto identify the demographic biases in the consumer population of any media house in a zero-shot setting (i.e., using no training data). A demo of their application can be found here .TIMME [144] supervises a special form of GCNs to identify the political bias of each Twitteruser on annotated data and is able to identify the geographic distribution of Twitter users withparticular bias which correlates well with the voting pattern of American citizens. They wereable to use the same algorithm to identify the bias of each news media house by gathering theirTwitter data. Future directions : News media is widely cited as the fourth pillar of democracy. While thehealth of a democratic institution depends heavily on fair coverage of the institution by themedia houses, studies on how computational media bias is related to the health of democraciesare lacking. Again, most of the studies in media bias are concentrated on two-party systemsand is done on American and European demographics for the English language media whileother democracies also face the same problem deserving similar attention. Further research isneeded to understand the challenges faced in a multi-party system for other languages in diﬀerentdemographics. Also, event space in media changes very rapidly. So, research in an online learningsetup is needed to further enhance the media bias prediction accuracy over time.

The digital platform is full of choices. To help users make intelligent choices, diﬀerent informationﬁltering systems are deployed in online platforms. Recommendation systems (RSs) are one such https://twitter-app.mpi-sws.org/media-bias-monitor/ Filter bubble and evolution of fairness in RSs : Traditionally, RSs like other informationﬁltering systems are keyed to relevance [87, 130]. However, over dependency on relevancehas led to diﬀerential services to diﬀerent users or diﬀerent user groups. To describe theseeﬀects succinctly, Eli Pariser coined a term ‘ ﬁlter bubble ’ [107]. The ﬁlter bubble problemis a concern that personalization technologies, including RSs, narrow and bias the topics ofinformation provided to people while they do not notice these facts. To account for theseeﬀects, the ﬁeld of fairness in recommendation ﬁrst evolved as ‘Information Neutrality in RSs’.Focusing on customer fairness (information neutrality toward customers) Kamishima et. al.[67, 68] tried to solve the unfairness issue in RSs by adding a regularization term that enforces demographic parity . Such objectives penalized the diﬀerences among the average predictedratings of user groups (based on sensitive attributes, e.g., gender). However, demographicparity is only appropriate when preferences are unrelated to the sensitive features. In taskssuch as recommendation, user preferences are indeed inﬂuenced by sensitive features such asgender, race, and age [17, 30]. Taking a leaf out of the progresses in fairness literature insupervised machine learning [57, 99], Yao et. al. [146] put forward fairness notions to bridge thegap. They formulated diﬀerent customer fairness metric by taking a leaf out of the evolutionof fairness in supervised learning [57] and showed their eﬀectiveness in improving customerfairness in recommendation [146]. Following their footsteps a number of works focused on‘group fairness’ in personalized recommendations [153, 38] where ﬁrst they quantiﬁed biases dueto recommendation algorithms toward socially salient groups and proposed methodologies tomitigate such biases. However, a major drawback of many of these works was their negligencetoward one of the major stakeholder in RSs, i.e., the producer of items/services. This led to asecond school of thought when Burke et. al. [13, 14] ﬁrst advocated for fairness toward bothcustomers and providers in a recommendation framework. Considering RSs as a two-sided aﬀairmany nuanced algorithms came into existence considering fairness toward both customers andproducers and thus taking a giant step toward a fair marketplace [100, 108, 48]. Auditing RSs : While the fairness community seems to have covered diﬀerent forms of bi-ases, there is a lack of understanding of the existing online recommendation systems and biasesthereof. Understanding of these systems are especially important today due to the emergenceof diﬀerent private label products (and in-house products) in e-commerce (and OTT) plat-forms [140, 39, 3]. A private label product is often produced and sold under the retailer’s brandname, providing enough monetary incentive to the platforms to be discriminative against severalother products (or producers) on the platform. Note that no third party (3P) regulator canquantify such biases because of the lack of access to the exact underlying algorithms and theexact user-item interaction details. To enable such 3P audits, in one of our works, we presenteda novel network-based technique that enabled us to extract important parameters for auditingRSs by considering them as black-boxes [28]. With detailed analysis on three diﬀerent existingonline RSs, we ﬁrst proposed ways to quantify their induced diversity and extent of informationsegregation [28]. The usefulness of such a framework is manifold: (a) it sheds light on howrecommendations are formed between items based on diﬀerent item-centric properties, (b) itcan be used as a tool for quantifying and auditing for diﬀerent consumer-focused metrics e.g., U par = | E g [ y ] − E ¬ g [ y ] | can be an instantiation of such demographic parity based regularizers [146]. Future directions:

While the fairness community seems to have covered diﬀerent forms ofbiases in recommendation frameworks, it has overlooked the special relationships that may existbetween the digital marketplace and a subset of stakeholders, and the biases thereof. Hence,studies of unfairness discovery and mitigation considering the special relationships of platformsremain an under-explored avenue of research till date. The introduction of sponsored search andrecommendations complicates the scenario even further. Policies that allow sponsored results todeviate from organic results; while adhering to fairness of marketplace can be another interestingbroad direction for further research.

In this survey we have presented a critical rundown on the evolution of the online infosphere bydepicting some of the research areas that are becoming very crucial at current times. We startedour discussion with a view of the infosphere as a collaborative platform, with a dedicated focuson Wikipedia. Wikipedia, the freely available and one of the largest knowledge base, containinga wide variety of information has been a primary focus of an extensive research so far. In thissurvey we have presented a detailed account of the works on article quality monitoring, editorbehaviour and their retention and malicious activities like vandalism.In the next section we have detailed the growth of the citations and collaborations within andacross various scientiﬁc disciplines that have their roots in the infosphere. In fact, this has17esulted in the birth of many new interdisciplinary landscapes. We also discussed how machinelearning algorithms can be used to predict future citations as well as for recommending citations.Finally we touched upon various issues related to anomalous citation ﬂows and their behaviour.Finally, we summarised the research drives for patrolling the infosphere to suppress the risingvolume of harmful content. The discussion started with analyzing the concept of hate speechand its growth over the past few years through the online social media platforms and the adverseimpact, thereby, on both the online and the real world. We have shown the massive eﬀortsthe research community have put forward in detection and mitigation of such hateful behaviour.However, a lot of issues still remain as open problems and need immediate attention. We observeddense connectivity in the hateful users network by crawling Gab platform and analyzing their dataand found that a signiﬁcant amount of posts are generated by these hateful users in social mediaplatforms. Additionally, we studied the temporal eﬀect of hate speech on the users by using theGab data and found the increasing rate of hateful users in the social network. After skimmingthrough the recent literature we have pointed out that by incorporating knowledge based contextinformation for a given post improves the overall performance of hatespeech detection ratherthan by analyzing only the textual information. Apart from that we have observed that byadding the user information of a given post can be further analyzed to improve the hatespeechdetection task. The last segment of the discussion dealt with the bias and discrimination that arebecoming pervasive across diﬀerent online environments like recommendation and news mediaplatforms.To conclude, the above mentioned research aspects related to the online infosphere have at-tracted a lot of attention across the board including scientiﬁc communities, industry stakeholdersand policy-makers. This paper discusses the technical challenges and possible solutions in a direc-tion that utilizes the immense power of AI for solving real world problems but also considers thesocietal implications of these solutions. As a ﬁnal note, we could see that the problems relatedto the anomalies in scientiﬁc collaborations and citations, hate speech detection and mitigationin social media and the bias and unfairness in news media and recommendation systems have alot of open ends thus enabling exciting opportunities for future research.

References [1]

Albert, R., and Barab´asi, A.-L.

Statistical mechanics of complex networks.

Rev.Mod. Phys. 74 (Jan 2002), 47–97.[2]

Aluru, S. S., Mathew, B., Saha, P., and Mukherjee, A.

Deep learning modelsfor multilingual hate speech detection. arXiv preprint arXiv:2004.06465 (2020).[3]

Amazon . Online platforms and market power, part 2: Innovation and entrepreneur-ship. https://docs.house.gov/meetings/JU/JU05/20190716/109793/HHRG-116-JU05-20190716-SD038.pdf, 2019.[4]

Badjatiya, P., Gupta, S., Gupta, M., and Varma, V.

Deep learning for hatespeech detection in tweets. WWW, pp. 759–760.[5]

Baly, R., Karadzhov, G., Alexandrov, D., Glass, J., and Nakov, P.

Predict-ing factuality of reporting and bias of news media sources. arXiv preprint arXiv:1810.01765 (2018). 186]

Baly, R., Karadzhov, G., An, J., Kwak, H., Dinkov, Y., Ali, A., Glass,J., and Nakov, P.

What was written vs. who read it: News media proﬁling using textanalysis and social media context. In

Proceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics (Online, July 2020), Association for ComputationalLinguistics, pp. 3364–3374.[7]

Barthel, R., and Seidl, R.

Interdisciplinary collaboration between natural and socialsciences – status and trends exempliﬁed in groundwater research.

PLOS ONE 12 , 1 (012017), 1–27.[8]

Beltagy, I., Peters, M. E., and Cohan, A.

Longformer: The long-documenttransformer, 2020.[9]

Bergmann, T., Dale, R., Sattari, N., Heit, E., and Bhat, H. S.

The in-terdisciplinarity of collaborations in cognitive science.

Cognitive science 41 , 5 (2017),1412–1418.[10]

Bhagavatula, C., Feldman, S., Power, R., and Ammar, W.

Content-basedcitation recommendation. In

Proceedings of the 2018 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human Language Technologies,Volume 1 (Long Papers) (New Orleans, Louisiana, June 2018), Association for Computa-tional Linguistics, pp. 238–251.[11]

Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., and Kalai, A.

Manis to computer programmer as woman is to homemaker? debiasing word embeddings.In

Proceedings of the 30th International Conference on Neural Information ProcessingSystems (USA, 2016), NIPS’16, Curran Associates Inc., pp. 4356–4364.[12]

Budak, C., Goel, S., and Rao, J. M.

Fair and balanced? quantifying media biasthrough crowdsourced content analysis.

Public Opinion Quarterly 80 , S1 (2016), 250–271.[13]

Burke, R.

Multisided fairness for recommendation. arXiv preprint arXiv:1707.00093 (2017).[14]

Burke, R., Sonboli, N., and Ordonez-Gauger, A.

Balanced neighborhoods formulti-sided fairness in recommendation. In

ACM FAccT (2018).[15]

Caliskan, A., Bryson, J. J., and Narayanan, A.

Semantics derived automaticallyfrom language corpora contain human-like biases.

Science 356 , 6334 (2017), 183–186.[16]

Chakraborty, S., Goyal, P., and Mukherjee, A.

Aspect-based sentiment anal-ysis of scientiﬁc reviews. In

Proceedings of the ACM/IEEE Joint Conference on DigitalLibraries in 2020 (New York, NY, USA, 2020), JCDL ’20, Association for ComputingMachinery, p. 207–216.[17]

Chausson, O.

Who watches what?: assessing the impact of gender and personal-ity on ﬁlm preferences.

Paper published online on the MyPersonality project websitehttp://mypersonality. org/wiki/doku. php (2010).[18]

Chen, Y., Zhou, Y., Zhu, S., and Xu, H.

Detecting oﬀensive language in socialmedia to protect adolescent online safety. pp. 71–80.[19]

Choi, G., Oh, S., and Kim, H.

Improving document-level sentiment classiﬁcationusing importance of sentences.

Entropy 22 , 12 (2020), 1336.1920]

Chouldechova, A.

Fair prediction with disparate impact: A study of bias in recidivismprediction instruments.

Big data 5 , 2 (2017), 153–163.[21]

Chung, Y.-L., Kuzmenko, E., Tekiroglu, S. S., and Guerini, M.

CONAN -COunter NArratives through nichesourcing: a multilingual dataset of responses to ﬁghtonline hate speech. In

Proceedings of the 57th Annual Meeting of the Association forComputational Linguistics (Florence, Italy, July 2019), Association for ComputationalLinguistics, pp. 2819–2829.[22]

Conover, M. D., Gonc¸alves, B., Ratkiewicz, J., Flammini, A., andMenczer, F.

Predicting the political alignment of twitter users. In (2011), IEEE, pp. 192–199.[23]

Cremisini, A., Aguilar, D., and Finlayson, M.

A Challenging Dataset for BiasDetection: The Case of the Crisis in the Ukraine . 06 2019, pp. 173–183.[24]

Dadvar, M., Trieschnigg, D., Ordelman, R., and de Jong, F.

Improvingcyberbullying detection with user context. pp. pp 693–696.[25]

D’Alessio, D., and Allen, M.

Media bias in presidential elections: A meta-analysis.

Journal of communication 50 , 4 (2000), 133–156.[26]

Dang, Q. V., and Ignat, C.-L.

Quality assessment of wikipedia articles withoutfeature engineering. In

Proceedings of the 16th ACM/IEEE-CS on Joint Conference onDigital Libraries (2016), pp. 27–30.[27]

Dash, A., Chakraborty, A., Ghosh, S., Mukherjee, A., and Gummadi, K. P.

When the umpire is also a player: Bias in private label product recommendations on e-commerce marketplaces. In

ACM FAccT (2021).[28]

Dash, A., Mukherjee, A., and Ghosh, S.

A network-centric framework for auditingrecommendation systems. In

IEEE INFOCOM (2019).[29]

Davidson, T., Warmsley, D., Macy, M., and Weber, I.

Automated hate speechdetection and the problem of oﬀensive language.[30]

Daymont, T. N., and Andrisani, P. J.

Job preferences, college major, and thegender gap in earnings.

Journal of Human Resources (1984).[31] de La Robertie, B., Pitarch, Y., and Teste, O.

Measuring article quality inwikipedia using the collaboration network. In (2015), IEEE, pp. 464–471.[32]

DeGroot, M. H.

Reaching a consensus.

Journal of the American Statistical Association69 , 345 (1974), 118–121.[33]

Del Vigna, F., Cimino, A., Dell’Orletta, F., Petrocchi, M., and Tesconi,M.

Hate me, hate me not: Hate speech detection on facebook.[34]

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

BERT: Pre-training ofdeep bidirectional transformers for language understanding. In

Proceedings of the 2019 onference of the North American Chapter of the Association for Computational Linguis-tics: Human Language Technologies, Volume 1 (Long and Short Papers) (Minneapolis,Minnesota, June 2019), Association for Computational Linguistics, pp. 4171–4186.[35] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

BERT: Pre-training ofdeep bidirectional transformers for language understanding. In

Proceedings of the 2019Conference of the North American Chapter of the Association for Computational Linguis-tics: Human Language Technologies, Volume 1 (Long and Short Papers) (Minneapolis,Minnesota, June 2019), Association for Computational Linguistics, pp. 4171–4186.[36]

Djuric, N., Zhou, J., Morris, R., Grbovic, M., Radosavljevic, V., andBhamidipati, N.

Hate speech detection with comment embeddings. pp. 29–30.[37]

Eberl, J.-M., Boomgaarden, H. G., and Wagner, M.

One bias ﬁts all? threetypes of media bias and their eﬀects on party preferences.

Communication Research 44 ,8 (2017), 1125–1148.[38]

Edizel, B., Bonchi, F., Hajian, S., Panisson, A., and Tassa, T.

Fairecsys:mitigating algorithmic bias in recommender systems.

Springer International Journal ofData Science and Analytics (2019).[39]

Faherty, E., Huang, K., and Land, R.

The amazon monopoly: Is amazon’s privatelabel business the tipping point?

Munich Personal RePEc Archive (2017).[40]

Fister, I., Fister, I., and Perc, M.

Toward the discovery of citation cartels incitation networks.

Frontiers in Physics 4 (2016), 49.[41]

Fister, I., Mlakar, U., and Brest, J.

A new population-based nature-inspiredalgorithm every month : Is the current era coming to the end ? In

StuCoSReC: Proceedingsof the 2016 3rd Student Computer Science Research Conference (2016).[42]

F¨arber, M., and Jatowt, A.

Citation recommendation: approaches and datasets.

International Journal on Digital Libraries 21 , 4 (Aug 2020), 375–405.[43]

Gabielkov, M., Ramachandran, A., Chaintreau, A., and Legout, A.

SocialClicks: What and Who Gets Read on Twitter? In

IFIP Perf. (Antibes Juan-les-Pins,France, June 2016).[44]

Gagliardone, I., Gal, D., Alves, T., and Martinez, G.

Countering online hatespeech . UNESCO Publishing, 2015.[45]

Ganguly, S., Kulshrestha, J., An, J., and Kwak, H.

Empirical evaluation ofthree common assumptions in building political media bias datasets.

Proceedings of theInternational AAAI Conference on Web and Social Media 14 , 1 (May 2020), 939–943.[46]

Gao, L., and Huang, R.

Detecting online hate speech using context aware models.pp. 260–266.[47]

Garland, J., Ghazi-Zahedi, K., Young, J.-G., H´ebert-Dufresne, L., andGalesic, M.

Countering hate on social media: Large scale classiﬁcation of hate andcounter speech. arXiv preprint arXiv:2006.01974 (2020).[48]

Geyik, S. C., Ambler, S., and Kenthapadi, K.

Fairness-aware ranking in search& recommendation systems with application to linkedin talent search.2149]

Gomez, R., Gibert, J., Gomez, L., and Karatzas, D.

Exploring hate speechdetection in multimodal publications. pp. 1459–1467.[50]

Gomez-Uribe, C. A., and Hunt, N.

The netﬂix recommender system: Algorithms,business value, and innovation.

ACM TMIS 6 , 4 (2016).[51]

Groeling, T.

Media bias by the numbers: Challenges and opportunities in the empiricalstudy of partisan news.

Annual Review of Political Science 16 , 1 (2013), 129–151.[52]

Guda, B. P. R., Seelaboyina, S. B., Sarkar, S., and Mukherjee, A.

Nwqm: Aneural quality assessment framework for wikipedia. In

Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Processing (EMNLP) (2020), pp. 8396–8406.[53]

Halfaker, A., and Geiger, R. S.

Ores: Lowering barriers with participatory machinelearning in wikipedia. arXiv preprint arXiv:1909.05189 (2019).[54]

Halfaker, A., Geiger, R. S., Morgan, J. T., and Riedl, J.

The rise anddecline of an open collaboration system: How wikipedia’s reaction to popularity is causingits decline.

American Behavioral Scientist 57 , 5 (2013), 664–688.[55]

Halfaker, A., Kittur, A., and Riedl, J.

Don’t bite the newbies: how revertsaﬀect the quantity and quality of wikipedia work. In

Proceedings of the 7th internationalsymposium on wikis and open collaboration (2011), pp. 163–172.[56]

Halfaker, A., and Taraborelli, D.

Artiﬁcial intelligence service “ores” giveswikipedians x-ray specs to see through bad edits.

Wikimedia Blog (2015).[57]

Hardt, M., Price, E., Srebro, N., et al.

Equality of opportunity in supervisedlearning. In

NIPS (2016).[58]

Hazra, R., Singh, M., Goyal, P., Adhikari, B., and Mukherjee, A.

Therise and rise of interdisciplinary research: Understanding the interaction dynamics of threemajor ﬁelds - physics, mathematics and computer science. In

Digital Libraries at theCrossroads of Digital Information for the Future - 21st International Conference on Asia-Paciﬁc Digital Libraries, ICADL 2019, Kuala Lumpur, Malaysia, November 4-7, 2019,Proceedings (2019), pp. 71–77.[59]

He, Q., Pei, J., Kifer, D., Mitra, P., and Giles, L.

Context-aware citationrecommendation. In

Proceedings of the 19th International Conference on World WideWeb (New York, NY, USA, 2010), WWW ’10, Association for Computing Machinery,p. 421–430.[60]

Herman, E. S., and Chomsky, N.

Manufacturing consent: The political economy ofthe mass media . Pantheon Books, 1988.[61]

Hofstetter, C. R., and Buss, T. F.

Bias in television news coverage of politicalevents: A methodological analysis.

Journal of Broadcasting 22 , 4 (1978), 517–530.[62]

Jain, N., and Singh, M.

The evolving ecosystem of predatory journals: a case study inindian perspective. In

International Conference on Asian Digital Libraries (2019), Springer,pp. 78–92.[63]

Jana, A., Kanojiya, P., Goyal, P., and Mukherjee, A.

Wikiref: Wikilinks as aroute to recommending appropriate references for scientiﬁc wikipedia pages. arXiv preprintarXiv:1806.04092 (2018). 2264]

Jeong, C., Jang, S., Shin, H., Park, E., and Choi, S.

A context-aware citationrecommendation model with bert and graph convolutional networks, 2019.[65]

Johnson, I., Lemmerich, F., S´aez-Trumper, D., West, R., Strohmaier,M., and Zia, L.

Global gender diﬀerences in wikipedia readership. arXiv preprintarXiv:2007.10403 (2020).[66]

Johnson, K., and Goldwasser, D.

Identifying stance by analyzing political discourseon twitter. In

Proceedings of the First Workshop on NLP and Computational Social Science (2016), pp. 66–75.[67]

Kamishima, T., Akaho, S., Asoh, H., and Sakuma, J.

Enhancement of theneutrality in recommendation. In

Decisions@ RecSys (2012).[68]

Kamishima, T., Akaho, S., and Sakuma, J.

Fairness-aware learning through regu-larization approach. In

IEEE ICDMW (2011).[69]

Katzenbach, C., Binns, R., and Gorwa, R.

Algorithmic content moderation:Technical and political challenges in the automation of platform governance.

Big Dataand Society 7 , 1 (2020).[70]

Kiesel, J., Mestre, M., Shukla, R., Vincent, E., Adineh, P., Corney, D.,Stein, B., and Potthast, M.

SemEval-2019 task 4: Hyperpartisan news detection.In

Proceedings of the 13th International Workshop on Semantic Evaluation (Minneapolis,Minnesota, USA, June 2019), Association for Computational Linguistics, pp. 829–839.[71]

Kiesel, J., Potthast, M., Hagen, M., and Stein, B.

Spatio-temporal analysis ofreverted wikipedia edits. In

ICWSM (2017).[72]

Kojaku, S., Livan, G., and Masuda, N.

Detecting citation cartels in journal net-works, 2020.[73]

Kshirsagar, R., Cukuvac, T., McKeown, K., and McGregor, S.

Predictiveembeddings for hate speech detection on twitter. In

Proceedings of the 2nd Workshop onAbusive Language Online (ALW2) (2018), pp. 26–32.[74]

Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A.,and Upfal, E.

Stochastic models for the web graph. In

Proceedings of the 41st AnnualSymposium on Foundations of Computer Science (USA, 2000), FOCS ’00, IEEE ComputerSociety, p. 57.[75]

Kumar, S., Spezzano, F., and Subrahmanian, V.

Vews: A wikipedia vandal earlywarning system.

ArXiv abs/1507.01272 (2015).[76]

Kwon, S., Solomon, G. E. A., Youtie, J., and Porter, A. L.

A measure ofknowledge ﬂow between speciﬁc ﬁelds: Implications of interdisciplinarity for impact andfunding.

PLOS ONE 12 , 10 (10 2017), 1–16.[77]

Laver, M., Benoit, K., and Garry, J.

Extracting policy positions from politicaltexts using words as data.

American political science review 97 , 2 (2003), 311–331.[78]

Lazaridou, K., Krestel, R., and Naumann, F.

Identifying media bias by analyzingreported speech. In (2017),IEEE, pp. 943–948. 2379]

Le, H. T. T., Shafiq, Z., and Srinivasan, P.

Scalable news slant measurementusing twitter. In

Eleventh International AAAI Conference on Web and Social Media (2017).[80]

Le, Q., and Mikolov, T.

Distributed representations of sentences and documents. In

International conference on machine learning (2014), pp. 1188–1196.[81]

Le, Q., and Mikolov, T.

Distributed representations of sentences and documents. (05 2014).[82]

Li, C.-T., Lin, Y.-J., Yan, R., and Yeh, M.-Y.

Trend-based citation count predic-tion for research articles. In

Advances in Knowledge Discovery and Data Mining (Cham,2015), T. Cao, E.-P. Lim, Z.-H. Zhou, T.-B. Ho, D. Cheung, and H. Motoda, Eds.,Springer International Publishing, pp. 659–671.[83]

Li, S., Zhao, W. X., Yin, E. J., and Wen, J.-R.

A neural citation count predictionmodel based on peer review text. In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the 9th International Joint Conference onNatural Language Processing (EMNLP-IJCNLP) (Hong Kong, China, Nov. 2019), Asso-ciation for Computational Linguistics, pp. 4914–4924.[84]

Li, X., Tang, J., Wang, T., Luo, Z., and De Rijke, M.

Automatically assessingwikipedia article quality by exploiting article–editor networks. In

European Conference onInformation Retrieval (2015), Springer, pp. 574–580.[85]

Lim, S., Jatowt, A., and Masatoshi, Y.

Creating a dataset for ﬁne-grained biasdetection in news articles.

Forum on Data Engineering and Information Management 12 (May 2020), 1–35.[86]

Lin, Y.-R., Bagrow, J. P., and Lazer, D.

More voices than ever? quantifyingmedia bias in networks. In

Fifth International AAAI Conference on Weblogs and SocialMedia (2011).[87]

Linden, G., Smith, B., and York, J.

Amazon. com recommendations: Item-to-itemcollaborative ﬁltering.

IEEE Internet computing (2003).[88]

Littlejohn, A., and Hood, N.

Becoming an online editor: Perceived roles andresponsibilities of wikipedia editors.

Information Research 23 , 1 (2018).[89]

Madaan, N., Mehta, S., Agrawaal, T., Malhotra, V., Aggarwal, A.,Gupta, Y., and Saxena, M.

Analyze, detect and remove gender stereotyping frombollywood movies. In

Conference on Fairness, Accountability and Transparency (2018),pp. 92–105.[90]

Maity, S. K., Chakraborty, A., Goyal, P., and Mukherjee, A.

Detection ofsockpuppets in social media. In

Companion of the 2017 ACM Conference on ComputerSupported Cooperative Work and Social Computing (2017), pp. 243–246.[91]

Marrese-Taylor, E., Loyola, P., and Matsuo, Y.

An edit-centric approach forwikipedia article quality assessment. arXiv preprint arXiv:1909.08880 (2019).[92]

Martinez-Rico, J. R., Mart´ınez-Romo, J., and Araujo, L.

Can deep learningtechniques improve classiﬁcation performance of vandalism detection in wikipedia?

Eng.Appl. Artif. Intell. 78 (2019), 248–259. 2493]

Mathew, B., Dutt, R., Goyal, P., and Mukherjee, A.

Spread of hate speechin online social media. pp. 173–182.[94]

Mathew, B., Illendula, A., Saha, P., Sarkar, S., Goyal, P., and Mukher-jee, A.

Hate begets hate: A temporal study of hate speech.

Proc. ACM Hum.-Comput.Interact. 4 , CSCW2 (Oct. 2020).[95]

Mathew, B., Kumar, N., Goyal, P., and Mukherjee, A.

Interaction dynamicsbetween hate and counter users on twitter. In

Proceedings of the 7th ACM IKDD CoDSand 25th COMAD . 2020, pp. 116–124.[96]

Mathew, B., Saha, P., Tharad, H., Rajgaria, S., Singhania, P., Maity,S. K., Goyal, P., and Mukherjee, A.

Thou shalt not hate: Countering online hatespeech. In

Proceedings of the International AAAI Conference on Web and Social Media (2019), vol. 13, pp. 369–380.[97]

Mathew, B., Saha, P., Yimam, S. M., Biemann, C., Goyal, P., and Mukher-jee, A.

Hatexplain: A benchmark dataset for explainable hate speech detection. arXivpreprint arXiv:2012.10289 (2020).[98]

Meeder, B., Karrer, B., Sayedi, A., Ravi, R., Borgs, C., and Chayes, J.

We know who you followed last summer: inferring social link creation times in twitter. In

Proceedings of the 20th international conference on World wide web (2011), pp. 517–526.[99]

Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and Galstyan, A.

Asurvey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635 (2019).[100]

Mehrotra, R., McInerney, J., Bouchard, H., Lalmas, M., and Diaz, F.

Towards a fair marketplace: Counterfactual evaluation of the trade-oﬀ between relevance,fairness & satisfaction in recommendation systems. In

ACM CIKM (2018).[101]

Morgan, J., Bouterse, S., Walls, H., and Stierch, S.

Tea and sympathy:crafting positive new user experiences on wikipedia. In

CSCW ’13 (2013).[102]

Morgan, J., and Halfaker, A.

Evaluating the impact of the wikipedia teahouse onnewcomer socialization and retention.

Proceedings of the 14th International Symposiumon Open Collaboration (2018).[103]

Morillo, F., Bordons, M., and G´omez, I.

Interdisciplinarity in science: A tentativetypology of disciplines and research areas.

J. Am. Soc. Inf. Sci. Technol. 54 , 13 (Nov.2003), 1237–1249.[104]

Muri´c, G., Abeliuk, A., Lerman, K., and Ferrara, E.

Collaboration drivesindividual productivity.

Proceedings of the ACM on Human-Computer Interaction 3 , CSCW(2019), 1–24.[105]

Nickerson, R. S.

Conﬁrmation bias: A ubiquitous phenomenon in many guises.

Reviewof general psychology 2 , 2 (1998), 175–220.[106]

Pandey, P. K., Singh, M., Goyal, P., Mukherjee, A., and Chakrabarti, S.

Analysis of reference and citation copying in evolving bibliographic networks.

Journal ofInformetrics 14 , 1 (2020), 101003.[107]

Pariser, E.

The ﬁlter bubble: What the Internet is hiding from you . Penguin UK, 2011.25108]

Patro, G. K., Biswas, A., Ganguly, N., Gummadi, K. P., and Chakraborty,A.

Fairrec: Two-sided fairness for personalized recommendations in two-sided platforms.In

WWW (2020).[109]

Pedersen, D. B.

Integrating social sciences and humanities in interdisciplinary research.

Palgrave Communications 2 (2016), 16036.[110]

Pornprasit, C., Liu, X., Kertkeidkachorn, N., Kim, K.-S., Noraset, T.,and Tuarob, S.

Convcn: A cnn-based citation network embedding algorithm towardscitation recommendation. In

Proceedings of the ACM/IEEE Joint Conference on DigitalLibraries in 2020 (New York, NY, USA, 2020), JCDL ’20, Association for ComputingMachinery, p. 433–436.[111]

Qian, J., Bethke, A., Liu, Y., Belding, E., and Wang, W. Y.

A benchmarkdataset for learning to intervene in online hate speech. In

Proceedings of the 2019 Con-ference on Empirical Methods in Natural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing (EMNLP-IJCNLP) (Hong Kong, China,Nov. 2019), Association for Computational Linguistics, pp. 4755–4764.[112]

Recasens, M., Danescu-Niculescu-Mizil, C., and Jurafsky, D.

Linguisticmodels for analyzing and detecting biased language. In

Proceedings of the 51st AnnualMeeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2013),pp. 1650–1659.[113]

Ribeiro, F. N., Henrique, L., Benevenuto, F., Chakraborty, A., Kul-shrestha, J., Babaei, M., and Gummadi, K. P.

Media bias monitor: Quanti-fying biases of social media news outlets at large-scale. In

Twelfth International AAAIConference on Web and Social Media (2018).[114]

Ribeiro, M. H., Calais, P. H., Almeida, V. A., and Meira Jr, W. ” ev-erything i disagree with is arXiv preprint arXiv:1706.05924 (2017).[115]

Ribeiro, M. H., Gligori´c, K., Peyrard, M., Lemmerich, F., Strohmaier,M., and West, R.

Sudden attention shifts on wikipedia following covid-19 mobilityrestrictions. arXiv preprint arXiv:2005.08505 (2020).[116]

Richards, R. D., and Calvert, C.

Counterspeech 2000: A new look at the oldremedy for bad speech.

BYU L. Rev. (2000), 553.[117]

Ruprechter, T., Santos, T., and Helic, D.

Relating wikipedia article quality toedit behavior and link structure.

Applied Network Science 5 (2020), 1–20.[118]

Saez-Trumper, D., Castillo, C., and Lalmas, M.

Social media news communi-ties: Gatekeeping, coverage, and statement bias. In

Proceedings of the 22Nd ACM In-ternational Conference on Information & Knowledge Management (New York, NY, USA,2013), CIKM ’13, ACM, pp. 1679–1684.[119]

Saha, P., Mathew, B., Goyal, P., and Mukherjee, A.

Hateminers: Detectinghate speech against women. arXiv preprint arXiv:1812.06700 (2018).[120]

Sarkar, S., Reddy, B. P., Sikdar, S., and Mukherjee, A.

Stre: Self attentiveedit quality prediction in wikipedia. In

Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics (2019), pp. 3962–3972.26121]

Sayama, H., and Akaishi, J.

Characterizing interdisciplinarity of researchers andresearch topics using web search engines.

PLOS ONE 7 , 6 (06 2012), 1–9.[122]

Sharma, A., Hofman, J. M., and Watts, D. J.

Estimating the causal impact ofrecommendation systems from observational data. In

ACM EC (2015).[123]

Shen, A., Qi, J., and Baldwin, T.

A hybrid model for quality assessment of wikipediaarticles. In

Proceedings of the Australasian Language Technology Association Workshop2017 (2017), pp. 43–52.[124]

Shen, A., Salehi, B., Baldwin, T., and Qi, J.

A joint model for multimodaldocument quality assessment. In (2019), IEEE, pp. 107–110.[125]

Shen, A., Salehi, B., Qi, J., and Baldwin, T.

Feature-guided neural model trainingfor supervised document representation learning. In

Proceedings of the The 17th AnnualWorkshop of the Australasian Language Technology Association (2019), pp. 47–51.[126]

Shiji, C., Yves, G., Cl´ement, A., and Vincent, L.

Interdisciplinarity patterns ofhighly-cited papers: A cross-disciplinary analysis.

Proceedings of the American Society forInformation Science and Technology 51 , 1 (2015), 1–4.[127]

Singh, M., Jaiswal, A., Shree, P., Pal, A., Mukherjee, A., and Goyal,P.

Understanding the impact of early citers on long-term scientiﬁc impact. In (2017), pp. 1–10.[128]

Singh, M., Patidar, V., Kumar, S., Chakraborty, T., Mukherjee, A., andGoyal, P.

The role of citation context in predicting long-term citation proﬁles: Anexperimental study based on a massive bibliographic text dataset. In

Proceedings of the24th ACM International on Conference on Information and Knowledge Management (NewYork, NY, USA, 2015), CIKM ’15, Association for Computing Machinery, p. 1271–1280.[129]

Singh, M., Sarkar, R., Goyal, P., Mukherjee, A., and Chakrabarti, S.

Relay-linking models for prominence and obsolescence in evolving networks. In

Proceedingsof the 23rd ACM SIGKDD International Conference on Knowledge Discovery and DataMining (New York, NY, USA, 2017), KDD ’17, Association for Computing Machinery,p. 1077–1086.[130]

Smith, B., and Linden, G.

Two decades of recommender systems at amazon. com.

IEEE internet computing 21 , 3 (2017).[131]

Spezzano, F., Suyehira, K., and Gundala, L. A.

Detecting pages to protect inwikipedia across multiple languages.

Social Network Analysis and Mining 9 (2019), 1–16.[132]

Steinmacher, I., Conte, T., Gerosa, M. A., and Redmiles, D.

Social barriersfaced by newcomers placing their ﬁrst contribution in open source software projects. In

Proceedings of the 18th ACM conference on Computer supported cooperative work &social computing (2015), pp. 1379–1392.[133]

Susuri, A., Hamiti, M., and Dika, A.

Machine learning based detection of van-dalism in wikipedia across languages. (2016), 446–451. 27134]

Tekiroglu, S. S., Chung, Y.-L., and Guerini, M.

Generating counter narrativesagainst online hate speech: Data and strategies. arXiv preprint arXiv:2004.04216 (2020).[135]

Tran, K., and Christen, P.

Cross-language learning from bots and users to de-tect vandalism on wikipedia.

IEEE Transactions on Knowledge and Data Engineering 27 (2015), 673–685.[136]

Van Hee, C., Lefever, E., Verhoeven, B., Mennes, J., Desmet, B., Pauw,G., Daelemans, W., and Hoste, V.

Detection and ﬁne-grained classiﬁcation ofcyberbullying events.[137]

Verstak, A., Acharya, A., Suzuki, H., Henderson, S., Iakhiaev, M., Lin,C. C. Y., and Shetty, N.

On the shoulders of giants: The growing impact of olderarticles, 2014.[138]

Warncke-Wang, M., Ayukaev, V. R., Hecht, B., and Terveen, L. G.

Thesuccess and failure of quality improvement projects in peer production communities. In

Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work &Social Computing (2015), pp. 743–756.[139]

Waseem, Z., and Hovy, D.

Hateful symbols or hateful people? predictive features forhate speech detection on twitter. pp. 88–93.[140]

Wired

Wong, F. M. F., Tan, C. W., Sen, S., and Chiang, M.

Quantifying politicalleaning from tweets, retweets, and retweeters.

IEEE transactions on knowledge and dataengineering 28 , 8 (2016), 2158–2172.[142]

Wright, L., Ruths, D., Dillon, K. P., Saleem, H. M., and Benesch, S.

Vectors for counterspeech on Twitter. In

Proceedings of the First Workshop on AbusiveLanguage Online (Vancouver, BC, Canada, Aug. 2017), Association for ComputationalLinguistics, pp. 57–62.[143]

Xiao, S., Yan, J., Li, C., Jin, B., Wang, X., Yang, X., Chu, S. M., and Zhu,H.

On modeling and predicting individual paper citation count over time. In

Proceedings ofthe Twenty-Fifth International Joint Conference on Artiﬁcial Intelligence (2016), IJCAI’16,AAAI Press, p. 2676–2682.[144]

Xiao, Z., Song, W., Xu, H., Ren, Z., and Sun, Y.

Timme: Twitter ideology-detection via multi-task multi-relational embedding. In

Proceedings of the 26th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining (New York,NY, USA, 2020), KDD ’20, Association for Computing Machinery, p. 2258–2268.[145]

Yang, D., Halfaker, A., Kraut, R. E., and Hovy, E. H.

Who did what: Editorrole identiﬁcation in wikipedia. In

ICWSM (2016), pp. 446–455.[146]

Yao, S., and Huang, B.

Beyond parity: Fairness objectives for collaborative ﬁltering.In

NeurIPS (2017).[147]

Yazdanian, R., Zia, L., Morgan, J., Mansurov, B., and West, R.

Eliciting newwikipedia users’ interests via automatically mined questionnaires: For a warm welcome,28ot a cold start. In

Proceedings of the International AAAI Conference on Web and SocialMedia (2019), vol. 13, pp. 537–547.[148]

Yigit-Sert, S., Altingovde, I. S., and Ulusoy, O.

Towards detecting media biasby utilizing user comments. In

Proceedings of the 8th ACM Conference on Web Science (New York, NY, USA, 2016), WebSci ’16, ACM, pp. 374–375.[149]

Yuan, S., Tang, J., Zhang, Y., Wang, Y., and Xiao, T.

Modeling and pre-dicting citation count via recurrent neural network with long short-term memory.

ArXivabs/1811.02129 (2018).[150]

Zhang, H., Ren, Y., and Kraut, R. E.

Mining and predicting temporal patterns inthe quality evolution of wikipedia articles. In

HICSS (2020), pp. 1–10.[151]

Zhang, S., Hu, Z., Zhang, C., and Yu, K.

History-based article quality assessmenton wikipedia. In (2018), IEEE, pp. 1–8.[152]

Zhao, J., Wang, T., Yatskar, M., Ordonez, V., and Chang, K.-W.

Men alsolike shopping: Reducing gender bias ampliﬁcation using corpus-level constraints. arXivpreprint arXiv:1707.09457 (2017).[153]

Zhu, Z., Hu, X., and Caverlee, J.

Fairness-aware tensor-based recommendation. In