[PDF] Building Representative Corpora from Illiterate Communities: A Review of Challenges and Mitigation Strategies for Developing Countries

Abstract

Most well-established data collection methods currently adopted in NLP depend on the assumption of speaker literacy. Consequently, the collected corpora largely fail to represent swathes of the global population, which tend to be some of the most vulnerable and marginalised people in society, and often live in rural developing areas. Such underrepresented groups are thus not only ignored when making modeling and system design decisions, but also prevented from benefiting from development outcomes achieved through data-driven NLP. This paper aims to address the under-representation of illiterate communities in NLP corpora: we identify potential biases and ethical issues that might arise when collecting data from rural communities with high illiteracy rates in Low-Income Countries, and propose a set of practical mitigation strategies to help future work.

Full PDF

BBuilding Representative Corpora from Illiterate Communities: A Reviewof Challenges and Mitigation Strategies for Developing Countries

Stephanie Hirmer , Alycia Leonard , Josephine Tumwesige , Costanza Conforti , Energy and Power Group, University of Oxford Rural Senses Ltd. Language Technology Lab, University of Cambridge [email protected]

Abstract

Most well-established data collection methodscurrently adopted in NLP depend on the as-sumption of speaker literacy. Consequently,the collected corpora largely fail to repre-sent swathes of the global population, whichtend to be some of the most vulnerable andmarginalised people in society, and often livein rural developing areas. Such underrepre-sented groups are thus not only ignored whenmaking modeling and system design decisions,but also prevented from beneﬁting from de-velopment outcomes achieved through data-driven NLP. This paper aims to address theunder-representation of illiterate communitiesin NLP corpora: we identify potential biasesand ethical issues that might arise when col-lecting data from rural communities with highilliteracy rates in Low-Income Countries, andpropose a set of practical mitigation strategiesto help future work.

The exponentially increasing popularity of super-vised Machine Learning (ML) in the past decadehas made the availability of data crucial to thedevelopment of the Natural Language Processing(NLP) ﬁeld. As a result, much NLP research hasfocused on developing rigorous processes for col-lecting large corpora suitable for training ML sys-tems. We observe, however, that many best prac-tices for quality data collection make two implicitassumptions: that speakers have internet accessand that they are literate (i.e. able to read andoften write text effortlessly ). Such assumptionsmight be reasonable in the context of most High-Income Countries (HICs) (UNESCO, 2018). How-ever, in Low-Income Countries (LICs), and espe-cially in sub-Saharan Africa (SSA), such assump-tions may not hold, particularly in rural developing For example, input from speakers is often taken in writing,in response to a written stimulus which must be read. areas where the bulk of the population lives (Roserand Ortiz-Ospina (2016), Figure 1). As a conse-quence, common data collection techniques – de-signed for use in HICs – fail to capture data from avast portion of the population when applied to LICs.Such techniques include, for example, crowdsourc-ing (Packham, 2016), scraping social media (Leet al., 2016) or other websites (Roy et al., 2020),collecting articles from local newspapers (Mari-vate et al., 2020), or interviewing experts from in-ternational organizations (Friedman et al., 2017).While these techniques are important to easily buildlarge corpora, they implicitly rely on the above-mentioned assumptions (i.e. internet access andliteracy), and might result in demographic misrep-resentation (Hovy and Spruit, 2016). In this pa-per, we make a ﬁrst step towards addressing howto build representative corpora in LICs from il-literate speakers . We believe that this is a cur-rently unaddressed topic within NLP research. Italigns with previous work investigating sourcesof bias resulting from the under-representationof speciﬁc demographic groups in NLP corpora(such as women (Hovy, 2015), youth (Hovy andSøgaard, 2015), or ethnic minorities (Groenwoldet al., 2020)). In this paper, we make the follow-ing contributions: (i) we introduce the challengesof collecting data from illiterate speakers in § §

3; ﬁnally, (iii) drawing on yearsof experience in data collection in LICs, we outlinepractical countermeasures to address these issuesin § a r X i v : . [ c s . C L ] F e b a) Adult literacy (% ages 15+, UN-ESCO (2018)) (b) Urban population (% total, UN-DESA (2018)) (c) Internet usage (% of total, ITU(2019)) Figure 1: Literacy, urban population, and internet usage in African countries. Note that countries with more ruralpopulations tend to have less literacy and less internet users. These countries are likely to be under-represented incorpora generated using common data collection methods that assume literacy and internet access (Grey: no data).

In recent years, developing corpora that encom-passes as many human languages as possible hasbeen recognised as important in the NLP commu-nity. In this context, widely translated texts (suchas the Bible (Mueller et al., 2020) or the HumanRights declaration (King, 2015)) are often used asa source of data. However, these texts tend to bequite short and domain-speciﬁc. Moreover, whilethe Internet constitutes a powerful data collectiontool which is more representative of real languageuse than the previously-mentioned texts, it excludesilliterate communities, as well as speakers whichlack reliable internet access (as is often the case inrural developing settings, Figure 1).Given the obstacles to using these common lan-guage data collection methods in LIC contexts, theNLP community can learn from methodologiesadopted in other ﬁelds. Researchers from ﬁeldssuch as sustainable development (SD, Gleitsmannet al. (2007)), African studies (Adams, 2014), andethnology (Skinner et al., 2013), tend to rely heav-ily on qualitative data from oral interviews, tran-scribed verbatim. Collecting such data in rural de-veloping areas is considerably more difﬁcult thanin developed or urban contexts. In addition to highilliteracy levels, researchers face challenges suchas seasonal roads and low population densities. Toour knowledge, there are very few NLP workswhich explicitly focus on building corpora fromrural and illiterate communities: of those worksthat exist, some present clear priming effect is-sues (Abraham et al., 2020), while others focuson application (Conforti et al., 2020). A detailed description of best practices for data collection re-mains a notable research gap.

Guided by research in medicine (Pannucci andWilkins, 2010), sociology (Berk, 1983), and psy-chology (Gilovich et al., 2002), NLP has experi-enced increasing interest in ethics and bias mitiga-tion to minimise unintentional demographic mis-representation and harm (Hovy and Spruit, 2016).While there are many stages where bias may enterthe NLP pipeline (Shah et al., 2019), we focus onthose pertinent to data collection from rural illit-erate communities in LICs, leaving the study ofbiases in model development for future work . Biases in data collection are inevitable (Marshall,1996) but can be minimised when known to theresearcher (Trembley, 1957). We identify variousbiases that can emerge when collecting languagedata in rural developing contexts, which fall underthree broad categories: sampling, observer, and re-sponse bias. Sampling determines who is studied,the interviewer (or observer) determines what in-formation is sought and how it is interpreted, andthe interviewee (or respondent) determines whichinformation is revealed (Woodhouse, 1998). Thesecategories span the entire data collection processand can affect the quality and quantity of languagedata obtained. Note, this paper does not focus on a particular NLP ap-plication, as once the data has been collected from illiteratecommunities it can be annotated for virtually any speciﬁc task. .2 Sampling or selection bias

Sampling bias occurs when observations are drawnfrom an unrepresentative subset of the populationbeing studied (Marshall, 1996) and applied morewidely. In our context, this might arise when select-ing communities from which to collect languagedata, or speciﬁc individuals within each commu-nity. When sampling communities, bias can beintroduced if convenience is prioritized. Commu-nities which are easier to access may not producelanguage data representative of a larger area orgroup. This can be illustrated through Uganda’srefugee response, which consists of 13 settlements(including the 2nd largest in the world) hosted in 12districts (UNHCR, 2020). Data collection may beeasier in one of the older, established settlements;however, such data cannot be generalised over theentire refugee response due to different culturalbackgrounds, length of stay of refugees in differentareas, and the varied stages along the humanitar-ian chain – emergency, recovery or development –found therein (Winter, 1983; OECD, 2019). Pri-oritizing convenience in this case may result incorpora which over-represents the cultural and eco-nomic contexts of more established, longer-termrefugees. When sampling interviewees, bias canbe introduced when certain sub-sets of a commu-nity have more data collected than others (Bryman,2012). This is seen when data is collected onlyfrom men in a community due to cultural norms(Nadal, 2017), or only from wealthier people incell-phone-based surveys (Labrique et al., 2017).

Observer bias occurs when there are systematic er-rors in how data is recorded, which may stem fromobserver viewpoints and predispositions (Gonsamoand D’Odorico, 2014). We identify three key ob-server biases relevant to our context.Firstly, conﬁrmation bias , which refers to thetendency to look for information which conﬁrmsone’s preconceptions or hypotheses (Nickerson,1998). Researchers collecting data in LICs mayexpect interviewees to express needs or hardshipsbased on their preconceptions. As Kumar (1987)points out, “often they hear what they want to hearand ignore what they do not want to hear”. A teamconducting a needs assessment for a rural electriﬁ-cation project, for instance, may expect a need forelectricity, and thus consciously or subconsciouslyseek data which conﬁrms this, interpret potentially unrelated data as electricity-motivated (Hirmer andGuthrie, 2017), or omit data which contradicts theirhypothesis (Peters, 2020). Using such data to trainNLP models may introduce unintentional bias to-wards the original expectations of the researchersinstead of accurately representing the community.Secondly, the interviewer’s understanding andinterpretation of the speaker’s utterances might beinﬂuenced by their class, culture and language.Note that, particularly in countries without stronglanguage standardisation policies, consistent se-mantic shifts can happen even between varietiesspoken in neighboring regions (Gordon, 2019),which may result in systematic misunderstand-ing (Sayer, 2013). For example, in the neighboringUgandan tribes of Toro and Bunyoro, the sameword omunyoro means respectively husband and a member of the tribe . Language data collected insuch contexts, if not properly handled, may containinaccuracies which lead to NLP models that mis-represent these tribes. Rich information commu-nicated through gesture, expression, and tone (i.e.nonverbal data, Oliver et al. (2005)) may also besystematically lost during verbatim transcription,causing inadvertent inconsistencies in the corpora.Thirdly, interviewer bias , which refers tothe subjectivity unconsciously introduced intodata gathering by the worldview of the inter-viewer (Frey, 2018). For instance, a deeply reli-gious interviewer may unintentionally frame ques-tions through religious language (e.g. it is God’swill , thank God , etc.), or may perceive certain emo-tions (e.g. thankfulness) as inherently religious,and record language data including this percep-tion. The researcher’s attitude and behaviour mayalso inﬂuence responses (Silverman, 2013); forinstance, when interviewers take longer to deliverquestions, interviewees tend to provide longer re-sponses (Matarazzo et al., 1963). Unlike in internet-based language data collection, where all speakersare exposed to uniform, text-based interfaces, col-lecting data from illiterate communities necessi-tates the presence of an interviewer, who cannotalways be the same person due to scalability con-straints, introducing this inevitable variability andsubsequent data bias. Response bias occurs when speakers provide inac-curate or false responses to questions. This is par-ticularly important when working in rural settings,where the majority of data collection is currentlyelated to SD projects. The majority of existingdata is biased by the projects for which it has beencollected, and any newly collected data for NLPuses is also likely to be used in decision making forSD. This inherent link of data collection to materialdevelopment outcomes inevitably affects what iscommunicated. There are ﬁve key response biasesrelevant to our context.Firstly, recall bias , where speakers recall onlycertain events or omit details (Coughlin, 1990).This is often as a result of external inﬂuences, suchas the presence of a data collector who is new tothe community. Recall can also be affected by thedistortion or ampliﬁcation of traumatic memories(Strange and Takarangi, 2015); if data is collectedaround a topic a speaker may ﬁnd traumatic, recallbias may be unintentionally introduced.Secondly, social desirability bias , which refersto the tendency of interviewees to provide sociallydesirable/acceptable responses rather than honestresponses, particularly in certain interview contexts(Bergen and Labont´e, 2020). In tight-knit ruralcommunities, it may be difﬁcult to deviate fromtraditional social norms, leading to biased data. Asan illustrative example, researchers in Nepal foundthat interviewer gender affected the detail in re-sponses to some sensitive questions (e.g. sex andcontraception): participants provided less detail tomale interviewers (Axinn, 1991). Social desirabil-ity bias can produce corpora which misrepresentcommunity social dynamics or under-represent sen-sitive topics.Thirdly, recency effect or serial-position ,which is the tendency of a person to recall theﬁrst and last items in a series best, and the mid-dle items worst (Troyer, 2011). This can greatlyimpact the content of language data. For instance,in the context of data collection to guide devel-opment work, it is important to understand cur-rent needs and values (Hirmer and Guthrie, 2016);however, if only the most recent needs are dis-cussed, long-term needs may be overlooked. Toillustrate, while a community which has just ex-perienced a poor agricultural season may tend toexpress the importance of improving agriculturaloutput, other needs which are less top-of-mind (i.e.healthcare, education) may be equally importantdespite being expressed less frequently. If datacontaining recency bias is used to develop NLPmodels, particularly for sustainable developmentapplications (such as for Automatic UPV Classiﬁ- cation, Conforti et al. (2020)), these may amplifycurrent needs and under-represent long-term needs.Fourthly, acquiescence bias , also known as“yea” saying (Laajaj and Macours, 2017), whichcan occur in rural developing contexts when in-terviewees perceive that certain (possibly false)responses will please a data collector and bringbeneﬁts to their community. For example, if datacollection is being undertaken by a group with astated desire to build a school may be more likelyto hear about how much education is valued.Finally, priming effect , or the ability of a pre-sented stimulus to inﬂuence one’s response to asubsequent stimulus (Lavrakas, 2008). Primingis problematic in data collection to inform SDprojects; it can be difﬁcult to collect data on therelative importance of simultaneous (or conﬂict-ing) needs if the community is primed to focus onone (Veltkamp et al., 2011). An example is shownin Figure 2a; respondents may be drawn to speakmore about the most dominant prompts presentedin the chart. This is typical of a broader failure inSD to uncover beneﬁciary priorities without intro-ducing project bias (Watkins et al., 2012). Needsassessments, like the one referenced above linkedto a rural electriﬁcation project, tend to focus ex-plicitly on project-related needs instead of morebroadly identifying what may be most importantto communities (Masangwi, 2015; USAID, 2006).As speakers will usually know why data is beingcollected in such cases, they may be biased towardsstating the project aim as a need, thereby skewingthe corpora to over-represent this aim.

Certain ethical codes of conduct must be followedwhen collecting data from illiterate speakers in ru-ral communities in LICs (Musoke et al., 2020).Unethical data collection may harm communities,treat them without dignity, disrupt their lives, dam-age intra-community or external relationships, anddisregard community norms (Thorley and Henrion,2019). This is particularly critical in rural develop-ing regions, as these areas are home to some of theworld’s poorest and most vulnerable to exploita-tion (Christiaensen and Subbarao, 2005; de Ceni-val M., 2008). Unethical data collection can repli-cate extractive colonial relationships whereby datais extracted from communities with no mutual ben-eﬁt or ownership (Dunbar and Scrimgeour, 2006).It can lead to a lack of trust between data collec-or and interviewees and unwillingness to partici-pate in future research (Clark, 2008). These phe-nomena can bias data or reduce data availability.Ethical data collection practices in rural develop-ing regions with high illiteracy include: obtainingconsent (McAdam, 2004), accounting for culturaldifferences (Silverman, 2013), ensuring anonymityand conﬁdentiality (Bryman, 2012), respecting ex-isting community or leadership structures (Hard-ing et al., 2012), and making the community theowner of the data. While the latter is not often cur-rently practiced, it is an important consideration forcommunity empowerment, with indigenous datasovereignty efforts (Rainie et al., 2019) alreadysetting precedent.

Drawing on existing literature and years of ﬁeldexperience collecting spoken data in LICs, belowwe outline a number of practical data collectionstrategies to minimise previously-outlined chal-lenges ( § Here, we outline practical preparation steps forcareful planning, which can minimise error andreduce ﬁeldwork duration (Tukey, 1980).

Local Context . A thorough understanding oflocal context is key to successful data collection(Hentschel, 1999; Bukenya et al., 2012; Launialaand Kulmala, 2006). Local context is broadly de-ﬁned as facts, concepts, beliefs, values, and percep-tions used by local people to interpret the worldaround them, and is shaped by their surroundings(i.e. their worldview, Vasconcellos and Vasconcel-los Sobrinho (2014)). It is important to considerlocal context when preparing to collect data in ruraldeveloping areas, as common data collection meth-ods may be inappropriate due to contextual linguis-tic differences and deep-rooted social and culturalnorms (Walker and Hamilton, 2011; Mafuta et al.,2016; Nikulina et al., 2019; Wang et al.). Selectinga contextually-appropriate data collection methodis critical in mitigating social desirability bias inthe collected data, among other challenges. Re- searchers should review socio-economic surveysand/or consult local stakeholders who can offervaluable insights on practices and social norms.These stakeholders can also highlight current orhistorical matters of concern to the area, whichmay be unfamiliar to researchers, and reveal lo-cal, traditional, and indigenous knowledge whichmay impact the data being collected (Wu, 2014)and result in recency effect . It is good practice toidentify local conﬂicts and segmentation within acommunity, especially in a rural context, wherethe population is vulnerable and systematically un-heard (Dudwick et al., 2006; Mallick et al., 2011).

Case sampling . In qualitative research, samplecases are often strategically selected based on theresearch question (i.e. systematic or purposive sam-pling, Bryman (2012)), and characteristics or cir-cumstances relevant to the topic of study (Yach,1992). If data collected in such research is usedbeyond its original scope, sampling bias may result.So, while data collected in previous research shouldbe re-used to expand NLP corpora where possible,it is important to be cognizant of the purposivesampling underlying existing data. A comprehen-sive dataset characterisation (Bender and Friedman,2018; Gebru et al., 2018) can help researchers un-derstand whether an existing dataset is appropri-ate to use in new or different research, such as intraining new NLP models, and can highlight thepotential ethical concerns of data re-use. Participant sampling.

Interviewees should beselected to represent the diverse interests of a com-munity or sampling group (e.g. occupation, age,gender, religion, ethnicity or male/female house-hold heads (Bryman, 2012)) to reduce samplingbias (Kitzinger, 1994). To ensure representativ-ity in collected data, sampling should be random,i.e. every subject has equal probability to be in-cluded (Etikan et al., 2016). There may be certainsocietal subsets that are concealed from view (e.g.as a result of embarrassment from disabilities orphysical differences) based on cultural norms inless inclusive societies (Vesper, 2019); particularcare should be exercised to ensure such subsets arerepresented.

Group composition . Participant sampling bestpractices vary by data collection method, with par-ticular care being necessary in group settings. Intraditional societies where strong power dynamicsexist, attention should be paid to group composi-tion and interaction to prevent some voices from ias & Deﬁnition Key countermeasures S a m p li ng - Community : An unrepresentative sample set is generalisedover the entire case being studied. • Select representative communities & only apply datawithin same scope (i.e. consult data statements)

Participant : Certain sub-sets of a community have more datacollected from them than others. • Select representative participants, only apply datawithin same scope & avoid tempting rewards O b s e r v e r — - Conﬁrmation : Looking for information that conﬁrms one’spreconceptions or hypotheses about a topic/research/sector. • Employ interviewers that are impartial to thetopic/research/sector investigated.

Misunderstanding : Data is incorrectly transcribed or catego-rized as a result of class, cultural, or linguistic differences. • Employ local people & minimise

Interviewer : Unconscious subjectivity introduced into datagathering by interviewers’ worldview. • Undertake training to minimise inﬂuence exerted fromquestions, technology, & attitudes. R e s pon s e ——— – Recall : Tendency of speakers recall only certain events or omitdetails • Collect support data (e.g. from socio-economic data orlocal stakeholders) to compare with interviews.

Social-desirability :Tendency of participants to provide sociallydesirable/acceptable responses rather than to respond honestly. • Select interviewers & design interview processes toaccount for known norms which might skew responses

Recency effect : Tendency to recall ﬁrst or last items in a seriesbest, & middle items worst. • Minimise external inﬂuence on participants throughoutdata gathering (e.g. technologies, people, perceptions).

Acquiescence : Respondents perceive certain, perhaps false, an-swers may please data collectors, bringing community beneﬁts. • Gather non-sectoral holistic insights (e.g. from socio-economic data or local stakeholders)

Priming effect : Ability of a presented stimulus to inﬂuenceone’s response to a subsequent stimulus • Use appropriate visual prompts (graphically similar),language and technology

Table 1: Sources of potential bias in data collection when operating in rural and illiterate settings in developingcountries, and key countermeasures that can help mitigating them. being silenced or over-represented (Stewart et al.,2007). For example, in Uganda, female intervie-wees may be less likely to voice opinions in thepresence of male interviewees (FIDH, 2012; Axinn,1991), introducing a form of social desirability bias in resulting corpora. To minimise this risk of databias, relations and power dynamics must be con-sidered during data collection planning (Hirmer,2018). It may be necessary to exclude, for instance,close relatives, governmental ofﬁcials, and villageleaders from group discussions where data is beingcollected, and instead engage such stakeholders inseparate activities to ensure that their voices areincluded in the corpora without biasing the datacollected from others.

Interviewer selection . The interviewer has asigniﬁcant opportunity to introduce observer andresponse biases in collected data (Salazar, 1990).Interviewers familiar with local language, includ-ing community-speciﬁc dialects, should be selectedwherever possible. Moreover, to reduce misunder-standing and recall biases in collected data, it isuseful to have the same person who conducts theinterviews also transcribe them. This minimizesthe layers of linguistic interpretation affecting theﬁnal dataset and can increase accuracy throughfamiliarity with the interview content. If the in-terviewer is unavailable, the transcriber must beproperly trained and briefed on the interviews, andmade aware of the level of detail needed during transcription (Parcell and Rafferty, 2017).

Study design . In rural LIC communities, quali-tative data like natural language is usually collectedby observation, interview, and/or focus group dis-cussion (or a combination, known as mixed meth-ods) which are transcribed verbatim (Moser andKorstjens, 2018). Prompts are often used to sparkdiscussion. Whether visual prompts (Hirmer, 2018)or verbalised question prompts are used duringdata collection, these should be designed to: (i) ac-commodate illiteracy, (ii) account for disabilities(e.g. visually impairment; both could cause sam-pling bias ), and (iii) minimise bias towards a topicor sector (e.g. minimising acquisition bias and conﬁrmation bias ). For instance, visual promptsshould be graphically similar and contain only vi-suals familiar to the respondents. This is analogousto the uniform interface with which speakers inter-act during text-based online data collection, wherethe platform used is graphically the same to allusers inputting data. Using varied graphical stylesor unfamiliar images may result in priming (Fig-ure 2a). To minimise recall bias or recency effect in collected data, socio-economic data can be inte-grated in data analysis to better understand if theassertions made in collected data reference recentevents, for example. These should be non-sectorspeciﬁc, to gain holistic insights and to minimise acquisition bias and conﬁrmation bias . .2 Engagement Here, we outline practical steps for successful com-munity engagement to achieve ethical and high-quality data collection.

Deﬁning community . Deﬁning a communityin an open and participatory manner is critical tomeaningful engagement (Dyer et al., 2014). Byunderstanding the community the way they under-stand themselves, misunderstandings and tensionsthat affect data quality can be minimized. The deﬁ-nition of the community (MacQueen et al., 2001)coupled with the requirements and use-cases forthe collected data determines the data collectionmethodology and style which will be most appropri-ate (e.g. interview-based community consultationvs. collaborative co-design for mutual learning).

Follow formal structures . Researchers enter-ing a community where they have no backgroundto collect data should endeavour to know the com-munity prior to commencing any work (Dialloet al., 2005). This could entail visiting the com-munity and mapping its hierarchies of authorityand decision-making pathways, which can guidethe research team on how to interact respectfullywith the community (Tindana et al., 2011). Thisprocess should also illuminate whether knowledge-able community members should facilitate entry byperforming introductions and assisting the externaldata collection team. Following formal commu-nity structures is vital, especially in developingcommunities, where traditional rules and socialconventions are strongly held yet often not articu-lated explicitly or documented. Approaching com-munity leaders in the traditional way can help tobuild a positive long-term relationship, removingsuspicion about the nature and motivation of theresearchers’ activities, explaining their presencein the community, and most importantly buildingtrust as they are granted permission to engage thecommunity by its leadership (Tindana et al., 2007).

Verbalising consent . Data ethics is paramountfor research involving human participants (Accen-ture, 2016; Tindana et al., 2007), including anycollection of personal and identiﬁable data, suchas natural language. Genuine (i.e. voluntary andinformed) consent must be obtained from inter-viewees to prevent use of data which is illegal,coercive, or for a purpose other than that whichhas been agreed (McAdam, 2004). The NufﬁeldCouncil on Bioethics (2002) caution that in LICs,misunderstandings may occur due to cultural dif- ferences, lower social-economic status, and illit-eracy (McMillan et al., 2004) which can call intoquestion the legitimacy of consent obtained. Re-searchers must understand that methods such aslong information forms and consent forms whichmust be signed may be inappropriate for the cul-tural context of LICs and can be more likely toconfuse than to inform (Tekola et al., 2009). Theauthors advise that consent forms should be ver-bal instead of written, with wording familiar to theinterviewees and appropriate to their level of com-prehension (Tekola et al., 2009). For example, tospeak of data storage on a password protected com-puter while obtaining consent in a rural communitywithout access to electricity or information technol-ogy is unﬁtting. Innovative ways to record consentcan be employed in such contexts (e.g. video tap-ing or recording), as signing an ofﬁcial documentmay be “viewed with suspicion or even outrighthostility” (Upjohn and Wells, 2016), or seen as“committing ... to something other than answeringquestions”. Researchers new to qualitative datacollection should seek advice from experienced re-searchers and approval from their ethics committeebefore implementing consent processes.

Approaching participants . Despite havinggained permission from community authorities andobtained consent to collect data, researchers mustbe cautious when approaching participants (Ira-bor and Omonzejele, 2009; Diallo et al., 2005) toensure they do not violate cultural norms. For ex-ample, in some cultures a senior family membermust be present for another household member tobe interviewed, or a female must be accompaniedby a male counterpart during data collection. In-sensitivity to such norms may compromise the datacollection process; so, they should be carefullynoted when researching local context ( § Minimise external inﬂuence.

Researchers mustbe aware of how external inﬂuences can affect datacollection (Ramakrishnan et al., 2012). We ﬁndthree main levels of external inﬂuence: (i) tech-nologies unfamiliar to a rural developing countrycontext may induce social desirability bias or prim-ing (e.g. if a researcher arrives to a community inn expensive vehicle or uses a tablet for data col-lection); (ii) intergroup context, which accordingto Abrams (2010) refers to when “people in differ-ent social groups view members of other groups”and may feel prejudiced or threatened by thesedifferences. This can occur, for instance, when anewcomer arrives and speaks loudly relative to theindigenous community, which may be perceived asoverpowering; (iii) there is the risk of a researcherover-incentivizing the data collection process, us-ing leading questions and judgemental framing ( in-terviewer bias or conﬁrmation bias ). To overcomethese inﬂuences, researchers must be cognizantof their inﬂuence and minimise it by hiring localmediators where possible alongside employing ap-propriate technology, mannerisms, and language. Here, we detail practical steps to minimise chal-lenges during the actual data collection.

Interview settings . People have personal valuesand drivers that may change in speciﬁc settings.For example, in the Ugandan Buganda and Bu-soga tribes, it is culturally appropriate for the malehead if present to speak on behalf of his wife andchildren. This could lead to corpora where inputfrom the husband is over-represented compared tothe rest of the family. To account for this, it isimportant to collect data in multiple interview set-tings (e.g. individual, group male/female/mixed;Figures 2b, 2c). Additionally, the inputs of in-dividuals in group settings should be consideredindependently to ensure all participants have anequal say, regardless of their position within thegroup (Barry et al., 2008; Gallagher et al., 1993).This helps to avoid social desirability bias in thedata and is particularly important in various devel-oping contexts where stereotypical gender rolesare prominent (Hirmer, 2018). During interviews,verbal information can be supplemented throughthe observation of tone, cadence, gestures, and fa-cial expressions (Narayanasamy, 2009; Hess et al.,2009), which could enrich the collected data withan additional layer of annotation.

Working with multiple interviewers . Ar-guably, one of the biggest challenges in data col-lection is ensuring consistency when working withmultiple interviewers. Some may report word-for-word what is being said, while others may sum- While participants’ photographing permission wasgranted, photos were pixelised to protect identity. marise or misreport, resulting in systematic misun-derstanding . Despite these risks, employing mul-tiple interviewers is often unavoidable when col-lecting data in rural areas of developing countries,where languages often exhibit a high number of re-gional, non-mutually intelligible varieties. This isparticularly prominent across SSA. For example, 41languages are spoken in Uganda (Nakayiza, 2016);English, the ofﬁcial language, is ﬂuently spoken byonly ∼

5% of the population, despite being widelyused among researchers and NGOs (Katushemer-erwe and Nerbonne, 2015). To minimise data in-consistency, researchers should: (i) undertake in-terviewer training workshops to communicate datarequirements and practice data collection processesthrough mock ﬁeld interviews; (ii) pilot the datacollection process and seek feedback to spot earlydeviation from data requirements; (iii) regularlyspot-check interview notes; (iv) support writtennotes with audio recordings ; and (v) offer qualitybased incentives to data collectors. Participant remuneration . While it is commonto offer interviewees some form of remunerationfor their time, the decision surrounding paymentis ethically-charged and widely contested (Ham-mett and Sporton, 2012). Rewards may tempt peo-ple to participate in data collection against theirjudgement. They can introduce sampling bias orcreate power dynamics resulting in acquiescencebias (Largent and Lynch, 2017). Barbour (2013)offers three practical solutions: (i) not advertisepayment; (ii) omit the amount being offered; or(iii) offer non-ﬁnancial incentives (e.g. productsthat are desirable but difﬁcult to get in an area). Thedecision whether or not to remunerate should notbe based upon the researcher’s own ethical beliefsand resources, but instead by considering the spe-ciﬁc context , interviewee expectations, precedentsset by previous researchers, and local norms (Ham-mett and Sporton, 2012). Representatives fromlocal organisations (such as NGOs or governmen-tal authorities) may be able to offer advice. Relying only on audio data recording may be risky: equip-ment can fail or run out of battery (which is not easily reme-died in rural off-grid regions) and seasonal factors (as noisefrom rain on corrugated iron sheets, commonly used for roof-ing in SSA) can make recordings inaudible (Hirmer, 2018)). In rural Uganda, for example, politicians commonly en-gage in vote buying by distributing gifts (Blattman et al., 2019)such as soap or alcohol. It is therefore considered an unrulyform of remuneration and can only be avoided when known.a) (b) (c)

Figure 2: Collecting oral data in rural Uganda. 2a

Priming effect (note the word “Energy” in the poster’s title andthe visual prompts differences between items). On the contrary, 2b and 2c show minimal priming; note also thatdifferent demographics are separately interviewed (women group, single men) to avoid social desirability bias . Here, we discuss practical strategies to mitigateethical issues surrounding the management andstewardship of collected data.

Anonymisation . To protect the participants’identity and data privacy, locations, proper names,and culturally explicit aspects (such as tribenames) of collected data should be made anony-mous (Sweeney, 2000; Kirilova and Karcher, 2017).This is particularly important in countries with se-curity issues and low levels of democracy.

Safeguarding data . A primary responsibility ofthe researcher is to safeguard participants’ data(Kirilova and Karcher, 2017). In addition toanonymizing data, mechanisms for data manage-ment include in-place handling and storage ofdata (UKRI, 2020a). Whatever data managementplan is adopted, it must be clearly articulated toparticipants before the start of the interview (i.e. aspart of the consent process (Silverman, 2013)), aswas discussed in § Verbalising consent ). Withdrawing consent . Participants shouldhave the ability to withdraw from research withina speciﬁed time frame. This is known as with-draw consent and is commonly done by phone oremail (UKRI, 2020b). As people in rural illiteratecommunities have limited means and technologyaccess, a local phone number and contact details ofa responsible person in the area should be providedto facilitate withdraw consent.

Communication and research fatigue . Whileresearchers frequently extract knowledge and datafrom communities, only rarely are ﬁndings fedback to communities in a way that can be use-ful to them. Whatever the research outcomes, re-searchers should share the results with participatingcommunities in an appropriate manner. In illiter-ate communities, for instance, murals (Jimenez, 2020), artwork, speeches, or song could be used tocommunicate ﬁndings. Not communicating ﬁnd-ings may result in research fatigue as people in over-studied communities are no longer willingto participate in data collection. This is common“where repeated engagements do not lead to anyexperience of change [...]” Clark (2008). Patel et al.(2020) offers practical guidance to minimise re-search fatigue by: (i) increasing transparency ofresearch purpose at the beginning of the research,and (ii) engaging with gatekeeper or oversight bod-ies to minimise number of engagements per partic-ipant. Failure to restrict the number of times thatpeople are asked to participate in studies risks poorfuture participation (Patel et al., 2020) which canalso lead to sampling bias . In this paper, we provided a ﬁrst step towards deﬁn-ing best practices in data collection in rural andilliterate communities in Low-Income Countriesto create globally representative corpora. We pro-posed a comprehensive classiﬁcation of sourcesof bias and unethical practices that might arise inthe data collection process, and discussed practicalsteps to minimise their negative effects. We hopethat this work will motivate NLP practitioners toinclude input from rural illiterate communities intheir research, and facilitate smooth and respect-ful interaction with communities during data col-lection. Importantly, despite the challenges thatworking in such contexts might bring, the effort tobuild substantial and high-quality corpora whichrepresent this subset of the population can result inconsiderable SD outcomes. cknowledgments

We thank the anonymous reviewers for their con-structive feedback. We are also grateful to ClaireMcAlpine, as well as Malcolm McCulloch andother members of the Energy and Power Group(University of Oxford) for providing valuable feed-back on early versions of this paper. This researchwas carried out as part of the Oxford Martin Pro-gramme on Integrating Renewable Energy. Finally,we are grateful to the Rural Senses team for sharingexperiences on data collection.

References

Basil Abraham, Danish Goel, Divya Siddarth, Ka-lika Bali, Manu Chopra, Monojit Choudhury, PratikJoshi, Preethi Jyoti, Sunayana Sitaram, and VivekSeshadri. 2020. Crowdsourcing speech data forlow-resource languages from low-income workers.In

Proceedings of The 12th Language Resourcesand Evaluation Conference , pages 2819–2826, Mar-seille, France. European Language Resources Asso-ciation.Dominic Abrams. 2010.

Processes of prejudices: The-ory, evidence and intervention . Equalities and Hu-man Rights Commission.Accenture. 2016. Building digital trust: The role ofdata ethics in the digital age. Accenture Labs.Glenn Adams. 2014. Decolonizing methods: Africanstudies and qualitative research.

Journal of socialand personal relationships , 31(4):467–474.William G Axinn. 1991. The inﬂuence of interviewersex on responses to sensitive questions in nepal.

So-cial Science Research , 20(3):303–318.Rosaline Barbour. 2013.

Introducing qualitative re-search: a student’s guide . Sage.Marie-Louise Barry, Herman Steyn, and Alan Brent.2008. Determining the most important factors forsustainable energy technology selection in africa:Application of the focus group technique. In

PICMET’08-2008 Portland International Confer-ence on Management of Engineering & Technology ,pages 181–187. IEEE.Emily M. Bender and Batya Friedman. 2018. Datastatements for natural language processing: Towardmitigating system bias and enabling better science.

Transactions of the Association for ComputationalLinguistics , 6:587–604.Nicole Bergen and Ronald Labont´e. 2020. “everythingis perfect, and we have no problems”: Detectingand limiting social desirability bias in qualitativeresearch.

Qualitative Health Research , 30(5):783–792. Richard A Berk. 1983. An introduction to sample se-lection bias in sociological data.

American sociolog-ical review , pages 386–398.The Nufﬁeld Council on Bioethics. 2002. The ethicsof research related to healthcare in developingcountries. The Nufﬁeld Council on Bioethics isfunded jointly by the Medical Research Council, theNufﬁeld Foundation and the Wellcome Trust.Christopher Blattman, Horacio Larreguy, BenjaminMarx, and Otis R Reid. 2019. Eat widely, votewisely? lessons from a campaign against vote buy-ing in uganda. Technical report, National Bureau ofEconomic Research.Alan Bryman. 2012. Mixed methods research: com-bining quantitative and qualitative research. In AlanBryman, editor,

Social Reserach Methods , forth edi-tion, chapter 27, pages 628–652. Oxford UniversityPress, New York.Badru Bukenya, Sam Hickey, and Sophie King. 2012.Understanding the role of context in shaping socialaccountability interventions: towards an evidence-based approach.

Manchester: Institute for Develop-ment Policy and Management, University of Manch-ester .de Cenival M. 2008. Ethics of research: the freedomto withdraw.

Bulletin de la Societe de pathologieexotique , 101(2):98—-101.Luc J. Christiaensen and Kalanidhi Subbarao. 2005.Towards an understanding of household vulnerabil-ity in rural kenya.

Journal of African Economies ,14(4):520–558.Tom Clark. 2008. We’re over-researched here!’ explor-ing accounts of research fatigue within qualitativeresearch engagements.

Sociology , 42(5):953–970.Costanza Conforti, Stephanie Hirmer, Dai Morgan,Marco Basaldella, and Yau Ben Or. 2020. Naturallanguage processing for achieving sustainable devel-opment: the case of neural labelling to enhance com-munity proﬁling. In

Proceedings of the 2020 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 8427–8444, Online. As-sociation for Computational Linguistics.Steven S Coughlin. 1990. Recall bias in epidemiologicstudies.

Journal of clinical epidemiology , 43(1):87–91.D. A. Diallo, O. K. Doumbo, C. V. Plowe, T. E.Wellems, E. J. Emanuel, and S. A Hurst. 2005. Com-munity permission for medical research in develop-ing countries.

Infectious Diseases Society of Amer-ica , 41(2):255—-259.Nora Dudwick, Kathleen Kuehnast, Veronica N. Jones,and Michael Woolcock. 2006. Analyzing social cap-ital in context: A guide to using qualitative methodsand data.erry Dunbar and Margaret Scrimgeour. 2006. Ethicsin indigenous research–connecting with community.

Journal of Bioethical Inquiry , 3(3):179–185.J Dyer, L.C Stringer, A.J Dougill, J Leventon,M Nshimbi, F Chama, A Kafwifwi, J.I Muledi, J.-M.K Kaumbu, M Falcao, S Muhorro, F Munyemba,G.M Kalaba, and S Syampungani. 2014. Assessingparticipatory practices in community-based naturalresource management: Experiences in communityengagement from southern africa.

Journal of Envi-ronmental Management , 137:137–145.Ilker Etikan, Sulaiman Abubakar Musa, andRukayya Sunusi Alkassim. 2016. Comparisonof convenience sampling and purposive sampling.

American journal of theoretical and appliedstatistics , 5(1):1–4.FIDH. 2012. Women’s rights in Uganda: gaps betweenpolicy and practice. Technical report, InternationalFederation for Human Rights, Paris.Bruce B Frey. 2018.

The SAGE encyclopedia of educa-tional research, measurement, and evaluation . SagePublications.Batya Friedman, Lisa P. Nathan, and Daisy Yoo. 2017.Multi-lifespan information system design in supportof transitional justice: Evolving situated design prin-ciples for the long (er) term.

Interacting with Com-puters , 29(1):80–96.Morris Gallagher, Tim Hares, John Spencer, ColinBradshaw, and Ian Webb. 1993. The nominal grouptechnique: a research tool for general practice?

Family practice , 10(1):76–81.Timnit Gebru, Jamie Morgenstern, Briana Vecchione,Jennifer Wortman Vaughan, Hanna Wallach, HalDaum´e III, and Kate Crawford. 2018. Datasheetsfor datasets. arXiv preprint arXiv:1803.09010 .Thomas Gilovich, Dale Grifﬁn, and Daniel Kahneman.2002.

Heuristics and biases: The psychology of in-tuitive judgment . Cambridge university press.Brett A Gleitsmann, Margaret M Kroma, and TammoSteenhuis. 2007. Analysis of a rural water supplyproject in three communities in mali: Participationand sustainability. In

Natural resources forum , vol-ume 31, pages 142–150. Wiley Online Library.Alemu Gonsamo and Petra D’Odorico. 2014. Citizenscience: best practices to remove observer bias intrend analysis.

International journal of biometeorol-ogy , 58(10):2159–2163.Matthew J Gordon. 2019. Language variation andchange in rural communities.

Annual Review of Lin-guistics , 5:435–453.Sophie Groenwold, Lily Ou, Aesha Parekh, SamhitaHonnavalli, Sharon Levy, Diba Mirza, andWilliam Yang Wang. 2020. Investigating African-American Vernacular English in transformer-based text generation. In

Proceedings of the 2020 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP) , pages 5877–5883, Online.Association for Computational Linguistics.Daniel Hammett and Deborah Sporton. 2012. Payingfor interviews? negotiating ethics, power and expec-tation.

Area , 44(4):496–502.Anna Harding, Barbara Harper, Dave Stone, Cather-ine O’Neill, Patricia Berger, Stuart Harris, andJamie Donatuto. 2012. Conducting research withtribal communities: Sovereignty, ethics, and data-sharing issues.

Environmental health perspectives ,120(1):6–10.J. Hentschel. 1999. Contextuality and data collectionmethods: A framework and application to healthservice utilisation.

Journal of Development Studies ,35(4):64–94.Ursula Hess, RB Jr Adams, and Robert E Kleck. 2009.Intergroup misunderstandings in emotion communi-cation.

Intergroup misunderstandings: Impact of di-vergent social realities , pages 85–100.Stephanie Hirmer. 2018.

Improving the Sustainabilityof Rural Electriﬁcation Schemes: Capturing Valuefor Rural Communities in Uganda . Ph.D. thesis,University of Cambridge.Stephanie Hirmer and Peter Guthrie. 2016. Identify-ing the needs of communities in rural uganda: Amethod for determining the ‘user-perceived value’ofrural electriﬁcation initiatives.

Renewable and Sus-tainable Energy Reviews , 66:476–486.Stephanie Hirmer and Peter Guthrie. 2017. The ben-eﬁts of energy appliances in the off-grid energysector based on seven off-grid initiatives in ruraluganda.

Renewable and Sustainable Energy Re-views , 79:924–934.Dirk Hovy. 2015. Demographic factors improve classi-ﬁcation performance. In

Proceedings of the 53rd An-nual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Confer-ence on Natural Language Processing (Volume 1:Long Papers) , pages 752–762, Beijing, China. As-sociation for Computational Linguistics.Dirk Hovy and Anders Søgaard. 2015. Tagging perfor-mance correlates with author age. In

Proceedingsof the 53rd Annual Meeting of the Association forComputational Linguistics and the 7th InternationalJoint Conference on Natural Language Processing(Volume 2: Short Papers) , pages 483–488, Beijing,China. Association for Computational Linguistics.Dirk Hovy and Shannon L. Spruit. 2016. The socialimpact of natural language processing. In

Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Pa-pers) , pages 591–598, Berlin, Germany. Associationfor Computational Linguistics.. O. Irabor and P. Omonzejele. 2009. Local atti-tudes, moral obligation, customary obedience andother cultural practices: their inﬂuence on the pro-cess of gaining informed consent for surgery in a ter-tiary institution in a developing country.

Developingworld bioethics , 9(1):34—-42.ITU. 2019. International Telecommunication UnionWorld Telecommunication/ICT Indicators Database.World Bank Open Data.Stephany Jimenez. 2020. Creatively Communicatingthrough Visual and Verbal Art- Poetry and Murals.Yale National Initiative.Fridah Katushemererwe and John Nerbonne. 2015.Computer-assisted language learning (call) in sup-port of (re)-learning native languages: the case ofrunyakitara.

Computer Assisted Language Learning ,28(2):112–129.Benjamin Philip King. 2015.

Practical Natural Lan-guage Processing for Low-Resource Languages.

Ph.D. thesis, University of Michigan.Dessi Kirilova and Sebastian Karcher. 2017. Rethink-ing data sharing and human participant protection insocial science research: Applications from the quali-tative realm.

Data Science Journal , 16.Jenny Kitzinger. 1994. The methodology of FocusGroups: the importance of interaction between re-search participants.

Sociology of Health and Illness ,16(1):103–121.Krishna Kumar. 1987.

Conducting group interviews indeveloping countries . US Agency for InternationalDevelopment Washington, DC.Rachid Laajaj and Karen Macours. 2017.

Measuringskills in developing countries . The World Bank.Alain Labrique, Emily Blynn, Saifuddin Ahmed,Dustin Gibson, George Pariyo, and Adnan A Hyder.2017. Health surveys using mobile phones in de-veloping countries: automated active strata monitor-ing and other statistical considerations for improvingprecision and reducing biases.

Journal of medicalInternet research , 19(5):e121.Emily A Largent and Holly Fernandez Lynch. 2017.Paying research participants: regulatory uncertainty,conceptual confusion, and a path forward.

Yale jour-nal of health policy, law, and ethics , 17(1):61.A. Launiala and T. Kulmala. 2006. The importanceof understanding the local context: Women’s per-ceptions and knowledge concerning malaria in preg-nancy in rural malawi.

Acta Tropica , 98(2):111–117.Paul J Lavrakas. 2008.

Encyclopedia of survey re-search methods . Sage Publications. Tuan Anh Le, David Moeljadi, Yasuhide Miura, andTomoko Ohkuma. 2016. Sentiment analysis for lowresource languages: A study on informal indone-sian tweets. In

Proceedings of the 12th Workshopon Asian Language Resources (ALR12) , pages 123–131.KM MacQueen, E McLellan, DS Metzger, S Kege-les, RP Strauss, and et al. 2001. What is commu-nity? an evidence-based deﬁnition for participatorypublic health.

American Journal of Public Health ,91:1929–1938.Eric M. Mafuta, Lisanne Hogema, Th´er`ese N.M.Mambu, Pontien B Kiyimbi, Berthys P. Indebe,Patrick K. Kayembe, Tjard De Cock Buning, andMarjolein A. Dieleman. 2016. Understanding thelocal context and its possible inﬂuences on shaping,implementing and running social accountability ini-tiatives for maternal health services in rural demo-cratic republic of the congo: a contextual factor anal-ysis.

Science Advances , 16(1):1–13.B. Mallick, K. Rubayet Rahaman, and J. Vogt. 2011.Social vulnerability analysis for sustainable disastermitigation planning in coastal bangladesh.

DisasterPrevention and Management , 20(3):220–237.Vukosi Marivate, Tshephisho Sefara, Vongani Cha-balala, Keamogetswe Makhaya, Tumisho Mok-gonyane, Rethabile Mokoena, and Abiodun Mod-upe. 2020. Investigating an approach for low re-source language dataset creation, curation and clas-siﬁcation: Setswana and sepedi. arXiv preprintarXiv:2003.04986 .M N Marshall. 1996. Sampling for qualitative research.

Family practice , 13(6):522–5.Salule J. Masangwi. 2015. Methodology for So-lar PV Needs Assessment in Chikwawa, SouthernMalawi. Technical report, Malawi Renewable En-ergy Acceleration Programme (MREAP) MREAP:Renewable Energy Capacity Building Programme(RECBP) Produced.Joseph D Matarazzo, Morris Weitman, George Saslow,and Arthur N Wiens. 1963. Interviewer inﬂuence ondurations of interviewee speech.

Journal of VerbalLearning and Verbal Behavior , 1(6):451–458.Keith McAdam. 2004. The ethics of research related tohealthcare in developing countries.

Acta Bioethica ,10(1):49–55.J. R. McMillan, C. Conlon, and Nufﬁeld Councilon Bioethics. 2004. The ethics of research related tohealthcare in developing countries.

Journal of medi-cal ethics , 30:204–206.Albine Moser and Irene Korstjens. 2018. Series: Prac-tical guidance to qualitative research. part 3: Sam-pling, data collection and analysis.

European Jour-nal of General Practice , 24(1):9–18.aron Mueller, Garrett Nicolai, Arya D McCarthy, Dy-lan Lewis, Winston Wu, and David Yarowsky. 2020.An analysis of massively multilingual neural ma-chine translation for low-resource languages. In

Pro-ceedings of The 12th Language Resources and Eval-uation Conference , pages 3710–3718.David Musoke, Charles Ssemugabo, Rawlance Ndejjo,Sassy Molyneux, and Elizabeth Ekirapa-Kiracho.2020. Ethical practice in my work: communityhealth workers’ perspectives using photovoice inwakiso district, uganda.

BMC Medical Ethics ,21(1):1–10.Kevin L. Nadal. 2017. Sampling Bias and Gender. In

The SAGE Encyclopedia of Psychology and Gender .Judith Nakayiza. 2016. The sociolinguistic situationof english in uganda.

Ugandan English. Amster-dam/Philadelphia: John Benjamins , pages 75–94.N. Narayanasamy. 2009.

Participatory Rural Ap-praisal: Principles, Methods and Application , ﬁrstedition. SAGE Publications India Pvt Ltd, NewDehli.Raymond S Nickerson. 1998. Conﬁrmation bias: Aubiquitous phenomenon in many guises.

Review ofgeneral psychology , 2(2):175–220.V. Nikulina, H. Larson Lindal, J.and Baumann, D. Si-mon, and H Ny. 2019. Lost in translation: A frame-work for analysing complexity of co-production set-tings in relation to epistemic communities, linguisticdiversities and culture.

Futures , 113(102442):1–13.OECD. 2019. Survey of refugees and humanitarianstaff in Uganda. Joint effort by Ground Truth So-lutions (GTS) and the Organisation for EconomicCo-operation and Development (OECD) Secretariatwith ﬁnancial support from the United Kingdom’sDepartment for International Development (DFID).Daniel G Oliver, Julianne M Serovich, and Tina L Ma-son. 2005. Constraints and opportunities with inter-view transcription: Towards reﬂection in qualitativeresearch.

Social forces , 84(2):1273–1289.Sean Packham. 2016.

Crowdsourcing a text corpus fora low resource language . Ph.D. thesis, University ofCape Town.Christopher J Pannucci and Edwin G Wilkins. 2010.Identifying and avoiding bias in research.

Plasticand reconstructive surgery , 126(2):619.Erin S. Parcell and Katherine A. Rafferty. 2017. Inter-views, recording and transcribing. In Mike Allen,editor,

The SAGE Encyclopedia of CommunicationResearch Methods , pages 800–803. SAGE Publica-tions, Thousand Oaks.Sonny S Patel, Rebecca K Webster, Neil Green-berg, Dale Weston, and Samantha K Brooks. 2020. Research fatigue in covid-19 pandemic and post-disaster research: causes, consequences and recom-mendations.

Disaster Prevention and Management:An International Journal .Uwe Peters. 2020. What is the function of conﬁrmationbias?

Erkenntnis , pages 1–26.Stephanie Carroll Rainie, Tahu Kukutai, MaggieWalter, Oscar Luis Figueroa-Rodr´ıguez, JenniferWalker, and Per Axelsson. 2019. Indigenous datasovereignty.Thiagarajan Ramakrishnan, Mary C. Jones, and AnnaSidorova. 2012. Factors inﬂuencing business intel-ligence (bi) data collection strategies: An empiricalinvestigation.

Decision Support Systems , 52:486—-496.Max Roser and Esteban Ortiz-Ospina. 2016. Literacy.

Our World in Data .Dwaipayan Roy, Sumit Bhatia, and Prateek Jain. 2020.A topic-aligned multilingual corpus of Wikipedia ar-ticles for studying information asymmetry in low re-source languages. In

Proceedings of the 12th Lan-guage Resources and Evaluation Conference , pages2373–2380, Marseille, France. European LanguageResources Association.Mary Kathryn Salazar. 1990. Interviewer bias: How itaffects survey research.

Aaohn Journal , 38(12):567–572.Inaad Mutlib Sayer. 2013. Misunderstanding and lan-guage comprehension.

Procedia-Social and Behav-ioral Sciences , 70:738–748.Deven Shah, H. Andrew Schwartz, and Dirk Hovy.2019. Predictive biases in natural language process-ing models: A conceptual framework and overview.David Silverman. 2013.

Doing Qualitative Reserach ,fourth edition. SAGE Publications Inc.Jonathan Skinner et al. 2013.

The interview: An ethno-graphic approach , volume 49. A&C Black.David Stewart, Prem Shamdasani, and Dennis Rook.2007. Group dynamics and focus group research. InDavid Stewart, Prem Shamdasani, and Dennis Rook,editors,

Focus Groups: Theory and Practice , chap-ter 2, pages 19–36. SAGE Publications, ThousandOaks.Deryn Strange and Melanie KT Takarangi. 2015. Mem-ory distortion for traumatic events: the role of men-tal imagery.

Frontiers in psychiatry , 6:27.Latanya Sweeney. 2000. Simple demographics oftenidentify people uniquely.

Health (San Francisco) ,671(2000):1–34.Fasil Tekola, Susan J. Bull, Farsides Bobbie, J. New-port Melanie, Adeyemo Adebowale, N. RotimiCharles, and Davey Gail. 2009. Tailoring consento context: Designing an appropriate consent pro-cess for a biomedical study in a low income setting.

PLOS Neglected Tropical Diseases , 3:7.Lisa Thorley and Emma Henrion. 2019. DFID ethicalguidance for research, evaluation and monitoring ac-tivities. Prepared for the UK Department for Inter-national Development.Paulina O. Tindana, Linda Rozmovits, Renaud F.Boulanger, Sunita V.S. Bandewar, Raymond A.Aborigo, Abraham V.O. Hodgson, PamelaKolopack, and James V. Lavery. 2011. Aligningcommunity engagement with traditional authoritystructures in global health research: A case studyfrom northern ghana.

American Journal of PublicHealth , 101:1857–1867.Paulina O. Tindana, Jerome A. Singh, C. ShawnTracy, Ross E.G. Upshur, Abdallah S. Daar, Peter A.Singer, Janet Frohlich, and James V. Lavery. 2007.Grand challenges in global health: Community en-gagement in research in developing countries.

PLOSMedicine , 4(9).Marc-Ad`elard Trembley. 1957. The Key InformantTechnique: A Nonethnographic Application. Tech-nical report, Cornell University, New York.Angela K. Troyer. 2011.

Serial Position Effect , pages2263–2264. Springer New York, New York, NY.J. W. Tukey. 1980. We need both exploratory and con-ﬁrmatory.

American Statistician , 34(1):23–25.UKRI. 2020a. Data protection Guidance. Economicand Social Research Council.UKRI. 2020b. Do participants have a right to withdrawconsent? Economic and Social Research Council.UNDESA. 2018. United Nations Department ofEconomic and Social Affairs, Population Division:World Urbanization Prospects. World Bank OpenData.UNESCO. 2018. UNESCO Institute for StatisticsAdult Literacy Rate. World Bank Open Data.UNHCR. 2020. Refugees and Nationals by District.Uganda Comprehensive Refugee Response Portal.Melissa Upjohn and Kimberly Wells. 2016. Challengesassociated with informed consent in low-and low-middle-income countries.

Frontiers in veterinaryscience , 3:92.USAID. 2006. Powering Health, Electriﬁcation Op-tions for Rural Health Centers. Technical report, US-AID, Washington DC, USA.Ana Maria de Albuquerque Vasconcellos and M´arioVasconcellos Sobrinho. 2014. Knowledge and cul-ture: two signiﬁcant issues for local level devel-opment programme analysis.

Interac¸ ˜oes (CampoGrande) , 15(2):285–300. Martijn Veltkamp, Ruud Custers, and Henk Aarts.2011. Motivating consumer behavior by subliminalconditioning in the absence of basic needs: Strikingeven while the iron is cold.

Journal of ConsumerPsychology , 21(1):49–56.Inga Vesper. 2019. Facts & Figures: Disabilities in de-veloping countries. Sci Dev Net.Robert S Walker and Marcus J Hamilton. 2011. So-cial complexity and linguistic diversity in the aus-tronesian and bantu population expansions.

Pro-ceedings of the Royal Society B: Biological Sciences ,278(1710):1399–1404.Ke Wang, Steven Goldstein, Madeleine Bleasdale,Bernard Clist, Koen Bostoen, Paul Bakwa-Lufu,Laura T Buck, Alison Crowther, Alioune D`eme,Roderick J McIntosh, et al. Ancient genomes revealcomplex patterns of population movement, interac-tion, and replacement in sub-saharan africa.Christopher D Watkins, Lisa M DeBruine, Anthony CLittle, David R Feinberg, and Benedict C Jones.2012. Priming concerns about pathogen threatversus resource scarcity: dissociable effects onwomen’s perceptions of men’s attractiveness anddominance.

Behavioral Ecology and Sociobiology ,66(12):1549–1556.Roger P Winter. 1983. Uganda-creating a refugee cri-sis.

Cultural Survival Quarterly , 7(2).Philip Woodhouse. 1998. Thinking with People andOrganizations. In Alan Thomas, Joanna Chataway,and Marc Wuyts, editors,

Finding out Fast: Inves-tigative Skills for Policy and Development , ﬁrst edi-tion, chapter Part III, pages 127–146. SAGE Publi-cations Inc, Milton Keynes.B. Wu. 2014. Embedding research in local context: lo-cal knowledge, stakeholders’ participation and ﬁeld-work design.

Field Research Method Lab at LSE .Derek Yach. 1992. The use and value of qualitativemethods in health research in developing countries.