[PDF] A Digital Corpus of St. Lawrence Island Yupik

Abstract

St. Lawrence Island Yupik (ISO 639-3: ess) is an endangered polysynthetic language in the Inuit-Yupik language family indigenous to Alaska and Chukotka. This work presents a step-by-step pipeline for the digitization of written texts, and the first publicly available digital corpus for St. Lawrence Island Yupik, created using that pipeline. This corpus has great potential for future linguistic inquiry and research in NLP. It was also developed for use in Yupik language education and revitalization, with a primary goal of enabling easy access to Yupik texts by educators and by members of the Yupik community. A secondary goal is to support development of language technology such as spell-checkers, text-completion systems, interactive e-books, and language learning apps for use by the Yupik community.

Full PDF

AA Digital Corpus of St. Lawrence Island Yupik

Lane Schwartz

Department of LinguisticsUniversity of Illinois [email protected]

Emily Chen

Department of LinguisticsUniversity of Illinois [email protected]

Hyunji Hayley Park

Department of LinguisticsUniversity of Illinois [email protected]

Edward Jahn [email protected]

Sylvia L.R. Schreiner

Linguistics ProgramDepartment of EnglishGeorge Mason University [email protected]

Abstract

St. Lawrence Island Yupik (ISO 639-3: ess )is an endangered polysynthetic language inthe Inuit-Yupik language family indigenous toAlaska and Chukotka. This work presentsa step-by-step pipeline for the digitization ofwritten texts, and the ﬁrst publicly availabledigital corpus for St. Lawrence Island Yupik,created using that pipeline. This corpus hasgreat potential for future linguistic inquiry andresearch in NLP. It was also developed foruse in Yupik language education and revital-ization, with a primary goal of enabling easyaccess to Yupik texts by educators and bymembers of the Yupik community. A sec-ondary goal is to support development of lan-guage technology such as spell-checkers, text-completion systems, interactive e-books, andlanguage learning apps for use by the Yupikcommunity.

St. Lawrence Island Yupik (ISO 639-3: ess ) is anendangered polysynthetic language in the Inuit-Yupik language family (see Figure 1). It is spokenon St. Lawrence Island, Alaska and the ChukotkaPeninsula of Russia. This work presents the ﬁrstpublicly available digital corpus of written texts inSt. Lawrence Island Yupik, as well as the step-by-step process by which it was created. We refer tothis process as our digitization pipeline , which canbe readily adapted to any other language with anyamount of written text.The public release of the digital corpus hasbeen coordinated with various stakeholders in theSt. Lawrence Island community, including the Na-tive Village of Gambell, the Bering Strait SchoolDistrict, the Alaska Native Language Center atthe University of Alaska Fairbanks, and WycliffeBible Translators.The digital corpus is now available in plain-text format under the terms of the Creative Commons Attribution No-Commercial 4.0 Inter-national License at https://github.com/SaintLawrenceIslandYupik/digital_corpus . Searchable PDF ﬁles are being archivedat the Alaska Native Language Archive. Amobile-friendly web-accessible version of thecorpus will be subsequently developed to allowconvenient on- or ofﬂine access to the corpusby members of the St. Lawrence Island Yupikcommunity. Inuit languages

Greenland; Canada; Alaska

Sirenik

Chukotka, Russia

St. Lawrence Island Yupik

Alaska; Chukotka, Russia

Naukan Yupik

Chukotka, Russia

Central Alaskan Yup’ik

Western Alaska

Alutiiq Alaskan Yupik

Southwest Alaska

Figure 1: Inuit-Yupik language family (Fortescue et al.,2010; Krauss et al., 2011)

While the vast majority of St. Lawrence Islandersborn in or prior to 1980 are ﬂuent L1 Yupik speak-ers (Krauss, 1980), rapid language shift is un-derway among younger generations, especially inRussia where language shift is even further ad-vanced (Morgounova, 2007). As a result, manymembers of the Yupik community have stateda desire for substantially strengthened Yupik in-struction in the schools, ideally in the form of aYupik language immersion program. One obstacleto this, however, is that many Yupik-language texts a r X i v : . [ c s . C L ] J a n s well as the pedagogical materials that were de-veloped in the Soviet Union in the early 20th cen-tury (Krauss, 1971; Krupnik and Chlenov, 2013)and in Alaska in the late 20th century (Krauss,1971; Koonooka, 2005) are not easily or broadlyaccessible. Many materials are also archived at theAlaska Native Language Archive at the Universityof Alaska Fairbanks and at the Materials Develop-ment Center in the Gambell school. Therefore, aprimary goal for the development and release ofthis digital corpus is to strengthen opportunitiesfor Yupik language revitalization and education byenabling easy access to existing Yupik-languagetexts by educators and by members of the Yupikcommunity. A related secondary goal is to supportthe development of language technologies such asspell-checkers, text-completion systems, interac-tive e-books, and language learning apps for useby the community. We introduce in this section the digitizationpipeline used in the creation of the digital corpus.It consists of the following three steps, and canbe easily replicated for other languages, since veryfew aspects of the pipeline were specially tailoredto Yupik:1. scanning2. image processing3. optical character recognitionAll of the texts that appear in the digital corpus arein UTF-8 plain-text format.In the United States, Yupik is written us-ing a Latin-derived orthography, while in RussiaYupik is written in a modiﬁed Cyrillic orthogra-phy. The steps described in this section were ap-plied to Yupik documents created in Alaska writ-ten in the Latin-derived Yupik alphabet. A sub-stantial amount of unscanned Cyrillic-orthographyYupik documents were gathered from Soviet li-braries and archived at the Alaska Native Lan-guage Archive by Krauss (1971); when the globalCOVID-19 pandemic situation once again allowsfor safe travel, we plan to scan and process theseCyrillic-orthography Yupik documents using es-sentially this same pipeline.

During ﬁeldwork visits to Gambell in 2017–2019,we identiﬁed and digitized a signiﬁcant portion of the existing Yupik-language texts. Priority wasgiven to material most likely to be immediatelyuseful in Yupik education efforts and in the devel-opment of Yupik language technologies, such asbilingual Yupik-English storybooks.We gathered all Yupik language materials thatcould be found in the Gambell school library andMaterials Development Center. Most texts werescanned one page at a time using ﬂatbed scan-ning equipment, while others were scanned usinga sheet-fed scanner with an automatic page feederfeature in the Gambell school ofﬁce. There were anumber of texts located at the Alaska Native Lan-guage Archive in Fairbanks that were not found inGambell, and those were scanned on-site in early2019. Texts were scanned at a resolution of 600DPI, and whenever possible saved in TIFF imageformat.

The raw TIFF image ﬁles were processed beforeoptical character recognition was performed. Anyimages that contained two physical pages weresplit into two separate ﬁles. Next, images weredeskewed, despeckled, and cropped. In mostcases, these steps were performed using Scan-Tailor, an open source program designed for suchimage processing. More recently, we have begunperforming these image processing steps directlyin ABBYY FineReader, a commercial applica-tion that we also use for performing optical char-acter recognition. Figure 2: Regions of images and text are identiﬁedby ABBYY (left - red and green rectangles, respec-tively). Low-conﬁdence characters are highlighted dur-ing OCR (right - cyan highlights). https://github.com/scantailor/scantailor https://pdf.abbyy.com AGATEKUlimakat Nuum Agencym Mumiqhquqhyiqani ,Nuum Alaskami 99762Atughqaaluki Title VII-nem Maalghustunakuzillghestun liinnaqfiganun Bureau ofIndian Affairenun .

Figure 3: Sample plain text ﬁle of the Yupik front mat-ter from the elementary reader

Nagatek ‘ Listen ’. Nagaten .Nagaqughsigu-u ?Nakaa .Sangaawa ?Esghaqaghhuqun tazigna .Maaten nagaqughaqa .Enta aqfaatelta tazingavek .Hilikaptera .

Figure 4: Sample plain text ﬁle of the Yupik contentfrom the elementary reader

Nagatek ‘ Listen ’. Optical character recognition (OCR) is the processof converting an image into text. As we began theprocess of scanning Yupik texts in 2016 and 2017,we ﬁrst attempted to make use of the open sourceTesseract OCR software to convert the scannedimages into text. While Tesseract models can betrained for new languages, such training requiresexisting digitized texts. This resulted in a boot-strapping dilemma; without existing Yupik digi-tal texts, we could not train Tesseract models forYupik.After poor initial results with Tesseract,we made the decision to switch to ABBYYFineReader (hereafter ABBYY), a state-of-the-artcommercial OCR application, for converting ourprocessed image ﬁles to plain text. This soft-ware was available to us through our respectivelibraries at the University of Illinois and GeorgeMason University. ABBYY FineReader includespre-trained OCR models for the broader Inuit-Yupik language family in both Latin and Cyrillicorthographies. Initial work was performed using https://github.com/tesseract-ocr/tesseract LISTENWritten and Designed by Myra PoageResource Staff/TranslatorsRaymond OozevaseukHenry SilookLinda S. GologerqenIllustrated by Michael S. ApatikiA producation of the Nome AgencyBilingual Education Resource Center,P.O. Box 1108Nome, Alaska 99762for the Title VII Bilingual EducationProgram of the Bureau of Indian AffairsSiberian YupikPrinted at the GSA Printing PlantP.O. Box 1612, Juneau, Alaska 99802May 1975150 copies

Figure 5: Sample plain text ﬁle of the English frontmatter from the elementary reader

Nagatek ‘ Listen ’.

1. Listen.Do you hear it?2. No.What is it?3. Look over there.4. Now I hear it.Let’s run over there.5. Helicopter.

Figure 6: Sample plain text ﬁle of the English contentfrom the elementary reader

Nagatek ‘ Listen ’. ABBYY version 12, with later work performed us-ing ABBYY version 14. Unlike our early attemptswith Tesseract, OCR quality in both versions ofABBYY FineReader was acceptable.To begin an ABBYY OCR project, all of theTIFF images associated with a document are im-ported, analyzed, and partitioned into regions thatcontain text and regions that contain images or il-lustrations. These regions must be veriﬁed, andcan be corrected where necessary.For each text region, we use ABBYY’s built-insupport for texts written in “Eskimo Latin” to per- upik EnglishCorpus (front- & back-matter)

818 2,364 1,053 962 7,903 2,429Oral Narratives 9,818 64,696 24,883 10,374 120,194 12,516 (front- & back-matter)

275 1,149 760 886 12,909 4,581Jacobson Exercises 307 907 772 307 2,372 764New Testament 16,440 121,425 34,069

Table 1: Counts of sentences, word tokens, and word types for texts included in the digital corpus. Some elemen-tary readers and oral narratives include front-matter and/or back-matter. Note that the number of sentences (andtokens) in the Yupik and English corpora is not directly comparable - in the Yupik texts lines containing multiplesentences have been split apart and punctuation has been tokenized; in the English texts neither of these steps hasyet been performed. form OCR, and rely on this language setting for allLatin-orthography Yupik documents. We have ob-served good OCR results even for documents thathave deteriorated somewhat over time. ABBYYwill nevertheless identify low-conﬁdence charac-ters in the recognized text, and present them tothe user for validation. For each section of textthat includes one or more low-conﬁdence charac-ters, the TIFF image associated with that sectionis presented to the user beside a pre-populated textbox, in which low-conﬁdence characters are high-lighted. The user then conﬁrms or corrects each ofthese characters. These aspects of image analysisand OCR are shown in Figure 2.Each fully OCR’d document is then saved inthree ﬁle formats:• Microsoft Word DOCX• searchable PDF/A• UTF-8 plain textThe Microsoft Word documents will be sharedwith instructional staff at the St. Lawrence Islandschools. Searchable PDF/A ﬁles will be archivedat the Alaska Native Language Archive, and plaintext ﬁles are included in our digital corpus.Lastly, the plain text ﬁles are subsequently sep-arated and saved as four individual ﬁles. The ﬁrstﬁle contains any Yupik-language front-matter andback-matter, including title page, table of contents,and appendices (Figure 3). The second ﬁle con-tains the main body of the Yupik text, exclud-ing any front-matter and back-matter (Figure 4).The third ﬁle contains the English front-matter andback-matter, if any (Figure 5), and the fourth ﬁle contains the English translation of the main bodyof the text, if any (Figure 6). Furthermore, eachsentence of a ﬁle appears on its own line, a blankline is used to delimit paragraphs, and punctua-tion marks are separated out from each line of text.We formatted the ﬁles of the digital corpus in thisway to ensure that there not only exists a record ofthe full text, but to also facilitate any NLP workthat uses this corpus as a source of data. Sepa-rating each text into four individual ﬁles enablesresearchers to easily access the desired data whichtypically does not include front and back matter.The formatting of each individual ﬁle is likewiseintended to facilitate text processing.

To date, we have digitized most of the exist-ing Yupik-language texts using the digitizationpipeline introduced herein. We have scanned 90mostly comb-bound Yupik elementary readers, 7collections of Yupik oral narratives, the end-of-chapter exercises from the Jacobson (2001) gram-mar, and 14 collections of Yupik-language hymnsand other religious texts. Table 1 summarizes thedistribution of sentences, word types, and word to-kens across the current digital corpus. We describeeach type of text in the following sections.

In the 1970s, a set of elementary-level readerswere developed by the Nome Agency BilingualEducation Resource Center at the Bureau of In-dian Affairs and by the Alaska Native LanguageCenter at the University of Alaska Fairbanks. Inthe 1980s, additional readers were developed byhe Bering Strait School District’s Bilingual Mate-rials Development Center (MDC) at the GambellSchool on St. Lawrence Island. In the early 1990s,a series of ﬁve bilingual Yupik-English readerswas planned for use by the St. Lawrence IslandSchools in grades 4-8 (Apassingok et al., 1993).Only the ﬁrst three books in the series (Apassin-gok et al., 1993, 1994, 1995) were actually pro-duced.To date, 90 of these elementary-level readershave been scanned. Of those, 68 have been fullydigitized and are included in the digital corpus, in-cluding the bilingual grade 4-6 readers. Process-ing of the remaining 22 elementary-level readersis ongoing. As seen in Figures 7 and 8, while theelementary-level readers comprise nearly half ofthe sentences in the digital corpus, they contributefar fewer word types. This is to be expected, giventhe nature of these texts; since they were originallyintended for language learning, one would expectthem to frequently repeat words.

In the late 1970s, two books collectingSt. Lawrence Island legends were producedby the National Bilingual Materials DevelopmentCenter at the University of Alaska Anchorage(Slwooko, 1977, 1979). In the 1980s, a set ofYupik oral narratives were recorded on cassettetape, a subset of which were transcribed, trans-lated, and collected into a series of three booksconstituting the Lore of St. Lawrence Island col-lection (Apassingok et al., 1985, 1987, 1989). Inthe late 1990s, a collection of oral narratives wererecorded in Savoonga, Alaska as part of ﬁeldworkconducted by Japanese linguist Kayo Nagai.Transcriptions of these narratives, along withinterlinear glosses and free translations, were laterpublished in book form (Nagai, 2001). In the early2000s, a 20th-century collection of Yupik oralnarratives from Chukotka was transliterated fromCyrillic into the Latin Yupik orthography andpublished with English translations (Koonooka,2003). The oral narratives in these collectionsinclude short stories, legends and folktales orallynarrated by Yupik elders.To date, ﬁve of these collections of Yupik oralnarratives have been fully digitized and are in-cluded in the digital corpus, while processing ofthe remaining two (Slwooko, 1977, 1979) is on-going. As seen in Figures 7 and 8, the oral nar- ratives contribute approximately one third of thesentences and word types in the digital corpus.ElementaryReaders33.53% OralNarratives24.57%NewTestament 41.13%Jacobson0.77%

Figure 7: Distribution of total Yupik sentences per col-lection, excluding front- & back-matter and Englishcontent.

ElementaryReaders29.97% OralNarratives29.17%NewTestament39.95%Jacobson0.91%

Figure 8: Distribution of total Yupik word types percollection, excluding front- & back-matter and Englishcontent.

The most thorough source of language docu-mentation for Yupik is the grammar of Jacobson(2001). The grammar is written at the level ofan undergraduate college text, and appears to bedesigned for an audience of L1 Yupik speakersstudying their own language at the college level.Chapters 3–17 of the grammar each include end-of-chapter sample Yupik sentences. These sen-tences are designed for the reader to practice theaspects of Yupik grammar presented in each re-spective chapter by translating the sentences intoEnglish. These end-of-chapter sample sentenceshave been fully digitized and are included in thedigital corpus, though they comprise only a smallportion of the corpus as seen in Figures 7 and 8.As part of our Yupik research, we elicited Englisheference translations of the Jacobson sample sen-tences which we also include in the digital corpus.

A Yupik translation of the New Testament waspublished in 2018, completing a nearly 60-yearcollaborative translation project by Wycliffe BibleTranslators and Yupik translators on St. LawrenceIsland. There are 14 additional Yupik religioustexts (including a collection of hymns) from theAlaska Native Language Archive that we havescanned but not yet fully processed.

Given Yupik’s status as an understudied language,there is no doubt much to be learned linguisticallyfrom analyzing this digital corpus. While the cor-pus has yet to be annotated, preliminary work hasyielded several remarkable facts of the languagethat were not known to us previously, particularlywith respect to its morphology.The morphology of Yupik is perhaps one of themore well-documented aspects of the language.The Yupik lexicon is broadly composed of threetypes of morphemes: roots, derivational mor-phemes, and inﬂectional morphemes. Since Yupikis strictly sufﬁxing with the exception of one pre-ﬁx, words typically have the following form: root - derivational morpheme(s) - inﬂectionalmorpheme Most roots are nominal or verbal, such as alquutagh- ‘spoon’ and qepghagh- ‘to work’ re-spectively. This results in four types of deriva-tional morphemes:• N → N which sufﬁx to nominal roots and yieldnominal stems• N → V which sufﬁx to nominal roots and yieldverbal stems• V → V which sufﬁx to verbal roots and yieldverbal stems• V → N which sufﬁx to verbal roots and yieldnominal stemsUpon sufﬁxation, there are several mor-phophonological processes that can occur depend-ing on the phonological shape of the root and themorpheme being sufﬁxed. For instance, in (1), sufﬁxing the derivational morpheme -peragh- re-sults in the deletion of root-ﬁnal consonant -g- inthe root atkug- .(1) atkuperaq atkug-peragh-Øparka-makeshift-

ABS .sg‘ makeshift parka ’ (Badten et al., 2008, p.661)The Badten et al. (2008) Yupik-English dictio-nary comprehensively documents all of the mor-phemes that have been identiﬁed to date, while theJacobson (2001) reference grammar and de Reuse(1994) overview the known ordering constraints(e.g. V → V derivational morphemes may only suf-ﬁx to verbal roots and stems) and all of the mor-phophonological processes that can occur uponsufﬁxation. The corpus, however, has demon-strated several exceptions to this documentation.For example, the derivational morpheme -pag- is an augmentive V → V sufﬁx meaning ‘to V in-tensively, excessively’. As such, it is attested tosufﬁx to verbal stems only. One would thus ex-pect it to yield the word seen in (2) but not in (3),where the stem alquutagh- ‘spoon’ is a nominalstem.(2) qepghaghpagtuq qepghagh-pag-tu-qwork-

AUG - IND . INTR -3sg‘ he worked hard ’ (Badten et al., 2008, p.659)(3) alquutaghpaget alquutagh-pag-etspoon-

AUG - ABS .pl‘ heaping tablespoons ’ (Nagai, 2001, p.103)Nevertheless, (3) is a valid, attested form in ourdigital corpus, and -pag- frequently appears suf-ﬁxed to other nominal stems as well ( sagnegh-paget ‘large bowls’, neqekrangllaghpagni ‘greatsmell of bread’, suupeliighpagni ‘great smell ofstew’ (Apassingok et al., 1993)). This suggeststhat perhaps -pag- is not only a V → V sufﬁx, butalso an N → N sufﬁx, which is not attested in theexisting documentation.A second interpretation is that there exist fewerconstraints on morpheme ordering than previouslybelieved, which would permit V → V sufﬁxes to af-ﬁx to nominal roots. This is also supported by sub-stantial evidence of verbal roots being inﬂected fornominal inﬂectional morphemes in our corpus, aseen in (4).(4) yuvghiiq yuvghiiq-Øexamine-

ABS .sg‘ look! ’ (NABERC, 1975)In the same way one would not expect the V → Vsufﬁx -pag- to sufﬁx to a nominal root, one wouldnot expect a verbal root to inﬂect for nominal in-ﬂectional endings. In this way, the digital corpushas opened up a rich area of inquiry.The digital corpus also speaks to the inﬂuenceof Yupik’s oral tradition. Since many of the textsin our corpus were originally oral narrations, thereis considerable speaker variation, which has re-sulted in transcriptions of morphemes that differgreatly from their attested forms in the Badtenet al. (2008) dictionary. These variations are ap-parent in the morphophonology as well, particu-larly in regards to allomorphy, as seen in (5).(5) maklaguugut maklagu-u-gu-tbearded.seal.intestines-be-

IND . INTR -3plliteral trans. they are bearded seal’s intestines (Nagai, 2001, p.191)(6) maklagunguut maklagu-ngu-u-tbearded.seal.intestines-be-

IND . INTR -3plliteral trans. they are bearded seal’s intestines

Whereas the word form in the digital corpus is maklaguugut , the word form predicted by the at-tested morphophonological processes of Yupik is maklagunguut , seen in (6). Detailed analysis ofword forms such as these would contribute to anincreased understanding of Yupik morphophonol-ogy, and remedy gaps in the existing documenta-tion.Our digital corpus thus offers many possibilitiesfor future research in and documentation of Yupikmorphology. It would not only facilitate stud-ies on morpheme ordering constraints and mor-phophonological variation, but would also allowfor the potential discovery of novel, previouslyunattested morphemes. Beyond the level of theword, the corpus will be of use for the study bothof morphemes in context and of phenomena at thelevel of the sentence or the discourse. For exam-ple, Yupik boasts a large number of morphemeswith meanings related to tense, aspect, mood, and modality. The meanings and uses of these mor-phemes are not well documented to begin with; inaddition, their use often depends on factors out-side of the word or sentence they are in. Sententialand discourse-level context is essential for under-standing and analyzing many other syntactic andsemantic phenomena, as well.

Many polysynthetic languages, such as Yupik,are low-resource and under-researched within theﬁeld of NLP. The availability of the digital cor-pus for Yupik now enables researchers to utilizea written dataset that was otherwise inaccessible.For our own purposes the digital corpus has hadan immediate impact on two of our projects re-lated to NLP, that is, our ongoing development ofa morphological analyzer and a dependency tree-bank for Yupik.Given Yupik’s rich morphology, the implemen-tation of a morphological analyzer is an essentialstep in the development of more complex languagetechnologies. While two iterations of a rule-basedmorphological analyzer have already been imple-mented (Chen and Schwartz, 2018; Chen et al.,2020), neither achieve full coverage and providean analysis for all input items. The digital cor-pus, however, offers a means of understanding theshortcomings of our existing analyzers.We have already begun a detailed error analysisof the words in the corpus that cannot be analyzed,and are working on identifying the prominent pat-terns in these errors. For instance, as described inthe previous section, our corpus has demonstratedthat constraints on morpheme ordering are per-haps more lax than has been initially documented.Knowing this allows us to appropriately modifythe analyzer to take this phenomena into account.All of our ﬁndings from studying the digital cor-pus will subsequently be used to improve the ex-isting analyzers.The digital corpus is also currently being usedto create the ﬁrst Universal Dependencies (UD)(Nivre et al., 2016) treebank for Yupik. UD pro-vides a crosslingual framework for consistent an-notation of dependency grammars across differ-ent natural languages. However, the frameworkhas not often been utilized for annotating polysyn-thetic languages like Yupik. By annotating the dig-ital corpus within the UD framework, we hope tocontribute to expanding the framework to annotatether polysynthetic languages. It would furtherallow us to utilize existing UD tools (e.g. multi-lingual UD parser) for comparative linguistic re-search as well as other NLP tasks like syntacticparsing.A second goal for the UD treebank project isto better understand the syntactic properties ofYupik and to utilize such knowledge for futureNLP tasks. In particular, we can use the treebankto create novel sentences in Yupik, thereby aug-menting existing textual data. This would greatlyassist those NLP tasks that require considerablequantities of data, such as neural language mod-eling.In summary, the digital corpus can help usachieve a better understanding of Yupik morphol-ogy and syntax, which in turn, would result inthe building of more robust computational models.These computational models would then supportthe development of educational applications forYupik revitalization, such as spell-checkers, text-completion systems, interactive e-books, and lan-guage learning apps.

To date, we have digitized all of the Yupik-language materials at the Gambell school and aportion of those archived at the Alaska Native Lan-guage Archive (ANLA). There are a number ofother materials located in the ANLA, however,that have not yet been included in the digital cor-pus.After visiting the ANLA in early 2019, we haveidentiﬁed approximately 65 documents indexedunder St. Lawrence Island Yupik or ChaplinskiYupik that remain to be scanned. We have alsoconﬁrmed that there is a substantial amount ofYupik material at the ANLA that has neither beenindexed nor scanned, most of which are Soviet-era Yupik texts (primarily in Cyrillic orthography)collected by Michael Krauss during visits to vari-ous libraries in the Soviet Union (Krauss, 1971).Furthermore, the Yupik examples in Shi-nen (1982), Silook et al. (1983), de Reuse(1994), Shutt et al. (2014) have not been digi-tized, nor have the examples in Soviet-era Yupiklanguage documentation (Menovshchikov, 1960,1962, 1967, 1983). The latter are written in Cyril-lic orthography with descriptions in Russian. Fu-ture work will entail digitizing all of these materi-als. A second objective for the digital corpus is textveriﬁcation. While ABBYY is the state-of-the-artsoftware for OCR work, errors may still have oc-curred during the OCR process. As such, we planto have all digitized texts veriﬁed by native speak-ers.Lastly, we intend to use the digital corpus toeventually build a parallel corpus of Yupik textsand their translations. Many of the texts includedin the digital corpus have English translations,while many of the Soviet-era works that we planto include have Russian translations. One chal-lenge, however, is the fact that many of the trans-lations do not have a one-to-one correspondencewith Yupik sentences. In such cases, a singleYupik sentence may be translated as more than oneEnglish sentence, or vice versa. The intended par-allel corpus will map Yupik sentences one-to-oneto their translations, which would facilitate variousprojects and endeavors in NLP.

The Yupik language and the corpus of Yupikwritten texts described herein represent importantcomponents of the linguistic and cultural heritageof the St. Lawrence Island Yupik people. Whilemany of the existing Yupik-language texts have al-ready been fully digitized and are present in ourdigital corpus, there remains much ongoing andfuture work. As it stands, however, the digitalcorpus already lends itself to general linguistic in-quiry and research related to NLP. We believe itto be a valuable source of data that would greatlycontribute to our understanding of the Yupik lan-guage, and moreover, to the ﬁeld of NLP, as com-putational research on polysynthetic languages isstill relatively scarce. Above all, however, is thefact that the digital corpus has broadened the ac-cessibility of Yupik language materials, which isa pivotal step towards establishing a program forYupik language education and revitalization.

The Yupik language is a critical part of the cul-tural heritage of the Yupik people. We offer ourdeep gratitude to the people of St. Lawrence Is-land who have trusted us to work with this mate-rial. Special thanks to the Yupik speakers whosewords are recorded in this corpus. We wish tothank everyone who assisted in scanning, proof-reading, and digitizing this material. Thanks to theoard members and staff of the Native Village ofGambell, the City of Gambell, and Sivuqaq, Inc.Thanks to the Bering Strait School District, thefaculty and staff of ANLC and ANLA, Dave andMitzi Shinen of Wycliffe Bible Translators, StevenA. Jacobson, Kayo Nagai, Willem de Reuse, thestaff of Gambell Lodge, Iyaaka (Anders Apassin-gok), Taayqa (Michael James, RIP) Rob Taylor,Petuwaq (Chris Koonooka) and the current andformer faculty and staff of Gambell and SavoongaSchools who developed so many wonderful mate-rials over the years and who supported us in thisproject. This work was supported by NSF Awards1761680 and 1760977.Igamsiqanaghhalek!

References

Anders Apassingok, (Iyaaka), Jessie Uglowook,(Ayuqliq), Lorena Koonooka, (Inyiyngaawen), andEdward Tennant, (Tengutkalek), editors. 1993.

Kallagneghet / Drumbeats . Bering Strait SchoolDistrict, Unalakleet, Alaska.Anders Apassingok, (Iyaaka), Jessie Uglowook,(Ayuqliq), Lorena Koonooka, (Inyiyngaawen), andEdward Tennant, (Tengutkalek), editors. 1994.

Aki-ingqwaghneghet / Echoes . Bering Strait School Dis-trict, Unalakleet, Alaska.Anders Apassingok, (Iyaaka), Jessie Uglowook,(Ayuqliq), Lorena Koonooka, (Inyiyngaawen), andEdward Tennant, (Tengutkalek), editors. 1995.

Su-luwet / Whisperings . Bering Strait School District,Unalakleet, Alaska.Anders Apassingok, (Iyaaka), Willis Walunga, (Ke-pelgu), and Edward Tennant, (Tengutkalek), edi-tors. 1985.

Sivuqam Nangaghnegha — SiivanllemtaUngipaqellghat / Lore of St. Lawrence Island —Echoes of our Eskimo Elders , volume 1: Gambell.Bering Strait School District, Unalakleet, Alaska.Anders Apassingok, (Iyaaka), Willis Walunga, (Ke-pelgu), and Edward Tennant, (Tengutkalek), edi-tors. 1987.

Sivuqam Nangaghnegha — SiivanllemtaUngipaqellghat / Lore of St. Lawrence Island —Echoes of our Eskimo Elders , volume 2: Savoonga.Bering Strait School District, Unalakleet, Alaska.Anders Apassingok, (Iyaaka), Willis Walunga, (Ke-pelgu), and Edward Tennant, (Tengutkalek), edi-tors. 1989.

Sivuqam Nangaghnegha — SiivanllemtaUngipaqellghat / Lore of St. Lawrence Island —Echoes of our Eskimo Elders , volume 3: South-west Cape. Bering Strait School District, Unalak-leet, Alaska.Linda Womkon Badten, Vera Oovi Kaneshiro,Marie Oovi, and Christopher Koonooka. 2008.

St. Lawrence Island / Siberian Yupik Eskimo Dictio-nary . Alaska Native Language Center, Universityof Alaska Fairbanks.Emily Chen, Hyunji Hayley Park, and Lane Schwartz.2020. Improving ﬁnite-state morphological analysisfor St. Lawrence Island Yupik with paradigm func-tion morphology.Emily Chen and Lane Schwartz. 2018. A morpho-logical analyzer for St. Lawrence Island / Cen-tral Siberian Yupik. In

Proceedings of the 11thLanguage Resources and Evaluation Conference ,Miyazaki, Japan.Michael Fortescue, Steven Jacobson, and LawrenceKaplan. 2010.

Comparative Eskimo Dictionary withAleut Cognates , 2nd edition. Alaska Native Lan-guage Center, Fairbanks, Alaska.Steven A. Jacobson. 2001.

A Practical Grammar of theSt. Lawrence Island / Siberian Yupik Eskimo Lan-guage, Preliminary Edition , 2nd edition. Alaska Na-tive Language Center, Fairbanks, Alaska.Christopher Koonooka, (Petuwaq). 2005. Yupik lan-guage instruction in Gambell (St. Lawrence Island,Alaska). ´Etudes/Inuit/Studies , 29(1/2):251–266.Christopher (Petuwaq) Koonooka. 2003.

Ungi-paghaghlanga: Let Me Tell You A Story . AlaskaNative Language Center.Michael Krauss. 1971. Developing a literature inthe language of the Eskimos of St. Lawrence Is-land. Alaska Native Language Archive IdentiﬁerSY970K1971c.Michael Krauss. 1980. Alaska Native languages: Past,present and future.

ANLC Research Papers , 4.Michael Krauss, Gary Holton, Jim Kerr, and Colin T.West. 2011. Indigenous peoples and languages ofAlaska. Alaska Native Language Archive IdentiﬁerG961K2010.Igor Krupnik and Michael Chlenov. 2013.

Yupik Tran-sitions — Change and Survival at Bering Strait,1900-1960 . University of Alaska Press, Fairbanks,Alaska.G. A. Menovshchikov. 1960.

Eskimosskii iazyk . Gosu-darstvennoe uchebno-pedagogicheskoe izdatel’stvo,Leningrad. Pedagogical grammar, similar in scopeand level to Jacobson (2001).G. A. Menovshchikov. 1962.

Grammatika iazyka azi-atskikh eskimosov (Grammar of the language ofAsian Eskimos) , volume 1. Izdatel’stvo akademiiNauk (Academy of Sciences of the USSR), Moscowand Leningrad.G.A. Menovshchikov. 1967.

Grammatika iazyka azi-atskikh eskimosov , volume 2. Izdatel’stvo akademiiNauk, Moscow and Leningrad. Major referencegrammar..A. Menovshchikov. 1983.

Slovar’ ekimossko-russkiy i russkio-eskimosskiy.

Prosveshchenie.,Leningrad. Yupik to Russian and Russian to YupikSchool dictionary.Daria Morgounova. 2007. Language, identitiesand ideologies of the past and present Chukotka. ´Etudes/Inuit/Studies , 31(1-2):183–200.Kayo Nagai. 2001.

Mrs. Della Waghiyi’s St. LawrenceIsland Yupik Texts with Grammatical Analysis .Number A2-006 in Endangered Languages of thePaciﬁc Rim. Nakanishi Printing, Kyoto, Japan.Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-ter, Yoav Goldberg, Jan Hajiˇc, Christopher D. Man-ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,Natalia Silveira, Reut Tsarfaty, and Daniel Zeman.2016. Universal Dependencies v1: A multilingualtreebank collection. In

Proceedings of the Tenth In-ternational Conference on Language Resources andEvaluation (LREC’16) , pages 1659–1666, Portoroˇz,Slovenia. European Language Resources Associa-tion (ELRA).Nome Agency Bilingual Education Resource Center.1975.

Aghnaaneq Neghsaq Teghigniqelghii . GSAPrinting Plant, Juneau, AK.Willem J. de Reuse. 1994.

Siberian Yupik Eskimo —The Language and Its Contacts with Chukchi . Stud-ies in Indigenous Languages of the Americas. Uni-versity of Utah Press, Salt Lake City, Utah.David C. Shinen. 1982. Some beginning conversa-tional phrases in St. Lawrence Island Yupik Es-kimo. Alaska Native Language Center IdentiﬁerSY960S1982.Lauren Shutt, Dawn Biddison, and ChristopherKoonooka. 2014.

Listen & Learn: St. LawrenceIsland Yupik Language and Culture Video Lessons .Arctic Studies Center, Smithsonian Institution, An-chorage, Alaska.Roger Silook, Adelinda Womkon Badten, He-len Slwooko Carius, Vera Oovi Kaneshiro, ElinorOozeva, David Shinen, and Grace Slwooko. 1983.

Sivuqam Anglinghhaan Akuzisii / St. Lawrence Is-land Junior Dictionary.

National Bilingual Mate-rials Development Center, Rural Education Affairs,University of Alaska., Anchorage, Alaska. Yupik-to-English school dictionary.Grace Slwooko. 1977.

Sivuqam Ungipaghaatangi I .University of Alaska, Anchorage, AK.Grace Slwooko. 1979.

Sivuqam Ungipaghaatangi II .University of Alaska, Anchorage, AK.Wycliffe. 2018.