[PDF] CUCHILD: A Large-Scale Cantonese Corpus of Child Speech for Phonology and Articulation Assessment

Abstract

This paper describes the design and development of CUCHILD, a large-scale Cantonese corpus of child speech. The corpus contains spoken words collected from 1,986 child speakers aged from 3 to 6 years old. The speech materials include 130 words of 1 to 4 syllables in length. The speakers cover both typically developing (TD) children and children with speech disorder. The intended use of the corpus is to support scientific and clinical research, as well as technology development related to child speech assessment. The design of the corpus, including selection of words, participants recruitment, data acquisition process, and data pre-processing are described in detail. The results of acoustical analysis are presented to illustrate the properties of child speech. Potential applications of the corpus in automatic speech recognition, phonological error detection and speaker diarization are also discussed.

Full PDF

CCUCHILD: A Large-Scale Cantonese Corpus of Child Speechfor Phonology and Articulation Assessment

Si-Ioi Ng † , Cymie Wing-Yee Ng † , Jiarui Wang , Tan Lee Kathy Yuet-Sheung Lee , Michael Chi-Fai Tong Department of Electronic Engineering, The Chinese University of Hong Kong Department of Otorhinolaryngology, Head & Neck Surgery, The Chinese University of Hong Kong { siioing, cymieng, jiaruiwang } @link.cuhk.edu.hk, [email protected] { leeys, mtong } @ent.cuhk.edu.hk Abstract

This paper describes the design and development of CUCHILD,a large-scale Cantonese corpus of child speech. The corpus con-tains spoken words collected from 1,986 child speakers agedfrom to years old. The speech materials include wordsof to syllables in length. The speakers cover both typi-cally developing (TD) children and children with speech dis-order. The intended use of the corpus is to support scientiﬁcand clinical research, as well as technology development relatedto child speech assessment. The design of the corpus, includ-ing selection of words, participants recruitment, data acquisi-tion process, and data pre-processing are described in detail.The results of acoustical analysis are presented to illustrate theproperties of child speech. Potential applications of the corpusin automatic speech recognition, phonological error detectionand speaker diarization are also discussed. Index Terms : speech corpus, child speech, Cantonese, speechsound disorder

1. Introduction

Speech is one of the most common media of human communi-cation. The natural speech sound can be captured and recordedin the form of acoustic signal for subsequent analysis. Therecorded speech data are stored in a structured database that isknown as a speech corpus. The speech data contain the informa-tion about the acoustic properties of speech, linguistic usage ofthe language concerned, as well as the characteristics of speak-ers and recording conditions. With sufﬁcient amount of speechdata, statistical analysis can be performed to investigate and un-derstand the properties of speech from different perspectives.Statistical modeling of speech data also allows the developmentof a wide range of speech technologies and applications. Inshort, speech data play an important role in multi-disciplinaryresearch on speech communication.The development of a speciﬁc speech technology needs toconsider the target speakers and choose a suitable speech cor-pus according to the nature and scope of intended applications.Nowadays main-stream speech technologies are built mainlywith adult speech data. They often show signiﬁcantly degradedperformance on child speakers, who account for a large popula-tion in the society. The performance degradation is clearly dueto the differences between adult and child speech in many as-pects. While abundant resources of well-annotated adult speechdata are available and continually accumulated in the public do-main, speech corpora of child speech are far less common. Part † Equal contribution of this issue comes from privacy-related concerns of the parentsand the difﬁculties in data collection due to limited attentionspan of child subjects. Despite the challenges, a number of childspeech corpora were developed over the years. Examples arethe OGI Kids Speech corpus [1], the University of Colorado’sKids’ Speech Corpus [2] and the CID children’s speech corpus[3]. These corpora target at healthy child/adolescent speakerswhose ages range from

5; 0 to

17; 11 . They are useful resourcesto support both speech technology development [4] and acous-tical analysis [3].In the population studies of speech acquisition, children ofyounger age tend to commit more mistakes in producing targetwords [5]. The mistakes are caused by their underdevelopedvocal tract and motor skill to produce speech sounds as well asthe developing phonological abilities. This implies that, when alarge-scale collection of child speech data is carried out, there isa high chance that speech errors would be included. By incor-porating erroneous speech in the design of corpus with detailedannotation on the relevant errors, it opens a new way for erroranalysis and development of new systems targeting on problemsof child speech acquisition. There were a few related researchworks on corpus development in recent years, e.g., [6][7]. Alarge part of these corpora are free from speech errors and couldbe used to support general speech technology development.In this paper, we present the CUCHILD child speech cor-pus, which is the outcome of close collaboration between theDepartment of Electronic Engineering and the Department ofOtorhinolaryngology, Head and Neck Surgery of the ChineseUniversity of Hong Kong. The primary goal of this effort isto provide data resources to support acoustic analysis and iden-tiﬁcation of children with speech sound disorder, targeting thenative Cantonese-speaking children of the age

3; 0 to

6; 11 . Thespeech data would be useful to the research of automatic speechrecognition, speaker diarization and other speech technologies.The rest of the paper is organised as follows. Section 2introduces the background of Hong Kong Cantonese and de-scribes the details of the CUCHILD corpus design. It is fol-lowed by Section 3, which describes the results of spectral andduration analysis of Cantonese vowels produced by children.Section 4 discusses the potential applications of the CUCHILD,followed by short conclusion in Section 5.

2. Design of Corpus

Cantonese, which is a traditional prestige variety of the YueChinese Dialect group, is a major Chinese dialect widely spo-ken by about 68 million native speakers in Hong Kong, Macau,Guangdong and Guangxi Provinces of Mainland China, as well a r X i v : . [ ee ss . A S ] A ug able 1: Number of participants in different age groups.

Age (years;months) 3;0-3;11 4;0-4;11 5;0-5;11 6;0-6;11Male

227 300 340 113

Female

202 368 341 95

Table 2:

Distribution of participants/ kindergartens in differentdistricts.

Districts Hong Kong Island New Territories KowloonKindergartens

Participants

740 761 485 as overseas Chinese communities. It is a monosyllabic and tonallanguage. Each Chinese character is pronounced as a singlesyllable carrying a lexical tone. A Cantonese syllable can be di-vided into an onset and a rime. The onset is a consonant whilethe rime can contain a nucleus or a nucleus followed by a coda.The nucleus can be a vowel or a diphthong and the coda is aﬁnal consonant. There are initial consonants, vowels, diphthongs, ﬁnal consonants and distinct lexical tones (plus allotones). The tones are characterised by different pitchesand duration patterns. Present-day Cantonese uses over le-gitimate base syllables. If tone difference is taken into account,the number of distinct syllables exceeds , [8][9]. The speech samples in the CUCHILD corpus were collectedfrom 1,986 Hong Kong pre-school children (1,006 female, 980male, age 3;0 to 6;11) during the period from February 2017 toJanuary 2018. All speakers use Hong Kong Cantonese as theirﬁrst language (L1). The children were grades K1 to K3 stu-dents, recruited via normal local kindergartens which use Can-tonese as their medium of teaching. Children from the specialchild care centres were not included. Parental consents wereobtained for each participating child. Information on age andgender were collected and they are summarized in Table 1. 17kindergartens from different districts of Hong Kong participatedin the study and the information of their distribution is presentedin Table 2.

Each participant was seen individually in a separated area in-side the kindergarten. He/she was arranged to sit face-to-face infront of a research assistant with a mini-game setting to engagehis/her attention. A digital recorder (TASCOM DR-44WL) waslocated at - centimeters in front of the childrens mouth.As the environmental noise such as reverberation, school bells,people walking around, etc. was unavoidable, the gain of therecorder was adjusted to maintain the background noise levelbelow - dB (relative to the maximum input level) with thebest effort. The sampling rate was set to be . kHz with two-channel stereo recording.For each child subject, the recording contains an interactiveconversation between the child and the research assistant. Theresearch assistants were student clinicians from speech therapyprogrammes in local universities. A technician, who was a stu-dent with engineering background, was responsible to monitorthe operation of recording devices. As children would lose con-centration easily, sufﬁcient break time was allowed during thesession. With previous experience of working with children,the research assistants were able to engage the participants withthe mini-game and elicit targeted verbal outputs during the ses-sions. The majority of the participants were co-operative in the recording process.Each recording session consisted of two major parts involv-ing three stimuli booklets. In the ﬁrst part, a single word ar-ticulation test, namely Hong Kong Cantonese Articulation Test(HKCAT)[10], was used to obtain the information about thechild’s speech sound ability at single word level. In the secondpart, the subject were asked to read aloud two stimuli bookletswith pictures that illustrate 130 Cantonese words (223 syllables)with 1-4 syllables. Hong Kong Cantonese Articulation Test (HKCAT) is a stan-dardized single word articulation test commonly used by qual-iﬁed speech therapists in Hong Kong. It provides informationabout the speech sound inventory, speech sound errors and pat-terns of the participant. All research assistants had receivedproper training on the use of HKCAT and transcription of Can-tonese speech sounds. The procedure during data collection wasmonitored by the supervisor, who is a qualiﬁed speech thera-pist with more than 10 years of clinical experience in workingwith children with speech sound disorders. The results of theHKCAT were instantly transcribed on recording forms by theresearch assistants.After HKCAT, the child subject was asked to name the 130Cantonese words one by one. When a subject failed to namea picture, the research assistant would provide a direct modelfor the child to repeat and imitate. The target words were se-lected with the consideration of their age-adequacy and are il-lustrated with children-friendly colorful drawings. Samples ofthe stimuli are illustrated in Figure 1(a)-(d). These words wereselected with an aim to elicit and collect speech samples cov-ering all Cantonese phonemes in words of different lengths,with different syllable structures (CV, CVV, CVC) and at differ-ent syllable positions. The initial consonants include plosives,affricates, nasals, fricatives, approximants and lateral approxi-mants. The initial consonant [n] was not included as it is com-monly regarded as an allophone of [l] in Hong Kong Cantonese.Seven long vowels, four short vowels, eleven diphthongs, six ﬁ-nal consonants and six lexical tones were all covered in the 223syllables. The list of phonemes are summarized as in Table 3.Table 3:

Cantonese phonemes included in CUCHILD

PhonemesInitial consonants p p h t t h k k h k w k wh ts ts h m N f s h w j l

Long vowels a: i: E: ÅŞ: O: u: y:

Short vowels

Diphthongs ai ei 5i ui Oiau 5u iu ou 8y Eu

Final consonants -p -t -k -m -n - N Tones High-level Mid-risingMid-level Mid-fallingLow-rising Low-level

Upon the collection of the speech samples from the children,the HKCAT results charted by the research assistants were val-idated by the supervisor, partially onsite and entirely at the lab-oratory with reference to the audio and audio-visual recordings.An analysis of the screening results from face-to-face analy-sis, audio recordings and audio-visual recordings suggested thata) (b) (c) (d)Figure 1:

Samples of stimuli: Cantonese words with 1-4 syl-lables. (a) ”t h O:N

25” (Candy) (b) ”f UN

55 s i: n33”(Fan) (c)”h O: n33 pou25 p a: u55”(Hamburger) (d) ”ts h i: u55 k h p55 s i: h ÅŞ:N

21” (Supermarket) (a) (b)(c) (d)Figure 2:

Formant analysis of the Cantonese vowels over dif-ferent age group, illustrated by F1-F2 scatter plots: (a) Age be-tween

3; 0 -

3; 11 ; (b) Age between

4; 0 -

4; 11 ; (c) Age between

5; 0 -

5; 11 ; (d) Age between

6; 0 -

6; 11 . The legend (from topto bottom) represents the vowels [ i: u: œ: y: O: E: ]. no signiﬁcant difference was found with the HKCAT scoresamong different modes of judgement [11]. The HKCAT scoresprovide important information about the children’s speech in-ventory, speech sound errors and patterns, and serve as a ref-erence transcription of the collected speech data with the 223syllables in CUCHILD. Age appropriate errors made by typ-ical developing children, age inappropriate phonological pro-cesses produced by children with suspected speech sound dis-orders and articulation errors are included. The referenced tran-scription is used to categorise the collected speech data into ac-curate pronunciation and expected erroneous speech collectedfrom typically developing (TD) children and unexpected erro-neous speech collected from children with disordered speech(DS). Thus, in addition to the full coverage of all Cantonesephonemes, the speech data in CUCHILD give a spectrum ofphonological processes and articulation errors which are typi-cally/ atypically found in Cantonese-speaking children at age3;0 to 6;11. The pre-processed information and manual anno-tation of speech data can be used as training data for speechrecognition system as well as the other proposed functions andapplications.

3. Acoustical Analysis

Acoustical analysis of child speech aims to provide better un-derstanding about developmental changes of acoustic patterns. Table 4:

Number of speakers used in acoustic analysis.

Age (years;months) 3;0-3;11 4;0-4;11 5;0-5;11 6;0-6;11Male

Female

13 32 34 20

Clinically the ﬁndings can provide a reference of each speakergroup for assessment. In this section, we measure the funda-mental frequency (F0) and the ﬁrst two formants (F1, F2) of the Cantonese long vowels [ i: y: E: œ: a: O: u: ] using automatic F0and formant tracking algorithms. The long vowels involve monosyllable words with the syllable structure of (C)V:. Theeffect of lexical tone is not considered in this study.A subset of speech data is selected from TD speak-ers, as summarized in Table 4. The audio signals are down-sampled from . kHz to kHz and converted to single-channel signals. Each target word in the recording is manuallysegmented and transcribed by trained research assistants usingthe software Wavesurfer [12]. To locate the vowel segments forsubsequent analysis, forced alignment is applied to the speechdata with a GMM-HMM triphone acoustic model. The acousticmodel is trained with -dimensional Mel-frequency cepstralcoefﬁcients (MFCC) and their ﬁrst- and second-order deriva-tives, which are extracted every ms with a ms Ham-ming window. Linear discriminant analysis (LDA), semi-tiedcovaraicne (STC) transform and feature space Maximum Like-lihood Linear Regression (fMLLR) are also applied in the tri-phone model training [13][14][15]. The acoustic modelingand forced alignment are implemented using the Kaldi speechrecognition toolkit. [16]. Vowel segments shorter than msare not included in the analysis. F0 and formant frequencies areestimated by Praat using the auto-correlation method and lin-ear predictive analysis with Burg’s algorithm respectively [17][18][19][20].Child speech is known to have higher F0 and formant fre-quencies than adult speech. The wide spacing of harmonicpeaks makes the analysis more difﬁcult [21]. To avoid erro-neous estimation of formant frequencies, the ceiling values offormant frequencies for front vowels [ i: y: E: œ: ], central vowel[ a: ] and back vowels [ O: u: ] are empirically set to be , Hz, , Hz and , Hz respectively. We allow a maximum of formants (F1 - F5) to be estimated in each analysis frame. ForF0 estimation, the pitch ﬂoor is set to be Hz.Each vowel segment consists of a number of analysisframes, from each of which F0 and formant frequencies canbe extracted. The median values over all frames are used torepresent the whole segment. The mean F0 values of male andfemale speakers are listed as in Figure 3. As the age increases,the child speakers of both genders show a declining trend inF0. Boys generally have lower F0 than girls, but the differenceis very small. At age of , boys have a mean F0 of Hzwhereas the mean F0 of girls is

Hz. At age of , the meanF0 values of boys and girls are Hz and

Hz respectively.Estimation of formant frequencies exhibits frequent occur-rences of errors, especially that closely located formants maynot be identiﬁed. A procedure of data cleansing is applied tomake the statistical analysis more meaningful. Estimated rawvalues for each formant (F1 - F3) are grouped according tovowel identity, age and gender. For each group, the mean andstandard deviation are computed. Any measured value deviat-ing by . standard deviation from the mean is removed. TheF1-F2 plots for different age ranges are illustrated as in Figure2. Different vowels are marked in different colors. The vowelellipses are drawn to represent the % conﬁdence interval. Itigure 3: Results of fundamental frequency analysis of differentage groups and genders is known that the F1 value is related closely with the height oftongue, whereas F2 is determined mainly by the frontness andbackness of the tongue body. The mean values of F1 and F2,as well as the mean duration of vowels [ i: E: a: O: u: ] for theage groups of and are shown as in Table 5. Comparing thetwo age groups, there is a trend of decrease in F1 values of all vowels when the age increases. Similar observation applies toF2 except for [u:] . The vowel duration by children of age isslightly longer than those of age .Table 5: Formant values and duration of 5 long vowels [ i: E: a:O: u: ] which are commonly used to represent the vowel loop.

Vowel Age / Age F1 (Hz) F2 (Hz) Duration (s)[ i: ] /

485 3129 / . / . [ E: ] /

755 2777 / . / . [ a: ] / / . / . [ O: ] /

768 1353 / . / . [ u: ] /

534 1306 / . / .

4. Applications of CUCHILD

In automatic speech recognition (ASR), the high diversity ofacoustic properties and limited language proﬁciency in childspeech explain that statistical models trained from adult speechare not applicable to child speech. The presence of child speechdata is necessary in the development of ASR systems for childusers. The CUCHILD corpus is expected to address the issueby providing a large amount of child speech data. Speaker di-arization (SD), aiming to solve the ”who speaks when” prob-lem, is another important research topic with practical signif-icance. Currently SD systems are commonly trained on adultspeech. The spontaneity and phonetic variation in child speechmake the extraction of speaker information difﬁcult [22]. Ahigh-performance SD system for child speech is expected tobring the beneﬁt in different aspects. For instance, a SD sys-tem can be used to analyze adult-child interaction and extracttarget speech from child in a conversation [23]. The extractedchild speech data can be used to provide training data for ASRsystem development [24] or support the development of clinicalassessment tools [25]. In addition, the analysis of adult-childspeech interaction would be helpful to understanding children’stypical or atypical social behaviours [26].

Speech Sound Disorder (SSD) is diagnosed when a child showsdifﬁculties in acquisition, production and perception of speech,and makes errors in pronunciations that do not match the normalvariation expectation for his/her age[27]. Poor speech soundproduction skills are found to have signiﬁcant impacts on so-cial, emotional and academic developments[28], and associatedwith lower literacy outcomes [29], [30] and a greater likelihoodof suffering reading disorders[31]. With large amount of childspeech data, automatic detection of phonological and articula-tion errors is feasible using the machine learning approach. Au-tomatic detection tools are expected to accelerate the screen-ing of children who are at-risk for SSD, thus bringing earlyidentiﬁcation and intervention. In the long-run, early interven-tion can bring positive impacts to the children development,and thus reduce the service load of the current healthcare sys-tem on children with special education needs. The CUCHILDincludes recordings of accurate production and expected erro-neous speech produced by TD children, as well as the unex-pected erroneous speech produced by disordered children. It isdesigned to support the development and evaluation of the de-tection systems. Relevant works can be found in [32][33].

Children’s acquisition of speech sounds can be investigated bylarge scale population studies, as in [5][34]. Using the articu-lation test, the subject-level statistics of the results describe theoverall picture of phonological acquisition and indicate the de-velopmental error patterns. These studies often involve huge de-mand in manpower and professional costs, and take long periodof time in data-collection, validation and drawing result conclu-sion. Alternatively, child speech can be collected and analysedbased on acoustic signal. The signal captures rich linguistic andspeaker information. The ﬁndings from the studies of acous-tic features can bring new insight to the developmental changesof child speech, as well as inspire new approaches to differen-tiate atypical from healthy speech with the automated system.The CUCHILD satisﬁes the above-mentioned motivations andsupports the studies of acoustic properties of pre-school childspeech.

5. Conclusion

In this paper, a large-scale child speech corpus CUCHILD withCantonese speech sounds collected from 1,986 children of agefrom 3;0 to 6;11 is presented. The corpus includes the record-ings of the speech sounds collected from both typically develop-ing children and children with disordered speech when reading130 Cantonese words with 1 to 4 syllables. All initial conso-nants, vowels, diphthongs, ﬁnal consonants and lexical tones ofCantonese were covered in the corpus. Acoustical analysis witha subset of speech sample including the measurement of funda-mental frequency and the ﬁrst three formants was illustrated inthis paper. Future work with the corpus includes child speechrecognition, speaker diarization, detection of speech sound er-rors and further spectral analysis are suggested and to be inves-tigated.

6. Acknowledgements

This research was partially supported by a direct grant and aResearch Sustainability Fund from the Research Committee ofthe Chinese University of Hong Kong, as well as the ﬁnancialupport by the Hear Talk Foundation under the project titled”Speech Analysis for Cantonese Speaking Children”.

7. References [1] K. Shobaki, J.-P. Hosom, and R. A. Cole, “The ogi kids’ speechcorpus and recognizers,” in

Sixth International Conference onSpoken Language Processing , 2000.[2] R. Cole, P. Hosom, and B. Pellom, “University of coloradoprompted and read childrens speech corpus,” Technical ReportTR-CSLR-2006-02, University of Colorado, Tech. Rep., 2006.[3] S. Lee, A. Potamianos, and S. Narayanan, “Acoustics of childrensspeech: Developmental changes of temporal and spectral parame-ters,”

The Journal of the Acoustical Society of America , vol. 105,no. 3, pp. 1455–1468, 1999.[4] G. Yeung and A. Alwan, “On the difﬁculties of automatic speechrecognition for kindergarten-aged children.” in

Proc. of Inter-speech , 2018, pp. 1661–1665.[5] C. K. To, P. S. Cheung, and S. McLeod, “A population study ofchildrens acquisition of hong kong cantonese consonants, vowels,and tones,”

Journal of Speech, Language, and Hearing Research ,2013.[6] P. Kothalkar, J. Rudolph, C. Dollaghan, J. McGlothlin, T. Camp-bell, and J. H. Hansen, “Fusing text-dependent word-level i-vectormodels to screen at risk child speech,” in

Proc. of Interspeech ,2018, pp. 1681–1685.[7] P. B. Ramteke, S. Supanekar, P. Hegde, H. Nelson, V. Aithal, andS. G. Koolagudi, “NITK Kids Speech Corpus,” in

Proc. of Inter-speech , 2019, pp. 331–335.[8] R. S. Bauer and P. K. Benedict,

Modern cantonese phonology .Walter de Gruyter, 2011, vol. 102.[9] T. Lee, W. K. Lo, P. Ching, and H. Meng, “Spoken languageresources for cantonese speech processing,”

Speech Communica-tion , vol. 36, no. 3-4, pp. 327–342, 2002.[10] P. Cheung, A. Ng, and C. To, “Hong kong cantonese articulationtest,”

Hong Kong: Language Information Sciences & ResearchCentre , 2006.[11] W. Y. Ng, Y. S. Lee, T. Lee, and C. F. Tong, “A comparison ofthe screening results of speech sound disorders with asynchronousversus face-to-face analysis,” in

Hong Kong Speech and HearingSymposium (HKSHS) , 2018.[12] K. Sj¨olander and J. Beskow, “Wavesurfer-an open source speechtool,” in

Sixth International Conference on Spoken Language Pro-cessing , 2000.[13] R. O. Duda, P. E. Hart, and D. G. Stork,

Pattern classiﬁcation .John Wiley & Sons, 2012.[14] M. J. Gales, “Semi-tied covariance matrices for hidden markovmodels,”

IEEE transactions on speech and audio processing ,vol. 7, no. 3, pp. 272–281, 1999.[15] ——, “Maximum likelihood linear transformations for hmm-based speech recognition,”

Computer speech & language , vol. 12,no. 2, pp. 75–98, 1998.[16] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The kaldi speech recognition toolkit,” in

IEEE 2011 workshopon automatic speech recognition and understanding , no. EPFL-CONF-192584. IEEE Signal Processing Society, 2011.[17] P. Boersma, “Accurate short-term analysis of the fundamental fre-quency and the harmonics-to-noise ratio of a sampled sound,”in

Proceedings of the institute of phonetic sciences , vol. 17, no.1193. Amsterdam, 1993, pp. 97–110.[18] N. Andersen, “On the calculation of ﬁlter coefﬁcients for maxi-mum entropy spectral analysis,”

Geophysics , vol. 39, no. 1, pp.69–72, 1974.[19] P. Boersma and D. Weenink, “Praat: Doing phonetics by computer[computer program]. version 6.0. 37,” . [20] Y. Jadoul, B. Thompson, and B. De Boer, “Introducing parsel-mouth: A python interface to praat,”

Journal of Phonetics , vol. 71,pp. 1–15, 2018.[21] R. D. Kent and H. K. Vorperian, “Static measurements of vowelformant frequencies and bandwidths: A review,”

Journal of com-munication disorders , vol. 74, pp. 74–97, 2018.[22] J. Xie, L. P. Garcia-Perera, D. Povey, and S. Khudanpur, “Multi-plda diarization on childrens speech,” in

Proc. of Interspeech ,2019, pp. 376–380.[23] P. V. Kothalkar, D. Irvin, Y. Luo, J. Rojas, J. Nash, B. Rous,and J. H. Hansen, “Tagging child-adult interactions in naturalistic,noisy, daylong school environments using i-vector based diariza-tion system,” in

Proc. SLaTE , 2019, pp. 89–93.[24] J. Wang, S. I. Ng, D. Tao, W. Y. Ng, and T. Lee, “A study onacoustic modeling for child speech based on multi-task learning,”in

Proc. of ISCSLP , 2018, pp. 389–393.[25] M. Shahin, U. Zafar, and B. Ahmed, “The automatic detection ofspeech disorders in children: Challenges, opportunities and pre-liminary results,”

IEEE Journal of Selected Topics in Signal Pro-cessing , 2019.[26] J. H. Hansen, M. Najaﬁan, R. Lileikyte, D. Irvin, and B. Rous,“Speech and language processing for assessing child–adult inter-action based on diarization and location,”

International Journal ofSpeech Technology , vol. 22, no. 3, pp. 697–709, 2019.[27] World Health Organization, “International classiﬁcation of dis-eases 11th revision: The global standard for diagnostic health in-formation,” https://icd.who.int/en/, 2010.[28] E. R. Hitchcock, D. Harel, and T. M. Byun, “Social, emotional,and academic impact of residual speech errors in school-agedchildren: A survey study,” in

Seminars in speech and language ,vol. 36, no. 04. Thieme Medical Publishers, 2015, pp. 283–294.[29] M. S. Overby, G. Trainin, A. B. Smit, J. E. Bernthal, and R. Nel-son, “Preliteracy speech sound production skill and later literacyoutcomes: A study using the templin archive,”

Language, Speech,and Hearing Services in Schools , 2012.[30] B. A. Lewis, A. A. Avrich, L. A. Freebairn, A. J. Hansen, L. E.Sucheston, I. Kuo, H. G. Taylor, S. K. Iyengar, and C. M. Stein,“Literacy outcomes of children with early childhood speech sounddisorders: Impact of endophenotypes,”

Journal of Speech, Lan-guage, and Hearing Research , 2011.[31] R. L. Peterson, B. F. Pennington, L. D. Shriberg, and R. Boada,“What inﬂuences literacy outcome in children with speech sounddisorder?”

Journal of Speech, Language, and Hearing Research ,2009.[32] S. I. Ng, D. Tao, J. Wang, Y. Jiang, W. Y. Ng, and T. Lee, “Anautomated assessment tool for child speech disorders,” in

Proc. ofISCSLP , 2018, pp. 493–494.[33] J. Wang, Y. Qin, Z. Peng, and T. Lee, “Child speech disorder de-tection with siamese recurrent network using speech attribute fea-tures,”

Proc. of Interspeech , pp. 3885–3889, 2019.[34] L. K. So and B. J. Dodd, “The acquisition of phonology bycantonese-speaking children,”