[PDF] CASS: Towards Building a Social-Support Chatbot for Online Health Community

Abstract

Chatbots systems, despite their popularity in today's HCI and CSCW research, fall short for one of the two reasons: 1) many of the systems use a rule-based dialog flow, thus they can only respond to a limited number of pre-defined inputs with pre-scripted responses; or 2) they are designed with a focus on single-user scenarios, thus it is unclear how these systems may affect other users or the community. In this paper, we develop a generalizable chatbot architecture (CASS) to provide social support for community members in an online health community. The CASS architecture is based on advanced neural network algorithms, thus it can handle new inputs from users and generate a variety of responses to them. CASS is also generalizable as it can be easily migrate to other online communities. With a follow-up field experiment, CASS is proven useful in supporting individual members who seek emotional support. Our work also contributes to fill the research gap on how a chatbot may influence the whole community's engagement.

Full PDF

99CASS: Towards Building a Social-Support Chatbot forOnline Health Community

LIUPING WANG ∗ , Institute of Software, Chinese Academy of Sciences and University of Chinese Academyof Sciences, China

DAKUO WANG ∗ , IBM Research, USA

FENG TIAN † , Institute of Software, Chinese Academy of Sciences, China

ZHENHUI PENG,

The Hong Kong University of Science and Technology, Hong Kong

XIANGMIN FAN,

Institute of Software, Chinese Academy of Sciences, China

ZHAN ZHANG,

Pace University, USA

SHUAI MA,

The Hong Kong University of Science and Technology, Hong Kong

MO YU,

IBM Research, USA

XIAOJUAN MA,

The Hong Kong University of Science and Technology, Hong Kong

HONGAN WANG,

Institute of Software, Chinese Academy of Sciences, ChinaChatbots systems, despite their popularity in today’s HCI and CSCW research, fall short for one of the tworeasons: 1) many of the systems use a rule-based dialog flow, thus they can only respond to a limited numberof pre-defined inputs with pre-scripted responses; or 2) they are designed with a focus on single-user scenarios,thus it is unclear how these systems may affect other users or the community. In this paper, we develop ageneralizable chatbot architecture (CASS) to provide social support for community members in an onlinehealth community. The CASS architecture is based on advanced neural network algorithms, thus it can handlenew inputs from users and generate a variety of responses to them. CASS is also generalizable as it can beeasily migrate to other online communities. With a follow-up field experiment, CASS is proven useful insupporting individual members who seek emotional support. Our work also contributes to fill the researchgap on how a chatbot may influence the whole community’s engagement.CCS Concepts: •

Human-centered computing → Computer supported cooperative work .Additional Key Words and Phrases: chatbot; bot; pregnancy; healthcare; AI deployment; online commu-nity; social support; peer support; emotional support; machine learning; neural network; system building;conversational agent; human AI collaboration; human AI interaction; explainable AI; trustworthy AI ∗ Both authors contributed equally to this research. † Corresponding authorAuthors’ addresses: Liuping Wang, [email protected], Institute of Software, Chinese Academy of Sciences , Universityof Chinese Academy of Sciences, China; Dakuo Wang, [email protected], IBM Research, USA; Feng Tian, [email protected], Institute of Software, Chinese Academy of Sciences, China; Zhenhui Peng, The Hong Kong University of Scienceand Technology, Hong Kong; Xiangmin Fan, Institute of Software, Chinese Academy of Sciences, China; Zhan Zhang, PaceUniversity, USA; Shuai Ma, The Hong Kong University of Science and Technology, Hong Kong; Mo Yu, IBM Research, USA;Xiaojuan Ma, The Hong Kong University of Science and Technology, Hong Kong; Hongan Wang, Institute of Software,Chinese Academy of Sciences, China.Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and thefull citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.2573-0142/2021/4-ART9 $15.00https://doi.org/10.1145/3449083Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021. a r X i v : . [ c s . H C ] F e b :2 Wang and Wang, et al. ACM Reference Format:

Liuping Wang, Dakuo Wang, Feng Tian, Zhenhui Peng, Xiangmin Fan, Zhan Zhang, Shuai Ma, Mo Yu, XiaojuanMa, and Hongan Wang. 2021. CASS: Towards Building a Social-Support Chatbot for Online Health Community.

Proc. ACM Hum.-Comput. Interact.

5, CSCW1, Article 9 (April 2021), 31 pages. https://doi.org/10.1145/3449083

Chatbot systems have been increasingly adopted in many fields (e.g., healthcare [30], humanresources (HR) [64], and customer service [122]), since the first chatbot system—ELIZA—emergedin 1964 to provide consulting sessions as a computer therapist [115]. In recent years, an increasingnumber of chatbot systems are being developed in various research labs and companies with apremise that these systems can have more powerful capabilities and support more user scenarios [21,44, 103]. For example, Hu et al. [44] built an experimental chatbot system that can understand thetones in a text input (e.g., sad or polite) and generate responses with an appropriate tone.Following these system development efforts, many recent Human-Computer Interaction (HCI)and Computer-Supported Cooperative Work (CSCW) studies have examined various aspects ofchatbots from the end users’ perspective, such as human-in-the-loop chatbot design [21], userperception of chatbots [18], playful usage of chatbots [64], and human trust in chatbots [49].However, most of these studies have inherent limitations: 1) many chatbot systems (e.g., [64, 123])use a rule-based architecture, which makes the chatbot capable of understanding only a limitednumber of user inputs and responding with prescripted sentences, hindering its generalization;and 2) most chatbots are deployed and tested only in single-user scenarios, but how these systemsinteracting with and impacting a group of users (or a community) is understudied.The first limitation of current chatbots – only returning a pre-defined list of responses to auser – is partially caused by the use of traditional heuristic rule-based algorithms or informationretrieval techniques [75]. Even with some advanced chatbot-development toolkit’s help (e.g.,Microsoft Cognitive Service [74] and IBM Watson [47]), a chatbot can only use neural-networkbased approaches (NN) to understand the text, but its responding function is still limited to a rule-based selection process. In this work, we propose a NN-based chatbot architecture based on whicha chatbot system can accurately handle unseen questions and generate various forms of responseswith the same meaning. This architecture is designed to have a high scalability and generalizability,so that other researchers and developers can take our code , provide it with a cleaned and labeledtraining dataset, retrain it, and deploy it for different online communities. Inspired by previousliterature [117], we also build a human-in-the-loop module so a human operator can monitor andintervene the fully automated architecture, if needed.The second research gap is that most of today’s HCI and CSCW research primarily focus on anindividual user’s interaction with and perception of a chatbot system [31, 46, 58, 64, 94, 122], andit is unclear how a chatbot system may affect a group of users or a community. In this paper, weaim to address this research gap by building and deploying a social-support chatbot system, CASS ( C h A tbot for S ocial S upport), and evaluating its impact on the individuals who need social-supportas well as on the other members in the community. Motivated by existing literature that onlinecommunities often suffer low engagement from the members due to that many posts can not get atimely response, our chatbot’s primary function is designed to engage in conversations with thoseun-replied social-support-needing posts. CASS automates the entire end-to-end process: retrieving In this paper, we consider functional bots (e.g., Twitter bots [34], and Wikipedia bots [50]) out of our research scope, asthese bots’ primary function focuses on completing tasks (e.g., broadcasting an information or editing an article), as opposedto communicating with users through a flow of conversations, which is the primary feature of chabots. We have made the code repository open source: https://github.com/liupingw/CASS-FrameworkProc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021.

ASS: Towards Building a Social-Support Chatbot for Online Health Community 9:3 new posts from the community, classifying these posts as with (or without) social-support needs,and generating appropriate responses to the posts with social-support needs.This study is exploratory in nature. We choose an online pregnancy healthcare community as ourresearch site (

YouBaoBao ), because this type of online community is extensively studied in recentresearch [4, 37], allowing us to leverage these existing knowledge to inform the system design andimplementation. The whole research project consists of three parts, and we will organize this paperfollowing the order of these three studies:Study 1 is an empirical exploration study, where we conduct both qualitative content analysisand descriptive statistical analysis to understand the context of the research site. A qualitativeopen coding of the posts suggests that there are three primary categories of posts in YouBaoBao: Emotional-Support Seeking, Informational-Support Seeking, and Sharing Daily Life.

The quantitativestatistical analysis also shows that half of the posts in this community can not get a timely response,which may cause the already-stressful community members (i.e., pregnant women) to be at a higherrisk [37].To provide social supports to this community, in Study 2, we build the CASS chatbot system,using the proposed generalizable NN-based architecture. The CASS design is tailored based onthe findings from Study 1, so that it can understand what posts need social support, and whatposts count as an un-replied case that needs immediate intervention. In Study 2, we also adopt thestandard evaluation practices (using both automated metrics and human evaluators) to show thatthe core NN-based algorithms’ performance is satisfactory.At last, in Study 3, CASS is deployed back to the YouBaoBao community, with a human-in-the-loop module to filter inappropriate AI-generated responses. Through a 7-day field experiment, theresult shows that the CASS system can provide the desired emotional support to the individualcommunity members who need help. In addition, we find evidence to support that the deploymentof CASS indeed have positive impacts on other members and the entire community, as othercommunity members are more likely to participate in the conversations intervened by the chatbotsystem. This finding suggests that such social-support chatbot can not only support individualmember’s well-being, but also improve the community engagement level, which is also an importantdimension according to McGrath’s TIP theory of groups [73].In summary, this paper makes the following contributions: • An empirical understanding of the challenges and user needs in an online pregnancy health-care community, and how these findings can be used to tailor a chatbot system design; • A scalable and generalizable chatbot development architecture, with which researchers anddevelopers can easily build a fully automated chatbot system with NN-based models to bedeployed in another online community; • Insights and recommendations for designing and deploying future chatbots systems to interactor collaborate with humans in the context of online communities.

The literature review is divided into three subsections: we first review selected HCI work onsocial support scenarios in online health communities. Then, we focused on the group of literatureabout human and chatbot interaction. Lastly, we switch to the literature that specifically addresseschallenges and issues of chatbot systems deployment in real world. https://youbaobao.meiyou.com/Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021. :4 Wang and Wang, et al. Online community has been a longstanding research topic for HCI and CSCW researchers (e.g., [10,20, 27, 37, 72, 99, 104, 105, 114, 118, 125, 127, 134]). Existing studies have looked at a variety oftopics including community structure [54, 88, 95], community activities [7, 37, 45, 54, 65, 71, 116],members’ commitment and contribution [87, 124], engaging newcomers [53, 55, 86], rewardingmechanism design [40, 54, 88], and the cold start problem [88]. Recently, a number of studieshas focused on a special type of online community – the online health community for pregnantwomen [23, 24, 28, 37, 42]. This is a special group of users. In addition to the significant changeson their bodies, their mental state also changes a lot over the trajectory of their pregnancy. Theyoften have a much higher stress level than before getting pregnant, thus their mental health is athigh stake [12, 48, 121, 128]. Banti et al. [11] reported that 12.4% of pregnant women had presentedsome depression symptoms during the pregnancy, and 9.6% of them encountered depression in thepostpartum period.It is known that pregnant women often go to online health communities to seek social supportfrom peers [14, 29, 57]. Previous literature reveals that members in such health communitiesactively seek social support from others, and they are also willing to volunteer their time toprovide social support to other help seekers [25, 126]. Prior research roughly divided the socialsupports into two categories: informational social support and emotional social support [15, 22, 114].Informational support refers to posters seeking information or knowledge about the course of theirdisease, treatments, side effects, communication with physicians, and other burdens (e.g., financialproblems) [114]. Emotional support refers to posters seeking encouragement and empathy whenexperienced an emotional disturbance [37]. In this project, as an illustration, our chatbot systemfocuses on providing non-informational social support to community members.It is often difficult to motivate community members to actively reply to others members’ postsin a timely manner [3, 61, 70, 77, 107]. Seminal research has explored various ways to solvethis problem [84]. In more traditional online communities (e.g., Wikipedia), researchers haveattempted to stimulate member’s intrinsic and extrinsic motivations [3] with monetary reward [61]or virtual badges and reputation rewards [70]. In the online health communities, when a user postsa support seeking post, he/she is recommended to use simpler language and express the needsmore explicitly [8]. It is also suggested that posts with more detailed user profile information and aphoto are more likely to get replies [9]. Even so, there are many posts may never get a reply. Forexample, Wang et al. [113] reported that at least 10% posts in an online community never receiveda response.When a pregnant woman posts a support seeking post and never gets a response, it may have moresevere harms to the user and to the community. Because pregnancy women are already stressful,overlooking their support seeking may make things worse [62, 79]. Furthermore, when communitymembers constantly fail to get the needed social support, they are less likely to contribute tothe community, so the community engagement level decreases accordingly, and even worse, themembers may leave the community over time [55, 114, 132].In this paper, we will illustrate how to leverage on the latest AI technology to build a chatbotsystem that can automatically detect non-informational support-seeking posts, and respond to itwith appropriate sentences. The system architecture is scalable and generalizable so it can be easilymigrated to other online communities.

Chatbot is an increasingly popular research topic in recent years. Most of today’s chatbot systemsare built to interact with a single user [31, 46, 58, 64, 94, 122]. For example, the famous ELIZA

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021.

ASS: Towards Building a Social-Support Chatbot for Online Health Community 9:5 and its successors can provide individual cognitive therapy sessions to users with a purpose ofrelieving their stress and anxiety, as well as helping them gain self-compassion [31, 58, 94]. Manycommercial chatbots in customer service domain are designed to answer customers’ frequentlyasked questions or perform a simple function (e.g., check bank account balance) [46, 122]. Thereare also human resources (HR) chatbots serving as a process guider to lead new employees to gothrough their onboarding process [64]. There are also some chatbot applications in the healthcaredomain which can help patients to better understand their symptoms [30].Only till recently, a number of researchers have started to explore and build chatbots that caninteract with a group of people [2, 21, 66, 96, 98, 103, 130]. For example, Toxtli et al. [103] built achatbot as a group chat facilitator to assign tasks to group members. Zhang et al. [130] developed achatbot for an online communication application to automatically summarize group chat messages.Cranshaw et al. [21] developed a chatbot system that can serve as an assistant to coordinatemeetings for multiple people via email. All these studies expand the chatbot user scenario fromsupporting single user to multiple users. Notably, Seering et al. [96] recently built a chatbot inan online gaming community on Twitch. They designed four different versions of the chatbot– “baby”, “toddler”, “adolescent”, and “tennager” – to simulate a chatbot’s grownup process in a3-week deployment. However, the user interaction with the chatbot is quite primitive that usersneed to input command-line text (e.g., “@Babybot” or “!feed”), thus it is more like a digital pet(tamagotchi) [85] than a chatbot.Along this research line, our work builds a chatbot system that can provide emotional support topregenant women members in an online community; also, we design a field experiment to revealfindings on how such deployment of the chatbot can impact the whole community. Different fromSeering’s work [96], where they deployed the “digital pet” into a Twitch community and users canhave chitchat with the system, we aim to build a chatbot to meet community members’ existingneeds — social-support seeking. We hope our chatbot can provide functional benefits to the usersthrough a conversational communication. Another difference is that Seering’s system [96] usedan information retrieval (IR) approach, and in turn, it could only return limited responses thatwere pre-defined by the researchers. The rule-based approach (e.g., IR) constraints the potential ofchatbot in providing social support for a community [63], and users may perceive the responsesfrom the chatbot not as useful as from people [75]. In contrast, we build an architecture thatleverages on the state-of-the-art NN-based models for the development of more powerful chabotsystems.

While reviewing recent work on building and deploying chatbots [32, 38, 93], we found that despitemany of these chatbots were designed with a good will, the deployment of certain systems maynegatively impact the stakeholders or the intended users.One example is the chatbot system developed by [96] and deployed on Twitch. It designs a novelinteraction approach that users can “raise” the chatbot as a pet through a number of commands,such as “!feeding”. However, the deployment of such a chatbot may distract users’ attention fromtheir original goal of using the platform, which is to watch videos and socialize with the host andthe other community members. Thus, with the chabot, users may engage with the platform but notwith each other. Such close bonding with a chatbot may even hurt the individual user’s benefits inthe long term and also negatively impact the community’s engagement level [131]. A report says there are users interacting with Microsoft’s XiaoIce for more than six hours restless, and treat XiaoIce as hisgirlfriend [131] Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021. :6 Wang and Wang, et al.

The unexpected consequence of a chatbot’s deployment is not uncommon when putting AI andmachine learning systems to practical use. Often, today’s NN-based algorithm research requires alarge amount of data. Such data may come from an online community (e.g., Reddit), and they needto be tagged with ground truth labels by human annotators. However, the benefit of the originalhuman annotators may be neglected during the training and deployment of the algorithm, as thealgorithm’s performance and optimization is developer’s most important objective. For example,developing a functional customer service chatbots needs a significant amount of training datalabeled by human customer service experts or obtained from their practices (e.g., [44, 122]). But thedeployment of such chatbots may cause companies to hire fewer customer service workers in thefuture, which may also in return reduce the sources of training data.Fortunately, some HCI researchers have noticed this challenge and they jointly work on anemerging research topic –

Human-Centered AI – that aims to take an algorithm’s impact onhuman users into the consideration of the algorithm design [17, 38, 39, 59, 90, 119, 133]. For example,Woodruff et al. interview 44 participants from several marginalized populations in the United States.Participants indicate that algorithmic fairness (or lack thereof) could substantially affect their trustin a company or product, if such application is deployed [119]. Thus, such fairness considerationsshould be taken into account during the algorithm design [59, 90]. In addition to the fairness, variousother design considerations may also influence the eventual consequence of an AI system in the realworld deployment, such as stakeholders’s tacit knowledge [133] and community involvement [60].It is also important to build an in-depth understanding of user needs or even involve them inthe design process, as exemplified by a few recent work adopting participatory design researchmethod [17, 38, 60].In addition to incorporating the various considerations into an AI system design, there arealso some good practices that one can follow in the deployment of the AI system [32, 38, 39]. Forexample, a group of researchers designed and developed ORES – an algorithmic scoring service thatsupports real-time scoring of wiki edits using multiple independent classifiers trained on differentdatasets. This paper proposes an example of deploying AI algorithms in an online community, buttheir algorithm is less explicit to the users, compared to the chatbot systems that we aim to developand deploy in this paper.In this paper, we design and develop a chatbot system that can provide social support for anonline health community. Besides the objective of ensuring a good functional performance, we arealso interested in evaluating its potential impacts on the online community after its deployment.This work joins the recent

Human-AI Collaboration research effort [30, 106, 109, 123] that aimsto develop and deploy AI systems that can work together with people, instead of replacing people. Itdiffers from the Human-AI Interaction discussion [6], as it goes beyond the usability and interactivedesign of AI systems, but focuses more on the cooperative nature of AI systems with humanpartners and their context (e.g., [36, 43]).

In this subsection, we first provide an overview for the research context—YouBaoBao—one of thelargest and most popular online health communities for pregnancy and parenting discussions inChina. Users are often pregnant women or couples that are expecting a baby. They can discuss avariety of topics on this platform, such as pregnancy, childbirth, childcare, and early education.This community has 164 sub-forums (similar to “sub-reddits” in Reddit), such as specific sub-forumsfor different stages of the pregnancy: “First Trimester”, “Second Trimester”, and “Third Trimester”.YouBaoBao is a self-organized health community, where community members can publish a post to

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021.

ASS: Towards Building a Social-Support Chatbot for Online Health Community 9:7

Fig. 1. User Interface of YouBaoBao community. It has a original post section to the top, a response sectionin the middle, and a input field for typing response at the bottom. The content is translated into English inthis figure. seek advice and information, and they can reply to a post to provide support and advice. Communitymembers have an option to disclose their current pregnancy status , which is always displayed nextto their user name. There are three pre-defined categories of pregnancy status from which a usercan choose: preparation stage (e.g., “planning to have a baby”), three-trimester stage (e.g., “duein 8 months”), and postpartum stage (e.g., “having a 4-month old baby”). It is worth noting thatmembers can post, browse, and reply in any of the sub-forums regardless of their labeled pregnancystatus. For example, we saw instances where new moms posting and replying in the “first-trimester”sub-forum, despite they have already delivered.A post or response has minimum and maximum word limits—the content has to be between 6and 3000 words. When a post is created, the system can automatically recommend a sub-forum forit according to its content and the user’s pregnancy status. Users can also manually modify therecommended sub-forum. Fig. 1 illustrates a typical post and responses it received (a direct responseto the post and a second-level response). The user profile avatar, user name, and pregnancy statusare displayed right above the content. The system also shows the total number of responses for apost, and of second-level responses for a first-level response. Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021. :8 Wang and Wang, et al.

To better understand what challenges the community faces and where chatbots can support, wefirst conducted both descriptive statistical analysis and qualitative content analysis (following themethod in [37]) to understand the characteristics of support-seeking activities in this community.Specifically, we explored the following research questions : What types of support do users (e.g.,pregnant women) seek in this online community? How do community members respond to thosesupport seeking posts? What issues or barriers exist in communications and interactions in thecommunity?

In order to gain a comprehensive view of support-seeking across differentstages of pregnancy, we chose five sub-forums with each focusing on one specific pregnancy stage: “pregnancy preparation”, “first-trimester”, “second-trimester”, “third-trimester”, and “having a baby lessthan 6-month-old” . We collected six months of data from September 2018 to March 2019 throughthe community’s application programming interface (API), and organized them into post-responsepairs (N = 220,000). All the data we collected were publicly available but we removed all identifiablepersonal information (e.g., user names and status) in consideration of research ethics. We alsoreplaced images with a special icon if there were any images in a post content. This data collectionmethod is standard in various HCI [37, 125] and NLP research literature [101, 112]. This study isapproved by the first author’s university Institutional Review Board (IRB).As this labeled dataset was later used for model training purposes, two coders removed themalicious posts, such as advertisements, before performing any detailed analysis. We followediterative coding process and conducted axial coding: two coders first independently coded a groupof 200 randomly-selected posts to identify 11 themes and organized these themes into categories.Then, the research team met and discussed the coding schema and differences in coding to resolveall disagreements. After that, the same coders conducted another round of independent codingwith a group of 300 randomly-sampled new posts, and discussed newly emerged categories anddisagreements in coding. They repeated this process until there was no newly emerged categoriesor inconsistent codings. In total, they repeated three iterations with each time analyzing 300 newposts (200+3*300=1,100 posts). At the end, we randomly sampled another 2,300 posts for computingthe coding reliability score. Based on the agreed coding schema, the two coders independentlycoded these 2,300 new posts and reached a high inter-rater reliability (Cohen’s kappa = 0.86).

The descriptive statisticalanalysis helped us understand the baseline of user behaviors in the community. These findings laterwere used to guide the design decisions of the CASS system. More specifically, we observed that,on average, the users on this community posted 5,000 posts per day, and each post had an averageof 6 responses. However, approximately 18% of the posts did not receive any response. To furtherinvestigate what these posts are and how they can be replied, we randomly sampled another 60,000posts out of the collected 220,000 samples for further analysis. We calculated the

Interval Time between the original post was published and the first response was made. As shown in Fig. 2, themedian interval time is 10 minutes. Later, we used this number in Study 3 as a threshold to evaluatewhether a post gets timely response after the deployment of our chatbot system.We also analyzed the time span from the beginning as the original post was published to theend as its last comment was made. This indicator reflects the retention of a post and its discussionthread in an online community (e.g., how long a post remains “alive” to the community members).We name this as the

Lifespan of a post. We found that the average lifespan for a post thread is about

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021.

ASS: Towards Building a Social-Support Chatbot for Online Health Community 9:9

Fig. 2. Distribution of Interval Time between a post is published and its first response is published. Median is10 minutes.

The content analysis of posts led to the identification of 11 concepts,which were then grouped into 3 high-level categories, including

Informational-Support Seeking,Emotional-Support Seeking, and Sharing Daily Life . As shown in Table 1, 27.45% of the posts wererelated to seeking emotional support or sharing emotional status, 45.95% of the posts were aboutseeking informational support, and 26.60% of them focused on sharing posters’ daily life.Emotional-Support Seeking refers to posters expressing their frustration and stress (e.g., com-plaining something happened in their work or life), or sharing positive news (e.g., announcing theconfirmation of pregnancy) to seek encouragement or empathy. Informational-Support Seekingrefers to seeking information, knowledge, advice, and suggestions from others to manage a situation(e.g., how due dates are calculated). Sharing Daily Life refers to posts where community memberspost a photo of food or exercise in a gym, or share updates and progress of pregnancy.The resulting categories are similar to previous findings [37, 125], confirming that our researchsite (YouBaoBao) has similar characteristics as other popular pregnancy-related online communities(such as BabyCenter and TheBump ). More importantly, the analyses of posts helped us understandthe types and nature of emotional support sought by users. Thus, our primary goal for the chatbot isto provide non-informational support (both emotional-support and sharing dailylife) to the postingusers. Later in Section 4.1.2, we will illustrate how we translate these learned contextual knowledgeinto parameters, and feed the labeled data to train the algorithm models in the system architecture. Based on the literature [37, 97, 102] and the knowledge gained from Study 1, we learned thatone major challenge faced by online health communities is that many posts can not get a timelyresponse (e.g., more than 18% never got a response); for the posts that have responses, it oftentakes more than 10 minutes to get the first response. In the YouBaoBao context, this challenge isparticularly frustrating considering the pregnant women are already at high stress level. To addressthis prominent challenge, we built a chatbot system to explore the feasibility of chatbot for providing :10 Wang and Wang, et al. Category Concept N %Emotional support seekingand emotion sharing Complain about work, life,family, etc. 631 27.45Have a fantasy conversationwith the baby to be bornShare happinessMake a good wishInformational support seeking Seek health-related information 1057 45.95Seek personal opinionsfrom peers about pregnancySeek opinions and suggestionson baby-related issuesSharing daily life Share food and cuisine 612 26.60Share sports and other activitiesShare updates of her bodyalong with pregnancyShare updates and progressof the baby or fetus

Table 1. Axial coding result with 3 Categories and 11 Concepts for original posts. (N = 2,300, Cohen’s kappa= 0.86) timely response to support seekers. Because the responses to the emotional-support-seeking postsand to the sharing-daily-life posts can be overlapped with compliment or empathy thus our textgeneration module can handle them the same way. We combine these two non-informationalsupport-seeking categories, and train the models to detect them from the informational-support-seeking posts.In this section, we will introduce how we build the system, what design decisions we made alongthe implementation process, and lastly how it performs using common natural language processing(NLP) performance metrics and human judgement in offline evaluation . In this study, we propose a system architecture that works in the following five steps (Fig. 3), andwe build a chatbot system–

CASS –using this architecture: • Step 1.

Collecting and monitoring new posts (Input interface: an API URL to fetch data fromthe community); • Step 2.

Classifying posts into emotional-support seeking category and informational-supportseeking category using a trained CNN model (Input: a labeled dataset with posts, responses,and a post’s category out of the 3 coded categories); • Step 3.

Randomly dispatching half of the emotional-support seeking posts into a controlgroup and monitor them, this is only for the field experiment in Study 3 (Parameter: 10minutes that derived from Study 1, as the threshold for deciding whether the chatbot needsto respond to the overlooked post or not); The offline evaluation is in contrast to an online evaluation approach. Offline evaluation uses machine learning testing orholdout subset of the data to evaluate the performance of a model or system, whereas online evaluation means deployingthe model back to the application environment.Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021.

ASS: Towards Building a Social-Support Chatbot for Online Health Community 9:11

Fig. 3. CASS system architecture and workflow. It has 3 modules and 5 steps. Steps 1, 3, and 5 are implementedvia the control logic module; step 2 applies the neural network (NN)-based text classification module; andstep 4 applies the neural network (NN)-based text generation module. • Step 4.

Generating responses for posts in the experimental group using a trained LSTMmodel (Input: the same input as in step 2 – the labeled dataset); • Step 5.

A human-in-the-loop verification and intervention user interface, before postingthe response back to the forum (Output interface: an API URL to post response back to thecommunity).This chatbot architecture and workflow is quite generic and easy to generalize to other onlinehealth forums — a developer or researcher only needs to alternate the two API URL parameterson how the system reads in original post from the community and how it publishes generatedresponse back to the community, and to feed a labeled dataset as input. In the following subsection,we present how we use the architecture to build the CASS system with 3 modules: a control logicmodule, a text classification module, and a text generation module . We collectively refer to the posts that can not get a response within10 minutes as overlooked posts , and the others as replied posts. CASS’s control logic module iscreated to identify and track those posts, dispatch them to the corresponding system modules fortext classification or text generation or for human-in-the-loop intervention, and finally publish themback to the community. In Fig. 3, the control logic module are responsible for the implementationof

Step 1, 3, and 5 . Step 1 and 5 are responsible for collecting posts from the community (i.e., input) and publishingresponses to the community (i.e. output), respectively. Step 1 is also responsible to distinguishwhether a post is timely-replied (yellow box in Step 1 in Fig. 3; no further interventions wereprovided for this type of post) or overlooked (green box in Step 1 in Fig. 3).In addition to its automated nature of the output step for CASS, we also built a Human-in-the-Loop (

HITL ) module in Step 5. We hypothesized there may be malfunction of the CASS systemas it is fully automated after the model is trained and the system is deployed. Thus, we designeda user interface console, which can be used by a human experimenter to track and monitor eachAI-generated response in real-time. Every response generated by the text generation module in Step4 will show up in the console for 10 seconds before being published to the community. The humanexperimenter has two options for action: he/she can choose to do nothing then the response ispublished; or he/she can hit “Enter” button to suspend the publishing process. If the latter action is

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021. :12 Wang and Wang, et al. triggered, the human experimenter will be prompted to type in a new sentence in the console, andthat new sentence will replace the AI-generated response and then will be sent to the community.In the actual online field experiment in Study 3, two experiment operators carefully monitoredthe performance of the chatbot in generating responses from 8 a.m. to midnight (16 hours) usingthe HILT module. But they did not find any inappropriate AI-generated responses that needintervention. This may be because the data collected from YouBaoBao and used to train the modelswere clean and had nearly no malicious speech corpus. This may also attribute to the friendlynature of the research site [37].

Step 3 is a control logic specifically designed for the field experiment in Study 3. We want tonote that step 3 is not needed in a real deployment of the CASS system in the future. This steprandomly assigns half of the qualified posts (i.e., overlooked emotional support seeking posts)into an experimental group, and the other half into a control group. For the experimental groupposts, CASS generates responses, sends them through HILT step, publishes them on the forum,and keeps tracking for 7 days for the post owner’s (i.e. poster) and other community members’ (i.e.commenter) actions. For the control group posts, CASS simply tracks the user actions for 7 dayswithout intervention for comparison purpose.

Our results from Study 1 suggest that a user typically seek twotypes of support: informational- and emotional-support. For informational support, posters lookfor accurate responses rather than diversity in response [114]. Thus, this type of utilitarian needscan be satisfied by various rule-based algorithms or existing chatbot platforms ( e.g., [64, 122]); thisis not the focus of this study. As such, we build a text classification module based on convolutionalneural network (CNN) (

Step 2 in Fig.3) to distinguish informational support seeking posts andnon-informational seeking posts (i.e., both emotional support seeking and sharing daily life) sothat the chatbot can provide timely response to non-informational support seeking posts only.CNN models have been widely used in the computer vision field in recent years [56], and somevariations have been extended to text classification tasks [51]. It normally requires preprocessingeach data instance (i.e., posts in this paper) into word vectors with a same dimension, then feed theword vectors into a CNN network; the output of the network is a classification of the input postinto one of the three categories. In our study, we do not need to distinguish the detailed categories— we only need to categorize posts that are informational support seeking, so that our chatbotlogic module can ignore those posts. To train this model, we used the labeled data (N=2,300) fromStudy 1 as training data. Out of the 2,300 labeled posts, 45.95% of them were seeking informationalsupport. To address the data imbalance issue, we oversampled the data [13], adjusted the ratio ofinformational support seeking posts to emotional support seeking posts to 1:1 (in total 3,000 posts).After the training dataset is prepared, we first preprocess the training data and encode each postas word vectors using tensorflow [1] (vocabulary size=5000, embedding dimension=64, sequencelength=600), followed by feeding these vectors into the CNN model. Our CNN model uses a simplestructure, including one convolutional layer, two full-connection (FC) layers, one input layer,and one output layer. The input layer has 600 neurons. The convolutional layer consists of threeconsecutive operations/layers: convolution with kernels, non-linear activation function, and maxpooling. The convolutional layer contains 256 kernels, and the FC layer contains 128 units with adropout rate of 0.5.In training, we optimize the model with AdamOptimizer using a learning rate of 0.001, opti-mized for accuracy. After training, we conduct 5-fold cross-validation. The overall cross-validationaccuracy is 0.86 and F1 is 0.87. This accuracy is good enough for our following experiments. In Step 4 , we build a text generation module. It can be consideredas a task that for a given input sequence of words (i.e., post), we need to find another sequence of

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021.

ASS: Towards Building a Social-Support Chatbot for Online Health Community 9:13 words (i.e., response) that best suit its input. Thus, we can conceptualize this as a NLP translationtask which can be solved using a variation of machine translation networks. To that end, we useOpenNMT[52], which is a state-of-the-art open-source toolkit for neural machine translation(NMT) tasks, to build a model to generate responses for a given post. The model is derived from asequence-to-sequence model with attention mechanism [69]. In addition, we build a informationretrieval model (IR) with BM25 ranking function [91] as a baseline model.The training data is 220,000 post-response pairs from the 5 sub-forums. For each post-responsepair, the data contains the post content, a picture icon if the content contains pictures, a post ID, In this subsection, we present offline evaluation of CASS performance. Offline evaluation refers tothe evaluation that is done before deploying the system to let people test in the real world context.There are two types of evaluation approaches: automated performance evaluation and humanevaluation. In this study, we adopt both of them to conduct an offline evaluation of our chatbotsystem.Automated performance evaluation is an established mechanism to automatically and quicklyevaluate the performance of an AI system in the field of machine learning and AI. It often firstreserves a subset of the training data as holdout data, whose both input (i.e., post) and output(i.e., human response) are known. Then, we feed input into the train model which automaticallygenerates a predicted output. We can automatically compare the predicted output with the knownhuman response output to evaluate the model performance.The human evaluation is a well-known practice to both HCI and machine learning communities.Often, we can define a variate of dimensions and ask human users to rate the AI predicted outputs.Depending on the task, the human graders/coders are not necessarily having to be the actualintended users as long as these graders are capable of completing the task. Thus, many such humanevaluations are done by crowd-workers [122]. https://opennmt.net/ Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021. :14 Wang and Wang, et al. Fig. 4. Some examples of human user commented responses as ground truth (Human-Response) and our textgeneration module generated responses (AI-Response). These sentences are all in Chinese and we translatethem into English for this paper.

In NLP research, there are many metrics to describe theperformance of a machine learning system, with each metric is suitable for different tasks. In oursystem, the final output is a predicted/generated response based on an input (i.e., a post). This is aNatural Language Generation task which can be evaluated by a widely used metric called

BLEU score (Bilingual Evaluation Understudy) [80]. BLEU score represents the similarity between CASSgenerated response and the human ground truth in the training dataset with a n-gram match. It is anumber between 0 and 1; 1 indicates that the generated response is exactly the same as the groundtruth. We randomly selected 2,000 post-response pairs and reserved them as holdout data beforetraining. The BLEU score on this holdout dataset was 0.23, which is a fairly acceptable performancescore [122, 129]. As a comparison, the BLEU score of IR-based baseline model is 0.03.

BLEU score simply describes the lexical differences of two sentences.It, however, does not reflect the grammar correctness or the semantic meaning. It is possible thatCASS generates a much shorter sentence with the same meaning to the original human response,but the BLEU score is low due to the length difference. For example, in the second row in Fig. 4,AI-generated response is a valid and even better response to the post, but the BLEU score is low

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021.

ASS: Towards Building a Social-Support Chatbot for Online Health Community 9:15 because it is quite different from the human-response. Thus, we recruit human coders to evaluatethe generated results. Based on previous literature [5], and for the research interest of this study,we introduce four dimensions for evaluating AI-generated responses: • Grammar Correctness. The human coders are asked to evaluate the generated response’sgrammar correctness with a 5-point Likert-scale ranging from strongly disagree (-2) tostrongly agree (+2) that the grammar is correct [89]. • Relevance. The response may not be relevant to the given input post so we decided to evaluatethe relevance of AI-generated responses. It is also done using a 5-point Likert-scale rangingfrom strongly disagree (-2) to strongly agree (+2) that the response is relevant to the post’stopic [89]. • Willing-to-Reply. A response’s content may be correct and relevant, but users in an onlinecommunity may not feel engaged so they tend to ignore the AI-generated response. In thisquestion, we ask human coders to rate how likely they will comment on the automaticallygenerated response. Again, a 5-point Likert-scale is used with -2 indicating strongly unlikelywhereas +2 representing strongly likely. • Emotional Support. This is a dimension designed specifically for our research context. We areinterested in examining the extent to which the human coders perceive that the AI-generatedresponse provides the desired emotional support. Similar to the above dimensions, a 5-pointLikert-scale is used.We randomly select 200 post-response pairs from the holdout dataset (the same dataset used inmachine evaluation). Each post has both a human response that we collected from the communityas a ground truth, and a AI-generated response. Thus, we end up with a total of 400 pairs ofpost-response. We ask human coders to evaluate both the AI-generated responses and humanground-truth for comparison. We recruit five human coders to rate each of the 400 pairs; each coderprovide 4 scores related to grammar correctness, relevance, willing-to-reply, and emotional supportdimensions. Intra-class correlations ICC(3,1) for the four dimensions range from 0.74 to 0.99, whichindicates a good inter-coder consistency. For each response, it has four dimension scores, and weuse the average from the five coder’s scores. Then we perform a t-test to compare the AI-generatedresponse and human ground-truth’s quality in each of the 4 dimensions (as shown in AI-Responseand Human-Response groups in Fig.5).Fig.5 shows that AI-response’s grammar correctness is rated high and not that different fromhuman-response (1.37 vs 1.52, t(199)=-4.393, 𝑝 > 0.1). However, AI-response’s quality is not asgood as human-response in topical relevance, willing-to-reply, and emotional support, suggestingthat the chatbot should be improved on those dimensions. But building a system to outperformhuman in writing responses is never the goal of this study. Despite the difference, all the evaluationscores are above 0, meaning that the 5 human coders agree that the AI-generated responses aregrammatically correct, of high relevance to the topic, engaging enough for users to reply, andproviding a considerate level of emotional support to the poster.The results of the automated performance evaluation and human evaluation confirm that ourchatbot system has a pretty good performance in providing emotional support. Therefore, wedeploy the CASS back to the research site to provide timely responses and emotional supports tooverlooked posts. We will describe our field deployment and experiment in the next section. In the previous two sections, we presented an exploratory study on the YouBaoBao community tobuild an contextual understanding of the study site, from which, we identified what types of posts

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021. :16 Wang and Wang, et al.

Fig. 5. Comparison of AI-generated responses’ and human ground-truth responses’ quality with 5 humancoders on 4 dimensions. We also performed t-test and post-hoc analyses. Grammar correctness: 1.37 vs 1.52,t(199)=-4.393, 𝑝 > 0.1; Relevance: 0.38 vs 1.16, t(199)=-12.062, 𝑝 <0.01; Willing-to-reply: 0.23 vs 0.91, t(199)=-10.657, 𝑝 <0.01; Emotional support: 0.28 vs 0.34, t(199)=-7.186, 𝑝 <0.01. the community members post, and highlighted the challenge that some of their emotional supportseeking posts can not get a timely response (Section 3.3). Then, we presented CASS system with afully automated end-to-end architecture that can read in posts, identify overlooked posts, classifyemotional support seeking posts, generate responses for those overlooked posts, and publish thegenerated response back to the community (Fig. 3). Our preliminary evaluations showed that theNN-based models have a satisfying performance score, so the CASS system is ready for deployment.In this section, we present how we deployed the CASS system back to the community following afield experiment research setup, and how we evaluated its impacts on individual members and thecommunity. Field deployment is a well-established HCI research practice [100]. It can reveal users “naturalisticusage” of the introduced new functionality of the system. We follow [100] guideline on how todefine the beginning and the end of the field deployment, and its ethical considerations of engagingusers.Moreover, to quantify the support that CASS brings to individual members as well as to thecommunity, we also adopted an experiment study setup [33]. We design a “control group” and an“experiment group” to measure the difference of its impact between users exposed to it and theones that are not exposed to it. An experiment design in a real-world deployment has its uniquechallenges. For example, to ensure the users having the most naturalistic behavior while interactingwith the intervention, they should not know they are part of an experiment, when such disguise isnot harmful to the users. We follow the field experiment research method described in the textbookby Terveen et al. [102] to “maximize realism of the context, while still affording some experimentalcontrol”.

Our research goal is to deploy CASS to provide timely social supportfor the individual members who seek support. In addition, we are also interested in evaluating itsimpact on other community members. We deployed the system in August 2019, and conducted thefield experiment for 7 days. The reason for deploying the system for 7 days is that the lifespan ofa post thread on average is 7 days (as reported in Study 1, Section 3.3). After we deactivated theCASS system, we still kept tracking of the comments and user actives for another 7 days to ensurewe gathered a 7-day data for each of the post.To evaluate whether CASS helps individual support seekers, our unit of analysis is each individualuser who published the original post. We define measurements to capture their online behaviors,

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021.

ASS: Towards Building a Social-Support Chatbot for Online Health Community 9:17 and their emotion changes after the post was responded by CASS. To evaluate CASS’s impact onother members of the community, the unit of analysis is changed to each individual post and itscomments as a thread. We also define measurements (e.g., how many members participated undera post thread) to reflect other community members’ behaviors caused by the CASS intervention.As shown in Study 1 in Section 3.3, the median interval time for a post to get its first response is10 minutes; and in Study 2 in Section 4.1.1, we define a post as overlooked post if it does not get ahuman response within 10 minutes. Thus, during the deployment, we design CASS to only trackthe overlooked non-informational support seeking posts (Section 4.1.2). In total, CASS tracks 3,445overlooked emotional-support-seeking posts during the 7-day field experiment.These overlooked posts are then randomly split into two groups: a baseline condition , wherewe only passively track the original poster’s and the other members’ activities without CASSintervention; and an experiment condition , where CASS responds to the overlooked post once10 minutes has passed, and we track activities and measure the poster interactions. During the7-day field experiment, 1,717 overlooked posts are assigned to the baseline condition, and another1,728 overlooked posts are in the experiment condition.

As instructed in the textbook [33], when possible, participantsshould not be aware that they are in a field experiment setting. In addition, a number of recentliterature [49, 68, 75] has highlighted that users’ behaviors and perceptions may change if they knowor believe they are interacting with an AI system, which will further compromise the experimentresults. Therefore, during this experimental study, we disguised CASS as a normal communitymember with a pseudo user name and a user avatar. We disclosed the chatbot’s real identity toall the users who had interacted with it via the community built-in private messages after thestudy was completed. This is a common practice in psychology experiments [33] and clinical trialexperiments [26].

In this section, we present the measurements to quantify the effectiveness ofour chatbot system in providing social support. In McGrath’s seminal work, “Time, Interaction, andPerformance” [73], he proposes a three dimentional framework to describe teamwork: production,member support, and group well-being . Grudin argues that a successful CSCW system should beable to provide productivity benefits for not only the individual members, but also for the groupdynamics [35]. Inspired by these literature, we thus define a set of measurements to quantify CASS’simpact on other community members.The three measurements for determining the CASS’s effectiveness in generating timely socialsupport include: • The count of posts with no response.

This variable describes how many posts have no responseat all in 7 days (in short

Post-NoResp-N ). If AI-Response group has a significant smallernumber in this measurement, that suggests CASS is effective in addressing no-response posts.When calculating this score, the CASS’s responses are not counted, as the chatbot replies toevery post in the AI-Response group. • The time interval between the original post and the 1st response, and the time interval betweenthe 1st response and the 2nd response.

Here we calculate two variables, both time intervals inminutes, to describe how fast a post gets replied (in short

Post-1Resp-Time ), and how fastthe 1st response could attract the 2nd response (in short ). If AI-Responsegroup has a smaller time interval measurement, that suggests CASS indeed supports theposter to get a response in a more timely manner.Then, we use a set of four measurements to reflect whether the CASS-generated post helps theoriginal support-seeking individual . Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021. :18 Wang and Wang, et al. • The count of follow-up comments from the original poster.

After the original poster publishesthe post and some other users come into the thread and reply to the posts, will the originalposter come back to publish follow-up comments and form an active discussion? If so, wecount the number of comments from the original poster, in short as

Poster-Comm-N . Thehigher this number is, the more active the poster is. Thus, the support-seeking poster foundan emotional outlet under her post and leveraged it to engage more with other communitymembers. If the poster never came back, we count it as 0. • The poster’s original-emotional states, updated-emotional states, and the difference of these twostates.

It is difficult to measure the posters’ emotional states without directly asking them, thuswe employ the theory of emotional valence [67], and recruit two coders to annotate an originalpost’s emotional valence score as a proxy for the original poster’s initial emotional state (inshort

Poster-Emo-Val-Orig ). More specifically, there are three states of emotional valence: . Then, ifthe original poster publishes a follow-up comment, we use this comment’s emotional valencescore as a proxy to represent her updated emotional state (in short

Poster-Emo-Val-Updt ).We also calculate the change of emotional states (in short

Poster-Emo-Val-Chng ). If anoriginal poster never come back to post a follow-up comment, we exclude this data point.In summary, we have 759 data points in the experiment condition (N=1,728); and 536 datapoints in the baseline condition (N=1,717). We refer to these data points as AI-Response andHuman-Response data points. Inter-rater reliability between the two coders on the emotionalvalence score of these 1319 data points reaches 0.88 (cohen’s kappa), which indicates a highconsistency.Lastly, we define three measurements to reflect CASS’s impact on other members in the commu-nity. • The count of total responses to an original post.

This variable reflects the participation levelunder a post. A higher average number of this variable suggests the community has a highercommunity contribution level, and the community is healthier (in short

Post-Resp-N ). • The count of how many members participated and commented under a post thread.

This variablealso reflects the participation level under a post. A higher average number of this variablesuggests on average more community members engaged in each post’s discussion. It reflects ahigh community commitment level, and the community is healthier (in short

Post-Member-N ). • The time interval between each pair of adjacent responses under a post.

This variable (in short

Adj-Comm-Time ) represents on average how fast a post thread can get a new comment. Ifthis number is smaller in the AI-Response group, it suggests that CASS not only helps theoriginal poster to get a timely response, but also activates the liveliness of the community(e.g., other members also participate in a post’s discussion in a more active fashion).

We organize the result section into three subsections to report that 1)

CASS is effective in providingtimely social support to overlooked posts ; 2)

CASS-generated responses can improve the individualsupport seeker’s emotional status ; and 3)

CASS can also positively impact other community members’participation . The measurements inthis category represent how CASS’s functionalities effectively mitigate the primary challenge thata emotional-support-seeking post can not get a timely response.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021.

ASS: Towards Building a Social-Support Chatbot for Online Health Community 9:19

Post-NoResp-N : The results show that in the baseline condition, there were 595 out of 1,717posts never received any response within 7 days of the post being published; in the experimentcondition, out of 1,728 posts, this number was 433 (we exclude the response from the originalposter, or from the CASS chatbot when calculating this number). A Chi-Square test suggests thatthe difference is significant ( 𝜒 (1)=39.65, p<0.01). That is, CASS can effectively generate timely socialsupport messages as response to the support seeking posts and significantly reduce the number ofoverlooked posts. Post-1Resp-Time & 1Resp-2Resp-Time : These two variables are used for calculating timeintervals between the original post and the 1st response, and the interval between the 1st responseand the 2nd response. The results show that

Post-1Resp-Time in the baseline condition is 254minutes on average, and in the experiment condition is 10 minutes. It is not comparable becausewe simply set the CASS logic to identify and reply to posts in 10 minutes. Thus, a more comparablemeasure is , which reflects how long it takes for a second response to comein after the 1st response was published. In the baseline condition, the time interval between 1stresponse and 2nd response is on average 462 minutes, whereas in the experiment condition thatnumber is 349 minutes. An independent sample t-test suggests the difference is significant ( t(1577)=-1.433, p<0.05). This result means that CASS not only posts a timely response to the original post,but also accelerates the time that a following response was posted by another user.In summary, CASS’s functionality can help effectively mitigate the primary problem, wheresome posts can not get a response in a timely manner, by reducing the number of posts with noresponse (smaller

Post-NoResp-N ), and speeding up the time that a post is responded (smaller ). Asintroduced in section 5.1.3, we defined and calculated four measurements to quantify the impact ofthe CDSS-generated response on an individual support seeker.

Poster-Comm-N : The results show that original support seekers come back to their post threadand leave 1.36 follow-up comments in the baseline condition (N = 1,717). In contrast, the postersposted on average 1.78 follow-up comments in the experiment condition (N = 1,728). An independentt-test (t(3443) = 2.616, p<0.1) has a marginal significant effect and suggests that CASS indeed madethe original support seeker to follow up with more comments, and interact more actively withother community members. These behaviors arguably are beneficial to release their stress.

Poster-Emo-Val-Orig : As shown in Fig. 6, the results show that for the experiment condition(N = 759), 22% of the original posts are coded by human coders as having positive emotion valencescore, 29% as neutral, and 49% have negative emotion valence score. In contrast, in the baselinecondition (N = 536), the numbers become 20%, 36%, and 44% for positive, neutral, and negativeemotion scores, respectively. As these post-response pairs are only the ones on which the originalsupport seeker left a follow-up comment; thus, to differentiate them from the entire dataset, we referto them as the AI-Response group (N = 759) and the Human-Response group (N = 536), respectively.

Poster-Emo-Val-Udpt : In the AI-Response group (N = 759), the original poster’s 1st commentis the response she replied to the CASS-generated response. 25% of these responses are coded aspositive emotion, 62% as neutral emotion, and 13% as negative emotion. In the Human-Responsegroup (N = 536), human coders rate 10%, 76%, and 14% of the posts expressing positive, neutral, andnegative emotion, respectively.These results are shown in Fig. 6. From the chart, we can observe that as for the emotionalvalence score for the original posts, there is no difference between the AI-Response group and theHuman-Response group. But we can see there is a clear difference in the emotional valence scoreof the poster’s 1st response.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021. :20 Wang and Wang, et al.

Fig. 6. The poster’s emotional valence and their changes, as in two proxy variables: the emotional valencescore for her original post, and for her first comment to others’ response under her own post.

Poster-Emo-Val-Chng : It is interesting to see that the emotional valence score in both groupschanged from the support seeker’s original post to her 1st response, in Fig. 6. It appears that inAI-Response group, many posters’ original posts are negative, but then change to neutral (29% to62%), and some neutral ones convert to positive (22% to 25%). Many community members’ emotionalvalence become more positive, as shown in the example below:Poster’s original post:

I may have insomnia. I still can’t fall asleep and it’s already 3:00amin the morning.

AI-generated response:

Have a good sleep, and don’t push yourself too hard.

Poster’s 1st comment:

Wow!!! I want to hug you!

In contrast, in the Human-Response group, many posters’ emotional valence score change frompositive to neutral (36% to 76%). That suggests, some of the responses posted by another communitymember may adversely affect a certain number of posters’ emotion.To statistically analyze poster’s emotional change, we further calculate the percentage of theposts that have an emotional valence score going up (e.g., neutral to positive), going down (e.g.,positive to neutral), or remaining unchanged. In the AI-Response group, 48% of the posts’ scoresgo up, 38% remain unchanged, and 14% go down. In the Human-Response group, the percentagesare 38%, 44%, and 18%, respectively. We perform Chi-Square test to compare the percentages acrossthe two groups.There are significantly more posters having an increased emotional valence score in the AI-Response group than in the Human-Response group ( 𝜒 (1)=14.35, p<0.01). Similarly, there aresignificantly less posters having decreased emotional state score in the AI-Response group than inthe Human-Response group ( 𝜒 (1)=5.242, p<0.05). This suggests that the AI-generated response canmotivate posters to become more positive, and prevent them from experiencing frustration or asimilar negative emotion, when compared to the human-generated response.In summary, our results show that in the AI-Response group, more original support seekers (i.e.posters) revisit and comment under their own posts to interact with other community members. Inaddition, more posters turn to a positive emotional valence, and less turn to negative emotion, when Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021.

ASS: Towards Building a Social-Support Chatbot for Online Health Community 9:21

Fig. 7. Time intervals for each pair of adjacent responses under a post. In this chart, we only show the first5 pairs, but this trend persists for the rest pairs. This result suggests the community is more active in theAI-Response group than in the Human-Response group. compared to the posters in the Human-Response group. In this sense, CASS can help individualmembers use their published post as an emotional outlet to chat with others, and may furtherimprove their emotional states.

In addition toCASS’s utilitarian effectiveness and support for individual members, we are also interested inexamining how it may improve the well-being of the community as a whole.

Post-Resp-N : The results show that in the experiment condition (N = 1,728), there are 7.63(S.D.=34.25) responses (N = 997) for each post; in contrast, in the baseline condition (N = 1,717),each post has an average of 6.3 responses (N = 1120, S.D. = 35.14). We have excluded the responsespublished by CASS or by the original poster, so this variable is a pure indicator of other communitymember’s participation. This result suggests that other community members in the experimentcondition have a higher participation level, but an independent sample t-test suggests such differenceis not significant (t(2115)=0.879, p = 0.228).

Post-Member-N : For each post, there are 5 other community members (S.D.=20.97) participatedin the thread (N = 997) in the experiment condition. In contrast, about 4 members (S.D.=23.35)participated in each post thread in the baseline condition (N = 1,120). This result suggests thateach post in the experiment condition has more community members participated, and it has ahigher level of community commitment, when compared to the baseline condition. However, anindependent sample t-test suggests such difference is not significant (t(2115)=0.911, p = 0.232).

Adj-Comm-Time : We also calculate the time intervals between each pair of adjacent responsesin a post thread. In section 5.2.1, we have reported

Post-1Resp-Time is 10 minutes in the ex-periment condition, and 254 minutes in the the baseline condition; is 349minutes and 462 minutes, respectively. In addition, we continue to calculate the time intervals upto five follow-up responses. As Fig. 7 illustrated, there is a trend that the AI-Response group’s timeintervals are always shorter than the ones in the Human-Response group. This suggests that CASSnot only helps the original poster to get a quicker response, but also energizes other members inthe community, and accelerates their participation in a post’s discussion.In summary, we find that in the experiment condition, community members participate moreactively in a post thread. Results also suggest that with the presence of our chatbot system, each

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021. :22 Wang and Wang, et al. post has a higher level of community participation, with both more people and more responses,even though difference is not significant.

We start our exploration with a detailed content analysis of the posts on the pregnancy forum.Despite the granularity difference between these two coding schema, our category and concept axisecho previous literature [37]. In [37], the researchers propose 6 categories from coding 683 onlineposts; in contrast, we identify 3 top-level and 11 detailed-level concepts after coding 2,300 posts.Our “informational-support seeking” category is quite similar to a combination of their 3 categories“advice”, “formal knowledge”, and “informal knowledge seeking”. Our “emotional support seekingand sharing” is similar to a combination of their “emotional support” and “reassurance”.We find a novel post category that refers to community members sharing their own daily life,such as posting a photo of food or exercise in a gym. This indicates that community members gainsupport and contribute to the community at the same time. For our research purpose, we do notattempt to distinguish the emotional support seeking category and the sharing daily life category,because the responses to these two categories can be overlapped with compliment or empathy andour text generation module can handle them the same way, as long as they are not seeking factorialinformation.Study 1 suggests approximately 18% of the posts did not receive any response, signifying anurgent need to address “post-with-no-response” issue. Community members engagement is akey factor influencing the response amount. Maintaining the active engagement of members isa key challenge for online communities. To address this challenge, some communities employedincentives schemes to motivate member participation, such as gamification [19] or monetaryreward [120]. While they are effective to some extend, such solutions come with a relatively highimplementation and maintenance cost [87]. With the advances in AI and chatbot technology, someplatforms have started experimenting with the use of chatbot service to more closely connectmembers, for instance, by recommending members to reach out to others [82] or by suggestingfurther actions based on participation status [81].Our work takes a different route — builing a chatbot to act as an active member to boost thecommunity engagement level. This result sounds similar to the catfish effect — an active catfishin a group of sardines in a fish tank can stimulus the sardines to be more active and live longerduring the transition. CASS is an “active member”, which generates diverse responses to reply toother members’ posts. The existence of CASS energizes other members to post more responses tothe support-seeking post. We speculate this is one plausible explanation to interpret our results inStudy 3. To develop a successful AI system to provide social support for an online community, a chatbotneeds to meet high creterias with respect to its post classification and response generation capabili-ties. From the original support seeker’s perspective, they need authentic and helpful responses in atimely manner, otherwise it may even negatively affect their mood. From the other communitymember’s perspective, if a post’s first response has a high quality and high relevancy to the post, https://en.wikipedia.org/wiki/Catfish_effectProc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021. ASS: Towards Building a Social-Support Chatbot for Online Health Community 9:23 they may follow the lead and be more likely to provide their high quality comments to the postthread.Therefore, CASS needs to be able to generate authentic, diverse, and high-quality responses.However, existing rule-based and IR-based chatbots cannot sufficiently meet such needs [83]. NN-based chatbot can be a promising solution in this case, but it may be affected by the noisy-labelissue in training data. We avoid this potential risk with three pre-processing methods. First, ourresearch site is a friendly community. When we clean up the data, we only see some advertisement(e,g., which brand of milk powder is better for the baby) and do not find any offensive or hatredlanguage. Second, we build a CNN-based classification model to filter out informational-seekingposts, as CASS is not designed to generate responses that are relevant to medical information.Thirdly, we design a user interface console with human-in-the-loop AI design, which can be usedby a human experimenter to track and monitor each AI-generated response in real-time. Fromdata collection to response generation and publication, we employ rigorous methods to avoid thepotential risk of such NN-based chatbot architecture.However, an online community is fluid and its norms and topics may evolve along with time [110].AI-generated responses rely on the style and topics in its training dataset. Thus, the NN-basedchatbot may need to be re-trained with the latest community data to keep up-to-date with its popularmemes and language norms after a period of deployment. This process may sound labor-intensive,but it is much easier than re-training and refining a rule-based chatbot system.

Previous work that studied the human interaction with chatbots often focused only on the impactof a chatbot on an individual user. Little is known about how chatbot system may affect a group ora community. Our results provide evidence that the NN-based AI system can provide timely socialsupport to individual members, as well as boost up other community members’ engagement level.Existing literature suggests that chatbots could play a significant role in boosting the popularityof post content [34, 82]. Seering et al. [96] also find the existence of the chatbot sparked bothhuman-chatbot interactions and human-human interactions. The results of our Study 3 suggest thatCASS can help individual members to use their published post as an emotional outlet and interactmore with other community members, and ultimately improve their emotional states. These resultsextend prior work and provide valuable insights into the design of future chatbot systems for onlinecommunity.In addition, we also find that CASS can promote other team members to engage more with a post.For example, results from study 3 show that other community members who did not interact withCASS directly have also been affected by CASS. That is, CASS made them become more engagedand more active in replying to those support-seeking posts. We speculate that our research site is aChinese online community, thus Chinese culture [16] may contribute to this effect. For example,people may not want to be the first person to express their opinion in a public discourse [16]. Beingthe first commenter seems to require more courage and a stronger desire to express. If the posthas already been replied by someone (in this case by the charbot), then the psychological pressureto reply will be less. So when CASS replied to the post, other members may be more willing toparticipate in the discussion with lower pressure. For HCI and AI researchers, this could be a newresearch direction that how different cultures interplay with the user behaviors when people andAI systems co-exist in an online community.More importantly, with more and more chatbots and AI-based moderation applications beingbuilt and adopted into various online communities, we argue that in the future it will be common foronline communities to have a human-AI co-existence ecosystem. To support this new paradigm, we

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021. :24 Wang and Wang, et al. need to extend the current human-in-the-loop AI design philosophy [21] and the mixed-initiativedesign approach [43]. An AI system with high accuracy is not enough. In addition, the AI systemshould also be designed to cooperate with the primary target users and the other users in thecommunity. An cooperative AI system should seamlessly fit into the context and bring somebenefits to the community (e.g., supports stressed community members and increases communityengagement). And the users who interact with it should feel comfortable to have the AI systemin a community, instead of feeling threatened or intimidated. This “Human-AI Collaboration”future [108] is the ultimate research goal that we are aiming for.

One limitation of our work is that the CASS system was designed to handle the emotional-support-seeking posts only, and neglected those informational support seeking posts. It is because theexpected responses are different: for emotional support responses, users prefer diverse and livelycontent, and it is time-sensitive; whereas responses for informational support seeking posts requireaccurate and factorial answers, thus the diversity is not a critical concern [37]. That being said,NN-based approaches can also help with such task, but it can not be the same seq-to-seq architecturewe used for emotional support. Open domain Q&A [111] is a promising NLP technique that canparse a whole textbook and automatically generate answers for any given question about thebook. With this technique, chatbot can reply to users’ informational support seeking questions bygenerating answers from medical textbooks, wikipedia or other authentic data sources. The CASSsystem can still follow the architecture to have 5 Steps and 3 Modules. Only inside the responsegeneration module, CASS needs a hybrid architecture to use the corresponding NLP techniquefor the incoming posts to generate responses. Theoretically, it is easy to do. But it requires carefuldesign considerations, solid data preparation, and rigorous model training and experimentationwork. We will explore this research direction in our future work.We did not disclose CASS identity before or during the experiment for two main reasons: Firstly,as suggested by previous literature [49, 68, 75, 76, 92], we believe that the identity of chatbot willaffect the results of the experiment, attracting too much attention [96] or making the responsesbiased [49]. For the purpose of this study, we want to avoid that. Secondly, the potential impact ofdisclosing chatbot’s identity on community users’ behaviors was not the scope of this study. Weare planning to thoroughly examine this topic in our future work. Whether disclosing the chatbotidentity is related to an important design question: the appropriateness of languages for a chatbot.In our study, the appropriateness of the generated language is evaluated based on the accuracy andwhether it answers the given input. However, we may anticipate that some of the words may beperceived appropriate if they are posted by pregnant human members but inappropriate if they areposted by an AI algorithm. For example, in the Fig 4 example, one pregnant woman complains abouther baby causing her some sleeping issues, and our chatbot answered “same here”. The answer istotally appropriate if it is from another human member of the community, but some people mayperceive it weird as the AI is suggesting “her baby” also caused “her” similar troubles. Thus, theremay need a novel dimension for characterizing the generated language’s AI-appropriateness.Another limitation of our chatbot is that in the field experiment, we limited our chatbot’s engage-ment with the poster to only one round. It is not because our chatbot can not deal with multipleround conversations, rather, it is due to the semi-controlled experiment design consideration: ifour chatbot can reply to poster for an arbitrary number of rounds, it introduces more confoundingvariables to the major behavioral measurements of interest. In the future, we can extend the CASScapability to have multi-round dialogue skills [78], and engage with community members in a morenatural manner.

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021.

ASS: Towards Building a Social-Support Chatbot for Online Health Community 9:25

Our work is under the assumption that for these support-seeking posts, the sooner they getanswers, the better it is for the support seeker. However, based on Figure 1, it is clear that most postseither quickly receive a response, or never receive a response. The introduction of the proposedchatbot may significantly change such community dynamic. Maybe this change will have someunexpected negative implications. Further study is required to fully understand its impact in alonger-term deployment.Our deployment in this particular healthcare community may be a limitation for generalizing theempirical findings, for example, those findings specific to the pregnant women community may notapply to other online communities, or the Chinese cultural context may hinder the generalizabilityto other cultures. However, we believe the scalable and generalizable chatbot architecture can beeasily customized for other online community contexts. Thus, we welcome other researchers tojoin our effort, together we can replicate and extend this study design in other community contexts.

In this paper, we present a comprehensive research project consisting of three studies, in whichwe developed and deployed a chatbot system to automatically generate and post responses toemotional support seeking posts in an online health community for pregnant women. Our studiesshow that the neural-network-based chatbot architecture is a promising solution to build chatbotsto generate timely responses for posts with no replies. In addition, we present evidence regardingthe chatbot’s positive impact on influencing the support seeker’s emotional status, and encouragingother community members to be more engaging.

REFERENCES [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, SanjayGhemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore,Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, andXiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In

Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & SocialComputing (CSCW ’15) . Association for Computing Machinery, New York, NY, USA, 839–851. https://doi.org/10.1145/2675133.2675208[3] Mark S Ackerman and David W McDonald. 1996. Answer Garden 2: merging organizational memory with collaborativehelp. In

Proceedings of the 1996 ACM conference on Computer supported cooperative work . 97–105.[4] Teresa Almeida, Rob Comber, and Madeline Balaam. 2016. HCI and Intimate Care as an Agenda for Change in Women’sHealth. In

Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16) . Association forComputing Machinery, New York, NY, USA, 2599–2611. https://doi.org/10.1145/2858036.2858187[5] Tim Althoff, Kevin Clark, and Jure Leskovec. 2016. Large-scale analysis of counseling conversations: An applicationof natural language processing to mental health.

Transactions of the Association for Computational Linguistics

Proceedings of the 2019 chiconference on human factors in computing systems . 1–13.[7] Tawfiq Ammari, Sarita Schoenebeck, and Daniel M. Romero. 2018. Pseudonymous Parents: Comparing ParentingRoles and Identities on the Mommit and Daddit Subreddits. In

Proceedings of the 2018 CHI Conference on HumanFactors in Computing Systems (CHI ’18) . Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3173574.3174063[8] Jaime Arguello, Brian S. Butler, Elisabeth Joyce, Robert Kraut, Kimberly S. Ling, Carolyn Rosé, and Xiaoqing Wang.2006. Talk to Me: Foundations for Successful Individual-group Interactions in Online Communities. In

Proceedingsof the SIGCHI Conference on Human Factors in Computing Systems (CHI ’06) . ACM, New York, NY, USA, 959–968.Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021. :26 Wang and Wang, et al. https://doi.org/10.1145/1124772.1124916[9] Saeideh Bakhshi, David A. Shamma, and Eric Gilbert. 2014. Faces Engage Us: Photos with Faces Attract More Likesand Comments on Instagram. In

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’14) . ACM, New York, NY, USA, 965–974. https://doi.org/10.1145/2556288.2557403[10] Antonina D. Bambina. 2007.

Online Social Support: The Interplay of Social Networks and Computer-Mediated Commu-nication . Cambria Press.[11] S Banti, Mauro Mauri, A Oppo, C Borri, C Rambelli, D Ramacciotti, M S Montagnani, V Camilleri, S Cortopassi, PaolaRucci, et al. 2011. From the third month of pregnancy to 1 year postpartum. Prevalence, incidence, recurrence, andnew onset of depression. Results from the Perinatal Depression-Research & Screening Unit study.

ComprehensivePsychiatry

52, 4 (2011), 343–351.[12] Marguerite Barry, Kevin Doherty, Jose Marcano Belisario, Josip Car, Cecily Morrison, and Gavin Doherty. 2017.mHealth for Maternal Mental Health: Everyday Wisdom in Ethical Design. In

Proceedings of the 2017 CHI Conferenceon Human Factors in Computing Systems (CHI ’17) . ACM, New York, NY, USA, 2708–2756. https://doi.org/10.1145/3025453.3025918[13] Sukarna Barua, Md Monirul Islam, Xin Yao, and Kazuyuki Murase. 2012. MWMOTE–majority weighted minorityoversampling technique for imbalanced data set learning.

IEEE Transactions on knowledge and data engineering

26, 2(2012), 405–425.[14] Natalya N. Bazarova, Yoon Hyung Choi, Victoria Schwanda Sosik, Dan Cosley, and Janis Whitlock. 2015. SocialSharing of Emotions on Facebook: Channel Differences, Satisfaction, and Replies. In

Proceedings of the 18th ACMConference on Computer Supported Cooperative Work & Social Computing (CSCW ’15) . Association for ComputingMachinery, New York, NY, USA, 154–164. https://doi.org/10.1145/2675133.2675297[15] Prakhar Biyani, Cornelia Caragea, Prasenjit Mitra, and John Yen. 2014. Identifying emotional and informational supportin online health communities. In

Proceedings of COLING 2014, the 25th International Conference on ComputationalLinguistics: Technical Papers . 827–836.[16] Michael Harris Bond. 1996.

The handbook of Chinese psychology . Oxford University Press Hong Kong.[17] Tone Bratteteig and Guri Verne. 2018. Does AI make PD obsolete? exploring challenges from artificial intelligenceto participatory design. In

Proceedings of the 15th Participatory Design Conference: Short Papers, Situated Actions,Workshops and Tutorial-Volume 2 . 1–5.[18] Heloisa Candello, Claudio Pinhanez, and Flavio Figueiredo. 2017. Typefaces and the Perception of Humanness inNatural Language Chatbots. In

Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI’17) . Association for Computing Machinery, New York, NY, USA, 3476–3487. https://doi.org/10.1145/3025453.3025919[19] Huseyin Cavusoglu, Zhuolun Li, and Ke-Wei Huang. 2015. Can Gamification Motivate Voluntary Contributions?The Case of StackOverflow Q&A Community. In

Proceedings of the 18th ACM Conference Companion on ComputerSupported Cooperative Work & Social Computing (CSCW’15 Companion) . Association for Computing Machinery, NewYork, NY, USA, 171–174. https://doi.org/10.1145/2685553.2698999[20] Stevie Chancellor, Andrea Hu, and Munmun De Choudhury. 2018. Norms Matter: Contrasting Social Support AroundBehavior Change in Online Weight Loss Communities. In

Proceedings of the 2018 CHI Conference on Human Factors inComputing Systems (CHI ’18) . ACM, New York, NY, USA, Article 666, 14 pages. https://doi.org/10.1145/3173574.3174240[21] Justin Cranshaw, Emad Elwany, Todd Newman, Rafal Kocielnik, Bowen Yu, Sandeep Soni, Jaime Teevan, and AndrésMonroy-Hernández. 2017. Calendar. help: Designing a workflow-based scheduling agent with humans in the loop. In

Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems . 2382–2393.[22] Carolyn E Cutrona and Julie A Suhr. 1994. Social support communication in the context of marriage: an analysis ofcouples’ supportive interactions. (1994).[23] Munmun De Choudhury, Scott Counts, and Eric Horvitz. 2013. Predicting Postpartum Changes in Emotion andBehavior via Social Media. In

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’13) .Association for Computing Machinery, New York, NY, USA, 3267–3276. https://doi.org/10.1145/2470654.2466447[24] Munmun De Choudhury, Scott Counts, Eric J. Horvitz, and Aaron Hoff. 2014. Characterizing and PredictingPostpartum Depression from Shared Facebook Data. In

Proceedings of the 17th ACM Conference on Computer SupportedCooperative Work & Social Computing (CSCW ’14) . Association for Computing Machinery, New York, NY, USA,626–638. https://doi.org/10.1145/2531602.2531675[25] Munmun De Choudhury and Sushovan De. 2014. Mental health discourse on reddit: Self-disclosure, social support,and anonymity. In

Eighth international AAAI conference on weblogs and social media .[26] Rebecca DerSimonian and Nan Laird. 1986. Meta-analysis in clinical trials.

Controlled clinical trials

7, 3 (1986),177–188.[27] Judith S Donath. 2002. Identity and deception in the virtual community. In

Communities in cyberspace . Routledge,37–68.Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021.

ASS: Towards Building a Social-Support Chatbot for Online Health Community 9:27 [28] Patricia Drentea and Jennifer L. Moren-Cross. 2005. Social capital and social support on the web: the case of an internetmother site.

Sociology of Health & Illness

27, 7 (2005), 920–943. https://doi.org/10.1111/j.1467-9566.2005.00464.xarXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-9566.2005.00464.x[29] Marilyn Evans, Lorie Donelle, and Laurie Hume-Loveland. 2012. Social support and online postpartum depressiondiscussion groups: A content analysis.

Patient Education and Counseling

87, 3 (2012), 405 – 410. https://doi.org/10.1016/j.pec.2011.09.011[30] Xiangmin Fan, Daren Chao, Zhan Zhang, Dakuo Wang, Xiaohua Li, and Feng Tian. 2020. Utilization of Self-DiagnosisHealth Chatbots in Real-World Settings: Case Study.

Journal of Medical Internet Research

22 (2020).[31] Kathleen Kara Fitzpatrick, Alison Darcy, and Molly Vierhile. 2017. Delivering Cognitive Behavior Therapy to YoungAdults With Symptoms of Depression and Anxiety Using a Fully Automated Conversational Agent (Woebot): ARandomized Controlled Trial.

JMIR Mental Health

CoRR abs/1810.09590 (2018). arXiv:1810.09590 http://arxiv.org/abs/1810.09590[33] Darren Gergle and Desney S Tan. 2014. Experimental research in HCI. In

Ways of Knowing in HCI . Springer, 191–227.[34] Zafar Gilani, Reza Farahbakhsh, and Jon Crowcroft. 2017. Do Bots Impact Twitter Activity?. In

Proceedings of the26th International Conference on World Wide Web Companion (WWW ’17 Companion) . International World Wide WebConferences Steering Committee, Republic and Canton of Geneva, Switzerland, 781–782. https://doi.org/10.1145/3041021.3054255[35] Jonathan Grudin. 2008. 1 McGrath and the Behaviors of Groups (BOGs). (2008).[36] Jonathan Grudin. 2017. From tool to partner: The evolution of human-computer interaction.

Synthesis Lectures onHuman-Centered Interaction

10, 1 (2017), i–183.[37] Xinning Gui, Yu Chen, Yubo Kou, Katie Pine, and Yunan Chen. 2017. Investigating Support Seeking from Peersfor Pregnancy in Online Health Communities.

Proc. ACM Hum.-Comput. Interact.

1, CSCW, Article 50 (Dec. 2017),19 pages. https://doi.org/10.1145/3134685[38] Aaron Halfaker and R Stuart Geiger. 2019. ORES: Lowering Barriers with Participatory Machine Learning in Wikipedia. arXiv preprint arXiv:1909.05189 (2019).[39] Aaron Halfaker, R Stuart Geiger, and Loren G Terveen. 2014. Snuggle: Designing for efficient socialization andideological critique. In

Proceedings of the SIGCHI conference on human factors in computing systems . 311–320.[40] F. Maxwell Harper, Daphne Raban, Sheizaf Rafaeli, and Joseph A. Konstan. 2008. Predictors of Answer Qualityin Online Q&A Sites. In

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’08) .Association for Computing Machinery, New York, NY, USA, 865–874. https://doi.org/10.1145/1357054.1357191[41] Sepp Hochreiter and Jürgen Schmidhuber. [n.d.]. Long Short-Term Memory.

Neural Computation

9, 8 ([n. d.]),1735–1780.[42] Bree Holtz, Andrew Smock, and David Reyes-Gastelum. 2015. Connected Motherhood: Social Support for Moms andMoms-to-Be on Facebook.

Telemedicine and e-Health

21, 5 (2015), 415–421. https://doi.org/10.1089/tmj.2014.0118arXiv:https://doi.org/10.1089/tmj.2014.0118 PMID: 25665177.[43] Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. In

Proceedings of the SIGCHI conference on HumanFactors in Computing Systems . 159–166.[44] Tianran Hu, Anbang Xu, Zhe Liu, Quanzeng You, Yufan Guo, Vibha Sinha, Jiebo Luo, and Rama Akkiraju. 2018. Touchyour heart: a tone-aware chatbot for customer care on social media. In

Proceedings of the 2018 CHI Conference onHuman Factors in Computing Systems . 1–12.[45] Jina Huh and Mark S. Ackerman. 2012. Collaborative Help in Chronic Disease Management: Supporting IndividualizedProblems. In

Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work (CSCW ’12)

Turk psikiyatri dergisi = Turkish journal of psychiatry

17, 4 (2006),243–251.[49] Maurice Jakesch, Megan French, Xiao Ma, Jeffrey T. Hancock, and Mor Naaman. 2019. AI-Mediated Communication:How the Perception That Profile Text Was Written by AI Affects Trustworthiness. In

Proceedings of the 2019 CHIConference on Human Factors in Computing Systems (CHI ’19) . ACM, New York, NY, USA, Article 239, 13 pages.https://doi.org/10.1145/3290605.3300469[50] Isaac L. Johnson, Yilun Lin, Toby Jia-Jun Li, Andrew Hall, Aaron Halfaker, Johannes Schöning, and Brent Hecht. 2016.Not at Home on the Range: Peer Production and the Urban/Rural Divide. In

Proceedings of the 2016 CHI Conference onHuman Factors in Computing Systems (CHI ’16) . ACM, New York, NY, USA, 13–25. https://doi.org/10.1145/2858036.Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021. :28 Wang and Wang, et al. arXiv preprint arXiv:1408.5882 (2014).[52] Guillaume Klein, Yoon Kim, Yuntian Deng, Vincent Nguyen, and Alexander M. Rush. 2018. OpenNMT: NeuralMachine Translation Toolkit. (2018).[53] Robert Kraut, Moira Burke, John Riedl, and Paul Resnick. 2010. Dealing with newcomers.

Evidencebased Social DesignMining the Social Sciences to Build Online Communities

Building successful onlinecommunities: Evidence-based social design (2011), 21–76.[55] Robert E Kraut and Paul Resnick. 2012.

Building successful online communities: Evidence-based social design . Mit Press.[56] Alex Krizhevsky, I. Sutskever, and G. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks.

Advances in neural information processing systems

25, 2 (2012).[57] Neha Kumar and Richard J. Anderson. 2015. Mobile Phones for Maternal Health in Rural India. In

Proceedings ofthe 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI ’15) . Association for ComputingMachinery, New York, NY, USA, 427–436. https://doi.org/10.1145/2702123.2702258[58] Minha Lee, Sander Ackermans, Nena van As, Hanwen Chang, Enzo Lucas, and Wijnand IJsselsteijn. 2019. Caring forVincent: A Chatbot for Self-Compassion. In

Proceedings of the 2019 CHI Conference on Human Factors in ComputingSystems (CHI ’19) . ACM, New York, NY, USA, Article 702, 13 pages. https://doi.org/10.1145/3290605.3300932[59] Min Kyung Lee, Ji Tae Kim, and Leah Lizarondo. 2017. A human-centered approach to algorithmic services: Con-siderations for fair and motivating smart community service management that allocates donations to non-profitorganizations. In

Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems . 3365–3376.[60] Min Kyung Lee, Daniel Kusbit, Anson Kahng, Ji Tae Kim, Xinran Yuan, Allissa Chan, Daniel See, Ritesh Noothigattu,Siheon Lee, Alexandros Psomas, et al. 2019. WeBuildAI: Participatory framework for algorithmic governance.

Proceedings of the ACM on Human-Computer Interaction

3, CSCW (2019), 1–35.[61] Uichin Lee, Jihyoung Kim, Eunhee Yi, Juyup Sung, and Mario Gerla. 2013. Analyzing crowd workers in mobilepay-for-answer q&a. In

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems . 533–542.[62] Bronwyn Leigh and Jeannette Milgrom. 2008. Risk factors for antenatal depression, postnatal depression and parentingstress.

BMC Psychiatry arXiv preprint arXiv:1606.01541 (2016).[64] Q Vera Liao, Muhammed Mas-ud Hussain, Praveen Chandar, Matthew Davis, Yasaman Khazaeni, Marco PatricioCrasso, Dakuo Wang, Michael Muller, N Sadat Shami, and Werner Geyer. 2018. All Work and No Play?. In

Proceedingsof the 2018 CHI Conference on Human Factors in Computing Systems . 1–13.[65] John Logie, Joseph Weinberg, F. Maxwell Harper, and Joseph A. Konstan. 2011. Asked and Answered: On Qualitiesand Quantities of Answers in Online Q&A Sites. In

The Social Mobile Web .[66] Kiel Long, John Vines, Selina Sutton, Phillip Brooker, Tom Feltwell, Ben Kirman, Julie Barnett, and Shaun Lawson.2017. “Could You Define That in Bot Terms”? Requesting, Creating and Using Bots on Reddit. In

Proceedings of the2017 CHI Conference on Human Factors in Computing Systems (CHI ’17) . Association for Computing Machinery, NewYork, NY, USA, 3488–3500. https://doi.org/10.1145/3025453.3025830[67] Nurul Lubis, Sakriani Sakti, Graham Neubig, Koichiro Yoshino, Tomoki Toda, and Satoshi Nakamura. 2015. A studyof social-affective communication: Automatic prediction of emotion triggers and responses in television talk shows.In . IEEE, 777–783.[68] Xueming Luo, Siliang Tong, Zheng Fang, and Zhe Qu. 2019. Machines Versus Humans: The Impact of AI ChatbotDisclosure on Customer Purchases.

Luo, X, Tong S, Fang Z, Qu

Computer Science (2015).[70] Lena Mamykina, Bella Manoim, Manas Mittal, George Hripcsak, and Björn Hartmann. 2011. Design lessons fromthe fastest q&a site in the west. In

Proceedings of the SIGCHI conference on Human factors in computing systems .2857–2866.[71] Lena Mamykina, Drashko Nakikj, and Noemie Elhadad. 2015. Collective Sensemaking in Online Health Forums. In

Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI ’15) . Association forComputing Machinery, New York, NY, USA, 3217–3226. https://doi.org/10.1145/2702123.2702566[72] Elijah Mayfield, Miaomiao Wen, Mitch Golant, and Carolyn Penstein Rosé. 2012. Discovering Habits of Effective OnlineSupport Group Chatrooms. In

Proceedings of the 17th ACM International Conference on Supporting Group Work (GROUP’12) . Association for Computing Machinery, New York, NY, USA, 263–272. https://doi.org/10.1145/2389176.2389216[73] Joseph E McGrath. 1991. Time, interaction, and performance (TIP) A Theory of Groups.

Small group research

ASS: Towards Building a Social-Support Chatbot for Online Health Community 9:29 [75] Robert R. Morris, Kareem Kouddous, Rohan Kshirsagar, and Stephen Matthew Schueller. 2018. Towards an ArtificiallyEmpathic Conversational Agent for Mental Health Applications: System Design and User Perceptions. In

Journal ofmedical Internet research .[76] Alessandro Murgia, Daan Janssens, Serge Demeyer, and Bogdan Vasilescu. 2016. Among the Machines: Human-BotInteraction on Social Q&A Websites. In

Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factorsin Computing Systems (CHI EA ’16) . ACM, New York, NY, USA, 1272–1279. https://doi.org/10.1145/2851581.2892311[77] Kevin Kyung Nam, Mark S Ackerman, and Lada A Adamic. 2009. Questions in, knowledge in? A study of Naver’squestion answering community. In

Proceedings of the SIGCHI conference on human factors in computing systems .779–788.[78] QIU Nan and WANG Haofen. 2018. State machine based context-sensitive system for managing multi-round dialog.US Patent App. 15/694,917.[79] Kathleen O’Leary, Arpita Bhattacharya, Sean A. Munson, Jacob O. Wobbrock, and Wanda Pratt. 2017. DesignOpportunities for Mental Health Peer Support Technologies. In

Proceedings of the 2017 ACM Conference on ComputerSupported Cooperative Work and Social Computing (CSCW ’17) . ACM, New York, NY, USA, 1470–1484. https://doi.org/10.1145/2998181.2998349[80] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluationof machine translation. In

Proceedings of the 40th annual meeting of the Association for Computational Linguistics .311–318.[81] Zhenhui Peng, Taewook Kim, and Xiaojuan Ma. 2019. GremoBot: Exploring Emotion Regulation in Group Chat. In

Conference Companion Publication of the 2019 on Computer Supported Cooperative Work and Social Computing (CSCW’19) . Association for Computing Machinery, New York, NY, USA, 335–340. https://doi.org/10.1145/3311957.3359472[82] Zhenhui Peng and Xiaojuan Ma. 2019. Exploring how software developers work with mention bot in GitHub.

CCFTransactions on Pervasive Computing and Interaction (05 Sep 2019). https://doi.org/10.1007/s42486-019-00013-2[83] Zhenhui Peng and Xiaojuan Ma. 2019. A survey on construction and enhancement methods in service chatbotsdesign.

CCF Transactions on Pervasive Computing and Interaction

1, 3 (2019), 204–223.[84] Trevor Perrier, Nicola Dell, Brian DeRenzi, Richard Anderson, John Kinuthia, Jennifer Unger, and Grace John-Stewart.2015. Engaging Pregnant Women in Kenya with a Hybrid Computer-Human SMS Communication System. In

Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI ’15) . Association forComputing Machinery, New York, NY, USA, 1429–1438. https://doi.org/10.1145/2702123.2702124[85] Dominic Pettman. 2009. Love in the Time of Tamagotchi.

Theory, culture & society

26, 2-3 (2009), 189–208.[86] Yuqing Ren, F Maxwell Harper, Sara Drenner, Loren Terveen, Sara Kiesler, John Riedl, and Robert E Kraut. 2012.BUILDING MEMBER ATTACHMENT IN ONLINE COMMUNITIES: APPLYING THEORIES OF GROUP IDENTITYAND INTERPERSONAL BONDS.

MIS Quarterly

36, 3 (2012), 841–864.[87] Yuqing Ren, Robert Kraut, Sara Kiesler, and Paul Resnick. 2012. Encouraging commitment in online communities.

Building successful online communities: Evidence-based social design (2012), 77–124.[88] Paul Resnick, Joseph Konstan, Yan Chen, and Robert E Kraut. 2012. Starting new online communities.

Buildingsuccessful online communities: Evidence-based social design

231 (2012).[89] Alan Ritter, Colin Cherry, and William B. Dolan. 2011. Data-Driven Response Generation in Social Media. In

Conferenceon Empirical Methods in Natural Language Processing .[90] Lionel P Robert, Casey Pierce, Liz Marquis, Sangmi Kim, and Rasha Alahmad. 2020. Designing fair AI for managingemployees in organizations: a review, critique, and design agenda.

Human–Computer Interaction (2020), 1–31.[91] Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi atTREC-3.

Nist Special Publication Sp

109 (1995), 109.[92] Saiph Savage, Andres Monroy-Hernandez, and Tobias Höllerer. 2016. Botivist: Calling Volunteers to Action UsingOnline Bots. In

Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing(CSCW ’16) . ACM, New York, NY, USA, 813–822. https://doi.org/10.1145/2818048.2819985[93] Ari Schlesinger, Kenton P. O’Hara, and Alex S. Taylor. 2018. Let’s Talk About Race: Identity, Chatbots, and AI. In

Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18) . ACM, New York, NY, USA,Article 315, 14 pages. https://doi.org/10.1145/3173574.3173889[94] Jessica Schroeder, Chelsey Wilkes, Kael Rowan, Arturo Toledo, Ann Paradiso, Mary Czerwinski, Gloria Mark, andMarsha M. Linehan. 2018. Pocket Skills: A Conversational Mobile Web App To Support Dialectical Behavioral Therapy.In

Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18) . ACM, New York, NY,USA, Article 398, 15 pages. https://doi.org/10.1145/3173574.3173972[95] Joseph Seering, Robert Kraut, and Laura Dabbish. 2017. Shaping Pro and Anti-Social Behavior on Twitch ThroughModeration and Example-Setting. In

Proceedings of the 2017 ACM Conference on Computer Supported CooperativeWork and Social Computing (CSCW ’17) . Association for Computing Machinery, New York, NY, USA, 111–125.https://doi.org/10.1145/2998181.2998277Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021. :30 Wang and Wang, et al. [96] Joseph Seering, Michal Luria, Connie Ye, Geoff Kaufman, and Jessica Hammer. 2020. It Takes a Village: Integratingan Adaptive Chatbot into an Online Gaming Community. In

Proceedings of the 2020 CHI Conference on HumanFactors in Computing Systems (CHI ’20) . Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376708[97] Joseph Seering, Tony Wang, Jina Yoon, and Geoff Kaufman. 2019. Moderator engagement and community developmentin the age of algorithms.

New Media & Society

21, 7 (2019), 1417–1443.[98] Ameneh Shamekhi, Q. Vera Liao, Dakuo Wang, Rachel K. E. Bellamy, and Thomas Erickson. 2018. Face Value? Exploringthe Effects of Embodiment for a Group Facilitation Agent. In

Proceedings of the 2018 CHI Conference on Human Factors inComputing Systems (CHI ’18) . ACM, New York, NY, USA, Article 391, 13 pages. https://doi.org/10.1145/3173574.3173965[99] Eva Sharma and Munmun De Choudhury. 2018. Mental Health Support and Its Relationship to Linguistic Accommo-dation in Online Communities. In

Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems(CHI ’18) . ACM, New York, NY, USA, Article 641, 13 pages. https://doi.org/10.1145/3173574.3174215[100] Katie A Siek, Gillian R Hayes, Mark W Newman, and John C Tang. 2014. Field deployments: Knowing from using incontext. In

Ways of Knowing in HCI . Springer, 119–142.[101] Ming Tan, Dakuo Wang, Yupeng Gao, Haoyu Wang, Saloni Potdar, Xiaoxiao Guo, Shiyu Chang, and Mo Yu. 2019.Context-Aware Conversation Thread Detection in Multi-Party Chat. In

Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP) . 6457–6462.[102] Loren Terveen, Joseph A Konstan, and Cliff Lampe. 2014. Study, Build, Repeat: Using Online Communities as aResearch Platform. In

Ways of Knowing in HCI . Springer, 95–117.[103] Carlos Toxtli, Andrés Monroy-Hernández, and Justin Cranshaw. 2018. Understanding Chatbot-mediated TaskManagement. In

Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18) . ACM, NewYork, NY, USA, Article 58, 6 pages. https://doi.org/10.1145/3173574.3173632[104] Cornelia F. van Uden-Kraan, Constance H. C. Drossaert, Erik Taal, Bret R. Shaw, Erwin R. Seydel, and Mart A. F. J.van de Laar. 2008. Empowering Processes and Outcomes of Participation in Online Support Groups for Patients WithBreast Cancer, Arthritis, or Fibromyalgia.

Qualitative Health Research

18, 3 (2008), 405–417. https://doi.org/10.1177/1049732307313429 arXiv:https://doi.org/10.1177/1049732307313429 PMID: 18235163.[105] Joseph Walther and Saxon Boyd. 2002. Attraction to computer-mediated social support.

Communication Technologyand Society: Audience Adoption and Uses (01 2002), 153–188.[106] Dakuo Wang, Josh Andres, Justin Weisz, Erick Oduor, and Casey Dugan. 2021. AutoDS: Towards Human-CenteredAutomation of Data Science. In

Proceedings of the CHI 2021 .[107] Dakuo Wang, Youyang Hou, Lin Luo, and Yingxin Pan. 2016. Answerer engagement in an enterprise social question& answering system.

IConference 2016 Proceedings (2016).[108] Dakuo Wang, Liuping Wang, Zhan Zhang, Ding Wang, Haiyi Zhu, Yvonne Gao, Xiangmin Fan, and Feng Tian. 2021.Brilliant AI Doctor in Rural China: Tensions and Challenges in AI-Powered CDSS Deployment. In

Proceedings of theCHI 2021 .[109] Dakuo Wang, Justin D. Weisz, Michael Muller, Parikshit Ram, Werner Geyer, Casey Dugan, Yla Tausczik, Horst Samu-lowitz, and Alexander Gray. 2019. Human-AI Collaboration in Data Science: Exploring Data Scientists’ Perceptions ofAutomated AI.

To appear in Computer Supported Cooperative Work (CSCW) (2019).[110] Minjuan Wang, Christina Sierra, and Terre Folger. 2003. Building a dynamic online learning community among adultlearners.

Educational Media International

40, 1-2 (2003), 49–62.[111] Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerry Tesauro, BowenZhou, and Jing Jiang. 2018. R 3: Reinforced ranker-reader for open-domain question answering. In

Thirty-SecondAAAI Conference on Artificial Intelligence .[112] Yi-Chia Wang, Mahesh Joshi, William W Cohen, and Carolyn Penstein Rosé. 2008. Recovering Implicit ThreadStructure in Newsgroup Style Conversations.. In

ICWSM .[113] Yi-Chia Wang, Robert Kraut, and John Levine. 2015. Eliciting and Receiving Online Support: Using Computer-AidedContent Analysis to Examine the Dynamics of Online Social Support.

Journal of medical Internet research

17 (04 2015),e99. https://doi.org/10.2196/jmir.3558[114] Yi-Chia Wang, Robert Kraut, and John M. Levine. 2012. To Stay or Leave?: The Relationship of Emotional andInformational Support to Commitment in Online Health Support Groups. In

Proceedings of the ACM 2012 Conferenceon Computer Supported Cooperative Work (CSCW ’12) . ACM, New York, NY, USA, 833–842. https://doi.org/10.1145/2145204.2145329[115] Joseph Weizenbaum. 1966. ELIZA&Mdash;a Computer Program for the Study of Natural Language CommunicationBetween Man and Machine.

Commun. ACM

9, 1 (Jan. 1966), 36–45. https://doi.org/10.1145/365153.365168[116] Miaomiao Wen and Carolyn Penstein Rosé. 2012. Understanding participant behavior trajectories in online healthsupport groups using automatic extraction methods. In

Proceedings of the 17th ACM international conference on

Proc. ACM Hum.-Comput. Interact., Vol. 5, No. CSCW1, Article 9. Publication date: April 2021.

ASS: Towards Building a Social-Support Chatbot for Online Health Community 9:31

Supporting group work . 179–188.[117] Marty J Wolf, K Miller, and Frances S Grodzinsky. 2017. Why we should have seen that coming: comments onMicrosoft’s tay" experiment," and wider implications.

ACM SIGCAS Computers and Society

47, 3 (2017), 54–64.[118] Marisol Wong-Villacres, Neha Kumar, and Betsy DiSalvo. 2019. The Parenting Actor-Network of Latino Immigrantsin the United States. In

Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19) .Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3290605.3300914[119] Allison Woodruff, Sarah E Fox, Steven Rousso-Schindler, and Jeffrey Warshaw. 2018. A qualitative exploration ofperceptions of algorithmic fairness. In

Proceedings of the 2018 chi conference on human factors in computing systems .1–14.[120] Ziming Wu and Xiaojuan Ma. 2017. Money as a Social Currency to Manage Group Dynamics: Red Packet Gifting inChinese Online Communities. In

Chi Conference Extended Abstracts on Human Factors in Computing Systems .[121] Mi Xiang, Zhiruo Zhang, and Huigang Liang. 2020. Sedentary behavior relates to mental distress of pregnant womendifferently across trimesters: An observational study in China.

Journal of affective disorders

260 (2020), 187–193.[122] Anbang Xu, Zhe Liu, Yufan Guo, Vibha Sinha, and Rama Akkiraju. 2017. A new chatbot for customer service onsocial media. In

Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems . 3506–3510.[123] Ying Xu, Dakuo Wang, Penelope Collins, Hyelim Lee, and Mark Warschauer. [n.d.]. Same benefits, different com-munication patterns: Comparing Children’s reading with a conversational agent vs. a human partner.

Computers &Education

161 ([n. d.]), 104059.[124] Diyi Yang, Robert Kraut, and John M. Levine. 2017. Commitment of Newcomers and Old-Timers to Online HealthSupport Communities. In

Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI ’17) .Association for Computing Machinery, New York, NY, USA, 6363–6375. https://doi.org/10.1145/3025453.3026008[125] Diyi Yang, Robert E. Kraut, Tenbroeck Smith, Elijah Mayfield, and Dan Jurafsky. 2019. Seekers, Providers, Welcomers,and Storytellers: Modeling Social Roles in Online Health Communities. In

Proceedings of the 2019 CHI Conference onHuman Factors in Computing Systems (CHI ’19) . ACM, New York, NY, USA, Article 344, 14 pages. https://doi.org/10.1145/3290605.3300574[126] Diyi Yang, Zheng Yao, and Robert Kraut. 2017. Self-disclosure and channel difference in online health support groups.In

Eleventh International AAAI Conference on Web and Social Media .[127] Diyi Yang, Zheng Yao, Joseph Seering, and Robert Kraut. 2019. The Channel Matters: Self-disclosure, Reciprocityand Social Support in Online Cancer Support Groups. In

Proceedings of the 2019 CHI Conference on Human Factors inComputing Systems (CHI ’19) . ACM, New York, NY, USA, Article 31, 15 pages. https://doi.org/10.1145/3290605.3300261[128] Esra Yazici, Tulay Sati Kirkan, Puren Akcali Aslan, Nazan Aydin, and A Yazici. 2015. Untreated depression in the firsttrimester of pregnancy leads to postpartum depression: High rates from a natural follow-up study.

NeuropsychiatricDisease and Treatment

11 (02 2015), 405–11. https://doi.org/10.2147/ndt.s77194[129] Zi Yin, Keng-hao Chang, and Ruofei Zhang. 2017. DeepProbe: Information Directed Sequence Understanding andChatbot Design via Recurrent Neural Networks. In

Proceedings of the 23rd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD ’17) . Association for Computing Machinery, New York, NY, USA,2131–2139. https://doi.org/10.1145/3097983.3098148[130] Amy X. Zhang and Justin Cranshaw. 2018. Making Sense of Group Chat Through Collaborative Tagging andSummarization.

Proc. ACM Hum.-Comput. Interact.

2, CSCW, Article 196 (Nov. 2018), 27 pages. https://doi.org/10.1145/3274465[131] Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2020. The design and implementation of xiaoice, an empatheticsocial chatbot.

Computational Linguistics

46, 1 (2020), 53–93.[132] Haiyi Zhu, Robert E Kraut, and Aniket Kittur. 2014. The impact of membership overlap on the survival of onlinecommunities. In

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems . 281–290.[133] Haiyi Zhu, Bowen Yu, Aaron Halfaker, and Loren Terveen. 2018. Value-sensitive algorithm design: Method, casestudy, and lessons.

Proceedings of the ACM on Human-Computer Interaction

2, CSCW (2018), 1–23.[134] Haiyi Zhu, Amy Zhang, Jiping He, Robert E Kraut, and Aniket Kittur. 2013. Effects of peer feedback on contribution:a field experiment in Wikipedia. In

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems .2253–2262.