Automatic Monitoring Social Dynamics During Big Incidences: A Case Study of COVID-19 in Bangladesh
AAutomatic Monitoring Social Dynamics DuringBig Incidences: A Case Study of COVID-19 inBangladesh
Fahim Shahriar and Md Abul Bashar Comilla University, Cumilla, Bangladesh [email protected] Queensland University of Technology, Brisbane, Australia [email protected]
Abstract.
Newspapers are trustworthy media where people get themost reliable and credible information compared with other sources. Onthe other hand, social media often spread rumors and misleading news toget more traffic and attention. Careful characterization, evaluation, andinterpretation of newspaper data can provide insight into intrigue andpassionate social issues to monitor any big social incidence. This studyanalyzed a large set of spatio-temporal Bangladeshi newspaper data re-lated to the COVID-19 pandemic. The methodology included volumeanalysis, topic analysis, automated classification, and sentiment analysisof news articles to get insight into the COVID-19 pandemic in differentsectors and regions in Bangladesh over a period of time. This analy-sis will help the government and other organizations to figure out thechallenges that have arisen in society due to this pandemic, what stepsshould be taken immediately and in the post-pandemic period, how thegovernment and its allies can come together to address the crisis in thefuture, keeping these problems in mind.
Keywords:
Topic Analysis · LDA Topic Model · Dynamic Topic Modeling · Time Series Decomposition · Bengali Text Dataset · Newspaper · Text Classifi-cation · RNN · LSTM · Sentiment Analysis · CNN-BiLSTM
The outbreak of COVID-19 has brought serious health and economic conse-quences to society. It triggered one of the largest recessions in the world. Traveland currency companies lost billions of dollars, global stock markets plummeted,schools were closed, and the health care system was exhausted. Mental and so-cial problems arose as people started to worry about infection, losing friends andfamily, losing their jobs, or isolation.Bangladesh has not been rid of this terrible virus. The virus has had majorimpacts on people’s lives and significantly degraded quality of life. There weresignificant cases of infections and deaths. The hospital did not have adequate a r X i v : . [ c s . C Y ] J a n Fahim Shahriar and Md Abul Bashar treatment facilities, including doctors, beds, and emergency supplies. Besidesthe health crisis, people have suffered enormous economic losses. Many peoplehave lost their jobs; companies lost revenues, many of them go bankrupt. Themost affected were the day-laborer and low-income workers. The lockdown inthe pandemic suppressed their income. Many workers starved since their liveli-hood was cut-off. Working people took to the streets in search of their livelihood.They started protesting in the streets for relief. Seeing their plight, many people,including the government, came forward to help them. Because of the lockdown,the international transport system was shut down, and stopped imports andexports. As a result, the country’s industry suffered miserably. Objective mon-itoring and analysis of social dynamics during such a big incident can help thegovernment and other authorities decide and take initiatives where required. Thisresearch proposes utilizing articles published in newspapers to objectively mon-itor and analyze social dynamics during a big incidence, such as the COVID-19pandemic.Newspapers are one of the most popular mass media in our daily life. News-papers provide information on all the country’s financial, political, social, envi-ronmental, etc. Whether it is a public campaign, an emergency, or a provocation,newspapers are a great resource for keeping track of internal and external eventsand stories. This mass media generally provide authentic information, whereassocial media such as Facebook and Twitter often spread rumors and cannot berelied upon for authentic news. Effective classification, analysis, and interpreta-tion of newspaper data can provide a deep understanding of any big incident ina society.In this research, we analyzed a large spatio-temporal dataset of BangladeshiDaily Newspapers related to COVID-19. The approach incorporated volumeanalysis, topic analysis, automatic classification of news articles, and sentimentanalysis to better understand the COVID-19 pandemic in Bangladesh’s divisionsand districts over time. The experimental results and analysis will give an ob-jective insight into the COVID-19 pandemic in Bangladesh that will benefit thegovernment and other authorities for disseminating resources. This paper espe-cially shows how to utilize automatic techniques for monitoring social dynamicsin big incidents such as a pandemic, natural disaster, and social unrest.This research makes the following main contributions. (1) It collects, manu-ally classifies, and publishes a large collection of COVID-19 related Bangladeshinews articles in Bengali and English. (2) It investigates the topics discussed dur-ing the COVID-19 pandemic in Bangladesh and how they have changed overtime using manual and automatic techniques. (3) It designs a CNN-BiLSTMarchitecture for analyzing sentiment in Bengali text. (4) It analyzes COVID-19related sentiments in the community over time and space. (5) It automaticallycategorizes documents into classes of observation interest for monitoring socialinterests.The rest of the paper is organized as follows: Section 2 discusses relatedwork, Sections 3 discussed methodology and data collection, Section 4 presentsexperimental results, and Section 5 concludes the paper. utomatic Monitoring Social Dynamics During Big Incidences 3
In this segment, we will discuss some past related works done by different ana-lysts. We will divide it into four sections: Static Topic Modeling, Dynamic TopicModeling, Sentiment Analysis, and Text Classification.
Topic modeling is a process of discovering hidden topics in a collection of textsBashar et al. (2020a); Balasubramaniam et al. (2020). It can be considered asa factual show of topics through text mining. One of the most popular topicsmodeling technique Latent Dirichlet Allocation (LDA) (Blei et al., 2003; Basharet al., 2020a) discovers topics based on word recurrence in a set of documents.LDA is incredibly valuable for finding a sensibly precise blend of topics inside agiven record.Topic modeling has been well studied for English text mining. For instance,Zhao et al. (2011) used unsupervised topic modeling in their research and com-pared the content of Twitter with the traditional news media “New York Times” .They used the Twitter-LDA model to find topics from a representative sample ofthe entire Twitter and then used text mining techniques to compare these Twit-ter topics with
New York Times ’ topics, taking into account the topic categoryand type. Wang and Blei (2011) developed an algorithm to recommend scientificarticles to users in online communities. Their method combines the advantagesof traditional collaborative filtering and probabilistic topic modeling. They ap-plied collaborative topic modeling for recommending scientific articles. Wayastiet al. (2018) applied the Latent Dirichlet Allocation function in the researchand extracted topics based on ride-hailing customers’ posts on Twitter. In theirresearch, they used 40 parameter combinations of LDA to obtain the best com-bination of topics. According to the perplexity value, the customers discussed 9topics in the post, including keywords for each topic. Tong and Zhang (2016)recommended two experiments to build topic models on Wikipedia articles andTwitter users’ tweets.However, topic modeling has not been well studied for Bengali text mining,unlike English text mining. Das and Bandyopadhyay (2010b) used topic wiseopinion summarization from Bengali text. They applied K-Means clustering and document-level theme relational graph representation. However, they did notuse any topic modeling technique, such as LDA. Rakshit et al. (2015) applieda Multi-class SVM classifier for analyzing Bengali poetry and poet relations.They performed a subject-wise classification of poems into foreordained cate-gories. Hasan et al. (2019) compared the performance of the LDA and LDA2vectopic model in Bengali Newspaper. Al Helal and Mouhoub (2018) used LDA fordetecting the primary topics from a Bengali news corpus. However, they did notdirectly apply LDA in the Bengali text. Instead, they translated the Bengali textinto English and then applied LDA to detect the topics. Rahman et al. (2019)used lexical analysis for sentence wise topic modeling. Their topic modeling was
Fahim Shahriar and Md Abul Bashar based on sentiment analysis. None of the existing works used Bengali text topicmodeling for monitoring a pandemic or a major event.In addition to English and Bengali, topic modeling in various languages isalso studied. De Santis et al. (2020) analyzed a system that uses NLP pipelines, atheoretical framework for content aging to determine the qualitative parametersof tweets, and co-occurrence analysis to build topic maps chart splits to identifytopics related to posts from Italian Twitter users. Han et al. (2020) extractedtopics related to COVID-19 from Sina Weibo(Chinese microblogging website)text dataset through the LDA topic model.
The dynamic topic model is a cumulative model that can be used to analyzechanges in document collection over time Bashar et al. (2020a). There are manystudies on dynamic topic modeling for the English language. For example, Al-Sumait et al. (2008) showed that the LDA model could be extended to theonline version by gradually updating the current model with new data, and themodel has the ability to capture the dynamic changes of the topics. Dieng et al.(2019) researched D-ETM on three data sets and discovered the word proba-bilities of eight different topics that D-ETM learned over time. Nguyen et al.(2020) discovered latent topics from the financial reports of listed companies inthe United States and studied the evolution of the themes discovered throughdynamic topic modeling methods. Marjanen et al. (2020) discussed humanisticinterpretation’s role in analyzing discourse dynamics through historical news-papers’ topic models. Bashar et al. (2020a) extracted five COVID-19 relatedtopics from the Twitter dataset through LDA topic modeling, and they showedthe changes in the extracted topics over time. Nevertheless, for the Bengali lan-guage, so far, there is no research on dynamic topic modeling. In this study, westudy the evolution of the extracted COVID-19 related topics over time usingdynamic topic modeling.
Text classification, moreover known as text labeling or text categorization, iscategorizing content into organized bunches Bashar et al. (2020b); Bashar andNayak (2020); Bashar et al. (2018); Bashar and Nayak (2020). By utilizing NLP,classifiers can naturally label text and, after that, relegate a set of predefinedlabels or categories based on its substance.Many researchers worked on text classification in English. For example, Patiland Pawar (2012) used the Naive Bayes algorithm to classify website content.They divided the website content into ten categories, and the average accuracyof the ten categories was almost 80%. Bijalwan et al. (2014) used K-NearestNeighbors, Naive Bayes, and Term-gram to classify text. They showed that intheir research, K-Nearest Neighbors’ accuracy was better than Naive Bayes andTerm-gram. Tam et al. (2002) showed that K-Nearest Neighbors was superiorto NNet and Naive Bayes for English documents. Pawar and Gawande (2012) utomatic Monitoring Social Dynamics During Big Incidences 5 showed that Support Vector Machines’ performance is far superior to DecisionTrees, Naive Bayes, K-Nearest Neighbors, Rocchio’s algorithms, and Backpropa-gation networks. Liu et al. (2010) showed that Support Vector Machines is betterthan K-Nearest Neighbors and Naive Bayes.In addition to English text classification, some researchers have also classi-fied Bengali text. For example, Mandal and Sen (2014) applied four supervisedlearning methods: (Naive Bayes, k nearest neighbor, Decision Tree classifier, andSupport Vector Machine) for labeled web documents. They classified the doc-uments into five categories: (Business, Sports, Health, Technology, Education).Chy et al. (2014) applied a Naive Bayes classifier to categorized Bengali news.Pal et al. (2015) described Naive Bayes classifier for Bengali sentence classifi-cation. They used over 1747 sentences in their experiment and got an accuracyof 84%. Kabir et al. (2015) used Stochastic Gradient Descent (SGD) classifierto categorize Bengali documents. Eshan and Hasan (2017) created an applica-tion that identifies abusive texts in Bengali. They applied Naive Bayes, RandomForest, Support Vector Machine (SVM) with Radial Basis Function (RBF), Lin-ear, Polynomial, and Sigmoid kernel to classify the texts and compare the resultsamong them. Islam et al. (2017) applied SVM, Naive Bayes, and Stochastic Gra-dient Descent(SGD) to classify Bengali documents and compare results of thoseclassifiers. However, non of the existing works used Bengali text classification formonitoring a pandemic or a major event.
Sentiment Analysis refers to computationally recognizing and categorizing opin-ions communicated in a chunk of text. It is successfully used in commerce wherethey use it to track online discussions to identify social estimation of their brand,item, or benefit.A lot of research work has been done in sentiment analysis for the Englishlanguage. For example, Cui et al. (2006) have reviewed about 100,000 productreviews from various websites. They divided reviews into two main categories:positive and negative. Jagtap and Dhotre (2014) applied the Support VectorMachine and Hidden Markov Model, and the Hybrid classification model is wellsuited for extracting teacher feedback and evaluating sentiments. Alm et al.(2005) divided the seven emotional words into three polarity categories: positiveemotion, negative emotion, and neutral, and the Winnow parameter adjustmentmethod used can reach 63% accuracy. For extracting the Twitter sentiment,Agarwal et al. (2011) applied unigram, tree model, and feature-based model.Bashar et al. (2020a) used Convolutional Neural Network to extract sentimentsrelated to COVID-19 from the Twitter dataset.Some research used sentiment analysis in Bengali texts. For instance, Dasand Bandyopadhyay (2010a) classified emotions into six categories: Happy, Sad,Anger, Disgust, Fear, and Surprise. Chowdhury and Chowdhury (2014) used sen-timent analysis in Bangla Microblog Posts. They applied a semi-supervised boot-strapping method utilizing SVM and Maximum Entropy. Hasan et al. (2014) pro-posed a strategy to identify sentiments in Bengali texts by Contextual Valency
Fahim Shahriar and Md Abul Bashar
Analysis. They employed the methodology of POS Tagger in their approach.Hassan et al. (2016) used recurrent neural networks to Romanize Bengali textsand analyze sentiments. In their experiments, they used Bangla and RomanizedBangla Text (BRBT) dataset. For Sentiment Analysis of Bangla Microblogs,Asimuzzaman et al. (2017) used Adaptive Neuro-Fuzzy Deduction Framework toanticipate extremity and utilized fluffy rules of speech in semantic rules. Mahtabet al. (2018) designed a model for sentiment analysis on Bangladesh Cricketnews. They applied TF-IDF and SVM (Support Vector Machine) in their modeland found 64.596% accuracy. Tripto and Ali (2018) used sentiment analysis onYoutube comments. Their research built a model based on deep learning thatclassifies a Bengali text into three classes and five sentiment classes. Tabassumand Khan (2019) used the Random Forest Algorithm to classify the sentimentsin Bengali texts. Tuhin et al. (2019) applied Naive Bayes and a topic model-ing approach to design an Automated System of Sentiment Analysis in BengaliText. Their system classifies emotions into six categories: happy, sad, tender,excited, angry, and scared. However, non of the existing works used Bengali textsentiment analysis for monitoring a pandemic or a major event.
This pandemic situation has changed society and the country by a significantmargin. The whole face of the country has changed completely. Some significantsectors of the nation, such as economic, social, political, have been affected mas-sively. The education systems have been hit particularly hard. This research aimsto automatically analyze the daily newspapers in Bangladesh to reveal what isgoing on in society and gain knowledge to comprehend the fundamental topics(or subjects) and sentiment arising and advance in the discussion.This study will conduct a topic and sentiment analysis on a large collectionof COVID-19 related news articles published in Bangladesh both in Bengali andEnglish texts. The study will focus the analysis on both spatial and temporaldimensions. In the topic analysis, we used LDA-based topic modeling and dy-namic topic modeling to find the topics, their evolution over time, and their timeand space (location). We also analyzed what impact each topic had on particu-lar areas. Then we analyzed the sentiment distribution over time and space toidentify social sentiment in space and time. The experimental workflow of thisstudy is shown in Figure 1.First, we manually gathered a large collection of COVID-19 related newsarticles from Bangladeshi six most circulated daily newspapers. Along with thenews, the collection contains geospatial and temporal information on the news.The dataset was then preprocessed by removing HTML, markers, and othernon-relevant information such as adverts.Then, we manually organized the news articles in a set of classes and sub-classes. Then we extracted the topics and the subtopics from the dataset. We utomatic Monitoring Social Dynamics During Big Incidences 7 used these classes and sub-classes to perform basic analysis such as comparingsimilarity and diversity in the news. These classes and sub-classes have also beenused to qualitatively evaluate the accuracy of the topics discovered by LDA andlabels predicted by classifiers before LDA and classifiers are employed for detailedanalysis.
Bengali NewspaperDatatset CollectionPreprocessing and DataPreparationTopic AnalysisVolume AnalysisTemporal Analysis ofVolume Spatial Analysis ofVolumeTopic Extraction Dynamic Topic Modeling:Temporal Trends of Topics Spatial Distribution of TopicsAutomatic Classification ofNews Articles Sentiment Analysis
Fig. 1.
Experimental Work flow
These publicly available News articles related to COVID-19 have been collectedfrom the six most popular newspapers in Bangladesh from 21 January 2020to 19 May 2020. The six newspapers are
The Daily Prothom Alo , BangladeshPratidin , Kaler Kantho , The Daily Star , Newage , and
The Daily Observer . Atotal of 15,565 news articles are collected from these six newspapers. From everynews article, we extracted the news title, the main body of the news, a summaryof the news (i.e., first few lines of the news body), the published date, and thenews incident’s location. We used Python’s
BeautifulSoup and
Newspaper3k toolfor extracting the news content.
BeautifulSoup is a popular Python package forparsing HTML and XML archives and one of the most popular web scrapingtools.
Newspaper3k is a user-friendly library for scraping the news articles andother related data from newspaper portals. It is built upon request and usedto parse LXML. This module is an improved version of the
Newspaper moduleand is also used for the same purpose. Table 1 summarizes the statistics of the
Fahim Shahriar and Md Abul Bashar collection. We call this collection
Comilla University COVID-19 News Collection (CoU-CNC). We made it available online for anyone for further analysis. Article Source Language Article CountThe Daily Prothom Alo Bengali 4169Bangladesh Pratidin Bengali 5584Kaler Kantho Bengali 1160The Daily Star English 1278The Daily Observer English 1191New Age English 2183
Table 1.
CoU-CNC Dataset Statistics
Out of these six newspapers, news articles in three newspapers (
The DailyProthom Alo , Bangladesh Pratidin , Kaler Kantho ) are composed in the Ben-gali language, and in the other three newspapers (
The Daily Star , The DailyObserver , New Age ) articles are composed in the English language. There are10,913 news articles in the Bengali language, and the remaining 4652 news arti-cles are in the English language.As the dataset has 4,652 articles in English and we wanted all articles in thesame language to be better parsed, so we translated the English articles intoBengali via Python’s googletrans module. As a result, after translating these ar-ticles, all the articles are in the Bengali language. Then, we applied tokenizationto split a string of text into smaller tokens. The news articles are split into sen-tences, and sentences are tokenized into words. Then, we applied noise removal(e.g., removing HTML tags, extra white spaces, special characters, numbers)to clean up the text. Then, we removed the stopwords from the document. Asthere is no build-in stopwords module for Bengali nltk, we manually created astopword list and made it available online . Then, we expanded contraction. Weset the minimum letter length to 6. We also removed all the words that werebelow the minimum letter length. There are no good resources for stemming andlemmatization in the Bengali language. So, we applied stemming and lemmati-zation to the tokens in our own process. After removing all the stopwords andother noises, there were a total of 80,693 tokens. There are some specific suffixesfor the Bengali language. The suffix removal from the word has also been donewith the help of Python. We used Bangla Steamer.Steamer library of Python toimprove accuracy. However, it did not show the expected results as the libraryis effective for a small number of Bengali words. To increase the accuracy of this80,693 sizes lemmatized dictionary, we manually verified about 30000 most fre-quent tokens from 80693 words. We lemmatized where we needed to lemmatizedmanually, and we also corrected the incorrect and misspelled words where it was CoU-CNC Dataset: https://cutt.ly/djGILi2 Bengali-Stopwords: https://cutt.ly/2jXbDRB utomatic Monitoring Social Dynamics During Big Incidences 9 needed. Many more words are manually lemmatized and corrected through thismanually 30,000 words check. We have published verified Bengali words on theInternet and titled “Modified Bengali Words” for further analysis.To compare the number of news published and the COVID-19 cases ofBangladesh, we collected an open-source dataset of confirmed COVID-19 casesand death cases of Bangladesh from March 8 to May 19.We also collected an-other open-source dataset of confirmed cases based on divisions and districts ofBangladesh from March 8 to May 19. Class Distribution in News Articles
After collecting the new articles, first,we analyzed them manually. In this process, we extracted eight classes (shownin Table 2) and 19 sub-classes from the news articles. The representation of theextracted eight classes and the hierarchical organization of sub-classes are shownin Figure 2. The distribution of the extracted classes over news articles is shownin Figure 3 and the distribution of the extracted sub-classes over news articlesis shown in Figure 4.
Table 2.
Eight Classes Extracted from the Collected News Articles(1) Statistics, (2) Social Information, (3) COVID-19 Effects, (4) COVID-19 Responsesand Preventive Measure, (5) Government Announcement and Responses, (6) Solidar-ity and Cooperation, (7) International Information, and (8) Health Organization Re-sponses
Time Series or temporal analysis of newspaper articles is utilized to observe thetransient expansion during the pandemic. Time series decomposition includesconsidering a series of components in the time dimension: Level, Trend, Season-ality, and Noise segments. Level refers to the average value in the series, Trendrefers to the increasing or decreasing value in the series, Seasonality refers to therepeating short-term cycle in the series, and finally, Noise refers to the randomvariation in the series. Decomposition gives a powerful supportive model for pon-dering time series and better arrangement issues during time series analysis anddecision making. The additive model (Dagum, 2010) suggests that the segmentsare added as the following formula: y ( x ) = l ( x ) + t ( x ) + s ( x ) + n ( x ) (1)where y ( x ) represents the additive model, l ( x ) represents the observed level, t ( x ) represents the trend, s ( x ) represents the seasonality and n ( x ) represents Modified Bengali Words: https://cutt.ly/8jE6GIC https://data.humdata.org/dataset/district-wise-quarantine-for-covid-19 Social impactPublic unawarenessProtestationSanctionSpread of rumors ormisinformation on corona virusPositive patient symptoms andidentificationSevere health outcomes anddeathsTransmission patterns and risksNegative cases and Coronavirus recovery storiesStrategic preparedness andresponse planGlobal economic impact ofCorona virusProtective products andmachinesCorona virus treatment andVaccineGovernment guidelines,instructions and mobilizationPolicy inconsistencyExternal support, AidsTrip and transportationRepatriationGlobal economic impact ofCorona virusStatisticsSocial InformationCOVID-19 EffectsCOVID-19 Responses andPreventive measureGovernmentAnnouncement andresponsesSolidarity and cooperationHealth OrganizationsResponsesInternational Information
Fig. 2.
Manually Extracted Classes and Sub-classesutomatic Monitoring Social Dynamics During Big Incidences 11
Solidarity and 4.9%Health 5.5%Government 8.9%International 10.2%Statistics13.1%Social Information13.7% COVID-19 Effects26.3%COVID-19 17.4%
Fig. 3.
Distribution of Manually Extracted Classes over News Articles
Repatriation2.4%Public unawareness3.0%Transmission 3.0%Policy inconsistency3.4%Global economic 4.7%Negative cases and 4.9%Government 8.2%Global political 8.4%Severe health 9.4% Strategic 19.1%Sanction2.4%Corona virus 2.3%Protestation1.6%Trip and 1.5%Protective products 1.4%External support, 1.0%Spread of rumors or 0.8%Positive patient 11.5%Social impact11.0%
Fig. 4.
Distribution of Manually Extracted Sub-classes over News Articles2 Fahim Shahriar and Md Abul Bashar the noise or residual in the signal x . This model is linear. The change over aperiod of time is reliably affected by the similar sum of the linear trend as astraight line. A linear seasonality has a similar recurrence and abundance. Onthe other hand, A multiplicative model (Dagum, 2010) recommends that thecomponents are multiplied together as the following formula: y ( x ) = l ( x ) × t ( x ) × s ( x ) × n ( x ) (2)where y ( x ) represents the multiplicative model, l ( x ) represents the observedlevel, t ( x ) represents the trend, s ( x ) represents the seasonality and n ( x ) repre-sents the noise in the signal x . A multiplicative model is exponential or quadraticwhen expanding or diminishing over the long run. A nonlinear pattern is a bentline. In this examination, we disintegrated the time series utilizing the multi-plicative model. For Spatial Analysis, we used Tableau Software to comparethe number of news published, and the number of COVID-19 confirmed casesgeographically. Analyzing the topics of news articles published during a major incident or a pan-demic like COVID-19 can help monitor the situation and understand the publicconcerns, which is critical for government authorities and charity organizationsto disseminate required resources and aids. However, in such a situation, a largenumber of news articles are published in various newspapers. We observed thatas the situation deteriorated during the pandemic, newspapers had to publishmuch news on various topics. Manually analyzing the topics by reading a largenumber of articles is time-consuming and expensive. We utilized two unsuper-vised machine learning techniques: (a) LDA (Blei et al., 2003), a popular topicmodeling technique, as static topic modeling to automatically find topics of ar-ticles published in newspapers, and (b) dynamic topic modeling in (Blei andLafferty, 2006) to see how those topics evolve over the long haul.LDA is a Bayesian probabilistic model that discovers topics and providestopic distribution over documents and word distribution over topics. It has twophases: (a) the first phase models each document as a composition of topics,and (b) the second phase models each topic as a composition of words. LDAutilizes word co-occurrences inside documents for discovering topics in a doc-ument assortment. Words occurring in an equivalent document are practicallycoming from the same topics, and documents containing comparative words willundoubtedly include comparable topics. In this research, the
Gensim package inPython was utilized to execute the LDA model. We utilized every news articleas a document in the topic modeling. Before applying the LDA topic model,we manually associated documents into general classes and sub-classes to knowabout the quality of LDA extracted fine-grained topics.Then, we analyzed each LDA extracted topic’s temporal trends to see whena topic has been discussed more or published more in the newspapers. Finally,we analyzed each topic’s spatial distribution to see what effect each had in a utomatic Monitoring Social Dynamics During Big Incidences 13 particular place. We used Tableau software to analyze the spatial distribution ofeach topic.
The static topic modeling treats words as interchangeable and indeed treats doc-uments as interchangeable. However, the presumption of replaceable documentsis impractical for some assortments when accumulating along the time. For exam-ple, tweets, news articles, and insightful articles as they are advancing substancealong time. The subjects in a newspaper article assortment develop, and it isessential to display the elements of the fundamental topics unequivocally.Dynamic topic modeling extends the static theme, which illustrates the pro-gression of the theme in consolidate. Dynamic topic modeling can catch thedevelopment of topics in a successively coordinated assortment of news articles.In this research, the articles are synchronized by week. We used the dynamictopic model to analyze discussion topics and topic changes over time.
Then we built a text classifier to verify their performance and predict the class,sub-class, and topics in the unknown (upcoming in the future) news articles.Such classification is important when we need to monitor a specific category (orclass or group) of news. We made Long Short-Term Memory (LSTM) RecurrentNeural Network (RNN) models in Python utilizing Keras deep learning libraryfor text classification. RNN is a special kind of neural network where the previ-ous step’s output will be used as the current step’s input. In a traditional neuralnetwork, not all inputs and outputs are interdependent. However, interdepen-dence is an important part of text data. In such cases, the model needs to predictthe next word given the previous words, so the previous word must be stored.Thus RNN was born, which solved the problem with the help of hidden layers.The primary function of RNN is also essential, namely the hidden state . It canremember some information about the sequence.RNN is a neural feedback network that operates on the internal memory.Since the RNN has a similar function for each piece of information and thecurrent range’s output is based on the last count, the RNN is essentially recur-sive. When there is an output, it is copied and sent back to the relay network.The current input and the output of the previous input are taken into accountin determining the prediction of the next word. Unlike direct feedback neuralnetworks, RNNs can use their internal state (memory) to manage the input el-ements’ interdependence. That makes them useful for text data, handwritingrecognition, or speech recognition. The architecture of an unrolled recurrentneural network is shown in Figure 5.In Figure 5, first the model gets x from the input sequence. Then it produces h , which is used in the next input to the model along with x . That is, both h and x become inputs to the next step. Then, h and x are input to the nextstep, and so on. Like this, RNN continues summarizing the unique circumstance A A A A A
Fig. 5.
Unrolled Recurrent Neural Network in the hidden state while training. Then, it uses the summarized hidden state toclassify the sequence (Bashar et al., 2020b).
We proposed a hybrid neural network model based on Convolutional NeuralNetwork (CNN) and LSTM for sentiment analysis in Bengali texts.Integrated models are used to solve various vision and NLP problems andimprove a single model’s performance. The following subsections provide anoverview of the LSTM and CNN models offered. In subsection 3.6, we describedLSTM. In this research, we used two-layer bi-LSTM, word embedding includewords in the news articles and provide sentiments.Another part of our proposed structure is based on Convolutional NeuralNetwork (CNN). CNN has very successful in various image processing and NLPtasks these last years. They are powerful in exploration, achieving local rele-vance, and data standards through learning. Generally, to rank text on CNN,different words in sentences (or paragraphs) need to be placed. Stacked to form atwo-dimensional matrix, pleated filters (different lengths) are applied to the win-dow. To use CNN for text classification, the different words stacked in a sentenceare usually stacked in a two-dimensional matrix, and afterward, a convolutionis applied to the word in the window in one word to be created applied a newfunction declaration. Then, a max-pooling is applied to the new function, andthe combined functions of different filters are combined to shape a concealedportrayal. Completely associated layers trail these portrayals for the last esti-mate. The architecture of our CNN-BiLSTM Hybrid network model is shown inFigure 6.We created a sequential model that includes an LSTM layer. Then we madeour model sequential and adding layers. In the first layer, we applied a conv1Dwith 200 as a filter for CNN. After that, we applied two Bi-LSTMs on the secondand third layer with an error of 0.5. Then we applied a dense network on theremaining levels. We also used
Adam as an optimizer with tight hyperparametersand applied L2 adjustments to reduce overfitting as much as possible. We only utomatic Monitoring Social Dynamics During Big Incidences 15 লকডাউেন রাজগারহারােনাঅিধকাংেশরপিরবাের দখািদেয়েছচরম খাদ স ট। Convolution
LSTMLSTMLSTMLSTMLSTMLSTMLSTMLSTM
PositiveNegative
DocumentMatrix FeaturesMap LSTM Layers FullyConnectedLayers Output
Fig. 6.
The architecture of CNN-BiLSTM Hybrid network for Sentiment Identification used five epochs, as using more epochs resulted in overfitting and kept the stacksize of 256, as it worked very well.
The time series volume analysis of newspapersis shown in Figure 7. The figure has four plots, namely observed level, trend,seasonal, and noise or residual. The first plot Figure 7a shows the original volume,i.e., the number of COVID-19 related news articles in a time point. It shows thatthe curve began to rise from January when some COVID-19 cases were foundin China and other countries. The plot increased sharply in early March whena few instances of COVID-19 cases were identified in Bangladesh. The curveremained high onward with some fluctuations. The second plot Figure 7b showsthe trend of the COVID-19 related news publication volume. It shows that theCOVID-19 related news started becoming trendy by the end of January, andthe trend increased significantly in early March. The trend stayed high throughthe rest of the time with some fluctuations. The third plot Figure 7c shows theseasonal, cyclical change in the volume. Moreover, the fourth plot Figure 7dshows a residual or random variation in the volume.To see how newspapers reacted during the COVID-19 pandemic, we trackedCOVID-19 cases, death from COVID-19, and COVID-19 news volume in Figure8. The figure shows that the newspapers were vigilant from the beginning of thepandemic. The newspaper journalists increased COVID-19 related news coverageexponentially as soon as COVID-19 cases were found in Bangladesh in earlyMarch. The news volume continued increasing until the last quarter of March.
Feb2020 Mar Apr May
Date nu m be r o f ne w s (a) Observed Feb2020 Mar Apr May
Date nu m be r o f ne w s (b) Trend Feb2020 Mar Apr May
Date nu m be r o f ne w s (c) Seasonal Feb2020 Mar Apr May
Date nu m be r o f ne w s (d) Residual Fig. 7.
Time Series Decompositionutomatic Monitoring Social Dynamics During Big Incidences 17
This part of the news volume shows the newspapers reacted from about COVID-19 from the very early pandemic stage. They significantly covered the pandemicduring the early period of the COVID-19 cases.The number of identified cases increased significantly by the second quarterof April, and it continued to increase. However, the number of COVID-19 relatednews articles did not increase during this time. Even in some cases, the newsarticle volume decreases marginally. The possible reasons might be: (a) BecauseBangladesh is a developing country, to survive at this point, people had to thinkmore about earnings than pandemic. As a result, pandemic news did not increaseattention, and newspapers did not increase COVID-19 related articles. (b) Someother big incidences gained more attention than COVID-19. (c) The newspapersreached their allocated space for pandemic news already.
Fig. 8.
Comparison of Daily News Article Counts and Daily Cases (21 January 2020 -19 May 2020)
Spatial Analysis of Volume
The spatial Distribution of Bengali newspapersis shown in Figure 9a. The number of news articles was concentrated on thecentral part of Bangladesh, mainly Dhaka, Narayanganj, and Gazipur. More than6000 COVID-19 related news articles were published in Bangladeshi newspapersrelated to Bangladesh’s central part. More than 2000 news articles related to thesouthern part of Bangladesh, mainly Chittagong and Cox’s Bazar.The spatial distribution of confirmed cases of COVID-19 is also shown inFigure 9b. The central part of Bangladesh is the most affected area. More than10,000 COVID-19 patients were identified in Dhaka during this time. Outbreaks
Fig. 9.
Spatial Analysis of News Article Volumeutomatic Monitoring Social Dynamics During Big Incidences 19 have been reported in the surrounding areas of Dhaka, mainly Narayanganj andGazipur. After the Dhaka division, we can see the highest infection rate in thesouthern part of Bangladesh, mainly Chittagong. Figure 9 shows a correlationbetween the number of confirmed COVID-19 cases in an area and the publishednews volume related to that area. This means automatic monitoring of newsarticle volume can give a clear view of the severity of a pandemic or big instancesin a society.
DhakaChittagongNarayanganjGazipurComillaKhulnaBarisalSylhetKishoreganjTangail
Fig. 10.
District-wise Distribution of News Articles
ChittagongKhulnaMymensinghSylhetDhakaRajshahiRangpurBarisal
Fig. 11.
Division-wise Distribution of News Articles
The district-wise break down of published news articles for significant volumeis shown in Figure 10 and division-wise break down in Figure 11. The figuresshow that most news published was related to the Dhaka district and Dhakadivision. More than 57% of the published news was related to the Dhaka division.After Dhaka, most news has been published on Chittagong. More than 19% ofthe news was related to the Chittagong division. The geospatial and temporal
Fig. 12.
Geo-spatial and Temporal Distributions of News Articles Published over Time.Horizontal axis shows consecutive weeks in the duration and vertical axis shows thevolume (count of news articles). distributions of newspaper articles are shown in Figure 12. The figure showsthat the volume of published news articles related to each location significantlychanged over time, lower volume before and beginning of the pandemic, andsignificantly increased during the pandemic.
For topic analysis through the LDA topic model, it is indis-pensable to decide the optimal number of topics. Seeking an appropriate LDAtopic number and clarifications to examine the relationship between the COVID-19 emergency and news articles, we have given much thought. We used a coher-ence score and perplexity score to assess the choice of an appropriate numberof topics. After preprocessing the data, we applied the LDA model to discoverhidden topics in news articles. To determine the optimal number of topics, wediagnosed the coherent score and the perplexity score graph shown in Figure 13.Figure 13a is showing the coherent score graph and Figure 13b is showing theperplexity score graph.From the coherence score graph, we got the highest coherence score (0.5077)when we set the number of topics to 9, shown in Figure 13a. Moreover, fromthe perplexity score graph, we got the highest perplexity score (-7.59) when weset the number of topics to 24, shown in Figure 13b. We chose the coherentscore between the coherent score and perplexity score as the optimal number oftopics for the coherent score is 9, which is very close to the number of manuallyextracted classes of 8, shown in Table 2. So we set the number of topics for LDAtopic extraction to 9. The word clouds for top words (i.e., keywords) in each ofthe nine topics is shown in Figure 14. The weights and appearance counts of the utomatic Monitoring Social Dynamics During Big Incidences 21
5 10 15 20 25 30 Number of Topics C oh e r e n t S c o r e (a) Coherent Score
5 10 15 20 25 30 Number of Topics P e r p l ex i t y S c o r e -7.59-7.60-7.61-7.62-7.63-7.64-7.65 (b) Perplexity Score Fig. 13.
Determining optimal number of topic keywords in each topic is shown in Figure 15. The visualization of the clustersof documents in a 2D space using the t-SNE (t-distributed stochastic neighborembedding) algorithm is shown in Figure 16. In Figure 17, inter-topic distancemap and 30 relevant keywords are displayed for each topic. They discovered ninetopics are listed in Table 3.
Table 3.
Nine Topics Discovered by LDA(1) Economic Crisis and Incentives, (2) Epidemic Situation and Outbreak, (3) Vaccineand Treatment, (4) Demonstration for Wages and Relief, (5) Medical Care and HealthOrganization Responses, (6) Repatriation and International Situations, (7) Daily In-fected Death and Recovered Cases, (8) Strategic Preparedness, and (9) GovernmentAnnouncement and Responses
Figure 18 shows the topic frequency ratio in the document collection (newsarticles). The figure shows that Topic 8 (Strategic Preparedness) is the mostfrequent topic amongst all the nine topics discovered by LDA, and this topic ac-counted for 26.3% of all the nine topics. The second most frequent LDA topic isTopic 2 (Epidemic Situation and Outbreak), which accounted for 20.1%. Topic 9(Government Announcement and Responses) and Topic 7 (Daily Infected, Death,and Recovered Cases) are 13.6% and 11.7%, respectively, and are the third andfourth most frequent topics. Topic 5 (Medical Care and Health OrganizationResponses), Topic 3 (Vaccine and Treatment), and Topic 4 (Demonstration forWages) and Relief are at fifth, sixth, and seventh positions, and They accountedfor 9.8%, 5.7%, and 5.2%, respectively. Finally, Topic 6 (Repatriation and In-ternational Situations) and Topic 1 (Economic Crisis and Incentives) are theleast frequent topics, and the proportion of these two topics is less than 5%. Byreviewing all these topics and analysis, we can insight into the pandemic or anyimportant incident in a society.
Topic 1 (a) Word cloud of Topic 1
Topic 2 (b) Word cloud of Topic 2
Topic 3 (c) Word cloud of Topic 3
Topic 4 (d) Word cloud of Topic 4
Topic 5 (e) Word cloud of Topic 5
Topic 6 (f) Word cloud of Topic 6
Topic 7 (g) Word cloud of Topic 7
Topic 8 (h) Word cloud of Topic 8
Topic 9 (i) Word cloud of Topic 9
Fig. 14.
Word Clouds for nine topicsutomatic Monitoring Social Dynamics During Big Incidences 23 W o r d C o un t Topic: 1
Word Count 0.0000.0050.0100.0150.0200.0250.030Weights
Word Count and Importance of Topic Keywords (a) Word counts of Topic 1 W o r d C o un t Topic: 2
Word Count 0.0000.0050.0100.0150.0200.0250.030Weights
Word Count and Importance of Topic Keywords (b) Word counts of Topic 2 W o r d C o un t Topic: 3
Word Count 0.0000.0050.0100.0150.0200.0250.030Weights
Word Count and Importance of Topic Keywords (c) Word counts of Topic 3 W o r d C o un t Topic: 4
Word Count 0.0000.0050.0100.0150.0200.0250.030Weights
Word Count and Importance of Topic Keywords (d) Word counts of Topic 4 W o r d C o un t Topic: 5
Word Count 0.0000.0050.0100.0150.0200.0250.030Weights
Word Count and Importance of Topic Keywords (e) Word counts of Topic 5 W o r d C o un t Topic: 6
Word Count 0.0000.0050.0100.0150.0200.0250.030Weights
Word Count and Importance of Topic Keywords (f) Word counts of Topic 6 W o r d C o un t Topic: 7
Word Count 0.0000.0050.0100.0150.0200.0250.030Weights
Word Count and Importance of Topic Keywords (g) Word counts of Topic 7 W o r d C o un t Topic: 8
Word Count 0.0000.0050.0100.0150.0200.0250.030Weights
Word Count and Importance of Topic Keywords (h) Word counts of Topic 8 W o r d C o un t Topic: 9
Word Count 0.0000.0050.0100.0150.0200.0250.030Weights
Word Count and Importance of Topic Keywords (i) Word counts of Topic 9
Fig. 15.
Word Counts for nine topics
Fig. 16. t-SNE clustering chart4 Fahim Shahriar and Md Abul Bashar
Selected Topic:
PC1 PC2Marginal topic distribution 2%5%10% 1122 3344 55 6677 8899
Intertopic Distance Map (via multidimensional scaling)
Overall term frequencyEstimated term frequency within the selected topic1. saliency(term w) = frequency(w) * [sum_t p(t | w) * log(p(t | w)/p(t))] for topics t; see Chuang et. al (2012)2. relevance(term w | topic t) = λ * p(w | t) + (1 - λ) * p(w | t)/p(w); see Sievert & Shirley (2014) হাসপাতালউপেজলামৃতু শনা কেরানাভাইরাসপরী াসং মণিচিক ৎ সকভাইরাস া িমক মিডেকলব ি নাগিরক াইটিচিক ৎ সাসাজ নযা ীব াংকসং হআইইিডিসআরকারখানাঅিধদ রআইেসােলশনযু রা কায়ােরি নকম কত াপিরবারকমে লকডাউন Top-30 Most Salient Terms Fig. 17.
Interactive Topic Visualization
Epidemic Situation & OutbreakEconomic Crisis & IncentivesRepatriation & International SituationsDaily infected,death and recovered casesStrategic PreparednessGovernment Announcement and ResponsesVaccine & TreatmentDemonstration for wages and reliefMedical Care & Health Organization Re...
Fig. 18.
Proportion of Topic Frequency Distribution in the News Collectionutomatic Monitoring Social Dynamics During Big Incidences 25
Week number of the year 2020 F r e q u e n c y (a) Topic 1: Economic Crisis & Incentives (a) Topic 1 Week number of the year 2020 F r e q u e n c y (b) Topic 2: Epidemic Situation & Outbreak (b) Topic 2 Week number of the year 2020 F r e q u e n c y (c) Topic 3: Vaccine & Treatment (c) Topic 3 Week number of the year 2020 F r e q u e n c y (d) Topic 4: Demonstration for wages and relief (d) Topic 4 Week number of the year 2020 F r e q u e n c y (e) Topic 5: Medical Care & Health Organization Responses (e) Topic 5 Week number of the year 2020 F r e q u e n c y (f) Topic 6: Repatriation & International Situations (f) Topic 6 Week number of the year 2020 F r e q u e n c y (g) Topic 7: Daily infected,death and recover cases (g) Topic 7 Week number of the year 2020 F r e q u e n c y (h) Topic 8: Strategic Preparedness (h) Topic 8 Week number of the year 2020 F r e q u e n c y (i) Topic 9: Government Announcement and Responses (i) Topic 9 Fig. 19.
Topics Evolution of nine topics6 Fahim Shahriar and Md Abul Bashar
Dynamic Topic Modeling: Temporal Trends of Topics
To show the evo-lution of topics overtime during the pandemic, we used Dynamic Topic ModelingBlei and Lafferty (2006). Figure 19 shows the evolution of nine topics over weeksduring the pandemic. The figure shows how the popularity of each topic and thetop words (i.e., keywords) in the topic changed over time. The overall temporaltrend of these topics is shown in Figure 20.Topic 1: “Economic Crisis and Incentives” climbed from early March andstopped rising at the end of March. Then the curve continued to decline untilthe beginning of April and rose slowly in the middle of April. Then again, thecurve continued to decline until the beginning of May. Then finally it reached apeak in the middle of May.Topic 2: “Epidemic Situation and Outbreak” climbed from the beginning ofMarch and stopped rising at the end of April. The curve then steadily continuedto decline until the beginning of April. Then slowly rose the curve to the middleof April. Furthermore, again continued to decrease the curve until the beginningof May and then rose to a peak in the middle of May finally.Topic 3: “Vaccine and Treatment” climbed from the end of February andstopped growing at the beginning of March. Then the curve steadily continuedto decline until the middle of March. And then again started to rise and reacheda peak at the end of March. Then it also started to decline the curve until themiddle of May. After that, the curve fluctuated till the end.Topic 4: “Demonstration for wages and relief” climbed from the beginningof March. Then the curve stopped rising in the middle of March. After that, thecurve was fluctuating between the middle of March and the middle of April, andthen the curve reached a peak in the middle of April. Then the curve steadilydeclined until the beginning of May and rose slowly in the middle of May.Topic 5: “Medical Care and Health Organization Responses” climbed fromthe beginning of February and stopped rising in early March. Then the curvestarted to rise from the beginning of March. In the middle of April, the curvereached its peak point. Then The slope of the curve gradually deteriorates afterthe middle of April.Topic 6: “Repatriation and International Situations” climbed from the be-ginning of February and stopped rising early in March. The curve then steadilydeclined and then again started to rise from early March to the middle of March.After that, the curve had a steady state for a while; it went to the top at the endof March. Then again, the curve steadily declined until the beginning of April.Then the curve fluctuated till the end.Topic 7: “Daily infected, death and recovered cases” climbed from the begin-ning of February. As the number of infected and death cases was growing everyday so that the number of daily infected deaths, death cases were also risingevery day. So the curve was also rising. The curve reached its peak at the end ofApril. Then the curve declined until the beginning of May. After that, the curveagain started to rise from the end of April to the end.Topic 8: “Strategic Preparedness” climbed from the beginning of March andstopped rising at the end of March. Then the curve declined until the end of utomatic Monitoring Social Dynamics During Big Incidences 27
March. After that, the curve started to rise from the end of March. Suddenlythe curve downgraded for a while and then again started to rise up and then itpeaked. The curve then steadily declined until the beginning of May and roseslowly in the middle of May.Topic 9: “Government Announcement and Responses” climbed from the be-ginning of March and reached a peak in the middle of March. Then the curvedeclined until the middle of April. After that, the curve again started to fluctuate.Then also, the curve steadily deteriorated until the end of April. Furthermore,finally, the curve rose slowly till the end.
Spatial Distribution of Topics
This subsection details the experimental re-sults of the spatial distribution of the topics.Topic 1: “Economic Crisis and Incentives” mainly concentrated on the Dhakadivision shown in Figure 21a. Most people of the Dhaka division lost their job inthat period. This incident also happened in Chittagong to a small extent. Thegovernment and various agencies provided Relief and incentives to the victims.In this case, the area that has received the most relief is the Dhaka and theChittagong division.Topic 2: “Epidemic Situation and Outbreak” mainly focused on the centraland southern parts of Bangladesh, shown in Figure 21b. We can see that Dhakais the most affected city in Bangladesh at that particular time. Most COVID-19infected patients have been identified in Dhaka, more deaths have been reportedin Dhaka, the situation in Dhaka was much worse than other districts and divi-sions at that time, and there was a much higher prevalence. Apart from Dhaka,Narayanganj, Gazipur, Chittagong has also been affected so much.Topic 3: “Vaccine and Treatment” mainly focused on Dhaka, Chittagong,Gazipur, Narayanganj area in Bangladesh, shown in Figure 21c. Since the preva-lence of COVID-19 is higher in Dhaka, its adjoining districts like Narayanganj,Gazipur, and Chittagong so that treatments were shouted comparatively higherthan in other districts. COVID-19 vaccine is also being studied in Dhaka.Topic 4: “Demonstration for wages and relief” is mainly concentrated all overthe country shown in Figure 21d. The situation is terrible all over the countrydue to COVID-19. Those who are day laborers have lost their jobs; they havebecome destitute. For this, they had to come out of the house to survive. Theyhad to move on the streets to provide food for their families. Since in everydistrict of Dhaka, Chittagong, Rajshahi, Barisal, Khulna, Mymensingh, Sylhet,Rangpur, people had taken to the streets to protest for survival.Topic 5: “Medical Care and Health Organization Responses” also focusedall over the country like Topic 4, shown in Figure 21e. The state of the healthsystem in the whole country is deplorable. Health organizations were in a verycritical situation.Topic 6: “Repatriation and International Situations” mainly concentrated onChina, USA, Italy, Russia, and other countries shown in Figure 21f. This topictalks about the situations of foreign countries and the immigrants who wantedto return to Bangladesh. Here those countries have been shown in Figure 21f.
Mar2020 Apr May
Time nu m be r o f ne w s (a) Economic Crisis & Incentives (a) Temporal Trends of Topic 1 Mar2020 Apr May
Time nu m be r o f ne w s (b) Epidemic Situation & Outbreak (b) Temporal Trends of Topic 2 Feb2020 Mar Apr May
Time nu m be r o f ne w s (c) Vaccine & Treatment (c) Temporal Trends of Topic 3 Mar2020 Apr May
Time nu m be r o f ne w s (d) Demonstration for wages and relief (d) Temporal Trends of Topic 4 Feb2020 Mar Apr May
Time nu m be r o f ne w s (e) Medical Care & Health Organization Responses (e) Temporal Trends of Topic 5 Feb2020 Mar Apr May
Time nu m be r o f ne w s (f) Repatriation & International Situations (f) Temporal Trends of Topic 6 Mar2020 Apr May
Time nu m be r o f ne w s (g) Daily infected,death and recovered cases (g) Temporal Trends of Topic 7 Feb2020 Mar Apr May
Time nu m be r o f ne w s (h) Strategic Preparedness (h) Temporal Trends of Topic 8 Feb2020 Mar Apr May
Time nu m be r o f ne w s (i) Government Announcement and Responses (i) Temporal Trends of Topic 9 Fig. 20.
Overall Temporal Trends of Topicsutomatic Monitoring Social Dynamics During Big Incidences 29
Topic 7: “Daily infected, death and recovered cases” mainly focused on thecentral region, shown in Figure 21g. Dhaka division has the highest number ofCOVID-19 infected cases, as well as the death cases. Dhaka division includesDhaka city, Narayanganj, Gazipur had the most cases. After Dhaka, most of thecases were found in the Chittagong division.Topic 8: “Strategic Preparedness” focused on all over the country shown inFigure 21h. Lockdown, isolation, home quarantine, social distancing was imposedacross the country.Topic 9: “Government Announcement and Responses” shown in Figure 21i.It was effective almost everywhere, especially in Dhaka.
We built a text classifier to automatically categorize upcoming new news arti-cles into classes, sub-classes, and topics. We implemented an LSTM recurrentneural network model in Python utilizing Keras deep learning library for thisclassification.We splited our data into 80% for training, 10% for validation, and 10% fortesting. In the data preparation, first, we cleaned the text by removing unnec-essary characters and stopwords. After cleaning the text, we tokenized the datausing Keras Tokenizer. After that, we built a word index from it. Then we vec-torized Bengali text by turning each text into a vector. We limited the dataset tothe top 50,000 words and set the max number of words in each article at 1000.After that, we added padding and truncated to our data to make the inputsequences uniform and the same length for modeling.After cleaning the data, we selected pre-trained word embeddings . Wordembedding maps each word from the vocabulary to a vector of real numbers.We used these pre-trained word embedding in the embedding layer of our LSTMmodel.In our classification model, the first layer is the embedded layer that em-ployments 300 length vectors to represent each word. The second layer is anLSTM layer with 100 hidden units. The final layer is a dense layer, also knownas the output layer. This final layer has a length of 8, 19, and 9 for the classes,sub-classes, and LDA-discovered topics, respectively. Softmax is used as the acti-vation function for multi-class classification in the final layer. We used categoricalcross-entropy as the loss function,
Adam as the optimizer, and a batch size of32. We used only five epochs as it worked quite well.The experimental results for Precision, Accuracy, F score and Recall aregiven in Table 4. The Precision, Accuracy, F1 score, and Recall are 47.80%,44.39%, 45.13%, and 42.80%, respectively, for the eight classes. For the 19 sub-classes, Precision, Accuracy, F1 score, and Recall are 47.33%, 38.51%, 37.20%,and 30.82%, respectively. Furthermore, we also computed the performance of 9LDA topics. For the 9 LDA topics, Precision, Accuracy, F1 score, and Recall arefound 81.37%, 79.55%, 79.67%, and 78.10%, respectively. https://keras.io/ Bengali Word Embeddings: https://cutt.ly/KjXmEio
Fig. 21.
Spatial Distribution of Topicsutomatic Monitoring Social Dynamics During Big Incidences 31Criteria Classes Sub-classes TopicsPrecision 47.80% 47.33% 81.37%Accuracy 44.39% 38.51% 79.55%F1 Score 45.13% 37.20% 79.67%Recall 42.80% 30.82% 78.10%
Table 4.
Performance in Classes, Sub-classes, and Topics
We analyzed sentiment in the COVID-19 related news articles to see how pos-itively and negatively the society was affected by the COVID-19 (or any bigincident). We also analyzed the effectiveness of a hybrid CNN-BiLSTM modelin identifying sentiments in Bengali texts. First, we manually labeled each newsarticle according to positive or negative sentiment in the article. Then we trainedCNN-BiLSTM to detect the sentiment of any upcoming new articles. After la-beling the articles’ positive/negative sensation, we visualized the results in Fig-ure 22. The figure shows that there were more negative sentiment news articlesthan positive sentiments.
NegativePositive
Fig. 22.
Proportion of Positive and Negative Sentiments
We prepared our labeled news collection as 80% of for training, 10% forvalidation, and 10% for testing for sentiment analysis. Using Keras Tokenizer,we tokenized the data after cleaning the dataset. Then we built a word indexand vectorize each text. We retrained the dataset to the 60,000 top words andset the max number of words in each article at 200 using feature selection. Weadded padding and truncated the data to make the input sequences uniform andthe same for modeling.After data preparation, we built our model. The first layer of the model is theembedding layer. We set the embedding dimension to 300 for embedding eachword. In the second layer, we started a conv1D with 200 filters for CNN. Then inthe third and fourth layers, we applied two Bi-LSTM with a dropout of 0.5. In thefinal layer, we used a dense network. We used
Adam as the optimizer with finelytuned hyperparameters and applied L2 regularizations to reduce overfitting. We Fig. 23.
Spatial and Temporal Distribution of Number of Positive and Negative Sen-timent News Articles kept the batch size of 256 as it worked quite well. We used only five epochs thatgave us reasonably good results.After calculating the Precision, Accuracy, F1 score, and Recall, the perfor-mance and the sentiments of the classification presented in Table 5. The Pre-cision, Accuray, F1 score, and Recall are 74.89%, 74.94%, 74.88%, and 74.89%,respectively, for sentiment analysis.
Criteria PerformancePrecision 74.89%Accuracy 74.94%F1 Score 74.88%Recall 74.89%
Table 5.
Performance of Sentiment Analysis
Figure 23 shows the spatial and temporal distribution of the number of pos-itive and negative sentiment news articles during the pandemic. It shows howsentiments are changing over eight divisions for 20 weeks.
This study took an in-depth analysis of Bangladeshi daily newspaper reportsfrom the onset of the COVID-19 pandemic. After collecting the news articles, utomatic Monitoring Social Dynamics During Big Incidences 33 we investigated and manually classified them into eight classes and nineteen sub-classes. We used LDA for extracting nine COVID-19 related topics from the newsarticles. We used the dynamic topic model to see the evaluation of topics overtime. We also provided the spatial distribution of the topics. We created a textclassifier that will automatically sort upcoming articles into classes, sub-classes,and topics. We also did a spatial and temporal analysis of news article volume.In the temporal analysis of volume, we decomposed the time series into fourcomponents: observed, trend, seasonal, and residual. Besides, we analyzed dailynews article counts and daily infected and death cases in the temporal and spatialdimensions. Finally, we analyzed the sentiments in the news articles relatedto COVID-19 to understand the positive and negative impacts of events andinitiatives during the COVID-19 pandemic using a CNN-BiLSTM architecture.In a period of big social incidence, continuous analysis of newspaper arti-cles is essential to ensure public well-being, maintain social consensus, and savelives. The automatic analysis techniques and the analysis outcomes presented inthis study will help government and crisis reaction faculty improve public com-prehension, evaluation, predisposition, quicken emergency reaction, and backingpost-incidence administration. ibliography
Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow, and Rebecca J Passon-neau. Sentiment analysis of twitter data. In
Proceedings of the workshop onlanguage in social media (LSM 2011) , pages 30–38, 2011.Mustakim Al Helal and Malek Mouhoub. Topic modelling in bangla language:An lda approach to optimize topics and news classification.
Computer andInformation Science , 11(4), 2018.Cecilia Ovesdotter Alm, Dan Roth, and Richard Sproat. Emotions from text:machine learning for text-based emotion prediction. In
Proceedings of humanlanguage technology conference and conference on empirical methods in naturallanguage processing , pages 579–586, 2005.Loulwah AlSumait, Daniel Barbar´a, and Carlotta Domeniconi. On-line lda:Adaptive topic models for mining text streams with applications to topic de-tection and tracking. In , pages 3–12. IEEE, 2008.Md Asimuzzaman, Pinku Deb Nath, Farah Hossain, Asif Hossain, and Rashe-dur M Rahman. Sentiment analysis of bangla microblogs using adaptive neurofuzzy system. In , pages 1631–1638.IEEE, 2017.Thirunavukarasu Balasubramaniam, Richi Nayak, and Md Abul Bashar. Un-derstanding the spatio-temporal topic dynamics of covid-19 using nonnegativetensor factorization: A case study. arXiv preprint arXiv:2009.09253 , 2020.Md Abul Bashar and Richi Nayak. Qutnocturnal@ hasoc’19: Cnn for hatespeech and offensive content identification in hindi language. arXiv preprintarXiv:2008.12448 , 2020.Md Abul Bashar, Richi Nayak, Nicolas Suzor, and Bridget Weir. Misogynistictweet detection: Modelling cnn with small datasets. In
Australasian Confer-ence on Data Mining , pages 3–16. Springer, 2018.Md Abul Bashar, Richi Nayak, and Thirunavukarasu Balasubramaniam. Topic,sentiment and impact analysis: Covid19 information seeking on social media. arXiv preprint arXiv:2008.12435 , 2020a.Md Abul Bashar, Richi Nayak, and Nicolas Suzor. Regularising lstm classifierby transfer learning for detecting misogynistic tweets with small training set.
Knowledge and Information Systems , 62(10):4029–4054, 2020b.Vishwanath Bijalwan, Vinay Kumar, Pinki Kumari, and Jordan Pascual. Knnbased machine learning approach for text and document mining.
InternationalJournal of Database Theory and Application , 7(1):61–70, 2014.David M Blei and John D Lafferty. Dynamic topic models. In
Proceedings ofthe 23rd international conference on Machine learning , pages 113–120, 2006.David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation.
Journal of machine Learning research , 3(Jan):993–1022, 2003. utomatic Monitoring Social Dynamics During Big Incidences 35
Shaika Chowdhury and Wasifa Chowdhury. Performing sentiment analysis inbangla microblog posts. In , pages 1–6. IEEE, 2014.Abu Nowshed Chy, Md Hanif Seddiqui, and Sowmitra Das. Bangla news classi-fication using naive bayes classifier. In , pages 366–371. IEEE, 2014.Hang Cui, Vibhu Mittal, and Mayur Datar. Comparative experiments on sen-timent classification for online product reviews. In
AAAI , volume 6, page 30,2006.Estela Bee Dagum. Time series modeling and decomposition.
Statistica , 70(4):433–457, 2010.Amitava Das and Sivaji Bandyopadhyay. Sentiwordnet for bangla.
KnowledgeSharing Event-4: Task , 2:1–8, 2010a.Amitava Das and Sivaji Bandyopadhyay. Topic-based bengali opinion summa-rization. In
Coling 2010: Posters , pages 232–240, 2010b.Enrico De Santis, Alessio Martino, and Antonello Rizzi. An infoveillance systemfor detecting and tracking relevant topics from italian tweets during the covid-19 event.
IEEE Access , 8:132527–132538, 2020.Adji B Dieng, Francisco JR Ruiz, and David M Blei. The dynamic embeddedtopic model. arXiv preprint arXiv:1907.05545 , 2019.Shahnoor C Eshan and Mohammad S Hasan. An application of machine learn-ing to detect abusive bengali text. In , pages 1–6. IEEE, 2017.Xuehua Han, Juanle Wang, Min Zhang, and Xiaojie Wang. Using social mediato mine and analyze public opinion related to covid-19 in china.
InternationalJournal of Environmental Research and Public Health , 17(8):2788, 2020.KM Azharul Hasan, Mosiur Rahman, et al. Sentiment detection from banglatext using contextual valency analysis. In , pages 292–295. IEEE,2014.M. Hasan, M. M. Hossain, A. Ahmed, and M. S. Rahman. Topic modelling: Acomparison of the performance of latent dirichlet allocation and lda2vec modelon bangla newspaper. In , pages 1–5. IEEE, 2019.Asif Hassan, Mohammad Rashedul Amin, Abul Kalam Al Azad, and NabeelMohammed. Sentiment analysis on bangla and romanized bangla text usingdeep recurrent models. In , pages 51–56. IEEE, 2016.Md Islam, Fazla Elahi Md Jubayer, Syed Ikhtiar Ahmed, et al. A comparativestudy on different types of approaches to bengali document categorization. arXiv preprint arXiv:1701.08694 , 2017.Balaji Jagtap and Virendrakumar Dhotre. Svm and hmm based hybrid approachof sentiment analysis for teacher feedback assessment.
International Journalof Emerging Trends & Technology in Computer Science (IJETTCS) , 3(3):229–232, 2014.
Fasihul Kabir, Sabbir Siddique, Mohammed Rokibul Alam Kotwal, and Mo-hammad Nurul Huda. Bangla text document categorization using stochasticgradient descent (sgd) classifier. In , pages 1–4. IEEE, 2015.Zhijie Liu, Xueqiang Lv, Kun Liu, and Shuicai Shi. Study on svm compared withthe other text classification methods. In , volume 1, pages 219–222. IEEE,2010.Shamsul Arafin Mahtab, Nazmul Islam, and Md Mahfuzur Rahaman. Sentimentanalysis on bangladesh cricket with support vector machine. In ,pages 1–4. IEEE, 2018.Ashis Kumar Mandal and Rikta Sen. Supervised learning methods for banglaweb document categorization. arXiv preprint arXiv:1410.2045 , 2014.Jani Marjanen, Elaine Zosa, Simon Hengchen, Lidia Pivovarova, and Mikko Tolo-nen. Topic modelling discourse dynamics in historical newspapers. arXivpreprint arXiv:2011.10428 , 2020.Ba-Hung Nguyen, Shirai Kiyoaki, and Van-Nam Huynh. Topics in financialfilings and bankruptcy prediction with distributed representations of textualdata. In
Proceedings of ECML-PKDD , 2020.Alok Ranjan Pal, Diganta Saha, and Niladri Sekhar Dash. Automatic classifica-tion of bengali sentences based on sense definitions present in bengali wordnet. arXiv preprint arXiv:1508.01349 , 2015.Ajay S Patil and BV Pawar. Automated classification of web sites using naivebayesian algorithm. In
Proceedings of the international multiconference ofengineers and computer scientists , volume 1, pages 519–523. Citeseer, 2012.Pratiksha Y Pawar and SH Gawande. A comparative study on different types ofapproaches to text categorization.
International Journal of Machine Learningand Computing , 2(4):423, 2012.Shahinur Rahman, Sheikh Abujar, SM Mazharul Hoque Chowdhury, Mohd Sai-fuzzaman, and Syed Akhter Hossain. Sentence-based topic modeling usinglexical analysis. In
Emerging Technologies in Data Mining and InformationSecurity , pages 487–494. Springer, 2019.Geetanjali Rakshit, Anupam Ghosh, Pushpak Bhattacharyya, and GholamrezaHaffari. Automated analysis of bangla poetry for classification and poet iden-tification. In
Proceedings of the 12th International Conference on NaturalLanguage Processing , pages 247–253, 2015.Nusrath Tabassum and Muhammad Ibrahim Khan. Design an empirical frame-work for sentiment analysis from bangla text using machine learning. In , pages 1–5. IEEE, 2019.Vincent Tam, Ardi Santoso, and Rudy Setiono. A comparative study of centroid-based, neighborhood-based and statistical approaches for effective documentcategorization. In
Object recognition supported by user interaction for servicerobots , volume 4, pages 235–238. IEEE, 2002. utomatic Monitoring Social Dynamics During Big Incidences 37
Zhou Tong and Haiyi Zhang. A text mining research based on lda topic mod-elling. In
International Conference on Computer Science, Engineering andInformation Technology , pages 201–210, 2016.Nafis Irtiza Tripto and Mohammed Eunus Ali. Detecting multilabel sentimentand emotions from bangla youtube comments. In , pages 1–6. IEEE,2018.Rashedul Amin Tuhin, Bechitra Kumar Paul, Faria Nawrine, Mahbuba Akter,and Amit Kumar Das. An automated system of sentiment analysis from banglatext using supervised learning techniques. In , pages 360–364. IEEE, 2019.Chong Wang and David M Blei. Collaborative topic modeling for recommendingscientific articles. In
Proceedings of the 17th ACM SIGKDD internationalconference on Knowledge discovery and data mining , pages 448–456, 2011.Reggia Aldiana Wayasti, Isti Surjandari, et al. Mining customer opinion fortopic modeling purpose: Case study of ride-hailing service provider. In , pages 305–309. IEEE, 2018.Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan,and Xiaoming Li. Comparing twitter and traditional media using topic models.In