A Text Mining Discovery of Similarities and Dissimilarities Among Sacred Scriptures
Younous Mofenjou Peuriekeu, Victoire Djimna Noyum, Cyrille Feudjio, Alkan Goktug, Ernest Fokoue
AA Text Mining Discovery of Similarities and DissimilaritiesAmong Sacred Scriptures ⋆,⋆⋆
Younous Mofenjou Peuriekeu a , ∗ , Victoire Djimna Noyum a ,1 , Cyrille Feudjio a ,2 , Alkan Göktug. c and Ernest Fokoué b a School of Mathematical Sciences, African Institute for Mathematical Sciences, Crystal Garden, Limbe Cameroon c School of Mathematical Sciences, ETH Zurich, Rämistrasse 101, 8092 Zurich, Switzerland b School of Mathematical Sciences, Rochester Institute of Technology, Rochester, NY 14623, USA
A R T I C L E I N F O
Keywords :Sacred textText MiningDTMDistanceClassification
A B S T R A C T
The careful examination of sacred texts gives valuable insights into human psychology, differentideas regarding the organization of societies as well as into terms like truth and
God . To improveand deepen our understanding of sacred texts, their comparison and their separation is crucial.For this purpose, we use of our data set has nine sacred scriptures. This work deals with sepa-ration of the Quran, the Asian scriptures Tao-Te-Ching, the Buddhism, the Yogasutras and theUpanishads as well as the four books from the Bible, namely the Book of Proverbs, the Book ofEcclesiastes, the Book of Ecclesiasticus and the Book of Wisdom. These scriptures are analyzedbased on the natural language processing NLP creating the mathematical representation of thecorpus in terms of frequencies called document term matrix (DTM). After this analysis, machinelearning methods like supervised and unsupervised learning are applied to perform classifica-tion. Here we use the Multinomial Naive Bayes (MNB), the Super Vector Machine (SVM), theRandom Forest (RF) and the K-nearest Neighbors (KNN). We obtain that among these methodsMNB is able to predict the class of a sacred text with an accuracy about 85.84 %.
1. Introduction
The progress in transportation and communication that has brought all the people of the world into one globalvillage has also brought the religions of the world into close contact. To know what is unique or specific about areligion and how religion has gained importance in the life of humans, it is helpful to understand the structural patternsof these texts. Generally, when we talk about sacred or holy text, we refer to the religious context. Sacred Scriptureare passages from the religious traditions. Often, these scriptural passages support a common theme. This methodof organization allows each topic to be addressed with the resources of many different traditions, often providing abroader and deeper understanding of the topic than considering only the resources of a single tradition. Each religionhas much value to contribute to humankind’s understanding of truth, which transcends any particular expression. Theproper description of religion is an area of study that has been tackled on different standpoints. Some of the commonapproaches to the study of religion are through history, anthropology, psychology and sociology.• The historians have been interested in religion as a social movement and has traced the development of variousreligions. ⋆ This document is the results of the research project funded by the National Science Foundation. ⋆⋆ The second title footnote which is a longer text matter to fill through the whole text width and overflow into another line in the footnotes areaof the first page.This note has no numbers. In this work we demonstrate 𝑎 𝑏 the formation Y_1 of a new type of polariton on the interface between a cuprousoxide slab and a polystyrene micro-sphere placed on the slab. ∗ Corresponding author ∗∗ Principal corresponding author [email protected] (Y.M. Peuriekeu); [email protected] (A. Göktug.); [email protected] (E. Fokoué) (Y.M. Peuriekeu); (A. Göktug.); (E. Fokoué)
ORCID (s): (Y.M. Peuriekeu) This is the first author footnote. but is common to third author as well. Another author footnote, this is a very long footnote and it should be a really long footnote. But this footnote is not yet sufficiently long enoughto make two lines of footnote text.
YOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 1 of 20 a r X i v : . [ s t a t . O T ] F e b omprehensive Text Mining on Various Sacred Scriptures • The anthropologists have been interested in the genetic approach in the study of man, both in physiological andpsychological aspects.• The sociologist is essentially interested in the institutional and the ritualistic aspects of religion.The transcendentalist approach shows that religion concerns values and ideas. This approach is different from other dis-ciplines of social sciences because, while the latter studies religion only as a structure, the former studies the religiousvalues constituting the inner core of religion.There are many different fields studying religions. These studies are always based on investigations on the scrip-tures. Hence, understanding a religion requires an analysis of its religious scripture. • Quran:It is the single-authored, central religious text of Islam and written in the eastern Arabian dialect of ClassicalArabic. The Quran has 30 divisions or Juz with 114 chapters as Surah, according to the length of surahs, but notaccording to when they were revealed and not by subject matter. Each Surah is subdivided into verses or Ayat.The Quran contains about 6,236 verses with 77,477 words. The Quran is believed to be orally revealed by Godthrough the archangel Gabriel to Muhammad, considered as the final prophet by Muslims. These three corporaare highly unstructured and do not follow or adhere to structured or regular syntax and patterns. Before applyingany statistical technique or machine learning algorithm, the corpus needs to be converted into a structured textor a vector format that these techniques and algorithms can work with (Sah and Fokoué, 2019).• Asian text: – Yogasutras: From IndiaThis Book contains essence of wisdom. It states that humans think of themselves as living a purely physicallife in their material bodies. The central claim is that, in reality, they have gone far indeed from pure physicallife; for ages, their life has been psychical, and they have been centered and immersed in the psychic nature(Sah and Fokoué, 2019). – Buddhism:This book teaches the so-called four noble truth . Each of these truths entails a duty: stress is to be compre-hended, the origination of stress is to be abandoned, the cessation of stress is to be realized, and the pathto the cessation of stress is aimed to be developed. When all of these duties have been fully performed, itis believed that the mind gains total release(Sah and Fokoué, 2019). – Tao Te Ching: Which is from ChinaTao Te Ching is a Chinese classic text traditionally credited to the 𝑡ℎ century BC sage Laozi. The TaoTe Ching is a short text which has two parts, the Tao Ching (chapters 1-37) and the Te Ching ( chapters38-81), which may have been edited together into the received text, possibly reversed from an original TeTao Ching. The chapters talk about staying detached, letting go and keeping things simple. – The Upanishads:This book represents the loftiest heights of ancient Indo-Aryan thought and culture. They form the wisdomportion or Gnana-Kanda of the Vedas, as contrasted with the Karma-Kanda or sacrificial portion. Fromeach of the four great Vedas known as Rik, Yajur, Sama and Atharva, there is a large portion which dealspredominantly with rituals and ceremonials aiming to teach humans how they can prepare themselves forhigher attainment (Sah and Fokoué, 2019).• Christianity books: Their origin is from Central Asia/America – Book of Proverbs:The Book of Proverbs is the Book of where we have the wise sentences regulating the morals of men anddirecting them to virtue. – Book of Ecclesiasticus:From this Book of Ecclesiasticus, it brings out the remarkable lessons of the virtues.
YOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 2 of 20omprehensive Text Mining on Various Sacred Scriptures – Book of Ecclesiastes :Also called the preacher, the Book of Ecclesiastes is one of the Christian Books, where Solomon plays avery important role as a preacher. – Book of Wisdom:The Book of Wisdom abounds with instructions and exhortations to kings and all magistrates to ministerjustice in the commonwealth, teaching all kinds of virtues under the general names of justice and wisdom(Sah and Fokoué, 2019).
To define text analysis, we can say that, it is the technique used to model and structure the information contentof textual sources for investigation, exploratory data analysis, and research. This technique is described by the set ofstatistical and machine learning approach. This technique is very useful in text mining to simplify the data analysis,research, and investigation(Hobbs, Walker and Amsler, 1982). It is an instance of text analysis applications.On the other hand, text analysis also refers to the set of processes analyzing a very large amount of unstructured textdata to come up with many attributes as topics respectively keywords. Hence, Natural Language Processing (NLP),Natural Language Understanding (NLU) and Data Mining (DM) are overlapping areas which include techniques toprocess large corpora and extract useful information(Elton, Turakhia, Reddy, Boukouvalas, Fuge, Doherty and Chung,2019). As technique of extraction, we have text pre-processing, text normalization, text categorization, text clustering,text similarity etc. However, the corpus are the collection of sentences with phrases and words. For a given unstructuredtext data, the foundation of an analysis is the text pre-processing and text normalization converting the raw corpus tostructured data whereas, text clustering and text similarity are processes to check the degree of closeness of two corporaand the process of grouping similar documents respectively.
Text mining is an approach that combines several fields of research and also several software tools. According to(Ignatow and Mihalcea, 2017), this is a technique that has begun to develop in social sciences such as anthropology,communication (Ezzeldin and El-Dakhakhni, 2020), economics (Levenberg, Pulman, Moilanen, Simpson and Roberts,2014), education (Evison, 2013), and psychology (Sklad, Diekstra, Ritter, Ben and Gravesteijn, 2012). Social scientistshave spent many decades studying transcribed interviews, newspaper articles, speeches, and other forms of textual data.After discovering the new, more sophisticated and rapid method of text mining, they have begun to adapt this approachto different forms of textual data analysis. Text mining also takes into account computer science. So, in our study,we will not limit ourselves to the use of this new analysis technique, but also study the extension of natural languagemining.The reason why we have come to a multiplicity of religious teachings is due to the fact that their sacred texts aredifferent from one religion to another. The fact that they are different can be explained by the lessons they teach theirfollowers, their period of appearance, and also their geographical location. In this study, we will deepen the excavationof sacred texts by using text mining techniques.Besides analyzing the differences between the books, it is also crucial to extract their similarities. This workattempts to find the similarity using text mining techniques.Automated lexical analysis techniques are used nowadays for retrieval of useful information from large amountsof unstructured texts. Several studies have been done in order to study different religions. Frank Lloyd Sindler et al.(Sindler, 2011) have written a thesis on a comparative study of Christian, Jewish, and Islamic theodicy. No automatedtechnique has been used to analyze the lexical content of religious texts and only three religions have been consideredin this study. Altogether, very little efforts have been made in the past to automatize analysis of important religioustexts.Daniel McDonald et al (McDonald, 2014) presented a method for automated extraction of topics from nine religioustexts to form a self-organizing map to find relationships between these religious texts. The backdraw of this study isthat only nine were taken into consideration leaving out important world religions.Buddhism, Jainism , Sikhism .Qahl, Salha Hassan Muhammed et al. (Qahl, 2014) developed an automatic similar-ity detection engine using the Bible and the Quran as corpus to explore the performance of various feature extractionand machine learning techniques. Only two religious texts were taken into consideration and it did not give deeperinsight into differences regardings the ideas and beliefs proposed by these two religions.
YOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 3 of 20omprehensive Text Mining on Various Sacred Scriptures
The most recent work was done by Preeti and al. (Sah and Fokoué, 2019) where height sacred text were used, fourfrom the Bible and four from Asian religions to present the statistical machine learning approach and to analyze manyaspects of sacred texts from both the Asian and Biblical scriptures. Thus, in our study, we will extend the amountof religious books by adding the Islamic sacred scripture and then use a more sophisticated and rapid method of textmining, namely the natural language processing (NLP) to extract the high quality information from those books. Basedon this, we will explore the performance of various feature extraction methods and machine learning techniques.
2. Text analysis and learning
After getting the different sacred texts that we will use for this study, we proceed with analyzing them in a structuredmanner. We begin with the Part-Of-Speech (POS) which is the process of marking up the word in a corpus (text) ascorresponding to the special part of the speech based on both its definition and its context. For our analysis we will usethe package nltk in Python. After presenting the
Information Retrieval , we give an overview about Natural LanguageProcessing. Then we explain the different steps of the pre-processing.
Before stating the pre-processing, we need to understand some concepts related to knowledge discovery. Infor-mation Retrieval (IR) means finding documents that contain information about a certain term respectively keyword(Manning, Manning and Schütze, 1999). For instance, Google does such kind of retrieval in their search engine.This search engine uses query based algorithm to track the trends and attain more significant results. However, thisapplication has many more everyday uses such as email search or searching a file on a personal computer.
Natural language processing (NLP) refers to the automatic processing and analysis of unstructured text data. Itis important to underline that NLP performs different types of analysis such as Named Entity Recognition (NER) forabbreviation and their synonyms extraction to find the relationships among them (Laxman and Sujatha, 2013). Fromthis approach, we can identify all the instances of specified objects from a group of documents. Moreover, it allowsthe identification of relationship and other information to attain their key concept. In real world, a single entity hasnumerous terms like TV and Television.
To give an overview of what we call pre-processing , we can say that it is the conversion of raw text data to astructured sequence of linguistic components in the form of a standard vector. Examples of pre-processing techniquesare:• Tokenization ;Tokenization here means the process of splitting words, sentences or texts into a list of tokens. The componentswith special syntax and semantics are called "tokens". We distinguish two tokenization techniques, namely wordtokenization and sentence tokenization. Word tokenization is the technique of splitting sentences into constituentwords and sentences tokenization is the process of splitting the text corpus to meaningful sentences.• Stemming ;Stemming means the reduction of words to their roots so that, for instance, the different grammatical forms ordeclinations of verbs are identified and counted as the same word (Nisbet, Elder and Miner, 2009). So basi-cally this step is an important approach for information retrieval and text analysis applications. An example isclustering and automatic text processing.• Lemmatization.Lemmatization is a process to resolve words to their dictionary form. In fact, a lemma of a word is its dictionaryform. To resolve a word to its lemma, its part-of-speech is needed. Another aspect to note about lemmatizationis that it is often times harder to create a lemmatizer in a new language than stemming.It is also relevant to mention that by using text pre-processing, we can manage to extract the knowledge from thetext data (corpus) which is needed for a good analysis and for an high accuracy of classifiers. We can see those stepsin figure 1
YOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 4 of 20omprehensive Text Mining on Various Sacred Scriptures
Figure 1:
The different step of Pre-processing.
To make the unstructured texts data useful, we need to transform the textual data into vector spaces, which can bedone by using what we call Bag-Of-Words (BOW) representation. The BOW converts the text data in a matrix formatwhich simplify the statistical analysis of the data. The obtained matrix is called Document Term Matrix (DTM). Therows in this matrix represent the documents and the columns represent the terms. The DTM gives the occurrence ofeach term varying among the documents.In this study, we will use nine different books (T = 9), with 704 chapters/documents (n = 704) and 5131 (p=5131)tokens/words. Hence, we have a 704 x 5131 DTM. When we consider the 𝑡 𝑡ℎ sacred scripture separately by means ofits own DTM 𝑋 𝑡 , its representation looks like follows: 𝑋 ( 𝑡 ) = ⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ 𝑋 ( 𝑡 )1 , 𝑋 ( 𝑡 )1 , … … … 𝑋 ( 𝑡 )1 ,𝑗 𝑡 … 𝑋 ( 𝑡 )1 ,𝑝 𝑡 𝑋 ( 𝑡 ) 𝑖 𝑡 , 𝑋 ( 𝑡 ) 𝑖 𝑡 , … … … 𝑋 ( 𝑡 ) 𝑖 𝑡 ,𝑗 𝑡 … 𝑋 ( 𝑡 ) 𝑖 𝑡 ,𝑝 𝑡 ∶ ∶ ⋱ ⋱ … ∶ … ∶ 𝑋 ( 𝑡 ) 𝑛 𝑡 , 𝑋 ( 𝑡 ) 𝑛 𝑡 , … … … 𝑋 ( 𝑡 ) 𝑛 𝑡 ,𝑗 𝑡 … 𝑋 ( 𝑡 ) 𝑛 𝑡 ,𝑝 𝑡 ⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ (1) YOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 5 of 20omprehensive Text Mining on Various Sacred Scriptures
From this matrix, each column 𝑋 ( 𝑡 ) .,𝑗 𝑡 represents a term and each row 𝑋 ( 𝑡 ) 𝑖 𝑡 ,. represents a document/chapter.If we consider the whole corpus of sacred books, the Document Term Matrix is defined by the following relation: 𝑋 = ⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ 𝑋 (1)1 , 𝑋 (1)1 , … … … 𝑋 (1)1 ,𝑗 … 𝑋 (1)1 ,𝑝 ∶ ∶ ⋱ ⋱ … ∶ … ∶ 𝑋 (1) 𝑖 , 𝑋 (1) 𝑖 , … … … 𝑋 (1) 𝑖 ,𝑗 … 𝑋 (1) 𝑖 ,𝑝 ∶ ∶ ⋱ ⋱ … ∶ … ∶ 𝑋 (1) 𝑛 , 𝑋 (1) 𝑛 , … … … 𝑋 (1) 𝑛 ,𝐽 … 𝑋 (1) 𝑛 ,𝑝 ∶ ∶ ⋱ ⋱ … ∶ … ∶∶ ∶ ⋱ ⋱ … ∶ … ∶ 𝑋 ( 𝑡 )1 , 𝑋 ( 𝑡 )1 , … … … 𝑋 ( 𝑡 )1 ,𝑗 𝑡 … 𝑋 ( 𝑡 )1 ,𝑝 ∶ ∶ ⋱ ⋱ … ∶ … ∶ 𝑋 ( 𝑡 ) 𝑖 𝑡 , 𝑋 ( 𝑡 ) 𝑖 𝑡 , … … … 𝑋 ( 𝑡 ) 𝑖 𝑡 ,𝑗 𝑡 … 𝑋 ( 𝑡 ) 𝑖 𝑡 ,𝑝 ∶ ∶ ⋱ ⋱ … ∶ … ∶ 𝑋 ( 𝑡 ) 𝑛 𝑡 , 𝑋 ( 𝑡 ) 𝑛 𝑡 , … … … 𝑋 ( 𝑡 ) 𝑛 𝑡 ,𝑗 𝑡 … 𝑋 ( 𝑡 ) 𝑛 𝑡 ,𝑝 ∶ ∶ ⋱ ⋱ … ∶ … ∶∶ ∶ ⋱ ⋱ … ∶ … ∶ 𝑋 ( 𝑇 )1 , 𝑋 ( 𝑇 )1 , … … … 𝑋 ( 𝑇 )1 ,𝑗 𝑇 … 𝑋 ( 𝑇 )1 ,𝑝 ∶ ∶ ⋱ ⋱ … ∶ … ∶ 𝑋 ( 𝑇 ) 𝑖 𝑇 , 𝑋 ( 𝑇 ) 𝑖 𝑇 , … … … 𝑋 ( 𝑇 ) 𝑖 𝑇 ,𝑗 𝑇 … 𝑋 ( 𝑇 ) 𝑖 𝑇 ,𝑝 ∶ ∶ ⋱ ⋱ … ∶ … ∶ 𝑋 ( 𝑇 ) 𝑛 𝑇 , 𝑋 ( 𝑇 ) 𝑛 𝑇 , … … … 𝑋 ( 𝑇 ) 𝑛 𝑇 ,𝑗 𝑇 … 𝑋 ( 𝑇 ) 𝑛 𝑇 ,𝑝 ⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ (2) After converting the documents into term vectors, the similarity between them can be estimated by comparing thecorresponding vectors. To distinguish the documents precisely, there are various algorithms which can be used. Inthis part, we will describe some of the standard distances used in text analysis like the cosine similarity, Euclideandistance, Manhattan distance (Qahl, 2014).
The most common document similarity algorithm is term frequency-inverse document frequency (Tf-IDF). heTerm Frequency (TF) means that the more frequently one term appears in a document the more its weight increases.In inverse document frequency (IDF), the term occuring in a greater number of documents are relatively less relevantand should be weighted less. It is presented mathematically by the equation 3 where 𝑓 𝑟𝑒𝑞 𝑖,𝑗 represents the number ofoccurrences of the word j in file i and 𝑊 𝑖𝑗 is the weight of word j in file i. 𝑡𝑓 𝑖,𝑗 = 𝑓 𝑟𝑒𝑞 𝑖,𝑗 max( 𝑓 𝑟𝑒𝑞 𝑖,𝑗 ) (3)From equation 4, 𝑚 represents the number of files, 𝑚 𝑗 represents the number of files containing the word j. 𝑖𝑑𝑓 𝑖,𝑗 = log 𝑚𝑚 𝑗 (4) YOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 6 of 20omprehensive Text Mining on Various Sacred Scriptures
And the weight is calculated according to equation 5: 𝑊 𝑖,𝑗 = 𝑡𝑓 𝑖,𝑗 ⋅ 𝑖𝑑𝑓 𝑖,𝑗 (5) In this part of our study, we will explore the similarity respectively dissimilarity between chapters and documents byapplying the non-standard distances like cosine similarity and also some standard distances such as Euclidean distance,Manhattan distance and Jaccard similarity (SAH and FOKOUÉ, 2019). Notice that not every distance measure is ametric. A metric must satisfy four conditions. Let 𝐴 and 𝐵 be any two documents and 𝑑 ( 𝐴, 𝐵 ) be the distance betweenboth (Maher and Joshi, 2016). The conditions read as follows:• The distance between any two given points must be nonnegative. Mathematically, it is: 𝑑 ( 𝐴, 𝐵 ) ≥ • The distance between two points is equal to zero (0) if and only if the two points are the same. It is formulated: 𝑑 ( 𝐴, 𝐵 ) = 0 𝑖𝑓 𝑓 𝐴 = 𝐵 • The distance must be symmetric, which means that the distance from A to B is the same as the distance from Bto A. 𝑑 ( 𝐴, 𝐵 ) = 𝑑 ( 𝐵, 𝐴 ) • The distance measure must satisfy the inequality 6 which is the triangle inequality. 𝑑 ( 𝐴, 𝐶 ) ≤ 𝑑 ( 𝐴, 𝐵 ) + 𝑑 ( 𝐵, 𝐶 ) (6)The different distance similarity measures which can facilitate the interpretation or understanding of the similarityrespectively dissimilarity between documents can also be applied for the corpus. In text mining, we deal with high-dimensional data. The sparsity of the raw data makes the study more complex.• The Euclidean distance measure :The euclidean distance can be defined as the shortest straight-line distance between two points. It is part of theMinkowski family. The Minkowski distance is a metric distance class on the Euclidean space (Deza and Deza,2006). The mathematical representation of Eucleidean distance is given by equation 7. 𝑑 𝐸 ( 𝑋 ( 𝑎 ) 𝑙 , 𝑋 ( 𝑏 ) 𝑚 ) = √√√√ 𝑝 ∑ 𝑗 =1 ( 𝑋 ( 𝑎 ) 𝑙,𝑗 − 𝑋 ( 𝑏 ) 𝑚,𝑗 ) (7)• The Jaccard similarity measure:The Jaccard similarity measure is approached used to measure the similarity between two chapters/documentsby taking the intersection of both and divide it by their union (Zahrotun, 2016). The Jaccard coefficient betweentwo documents 𝑋 ( 𝑎 ) 𝑙 and 𝑋 ( 𝑏 ) 𝑚 is mathematically defined by: 𝑠𝑖𝑚 ( 𝑋 ( 𝑎 ) 𝑙 , 𝑋 ( 𝑏 ) 𝑚 ) ≡ 𝑠𝑖𝑚 ( 𝑋 𝑙 , 𝑋 𝑚 ) = 𝑝 ∑ 𝑗 =1 min{ 𝑋 𝑙𝑗 , 𝑋 𝑚𝑗 } 𝑝 ∑ 𝑘 =1 max{ 𝑋 𝑙𝑘 , 𝑋 𝑚𝑘 } (8)Thus, by using the equation 8 the Jaccard distance between two document is given by the equation 10: 𝑑 𝐽 ( 𝑋 ( 𝑎 ) 𝑙 , 𝑋 ( 𝑏 ) 𝑚 ) ≡ 𝑑 𝐽 ( 𝑋 𝑙 , 𝑋 𝑚 ) = 1 − 𝑠𝑖𝑚 ( 𝑋 𝑙 , 𝑋 𝑚 ) (9) YOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 7 of 20omprehensive Text Mining on Various Sacred Scriptures • The Manhattan distance measure :The Manhattan Distance is the sum of absolute differences between points across all the dimensions. 𝑑 𝑀 ( 𝑋 ( 𝑎 ) 𝑙 , 𝑋 ( 𝑏 ) 𝑚 ) = 𝑝 ∑ 𝑗 =1 | 𝑋 ( 𝑎 ) 𝑙,𝑗 − 𝑋 ( 𝑏 ) 𝑚,𝑗 | • The Cosine similarity measureThe Cosine similarity measure considers the correlation between the vectors. It is also the most popular similaritymeasure applied to text documents (Maher and Joshi, 2016). The formulation of this measure is defined by: 𝑑 𝐶𝑜𝑠 ( 𝑋 ( 𝑎 ) 𝑙 , 𝑋 ( 𝑏 ) 𝑚 ) ≡ 𝑑 𝐶𝑜𝑠 ( 𝑋 𝑙 , 𝑋 𝑚 ) = 𝑋 𝑇𝑙 𝑋 𝑚 ( 𝑋 𝑇𝑙 𝑋 𝑙 ) ( 𝑋 𝑇𝑚 𝑋 𝑚 ) (10) After having introduced the distance similarity measures, we can use them to explore the similarity between booksrespectively documents.First of all, we are going to compare the similarity between chapters within the same book. For instance in book 𝑋 𝛼 , the distance between the chapters is mathematically represented by: 𝐷 ( 𝛼 ) = ⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣ 𝑑 ( 𝛼 )1 , 𝑑 ( 𝛼 )1 , … 𝑑 ( 𝛼 )1 ,𝑛 𝛼 𝑑 ( 𝛼 )2 , 𝑑 ( 𝛼 )2 , … 𝑑 ( 𝛼 )2 ,𝑛 𝛼 ∶ ∶ ⋱ ∶ 𝑑 ( 𝛼 ) 𝑛 𝛼 , 𝑑 ( 𝛼 ) 𝑛 𝛼 , … 𝑑 ( 𝛼 ) 𝑛 𝛼 ,𝑛 𝛼 ⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ Notice that the matrix 𝐷 ( 𝛼 ) is nonnegative 𝑛 𝛼 × 𝑛 𝛼 ( ℝ 𝑛 𝛼 × 𝑛 𝛼 + ) .The components are defined as : 𝑑 ( 𝛼 ) 𝑙,𝑚 ≡ 𝑑 ( 𝑋 ( 𝛼 ) 𝑙 , 𝑋 ( 𝛼 ) 𝑚 ) ≡ the distance between the 𝑙 𝑡ℎ chapter and the 𝑚 𝑡ℎ chapterof the 𝛼 𝑡ℎ book 𝑋 ( 𝛼 ) . This can help us to evaluate the relationship between various chapter within the same book.If we want to discover the relationship between books 𝑋 𝛼 and 𝑋 𝛽 , we can use the following formula: 𝐷 = ⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣ 𝑑 , 𝑑 , … 𝑑 ,𝑛 𝑑 , 𝑑 , … 𝑑 ,𝑛 ∶ ∶ ⋱ ∶ 𝑑 𝑛, 𝑑 𝑛, … 𝑑 𝑛,𝑛 ⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦ Also this 𝑛 × 𝑛 matrix is nonnegative ( ℝ 𝑛 × 𝑛 + ) . 𝑑 𝑙,𝑚 ≡ 𝑑 ( 𝑋 ( 𝛼 ) 𝑙 , 𝑋 ( 𝛽 ) 𝑚 ) ≡ is the distance between the 𝑙 𝑡ℎ chapter of the book 𝑋 ( 𝛼 ) and the 𝑚 𝑡ℎ chapter of book 𝑋 ( 𝛽 ) .This can help us to evaluate the relationship between various chapter within the same book.There are four different approaches existing to study the similarity between the sacred texts reading as following:1. The first approach for obtaining the distance between books is named single or minimum linkage and is given bythe relation 11. This simply means that the distance between two books, for instance 𝑋 ( 𝛼 ) and 𝑋 ( 𝛽 ) , correspondsto the smallest value of the 𝑛 ∗ 𝑛 distance between their chapters. Mathematically, it is define as: Δ( 𝑋 ( 𝛼 ) , 𝑋 ( 𝛽 ) ) = min 𝑙 ∈[ 𝑛𝛼 ] 𝑚 ∈[ 𝑛 𝛽 ] { 𝑑 ( 𝑋 ( 𝛼 ) 𝑙 , 𝑋 ( 𝛽 ) 𝑚 )} (11) YOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 8 of 20omprehensive Text Mining on Various Sacred Scriptures
2. The second approach for measuring distances between books is the maximum linkage given by the relation 12.In this method the distance between two books, for instance 𝑋 ( 𝛼 ) and 𝑋 ( 𝛽 ) , corresponds to the largest value ofthe 𝑛 ∗ 𝑛 distance matrix with respect to their chapters. Mathematically, it is defined as: Δ( 𝑋 ( 𝛼 ) , 𝑋 ( 𝛽 ) ) = max 𝑙 ∈[ 𝑛𝛼 ] 𝑚 ∈[ 𝑛 𝛽 ] { 𝑑 ( 𝑋 ( 𝛼 ) 𝑙 , 𝑋 ( 𝛽 ) 𝑚 )} (12)3. The third approach for the distance between books is the average linkage and is defined by the relation 13. Here,the distance between two books, for instance 𝑋 ( 𝛼 ) and 𝑋 ( 𝛽 ) , corresponds to the mean value of the 𝑛 ∗ 𝑛 distancematrix with respect to their chapters. It is written as: Δ( 𝑋 ( 𝛼 ) , 𝑋 ( 𝛽 ) ) = 𝑚𝑒𝑎𝑛 𝑙 ∈[ 𝑛𝛼 ] 𝑚 ∈[ 𝑛 𝛽 ] { 𝑑 ( 𝑋 ( 𝛼 ) 𝑙 , 𝑋 ( 𝛽 ) 𝑚 )} (13)4. The last approach for distance between books is the median distance and it is defined by the relation 14 whichreads: Δ( 𝑋 ( 𝛼 ) , 𝑋 ( 𝛽 ) ) = 𝑚𝑒𝑑𝑖𝑎𝑛 𝑙 ∈[ 𝑛𝛼 ] 𝑚 ∈[ 𝑛 𝛽 ] { 𝑑 ( 𝑋 ( 𝛼 ) 𝑙 , 𝑋 ( 𝛽 ) 𝑚 )} (14)Having 𝐵 sacred books, we can give the expression of the distance between the whole set of book as Δ = ⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣ Δ , Δ , … Δ ,𝐵 Δ , Δ , … Δ ,𝐵 ∶ ∶ ⋱ ∶Δ 𝐵, Δ 𝐵, … Δ 𝐵,𝐵 ⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦ where Δ 𝛼,𝛽 ≡ Δ( 𝑋 ( 𝛼 ) , 𝑋 ( 𝛽 ) ) corresponds to the distance to the book 𝛼 and the book 𝛽 . For this research work, we use the Asian scripture, the Bible and Quran as our collection of texts (data), alsonamed corpus in this context. When we want to obtain the origin of a text or of a text fraction, we can use machinelearning making predictions about the source of these texts. Thus, in what follows, we will describe unsupervised andsupervised machine learning applied to text mining processes.
In machine learning, we have data with which the machine is trained. Based on this training, it learns patterns in thedata. For text analytics, machine learning involves a set of statistical techniques for identifying various characteristicsand patterns among the considered texts. Machine learning is divided into supervised and unsupervised learning whichwill be described in the following. Due to the fact that the text data can have hundreds of thousands of dimensions(sentences or words), text data requires a particular approach to machine learning.•
Unsupervised learning :Unsupervised machine learning algorithms infer patterns from a dataset without reference to known or labeledoutcomes. Unlike supervised machine learning, unsupervised machine learning methods cannot be directlyapplied to a regression or a classification problem because of the missing labels. It is about training a modelwithout pre-tagging or annotating.One of the most interesting questions is whether it is possible to classify texts according to the sacred books thatthey originate from. In this section, as far as the relationships among entire sacred scriptures are concerned, weherein briefly describe the way in which we use cluster analysis to tackle and answer that overarching question.
YOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 9 of 20omprehensive Text Mining on Various Sacred Scriptures • Supervised learning :In supervised learning the goal is to predict the label of a data sample. In the context of sacred text classification,we use labelled documents to train a model. In supervised machine algorithm, the goal is to achieve highprediction accuracy after training with the machine.So far, there exists a couple of algorithms that we can apply on the labeled corpus. Inter alia, we have the linearclassification algorithms like K-nearest neighbors (
KNN ), Random Forest ( RF ) and we also have the SupportVector Machines ( SVM ) which can be both linear and non-linear. – The K-Nearest Neighbors (KNN):In order to proceed with the categorization of texts, there are a number of methods and techniques that wecan apply. Seeing a new document, the KNN method assigns to this document the label that occurs themost often among its closest neighbors. – The Support Vector Machine (SVM):The support vector machine (SVM) is one of the methods of supervised learning that generates input-outputmapping functions from a set of labeled training data (Wang, 2005). To perform classification, the inputdata is projected to an higher dimensional space feature space through nonlinear kernel functions enforcingthe seperability of the input data. In addition, the SVM creates a margin between the separating hyperplaneand the data.The equation of the hyperplane is mathematically defined as follow: 𝑤 𝑇 𝑥 − 𝑏 = 0 The equation of separating hyperplanes are given by: { 𝑤 𝑇 𝑥 − 𝑏 = 1 𝑤 𝑇 𝑥 − 𝑏 = −1 (15)The first instance is about the Hard Margin and the goal here is to maximize 𝑊 ∥ which is equivalentto minimizing ∥ 𝑤 ∥2 .We must have two (2) constraints: { 𝑤 𝑇 𝑥 ( 𝑖 ) − 𝑏 ≥ 𝑖𝑓 𝑦 ( 𝑖 ) = 1 𝑤 𝑇 𝑥 ( 𝑖 ) − 𝑏 ≤ −1 𝑖𝑓 𝑦 ( 𝑖 ) = −1 (16)By combination of constraint, we have the following: 𝑦 ( 𝑖 ) ( 𝑤 𝑇 𝑥 ( 𝑖 ) − 𝑏 ) ≥ for 𝑖 = 1 , ..., 𝑚 For this second instance, the optimization function is defined as: 𝑤 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛 ∥ 𝑤 ∥ The second instance is the
Soft Margin . It allows for points to be inside the margin. The loss functionbecomes the hinge loss which is used when one deals with discrete labels. It is mathematically defined by: ( 𝑤 ) = 1 𝑚 𝑚 ∑ 𝑖 =1 𝑙 ( 𝑤 ; 𝑥, 𝑦 ) + 𝜆 ∥ 𝑊 ∥ (17)with 𝑙 ( 𝑤 ; 𝑥, 𝑦 ) = 𝑚𝑎𝑥 [0 , 𝑦 ( 𝑖 ) ( 𝑤 𝑇 𝑥 ( 𝑖 ) − 𝑏 )] Here, the optimization function is define as: 𝑤 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛 ( 𝑤 ) YOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 10 of 20omprehensive Text Mining on Various Sacred Scriptures
The third instance is the non-linear classification . In this case, the feature 𝑥 𝑖 moves to the feature space 𝜙 𝑖 and the kernel trick is given by: 𝐾 ( 𝑥, 𝑥 ′ ) = 𝜙 𝑇 ( 𝑥 ) ⋅ 𝜙 ( 𝑥 ′ ) And the weight is equal to: 𝑤 = 𝑚 ∑ 𝑖 =1 𝛼 𝑖 𝑦 ( 𝑖 ) 𝜙 ( 𝑥 ( 𝑖 ) ) 𝑤 𝑇 = 𝑚 ∑ 𝑖 =1 𝛼 𝑖 𝑦 ( 𝑖 ) 𝜙 𝑇 ( 𝑥 ( 𝑖 ) ) 𝑤 𝑇 𝜙 ( 𝑥 ( 𝑖 ) ) = 𝑚 ∑ 𝑖 =1 𝛼 𝑖 𝑦 ( 𝑖 ) 𝐾 ( 𝑥 ( 𝑖 ) , 𝑥 ′ ) – The Random Forest (RF):We can name Random Forest as another tree classifier, which can be used for classification of text data.Because of the fact that it is capable of extracting similarity between features, it is also used in text miningas an embedded feature selection method.
Let 𝑥 be the vector of word frequency corresponding to a chapter from one of the nine sacred texts. The probabilitythat this chapter 𝑥 comes from the 𝑡 𝑡ℎ book is denoted as ℙ [ 𝑦 = 𝑠 𝑡 | 𝑥 ] where 𝑦 is the variable indicating the sacredbook.The function 𝑓 according to which classification is done reads: 𝑓 ( 𝑥 ) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡 ∈[ 𝑇 ] { ℙ [ 𝑦 = 𝑠 𝑡 | 𝑥 ]} (18)For the pure purpose of scripture authentication, we build the classifier from the data ̂𝑓 ( 𝑥 ) . For instance, the classifierfunction for K-Nearest-Neighbors is given by: ̂𝑓 ( 𝐾𝑁𝑁 ) ( 𝑥 ) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡 ∈ 𝑇 { 𝑘 𝑛 ∑ 𝑗 =1 𝕀 ( 𝑦 𝑗 = 𝑡 ) 𝕀 ( 𝑥 𝑗 ∈ 𝑘 ( 𝑥 )) } To evaluate the performance of our model, we will Cross validate. Another method that is used for the selectionof the best hyperparameter configuration is the Grid search.• m-Fold Cross validationInspecting the test error, one can separate the data into a training and test set. However, this process may leadaccuracies that are very different from one test to another using the same algorithm. To avoid this issue, we willuse the m-Fold Cross-Validation where 𝑚 is a indicating the number of folds. In this process, we divide thedata into 𝑚 folds. A training set is represented by 𝑚 − 1 set while the testing set is the remaining. After eachiteration it shuffle the data and create a new training and testing sets.The cross-validation estimate of prediction error is given by 𝐶𝑉 ( ̂𝑓 ) = 1 𝑁 𝑁 ∑ 𝑖 =1 𝐿 ( 𝑦 𝑖 , ̂𝑓 − 𝑘 ( 𝑖 ) ( 𝑥 𝑖 )) where 𝑘 ( 𝑖 ) means, the model f is trained without the training patterns in the same partition of the dataset aspattern 𝑖 . YOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 11 of 20omprehensive Text Mining on Various Sacred Scriptures • Grid searchingThe hyperparameter optimization is essential in machine learning due to the fact that neural networks are difficultto configure and there are a lot of parameters that need to be set. Setting the hyperparameters manually canconsume a lot of time. Hence, we will use the grid search method for this purpose.Grid-searching is the process of scanning the data to configure optimal parameters for a given model. Dependingon the type of model that is used, we need to adjust the parameters of the grid search algorithm. Grid-searchingstores a model by doing the iteration through every parameter combination and selects the optimal configurationfor our model. The general formulation to compute the accuracy is given by:
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇 𝑃 + 𝑇 𝑁𝑇 𝑃 + 𝑇 𝑁 + 𝐹 𝑃 + 𝐹 𝑁 (19)Where TP, TN, FP, FN means True Positive, True Negative, False Positive, False Negative respectively.
3. Results
We are presenting the different results we got during our analysis. Starting by describing our data, continuous bythe distribution of the words, then we perform some text analysis and end by presenting the distance measurement.
Before starting the pre-processing, we needed to download the English version of Quran. Then we will do theanalysis on our data and transform the text into tokens. This transformation yields in total 156110 tokens. This step isfollowed by removing useless tokens. Hence, punctuations ( e.g. - , : ""), stop words (e.g. as, it, very, own, any, only,off) and integers (e.g. 1, 2, 3 etc) are removed. Finally, we get the cleaner data ready to be used.
The frequency distribution contains the ratios between each token and the total number of tokens shown below infigure 2.
Figure 2:
Frequency distribution of words in text data.
In this graph, we can see that the word god is the most often used word. It is followed by the words lord and belive . YOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 12 of 20omprehensive Text Mining on Various Sacred Scriptures
Object Label prefix god NN lord NN said VBD believ NN allah NN may MD Table 1
The Part Of Speech tagging for the data
In this section, we will present the Part-Of-Speech tagging that, we get from the list of tokens. Then, we will alsopresent the frequency plot of those POS tagging.
The following table, shows the few example of POS tagging which are from our list of tokens.
Figure 3:
Frequency of POS tagging in text data.
From subsection 3.1 to session 3.3, we repeat the same method for the Asian books and the Bible. • Document Terms Matrix of QuranThe following table describes the labeled document term matrix of Quran. From this table, we obtain that thesacred book has 114 chapters/documents and 4284 terms/tokens. The value in each cell represents the numberof times that the word corresponding to that column occurs in the documents corresponding to that row.
YOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 13 of 20omprehensive Text Mining on Various Sacred Scriptures • Document Terms Matrix of Asian and BibleThe following table describe the labeled document term matrix of Asian and The holy Bible. The table showsthat these sacred books have together 590 chapters/documents and 1024 terms/tokens.• Document Terms Matrix of both documents.Firstly, when we put all the sacred texts together, we get the DTM with 704 chapters/documents and 5131terms/tokens. We notice that there are some missing values which will not be needed for our analysis. Thus, thenext step consists of removing them. After removing all the missing values, we get the table below. It still has704 chapters/documents and 5131 terms/tokens. This DTM will be used for several analysis that we make onthe sacred scriptures.
YOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 14 of 20omprehensive Text Mining on Various Sacred Scriptures
Since we already have the matrix representation of the different sacred texts, corresponding here to the previousdocument term matrix (DTM), we will proceed with the studies of the distances/similarities between the chaptersof the same book, between the chapters of a sacred scripture with those of other sacred books and finally betweenthe different books without dividing them into chapters.
After converting the sacred books into term vectors, the distances/similarities between two documents can beestimated by comparing the corresponding vectors. In this part, we will apply the Euclidean distance, the Manhattandistance, the cosine similarity and the Jaccard distance. • Euclidean distance
Using the Euclidean distance, we can visualize the distance matrix by its corresponding heatmap.
Figure 4:
Heatmap of the Euclidean distance between the chapters in both sacred books.YOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 15 of 20omprehensive Text Mining on Various Sacred Scriptures
This heat map shows for instance that there is a very high distance between
10 − 𝑡ℎ document corresponding tochapter 9 of Quran and the document with index corresponding to chapter 14 in the Book of Wisdom.• Cosine distance
Analogously, we obtain a distance matrix after applying the cosine distance and its heatmap representationdepicted below.
Figure 5:
Heatmap of the cosine distance between the chapters in both sacred books.
This heat map shows for instance that, there is a considerable distance between the chapters of the books. Thisis the reason why we have several block of color in the heat map.
Regarding this part, we will show the different distance measure between the sacred books.•
Euclidean distances
Figure 6:
Heatmap of the Euclidean distance between sacred booksYOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 16 of 20omprehensive Text Mining on Various Sacred Scriptures
Labels
Books Quran Buddhism Tao-Te-Ching Upanishad Yogasutra Book of Proverb
Book of Ecclesiasticus Book of Ecclesiastes Book of Wisdom
From this heat map, we can say that there is a distance between the Quran and the other sacred books whereasthere is a similarity between the book corresponding to book of book of ecclesiastes with the book corre-sponding to Tao Te Ching.• Cosine distance
Figure 7:
Heatmap of the cosine distance between the sacred books.
Here, the heat map shows that, still, there is a distance between the Quran and some of the other sacred books. Werealize that, some of the Asian books (like Buddhism and YogaSutra) are not similar to the Bible.
In this section of our work, we will check if the different distance measures are based on the same properties. Todo this, we will compute the correlation between the distance measures applied before.
Figure 8:
Correlation matrix between the values obtained from the different distance measuresYOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 17 of 20omprehensive Text Mining on Various Sacred Scriptures
Here, we used four different model to do the prediction. We have, K-Nearest Neighbors (KNN), Super vectorMachine (SVM), Random Forest (RF) and multinomial naive bayes. Grid-search is used to find the optimal hyper-parameters of a model which give the higher accuracy. We used m = 10 folds in this work.Thus, we find the accuracy for the RF equal to 0.7370, for the Super Vector Machine SVM classifier is 0.7781, forthe KNN find 0.6128 and 0.8584 for the Multinomial Naive Bayes classifier
Figure 9:
The confusion matrix of the Random Forest and Super Vector Machine
Figure 10:
The confusion matrix of the K-Nearest Neighbors and Multinomial Naive BayesYOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 18 of 20omprehensive Text Mining on Various Sacred Scriptures
4. Discussion
When we carried out the natural language processing using the NLTK Library, we focused on three things that areessential in text mining: We applied pre-processing steps as well as the Part-Of-Speech tagging to obtain appropriatetext data for the analysis. Finally, we worked out the document term matrix representation of the text data with respectto several distance measures.Looking at figure 6, the Euclidean measure tells us that the distance between the Quran and other sacred booksseems to be high. The Asian books are almost similar to the sacred books from Bible. But looking at figure 7, wenotice that, the distance between the Quran and some sacred books is not very high. The books from the Bible andAsian religions are almost similar. This leads us to wonder why the distance measure methods do not give us the sameresult. This is the reason why it is interesting to check the correlation between these approaches.When we look at figure 8, the Euclidean distance and Manhattan distance are strongly correlated while we do notobserve a high correlation with the Jaccard and Cosine distance. Based on this result, we conclude that, not all of thesemethods use the same properties to determine the distance between documents.To analyze whether the high similarities between documents obtained with respect to the different measures arerelated to similar meanings among these documents, we consider exemplifying the following fragments:
1- Tao Te Ching :
Not to value and employ men of superior ability is the way to keep the people from rivalryamong themselves; not to prize articles which are difficult to procure is the way to keep them from becoming thieves;not to show them what is likely to excite their desires is the way to keep their minds from disorder. Therefore thesage, in the exercise of his government, empties their minds, fills their bellies, weakens their wills, and strengthenstheir bones. He constantly (tries to) keep them without knowledge and without desire, and where there are those whohave knowledge, to keep them from presuming to act (on it). When there is this abstinence from action, good order isuniversal.
2- Book of Wisdom : - the works of the hands of men. 13:1. But all men are vain, in whom there is not theknowledge of God: and who by these good things that are seen, could not understand him that is, neither by attendingto the works have acknowledged who was the workman: 13:2. But have imagined either the fire, or the wind, or theswift air, or the circle of the stars, or the great water, or the sun and moon, to be the gods that rule the world. 13:3.
3- Tao Te Ching : - The skilful traveller leaves no traces of his wheels or footsteps; the skilful speaker saysnothing that can be found fault with or blamed; the skilful reckoner uses no tallies; the skilful closer needs no bolts orbars, while to open what he has shut will be impossible; the skilful binder uses no strings or knots, while to unloosewhat he has bound will be impossible. In the same way the sage is always skilful at saving men, and so he does not castaway any man; he is always skilful at saving things, and so he does not cast away anything. This is called ’Hiding thelight of his procedure.’Therefore the man of skill is a master (to be looked up to) by him who has not the skill; and hewho has not the skill is the helper of (the reputation of) him who has the skill. If the one did not honour his master, andthe other did not rejoice in his helper, an (observer), though intelligent, might greatly err about them. This is called’The utmost degree of mystery.’ - The disciple said: I do not think I know It well, nor do I think that I do not know It. He amongus who knows It truly, knows (what is meant by) "I know" and also what is meant by "I know It not." This appearsto be contradictory, but it is not. In the previous chapter we learned that Brahman is "distinct from the known" and"beyond the unknown." The disciple, realizing this, says: "So far as mortal conception is concerned, I do not think Iknow, because I understand that It is beyond mind and speech; yet from the higher point of view, I cannot say that I donot know; for the very fact that I exist, that I can seek It, shows that I know; for It is the source of my being.
Philosophically speaking, all the documents show differences with respect to the content they present. The firstdocument deals with the duties of a government to create a society with peace. The second one is about the existence ofGod and about the belief in God. The third document deals with interactions among individuals and describes virtue.The last fragment is an abstract discussion about what knowledge is and can be classified as an epistemological text.Despite the fact that we obtain similarities with respect to the Euclidean and Manhattan measure, we see thatregarding the content of these documents, there is not a large overlapping. This prompts us to claim that these twomethods are not focused on the meaning of the documents. Also, using the Jaccard and Cosine approaches, we obtainhigh similarities which leads us to the hypothesis that these measures also do not focus on the semantics. Thus, weclaim that the similarities are mostly based on the syntactical structures of the documents.Regarding the classification, our findings reveal that the Multinomial Naive Bayes model yields the best predictionperformance depicted in figures 9 and 10. The Quran and Bible have the largest number of chapters in the corpus,
YOUNOUS MOFENJOU PEURIEKEU et al.:
Preprint submitted to Elsevier
Page 19 of 20omprehensive Text Mining on Various Sacred Scriptures and the Multinomial Naive Bayes is able to predict most of them with a very high accuracy, which is around 0.8584.Random Forest (RF) follows with 0.737 of accuracy. K-nearest-neighbors (KNN), and Super Vector Machine (SVM)fails to distinguish the majority of chapters from the Asian sacred books.
Future research:
As future work, the plan is to extend the set of our corpus by taking into consideration manyother religions around the world, and therefore, more books. It could be for instance:• Inclusion of more books from Asian and Christian religions;• Inclusion of the African tradition Scriptures and also the consideration of religions that are now obsolete. It willbe very interesting to go deep into this study by exploring all the different properties that the distance measuresuse to analyze the meaning or semantic aspect of the sacred scriptures.In addition to that, since the focus of this study has been more on the lexical analysis, a further approach could includesemantic. Given, for instance, the intrinsic dependence between words in raw texts, a much more informative featureengineering could be done before hand. That could include deriving from the bag of words, group of words/tokensas features instead of single tokens. Subsequently, deep learning approaches including Long Short-Term Memory(LSTM), and Convolution Recurrent Neural Network (CRNN) could be used to fit the data and extract, for instance,the contexts inside chapters of the books, categorize them, and run a contextual comparison analysis between (chapterof) different books. These models could also be used after effective training for completion of some of the sentencesin the actual religious books.
References
Deza, M.M., Deza, E., 2006. Dictionary of distances. Elsevier.Elton, D.C., Turakhia, D., Reddy, N., Boukouvalas, Z., Fuge, M.D., Doherty, R.M., Chung, P.W., 2019. Using natural language processing techniquesto extract information on the properties and functionalities of energetic materials from large text corpora. arXiv preprint arXiv:1903.00415 .Evison, J., 2013. A corpus linguistic analysis of turn-openings in spoken academic discourse: Understanding discursive specialisation. EnglishProfile Journal 3.Ezzeldin, M., El-Dakhakhni, W., 2020. Metaresearching structural engineering using text mining: Trend identifications and knowledge gap discov-eries. Journal of Structural Engineering 146, 04020061.Hobbs, J.R., Walker, D.E., Amsler, R.A., 1982. Natural language access to structured text, in: Proceedings of the 9th conference on Computationallinguistics-Volume 1, Academia Praha. pp. 127–132.Ignatow, G., Mihalcea, R., 2017. An introduction to text mining: Research design, data collection, and analysis. Sage Publications.Laxman, B., Sujatha, D., 2013. Improved method for pattern discovery in text mining. International Journal of Research in Engineering andTechnology 2, 2321–2328.Levenberg, A., Pulman, S., Moilanen, K., Simpson, E., Roberts, S., 2014. Predicting economic indicators from web text using sentiment composition.International Journal of Computer and Communication Engineering 3, 109–115.Maher, K., Joshi, M.S., 2016. Effectiveness of different similarity measures for text classification and clustering. International Journal of ComputerScience and Information Technologies 7, 1715–1720.Manning, C.D., Manning, C.D., Schütze, H., 1999. Foundations of statistical natural language processing. MIT press.McDonald, D., 2014. A text mining analysis of religious texts. The Journal of Business Inquiry 13, 27–47.Nisbet, R., Elder, J., Miner, G., 2009. Handbook of statistical analysis and data mining applications. Academic Press.Qahl, S.H.M., 2014. An automatic similarity detection engine between sacred texts using text mining and similarity measures .SAH, P., FOKOUÉ, E., 2019. What do asian and non-asian scriptures have in common? an applied statistical machine learning inquiry. Math. Appl8, 151–171.Sah, P., Fokoué, E., 2019. What do asian religions have in common? an unsupervised text analytics exploration. arXiv preprint arXiv:1912.10847 .Sindler, F.L., 2011. Comparative Study of Christian, Jewish, and Islamic Theodicy. Ph.D. thesis. Reformed Theological Seminary, Virtual Campus.Sklad, M., Diekstra, R., Ritter, M.d., Ben, J., Gravesteijn, C., 2012. Effectiveness of school-based universal social, emotional, and behavioralprograms: Do they enhance students’ development in the area of skill, behavior, and adjustment? Psychology in the Schools 49, 892–909.Wang, L., 2005. Support vector machines: theory and applications. volume 177. Springer Science & Business Media.Zahrotun, L., 2016. Comparison jaccard similarity, cosine similarity and combined both of the data clustering with shared nearest neighbor method.Computer Engineering and Applications Journal 5, 11–18.
YOUNOUS MOFENJOU PEURIEKEU et al.: