[PDF] Classification of Pedagogical content using conventional machine learning and deep learning model

Abstract

Full PDF

CC LASSIFICATION OF PEDAGOGICAL CONTENT USINGCONVENTIONAL MACHINE LEARNING AND DEEP LEARNINGMODEL

A P

REPRINT

Vedat Apuk, Krenare Pireva Nuçi

Department of Computer Science and EngineeringUniversity for Business and Technology10000 Prishine, Kosovo [email protected], [email protected]

January 20, 2021 A BSTRACT

The advent of the Internet and a large number of digital technologies has brought with it manydifferent challenges. A large amount of data is found on the web, which in most cases is unstructuredand unorganized, and this contributes to the fact that the use and manipulation of this data is quite adifﬁcult process. Due to this fact, the usage of different machine and deep learning techniques for TextClassiﬁcation has gained its importance, which improved this discipline and made it more interestingfor scientists and researchers for further study. This paper aims to classify the pedagogically contentusing two different models, the K-Nearest Neighbor (KNN) from the conventional models and theLong short-term memory (LSTM) recurrent neural network from the deep learning models. Theresult indicates that the accuracy of classifying the pedagogical content reaches 92.52 % using KNNmodel and 87.71 % using LSTM model. K eywords Document Classiﬁcation · KNN · LSTM · coursera dataset · education · text classiﬁcation · deep learningmodels · machine learning models Billions of users create a large amount of data every day, which in a sense comes from various types of sources. Thisdata is in most cases unorganized and unclassiﬁed and is presented in various formats such as text, video, audio, orimages. Processing and analyzing this data is a major challenge that we face every day. The problem of unstructuredand unorganized text dates back to ancient times, but Text Classiﬁcation as a discipline ﬁrst appeared in the early60s, where 30 years later the interest in various spheres for it increased [1], and began to be applied in various typesof domains and applications such as for movie review [2], document classiﬁcation [3], ecommerce [4], social media[5], online courses [6, 7], etc. As interest has grown more in the upcoming years, the uses start solving the problemswith higher accurate results in more ﬂexible ways. Knowledge Engineering (KE) was one of the applications of textclassiﬁcation in the late 80s, where the process took place by manually deﬁning rules based on expert knowledge interms of categorization of the document for a particular category [1]. After this time, there was a great wave of useof various modern and advanced methods for text classiﬁcation, which all improved this discipline and made it moreinteresting for scientists and researchers, more speciﬁcally the use of machine learning techniques. These techniquesbring a lot of advantages, as they are now in very large numbers, where they provide solutions to almost every problemwe may encounter.The need for education and learning dates back to ancient times, where people are constantly improving and trying togain as much knowledge as possible. There are various sources of learning available today including various MOOCplatforms such as Coursera, Khan Academy, Udemy, Udacity, edX, to name a few, and as technology has evolved it a r X i v : . [ c s . A I] J a n PREPRINT - J

ANUARY

20, 2021has contributed to better methods of acquiring knowledge that will facilitate this process. The data coming from thesesources are in most cases in digital form, more speciﬁcally in the form of video and text lessons. The platforms thatcontain these lessons are called Massive Open Online Courses (MOOCs), where in addition to the video lesson, it alsocontains its textual representation called a transcript. Considering that the duration of a video lesson depends on severalparameters, such as the category of video material, the platform on which the lesson is provided, the complexity ofthe topic, the number of instructors, and the group of lesson attendants. The duration of the lessons indirectly dictateshow long the transcript will be, in other words how many words it can contain. The category shows the nature of thevideo and the topics that will be presented in it. As it is already known, that each video lesson belongs to a certaincategory, or in a group of categories, so does the transcript as well. Given this advantage, we can conclude the fact thattext classiﬁcation is becoming quite extensive as a discipline, where also its use can solve many challenging problemsin every domain, and speciﬁcally in education domain.The aim of this paper is to investigate two classiﬁcation techniques that are used to classify the pedagogical content,and the focus tends to compare conventional machine learning models with deep learning models, by selecting KNNalgorithm for the ﬁrst approach and LSTM architecture for the latter one.To better indicate the idea we want to present, the paper will be divided into several sections, as follows: as part ofliterature review the main processes of classifying documents are explained, continuing with related work conducted sofar in this area. In the experimental section, the design of the conventional machine learning models and deep learningmodels will be elaborated and the results for the each of the architectures will be presented using a number of evaluationtechniques (recall, precision, F-Score, accuracy). The paper will be concluded with conclusion and future work.

Text mining or text analytics is one of the artiﬁcial intelligence techniques that uses Natural Language Processing(NLP) to transform unorganized and unstructured text into an appropriately structured format that will make it easierto process and analyze data. For businesses and other corporations, generating large amounts of data has become adaily routine. Analysis of this data help companies gain smarter and more creative insights regarding their services orproducts collected from a variety of sources in automated manner. But this analysis step requires processing a hugeamount of data where the data needs to be prepared, and this is in most cases the cause of various problems. NLP ismade up of ﬁve steps or phases, and they are Lexical Analysis, Syntax Analysis, Semantic Analysis, Pragmatics, andDiscourse [8]. Figure 1: Natural Language Processing steps.Figure 1 shows the following steps within NLP, where each of these steps will be brieﬂy described, with the intention tounderstand the main concepts:1. Lexical Analysis - involves identifying the structure of a sentence, to separate words from the text, and createindividual words, sentences, or paragraphs, which also includes separating punctuation from words2. Syntax Analysis - involves parsing words and arranging words in a sentence to have a certain meaning andrelationship between them, where it is based exclusively on grammar.3. Semantic Analysis - implicates to analyze the grammatical structure of a word and seeks for a speciﬁc meaningin that word. The semantic analysis makes it possible to understand the relationship between lexical items.4. Pragmatics - means how the interpretation of a sentence is affected in its use in different situations to understandwhat it means and encompasses.5. Discourse - points out that the current sentence may depend on the previous sentence, where it can also affectthe meaning of the sentence that comes after it. 2

PREPRINT - J

ANUARY

20, 2021So, the goal of text classiﬁcation or text analysis is to structure and classify data to facilitate the analysis process. Today,as shown in Figure 2, in order to perform text classiﬁcation in the existing data, we follow the four phases emphasizedby [9]:1. Feature Extraction2. Dimension Reductions3. Classiﬁer Selection4. Evaluation Figure 2: Four-phase model of a text classiﬁcation system.

As shown in Figure 2, with feature extraction as an initial phase one piece of text or document is converted into aso-called structured feature space, which will be useful to us when using a classiﬁer. But prior to this, needs to performdata cleaning, taking care of missing data, removal of unnecessary characters or letters, in order to bring the data in anappropriate shape for extracting the features, otherwise omitting the data cleaning can directly affect negatively theperformance and the accuracy of the ﬁnal results.Figure 3: Techniques of data preprocessing phase.Emphasizing the importance of pre-processing data, in Figure 3, are depicted a number of processes that are followed toclear the data and prepare it for further processing [9]. Such processes as:• Tokenization - is the process of separating a piece of text into smaller units called tokens. The way the token isformed is based on a delimiter, which in most cases is space. Also, tokens can be words or sub-words, but alsoat a lower level, based on characters.• Stop Words - are words that are commonly used in one language, that are unnecessary in the data processingpart, and in most cases are ignored because they take up more space in the database, and affect longerprocessing times. In English stop words are words like: "a", "the", "an", "it", "in", "because", "what", to namea few.• Capitalization - is the part where it is necessary to identify the correct capitalization of the word, where theﬁrst word in the sentence will be automatically capitalized ﬁrst.• Noise Removal - is the process of removing characters, numbers, and parts of text that affect your analysis.These characters can be some special characters, punctuation, source code removal, HTML code removal,unique characters that represent a particular word, numbers, and many other identiﬁers.3

PREPRINT - J

ANUARY

20, 2021• Spelling Correction - is a problem where the meaning of a particular word can be mispronounced, where theword loses its meaning. This problem can be solved in two ways: with edit distance and another with overlapusing k-gram.• Stemming - is a process where more morphological variants are produced than the base word or the so-calledroot word. For example different morphological variants of root words "like" such as "likes", "liked", "liking"and "likely".• Lemmatization - in this technique words are replaced with root words or words that have a similar meaning,and such words are called lemmas.• Syntactic Word Representation (such as N-Gram) - is a contiguous sequence of n items from one part of thetext.• Syntactic N-Gram - are n-grams that are constructed using paths in syntactic trees. – Weighted Words (such as TF and TF*IDF) – Word Embedding (such as Word2Vec, GloVe, FastText)

As we can conclude from the name itself that in this step the goal is to transform from a high-dimensional space to alow-dimensional space. The reason for this is that we strive to improve performance, speedup time, and reduce memorycomplexity. There are a number of algorithms or techniques in this step that could be implemented, such as: (i) PrincipalComponent Analysis (PCA), (ii) Non-negative Matrix Factorization (NMF), (iii) Linear Discriminant Analysis (LDA)and (iv) Kernel PCA. Figure 4: Categorization of dimension reductions algorithms.

One of the main concerns is to choose the right classiﬁer model that will be able to perform with a certain set of datato achieve the desired results. Choosing the right classiﬁer model is not an easy task, and is a challenge that is alsoreferred to in the literature as the Algorithm Selection Problem (ASP). Every day we come across applications thatuse classiﬁcation algorithms in some hands. The results of the task depend on choosing the right algorithm that willcomplete a particular job while showing very good performance and problem optimization. In general, there is nosingle algorithm that can work for every type of problem, and that can learn all the tasks while still being efﬁcient,and this phenomenon is also known as performance complementary [10]. Many factors affect the performance of aparticular algorithm, some of which is the amount of data assigned to it for testing and training, the operating system tobe executed, the speciﬁcations of the machine on which the algorithm will be performed, and many other factors thatdirectly or indirectly affect the selection of the algorithm.Some of the algorithms used for text classiﬁcation are: Logistic Regression, Naive Bayes, K-Nearest Neighbor (KNN),Support Vector Machines (SVM), Decision Trees, Random Forests, Neural Network algorithms (such as DNN, CNN,RNN) and Combination Techniques.In our experiment we have used K-Nearest Neighbor (KNN) from the conventional models whereas LSTM recurrentneural network from the deep learning models. 4

PREPRINT - J

ANUARY

20, 2021

One of the most important steps when creating a model for text classiﬁcation is the evaluation phase. In this phase,algorithms are analyzed or scored to assess how efﬁciently they performed. It should also be suggested that comparingdifferent parameters or metrics with this method is not an easy task.There is a so-called confusion matrix table (see Figure 5) in which classiﬁcation metrics such as True Positives (TPs),False Positives (FPs), False Negatives (FNs) and True Negatives (TNs) are calculated and presented [11].Figure 5: Confusion MatrixFigure 5 shows a confusion matrix table in which the prediction results are displayed horizontally, while a label that ispositive or negative is shown vertically. Another evaluation technique that is lately being used is also F-Score. In thispaper, in order to evaluate the experimental model the precision, recall, F-Score and Accuracy are used.

The various technologies available today have drastically improved the way people try to gain new knowledge.Technology has greatly inﬂuenced the improvement of this process, and at the same time contributed to the developmentof systems that enable a more efﬁcient and easier learning process. With this fact the use of various Massive Open OnlineCourses (MOOCs) begins to increase, which bring with them various opportunities, but also challenges. Attempts toidentify and analyze the opportunities and challenges of MOOCs both from pedagogical and business standpoint haveled to understand how some of the very well known and successful platforms like Coursera, edX and Udacity havecontributed to the improvement of their business model through various aspects,using the models for: certiﬁcationmodel, freemium model, advertising model, job-matching model, and subcontractor model [12]. During the analysisof these platforms, the authors in [9] concluded that quite a low number of students actually take assessment examsat the end of a MOOC which makes it difﬁcult to assess whether students joining a MOOC are actually learning thecontent, and hence whether the MOOC is achieving its goal. One of the main components of these platforms is LearningObjects (LOs). Various techniques regarding Learning Objects (LOs) representation are presented, in which it containspedagogical values [13]. Using the representation features of Learning Objects will provide possibilities to personalizeand customizable contents when presenting to learners along with the ability to choose an individual learning path thatbest suits them, aiming to maximize the learning outcome as claimed in [13]. There are plenty of examples whereK-Means, Decision Trees, Deep Neural Network (DNN) and other machine learning techniques have been used forclassiﬁcation purposes [14]. As eLearning platforms are becoming more accessible, where their main goal is to providea smarter way of learning. The new paradigm of e-Learning also known as Cloud eLearning aiming to offer personalisedlearning using Cloud resources, where the main challenge is the process of content classiﬁcation and matching it withlearners preferences. As part of this work, the author [15] integrated as middle layer the recommendation systemsusing hierarchical clustering technique to recommend learners courses or materials that are similar to their needs before5

PREPRINT - J

ANUARY

20, 2021proposing a learning path using artiﬁcial intelligent automated planner. Also, paper [16] contributes to the classiﬁcationsystems in pedagogical content, with the main focus on the content classiﬁcation of video lectures. The authorsrecommended model for the visual content classiﬁcation system (VCCS) for multimedia lecture videos is to classifythe content displayed on the blackboard. Through this recommended model, the authors showed over several stageshow lecture videos are processed and then with a combination of support vector machines (SVM) and optical characterrecognition (OCR) classiﬁes visual content, text and equations [16]. Furthermore in [17], researchers presented theclassiﬁcation and organization of pedagogical documents using domain ontology.In one of the previous studies [18], the authors of this paper presented a technique for automatic classiﬁcation of MOOCvideos, where the ﬁrst step is to extract transcripts from video and then convert them into image representation using astatistical co-occurrence transform. After that, a CNN model with a dataset was implemented which was collected fromKhan Academy with a total of 2545 videos, in order to evaluate the technique presented in the paper. Based on labelaccuracy, the best results were achieved with the CNN model, with the value of 97.87%. Also, similar work has beencarried out by Imran, Kurti and Kastrati in [19] where they have proposed a video classiﬁcation framework, consistingof three main modules: pre-processing, transcript representation, and classiﬁer. In this paper, it was concluded thatmuch better classiﬁcation results were achieved with general-level than with speciﬁc-level, argued with the fact of classoverlap that the speciﬁc-level category contains.This paper aims to classify the pedagogical content using two different algorithms, K-Nearest Neighbour as anconventional machine learning model and Long short-term memory (LSTM) as an artiﬁcial recurrent neural networkarchitecture used in deep learning.

In this section is given the methodology used during the research and the experimental part. Initially a brief introductionregarding the dataset is given, and continuing with explanation of the architectures that are modelled to classifypedagogical content.Python technology is used for the whole experiment, and speciﬁcally to implement the KNN model is used the built-infunctions and modules of scikit-learn library, whereas for the implementation of the RNN model is used Keras library,that runs on top of Tensorﬂow. In the following subsections, the used dataset as part of this experiment is described indetail, following with both models, the KNN and LSTM.

The process of collecting and reviewing data is not an easy task, and in most cases requires a lot of research and ﬁndingrelevant data that are used to achieve the desired results. The dataset [20] used in this paper for the experimentalpurposes is used in [19] and it is modelled from scratch. This dataset consists of a total of 12,032 videos collected fromthe Coursera platform from more than 200 different courses. Coursera categorizes courses into a 2-level hierarchicalstructure from general level to ﬁne-grained level. The general level consists of 8 categories, the speciﬁc level of 40categories, and the course level of a total of 200 categories. In addition to these three levels that made up the course, avideo lesson transcript was also included.Figure 6 presents the top ﬁve most frequent categories, while Figure 7 presents the top ﬁve least frequent categories bythe number of transcripts that these categories contained. In order for the data to be in the correct format for furtheranalysis and modeling process, the data needs to go under pre-process phase, by preparing, cleaning, and transformedin a desired shape. The data preparation and preprocessing part depends on the given dataset, and in our case the ﬁrststep after the review is to remove the noisy data (such as ’[MUSIC]’ which are recorded very frequently in all transcriptrecords). Following the steps depiced in Figure 3 the entire textual content of the transcript is converted into lowercase,and removed the non-letters characters. Further, the stopwords are removed from the transcript where it helped usreduce the derived words to their particular word stem or root word as explained in 2.1. The dataset is transformedﬁnally into the desired shape after ﬁnishing the lemmatization process, and it is ready to be used for both architecturesthat we have modelled, KNN and LSTM described further in the following subsections.

K-Nearest Neighbors (KNN) is one of the techniques that is used in both classiﬁcation and regression. It is known thatKNN has no model other than collecting the entire dataset, and there is no need for learning. The predictions made withthe KNN for the new data point are by searching the entire dataset for the K most similar instance (so-called neighbors)in relation to the output variant of the K instance [21]. 6

PREPRINT - J

ANUARY

20, 2021Figure 6: Top ﬁve most frequent categories for all three levels.Figure 7: Top ﬁve least frequent categories for all three levels.There are a number of steps that the KNN algorithm goes through, such as:1. Modify K with the number of speciﬁc neighbors.2. Calculate the distance between the available raw data examples.3. Sort the calculated distances.4. Get the labels of top K entries.5. Generated prediction results for the test case.In this experiment, while implementing the KNN model, immediately after the process of cleaning and preparing data,is built a dictionary of features, which transforms documents to feature vectors and convert the transcripts of documentsto a matrix of token counts using CountVectorizer method. Then, the count matrix is transformed to a normalized tf-idfrepresentation using TﬁdfTransformer method. After this is identiﬁed the exact number of neighbors which in our caseresulted in 7 neighbors. To train the classiﬁer, the dataset is divided into two subsets: 80% for training and 20% fortesting. Where the latter subset is used to predict the category for each input text record.7

PREPRINT - J

ANUARY

20, 2021Figure 8: Implementation of KNN Classiﬁer.Figure 8 shows a screenshot of implementation of our KNN classiﬁer using Python technology and scikit-learn library,as mentioned in Section 3.

Recurrent Neural Networks (RNN) are types of artiﬁcial neural networks that allow previous outputs to be used as inputswhile having hidden states [22]. These algorithms are mostly used in ﬁelds such as: Natural Language Processing(NLP), Speech Recognition, Robot Control, Machine Translation, Music Composition, Grammar Learning, and manyothers. Typically, a feedforward network maps one input to one output. But as such, the inputs and outputs of neuralnetworks can vary in the length and type of networks used for different examples and applications [23].Figure 9: Implementation of LSTM model.Figure 9 shows the implementation of our LSTM model, where in this experiment in order to implement the RNNmodel, we used the LSTM architecture that remembers values over arbitrary intervals. As part of this architecture ﬁrstlyare created Sequence models as the input layer to our network, then adding the Embedding layer which encodes tointeger values the textual data entered as input, and as a result of this layer each word is then represented by a uniqueinteger.For this layer, we have speciﬁed three required parameters with their respected values:• Maximum number of words - which in our case is 50000.• Embedding Dim - 100. 8

PREPRINT - J

ANUARY

20, 2021• Input length - shape of X value which for us is 3002.Further are dropped out hidden and visible units between the layers in the network, with a dropout rate of 0.2, the samevalue is for recurrent dropout as well. This is followed by the implementation of LSTM layer, and Dense layer to whichwe passed as the ﬁrst parameter the number of units denoting the dimensionality of the output space, which in our casedepends on the number of categories that are selected to classify, and as the second parameter the activation function, inthis case is chosen the softmax function. And as a ﬁnal step, is used categorical_crossentropy as a loss function, andAdam as an optimizer of the network. To prevent underﬁtting or overﬁtting of the network, and to select the appropriatenumber of training epochs is used EarlyStopping with ’val_loss’ as a monitoring metric with patience of 3 epochs.

Table 1 shows the classiﬁcation results with the conventional model using K-Nearest Neighbours algorithm. As shownin Table 1, the general level based on the precision metric has shown a very good result, 92.63% of accuracy whereas87.89% accuracy is estimated by precision metrics speciﬁc level . And at the course level , also based on the precisionmetric reaches 78.59%. Analyzing the results for all three levels, we notice that the percentage of accuracy decreasesfrom the upper level (general level) up to the lower level (course level). In our case, the general level consists of 8sub-categories, the speciﬁc level of 40 sub-categories, and the course level consists of 200 sub-categories. From thiswe can infer that that the number of sub-categories for a single level by which the video is classiﬁed on the Courseraplatform differs in each level.Table 1: Classiﬁcation results with K-Nearest Neighbours.Category Precision (%) Recall (%) F1 Score (%) Accuracy (%)General Level 92.63 92.52 92.53 92.52Speciﬁc Level 87.89 87.58 87.49 87.58Course Level 78.59 76.73 76.11 76.73Table 2 shows the classiﬁcation results with the Recurrent Neural Networks, more speciﬁcally with an Long Short-TermMemory (LSTM) architecture. Using LSTM classiﬁer, the general level based on the precision metric reaches 88.22%of accuracy whereas in the speciﬁc level , 72.31% . Finally, at the course level , the results shows 59.49% of accuracy.Analyzing the results using LSTM architecture the highest accuracy is achieved at the general level, followed by aspeciﬁc level, while the lowest accuracy is achieved at the course level.Table 2: Classiﬁcation results with Recurrent Neural NetworksCategory Precision (%) Recall (%) F1 Score (%) Accuracy (%)General Level 88.22 87.71 87.68 87.71Speciﬁc Level 72.31 69.93 70.13 69.93Course Level 59.49 52.91 53.99 52.91

In this paper are presented and discussed the classiﬁcation results of the conducted experiment for all three categorylevels (General, Speciﬁc and Course level) using both architectures, KNN and LSTM. We can conclude that betterresults are achieved for levels with a smaller number of categories than for levels with a larger number of categories. Inour case, as the category number increased in classes the results decreased. With this, we claim that the classiﬁcationresults are directly affected by the number of categories that each level contains. From results shown in Table 1 andTable 2 KNN reached 92.52% of accuracy compared to LSTM with 87.71% at general level, 87.58% compared to69.93% at speciﬁc level and ﬁnally 76.73% compared to 52.91% at course level. The conducted results could be affectedfrom several factors. First, the quantity of data required for LSTM, since a large number of categories increases thecomplexity of the problem, and thus requires more data to train the model. The result could have been affected due tothe high similarity of different transcripts. Many of the transcripts belonged to different classes at the course level, and9

PREPRINT - J

ANUARY

20, 2021they had many similarities in the context of the sentences and keywords, so the model could not properly distinguish inwhich class the transcripts belonged. However, the ﬁnal results gives us a spark for future work to investigate more onrecurrent neural networks like, applying hyperparameters tuning, or even expand the number of architectures to furtherinvestigate the pedagogical content classiﬁcation.

References [1] Fabrizio Sebastiani. Machine learning in automated text categorization.

ACM computing surveys (CSUR) ,34(1):1–47, 2002.[2] Cicero Dos Santos and Maira Gatti. Deep convolutional neural networks for sentiment analysis of short texts.In

Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: TechnicalPapers , pages 69–78, 2014.[3] Zenun Kastrati, Ali Shariq Imran, and Sule Yildirim Yayilgan. A general framework for text document clas-siﬁcation using semcon and acvsr. In

International Conference on Human Interface and the Management ofInformation , pages 310–319. Springer, 2015.[4] Arﬁnda Ilmania, Samuel Cahyawijaya, Ayu Purwarianti, et al. Aspect detection and sentiment classiﬁcation usingdeep neural network for indonesian aspect-based sentiment analysis. In , pages 62–67. IEEE, 2018.[5] Ali Shariq Imran, Sher Muhammad Daudpota, Zenun Kastrati, and Rakhi Batra. Cross-cultural polarity andemotion detection using sentiment analysis and deep learning on COVID-19 related tweets.

IEEE Access ,8:181074–181090, 2020.[6] Zenun Kastrati, Ali Shariq Imran, and Arianit Kurti. Weakly supervised framework for aspect-based sentimentanalysis on students’ reviews of MOOCs.

IEEE Access , 8:106799–106810, 2020.[7] Alya Itani, Laurent Brisson, and Serge Garlatti. Understanding learner’s drop-out in MOOCs. In internationalconference on intelligent data engineering and automated learning , pages 233–244. Springer, 2018.[8] Hannes Max Hapke, Hobson Lane, and Cole Howard. Natural language processing in action, 2019.[9] Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura Barnes, and DonaldBrown. Text classiﬁcation algorithms: A survey.

Information , 10(4):150, 2019.[10] Irfan Khan, Xianchao Zhang, Mobashar Rehman, and Rahman Ali. A literature survey and empirical study ofmeta-learning for classiﬁer selection.

IEEE Access , 8:10262–10281, 2020.[11] Jake Lever, Martin Krzywinski, and Naomi Altman. Erratum: Corrigendum: Classiﬁcation evaluation.

NatureMethods , 13(10):890–890, 2016.[12] Fisnik Dalipi, Sule Yayilgan, Ali Shariq Imran, and Zenun Kastrati. Towards understanding the MOOC trend:pedagogical challenges and business opportunities. In

International Conference on Learning and CollaborationTechnologies , pages 281–291. Springer, 2016.[13] Krenare Pireva, Ali Shariq Imran, and Fisnik Dalipi. User behaviour analysis on LMS and MOOC. In , pages 21–26. IEEE, 2015.[14] Fisnik Dalipi, Ali Shariq Imran, and Zenun Kastrati. MOOC dropout prediction using machine learning techniques:Review and research challenges. In , pages1007–1014. IEEE, 2018.[15] Krenare Pireva and Petros Kefalas. A recommender system based on hierarchical clustering for cloud e-learning.In

International Symposium on Intelligent and Distributed Computing , pages 235–245. Springer, 2017.[16] Ali Shariq Imran and Faouzi Alaya Cheikh. Blackboard content classiﬁcation for lecture videos. In , pages 2989–2992. IEEE, 2011.[17] Ali Shariq Imran and Zenun Kastrati. Pedagogical document classiﬁcation and organization using domain ontology.In

International Conference on Learning and Collaboration Technologies , pages 499–509. Springer, 2016.10

PREPRINT - J

ANUARY

20, 2021[18] Houssem Chatbri, Kevin McGuinness, Suzanne Little, Jiang Zhou, Keisuke Kameyama, Paul Kwan, and Noel EO’Connor. Automatic mooc video classiﬁcation using transcript features and convolutional neural networks. In

Proceedings of the 2017 ACM Workshop on Multimedia-based Educational and Knowledge Technologies forPersonalized and Social Online Training , pages 21–26, 2017.[19] Zenun Kastrati, Ali Shariq Imran, and Arianit Kurti. Integrating word embeddings and document topics with deeplearning in a video classiﬁcation framework.

Pattern Recognition Letters , 128:85–92, 2019.[20] Zenun Kastrati, Arianit Kurti, and Ali Shariq Imran. Wet: Word embedding-topic distribution vectors for moocvideo lectures dataset.

Data in brief , 28:105090, 2020.[21] Jason Brownlee.

Master Machine Learning Algorithms: discover how they work and implement them from scratch .Machine Learning Mastery, 2016.[22] Afshine Amidi and Shervine Amidi. Vip cheatsheet: Recurrent neural networks, 2018.[23] Larry Medsker and Lakhmi C Jain.