[PDF] Multi-split Optimized Bagging Ensemble Model Selection for Multi-class Educational Data Mining

Abstract

Predicting students' academic performance has been a research area of interest in recent years with many institutions focusing on improving the students' performance and the education quality. The analysis and prediction of students' performance can be achieved using various data mining techniques. Moreover, such techniques allow instructors to determine possible factors that may affect the students' final marks. To that end, this work analyzes two different undergraduate datasets at two different universities. Furthermore, this work aims to predict the students' performance at two stages of course delivery (20% and 50% respectively). This analysis allows for properly choosing the appropriate machine learning algorithms to use as well as optimize the algorithms' parameters. Furthermore, this work adopts a systematic multi-split approach based on Gini index and p-value. This is done by optimizing a suitable bagging ensemble learner that is built from any combination of six potential base machine learning algorithms. It is shown through experimental results that the posited bagging ensemble models achieve high accuracy for the target group for both datasets.

Full PDF

AApplied Intelligence manuscript No. (will be inserted by the editor)

Multi-split Optimized Bagging Ensemble Model Selection forMulti-class Educational Data Mining

MohammadNoor Injadat · Abdallah Moubayed · Ali Bou Nassif · Abdallah Shami

Received: date / Accepted: date

Abstract

Predicting students’ academic performance has been a research area of interestin recent years with many institutions focusing on improving the students’ performance andthe education quality. The analysis and prediction of students’ performance can be achieved using various data mining techniques. Moreover, such techniques allow instructors to de-termine possible factors that may a ﬀ ect the students’ ﬁnal marks. To that end, this workanalyzes two di ﬀ erent undergraduate datasets at two di ﬀ erent universities. Furthermore,this work aims to predict the students’ performance at two stages of course delivery (20%and 50% respectively). This analysis allows for properly choosing the appropriate machinelearning algorithms to use as well as optimize the algorithms’ parameters. Furthermore, thiswork adopts a systematic multi-split approach based on Gini index and p-value. This is doneby optimizing a suitable bagging ensemble learner that is built from any combination of sixpotential base machine learning algorithms. It is shown through experimental results thatthe posited bagging ensemble models achieve high accuracy for the target group for bothdatasets. Keywords e-Learning · Student Performance Prediction · Optimized Bagging EnsembleLearning Model Selection · Gini Index

MohammadNoor Injadat, Abdallah Moubayed, Abdallah ShamiElectrical & Computer Engineering Dept.University of Western OntarioLondon, ON, CanadaE-mail: [email protected], [email protected], [email protected] Bou NassifComputer Engineering Dept.University of Sharjah, Sharjah, UAEandElectrical & Computer Engineering Dept.University of Western OntarioLondon, ON, CanadaE-mail: [email protected] a r X i v : . [ c s . C Y ] J un MohammadNoor Injadat et al.

Data mining is rapidly becoming a part of software engineering projects, and standard meth-ods are constantly revisited to integrate the software engineering point of view. Data miningcan be deﬁned as an extraction of data from a dataset and discovering useful informationfrom it [34][28]. This is followed by the analysis of the collected data in order to enhancethe decision-making process [17]. Data mining uses di ﬀ erent algorithms and tries to uncovercertain patterns from data [1]. These techniques techniques have proved to be e ﬀ ective solu-tions in a variety of ﬁelds including education, network security, and business [29, 50, 66].Hence, they have the potential to also be e ﬀ ective in other ﬁelds such as medicine and edu-cation.Educational Data Mining (EDM), a sub-ﬁeld of data mining, has emerged that specializesin educational data with the goal of better understanding students behavior and improvingtheir performance [12][22]. Moreover, this sub-ﬁeld also aims at enhancing the learning andteaching processes [17]. EDM often takes into consideration various types of data such asadministrative data, students performance data, and student activity data to gain insights andprovide the appropriate recommendation [35] [48].The rapid growth of technology and the Internet has introduced an interactive opportuni-ties to help education ﬁeld to improve the teaching and learning processes. In turn, thishas led to the emergence of the ﬁeld of e-learning. This ﬁeld can be deﬁned as the use ofcomputer network technology, primarily over an intranet or through the Internet, to deliver information and instruction to individuals [61][33]. There are various challenges facing e-learning platforms and environment [49]. This includes the assorted styles of learning, andchallenges arising from cultural di ﬀ erences [16]. Other challenges also exist such as ped-agogical e-learning, technological and technical training, and e-learning time management[38]. To this end, personalized learning has emerged as a necessity in order to better cater tothe learners’ needs [30]. Accordingly, this personalization process has become a challengingtask [13], as it requires adapting courses to meet di ﬀ erent individuals’ needs. This calls foradaptive techniques to be implemented [14], [8]. This can be done by automatically collect-ing data from the e-learning environment [8] and analyzing the learners proﬁle to customizethe course according to the participants needs and constraints such as his / her location, lan-guage, currency, seasons, etc. [8], [46], [44].Many of the previous works in the literature focused on predicting the performance of thestudents by adopting a binary classiﬁcation model. However, some educators prefer to iden-tify not only two classes of students (i.e. Good vs. Weak), but instead they divide the studentsinto several groups and consider the associated multi-class classiﬁcation problem [58]. Thisis usually done because the binary model often identiﬁes a large number weak students,many of which are not truly at risk of failing the course. Accordingly, this work considerstwo datasets at two di ﬀ erent stages of the course, namely at 20% and 50% of the course-work, and divides the students into three groups, namely Weak, Fair, and Good students.Accordingly, the datasets are analyzed as a set of multi-class classiﬁcation problems.Multi-class classiﬁcation problems can be solved by naturally extending the binary classiﬁ-cation techniques for some algorithms, [3]. In this work, we consider various classiﬁcation algorithms, compare their performances, and use Machine Learning (ML) techniques aimingto predict the students’ performance in the most accurate way. Indeed, we consider K-nearestneighbor (k-NN), random forest (RF), Support Vector Machine (SVM), Multinomial Logis-tic Regression (LR), Nave Bayes (NB) and Neural Networks (NN) and use an optimizedsystematic ensemble model selection approach coupled with ML hyper-parameter tuningusing grid search optimization. itle Suppressed Due to Excessive Length 3 In this paper, we produced a bagging of each type of model and the bagging was used forthe ensembles as opposed to single models. Bagging is itself an ensemble algorithm as itconsists of grouping several models of the same type and deﬁning a linear combination ofthe individual predictions as the ﬁnal prediction on an external test sample, as explained inSection 6. Bagging is one of the best procedures to improve the performance of classiﬁersas it helps reduce the variance in many hard decision problems [10][52]. The empirical factthat bagging improves the classiﬁers’ performance is widely documented [9], and in factensemble methods placed ﬁrst in many prestigious ML competitions, such as the NetﬂixCompetition [54], KDD 2009 [24], and Kaggle [32]. Furthermore, a multi-split frameworkis considered for the studied datasets in order to reduce the bias of the ML models investi-gated as part of the bagging ensemble models.The main disadvantage of bagging, and other ensemble algorithms, is the lack of interpre-tation. For instance, a linear combination of decision trees is much harder to interpret thana single tree. In the same way, bagging several variable selections gives little clues aboutwhich of the predictor variables are actually important.In this paper, in order to have a roughidea of which variables are the best predictors for each algorithm, we decided to average,for each variable, its importance in every model and this average is assigned to the variableand deﬁned to be its averaged importance . This was done in order to better highlight thefeatures that are truly important across the multiple splits under consideration.The remainder of this paper is organized as follows: Section 2 presents some of the previousrelated work and their limitations; Section 3 summarizes the research contributions of this work; Section 4 describes the datasets under consideration and deﬁnes the correspondingtarget variables for both datasets; Section 5 describes the performance measurement ap-proach adopted; Section 6 presents the methodology used to choose the best classiﬁers forthe multi-class classiﬁcation problem; Section 7 discusses the architecture used for train-ing NN and shows the features’ importance for each classiﬁer for each dataset; Section 8presents and discusses the experimental results both in terms of Gini Indices (also calledGini coe ﬃ cient) and by using confusion matrices; and ﬁnally, Section 9 lists the researchlimitations, proposes multiple future research opportunities, and concludes the paper. ﬀ erent factors that improve the knowledge gaining, skills improvementof the learners, and makes the educational institution o ﬀ er a better learning experience withhighly qualiﬁed students or trainees [60].Several researchers have explored the use of data mining techniques in an educational set-ting. Authors of [37] used data mining techniques to analyze the learners web usage and content-based proﬁles to have an on-line automatic recommendation system. In contrast,Chang et al. proposed a k-NN classiﬁcation model to classify the learners style [11]. Theresults of this model was used to help the educational institution management and facultiesto improve the courses contents to satisfy the learners needs [11].Another related study that used simple leaner regression to check the e ﬀ ect of the studentmothers education level and the familys income in learners academic level was presented in MohammadNoor Injadat et al. [26].On the other hand, Baradwaj and Pal used classiﬁcation methods to evaluate the studentsperformance using decision trees [6]. The study was conducted using collected data fromprevious years database to predict the student result at the end of the current semester. Theirstudy aimed to provide a prediction that will help the next term instructors identify studentsthat they may need help.Other researchers [7] applied Nave Bayes classiﬁcation algorithm to predict students gradesbased on their previous performance and other important factors. The authors discoveredthat, other than students e ﬀ orts, factors such as residency, the qualiﬁcation standards of themother, hobbies and activities, the total income of the family, and the state of the family hada signiﬁcant e ﬀ ect on the students performance.Later, the same authors used Iterative Dichotomiser 3 (ID3) decision tree algorithm and if-then rules to accurately predict the performance of the students at the end of the semester[56] based on di ﬀ erent variables like Previous Semester Marks, Class Test Grades, SeminarPerformance, Assignments, Attendance, Lab Work, General Proﬁciency, and End SemesterMarks.Similarly, Moubayed et al. [51, 53] studied the student engagement level using K-meansalgorithm and derived a set of rulers that related student engagement with academic per-formance using Apriori association rules algorithm. The results analysis showed a positivecorrelation between students’ engagement level and their academic performance in an e-learning environment. Prasad et al. [57] used J48 (C4.5) algorithm and concluded that this algorithm is the bestchoice for making the best decision about the students performance. The algorithm was alsopreferred because of its accuracy and speed.Ahmed and Elaraby conducted a similar research in 2014 [2] using classiﬁcation rules. Theyanalyzed data from a course program across 6 years and were able to predict students ﬁnalgrades. In similar fashion, Khan et al. [36] used J48 (C4.5) algorithm for predicting the ﬁnalgrade of Secondary School Students based on their previous marks.Kostiantis et al. [40] proposed an incremental majority voting-based ensemble classiﬁerbased on 3 base classiﬁers, namely NB, k-NN, and Winnow algorithms. The authors’ ex-perimental results showed that the proposed ensemble model outperformed the single basemodels in a binary classiﬁcation environment.Saxena [62] used k-means clustering and J48 (C4.5) algorithms and compared their per-formance in predicting students grades. The author concluded that J48 (C4.5) algorithm ismore e ﬃ cient, since it gave higher accuracy values than k-means algorithm. Authors in [59]used and compared K-Means and Hierarchical clustering algorithms. They concluded thatK-means algorithm is more preferred to hierarchical clustering due to better performanceand faster model building time.Wang et al. proposed an e-Learning recommendation framework using deep learning neu-ral networks model [65]. Their experiments showed that the proposed framework o ﬀ ereda better personalized e-learning experience. Similarly, Fok et al. proposed a deep learningmodel using TensorFlow to predict the performance of students using both academic andnon-academic subjects [21]. Experimental results showed that the proposed model had ahigh accuracy in terms of student performance prediction. Asogbon et al. proposed a multi-class SVM model to correctly predict students performancein order to admit them into appropriate faculty program [4]. The performance of the modelwas examined using an educational dataset collected at the University of Lagos, Nigeria. Ex-perimental results showed that the proposed model adequately predicted the performancesof students across all categories [4]. itle Suppressed Due to Excessive Length 5

In a similar fashion, Athani et al. also proposed the use of a multi-class SVM model to pre-dict the performance of high school students and classify them into one of ﬁve letter gradesA-F [5]. The goal was to predict student performance to provide a better illustration of theeducation level of the schools based on their students’ failure rate. The authors used a Por-tuguese high school dataset consisting mostly of the students’ socio-economic descriptorsas features. Their experiments showed that the proposed multi-class SVM model achievedhigh prediction accuracy close to 89% [5].Jain and Solanki proposed a comparative study between four tree-based models to predictthe performance of students based on a three-class output [31]. Similar to the work of Athani et al. , the authors in this work also considered the Portuguese high school dataset consistingmostly of the students’ socio-economic descriptors as features. Experimental results showedthat the proposed tree-based model also achieved high prediction accuracy with a low exe-cution time [31].2.2 Limitations of Related WorkThe limitations of the related work can be summarized as follows: – Do not analyze the features before applying any ML model. Any classiﬁcation model isdirectly applied without studying the nature of the data being considered. – Mostly consider the binary classiﬁcation case. Such cases often lead to identifying toomany students which are not truly in danger of failing the course and hence would notneed as much help and attention. Even when multi-class models were considered, thefeatures used were mostly focused on students’ socio-economic status rather than theirperformance in di ﬀ erent educational tasks. – Often use a single classiﬁcation model or an ensemble model built upon randomly cho-sen group of base classiﬁers. Moreover, to the best of our knowledge, only majorityvoting-based ensemble models are considered. – Often predict the performance of students from one course to the other or from one yearto the other. Performance prediction is rarely considered during the course delivery. – Often use the default parameters of the utilized algorithms / techniques without optimiza-tion. To overcome the limitations presented in Section 2.2, our research aims to predict the stu-dents performance during the course delivery as opposed to other previous works that per-form the prediction at the end of the course. The multi-class classiﬁcation problem assumesthat their is a proportional relationship between the students’ e ﬀ orts and seriousness in thecourse and their ﬁnal course performance and grade.More speciﬁcally, our work aims to: – Analyze the collected datasets and visualize the corresponding features by applying dif-ferent graphical and quantitative techniques (e.g. dataset distribution visualization, targetvariable distribution, and feature importance). – Optimize hyper-parameters of the di ﬀ erent ML algorithms under consideration using grid search algorithm. MohammadNoor Injadat et al. – Propose a systemic approach to build a multi-split-based (to reduce bias) bagging en-semble (to reduce variance) learner to select the most suitable model depending on mul-tiple performance metrics, namely the Gini index (for better statistical signiﬁcance androbustness) and the target class score. – Study the performance of the proposed ensemble learning classiﬁcation model on multi-class datasets. – Evaluate the performance of the proposed bagging ensemble learner in comparison withclassical classiﬁcation techniques.Note that in this work, the term

Gini index refers to the Gini coe ﬃ cient that is calculatedbased on the Lorenz curve and area under the curve terms [43]. Therefore, the remainder ofthis work adopts to the term Gini index . is conducted to better visualize the considered datasets. – Dataset 1 : The experiment was conducted at the University of Genoa on a group of115 ﬁrst year engineering major students [63]. The dataset consists of data collectedusing a simulation environment named Deeds (Digital Electronics Education and DesignSuite). This e-Learning platform allows students to access the courses contents using aspecial browser and asks the students to solve problems that are distributed over di ﬀ erentcomplexity levels.Table 1 shows a summary of the di ﬀ erent tasks for which the data was collected. It isworth mentioning that 52 students out of the original 115 students registered were ableto complete the course.The 20% stage consists of the grades of tasks ES 1.1 to ES 3.5. On the other hand, the50% stage consists of tasks ES. 1.1 to ES 5.1.To improve the accuracy of the classiﬁcation model, empty marks were replaced witha 0. Moreover, all tasks’ marks were converted to a scale out of 100. Furthermore, alldecimal point marks were rounded to the nearest 1 to maintain consistency. – Dataset 2 : This dataset was collected at the University of Western Ontario for a secondyear undergraduate Science course. The dataset is composed of two main parts. Theﬁrst part is an event log of the 486 students enrolled. This event log dataset consists of305933 records. In contrast, the other part, which is under consideration in this research,is the grades of the 486 students in the di ﬀ erent evaluated tasks. This includes assign-ments, quizzes, and exams. Table 2 summarizes the di ﬀ erent tasks evaluated within this course. The 20% stage con-sists of the results of Assignment 01 and Quiz 01. On the other hand, the 50% stageconsists of the grades of Quiz 01, Assignments 01 and 02, and the midterm exam.Similar to Dataset 1, all empty marks were replaced with a value of 0 for better clas-siﬁcation accuracy. Moreover, all marks were scaled out of 100. Additionally, decimalpoint marks were rounded to the nearest 1. itle Suppressed Due to Excessive Length 7 Table 1: Dataset 1 - Features

Feature Description Type Value / s Id Student Id. Nominal Std. 1,..,Std. 52ES 1.1 Exc. 1.1 Mark Numeric 0..2ES 1.2 Exc. 1.2 Mark Numeric 0..3ES 2.1 Exc. 2.1 Mark Numeric 0..2ES 2.2 Exc. 2.2 Mark Numeric 0..3ES 3.1 Exc. 3.1 Mark Numeric 0..1ES 3.2 Exc. 3.2 Mark Numeric 0..2ES 3.3 Exc. 3.3 Mark Numeric 0..2ES 3.4 Exc. 3.4 Mark Numeric 0..2ES 3.5 Exc. 3.5 Mark Numeric 0..3ES 4.1 Exc. 4.1 Mark Numeric 0..15ES 4.2 Exc. 4.2 Mark Numeric 0..10ES 5.1 Exc. 5.1 Mark Numeric 0..2ES 5.2 Exc. 5.2 Mark Numeric 0..10ES 5.3 Exc. 5.3 Mark Numeric 0..3ES 6.1 Exc. 6.1 Mark Numeric 0..25ES 6.2 Exc. 6.2 Mark Numeric 0..15Final Grade Total Final Mark Numeric 0..100Total Final Course Grade Nominal G,F,W

Table 2: Dataset 2 - Features

Feature Description Type Value / s Id Student Id. Nominal std000,..,std485Quiz01 Quiz1 Mark Numeric 0..10Assign.01 Assign.01 Mark Numeric 0..8Midterm Midterm Mark Numeric 0..20Assign.02 Assign.02 Mark Numeric 0..12Assign.03 Assign.03 Mark Numeric 0..25Final Exam Final Exam Mark Numeric 0..35Final Grade Total Final Mark Numeric 0..100Total Final Grade Nominal G,F,W − − ≤ having only 8 Weak students out of 486 students.To better visualize the three classes, we applied PCA to the datasets (both considered atStage 50%) as shown in Figures 2 and 3. Looking at these two ﬁgures, we note that it canbe possible to draw a boundary that separates Weak Students from the rest of the students,whereas Fair and Good students are too close and not separable by a boundary. We will seein the next sections that the performance of the models is a ﬀ ected by this distribution and MohammadNoor Injadat et al.

Fig. 1: Dataset 1 and Dataset 2- Target Variables that most of the algorithms fail in distinguishing between Fair and Good students, especiallyfor Dataset 1.

Fig. 2: Dataset 1 - multi-class target visualization

In general there are two standard approaches to choosing multiple class performance mea-sures [3], [25]. One approach, namely

OVA (One-versus-all) , is to reduce the problem ofclassifying among N classes into N binary problems. In this case, every class is discrimi-nated from the other classes. In the second approach, called AVA (All-versus-all) , each class itle Suppressed Due to Excessive Length 9

Fig. 3: Dataset 2 - multi-class target visualization is compared to each other class. In other words, it is necessary to build a classiﬁer for everypair of classes, i.e. building N ( N − classiﬁers, while discarding the rest of the classes.Due to the size of our datasets, we chose to follow the ﬁrst method as opposed to the secondone. In fact, if we were to use the second approach for Dataset 1, we would need to train three binary models, one for each pair of classes (G,F), (F,W), and (G,W). In particular, thesubset of data for the (F,W) model would consist of only 28 students, which would be splitinto Training Sample (70%) and Test Sample (30%). This corresponds to training a modelusing 20 students and testing it using only 8 students. Due to the relatively small size ofthe (F,W) model, we determine that the AVA approach would not be suitabe for accurateprediction.It is well-known that the Gini Index metric, as well as the other metrics (Accuracy, ROCcurve etc.) can be generalized to the multi-class classiﬁcation problem. In particular, wechoose the Gini Index metric instead of the Accuracy because the latter depends on thechoice of a threshold whereas the Gini Index metric does not . This makes it statisticallymore signiﬁcant and robust than the accuracy, particularly given that it provides a measureof the statistical dispersion of the classes [27].In particular, we implemented a generalization of Gini index metric: during the trainingphase, that computes the Gini Index of each one of the three binary classiﬁcations and opti-mizes (i.e. maximizes) the average of the 3 performances , i.e. the performances correspond-ing to classes G, F, W. For the multi-class classiﬁcation problem we used several algorithms. More speciﬁcally weexplored RF, SVM - RBF, k-NN, NB, LR, and NN with 1, 2 and 3 layers (i.e. 3 di ﬀ erent NN models), for a total of eight classiﬁers per dataset.In order to achieve better performances, we did not build only one individual model for eachalgorithm, instead we constructed baggings of classiﬁers. In fact, as explained in the previ-ous section, bagging reduces the variance.We built a bagging of models for each algorithm in the following way: we started by split-ting each dataset into Training and Test samples in proportions 70%-30% then we used the training sample to build baggings of models. More precisely the Training sample was split200 times into sub-Training and sub-Test samples randomly but forcing the percentages ofFair, Good and Weak students to be the same as the ones in the entire datasets.The models resulting from the 200 splits were trained on the sub-Training samples andinferred on the corresponding sub-Test samples. If the Average Gini Index was above a cer-tain ﬁxed threshold (lowest acceptable Gini Index) then the model was kept otherwise it wasdiscarded. For each algorithm we obtained in this way a set of models having the best per-formances, and we averaged their scores on the (external) Test sample, class by class. Thisprocedure is explained in Figure 4.Once we had the eight baggings of models (one for each algorithm), we considered all the Fig. 4: Bagging Ensemble Model Building Methodology possible ensembles that could be constructed with them and compared their performancesin terms of Gini Index, as explained in Section 5. Moreover, for each dataset, we computed itle Suppressed Due to Excessive Length 11 the p-values corresponding to each one of the 256 possible ensembles and aimed to chooseas the ﬁnal ensemble the one that had best Gini Index and, at the same time, that was statis-tically signiﬁcant.The Gini Index, also commonly referred to as the Gini coe ﬃ cient, can be seen geometri-cally as the area between the Lorenz curve [43] and the diagonal line representing perfectequality. The higher the Gini Index, the better the performance of the model. Formally theGini index is deﬁned as follows:Let F ( z ) be the cumulative distribution of z and let a and b be the highest and the lowestvalue of z respectively, then the we can calculate half of Gini’s expected mean di ﬀ erence as:2 (cid:90) ba F ( z )[1 − F ( z )] dz (1)Alternatively, the Gini index can be calculated as 2 ∗ Area Under Curve − null hypothesis ,made about a population. An alternative hypothesis is the one you would believe if the nullhypothesis is concluded to be untrue. A small p-value ( ≤ .

05) indicates strong evidenceagainst the null hypothesis, so you reject the null hypothesis. For our purposes, the nullhypothesis states that the Gini Indices were obtained by chance. We generated 1 millionrandom scores from normal distribution and calculated the p-value. The ensemble learnersselected have p-value ≤ .

05, indicating that there is strong evidence against the null hy- pothesis. Therefore, choosing an ensemble model using a combination of Gini Index andp-value allows us to have a more statistically signiﬁcant and robust model.The classiﬁers were inferred on the test sample, giving as output three vectors of predic-tions to be analyzed. These three vectors express the chance that each student is classiﬁedas Weak, Fair and Good. In order t o build the confusion matrices, we ﬁxed a threshold foreach class, namely τ F , τ G , and τ W . To determine each threshold, a one-vs-all method is con-sidered for each class with the threshold being chosen as the score for which the point onthe ROC curve is closest to the top-left corner (commonly referred to as the Youden Index)[20]. This is done in order to ﬁnd the point that simultaneously maximizes the sensitivityand speciﬁcity.For each student belonging to the Test sample, we deﬁned the predicted class according tothe following steps:1. The 3 scores corresponding to the 3 classes were normalized in order to make themcomparable.2. For each class, if the probability is higher than the corresponding threshold then thetarget variable for the binary classiﬁcation problem associated to that class is predictedto be 1, otherwise it’s 0.3. In this way we obtained a 3-column matrix taking values 1’s and 0’s. Comparing the3 predictions, if a student has only one possible outcome (i.e. only one 1, and two 0’s)then the student is predicted to belong to the corresponding class. Otherwise, if thereis uncertainty about the prediction because there is more than one 1 predicted for thestudent, then the class with the highest score is chosen to be the predicted one. For instance, consider the following example:

Example 1

Suppose we have trained a classiﬁer using 70% of Dataset 1. When we infer themodel on the test sample (remaining 30%, consisting of 15 students), we obtain 3 vectors ofscores, one for each class and we can compute their Gini Indices, see Figure 5.

Fig. 5: Example - Averaged Gini Index ComputationTable 3: Example - Predicting Classes

ID Actual Class score F score G score W Max Pred. F G W Predicted Class

In this example, the Gini Indices of Classes F , G , W are 97.2%, 76.8%, 98% respec-tively, hence the Averaged Gini Index is 90.7%.We map the three scores linearly to the interval [0 , score F , score G , score W .Column Actual Class corresponds to the actual target variable that we aim to predict. Treat-ing each score as if it was the score associated to a binary classiﬁcation problem, we needto set a threshold for each class such that if the score is greater than the threshold then thestudent belongs to such class otherwise he / she doesn’t (i.e., he / she belongs to one of theother two classes). Therefore we set three thresholds τ F , τ G , and τ W for Class, F , G , and W respectively. For instance, let τ F = . τ G = . τ W = . F is 0 . ≥ τ F , whereas the probabilities to be-long to classes G and W are less than τ G and τ W respectively. In conclusion, once the threethresholds are set, we can claim that student 1 is a Fair student.Student 6 has score F = . ≥ τ F and score G = . ≥ τ G so he / she belongs either to Class F or to class G . Since the scores are normalized and are comparable, we set thepredicted class to be the one corresponding to the highest score, hence we predict studentID = G .For student 2 (7 and 14) note that the three scores are all below the thresholds so the pre-dicted class is the one corresponding to the greatest score, i.e. the student is predicted asWeak. itle Suppressed Due to Excessive Length 13 The max probability associated to each student is expressed in column

Max Pred. , and ifwe compare this column with column

Actual Class we note that taking the max score as thepredicted class would not have been a good strategy.By setting the three thresholds τ F , τ G , and τ W and considering the max score only in case ofuncertainty we obtained for each student a predicted class, expressed in column PredictedClass . If we compare the actual class with the predicted class we can build the correspondingconfusion matrix:

Table 4: Example - Confusion Matrix

F G WF G W The threshold for class W in dataset 1 is typically higher than that for the other two classesdue to the combination of two reasons. The ﬁrst is that the test sample is fairly small. Thesecond is that the number of class W instances is also small. As such, based on the fact thatthe threshold is determined by ﬁnding the score that results in the closest point on the ROCcurve to the top left corner, the threshold has to be high in order to make sure that the points are identiﬁed correctly. Therefore, since the number of class W points is low, missing oneof them would result in a signiﬁcant drop in speciﬁcity and sensitivity. Thus, the optimalthreshold should be high to be able to identify and classify them correctly.

We chose one algorithm for each area of ML aiming to cover all types of classiﬁcation meth-ods including tree-based (RF), vector-based (SVM-RBF), distance-based (k-NN), regression-based (LR), probabilistic (NB), and neural network-based (NN1, NN2, and NN3 with 5 neu-rons per layer). The corresponding bagging ensemble models consist of all possible com-binations of the aforementioned base models. In Section 7.1, we explain how we train aNN. In the following sections, for each dataset, we show the impact of each variable on theperformance of each classiﬁer. As explained in Section 1, in order to understand which vari-ables are the best predictors for each algorithm, we decided to average, for each variable, itsimportance on every model and this average is assigned to the variable and deﬁned to be its averaged importance . In Section 8 we will show that the most important variables a ﬀ ect theperformances of some classiﬁers.7.1 Neural Network Tuning Finding the optimal number of neurons for NN is still an open ﬁeld of research and requiresa lot of computational resources. The authors in [64] summarize some formulas for thecomputation of the optimal number of hidden neurons N h : – N h = √ + N i − – N h = √ N i N o – N h = N i + N i − where N i is the number of input neurons (number of variables) and N o is the number ofoutput neurons (3 classes). Applying the latter formulas to our datasets at the two di ﬀ erentstages, we obtained a number of neurons between 2 and 6. Considering that we adopted theearly stopping technique in order to prevent over-ﬁtting and reduce variance, we decided tochoose this number in the high range of the interval [2 ,

6] and set it to be equal to 5 insteadof performing a full optimization (i.e., brute force searching).The results obtained by using 1 hidden layer with 5 neurons were so promising that wedecided to stress our hypothesis about early stopping and tried NN with 2 and 3 hiddenlayers with 5 neurons each, obtaining similar results.The NN models we built are as in Figure 6, 7, 8.

Fig. 6: NN with 1 hidden layerFig. 7: NN with 2 hidden layersFig. 8: NN with 3 hidden layers

The initialization of the weights of neural networks was implemented by using theNguyen-Widrow Initialization Method [55] whose goal is to speed up the training process by choosing the initial weights instead of generating them randomly. Simply put, this methodassigns to each hidden node its own interval at the start of the training phase. By doing so,during the training each hidden layer has to adjust its interval size and location less than ifthe initial weights are chosen randomly. Consequently, the computational cost is reduced.Levenberg-Marquardt backpropagation was used to train the models: this algorithm was in-troduced for the ﬁrst time by Levenberg and Marquardt in [47], and is derived from Newtons itle Suppressed Due to Excessive Length 15 method that was designed for minimizing functions that are sums of squares of nonlinearfunctions [45]. This method is conﬁrmed to be the best choice in various learning scenarios,both in terms of time spent and performance achieved, [15] . Moreover, the datasets werenormalized in input by mapping linearly to [ − ,

1] (the activation function used in the inputlayer is the hyperbolic tangent) and in output to [0 ,

1] (the activation function in the outputlayer is linear) in order to avoid saturation of neurons and make the training smoother andfaster.7.2 ML Algorithms’ Parameter TuningHyper-parameter tuning has become an essential step to improve the performance ofML algorithms. This is due to the fact that each ML algorithm is governed by a set of pa-rameters that dictate its predictive performance [39]. Several methods have been proposedin the literature to optimize and tune these parameters such as grid search algorithm, randomsearch, evolutionary algorithms, and Bayesian optimization method [39, 29].This work adopts the grid search method to perform hyper-parameter tuning. Gridsearch optimization method is a well-known optimization method often used to hyper tunethe parameters of ML classiﬁcation techniques. Simply put, it discretizes the values for theset of techniques’ parameters [39]. For every possible combination of parameters, the corre-sponding classiﬁcation models are trained and assessed. Mathematically speaking, this can be formulated as follows: max parm f ( parm ) (2)where f is an objective function to be maximized (typically the accuracy of the model) and parm is the set of parameters to be tuned. Despite the fact that this may seem computa-tionally heavy, grid search method beneﬁts from the ability to perform the optimization inparallel, which results in a lower computational complexity [39].In contrast to traditional hyper-parameter tuning algorithms that perform the optimiza-tion with the objective of maximizing the accuracy of the ML model, this work tunes theparameters used for each model using the grid search optimization method to maximize theaverage Gini index (for more statistical signiﬁcance and robustness [27]) over multiple splits[42]. More speciﬁcally, the objective function is:max parm Average Gini Index = max parm N N (cid:88) i = Gini Index i ( parm ) (3)where parm is the set of parameters to be tuned for each ML algorithm and N is the numberof di ﬀ erent splits considered. For example, in the case of K-NN algorithm, parm = { K } which is the number of neighbors used to determine the class of the data point.R was used to implement the eight classiﬁers and the corresponding ensemble learners.As mentioned above, the eight classiﬁers considered in this work are SVM-RBF, LR, NB,k-NN, RF, NN1, NN2, and NN3. All the classiﬁers were trained using all the variables available. Moreover, the parameters of the algorithms were tuned by maximizing the GiniIndex of each split. Furthermore, 200 di ﬀ erent splits of data were used to reduce the bias ofthe models under consideration.Table 5 summarizes the range of values for the parameters of the di ﬀ erent ML algorithmsconsidered in this work. Table 5: Grid Search Parameter Tuning Range

Algorithm Parameter Range in Dataset 1 Parameter Range in Dataset 2

SVM-RBF C = [0.25, 0.5, 1] & sigma = [0.05-0.25] C = [0.25, 0.5, 1] & sigma = [0.5-3.5]NB usekernel = [True,False] usekernel = [True,False]K-NN k = [5,7,9,...,43] k = [5,7,9,...,43]RF mtry = [2,3,...,12] mtry = [2,3,4] Note the following: – For the NB algorithm, density estimator used by the algorithm is represented using the usekernel parameter. In particular, usekernel = false means that the data distribution isassumed to be Gaussian. On the other hand, usekernel = true means that the data distri-bution is assumed to be non-Gaussian. – The LR algorithm was not included in the table. This is due to the fact that it has noparameters to optimize. The sigmoid function, which is the default function, was usedby the grid search method to maximize the Gini index. – The NN method was not included in the table because it was explained in the previousSection 7.1.The features are ordered according to their importance. This is done for the two datasets and for each of the algorithm used. This provides us with better insights about which features areimportant for each algorithm and each dataset. The importance of the features is determinedusing the CARET package available for R language [41]. Depending on the classiﬁcationmodel adopted, the importance is calculated in one of multiple ways. For example, whenusing RF method, the prediction accuracy on the out-of-bag portion of the data is recorded.This is iteratively done after permuting each predictor variable. The di ﬀ erence between thetwo accuracy values is then averaged over all trees and normalized by the standard error[41]. In contrast, when the k-NN method is used, the di ﬀ erence between the class centroidand the overall centroid is used to measure the variable inﬂuence. Accordingly, the sepa-ration between the classes is larger whenever the di ﬀ erence between the class centroids islarger [41]. On the other hand, when using the NN method, the CARET package uses thesame feature importance method proposed in Gevrey et al. which uses combinations of theabsolute values of the weights [23]. This importance is reﬂected in the weights calculated foreach feature for each classiﬁcation model with more important features contributing moretowards the prediction.The ﬁnal step consist of selecting the most suitable bagging ensemble learner for bothdatasets at the two course delivery stages.7.3 Features importance: Dataset 1 - Stage 20% – RF: The variables’ importance in terms of predictivity is described in Table 6 that shows that the most relevant features are ES2.2 and ES3.3. – SVM-RBF: The variables’ importance for SVM is described in Table 6, that shows thatthe most relevant features are ES2.2 and ES3.3. – NN1: For NN1, the variables’ importance in terms of predicativity is described in Table6 that shows that the most relevant features are ES2.2 and ES3.5. – NN2: The most important variables for NN2 are ES2.2 and ES3.2, as shown in Table 6. itle Suppressed Due to Excessive Length 17

Table 6: Dataset 1 - Stage 20% - Features’ importance for Di ﬀ erent Base Classiﬁers Ranking RF SVM-RBF NN1 NN2 NN3 k-NN LR NB – NN3: The variables’ importance in terms of predicativity is described in Table 6 thatshows that the most relevant features are ES2.2 and ES3.2. – k-NN: Table 6 shows that the most relevant features for k-NN are ES2.2 and ES3.3. – LR: The variables’ importance in terms of predicativity is described in Table 6 thatshows that the most relevant features are ES1.1 and ES1.2. – NB: Table 6 shows that the most relevant features are ES2.2 and ES3.3.7.4 Features importance: Dataset 1 - Stage 50%

It is important to point out that, for Dataset 1 at stage 50%, features ES4.1 and ES4.2 are themost important for every classiﬁer. – RF: For RF, the variables’ importance in terms of predicativity is described in Table 7that shows that the most relevant features are ES4.1 and ES4.2. – SVM-RBF: The variables’ importance in terms of predicativity is described in Table 7that shows that the most relevant features are ES4.1 and ES4.2. – NN1: The variables’ importance in terms of predicativity is described in Table 7 thatshows that the most relevant features are ES4.1 and ES4.2. – NN2: The variables’ importance in terms of predicativity is described in Table 7 thatshows that the most relevant features are ES4.1 and ES4.2. – NN3: The variables’ importance in terms of predicativity is described in Table 7 thatshows that the most relevant features are ES4.1 and ES4.2. – k-NN: Table 7 shows that the most relevant features for k-NN are ES4.1 and ES4.2. – LR: Table 7 shows that the most relevant features for LR are ES4.1 and ES4.2. – NB: The variables’ importance in terms of predicativity is described in Table 7 thatshows that the most relevant features are ES4.1 and ES4.2.In general, the most important features for almost all the classiﬁers are ES4.1 and ES4.2.These features correspond to the

Evaluate category as per Bloom’s taxonomy which repre-sents one of the highest level of comprehension of the course material from the educationalpoint of view. Therefore, it makes sense for these features to be suitable indicators and pre-dictors of student performance.

Table 7: Dataset 1 - Stage 50% - Features’ importance for Di ﬀ erent Base Classiﬁers Ranking RF SVM-RBF NN1 NN2 NN3 k-NN LR NB

Table 8: Dataset 2 - Stage 20% - Features’ importance

Ranking Feature

Since Dataset 2 at stage 20% has only two variables we can represent it graphically in orderto have a better understanding of the situation and to explain why all the algorithms agreethat Assignment01 is the most important predictor.

Figure 9 shows that it is straightforward to identify the categories of students by setting

Fig. 9: Dataset 2 - Stage 20% - scatter plot some thresholds on the Assignment01 feature. For instance, most of the Weak students havegrade zero in Assignment01. itle Suppressed Due to Excessive Length 19

Table 9: Dataset 2 - Stage 50% - NN1, NN2, k-NN, and NB, Features’ importance

Ranking Feature

Table 10: Dataset 2 - Stage 50% - NN3, LR, RF, and SVM-RBF, Features’ importance

Ranking Feature

Based on the aforementioned results, it can be seen that assignments are better indicatorsof the student performance. This can be attributed to several factors. The ﬁrst is the fact thatassignments typically allow instructors to assess the three higher levels of cognition as perBloom’s taxonomy, namely analysis, synthesis, and evaluation [18]. As such, assignmentsprovide a better indicator of the learning level that a student has achieved and consequentlycan give insights about his / her potential performance in the class overall. Another factoris that students tend to have more time to complete assignments. Moreover, they are often allowed to discuss issues and problems among themselves. Thus, students not performingwell in the assignments may be indicative of them not fully comprehending the material.This can result in the students receiving a lower overall ﬁnal course grade. Matlab 2018 was used to build the Neural Networks classiﬁers, whereas all the other modelswere built using R.All possible combinations of ensembles of eight baggings of models (256 in total) werecomputed for the initial Train-Test split and for 5 extra splits. For each dataset, the averageof the performances, namely averaged Gini Index , on the 6 splits was used to select themost robust ensemble learner. In addition, we computed the p-values of all the ensemblesfor all the splits aiming to select the ensemble learner with highest averaged Gini index thatwas also statistically signiﬁcant on every split . Note that the contribution of each featureis determined by the base learner model being used in the ensemble as per the rankingdetermined for each dataset at each stage. For example, if the RF learner is part of theensemble being considered for Dataset 1 at 50% stage, the ﬁrst split is done over feature ES4.1, the second split is over feature ES 4.2, and so on.In the following sections we will see the results obtained for the two datasets at each stage.

Table 11: Dataset 1 - Stage 20% Ensemble (NN2) Confusion Matrix τ F = . , τ G = . , τ W = . F G WF G W model and the combination of the bagging of NN2 and NB as a bagging ensemble. Figure10 shows the results obtained by inferring the ensemble on the initial test sample. Fig. 10: Dataset 1 - Stage 20% - Ensemble Learner

Classes G, F, W have Gini Indices equal to 46.4%, 38.9% and 94.0% respectively.Hence, the Averaged Gini Index is 59.8%. On average, on Test sample and the 5 extra splitsthe Averaged Gini Index is 62.1%. The corresponding p-values are all less than 0.03.The confusion matrix for the Test sample (consisting of 15 students), obtained as explainedin Section 6, is shown in Table 11.Table 12 illustrates the performances of the ensemble learner in terms of precision, re-call, F-measure and false positive rate per class and on average. These quantities dependon the thresholds τ F , τ G and τ W and the way we deﬁned the predictions. The Accuracy is66.7%. Although this may seem to be low, it actually outperforms all of the base learnersused to create the bagging ensemble. Note that the low accuracy may be attributed to thefact that the dataset itself is small and hence did not have enough instances to learn from. Table 12: Dataset 1 - Stage 20% - Ensemble Performances

Precision Recall F-measure False Positive RateF G W Avg none of the ensembles we constructed were statistically signif-icant even if their Averaged Gini Indices are on average higher than the ones obtained forDataset 1 at Stage 20%. In fact, the performance for class F gets worse when we add thethree variables. More precisely, when we add Features

ES4.1 , ES4.2 and

ES5.1 to Dataset 1at stage 20% obtaining Dataset 1 at stage 50%, they end up being the ones that have the mainimpact on the predictions. These variables help distinguish between W and G and in fact theperformance corresponding to these two classes improve. However, since Fair students areclosely correlated with the Good students class, the classiﬁer becomes less conﬁdent in pre-dicting the Fair students.The best ensemble in terms of performance is the one obtained from a bagging of NB andk-NN. The Averaged Gini Index on 6 splits is 74.9% and on the initial test sample the Av-eraged Gini Index is 86.5%. Figure 11 shows the performance obtained on Split 1, havingAveraged Gini Index equals 50%, with Gini Indices -22.2%, 76.8%, 86.0% respectively onClasses F, G and W. On a di ﬀ erent split, the ensemble formed by a bagging of NB andk-NN on Dataset 1 at stage 20% gives Gini Indices 77.8%, 53.6% and 48.0% respectivelyon Classes F, G and W. This proves that the performance heavily depends on the split. Ingeneral, when we add the new three features (obtaining Dataset1 at stage 50%), the perfor-mance improves on classes G and W whereas it gets much worse for class F. Fig. 11: Dataset 1 - Stage 50% - Ensemble Learner

The confusion matrix obtained is the following:

Table 14 illustrates the performances of the ensemble learner in terms of precision, re-call, F-measure and false positive rate per class and on average. These quantities dependon the thresholds τ F , τ G and τ W and the way we deﬁned the predictions. The Accuracy is66.7%. Again, the bagging ensemble outperforms all of the base learners used to create itdespite it potentially being low. This is due to the fact that the dataset itself is small andhence did not have enough instances for the ensemble to learn from. Note that we cannot Table 13: Dataset 1 - Stage 50% Ensemble (NB and k-NN) Confusion Matrix τ F = . , τ G = . , τ W = . F G WF G W compute the F-measure for class F as Precision and Recall are zero. It is worth noting that Table 14: Dataset 1 - Stage 50% - Ensemble Performances

Precision Recall F-measure False Positive RateF G W Avg the low average Gini index can be attributed to 2 main reasons: – This dataset is a small dataset. – The Fair class is highly correlated with the Good students class. Hence, this is causing some confusion to the models being trained.This is further highlighted by the large false positive rate obtained for the Fair class.8.3 Results: Dataset 2 - Stage 20%The ensemble learner selected for Dataset 2 at Stage 20% is formed by bagging of NB,k-NN, LR, NN2, and SVM-RBF. For instance, we show the results corresponding to theinitial test sample. For each class, we normalized the scores obtained by the ﬁve baggings ofmodels on the test sample in order to make these probabilities comparable, then we averagedthem. The performances obtained are shown in Figure 12.Classes G, F, W have Gini Indices equal to 48.1%, 38.6% and 99.7% respectively. Theconfusion matrix associated is shown in Table 15.

Table 15: Dataset 2 - Stage 20% Ensemble (NB,k-NN,LR,NN2,SVM) Confusion Matrix τ F = . , τ G = . , τ W = . F G WF G W Furthermore, Table 16 illustrates the performances of the ensemble learner in terms ofprecision, recall, F-measure and false positive rate per class and on average. These quanti-ties depend on the thresholds τ F , τ G and τ W and the way we deﬁned the predictions. TheAccuracy is 88.2%, which is very good compared with respect to the performances obtainedfor Dataset 1. In a similar fashion to Dataset 1, the bagging ensemble outperforms the baselearners in terms of classiﬁcation accuracy. itle Suppressed Due to Excessive Length 23 Fig. 12: Dataset 2 - Stage 20% - Ensemble LearnerTable 16: Dataset 2 - Stage 20% - Ensemble Performances

Precision Recall F-measure False Positive RateF G W Avg

Table 17: Dataset 2 - stage 50% Ensemble (LR) Confusion Matrix τ F = . , τ G = . , τ W = . F G WF

11 5 1 G W Table 18 illustrates the performances of the ensemble learner in terms of precision, re-call, F-measure and false positive rate per class and on average. These quantities dependon the thresholds τ F , τ G and τ W and the way we deﬁned the predictions. The Accuracy is93.1%. Again, the bagging ensemble at this stage also outperforms the base learners in termsof classiﬁcation accuracy. Fig. 13: Dataset 2 - Stage 50% - Ensemble LearnerTable 18: Dataset 2 - Stage 50% - Ensemble Performances

Precision Recall F-measure False Positive RateF G W Avg ﬀ erent base learners in comparison withthe average accuracy across the 256 splits of the bagging ensemble. It can be seen thatthe bagging ensemble on average outperforms all of the base learners at the two coursedelivery stages for both datasets. This is despite the fact that some of the splits may have hadpoor distribution which often leads to lower classiﬁcation accuracy of the ensemble. Thisfurther highlights and emphasizes the e ﬀ ectiveness of the proposed ensemble in accuratelypredicting and identifying students who may need help. Table 19: Performance of Bagging Ensemble and Base Learners

AccuracyTechnique Dataset 1 Dataset 2

Stage 20% Stage 50% Stage 20% Stage 50%RF 46.7% 66.6% 82.8% 89.0%NN 66.7% 60% 86.2% 91.7%K-NN 60% 66.6% 86.2% 89.0%NB 53.3% 66.6% 85.5% 85.5%LR 53.3% 53.3% 86.9% 90.3%SVM 46.7% 33.3% 86.2% 90.3%Ensemble itle Suppressed Due to Excessive Length 25 ﬀ erent. For Dataset 1, themodels performances depend strongly on the splits. For instance, the same ensemble mightperform very well on certain splits but have very low Averaged Gini Index on others, dueto a negative Gini index on class F. Moreover, only 25% of the ensembles for Dataset 1 atthe 20% had averaged Gini Index above 50% and of all the ensembles only one of them isstatistically signiﬁcant, the one corresponding to a bagging of NN2 models.Although the evidence shows that this ensemble performs decently on each split we haveconsidered for our experiments, we cannot assume that this is true on every other possiblesplit we might have chosen instead. The problem is so dependent on the split selected, thateven the ensemble we chose results in lack of robustness and poor performances.For Dataset 1 at stage 50%, the averaged Gini Index is in general higher than the aver-aged Gini Index obtained at stage 20% because the Gini Indices corresponding to classesG (Good students) and class W (Weak students) improve when we add the three features ES4.1, ES4.2, ES5.1 . Since the Fair students class is highly correlated with the Good stu-dents class, the consequence is that when we add the best predictors, they predict incorrectlythe Fair students. Consequently, the Gini Index for class F for each ensemble and for almostevery split is negative or very low, leading to statistically insigniﬁcant results. In particular,there is not an ensemble among the 256 constructed such that the p-value corresponding toclass F is lower than 0 .

03 on every split. Note that the ensemble chosen at the 50% stage is bagging of NB and k-NN. Although the ensemble was not statistically signiﬁcant due toclass F, it was statistically signiﬁcant for the target class W.For this reason, even though for completeness we are going to show the results for Dataset1 at both stages, it is important to point out that if we were aiming to classify correctly thestudents for Dataset 1 and to use the classiﬁer for applications in real world, we should notinclude the last three features, i.e. we should use Dataset 1 at stage 20%.Dataset 2 was easier to deal with and also the choice of the best ensemble was straightfor-ward. For Dataset 2 at stage 50%, 88% of the ensembles have averaged Gini Indices above90%, and 96% of the ensembles were statistically signiﬁcant.For Dataset 2, the highest averaged Gini Index led us to choose: – the ensemble of baggings of NB, k-NN, LR, and NN2 for the 20% stage. – the ensemble consisting of bagging of LR for the 50% stage.Note that in general, it is better to perform the prediction at the 50% stage rather than atthe 20% stage. This is due to the fact that more features are collected at the 50% stage,resulting in the learners being able to gain more information. Although this observation wasnot evident for dataset 1, this is due to the dataset being small with only a few instances ofthe F class that were at the border between the G and W classes. However, it was observed that the F-measure was high at both stages for the target class W.For dataset 2, the results showed that indeed predicting at the 50% stage is better sincethe performance of the ensemble improved with the added number of features. However,the results at the 20% stage were still valuable as they helped provide vital insights at anextremely early stage of the course delivery as evident by the fact that the F-measure wasclose to 0.7 at that stage. In this paper, we investigated the problem of identifying student who may need help duringcourse delivery time for an e-Learning environment. The goal was to predict the students’performance by classifying them into one of three possible classes, namely Good, Fair,and Weak. In particular, we tackled this multi-class classiﬁcation problem for two educa-tional datasets at two di ﬀ erent course delivery stages, namely at the 20% and 50% mark. Wetrained eight baggings of models for each dataset and considered all the possible ensemblesthat could be generated by considering the scores produced by inferring them on a test sam-ple.We compared the performances, and concluded that the ensemble learners to be selected areformed by: – a bagging of NN2 models for Dataset 1 at stage 20%. – a bagging of NB and k-NN models for Dataset 1 at stage 50%. – a bagging of NB, k-NN, LR, NN2, and SVM-RBF for Dataset 2 at stage 20%. – a bagging of LR models for Dataset 2 at stage 50%.whereas it was not possible to select a good ensemble for Dataset 1 at stage 50% as none ofthe ensembles was statistically signiﬁcant.The results are good for Dataset 2 both in terms of Averaged Gini Index and p-values,especially if we consider the issues encountered. In particular, the issues are mainly the sizeof Dataset 1 and the unbalanced nature of Dataset 2. In turn, this makes the multi-class classiﬁcation problem more complex. This was evident by the fact that it was impossibleto ﬁnd a good classiﬁer for Dataset 1 at stage 50% and that the performance obtained forDataset 1 at stage 20% was poor.Based on the aforementioned research limitations, below are some suggestions for our futurework: – The best way to face the dataset size issue would be to have more data available, bycollecting training and testing datasets for every time the course is o ﬀ ered. – We also suggest to perform several additional splits for Dataset 1 at Stage 20% to checkthe robustness of the model as well as the statistical signiﬁcance. – It might be worth trying to optimize the topology of the neural network with a dedicatedalgorithm. Even though our choice was based on recent literature it is unlikely that wereached the optimum. One could consider to try, for instance, all the possible combina-tions with 1, 2, and 3 layers and 1,..,20 neurons in each layer. If we considered all suchcombinations we would have had 20 + + NNs to train. Of course it would becomputationally not viable and would probably result in a massive over-ﬁtting. How-ever, there are several approaches proven to be e ﬀ ective in this kind of tasks, such asgenetic optimization or pre-trained models capable to predict the optimal topology ofa network for a given problem, considering parameters such as the dimension of thedataset and the intensity of the noise [19]. Datasets’ Permissions–

Dataset 1: The dataset is publicly available at: https://sites.google.com/site/learninganalyticsforall/data-sets/epm-dataset .Use of this data set in publications was acknowledged by referencing [63]. – Dataset 2: All permissions to use this dataset were obtained through The University of Western Ontario’sResearch Ethics O ﬃ ce. This o ﬃ ce approved the use of this dataset for research purposes. Acknowledgments

This study was funded by Ontario Graduate Scholarship (OGS) Program.

Conﬂict of Interest

The authors declare that they have no conﬂict of interest.

Informed Consent

This study does not involve any experiments on animals.itle Suppressed Due to Excessive Length 27

References

1. Abdul Aziz A, Ismail NH, Ahmad F (2013) Mining students’ academic performance. Journal of Theo-retical and Applied Information Technology 53(3):485–4852. Ahmed ABED, Elaraby IS (2014) Data mining: A prediction for student’s performance using classiﬁca-tion method. World Journal of Computer Application and Technology 2(2):43–473. Aly M (2005) Survey on multiclass classiﬁcation methods. Neural Network 19:1–94. Asogbon MG, Samuel OW, Omisore MO, Ojokoh BA (2016) A multi-class support vector machineapproach for students academic performance prediction. Int J of Multidisciplinary and Current research45. Athani SS, Kodli SA, Banavasi MN, Hiremath PS (2017) Student performance predictor using multi-class support vector classiﬁcation algorithm. In: 2017 International Conference on Signal Processingand Communication (ICSPC), IEEE, pp 341–3466. Baradwaj BK, Pal S (2012) Mining educational data to analyze students’ performance. arXiv preprintarXiv:120134177. Bhardwaj BK, Pal S (2012) Data mining: A prediction for performance improvement using classiﬁcation.arXiv preprint arXiv:120134188. Bu ﬀ ardi K, Edwards SH (2014) Introducing codeworkout: An adaptive and social learning environ-ment. In: Proceedings of the 45th ACM Technical Symposium on Computer Science Education, ACM,SIGCSE ’14, pp 724–724, DOI 10.1145 / http://doi.acm.org/10.1145/2538862.2544317

9. B¨uhlmann P (2012) Bagging, boosting and ensemble methods. In: Handbook of Computational Statis-tics, Springer, pp 985–102210. B¨uhlmann P, Yu B, et al. (2002) Analyzing bagging. The Annals of Statistics 30(4):927–96111. Chang YC, Kao WY, Chu CP, Chiu CH (2009) A learning style classiﬁcation mechanism for e-learning.Computers & Education 53(2):273–28512. Chen X, Vorvoreanu M, Madhavan K (2014) Mining social media data for understanding students’ learn-ing experiences. IEEE Transactions on Learning Technologies 7(3):246–259, DOI 10.1109 / TLT.2013.229652013. Daniel J, V´azquez Cano E, Gisbert Cervera M (2015) The future of moocs: Adaptive learning or businessmodel? International Journal of Educational Technology in Higher Education 12(1):64–73, DOI 10.7238 / rusc.v12i1.247514. Daradoumis T, Bassi R, Xhafa F, Caballe S (2013) A review on massive e-learning (mooc) design,delivery and assessment. In: 2013 Eighth International Conference on P2P, Parallel, Grid, Cloud andInternet Computing, pp 208–21315. Dhar V, Tickoo A, Koul R, Dubey B (2010) Comparative performance of some popular artiﬁcial neuralnetwork algorithms on benchmark and function approximation problems. Pramana 74(2):307–32416. Essalmi F, Ayed LJB, Jemni M, Graf S, Kinshuk (2015) Generalized metrics for the analysis of e-learningpersonalization strategies. Computers in Human Behavior 48:310 – 322, DOI https: // doi.org / / j.chb.2014.12.05017. Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases.AI magazine 17(3):37–3718. Feldman L (2006) Designing homework assignments: from theory to design. age 4:119. Fiszelew A, Britos P, Ochoa A, Merlino H, Fern´andez E, Garc´ıa-Mart´ınez R (2007) Finding optimalneural network architecture using genetic algorithms. Advances in computer science and engineeringresearch in computing science 27:15–2420. Fluss R, Faraggi D, Reiser B (2005) Estimation of the youden index and its associated cuto ﬀ point.Biometrical Journal: Journal of Mathematical Methods in Biosciences 47(4):458–47221. Fok WW, He Y, Yeung HA, Law K, Cheung K, Ai Y, Ho P (2018) Prediction model for students’ futuredevelopment by deep learning and tensorﬂow artiﬁcial intelligence engine. In: 2018 4th InternationalConference on Information Management (ICIM), IEEE, pp 103–10622. Fujita H, et al. (2019) Neural-fuzzy with representative sets for prediction of student performance. Ap-plied Intelligence 49(1):172–18723. Gevrey M, Dimopoulos I, Lek S (2003) Review and comparison of methods to study the contribution ofvariables in artiﬁcial neural network models. Ecological modelling 160(3):249–26424. Guyon I, Lemaire V, Boull´e M, Dror G, Vogel D (2010) Design and analysis of the kdd cup 2009: fastscoring on a large orange customer database. ACM SIGKDD Explorations Newsletter 11(2):68–7625. Hand DJ, Till RJ (2001) A simple generalisation of the area under the roc curve for multiple classclassiﬁcation problems. Machine learning 45(2):171–1868 MohammadNoor Injadat et al.26. Hijazi ST, Naqvi S (2006) Factors a ﬀ ecting students’performance. Bangladesh e-journal of Sociology3(1)27. Hosseinzadeh A, Izadi M, Verma A, Precup D, Buckeridge D (2013) Assessing the predictability ofhospital readmission using machine learning. In: Twenty-Fifth IAAI Conference28. Injadat M, Salo F, Nassif AB (2016) Data mining techniques in social media: A survey. Neurocomputing214:654 – 67029. Injadat M, Salo F, Nassif AB, Essex A, Shami A (2018) Bayesian optimization with machine learningalgorithms towards anomaly detection. In: 2018 IEEE Global Communications Conference (GLOBE-COM), pp 1–6, DOI 10.1109 / GLOCOM.2018.864771430. Injadat M, Moubayed A, Nassif AB, Shami A (2020) Systematic ensemble model selection ap-proach for educational data mining. Knowledge-Based Systems 200:105992, DOI https: // doi.org / / j.knosys.2020.105992, URL

31. Jain A, Solanki S (2019) An e ﬃ cient approach for multiclass student performance prediction basedupon machine learning. In: 2019 International Conference on Communication and Electronics Systems(ICCES), IEEE, pp 1457–146232. Kaggle Inc (2019) Kaggle. URL

33. Karaci A (2019) Intelligent tutoring system model based on fuzzy logic and constraint-based studentmodel. Neural Computing and Applications 31(8):3619–3628, DOI 10.1007 / s00521-017-3311-234. Kaur G, Singh W (2016) Prediction of student performance using weka tool. An International Journal ofEngineering Sciences 17:8–1635. Kehrwald B (2008) Understanding social presence in text-based online learning environments. DistanceEducation 29(1):89–106, DOI 10.1080 / / ICALT.2008.19838. Klamma R, Chatti MA, Duval E, Hummel H, Hvannberg ET, Kravcik M, Law E, Naeve A, Scott P(2007) Social software for life-long learning. Journal of Educational Technology & Society 10(3):72–8339. Koch P, Wujek B, Golovidov O, Gardner S (2017) Automated hyperparameter tuning for e ﬀ ective ma-chine learning. In: Proceedings of the SAS Global Forum 2017 Conference, pp 1–2340. Kotsiantis S, Patriarcheas K, Xenos M (2010) A combinational incremental ensemble of classiﬁersas a technique for predicting students’ performance in distance education. Knowledge-Based Systems23(6):529–53541. Kuhn M, et al. (2008) Building predictive models in r using the caret package. Journal of statisticalsoftware 28(5):1–2642. Lerman RI, Yitzhaki S (1984) A note on the calculation and interpretation of the gini index. EconomicsLetters 15(3-4):363–36843. Lorenz MO (1905) Methods of measuring the concentration of wealth. Publications of the Americanstatistical association 9(70):209–21944. Luan J (2002) Data mining and its applications in higher education. New Directions for InstitutionalResearch 2002(113):17–36, DOI 10.1002 / ir.3545. Lv C, Xing Y, Zhang J, Na X, Li Y, Liu T, Cao D, Wang FY (2017) Levenberg–marquardt backpropaga-tion training of multilayer neural networks for state estimation of a safety-critical cyber-physical system.IEEE Transactions on Industrial Informatics 14(8):3436–344646. Ma Y, Liu B, Wong CK, Yu PS, Lee SM (2000) Targeting the right students using data mining. In:Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and datamining, ACM, pp 457–46447. Marquardt DW (1963) An algorithm for least-squares estimation of nonlinear parameters. Journal of thesociety for Industrial and Applied Mathematics 11(2):431–44148. M´arquez-Vera C, Cano A, Romero C, Ventura S (2013) Predicting student failure at school using geneticprogramming and di ﬀ erent data mining approaches with high dimensional and imbalanced data. Appliedintelligence 38(3):315–33049. Moubayed A, Injadat M, Nassif AB, Lutﬁyya H, Shami A (2018) E-learning: Challenges and re-search opportunities using machine learning data analytics. IEEE Access 6:39117–39138, DOI 10.1109 / ACCESS.2018.285179050. Moubayed A, Injadat M, Shami A, Lutﬁyya H (2018) DNS Typo-Squatting Domain Detection: A DataAnalytics & Machine Learning Based Approach. In: 2018 IEEE Global Communications Conference(GLOBECOM), IEEE, pp 1–7itle Suppressed Due to Excessive Length 2951. Moubayed A, Injadat M, Shami A, Lutﬁyya H (2018) Relationship between student engagement andperformance in e-learning environment using association rules. In: 2018 IEEE World Engineering Edu-cation Conference (EDUNINE), pp 1–6, DOI 10.1109 / EDUNINE.2018.845100552. Moubayed A, Aqeeli E, Shami A (2020) Ensemble-based Feature Selection and Classiﬁcation Model forDNS Typo-squatting Detection. In: 33rd Canadian Conference on Electrical and Computer Engineering(CCECE’20, IEEE, pp 1–653. Moubayed A, Injadat M, Shami A, Lutﬁyya H (2020) Student engagement level in e-learning environ-ment: Clustering using k-means. American Journal of Distance Education DOI 10.1080 /

55. Nguyen D, Widrow B (1990) Improving the learning speed of 2-layer neural networks by choosing initialvalues of the adaptive weights. In: 1990 IJCNN International Joint Conference on Neural Networks,IEEE, pp 21–2656. Pal S (2012) Mining educational data to reduce dropout rates of engineering students. InternationalJournal of Information Engineering and Electronic Business 4(2):157. Prasad GNR, Babu AV (2013) Mining previous marks data to predict students performance in their ﬁnalyear examinations. International Journal of engineering research and technology 2(2):1–458. Ramaswami M (2014) Validating predictive performance of classiﬁer models for multiclass problem ineducational data mining. International Journal of Computer Science Issues (IJCSI) 11(5):8659. Rana S, Garg R (2016) Evaluation of students’ performance of an institute using clustering algorithms.International Journal of Applied Engineering Research 11(5):3605–360960. Romero C, Ventura S (2007) Educational data mining: A survey from 1995 to 2005. Expert systems withapplications 33(1):135–14661. Rosenberg MJ, Foshay R (2002) E-learning: Strategies for delivering knowledge in the dig-ital age. Performance Improvement 41(5):50–51, DOI 10.1002 / pﬁ.4140410512, URL https://onlinelibrary.wiley.com/doi/abs/10.1002/pfi.4140410512 , https://onlinelibrary.wiley.com/doi/pdf/10.1002/pfi.4140410512

62. Saxena R (2015) Educational data mining: Performance evaluation of decision tree and clusteringtechniques using weka platform. International Journal of Computer Science and Business Informatics15(2):26–3763. Vahdat M, Oneto L, Anguita D, Funk M, Rauterberg M (2015) A learning analytics approach to correlatethe academic achievements of students with interaction data from an educational simulator. In: Designfor Teaching and Learning in a Networked World, Springer International Publishing, Cham, pp 352–36664. Vujicic T, Matijevic T, Ljucovic J, Balota A, Sevarac Z (2016) Comparative analysis of methods fordetermining number of hidden neurons in artiﬁcial neural network. In: Central european conference oninformation and intelligent systems, Faculty of Organization and Informatics Varazdin, p 21965. Wang X, Zhang Y, Yu S, Liu X, Yuan Y, Wang F (2017) E-learning recommendation framework basedon deep learning. In: 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp455–460, DOI 10.1109 //