XRDS: Crossroads, The ACM Magazine for Students | 2021
Predictive analytics
Abstract
XRDS • S P R I N G 2 0 2 1 • V O L . 2 7 • N O . 3 In the past few decades, predictive technology has gained traction across industry and academia alike for gathering valuable insights from data. From predicting weather patterns to financial scores, researchers use predictive analytics on several variables with increasing accuracy. This article briefly introduces the concepts involved in predictive analytics as well as pointers to implementing such algorithms in R and Python. The first step to predictive analytics is data collection. Whether there is already an existing hypothesis or not, the data essentially guides what can be predicted. This makes predictive analytics inductive in nature. A good data set can generate more accurate projections. Once the data set is obtained, analysts need to clean and configure the data; this is done with the purpose of running mathematical operations on the data. For example, responses would need to be converted into numbers to then be classified into variable types. The third step is exploring the available data set to find relationships between variables that suit the research problem. Often, analysts start with an existing hypothesis and try to collect data that would help them answer a particular research question. For predictive analysis, however, it is more common for data exploration to lead toward hypothesis development based on patterns revealed in existing data. An important component of data exploration is visualization. In order to gain meaningful insights from data, it is important analysts are able to find an accurate way to visually represent data that would help them see patterns and form hypotheses. Both Python and R (like ggplot2) [1] have several packages that can be used to visualize data. Once a hypothesis is formed, we need to develop a model that fits relationships between the variables. In order to develop a statistical model that can predict, we first need a portion of our existing dataset to train the model. This training data is used to develop a statistical relationship between the desired variables. For example, if our variables are x and y, the model can be a generated linear relationship y=mx+c. We then test the model with the remaining dataset to check if this model can correctly predict the values of y when provided the corresponding values of x. The accuracy (calculated using a “confusion matrix”) of the model is tested to determine an acceptable level of deviance. There are several statistical and machine learning approaches that can be used to generate a model. While the most common statistical method is regression, several supervised and unsupervised machine-learning approaches have been developed and implemented in languages like R [2] and Python (using the scikit-learn package) [3] to build these models. Once a suitable model is developed, it is deployed as part of a particular software to generate real-time predictions. For example, financial institutions use several variables to generate credit scores, and as the values of the variables change, new scores are generated by projecting the available new data. Finally, the accuracy of the model is tested post-deployment through feedback. This is how these statistical models are validated through continuous testing to check if the available data overfits or underfits the prediction model. Accordingly, this feedback is used to modify future iterations for improvement. Thus, predictive modeling can be used to generate meaningful information from data that can be a beneficial supplement for research and development. However, predictive analytics goes only as far as the data used for modeling. Data quality, source, as well as variable count affect such models and they are not free from biases. There are also several ethical considerations that need to be taken into account as these algorithms are implemented in real-world settings.