ProcData: An R Package for Process Data Analysis
Xueying Tang, Susu Zhang, Zhi Wang, Jingchen Liu, Zhiliang Ying
JJSS
Journal of Statistical Software
MMMMMM YYYY, Volume VV, Issue II. doi: 10.18637/jss.v000.i00
ProcData : An R Package for Process Data Analysis
Xueying Tang
University of Arizona
Susu Zhang
Columbia University
Zhi Wang
Columbia University
Jingchen Liu
Columbia University
Zhiliang Ying
Columbia University
Abstract
Process data refer to data recorded in the log files of computer-based items. Thesedata, represented as timestamped action sequences, keep track of respondents’ responseprocesses of solving the items. Process data analysis aims at enhancing educational as-sessment accuracy and serving other assessment purposes by utilizing the rich informationcontained in response processes. The R package ProcData presented in this article is de-signed to provide tools for processing, describing, and analyzing process data. We definean S proc ’ for organizing process data and extend generic methods summary and print for ‘ proc ’. Two feature extraction methods for process data are implemented in thepackage for compressing information in the irregular response processes into regular nu-meric vectors. ProcData also provides functions for fitting and making predictions from aneural-network-based sequence model. These functions call relevant functions in package keras for constructing and training neural networks. In addition, several response processgenerators and a real dataset of response processes of the climate control item in the 2012Programme for International Student Assessment are included in the package.
Keywords : process data analysis, multidimensional scaling, autoencoder, sequence model.
1. Introduction
With the advancement of technology, computer-based assessments have become popular inmeasuring complex human skills such as problem solving skills. In these assessments, partic-ipants are often asked to fulfill one or more real life tasks in a simulated environment. As aparticipant interacts with a computer to complete the tasks, the entire interaction process willbe recorded in log files. Each recorded process includes the actions such as mouse clicks andkeystrokes taken by the participant and the timestamps at which these actions took place. a r X i v : . [ s t a t . C O ] J un Process Data Analysis in R Such data describe the process of responding to an item and are thus called response processdata, or in short, process data.The climate control item described below is an example of computer-based items from whichresponse process data are collected. It belongs to the 2012 survey in the Programme forInternational Student Assessment (PISA) for assessing students’ problem solving skills. Figure1 is a screenshot of the item interface. In the simulated environment, there is a new airconditioner with no instructions. Students are asked to figure out which climate variable,temperature or humidity, that each of the three controls on the air conditioner influences.They can slide the control bars through the simulation interface and read how the temperatureand humidity change. In the exploration process, how the controls are moved and buttons areclicked is recorded in the log files. For example, if a student clicked the “APPLY” button aftermoving the top control to “+” and the middle control to “ −− ”, then action “1_-2_0” alongwith the time elapsed since the start of the item, say 5.4 seconds, are recorded. A recordedprocess with actions “1_0_0”, “RESET”, “0_0_-2” and timestamps 4.9, 6.3, 10.6 indicatesthat the student moved the top control to “+” and clicked “APPLY” 4.9 seconds after theitem started. The positions of the three controls were reset by a click of the “RESET” button1.4 seconds later. Then the bottom control was moved to “ −− ” and “APPLY” was clickedagain 10.6 seconds after the item started. This sequence of actions constitutes a responseprocess of the climate control item.Figure 1: Climate control item in PISA 2012. ournal of Statistical Software O = ( S, T ) of process data consists of twosequences, an action sequence S = ( s , . . . , s L ) and a timestamp sequence T = ( t , . . . , t L ).The action sequence S records the actions taken by the respondent to solve the item inorder. Each element is an action in the set of possible actions A = { a , . . . , a N } for the item.Elements in the timestamp sequence T record the time of actions in S from inception andtherefore 0 ≤ t ≤ t ≤ · · · ≤ t L . For a set of response processes O , . . . , O n of n respondents,the length of a process is likely to vary across observations. We write O i = ( S i , T i ), S i =( s i , . . . , s iL i ), and T i = ( t i , . . . , t iL i ), where L i denotes the length of the response process forrespondent i , i = 1 , . . . , n .One of the goals of process data analysis is to study how students’ response processes arerelated to their observed characteristics such as demographics and education history and totheir latent traits such as problem solving ability and personality. Traditionally, the relationbetween covariates and response variables is explored through linear or generalized linearmodels. The effect of latent attributes on item responses, such as final polytomous or di-chotomous scores, can be modeled through latent variable models, including item responsetheory models (Lord and Novick 1968; Lord 1980) or cognitive diagnosis models (Rupp, Tem-plin, and Henson 2010). However, because of the complex structure of response processes,most of the traditional methods cannot be directly applied to process data.In the past few years, various methods have been proposed for analyzing process data. Heand von Davier (2016) and Qiao and Jiao (2018) found that n-grams of action sequences areuseful for predicting the item performance and clustering respondents. Chen, Li, Liu, andYing (2019) analyzed process data through an event history analysis approach. More recently,two feature extraction methods (Tang, Wang, He, Liu, and Ying 2019a; Tang, Wang, Liu,and Ying 2019b) are developed to automatically construct informative features from processdata. One is based on multidimensional scaling while the other is based on a type of neuralnetworks called autoencoder. Besides autoencoders, other neural networks such as recurrentneural networks have also been found potentially useful for analyzing process data (Tang,Peterson, and Pardos 2016). The detailed response information contained in process data hasbeen utilized to enhance education measurement, compare behavior patterns in successful andunsuccessful responses, and detect abnormal behaviors (Stadler, Fischer, and Greiff 2019; Ren,Luo, Ren, Bai, Li, and Liu 2019; Wang, Xu, Shang, and Kuncel 2018).Although the methodology development has been prosperous, the software development foranalyzing process data has not been done in parallel. Packages and toolboxes for manag-ing and analyzing continuous time series data are abound in literature, but few has beendeveloped for categorical sequence data similar to process data. TraMineR (Gabadinho,Ritschard, MÃijller, and Studer 2011) is an R (R Core Team 2019) package designed fordescribing and analyzing discrete sequence data. However, it cannot be applied directly toprocess data coming from the log files of computer-based items. Process data are recorded astimestamped action sequences whose lengths vary greatly in practice while the sequences han-dled by TraMineR consist of states observed at given time points which are the same acrossobservations. It is very difficult to organize process data as dataframes that can be used by
TraMineR . Moreover, process data usually have much more possible states (actions) and aremuch noisier than the sequence data dealt with by
TraMineR . Although neural networks havebeen shown useful for analyzing process data and more general education data, building aproperly working neural network model usually requires a significant amount of effort evenwith the help of high level neural network application programming interfaces (APIs) such
Process Data Analysis in R Topic ObjectsProcess data proc , print.proc , summary.proccc_data , seq_gen , seq_gen2 , seq_gen3 † Process input and output read.seqs , write.seqs Process manipulation remove_repeat , remove_action , replace_action , combine_actions Feature extraction seq2feature_mds , chooseK_mds , seq2feature_seq2seq † , chooseK_seq2seq † Sequence model seqm † , predict.seqm † Table 1: Summary of
ProcData features. The functions marked by † depend on the keras package.as the Python library
Keras and its R interface keras (Allaire and Chollet 2019). The lack ofsoftware for practitioners’ usage hinders the wide applications of the state-of-the-art methodsand the discovery of scientific results.We try to fill in this gap by developing an R package ProcData . This package provideseasy-to-use tools for processing, describing, and analyzing process data. It includes functionsimplementing the state-of-the-art methods for process data analysis. It also provides func-tions that wrap a series of keras functions to facilitate implementation of neural networks usedfor analyzing process data.
ProcData is available from the Comprehensive R Archive Net-work (CRAN) at http://CRAN.R-project.org/package=ProcData and under developmenton GitHub at https://github.com/xytangtang/ProcData .The remaining sections of this article are organized as follows. We introduce the features ofthe
ProcData package in Section 2. A case study of the climate control item in PISA 2012is presented in Section 3 to demonstrate the usage of the package. Summary is included inSection 4.
2. The
ProcData
Package
ProcData provides tools for analyzing process data. It defines an S ProcData is summarized in Table 1. The details of these features are described in the sequel.Some features of
ProcData require construction and training of neural networks.
ProcData relies on R package keras for achieving this functionality. The keras package is an R interfaceto Keras , which is a high level neural network API developed for fast experimentation ofneural networks in
Python . The functions in
ProcData that are built on functions in keras ismarked by † in Table 1. The installation guide of the keras package can be found at https://keras.rstudio.com/ . If keras is not installed properly, calling these marked functions willlead to an error while other functions in ProcData can still be used normally.
We include in this package the climate control data in PISA 2012. The dataset containsthe response processes and the dichotomous response outcomes of 16,763 students. The item ournal of Statistical Software . The data can be loaded by
R> library("ProcData")R> data("cc_data")
The data object cc_data is a list of two elements, seqs and responses . The responseoutcomes are contained in responses as a numeric vector. The response processes are storedin seqs as an object of class ‘ proc ’, which will be specified in Section 2.2.Three functions, seq_gen() , seq_gen2() , and seq_gen3() , are included in this package tosimulate process data. The values of these functions (both action and timestamp sequences)are objects of class ‘ proc ’. All the three functions generate the inter-arrival time (time elapsedbetween two consecutive actions) independently from a provided distribution if include_time= TRUE . Each of them has a different underlying model to generate action sequences.The values of seq_gen() resemble those from an item similar to the climate control item. Inparticular, participants are asked to answer a question by running simulated experiments inwhich two conditions can be controlled. A simulated experiment can be run by setting thetwo conditions at one of the given choices and clicking the “RUN” button. An example ofthe action sequence generated by seq_gen() is “Start, OPT1_3, OPT2_2, RUN, OPT1_1,OPT2_2, RUN, OPT1_1, OPT2_1, RUN, CHECK_D, End” .Function seq_gen2() generates action sequences according to a Markov model with a pro-vided transition probability matrix. Given an N × N transition probability matrix P = ( p ij )among N actions, an action sequence is generated by setting S = a and the remaining ac-tions are generated according to P ( S t +1 = a j | S t = a i ) = p ij for t ≥
1. The action sequenceterminates once the predetermined terminating action (one of the N actions) is generated.Function seq_gen3() generates action sequences according to a recurrent neural network(RNN) (Goodfellow, Bengio, and Courville 2016, Chapter 10). More specifically, an RNN withparameter η is used to define the distribution of the action to be taken S t given all previousactions S , . . . , S t − . An action sequence is generated recursively in seq_gen3() according tothe distribution until the predetermined terminating action appears. The parameter vector η in the RNN can either be provided by the user or be randomly generated in seq_gen3() . S proc ’ ProcData defines an S proc ’ for organizing process data. An object of class ‘ proc ’is a list of two elements, action_seqs and time_seqs . In a ‘ proc ’ object storing responseprocesses O , . . . , O n , action_seqs is a list with elements S , . . . , S n and time_seqs is a listwith elements T , . . . , T n . The names of the elements in action_seqs and time_seqs are theidentity of the respondents. If the timestamp sequences are not available, time_seqs is set to NULL . The seqs element in cc_data and the object returned by the three process generatorsare ‘ proc ’ objects
R> class(cc_data$seqs) [1] "proc"
R> seqs <- seq_gen(100)R> class(seqs)
Process Data Analysis in R [1] "proc" An object of class ‘ proc ’ is printed by the print method for class ‘ proc ’, print.proc() .Note that the ‘ proc ’ object seqs below does not contain timestamp sequences. R> seqs 'proc' object of 100 processesFirst 5 processes:1 Step 1 Step 2Event Start End2 Step 1 Step 2Event Start End3 Step 1 Step 2 Step 3 Step 4 Step 5Event Start OPT1_2 OPT2_2 RUN End4 Step 1 Step 2 Step 3Event Start CHECK_B End5 Step 1 Step 2 Step 3Event Start CHECK_C End
R> print(cc_data$seqs, index=3) 'proc' object of 16763 processesARE000000300079Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9Event start 1_1_1 reset 0_0_1 reset 0_1_0 reset 1_0_0 endTime 0.0 113.2 119.1 122.0 135.4 138.5 147.8 149.8 157.0
We also extend the summary method for class ‘ proc ’. Function summary.proc() returns a listcontaining the following components:• n_seq : the number of response processes;• n_action : the number of distinct actions; ournal of Statistical Software actions : A , the set of all possible actions,• seq_length : a numeric vector of length n_seq containing the sequence length of eachprocess;• action_freq : a numeric vector of length n_action containing the action frequencies;• action_seqfreq : a numeric vector of length n_action recording the number of se-quences that each action appears;• trans_count : an n_action × n_action matrix with the ij th element being the countof bigram “ actions[i] , actions[j] ”;• total_time : a summary of the response time of the n_seq response processes;• mean_react_time : a summary of the mean reaction time (response time divided bysequence length) of the n_seq response processes. Process data are usually stored as comma separated values (CSV) files.
ProcData providesfunctions read.seqs() and write.seqs() to read process data from a CSV file as a ‘ proc ’object and to write a ‘ proc ’ object to a CSV file. The functions accommodate CSV files oftwo styles, “single” and “multiple”. In both styles, all the action sequences form a column(action column) and all the timestamp sequences form another column (time column). In the“single” style, the process for one respondent takes up one row in the CSV file. The entireaction sequence of the respondent is stored as one entry of the action column. The timestampsequence of the respondent is stored in the corresponding entry of the time column. Figure2 presents a CSV file storing five response processes in the “single” style. In the “multiple”style, each action in the process and its timestamp occupy one row in the CSV file. A processof length L takes up L consecutive rows. A CSV file that stores two response processes in the“multiple” style is displayed in Figure 3. The two response processes are the same as the firsttwo shown in Figure 2. In read.seqs() and write.seqs() , both styles are accommodated.The style can be specified by setting the argument style . R> write.seqs(cc_data$seqs, file="seqs_format_multiple.csv",+ style = "multiple", id_var="ID", action_var="Action",+ time_var="Time")R> write.seqs(cc_data$seqs, file="seqs_format_single.csv",+ style = "single", id_var="ID", action_var="Action",+ time_var="Time", step_sep=",")R> seqs_multiple <- read.seqs("seqs_format_multiple.csv",+ style = "multiple", id_var = "ID",+ action_var = "Action", time_var = "Time")R> seqs_single <- read.seqs("seqs_format_single.csv", style = "single",+ id_var = "ID", action_var = "Action",+ time_var = "Time", step_sep = ",")R> all.equal(seqs_multiple, seqs_single)
Process Data Analysis in R Figure 2: Screenshot of a CSV file storing response processes in “single” style.Figure 3: Screenshot of a CSV file storing response processes in “multiple” style. [1] TRUE
ProcData provides functions for presponse process manipulation. These functions can beused for process data cleaning. Function sub_seqs() subsets a set of response processes.
R> seqs <- sub_seqs(cc_data$seqs, 4*1:10)R> print(seqs, 1) 'proc' object of 10 processesFirst 1 processes:ARE000000400093Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9Event start 1_0_0 0_0_0 reset 0_0_0 0_0_1 0_0_2 reset 0_1_0Time 0.0 36.9 41.2 43.3 44.5 50.5 61.9 66.1 68.7Step 10 Step 11 Step 12 Step 13 Step 14 Step 15 Step 16 Step 17Event 0_2_0 -1_2_0 reset 2_0_0 reset 2_0_0 2_0_0 2_0_0Time 72.3 92.1 94.5 97.2 110.3 112.9 116.5 117.8Step 18 Step 19 Step 20 Step 21 Step 22 Step 23 Step 24 Step 25Event reset 0_2_0 0_2_0 0_2_0 reset 0_0_2 0_0_2 0_0_2Time 121.1 123.6 124.8 125.6 127.5 129.4 130.0 130.7Step 26 Step 27 Step 28Event reset 1_1_1 endTime 131.7 147.8 171.6 ournal of Statistical Software remove_repeat() removes the consecutive repeated actions and their timestamps.
R> seqs1 <- remove_repeat(seqs)R> print(seqs1, 1) 'proc' object of 10 processesFirst 1 processes:ARE000000400093Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9Event start 1_0_0 0_0_0 reset 0_0_0 0_0_1 0_0_2 reset 0_1_0Time 0.0 36.9 41.2 43.3 44.5 50.5 61.9 66.1 68.7Step 10 Step 11 Step 12 Step 13 Step 14 Step 15 Step 16 Step 17Event 0_2_0 -1_2_0 reset 2_0_0 reset 2_0_0 reset 0_2_0Time 72.3 92.1 94.5 97.2 110.3 112.9 121.1 123.6Step 18 Step 19 Step 20 Step 21 Step 22Event reset 0_0_2 reset 1_1_1 endTime 127.5 129.4 131.7 147.8 171.6
Function remove_action() removes a particular set of actions and their timestamps.
R> seqs2 <- remove_action(seqs1, "0_0_0")R> print(seqs2, 1) 'proc' object of 10 processesFirst 1 processes:ARE000000400093Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9Event start 1_0_0 reset 0_0_1 0_0_2 reset 0_1_0 0_2_0 -1_2_0Time 0.0 36.9 43.3 50.5 61.9 66.1 68.7 72.3 92.1Step 10 Step 11 Step 12 Step 13 Step 14 Step 15 Step 16 Step 17Event reset 2_0_0 reset 2_0_0 reset 0_2_0 reset 0_0_2Time 94.5 97.2 110.3 112.9 121.1 123.6 127.5 129.4Step 18 Step 19 Step 20Event reset 1_1_1 endTime 131.7 147.8 171.6
Function replace_action() renames an action.
R> seqs3 <- replace_action(seqs2, "reset", "RESET")R> print(seqs3, 1) 'proc' object of 10 processes Process Data Analysis in R First 1 processes:ARE000000400093Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9Event start 1_0_0 RESET 0_0_1 0_0_2 RESET 0_1_0 0_2_0 -1_2_0Time 0.0 36.9 43.3 50.5 61.9 66.1 68.7 72.3 92.1Step 10 Step 11 Step 12 Step 13 Step 14 Step 15 Step 16 Step 17Event RESET 2_0_0 RESET 2_0_0 RESET 0_2_0 RESET 0_0_2Time 94.5 97.2 110.3 112.9 121.1 123.6 127.5 129.4Step 18 Step 19 Step 20Event RESET 1_1_1 endTime 131.7 147.8 171.6
Function combine_actions() combines a given pattern of consecutive actions into a singleaction.
R> seqs4 <- combine_actions(seqs2, c("0_0_1", "0_0_2"), "BOTTOM_MOVE_ONE")R> print(seqs4, 1) 'proc' object of 10 processesFirst 1 processes:ARE000000400093Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7Event start 1_0_0 reset BOTTOM_MOVE_ONE reset 0_1_0 0_2_0Time 0.0 36.9 43.3 50.5 66.1 68.7 72.3Step 8 Step 9 Step 10 Step 11 Step 12 Step 13 Step 14 Step 15Event -1_2_0 reset 2_0_0 reset 2_0_0 reset 0_2_0 resetTime 92.1 94.5 97.2 110.3 112.9 121.1 123.6 127.5Step 16 Step 17 Step 18 Step 19Event 0_0_2 reset 1_1_1 endTime 129.4 131.7 147.8 171.6
A major technical difficulty for process data analysis is that response processes are not readilyand cannot be easily organized as a matrix. The nonstandard format prevents the direct ap-plication of many traditional statistical methods to process data. Feature extraction methodshave been shown useful for exploratory process data analysis. These methods circumvent thenonstandard format difficulty by compressing response processes into fixed-dimension vectorsthat can be easily incorporated in traditional statistical models such as regression models.Two feature extraction methods, one based on multidimensional scaling (Tang et al. et al.
ProcData . ournal of Statistical Software Feature Extraction via Multidimensional Scaling
Multidimensional scaling (MDS) (Borg and Groenen 2005) is a technique that is often usedfor dimension reduction and data visualization. Its goal is to embed objects in a space insuch a way that the dissimilarity between objects is approximated by the distance betweentheir embedded features. Similar objects are located close together while dissimilar objectsare far apart. Given a dissimilarity measure that comprehensively characterizes the differ-ence between objects, the object coordinates obtained from MDS can be viewed as featuresdescribing the latent attributes of the objects. Given a dissimilarity measure d and the latentfeature dimension K , features of a set of response processes O , . . . , O n are a solution to theoptimization problem min θ ,..., θ n ∈ R K X ≤ i
R> seqs <- seq_gen(100)R> mds_res <- seq2feature_mds(seqs = seqs, K = 10)R> str(mds_res)
List of 1$ theta: num [1:100, 1:10] -0.195 -0.195 0.272 -0.162 -0.167 .....- attr(*, "dimnames")=List of 2.. ..$ : NULL.. ..$ : NULL
Classical MDS (implemented as cmdscale() in R ) approximates the solution to (1) byperforming eigenvalue decomposition of the matrix − JD (2) J where J = I − n > and D (2) = ( d ij ). The computational complexity of the algorithm is O ( n ), which is very ex-pensive if the number of processes n is large. Moreover, the n × n dissimilarity matrix D consumes a large amount of memory. To accomodata a large n , the algorithm proposed inParadis (2018) is implemented in seq2feature_mds() . This algorithm first chooses a smallsubset Ω of the objects and obtains ˆ θ i , i ∈ Ω by performing classical MDS on this subset.Then it minimizes F ( θ i ) = X j ∈ Ω ( d ij − k θ i − ˆ θ j k ) (2)by the BFGS method (Broyden 1970; Fletcher 1970; Goldfarb 1970; Shanno 1970) for each i Ω. In this way, only the dissimilarities for O ( mn ) pairs of objects are calculated where m is the subset size and the eigenvalue decomposition for a large matrix is avoided.In seq2feature_mds() , argument method specifies the algorithm for the feature extraction.If method = "small" , cmdscale() is called to perform classical MDS. If method = "large" ,the algorithm for large datasets is used. By default ( method = "auto" ), seq2feature_mds() selects the method according to the sample size.2 Process Data Analysis in R Two choices of dissimilarity measures are implemented in seq2feature_mds() . One is theoptimal symbol similarity (OSS) measure ( dist_type = "oss_action" ) proposed in Gómez-Alonso and Valls (2008). It takes into account the difference between two action sequences S i and S j . The other choice ( dist_type = "oss_both" ) is a time-weighted version of the OSSmeasure. It takes the difference in both action sequences and the timestamp sequences intoconsideration. If a dissimilarity measure other than the two choices are desired, a dissimilaritymatrix can be pre-computed and passed to seq2feature_mds() via seqs .Following the recommendation in Tang et al. (2019a), seq2feature_mds() performs principalcomponent analysis on the extracted features to enhance the interpretability of the features.This step can be turned off by setting pca = FALSE . By default, seq2feature_mds() returnsa list containing the latent features ( theta ) and the value of the optimized objective function( loss ). The dissimilarity matrix ( dist_mat ) can be returned by setting return_dist =TRUE . ProcData provides a function chooseK_mds() to select the latent feature dimension ( K ) by k -fold cross-validation. The basic usage is demonstrated in the example below. R> seqs <- seq_gen(100)R> cv_res <- chooseK_mds(seqs = seqs, K_cand = 5:10, n_fold = 5,+ return_dist = TRUE)R> str(cv_res)
List of 4$ K : int 8$ K_cand : int [1:6] 5 6 7 8 9 10$ cv_loss : num [1:6, 1] 0.00399 0.00296 0.00265 0.00239 0.00242 ...$ dist_mat: num [1:100, 1:100] 0 0 0.514 0.267 0.267 ...
R> mds_res <- seq2feature_mds(seqs = cv_res$dist_mat, K = cv_res$K)R> str(mds_res)
List of 1$ theta: num [1:100, 1:8] -0.195 -0.195 0.272 -0.162 -0.167 .....- attr(*, "dimnames")=List of 2.. ..$ : NULL.. ..$ : NULL
Sequence-to-Sequence Autoencoder Feature Extraction
Autoencoder (Goodfellow et al.
ProcData , seq2feature_seq2seq() extracts K ( K = ) features from a given set of responseprocesses ( seqs = ) by fitting a sequence autoencoder (SeqAE). The SeqAEs constructed in ournal of Statistical Software Input OutputFeature ✓
Figure 4: The structure of autoencoders. seq2feature_seq2seq() can be classified into action SeqAE, time SeqAE, and action-timeSeqAE depending on the input process. As the names suggest, an action SeqAE only dealswith action sequences, a time SeqAE only deals with timestamp sequences, and an action-timeSeqAE deals with both action and timestamp sequences in response processes. The desiredtype of SeqAE can be specified through argument ae_type in the function.The structure of an action SeqAE is depicted in Figure 5. The encoder of an action SeqAEconsists of three steps. In the first step, each unique action in A is associated with a numericvector (embedding) in R K so that an input action sequence is transformed into a sequence of K -dimensional embeddings. In the second step, an RNN is used to sequentially summarizethe information in the embedding sequence up to a time step. In the last step, a featurevector is constructed from the output vectors of the encoder RNN. The decoder of an actionSeqAE also consists of three steps. In the first step, the feature vector is repeated to forma sequence of vectors which are then passed into another RNN in the second step to obtainanother sequence of vectors. Each vector in this latter sequence contains the information ofthe action at the corresponding time step. In the last step, a fully connected layer with asoftmax activation (multinomial logistic model) is used to construct a probability distributionˆ s l = (ˆ s l , ˆ s l , . . . , ˆ s lN ) on A at each time step l from the corresponding vector obtained fromthe decoder RNN.The structure of a time SeqAE (Figure 6) is similar to that of an action SeqAE. RNNs are usedin both the encoder and the decoder to summarize the sequential information. The featurevector is obtained from the output vectors of the encoder RNN. However, since timesteps arenumeric, embedding is not needed in the encoder. Also, in the last step of the decoder, thetimesteps are reconstructed through a fully connected layer with a ReLU activation insteadof a softmax activation.The structure of an action-time SeqAE is a combination of an action SeqAE and a timeSeqAE as shown in Figure 7. Given a response process of length L , the encoder of an action-time SeqAE first transforms the action sequence into a K -dimensional embedding sequence.The embedding sequence is then combined with the timestamp sequence to form a ( K + 1)-dimensional sequence that serves as the input of the encoder RNN. The encoder RNN producesa sequence of L output vectors, each of which is K -dimensional. Tthe feature vector is thencomputed from the output of the encoder RNN. The first two steps of the decoder of anaction-time SeqAE is the same as that of an action SeqAE. Once the output vectors fromthe decoder RNN are obtained, they are passed into a fully connected layer with a softmaxactivation to construct a probability distribution on A and into another fully connected layer4 Process Data Analysis in R but with a ReLU activation to reproduce the timestamps.Two methods can be used to construct the feature vector from the output vectors of theencoder RNN for all three types of SeqAEs. In the first method ( method = "last" ), thefeature vector is the last output vector of the decoder RNN. In the other method ( method ="avg" ), the feature vector is the average of the L output vectors.By default, the original timestamp sequences are used in time SeqAE and action-time SeqAE.The inter-arrival time sequences can be used in replace of the timestamp sequences by setting cumulative = FALSE . The natural logorithm of the timestamps or the inter-arrival time willbe used if log = TRUE . There are two choices of the recurrent units in the encoder and decoderRNNs, long-short-memory (LSTM) unit (Hochreiter and Schmidhuber 1997) ( rnn_type ="lstm" ) and the gated recurrent unit (GRU) (Cho, Van Merriënboer, Gulcehre, Bahdanau,Bougares, Schwenk, and Bengio 2014) ( rnn_type = "gru" ). Embedding Multinomial Logit ModelRNN RNN
Encoder Decoder
Feature θ s s · · · s L
Encoder Decoder
Feature θ ✓ ✓ · · · ✓ L
Encoder Decoder
Feature θ s s · · · s L
Figure 7: The structure of action-time autoencoders. + K_cand = c(5, 10), n_epoch = 10,+ verbose = FALSE)R> str(cv_res)
List of 3$ K : num 10$ K_cand : num [1:2] 5 10$ cv_loss: num [1:2, 1] 12 11.4
R> samples_train <- sample(1:100, 80)R> samples_valid <- setdiff(1:100, samples_train)R> seq2seq_res <- seq2feature_seq2seq(seqs = seqs, ae_type = "action",+ K = cv_res$K,+ samples_train = samples_train,+ samples_valid = samples_valid,+ verbose = FALSE)R> str(seq2seq_res)
List of 3$ train_loss: num [1:50] 2.48 2.47 2.46 2.44 2.42 ...$ valid_loss: num [1:50] 2.47 2.46 2.45 2.44 2.42 ...$ theta : num [1:100, 1:10] -3.21 -3.21 -3.52 -4.21 -3.72 .....- attr(*, "dimnames")=List of 2.. ..$ : NULL.. ..$ : chr [1:10] "PC1" "PC2" "PC3" "PC4" ... ournal of Statistical Software The previous feature extraction methods do not have a particular target prediction variable.The features mainly capture variations among response processes. We further present a func-tion that extracts features targeting a particular variable. Function seqm fits a sequence modelthat relates a response process ( seqs = ) and covariates (
X = ) with a binary ( response_type= "binary" ) or numeric ( response_type = "scale" ) response variable ( response = ). Asequence model is essentially a neural network, whose architecture is summarized in Figure8. Given a response process, a sequence model built by seqm() first transforms the actionsequence into a sequence of K -dimensional numeric vectors through an embedding step. Thedimension of the embeddings is specified through K_emb . The embedding sequence is then fedinto an RNN to process the information sequentially. The dimension of the output vectorsof the RNN is specified through
K_rnn . The last output vector from the RNN is used as theinput of a single-layer or multi-layer feedforward neural network whose output is the targetresponse variable. To construct a feedforward neural network with more than one layers, set n_hidden as the number of intermediate layers (the total number of layers minus one) andspecify the dimensions of the intermediate layers as a vector through
K_hidden .If include_time = TRUE and timestamp sequences are available in the reponse processes,the embedding sequence is combined with the timestamp sequence or the inter-arrival timesequence ( time_interval = TRUE ) to form a sequence of K + 1-dimensional numeric vectorsbefore fed into the RNN. The logorithms of the time-related sequences are used by default.To use the time-related sequences in their original scale, set log_time = FALSE .If covariates are provided ( X = ), the covariate vector is concatenated to the last output fromthe RNN. The resulting vector is then used as the input of the feed-forward neural network.Similar to seq2feature_seq2seq() , the parameters in the sequence model are estimated bystochastic approximation. The optimizer, (baseline) step size and the number of epochs to berun is specified through arguments optimizer_name , step_size , and n_epoch , respectively.The training and validation sets are specified by providing either the indices of validationsamples or the proportion of the validation samples through index_valid . In the latter case,the validation samples are randomly selected.Function seqm() returns an object of class ‘ seqm ’, which is a list containing the neural networkarchitecture, the estimated parameters, and other information about modeling fitting. Thekey elements of a ‘ seqm ’ object are• structure : a character string describing the neural network structure;• coefficients : a list containing estimated parameters;• history : a matrix with two columns giving the training (column 1) and validation(column 2) loss at the end of each epoch.Once a sequence model is fit, prediction can be made by predict.seqm() . The inputs of predict.seqm() are a ‘ seqm ’ object returned by seqm() ( object = ), a ‘ proc ’ object contain-ing a set of new response processes ( new_seqs = ), and a covariate matrix ( new_X = ) of thenew response processes. Predictions are produced by evaluating the fitted sequence model atthe new response processes and covariates. The outputs of predict.seqm() are the proba-bilities of the response being one for binary responses and the expected value of the response8 Process Data Analysis in R for numeric responses. The code below exemplifies the prediction of the binary final responseon the climate control item using the problem-solving processes. Embedding RNNAction Sequence S
R> n <- 100R> data(cc_data)R> samples <- sample(1:length(cc_data$responses), n)R> seqs <- sub_seqs(cc_data$seqs, samples)R> y <- cc_data$responses[samples]R> x <- matrix(rnorm(n*2), ncol=2)R> index_test <- 91:100R> index_train <- 1:90R> seqs_train <- sub_seqs(seqs, index_train)R> seqs_test <- sub_seqs(seqs, index_test)R> actions <- unique(unlist(seqs$action_seqs))R> ournal of Statistical Software R>
3. Examples
In this section, we demonstrate the
ProcData package through a case study of the climatecontrol item in PISA 2012. The dataset cc_data is included in the package. In the case study,we explore how well the response outcome can be predicted from the response processes. Inparticular, we hope to understand what behavior patterns in response processes are closelyrelated to answering the item correctly.
We randomly choose a subset of size 3000 from dataset cc_data for our analysis.
R> set.seed(12345)R> n <- 3000R> idx <- sample(1:length(cc_data$responses), n)R> seqs <- sub_seqs(cc_data$seqs, idx)R> y <- cc_data$responses[idx]
In the subset, about 53% of the respondents answered the item correctly.
R> table(y) / n y 0 10.4656667 0.5343333
We take a glance at the response processes by summarizing them through the summary methodfor ‘ proc ’ objects.
R> seqs_summary <- summary(seqs)R> seqs_summary$n_action [1] 128
R> seqs_summary$actions [1] "-1_-1_-1" "-1_-1_-2" "-1_-1_0" "-1_-1_1" "-1_-1_2"[6] "-1_-2_-1" "-1_-2_-2" "-1_-2_0" "-1_-2_1" "-1_-2_2"[11] "-1_0_-1" "-1_0_-2" "-1_0_0" "-1_0_1" "-1_0_2"[16] "-1_1_-1" "-1_1_-2" "-1_1_0" "-1_1_1" "-1_1_2" Process Data Analysis in R [21] "-1_2_-1" "-1_2_-2" "-1_2_0" "-1_2_1" "-1_2_2"[26] "-2_-1_-1" "-2_-1_-2" "-2_-1_0" "-2_-1_1" "-2_-1_2"[31] "-2_-2_-1" "-2_-2_-2" "-2_-2_0" "-2_-2_1" "-2_-2_2"[36] "-2_0_-1" "-2_0_-2" "-2_0_0" "-2_0_1" "-2_0_2"[41] "-2_1_-1" "-2_1_-2" "-2_1_0" "-2_1_1" "-2_1_2"[46] "-2_2_-1" "-2_2_-2" "-2_2_0" "-2_2_1" "-2_2_2"[51] "0_-1_-1" "0_-1_-2" "0_-1_0" "0_-1_1" "0_-1_2"[56] "0_-2_-1" "0_-2_-2" "0_-2_0" "0_-2_1" "0_-2_2"[61] "0_0_-1" "0_0_-2" "0_0_0" "0_0_1" "0_0_2"[66] "0_1_-1" "0_1_-2" "0_1_0" "0_1_1" "0_1_2"[71] "0_2_-1" "0_2_-2" "0_2_0" "0_2_1" "0_2_2"[76] "1_-1_-1" "1_-1_-2" "1_-1_0" "1_-1_1" "1_-1_2"[81] "1_-2_-1" "1_-2_-2" "1_-2_0" "1_-2_1" "1_-2_2"[86] "1_0_-1" "1_0_-2" "1_0_0" "1_0_1" "1_0_2"[91] "1_1_-1" "1_1_-2" "1_1_0" "1_1_1" "1_1_2"[96] "1_2_-1" "1_2_-2" "1_2_0" "1_2_1" "1_2_2"[101] "2_-1_-1" "2_-1_-2" "2_-1_0" "2_-1_1" "2_-1_2"[106] "2_-2_-1" "2_-2_-2" "2_-2_0" "2_-2_1" "2_-2_2"[111] "2_0_-1" "2_0_-2" "2_0_0" "2_0_1" "2_0_2"[116] "2_1_-1" "2_1_-2" "2_1_0" "2_1_1" "2_1_2"[121] "2_2_-1" "2_2_-2" "2_2_0" "2_2_1" "2_2_2"[126] "end" "reset" "start" R> range(seqs_summary$seq_length) [1] 3 183
R>
Figure 9 present a 25 ×
25 submatrix of the 128 ×
128 action transition probability matrix trans_mat . The 25 actions corresponds to the submatrix are randomly selected from theaction set. The dark diagonal of the submatrix indicates that respondents tend to repeat anaction several times before taking a different action. To avoid the repeated actions dilutingthe information in other actions and to make the response processes less noisy, we remove therepeated actions by using function remove_repeat . R> seqs <- remove_repeat(seqs)
As we mentioned earlier, response processes are in a nonstandard format and thus cannotbe easily incorporated in traditional statistical models such as linear models and generalizedlinear models. To make use of these classical tools for studying the relation between responseprocesses and response outcomes, we first compress the information in the response processes ournal of Statistical Software − − − − − − − − − − − − − − − − − − − − − − − − − − − Figure 9: Action transition probability matrix.into features, which are in a standard matrix format, and then use them as covariates to fita logistic regression.We start by extracting features through multidimensional scaling. The number of features tobe extracted is chosen by five-fold cross-validation. The candidate values of the number offeatures are 10 , , . . . , dist_type ="oss_action" . Note that by setting return_dist = TRUE , we ask chooseK_mds to return thedissimilarity matrix for future use to avoid repeated calculations. The returned dissimilaritymatrix and the selected number of features are then passed to seq2feature_mds for featureextraction. R> mds_K_res <- chooseK_mds(seqs, 1:10*10, dist_type = "oss_action",+ return_dist = TRUE)R> K <- mds_K_res$KR> K [1] 60
R> mds_res <- seq2feature_mds(mds_K_res$dist_mat, K = K) Process Data Analysis in R The 60 extracted features can then be used as covariates in logistic regression. To evaluatethe out-of-sample prediction performance, the dataset is split into training, validation, andtest sets.
R> n_train <- 2000R> n_valid <- 500R> n_test <- 500R> index_train <- sample(1:n, n_train)R> index_valid <- sample(setdiff(1:n, index_train), n_valid)R> index_test <- sample(1:n, c(index_train, index_valid))
The fitting and prediction of logistic regression models can be achieved by the glm functionand the predict method as usual.
R> mds_data <- data.frame(y = y, x = mds_res$theta)R> mds_glm_res <- glm(y ~ ., family = "binomial",+ subset=c(index_train, index_valid), data=mds_data)R> yhat_test <- as.numeric(predict(mds_glm_res, newdata = mds_data[index_test,]) > 0)R> [1] 0.7981735
R> summary(mds_glm_res)
Call:glm(formula = y ~ ., family = "binomial", data = mds_data, subset = c(index_train,index_valid))Deviance Residuals:Min 1Q Median 3Q Max-3.7806 -0.6561 0.1878 0.7678 2.2290Coefficients:Estimate Std. Error z value Pr(>|z|)(Intercept) 0.23119 0.05255 4.399 1.09e-05 ***x.1 -9.87064 0.41848 -23.587 < 2e-16 ***x.2 -0.73587 0.37248 -1.976 0.048201 *x.3 -1.68177 0.48546 -3.464 0.000532 ***x.4 5.42484 0.49103 11.048 < 2e-16 ***x.5 1.38222 0.51211 2.699 0.006954 **x.6 -1.10816 0.63259 -1.752 0.079814 .x.7 1.25793 0.66810 1.883 0.059721 .x.8 -3.68047 0.68458 -5.376 7.61e-08 ***x.9 1.64550 0.82766 1.988 0.046796 *x.10 2.18376 0.83807 2.606 0.009169 ** ournal of Statistical Software (output omitted)---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1(Dispersion parameter for binomial family taken to be 1)Null deviance: 3457.4 on 2499 degrees of freedomResidual deviance: 2382.7 on 2439 degrees of freedomAIC: 2504.7Number of Fisher Scoring iterations: 5 Since the out-of-sample prediction accuracy (0.80) is much higher than the benchmark (0.54),which is the accuracy if we naively predict that all respondents in the test set answered theitem correctly, it is reasonable to believe that some behavior patterns in the response processesare closely related to the response outcomes. By inspecting the glm output and refiting alogistic regression with the first feature as the only covariate, we find that including only thefirst feature can already produce a prediction accuracy of 0.77.
R> mds_glm_res1 <- glm(y ~ x.1, family = "binomial",+ subset=c(index_train, index_valid), data=mds_data)R> yhat_test1 <- as.numeric(predict(mds_glm_res1, newdata = mds_data[index_test,]) > 0)R> mean(y[index_test] == yhat_test1) [1] 0.7730594
Now we examine the behavior patterns associated with the first multidimensional scalingfeature. We order the response processes according to the value of their first feature anduse the print method for class ‘ proc ’ to display the response processes with the highest andlowest values.
R> o_mds1 <- order(mds_res$theta[,1])R> 'proc' object of 3000 processesFRA000019103869Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9Event start 1_0_0 2_0_0 -1_0_0 -2_0_0 0_1_0 0_2_0 0_-1_0 0_-2_0Time 0.0 36.6 40.9 46.6 50.3 59.4 64.0 66.3 69.4Step 10 Step 11 Step 12 Step 13 Step 14Event 0_0_1 0_0_2 0_0_-1 0_0_-2 endTime 77.7 80.3 83.9 86.0 93.2 Process Data Analysis in R AUT000007001735Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9Event start 1_0_0 2_0_0 -2_0_0 -1_0_0 0_1_0 0_2_0 0_-1_0 0_-2_0Time 0.0 57.4 60.6 62.8 67.1 79.6 81.1 83.3 85.3Step 10 Step 11 Step 12 Step 13 Step 14Event 0_0_1 0_0_2 0_0_-1 0_0_-2 endTime 97.2 98.6 100.7 102.5 119.5PRT000009602769Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9Event start 1_0_0 2_0_0 -1_0_0 -2_0_0 0_1_0 0_2_0 0_-1_0 0_0_1Time 0.0 73.1 82.5 89.2 94.3 109.5 114.8 119.8 128.8Step 10 Step 11Event 0_0_2 endTime 131.4 139.7BRA000083118946Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9Event start 1_0_0 2_0_0 -1_0_0 -2_0_0 reset 0_1_0 0_2_0 0_-1_0Time 0.0 141.7 151.2 157.7 164.6 182.1 185.1 188.8 194.2Step 10 Step 11 Step 12 Step 13 Step 14 Step 15 Step 16Event 0_-2_0 reset 0_0_1 0_0_2 0_0_-1 0_0_-2 endTime 196.3 204.0 206.6 209.4 215.9 218.6 240.4KOR000006202016Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9Event start 2_0_0 1_0_0 -1_0_0 -2_0_0 reset 0_1_0 0_2_0 0_-1_0Time 0.0 46.1 52.4 56.4 59.2 62.5 67.3 69.3 70.3Step 10 Step 11 Step 12 Step 13 Step 14 Step 15 Step 16Event 0_-2_0 reset 0_0_1 0_0_2 0_0_-1 0_0_-2 endTime 72.4 78.9 81.1 84.1 86.6 89.0 94.6
R> 'proc' object of 3000 processesKOR000010803463Step 1 Step 2 Step 3Event start 0_0_0 endTime 0.0 50.6 63.9MYS000009803099Step 1 Step 2 Step 3Event start 0_0_0 endTime 0.0 38.4 114.8 ournal of Statistical Software URY000004601383Step 1 Step 2 Step 3Event start 0_0_0 endTime 0.0 61.0 66.9MYS000001800562Step 1 Step 2 Step 3Event start 0_0_0 endTime 0.0 70.4 84.5MYS000015604929Step 1 Step 2 Step 3Event start 0_0_0 endTime 0.0 71.5 107.1
The response processes corresponding to the highest feature values are often very short,meaning that little meaningful interaction with computer interface is made. Respondents withthis type of behaviors are unlikely to answer the question correctly. The response processescorresponding to the lowest feature values are often longer. However, their lengths are notthe longest according to Figure 10 where the value of the first feature is plotted against thelogarithm of the length of the corresponding process. A closer look at the response processeswith the lowest values of the feature reveals that these respondents often explore the functionof one control bar at a time. This “varying-one-thing-at-a-time” strategy is efficient to getthe correct answer.We can also extract features by constructing a sequence autoencoder. Below is an exampleof extracting action sequence autoencoder features by seq2feature_seq2seq in ProcData . R> ae_a_res <- seq2feature_seq2seq(seqs, ae_type = "action", K = K,+ rnn_type = "gru", samples_train = index_train,+ samples_valid = index_valid, verbose = FALSE)
The timestamp sequence can easily be taken into account by setting ae_type = "both" . R> ae_at_res <- seq2feature_seq2seq(seqs, ae_type = "both", K = K,+ rnn_type = "gru", samples_train = index_train,+ samples_valid = index_valid, verbose = FALSE)
The logarithm of the inter-arrival time sequence is used here as by default cumulative =FALSE and log = TRUE . Similar to the multidimensional scaling features, features extractedby sequence autoencoders can be used as covariates in classical models. In our case, we canfit logistic regression models for predicting response outcomes.
R> ae_a_data <- data.frame(y = y, x = ae_a_res$theta)R> ae_glm_a_res <- glm(y ~ ., family = "binomial",+ subset=c(index_train, index_valid), data=ae_a_data)R> yhat_test_a <- as.numeric(predict(ae_glm_a_res, Process Data Analysis in R − . − . . . log (process length) F i r s t M D S F ea t u r e Figure 10: First mds feature against the logarithm of process length. + newdata = ae_a_data[index_test,]) > 0)R> [1] 0.8401826
R> ae_at_data <- data.frame(y = y, x = ae_at_res$theta)R> ae_glm_at_res <- glm(y ~ ., family = "binomial",+ subset=c(index_train, index_valid), data=ae_at_data)R> yhat_test_at <- as.numeric(predict(ae_glm_at_res,+ newdata = ae_at_data[index_test,]) > 0)R> [1] 0.8424658
The prediction accuracy is about 5% higher than that obtained by MDS features. Importantpatterns in response processes can be examined similarly. We refer the interested readers toTang et al. (2019b).
If prediction is the only goal for exploring the relationship between response processes anda binary or numeric response variable, one can achieve this by fitting a sequence model. We ournal of Statistical Software
ProcData .We split the data into training, validation, and testing set as before. The training and vali-dation sets are used for fitting the sequence model while the testing set is used for evaluatingits prediction performance. The response processes used for model fitting and testing areobtained by subsetting the ‘ proc ’ object seqs with function sub_seqs() in ProcData . R> seqs_train <- sub_seqs(seqs, c(index_train, index_valid))R> seqs_test <- sub_seqs(seqs, index_test)R> y_train <- y[c(index_train, index_valid)]R> y_test <- y[index_test]
We first consider the task of predicting the response outcome from the action sequence ina response process. The sequence model that fulfills this task can be fitted by calling func-tion seqm() . Since the response outcome is a binary variable, we specify response_type ="binary" . n_epoch is the total number of epochs to be run for estimating the parameters.The fitted model is the one that produces the lowest loss on the validation set, which isspecified by passing the indices of the processes through index_valid . R> seqm_res <- seqm(seqs_train, y_train, response_type = "binary",+ K_emb = 5, K_rnn = 5, n_epoch = 20,+ max_len = max(seqs_summary$seq_length),+ index_valid = n_train + 1:n_valid)
The convergence of training process can be examined by a plot of the value of the loss functionat the end of each epoch stored in seqm_res$history (Figure 11). Prediction based on thefitted model can be made by calling the predict method for ‘ seqm ’ objects.
R> seq_pred_res <- predict(seqm_res, new_seqs = seqs_test)R> mean(as.numeric(seq_pred_res > 0.5) == y_test) [1] 0.847032
The prediction accuracy is slightly higher than that obtained by logistic regression on ex-tracted features.A sequence model incorporating the information in the timestamp sequences can be fitted bysetting include_time = TRUE in seqm() . Prediction from the fitted model can be made inthe same way as before. R> seqm_res2 <- seqm(seqs_train, y_train, response_type = "binary",+ include_time = TRUE, K_emb = 5, K_rnn = 5, n_epoch = 20,+ max_len = max(seqs_summary$seq_length),+ index_valid = n_train + 1:n_valid)R> seq_pred_res2 <- predict(seqm_res2, new_seqs = seqs_test)R> mean(as.numeric(seq_pred_res2 > 0.5) == y_test) [1] 0.656621 Process Data Analysis in R . . . . . . Epoch Lo ss trainingvalidation Figure 11: Trace of training and validation loss.Adding timestamp sequences impairs the prediction accuracy, indicating that timestamp se-quences of the climate control item do not contain much additional information about theresponse outcomes.In many cases, both the response processes and background information such as age, educationlevel, and employment status are considered in prediction. These background informationcan be added to the sequence model as covariates supplied through covariates in seqm .Covariates for the test set should be supplied to predict.seq() through new_covariates when we make predictions. Since background information is not available in the climatecontrol dataset cc_data , we demonstrate it by adding the first five MDS features. R> X <- mds_res$theta[,1:5]R> X_train <- X[c(index_train, index_valid), ]R> X_test <- X[index_test, ]R> seqm_res3 <- seqm(seqs_train, y_train, covariates = X_train,+ response_type = "binary", K_emb = 5, K_rnn = 5,+ n_epoch = 20, max_len = max(seqs_summary$seq_length),+ index_valid = n_train + 1:n_valid)R> seq_pred_res3 <- predict(seqm_res3, new_seqs = seqs_test,+ new_covariates = X_test)R> mean(as.numeric(seq_pred_res3 > 0.5) == y_test) [1] 0.847032
Prediction accuracy does not change significantly after the MDS features are incorporated. ournal of Statistical Software
4. Summary
To the authors’ best knowledge,
ProcData is the first package designed for process dataanalysis. It can be used for analyzing process data generating from a wide range of human-computer interface.
ProcData includes an S ProcData . These tools are easy to use. Users do not need to handle the construction andtraining of neural network if they want to use neural network related models. Process dataanalysis is an active and quickly rising field. We envision to include more state-of-the-artmethods for analyzing response processes in future versions of
ProcData . Acknowledgments
This research is supported in part by NSF IIS-1633360 and NSF SES-1826540.
References
Allaire J, Chollet F (2019). keras: R Interface to ’Keras’ . R package version 2.2.4.1, URL https://CRAN.R-project.org/package=keras .Borg I, Groenen PJ (2005).
Modern multidimensional scaling: Theory and applications .Springer Science & Business Media, New York, NY. doi:10.1007/0-387-28981-X .Broyden CG (1970). “The convergence of a class of double-rank minimization algorithms 1.general considerations.”
IMA Journal of Applied Mathematics , (1), 76–90.Chen Y, Li X, Liu J, Ying Z (2019). “Statistical Analysis of Complex Problem-SolvingProcess Data: An Event History Analysis Approach.” Frontiers in Psychology , , 486. doi:10.3389/fpsyg.2019.00486 .Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y(2014). “Learning phrase representations using RNN encoder-decoder for statistical ma-chine translation.” In Proceedings of the 2014 Conference on Empirical Methods in Natu-ral Language Processing , pp. 1724–1734. Association for Computational Linguistics. doi:10.3115/v1/D14-1179 .Fletcher R (1970). “A new approach to variable metric algorithms.”
The computer journal , (3), 317–322.Gabadinho A, Ritschard G, MÃijller NS, Studer M (2011). “Analyzing and Visualizing StateSequences in R with TraMineR.” Journal of Statistical Software , (4), 1–37. doi:10.18637/jss.v040.i04 .0 Process Data Analysis in R Goldfarb D (1970). “A family of variable-metric methods derived by variational means.”
Mathematics of computation , (109), 23–26.Gómez-Alonso C, Valls A (2008). “A similarity measure for sequences of categorical data basedon the ordering of common elements.” In V Torra, Y Narukawa (eds.), Modeling Decisionsfor Artificial Intelligence , pp. 134–145. Springer Berlin Heidelberg, Berlin, Heidelberg. doi:https://doi.org/10.1007/978-3-540-88269-5_13 .Goodfellow I, Bengio Y, Courville A (2016).
Deep Learning . MIT Press.He Q, von Davier M (2016).
Analyzing process data from problem-solving items with n-grams:Insights from a computer-based large-scale assessment , pp. 749–776. Information ScienceReference, Hershey, PA. doi:10.4018/978-1-4666-9441-5.ch029 .Hinton G, Srivastava N, Swersky K (2014). “RMSProp: Divide the gradient by a running av-erage of its recent magnitude.” .Hochreiter S, Schmidhuber J (1997). “Long short-term memory.”
Neural Computation , (8),1735–1780. doi:10.1162/neco.1997.9.8.1735 .Kingma DP, Ba J (2015). “Adam: A method for stochastic optimization.” In Proceedings ofthe 3rd International Conference on Learning Representations .Lord FM (1980).
Applications of Item Response Theory to Practical Testing Problems . Rout-ledge, New York, NY.Lord FM, Novick MR (1968).
Statistical theories of mental test scores.
Addison-Wesley,Reading, MA.Paradis E (2018). “Multidimensional scaling with very large datasets.”
Journal of Computa-tional and Graphical Statistics , (4), 935–939.Qiao X, Jiao H (2018). “Data Mining Techniques in Analyzing Process Data: A Didactic.” Frontiers in Psychology , , 2231. doi:10.3389/fpsyg.2018.02231 .R Core Team (2019). R: A Language and Environment for Statistical Computing . R Foun-dation for Statistical Computing, Vienna, Austria. URL .Ren Y, Luo F, Ren P, Bai D, Li X, Liu H (2019). “Exploring Multiple Goals Balancing inComplex Problem Solving Based on Log Data.”
Frontiers in Psychology , , 1975. doi:10.3389/fpsyg.2019.01975 .Robbins H, Monro S (1951). “A stochastic approximation method.” The Annals of Mathe-matical Statistics , (3), 400–407. doi:10.1214/aoms/1177729586 .Rupp AA, Templin J, Henson RA (2010). Diagnostic Measurement: Theory, Methods, andApplications . Guilford Press, New York, NY.Shanno DF (1970). “Conditioning of quasi-Newton methods for function minimization.”
Math-ematics of computation , (111), 647–656. ournal of Statistical Software Frontiers in Psychology , , 777. doi:10.3389/fpsyg.2019.00777 .Tang S, Peterson JC, Pardos ZA (2016). “Deep neural networks and how they apply to sequen-tial education data.” In Proceedings of the Third (2016) ACM Conference on Learning@Scale , pp. 321–324. ACM. doi:10.1145/2876034.2893444 .Tang X, Wang Z, He Q, Liu J, Ying Z (2019a). “Latent Feature Extraction for Process Datavia Multidimensional Scaling.” arXiv preprint arXiv:1904.09699 .Tang X, Wang Z, Liu J, Ying Z (2019b). “An Exploration of Process Data by Action SequenceAutoencoder.” arXiv preprint arXiv:1908:06075 .Wang C, Xu G, Shang Z, Kuncel N (2018). “Detecting aberrant behavior and item pre-knowledge: a comparison of mixture modeling method and residual method.”
Journal ofEducational and Behavioral Statistics , (4), 469–501. doi:10.3102/1076998618767123 .Zeiler MD (2012). “ADADELTA: an adaptive learning rate method.” arXiv preprintarXiv:1212.5701 . Affiliation:
Xueying TangDeparment of MathematicsUniversity of ArizonaAddress: 617 N. Santa Rita Avenue, Tucson, AZ 85721E-mail: [email protected]
URL: http://websites.math.arizona.edu/xueyingtang/
Susu Zhang, Zhi Wang, Jingchen Liu, Zhiliang YingDepartment of StatisticsColumbia UniversityAddress: 1255 Amsterdam Avenue, New York, NY 10027
Journal of Statistical Software published by the Foundation for Open Access Statistics
MMMMMM YYYY, Volume VV, Issue II
Submitted: yyyy-mm-dd doi:10.18637/jss.v000.i00