Corpus Linguistics and Linguistic Theory | 2019
On classification trees and random forests in corpus linguistics: Some words of caution and suggestions for improvement
Abstract
Abstract This paper is a discussion of methodological problems that (can) arise in the analysis of multifactorial data analyzed with tree-based or forest-based classifiers in (corpus) linguistics. I showcase a data set that highlights where such methods can fail at providing optimal results and then discuss solutions to this problem as well as the interpretation of random forests more generally.