[PDF] Coupling Machine Learning and Crop Modeling Improves Crop Yield Prediction in the US Corn Belt

Abstract

This study investigates whether coupling crop modeling and machine learning (ML) improves corn yield predictions in the US Corn Belt. The main objectives are to explore whether a hybrid approach (crop modeling + ML) would result in better predictions, investigate which combinations of hybrid models provide the most accurate predictions, and determine the features from the crop modeling that are most effective to be integrated with ML for corn yield prediction. Five ML models (linear regression, LASSO, LightGBM, random forest, and XGBoost) and six ensemble models have been designed to address the research question. The results suggest that adding simulation crop model variables (APSIM) as input features to ML models can decrease yield prediction root mean squared error (RMSE) from 7 to 20%. Furthermore, we investigated partial inclusion of APSIM features in the ML prediction models and we found soil moisture related APSIM variables are most influential on the ML predictions followed by crop-related and phenology-related variables. Finally, based on feature importance measure, it has been observed that simulated APSIM average drought stress and average water table depth during the growing season are the most important APSIM inputs to ML. This result indicates that weather information alone is not sufficient and ML models need more hydrological inputs to make improved yield predictions.

Full PDF

Coupling Machine Learning and Crop Modeling Improves Crop Yield Prediction in the US Corn Belt

Mohsen Shahhosseini , Guiping Hu , Sotirios V. Archontoulis Isaiah Huber Department of Industrial and Manufacturing Systems Engineering, Iowa State University, Ames, Iowa, USA Department of Agronomy, Iowa State University, Ames, Iowa, USA * Corresponding author e-mail: [email protected]

Abstract

This study investigates whether coupling crop modeling and machine learning (ML) improves corn yield predictions in the US Corn Belt. The main objectives are to explore whether a hybrid approach (crop modeling + ML) would result in better predictions, investigate which combinations of hybrid models provide the most accurate predictions, and determine the features from the crop modeling that are most effective to be integrated with ML for corn yield prediction. Five ML models and six ensemble models have been designed to address the research question. The results suggest that adding simulation crop model variables (APSIM) as input features to ML models can make a significant difference in the performance of ML models, and it can boost ML performance by up to 29%. Furthermore, we investigated partial inclusion of APSIM features in the ML prediction models and we found that soil and weather-related APSIM variables are most influential on the ML predictions followed by crop-related and phenology-related variables. Finally, based on feature importance measure, it has been observed that simulated APSIM average drought stress and average water table depth during the growing season are the most important APSIM inputs to ML. This result indicates that weather information alone is not sufficient and ML models need more hydrological inputs to make improved yield predictions. Introduction

Advances in machine learning and simulation crop modeling have created new opportunities to improve prediction in agriculture

1, 2, 3, 4 . These technologies have each provided unique capabilities and significant advancements in the prediction performance, however, they have been mainly studied separately. Given the strengths of each of these technologies, and that each may favor additional information for making more informed predictions, there may be benefits integrating them to further increase prediction accuracy. Simulation crop models make agricultural predictions such as yield, flowering time, and water stress using management, crop cultivar and environmental inputs as well as science-based equations of crop physiology, hydrology and soil C and N cycling

5, 6, 7 . Numerous studies have used simulation crop models for forecasting applications. For instance, Dumont et al. compared the within-season yield predictive performance of two simulation crop models, one model based on stochastically generated climatic data, and the other on mean climate data. The results show similar performance of both models with relative root mean square error (RRMSE) of 10% in 90% of the climatic situations. However, the model based on mean climate data had far less running time. Togliatti et al. used APSIM maize and soybean to forecast phenology and yields with and without including weather forecast data. They found that inclusion of weather forecast did not improve prediction accuracy. There are many other examples in the literature, in which simulation crop modeling was used to forecast various outputs of cropping systems

10, 11, 12 . On the other hand, machine learning (ML) intends to make predictions by finding connections between input and response variables. A wide variety of studies have addressed agronomic prediction by using ML algorithms. Drummond et al. applied stepwise multiple linear regression (SMLR), projection pursuit regression (PPR), and several types of neural networks on a data set constructed with soil properties and topographic characteristics for 10 “site-years” with the purpose of predicting grain yields. They found that neural network models outperformed SMLR and PPR in every site-year. Khaki and Wang designed residual neural network models to predict yield with prediction. Khaki et al. developed a CNN-RNN framework to predict corn and soybean yields of 13 states in the US Corn Belt. Their model outperformed random forest, deep fully connected neural networks (DFNN), and LASSO models, achieving an RRMSE of 9% and 8% for corn and soybean prediction, respectively. Jiang et al. devised a long short-term memory (LSTM) model that incorporates heterogeneous crop phenology, meteorology, and remote sensing data in predicting county-level corn yields. This model outperformed LASSO and random forest and explain 76% of yield variations across the Corn Belt. Mupangwa et al. evaluated the performance of several ML models in predicting maize grain yields under conservation agriculture. The problem was formatted as a classification problem with the objective of labeling unseen observations’ agro-ecologies (highlands or lowlands). They found that Linear discriminant analysis (LDA) performed better than other trained models, including logistic regression, K-nearest neighbor, decision tree, naïve Bayes, and support vector machines (SVM), with prediction accuracy of 61%. We hypothesized that merging prediction tools, namely simulation crop models and machine learning models will improve prediction. It should be noted that there have not been many studies in this area other than a few papers on combining crop models with simple regression. The main method has been the use of regression analysis to incorporate yield technology trends into the crop model simulations

18, 19, 20, 21 . Some studies have used simulation crop model outputs as inputs to a multiple linear regression model and formed a hybrid simulation crop–regression framework to predict yields

22, 23, 24 . However, only two recent studies created hybrid simulation crop modeling–ML models for yield prediction. Everingham et al. considered simulated biomass from the APSIM sugarcane crop model, seasonal climate prediction indices, observed rainfall, maximum and minimum temperature, and radiation as input variables of a random forest regression algorithm to predict annual variation in regional sugarcane yields in northeastern Australia. The results showed that the hybrid model was capable of making decent yield predictions explaining 67%, 72%, and 79% of the total variability in yield, when predictions are made on September 1st, January 1st, and March 1st, respectively. In another recent study, Feng et al. claimed that incorporating machine learning with a biophysical model can improve the evaluation of climate extremes' impact on wheat yield in south-eastern Australia. To this end, they designed a framework that used the APSIM model outputs and growth stage-specific extreme climate events (ECEs) indicators to predict wheat yield using a random forest (RF) model. The developed hybrid APSIM + RF model outperformed the benchmark (hybrid APSIM + multiple linear regression (MLR)) and the APSIM model alone. The APSIM + RF introduced 19% and 33% improvements in the prediction accuracy of APSIM + MLR and APSIM alone, respectively. None of these studies compared the performance of various ML models and their ensembles in creating hybrid simulation crop modeling – ML frameworks and partial inclusion of the simulation crop modeling outputs is not studied in the literature. The goal of this paper is to comprehensively investigate the effect of coupling process-based modeling with machine learning algorithms towards improved crop yield prediction. The specific research objectives include: 1. Explore whether a hybrid approach (simulation crop modeling + ML) would result in better corn yield predictions in the US Corn Belt; 2.

Investigate which combinations of hybrid models (various ML x crop model) provide the most accurate predictions; 3.

Determine the features from the crop modeling that are most relevant for use by ML for corn yield prediction. Figure 1 depicts the conceptual framework of this paper.

Figure 1: Conceptual framework of this study’s objective

The remainder of this paper is organized as follows. Section 2 describes the methodology and the materials used in this study, and Section 3 presents and discusses the results and the possible improvements. Section 4 discusses the analysis and findings and finally, Section 5 concludes the paper. Materials and Methods

Since the main objective is to evaluate the performance of a hybrid simulation-machine learning framework in predicting corn yield, this section is split into two parts. The first describes the Agricultural Production Systems sIMulator (APSIM) and the second the Machine learning (ML) algorithms. Each of them explains the details of the prediction/forecasting framework, including the inputs to the models, the data processing tasks, the details of selected predictive models, and evaluation metrics used to compare the results, for simulation and machine learning.

Yield data Weather data Soil data Management data Machine learning Crop modelling Yield prediction Yield prediction Flowering prediction Biomass prediction Water stress prediction Etc. This study

Agricultural Production Systems sIMulator (APSIM)

APSIM run details

The Agricultural Production Systems sIMulator (APSIM) is an open source advanced simulator of cropping systems. It includes many crop models along with soil water, C, N, crop residue modules, which all interact on a daily time step. In this project, we used the APSIM maize version 7.9 and in particular the calibrated model version for US Corn Belt environments as outlined by Archontoulis et al. that includes shallow water tables and inhibition of root growth due to excess water stress and waterlogging functions . Within APSIM we used the following modules: maize , SWIM soil water , soil N and carbon , surface residue

32, 33 , soil temperature . pSIMS is a platform for generating simulations and running point-based agricultural models across large geographical regions. The simulations used in this study were created on a 5-arcminute grid across Iowa, Illinois and Indiana considering only cropland area when creating soil profiles. Soil profiles for these simulations were created from SSURGO , a soil database based off of soil survey information collected by the National Cooperative Soil Survey. Climate information used by the simulations came from a weather database based off of NASA Power (https://power.larc.nasa.gov) and Iowa Environmental Mesonet (https://mesonet.agron.iastate.edu). Current management APSIM model input databases include changes in plant density, planting dates, cultivar characteristics and N fertilization rate to corn from 1984 to 2019. Planting date and plant density data derived from USDA-NASS . Cultivar traits data derived through regional scale model calibration. N fertilizer data derived from a combined analysis of USDA-NASS and Cao et al. including N rates to corn by county and by year. Over the historical period, 1984-2019, APSIM captured 78% of the variability in the NASS yields having a RMSE of 1 Mg/ha and RRMSE of 10% (See Figure 2). This version of the model is used to provide outputs to the machine learning. Figure 2: Measured (USDA-NASS) corn yields vs. simulated corn yields at the state level from 1984 to 2019 using the pSIMS-APSIM framework. y = 0.9601x + 0.812RMSE = 1000 Kg/haRRMSE = 10%R² = 0.78530246810121416 0 2 4 6 8 10 12 14 16 S i m u l a t e d S t a t e L e v e l Y i e l d ( M g / h a ) USDA-NASS State Level Yields (Mg/ha)IllinoisIndianaIowa

APSIM output variables used as inputs to ML models

The first step to combine the developed data set with APSIM variables was to extract all APSIM simulations from its outputs and prepare the obtained data to be added to the mentioned data set. The APSIM outputs include 22 variables (the details are presented in Table 1). The granularity level for the APSIM variables was different from USDA obtained data, as the APSIM variables made at 5 arc (approximately 40 fields within a county). Therefore, to calculate a county-level value for each of them, the median of all corresponding values is used. The reason to use median instead of a simple average is to reduce the impact of outliers on yields. Among the 40 fields/county * 300 fields * 35 yields there were some model failures or zero yields that bias the county level yield predictions.

Table 1: Description of all APSIM outputs added to the developed data set for building ML models

Acronym

Description 1

Crop Yield Crop yield (kg/ha) Biomass Crop above ground biomass (kg/ha) Root Depth Maximum root depth (mm) Flower Date Flowering time (doy) Maturity Date Maturity time (doy) LAI maximum Maximum leaf area index (m2/m2) ET Annual Actual evapotranspiration (mm) Crop Transpiration Crop transpiration (mm) Total Nupt Above ground crop N uptake (Kg N/ha) Grainl Nupt Grain N uptake (kg N/ha) Avg Drought Stress Average drought stress on leaf development (0-1) Avg Excessive Stress Average excess moisture stress on photosynthesis (0-1) Avg N Stress Average N stress on grain growth (0-1) Avg WT Inseason Depth to water table during the growing season (mm) Runoff Annual Runoff (mm) Drainage Drainage from tiles and below 1.5 m (mm) Gross Miner Soil gross N mineralization (kg N/ha) Nloss Total Total N loss (denitrification and leaching) kg N/ha Avg WT Depth to water table during the entire year (mm) SWtoDUL30Inseason Growing season average soil water to field capacity ratio at 30 cm SWtoDUL60Inseason Same as above but at 60 cm SWtoDUL90Inseason Same as above but at 90 cm

All 22 APSIM output values were prepared and added to the developed data set. The pre-processing tasks done for APSIM data were: -

Imputing zero values with the average of other values of the same feature -

Removing rows with missing values -

Normalizing the data to be between 0 and 1 -

Cross-referencing the new data with the developed data set Then, all feature selection procedures explained in section 2.2.2 were executed on the newly created data set, which resulted in eliminating some of APSIM variable features (AvgWT, CropYield, SWtoDUL60Inseason, SWtoDUL90Inseason, MaturityDate, and CropTranspiration for the case of year 2018 as the test set). The data from two years, namely 2017, and 2018 are considered as the test data and for each scenario, the training data is set to be from years 1984 to the year before the test set (2017, or 2018).

Machine Learning (ML)

The machine learning models are developed using a data set spanning from 1984 to 2018 to predict corn yield in three US Corn Belt states (Illinois, Indiana, and Iowa). The data set is comprised of the environment (soil and weather) and management as input variables, and actual corn yields for the period under study as the target variable. The input data are selected in a way that they can be agronomically relevant for yield prediction . Environment data includes several soil parameters at a 5 km resolution and weather data at 1 km resolution . Data set

The county-level historical corn yields were downloaded from the USDA National Agricultural Statistics Service for years 1984-2018. A data set including observed information of the environment, management, and yields was developed, which consists of 10,016 observations of yearly average corn yields for 293 counties. The factors that mainly affect crop yields are alleged to be the environment, genotype, and management. To this end, weather and soil as environmental features and plant population and planting progress as management features were included in the data set. It should be noted that data preprocessing has been designed to address the increasing trends in yields due to technological and genotypic advances over the years. This is mainly due to that there is no publicly available genotype data set. The data set with 598 variables (including target variable) are described below. • Plant population: one feature describing the population of plants measured in plants per acre, obtained from USDA-NASS • Planting progress (planting date):

52 features explaining the weekly cumulative percentage of corn planted within each state • Weather : Seven weather variables accumulated weekly (364 features), obtained from Daymet Daily minimum air temperature in degrees Celsius 2.

Daily maximum air temperature in degrees Celsius 3.

Daily total precipitation in millimeters per day 4.

Shortwave radiation in watts per square meter 5.

Water vapor pressure in pascals 6.

Snow water equivalent in kilograms per square meter 7.

Day length in seconds per day • Soil : The soil features soil organic matter, sand content, clay content, soil pH, soil bulk density, wilting point, field capacity, and saturation point, were considered in this study. Different values for different soil layers were used as the features mentioned above change across the soil profile. Consequently, 180 features for soil characteristics of the locations under study were obtained from the Web Soil Survey • Corn Yield : Yearly corn yield data in bushel per acre, collected from USDA-NASS Data pre-processing

Several pre-processing tasks were conducted to ensure the data is prepared for fitting machine learning models. The first pre-processing task was to normalize the data inputs and scale them between 0 and 1 to ensure avoiding the effect of false magnitude on the importance of some features discovered by machine learning models. The next pre-processing tasks include adding yearly trends and feature selection. • Add yearly trends feature

Figure 3 suggests an increasing trend in the yields over time. It is evident that there is no input feature in the developed data set that can explain this observed increasing trend in the corn yields. This trend is commonly described as the effect of technological gains over time, such as improvements in genetics (cultivars), management, equipment, and other technological advances.

Therefore, to account for the trend as mentioned above, two measures were done. 1)

A new feature (yield_trend) was constructed that only explained the observed trend in corn yields. For building this new feature, a linear regression model was built for each location as the trends for each site tend to be different. The year (

𝑌𝐸𝐴𝑅 ) and yield ( 𝑌 ) features formed the independent and dependent variables of this linear regression model, respectively. Then the predicted value for each data point ( 𝑌% ) is added as a new input variable that explains the increasing annual trend in the target variable. Only training data was used for fitting this linear regression model and the corresponding values of the newly added feature for the test set is set to be the predictions made by this model for the data of that year ( 𝑌% !, = 𝑏 &! + 𝑏 ’! 𝑌𝐸𝐴𝑅 !, ). The following equation shows the trend value ( 𝑌% ! ) calculated for each location ( 𝑖 ), that is added to the data set as a new feature. 𝑌 * 𝑖 = 𝑏 + 𝑏 𝑌𝐸𝐴𝑅 𝑖 (1) 2) In addition, another new feature ( yield_avg ) was added to the data set that explains the average yield of each year for each state when considering training data. The equation (2) explains how the average value of the yields of each state ( 𝑗 ) is added as a new feature. 𝑦𝑖𝑒𝑙𝑑_𝑎𝑣𝑔 ( = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒(𝑦𝑖𝑒𝑙𝑑 ( ) (2) The corresponding value of this feature for the unseen test observations are set to be the average of the previous year ( 𝑦𝑖𝑒𝑙𝑑_𝑎𝑣𝑔 (, = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒(𝑦𝑖𝑒𝑙𝑑 (, ) ). Figure 3: Plotting aggregated annual yields for all locations under study and the average yields per year The figure clearly shows a visible, increasing trend in yields. The blue line shows the yearly increasing trend in the yields • Feature selection

Since the data developed data set has a large number of input variables and is prone to overfitting, feature selection becomes necessary to build generalizable machine learning models. A three-stage feature selection procedure was performed to select the most essential features in the data set and prevent the machine learning models from overfitting on the highly dimensional training data. The steps to perform feature selection were feature selection based on expert knowledge, permutation feature selection using random forest, and correlation-based feature selection, respectively. i. Feature selection based on expert knowledge

Using expert knowledge, weather features were reduced by removing features for the period between the end of harvesting and the beginning of next year’s planting. Additionally, the number of planting progress features were lowered by eliminating the cumulative planting progress for the weeks before planting, as they did not include useful information. The feature selection based on expert knowledge could reduce the number of features from 597 to 383. ii.

Permutation feature selection with random forest

Strobl pointed out that the default random forest variable importance (impurity-based) is not reliable when dealing with situations where independent variables have different scales of measurement or different number of categories. This is specifically important for biological and genomic studies where independent variables are often a combination of categorical and numeric features with varying scales. Therefore, to overcome this bias and find decisive importance of input features, permutation feature importance is decided to be used . Permutation feature importance measures the importance of an input feature by calculating the decrease in the model’s prediction error when one feature is not available . To make the unavailability of one feature possible, each feature is permuted in the validation or test set, that is, its values are shuffled, and the effect of this permutation on the quality of the predictions is measured. Specifically, if permutation increases the model error, the permuted feature is considered important, as the model relies on that feature for prediction. On the other hand, if permutation does not change the prediction error significantly, the feature is thought to be unimportant, as the model ignores it for making the prediction . The second stage of feature selection and likely the most effective one, includes fitting a random forest model with 100 number of trees as the base model and calculating permutation importance of input features with 100 times of repetition and considering a random 35-fold cross-validation schema. Afterward, the top 100 input features were selected in the second stage of feature selection. iii. Correlation-based feature selection

Lastly, a filter-based feature selection based on Pearson correlation values, to avoid multicollinearity of independent variables were performed. In this stage, it was assumed that the relationship between independent variables is linear. One of each two highly correlated features (correlation higher than 0.9) were removed to prevent multicollinearity . Model selection

Tuning hyperparameters of machine learning models and selecting best models with optimal hyperparameter values is necessary to achieve high prediction accuracies. Cross-validation is commonly used to evaluate the predictive performance of fitted models by dividing the training set to train and validation subsets. Here, we use a random 35-fold cross-validation method to tune the hyperparameter of ML models. Grid search is an exhaustive search method that tries all the possible combinations of hyperparameter settings to find the optimal selection. It is both computationally expensive and generally dependent on the initial values specified by the user. However, Bayesian search addresses both issues and is capable of tuning hyperparameters faster and using a continuous range of values. Bayesian search assumes an unknown underlying distribution and tries to approximate the unknown function with surrogate models such as Gaussian process. Bayesian optimization incorporates prior belief about the underlying function and updates it with new observations. This makes it faster for tuning hyperparameters and ensures finding a better solution, given that enough number of observations are observed. In each iteration, Bayesian optimization gathers observations with the highest amount of information and intends to make a balance between exploration (exploring uncertain hyperparameters) and exploitation (gathering observations from hyperparameters close to the optimum) . That being so, to tune hyperparameters, Bayesian search with 20 iterations was selected as the search method under 35-fold cross-validation procedure. Predictive models

In this study, we combine well-diverse models in different ways and create ensemble models to make a robust and precise machine learning model. One prerequisite for creating well-performing ensemble models is to show a particular element of diversity in the predictions of base learners as well as preserve excellent performance individually . Thus, several base learners made with different procedures were selected and trained, including linear regression, LASSO regression, Extreme Gradient Boosting (XGBoost), LightGBM, and random forest. Moreover, an average weighted ensemble that assigns equal weights to all base learners is the simplest ensemble model created. Additionally, optimized weighted ensemble method proposed in Shahhosseini et al. was applied here to test its predictive performance. Several two-level stacking ensembles, namely stacked regression, stacked LASSO, stacked random forest, and stacked LightGBM, were built, which are expected to demonstrate excellent performance. The details of each model can be found at Shahhosseini et al. . Linear regression

Linear regression intends to predict a measurable response using multiple predictors. It assumes the existence of a linear relationship between the predictors and response variable, normality, no multicollinearity, and homoscedasticity . LASSO regression

LASSO is a regularization method that is equipped with in-built feature selection. It can exclude some variables by setting their coefficient to zero . Specifically, it adds a penalty term to the linear regression loss function, which can shrink coefficients towards zero (L1 regularization) . XGBoost and LightGBM

XGBoost and LightGBM are two implementations of gradient boosting tree-based ensemble methods. These types of ensemble methods make predictions sequentially and try to combine weak predictive tree models and learn from their mistakes. XGBoost was proposed in 2016 with new features, such as handling sparse data, and using an approximation algorithm for a better speed , while LightGBM was published in 2017 by Microsoft, with improvements in performance and computational time . Random forest

Random forest is built on the concept of bagging, which is another tree-based ensemble model. Bagging tries to reduce prediction variance by averaging predictions made by sampling with replacement . Random forest adds a new feature to bagging, which is randomly choosing a random number of features and constructing a tree with them and repeating this procedure many times and eventually averaging all the predictions made by all trees . Therefore, random forest addresses both bias and variance components of the error and is proved to be powerful . Optimized weighted ensemble

An optimization model was proposed in Shahhosseini et al. , which accounts for the tradeoff between bias and variance of the predictions, as it uses mean squared error (MSE) to form the objective function for the optimization problem . In addition, out-of-bag predictions generated by 𝑘 -fold cross-validation are used as emulators of unseen test observations to create the input matrices of the optimization problem, which are out-of-bag predictions made by each base learner. The optimization problem, which is a nonlinear convex problem, is as follows. 𝑀𝑖𝑛 $% ∑ -𝑦 & − ∑ 𝑤 ’ 𝑦1 &’(’)$ *%&)$ (3) 𝑠. 𝑡. ∑ 𝑤 (*(+’ = 1 , 𝑤 ( ≥ 0, ∀𝑗 = 1, … , 𝑘. where 𝑤 ( is the weights corresponding to base model 𝑗 ( 𝑗 = 1, … , 𝑘 ), 𝑛 is the total number of instances, 𝑦 ! is the actual value of observation 𝑖 , and 𝑦* !( is the prediction of observation 𝑖 by base model 𝑗 . Average weighted ensemble

Average weighted ensemble, which we call “average ensemble”, is a simple average of out-of-bag predictions made by each base learner. The average ensemble can perform well when the base learners are diverse enough . Stacked generalization

Stacked generalization tries to combine multiple base learners by performing at least one more level of learning task, that uses out-of-bag predictions for each base learner as inputs, and the actual target values of training data as outputs . The out-of-bag predictions are generated through a 𝑘 -fold cross-validation and have the same size of the original training set . The steps to design a stacked generalization ensemble are as follows. a) Learn first-level machine learning models and generate out-of-bag predictions for each of them by using 𝑘 -fold cross-validation. b) Create a new data set with out-of-bag predictions as the input variables and actual response values of data points in the training set as the response variable. c)

Learn a second-level machine learning model on the created data set and make predictions for unseen test observations. Considering four predictive models as the second-level learners, four stacking ensemble models were created, namely stacked regression, stacked LASSO, stacked random forest, and stacked LightGBM.

Performance metrics

To evaluate the performance of the developed machine learning models, three statistical performance metrics were used. -

Root Mean Squared Error (RMSE): the square root of the average squared deviation of predictions from actual values . - Relative Root Mean Squared Error (RRMSE): RMSE normalized by the mean of the actual values -

Mean Bias Error (MBE): a measure that describes the average bias in the predictions. -

Coefficient of determination (R ): the proportion of the variance in the dependent variable that is explained by independent variables. These metrics together provide estimates of the error (RMSE, RRMSE, MBE) and of the variance explained by the models (R ). Results

Numerical results of hybrid simulation – ML framework

Table 2 shows the test set prediction errors of the developed ML models for the benchmark (the case that no APSIM variable is added to the data set) and the hybrid simulation-ML (where all 22 APSIM outputs are added to the data set) cases. The relative RMSE (RRMSE) is calculated using the average corn yield value of the test set (see Table 3).

Table 2: Test set prediction errors of ML models for benchmark and hybrid cases

ML model Benchmark (no APSIM variable) Hybrid simulation – ML (all 22 APSIM variables included) % decrease in RMSE RMSE (kg/ha) RRMSE (%) MBE (kg/ha) R (%) RMSE (kg/ha) RRMSE (%) MBE (kg/ha) R (%) % Training set: 1984-2017, Test set: 2018

LASSO

XGBoost

LightGBM

Random forest

Linear regression

Optimized weighted ens.

Average ensemble

Stacked regression ens.

Stacked LASSO ensemble

Stacked Random f. ens.

Stacked LightGBM ens.

Training set: 1984-2016, Test set: 2017

LASSO

913 7.67% 6 60.59% 827 6.94% 77 67.70% 9.47%

XGBoost

LightGBM

961 8.07% -454 56.32% 717 6.02% 31 75.71% 25.43%

Random forest

943 7.92% -431 57.97% 823 6.92% -195 67.96% 12.69%

Linear regression

888 7.46% 114 62.75% 926 7.78% 474 59.44% -4.34%

Optimized weighted ens.

874 7.34% -235 63.92% 724 6.08% 116 75.21% 17.12%

Average ensemble

894 7.51% -328 62.20% 739 6.21% -29 74.16% 17.33%

Stacked regression ens.

888 7.46% -214 62.75% 734 6.17% 165 74.50% 17.25%

Stacked LASSO ensemble

876 7.36% -196 63.67% 741 6.22% 185 74.04% 15.46%

Stacked Random f. ens.

953 8.01% -228 56.98% 826 6.94% -50 67.75% 13.42%

Stacked LightGBM ens.

923 7.76% -222 59.68% 795 6.68% -72 70.11% 13.90%

Adding APSIM variables as input features to ML models can make a massive difference in the performance of developed ML models, and it can boost ML performance up to 29%. Looking at the average test results (Figure 4), it can be observed that adding APSIM forecasted values makes improvements to all designed ML models. Another observation is the superiority of optimized weighted ensemble model compared to other ML models. It should be noted that the negative R value of linear regression model when having no APSIM variables shows that this model’s predictions are worse than taking the mean value as the predictions. Table 3: Test data summary statistics

Test Year

Mean (kg/ha) Standard deviation (kg/ha) Number of counties

On average, XGBoost and LightGBM benefit the most from inclusion of APSIM outputs in predicting corn yields. It can be seen that the weighted average ensemble model performed decently and outperformed the average ensemble in both cases. Besides, stacking ensemble models made great use of newly added features and offered the modest decrease in prediction errors after adding APSIM outputs. Considering Mean Bias Estimate (MBE) values of the ML models, we can observe that almost all ML models presented less biased predictions after having APSIM information in their inputs.

Figure 4: Comparing average test RRMSE of benchmark and hybrid developed ML models

Figure 5 compares X-Y plots of some of the designed ML models for two benchmark and hybrid cases when the X-axis shows actual corn yields of year 2018, and Y-axis demonstrates the yield predictions made by each ML model when test set is set to be year 2018. The clear advantage of including APSIM variables in the machine learning algorithms is evident in this figure.

Figure 5: X-Y plots of some of the designed models for benchmark (top) and hybrid (bottom) cases for test year 2018. The intensity of the colors shows the accumulation of the data points

Summary statistics of some of the best performing designed ML models are shown in Figure 6. The probability density functions of these models are depicted along with the ground truth. This plot suggests that the ML models can maintain the same probability density function to some extent.

Figure 6: Probability density function of ground truth compared to some of designed ML models

Partial inclusion of APSIM variables

This section investigates the effect of partial inclusion of APSIM variables considering three different scenarios for the test year 2018. The scenarios are (1) include only phenology-related APSIM variables; (2) include only crop-related APSIM variables, and (3) include soil and weather-related APSIM variables. The results indicate relative importance of each group of APSIM variables in the prediction performance made by designed ML models.

Including only phenology-related APSIM variables

Phenology-related APSIM variables consist of two variables: silking date and physiological maturity date. These data sets reflect time available for plant growth, a key indicator of yield. Results demonstrate that LightGBM makes the best predictions, while the least biased predictions are generated from linear regression when including only phenology-related APSIM variables.

Including only crop-related APSIM variables

This group of APSIM variables include the following variables: crop yield, biomass, maximum rooting depth, maximum leaf area index, cumulative transpiration, crop N uptake, grain N uptake, season average water stress (both drought and excessive water), and season average nitrogen stress. These data sets reflect how much dry mater the crop accumulates per day and what limits dry matter accumulation. Based on the results, random forest and linear regression make the best and the least biased predictions, respectively, in case of having crop-related APSIM variables as ML inputs.

Including only soil and weather-related APSIM variables

These data sets provide an additional characterization of the environmental performance of each field. The APSIM variables included: annual evapotranspiration, growing season average depth to the water table, annual runoff, annual drainage (in subsurface tile located at about 1 m depth and below that layer), annual gross N mineralization, total N loss that accounts for leaching and denitrification, annual average water table depth, ratio of soil water to field capacity during the growing season at 30, 60 and 90 cm profile depth. Results show that XGBoost makes decent predictions with having the least amount of prediction error as well as bias, when the soil and weather-related APSIM variables are considered as ML inputs.

Table 4: Test set prediction errors of ML models for partial inclusion of APSIM variables (Test set is set to be the data for the year 2018)

ML model Phenology-related Crop-related Soil and weather-related RMSE (kg/ha) RRMSE (%) MBE (kg/ha) R (%) RMSE (kg/ha) RRMSE (%) MBE (kg/ha) R (%) RMSE (kg/ha) RRMSE (%) MBE (kg/ha)) R (%) LASSO

XGBoost

LightGBM

Random forest

Linear regression

Optimized w. ens.

Average ensemble

Stacked reg. ens.

Stacked LASSO ens.

Stacked Random f. ens.

Stacked LightGBM ens.

Table 4 presents the test set prediction errors of designed ML models for all three scenarios of partial inclusion of APSIM variables. Based on these results, it seems that soil and weather-related APSIM variables have a more significant influence on the predictions made by ML followed by crop-related and phenology-related variables. This is interesting and is partially explained by the fact that ML somehow already accounts for phenology-related parameters, which are largely weather-driven, while the soil-related parameters are more complicated parameters that ML alone cannot see. This is more evident in Figure 7. Furthermore, it can be observed that random forest, XGBoost, and optimized weighted ensemble made better predictions compared to other models.

Figure 7: Comparing test errors of three scenarios of partial APSIM variables inclusion. (Test data is set to be the data from the year 2018)

Variable importance

The permutation importance (see also section 2.2.2) of each of five individual base models (LASSO regression, XGBoost, LightGBM, random forest, and linear regression) was calculated using the test data of the year 2018. Figure 8 depicts the top-10 normalized average permutation importance of five individual ML models. It should be noted that due to black-box nature of ensemble models, only individual learners were used to calculate permutation importance.

Figure 8: Top-10 normalized average permutation importance of five individual ML models for test year 2018. Refer to Table 1 for explanation of the variables T e s t RR M S E ( % ) Phenology-related Crop-related Soil and weather-related

As Figure 8 indicates, the two created features for explaining the yield trend are the most important input features for ML models. From the next 8 most important input variables, 6 variables are APSIM variables we added to the developed data set, while the other two are soil and weather input variables. From the APSIM variables, 4 input features are part of soil and weather-related input features, and other 2 input features are grouped as crop-related APSIM variables. This is in-line with the results of partial inclusion of APSIM variables discussed in section 3.2. To find out which APSIM features have been more influential in predicting yields, the average permutation importance of five individual models (linear regression, LASSO regression, LightGBM, XGBoost, and random forest) was calculated for each test year. Figure 9 demonstrates the ranking of 16 APSIM features remained in the data set after feature selection. As the results suggest, the ranking of features is almost identical for both test years. AvgDroughtStress and AvgWTInseason were the most important features for machine learning models to predict yield, while FlowerDate and AvgNStress were the least important APSIM features. Most of these are water-related features suggesting the importance of soil hydrology in crop yield prediction in the US Corn belt.

Figure 9: Average normalized permutation importance of APSIM features for all test years. Refer to Table 1 for explanation of the variables Discussion

We proposed a hybrid simulation-machine learning approach that provided significant improvements in the predicted corn yields. To the best of our knowledge, this is the first study that designs ensemble models to increase corn yields predictability. This study demonstrated that introducing APSIM variables into machine learning models and utilize them as inputs to a prediction task can increase the prediction accuracy by 29%. In addition, the predictions made by the hybrid model show less bias toward actual yields. Other studies in this area, are mainly limited in coupling simplest statistical models, i.e. linear regression variants, with simulation crop models and apart from two recent studies

25, 26 there has been no study combining machine learning and simulation crop models. A v g D r o u g h t S t r e ss A v g W T I n s e a s o n G r o ss M i n e r G r a i n l N up t B i o m a ss S W t o D U L I n s e a s o n N l o ss T o t a l E T A nnu a l T o t a l N up t R oo t D e p t h D r a i n a g e L A I m a x i m u m R un o ff A nnu a l A v g E x c e ss i v e S t r e ss F l o w e r D a t e A v g N S t r e ss A v e r a g e n o r m a li z e d p e r m u t a t i o n i m p o r t a n c e In addition to the prediction advantages achieved by coupling ML and simulation crop modelling, we investigated the effect of different types of APSIM variables in the quality of predictions and found out that soil and weather-related APSIM variables contribute the most in predicting corn yields. Furthermore, the permutation importance procedure conducted in this study provided a tool to compare the importance of each ML input feature for predicting corn yields. Designing a method that enables the ML models to capture the yearly increasing trends in corn yields was the main challenge of this work. To address this challenge, two innovative features were constructed that could explain the trend to a great extent and as the variable importance results showed, they are by far the most important input features for predicting corn yields. The significant merits of coupling ML and simulation crop models shown in this study raise the question that whether the ML models can further benefit from addition of more input features from other sources. Hence, a possible extension of this study could be inclusion of remote sensing data into the ML prediction task and investigate the level of importance each data source can exhibit. It should be also acknowledged that APSIM simulations that used as inputs to ML model leveraged the full weather of each test year. In real word applications, the weather will be unknown and the APSIM model would need to run in a forecasting mode

9, 59, 1 introducing some additional uncertainty. This is something to be explored further in the future. Conclusion

We demonstrated significant improvements in yield prediction accuracy (up to 29%) across all designed ML models when additional inputs from a simulation cropping systems model (APSIM) are included. Among several APSIM variables that can be used as inputs to ML, analysis suggested that the most important ones were those related to soil water, and in particular growing season average drought stress, and average depth to water table. We concluded that inclusion of additional soil water related variables (either from simulation model or remote sensing or other sources) could further improve ML yield prediction in the central US Corn Belt.

References Archontoulis, S. V., Castellano, M. J., Licht, M. A., Nichols, V., Baum, M., Huber, I., et al. (2020). Predicting crop yields and soil-plant nitrogen dynamics in the US Corn Belt.

Crop Science, 60 (2), 721-738. 2.

Bogard, M., Biddulph, B., Zheng, B., Hayden, M., Kuchel, H., Mullan, D., et al. (2020). Linking genetic maps and simulation to optimize breeding for wheat flowering time in current and future climates.

Crop Science, 60 (2), 678-699. 3.

Ersoz, E. S., Martin, N. F., & Stapleton, A. E. (2020). On to the next chapter for crop breeding: Convergence with data science.

Crop Science, 60 (2), 639-655. 4.

Washburn, J. D., Burch, M. B., & Franco, J. A. V. (2020). Predictive breeding for maize: Making use of molecular phenotypes, machine learning, and physiological crop models.

Crop Science, 60 (2), 622-638. 5.

Asseng, S., Zhu, Y., Basso, B., Wilson, T., & Cammarano, D. (2014). Simulation Modeling: Applications in Cropping Systems. In N. K. Van Alfen (Ed.),

Encyclopedia of Agriculture and Food Systems (pp. 102-112). Oxford: Academic Press. 6.

Basso, B., & Liu, L. (2019). Chapter Four - Seasonal crop yield forecast: Methods, applications, and accuracies. In D. L. Sparks (Ed.),

Advances in Agronomy (Vol. 154, pp. 201-255): Academic Press. 7.

Shahhosseini, M., Martinez-Feria, R. A., Hu, G., & Archontoulis, S. V. (2019). Maize yield and nitrate loss prediction with machine learning algorithms.

Environmental Research Letters, 14 (12), 124026. 8.

Dumont, B., Basso, B., Leemans, V., Bodson, B., Destain, J. P., & Destain, M. F. (2015). A comparison of within-season yield prediction algorithms based on crop model behaviour analysis.

Agricultural and Forest Meteorology, 204 , 10-21. 9.

Togliatti, K., Archontoulis, S. V., Dietzel, R., Puntel, L., & VanLoocke, A. (2017). How does inclusion of weather forecasting impact in-season crop model predictions?

Field Crops Research, 214 , 261-272. 10.

Li, Z., Song, M., Feng, H., & Zhao, Y. (2016). Within-season yield prediction with different nitrogen inputs under rain-fed condition using CERES-Wheat model in the northwest of China.

Journal of the Science of Food and Agriculture, 96 (8), 2906-2916. 11.

Mishra, A., Hansen, J. W., Dingkuhn, M., Baron, C., Traoré, S. B., Ndiaye, O., et al. (2008). Sorghum yield prediction from seasonal rainfall forecasts in Burkina Faso.

Agricultural and Forest Meteorology, 148 (11), 1798-1814. 12.

Manatsa, D., Nyakudya, I. W., Mukwada, G., & Matsikwa, H. (2011). Maize yield forecasting for Zimbabwe farming sectors using satellite rainfall estimates.

Natural Hazards, 59 (1), 447-463. 13.

Drummond, S. T., Sudduth, K. A., Joshi, A., Birrell, S. J., & Kitchen, N. R. (2003). STATISTICAL AND NEURAL METHODS FOR SITE–SPECIFIC YIELD PREDICTION.

Transactions of the ASAE, 46 (1), 5. 14.

Khaki, S., & Wang, L. (2019). Crop Yield Prediction Using Deep Neural Networks. [Original Research].

Frontiers in Plant Science, 10 (621). 15.

Khaki, S., Wang, L., & Archontoulis, S. V. (2020). A CNN-RNN Framework for Crop Yield Prediction. [Original Research].

Frontiers in Plant Science, 10 (1750). 16.

Jiang, H., Hu, H., Zhong, R., Xu, J., Xu, J., Huang, J., et al. (2020). A deep learning approach to conflating heterogeneous geospatial data for corn yield estimation: A case study of the US Corn Belt at the county level.

Global Change Biology, 26 (3), 1754-1766. 17.

Mupangwa, W., Chipindu, L., Nyagumbo, I., Mkuhlani, S., & Sisito, G. (2020). Evaluating machine learning algorithms for predicting maize yield under conservation agriculture in Eastern and Southern Africa.

SN Applied Sciences, 2 (5), 952. 18.

Supit, I. (1997). Predicting national wheat yields using a crop simulation and trend models.

Agricultural and Forest Meteorology, 88 (1), 199-214. 19.

Nain, A. S., Dadhwal, V. K., & Singh, T. P. (2002). Real time wheat yield assessment using technology trend and crop simulation model with minimal data set. [Article].

Current Science, 82 (10), 1255-1258.

Nain, A. S., Dadhwal, V. K., & Singh, T. P. (2004). Use of CERES-wheat model for wheat yield forecast in central indo-gangetic plains of India. [Article].

Journal of Agricultural Science, 142 (1), 59-70. 21.

Chipanshi, A., Zhang, Y., Kouadio, L., Newlands, N., Davidson, A., Hill, H., et al. (2015). Evaluation of the Integrated Canadian Crop Yield Forecaster (ICCYF) model for in-season prediction of crop yield across the Canadian agricultural landscape.

Agricultural and Forest Meteorology, 206 , 137-150. 22.

Mavromatis, T. (2016). Spatial resolution effects on crop yield forecasts: An application to rainfed wheat yield in north Greece with CERES-Wheat.

Agricultural Systems, 143 , 38-48. 23.

Busetto, L., Casteleyn, S., Granell, C., Pepe, M., Barbieri, M., Campos-Taberner, M., et al. (2017). Downstream Services for Rice Crop Monitoring in Europe: From Regional to Local Scale. [Article].

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10 (12), 5423-5441. 24.

Pagani, V., Stella, T., Guarneri, T., Finotto, G., van den Berg, M., Marin, F. R., et al. (2017). Forecasting sugarcane yields using agro-climatic indicators and Canegro model: A case study in the main production region in Brazil.

Agricultural Systems, 154 , 45-52. 25.

Everingham, Y., Sexton, J., Skocaj, D., & Inman-Bamber, G. (2016). Accurate prediction of sugarcane yield using a random forest algorithm.

Agronomy for Sustainable Development, 36 (2), 27. 26.

Feng, P., Wang, B., Liu, D. L., Waters, C., & Yu, Q. (2019). Incorporating machine learning with biophysical model can improve the evaluation of climate extremes impacts on wheat yield in south-eastern Australia.

Agricultural and Forest Meteorology, 275 , 100-113. 27.

Holzworth, D. P., Huth, N. I., deVoil, P. G., Zurcher, E. J., Herrmann, N. I., McLean, G., et al. (2014). APSIM – Evolution towards a new generation of agricultural systems simulation.

Environmental Modelling & Software, 62 , 327-350. 28.

Ebrahimi-Mollabashi, E., Huth, N. I., Holzwoth, D. P., Ordóñez, R. A., Hatfield, J. L., Huber, I., et al. (2019). Enhancing APSIM to simulate excessive moisture effects on root growth.

Field Crops Research, 236 , 58-67. 29.

Pasley, H. R., Huber, I., Castellano, M. J., & Archontoulis, S. V. (2020). Modeling Flood-Induced Stress in Soybeans. [Original Research].

Frontiers in Plant Science, 11 (62). 30.

Keating, B. A., Carberry, P. S., Hammer, G. L., Probert, M. E., Robertson, M. J., Holzworth, D., et al. (2003). An overview of APSIM, a model designed for farming systems simulation.

European Journal of Agronomy, 18 (3), 267-288. 31.

Huth, N. I., Bristow, K. L., & Verburg, K. (2012). SWIM3: Model Use, Calibration, and Validation.

Transactions of the ASABE, 55 (4), 1303-1313. 32.

Probert, M., Dimes, J., Keating, B., Dalal, R., & Strong, W. (1998). APSIM's water and nitrogen modules and simulation of the dynamics of water and nitrogen in fallow systems.

Agricultural systems, 56 (1), 1-28. 33.

Thorburn, P. J., Meier, E. A., & Probert, M. E. (2005). Modelling nitrogen dynamics in sugarcane systems: Recent advances and applications.

Field Crops Research, 92 (2), 337-351. 34.

Campbell, G. S. (1985).

Soil physics with BASIC: transport models for soil-plant systems : Elsevier. 35.

Elliott, J., Kelly, D., Chryssanthacopoulos, J., Glotter, M., Jhunjhnuwala, K., Best, N., et al. (2014). The parallel system for integrating impact models and sectors (pSIMS).

Environmental Modelling & Software, 62 , 509-516. 36.

Soil Survey Staff, Natural Resources Conservation Service, United States Department of Agriculture, Web Soil Survey. (2019). from https://websoilsurvey.nrcs.usda.gov/ 37.

NASS, U. (2019). Surveys.

National Agricultural Statistics Service, U.S. Department of Agriculture . 38.

Cao, P., Lu, C., & Yu, Z. (2018). Historical nitrogen fertilizer use in agricultural ecosystems of the contiguous United States during 1850–2015: application rate, timing, and fertilizer types.

Earth Syst. Sci. Data, 10 (2), 969-984. 39.

Thornton, P. E., Thornton, M. M., Mayer, B. W., Wilhelmi, N., Wei, Y., Devarakonda, R., et al. (2012). Daymet: Daily surface weather on a 1 km grid for North America, 1980-2008.

Oak Ridge National Laboratory (ORNL) Distributed Active Archive Center for Biogeochemical Dynamics (DAAC) . Strobl, C., Boulesteix, A.-L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution.

BMC Bioinformatics, 8 (1), 25. 41.

Altmann, A., Toloşi, L., Sander, O., & Lengauer, T. (2010). Permutation importance: a corrected feature importance measure.

Bioinformatics, 26 (10), 1340-1347. 42.

Breiman, L. (2001). Random forests.

Machine learning, 45 (1), 5-32. 43.

Molnar, C. (2020).

Interpretable Machine Learning : Lulu. com. 44.

Farrar, D. E., & Glauber, R. R. (1967). Multicollinearity in Regression Analysis: The Problem Revisited.

The Review of Economics and Statistics, 49 (1), 92-107. 45.

Snoek, J. a. L. H. a. A. R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. 2951--2959. 46.

Brown, G. (2017). Ensemble Learning. In C. Sammut & G. I. Webb (Eds.),

Encyclopedia of Machine Learning and Data Mining (pp. 393-402). Boston, MA: Springer US. 47.

Shahhosseini, M., Hu, G., & Pham, H. (2020, 2020//).

Optimizing Ensemble Weights for Machine Learning Models: A Case Study for Housing Price Prediction.

Paper presented at the Smart Service Systems, Operations Management, and Analytics, Cham. 48.

Shahhosseini, M., Hu, G., & Archontoulis, S. V. (2020). Forecasting corn yield with machine learning ensembles. arXiv preprint arXiv:2001.09055 . 49.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013).

An introduction to statistical learning (Vol. 112): Springer. 50.

Tibshirani, R. (1996). Regression Shrinkage and Selection Via the Lasso.

Journal of the Royal Statistical Society: Series B (Methodological), 58 (1), 267-288. 51.

Chen, T., & Guestrin, C. (2016).

XGBoost: A Scalable Tree Boosting System . Paper presented at the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Retrieved from https://doi.org/10.1145/2939672.2939785 52.

Ke, G. a. M. Q. a. F. T. a. W. T. a. C. W. a. M. W. a. Y. Q. a. L. T.-Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. 3146--3154. 53.

Breiman, L. (1996). Bagging predictors.

Machine learning, 24 (2), 123-140. 54.

Cutler, D. R., Edwards Jr., T. C., Beard, K. H., Cutler, A., Hess, K. T., Gibson, J., et al. (2007). RANDOM FORESTS FOR CLASSIFICATION IN ECOLOGY.

Ecology, 88 (11), 2783-2792. 55.

Peykani, P., Mohammadi, E., Saen, R. F., Sadjadi, S. J., & Rostamy-Malkhalifeh, M. Data envelopment analysis and robust optimization: A review.

Expert Systems, n/a (n/a), e12534. 56.

Wolpert, D. H. (1992). Stacked generalization.

Neural Networks, 5 (2), 241-259. 57.

Cai, Y., Moore, K., Pellegrini, A., Elhaddad, A., Lessel, J., Townsend, C., et al. (2017).

Crop yield predictions-high resolution statistical model for intra-season forecasts applied to corn in the US.

Paper presented at the 2017 Fall Meeting. 58.

Zheng, A. (2015).

Evaluating machine learning models: a beginner's guide to key concepts and pitfalls : O'Reilly Media. 59.

Carberry, P. S., Hochman, Z., Hunt, J. R., Dalgliesh, N. P., McCown, R. L., Whish, J. P. M., et al. (2009). Re-inventing model-based decision support with Australian dryland farmers. 3. Relevance of APSIM to commercial crops.

Crop and Pasture Science, 60 (11), 1044-1056.

Author Contribution Statement

MS is the lead author. He conducted the research and wrote the first draft of the manuscript. GH secures funding for this study, oversees the research, reviews and edits the manuscript. SA provides the data and guidance for the research. He also reviews and edits the manuscript. IH prepared the APSIM data.

Legends

Figures

Figure 1.

Conceptual framework of this study’s objective. This study investigates the effect of coupling process-based modeling with machine learning algorithms towards improved crop yield prediction. Figure 2.

Measured (USDA-NASS) corn yields vs. simulated corn yields at the state level from 1984 to 2019 using the pSIMS-APSIM framework. Figure 3.

Plotting aggregated annual yields for all locations under study and the average yields per year. The figure clearly shows a visible, increasing trend in yields. The blue line shows the yearly increasing trend in the yields. Figure 4.

Comparing average test RRMSE of benchmark and hybrid developed ML models. All developed models reveal superiority of hybrid models compared to the benchmark. Figure 5.

X-Y plots of some of the designed models for benchmark (top) and hybrid (bottom) cases for test year 2018. The intensity of the colors shows the accumulation of the data points. Figure 6.

Probability density function of ground truth compared to some of designed ML models. The ML models can maintain the same probability density function to some extent. Figure 7.

Comparing test errors of three scenarios of partial APSIM variables inclusion (Test data is set to be the data from the year 2018). Figure 8.

Top-10 normalized average permutation importance of five individual ML models for test year 2018. Refer to Table 1 for explanation of the variables. Figure 9.

Average normalized permutation importance of APSIM features for all test years. Refer to Table 1 for explanation of the variables.

Tables

Table 1.

Description of all APSIM outputs added to the developed data set for building ML models.

Table 2.

Test set prediction errors of ML models for benchmark and hybrid cases. Table 3.