DETAILED DESCRIPTION OF THE RESEARCH PLAN
1. Background and rationale
High‐resolution spatial soil information with known accuracy and precision is crucially important, not only in handling global issues such as food security and land degradation (McBratney et al., 2003; Shepherd et al., 2015), but also at community and local levels in helping extension agents and smallholder farmers to take decisions about land management interventions (Stoorvogel et al., 2015). The successes of digital soil mapping (DSM) in providing such information are ascribed to many factors, including: recent technological and computational advances, availability of high-resolution remote sensing and other auxiliary data, advancement of proximal soil sensing methods (such as diffuse reflectance spectroscopy) for soil measurements, as well as the development of machine-learning algorithms (ML) (Minasny and McBratney, 2016; Hengl et al., 2017).
Previous DSM studies have been carried out extensively in many parts of the globe with geographical focus on Europe and Asia (Lamichhane et al., 2019). Few DSM studies have been conducted in Africa, probably due to limited resources (Hengl et al., 2015; van Zijl, 2019).Most DSM studies ignore the fact that soil measurements are not perfect, and only consider the limited predictive power of environmental covariates and spatial interpolation error as sources of uncertainties Despite the potential of proximal soil sensing (PSS) in generating soil measurements, with several advantages over conventional laboratory soil analyses (Shepherd and Walsh, 2007; Viscarra Rossel and Bouma, 2016), soil spectral data are associated with measurement errors that lower their accuracies and may affect the quality of DSM outputs (Heuvelink, 2018; Somarathna et al., 2018). In addition, spectral data ultimately still depend on analytical data used for calibration of PSS prediction models, which also suffer from measurement errors, both within and between laboratories (Shepherd and Walsh, 2002; Viscarra et al., 2016a).
Further exploration of the development of DSM approaches that account for uncertainties in soil measurements (both analytical and spectral), is therefore required to improve the reported accuracies of the final outputs, especially in tropical mosaic landscapes such as West and Central Africa, where there is a lack of accurate soil information (Akpa et al., 2014; Hengl et al., 2015). Quantification of the uncertainty in DSM products is very important for decision makers and land users, as decisions based on inaccurate soil information can ultimately have extensive and profound impacts on the design and implementation of site-specific land management strategies (Bone et al., 2014; Takoutsing et al., 2017) and soil productivity improvement measures, such as fertilizer application.
1.2 Research problem, research objectives and research questions
1.2.1 Research problem
The increasing demand for accurate soil spatial information to efficiently manage agronomic recommendations such as fertilizers application has led to the increased use of proximal soil sensing (PSS) methods (e.g. spectral data) for the development and application of DSM (Stoorvogel et al., 2015). During the procedural steps in generating soil measurements used in DSM, errors can occur or can be introduced and propagated to the final predictions through the mapping processes. Along with this recognition, analysis of uncertainties in soil measurements has become a subject of interest (Minasny and McBratney, 2016), and the lack of consideration for this issue may lead to suboptimal models and systematic overestimation of prediction map accuracy (Heuvelink and Burrough, 1993; Malone et al., 2015; Poggio et al., 2016; Heuvelink, 2018).
In addition, the spatial support of most DSM maps often mismatches that of the decision makers and end-users, who typically require predictions at the larger ‘block’ support (e.g., fields) while most soil measurements are taken at ‘point’ support. Previous DSM models have been carried out successfully on different supports and in different types of landscapes, regardless of the methods used (Hengl et al., 2015; van Zijl, 2019), but it has not yet been analysed how changes in support affect the uncertainties of DSM outputs. In sub-Saharan Africa, agricultural landscapes are managed under “blanket” agronomic recommendations, even though there is often considerable intrinsic spatial micro-variation in soil nutrients due to factors, in particular historic management (Tittonell et al., 2015). This increases the risk of over- and under-application of fertilizers, leading both to undesirable environmental effects and increases in production costs.
The current growth in technological and computational power and availability of low-cost and high resolution environmental covariates are expected to improve the quality of DSM model outputs by providing additional information about the spatial distribution of soil properties (Behrens et al., 2018). It is expected that these advances lead to more accurate and precise (less uncertain) models. It is important to determine to what extent uncertainties in soil measurements are propagated through a DSM model, and assess their effects on the accuracies of the final products and end-user’s decision-making processes aiming at improving land management interventions and enhancing sustainable agricultural intensification.
1.2.2 Research Objectives
The main objective of this research is to incorporate uncertainties in soil measurements in state-of-the-art DSM approaches and assess the usefulness of realistic quantification of accuracies of soil spatial information in improving sustainable agricultural intensification. Generating soil spatial information with associated accuracies entails a combination of soil sampling, laboratory analyses, proximal soil measurements (spectral data), selection of environmental co-variables, geostatistical modelling, interpolation, as well as soil information delivery to end users at various spatial scales. These requirements result in several specific objectives, which are to:
1) Assess the usefulness of Mid Infrared Reflectance Spectroscopy (MIRS) in supporting kriging with external drift DSM approaches with uncertain data and evaluate the trade-off between a limited number of accurate samples versus many samples with high uncertainties in predictive DSM models;
2) Extend the calibration and prediction of DSM models using uncertain soil measurements from linear kriging with external drift to non-linear Machine Learning Algorithms-based DSM models;
3) Analyse and evaluate how a change in spatial support affects the uncertainty of the DSM model outputs in the context of uncertain input data;
4) Analyse the required accuracy level of site-specific soil information for improving soil productivity and enhancing sustainable agricultural intensification in smallholder farming systems.
1.3 Research methodology
1.3.1 Study site
The research will be implemented in two climatic and contrasting landscapes located in Central African countries: 1) the western highlands of Cameroon; and 2) the Sahelian savannah zone of Chad.
a) Western Highlands of Cameroon (WHC)
The study site is largely an agrarian area which spans 1,053 km² located in the western highlands of Cameroon. The area is dominated by subsistence agricultural systems, where smallholder farmers grow a range of annual and perennial crops. Annual rainfall varies from 1500 to 1800 mm and exhibits a rainy season from March to October with peaks in August, and a dry season between November and February. The topography is undulating with altitude ranging between 1200 and 1800 masl. The vegetation is predominantly savannah with patches of gallery and montane forests. The mean daily minimum and maximum temperatures oscillates between 18 and 28 oC, respectively. The main soil type is relatively unfertile Ferralsol (IUSS Working Group WRB, 2015).
b) Sahelian savannah zone of Chad (SSC)
The site spans 19,000 km2 and is located in the Sahelian zone of the Republic of Chad, often termed as the Sudanese agroecological zone. The production system is extensive and based on subsistence farming on small traditional family farms of 2 to 5 hectares. Cereals represent the largest share of food crops grown, but production levels are low and highly dependent on climatic conditions, which are increasingly variable. The climate is of continental type with a dry and hot season. The raining season is from June to September, with an average rainfall between 300 and 800 mm, characterized by high spatio-temporal variation. The average monthly temperature varies between 28 and 45°C. The dominant soil type is vertisol (IUSS Working Group WRB, 2015).
1.3.2 Experimental design and soil sampling
The data were obtained using a spatially stratified sampling approach based on the concept of sentinel sites (10 km X 10 km) as developed within the AfSIS project (http://africasoils.net/).The site locations were established using convenience sampling, while accounting for the major Koeppen-Geiger climate zones of the study area to capture the variation of landscape conditions (i.e., feature space coverage). Each sentinel site is subdivided into 16 square sampling units (‘grid cells’), within which 10 sampling plots (1000 m2) are randomly established, leading to 160 sampling plots per sentinel site (Vågen et al., 2010). Soil samples are collected within each plot at four locations and pooled together to obtain composite samples at 0–20 cm depth. The sampling strategy is strongly spatially clustered into sentinel sites. This has disadvantages from a modelling perspective but was done because of logistical, accessibility and safety constraints. To address this shortcoming, the targeted area for predictions in each of the landscapes will be carefully delineated to minimize extrapolation during the mapping exercise. To avoid severe extrapolation we will make use of the regressor variable hull concept (Steinbuch et al., 2016). Since statistical validation of DSM maps greatly benefits from probability sampling, validation data will be collected for the study area in Chad, using stratified random sampling.
1.3.2 Analytical methods
A subset of the calibration sample datasets (approximately 10%) is subjected to conventional laboratory analyses for determination of targeted soil properties. These properties include texture, pH, soil organic matter and nitrogen. Next, all soil samples (n = 160 x number of sentinel sites) are processed to obtain fine earth samples (<2 mm), which are grounded to <100 μm with an agate mortar and pestle and analysed by Mid Infrared diffuse Reflectance Spectroscopy (MIRS) following standard procedures described in Terhoeven-Urselmans et al. (2010). The data are split into training and test sets, using a k-fold cross-validation approach. Regression models are used to predict targeted soil properties for all samples based on the calibration samples (Sila et al., 2016). Soil measurements obtained through conventional laboratory methods are referred to as analytical data, while soil values derived from MIR spectroscopy are referred to as spectral data.
1.3.3 Remote sensing data and environmental covariates
Environmental covariates will be carefully selected to characterise the environmental conditions and soil forming factors (climate, terrain, vegetation and parent material) in the study areas and prepared following procedures described in Hengl et al. (2017). Emphasis will be laid on globally and publicly available remote sensing-based finer resolution (30–250 m) covariates. The remote sensing images will include SRTM and/or ALOS W3D Digital Elevation Model (DEM) at 30 m, MODIS, Landsat 7, 8 satellite images, Landsat-based Global Surface Water (GSW), and the USGS’s global bare surface image. Examples of covariate layers derived from the images are elevation, annual average temperature, annual average precipitation, annual average evaporation, moisture index, slope gradient, topographic wetness index, Normalized Difference Vegetation Index, parent material and soil information from SoilGrids250m (ISRIC-World Soil Information).
1.3.4 Methodologies for the various objectives
Objective 1: Assessment of the usefulness of Mid Infrared Reflectance Spectroscopy (MIRS) in supporting kriging with external drift DSM approaches and evaluate how resulting soil maps are affected by uncertainties of soil measurements
We will use partial least squares regression (PLSR) to establish relationships between laboratory and soil spectral data to generate spectral soil measurements together with their associated measurement error variances. The uncertainties in soil measurements will be quantified by probability distributions. Uncertainties in the laboratory data will be analysed using either measurement error variances as stated by the laboratory or by using the Wageningen Evaluating Programs for Analytical Laboratories. Uncertainties in the spectral data will be quantified using the residual error variance of the PLSR model and incorporated in the variogram or covariance structure of the spatial model. Initially, spatial and cross-correlations in measurement errors will be ignored, so that including measurement error boils down to adding the error variances to the diagonal of the spatial covariance matrix (Somarathna et al., 2018). It is thus assumed that uncertainties of soil measurements are summarized by their variances, enabling the predictor to “filter out” the effect of measurement error variances from the data (Delhomme, 1978; Wadoux et al., 2019). In other words, the kriging with external drift (KED) will weigh measurements in accordance to their accuracy and predictions and prediction error variances at prediction locations will account for measurement error.
Since the accuracy of the spatial predictions can also be influenced by the technique used for model parameter estimation, variogram parameters and regression coefficients of the spatial trend will be estimated directly from the data, using the Residual Maximum likelihood method (Viscarra et al., 2016b). Here, measurement errors in the soil data will also be accounted for. Next the measurement errors will be incorporated in KED and applied to predict soil properties in the targeted landscape using predictors from various sources, including remote sensing-based and DEM-based covariates. The kriging results obtained will be compared under two scenarios: 1) soil property map based on analytical and spectral data whilst ignoring measurement errors, and 2) soil property map based on analytical and spectral data taking measurement error into account.
The quality of the predictions will be assessed using the KED variance and cross-validation, whereby the accuracy of the predictions is assessed by comparing predicted values with actual measurements at validation points. Both standard cross-validation, leave-grid-cell-out cross-validation and leave-sentinel-site-out cross-validation will be used. This is done because we expect that standard cross-validation will produce over-optimistic results, while leave-sentinel-site-out cross-validation will produce over-pessimistic results. Thus, comparison of cross-validation statistics between approaches will provide information valuable insights. Three common statistical validation measures that will be used are: (1) the mean prediction error (ME); (2) the root mean squared error (RMSE); and (3) the Modelling Efficiency (Janssen and Heuberger, 1995) These indices are affected by errors in the validation data, which will be accounted for when evaluating the accuracy of the predictions (Heuvelink, 1998b). Accuracy plots will also be derived to visualize the quality of the estimated prediction uncertainty (Goovaerts, 2001; Wadoux et al., 2018).
Objective 2: Extend the calibration and prediction of DSM models under soil input data uncertainty from linear kriging with external drift to ML-based DSM models
Under Objective 2, the study will extend the calibration and prediction of DSM models with uncertain soil measurements from kriging with external drift to non-linear machine learning (ML) regression methods. The focus will be on Random Forest Regression (RFR) and its application will be based on several environmental variables (covariates) used as predictors, while accounting for multicollinearity effect or overlap of information among the predictors. To minimize this, principal component analysis may be applied, thus achieving that independent transformed covariate are used in place of original predictors. The innovative element of this study is that RFR for DSM will be extended to a case in which the calibration data have measurement errors. One way of doing that is to use Monte Carlo simulation (Malone et al., 2015; Heuvelink, 2018), although more efficient approaches will also be explored.
Cross-validation of the final maps and their associated uncertainty will be carried out, i.e. the standard RFR model that ignores measurement error will be applied and compared to the model that accounts for uncertainty in soil measurements. This will be done using cross-validation methods similar to those used in Objective 1.
Objective 3: Analyse and evaluate how change of support affects the uncertainty of the model outputs
The motivation for Objective 3 is to generate soil spatial information that meets the requirements of various stakeholders, such as farmers, extension workers, and policy makers. Soil measurements are composite soil samples from four points within a 1000 m2 plot, while users require information at larger support. We will apply spatial upscaling (aggregation) using block kriging approaches, or by arithmetic averaging of predictions within the spatial ‘block’ when using RFR. Performing kriging on a larger area (block) rather than on a smaller (point or plot) will lead to smoothing and will decrease the associated prediction errors. Maps resulting from various block kriging applications (with and without accounting for uncertainties of soil measurements) will be produced for each of the spatial supports, and the effects of change of support on the uncertainties of the model outputs quantified.
Accuracies of maps resulting from block kriging will be assessed using independent validation. For this, a probability sample of block support validation data will be collected. Here, block support validation data will be estimated using a composite sample from each validation block, where the sampling locations within the block are again derived using probability sampling. This sampling error will be taken into account when the population MSE is estimated, and the associated estimation error quantified. Moreover, measurement errors in the validation data will be accounted for in the validation analysis.
Objective 4: Analyse the required accuracy level of site-specific soil information for improving soil productivity and enhancing sustainable agricultural intensification in smallholder farming systems.
Objective 4 will make use of the soil spatial information generated in previous objectives with their associated uncertainties to formulate agronomic recommendations that support sustainable agricultural intensification. The information in the form of maps will be reported at various spatial supports to ease usage and interpretation at different scales. Stakeholders will be informed on the effects of a realistic quantification of uncertainties in soil information on land management practices and crop productivity. Analyses of the level of accuracy of soil maps that is required to generate useful soil information as well as the analysis of crop production yields will be carried out using a decision support tool capable of accommodating maps at various supports (Dias et al., 2016). The Decision Support System for Agro technology Transfer (DSSAT) is one of such models that has been widely used to simulate crop growth, soil water balance, soil carbon and soil nitrogen dynamics under different crop systems, management practices and climatic conditions (He et al., 2016; Jiang et al., 2019; Malik et al., 2019).This will be followed by the assessment of how uncertainties in soil maps propagate through the decision support tool using Monte Carlo simulation (Heuvelink, 1998a). Thus, N possible realities are drawn from the uncertain DSSAT soil input using spatial stochastic simulation, DSSAT is run for each of these realisations, and the uncertainty propagation analysed is derived from the variability in the N DSSAT outputs. All this will be done under two scenarios: (1) soil input maps derived while ignoring measurement errors; and (2) soil input maps derived while accounting for measurement errors
The DSSAT model requires also non-soil input parameters known to affect crop responses to fertilizers application, such as climate, rainfall distribution and crop management conditions (e.g., cropping system, planting time, tillage, fertilization and irrigation)(MacCarthy et al., 2018). These will be assembled as well for a selected study area in Chad. In addition, DSSAT model parameters will be derived from data gathered in field experiments, i.e. a split-plot experimental design with replications.
Model simulated outcomes will be compared under the above-mentioned scenarios and interpreted based on the effects on site-specific land management and fertilizer application. This is expected to demonstrate how soil spatial information with associated uncertainties can be used in decision tools to support policies that enhance sustainable agricultural intensification practices.