1. Introduction
Wheat stands as one of the most vital crops globally, with approximately 35%-40% of the world's population relying on it as a primary food source. It contributes approximately 21% of food energy and 20% of protein intake. Given the backdrop of population growth and climate change, the early and accurate estimation of wheat yield holds utmost importance for safeguarding national food security and maintaining people's living standards [
1,
2]. Conventionally, the yield prediction method has primarily been dependent on field observation and investigation, which is not only a time-consuming and laborious process but also susceptible to subjective biases, and can even result in crop damage [
3]. In recent years, remote sensing technology has gained widespread application in the domain of agricultural monitoring. This technology enables the effective acquisition of canopy spectral data from aerial sources, thereby facilitating the estimation of crop yields [
4,
5]. Furthermore, unmanned aerial vehicle (UAV)-based remote sensing technology has witnessed rapid development, owing to its distinctive advantages of flexibility and high resolution [
6].
The vegetation index (VI) derived from UAV images has demonstrated its effectiveness in predicting crop yields. Spectral, structural, thermal infrared (TIR), and texture features extracted from UAV-collected datasets through sensors can be utilized to assess various plant traits and structures [
7]. For instance, low-altitude UAVs were employed to capture RGB imaging data of potato canopies at two distinct growth stages, to predict yields [
8]. The use of a multispectral (MS) UAV platform for swift monitoring of the normalized vegetation index (NDVI) during the wheat filling stage exhibited a strong correlation with wheat grain yield [
9]. Texture information extracted from UAV images can effectively reflect the spatial variations in pixel intensity, thereby emphasizing the structural and geometric characteristics of the plant canopy [
10]. The potential of UAV TIR imaging technology for assessing crop water stress and predicting wheat kernel yield in different wheat varieties has also been thoroughly validated [
11]. However, the majority of studies solely rely on data from a single sensor to estimate crop yields, overlooking the advantages of combining multiple sensors. For example, by combining the features derived from MS, RGB, and TIR imaging, the accuracy of soybean yield prediction can be significantly improved [
7]. The combination of canopy TIR information with spectral and structural characteristics can improve the robustness of crop yield prediction across diverse climatic conditions and developmental stages [
12]. In particular, the application of machine learning (ML) techniques to the analysis of multi-sensor data collected by UAVs can significantly enhance the accuracy of crop yield predictions [
13]. On this basis, to fully harness the potential of ML algorithms, the machine learning technology is combined with the VIs extracted from the spectral image of the sensor to build a yield prediction model, which provides strong support for the relevant practices of precision agriculture [
14,
15].
At present, a variety of machine learning methods have been applied to yield prediction, such as random forest (RF) [
16], partial least squares (PLS) [
17], ridge regression (RR) [
18], K-Nearest Neighbor (KNN) [
19] and eXtreme Gradient Boosting Decision Tree (XGboost) [
20]. However, the predictions of the same model may vary significantly across different crops and environments, primarily due to the quality of data, the representation of the model, and the dependencies between input and target variables within the collected dataset [
21]. If the data is biased or if the chosen model exhibits overfitting to the respective dataset, the model will fail to demonstrate accurate performance [
22]. Ensemble learning, a research hotspot, is proposed to address these challenges. Its objective is to integrate data fusion, data modeling, and data mining into a cohesive framework [
23]. the ensemble learning paradigm known as stacked regression involves linearly combining various predictors to enhance prediction accuracy [
24,
25]. The feature-weighted ensemble method assigns weights according to the correlation of features and estimates the degree of correlation between each feature and the extracted output model [
26,
27,
28,
29]. In this study, we employ a feature-weighted ensemble learning approach that assigns weights to the training dataset generated by the primary learner, based on the prediction accuracy of each individual learner. Subsequently, utilizing these weighted data, the meta-learner is trained to enhance the overall model's learning efficiency. To further refine the model performance, we introduce an innovative third-layer ensemble method, specifically the simple average ensemble method. To further optimize the model performance, we introduce a novel ensemble method in the third layer, specifically the simple average ensemble method. The method calculates the average values of the predictions of the stacking ensemble method and the feature-weighted ensemble method on the test set and compares them with the actual measured values to realize the effect of the third-layer ensemble learning.
The primary objective of this study was to explore the utilization of UAV-based remote sensing data obtained 21 days after wheat flowering to predict wheat yield. It includes: (1) evaluation and data fusion of UAV yield prediction methods based on RGB, MS, Texture and TIR; (2) Compare the accuracy of the basic learner (RF, PLS, RR, KNN and XGboost) and three ensemble methods (stacking, feature-weighted and simple average) for yield prediction, and then select the optimal approach.
2. Materials and Methods
2.1. Experiment Location and Design
Two hundred and seventy RILs from cross Zhongmai 578/Jimai 22 were planted at the research site of Chinese Academy of Agriculture Sciences (35°18′0″N, 113°52′0″E) in Xinxiang, Henan province, China during the 2021-2022 growing season. This experiment used randomized complete blocks with three replications under full and limited irrigation treatments. Two irrigations at the seedling and overwintering stages were poured for both treatments, the full irrigation treatment was flooded at the greening jointing and early grain filling stages. A plot area was 3.6 m2(1.2 m×3 m). It was designed in 6 lines, with a line spacing of 0.20 m. The planting density was maintained at 270 plants/m2, and agricultural management was performed according to local conditions. After maturity, the harvest was conducted using a combine harvester. The seeds were weighed after drying to a moisture content of less than 12.5%.
2.2. Multi-Sensor Image Acquisition and Processing Based on UAV
Data acquisition for all traits was done by a UAV platform M210 (SZ DJI Technology Co., Shenzhen, China). An RGB and TIR were the same sensor (Zenmuse XT2 camera, SZ DJI Technology Co., Shenzhen, China) with lens pixels of 4000×3000 and 640×512, respectively. MS sensor (Red-Edge MX camera, MicaSense, Seattle, USA) captures same pixel images (1280×960) in five bands including blue, green, red, red edge and near infrared (NIR) with wavelength were 475 nm, 560 nm, 668 nm, 717 nm and 842 nm, respectively. The aerial surveys were carried out at the 21 days post-anthesis due to the proven high accuracy of yield predictions during this period [
13]. All flight tasks were carried out from 10:00 to 14:00 in clear skies, using DJI Pilot software to set route parameters as follows: the forward and side overlap were 90% and 85%, respectively, and the flight altitude was 30 meters.
In this study, the Pix4D Mapper Pro 4.5.6 software (Pix4D, Lausanne, Switzerland) was used to perform radiometric correction and image stitching on RGB, TIR and MS images of UAV, and the visible, TIR orthophoto image and five-band orthophoto reflectance map were obtained. The obtained images with spectral reflectance were imported into ArcGIS 10.8.1 (Environmental Systems Research Institute, Inc., Redlands, USA) software for image cropping, each cell was selected as the area of interest, the features were extracted and to calculate the different VIs used in this study. The detailed process is shown in
Figure 1. To minimize the noise impact on the images and enhance the efficiency of subsequent processing steps, it was necessary to exclude non-target areas from the acquired MS images. The Pix4D Mapper software was utilized to perform image stitching, shading correction, and digital number (DN) processing on the filtered MS data, ultimately converting it into a TIFF image format with spectral reflectivity. Radiation calibration was conducted prior to and following each flight using a dedicated calibration plate. Subsequently, the TIR data was calibrated based on the blackbody reference to determine the temperature corresponding to each pixel value in the TIR imagery.
2.3. Extraction of Vegetation and Texture Index
As a metric for evaluating physiological parameters of crops, VIs could effectively reflect the real-time growth level of crops [
30]. Ten color index and eleven MS VIs were selected as shown in
Table 1.
In addition to spectral information, texture features as another important remote sensing information were less susceptible to external environmental factors. They reflected the grayscale nature of the image and its spatial relationships, thereby enhancing the inversion accuracy of single spectral information sources that may suffer from saturation issues. Furthermore, texture features enhanced the potential for inverting physicochemical parameters to a certain extent [
31]. In ENVI 5.3, the widely utilized gray level co-occurrence matrix (GLCM) was used to extract 40 texture features for the RGB-based R, G, B bands and MS based red-edge, NIR bands. Then, the region of interest was delimited for the texture feature images of each band in ArcGIS 10.8.1 (
Figure 1).
Principal component analysis (PCA) was a data mining technique in multivariate statistics. It transformed convert high-dimensional data into low-dimensional data through dimensionality reduction, while preserving the majority of the information within the data without compromising its integrity [
32]. Through principal component analysis, we transformed the initial 40 texture features into 3 new principal components, which were linear combinations of the original features. Each principal component encapsulated a portion of the information from the original features. By utilizing these principal components, we effectively represented the original data in a lower-dimensional space while preserving as much of the data's variance as possible. Consequently, these three principal components could be regarded as representative of the most significant texture features within the dataset (
Figure 1).
2.3. Ensemble Learning Framework
In ML, each algorithm possesses its distinct strengths. Ensemble learning achieves superior generalization performance by harnessing the combined advantages of various machine learning algorithms [
51]. This study proposed three methods in total. The first method was stacking regression, which was a heterogeneous ensemble learning model first introduced by WOLPERT in 1992 [
52]. The objective of this study was to integrate the predictive strengths of five fundamental models: RF, PLS, RR, KNN and XGboost. Initially, the training dataset was partitioned into an 80% training subset and a 20% testing subset. Each base model was then trained independently on the training subset, utilizing a 10-fold cross-validation approach, and their respective predictions were generated for the testing subset. Subsequently, these prediction results were employed as input features for the meta-model. RR served as the regression algorithm for the meta-model, tasked with learning to effectively integrate the learning algorithms of the various basic models in order to generate a final ensemble prediction. Throughout the training process, cross-validation techniques were employed to meticulously fine-tune the hyperparameters of the meta-model, with the ultimate goal of bolstering its generalization capabilities. Upon completion of the training phase, the refined stacking model was then utilized to predict outcomes for the test set, subsequently enabling a thorough evaluation of the model's overall performance (
Figure 2).
The second approach was feature-weighted ensemble learning. Its essence laied in assigning distinct weights to each base learner depending on their predictive prowess. Each base model underwent training on the training set, and the coefficient of determination (
R2) for each base model was computed using the testing set. Subsequently, the
R2 values served as the foundation for allocating weights (
Figure 2).
The third approach proposed in this study was simple average ensemble learning, where the predictions obtained from Stacking regression and the feature-weighted ensemble method on the testing set were averaged. Then, the
R² score was computed between the averaged predictions and the true values of the testing set (
Figure 2).
2.4. Model Performance Evaluation
In this study, the selection
R2, root-mean-square error (RMSE) and normalized root-mean-square error (NRMSE) were selected as the indexes to evaluate the prediction accuracy of the base learner. The formula is as follows:
Where and are measured and predicted values of wheat yield, respectively, is the mean value of measured yield and n is the sample size.
The weight allocation formula is as follows:
Where is the weight of the primary learner, = 1, 2, ... , T; T is the number of primary learners; is the R2 of the primary learner; is the R2 of the primary learner.
This formula transforms the R² scores of each base model into weights and ensures that the sum of all weights equals 1. Thus, the stronger predictive performance of each base model is assigned a higher weight, leading to a larger proportion in the ensemble prediction.