2.1. Sample preparation
The biomass samples were collected from the Terai low flatland and mid-hill regions of Nepal, with altitudes ranging from 86 to 1,940 meters above sea level. The study included five fast-growing species: (1) Alnus nepalensis, (2) Pinux roxiburghii, (3) Bombusa vulagris, (4) Bombax ceiba, and (5) Eucalyptus camaldulensis. Also included were five agricultural residues: (1) Zea mays (cob), (2) Zea mays (shell), (3) Zea mays (stover), (4) Oryza sativa, and (5) Saccharum officinarun. Alnus nepalensis and Pinux roxiburghii were collected from the mid-hill region; Bombax ceiba, Eucalyptus camaldulensis, and Saccharum officinarum were collected from the Terai region; and Zea mays (cob, shell, stover), Bombusa vulagris and Oryza sativa were collected from both Terai and mid-hill region of Nepal.
During preparation, all collected samples except for
Oryza sativa were manually chopped into smaller pieces, i.e. less than 30 mm × 15 mm (refer to
Figure 2a); dried in the open sun; and stored in an airtight aluminum bag to maintain their biomass properties by preventing the exchange of air and moisture during transport to the Near-Infrared Spectroscopy Research Center for Agricultural Product and Food at School of Engineering, King Mongkut’s Institute of Technology Ladkrabang, Thailand. The samples were ground using a multi-functional high-speed disintegrator (WF-04, Thai grinder, Thailand). The particle size of the grounded biomass was evaluated at Scientific and Technological Research Equipment Center (STREC) at Chulalongkorn University, Bangkok, Thailand, using the instrument Mastersizer 3000 (MAL1099267, Hydro MV).
Figure 3 shows the representative particle size distribution of the ground biomass used in this research, ranging from 0.01 to 3080 µm. The ground samples were stored in airtight plastic zip lock bags before and during the experiment.
2.2. Spectral data collection
As shown in
Figure 2c,d, the grounded biomass samples were placed in a glass vial (20 mm diameter and 48 mm height) and scanned using an FT-NIR spectrometer (MPA, Bruker, Ettlingen, Germany) in a transflectance mode at the controlled temperature of 25±2
oC. The spectrometer operates with a resolution of 16 cm
-1, with a background scan time and sample scan time of 32 scans (average), logging absorbance data - log(1/R) within wavenumber range of 3,595 to 12,489 cm
-1, where R is the diffuse reflectance detected from the grounded biomass sample. Prior to scanning, the FT-NIR spectrometer was normalized by performing a gold plate background scan. The primary purpose of performing a background scan on every new ground sample was to compensate for instrumental drift and ambient environmental influences such as temperature, light, relative humidity, etc. on the measurement setup [
12].
All the grounded samples were scanned twice without changing their positions, with no NIR leakage occurring during scanning. The average absorbance value for each sample, with respect to its wavenumber, was considered spectroscopic data for model development.
Figure 4a) shows the raw spectrum of ten different grounded biomasses within the wavenumber range between 3,595 to 12,489 cm
-1, which were used to evaluate HHV and ultimate analysis parameters.
2.4. Spectral preprocessing
Spectral preprocessing is one of the important components of NIR calibration. Ten different varieties of grounded biomass samples were scanned to collect spectral data, whose physical, chemical, and biological properties may vary from sample to sample. Although the raw spectrum for all the biomass samples appears similar, instrumental errors, variations in light scattering during sample scanning, and a large number of redundant and interfering variables can introduce unwanted and harmful signals into the spectrum (refer to
Figure 4a). To improve spectral features, remove noise, address overlapping peaks and baseline shifts, handle collinearity within the spectral data, and enable easy data interpretation for calibration [
23,
24], NIR spectral preprocessing is necessary before model development.
In this study, the raw spectrum was pre-treated with two approaches. The first approach was a traditional approach involving the entire raw spectrum, i.e. the full wavenumber range from 3,595 to 12,489 cm-1 using no preprocessing or traditional preprocessing methods. The second approach was a novel multi-preprocessing approach, where the entire wavenumber range was divided into different sections and pretreated using a combination set of various preprocessing methods.
For the traditional approach, ten different types of spectrum pretreatment methods were used for calibration models. These included (1) first derivative (segment=5 and gap=5), (2) second derivative (segment=5 and gap=5), (3) constant offset, (4) SNV, (5) MSC, (6) vector normalization, (7) min-max normalization, (8) mean centering, (9) first derivative (segment=5 and gap=5) + vector normalization, and (10) first derivative (segment=5 and gap=5) + MSC.
For the multi-preprocessing approach, the entire wavenumber range was divided into different sections and pre-treated with various pretreatment combination sets obtained from seven different preprocessing methods and marked as follows: 0 = empty (all the absorbance values = 0), 1 = raw spectra, 2 = SNV, 3 = MSC, 4 = first derivative (5,5), 5 = second derivative (5,5) and 6 = constant offset.
For the multi-preprocessing 5-range method, the entire wavenumber range was equally divided into five sections: 3,625.72–5,392.30 cm-1, 5,400.02–7,166.59 cm-1, 7,174.31–8,940.89 cm-1, 8,948.60–10,715 cm-1, and 10,722.9–12,489.48 cm-1. However, the wavenumber range from 3,595 to 12,489 cm-1 could not be equally divisible by 5, so the last four independent variables were removed from the total dataset, leaving 1150 out of 1154 considered for model development. Similarly, for the multi-preprocessing 3-range method, the entire wavenumber range was divided into three sections: 3594.87–5492.59 cm-1, 7498.314–5500.30 cm-1, and 7506.02–12489.48 cm-1.
Figure 4b,c show the spectrum of the grounded biomass obtained from the multi-preprocessing method with a) 5-range and b) 3-range methods, respectively. In
Figure 4b, the raw spectrum was pre-treated with the preprocessing combination set of 0, 5, 1, 6, and 0, i.e. empty (zero absorbance) from 3,625.72–5,392.30 cm
-1, second derivative from 5,400.02–7,166.59 cm
-1, raw spectra from 7,174.31–8,940.89 cm
-1, constant offset from 8,948.60–10,715 cm
-1, and empty (zero absorbance) from 10,722.9–12,489.48 cm
-1. Similarly, in
Figure 4c, the raw spectrum was pre-treated with the preprocessing combination set of 0, 4, and 1, i.e. empty (zero absorbance) from 3594.87–5492.59 cm
-1, first derivative from 7498.314–5500.30 cm
-1, and raw spectrum from 7506.02–12489.48 cm
-1. The best combination set for multi-preprocessing is determined by the optimum LVs obtained from full cross-validation.
MATLAB-R2020b (MathWorks, USA) built-in code was used to select the optimal combination set of multi-preprocessing methods for developing a PLSR calibration model.
2.5. Model development
The accuracy of the model is one of the major concerns in the NIRS. Accuracy can be improved by using different spectral pretreatments and appropriate data analysis methods. Various research articles related to NIR modeling have concluded that PLSR is one of the effective and commonly used quantitative analysis techniques [
14,
25,
26,
27]. Therefore, this study proposes PLSR-based models that can handle highly collinear spectroscopic data [
28] for the assessment of grounded biomass properties. In this study, the following models were developed to match its objectives: (1) full wavenumber range – PLSR with no preprocessing and traditional preprocessing techniques, (2) multi-preprocessing PLSR 3-range method, (3) multi-preprocessing PLSR 5-range method, (4) GA-PLSR, and (5) SPA-PLSR.
To develop PLSR models using different methods, the total data set obtained after the removal of outliers was manually divided into an 80% calibration set and a 20% validation set as shown in
Figure 1. The calibration set was designed to include the maximum and minimum reference values, thereby representing a wider range to generate a regression model [
24]. The calibration set was first subjected to full cross-validation to select the optimal number of LVs. This number ensures the smallest possible standard error for data analysis; considering too few LVs leads to underfitting, and considering too many LVs leads to overfitting. If several LVs show similar or comparatively better model performance, the smallest number of LVs was selected for model development [
29]. The PLSR models for assessing biomass properties for energy usage were created using in-house code in MATLAB-R2020b (Mathworks, USA).
GA and SPA are the wavelength selection methods that select the highly influential wavenumbers from the spectra and have been shown to provide better performance when combined with PLSR compared to PLSR with full wavenumber range only, thus avoiding overfitting [
30,
31,
32]. SPA selects the variables with minimum collinearity and assesses them based on the value of the root mean square error obtained from the validation set. In SPA, uninformative variables are eliminated until the model’s performance no longer increases [
33]. GA selects variables with a minimum amount of redundant information, starting with one variable and adding a new one to the loop in each iteration, maximizing its fitness. The model developed with GA-PLSR shows the lowest prediction error as it maximizes the fitness and covariance between the spectral and reference data [
34,
35]. In GA-PLSR and SPA-PLSR, the new calibration dataset was processed through full-cross validation to select the optimum LVs, which were then considered for PLSR model development.
The accuracy of the NIR model should be compared with the reference method. Therefore, the performance of the model was determined in terms of R
2c, RMSEC, R
2v, RMSEP, RPD, and bias [
36]. These parameters can be calculated as follows, where y is the measured value,
is the predicted value, i is subscript indicate number of sample,
is the mean of the measured value, N
T is the number of samples, SD is the standard deviation of the measured values of the validation set and n is the number of samples in the validation set:
The better model was selected based on the tradeoff value between the highest R
2c, R
2P, and RPD and the lowest RMSEC, RMSEP, and bias. In this study, the performance results, namely R
2 and RPD value, were interpreted based on the recommendations of Williams et al. (2019) [
37] and Zornoza et al. (2008) [
38], respectively.
As per recommendations of Williams et al. (2019), R
2 up to 0.25 are not usable for NIRS calibration; 0.26-0.49 indicates poor calibration, and reasons for this should be researched; 0.50-0.64 is considered okay for rough screening; 0.66-0.81 is okay for rough screening and some other appropriate calibrations; 0.83-0.90 is usable with caution for most applications, including research; 0.92-0.96 is usable in most applications, including quality assurance; and 0.98+ is excellent and can be used in any application [
37]. Similarly, according to Zornoza et al. (2008), an RPD value of less than 2 is considered insufficient for applications; RPD between 2 and 2.5 makes approximate quantitative predictions possible; RPD values between 2.5 and 3 are considered good for prediction, and RPD greater than 3 indicates an excellent prediction [
38].