1. Introduction
Fire investigation is considered one of the most challenging forensic science disciplines. Current gas chromatographic and spectroscopic analytical methods in fire investigation cannot discriminate or individualize petrol sources based on class compounds within the petrol samples from the fire scene that are an indication of the country of origin, refinery (source), natural weathering/age and or fire exposure. As petrol is considered one of the most commonly Petroleum Products (PP) used as an Ignitable Liquids (IL) in fire investigation, it was a primary concern of this study [
1]. The characterization and identification of petrol samples is a crucial challenge in the scientific investigation of fire as the current reference data relating to petrol does not highlight the broad range of petrol composition [
2]. Identifying individual compounds in the petrol contributes to understanding complex petrol chemical composition and its additive and blending agents as the refineries do not reveal the exact composition of their petrol. Those compounds have not been previously identified due to their volatility and trace amounts in the petrol mixture. Gas Chromatography (GC)-Mass Spectroscopy (MS) analysis of ILs using chemometric analysis for comparison of unevaporated, evaporated and “on substrate” petrol samples from stations across the UK displayed very similar chromatographic patterns regardless of petrol grade or type, hence discrimination by grade, type or brand could be very challenging [
3]. The author used Principal Component Analysis (PCA) analysis to target C
2-C
4 alkyl benzenes; the PCA achieved grouping of petrol brands based on their grade (premium and regular). Hierarchical Cluster Analysis (HCA) was applied to the data and no substantial clustering based on petrol type or brand was revealed. However, the HCA dendrogram demonstrated a linkage of the samples according to their degree of evaporation [
3].
A method based on Gas Chromatography (GC)-Flame Ionization Detector (FID) analysis combined with an ANN (artificial neural networks) algorithm was explored for discrimination of petrol brands from five petrol stations in Spain based on the entire chromatogram [
2]. It was concluded that despite there not being significant variations in the chromatogram, mathematically the different petrol samples were classified according to their brand. The author suggested that the potential difference that contributed to the discrimination was the content of oxygenates and hydrocarbons groups such as aromatics and olefins. In that experiment, native petrol samples were only considered for identification purposes and no identification of specific compounds was made [
2].
Research by Monfreda and Gregori [
4] offered promising results where unevaporated samples from different petrol sources were correctly grouped based on aromatic compounds. In addition, Barrett, et al. [
5] used Direct Analysis Real Time-Mass Spectroscopy (DART-MS) combined with Partial Least Squares -Discriminant Analysis (PLS-DA) model to classify petrol sources on different substrates; however, the petrol samples were grouped to already identified class rather than unknown class.
Even though many spectroscopic and chromatography techniques are considered, it can be concluded that the identity of the source of ILs recovered from a fire scene is still a challenging and ongoing research area. Therefore, there is a need for individualizing and classification of petrol sources to enhance evidential value.
Nuclear Magnetic Resonance (NMR) is a spectroscopic method that studies the nuclei of atoms within a molecule and their chemical environment. NMR spectroscopy is sufficient to completely determine the structure of an unknown molecule and to differentiate between isomers or related compounds which can be difficult using GC-MS. Various NMR pulse sequences allow complex spectra to be dissected by focusing on individual small spectral regions and extracting the spectra of those coupled spin systems that have a resonance within that region, even when their spectra are severely overlapped. Therefore, NMR spectroscopy has capabilities to extract the sub-spectra of an individual component without prior separation from highly complex spectra [
6].
A simple
1H NMR method has been proven to be successful in the determination of the petrol composition and some individual compounds with rapid and accurate analysis. Further investigation of NMR applications in the petroleum industry displayed the capabilities of
1H NMR coupled with PCA, k-NN (k-Nearest Neighbors), HCA and SIMCA (Soft Independent Modeling of Glass Analogy) proved to be a useful tool for categorizing of petrol samples with adulteration (solvents), fuel additives and blends, petroleum mixtures (kerosene and diesel mixtures) and petrol samples with different octane numbers [
7,
8,
9,
10,
11].The primary application of high field NMR spectroscopy in the petroleum industry was based on quality control of hydrocarbon classes in a sample rather than individual compounds of the overly crowded complex spectra.
1H NMR methods coupled with clustering and multivariate classification techniques were used for the successful identification of adulteration between two types of samples. The potential of NMR spectroscopy for structural elucidation of petrol components in a sample is established.
Considering the application of NMR in various scientific fields, forensic NMR is still in the early stages of development with a particular focus on chemical compositions of single compounds. A
1H NMR method has been combined with statistical analysis to identify the chemical ‘‘fingerprint’’ of cocaine samples and to link cocaine samples based on this information. It was concluded that the NMR method could establish a link between seized samples obtained at different locations or in possession of different individuals. The relative ratios of the minor components in coca leaf are closely associated with plant varietal, cultivar and agronomic differences that can be exploited for the assignment of geographical origin, at least when suitable authentic databases are available [
12]. One of the disadvantages of
1H NMR is that it is generally used for nonselective analysis compared to the MS selectivity. Peak overlaps from multiple detected compounds pose major challenges in the complex
1H NMR spectrum of petrol. Therefore, band-selective sequences including selective (sel) TOCSY and pure shift that use tailored pulses which narrow the excitation bandwidth to the region of interest in a signal measurement to obtain information for a single spin system are recommended.
Machine learning has been proven to be beneficial in forensic science in its various fields such as public safety, image and video analysis, image recognition, gunshot detection, firearms identification, 3D crime scene reconstruction, huge digital data analysis, building statistical evidence, handwriting identification, time since death estimation, dental age estimation and personal identification through dental findings [
13], sex determination of skeletal remains, 3D facial reconstruction from unidentified skull, cybercrimes and digital evidence detection [
14], bloodstain pattern analysis [
15] and pattern recognition which involves pattern evidences such as bite marks, lip prints, bullet marks, tool marks, shoe prints and fingerprint comparison and identification with more accuracy and ultimately higher speeds than human experts [
16,
17].
The objective of this work, using high-field (600 Hz) NMR spectroscopy, was to uniquely individualize and discriminate aliquot petrol sources based on: 1) source (origin of the crude oil 2) refinery processes and procedures (blending agents) and 3) brand (additive package). Within forensic science, the identification and classification of petrol sources could help police forces in the investigation of various fuel offenses, including arson, motor vehicle incidents, environment spillage, fuel smuggling and petrol bomb related incidents. Therefore, the objective of the study also included individualization and discrimination of weathered (evaporated) and Ignitable Liquid Residue (ILR) samples (fire debris residues) to consider the petrol sample collected at a fire scene. This study develops an automated classification model to individualize and classify an unknown native and fire debris petrol sample based on class characteristics of a source by using machine learning.
An automated hierarchical model for classification using local classifiers for each leaf used for predication of petrol sources is described in these paper and experimental results and limitation of this model are discussed. The key contributions of this paper are: 1) developing an automated classification model that can successfully classify petrol sources; 2) providing machine learning and statistical analysis results to support the opinion-based decision making when identifying petrol samples in fire debris analysis, 3) creating a new dataset of different petrol sources from UK and Ireland.
2. Materials and Methods
The main steps of this study methodology included NMR analysis of petrol, data acquisition; data pre-processing; feature selection; design, training, optimization, and evaluation of the classification model.
2.1. Materials
This study used 58 petrol samples that represented British Petroleum (Mainland (M) and Scotland(S)), Jet, Esso, Texaco, and Shell sources across petrol stations in UK and Ireland. To address the issues associated with evaporation and matrix interferences, the experimental protocol was followed to analyze 1) evaporated petrol samples (per laboratory protocol described below) and 2) simulated fire debris petrol samples burnt to 50% of the original weight. For each petrol brand collected, a set of three evaporated samples was generated. In a dry bath at approximately 25°C (room temperature), 10 mL of neat petrol samples from various petrol sources in triplicates were pipetted into 15 ml plastic tubes and placed under a nitrogen stream until approximately evaporation percentages were 25%, 50%, 75% and 90% corresponding to volume reductions of 2.5 mL, 5.0 mL, 7.5 mL, and 9.0 mL, respectively. The samples were prepared for analysis by diluting in non-deuterated cyclohexane. Finally, petrol sources (2ml) were burnt up to 50% their original weight on their own and on a substrate (flooring material, carpets, fabrics, and paper materials) and subsequently extracted by immersing the substrate with cyclohexane. To impartially compare the NMR method for the discrimination of neat, weathered and burnt petrol samples to the current laboratory method used Automated Thermal Desorption (ATD)-Gas Chromatography-Mass Spectroscopy (GCMS) (in house developed method used by Eurofins Forensic Services) to analyze ILs and their residues for interpretation of volatile compounds and ignitable liquids, a set of neat, evaporated, burnt and fire debris samples was created. Different petrol samples were prepared by an independent laboratory examiner/analyst; the samples prepared included different brands of neat petrol samples, weathered petrol samples, and burnt petrol samples of different substrates The neat and extracted weathered petrol samples were deposited into a glass vial and sealed. The corresponding burnt on substrates samples were collected and packed into a control nylon bag (
Table 1).
2.2. Data Acquisition
The data in this paper was acquired by using a Bruker high field 600 MHz NMR spectrometer with a 5mm broadband inverse diameter probe. The Icon NMR software was used to set the NMR experiments and control acquiring the data. The NMR experiment was a simple single pulse sequence (zg30 from the Bruker library) for 1) neat petrol, and 2) a second data set was acquired in cyclohexane with a solvent suppression pulse sequence (NOESY) for evaporated (due to limited volume) and burnt petrol samples. A pulse sequence program (seldigpzs from the Bruker library) was used for the acquisition of 1H sel (selective) TOCSY. Data was collected with 64k points as the size of the free induction decay (fid) a spectral width of 20.0ppm, a mixing time of 0.06sec, an acquisition time of 2.7 sec, a pre-scan delay of 6.5 sec and a minimum of 16 scans for neat petrol samples. The acquisition parameters are based on the default pulse sequences in the Bruker library. The 1H selTOCSY was performed on the following bands of chemical shift: 4.65ppm-4.72ppm (olefin set 1), 4.73ppm-4.85ppm (olefin set 2), 4.95ppm-5.10ppm (olefin set 3) and 5.10ppm5.35ppm (olefin set 4). The couplings are resolved and provide assignment of the chemical species. The four discriminative sets of olefins were identified as 3-methyl-1-butene by irradiating the signal at 4.64ppm-4.72ppm, a mixture of 3-methyl-1-butene and 1-pentene by irradiating the signal at 4.73ppm-4.85ppm, 2-methyl-2-butene by irradiating the signal at 4.95ppm-5.10ppm and a mixture of cis and trans-2-pentene by irradiating the signal at 5.10ppm-5.35ppm.For the double-blind study, the exhibits analyzed using headspace-ATD-GC-MS using a Tenax TA sorbent sampling tube. 1ml headspace was taken from within the packaging after a period of incubation at circa 100℃. Interpretation of results was based on pattern recognition and comparing chromatography obtained from evidential items with the standards references. Where possible, comparison against a reference of the relevant liquid was preferable, but if not possible, the sample was compared to the laboratory reference database or published literature.
2.3. Data Pre-Treatment and Pre-Processing
The
1H NMR spectrum of petrol is a complex mixture consisting of multiple detectable and overlapping peaks. The position, intensity, and spectra width of the peaks of interest significantly impact on the quality of the NMR spectrum and its subsequent interpretation. The acquired
1H NMR and
1H TOCSY data were processed with Mestre Nova (version 10.1.0 LITE-SE) software, where different processing parameters were applied to achieve the most efficient data set. Processing included 1) chemical referencing, 2) phasing, 3) baseline correction, 4) sub-spectral selection and filtering 5) normalization and 6) binning (
Figure 1).
To achieve optimal and robust chemical shift referencing, an internal reference was applied to the single protonated peak at 7.05 ppm of the benzene ring using Bruker Topspin software version 3.6.5. The single protonated peak of the benzene ring was chosen due to its single peak representation, its location (end of the aromatic region) and its clear resolution from other signals of interest. Phase correction was performed in Mestre Nova software. Auto phasing was selected which consisted of performing a zero-order phase correction on the whole spectrum by selection of the PH0 algorithm in the processing parameters of the software. Thereafter, all NMR spectra were manually inspected for any phase distortions. There were several baseline correction algorithms in the Mestre Nova software available from which to select. The most frequently recommended automated baseline algorithm is the Bernstein polynomial fit, where the baseline is extracted using the polynomial curve [
18]. For this experiment, the automatic baseline correction function in Mestre Nova which applies a Berstein polynomial fit algorithm to the frequency domain of the NMR data was selected. Drift correction was used to remove a baseline offset of the spectrum resulting from a non-zero integral for the fid and zero-frequency spikes in the spectrum. The Mestre Nova software applied this automatically, by default, using the common procedure of averaging the last 5% of the points in the fid and subtracting these from the rest of the fid.
Subsequently, the 1H NMR and 1H TOCSY data was binned with a 0.01ppm bin size using Mestre Nova software for every petrol sample and the binned data was saved as CSV comma (*.csv) files. However, not all the variables in the binned data were relevant for discrimination. Visual inspection of the 1H NMR and 1H TOCSY spectra showed ranges of the spectrum that did not contain any information, which was considered as background noise that had to be filtered out of the binned data. The chemical shift ranges of the spectrum that did not contain any spectral information, i.e. the presence of no couplings, were 2.70ppm-3.10ppm, 3.50ppm-3.90ppm, 4.20ppm-4.50ppm and 5.70ppm—6.50ppm. The position of the relevant couplings in the binned data was significant for the investigation of the variables; for that reason, the spectral free regions of the data were not omitted or deleted but conditioned, so they were not considered for the classification model. To filter out the data noise and spectra-free regions, if the numerical value of the bin was <1, the value in the bin was transformed to be equivalent to the numerical value of 0, else the binned data was the actual value.
The normalization process includes rescaling and/or transforming the raw data in such a way that each attribute is a uniform contribution [
19]. Normalizing techniques set data values in a range of 0-1. The main advantages of applying normalization to the raw data are overcoming outliners and controlling data attributes [
20]. The total area sum normalization consists of normalizing the intensity of each individual spectral bin to the total intensity of each spectrum. This type of normalization procedure is performed automatically in the MetaboAnalyst version 5.0 web platform or by applying a Log function in MS Excel. Single peak normalization is performed when each data point is divided by the amplitude of a specific peak of interest. Usually, the intensity of the internal standard peak is used as it is consistent for all samples. However, if an internal standard is not added to the sample mixture, a peak within the sample can be used for single peak normalization. This peak is usually the highest intensity peak in each sample. There are limitations of the single peak normalization method when there is no internal standard present; there is a possibility of skewing the relative abundance among samples because the abundance of the selected peak may not exhibit the same intensity in all samples [
21]. Therefore, total area sum normalization process was applied to the NMR data.
2.4. Feature Selection
The datasets underwent unsupervised machine learning by applying PCA and supervised analysis by applying PLS-DA in MetaboAnalyst. PCA was chosen as the explorative tool of the pre-processed data to display any natural groupings. The score plots were a visual representation of the clustering between groups. A loading plot displays how strongly each characteristic influences a principal component. Therefore, PLS-DA was then used for classification and feature selection of the variables, using cross-validation to select an optimal number of components for classification. The bins that contained important variable information for classification were identified by PLS-DA in Variation Importance Projection (VIP) score The VIP score is a measure of a variable’s importance in the PLS-DA model [
22]. It summarizes the contribution a variable makes to the model. The VIP score of a variable is calculated as a weighted sum of the squared correlations between the PLS-DA components and the original variable. Statistical analysis was performed exploiting the real-time interactive web-based application MetaboAnalyst. Firstly, the non-targeting approach, considering all the spectral information, was explored for classification purposes, then the targeting approach where the four sets of olefins was evaluated for achieving better clustering. Using the dataset from the
1H selTOCSY spectral data, which edited out many NMR peaks by filtering out all signals that do not have a component of their spin system in selective excitation.
2.5. Classification Model
The study used for the first-time machine learning techniques to automatically individualize and classify petrol sources from native, evaporated and fire debris samples. They were implemented in MATLAB (R2019b) using the Classification Learner app. The evaluation of the model was performed over selected datasets which were essential in experimental modelling developments.
The classification model research design is outlined as follow:
Step 1- data collection: The datasets evaluated in this research were as follow: 1) entire 1H NMR spectrum of neat petrol samples; 2) 1H selTOCSY spectrum of the four olefins of neat petrol samples; 3) 1H selTOCSY spectrum of the four olefins of neat and evaporated petrol samples; 4)1H selTOCSY spectrum of the four olefins of neat, evaporated and fire debris residue samples.
The datasets were divided into i) training data, comprising non-targeting (contained all the NMR spectrum information) and targeting (1H selTOCSY spectrum that consisted of selected features that were recognized as an important feature for discrimination purposes) datasets and ii) blind study testing dataset (for the practical validation of the model to a real-world dataset). The training and testing datasets are used to determine the best classifier model for classification of petrol brands based on the NMR spectrum.
Step 2- reduction of data dimensionality by selecting only a subset of measured features (predictor variables) to create a cluster model through PCA or for feature selection function (using the featured chemical bins from the PLS-DA VIP scores). The PCA function is enabled with component reduction criteria of 80% explained variance as it represents sufficient information variance; typically, the first few PCs correspond to cumulative eigenvalues accounting for 80% or above of the variation within the data set and are sufficient to describe or explain most of the variability in the given dataset thus reducing the dimensionality [
3]. For optimal results, the study aims to choose a classifier model with a minimum of 60% accuracy (validation).
Step 3- dataset optimization: the effect of the pre-processed parameters on the classification training model for discrimination of petrol samples was tested. Two different parameter pre-processing methods were investigated in this study: filtering of the redundant spectral bins and the normalization parameter. Data filtering was applied to set any spectral bin value less than 1 to 0. For normalization, the dataset was (i) single peak normalized (to the highest peak of the spectrum) and (ii) normalized with the total area sum normalization (LOG function).
Step 4- dataset splitting: datasets are split into training and testing datasets using the cross-validation function with K-folds. The cross-validation method with 5 and 10 folds was investigated.
Step 5- evaluating different classifier models such as Decision Trees, Discriminant Analysis (DA), Support Vector Machines (SVM), Logistic Regression, k- Nearest Neighbour (k-NN), Naïve Bayes, Ensembles, and Artificial Neural Networks (ANN).
Step 6- after training multiple models, their performance is compared, and then the most robust and effective classification model was chosen. The Classification Learner app displayed the results of the validated model. Performance measures, such as model accuracy, and visual representation plots such as the confusion matrix chart, reflect the validated model results. The confusion matrix table displayed six petrol brands as true classes in rows and predicted classes in column.
The goal of the classification model method was to investigate different datasets of native, evaporated and fire debris petrol samples with different pre-treatment techniques including data filtering and normalization to identify the most desirable classifier which provides the highest classification accuracy.