1. Introduction
The global COVID pandemic has caused huge impact on world health and economy [
1]. The fast-spreading virus SARS-CoV-2 is the main culprit, and detection of the virus in human population is crucial for curbing the epidemic [
2]. Traditional detection approaches are mainly Nucleic Acid Amplification Test (NAAT) [
3] and antigen detection [
4] techniques. Currently, the mainstream is quantitative Polymerase Chain Reaction (qPCR) [
5] which is a kind of NAAT that has high sensitivity and specificity, but requires clean environment, bulky and expensive equipment, and trained personnel. Therefore, qPCR is not suitable for onsite, fast turnaround detection, or population scale screening, which are often required in pandemic control scenarios [
6]. To complement qPCR, antigen detection based on lateral flow [
7] has also been employed in home use and self-test. However, antigen detection is limited in detection sensitivity and specificity, hindering its efficacy in fighting a pandemic [
8]. There still lacks rapid, accurate and low cost detection techniques that can be deployed onsite for population scale epidemic screening and/or surveillance [
9], especially for regions of limited resources [
10].
Biosensors have been proposed for detection of SARS-CoV-2 [
11]. Biosensor technologies have the advantages of high sensitivity, good specificity, fast turnaround, ease of operation, low cost, and onsite deployment capability [12, 13]. We have previously proposed a photonic biosensor for fast onsite detection of SARS-CoV-2 with high sensitivity and specificity [14, 15]. The biosensor is based on nanoporous silicon material fabricated via CMOS-compatible silicon process, and nanophotonic working principles of Localized Surface Plasmon Resonance (LSPR) [
16] and Tamm Plasmon Polariton (TPP) [17, 18]. The measurement of the biosensor is based on reflection spectroscopy [
14].
We have also developed handheld and high throughput detection systems [
19] that can collect the refection spectrum of biosensors and process the spectral data to determine the detection results efficiently. The high throughput detection system is suitable for populations scale screening of infection, and the handheld detection system is for home use or self-test. The spectral data processing algorithm works by recognizing the characteristic resonant valleys in the reflection spectrum of the biosensor and determines the detection result by judging if there is spectral red shift of the characteristic resonant valleys. This is the often used and so called “find peaks” technique, with its name originating from the MATLAB function
find_peaks(). This technique can also be implemented on Field Programmable Gate Array (FPGA) for fast and efficient processing of signals from array of biosensors [
20]. In addition, researchers have also proposed Interferogram Average over Wavelength (IAW) technique to process signals of optical biosensors that depend on spectral shift of characteristic resonant features, which can achieve sensitivity enhancement compared with spectral shift detection [
21]. Detection of change in reflection intensity due to shift of spectral features in spectrum has also been used to detect biomolecules in real time [
22]. However, both IAW and light intensity measurement techniques are subject to spectral amplitude fluctuations, and thus requires highly stable spectroscopy systems, such as stable light source and high signal-to-noise ratio spectrometer.
In this work, we demonstrate that it is advantageous to utilize artificial intelligence technology, more specifically machine learning (ML) algorithm to process the spectral data of the biosensor [
23]. Instead of depending on programming, its algorithm is learnt from big volume of data [
24]. Machine learning has been used for computer vision [
25], face recognition [
26], autonomous driving [
27], auxiliary decision-making [
28], brain-machine interface [
29], and games [
30]. It includes supervised learning, unsupervised learning, and reinforcement learning [
31]. Supervised learning (SL) is an algorithm that learns from massive, labeled data and generates prediction models that can work to generate labels for new dataset. SL includes Support Vector Machine (SVM) [
32], Multilayer Perceptron (MLP) [
33], Linear Regression [
34], Linear Discriminant Analysis [
35], K-nearest Neighbor [
36], Decision Tree [
37], and Naïve Bayes [
38]. In this work, we demonstrate that SVM and MLP can be used for processing of the photonic biosensor signal and dataset. Compared with previously proposed techniques, ML technique has the advanteges of : 1) no need to find appropriate parameters of the algorithm, e.g. the
find_peaks() function, in a try-and-error way to guarantee accurate recognition of spectral feature; 2) no need to discriminate between redshift or blueshift which can be an extra issue in algorithm design; 3) not sensitive to spectral amplitude fluctuations so that requirements on stable and expensive hardware are relaxed; 4) generalizable to all kinds of sensors with salient features in response signal which serve as the basis of discriminating between positive and negatibe responses.
Data visualization approach can help us to understand the distribution of dataset and find out the distinguishability of the dataset. T-distributed Stochastic Neighbor Embedding (t-SNE) is a prevalent approach to map high-dimensional data to low-dimensional embedding [
39]. In this contribution, we also implemented t-SNE approach on the SARS-CoV-2 detection dataset to clarify the distinguishability of the biosensor dataset so that a better understanding of the data processing and ML prediction performances can be obtained.
3. Results and Discussion
In terms of the experiments, we used SVM and MLP models to test the raw data processing and feature engineering method. Two performance metrics are considered in the experiments: sensitivity (SEN) and specificity (SPE) which are defined as
where TP, FN, TN, FP stand for true positive, false negative, true negative and false positive, respectively.
Table 1 shows the performances of ML model predictions. We can see that perfect performances are achieved for both raw data and feature engineering methods, combined with either SVM or MLP model. The last row in
Table 1 shows the performance of the models in processing the control experiment dataset. The performance is very poor, and this is due to the fact that the biosensors have not been functionalized with specific antibodies and thus, cannot detect SARS-CoV-2 virus effectively.
Figure 4 (a) shows data distribution of raw dataset in 2D space by t-SNE data visualization approach. We can see that the positive and negative samples from dataset of valid detection experiments are clustered without any overlapping. Thus, the valid experimental dataset is distinguishable.
Figure 4 (b) shows data distribution of features extracted from the dataset of
Figure 4 (a). The extracted features change the data distribution, while maintaining the distinguishability because the samples are separated into different clusters.
Figure 4(c) shows the data distribution of dataset obtained from control experiments wherein biosensors are not functionalized with specific antibodies. Negative samples are overlaping with positive samples, and the dataset is indistinguishable according to the visualization results.
Figure 4 (d) shows data distribution of features extracted from the dataset of
Figure 4 (c). The distribution of features’ dataset is still mixed up, so that feature engineering cannot help the dataset to be classified effectively. These dataset distribution results could serve to interpret the performance comparisons demonstrated in
Table 1.
Table 2 demonstrates the advantages of ML data processing technique when compared with other techniques. It can be seen that the general advtantages of ML are valid, in addition to eased hardware requirement.
To verify the efficacy of the ML data processing technique for biosensors, detection experiments of inactivated SARS-CoV-2 in vaccination sites of Hangzhou Center for Disease Control and Prevention (CDC) were carried out and the detection results are compared with the gold standard-reverse transcription qPCR technique. The envrionmental specimens were collected from various locations in different vaccination sites, delivered to Hangzhou CDC within 4 hours,, and were simultaneously analyzed by both techniques.
Table 3 shows that biosensors, together with ML data processing, generate detection results that are consistent with qPCR results. Note that qPCR provides semi-quantitative results dependent on the Ct value [
5], while ML processing of biosensor data only provides qualitative results. This comparative study demonstrates that the ML technique is an effective tool for biosensor signal and data processing,