1. Introduction
Accidental fires in our daily lives can cause harm to personal and property safety. According to the National Fire Protection Association, in 2022, the US fire department responded to estimated 1.5 million fires, which resulted in 3,790 civilian deaths, 13250 civilian injuries, and an estimated
$18 billion in property damage [
1]. At the same time, the damage caused by fires to the natural environment cannot be ignored. In 2023, Canada’s wildfires burn total surpassed 156,000 square kilometers, exceeding the benchmark established in 1995. The record blaze released airborne pollutants and greenhouse gases, contributing significantly to climate alteration [
2]. After a fire breaks out, time is crucial. Timely detection of a fire and victims can effectively reduce its harm. Traditional fire alarm systems, such as photoionization smoke detectors, infrared thermal imagers, flame gas sensors, and smoke gas sensors, have drawbacks such as delayed response time and restricted sensor density. Especially in open spaces, airflow and other conditions may impede accurate detection [
3].
Early visualization-based systems for detecting fire and smoke involved techniques such as color detection, moving object detection, motion and flicker analysis using Fourier and wavelet transforms, among others [
4]. B. Uğur Töreyin et al. based on color and motion analysis, implemented a real-time fire and flame detection method by conducting ripple domain analysis on flames and flame flicker in video data [
5]. P. V. Koerich Borgesf et al. achieved fire detection by evaluating inter-frame variations of features such as color, area size, and texture in potential fire zones, and combining Bayesian classifiers [
6]. Yusuf Hakan Habiboğlu et al. proposed a flame detection system using a spatiotemporal covariance matrix of video data, which effectively captured the flickering and irregular characteristics of flames by dividing the video into spatiotemporal blocks and calculating the covariance features extracted from these blocks [
7]. Although numerous physical and mathematical methods have been used to extract features like color, texture, and flicker frequency contour of fire and smoke, these early methods have limited feature representation capability due to their manually designed feature extractors. In addition, they showed poor adaptability to complex scene changes, dynamic backgrounds, and lighting modifications, leading to high missed detection rates and weak generalization ability [
8].
With the rapid development and increasing maturity of neural networks, Convolutional Neural Networks (CNNs) have distinguished themselves in the field of computer vision due to their capacity for extracting rich and discriminative features from extensive data [
8,
9,
10], attracting attention from massive researchers. Object detection algorithms based on CNN are progressively applied for fire and smoke detection [
3,
8,
9,
10,
11]. According to the different processing procedures and structures, they can be roughly divided into two categories: one-stage algorithms and two-stage algorithms. One-stage methods directly estimate target location and category from input images, eliminating the need for detecting potential target regions beforehand. These algorithms operate by dividing the image into grids, generating diverse bounding boxes based on anchor points in each grid, and employing non-maximum suppression (NMS) to eliminate redundant and overlapped bounding boxes. The representative of one-stage algorithms is the YOLO series [
12,
13,
14,
15,
16,
17,
18]. Two-stage algorithms complete object detection tasks through two main stages: candidate box generation and object detection. Initially, by utilizing a component called candidate box generators (like Selective Search [
19] or Region Proposal Network [
21]), potential target-containing candidate boxes are produced in the input image. Then, these candidate boxes undergo filtering and feature extraction using NMS, followed by classification and regression within classification and regression heads. Algorithms such as Faster R-CNN [
20], Faster R-CNN [
21], Cascade RCNN [
22] and Sparse R-CNN [
23] exemplify this category. Although two-stage algorithms exhibit superior accuracy relative to one-stage methods, they often have higher hardware requirements due to their high computational complexity and are challenging to meet real-time requirements [
24].
A novel detection method, DETR, has recently emerged for object detection, achieving excellent results comparable to the mature Faster R-CNN on the COCO dataset [
25]. Inspired by the transformer, this technology was initially adopted in fields like natural language processing and speech recognition, showcasing substantial advancements. Subsequently, it was introduced to the field of computer vision. DETR firstly enables end-to-end object detection, meaning it directly predicts the bounding box coordinates and class labels without relying on anchor boxes or region proposal techniques. This simplifies the object detection pipeline and eliminates the need for complex components like NMS, anchor generation, and anchor matching. The end-to-end nature of DETR makes it more efficient and easier to implement compared to traditional algorithms. Zhu, X. et al. have made improvements to DETR and proposed a new model called Deformable DETR. Compared with DETR, Deformable DETR has better detection performance, lower computational complexity, and faster convergence. It’s worth noting that Deformable DETR performs exceptionally well in detecting small target objects [
26]. In the early stages of a fire, smoke and fire tend to be concentrated in a small area [
27]. The advantage of Deformable DETR in detecting small objects is helpful in the timely detection of small flames and smoke, which can prevent the fire from spreading. Additionally, Deformable DETR introduces the concept of Deformable Convolution [
28], which selects only a few points near the reference point as
in self-attention calculation. This approach not only speeds up the convergence of the model but also improves its computational efficiency, allowing it to detect irregular flames and smoke more effectively.
However, using Deformable DETR for object detection still faces substantial challenges. Even though Deformable DETR shows excellent prediction accuracy on the COCO dataset, its huge amount of computation and slow inference time are significant disadvantages that cannot be neglected. This makes it difficult to deploy the model on resource-constrained devices and meet the requirements of real-time detection. To address these issues, several improvements have been made to the model. First and foremost, the original ResNet is replaced by an advanced ConvNeXt, improving the network’s ability to extract complex features related to fire and smoke. Considering the high computational complexity of the encoder part of Deformable DETR, it is not conducive to deploying the model to resource-constrained detection devices. Some improvements are made to the encoder part, simplifying its structure and enhancing the detection accuracy. Third, the use of GIoU in Deformable DETR limits the convergence speed and detection accuracy, thereby PIoU v2 is introduced as a new loss function. Finally, we innovatively use human as the detection target, which will help firefighters develop response strategies to assist victims in fires. Our contributions can be summarized in the following several points:
- (1)
In response to complex and dynamic fire environments, ConvNeXt is applied as a backbone to enhance algorithm’s ability to extract features of different scales.
- (2)
While introducing CCFM, SSFI is proposed. The combination of these modules forms the Mixed encoder, which improves detection accuracy and greatly reduces the amount of computation.
- (3)
The loss function is modified to PIoU v2, which solves the slow convergence issue and improves the model’s stability in complex fire scenarios.
- (4)
Human is considered as a detection class, which helps firefighters identify potential victims in fires and improve the practical value of the model.
This paper is structured as follows. In
Section 2, we review related works and discussed their strengths and limitations.
Section 3 details the overall architecture and improvement methods of our proposed model.
Section 4 introduces the experimental setup, including the dataset used, evaluation methods, and experimental environment. In
Section 5, to demonstrate the detection performance and characteristics of our model, visual examples, qualitative analysis, and comparison with other methods are provided. Particularly in
Section 5.3, we conduct a series of ablation experiments to explore the impact of different components and design choices of the proposed model on its performance.
Section 6 summarizes the entire study and provides prospects for future work.
2. Related Works
In recent years, there have been significant advancements in the field of object detection. We will now introduce the relevant work based on the different algorithms used in fire and smoke detection.
One-stage algorithms: Given the fast inference speed and low hardware requirements of one-stage algorithms, most fire and smoke detection tasks prefer this type of algorithm. Lei Zhao et al. proposed the Fire YOLO model, an adaptation of YOLO-V3 that incorporates the EfficientNet technique and mobile inverted bottleneck convolution (MBConv) to boost the sensitivity of the feature extraction network and enhance the detection efficiency of small targets [
24]. Xin Geng et al. proposed an improved fire and smoke target detection algorithm called YOLOFM based on YOLO-V5. Using technologies like FocalNext network, QAHARp-FPN, NADH, and Focal SIoU Loss, the accuracy and recall of the baseline network were improved, resulting in a more reliable and precise solution for fire and smoke detection tasks [
29]. Hu et al. proposed a new model called MVMNet, which is based on YOLO-v5 and employs a novel detection method called multioriented detection. During the data training process, a parameter describing the angle was added to solve the problems of tilted smoke and misdetection of similar objects in forest fire smoke detection. At the same time, the SPP module in YOLO-v5 was replaced by the Soft-SPP module, and a value conversion attention mechanism module (VAM) was created to specifically extract smoke color and texture. Finally, a mixed non-maximum suppression method comprising DIoU-NMS and Sketch NMS was employed to address false alarms and missed detections in smoke detection [
27]. Gong Chen et al. proposed a lightweight forest fire smoke detection model based on YOLOv7. Firstly, Ghost Shuffle Convolution (GSConv) was used to replace standard convolution to reduce the model size and improve deployment on edge devices. Then, coordinate attention (CA) was embedded in the backbone to improve its ability to extract smoke information and reduce background interference. Then, Content-Aware Reassembly of Features (CARAFE) substitutes the nearest neighbor interpolation upsampling in YOLO-v7, expanding the receptivity of the feature fusion network. Finally, the SIoU loss function is employed [
30]. Although one-stage algorithms are simple, fast, and can achieve real-time object detection, their detection accuracy is still not as good as some two-stage algorithms [
31]. Meanwhile, the YOLO series is not ideal for detecting small target objects [
32], making it naturally disadvantageous in detecting early fire characteristics.
Two-stage algorithms: P Barmpoutis et al. introduced a fire detection approach integrating deep learning networks and linear dynamic systems (LDS) for multi-dimensional texture analysis. Initially, the Faster R-CNN network detects potential fire regions within the image. The regions are then projected onto the Grassmannian space. A vector of indigenous aggregated descriptors (VLAD) is used to group Grassmannian points based on local standards on the manifold. Finally, SVM is applied to categorize the results [
33]. Chaoxia et al. advanced the anchor formulation strategy of Faster R-CNN using the color-guided anchoring strategy, while simultaneously constructing a Global Information Network (GIN) to obtain global image information, enhancing the efficiency and accuracy of flame detection [
34]. Pan J et al. used a knowledge distillation process to make Faster R-CNN lightweight and proposed a weakly supervised fine-segmentation method for detection and classification. A fuzzy system was introduced to construct a fire and smoke rating framework [
31]. Nevertheless, mainstream two-staged methods show low accuracy in small object detection [
32]. More critically, anchor-based methods like Faster RCNN face challenges in locating objects with diverse shapes [
35], which is a drawback for detecting amorphous fire and smoke.
DETR-based algorithms: One-stage and two-stage algorithms are anchor-based methods. According to recent research, the detection performance of anchor-based algorithms depends to some extent on the initial value of the set number of anchors [
36]. Both too many and too few anchors can lead to poor results, and excessive anchors can also increase computational complexity. Unfortunately, these algorithms use NMS during the detection process, rather than all edge devices supporting NMS (such as edge computing devices that only support integer operations) [
37]. In order to solve the above problems and abandon manual intervention and the application of prior knowledge, researchers have begun to turn their attention to transformer-based DETR. Li, Y. et al. applied lightweight DETR in fire and smoke detection, reducing the number of encoder layers and incorporating a multi-scale deformable attention mechanism. They also used ResNeXt50 as the backbone and added the normalization-based attention module (NAM) to improve the model’s feature extraction ability [
39]. Mardani, K. et al. simplified DETR by removing unnecessary components such as binary matching and bounding box heads, and added masked or linear layers composed of Multi-head attention layers to complete different tasks, achieving optimal accuracy performance on specified datasets [
38]. Huang, J. et al. used Deformable DETR as the baseline and combined Multi-scale Context Controlled Local Feature Module (MCCL) and Dense Pyramid Pooling Module (DPPM) to improve the ability of small smoke detection [
40].
Recent improvements to DETR have mainly focused on improving the decoder section. For instance, Conditional DETR decouples the cross-attention function of the DETR decoder and proposes conditional spatial embedding, which accelerates the model’s convergence speed [
41]. DAB-DETR uses dynamically updated box coordinates as queries in the decoder, achieving the goal of improving model accuracy and convergence speed [
42]. New research indicates that low-scale features account for 75% of all tokens in the encoder, but they make a small contribution to the overall detection accuracy [
43]. Therefore, we focus on improving the rarely studied encoder block in this article. Compared with the baseline, Deformable DETR, we reduce the number of encoder layers from 6 to 2, decreasing the computational complexity. Simultaneously, Separate Self-Attention and CCFM are employed to substitute the Multi-scale Deformable Attention function in the encoder block. Finally, we replaced the backbone with ConvNeXt, a more advanced architecture with stronger feature extraction capabilities than the traditional Resnet. Overall, the result is a model with good performance and reduced computational burden.
5. Result Analysis
This section presents a comprehensive and reliable analysis of the results obtained from our experiments on fire and smoke detection. By analyzing the effectiveness of some blocks in the model, visualizing the results, conducting ablation experiments, and comparing them with other models, the superiority of our proposed model is demonstrated.
5.1. Effectiveness of Backbone
The backbone architecture acts as a feature extractor, and its ability to capture discriminative features directly influences the model’s ability to detect fire and smoke instances accurately. To demonstrate the effectiveness of the backbone, we considered several popular backbone architectures, including ResNet [
45], EfficientNet [
57] and ConvNextv2 [
58], to extract features from the input images for fire and smoke detection. Each backbone architecture has its unique characteristics and capabilities in capturing and representing visual patterns. We trained and evaluated our fire and smoke detection models with different backbone architectures while keeping other hyperparameters and training procedures consistent. The detection results of baseline under different backbones are shown in
Table 3. According to the results, it is evident that using different backbones can affect the detection accuracy of the model. Additionally, implementing ConvNeXt-tiny as a backbone not only reduces the parameters and computation complexity but also significantly enhances the detection accuracy for fire and smoke.
5.2. Effectiveness of PIoU v2
In this subsection, we conducted experiments to verify the effectiveness of PIoU v2 by comparing it with other IoU-based loss functions. The experimental results are presented in
Table 4. It can be observed from these results that PIoU v2 can improve the detection accuracy and significantly reduce the training time of the model, even with the same epoch of training.
To better understand the effectiveness of PIoU v2, we decide to visualize the training process using different IoU loss functions. It is worthy noting that the pre-trained model provided by mmdetection is used for parameter initialization. Therefore, the
of the model does not increase from 0 in the early stages of training. From
Figure 7, it is evident that the model using PIoU v2 as the loss function has a faster convergence speed, while DIoU has the slowest convergence speed. After 50 epochs, all IoUs tend to converge and have roughly the same accuracy. However, PIoU v2 achieves a slightly higher mAP than other models.
5.3. Ablation Experiments
In the field of object detection, ablation experiments are a widely used evaluation method aimed at analyzing the significance and mutual influence of different components in model design. Through ablation experiments, researchers gradually eliminate or modify different components of the model, observe the impact of these changes on the performance, and gain a deeper understanding of the working mechanism and key factors.
This subsection aims to comprehensively analyze and evaluate the remarkable performance of our proposed model through a series of ablation experiments. We will focus on studying the contribution of different components to the model’s performance and exploring their role in the object detection process.
Table 5 displays the results of multiple ablation experiments, where √ denotes that relevant improvement methods have been applied to the baseline, while × denotes that no relevant improvement methods have been applied.
Compared with the first and fourth experimental groups, the result shows that using PIoU v2 as the loss function slightly improves the detection precision of the algorithm, but has almost no effect on the parameter and computational complexity.
Compared with the first and second experiments, the result reveals that ConvNeXt significantly reduced the number of parameters while improving , and .
Compared with the first and third groups of experiments, it is found that upgrading the original encoder to the Mixed encoder reduces the computational complexity but increases the number of parameters and reduce and slightly.
Compared with the experiments of the sixth and seventh groups, it can be found that although the Mixed encoder is the main reason for the increase in model parameter count, but it also ensures the improvement of the model’s accuracy in detecting fires and humans, as well as .
- (1)
Compared with the first and fourth experimental groups, the result shows that using PIoU v2 as the loss function slightly improves the detection precision of the algorithm, but has almost no effect on the parameter and computational complexity.
- (2)
Compared with the first and second experiments, the result reveals that ConvNeXt significantly reduced the number of parameters while improving Accuracyfire, Accuracysmoke Accuracyhuman and mAP.
- (3)
Compared with the first and third groups of experiments, it is found that upgrading the original encoder to the Mixed encoder reduces the computational complexity but increases the number of parameters and reduce Accuracysmoke and Accuracyhuman slightly.
- (4)
Compared with the experiments of the sixth and seventh groups, it can be found that although the Mixed encoder is the main reason for the increase in model parameter count, but it also ensures the improvement of the model’s accuracy in detecting fires and humans, as well as mAP.
5.4. Comparison with Other Models
In addition to evaluating the performance of our method, we compare it with existing representative object detection algorithms, such as YOLO v7, YOLO v8, DAB-DETR, and so on. By benchmarking our results against these approaches, we gain insights into the advancements achieved by our proposed method. We discuss the performance of different methods under multiple indicators, emphasizing our model’s potential for outperforming existing methods. All the experiments are performed on the dataset that is introduced in
Section 1. The results are presented in
Table 6, with the best results highlighted in bold. According to the results, FCM-DETR achieved the highest
among all the algorithms. Moreover, the remaining indicators of this algorithm also exceed other DETR series algorithms. In small-scale object detection, it delivers impressive results that are only second to RTMDet [
59]. Additionally, in large-scale object detection, its
reaches astonishing 71.6%, which is much higher than the one-stage algorithm.
To provide a more intuitive demonstration of the superiority of our algorithm, we have selected detection results from various scenarios and presented them in
Figure 8. When flames are detected, chartreuse is used for identification, while Shenbulun yellow and Royal blue is used for smoke and human detection, respectively. In the dark scene, our FCM-DETR algorithm performs better than other algorithms by detecting more targets and with higher accuracy. In the bright scene, FCM-DETR also detects more small-scale targets than other algorithms. Notably, our method can distinguish negative samples well, reducing the false alarm rate of fire and smoke detection.
6. Conclusion
With the rapid development of deep learning technology, object detection techniques are increasingly being used in fields like forest fire surveillance, fire emergency identification, and industrial safety. Nevertheless, there is still huge room for improvement in this technology. FCM-DETR, our proposed model, leverages the advanced Deformable DETR as a baseline to accurately identify and localize fire and smoke instances in images. Employing ConvNeXt for its powerful ability, lightweight FCM-DETR extracts richer and more comprehensive feature information. Subsequently, a Mixed encoder comprising SSFI and CCFM modules is developed, reducing the computational complexity of the original encoder while maintaining high accuracy in fire and smoke detection. Lastly, the latest PIoU v2 is introduced, which not only accelerates the convergence speed and improves its robustness in complex fire scenarios, but also raises the detection accuracy to a new level. Through extensive experience and evaluation, the effectiveness and potential of our approach are being demonstrated. The model ultimately attained the of 66.7%, outperforming the comparative model.
Moving forward, considering that DETR is still a novel technology, several avenues for future work can build upon the findings of this study. Extending our approach to real-time video-based fire and smoke detection is an important direction. In future works, we will improve the real-time processing capability of the model and further reduce its computational complexity while ensuring its effectiveness, aiming to achieve practical applications.