Submitted:
08 September 2025
Posted:
09 September 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Works
3. Materials and Methods Used for the Proposed System
3.1. Flight Module
3.2. Data Acquisition Module
- Main MPU: It features a 32-bit, dual-core Tensilica Xtensa LX7 processor that operates at speeds of up to 240 MHz. It is also equipped with 8 MB of PSRAM and 8 MB of Flash memory, which provides ample storage for processing sensor data and managing video streams.
- Camera: Although the ESP32S3 MPU is equipped with an OV2640 camera sensor, a superior OV5640 camera module has been added to enhance image quality and transmission speed. This enables higher-resolution capture and faster access times, providing the analysis module with higher-quality visual data at a cost-effective price.
- Distance sensor (LIDAR): A TOF LIDAR laser distance sensor (VL53LDK) is used. It is important to note that this sensor has a maximum measurement range of 2 meters, with an accuracy of ±3%. The sensor has a maximum measurement range of up to 2 meters and an accuracy of ±3%. Although this sensor is not designed for large-scale mapping, it enables proximity awareness and estimation of the terrain profile immediately below the drone. This data is essential for low-altitude flight and for the victim location estimation algorithm.
- GPS module: A GY-NEO6MV2 module is integrated to provide positioning data with a horizontal accuracy of 2.5 meters (circular error probable) and an update rate of 1 Hz.
3.3. Delivery Module
3.4. Drone Command Center
- Human-Machine Interface (HMI): The front end gives the human operator complete situational awareness. A key feature is the interactive map, which displays the drone's position in real time, the precise location of any detected individuals and other relevant geospatial data. At the same time, the operator can view the live video stream annotated in real time by the detection module which highlights victims and classifies their status (e.g., walking, standing, or lying down). Figure 6 shows the visual interface of the central application, which consolidates these data streams.
- 2.
-
Perceptual and cognitive processing: One of the fundamental architectural decisions behind our system is to decouple AI-intensive processing from the aerial platform and host it at the GCS level. The drone acts as an advanced data collection platform, transmitting video and telemetry data to the ground station. Here, the backend takes this data and performs two critical tasks:
- a.
- Visual detection: Uses the YOLO11 object detection model to analyze the video stream and extract semantic information. This off-board approach allows the use of complex, high-precision computational models, that would otherwise exceed the hardware capabilities of resource-limited aerial platforms.
- b.
- Agentic Reasoning: The GCS hosts the entire cognitive-agentic architecture. This architecture is detailed in Section 6. All interactions between AI agents, including contextual analysis, risk assessment and recommendation generation, take place at the ground server level.
4. Visual Detection and Classification with YOLO
4.1. Construction of the Dataset and Class Taxonomy
- C2A Dataset: Human Detection in Disaster Scenarios [27] - This dataset is designed to improve human detection in disaster contexts.
- NTUT 4K Drone Photo Dataset for Human Detection [28] – It is a dataset designed to identify human behavior. It includes detailed annotations for classifying human actions.
-
States of Mobility and Action: This category helps the system distinguish between individuals who are mobile and likely at lower risk, versus those who may be incapacitated
- ○
- High-Priority/Immobile (sit): This is the most critical class for SAR scenarios. It is used to identify individuals who are in a static, low-profile position, which includes not only sitting but also lying on the ground or collapsed. An entity classified as sit is treated as a potential victim requiring urgent assessment.
- ○
- Low-Priority/Mobile (stand, walk): These classes identify individuals who are upright and mobile. They are considered a lower immediate priority but are essential for contextual analysis and for distinguishing active survivors from incapacitated victims.
- ○
- General Actions (person, push, riding, watchphone): Derived largely from the NTUT 4K dataset, these classes help the model build a more robust and generalized understanding of human presence and posture, improving overall detection reliability even if these specific actions are less common in disaster zones.
-
Visibility and Occlusion: This category quantifies how visible a person is, which is crucial for assessing if a victim might be trapped or partially buried
- ○
- Occlusion Levels (block25, block50, block75): These classes indicate that approximately 25%, 50%, or 75% of the person is obscured from view.
4.2. Data Preprocessing and Hyperparameter Optimization
- Data augmentation: The techniques applied consisted exclusively of fixed rotations at 90°, random rotations between -15° and +15°, and shear deformations of ±10°. The purpose of these transformations was to artificially simulate the variety of viewing angles and target orientations that naturally occur in dynamic landscapes filmed by a drone, forcing the model to learn features that are invariant to rotation and perspective.
- Hyperparameter optimization: Through a process of evolutionary tuning consisting of 100 iterations, we determined the optimal set of hyperparameters for the final training. This method automatically explores the configuration space to find the combination that maximizes performance. The resulting hyperparameters, which define everything from the learning rate to the loss weights and augmentation strategies, are shown in Figure 7.
4.3. Experimental Setup and Validation of Overall Performance
- Precision (0.9586 / 95.86%) indicates the proportion of correct detections out of the total number of detections performed. The high value reflects a low probability of erroneous predictions and high confidence in the model's results.
- The Recall (0.4025 / 40.25%) expresses the ability to identify objects present in the image. The relatively low value suggests the omission of about 60% of objects. This is explained by the increased complexity of the training set due to the concatenation of two datasets, which reduces raw performance but increases the generalization and versatility of the model.
- mAP@0.5 (0,6134 combines Precision and Recall, evaluating correct detections at an intersection over union (IoU) threshold of 50%. The result indicates balanced performance, despite the low Recall.
- mAP@0.5:0.95 (0,3539) measures the accuracy of localization at strict IoU thresholds (50%–95%). The value significantly lower than mAP@0.5 suggests that, although the model detects objects, the generated bounding boxes are not always well adjusted to their contours.
4.4. Granular Analysis: From Aggregate Metrics to Error Patterns
4.4.1. Classes Performance
4.4.2. Diagnosing Error Patterns with the Confusion Matrix
5. Geolocation of Detected Targets
5.1. Methodology for Estimating Position
- ): The geographical coordinates (latitude, longitude) of the drone, provided by the GPS module;
- : Height of the drone above the ground;
- : The vertical deviation angle, representing the angle between the vertical axis of the camera and the line of sight to the person. This is calculated based on the vertical position of the person in the image and the vertical field of view (FOV) of the camera;
- : This value is provided by the magnetometer integrated into the flight controller's inertial measurement unit (IMU) and is crucial for defining the projection direction of the visual vector [30];
- : The estimated geographical coordinates of the person are the final result of the calculation.
- Flight height (h): 27 m;
- GPS position of the drone ): (45.7723°, 22.144°);
- Drone orientation (): 30° (azimuth measured from North);
- Vertical deviation angle (): 25° (calculated from the target position in the image).
- Earth's radius (R): 6,371,000 m;
- Angular distance (): ;
- Azimuth (): ;
- Drone latitude ): ;
- Drone longitude .
6. Cognitive Agent Architecture
6.1. Components of the Proposed Cognitive-Agentic Architecture
6.1.2. MAg – Memory Agent
6.1.3. RaAg - System Reasoning
6.1.4. RAsAg – Risk Assessment Agent
- Rule_ID: A unique identifier for traceability (e.g., R-MED-01).
- Condition (C): The description of the rule prerequisites it to act.
- Action (A): The process of generating the structured risk report, specifying the Type, Severity, and Justification.
- Severity (S): A numerical value that dictates the order of execution in case multiple rules are triggered simultaneously. A higher value indicates a higher priority.
- Type of risk
- Severity level: a numerical value, where 1 means low risk and 10 means high risk
- Entities involved: ID of the person or area affected or coordinates
- Justification: A brief explanation of the rules that led to the assessment
6.1.5. ReAg – Recommendation Agent
6.1.6. Consolidate and Interpret (Orchestrator Agent)
6.1.7. Inter-Agent Communication and Operational Flow
7. Results
7.1. Validation Methodology
- Information Integrity: The ability to maintain the consistency and accuracy of data as it moves through the cognitive cycle (from PAg to the final report).
- Temporal Coherence (Memory): The effectiveness of the Memory Agent (MAg) in maintaining a persistent state of the environment to avoid redundant alerts and adapt to evolving situations.
- Accuracy of Hazard Identification: The accuracy of the system in identifying the most significant threat in a given context.
- Self-Correction Capability: The system's ability to detect and rectify internal logical inconsistencies, a key feature of the Orchestrator Agent.
7.2. First Scenario – Low Risk Situation
7.2.1. Initial Data
7.2.2. Contextual Enrichment (RaAg)
7.2.4. Determining the Response Protocol (ReAg)
7.2.5. Final Validation and Display in GCS
- context: “group of people on a marked path”;
- assessment of “low risk (1/10)”;
- selected response protocol.
- The pins corresponding to individuals were marked in green;
- The operator received an informative, non-intrusive notification.
7.3. Second Scenario – High Risk and Self-Correction
7.3.1. Description of the Initial Situation
7.3.2. Deliberate Error and System Response
7.3.3. Self-Correcting Mechanism
7.3.4. Final Result
- marking the victim on the interactive map with a red pin;
- displaying the detailed report in a prominent window;
- enabling immediate intervention by the operator.
7.4. Third Scenario – Demonstration of Adaptation
- An update to an existing event
- A completely new situation
- Reducing operator cognitive fatigue by limiting unnecessary alerts;
- Increasing operational accuracy by consolidating information;
- Improving information continuity in long-term missions.
7.5. Cognitive Performance Analysis
7.6. Computational Cost and Reliability in Time-Critical Operations
- A.
- Computational Cost and Architectural Choices
- B.
- Reliability and Mitigation of Hallucinations
8. Discussion
9. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Sun, J.; Li, B.; Jiang, Y.; Wen, C.-y. A Camera-Based Target Detection and Positioning UAV System for Search and Rescue (SAR) Purposes. Sensors 2016, 16, 1778. [CrossRef]
- Pensieri, Maria Gaia, Mauro Garau, and Pier Matteo Barone. "Drones as an integral part of remote sensing technologies to help missing people." Drones 4.2 (2020): 15. [CrossRef]
- Ashish, Naveen, et al. "Situational awareness technologies for disaster response." Terrorism informatics: Knowledge management and data mining for homeland security. Boston, MA: Springer US, 2008. 517-544.
- Kutpanova, Zarina, et al. "Multi-UAV path planning for multiple emergency payloads delivery in natural disaster scenarios." Journal of Electronic Science and Technology 23.2 (2025): 100303. [CrossRef]
- Kang, Dae Kun, et al. "Optimising disaster response: opportunities and challenges with Uncrewed Aircraft System (UAS) technology in response to the 2020 Labour Day wildfires in Oregon, USA." International Journal of Wildland Fire 33.8 (2024). [CrossRef]
- R. Arnold, J. Jablonski, B. Abruzzo and E. Mezzacappa, "Heterogeneous UAV Multi-Role Swarming Behaviors for Search and Rescue," 2020 IEEE Conference on Cognitive and Computational Aspects of Situation Management (CogSIMA), Victoria, BC, Canada, 2020, pp. 122-128. [CrossRef]
- Alotaibi, Ebtehal Turki, Shahad Saleh Alqefari, and Anis Koubaa. "Lsar: Multi-uav collaboration for search and rescue missions." Ieee Access 7 (2019): 55817-55832. [CrossRef]
- Zak, Yuval, Yisrael Parmet, and Tal Oron-Gilad. "Facilitating the work of unmanned aerial vehicle operators using artificial intelligence: an intelligent filter for command-and-control maps to reduce cognitive workload." Human Factors 65.7 (2023): 1345-1360. [CrossRef]
- Zhang, Wenjuan, et al. "Unmanned aerial vehicle control interface design and cognitive workload: A constrained review and research framework." 2016 IEEE international conference on systems, man, and cybernetics (SMC). IEEE, 2016.
- Jiang, Peiyuan, et al. "A Review of Yolo algorithm developments." Procedia computer science 199 (2022): 1066-1073. [CrossRef]
- Sapkota, Ranjan, Konstantinos I. Roumeliotis, and Manoj Karkee. "UAVs Meet Agentic AI: A Multidomain Survey of Autonomous Aerial Intelligence and Agentic UAVs." arXiv preprint arXiv:2506.08045 (2025). [CrossRef]
- Jones, Brennan, Anthony Tang, and Carman Neustaedter. "RescueCASTR: Exploring Photos and Live Streaming to Support Contextual Awareness in the Wilderness Search and Rescue Command Post." Proceedings of the ACM on Human-Computer Interaction 6.CSCW1 (2022): 1-32. [CrossRef]
- Volpi, Michele, and Vittorio Ferrari. "Semantic segmentation of urban scenes by learning local class interactions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2015.
- Kutpanova, Zarina, et al. "Multi-UAV path planning for multiple emergency payloads delivery in natural disaster scenarios." Journal of Electronic Science and Technology 23.2 (2025): 100303. [CrossRef]
- Gaitan, N.C.; Batinas, B.I.; Ursu, C.; Crainiciuc, F.N. Integrating Artificial Intelligence into an Automated Irrigation System. Sensors 2025, 25, 1199. [CrossRef]
- M. Atif, R. Ahmad, W. Ahmad, L. Zhao and J. J. P. C. Rodrigues, "UAV-Assisted Wireless Localization for Search and Rescue," in IEEE Systems Journal, vol. 15, no. 3, pp. 3261-3272, Sept. 2021. [CrossRef]
- S. Hayat, E. Yanmaz, T. X. Brown and C. Bettstetter, "Multi-objective UAV path planning for search and rescue," 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 2017, pp. 5569-5574. [CrossRef]
- Liu, C.; Szirányi, T. Real-Time Human Detection and Gesture Recognition for On-Board UAV Rescue. Sensors 2021, 21, 2180. [CrossRef]
- D. Cavaliere, S. Senatore and V. Loia, "Proactive UAVs for Cognitive Contextual Awareness," in IEEE Systems Journal, vol. 13, no. 3, pp. 3568-3579, Sept. 2019. [CrossRef]
- Al-Haddad, Luttfi Ahmed, et al. "Energy consumption and efficiency degradation predictive analysis in unmanned aerial vehicle batteries using deep neural networks." Adv. Sci. Technol. Res. J 19.5 (2025): 21-30. [CrossRef]
- T. Toschi et al., Evaluation of DJI Matrice 300 RTK Performance in Photogrammetric Surveys with Zenmuse P1 and L1 Sensors, ISPRS Archives, Vol. XLIII-B1-2022, pp. 339–346, 2022. [CrossRef]
- T. Toschi et al., Evaluation of DJI Matrice 300 RTK Performance in Photogrammetric Surveys with Zenmuse P1 and L1 Sensors, ISPRS Archives, Vol. XLIII-B1-2022, pp. 339–346, 2022. [CrossRef]
- Liang, Junbiao. "A review of the development of YOLO object detection algorithm." Appl. Comput. Eng 71.1 (2024): 39-46. [CrossRef]
- Wang, Xin, et al. "Yolo-erf: lightweight object detector for uav aerial images." Multimedia Systems 29.6 (2023): 3329-3339. [CrossRef]
- Terven, Juan, Diana-Margarita Córdova-Esparza, and Julio-Alejandro Romero-González. "A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas." Machine learning and knowledge extraction 5.4 (2023): 1680-1716. [CrossRef]
- Ragab, Mohammed Gamal, et al. "A comprehensive systematic review of YOLO for medical object detection (2018 to 2023)." IEEE Access 12 (2024): 57815-57836. [CrossRef]
- NIHAL, Ragib Amin, et al. UAV-Enhanced Combination to Application: Comprehensive analysis and benchmarking of a human detection dataset for disaster scenarios. In: International Conference on Pattern Recognition. Cham: Springer Nature Switzerland, 2024. p. 145-162.
- https://www.kaggle.com/datasets/kuantinglai/ntut-4k-drone-photo-dataset-for-human-detection/data.
- Zhao, Xiaoyue, et al. "Detection, tracking, and geolocation of moving vehicle from uav using monocular camera." IEEE Access 7 (2019): 101160-101170. [CrossRef]
- Mallick, Mahendra. "Geolocation using video sensor measurements." 2007 10th International Conference on Information Fusion. IEEE, 2007.
- Cai, Y.; Zhou, Y.; Zhang, H.; Xia, Y.; Qiao, P.; Zhao, J. Review of Target Geo-Location Algorithms for Aerial Remote Sensing Cameras without Control Points. Appl. Sci. 2022, 12, 12689. [CrossRef]
- Thrun, Sebastian. "Toward a framework for human-robot interaction." Human–Computer Interaction 19.1-2 (2004): 9-24.
- Brooks, Rodney A. "Intelligence without representation." Artificial intelligence 47.1-3 (1991): 139-159. [CrossRef]
- Gat, Erann, R. Peter Bonnasso, and Robin Murphy. "On three-layer architectures." Artificial intelligence and mobile robots 195 (1998): 210.
- Laird, John E., Christian Lebiere, and Paul S. Rosenbloom. "A standard model of the mind: Toward a common computational framework across artificial intelligence, cognitive science, neuroscience, and robotics." Ai Magazine 38.4 (2017): 13-26. [CrossRef]
- Karoudis, Konstantinos, and George D. Magoulas. "An architecture for smart lifelong learning design." Innovations in smart learning. Singapore: Springer Singapore, 2016. 113-118.
- Webb, Taylor, Keith J. Holyoak, and Hongjing Lu. "Emergent analogical reasoning in large language models." Nature Human Behaviour 7.9 (2023): 1526-1541. [CrossRef]
- Cavaliere, Danilo, Sabrina Senatore, and Vincenzo Loia. "Proactive UAVs for cognitive contextual awareness." IEEE Systems Journal 13.3 (2018): 3568-3579. [CrossRef]
- Zheng, Yu, Yujia Zhu, and Lingfeng Wang. "Consensus of heterogeneous multi-agent systems." IET Control Theory & Applications 5.16 (2011): 1881-1888. [CrossRef]
- Russell, Stuart, Peter Norvig, and Artificial Intelligence. "A modern approach." Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs 25.27 (1995): 79-80.
- BOUSETOUANE, Fouad. Physical AI Agents: Integrating Cognitive Intelligence with Real-World Action. arXiv preprint arXiv:2501.08944, 2025. [CrossRef]
- Romero, Marcos Lima, and Ricardo Suyama. “Agentic AI for Intent-Based Industrial Automation.” arXiv preprint arXiv:2506.04980 (2025). [CrossRef]
- Weiss, Michael, and Franz Stetter. "A hierarchical blackboard architecture for distributed AI systems." Proceedings Fourth International Conference on Software Engineering and Knowledge Engineering. IEEE, 1992.
- Yao, Shunyu, et al. "React: Synergizing reasoning and acting in language models." International Conference on Learning Representations (ICLR). 2023. [CrossRef]
- Wang, Lei, et al. "A survey on large language model based autonomous agents." Frontiers of Computer Science 18.6 (2024): 186345. [CrossRef]
- Park, Joon Sung, et al. "Generative agents: Interactive simulacra of human behavior." Proceedings of the 36th annual acm symposium on user interface software and technology. 2023.


















| Comparison criterion | The proposed drone | DJI Mini 2 [20] | DJI Matrice 300 RTK ([21,22]) |
| Drone type | Custom-built multi-motor (quadcopter) | Multi-motor (quadcopter) | Multi-motor (quadcopter) |
| Payload (kg) | 0,5 – 1,0 (delivery module 0,3-0,7) | ~0.1 | 2,7 |
| Hardware configuration | 920 kv motors, 3650 mAh LiPo battery, STM32F405 flight controller, camera sensors, LiDAR, GPS | Electric motors with 2S 2250mAh battery, | Coaxial electric motors, compatible with Zenmuse P1/L1 sensors |
| Autonomy (min) | 20-22 | 16-18 | 55 |
| Artificial intelligence (AI) capabilities | YOLO11 + cognitive-agentic architecture (LLM) | Integrated AI for photography and videography purposes | Integrated AI functions for mapping and automatic inspection |
| Maximum range (km) | 0,5-1,0 | 50-200 | 15 |
| Rule_ID | Severity | Condition | Action |
| R-MED-01 | 10 | Sudden Collapse: An abrupt change from an active to an inactive state. |
|
| R-MED-02 | 9 | Critical Inactivity: A person in a vulnerable position for an extended period. |
|
| R-ENV-01 | 8 | Environmental Hazard: A person located in a known natural danger zone. |
|
| R-BHV-01 | 7 | Mass Evacuation: Coordinated fleeing behavior of a group. |
|
| R-VUL-01 | 8 | Trapped Victim: High occlusion and prolonged inactivity suggest entrapment. |
|
| Cognitive Metrics | Scenario 1 (Low Risk) | Scenario 2 (High Risk) | Scenario 3 (With Forced Error) |
| Total decision time (seconds) | 11 | 14 | 18 |
| Accuracy Risk Assessment | Correct (1/10) | Correct (9/10) | Initially incorrect, corrected onto 9/10 |
| Auto-Correct Success | N/A | N/A | N/A |
| Long-Term Memory Calls (MAg) | 1 | 2 | 3 |
| Fidelity Final Report vs. Situation | High | High | High (after correction) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).