Submitted:
21 May 2026
Posted:
22 May 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Scope and Focus of the Review
1.2. Literature Search Process
1.3. Outline of the Review
2. Foundations
2.1. Large Language Models
2.2. The Self-Evolving Agent Harness
- 1.
- Observation: The agent reads tool results, EHR data, or clinical input from the environment.
- 2.
- Reasoning: The LLM analyzes the current state against the clinical goal, generating an internal reasoning trace that references accumulated memory and skills.
- 3.
- Action: The agent executes a tool call (medical database query, code execution, image analysis) or generates a clinical output.
- Tier 1: Environment and preference memory. Persistent configuration files store facts about the clinical environment (EHR configuration, encountered edge cases) and clinician preferences (communication style, specialty focus). These are loaded as a frozen snapshot at session start; updates written during a session take effect in the next session, ensuring within-session consistency while accumulating knowledge across sessions.
- Tier 2: Skill memory. When the agent completes a complex clinical task, it automatically synthesizes the experience into a reusable skill document stored in a persistent local skill library. Skills are loaded hierarchically: Level 0 loads only skill names and descriptions; Level 1 loads the full skill specification on demand; Level 2 loads specific reference files. This lazy loading minimizes token overhead while maintaining a rich, searchable clinical skill library.
- Tier 3: Session search index. All historical conversations are indexed in a full-text search database, enabling the agent to retrieve relevant precedents from past clinical interactions. This tier supports long-range continuity across hundreds of clinical sessions.
- 1.
- The agent executes clinical tasks (e.g., differential diagnosis, radiology report review) and records full interaction trajectories.
- 2.
- A distributed RL coordinator collects trajectories and routes them to task-specific verifiers.
- 3.
- Domain-specific verifiers (medical knowledge checkers, clinical guideline validators) score trajectories via rejection sampling, filtering for high-quality clinical reasoning.
- 4.
- A training service receives filtered trajectories and performs parameter-efficient fine-tuning (e.g., LoRA [68]) of the backbone model, updating clinical reasoning capabilities.
2.3. Reference Implementation: The Hermes Framework
3. Self-Evolving Agent Engineering: Taxonomy of Mechanisms
3.1. Memory Engineering
- 1.
- Novelty trigger: No existing skill matches the current task (cosine similarity below threshold against all skill descriptions). The agent reasons from general medical knowledge, retrieves relevant references via MCP tools, and if a domain-specific clinical verifier scores the output above 72, synthesizes a new skill document following the agentskills.io standard and indexes it in the local skill library.
- 2.
- Performance trigger: An existing skill is used but the verifier scores the output below the skill’s historical best by more than a configurable margin (default: 10 points). The agent generates an improved skill variant incorporating newly identified reasoning steps or clinical pathways; the updated version replaces the prior skill only if it scores higher on a held-out validation set of similar cases.
- 3.
- Contradiction trigger: A new clinical guideline fetched via PubMed or UpToDate MCP tools contradicts an assumption embedded in an existing skill. The agent flags the skill for revision and generates an updated version citing the new guideline as authority.
- 4.
- Consolidation trigger: Every N tasks (configurable, default ), the agent reviews its skill library for redundancy: pairs with cosine similarity above 0.85 on their description embeddings are merged into a single unified skill. This prevents skill library bloat and maintains retrieval efficiency.
- 5.
- RL trigger: High-quality trajectories (verifier score ) are submitted to the Atropos RL pipeline, where Tinker performs LoRA fine-tuning on the Hermes backbone. This improves the underlying reasoning quality that drives all subsequent skill synthesis, creating a compounding improvement cycle.
- 6.
- Boundary trigger: A skill is flagged for refinement when repeated invocations on similar clinical inputs produce inconsistent outcomes under a fixed policy (some succeeding, others failing) rather than systematically scoring above or below a fixed threshold. This stochastic-instability signal, recently introduced in the agent-data co-evolution framework of Yang et al. [71] where it accounts for >45% of detected weaknesses on tool-use benchmarks, captures decision-boundary cases that the deterministic performance trigger misses. In a clinical context, boundary signals identify skills whose recommendations are unstable across superficially similar cases (e.g., a chest-radiograph skill that correctly handles obvious consolidation but flips between findings on borderline opacities), prioritising precisely the skill regions most in need of additional training data.
- 1.
- Query the skill library (L0 lookup: skill names and descriptions, ≈3K tokens).
- 2.
- If a matching skill is found (similarity ), load the full skill document (L1) and execute the task with skill guidance.
- 3.
- Evaluate the output with the relevant domain verifier. If the verifier score exceeds the skill’s recorded best, log the trajectory and update the skill’s performance history. If the score falls below the recorded best by more than 10 points, initiate skill refinement (performance trigger).
- 4.
- If no matching skill is found, execute with general clinical reasoning, retrieve supporting evidence via MCP tools, and, if verifier score , synthesise a new skill document (novelty trigger).
- 5.
- After every 50 tasks, run the consolidation pass to merge skill documents with similarity (consolidation trigger).
- 6.
- All trajectories with verifier score are queued for RL fine-tuning of the backbone (RL queue).
3.2. Tool and Multi-Agent Engineering
- EHR connectors: Structured queries against HL7 FHIR endpoints for patient history, lab results, and medication lists.
- Medical knowledge bases: Integration with PubMed, UpToDate, and clinical guideline repositories via API.
- Imaging tools: DICOM viewers, segmentation models, and radiology-specific analysis libraries.
- Code execution: The execute_code tool enables the agent to perform biostatistical analyses, generate clinical visualizations, and run diagnostic algorithms within a single inference call.
- Simultaneous differential diagnosis exploration across disease categories
- Parallel review of imaging findings, laboratory results, and clinical history
- Independent pharmacological interaction checking and dosage verification
3.3. Comparison: Self-Evolving Harness Vs. Stateless Approaches
4. Applications in Healthcare
| Reference | Task | Agent Type | Dataset | Highlight | SEAE |
|---|---|---|---|---|---|
| Classification and diagnosis tasks (§4.1) | |||||
| MDAgents [5] | Medical decision making | Adaptive multi-agent | MedQA, MMLU-Medical | Dynamic collaboration depth | – |
| Agent Hospital [14] | Disease diagnosis | Frozen-LLM evolvable agent | MedQA (USMLE) + 20K-patient simulacrum | 88.2→92.2% MedQA (GPT-4o); 77.0→95.3% internal diagnosis after 20K-patient evolution | P·S |
| ClinicalAgents [17] | Clinical diagnosis | Dual-memory multi-agent | Clinical EHR | MCTS orchestration + experience memory | P·S |
| GPT4MIA [95] | Medical image classification | Plug-and-play GPT-4 (non-agentic) | RetinaMNIST, FractureMNIST3D | Plug-and-play transductive inference | – |
| Generation tasks (§4.2) | |||||
| Best Practices Radiology [3] | Report generation guidance | Practitioner guidance (review) | — (no benchmarked dataset) | Best-practice principles for LLM use in radiology reporting | – |
| Lyu et al. [96] | Radiology report plain-language translation | GPT-4 agent | Radiology report corpora | Translates technical radiology reports into patient-readable language | – |
| R2GenGPT [35] | Radiology report generation from CXR | Frozen-LLM + visual alignment (≈5M trainable params) | IU-Xray, MIMIC-CXR | Delta-tuning aligns Swin-Transformer features to a frozen LLM; near-SOTA BLEU/ROUGE at 0.07% of total parameters | – |
| NapSS [97] | Paragraph-level medical text simplification | Two-stage seq2seq (non-agentic) | English medical simplification corpus (Cochrane-style) | Summarize→simplify with narrative prompts; +3–4 pts SARI over seq2seq baseline | – |
| TissueLab (Co-evolving AI) [98] | Medical image analysis with expert-in-the-loop | Co-evolving agentic system | Medical imaging tasks | Continuous learning from clinician feedback; explainable workflows | P·S·R |
| Radiologist Copilot [99] | 3D radiology report generation | Agentic system with QC feedback | Radiology benchmarks | Integrated observation, template selection, and feedback-driven iterative refinement | – |
| Detection tasks (§4.3) | |||||
| MACRO [79] | Medical imaging analysis | Self-skill discovery agent | Diverse medical imaging datasets | Composite tool synthesis from execution histories | P·S·R |
| Augmentation tasks (§4.4) | |||||
| Agent Hospital [14] | Synthetic patient generation | Simulacrum agents | 20K virtual patients × 21 departments | Retrieval-based self-evolution via simulated encounters | P·S |
| MEDIMP [100] | Medical image representation learning | Contrastive multi-modal model (non-agentic) | DCE MRI of renal transplants + clinical tabular data | Contrastive learning with LLM-generated prompts; cited as a representation that agents can use | – |
| AugGPT [101] | Text data augmentation | ChatGPT-based rephrasing (non-agentic) | Few-shot text classification benchmarks | ChatGPT rephrasing for few-shot classification; cited as a data-augmentation tool | – |
| Hermes RL Trajectories† | Agent-generated training data (general) | Self-improving agent framework | Atropos-verified trajectories (general; no benchmarked medical run) | Framework that generates its own RL training data; clinical applicability is conjectural | P·S·R |
| Question answering tasks (§4.5) | |||||
| Survey: Agents in Medicine§ [4] | Medical QA benchmark | Survey (multiple agents) | MedQA, MedMCQA, USMLE | Systematic evaluation | — |
| RadioRAG [12] | Radiology-specific QA | RAG agent | RSNA 80Q + 24 expert-curated | Real-time domain-specific retrieval | – |
| KG-Agent⋄ [102] | General KG-augmented QA (non-medical) | Knowledge-augmented agent | General KG benchmarks | Multi-hop KG traversal; 10K-sample tuning outperforms larger LLMs | – |
| Clinical QA System [103] | Interactive clinical QA over notes | Single-turn extractive QA agent | Unstructured clinical notes | Extractive answers with in-note highlighting for traceability | – |
| Dynamic Clinical QA [104] | Document-based QA | RAG agent | Clinical documents | Real-time RAG from clinical records | – |
| Inference tasks (§4.6) | |||||
| ClinicalAgents [17] | Complex diagnosis | MCTS multi-agent | EHR | Monte Carlo Tree Search for clinical reasoning | P·S |
| Reinventing Clinical Dialogue§ [105] | Survey of agentic clinical-dialogue paradigms | Survey (taxonomy of 4 archetypes) | — (survey, no benchmarked dataset) | First-principles taxonomy of agentic clinical-dialogue systems | — |
| AI Clinician‡ [106] | Sepsis treatment optimization | RL agent | MIMIC-III ICU | Optimal policy learning from 17K patient trajectories | R |
| DiagAgent [107] | Multi-turn diagnostic inference | RL-trained agent | DiagBench (2.2K cases) | +11.20% accuracy, +17.58% exam F1 over 11 LLMs | R |
| DoctorAgent-RL [108] | Clinical dialogue | Multi-agent RL | MTMedDialog | RL-refined questioning strategy for diagnosis | R |
| MedAide [109] | Multi-intent medical QA | Rotation multi-agent | 4 medical benchmarks | Intent-aware multi-agent routing | – |
| Medical Data Inference [110] | End-to-end clinical inference | Multi-agent pipeline (7 specialised agents) | Geriatrics, palliative care, colonoscopy imaging | Ingestion, anonymisation, feature extraction, model-data matching, preprocessing, and inference stages | – |
4.1. Classification and Diagnosis Tasks
4.2. Generation Tasks
4.3. Detection Tasks
4.4. Augmentation Tasks
4.5. Question Answering Tasks
4.6. Inference Tasks
4.7. Recap: Self-Evolving Agent Engineering Across Clinical Tasks
- 1.
- Persistence compounds performance (in simulation/benchmark): Where self-evolving systems are present, accumulated experience can improve performance over session-bounded baselines, though this advantage is currently demonstrated in a minority of reviewed papers and primarily in simulated or benchmark settings. Agent Hospital’s reported diagnosis-accuracy gain (77.0%→95.3% on its internal 21-department simulacrum after 20K-patient evolution, with a smaller but consistent +4 pt MedQA gain on GPT-4o) and co-evolving AI’s reported imaging-quality gains over a static baseline are the clearest examples; prospective clinical validation remains open.
- 2.
- Tool integration multiplies capability: Agent systems with well-engineered clinical tool access (RadioRAG’s real-time retrieval, KG-Agent’s knowledge graph traversal) outperform larger memoryless models, confirming that tool engineering is as important as model scaling.
- 3.
- Multi-agent coordination for complex cases: Complex clinical tasks consistently benefit from multi-agent architectures (MDAgents, ClinicalAgents) that decompose the diagnostic problem across specialist agents.
- 4.
- Self-generated training data closes the data gap: Agent Hospital’s simulacrum and self-evolving RL trajectory generation provide clinically valuable training data without accessing real patients, addressing the fundamental data scarcity challenge of medical AI.
5. Challenges and Future Directions
5.1. Challenges in Self-Evolving Agent Engineering for Healthcare
5.2. Future Directions
6. Conclusions
6.1. Summary
6.2. Limitations
6.3. Key Contributions
6.4. Recommendations for Future Research
Author Contributions: Dengzhe Hou
Acknowledgments
Conflicts of Interest
References
- Shortliffe, E.H.; Davis, R.; Axline, S.G.; Buchanan, B.G.; Green, C.C.; Cohen, S.N. Computer-based consultations in clinical therapeutics: Explanation and rule acquisition capabilities of the MYCIN system. Computers and Biomedical Research 1975, 8, 303–320. [CrossRef]
- Rajpurkar, P.; Irvin, J.; Ball, K.; Zhu, R.; Yang, B.; et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning, 2017, [1711.05225].
- Bluethgen, C.; Veen, D.V.; Zakka, C.; Link, K.; Fanous, A.; Daneshjou, R.; Frauenfelder, T.; Langlotz, C.; Gatidis, S.; Chaudhari, A. Best Practices for Large Language Models in Radiology, 2024, [2412.01233].
- Wang, W.; Ma, Z.; Wang, Z.; Wu, C.; Ji, J.; Chen, W.; Li, X.; Yuan, Y. A survey of llm-based agents in medicine: How far are we from baymax? Findings of the Association for Computational Linguistics: ACL 2025 2025, pp. 10345–10359.
- Kim, Y.; Park, C.; Jeong, H.; Chan, Y.S.; Xu, X.; McDuff, D.; Lee, H.; Ghassemi, M.; Breazeal, C.; Park, H.W. MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making, 2024, [2404.15155].
- OpenAI. GPT-4 Technical Report, 2023, [2303.08774].
- Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. Anthropic model card. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf, 2024.
- Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; et al. Gemini: A Family of Highly Capable Multimodal Models, 2023, [2312.11805].
- Wang, J.; Shi, E.; Yu, S.; Wu, Z.; Hu, H.; Ma, C.; Dai, H.; Yang, Q.; Kang, Y.; Wu, J.; et al. Prompt engineering for healthcare: Methodologies and applications. Meta-Radiology 2025, p. 100190. [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022, Vol. 35, pp. 24824–24837.
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2020, Vol. 33, pp. 1877–1901.
- Arasteh, S.T.; Lotfinia, M.; Bressem, K.; Siepmann, R.; Adams, L.; Ferber, D.; Kuhl, C.; Kather, J.N.; Nebelung, S.; Truhn, D. RadioRAG: Online Retrieval-Augmented Generation for Radiology Question Answering, 2024, [2407.15621].
- Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey, 2024, [2312.10997].
- Li, J.; Lai, Y.; Li, W.; Ren, J.; Zhang, M.; Kang, X.; Wang, S.; Li, P.; Zhang, Y.Q.; Ma, W.; et al. Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents, 2024, [2405.02957].
- Packer, C.; Wooders, S.; Lin, K.; Fang, V.; Patil, S.G.; Stoica, I.; Gonzalez, J.E. MemGPT: Towards LLMs as Operating Systems, 2023. [CrossRef]
- Zhong, W.; Guo, L.; Gao, Q.; Ye, H.; Wang, Y. MemoryBank: Enhancing Large Language Models with Long-Term Memory, 2023. [CrossRef]
- Ge, Z.; Li, H.; Wang, Y.; Hu, N.; Zhang, C.J.; Li, Q. ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory, 2026, [2603.26182].
- Du, P. Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers, 2026, [2603.07670].
- Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language Agents with Verbal Reinforcement Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2023, Vol. 36, pp. 8634–8652.
- Topol, E.J. Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again; Basic Books, 2019.
- Rajpurkar, P.; Chen, E.; Banerjee, O.; Topol, E.J. AI in health and medicine. Nature Medicine 2022, 28, 31–38. [CrossRef]
- Gorenshtein, A.; Omar, M.; Glicksberg, B.S.; Nadkarni, G.N.; Klang, E. AI agents in clinical medicine: A systematic review. medRxiv 2025. [CrossRef]
- Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A Survey on Large Language Model Based Autonomous Agents, 2023. [CrossRef]
- Guo, T.; Chen, X.; Wang, Y.; Chang, R.; Pei, S.; Chawla, N.V.; Wiest, O.; Zhang, X. Large Language Model Based Multi-Agents: A Survey of Progress and Challenges, 2024. [CrossRef]
- He, K.; Mao, R.; Lin, Q.; Ruan, Y.; Lan, X.; Feng, M.; Cambria, E. A Survey of Large Language Models for Healthcare: From Data, Technology, and Applications to Accountability and Ethics, 2023. [CrossRef]
- Khosravi, B.; Rouzrokh, P.; Akinci D’Antonoli, T.; Moassefi, M.; Faghani, S.; Mansuri, A.; Bressem, K.; Tejani, A.; Gichoya, J. Agentic AI in Radiology: Evolution from Large Language Models to Future Clinical Integration. Radiology: Artificial Intelligence 2026, 8, e250651. [CrossRef]
- Koçak, B.; Meşe, İ. AI agents in radiology: Toward autonomous and adaptive intelligence. Diagnostic and Interventional Radiology 2025. [CrossRef]
- Bluethgen, C.; Veen, D.V.; Truhn, D.; Kather, J.N.; Moor, M.; Polacin, M.; Chaudhari, A.; Frauenfelder, T.; Langlotz, C.P.; Krauthammer, M.; et al. Agentic Systems in Radiology: Design, Applications, Evaluation, and Challenges, 2025, [2510.09404].
- Collaco, B.G.; Haider, S.A.; Prabha, S.; Gomez-Cabello, C.A.; Genovese, A.; Wood, N.G.; Bagaria, S.P.; Gopala, N.; Tao, C.; Forte, A.J. The role of agentic artificial intelligence in healthcare: A scoping review. npj Digital Medicine 2026, 9, 345. [CrossRef]
- Gao, H.a.; et al. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence, 2025, [2507.21046].
- Fang, J.; et al. A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems, 2025, [2508.07407].
- Jiang, P.; et al. Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills, 2025, [2512.16301].
- Wu, Z.; Xu, S.; Chen, B.; Wan, S.; Li, Y.; Ruan, W.; Lyu, Y.; Li, S.; Zhu, D.; Liu, T.; et al. Vibe Medicine: Redefining Biomedical Research Through Human-AI Co-Work, 2026, [2604.23674].
- Tricco, A.C.; Lillie, E.; Zarin, W.; O’Brien, K.K.; Colquhoun, H.; Levac, D.; Moher, D.; Peters, M.D.J.; Horsley, T.; Weeks, L.; et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Annals of Internal Medicine 2018, 169, 467–473. [CrossRef]
- Wang, Z.; Liu, L.; Wang, L.; Zhou, L. R2GenGPT: Radiology Report Generation with Frozen LLMs. Meta-Radiology 2023, 1, 100033. [CrossRef]
- Moore, S.M.; Maffitt, D.R.; Smith, K.E.; Kirby, J.S.; Clark, K.W.; Freymann, J.B.; Vendt, B.A.; Tarbox, L.R.; Prior, F.W. De-identification of Medical Images with Retention of Scientific Research Value. RadioGraphics 2015, 35, 727–735. [CrossRef]
- Zech, J.R.; Badgeley, M.A.; Liu, M.; Costa, A.B.; Titano, J.J.; Oermann, E.K. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Medicine 2018, 15, e1002683. [CrossRef]
- Brodeur, P.G.; Buckley, T.A.; Kanjee, Z.; Goh, E.; Ling, E.B.; Jain, P.; Cabral, S.; Abdulnour, R.E.; Haimovich, A.D.; Freed, J.A.; et al. Performance of a large language model on the reasoning tasks of a physician. Science 2026, 392, 524–527. [CrossRef]
- Hou, D.; Jiang, L.; Li, D.; Li, Z.; Lin, F.; Yamada, K.D. WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking, 2026, [2603.27343].
- Xu, W.; Liang, Z.; Anthony, H.; Ibrahim, Y.; Cohen, F.; Yang, G.; Kamnitsas, K. You Point, I Learn: Online Adaptation of Interactive Segmentation Models for Handling Distribution Shifts in Medical Imaging. International Conference on Learning Representations (ICLR), 2026.
- Xi, Z.; Chen, W.; Guo, X.; He, W.; Ding, Y.; Hong, B.; Zhang, M.; Wang, J.; Jin, S.; Zhou, E.; et al. The Rise and Potential of Large Language Model Based Agents: A Survey, 2023. [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2017, Vol. 30, pp. 5998–6008. [CrossRef]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; et al. Training language models to follow instructions with human feedback. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022, Vol. 35, pp. 27730–27744.
- Nori, H.; King, N.; McKinney, S.M.; Carignan, D.; Horvitz, E. Capabilities of GPT-4 on Medical Challenge Problems, 2023. [CrossRef]
- Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Hou, L.; Clark, K.; Pfohl, S.; Cole-Lewis, H.; Neal, D.; et al. Towards Expert-Level Medical Question Answering with Large Language Models, 2023. [CrossRef]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; et al. LLaMA: Open and Efficient Foundation Language Models, 2023, [2302.13971].
- Grattafiori, A.; Dubey, A.; Jauhri, A.; et al. The Llama 3 Herd of Models, 2024. [CrossRef]
- Teknium, R.; Quesnelle, J.; Guang, C. Hermes 3 Technical Report, 2024. [CrossRef]
- Teknium, R.; Jin, R.; Suphavadeeprasit, J.; Mahan, D.; Quesnelle, J.; Li, J.; Guang, C.; Sands, S.; Malhotra, K. Hermes 4 Technical Report. arXiv preprint arXiv:2508.18255 2025.
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, 2019; pp. 4171–4186. [CrossRef]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining, 2019. [CrossRef]
- Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing, 2021. [CrossRef]
- Huang, K.; Altosaar, J.; Ranganath, R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission, 2019. [CrossRef]
- Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways, 2022. [CrossRef]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwandou, A.; Cole-Lewis, H.; Hamoy-Blumenstein, N.; et al. Large Language Models Encode Clinical Knowledge, 2022. [CrossRef]
- Luo, R.; Sun, L.; Xia, Y.; Qin, T.; Zhang, S.; Poon, H.; Liu, T.Y. BioGPT: Generative Pre-Trained Transformer for Biomedical Text Generation and Mining, 2022. [CrossRef]
- Wu, C.; Lin, W.; Zhang, X.; Zhang, Y.; Xie, W.; Wang, Y. PMC-LLaMA: Toward Building Open-Source Language Models for Medicine, 2023. [CrossRef]
- Xie, Q.; Chen, Q.; Chen, A.; Peng, C.; Hu, Y.; Lin, F.; Peng, X.; Huang, J.; Zhang, J.; Keloth, V.; et al. Me-LLaMA: Foundation Large Language Models for Medical Applications, 2024. [CrossRef]
- Rafailov, R.; Sharma, A.; Mitchell, E.; Ermon, S.; Manning, C.D.; Finn, C. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2023, Vol. 36, pp. 53728–53741. [CrossRef]
- Chen, Z.; Cano, A.H.; Romanou, A.; Bonnet, A.; Matoba, K.; Salvi, F.; Pagliardini, M.; Fan, S.; Köpf, A.; Mohtashami, A.; et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079 2023. [CrossRef]
- Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.R.; Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models. The Eleventh International Conference on Learning Representations (ICLR 2023), OpenReview.net, 2023.
- Yu, Y.; Yao, L.; Xie, Y.; Tan, Q.; Feng, J.; Li, Y.; Wu, L. Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents, 2026, [2601.01885].
- Zhang, X.; Wang, G.; Cui, Y.; Qiu, W.; Li, Z.; Zhu, B.; He, P. Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents, 2026, [2604.15877].
- Anthropic. Model Context Protocol: An open standard for connecting LLM applications to data sources and tools. https://modelcontextprotocol.io, 2024. Open protocol specification, accessed May 2026.
- Zhao, A.; Huang, D.; Xu, Q.; Lin, M.; Liu, Y.J.; Huang, G. ExpeL: LLM Agents Are Experiential Learners. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2024, Vol. 38, pp. 19632–19642.
- Liu, Y.; Si, C.; Narasimhan, K.R.; Yao, S. Contextual Experience Replay for Self-Improvement of Language Agents. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 14179–14198. [CrossRef]
- Wu, R.; Wang, X.; Mei, J.; Cai, P.; Fu, D.; Yang, C.; Wen, L.; Yang, X.; Shen, Y.; Wang, Y.; et al. EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle, 2025, [2510.16079].
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models, 2021, [2106.09685]. [CrossRef]
- Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, 2023. [CrossRef]
- Anokhin, P.; Semenov, N.; Sorokin, A.; Evseev, D.; Kravchenko, A.; Burtsev, M.; Burnaev, E. AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents, 2024. [CrossRef]
- Yang, S.; Ma, Z.; Huang, T.; Hu, Y.; Wang, Y.; Chu, X. CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution, 2026, [2604.15840].
- Ni, J.; Liu, Y.; Liu, X.; Sun, Y.; Zhou, M.; Cheng, P.; Wang, D.; Zhao, E.; Jiang, X.; Jiang, G. Trace2skill: Distill trajectory-local lessons into transferable agent skills. arXiv preprint arXiv:2603.25158 2026.
- Zhang, H.; Fan, S.; Zou, H.P.; Chen, Y.; Wang, Z.; Zhou, J.; Li, C.; Huang, W.C.; Yao, Y.; Zheng, K.; et al. CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification, 2026, [2604.01687].
- Xia, P.; Chen, J.; Yang, X.; Tu, H.; Liu, J.; Xiong, K.; Han, S.; Qiu, S.; Ji, H.; Zhou, Y.; et al. MetaClaw: Just Talk — An Agent That Meta-Learns and Evolves in the Wild, 2026, [2603.17187].
- Ma, Z.; Yang, S.; Ji, Y.; Wang, X.; Wang, Y.; Hu, Y.; Huang, T.; Chu, X. SkillClaw: Let Skills Evolve Collectively with Agentic Evolver, 2026, [2604.08377]. [CrossRef]
- Xia, P.; Chen, J.; Wang, H.; Liu, J.; Zeng, K.; Wang, Y.; Han, S.; Zhou, Y.; Zhao, X.; Chen, H.; et al. SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning, 2026, [2602.08234].
- Wang, H.; Wang, G.; Xiao, H.; et al. Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents, 2026, [2604.10674].
- Ge, T.; Peng, B.; Cheng, H.; Gao, J. Synthetic Computers at Scale for Long-Horizon Productivity Simulation, 2026, [2604.28181].
- Fan, L.; Dai, P.; Deng, Z.; Wang, H.; Gong, X.; Zheng, Y.; Ou, Y. Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery, 2026, [2603.05860]. [CrossRef]
- Lin, J.; Liu, S.; Pan, C.; Lin, L.; Dou, S.; Huang, X.; Yan, H.; Han, Z.; Gui, T. Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses, 2026, [2604.25850].
- Ding, L. AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling, 2026, [2603.21357].
- Li, X.; et al. SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks, 2026, [2602.12670].
- Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems 2023, 36, 68539–68551.
- Liu, X.; Yu, H.; Zhang, H.; Xu, Y.; Lei, X.; Lai, H.; Gu, Y.; Ding, H.; Men, K.; Yang, K.; et al. AgentBench: Evaluating LLMs as Agents, 2023. [CrossRef]
- Xu, R.; Yan, Y. Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward, 2026, [2602.12430].
- Gao, S.; Zhu, R.; Sui, P.; Kong, Z.; Aldogom, S.; Huang, Y.; Noori, A.; Shamji, R.; Parvataneni, K.; Tsiligkaridis, T.; et al. Democratizing AI Scientists Using ToolUniverse, 2025, [2509.23426].
- Gao, S.; Zhu, R.; Kong, Z.; Noori, A.; Su, X.; Ginder, C.; Tsiligkaridis, T.; Zitnik, M. TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools, 2025, [2503.10970].
- Huang, K.; Zhang, S.; Wang, H.; Qu, Y.; Lu, Y.; Roohani, Y.; Li, R.; Qiu, L.; Li, G.; Zhang, J.; et al. Biomni: A General-Purpose Biomedical AI Agent. bioRxiv 2025. [CrossRef]
- Hong, S.; Zhuge, M.; Chen, J.; Zheng, X.; Cheng, Y.; Zhang, C.; Wang, J.; Wang, Z.; Yau, S.K.S.; Lin, Z.; et al. MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework, 2023. [CrossRef]
- Jin, D.; Pan, E.; Oufattole, N.; Weng, W.H.; Fang, H.; Szolovits, P. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams, 2021. [CrossRef]
- Jin, Q.; Dhingra, B.; Liu, Z.; Cohen, W.W.; Lu, X. PubMedQA: A Dataset for Biomedical Research Question Answering, 2019. [CrossRef]
- Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.V.; Chi, E.H.; Narang, S.; Chowdhery, A.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models, 2022. [CrossRef]
- Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.L.; Cao, Y.; Narasimhan, K. Tree of Thoughts: Deliberate Problem Solving with Large Language Models, 2023. [CrossRef]
- Hu, S.; Lu, C.; Clune, J. Automated Design of Agentic Systems. International Conference on Learning Representations (ICLR), 2025.
- Zhang, Y.; Chen, D.Z. GPT4MIA: Utilizing Generative Pre-Trained Transformer (GPT-4) as a Plug-and-Play Transductive Model for Medical Image Analysis. In Proceedings of the Workshop Proceedings of MICCAI 2023 (MedAGI/DeCaF), 2023, pp. 151–160. [CrossRef]
- Lyu, Q.; Tan, J.; Zapadka, M.E.; Ponnat, J.; Niu, C.; Wang, G.; Whitlow, C.T. Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: Results, limitations, and potential. Visual Computing for Industry, Biomedicine, and Art 2023, 6, 9. [CrossRef]
- Lu, J.; Li, J.; Wallace, B.C.; He, Y.; Pergola, G. NapSS: Paragraph-level Medical Text Simplification via Narrative Prompting and Sentence-matching Summarization. In Proceedings of the Findings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 1079–1091. [CrossRef]
- Li, S.; Xu, J.; Bao, T.; Liu, Y.; Liu, Y.; Liu, Y.; Wang, L.; Lei, W.; Wang, S.; Xu, Y.; et al. A Co-Evolving Agentic AI System for Medical Imaging Analysis, 2025, [2509.20279].
- Yu, Y.; Huang, Z.; Mu, L.; Zhang, S.; Zhang, X. Radiologist Copilot: An Agentic Framework Orchestrating Specialized Tools for Reliable Radiology Reporting, 2025, [2512.02814].
- Milecki, L.; Kalogeiton, V.; Bodard, S.; Anglicheau, D.; Correas, J.M.; Timsit, M.O.; Vakalopoulou, M. MEDIMP: 3D Medical Images with clinical Prompts from limited tabular data for renal transplantation, 2023, [2303.12445]. [CrossRef]
- Dai, H.; Liu, Z.; Liao, W.; Huang, X.; Cao, Y.; et al. ChatAug: Leveraging ChatGPT for Text Data Augmentation, 2023, [2302.13007].
- Jiang, J.; Zhou, K.; Zhao, W.X.; Song, Y.; Zhu, C.; Zhu, H.; Wen, J.R. KG-Agent: An Efficient Autonomous Agent Framework for Complex Reasoning over Knowledge Graph, 2024, [2402.11163].
- Albassam, D. Toward Human-Centered Interactive Clinical Question Answering System, 2025, [2505.18928].
- Elgedawy, R.; Danciu, I.; Mahbub, M.; Srinivasan, S. Dynamic Question-Answering of Clinical Documents using Retrieval Augmented Generation, 2024, [2401.10733].
- Zhi, X.; Zhao, H.; Wu, L.; Zhao, C.; Zhu, H. Reinventing Clinical Dialogue: Agentic Paradigms for LLM-Enabled Healthcare Communication, 2025, [2512.01453].
- Komorowski, M.; Celi, L.A.; Badawi, O.; Gordon, A.C.; Faisal, A.A. The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine 2018, 24, 1716–1720. [CrossRef]
- Qiu, P.; Wu, C.; Liu, J.; Zheng, Q.; Liao, Y.; Wang, H.; Yue, Y.; Fan, Q.; Zhen, S.; Wang, J.; et al. Evolving Interactive Diagnostic Agents in a Virtual Clinical Environment, 2025, [2510.24654].
- Feng, Y.; Wang, J.; Zhou, L.; Zheng, Y.; Lei, Z.; Li, Y. Real-World Doctor Agent with Proactive Consultation through Multi-Agent Reinforcement Learning, 2025, [2505.19630].
- Yang, D.; Wei, J.; Li, M.; Liu, J.; Liu, L.; Hu, M.; He, J.; Ju, Y.; Zhou, W.; Liu, Y.; et al. MedAide: Information Fusion and Anatomy of Medical Intents via LLM-based Agent Collaboration, 2024, [2410.12532]. [CrossRef]
- Shimgekar, S.R.; Vassef, S.; Goyal, A.; Kumar, N.; Saha, K. Agentic AI Framework for End-to-End Medical Data Inference, 2025, [2507.18115].
- Tang, X.; Zou, A.; Zhang, Z.; Li, Z.; Zhao, Y.; Zhang, X.; Cohan, A.; Gerstein, M. Medagents: Large language models as collaborators for zero-shot medical reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 599–621.
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the International Conference on Computer Vision, 2023, pp. 4015–4026. [CrossRef]
- Cheng, J.; Ye, J.; Deng, Z.; Chen, J.; Li, T.; Wang, H.; Su, Y.; Huang, Z.; Chen, J.; Jiang, L.; et al. Sam-med2d. arXiv preprint arXiv:2308.16184 2023.
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning. PmLR, 2021, pp. 8748–8763.
- Li, C.; Wong, C.; Zhang, S.; Usuyama, N.; Liu, H.; Yang, J.; Naumann, T.; Poon, H.; Gao, J. LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day, 2023. [CrossRef]
- Zhou, H.Y.; Acosta, J.N.; Adithan, S.; Datta, S.; Topol, E.J.; Rajpurkar, P. MedVersa: A generalist foundation model for medical image interpretation. arXiv preprint arXiv:2405.07988 2024.
- Lu, M.Y.; Chen, B.; Williamson, D.F.; Chen, R.J.; Ikamura, K.; Gerber, G.; Liang, I.; Le, L.P.; Ding, T.; Parwani, A.V.; et al. A foundational multimodal vision language AI assistant for human pathology. arXiv preprint arXiv:2312.07814 2023.
- Park, J.S.; O’Brien, J.C.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative Agents: Interactive Simulacra of Human Behavior. In Proceedings of the Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), 2023, pp. 1–22. [CrossRef]
- Johnson, A.E.W.; Pollard, T.J.; Shen, L.; Li-Wei, H.L.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi, L.A.; Mark, R.G. MIMIC-III, a Freely Accessible Critical Care Database. Scientific Data 2016, 3, 160035. [CrossRef]
- Patel, A.; Hofmarcher, M.; Leoveanu-Condrei, C.; Dinu, M.C.; Callison-Burch, C.; Hochreiter, S. Large Language Models Can Self-Improve At Web Agent Tasks, 2024. [CrossRef]
- Boiko, D.A.; MacKnight, R.; Kline, B.; Gomes, G. Autonomous Chemical Research with Large Language Models, 2023. [CrossRef]
- Rezaei, M.R.; Fard, R.S.; Parker, J.L.; Krishnan, R.G.; Lankarany, M. Agentic Medical Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge, 2025, [2502.13010].
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Dai, W.; Madotto, A.; et al. Survey of Hallucination in Natural Language Generation, 2022. [CrossRef]
- Salehi, S.; Singh, Y.; Horst, K.K.; Hathaway, Q.A.; Erickson, B.J. Agentic AI and Large Language Models in Radiology: Opportunities and Hallucination Challenges. Bioengineering 2025, 12, 1303. [CrossRef]
- Pal, A.; Umapathi, L.K.; Sankarasubbu, M. Med-HALT: Medical Domain Hallucination Test for Large Language Models, 2023. [CrossRef]
- Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; et al. Constitutional AI: Harmlessness from AI Feedback, 2022. [CrossRef]
- Wornow, M.; Xu, Y.; Thapa, R.; Patel, B.; Steinberg, E.; Fleming, S.; Pfeffer, M.A.; Fries, J.; Shah, N.H. The shaky foundations of large language models and foundation models for electronic health records. npj digital medicine 2023, 6, 135. [CrossRef]
- Maiti, S. Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare, 2026, [2603.17419].
- Dong, S.; Xu, S.; He, P.; Li, Y.; Tang, J.; Liu, T.; Liu, H.; Xiang, Z. Memory Injection Attacks on LLM Agents via Query-Only Interaction, 2025, [2503.03704].
- Sunil, B.D.; Sinha, I.; Maheshwari, P.; Todmal, S.; Mallik, S.; Mishra, S. Memory Poisoning Attack and Defense on Memory Based LLM-Agents, 2026, [2601.05504]. [CrossRef]
- Azarafrooz, A. Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms, 2026, [2604.21131].
- Lin, Z.; Li, C.; Chen, K. A Survey on the Security of Long-Term Memory in LLM Agents: Toward Mnemonic Sovereignty, 2026, [2604.16548].
- Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 2017, 114, 3521–3526. [CrossRef]
- Shi, H.; Xu, Z.; Wang, H.; Qin, W.; Wang, W.; Wang, Y.; Wang, Z.; Ebrahimi, S.; Wang, H. Continual learning of large language models: A comprehensive survey. ACM Computing Surveys 2025, 58, 1–42. [CrossRef]
- Goddard, K.; Roudsari, A.; Wyatt, J.C. Automation Bias: A Systematic Review of Frequency, Effect Mediators, and Mitigators. Journal of the American Medical Informatics Association 2012, 19, 121–127. [CrossRef]
- Lyell, D.; Coiera, E. Automation Bias and Verification Complexity: A Systematic Review. Journal of the American Medical Informatics Association 2017, 24, 423–431. [CrossRef]
- Zhang, D. AgentDevel: Reframing Self-Evolving LLM Agents as Release Engineering, 2026, [2601.04620].
- Dietrich, N. Agentic AI in radiology: Emerging potential and unresolved challenges. British Journal of Radiology 2025, 98, 1582–1584. [CrossRef]
- Tu, T.; Azizi, S.; Driess, D.; Schaekermann, M.; Amin, M.; Chang, P.C.; Carroll, A.; Lau, C.; Tanno, R.; Ktena, I.; et al. Towards generalist biomedical AI. Nejm Ai 2024, 1, AIoa2300138.
- Cosentino, J.; Belyaeva, A.; Liu, X.; Furlotte, N.A.; Yang, Z.; Lee, C.; Schenck, E.; Patel, Y.; Cui, J.; Schneider, L.D.; et al. Towards a personal health large language model. arXiv preprint arXiv:2406.06474 2024.
- Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly Accurate Protein Structure Prediction with AlphaFold. Nature 2021, 596, 583–589. [CrossRef]
- Joshi, M.; Pal, A.; Sankarasubbu, M. Federated Learning for Healthcare Domain – Pipeline, Applications and Challenges, 2022, [2211.07893].
- Saha, P.; Strong, J.; Mishra, D.; Ouyang, C.; Noble, J.A. FedAgentBench: Towards Automating Real-world Federated Medical Image Analysis with Server-Client LLM Agents, 2025, [2509.23803].
- Qu, A.; Zheng, H.; Zhou, Z.; Liang, P.P.; et al. CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery, 2026, [2604.01658]. [CrossRef]
- Pal, A.; Umapathi, L.K.; Sankarasubbu, M. MedMCQA: A Large-Scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering, 2022. [CrossRef]
- Zhong, S.; Lu, Y.; Ning, J.; et al. SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks, 2026, [2604.20087].
- Zhao, Y.; Yuan, B.; Huang, J.; Yuan, H.; Yu, Z.; Xu, H.; Hu, L.; Shankarampeta, A.; Huang, Z.; Ni, W.; et al. AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications, 2026, [2602.22769]. [CrossRef]
- Ogdu, C.U.; Gurbuz, S.; Karakose, M.; Hanoglu, E. Medical Implications of LLM Based Clinical Decision Support Systems in Healthcare. In Proceedings of the 2025 29th International Conference on Information Technology (IT). IEEE, 2025, pp. 1–4. [CrossRef]
| 1 | Industry references on the Agent = Model + Harness decomposition: Anthropic, “Effective harnesses for long-running agents,” 2025, https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents; V. Trivedy, “The anatomy of an agent harness,” LangChain Blog, 2026, https://blog.langchain.com/the-anatomy-of-an-agent-harness/; B. Böckeler, “Harness engineering for coding agent users,” martinfowler.com, 2026, https://martinfowler.com/articles/harness-engineering.html. |
| 2 | NousResearch, Hermes Agent project documentation, 2026: https://hermes-agent.nousresearch.com and https://github.com/NousResearch/hermes-agent. |

| Trigger | Activation condition | Concurrent support |
|---|---|---|
| Novelty | No matching skill, verifier score | Trace2Skill [72] |
| Performance | Score >10 pts below skill’s recorded best | CoEvoSkills [73], MetaClaw [74] |
| Contradiction | New retrieved guideline conflicts with skill assumption | (knowledge update; conceptual, no direct exemplar in this survey) |
| Consolidation | Every 50 tasks, similarity >0.85 between skills | SkillClaw [75] |
| RL | Verifier score (high-quality trajectory) | SkillRL [76], Skill-SD [77], Ge et al. [78] |
| Boundary | Inconsistent outcomes on similar inputs under fixed policy | CoEvolve [71] |
| Harness dimension | Prompt Eng. | RAG-augmented | Stateless harness† | Self-evolving harness |
|---|---|---|---|---|
| Tool integration | Limited | Retrieval only | Full (MCP/API) | Full (MCP/API) |
| Multi-agent coordination | ✗ | ✗ | ✓ | ✓ |
| Orchestration (ReAct) | ✗ | ✗ | ✓ | ✓ |
| Persistent memory | ✗ | ✗ | ✗ | ✓ |
| Skill accumulation | ✗ | ✗ | ✗ | ✓ |
| RL self-improvement | ✗ | ✗ | ✗ | ✓ |
| Setup complexity | Low | Medium | Medium | High |
| Deployment infrastructure | Minimal | Vector DB | Agent server | Agent server + RL |
| Challenge | Description | Possible Solutions |
|---|---|---|
| Hallucination & safety | Agents generate confident but incorrect clinical facts | Clinical verifiers; uncertainty quantification; agent checkpoints |
| Data privacy & HIPAA | Persistent memory may retain PHI across sessions | Memory scanning; de-identification; local deployment |
| Domain adaptation | Medical terminology varies by specialty, institution | Domain-specific skill libraries; local guideline integration |
| Interpretability & trust | Clinical decisions require auditable reasoning chains | Reasoning trace logging; explainability tools; clinician oversight |
| Memory security | Persistent memory is vulnerable to injection attacks | Entry scanning; injection detection; access control |
| Scalability | Multi-agent orchestration is computationally expensive | Subagent pooling; cached skill retrieval; model distillation |
| Regulatory compliance | FDA, CE, and HIPAA impose strict validation requirements | Deterministic audit modes; clinical validation protocols |
| Autonomy–safety tension | Autonomous skill evolution may propagate systematic errors | Graduated autonomy; confidence-gated escalation |
| Memory fragmentation | Skill library grows unwieldy at scale | Hierarchical taxonomies; adaptive retrieval; periodic pruning |
| Catastrophic forgetting | RL fine-tuning degrades prior capabilities | EWC; experience replay; modular LoRA adapters |
| Over-reliance & de-skilling | Clinicians may accept polished agent outputs without scrutiny, eroding analytical capacity | Mandatory review of high-stakes outputs; reasoning-chain exposure; periodic audit |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).