Self-Evolving Agent Engineering for Healthcare: Methodologies and Applications

Dengzhe Hou; Zihao Wu; Yuwen Zeng; Lingyu Jiang; Fangzhou Lin; Kazunori Yamada

doi:10.20944/preprints202605.1547.v1

Submitted:

21 May 2026

Posted:

22 May 2026

You are already at the latest version

Abstract

Clinical large language model (LLM) agents are increasingly engineered as systems that combine a language model backbone with memory, tools, orchestration loops, and feedback mechanisms. In this setting, the key engineering question is no longer only what the backbone model can answer, but how the surrounding harness stores experience, retrieves context, orchestrates tools, and converts feedback into reusable knowledge. Existing reviews of LLM agents in healthcare primarily emphasise prompting strategies, task capabilities, and benchmark performance, leaving these harness-level mechanisms insufficiently synthesised. This review addresses that gap by organising the emerging literature under the term Self-Evolving Agent Engineering (SEAE), defined here as a harness-level design paradigm centred on three recurring mechanisms: persistent cross-session memory, autonomous skill or experience synthesis, and closed-loop feedback-driven improvement. We review 148 references and 23 representative clinical systems across six task categories, using radiology as the main translational focus. Rather than treating these systems as isolated applications, we map how persistent memory, skill synthesis, tool orchestration, and feedback-driven improvement are implemented across current healthcare agents, and examine the technical, clinical, and regulatory challenges that arise when clinical agents are designed to evolve across sessions.

Keywords:

self-evolving agent

;

agent engineering

;

LLM agent

;

agentic AI

;

persistent memory

;

clinical decision support

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

A radiologist who sees thousands of chest X-rays accumulates pattern recognition that no textbook can fully encode. A clinician who manages hundreds of diabetic patients develops an intuition for subtle signs of decompensation that formal guidelines miss. This accumulated, experience-driven expertise is central to clinical excellence, and it is precisely what most current large language model (LLM)-based healthcare tools, once deployed, lack. Clinical AI has a long history, from early rule-based expert systems such as MYCIN [1] to deep learning systems achieving radiologist-level diagnostic accuracy [2]; however, neither these earlier systems nor the vast majority of currently deployed clinical AI have acquired the capacity to improve from its own accumulated clinical experience.

LLMs have been applied widely to clinical tasks, from radiology report generation [3] and medical question answering [4] to diagnostic classification [5], but virtually all deployed systems operate without cross-session state. Each clinical session starts from scratch: the model has no memory of prior patient encounters, no accumulated clinical skills, and no way to improve through use. This is an architectural property, not a question of model scale. Even the most capable LLMs (GPT-4 [6], Claude [7], Gemini [8]) reset their clinical context at every session boundary, discarding case experience that would be useful for subsequent patients.

A recent review [9] has mapped how prompt engineering adapts LLMs to clinical tasks; however, the dominant input-design paradigm [10,11] cannot resolve this amnesia. Retrieval-augmented generation (RAG) [12,13] improves knowledge currency by retrieving from updated databases, but retrieval is passive: the agent does not learn from what it retrieves, and each session begins with the same retrieval capability regardless of experience. The present review extends that foundation by asking what happens when the agent’s infrastructure itself evolves through clinical use.

Several independent research threads point toward a new design paradigm for clinical AI. Agent Hospital [14] shows that LLM-powered doctor agents improve through simulated clinical experience, accumulating diagnostic patterns across thousands of encounters. Memory-augmented LLM systems [15,16] and clinical agents [17,18] show that persistent memory (working memory plus experience memory) enables context-aware reasoning that single-session prompting cannot achieve. The Reflexion framework [19] and related self-improvement methods show that agents synthesising experience into reusable reasoning patterns outperform one-shot inference. Together, these threads point to a shared set of design principles: persistent cross-session memory, experience-driven skill synthesis, and closed-loop self-improvement.

In this review, we adopt the term Self-Evolving Agent Engineering (SEAE) to label this converging design space and synthesise a reference architecture from the literature rather than propose a new system. Building on the widely used Agent = Model + Harness decomposition1, where the harness denotes all infrastructure surrounding the LLM (memory, tools, orchestration, training), we characterise a “self-evolving harness” as one that satisfies three core principles inferred across the surveyed systems: (1) persistent cross-session memory that accumulates clinical context across sessions; (2) automatic skill synthesis, wherein the agent converts task experience into reusable clinical protocols; and (3) a closed-loop RL pipeline that uses agent-generated interaction trajectories to fine-tune the next model generation. The NousResearch Hermes Agent framework2 provides one concrete instantiation of these principles (using MEMORY.md, agentskills.io skill documents, and the Atropos/Tinker training pipeline); however, the paradigm is framework-agnostic. The surveyed systems are examined in terms of whether they instantiate one or more of these core principles, regardless of the specific harness implementation used.

1.1. Scope and Focus of the Review

Radiology is a particularly natural testbed for SEAE [20,21]: radiologists accumulate pattern libraries through years of longitudinal case review, imaging protocols and reporting standards evolve continuously, and high-throughput reading environments generate structured feedback that can drive autonomous skill evolution. The systems we survey instantiate at least one of the three SEAE principles, with stateless prompting or single-session RAG systems included as comparison baselines; because fully realised persistent clinical agents remain rare, we map an emerging design space rather than catalogue complete deployments.

Several recent reviews survey LLM agents in medicine from complementary perspectives: Wang et al. [4] provide a broad taxonomy of LLM-based medical agent capabilities; the clinical agents systematic review [22] catalogues deployed systems; Wang et al. [23] and Guo et al. [24] survey autonomous LLM agents and multi-agent frameworks more broadly; He et al. [25] survey LLMs for healthcare; and the recent surge of agentic AI in radiology [26,27,28] and healthcare more broadly [29] further underscores the timeliness of this review. Two concurrent surveys [30,31] provide broad taxonomies of self-evolving agent techniques across general domains, and Jiang et al. [32] survey post-training, memory, and skills as the three axes of agentic AI adaptation. At the opposite end of the autonomy spectrum, Wu et al. [33] articulate “Vibe Medicine,” a human-in-the-loop paradigm in which clinicians retain the role of research director while natural-language interfaces direct skill-augmented agents. The present review sits at the high engineering-autonomy end of this spectrum (the agent autonomously manages its own skills, memory, and training trajectories) while remaining low on the clinical-autonomy axis: throughout, SEAE agents are positioned as decision-support (“second-opinion”) systems requiring physician adjudication, not as replacements for clinical judgement. The prompt engineering review published in this journal [9] serves as both the structural reference and the conceptual starting point: where that review maps input-design strategies for clinical LLMs, the present review examines what lies beyond input design, namely harness-level engineering that enables agents to persist, accumulate, and evolve.

1.2. Literature Search Process

We conducted a scoping review of LLM agents and self-evolving methodologies in healthcare, following the spirit of PRISMA-ScR [34] but adapted to a fast-moving preprint-heavy literature.

Search strategy. We searched arXiv, PubMed, IEEE Xplore, ACM Digital Library, and Google Scholar covering 2019–April 2026, using keyword combinations including “LLM agent healthcare,” “medical LLM agent,” “autonomous clinical AI,” “persistent memory agent medical,” “self-improving medical agent,” “agent harness,” and “multi-agent clinical decision.” Foundational pre-2019 works (MYCIN, automation bias) were back-cited from primary papers for historical context. Searches were last refreshed in April 2026.

Eligibility criteria.Inclusion: (i) describes an LLM-based agent applied to a healthcare or clinical task; (ii) instantiates at least one of the three SEAE principles (persistent cross-session memory, autonomous skill or experience synthesis, or closed-loop self-improvement); OR (iii) serves as a stateless agent baseline that is informative for the SEAE comparison; OR (iv) provides foundational methodology (backbone LLMs, RL techniques, security frameworks) directly referenced in the SEAE architecture. Exclusion: (a) non-agent stateless LLM applications without harness components; (b) clinical AI work using only traditional supervised learning without LLM agents; (c) duplicate publications; (d) papers without retrievable full text.

Screening and counts. The keyword search returned 312 candidate papers. After title/abstract screening for relevance (148 excluded as off-topic, duplicates, or non-LLM-agent) and full-text review (10 excluded for insufficient methodological detail or non-retrievable text), 154 records from the search were retained, plus 4 foundational pre-2019 works added by back-citation from primary papers (158 records included after screening). A subsequent revision-time citation audit (May 2026) yielded a net reduction of 5 in this pool: 11 stale citations were removed (9 orphan entries no longer cited after content trimming, plus 2 wrong-context citations identified by an independent cross-model audit) and 6 revision-time additions were made (4 venue-aligned records: R2GenGPT [35], Moore et al. [36] on DICOM de-identification, Zech et al. [37] on CXR domain shift, and Brodeur et al. [38] on LLM physician-reasoning benchmarks; plus 2 supporting empirical references, Hou et al. [39] on LLM working memory in support of §3.1 and Xu et al. [40] on online adaptation under medical-imaging distribution shift in support of §5), bringing the cited count after audit to 153 (158 − 11 + 6 = 153). A subsequent formatting pass moved 4 non-academic sources (industry blog posts on the Agent = Model + Harness decomposition, and the NousResearch Hermes Agent framework documentation) from the formal bibliography to footnotes, and the removal of the original case-study chapter eliminated one further citation (a chest-radiograph dataset reference), leaving 148 papers in the final reference list. The primary search window (2022–2025) shows a peak of 36 papers in 2023, reflecting the rapid emergence of this field following the release of GPT-4-class models; 2026 data are partial.

Evidence-tier policy. References are classified into four tiers, with the tier surfaced explicitly in the unified application table (Table 3) where it bears on interpretation: (T1) peer-reviewed articles in journals or conference proceedings; (T2) preprints (e.g., arXiv) with verifiable methodology and reproducible artefacts; (T3) framework documentation or system technical reports (e.g., the Hermes Atropos/Tinker pipeline); and (T4) historical non-LLM comparators retained for cross-paradigm reference. Tier markers are used in tables and the recap to ensure that performance claims are not pooled across heterogeneous evidence types. The summary primarily reports benchmarked agent systems (T1/T2); the T3 framework-documentation row (Hermes RL Trajectories) and the T4 historical non-LLM RL comparator (AI Clinician) are flagged explicitly with daggers in Table 3 and the recap, so that any aggregate count involving them can be re-evaluated under the strict T1/T2-only criterion (§4.7). An interactive chart of the annual and cumulative publication distribution is available at https://dengzhe-hou.github.io/self-evolving-agent-engineering/.

1.3. Outline of the Review

Section 2 presents the architectural foundations of the harness, surveying backbone LLMs and the four layers of the self-evolving harness, with the NousResearch Hermes framework as a reference implementation. Section 3 details the engineering principles (memory, tool and multi-agent, RL trajectory engineering) and compares SEAE with prompt engineering and RAG. Section 4 surveys SEAE-instantiating systems across six clinical task categories. Section 5 examines principal challenges and future directions; Section 6 concludes.

2. Foundations

Building on the Agent = Model + Harness decomposition introduced in §1, the harness corresponds to what Wang et al. [23] term the “action” and “memory” modules and what Xi et al. [41] call the “capability” layer surrounding the LLM core; it provides the infrastructure for memory, tool integration, orchestration, and training. In a clinical context, the model determines what an agent can reason about; the harness determines whether that reasoning persists, accumulates, and improves across sessions. This section surveys backbone LLMs (§2.1), presents the self-evolving harness architecture (§2.2), and discusses the NousResearch Hermes framework as a reference implementation (§2.3).

2.1. Large Language Models

SEAE is built upon large pre-trained language models that serve as the agent’s cognitive backbone, responsible for understanding clinical context, reasoning over patient data, and generating structured outputs.

Transformers [42] introduced the self-attention mechanism underlying all modern LLMs, enabling parallel sequence processing and long-range dependency modeling at scale.

GPT series and instruction-tuned models [11,43] established the dominant paradigm for generative clinical LLMs: GPT-3 demonstrated few-shot clinical reasoning through in-context learning; InstructGPT introduced RLHF alignment; and GPT-4 achieved near-physician accuracy on the USMLE medical licensing examination [44], forming the backbone of many medical agent systems [5]. Med-PaLM 2 [45] achieved performance competitive with physician specialists on clinical vignette benchmarks, and Brodeur et al. [38] extend this trajectory: across five reasoning experiments (differential diagnosis, display of diagnostic reasoning, triage differential, probabilistic reasoning, and management reasoning), the o1-preview model meets or exceeds physician expert adjudication on NEJM CPC cases (e.g., agreement on 120/143 = 84% of differential-diagnosis cases), establishing that LLM-scale reasoning is clinically viable.

LLaMA series [46,47] provided open-weight alternatives enabling domain-specific fine-tuning and local deployment without proprietary API dependency, a critical requirement for clinical settings subject to patient data-protection regulations such as the U.S. Health Insurance Portability and Accountability Act (HIPAA), the EU General Data Protection Regulation (GDPR), and equivalent national frameworks. LLaMA 3.1 also serves as the backbone for the Hermes-3 and Hermes-4 agentic models [48,49].

Domain-specific biomedical models have adapted these general architectures to clinical language. On the encoder side, BERT [50] established the pre-train–fine-tune paradigm; BioBERT [51], PubMedBERT [52], and ClinicalBERT [53] extended it to biomedical and clinical corpora, achieving substantial gains on named-entity recognition and relation extraction benchmarks. On the generative side, PaLM [54] and its medical adaptation Med-PaLM [55] demonstrated that scaling and domain prompting can reach clinician-level question answering, while open-weight alternatives such as BioGPT [56], PMC-LLaMA [57], and MeLLaMA [58] enable local deployment and continual pre-training on institutional data. These domain-specific models serve as candidate backbones for self-evolving clinical agents, particularly for settings where data governance precludes API-based models.

Hermes-3/4 models [48,49] are purpose-built agentic fine-tunes of LLaMA 3.1, trained with SFT, DPO [59], and large-scale RL on filtered reasoning trajectories; we adopt them as the backbone of the reference implementation in §2.3, where the training pipeline is detailed.

MEDITRON [60] further shows that continued medical pre-training on LLaMA substantially improves clinical benchmark performance, illustrating that backbone choice (general API model, domain-pre-trained open weight, or agentic fine-tune) is itself a clinical engineering decision constrained by data governance and update flexibility.

2.2. The Self-Evolving Agent Harness

The “agent harness” is the infrastructure layer that wraps around the LLM backbone to provide memory, tools, orchestration, and training capabilities (see footnote 1). Most current clinical agent harnesses are “stateless”: every session starts from scratch, discarding accumulated context. A “self-evolving harness”, by contrast, persists knowledge across sessions, synthesizes reusable skills from experience, and generates its own training data to improve the underlying model. As shown in Figure 1, the self-evolving harness consists of four tightly integrated layers.

Orchestration Layer: Prompt Assembly and ReAct Loop. The orchestration layer manages each inference step of the clinical agent. At each iteration, it assembles the system prompt from multiple persistent sources:

P_{sys} = P_{persona} ∥ M_{env} ∥ M_{user} ∥ S_{active} ∥ C_{ctx}

(1)

where || denotes sequential concatenation;

P_{persona}

defines the agent’s clinical persona and safety constraints;

M_{env}

contains accumulated environmental facts;

M_{user}

captures clinician preferences;

S_{active}

denotes lazily loaded skill documents; and

C_{ctx}

contains project-level context. This assembly process ensures that every clinical interaction is informed by accumulated context, a fundamental departure from memoryless prompting.

The orchestration layer implements the ReAct (Reasoning + Acting) paradigm [61] as its core execution loop, adapted for clinical contexts:

1.: Observation: The agent reads tool results, EHR data, or clinical input from the environment.
2.: Reasoning: The LLM analyzes the current state against the clinical goal, generating an internal reasoning trace that references accumulated memory and skills.
3.: Action: The agent executes a tool call (medical database query, code execution, image analysis) or generates a clinical output.

The loop is managed by an iteration budget that prevents runaway execution, while an adaptive context compression module preserves clinically critical information while discarding redundant history.

Memory Layer: Three-Tier Persistent Memory. The memory layer is the central architectural innovation that distinguishes self-evolving harnesses from session-bounded ones. Building on prior work on long-term memory for LLMs [15,16,18,62] and the experience compression hierarchy of Zhang et al. [63], the self-evolving harness operationalizes a hierarchical, self-managed memory that persists across clinical sessions.

Tier 1: Environment and preference memory. Persistent configuration files store facts about the clinical environment (EHR configuration, encountered edge cases) and clinician preferences (communication style, specialty focus). These are loaded as a frozen snapshot at session start; updates written during a session take effect in the next session, ensuring within-session consistency while accumulating knowledge across sessions.
Tier 2: Skill memory. When the agent completes a complex clinical task, it automatically synthesizes the experience into a reusable skill document stored in a persistent local skill library. Skills are loaded hierarchically: Level 0 loads only skill names and descriptions; Level 1 loads the full skill specification on demand; Level 2 loads specific reference files. This lazy loading minimizes token overhead while maintaining a rich, searchable clinical skill library.
Tier 3: Session search index. All historical conversations are indexed in a full-text search database, enabling the agent to retrieve relevant precedents from past clinical interactions. This tier supports long-range continuity across hundreds of clinical sessions.

Security is built into the memory layer: entries are scanned for prompt injection and data exfiltration patterns (including invisible Unicode characters) before being accepted into persistent storage, a critical requirement for protecting protected health information (PHI).

Tool Layer: MCP-registered Clinical Tools. The tool layer exposes external clinical resources (EHR endpoints, medical knowledge bases, imaging tools, code execution sandboxes) to the agent through a uniform interface. We adopt the Model Context Protocol (MCP) [64] as the standard registration mechanism so that institutional plugins (FHIR endpoints, PubMed connectors, DICOM viewers) are discovered and integrated without framework modification. The engineering specifics of tool registration, security auditing, and subagent spawning are detailed in §3.

Training Layer: RL Closed Loop. The most distinctive capability of the self-evolving harness is its RL closed loop: the agent not only performs clinical tasks but also generates its own training data to improve the next model generation. Prior work on experiential learning agents [65], contextual experience replay [66], and experience-driven agent lifecycles [67] has demonstrated the feasibility of this approach in general domains; the self-evolving harness specializes it for clinical deployment. This creates a compounding improvement cycle particularly well suited to clinical domains where labeled data is scarce.

The loop operates as follows:

1.: The agent executes clinical tasks (e.g., differential diagnosis, radiology report review) and records full interaction trajectories.
2.: A distributed RL coordinator collects trajectories and routes them to task-specific verifiers.
3.: Domain-specific verifiers (medical knowledge checkers, clinical guideline validators) score trajectories via rejection sampling, filtering for high-quality clinical reasoning.
4.: A training service receives filtered trajectories and performs parameter-efficient fine-tuning (e.g., LoRA [68]) of the backbone model, updating clinical reasoning capabilities.

This closed loop transforms clinical deployment from a one-time fine-tuning event into a continuously improving system: the agent accumulates clinical expertise through use, analogous to a resident who improves through supervised practice. Concrete clinical instantiations of this principle are surveyed in §4.

2.3. Reference Implementation: The Hermes Framework

The NousResearch Hermes Agent framework (footnote 2) provides a concrete, open-source instantiation of the four-layer architecture: orchestration assembles the system prompt from SOUL.md (persona), MEMORY.md (≈800 tokens), and USER.md (≈500 tokens) in ChatML format; the memory layer maps to bounded MEMORY.md/USER.md files (Tier 1), agentskills.io skill documents loaded hierarchically (Tier 2, ≈3K tokens at Level 0 for 40+ skills), and a SQLite FTS5 session index (Tier 3); the training layer combines the Atropos RL framework (≈1,000 task-specific verifiers) with the Tinker service performing LoRA fine-tuning on the Hermes-3/4 (LLaMA 3.1-based) backbone. The architecture itself is framework-agnostic: generic agent frameworks such as AutoGen [69], LangChain, or CrewAI could implement the same four layers with different engineering choices, including alternative persistent-memory designs (memory-graph stores or token-compressed tool-output pipelines) rather than the bounded MEMORY.md convention of Hermes. Hermes is used as a reference throughout because it is, to our knowledge, the most complete open-source instantiation of all three self-evolving principles simultaneously.

3. Self-Evolving Agent Engineering: Taxonomy of Mechanisms

This section turns from architecture to operational design principles. SEAE organises these into three categories, each addressing a distinct dimension of clinical persistence: memory engineering, tool and multi-agent engineering, and RL trajectory engineering.

3.1. Memory Engineering

Memory engineering addresses how clinical agents accumulate, organize, and retrieve knowledge across sessions. SEAE proposes three complementary memory modalities that together enable clinically grounded persistence: short-term context management within a session, long-term persistent memory across sessions, and skill memory that encodes procedural clinical knowledge. Knowledge graph-based approaches [70] complement these modalities by encoding relational clinical knowledge for multi-hop reasoning. The capacity bounds these modalities impose are not arbitrary engineering choices: recent probing work shows that current LLMs degrade systematically as the depth of cumulative state tracking grows [39], with that depth rather than arithmetic or entity-tracking complexity dominating difficulty, which motivates both the bounded MEMORY.md convention (§2.3) and the hierarchical L0/L1/L2 skill-loading scheme described below.

Short-term Context Management. Short-term memory in self-evolving agents corresponds to the active context window, the LLM’s immediate working memory during a clinical session. Effective short-term memory engineering involves:

Context compression: An adaptive summarization mechanism activates when the session context approaches the model’s maximum token length. Clinical summaries prioritize diagnosis-critical information (patient history, abnormal findings, differential diagnoses) while compressing repetitive or administrative content, so that long encounter histories remain manageable without losing the clinically load-bearing detail.

Prompt caching: The frozen system prompt (≈1,300 tokens) is cached across turns to avoid redundant encoding, reducing inference costs in high-volume clinical deployments.

Iteration budgeting: The IterationBudget mechanism limits the number of reasoning steps per query, preventing runaway inference loops that could delay time-sensitive clinical decisions.

Long-term Persistent Memory Design. Long-term memory in self-evolving agents is stored in persistent configuration files such as MEMORY.md and USER.md. Effective long-term memory engineering requires careful curation:

Memory capacity management: In the Hermes Agent reference configuration (footnote 2), MEMORY.md is capped at 2,200 characters (≈800 tokens) and USER.md at 1,375 characters (≈500 tokens); the running system-prompt header surfaces current usage as a percentage so the agent can proactively consolidate before overflow. Once the bound is reached, the agent merges or evicts entries under a self-managed replacement policy that prioritizes clinically novel information over redundant facts. This bounded curation prevents the “stale knowledge” problem observed in unbounded memory systems [18].

Clinical content guidelines: Effective MEMORY.md entries for healthcare include: EHR system configuration and quirks, local clinical guideline variants, recurring patient population characteristics, previously encountered diagnostic edge cases, and tool-specific clinical workflow patterns. USER.md captures clinician-specific preferences: preferred terminology, report formatting style, specialty focus, and communication preferences.

Security and injection prevention: All proposed memory entries are scanned for prompt injection patterns (adversarial instructions disguised as clinical facts) and PHI exfiltration attempts before acceptance. Invisible Unicode characters, a common injection vector, are explicitly detected and blocked.

External memory plugins: For healthcare deployments requiring richer memory semantics, SEAE supports external memory provider plugins covering capabilities such as knowledge graph construction (Mem0), user behavior modeling (Honcho), automatic fact extraction from session transcripts, semantic vector search across historical cases, and cross-session user modeling (Supermemory). These plugins extend base memory capacity without modifying the core agent runtime.

Skill Memory: Experience Synthesis. Skill memory is the most clinically significant innovation of SEAE. When the agent completes a complex task, such as a differential diagnosis across multiple imaging modalities or a multi-step medication reconciliation, it synthesises the experience into a persistent skill document. A defining requirement is full autonomy in the evolution process: the agent decides when to create, update, or retire skills, and how to improve them, without human intervention between sessions, which distinguishes self-evolving agents from systems that merely accumulate data for human-supervised retraining [65,67]. The autonomous skill evolution process is driven by six distinct triggers:

1.: Novelty trigger: No existing skill matches the current task (cosine similarity below threshold against all skill descriptions). The agent reasons from general medical knowledge, retrieves relevant references via MCP tools, and if a domain-specific clinical verifier scores the output above 72, synthesizes a new skill document following the agentskills.io standard and indexes it in the local skill library.
2.: Performance trigger: An existing skill is used but the verifier scores the output below the skill’s historical best by more than a configurable margin (default: 10 points). The agent generates an improved skill variant incorporating newly identified reasoning steps or clinical pathways; the updated version replaces the prior skill only if it scores higher on a held-out validation set of similar cases.
3.: Contradiction trigger: A new clinical guideline fetched via PubMed or UpToDate MCP tools contradicts an assumption embedded in an existing skill. The agent flags the skill for revision and generates an updated version citing the new guideline as authority.
4.: Consolidation trigger: Every N tasks (configurable, default $N = 50$ ), the agent reviews its skill library for redundancy: pairs with cosine similarity above 0.85 on their description embeddings are merged into a single unified skill. This prevents skill library bloat and maintains retrieval efficiency.
5.: RL trigger: High-quality trajectories (verifier score $> 85 %$ ) are submitted to the Atropos RL pipeline, where Tinker performs LoRA fine-tuning on the Hermes backbone. This improves the underlying reasoning quality that drives all subsequent skill synthesis, creating a compounding improvement cycle.
6.: Boundary trigger: A skill is flagged for refinement when repeated invocations on similar clinical inputs produce inconsistent outcomes under a fixed policy (some succeeding, others failing) rather than systematically scoring above or below a fixed threshold. This stochastic-instability signal, recently introduced in the agent-data co-evolution framework of Yang et al. [71] where it accounts for >45% of detected weaknesses on tool-use benchmarks, captures decision-boundary cases that the deterministic performance trigger misses. In a clinical context, boundary signals identify skills whose recommendations are unstable across superficially similar cases (e.g., a chest-radiograph skill that correctly handles obvious consolidation but flips between findings on borderline opacities), prioritising precisely the skill regions most in need of additional training data.

Table 1 consolidates the six triggers, mapping each to its definition and the concurrent literature that supports it.

Skill synthesis is particularly valuable for radiology, where high-dimensional imaging inputs make accumulated pattern knowledge difficult to encode in static prompts: an agent that has processed many chest radiographs may, in principle, synthesise imaging-specific skills (e.g., distinguishing true opacities from overlapping rib shadows) that static prompt engineering cannot easily replicate; MACRO [79] provides direct empirical evidence of this self-skill-discovery behaviour for medical imaging agents, and a skill document for one task (“pneumonia diagnosis from chest X-ray”) transfers partially to a structurally similar one (“COVID-19 infiltrate assessment”) rather than requiring re-derivation.

Verification of skill updates.. Triggers determine when and what to evolve, but a self-evolving harness in a clinical setting also needs an auditable mechanism for verifying that an update is an improvement rather than a regression. Three complementary mechanisms recur in the recent literature and can be combined within SEAE. (i) Component-level versioning treats every skill document, tool registration, and memory entry as a versioned file-level artifact so that each update is reversible and attributable to a specific harness component [80] (the Hermes Agent reference of footnote 2 adopts the same pattern). (ii) Trajectory distillation archives the rollout trajectories that triggered each update and exposes them as a layered, drill-down evidence corpus for later audit, rather than storing only the final skill text [80,81]. (iii) Falsifiable predictions pair every proposed update with an explicit prediction about its expected effect on a next-round verifier score (or on a held-out validation set of similar cases), and only accept updates whose predictions are subsequently verified [73,80]. The third mechanism is what the performance trigger’s “replace only if scored higher on a held-out validation set” rule above instantiates; the first two represent natural extensions that are particularly relevant for HIPAA-compliant deployment, where regulators expect every model-affecting change to be auditable.

Composing the six triggers into one decision loop.. In aggregate, the triggers compose into a single per-task decision loop that any SEAE-compliant harness can implement:

1.: Query the skill library (L0 lookup: skill names and descriptions, ≈3K tokens).
2.: If a matching skill is found (similarity $> 0.75$ ), load the full skill document (L1) and execute the task with skill guidance.
3.: Evaluate the output with the relevant domain verifier. If the verifier score exceeds the skill’s recorded best, log the trajectory and update the skill’s performance history. If the score falls below the recorded best by more than 10 points, initiate skill refinement (performance trigger).
4.: If no matching skill is found, execute with general clinical reasoning, retrieve supporting evidence via MCP tools, and, if verifier score $> 72$ , synthesise a new skill document (novelty trigger).
5.: After every 50 tasks, run the consolidation pass to merge skill documents with similarity $> 0.85$ (consolidation trigger).
6.: All trajectories with verifier score $> 85 %$ are queued for RL fine-tuning of the backbone (RL queue).

The contradiction and boundary triggers fire orthogonally to this main loop, on retrieval events (newly fetched guideline disagrees with an existing skill) and on outcome-variance audits (the same input yields divergent outputs across repeated invocations).

Convergence evidence from concurrent work. Beyond the per-trigger references summarised in Table 1, two recent works are particularly informative for the clinical and quantitative case. MACRO [79] brings self-skill discovery directly to medical imaging, synthesising verified multi-step tool sequences into composite primitives and grounding selection via image-feature memory; this is the first concrete evidence that skill synthesis extends to the radiology domain. SkillsBench [82], evaluating 86 tasks across 11 domains, finds that curated skills raise success rates by 16.2 percentage points on average and by 51.9 points in the healthcare domain specifically, the largest gain of any domain tested. Lin et al. [80] formalise observability-driven closed-loop harness evolution for coding agents under the name Agentic Harness Engineering (AHE): ten agent-driven iterations lift Terminal-Bench 2 pass@1 from 69.7% to 77.0%, with a component ablation localising the gain to tools, middleware, and long-term memory rather than the system prompt, and the evolved harness transferring to alternate base-model families with gains of

+ 5.1

to

+ 10.1

pp without re-evolution. Together, this convergence across independent research groups, spanning general-purpose, productivity, coding, and medical imaging domains, points to self-evolving harness architectures as a promising direction for overcoming the limitations of session-bounded agents.

3.2. Tool and Multi-Agent Engineering

Tool Integration Via MCP. Self-evolving agents integrate with clinical tools through the Model Context Protocol (MCP) [64], an open standard for LLM tool connectivity. The ability for language models to teach themselves to use tools [83] and to act as general-purpose agents across diverse tool environments [84] has been well-established; Xu and Yan [85] survey the emerging ecosystem of composable agent skills (including MCP integration, agentskills.io-standard skill packages, and RL-based skill acquisition), noting that 26.1% of community-contributed skills contain security vulnerabilities, reinforcing the need for the permission governance mechanisms described in Section 5. SEAE specializes this capability for healthcare by providing curated, security-audited clinical tool registries. In healthcare deployments, MCP-registered tools include:

EHR connectors: Structured queries against HL7 FHIR endpoints for patient history, lab results, and medication lists.
Medical knowledge bases: Integration with PubMed, UpToDate, and clinical guideline repositories via API.
Imaging tools: DICOM viewers, segmentation models, and radiology-specific analysis libraries.
Code execution: The execute_code tool enables the agent to perform biostatistical analyses, generate clinical visualizations, and run diagnostic algorithms within a single inference call.

The MCP architecture allows editor plugins and institutional EHR systems to register their own tools, which the harness automatically discovers and integrates, enabling deployment in heterogeneous clinical IT environments without framework modification.

Recent biomedical agent platforms further demonstrate that large, curated tool ecosystems are now feasible at scale. ToolUniverse [86] unifies access to scientific databases, analysis libraries, and execution environments under a single agent-callable interface; TxAgent [87] specializes this capability for therapeutic reasoning across a broad universe of tools; and Biomni [88] demonstrates a general-purpose biomedical agent that orchestrates literature search, omics pipelines, and computational analysis as composable units. These systems provide concrete evidence that the tool-integration layer of the agent harness can support the scale and heterogeneity of real biomedical workflows; the self-evolving extension proposed here is to combine such ecosystems with persistent skill memory and closed-loop training so that effective tool-invocation patterns are accumulated, refined, and reused across sessions rather than re-derived each time.

Subagent Spawning for Parallel Clinical Workstreams. Complex clinical tasks often require parallel investigation of multiple hypotheses. SEAE supports this through subagent spawning: the primary agent can instantiate isolated subagents with independent conversations, terminals, and tool access. Each subagent operates at zero context cost to the parent, enabling:

Simultaneous differential diagnosis exploration across disease categories
Parallel review of imaging findings, laboratory results, and clinical history
Independent pharmacological interaction checking and dosage verification

Subagents communicate results back to the parent via Python RPC, which consolidates findings for final clinical recommendation generation. This architecture mirrors the multi-specialty consultation model in clinical medicine [14], where multiple specialists contribute independent assessments to a unified treatment decision. General-purpose multi-agent frameworks such as AutoGen [69] and MetaGPT [89] have demonstrated the effectiveness of agent-to-agent coordination for complex task decomposition; SEAE specializes this coordination for clinical reasoning chains.

RL Trajectory Engineering for Clinical Self-Improvement. RL trajectory engineering is the mechanism by which self-evolving agents improve through clinical use. The key engineering decisions are:

Clinical verifier design: Task-specific verifiers evaluate whether an agent’s clinical reasoning trajectory is correct. For medical QA, verifiers compare agent answers against validated clinical references such as MedQA [90] and PubMedQA [91]. For diagnostic tasks, verifiers assess alignment with clinical guidelines and expert physician labels. Effective verifier design requires domain expertise: a verifier that rewards diagnostic confidence may inadvertently penalize appropriate clinical uncertainty. Reasoning enhancement techniques such as self-consistency [92] and tree-of-thought [93] can be integrated as trajectory scoring methods to identify high-quality clinical reasoning chains for RL training.

Trajectory filtering: The Atropos framework applies rejection sampling across trajectories: only trajectories that satisfy verifier criteria above a quality threshold are retained for training. This filtering is critical in clinical contexts where unsafe reasoning trajectories (e.g., a confident but incorrect diagnosis) must be excluded regardless of the agent’s expressed certainty. Complementarily, AgentHER [81] recycles failed trajectories rather than discarding them: a four-stage pipeline (failure classification, outcome extraction, LLM-guided relabeling with confidence gating, and data packaging) converts failures into SFT/DPO training data, achieving +7.6–11.4 percentage point gains over success-only training across four models and raising relabeling precision from 94.1% to 97.1% on WebArena (96.0% on ToolBench), and halving the number of successful demonstrations required to reach baseline performance. This suggests that clinical trajectory pipelines should archive failures as well as successes.

Closed-loop training-data evolution. A further refinement, demonstrated in CoEvolve [71], treats the training distribution itself as a co-evolving artifact rather than a fixed corpus. Three signal types extracted from rollout trajectories during GRPO training, namely forgetting (regression on previously-mastered tasks), boundary (high outcome variance under fixed policy), and rare-pattern (under-explored action sequences), guide LLM-driven environment re-exploration; the resulting interactions are abstracted into task–solution pairs, validated by execution, and appended to the training set. Without any human supervision, this loop yields absolute gains of

15.6

–

19.4

pp across three Qwen backbones (Qwen2.5-7B:

+ 19.4

; Qwen3-4B:

+ 15.6

; Qwen3-30B-A3B:

+ 18.1

, taken from the paper’s Avg. column which averages five evaluation splits: AppWorld TestN/TestC × TGC/SGC together with BFCL-V3 Multi-turn; individual splits range from

+ 2

pp to

+ 48

pp) at only ∼10% additional compute cost over standard GRPO, and produces strong cross-domain transfer (AppWorld→BFCL:

26.50 \to 45.00

, i.e.,

+ 18.5

pp zero-shot, Qwen3-4B). Adapted to clinical settings, an analogous mechanism would prioritise trajectory generation around clinical decision boundaries (e.g., borderline imaging findings, atypical presentations) where verifier disagreement is highest, rather than over-sampling cases the agent already handles reliably; the same authors caution that such autonomous data evolution requires explicit safety filters and risk-triggered review before synthesised cases are admitted into training, a constraint of particular force in healthcare deployments.

Privacy-preserving trajectory generation: Clinical trajectories may contain PHI. SEAE requires de-identification of all trajectories before training data submission. Synthetic patient simulacra (inspired by Agent Hospital [14]) can generate de-identified training trajectories at scale, enabling RL improvement without accessing real patient data.

3.3. Comparison: Self-Evolving Harness Vs. Stateless Approaches

The clinical AI landscape offers a spectrum of deployment paradigms, which can be understood as progressive levels of harness sophistication. Table 2 compares four levels: prompt engineering (no harness), RAG-augmented LLM (retrieval-only harness), stateless agent harness (full tool and orchestration capabilities but no cross-session state), and self-evolving agent harness (persistent memory, skill synthesis, and RL self-improvement).

Table 2 positions SEAE against the lower-capability levels of the clinical AI deployment spectrum. Prompt engineering [9] and RAG [12] cannot accumulate clinical knowledge across sessions; RAG is best understood as a retrieval technique incorporable into any harness level rather than a paradigm in its own right. Stateless agent harnesses (ReAct [61], AutoGen [69], MetaGPT [89]; meta-level architecture optimisation in [94]) add tool use, multi-agent coordination, and structured reasoning, but such a harness that has processed 10,000 radiology cases performs identically to one processing its first. Self-evolving harnesses add the memory and training layers of §2.2, subsuming the prior levels: a self-evolving harness incorporates prompt engineering for persona/skill-document design and RAG as one of many registered tools.

4. Applications in Healthcare

This section surveys clinical AI systems that embody one or more of the three SEAE principles across six task categories: classification/diagnosis, generation, detection, augmentation, question answering, and inference. The unified summary table (Table 3) lists each system’s SEAE pillars in a single SEAE column, using three letter tags joined by “·” where applicable: P (persistent cross-session memory: the system explicitly stores and reuses state across distinct interaction sessions, beyond in-context caching within a single session), S (skill/experience synthesis: the system autonomously extracts reusable protocols, templates, or skill documents from task experience, beyond simple retrieval from a fixed knowledge base), and R (RL or closed-loop self-improvement: the system uses interaction outcomes to update model parameters or a reward-driven policy, beyond prompt-level adaptation). A pillar is listed only when the corresponding mechanism is explicitly described and evidenced in the cited work; “–” marks stateless comparison baselines. Table 3 provides per-task evidence in detail, grouped by the six task categories; §4.7 aggregates the per-task evidence into a cross-cutting written summary.

Table 3. Applications of self-evolving agent principles across the six clinical task categories. Subheader rows group entries by task type. The SEAE column lists active pillars: P (persistent cross-session memory), S (skill/experience synthesis), R (RL or closed-loop self-improvement), joined by “·” when multiple apply; “–” marks stateless comparison baselines. ^† Framework/system documentation (T3). ^‡ Historical non-LLM RL comparator (T4). ^§ Survey paper (not a single benchmarked system). ^⋄ Original evaluation on general-domain knowledge-graph benchmarks; included as methodological reference.

Reference	Task	Agent Type	Dataset	Highlight	SEAE
Classification and diagnosis tasks (§4.1)
MDAgents [5]	Medical decision making	Adaptive multi-agent	MedQA, MMLU-Medical	Dynamic collaboration depth	–
Agent Hospital [14]	Disease diagnosis	Frozen-LLM evolvable agent	MedQA (USMLE) + 20K-patient simulacrum	88.2→92.2% MedQA (GPT-4o); 77.0→95.3% internal diagnosis after 20K-patient evolution	P·S
ClinicalAgents [17]	Clinical diagnosis	Dual-memory multi-agent	Clinical EHR	MCTS orchestration + experience memory	P·S
GPT4MIA [95]	Medical image classification	Plug-and-play GPT-4 (non-agentic)	RetinaMNIST, FractureMNIST3D	Plug-and-play transductive inference	–
Generation tasks (§4.2)
Best Practices Radiology [3]	Report generation guidance	Practitioner guidance (review)	— (no benchmarked dataset)	Best-practice principles for LLM use in radiology reporting	–
Lyu et al. [96]	Radiology report plain-language translation	GPT-4 agent	Radiology report corpora	Translates technical radiology reports into patient-readable language	–
R2GenGPT [35]	Radiology report generation from CXR	Frozen-LLM + visual alignment (≈5M trainable params)	IU-Xray, MIMIC-CXR	Delta-tuning aligns Swin-Transformer features to a frozen LLM; near-SOTA BLEU/ROUGE at 0.07% of total parameters	–
NapSS [97]	Paragraph-level medical text simplification	Two-stage seq2seq (non-agentic)	English medical simplification corpus (Cochrane-style)	Summarize→simplify with narrative prompts; +3–4 pts SARI over seq2seq baseline	–
TissueLab (Co-evolving AI) [98]	Medical image analysis with expert-in-the-loop	Co-evolving agentic system	Medical imaging tasks	Continuous learning from clinician feedback; explainable workflows	P·S·R
Radiologist Copilot [99]	3D radiology report generation	Agentic system with QC feedback	Radiology benchmarks	Integrated observation, template selection, and feedback-driven iterative refinement	–
Detection tasks (§4.3)
MACRO [79]	Medical imaging analysis	Self-skill discovery agent	Diverse medical imaging datasets	Composite tool synthesis from execution histories	P·S·R
Augmentation tasks (§4.4)
Agent Hospital [14]	Synthetic patient generation	Simulacrum agents	20K virtual patients × 21 departments	Retrieval-based self-evolution via simulated encounters	P·S
MEDIMP [100]	Medical image representation learning	Contrastive multi-modal model (non-agentic)	DCE MRI of renal transplants + clinical tabular data	Contrastive learning with LLM-generated prompts; cited as a representation that agents can use	–
AugGPT [101]	Text data augmentation	ChatGPT-based rephrasing (non-agentic)	Few-shot text classification benchmarks	ChatGPT rephrasing for few-shot classification; cited as a data-augmentation tool	–
Hermes RL Trajectories^†	Agent-generated training data (general)	Self-improving agent framework	Atropos-verified trajectories (general; no benchmarked medical run)	Framework that generates its own RL training data; clinical applicability is conjectural	P·S·R
Question answering tasks (§4.5)
Survey: Agents in Medicine^§ [4]	Medical QA benchmark	Survey (multiple agents)	MedQA, MedMCQA, USMLE	Systematic evaluation	—
RadioRAG [12]	Radiology-specific QA	RAG agent	RSNA 80Q + 24 expert-curated	Real-time domain-specific retrieval	–
KG-Agent^⋄ [102]	General KG-augmented QA (non-medical)	Knowledge-augmented agent	General KG benchmarks	Multi-hop KG traversal; 10K-sample tuning outperforms larger LLMs	–
Clinical QA System [103]	Interactive clinical QA over notes	Single-turn extractive QA agent	Unstructured clinical notes	Extractive answers with in-note highlighting for traceability	–
Dynamic Clinical QA [104]	Document-based QA	RAG agent	Clinical documents	Real-time RAG from clinical records	–
Inference tasks (§4.6)
ClinicalAgents [17]	Complex diagnosis	MCTS multi-agent	EHR	Monte Carlo Tree Search for clinical reasoning	P·S
Reinventing Clinical Dialogue^§ [105]	Survey of agentic clinical-dialogue paradigms	Survey (taxonomy of 4 archetypes)	— (survey, no benchmarked dataset)	First-principles taxonomy of agentic clinical-dialogue systems	—
AI Clinician^‡ [106]	Sepsis treatment optimization	RL agent	MIMIC-III ICU	Optimal policy learning from 17K patient trajectories	R
DiagAgent [107]	Multi-turn diagnostic inference	RL-trained agent	DiagBench (2.2K cases)	+11.20% accuracy, +17.58% exam F1 over 11 LLMs	R
DoctorAgent-RL [108]	Clinical dialogue	Multi-agent RL	MTMedDialog	RL-refined questioning strategy for diagnosis	R
MedAide [109]	Multi-intent medical QA	Rotation multi-agent	4 medical benchmarks	Intent-aware multi-agent routing	–
Medical Data Inference [110]	End-to-end clinical inference	Multi-agent pipeline (7 specialised agents)	Geriatrics, palliative care, colonoscopy imaging	Ingestion, anonymisation, feature extraction, model-data matching, preprocessing, and inference stages	–

4.1. Classification and Diagnosis Tasks

SEAE addresses classification and diagnosis by combining the agent’s accumulated disease pattern memory with real-time tool access to EHR data and clinical guidelines. As shown in the Classification and diagnosis block of Table 3, several representative systems demonstrate the effectiveness of persistent agent approaches for clinical classification.

Self-evolving diagnostic agents: A widely cited demonstration of the persistent-memory principle in simulation is Agent Hospital [14], in which doctor agents (with their base LLM frozen) accumulate a retrievable medical case base and experience text by treating synthetic patient agents. After evolving on 20,000 virtual patients per clinical department, internal diagnosis accuracy rises from 77.0% to 95.3% (mean across 21 departments), and the resulting MedAgent-Zero method outperforms strong prompting baselines on MedQA across all tested backbones (e.g., 88.22→92.22% with GPT-4o; +16 pts over Direct prompting on GPT-3.5). The gains arise purely from retrieval-augmented experience accumulation rather than parameter updates, illustrating that “closed-loop self-improvement” in this regime is memory-driven rather than RL-driven. The result is consistent with the SEAE thesis that persistent agents can grow more capable through use, but the effect has not been demonstrated outside simulation.

Collaborative multi-agent reasoning: MedAgents [111] demonstrates that zero-shot multi-agent medical reasoning, where LLM agents collaborate to resolve diagnostic disagreements, outperforms single-agent approaches without any fine-tuning, establishing the multi-agent paradigm as applicable even to resource-constrained clinical deployments.

Adaptive multi-agent diagnosis: MDAgents [5] assigns clinical complexity scores to each case and adaptively assembles collaborative agent groups: simple cases use a single-agent ReAct loop, while complex cases trigger a multi-agent group consultation with specialist agents for different organ systems. This dynamic collaboration mirrors the tiered consultation model in clinical medicine and significantly outperforms fixed-composition agent groups on MedQA.

Dual-memory clinical agents: ClinicalAgents [17] introduces a dual-memory architecture, a mutable Working Memory tracking the evolving patient state, and a static Experience Memory encoding case templates from prior successful diagnoses. Orchestration is modeled as a Monte Carlo Tree Search (MCTS) process, enabling the agent to explore diagnostic branches and backtrack from dead ends. This MCTS-memory combination handles the non-linear reasoning paths characteristic of complex clinical diagnosis.

4.2. Generation Tasks

Generation tasks combine tool-augmented retrieval with prompt-engineered style guidance; most reviewed systems in this category remain session-bounded, with co-evolving multi-agent setups as the main self-evolving exception. The Generation block of Table 3 summarises representative applications.

Radiology report generation: Bluethgen et al. [3] survey deployment considerations for LLMs in radiology, articulating recommended practices for input design, output verification, and clinical oversight; the article is a guidance/perspective piece rather than a benchmarked system. The Radiologist Copilot [99] complements this on the system side: it organises the reporting workflow as an agentic pipeline with image observation, template selection, draft generation, and a dedicated quality-control stage that performs feedback-driven iterative refinement of the draft report before finalisation.

Self-improving generation via co-evolution: The TissueLab co-evolving agentic AI system [98] applies the continuous self-improvement principle to medical imaging analysis: the agent system plans and generates explainable analysis workflows while allowing clinical experts to visualise intermediate results and refine them in real time, with the system continuously learning from this expert feedback. The reported gains over static-configuration baselines, while encouraging, come from a single benchmark study and have not been validated in prospective clinical deployment.

Adding persistence to frozen-LLM radiology VLMs: R2GenGPT [35] exemplifies a parameter-efficient design pattern that is now widespread in clinical VLMs: a lightweight visual alignment module (Swin-Transformer features projected into the LLM’s embedding space, ≈5M trainable parameters) is delta-tuned while all LLM parameters remain frozen, achieving near state-of-the-art BLEU/ROUGE on IU-Xray and MIMIC-CXR. The same design choice that makes R2GenGPT efficient, however, also makes it static at inference time: once trained, the model cannot persist clinician corrections or accumulate institution-specific reading patterns across deployment, and any drift in imaging protocols requires full re-training of the alignment module. Wrapping a frozen-LLM VLM such as R2GenGPT in a self-evolving harness directly addresses this gap: the agent reads with the frozen backbone, but its session memory tracks accumulated case-level preferences, novelty-triggered skill documents encode emerging reading conventions (for example, a clinician-confirmed correction of a missed micro-nodule’s location), and the RL closed loop queues high-confidence trajectories to retrain the visual alignment module without disturbing the frozen LLM. The resulting system retains the parameter efficiency of frozen-VLM designs while gaining the cross-session learning that purely supervised report-generation systems lack.

4.3. Detection Tasks

Medical detection tasks, identifying abnormal findings in images or text, benefit from self-evolving agents’ ability to accumulate case-specific pattern knowledge across sessions. The Detection block of Table 3 lists the most directly relevant SEAE-instantiating system in this category (MACRO); generic vision foundation models are discussed below as the broader tool layer that detection agents register.

Self-evolving medical imaging detection: MACRO [79] introduces experience-driven self-skill discovery for medical imaging agents: the system identifies effective multi-step tool sequences from verified execution histories, synthesizes them into reusable composite tool primitives, and registers them in a growing behavioral repertoire. A lightweight image-feature memory grounds tool selection in visual-clinical context, while GRPO-like RL training reinforces reliable invocation of discovered composites. Across diverse medical imaging datasets, MACRO consistently improves multi-step orchestration accuracy and cross-domain generalization compared to static-configuration baselines, directly instantiating all three SEAE principles in a radiology-native deployment.

Medical image segmentation: Vision foundation models provide self-evolving agents with powerful image analysis capabilities. The Segment Anything Model (SAM) [112] and its medical adaptation SAM-Med2D [113] enable agents to identify anatomical structures and lesions from diverse imaging modalities, while CLIP-based representations [114] enable cross-modal retrieval linking imaging findings to clinical text. Multimodal biomedical assistants such as LLaVA-Med [115] and MedVersa [116] provide general-purpose biomedical visual question answering, and PathChat [117] extends this capability to histopathology, together forming a multimodal clinical sensing toolkit that persistent agents can register as tools.

4.4. Augmentation Tasks

Data scarcity is among the most persistent challenges in clinical AI. Self-evolving agents address this through automated clinical data augmentation, generating synthetic examples that preserve clinical semantics while expanding training sets. The Augmentation block of Table 3 summarises key approaches.

Simulacrum-based data generation: Agent Hospital [14] introduces a clinical simulacrum, building on the generative agents paradigm of Park et al. [118], where patient agents present symptoms drawn from disease distributions, and doctor agents practice diagnosis and treatment. The simulacrum spans 21 clinical departments with up to 20,000 virtual patients per department, providing richly labeled training material at a scale that would be infeasible with real patients. De-identified real-world clinical databases such as MIMIC-III [119] remain the natural ground-truth anchor for such simulacra, against which disease prevalence and symptom presentations can be calibrated to preserve clinical realism. This approach is particularly valuable for rare disease training data, where real cases are too infrequent to build adequate datasets.

RL trajectory generation: When a self-evolving agent successfully navigates a complex clinical reasoning chain, the full thought-action-observation trajectory is captured, verified by domain-specific checkers, and exported as fine-tuning data. Trajectory-based self-training is generalisable across domains [120,121]: Ge et al. [78] run month-long agent simulations across 1,000 persona-driven synthetic computer environments and extract experiential signals into occupation-specific skills, raising mean rubric score from 61.6% to 68.6% with gains scaling monotonically in the number of training simulations. Transposing this scale to clinical simulacra (synthetic longitudinal EHRs, multi-encounter timelines) is a natural extension of Agent Hospital’s approach, yielding a compounding data advantage in which clinical deployment generates the training data that improves the next model generation.

4.5. Question Answering Tasks

Clinical question answering spans a range from factual medical knowledge retrieval to complex clinical reasoning across multiple patient data sources. The Question answering block of Table 3 summarises applications.

Knowledge graph-augmented agents: KG-Agent [102] integrates a knowledge graph as a persistent tool and trains the agent to actively traverse the graph during multi-hop reasoning; although the original evaluation is on general-domain KG-QA benchmarks rather than medical knowledge graphs, the design pattern (using only 10K tuning samples on LLaMA-7B to outperform larger LLMs) directly motivates equivalent medical-KG-Agent deployments, where clinical concept relationships (symptoms, diagnoses, treatments) could be traversed analogously. Within the medical domain, AMG-RAG [122] explicitly automates medical knowledge graph construction and continuous updating via PubMed and WikiSearch retrieval, bridging the gap between static knowledge graphs and evolving medical literature and achieving 74.1% F1 on MedQA without increasing computational overhead. By our taxonomy AMG-RAG is a stateless retrieval system (SEAE pillars: none) rather than a self-evolving agent: the retrieval source updates over time, but the agent itself does not synthesise reusable skill documents from prior interactions, illustrating how retrieval currency can complement rather than substitute for self-evolving mechanisms.

4.6. Inference Tasks

Clinical inference tasks require agents to derive clinical conclusions from incomplete or ambiguous evidence, the most cognitively demanding category. The Inference block of Table 3 summarises representative systems.

Optimal treatment strategy learning (historical RL comparator): The AI Clinician [106] predates LLM-based agents and is included in the Inference block of Table 3 as a non-LLM RL comparator (T4) rather than as a direct SEAE exemplar; it learns optimal sepsis treatment strategies from historical ICU data and offers an early instantiation of the closed-loop improvement principle in a high-stakes clinical context. While its results are encouraging, claims of clinical safety in real intensive care deployments require prospective validation that this work does not provide.

RL-trained diagnostic agents in virtual clinical environments: DiagAgent [107] trains a diagnostic agent through reinforcement learning in DiagGym, a world model built from electronic health records that simulates multi-turn patient interactions. Unlike static instruction-tuning, DiagAgent learns dynamic examination-selection and diagnosis policies through outcome-based feedback, achieving an 11.20% diagnostic accuracy improvement and a 17.58% examination-recommendation F1 boost over 11 state-of-the-art LLMs on DiagBench’s 2,200 physician-validated cases, the closest existing analogue to the RL closed loop described in §3. DoctorAgent-RL [108] extends this paradigm to multi-turn clinical dialogue: a doctor agent learns to autonomously refine its questioning strategy through multi-agent RL on MTMedDialog, avoiding the pitfall of requiring patients to describe all symptoms upfront and progressively aligning the agent’s questioning behaviour with clinical reasoning.

Agentic clinical-dialogue taxonomy: Reinventing Clinical Dialogue [105] is a recent survey that proposes a first-principles taxonomy of agentic paradigms for LLM-enabled clinical communication, organising the design space into four archetypes spanning strategic planning, memory management, action execution, collaboration, and evolution. Although it is not itself a benchmarked system, it usefully maps where existing clinical-dialogue agents fall along the persistence/skill/RL axes used in this review.

4.7. Recap: Self-Evolving Agent Engineering Across Clinical Tasks

Across the six task categories, the surviving row-instances of benchmarked agent systems (excluding two survey rows, marked ^§ in the Question answering and Inference blocks of Table 3, and one general-domain methodological reference, marked ^⋄) divide into 3 fully self-evolving row-instances (TissueLab/Co-evolving AI, MACRO, and Hermes RL Trajectories^† as a framework-level reference), 4 $P + S$ row-instances without RL (Agent Hospital in classification and in augmentation, where the base LLM is frozen and improvement is retrieval-driven; ClinicalAgents in classification and inference), 3 RL-only systems (AI Clinician^‡, DiagAgent, DoctorAgent-RL, all in inference), and the remaining rows as session-bounded baselines. Of the benchmarked rows, the large majority are peer-reviewed papers or preprints (tiers T1/T2 in §1.2); 1 is framework documentation (T3, Hermes RL Trajectories), and 1 is a historical non-LLM RL comparator (T4, AI Clinician). Restricting attention to T1/T2 LLM-agent evidence, roughly one-third of row-instances instantiate at least one self-evolving pillar; the rest serve as stateless comparison baselines. The cross-cutting findings are:

1.: Persistence compounds performance (in simulation/benchmark): Where self-evolving systems are present, accumulated experience can improve performance over session-bounded baselines, though this advantage is currently demonstrated in a minority of reviewed papers and primarily in simulated or benchmark settings. Agent Hospital’s reported diagnosis-accuracy gain (77.0%→95.3% on its internal 21-department simulacrum after 20K-patient evolution, with a smaller but consistent +4 pt MedQA gain on GPT-4o) and co-evolving AI’s reported imaging-quality gains over a static baseline are the clearest examples; prospective clinical validation remains open.
2.: Tool integration multiplies capability: Agent systems with well-engineered clinical tool access (RadioRAG’s real-time retrieval, KG-Agent’s knowledge graph traversal) outperform larger memoryless models, confirming that tool engineering is as important as model scaling.
3.: Multi-agent coordination for complex cases: Complex clinical tasks consistently benefit from multi-agent architectures (MDAgents, ClinicalAgents) that decompose the diagnostic problem across specialist agents.
4.: Self-generated training data closes the data gap: Agent Hospital’s simulacrum and self-evolving RL trajectory generation provide clinically valuable training data without accessing real patients, addressing the fundamental data scarcity challenge of medical AI.

The convergence across diverse architectures and clinical domains supports the paradigm, but the prospective clinical validation needed to confirm it remains an open challenge addressed in Section 5.

5. Challenges and Future Directions

This section synthesises lessons from the surveyed literature (Section 4). Deploying persistent, self-evolving agents in clinical settings introduces challenges that are more severe than those encountered in general-purpose agent deployments.

5.1. Challenges in Self-Evolving Agent Engineering for Healthcare

The key challenges are summarized in Table 4.

Hallucination and clinical safety. LLM hallucination is a well-documented phenomenon in natural language generation [123], and hallucination rates of 8–15% reported across current systems in medical imaging contexts [124] are unacceptable in safety-critical decisions. The Med-HALT benchmark [125] specifically targets medical hallucination, identifying failure modes including false reasoning chains and confabulated citations. Persistent agents introduce a compounding risk: a hallucinated fact stored in MEMORY.md propagates across future sessions, potentially corrupting downstream reasoning. SEAE addresses this through multi-layer validation: clinical verifiers in the Atropos RL pipeline reject trajectories containing factual errors, agent checkpoints enable cross-agent fact verification, and memory injection scanning prevents corrupted entries from persisting. Constitutional AI approaches [126] provide a principled framework for embedding clinical safety constraints into agent self-revision loops. Future work should develop clinical hallucination benchmarks specific to persistent agent systems.

Data privacy and HIPAA compliance. Persistent memory accumulates sensitive clinical information that may constitute PHI under HIPAA regulations. The foundational challenges of deploying LLMs and foundation models on electronic health records have been critically documented [127], highlighting that EHR-specific issues including temporal reasoning, cross-institutional variation, and PHI handling require careful engineering beyond general LLM capabilities. The three-layer memory architecture presents distinct privacy challenges at each layer: MEMORY.md may contain patient-specific clinical patterns; skill documents may encode de-identified but re-identifiable case experiences; and the SQLite session index may retain verbatim clinical dialogue. A radiology-specific issue compounds this: DICOM image files carry PHI not only in pixel data (burned-in patient identifiers on the image itself) but also in dozens of structured metadata fields scattered across the DICOM header, including private vendor tags whose semantics are not standardised across manufacturers. Compliance with DICOM Supplement 55 / PS3.15 Basic Confidentiality Profile [36] requires not just stripping the eighteen HIPAA Safe Harbor identifiers but also auditing private tags and pixel regions where text overlays may have been added by the modality, and any agent that persists imaging context in long-term memory must enforce equivalent de-identification on every cached image excerpt or metadata snippet before the memory layer accepts it. Caging the Agents [128] demonstrates a production deployment strategy using kernel-level workload isolation (gVisor on Kubernetes), credential proxy sidecars, and network egress policies to contain PHI within hospital infrastructure. Local-first deployment, running the agent server within the hospital firewall with no external API calls, provides the strongest privacy guarantee but requires significant on-premise compute investment.

Domain adaptation and transferability. Medical terminology and clinical protocols vary substantially across specialties, institutions, and geographies, and the imaging domain adds a second axis of variation that is largely invisible to a non-imaging agent: image-level shifts driven by scanner manufacturer (e.g., GE versus Siemens versus Philips), acquisition protocol (slice thickness, kVp/mAs, reconstruction kernel), and processing pipeline (vendor-specific DICOM post-processing, lossy compression at PACS export). Zech et al. [37] provide the canonical demonstration of this problem: a pneumonia-detection CNN trained on combined NIH and Mount Sinai chest radiographs reached AUC 0.931 internally but only AUC 0.815 when tested at Indiana University, and a separate CNN could identify the source hospital with 99.95–99.98% accuracy from the radiograph alone, indicating that hospital-system signature was a larger source of variance than the pathology under study. For a self-evolving radiology agent, this has two operational consequences. First, a skill document derived from one institution’s reading practice may be miscalibrated for another’s image distribution even if the underlying clinical concept transfers, so skill documents should carry an explicit “derived-from” provenance tag (institution, scanner family, acquisition parameters) and the retrieval layer should down-weight skills whose provenance does not match the current case. Second, the performance trigger described in §3 provides a natural cross-vendor canary: a systematic drop in verifier score that correlates with scanner family, rather than with case difficulty, is exactly the signal of an unannounced PACS or modality upgrade, and surfaces a domain-shift incident as a skill-deficiency event rather than as silent performance degradation. SEAE therefore addresses domain shift not only through specialty-specific skill libraries, with institution-specific guideline files registered as context providers via MCP, but also through provenance-aware skill metadata that the trigger machinery can monitor. A recent empirical exemplar of online deployment-time adaptation in medical imaging is OAIMS [40], which updates an interactive segmentation model from clinician click feedback either after each case (post-interaction) or incrementally within a case (mid-interaction) using user-refined outputs as pseudo-ground-truth, explicitly to mitigate distribution shift introduced by unseen imaging modalities and pathologies at the deployment site. From an SEAE perspective, OAIMS operationalises the R-pillar (closed-loop self-improvement) directly on model weights rather than on retrieved memory, providing a clean counterpoint to the memory-only adaptation strategies surveyed in §4.

Interpretability and clinical trust. In clinical medicine, an unexplained AI recommendation is clinically unusable: clinicians require audit trails that connect recommendations to evidence. Self-evolving agents maintain full reasoning traces through the ReAct loop’s thought-action-observation log, providing an inherently interpretable record of the agent’s decision-making process. The verification-layer mechanisms introduced in Section 3 (component-level versioning, trajectory distillation, and falsifiable predictions paired with each update) directly support this requirement by making every skill update reversible, attributable, and falsifiable. However, current implementations log reasoning traces only at the session level; future work should develop standardised clinical audit formats that enable retrospective review of multi-session reasoning chains for regulatory compliance.

Memory security and injection. Persistent memory is a novel attack surface. MINJA [129] shows that an adversary can compromise an LLM agent’s memory through normal interactions alone, achieving >98% injection and >70% end-to-end attack success on GPT-4/4o; Devarangadi et al. [130] validate these findings on clinical data and propose input/output moderation with trust scoring and temporal-decay sanitisation. Cross-session attacks [131] split malicious payloads across sessions to evade memoryless guardrails. Lin et al. [132] organise threats across six lifecycle phases (Write, Store, Retrieve, Execute, Share, Forget/Rollback) and introduce “mnemonic sovereignty” as the governance standard clinical deployments must meet; the harness memory security layer of the Hermes reference (footnote 2) addresses only the Write phase, leaving the rest as open challenges.

Autonomy–safety tension and the autonomy spectrum. Clinical agent designs occupy a spectrum from human-directed tool-augmented assistants (the “Vibe Medicine” paradigm of Wu et al. [33]), to semi-autonomous configurations with human review at decision boundaries, to fully autonomous self-evolving agents that update their own skills and parameters between sessions. SEAE sits at the high-autonomy end and is not universally preferable: an agent that autonomously creates and modifies its own diagnostic protocols can propagate systematic errors that compound across sessions in a way single-session human-directed systems do not. The case for high autonomy rests on regimes where between-session continuity is the binding constraint, longitudinal radiology follow-up, chronic disease management, and rare-event accumulation, in which the human-in-the-loop bottleneck of lower-autonomy designs limits the very capability the system is meant to provide. Reconciling these design choices requires graduated autonomy mechanisms in which agents operate autonomously on routine tasks but escalate to human review for novel diagnoses, high-stakes decisions, or whenever verifier confidence falls below a critical threshold; the appropriate threshold is itself a deployment parameter rather than a fixed architectural property.

Memory fragmentation and skill library scalability. When a self-evolving agent processes tens of thousands of clinical cases over months or years of deployment, the skill library may grow to hundreds or thousands of entries. At this scale, the L0 skill lookup (loading all skill names and descriptions into the context window) becomes a bottleneck: retrieval precision degrades as semantically similar skills compete for activation, and the token overhead of the skill index may exceed practical limits. The consolidation trigger (merging skills with similarity >0.85) mitigates this partially, but a more principled approach to skill library management (including hierarchical skill taxonomies, specialty-based partitioning, and adaptive retrieval strategies) is needed for long-term clinical deployments. A complementary mechanism, recently demonstrated in productivity-domain agents at scale [78], is distillation-driven reset: once a skill’s content has been successfully internalized by the backbone via RL training (verified by retained performance on held-out cases when the explicit skill document is withheld), the skill document is retired from the active library, and library capacity is freed for new skills. This treats the skill library as a transient learning buffer rather than an indefinitely accumulating store, and provides a principled exit path from the unbounded-growth regime that the consolidation trigger alone cannot fully resolve.

Catastrophic forgetting in RL closed-loop training. The RL closed loop continuously fine-tunes the backbone model on new clinical trajectories. A well-documented risk in continual learning is catastrophic forgetting [133,134]: fine-tuning on radiology-specific trajectories may degrade the model’s general medical reasoning or text comprehension capabilities. SEAE must incorporate continual learning safeguards such as elastic weight consolidation (EWC), experience replay buffers that periodically re-expose the model to diverse task trajectories, or modular LoRA adapters that isolate specialty-specific updates from the general backbone. Monitoring for capability drift across clinical domains during RL training is an open engineering challenge.

Over-reliance and clinical de-skilling. A challenge that intensifies as agent autonomy increases is epistemological rather than purely technical: when clinicians repeatedly accept fluent, well-structured agent outputs without independent verification, their own analytical capacity for the underlying reasoning may erode over time. Automation bias, the tendency to over-rely on automated decision support and to under-attend to disconfirming evidence, is well documented in clinical decision support systems [135,136] and becomes more acute, not less, as outputs grow more polished. Self-evolving agents amplify this risk in two specific ways: (1) accumulated skill documents make outputs increasingly confident and stylistically consistent, which can be misread as increasing correctness; and (2) the autonomous skill-evolution loop removes the natural friction points (manual prompt revision, explicit retrieval inspection) at which clinicians historically encountered the agent’s reasoning. Mitigations include mandatory clinician review for high-stakes outputs, interface designs that surface the agent’s full reasoning chain rather than only its conclusion, periodic blind-comparison audits in which clinician judgments are recorded before agent recommendations are revealed, and training programs that explicitly target critical evaluation of agent outputs. This risk is particularly acute for SEAE deployments precisely because their value proposition, namely continuous and low-friction skill accumulation, directly trades against the friction that historically preserved clinician engagement with the reasoning process.

The engineering- vs. clinical-autonomy boundary. The technical autonomy described throughout this review, namely autonomous skill synthesis, self-managed memory, and closed-loop RL fine-tuning, must not be conflated with clinical decision autonomy. Even a fully self-evolving SEAE deployment is intended to surface every clinically actionable output as a second opinion: flagged for physician review, accompanied by retrieval provenance and uncertainty estimates, and never auto-acted on. Regulatory frameworks reinforce this boundary: the FDA’s risk-tiered guidance on Software as a Medical Device and on AI/ML-enabled functions, the EU MDR Class IIb/III device regime, and PMDA’s evolving guidance on continuously-learning AI all expect that diagnostic and therapeutic decisions remain under qualified-physician adjudication, irrespective of how autonomously the underlying agent manages its own skills, memory, or training. A higher engineering-autonomy agent does not entail a higher clinical-autonomy authorisation; the two axes must be co-engineered but kept conceptually separate. Concretely, this implies that SEAE harnesses should expose interface affordances for physician override at every clinically actionable step, log the provenance of each recommendation back to the specific skill document and trajectory that produced it, and treat any drift toward clinical-autonomy framing in product or research communications as itself a deployment risk to be managed.

Scalability of multi-agent orchestration. Multi-agent clinical systems (MDAgents, ClinicalAgents, MedAide) demonstrate strong diagnostic performance but introduce significant computational overhead. A 6-agent consultation for a complex differential diagnosis may require 30–50 LLM inference calls, increasing latency beyond clinical tolerability for acute care settings. SEAE addresses this through subagent pooling (reusing agent contexts across queries), cached skill retrieval (loading skill descriptions rather than full documents), and background RL training that does not interrupt real-time clinical inference.

Regulatory compliance. Medical AI deployment is subject to regulatory approval requirements (FDA 510(k) in the US, CE marking in Europe) that require clinical validation across diverse patient populations. Persistent, self-evolving agents introduce a regulatory challenge not anticipated by current frameworks: an agent that improves through use is not a fixed software artifact but a continuously changing system. Treating agent updates as release engineering artifacts with regression-aware gating [137] offers one promising direction. Future regulatory frameworks must address validation of learning agents, including requirements for continuous monitoring, performance drift detection, and rollback mechanisms.

5.2. Future Directions

The current research directions in self-evolving agent engineering for healthcare are diverse, with active work across five principal areas:

Multi-modal agent integration. Healthcare data is inherently multi-modal (imaging, text, genomics, wearable sensor data), and radiology is where this challenge is most acute: DICOM metadata, structured radiology reports, and longitudinal comparison imaging must all be integrated into a coherent self-evolving memory. Current self-evolving agent deployments primarily handle text and structured data; extending persistent skill synthesis to DICOM-grounded imaging workflows, longitudinal follow-up imaging, and multi-modality radiology (CT/MRI/PET) is the most clinically urgent frontier [98,138]. Towards-generalist biomedical AI systems [139] and personal health LLMs [140] demonstrate the feasibility of unified multi-modal clinical reasoning, pointing toward future self-evolving agents that accumulate cross-modal clinical skills. Structural biology tools such as AlphaFold [141] exemplify the kind of specialized domain tools that self-evolving agents could integrate via MCP for genomics and drug discovery applications.

Federated agent training. Federated learning approaches [142] enable model training across distributed hospital data without centralizing patient records. FedAgentBench [143] is the first benchmark for LLM-agent-driven federated medical image analysis: server and client agents autonomously coordinate 40 FL algorithms across 201 curated datasets spanning six imaging modalities, finding that while frontier models (GPT-4.1, DeepSeek V3) can automate many FL pipeline stages, complex interdependent tasks remain challenging, identifying the precise coordination bottlenecks that future federated SEAE pipelines must address. Integrating federated RL trajectory collection with the Atropos training pipeline would enable cross-institutional agent improvement while maintaining data locality.

Multi-agent co-evolution. While the self-evolving principles described in this review focus primarily on single-agent loops, the next frontier lies in multi-agent co-evolution. The CORAL framework [144] demonstrates that multiple long-running agents exploring the same problem through shared persistent memory, asynchronous execution, and mutual critique can achieve 3–10× higher improvement rates than fixed evolutionary search. In a clinical context, an ecosystem of specialized agents (e.g., a radiology sub-agent and an oncology sub-agent) could continuously refine their shared skill libraries and collaborative protocols without central human orchestration, extending the self-evolving paradigm from individual agent improvement to collective clinical intelligence.

Clinical agent evaluation frameworks. Robust evaluation of persistent agents requires benchmarks that assess cross-session memory retention, skill generalisation, and improvement over time, dimensions not captured by existing medical QA benchmarks such as MedQA [90], PubMedQA [91], and MedMCQA [145]. Two recent benchmarks target this gap: SkillLearnBench [146] targets continual skill learning and finds that “self-feedback alone induces recursive drift,” motivating cross-model LLM-as-judge evaluation and rejection-sampling pipelines as safeguards; AMA-Bench [147] evaluates long-horizon memory and finds that current memory systems lack causality and are constrained by lossy similarity-based retrieval, the failure modes that the three-tier memory architecture (with FTS session index and self-managed replacement policies) is designed to address. Beyond benchmark accuracy, clinical decision support frameworks [22,148] highlight the lack of standardised evaluation protocols for reliability and safety under real clinical workflow conditions as a critical gap.

Agent safety and alignment. The Caging the Agents framework [128] demonstrates production-grade safety for clinical agent deployment but was developed for a specific institutional context. Generalizable clinical agent safety standards, covering prompt integrity, credential isolation, network containment, and audit logging, are needed for broad adoption.

Translational Opportunities. If the technical and regulatory frontiers above are addressed, SEAE opens several translational opportunities: continuous learning systems that absorb evolving clinical guidelines in real time and address AI staleness in fast-moving domains such as oncology and infectious disease; personalised clinical AI companions tailored to individual clinician workflows through accumulated USER-level preferences and specialty-specific skills; cross-institutional skill sharing via de-identified agentskills.io documents, enabling a clinical AI commons; and automated guideline compliance through institutional guidelines and drug formularies embedded as persistent skill documents and MCP-registered tools. Realising these opportunities will require as much work on clinical governance and regulatory frameworks as on the technical capabilities themselves.

6. Conclusions

6.1. Summary

The main finding of this review is that Self-Evolving Agent Engineering (SEAE) represents a clear shift from session-bounded LLM deployment: where implemented, clinical agents that accumulate knowledge, synthesise experience into reusable skills, and improve through a closed RL loop can grow more capable through use, much like the development of clinical expertise. This property is absent from prompt engineering and RAG approaches as typically deployed, and it is what defines the SEAE paradigm.

6.2. Limitations

Paradigm framing. SEAE is not yet a mature or widely standardised paradigm in healthcare: while the Agent = Model + Harness decomposition is increasingly adopted in industry (see footnote 1), the self-evolving extension is still emergent and most surveyed systems were developed independently rather than under this label. This review should be read as a synthesis of a converging design space rather than a survey of standardised implementations.

Partial instantiation of principles. Because fully realized persistent clinical agents remain rare, many surveyed systems instantiate only one or two of the three core principles (persistent memory, skill synthesis, RL closed loop), rather than all three simultaneously. Their inclusion is intentional: the purpose of this review is to analyze how these principles are emerging across healthcare agent research, not to imply that the field has already reached a complete architectural consensus.

Evidence base and deployment reality. Much of the evidence for compounding self-improvement comes from simulation environments or benchmark settings rather than real healthcare deployment. Questions of prospective clinical validation, workflow integration, regulatory approval, and governance of continuously evolving systems are not resolved by the present review; rather, they define the translational barriers that future work must address before persistent clinical agents can be safely deployed.

6.3. Key Contributions

This review makes two distinctive contributions: (i) an application taxonomy (Table 3) that classifies 23 representative clinical systems across six task categories (22 benchmarked systems and baselines, including both agentic systems and explicitly non-agentic session-bounded comparators, plus the historical non-LLM AI Clinician comparator) using the P/S/R pillar tags, drawn from a broader survey of 148 references (with 4 non-academic sources moved to footnotes); and (ii) a challenge map (Table 4) with eleven principal deployment challenges and concrete solution directions.

6.4. Recommendations for Future Research

We recommend four priorities for future research: (1) longitudinal clinical validation (six-month to one-year prospective trials tracking diagnostic accuracy and error rates); (2) privacy-preserving persistent memory with formal PHI protection guarantees; (3) standardised persistent-agent benchmarks for cross-session retention, skill generalisation, and improvement trajectory, including controlled memory-ablation studies, longer-horizon runs (≥50 tasks) that exercise the consolidation and contradiction triggers, and radiologist-validated multimodal benchmarks; and (4) regulatory frameworks for continuously improving clinical agents. Whether SEAE ultimately delivers on its clinical promise depends on progress along these four axes.

Author Contributions: Dengzhe Hou

Dengzhe Hou: Conceptualization; Methodology; Formal analysis; Investigation; Data curation; Validation; Visualization; Project administration; Writing – original draft; Writing – review & editing. Zihao Wu: Methodology; Investigation; Writing – review & editing. Yuwen Zeng: Investigation; Writing – review & editing. Lingyu Jiang: Investigation; Writing – review & editing. Fangzhou Lin: Methodology; Validation; Supervision; Writing – review & editing. Kazunori D Yamada: Conceptualization; Resources; Supervision; Project administration; Writing – review & editing.

Acknowledgments

The authors thank colleagues at the participating institutions for helpful discussions on early drafts of this manuscript.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Shortliffe, E.H.; Davis, R.; Axline, S.G.; Buchanan, B.G.; Green, C.C.; Cohen, S.N. Computer-based consultations in clinical therapeutics: Explanation and rule acquisition capabilities of the MYCIN system. Computers and Biomedical Research 1975, 8, 303–320. [CrossRef]
Rajpurkar, P.; Irvin, J.; Ball, K.; Zhu, R.; Yang, B.; et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning, 2017, [1711.05225].
Bluethgen, C.; Veen, D.V.; Zakka, C.; Link, K.; Fanous, A.; Daneshjou, R.; Frauenfelder, T.; Langlotz, C.; Gatidis, S.; Chaudhari, A. Best Practices for Large Language Models in Radiology, 2024, [2412.01233].
Wang, W.; Ma, Z.; Wang, Z.; Wu, C.; Ji, J.; Chen, W.; Li, X.; Yuan, Y. A survey of llm-based agents in medicine: How far are we from baymax? Findings of the Association for Computational Linguistics: ACL 2025 2025, pp. 10345–10359.
Kim, Y.; Park, C.; Jeong, H.; Chan, Y.S.; Xu, X.; McDuff, D.; Lee, H.; Ghassemi, M.; Breazeal, C.; Park, H.W. MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making, 2024, [2404.15155].
OpenAI. GPT-4 Technical Report, 2023, [2303.08774].
Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. Anthropic model card. https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf, 2024.
Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; et al. Gemini: A Family of Highly Capable Multimodal Models, 2023, [2312.11805].
Wang, J.; Shi, E.; Yu, S.; Wu, Z.; Hu, H.; Ma, C.; Dai, H.; Yang, Q.; Kang, Y.; Wu, J.; et al. Prompt engineering for healthcare: Methodologies and applications. Meta-Radiology 2025, p. 100190. [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022, Vol. 35, pp. 24824–24837.
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2020, Vol. 33, pp. 1877–1901.
Arasteh, S.T.; Lotfinia, M.; Bressem, K.; Siepmann, R.; Adams, L.; Ferber, D.; Kuhl, C.; Kather, J.N.; Nebelung, S.; Truhn, D. RadioRAG: Online Retrieval-Augmented Generation for Radiology Question Answering, 2024, [2407.15621].
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey, 2024, [2312.10997].
Li, J.; Lai, Y.; Li, W.; Ren, J.; Zhang, M.; Kang, X.; Wang, S.; Li, P.; Zhang, Y.Q.; Ma, W.; et al. Agent Hospital: A Simulacrum of Hospital with Evolvable Medical Agents, 2024, [2405.02957].
Packer, C.; Wooders, S.; Lin, K.; Fang, V.; Patil, S.G.; Stoica, I.; Gonzalez, J.E. MemGPT: Towards LLMs as Operating Systems, 2023. [CrossRef]
Zhong, W.; Guo, L.; Gao, Q.; Ye, H.; Wang, Y. MemoryBank: Enhancing Large Language Models with Long-Term Memory, 2023. [CrossRef]
Ge, Z.; Li, H.; Wang, Y.; Hu, N.; Zhang, C.J.; Li, Q. ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory, 2026, [2603.26182].
Du, P. Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers, 2026, [2603.07670].
Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language Agents with Verbal Reinforcement Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2023, Vol. 36, pp. 8634–8652.
Topol, E.J. Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again; Basic Books, 2019.
Rajpurkar, P.; Chen, E.; Banerjee, O.; Topol, E.J. AI in health and medicine. Nature Medicine 2022, 28, 31–38. [CrossRef]
Gorenshtein, A.; Omar, M.; Glicksberg, B.S.; Nadkarni, G.N.; Klang, E. AI agents in clinical medicine: A systematic review. medRxiv 2025. [CrossRef]
Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A Survey on Large Language Model Based Autonomous Agents, 2023. [CrossRef]
Guo, T.; Chen, X.; Wang, Y.; Chang, R.; Pei, S.; Chawla, N.V.; Wiest, O.; Zhang, X. Large Language Model Based Multi-Agents: A Survey of Progress and Challenges, 2024. [CrossRef]
He, K.; Mao, R.; Lin, Q.; Ruan, Y.; Lan, X.; Feng, M.; Cambria, E. A Survey of Large Language Models for Healthcare: From Data, Technology, and Applications to Accountability and Ethics, 2023. [CrossRef]
Khosravi, B.; Rouzrokh, P.; Akinci D’Antonoli, T.; Moassefi, M.; Faghani, S.; Mansuri, A.; Bressem, K.; Tejani, A.; Gichoya, J. Agentic AI in Radiology: Evolution from Large Language Models to Future Clinical Integration. Radiology: Artificial Intelligence 2026, 8, e250651. [CrossRef]
Koçak, B.; Meşe, İ. AI agents in radiology: Toward autonomous and adaptive intelligence. Diagnostic and Interventional Radiology 2025. [CrossRef]
Bluethgen, C.; Veen, D.V.; Truhn, D.; Kather, J.N.; Moor, M.; Polacin, M.; Chaudhari, A.; Frauenfelder, T.; Langlotz, C.P.; Krauthammer, M.; et al. Agentic Systems in Radiology: Design, Applications, Evaluation, and Challenges, 2025, [2510.09404].
Collaco, B.G.; Haider, S.A.; Prabha, S.; Gomez-Cabello, C.A.; Genovese, A.; Wood, N.G.; Bagaria, S.P.; Gopala, N.; Tao, C.; Forte, A.J. The role of agentic artificial intelligence in healthcare: A scoping review. npj Digital Medicine 2026, 9, 345. [CrossRef]
Gao, H.a.; et al. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence, 2025, [2507.21046].
Fang, J.; et al. A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems, 2025, [2508.07407].
Jiang, P.; et al. Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills, 2025, [2512.16301].
Wu, Z.; Xu, S.; Chen, B.; Wan, S.; Li, Y.; Ruan, W.; Lyu, Y.; Li, S.; Zhu, D.; Liu, T.; et al. Vibe Medicine: Redefining Biomedical Research Through Human-AI Co-Work, 2026, [2604.23674].
Tricco, A.C.; Lillie, E.; Zarin, W.; O’Brien, K.K.; Colquhoun, H.; Levac, D.; Moher, D.; Peters, M.D.J.; Horsley, T.; Weeks, L.; et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Annals of Internal Medicine 2018, 169, 467–473. [CrossRef]
Wang, Z.; Liu, L.; Wang, L.; Zhou, L. R2GenGPT: Radiology Report Generation with Frozen LLMs. Meta-Radiology 2023, 1, 100033. [CrossRef]
Moore, S.M.; Maffitt, D.R.; Smith, K.E.; Kirby, J.S.; Clark, K.W.; Freymann, J.B.; Vendt, B.A.; Tarbox, L.R.; Prior, F.W. De-identification of Medical Images with Retention of Scientific Research Value. RadioGraphics 2015, 35, 727–735. [CrossRef]
Zech, J.R.; Badgeley, M.A.; Liu, M.; Costa, A.B.; Titano, J.J.; Oermann, E.K. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLOS Medicine 2018, 15, e1002683. [CrossRef]
Brodeur, P.G.; Buckley, T.A.; Kanjee, Z.; Goh, E.; Ling, E.B.; Jain, P.; Cabral, S.; Abdulnour, R.E.; Haimovich, A.D.; Freed, J.A.; et al. Performance of a large language model on the reasoning tasks of a physician. Science 2026, 392, 524–527. [CrossRef]
Hou, D.; Jiang, L.; Li, D.; Li, Z.; Lin, F.; Yamada, K.D. WMF-AM: Probing LLM Working Memory via Depth-Parameterized Cumulative State Tracking, 2026, [2603.27343].
Xu, W.; Liang, Z.; Anthony, H.; Ibrahim, Y.; Cohen, F.; Yang, G.; Kamnitsas, K. You Point, I Learn: Online Adaptation of Interactive Segmentation Models for Handling Distribution Shifts in Medical Imaging. International Conference on Learning Representations (ICLR), 2026.
Xi, Z.; Chen, W.; Guo, X.; He, W.; Ding, Y.; Hong, B.; Zhang, M.; Wang, J.; Jin, S.; Zhou, E.; et al. The Rise and Potential of Large Language Model Based Agents: A Survey, 2023. [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2017, Vol. 30, pp. 5998–6008. [CrossRef]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; et al. Training language models to follow instructions with human feedback. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022, Vol. 35, pp. 27730–27744.
Nori, H.; King, N.; McKinney, S.M.; Carignan, D.; Horvitz, E. Capabilities of GPT-4 on Medical Challenge Problems, 2023. [CrossRef]
Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Hou, L.; Clark, K.; Pfohl, S.; Cole-Lewis, H.; Neal, D.; et al. Towards Expert-Level Medical Question Answering with Large Language Models, 2023. [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; et al. LLaMA: Open and Efficient Foundation Language Models, 2023, [2302.13971].
Grattafiori, A.; Dubey, A.; Jauhri, A.; et al. The Llama 3 Herd of Models, 2024. [CrossRef]
Teknium, R.; Quesnelle, J.; Guang, C. Hermes 3 Technical Report, 2024. [CrossRef]
Teknium, R.; Jin, R.; Suphavadeeprasit, J.; Mahan, D.; Quesnelle, J.; Li, J.; Guang, C.; Sands, S.; Malhotra, K. Hermes 4 Technical Report. arXiv preprint arXiv:2508.18255 2025.
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, 2019; pp. 4171–4186. [CrossRef]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining, 2019. [CrossRef]
Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing, 2021. [CrossRef]
Huang, K.; Altosaar, J.; Ranganath, R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission, 2019. [CrossRef]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways, 2022. [CrossRef]
Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwandou, A.; Cole-Lewis, H.; Hamoy-Blumenstein, N.; et al. Large Language Models Encode Clinical Knowledge, 2022. [CrossRef]
Luo, R.; Sun, L.; Xia, Y.; Qin, T.; Zhang, S.; Poon, H.; Liu, T.Y. BioGPT: Generative Pre-Trained Transformer for Biomedical Text Generation and Mining, 2022. [CrossRef]
Wu, C.; Lin, W.; Zhang, X.; Zhang, Y.; Xie, W.; Wang, Y. PMC-LLaMA: Toward Building Open-Source Language Models for Medicine, 2023. [CrossRef]
Xie, Q.; Chen, Q.; Chen, A.; Peng, C.; Hu, Y.; Lin, F.; Peng, X.; Huang, J.; Zhang, J.; Keloth, V.; et al. Me-LLaMA: Foundation Large Language Models for Medical Applications, 2024. [CrossRef]
Rafailov, R.; Sharma, A.; Mitchell, E.; Ermon, S.; Manning, C.D.; Finn, C. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2023, Vol. 36, pp. 53728–53741. [CrossRef]
Chen, Z.; Cano, A.H.; Romanou, A.; Bonnet, A.; Matoba, K.; Salvi, F.; Pagliardini, M.; Fan, S.; Köpf, A.; Mohtashami, A.; et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079 2023. [CrossRef]
Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.R.; Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models. The Eleventh International Conference on Learning Representations (ICLR 2023), OpenReview.net, 2023.
Yu, Y.; Yao, L.; Xie, Y.; Tan, Q.; Feng, J.; Li, Y.; Wu, L. Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents, 2026, [2601.01885].
Zhang, X.; Wang, G.; Cui, Y.; Qiu, W.; Li, Z.; Zhu, B.; He, P. Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents, 2026, [2604.15877].
Anthropic. Model Context Protocol: An open standard for connecting LLM applications to data sources and tools. https://modelcontextprotocol.io, 2024. Open protocol specification, accessed May 2026.
Zhao, A.; Huang, D.; Xu, Q.; Lin, M.; Liu, Y.J.; Huang, G. ExpeL: LLM Agents Are Experiential Learners. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2024, Vol. 38, pp. 19632–19642.
Liu, Y.; Si, C.; Narasimhan, K.R.; Yao, S. Contextual Experience Replay for Self-Improvement of Language Agents. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 14179–14198. [CrossRef]
Wu, R.; Wang, X.; Mei, J.; Cai, P.; Fu, D.; Yang, C.; Wen, L.; Yang, X.; Shen, Y.; Wang, Y.; et al. EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle, 2025, [2510.16079].
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models, 2021, [2106.09685]. [CrossRef]
Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, 2023. [CrossRef]
Anokhin, P.; Semenov, N.; Sorokin, A.; Evseev, D.; Kravchenko, A.; Burtsev, M.; Burnaev, E. AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents, 2024. [CrossRef]
Yang, S.; Ma, Z.; Huang, T.; Hu, Y.; Wang, Y.; Chu, X. CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution, 2026, [2604.15840].
Ni, J.; Liu, Y.; Liu, X.; Sun, Y.; Zhou, M.; Cheng, P.; Wang, D.; Zhao, E.; Jiang, X.; Jiang, G. Trace2skill: Distill trajectory-local lessons into transferable agent skills. arXiv preprint arXiv:2603.25158 2026.
Zhang, H.; Fan, S.; Zou, H.P.; Chen, Y.; Wang, Z.; Zhou, J.; Li, C.; Huang, W.C.; Yao, Y.; Zheng, K.; et al. CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification, 2026, [2604.01687].
Xia, P.; Chen, J.; Yang, X.; Tu, H.; Liu, J.; Xiong, K.; Han, S.; Qiu, S.; Ji, H.; Zhou, Y.; et al. MetaClaw: Just Talk — An Agent That Meta-Learns and Evolves in the Wild, 2026, [2603.17187].
Ma, Z.; Yang, S.; Ji, Y.; Wang, X.; Wang, Y.; Hu, Y.; Huang, T.; Chu, X. SkillClaw: Let Skills Evolve Collectively with Agentic Evolver, 2026, [2604.08377]. [CrossRef]
Xia, P.; Chen, J.; Wang, H.; Liu, J.; Zeng, K.; Wang, Y.; Han, S.; Zhou, Y.; Zhao, X.; Chen, H.; et al. SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning, 2026, [2602.08234].
Wang, H.; Wang, G.; Xiao, H.; et al. Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents, 2026, [2604.10674].
Ge, T.; Peng, B.; Cheng, H.; Gao, J. Synthetic Computers at Scale for Long-Horizon Productivity Simulation, 2026, [2604.28181].
Fan, L.; Dai, P.; Deng, Z.; Wang, H.; Gong, X.; Zheng, Y.; Ou, Y. Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery, 2026, [2603.05860]. [CrossRef]
Lin, J.; Liu, S.; Pan, C.; Lin, L.; Dou, S.; Huang, X.; Yan, H.; Han, Z.; Gui, T. Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses, 2026, [2604.25850].
Ding, L. AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling, 2026, [2603.21357].
Li, X.; et al. SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks, 2026, [2602.12670].
Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems 2023, 36, 68539–68551.
Liu, X.; Yu, H.; Zhang, H.; Xu, Y.; Lei, X.; Lai, H.; Gu, Y.; Ding, H.; Men, K.; Yang, K.; et al. AgentBench: Evaluating LLMs as Agents, 2023. [CrossRef]
Xu, R.; Yan, Y. Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward, 2026, [2602.12430].
Gao, S.; Zhu, R.; Sui, P.; Kong, Z.; Aldogom, S.; Huang, Y.; Noori, A.; Shamji, R.; Parvataneni, K.; Tsiligkaridis, T.; et al. Democratizing AI Scientists Using ToolUniverse, 2025, [2509.23426].
Gao, S.; Zhu, R.; Kong, Z.; Noori, A.; Su, X.; Ginder, C.; Tsiligkaridis, T.; Zitnik, M. TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools, 2025, [2503.10970].
Huang, K.; Zhang, S.; Wang, H.; Qu, Y.; Lu, Y.; Roohani, Y.; Li, R.; Qiu, L.; Li, G.; Zhang, J.; et al. Biomni: A General-Purpose Biomedical AI Agent. bioRxiv 2025. [CrossRef]
Hong, S.; Zhuge, M.; Chen, J.; Zheng, X.; Cheng, Y.; Zhang, C.; Wang, J.; Wang, Z.; Yau, S.K.S.; Lin, Z.; et al. MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework, 2023. [CrossRef]
Jin, D.; Pan, E.; Oufattole, N.; Weng, W.H.; Fang, H.; Szolovits, P. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams, 2021. [CrossRef]
Jin, Q.; Dhingra, B.; Liu, Z.; Cohen, W.W.; Lu, X. PubMedQA: A Dataset for Biomedical Research Question Answering, 2019. [CrossRef]
Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.V.; Chi, E.H.; Narang, S.; Chowdhery, A.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models, 2022. [CrossRef]
Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.L.; Cao, Y.; Narasimhan, K. Tree of Thoughts: Deliberate Problem Solving with Large Language Models, 2023. [CrossRef]
Hu, S.; Lu, C.; Clune, J. Automated Design of Agentic Systems. International Conference on Learning Representations (ICLR), 2025.
Zhang, Y.; Chen, D.Z. GPT4MIA: Utilizing Generative Pre-Trained Transformer (GPT-4) as a Plug-and-Play Transductive Model for Medical Image Analysis. In Proceedings of the Workshop Proceedings of MICCAI 2023 (MedAGI/DeCaF), 2023, pp. 151–160. [CrossRef]
Lyu, Q.; Tan, J.; Zapadka, M.E.; Ponnat, J.; Niu, C.; Wang, G.; Whitlow, C.T. Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: Results, limitations, and potential. Visual Computing for Industry, Biomedicine, and Art 2023, 6, 9. [CrossRef]
Lu, J.; Li, J.; Wallace, B.C.; He, Y.; Pergola, G. NapSS: Paragraph-level Medical Text Simplification via Narrative Prompting and Sentence-matching Summarization. In Proceedings of the Findings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 1079–1091. [CrossRef]
Li, S.; Xu, J.; Bao, T.; Liu, Y.; Liu, Y.; Liu, Y.; Wang, L.; Lei, W.; Wang, S.; Xu, Y.; et al. A Co-Evolving Agentic AI System for Medical Imaging Analysis, 2025, [2509.20279].
Yu, Y.; Huang, Z.; Mu, L.; Zhang, S.; Zhang, X. Radiologist Copilot: An Agentic Framework Orchestrating Specialized Tools for Reliable Radiology Reporting, 2025, [2512.02814].
Milecki, L.; Kalogeiton, V.; Bodard, S.; Anglicheau, D.; Correas, J.M.; Timsit, M.O.; Vakalopoulou, M. MEDIMP: 3D Medical Images with clinical Prompts from limited tabular data for renal transplantation, 2023, [2303.12445]. [CrossRef]
Dai, H.; Liu, Z.; Liao, W.; Huang, X.; Cao, Y.; et al. ChatAug: Leveraging ChatGPT for Text Data Augmentation, 2023, [2302.13007].
Jiang, J.; Zhou, K.; Zhao, W.X.; Song, Y.; Zhu, C.; Zhu, H.; Wen, J.R. KG-Agent: An Efficient Autonomous Agent Framework for Complex Reasoning over Knowledge Graph, 2024, [2402.11163].
Albassam, D. Toward Human-Centered Interactive Clinical Question Answering System, 2025, [2505.18928].
Elgedawy, R.; Danciu, I.; Mahbub, M.; Srinivasan, S. Dynamic Question-Answering of Clinical Documents using Retrieval Augmented Generation, 2024, [2401.10733].
Zhi, X.; Zhao, H.; Wu, L.; Zhao, C.; Zhu, H. Reinventing Clinical Dialogue: Agentic Paradigms for LLM-Enabled Healthcare Communication, 2025, [2512.01453].
Komorowski, M.; Celi, L.A.; Badawi, O.; Gordon, A.C.; Faisal, A.A. The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine 2018, 24, 1716–1720. [CrossRef]
Qiu, P.; Wu, C.; Liu, J.; Zheng, Q.; Liao, Y.; Wang, H.; Yue, Y.; Fan, Q.; Zhen, S.; Wang, J.; et al. Evolving Interactive Diagnostic Agents in a Virtual Clinical Environment, 2025, [2510.24654].
Feng, Y.; Wang, J.; Zhou, L.; Zheng, Y.; Lei, Z.; Li, Y. Real-World Doctor Agent with Proactive Consultation through Multi-Agent Reinforcement Learning, 2025, [2505.19630].
Yang, D.; Wei, J.; Li, M.; Liu, J.; Liu, L.; Hu, M.; He, J.; Ju, Y.; Zhou, W.; Liu, Y.; et al. MedAide: Information Fusion and Anatomy of Medical Intents via LLM-based Agent Collaboration, 2024, [2410.12532]. [CrossRef]
Shimgekar, S.R.; Vassef, S.; Goyal, A.; Kumar, N.; Saha, K. Agentic AI Framework for End-to-End Medical Data Inference, 2025, [2507.18115].
Tang, X.; Zou, A.; Zhang, Z.; Li, Z.; Zhao, Y.; Zhang, X.; Cohan, A.; Gerstein, M. Medagents: Large language models as collaborators for zero-shot medical reasoning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 599–621.
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. In Proceedings of the International Conference on Computer Vision, 2023, pp. 4015–4026. [CrossRef]
Cheng, J.; Ye, J.; Deng, Z.; Chen, J.; Li, T.; Wang, H.; Su, Y.; Huang, Z.; Chen, J.; Jiang, L.; et al. Sam-med2d. arXiv preprint arXiv:2308.16184 2023.
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning. PmLR, 2021, pp. 8748–8763.
Li, C.; Wong, C.; Zhang, S.; Usuyama, N.; Liu, H.; Yang, J.; Naumann, T.; Poon, H.; Gao, J. LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day, 2023. [CrossRef]
Zhou, H.Y.; Acosta, J.N.; Adithan, S.; Datta, S.; Topol, E.J.; Rajpurkar, P. MedVersa: A generalist foundation model for medical image interpretation. arXiv preprint arXiv:2405.07988 2024.
Lu, M.Y.; Chen, B.; Williamson, D.F.; Chen, R.J.; Ikamura, K.; Gerber, G.; Liang, I.; Le, L.P.; Ding, T.; Parwani, A.V.; et al. A foundational multimodal vision language AI assistant for human pathology. arXiv preprint arXiv:2312.07814 2023.
Park, J.S.; O’Brien, J.C.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative Agents: Interactive Simulacra of Human Behavior. In Proceedings of the Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), 2023, pp. 1–22. [CrossRef]
Johnson, A.E.W.; Pollard, T.J.; Shen, L.; Li-Wei, H.L.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi, L.A.; Mark, R.G. MIMIC-III, a Freely Accessible Critical Care Database. Scientific Data 2016, 3, 160035. [CrossRef]
Patel, A.; Hofmarcher, M.; Leoveanu-Condrei, C.; Dinu, M.C.; Callison-Burch, C.; Hochreiter, S. Large Language Models Can Self-Improve At Web Agent Tasks, 2024. [CrossRef]
Boiko, D.A.; MacKnight, R.; Kline, B.; Gomes, G. Autonomous Chemical Research with Large Language Models, 2023. [CrossRef]
Rezaei, M.R.; Fard, R.S.; Parker, J.L.; Krishnan, R.G.; Lankarany, M. Agentic Medical Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge, 2025, [2502.13010].
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Dai, W.; Madotto, A.; et al. Survey of Hallucination in Natural Language Generation, 2022. [CrossRef]
Salehi, S.; Singh, Y.; Horst, K.K.; Hathaway, Q.A.; Erickson, B.J. Agentic AI and Large Language Models in Radiology: Opportunities and Hallucination Challenges. Bioengineering 2025, 12, 1303. [CrossRef]
Pal, A.; Umapathi, L.K.; Sankarasubbu, M. Med-HALT: Medical Domain Hallucination Test for Large Language Models, 2023. [CrossRef]
Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; et al. Constitutional AI: Harmlessness from AI Feedback, 2022. [CrossRef]
Wornow, M.; Xu, Y.; Thapa, R.; Patel, B.; Steinberg, E.; Fleming, S.; Pfeffer, M.A.; Fries, J.; Shah, N.H. The shaky foundations of large language models and foundation models for electronic health records. npj digital medicine 2023, 6, 135. [CrossRef]
Maiti, S. Caging the Agents: A Zero Trust Security Architecture for Autonomous AI in Healthcare, 2026, [2603.17419].
Dong, S.; Xu, S.; He, P.; Li, Y.; Tang, J.; Liu, T.; Liu, H.; Xiang, Z. Memory Injection Attacks on LLM Agents via Query-Only Interaction, 2025, [2503.03704].
Sunil, B.D.; Sinha, I.; Maheshwari, P.; Todmal, S.; Mallik, S.; Mishra, S. Memory Poisoning Attack and Defense on Memory Based LLM-Agents, 2026, [2601.05504]. [CrossRef]
Azarafrooz, A. Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms, 2026, [2604.21131].
Lin, Z.; Li, C.; Chen, K. A Survey on the Security of Long-Term Memory in LLM Agents: Toward Mnemonic Sovereignty, 2026, [2604.16548].
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 2017, 114, 3521–3526. [CrossRef]
Shi, H.; Xu, Z.; Wang, H.; Qin, W.; Wang, W.; Wang, Y.; Wang, Z.; Ebrahimi, S.; Wang, H. Continual learning of large language models: A comprehensive survey. ACM Computing Surveys 2025, 58, 1–42. [CrossRef]
Goddard, K.; Roudsari, A.; Wyatt, J.C. Automation Bias: A Systematic Review of Frequency, Effect Mediators, and Mitigators. Journal of the American Medical Informatics Association 2012, 19, 121–127. [CrossRef]
Lyell, D.; Coiera, E. Automation Bias and Verification Complexity: A Systematic Review. Journal of the American Medical Informatics Association 2017, 24, 423–431. [CrossRef]
Zhang, D. AgentDevel: Reframing Self-Evolving LLM Agents as Release Engineering, 2026, [2601.04620].
Dietrich, N. Agentic AI in radiology: Emerging potential and unresolved challenges. British Journal of Radiology 2025, 98, 1582–1584. [CrossRef]
Tu, T.; Azizi, S.; Driess, D.; Schaekermann, M.; Amin, M.; Chang, P.C.; Carroll, A.; Lau, C.; Tanno, R.; Ktena, I.; et al. Towards generalist biomedical AI. Nejm Ai 2024, 1, AIoa2300138.
Cosentino, J.; Belyaeva, A.; Liu, X.; Furlotte, N.A.; Yang, Z.; Lee, C.; Schenck, E.; Patel, Y.; Cui, J.; Schneider, L.D.; et al. Towards a personal health large language model. arXiv preprint arXiv:2406.06474 2024.
Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly Accurate Protein Structure Prediction with AlphaFold. Nature 2021, 596, 583–589. [CrossRef]
Joshi, M.; Pal, A.; Sankarasubbu, M. Federated Learning for Healthcare Domain – Pipeline, Applications and Challenges, 2022, [2211.07893].
Saha, P.; Strong, J.; Mishra, D.; Ouyang, C.; Noble, J.A. FedAgentBench: Towards Automating Real-world Federated Medical Image Analysis with Server-Client LLM Agents, 2025, [2509.23803].
Qu, A.; Zheng, H.; Zhou, Z.; Liang, P.P.; et al. CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery, 2026, [2604.01658]. [CrossRef]
Pal, A.; Umapathi, L.K.; Sankarasubbu, M. MedMCQA: A Large-Scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering, 2022. [CrossRef]
Zhong, S.; Lu, Y.; Ning, J.; et al. SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks, 2026, [2604.20087].
Zhao, Y.; Yuan, B.; Huang, J.; Yuan, H.; Yu, Z.; Xu, H.; Hu, L.; Shankarampeta, A.; Huang, Z.; Ni, W.; et al. AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications, 2026, [2602.22769]. [CrossRef]
Ogdu, C.U.; Gurbuz, S.; Karakose, M.; Hanoglu, E. Medical Implications of LLM Based Clinical Decision Support Systems in Healthcare. In Proceedings of the 2025 29th International Conference on Information Technology (IT). IEEE, 2025, pp. 1–4. [CrossRef]

1

Industry references on the Agent = Model + Harness decomposition: Anthropic, “Effective harnesses for long-running agents,” 2025, https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents; V. Trivedy, “The anatomy of an agent harness,” LangChain Blog, 2026, https://blog.langchain.com/the-anatomy-of-an-agent-harness/; B. Böckeler, “Harness engineering for coding agent users,” martinfowler.com, 2026, https://martinfowler.com/articles/harness-engineering.html.

2

NousResearch, Hermes Agent project documentation, 2026: https://hermes-agent.nousresearch.com and https://github.com/NousResearch/hermes-agent.

Figure 1. The self-evolving agent harness architecture for healthcare. The four harness layers, persistent memory, orchestration (ReAct loop), tool integration, and RL training, wrap around the LLM backbone to enable clinical agents that accumulate knowledge and improve through use.

Table 1. Summary of the six skill-evolution triggers in Self-Evolving Agent Engineering.

Trigger	Activation condition	Concurrent support
Novelty	No matching skill, verifier score $> 72$	Trace2Skill [72]
Performance	Score >10 pts below skill’s recorded best	CoEvoSkills [73], MetaClaw [74]
Contradiction	New retrieved guideline conflicts with skill assumption	(knowledge update; conceptual, no direct exemplar in this survey)
Consolidation	Every 50 tasks, similarity >0.85 between skills	SkillClaw [75]
RL	Verifier score $> 85 %$ (high-quality trajectory)	SkillRL [76], Skill-SD [77], Ge et al. [78]
Boundary	Inconsistent outcomes on similar inputs under fixed policy	CoEvolve [71]

Table 2. Comparison of clinical AI deployment paradigms by harness capability. RAG is a retrieval technique that can be incorporated into any harness level; it is shown separately to highlight that retrieval alone does not constitute a self-evolving capability.

Harness dimension	Prompt Eng.	RAG-augmented	Stateless harness^†	Self-evolving harness
Tool integration	Limited	Retrieval only	Full (MCP/API)	Full (MCP/API)
Multi-agent coordination	✗	✗	✓	✓
Orchestration (ReAct)	✗	✗	✓	✓
Persistent memory	✗	✗	✗	✓
Skill accumulation	✗	✗	✗	✓
RL self-improvement	✗	✗	✗	✓
Setup complexity	Low	Medium	Medium	High
Deployment infrastructure	Minimal	Vector DB	Agent server	Agent server + RL

^†Stateless harness examples include ReAct agents [61], AutoGen [69], and LangChain-based agent pipelines.

Table 4. Key challenges in self-evolving agent engineering for healthcare and possible solutions.

Challenge	Description	Possible Solutions
Hallucination & safety	Agents generate confident but incorrect clinical facts	Clinical verifiers; uncertainty quantification; agent checkpoints
Data privacy & HIPAA	Persistent memory may retain PHI across sessions	Memory scanning; de-identification; local deployment
Domain adaptation	Medical terminology varies by specialty, institution	Domain-specific skill libraries; local guideline integration
Interpretability & trust	Clinical decisions require auditable reasoning chains	Reasoning trace logging; explainability tools; clinician oversight
Memory security	Persistent memory is vulnerable to injection attacks	Entry scanning; injection detection; access control
Scalability	Multi-agent orchestration is computationally expensive	Subagent pooling; cached skill retrieval; model distillation
Regulatory compliance	FDA, CE, and HIPAA impose strict validation requirements	Deterministic audit modes; clinical validation protocols
Autonomy–safety tension	Autonomous skill evolution may propagate systematic errors	Graduated autonomy; confidence-gated escalation
Memory fragmentation	Skill library grows unwieldy at scale	Hierarchical taxonomies; adaptive retrieval; periodic pruning
Catastrophic forgetting	RL fine-tuning degrades prior capabilities	EWC; experience replay; modular LoRA adapters
Over-reliance & de-skilling	Clinicians may accept polished agent outputs without scrutiny, eroding analytical capacity	Mandatory review of high-stakes outputs; reasoning-chain exposure; periodic audit

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.