Introduction
Open-ended discovery—the search for novel, high-quality solutions in domains where the solution space lacks clear structure and evaluation may be expensive or sparse—remains one of the hardest challenges in automated scientific reasoning. Unlike constrained optimization, where gradients or convexity guide the search, open-ended problems demand sustained exploration, accumulation of partial insights, and the ability to redirect effort when progress stalls. Mathematical conjecture proving, systems-level code optimization, and combinatorial design all fall squarely in this category.
The emergence of large language model (LLM)-driven evolutionary search has begun to change what is possible. FunSearch (Romera-Paredes et al., 2024) demonstrated that an LLM could mutate programs evolved over a population, discovering new results in combinatorics and combinatorial optimization. AlphaEvolve (Novak et al., 2025) extended this idea with MAP-Elites archiving and island-model parallelism, achieving notable advances in matrix multiplication and graph algorithms. Yet both systems share a fundamental limitation: the search itself is governed by fixed heuristics. The choice of which parent to mutate, how to construct the mutation prompt, when to evaluate, and what knowledge to carry forward are all determined by pre-written rules. The LLM functions as a proposal engine embedded in a rigid loop; it cannot decide to run a local test before submitting, nor can it pause to write down an insight for later reuse.
The core insight behind CORAL (Qu et al., 2026) is that delegating more search decisions to autonomous agents, rather than pre-defining them as fixed procedures, unlocks substantially better performance. Where FunSearch hard-codes a selection rule, a CORAL agent decides what to read based on its own reasoning. Where AlphaEvolve invokes the evaluator after every proposal, a CORAL agent may choose to validate locally first, iterate on a draft, and only call the external evaluator when confidence is high. Where traditional evolutionary search discards knowledge between runs, CORAL agents accumulate observations, strategies, and reusable tools in a shared persistent memory that persists across evaluations and agents.
CORAL introduces three mechanisms that make this autonomy practical at scale: shared persistent memory provides a filesystem-based knowledge repository that all agents read from and write to; asynchronous multi-agent organization enables agents to explore in parallel without any direct message passing; and heartbeat-based interventions inject structured reflection, consolidation, and redirection prompts at configurable intervals, preventing agents from getting stuck in unproductive loops. Evaluated across eleven tasks spanning mathematical optimization and systems engineering, CORAL achieves the best final score on every task and establishes eight new state-of-the-art results. Its improvement rate exceeds fixed evolutionary baselines by 3–10, and it typically converges in 5–20 evaluations where baselines require 60–100. On Anthropic’s kernel engineering benchmark, four co-evolving agents push the best known score from 1,363 to 1,103 cycles—a 19% improvement—without any web search.
Problem Formulation
An open-ended discovery task is defined as a pair , where is a task description and is an evaluator function. For a candidate solution , the evaluator returns a score and optional feedback:
Here is a scalar score (to be maximized or minimized depending on the task) and is auxiliary feedback, which may take the form of sub-score decompositions, textual criticism from an LLM-based judge, or execution traces. The feedback signal is richer than a single number, but it is not a gradient; it does not directly indicate how to improve .
Each improvement step in CORAL follows a four-stage cycle:
- RETRIEVE: Construct a working context from shared persistent memory —selecting relevant prior attempts, notes, and skills.
- PROPOSE: Generate a candidate solution conditioned on the task and the retrieved context .
- EVALUATE: Obtain score and feedback from the external evaluator.
- UPDATE: Integrate the new information into shared persistent memory, producing .
The critical question is who decides at each stage. The table below contrasts three paradigms:
| Stage | Fixed Evolutionary Search | Autonomous Single-Agent | Autonomous Multi-Agent |
|---|---|---|---|
| Retrieve | Fixed selection rule (e.g., top- by score) | Agent decides what to read | Each agent independently decides |
| Propose | Single LLM forward pass per candidate | Agent may iterate, test locally, refine | Multiple agents explore in parallel |
| Evaluate | External call after every proposal | Agent decides when to call evaluator | Shared evaluator, agents decide timing |
| Update | Fixed rule (e.g., replace worst in population) | Agent decides what knowledge to write | Asynchronous writes to shared memory |
| Communication | None | None | Indirect, via shared persistent memory |
In fixed evolutionary search, the LLM is a passive proposal engine. In autonomous single-agent evolution, it becomes an active optimizer that plans its own search trajectory. In autonomous multi-agent evolution, multiple such optimizers collaborate implicitly through a shared knowledge store, achieving both diversity and cumulative progress.
Core Mechanisms
3.1 Shared Persistent Memory
CORAL organizes shared knowledge as a filesystem with three root directories, each mapped into every agent’s workspace via symbolic links:
attempts/ stores historical evaluations. Each attempt is a JSON record keyed by commit hash, containing the solution snapshot, score, status (improved / baseline / regressed / crashed / timeout), parent hash, timestamp, and evaluator feedback. Agents browse high-performing solutions, compare approaches, and trace lineage through the parent_hash field.
notes/ captures observations, learnings, and reflections in Markdown files with YAML frontmatter. Agents decide what to record and where to file it. Special subdirectories support collective knowledge: _synthesis/ holds cross-cutting summaries produced by consolidation heartbeats, _connections.md maps patterns across categories, and _open-questions.md tracks unresolved gaps and contradictions. On demanding tasks like kernel engineering, agents create directories such as “what NEVER worked” to catalog dead ends—a practice that emerges organically rather than being prescribed.
skills/ records reusable procedures, tools, and scripts. Each skill consists of a natural-language description (SKILL.md) paired with executable artifacts (functions and example scripts). The system provides a built-in skill_creator skill that guides agents through the create-test-refine workflow for producing new skills.
The filesystem-as-memory design has practical consequences. Agents access memory through CLI tools (coral notes, coral skills) or direct Bash file reads, both of which are natural operations for code agents. Concurrency safety comes from atomic writes using the temp-file-then-rename pattern, and because each attempt writes to a uniquely named file (keyed by commit hash), no explicit locking is needed. Git version control on the shared directory provides an audit trail: every Attempt record includes a shared_state_hash linking it to the exact memory snapshot at evaluation time.
3.2 Asynchronous Multi-Agent Organization
agents run asynchronously, each maintaining its own local context and operating in an isolated workspace. Each agent has an independent git worktree on its own branch with its own Python virtual environment (.venv). The worktrees share the underlying repository object database, keeping disk usage modest, while branch isolation ensures that one agent’s experimental changes never corrupt another’s workspace.
The defining design choice is that agents do not communicate directly. Coordination happens entirely through shared persistent memory: when agent writes an attempt, note, or skill to , agent may retrieve it when constructing in a subsequent step. This indirect coordination has three desirable properties. First, it increases exploration diversity: each agent follows its own reasoning about what to try next, producing search trajectories that overlap only partially (pairwise Jaccard similarity of 0.31–0.43 on pressure-test tasks). Second, it enables shared accumulation: discoveries by one agent—whether a high-scoring solution or a useful insight—become immediately available to all others. Third, it eliminates the need for any message-passing protocol, avoiding the engineering complexity of queuing, ordering, and consensus that plagues explicit communication frameworks.
A critical security measure is evaluator separation. The grader code resides in .coral/private/eval/, which is excluded from the agent’s worktree via gitignore rules and directory permissions. Agents cannot read the scoring function, preventing them from gaming the evaluator—a real concern when agents are powerful enough to manipulate code.
3.3 Heartbeat Mechanism
Left to their own devices, autonomous agents can fall into unproductive patterns: obsessing over minor variations of the same approach, failing to recognize when a strategy is exhausted, or neglecting to consolidate scattered insights. CORAL’s heartbeat mechanism addresses this through structured interventions that inject targeted prompts at configurable intervals. Three heartbeat types are defined in the paper, with a fourth discovered in the codebase:
Reflect (every 1 evaluation, local scope). After each evaluation, the agent is prompted to anchor its reflection on the concrete result, check for unexpected findings, analyze causes, assess confidence, and plan the next experiment. This keeps the agent grounded in evidence rather than drifting into speculation.
Consolidate (every 10 evaluations, global scope). At regular intervals, the system prompts the agent to synthesize knowledge across all agents’ work, producing entries in notes/_synthesis/, updating notes/_connections.md with cross-category patterns, and refreshing notes/_open-questions.md with current gaps. Consolidation heartbeats ensure that partial discoveries do not languish in isolation.
Pivot (triggered after 5 consecutive evaluations without improvement, local scope). When an agent’s progress plateaus, the pivot heartbeat directs it to diagnose the ceiling, study what distinguishes top performers, select a new direction (a different algorithm family, problem formulation, or representation), restart from a strong foundation, and quickly submit a minimal implementation. Plateau detection tracks evals_since_improvement with a cooldown mechanism to avoid triggering on every marginal regression.
Lint Wiki (every 10 evaluations, global scope). This fourth heartbeat, present in the codebase but not discussed in the paper, spawns a librarian sub-agent that performs health checks on the knowledge base: detecting contradictions between notes, deduplicating overlapping entries, and restructuring the directory hierarchy. It reflects CORAL’s philosophy that knowledge quality requires active maintenance, not just passive accumulation.
The triggering mechanism is worth understanding in detail. The agent manager runs a monitoring loop that polls .coral/public/attempts/ every 5 seconds. For each new attempt, it updates per-agent tracking state: local evaluation count, best score, and consecutive evaluations without improvement. When a heartbeat condition is met, the manager sends a SIGINT to interrupt the agent’s current operation, composes a combined prompt (original task context plus heartbeat directive), and resumes the agent with this enriched context. Agents can also create heartbeat events themselves using coral heartbeat set, giving them a degree of self-regulation.
Implementation Insights from Code
The paper presents CORAL’s mechanisms at a conceptual level. Reading the source code reveals several design decisions that are essential to making autonomous multi-agent evolution work in practice but receive little or no discussion in the paper itself. This section documents those insights.
4.1 The Filesystem as Message Bus
CORAL’s most striking architectural choice is the absence of any centralized coordination service. All inter-agent communication flows through the filesystem: agents read and write files in the shared memory directory, and the manager monitors the filesystem to trigger heartbeats. There is no message queue, no RPC framework, no database. This design is both simple and robust. Atomic writes via the temp-file-then-rename pattern guarantee that no agent ever reads a partially written file. Git versioning on the shared directory means that every state change is auditable and revertible. The Attempt record’s shared_state_hash field creates a snapshot link between each evaluation and the memory state at that moment, enabling post-hoc analysis of exactly what information was available to each agent.
4.2 Crash Recovery as a First-Class Concern
Autonomous agents crash. They run out of context windows, encounter Python import errors, or produce output that the LLM cannot parse. CORAL treats crash recovery as a first-class concern rather than an afterthought. The exit classifier categorizes every agent termination into three types: clean (normal exit), no_result (ran but produced no evaluation), and session_error (crash or timeout). A crash circuit breaker monitors failure frequency: if three crashes occur within a short window, the system pauses the agent for five minutes before restarting, preventing rapid crash loops from wasting API credits. An important nuance is the evaluator queue exemption: if an agent is waiting for an evaluator response, the manager does not count this as a stall, avoiding false-positive kill signals during long-running evaluations.
The evaluator itself runs as an independent subprocess with a hard timeout (default 300 seconds, configurable per task). If the grader does not return within the limit, it is killed with SIGKILL—no graceful shutdown, no chance to hang indefinitely. This hard boundary ensures that a buggy or adversarial solution cannot monopolize evaluation resources.
4.3 The Agent-as-Optimizer Philosophy
The CORAL.md template file, injected into every agent’s workspace as its primary instruction set, encodes a distinctive philosophy about how agents should approach search. Three directives stand out:
“Eval early and often.” The template urges agents to submit solutions for external evaluation rather than over-optimizing locally. The reasoning is that the external evaluator provides the only reliable signal; local tests may be incomplete or misleading.
“Bias toward speed.” A rough but evaluable solution is preferable to a perfect but untested one. This directive combats the tendency of LLM agents to refine indefinitely without ever checking whether their refinements actually improve the score.
“Every eval should produce at least one note or skill update.” Knowledge accumulation is not optional. Even a failed evaluation should generate an insight—what was tried, why it failed, what should be avoided next time. This rule ensures that the shared memory grows monotonically, benefiting all agents.
Git operations are entirely managed by the framework. Agents never run git commit or git add directly; instead, they call coral eval -m "message", which stages all changes, commits, evaluates, and records the attempt atomically. This prevents agents from accidentally corrupting the repository state and ensures that every evaluation corresponds to a clean commit.
4.4 The Sub-Agent System
CORAL deploys specialized sub-agents for tasks that benefit from focused expertise:
Deep-researcher performs structured literature review. When the warm-start option is enabled, this sub-agent surveys relevant web resources before the main agent begins coding, providing an initial knowledge base that accelerates early progress.
Librarian conducts knowledge base health checks during lint-wiki heartbeats. It scans shared notes for contradictions (e.g., two notes claiming opposite conclusions about the same technique), identifies redundant entries covering the same ground, and restructures the directory hierarchy when the organization becomes unwieldy.
Skill-creator is a meta-skill: it guides agents through the process of creating, testing, and refining new skills. When an agent discovers a reusable procedure—say, a particular code transformation pattern that consistently reduces cycle count in kernel engineering—it can invoke the skill-creator to package this procedure into a properly documented and tested skill that other agents can discover and apply.
Experimental Analysis
5.1 Single-Agent Results
CORAL was evaluated on eleven tasks: six mathematical optimization problems (circle packing, signal processing, Erdos minimum overlap, MMD-16-2, MMD-14-3, 3rd-order autocorrelation inequality) and five systems optimization problems (EPLB, PRISM, LLM-SQL, transaction scheduling, Cloudcast). All results are averaged over four independent trials with a budget of 3 hours wall-clock time or 100 iterations (whichever is longer).
Single-agent CORAL achieves the best final score on all eleven tasks against three baselines—OpenEvolve, ShinkaEvolve, and EvoX (the strongest competitor with its meta-evolutionary search strategy)—and establishes new state-of-the-art results on eight. The improvement rate, defined as the fraction of evaluations that produce a strictly better score, is 3–10 higher than baselines across tasks. Perhaps more striking is the evaluation efficiency: CORAL typically converges in 5–20 evaluations where baselines require 60–100. On circle packing, CORAL matches the SOTA in just 11 evaluations (OpenEvolve needs 100); on MMD-16-2, it reaches the known optimum in 6 evaluations (EvoX requires 18).
The efficiency gain is not accidental. Because autonomous agents can validate locally before calling the external evaluator, a significant fraction of submissions are already pre-screened. On kernel engineering, 57% of evaluations are preceded by a local test, and 47% of locally tested submissions produce an improvement. The agent is not guessing; it is making informed proposals.
5.2 Multi-Agent Results
The multi-agent setting reveals CORAL’s most impressive results. On Anthropic’s kernel engineering task, four co-evolving agents (Claude Code with Opus 4.6) achieve 1,103 cycles, compared to 1,350 for single-agent CORAL and 2,740 for OpenEvolve. The four agents collectively produce 596 evaluations with a 9% improvement rate. Cross-pollination is critical: 66% of new records originate from a cross-agent parent—a solution proposed by one agent that another agent picks up and improves. On polyominoes packing, four agents reach 84.2% coverage (versus 80.2% for single-agent), and with web search enabled, CORAL attains 89.4%, surpassing the prior SOTA of 87%.
The multi-agent advantage is not limited to proprietary models. Using the fully open-source stack (OpenCode + MiniMax M2.5), four-agent CORAL consistently outperforms its single-agent counterpart across all mathematical and systems tasks, with gains ranging from 0.15% to 20.8%.
5.3 Why Autonomous Evolution Works
Three mechanisms explain the performance gap:
Local verification. Agents test solutions locally before submitting them for external evaluation. The local test rate varies by task: 57% on kernel engineering (where compilation and cycle counting can be done locally), 61% on transaction scheduling, but 0% on PRISM (where the evaluator generates random test cases that cannot be reproduced locally). Where local testing is feasible, it acts as a high-pass filter, catching compilation failures and obvious regressions before they consume evaluation budget.
Knowledge accumulation. On demanding tasks, agents create 0.55–0.68 knowledge artifacts per attempt, compared to 0.05 on standard tasks. This tenfold difference reflects a qualitative shift: on standard tasks, notes tend to be lightweight progress logs (“tried parameter X, got score Y”), while on hard tasks they capture reusable insights (“identified VALU architecture bottleneck at depth-0 XOR; switching to per-lane ALU saves 64 VALU at cost of 512 ALU”). Knowledge access correlates with improvement: on kernel engineering, 55% of evaluations that access prior knowledge produce an improvement, versus 9% overall.
Cross-agent information transfer. In the four-agent kernel engineering run, 36% of attempts use another agent’s commit as a parent. Cross-agent parents achieve a 17% improvement rate (versus 9% overall), and 66% of new records trace back to cross-agent lineages. The transfer modes differ by task: kernel engineering favors direct code reuse (agents copy and modify promising commits), while polyominoes packing favors knowledge transfer (87% of rounds reference another agent’s notes or skills). These complementary patterns emerge organically from the shared memory architecture.
5.4 Ablation Studies
Disabling knowledge accumulation (removing notes and skills) degrades performance on all tested tasks. The effect is largest on kernel engineering, where scores regress from 1,350 to 1,601 cycles—an 18.6% setback. On polyominoes and transaction scheduling, the regression is 3.6% and 2.7% respectively. Knowledge is not a nice-to-have; it is load-bearing.
Separating co-evolution from independent execution isolates the value of shared memory. Four co-evolving agents achieve 1,103 cycles on kernel engineering, while the best of four independently run agents reaches only 1,180—a 6.5% gap that cannot be explained by additional computation alone. On polyominoes, the gap is 4.2%. The shared memory enables a form of soft coordination that makes the whole greater than the sum of its parts.
Discussion
6.1 From Fixed to Autonomous Evolution
Fixed evolutionary search treats the LLM as a sophisticated mutator embedded in a rigid loop. The search strategy—what to mutate, how to select parents, when to evaluate—is determined entirely by hand-coded heuristics. This approach works when the problem structure aligns with the heuristics, but it fundamentally limits the LLM’s capacity for planning and strategic decision-making. CORAL’s shift to autonomous evolution asks a different question: rather than “how should we orchestrate the LLM?”, it asks “what decisions should we let the LLM make for itself?”
The answer, it turns out, is “most of them.” Agents that can choose what to read, when to test, what to record, and when to pivot consistently outperform fixed procedures. The heartbeat mechanism is key to making this work: it provides soft guidance (reflection prompts, consolidation triggers, redirection cues) rather than hard constraints (fixed selection rules, mandatory evaluation after every proposal). The agent retains autonomy over its search trajectory while benefiting from periodic nudges that prevent common failure modes.
6.2 Implicit Protocols in Multi-Agent Collaboration
CORAL’s multi-agent organization deliberately avoids explicit communication. There is no message-passing protocol, no shared plan, no role assignment. Yet the agents develop what can be called implicit protocols: patterns of coordination that emerge from shared memory access. The Jaccard similarity of 0.31–0.43 between agents’ attempted strategies indicates that more than half of each agent’s search vocabulary is unique, providing genuine exploration diversity. At the same time, the 36% cross-agent parent rate on kernel engineering shows that agents are effectively building on each other’s discoveries. The result is a system that combines the breadth of independent exploration with the depth of shared accumulation.
This horizontal-parallel architecture contrasts with vertical-sequential frameworks like MetaGPT, where agents assume fixed roles (product manager, architect, engineer) and pass artifacts through a predefined pipeline. For open-ended problems, where the optimal division of labor is unknown in advance, horizontal parallelism with implicit coordination is more appropriate: it allows the discovery process itself to determine what each agent should work on.
6.3 Limitations
CORAL’s approach carries three notable limitations. First, it depends on frontier foundation models capable of handling complex coding agent workflows; single-agent runs cost approximately $30–60 per 3-hour session, and four-agent runs are roughly four times that amount. Deploying on smaller, locally-hostable models remains an open challenge. Second, all agents are initialized identically, with the same task prompt and the same CORAL.md instructions. Injecting heterogeneous personalities, roles, or private information at initialization could further increase exploration diversity, but how to do this systematically is not yet understood. Third, the framework assumes the availability of a reliable evaluator. For problems where evaluation is itself expensive, incomplete, or ambiguous—a common situation in real-world scientific discovery—the evaluator may need to co-evolve with the solution, a direction that CORAL does not currently explore.
References
- Romera-Paredes, B. et al., 2024. Mathematical discoveries from program search with large language models. Nature, 625, pp.468–475.
- Novak, R. et al., 2025. AlphaEvolve: A coding agent for scientific and algorithmic discovery. Google DeepMind Technical Report.
- Sharma, R., 2025. OpenEvolve: An open-source implementation of evolutionary search with LLMs. GitHub Repository.
- Lange, R. et al., 2025. ShinkaEvolve: Accelerating evolutionary search with diversity-guided sampling. Preprint.
- Liu, Y. et al., 2026. EvoX: Meta-evolutionary search for open-ended discovery. Preprint.
- Qu, A., Zheng, H., Zhou, Z. et al., 2026. CORAL: Towards autonomous multi-agent evolution for open-ended discovery. arXiv:2604.01658.