Union.ai

Flyte

Agentic AI

Tutorial

June 25, 2026

•

Min Read

Long horizon Agents on a Durable AI Runtime

Niels Bantilan

Local development agent harnesses like Claude Code, OpenCode, and Pi make building an agent look easy. You wire up a few tools, point it at a model, ship a demo, and it works beautifully.

But then you try to productionize the same functionality and deploy a long horizon agent that runs on your cloud, in the background, and you ask yourself: "what happens when a tool call OOMs partway through a four-hour batch?", "how do you keep the agent from re-running two hundred dollars of LLM calls after a process crash?", "can it compile and run untrusted code without paying for it on the wrong machine?". This is when a production-grade AI runtime matters.

The problem isn't necessarily the LLM. Agents break in multiple ways:

Semantic hallucinations (this one is the most well-known)
Logical bugs (e.g. tool code bugs and malformed tool call payloads)
Network rate-limits and timeouts (solution: sorry, just try again in a few seconds!)
Infrastructure failures like OOM (the agent says "I can't provision any more compute")

A lot of engineering focus goes into the first three, but in my experience people rarely think about infrastructure-level concerns, like whether the agent can survive an OOM-kill error on the ninety-ninth out of a hundred steps and pick up where it left off. In this blog, we'll cover how Flyte handles all four sets of concerns and talk about how it's uniquely positioned to handle infrastructure-level agent failures.

What's an AI runtime?

We've been seeing this term pop up in various contexts lately, and our take is that an AI runtime consists of three fundamental pillars:

Durable orchestration: the substrate that connects data, models, and compute as tasks composed into fault-tolerant workflows.
Real-time serving: long-running services that serve model endpoints, MCPs, REST-like applications, and human-facing UI applications.
Multi-silicon infrastructure: the backend that provisions compute that an agent needs to get the right amount of CPUs, memory, GPUs, and XPUs, in whichever cloud it can find.

Different use cases and applications will draw from one, two, or all three of these pillars, and in this blog post we'll see how they apply to building long horizon agents.

Agents on a production-grade AI runtime

Over the last few weeks we've shipped three production-shaped agents using Flyte-native primitives, which are exposed in the flyte.ai .agents package: an `Agent` harness, tools that run on variable compute requirements via Flyte's `@env.tasks`, a durable `MemoryStore`, an on-device sandbox for untrusted code, and task-level retries, traces, and resource overrides:

The parallelized autoresearch agent: autonomous ML research that proposes TinyGPT configurations, edits `train.py` directly in memory, and fans out parallel training batches on the karpathy/climbmix corpus.
The AutoSec researcher: autonomous security-ops researcher that scans C source for memory-corruption bugs, hypothesizes a vulnerability, builds a proof-of-concept payload, and validates it by triggering a segmentation fault in a sandbox.
The drug molecule screening agent: medicinal-chemistry strategist that derives a target profile from a natural-language brief, runs an RDKit virtual screen, reads the funnel, and rescreens once if needed.

These aren't variations on a theme. They're three very different problems that all take on the same shape, because production-grade agents converge on the same sets of concerns.

The pattern, in one block

At a high level, Flyte-native agents share a similar design to many of the agent frameworks out there. The main difference is that the `Agent` itself and the tools they run can each live in different machines with highly variable compute requirements.

Copied to clipboard!

import flyte
from flyte.ai.agents import Agent, tool

@tool
@env.task
async def do_thing(...): ...

agent = Agent(
    name="my-agent",
    instructions=INSTRUCTIONS,
    model="claude-haiku-4-5",
    tools=[do_thing, ...],
    max_turns=12,
)

@env.task(report=True)
async def run(brief: str) -> str:
    result = await agent.run.aio(brief)
    return result.summary or result.error or ""

Tools are `@env.tasks` decorated with `@tool` and each one runs in its own container with its own resources, retry policy, and caching policy. The agent is configured with a model and an instruction set. Everything beyond that, from the memory, sandbox, OOM recovery, resource sizing, and fan-out, is opt-in. The three agents above pick up different subsets because their workloads are different.

Five things change when you go from demo to production.

1. Untrusted code execution

When an agent produces executable output (Python it wrote to train a model, a C exploit payload it generated, a SQL query against your warehouse), you have figure out where that code runs.

The parallelized autoresearch agent runs in `code_mode` against `claude-sonnet-4-6`. In each iteration it proposes a batch of TinyGPT experiments (depth, width, optimizer, schedule), edits `train.py` to express each one, and asks Flyte to train them. The training corpus is the karpathy/climbmix BPE shard set; the eval metric is validation bits-per-byte (`val_bpb`). Each agent-edited script is staged into a sandbox work directory along with a driver, and executed:

Copied to clipboard!

async def run_train_in_sandbox(work_dir, train_py, *, title, time_budget_sec, config_overrides):
    """Stage agent-edited train.py + driver, run in a userns sandbox, return parsed metrics."""
    from union import sandbox as sb

    driver_path = stage_sandbox_files(
        work_dir, train_py,
        title=title,
        time_budget_sec=time_budget_sec,
        config_overrides=config_overrides,
    )

    async with sb.on_device.session(backend="userns", host_work_dir=work_dir) as sbx:
        proc = await sbx.run(
            f"python {driver_path}",
            stdout=True, stderr=True,
            network_mode="blocked",
            timeout_s=time_budget_sec + 180,
        )
        stdout, stderr = await proc.communicate_text()
    if is_oom(stderr, proc.returncode, stdout=stdout):
        return {"success": False, "oom": True, "title": title, ...}
    return {"success": True, **parse_metrics(stdout)}

The AutoSec researcher is a more adversarial case. It generates a buffer-overflow payload (`"A" * (buffer_size + 64)`) against vulnerable C source, then compiles and runs the binary inside the same sandbox shape to confirm the crash:

Copied to clipboard!

@env.task(retries=2, timeout=300)
async def validate_in_sandbox(source: str, poc: dict) -> dict:
    import tempfile
    from union import sandbox as sb

    with tempfile.TemporaryDirectory() as work:
        async with sb.on_device.session(host_work_dir=work, backend="userns") as sbx:
            await sbx.put_bytes(f"{work}/target.c", source.encode())
            compile_proc = await sbx.run(
                f"gcc -fno-stack-protector -w -o {work}/target {work}/target.c",
                stdout=True, stderr=True, timeout_s=60,
            )
            _, compile_err = await compile_proc.communicate_text()
            if "error" in compile_err.lower():
                return {"triggered": False, "log": f"COMPILE_FAILED\n{compile_err}"}

            payload = "A" * int(poc["payload_len"])
            run_proc = await sbx.run(
                f"{work}/target {payload}",
                stdout=True, stderr=True, timeout_s=60,
            )
            run_out, run_err = await run_proc.communicate_text()
            log = run_out + "\n" + run_err
            return {"triggered": "SIGSEGV" in log, "log": log}

The `-fno-stack-protector` flag is intentional, and the targets under `autosec_research_agent/targets/` are written to be vulnerable, and the validation criterion is whether the binary actually segfaults (indicated by `SIGSEGV`). Therefore, using stack protection would mask the bug we want to confirm. The resulting agent run produces a report that displays the vulnerabilities for each script and the agent’s reasoning behind it:

The AutoSec agent statically analyses `n` files for vulnerabilities, develops a hypothesis for how to build a payload to exploit that vulnerability, and validates the hypothesis in the network-blocked and filesystem-limited sandbox.

Both agents use `union.sandbox` with the `userns` backend, which opens a Linux user-namespace sandbox on the same node (no VM provisioning), mounts a host work directory, blocks network egress, and tears the session down in `__aexit__`. If the inner process segfaults, hangs, or attempts an outbound connection, the orchestrator records the run and contains the failure inside the sandbox.

Running these in-process is roughly an order of magnitude faster than spinning up a VM, which matters at the rates these agents call the sandbox. The autoresearch agent fans out four experiments per iteration; AutoSec fans out one investigation per target file. Per-call VM provisioning would dominate wall-clock time, and isolation would stop being affordable as a default. For workloads that need stronger isolation than user namespaces or using `bubblewrap` (kernel-level escapes, for instance), you can swap in something like gVisor without changing the call sites.

2. Self-healing through right-sizing

The autoresearch agent runs experiments with very different memory footprints in the same batch. A two-layer TinyGPT with `n_embd=128` fits in a couple of gigabytes; a 24-layer model with `n_embd=1024` and a large device batch size needs closer to sixteen. Picking one resource shape for every experiment wastes compute on the small ones and OOMs the large ones, and the agent is the only thing in the loop that actually knows what each experiment looks like.

The runtime gives the agent a way to size each call individually through a `right_size` `call_handler`:

Copied to clipboard!

@tool(call_handler=tools.right_size)
@experiment_env.task
async def run_experiment(
    title: str,
    time_budget_sec: int = 45,
    memory_key: str = tools.MEMORY_KEY_FANOUT,
) -> dict:
    """Train using agent-edited train.py with LLM right-sizing and OOM self-healing."""
    ...

The handler does two things. First, it asks a small capacity-planning LLM to propose `cpu`/`memory`/`disk` for the specific call. The system prompt makes the model reason in Kubernetes terms about the work implied by the arguments:

Copied to clipboard!

RESOURCE_SIZING_SYSTEM_PROMPT = """\
You are a Kubernetes capacity planner for Flyte autoresearch sandbox training runs.
...
Reason about the work implied by the arguments:
- TinyGPT training is memory-bound: scale with model width/depth and batch size.
- ...
Return a single JSON object with optional keys: cpu, memory, disk.
"""

The LLM's estimate is treated as a hypothesis, not a contract. The handler clamps it to a fixed range before dispatching:

Copied to clipboard!

RESOURCE_FLOOR   = flyte.Resources(cpu=2,  memory="2Gi")
RESOURCE_CEILING = flyte.Resources(cpu=16, memory="32Gi")

def _cap_resources(resources: flyte.Resources) -> flyte.Resources:
    # 2 <= cpu <= 16, 2Gi <= memory <= 32Gi
    ...

If the planner asks for 128Gi, the runtime gives it 32Gi. If it asks for 256m, the runtime gives it 2Gi.

Second, if the experiment OOMs anyway, the handler catches the failure and retries with deterministically bumped memory, up to `MAX_OOM_RETRIES`:

Copied to clipboard!

MAX_OOM_RETRIES = 3

async def execute_with_right_sizing(call_llm, target_task, **kwargs):
    resources = await estimate_resources(call_llm, target_task, kwargs)
    attempt = 0
    while True:
        try:
            return await target_task.override(resources=resources).aio(**kwargs)
        except flyte.errors.OOMError:
            if attempt >= MAX_OOM_RETRIES:
                raise
            resources = bump_memory(resources)  # min(ceiling, max(prev*2, prev+2048))
            attempt += 1

OOM detection has to inspect two layers because the training process runs inside the sandbox. If the pod itself dies, Flyte raises `flyte.errors.OOMError`. If only the inner training process dies, the sandbox returns exit code 137 or surfaces an OOM marker in stderr:

Copied to clipboard!

OOM_MARKERS = (
    "out of memory", "oom", "cannot allocate memory",
    "memoryerror", "killed", "signal 9", "std::bad_alloc", ...
)

def is_oom(stderr: str, returncode: int | None, *, stdout: str = "") -> bool:
    if returncode in (137, -9):
        return True
    text = f"{stderr}\n{stdout}".lower()
    return any(marker in text for marker in OOM_MARKERS)

The result row carries `oom_retries`, so downstream code (and the agent reading the leaderboard) can distinguish an experiment that landed on its first allocation from one that needed a bump. In a four-experiment batch I ran, `val_bpb` improved from 3.130 to 2.880. Three experiments completed on the first allocation, and the fourth came back from the planner with 8Gi, OOM'd, was retried at 16Gi, and finished. The agent observed a successful experiment with `oom_retries=1`.

The same recovery shape shows up in the AutoSec researcher's static-scan stage, but without the capacity-planning LLM. The tool catches OOM and re-dispatches itself with hardcoded larger resources and a narrower scope:

Copied to clipboard!

@env.task(retries=2, timeout=30)
async def scan_static(source: str, scope: str = "whole") -> str:
    try:
        return _grep_dangerous_calls(source) or "(no dangerous-call sites found)"
    except flyte.errors.OOMError as exc:
        print(f"[scan_static] {exc}; escalating resources + narrowing scope")
        return await scan_static.override(
            short_name="scan_static_more_resources",
            resources=flyte.Resources(cpu=2, memory="4Gi"),
        )(source, scope="file")

Two sizing strategies (planner-driven for the autoresearch agent, hardcoded fallback for AutoSec) sharing one underlying primitive: a tool body catches `flyte.errors.OOMError` and re-dispatches the same task via `.override(resources=...)`. The agent doesn't have to model capacity correctly on the first try, and infrastructure failure doesn't propagate out of the tool boundary.

In this run, the agent executes an experiment that fails with an OOM error, but the agent loop logic catches the error and re-runs the experiment with more memory to succeed.

3. Durable state across runs

A research run that loses its leaderboard when the pod restarts is not a research run. It's four hours of GPU compute you can't reuse.

The autoresearch agent persists the leaderboard, the agent-edited `train.py` files (one per experiment at `memory/code/{slug}.py`), the hypotheses recorded before each batch, the per-experiment config overrides, and a config-signature index. All of it lives in `MemoryStore`.

The agent writes to memory through a tool in its own catalog. When it proposes a batch, it calls `edit_train_code_batch` to atomically save every edit:

Copied to clipboard!

@tool
@agent_env.task
async def edit_train_code_batch(
    edits: list[dict[str, Any]],
    memory_key: str = MEMORY_KEY_FANOUT,
) -> dict:
    """Save multiple edited train.py files in one atomic memory write.

    Each edit must include title + change_summary, plus either train_py
    (full source) or config_overrides (e.g. {"n_layer": 6}).
    """
    if not edits:
        return {"count": 0, "titles": [], "edits": [], "errors": [...]}
    return await _persist_train_edits(memory_key, edits, ...)

At the start of every session, the orchestrator hydrates a directive from prior research:

Copied to clipboard!

memory = await MemoryStore.get_or_create.aio(key="parallelized-autoresearch")
history = await load_research_history("parallelized-autoresearch")
directive = build_directive(n_experiments=8, batch_size=4, prior=history)
result = await agent.run.aio(directive, memory=memory)
await memory.save.aio()

If you re-run with the same `memory_key`, the agent sees its earlier work in the directive. It doesn't redo experiments 1-5 to figure out the best so far; it reads the leaderboard, picks the current frontrunner, and proposes experiment 6. The cost of stopping and restarting is a session boundary, not a redo.

`MemoryStore` is doing something other than what most agent frameworks call memory. It isn't a conversation buffer. Agent-written code lives in it. The leaderboard lives in it. Resuming the agent is closer to handing it back a lab notebook than restoring a chat session.

The parallelized autoresearch agent maintains memory in object storage so that subsequent runs with the same memory key will enable it to resume its iterations from prior runs.

4. Bounded loops

The most common production failure mode is a loop the agent doesn't escape. The model proposes a change, the metric moves a hair, it proposes another change, and you wake up to a sixty-dollar bill. Each of the three agents enforces some form of bounded reflection in its instructions.

The drug molecule screening agent uses the simplest version, declared in the system prompt:

Copied to clipboard!

SCREENING_AGENT_INSTRUCTIONS = """\
You are a medicinal chemistry screening strategist. ...

3. Read the summary returned by generate_report. Reflect:
   - If all_criteria_met == 0: relax exactly ONE profile bound by ~10-20%
     and re-run screen_candidates then generate_report only, reusing the
     same molecule_dir and properties_json.
   - Maximum ONE rescreen iteration.
4. Finish with plain text: top candidate, rationale, funnel interpretation,
   suggested next steps (docking, ADMET).
"""

One rescreen, one relaxed bound, then the agent has to stop. The contract is in the prompt.

The AutoSec researcher layers two bounds. The hypothesis stage is an inner `Agent` with `max_turns=6`, wrapped in an outer Flyte task with `retries=3`:

Copied to clipboard!

hypothesis_agent = Agent(
    name="autosec-hypothesis",
    instructions=ANALYSIS_INSTRUCTIONS,
    model=MODEL,                            # claude-haiku-4-5
    tools=[scan_static, build_poc, validate_in_sandbox],
    max_turns=6,
)

@env.task(retries=3, timeout=20)
async def hypothesize(source: str, static_findings: str) -> dict:
    result = await hypothesis_agent.run.aio(_prompt(source, static_findings), memory=[])
    hyp = _extract_json(result.summary or "")   # malformed JSON -> raises -> task retries
    if hyp.get("vulnerable") and "buffer_size" not in hyp:
        raise ValueError(f"vulnerable hypothesis missing buffer_size: {hyp}")
    return hyp

The inner loop terminates on max turns. The outer terminates on retry budget. Malformed JSON from the inner agent raises in the outer task, which retries, so a bad turn shows up as a clean dict on the next attempt rather than as an unbounded retry storm.

The autoresearch agent's case is the most nuanced because plateaus are a real signal in ML experimentation. Its directive includes a plateau rule that switches strategies after three batches of negligible improvement:

Copied to clipboard!

PLATEAU RULE: if 3 consecutive batches fail to improve val_bpb by >0.01,
switch from config-overrides to substantive train.py refactoring
(LR schedule, optimizer, weight decay, grad clip, etc.), not more
config tweaks.

There is also a config-signature index in `MemoryStore` that hashes proposed configurations, so the agent can't accidentally re-run an equivalent experiment under a different title. Three failure-stop conditions exist for free: max turns at the agent level, retry budget at the task level, plateau pivot at the planning level. Each is declared in plain language and enforced by the runtime.

5. Composable, auditable tools

When tools are real Flyte tasks (typed, cached, in their own containers, each carrying its own resources and trace lineage), composition is also an audit-trail decision.

The drug molecule screening agent is the cleanest illustration. Four tools, each `@tool` `@env.task`, each with the durability flag that fits the work:

Copied to clipboard!

@tool
@env.task(cache="auto")                     # one-time SMILES parsing
async def load_molecules(molecules_json: str = "") -> flyte.io.Dir:
    """Parse SMILES, validate with RDKit, generate 2D depictions."""

@tool
@env.task(report=True)                      # streams an HTML card per call
async def compute_properties(molecule_dir: flyte.io.Dir) -> str:
    """MW, LogP, HBD, HBA, TPSA, QED + Lipinski compliance flags."""

@tool
@env.task(report=True)
async def screen_candidates(properties_json: str, target_profile: str = "") -> str:
    """Score molecules; rank by composite score; funnel + Tanimoto similarity."""

@tool
@env.task(report=True)
async def generate_report(
    molecule_dir: flyte.io.Dir, properties_json: str, screening_json: str,
) -> str:
    """Final report: top-3 spotlights, distributions, diversity, recommendation."""

The library ships with fifteen real drugs: Aspirin, Ibuprofen, Caffeine, Metformin, Paracetamol, Penicillin G, and others. The chemical properties come from RDKit: molecular weight, LogP, hydrogen-bond donors and acceptors, TPSA, QED, Lipinski compliance flags. None of these are LLM outputs. `load_molecules` is cached, so a chemist iterating on the target profile only pays for screening and report generation on each turn. It outputs a `flyte.io.Dir`, which is an object-store-backed directory containing images and other files relating to the molecules in question.

Then, the three downstream tools have `report=True`, so each stage streams an HTML card to the UI as the agent calls it.

The drug screening agent is given tools to compute the properties of different drugs, screen them based on standard criteria, generate reports, and iteratively perform rescreenings if necessary.

The more interesting property is in the instructions: the agent passes tool outputs verbatim between steps. `screen_candidates` returns a JSON blob; the agent forwards that blob unmodified to `generate_report`. The runtime backs this up with a parser that rejects rewritten input:

Copied to clipboard!

def _parse_screening_json(screening_json: str) -> dict:
    """Parse screening JSON from screen_candidates, with safe defaults."""
    screening = json.loads(screening_json)
    if "ranked_molecules" not in screening:
        raise ValueError(
            "screening_json must be the exact JSON string returned by "
            "screen_candidates (missing 'ranked_molecules'). Do not construct, "
            "truncate, or summarize tool output."
        )
    ...

If the agent reformats or summarizes the JSON, the next tool refuses it. The runtime trace ends up showing the exact bytes flowing between tools, which is what a chemist or auditor needs to replay a run.

The other agents display the same property. Every LLM call, every sandbox session, every retry, every resource override is a typed step in the run record. The AutoSec orchestrator is small (an `asyncio.gather` over the targets directory), but each per-target investigation is its own Flyte action with independent checkpoints:

Copied to clipboard!

@env.task(report=True)
async def run_autosec_agent() -> dict:
    targets = {p.name: p.read_text() for p in sorted(TARGETS_DIR.glob("*.c"))}
    findings = list(await asyncio.gather(
        *(analyze_target(name, src) for name, src in targets.items())
    ))
    await flyte.report.replace.aio(_render_report_html(findings))
    return {
        "targets_analyzed": len(findings),
        "triggered": sum(1 for f in findings if f["verdict"].get("triggered")),
        "findings": findings,
    }

Three agents, one runtime

To summarize, we looked at three agents in very different domains:

AI research: the parallelized autoresearch agent runs autonomous ML research over the `karpathy/climbmix corpus`. The agent edits TinyGPT `train.py` in durable memory, fans out experiment batches via flyte.map, sizes each call through an LLM capacity planner with floor/ceiling clamps, runs the experiment in a userns sandbox, and self-heals OOM at two layers (pod and inner process). It exercises most of the runtime: `Agent` + `code_mode=True`, `@tool @env.task`, `MemoryStore`, `flyte.map.aio`, `union.sandbox`, `@tool(call_handler=right_size)` with `task.override(resources=...)`, `@env.task(report=True)`. Tutorial · Code

Security research: The AutoSec researcher finds memory-corruption bugs in C programs and validates them by triggering the crash inside the same userns sandbox shape. It fans out across `targets/*.c` with `asyncio.gather`; the hypothesis stage uses an inner `Agent(max_turns=6)` wrapped in a Flyte task that retries on malformed JSON. The tutorial ships with `AUTOSEC_FORCE_*` env vars (OOM on static scan, 600s LLM timeout, hallucinated tool call) so you can watch the healing happen in the run graph rather than take it on faith. Tutorial · Code

Drug screening: The drug molecule screening agent drives an RDKit virtual screen over fifteen real drugs from a natural-language brief, runs `load_molecules → compute_properties → screen_candidates → generate_report`, and rescreens at most once if the funnel is empty. It is the lightest of the three; it shows that the same `Agent` + `@tool` primitives compose into clean pipelines without sandbox or memory overhead when the workload doesn't need them. Tutorial · Code

Conclusion: make the runtime robust to LLM failures

Most agent frameworks treat the LLM as the durable component and the surrounding infrastructure as glue code. Flyte 2 inverts that. The runtime is the durable layer (tasks, memory, sandbox, retries, traces, reports, resource overrides). The LLM is the flaky component the runtime is built to recover from.

You can implement any of these agents with raw `asyncio` and `httpx`. The control flow isn't difficult. The work is in the long tail: transient API failures, OOMs on a specific config, mid-run crashes that lose leaderboards, untrusted code that has to be isolated and torn down regardless of outcome, agent-proposed resource estimates that need to stay inside a sane envelope. Most teams handle each of these with a different vendor: a scheduler, a secrets manager, an observability tool, a sandbox, a retry wrapper, a capacity-planning service.

With Flyte, you get all of the core primitives you need for building production-grade agents:

Construct	Use case
`Agent`	A Flyte-native agent harness that implements the loop
`@env.task`	Tools that run on their own containers
`task(retries=...)`	Recover from ephemeral network issues
`cache="auto"`	Saving work done by tools to reduce re-compute waste
`task.override(resources=...)`	Core mechanism for the agent to hook into the infrastructure layer
`@tool(call_handler=right_size)`	Reusable components for dynamically modifying tool call behavior
`MemoryStore`	Object-store-backed state store for the agent to persist its work
`flyte.map.aio`	A utility for fanning out to an arbitrary number of tool calls
`union.sandbox`	Securely execute LLM-generated or otherwise untrusted code

In the beginning of this post, we learned that an AI runtime is composed of the three things: durable orchestration, real-time serving, and multi-silicon infrastructure. The agents we talked about here leverage the first and the third, piece. In subsequent posts, we’ll talk more about how you can self-host LLMs (typically small language models or SLMs) on your own infrastructure to perform tasks that don’t require a state-of-the-art LLM to complete.

Lastly, because Flyte provides a pure Python SDK for building agents, you can bring your own agent framework, like LangGraph, Pydantic AI, and more. We’ll also cover more of how you can use Flyte to productionize your existing agent harnesses built in these frameworks and get the same durability guarantees that we showed here today.

Try it

The fastest way to feel the difference is to run any of the three on a Flyte DevBox, a managed Flyte cluster you can stand up in about a minute:

Copied to clipboard!

git clone https://github.com/unionai/unionai-examples
cd unionai-examples

# Parallelized autoresearch: 6 experiments, batches of 3
uv run --script v2/tutorials/parallelized_autoresearch/parallelized_autoresearch.py \
    --n-experiments 6 --batch-size 3 --num-shards 1

# AutoSec researcher: fans out across the bundled vulnerable + secure C targets
uv run --script v2/tutorials/autosec_research_agent/main.py

# Drug molecule screening: agent over an RDKit pipeline, ~15 default molecules
uv run --script v2/tutorials/drug_molecule_screening/drug_molecule_screening.py

You'll need an Anthropic API key as a Flyte secret. Reports show up in the Union UI. Watch an experiment in the autoresearch agent get sized at 8Gi, OOM, and finish at 16Gi without the agent intervening. Force a hallucinated tool call in the AutoSec researcher with `AUTOSEC_FORCE_BAD_TOOL_CALL=1` and watch the parser raise, the task retry, and the agent finish the investigation. Run the drug molecule screening agent against a strict target profile and watch it rescreen once and stop.

When something goes wrong in any of them, the run record tells you what happened, the agent didn't bill you for the failure, and the next turn keeps moving. That's the part of the agent stack that's worth getting right.

Try the devbox

A free, local sandbox to explore the Union.ai platform.

Chat with an engineer