Implications of object storage retention or lifecycle policies on the data plane bucket.

# Data retention policies

The data plane uses an object store bucket (S3, GCS, or ABS) to hold the **raw data** of your workflows — files, directories, dataframes, models, and other large values offloaded from task inputs and outputs. Bucket-level retention and lifecycle policies (such as S3 lifecycle rules) on this bucket purge raw data and can break historical executions, caches, and trace-based recovery.

## Where metadata vs. raw data lives

It helps to be precise about what "metadata" means here, because the term is used in two very different ways elsewhere in the industry.

**Metadata in Union.ai lives in the control plane database.** It includes:

- Task, trigger and app definitions.
- Execution status, history, schedules, and audit trail.
- The literal values of task inputs and outputs — primitives (ints, strings, etc.), JSON-serializable dataclasses, and similar small values that are passed *by value* between tasks. The DB stores these values directly.

**Raw data lives in the data plane object-store bucket.** It includes:

- `flyte.io.File` / `flyte.io.Dir` contents.
- `flyte.io.DataFrame` payloads.
- Models and other pickled / large objects.
- `Deck` data and artifact payloads.
- [Trace](https://www.union.ai/docs/v2/union/deployment/user-guide/task-programming/traces) checkpoints.

When a task input or output is *too large to be passed inline as a literal value*, it is offloaded: the bytes are written to the bucket as raw data, and the literal that the DB stores becomes a **pointer** (URI) into the bucket. The DB still holds the canonical record of the input/output — it just holds a reference instead of the value.

## Impact of raw data loss

A retention policy that purges raw data leaves the metadata in the control plane database intact, but the references it holds to offloaded values become **dangling pointers**. The effects:

| Area | Impact |
|------|--------|
| **UI and APIs** | Execution detail views still render (status, timing, structure all come from the DB), but input/output previews for offloaded values, `Deck` views, and artifact payload links resolve to purged blobs and fail with "resource not found." |
| **Execution engine** | Re-runs or downstream tasks that consume a purged upstream output fail at runtime. In-flight tasks that depend on a node whose output was just purged fail. |
| **Caching** | A cache hit may resolve to a pointer whose underlying raw data has been purged, producing cache misses, task re-execution, or failure. |
| **Traces** | [Trace](https://www.union.ai/docs/v2/union/deployment/user-guide/task-programming/traces) checkpoints used by `@flyte.trace` for fine-grained recovery are stored in the bucket; if purged, resume-from-checkpoint is not possible for affected executions. |
| **Operations** | The DB record of what ran, when, and with what *small literal* inputs/outputs is preserved. The record of what *large offloaded* inputs/outputs each task produced is lost wherever the raw data has been purged. |

## Applying retention deliberately

Retention on the raw-data bucket is a legitimate cost-management strategy as long as you accept the trade-off: historical executions that referenced purged data will no longer be re-runnable from cache or recoverable from trace checkpoints, and their UI views will show missing blobs. New executions are unaffected.

Be aware of the trade-offs in particular for:

- **Cached task outputs** — purging the cached raw data invalidates the cache; affected tasks re-execute on the next call.
- **Trace checkpoints** — purging prevents resume-from-checkpoint for executions whose checkpoints have aged out.
- **Historical execution previews** — purged raw data will show as "resource not found" in the UI even though the DB still has the rest of the record.

Data correctness is not silently violated: re-runs read from current raw data, and the DB record is the source of truth for what executed. You're trading off the ability to recover or inspect old offloaded values.

## Designing lifecycle rules

The Union.ai data plane uses a **single object-store bucket** for all execution data. Inside that bucket, content is split by **prefix**, controlled by two settings in the engine configuration:

- `config.core.propeller.metadata-prefix` (default `metadata/propeller`) — where the engine writes its per-execution working files: `inputs.pb`, `outputs.pb`, `error.pb`, `futures.pb`, and `deck.html`. These are required for in-flight workflows to complete and for historical-execution input/output and Deck previews to render. **Despite the name, this is not Union.ai metadata in the customer-facing sense** — that lives in the control plane database (see [above](#where-metadata-vs-raw-data-lives)). The prefix name reflects Union's internal terminology.
- `config.core.propeller.rawoutput-prefix` (defaults to the bucket root) — where raw outputs land: `flyte.io.File` / `flyte.io.Dir` contents, `flyte.io.DataFrame` payloads, checkpoint data, and other offloaded values. This is the prefix to which retention can safely be applied.

When designing S3 lifecycle rules (or the GCS/ABS equivalent), **scope expiration to prefixes other than `metadata/propeller/`** so the engine working state stays durable. Typical patterns are rules scoped to domain/project prefixes (or specific subpaths under the bucket root) rather than a bucket-wide rule.

Validate any retention rule in a non-production environment before applying it broadly.

## Per-run customization

The raw-data location can be overridden **per run** via [`flyte.with_runcontext()`](https://www.union.ai/docs/v2/union/deployment/user-guide/task-deployment/run-context#storage) using the property `raw_data_path`. This property defines the prefix for offloaded raw data (`flyte.io.File`, `flyte.io.Dir`, `flyte.io.DataFrame`, checkpoints, etc.).

---
**Source**: https://github.com/unionai/unionai-docs/blob/main/content/deployment/selfmanaged/configuration/data-retention.md
**HTML**: https://www.union.ai/docs/v2/union/deployment/selfmanaged/configuration/data-retention/