Serve your LLM with MAX Serve

MAX Serve is a high-performance inference server for deploying large language models. In this tutorial, we learn how to cache a model from HuggingFace and serve with MAX Serve and Union Serving.

Once you have a Union account, install union:

pip install union

Export the following environment variable to build and push images to your own container registry:

# replace with your registry name
export IMAGE_SPEC_REGISTRY="<your-container-registry>"

Then run the following commands to run the workflow:

$ git clone https://github.com/unionai/unionai-examples
$ cd unionai-examples
$ union run --remote <path/to/file.py> <workflow_name> <params>

The source code for this example can be found here.

Managing Dependencies

First we import the dependencies for defining the Union App:

from union import Resources, ImageSpec, Artifact
from union.app import App, Input
from flytekit.extras.accelerators import L4
import os

For defining the image, we install union-runtime into Modular’s base name with the ImageSpec image builder. Set the IMAGE_SPEC_REGISTRY environment variable to be a public registry you can push to. With python_exec="/opt/venv/bin/python", we configure the image builder to install any new packages into the base image’s python environment.

image = ImageSpec(
    name="modular-max",
    base_image="modular/max-nvidia-base:25.4.0.dev2025050705",
    builder="default",
    packages=["union-runtime>=0.1.18"],
    entrypoint=["/bin/bash"],
    python_exec="/opt/venv/bin/python",
    registry=os.environ.get("IMAGE_SPEC_REGISTRY"),
)

Defining the Union App

The workflow in cache_model.py caches the Qwen2.5 model from HuggingFace into a Union Artifact. Here we use the same Artifact as an Input to the Union App, which gets downloaded in mount=/root/qwen-0-5. The args is set to a Max Serve specific entrypoint, where --model-path=/root/qwen-0-5 configures Max Serve to load the model from /root/qwen-0-5.

Qwen_Coder_Artifact = Artifact(name="Qwen2.5-Coder-0.5B")
modular_model = App(
    name="modular-qwen-0-5-coder",
    container_image=image,
    inputs=[Input(name="model", value=Qwen_Coder_Artifact.query(), env_var="MODEL", mount="/root/qwen-0-5")],
    args=[
        "python",
        "-m",
        "max.entrypoints.pipelines",
        "serve",
        "--model-path=/root/qwen-0-5",
        "--device-memory-utilization",
        "0.7",
        "--max-length",
        "2048",
    ],
    port=8000,
    requests=Resources(cpu="7", mem="20Gi", gpu="1", ephemeral_storage="20Gi"),
    accelerator=L4,
    scaledown_after=300,
)

Caching and deploying The App

Run the workflow to cache the LLM:

union run --remote cache_model.py cache_model

Deploy the Union App backed by Max Serve:

union deploy apps max_serve.py modular-qwen-0-5-coder