Serving LlamaIndex RAG with FastAPI and LanceDB
This tutorial shows you how to serve a LLamaIndex RAG with FastAPI, where the data source is the Wikipedia page for Star Wars.
Once you have a Union account, install union
:
pip install union
Export the following environment variable to build and push images to your own container registry:
# replace with your registry name
export IMAGE_SPEC_REGISTRY="<your-container-registry>"
Then run the following commands to run the workflow:
$ git clone https://github.com/unionai/unionai-examples
$ cd unionai-examples
$ union run --remote <path/to/file.py> <workflow_name> <params>
The source code for this example can be found here.
With Union’s App Serving SDK, we define three applications:
- VLLM backed app hosting the granite-embedding to embedded our data.
- VLLM backed app hosting Quen2.5 LLM to process the context
- FastAPI app that has one
/ask
endpoint that uses the above two apps to answer questions about Star Wars We start by importing the necessary libraries and modules for this example:
from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.openai_like import OpenAILikeEmbedding
from llama_index.core import VectorStoreIndex, Settings, StorageContext
from llama_index.readers.web import SimpleWebPageReader
from llama_index.vector_stores.lancedb import LanceDBVectorStore
from fastapi import FastAPI
from pydantic import BaseModel
from contextlib import asynccontextmanager
from union.app.llm import VLLMApp
from union.app import App
from union import Resources, ImageSpec
from flytekit.extras.accelerators import L4
import os
Caching From HuggingFace
First, we cache the model weights from HuggingFace, which requires a hugging face secret api key:
- Generate an API key: Obtain your API key from the Hugging Face website.
- Create a Secret: Use the Union CLI to create the secret:
$ union create secret hf-api-key
After creating your HuggingFace API key, we cache the two models by running:
$ union cache model-from-hf ibm-granite/granite-embedding-125m-english --hf-token-key hf-api-key \
$ --union-api-key EAGER_API_KEY --cpu 2 --mem 6Gi --wait
$ union cache model-from-hf Qwen/Qwen2.5-0.5B-Instruct --hf-token-key hf-api-key \
$ --union-api-key EAGER_API_KEY --cpu 2 --mem 6Gi --wait
In both cases, you’ll get an artifact ID, please copy and replace the artifact URIs here:
EMBEDDING_ARTIFACT = "flyte://av0.2/demo/thomasjpfan/development/granite-embedding-125m-english-20@2abf07bdc0d2bcd17715afeeb293ebfb"
LLM_ARTIFACT = "flyte://av0.2/demo/thomasjpfan/development/Qwen2_5-0_5B-Instruct-20@0ed210eb88c94c16f17d4697a0a91212"
Configuring the VLLM Apps
We configure the two VLLM Apps to serve both the embedding model and the LLM with VLLMApp
and defining it’s resources
such as the CPU, Memory and accelerator type. With stream_model=True
the model is streamed from the object store
to the GPU.
image = "ghcr.io/unionai-oss/serving-vllm:0.1.17"
vllm_embedding_app = VLLMApp(
name="granite-embedding",
container_image=image,
model=EMBEDDING_ARTIFACT,
model_id="text-embedding-granite-embedding-125m-english",
limits=Resources(cpu=7, mem="25Gi", gpu=1),
requests=Resources(cpu=7, mem="25Gi", gpu=1),
stream_model=False,
extra_args="--dtype=half --max-model-len 512",
scaledown_after=500,
accelerator=L4,
)
vllm_chat_app = VLLMApp(
name="qwen-app",
container_image=image,
requests=Resources(cpu=7, mem="24Gi", gpu="1"),
accelerator=L4,
model=LLM_ARTIFACT,
model_id="qwen2",
extra_args="--dtype=half --enable-auto-tool-choice --tool-call-parser hermes",
scaledown_after=500,
stream_model=True,
port=8084,
)
Defining the FastAPI App
The FastAPI App uses a lifespan
to define the startup behavior. Here we create a LlamaIndex query engine
using the embedder to convert wikiedpia artifact into vectors and placing it into a LanceDB database.
models = {}
@asynccontextmanager
async def lifespan(app: FastAPI):
embedding = OpenAILikeEmbedding(
model_name="text-embedding-granite-embedding-125m-english",
api_key="XYZ",
api_base=f"{vllm_embedding_app.endpoint}/v1",
additional_kwargs={"extra_body": {"truncate_prompt_tokens": 512}},
)
source_url = "https://en.wikipedia.org/wiki/Star_Wars"
llm = OpenAILike(
api_base=f"{vllm_chat_app.endpoint}/v1",
api_version="v1",
model="qwen2",
api_key="XYZ",
is_chat_model=True,
is_function_calling_model=True,
max_tokens=2056,
)
Settings.llm = llm
Settings.embed_model = embedding
documents = SimpleWebPageReader(html_to_text=True).load_data([source_url])
vector_store = LanceDBVectorStore(uri="lancedb-serving")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
query_engine = index.as_chat_engine()
models["query_engine"] = query_engine
yield
app = FastAPI(lifespan=lifespan)
Here we define a simple /ask
API to query the RAG and returning the response.
class MessageInput(BaseModel):
content: str
@app.post("/ask")
async def ask(message: MessageInput) -> str:
query_engine = models["query_engine"]
chat_response = await query_engine.achat(message.content)
return chat_response.response
Configuring the Image for FastAPI
Next, we use an ImageSpec to build an image with dependencies defined in requirements.txt
:
In your environment set, IMAGE_SPEC_REGISTRY
to your registry:`
fast_api_image = ImageSpec(
name="simple-rag-serving",
requirements="requirements.txt",
builder="default",
registry=os.environ.get("IMAGE_SPEC_REGISTRY"),
)
We declare that the FastAPI app depends on both the embedding app and the LLM app by setting
the dependencies
parameter. The framework_app
configures the App
object to load the FastAPI
app as the entrypoint point:
fast_app = App(
name="llama-index-fastapi",
container_image=fast_api_image,
requests=Resources(cpu=2, mem="2Gi"),
framework_app=app,
dependencies=[vllm_embedding_app, vllm_chat_app],
requires_auth=False,
env={"LANCE_CPU_THREADS": "2"},
scaledown_after=500,
)
Deploy the app by running:
$ union deploy apps llama_index_rag.py llama-index-fastapi
...
🚀 Deployed Endpoint: https://<union-url>
To query the endpoint with curl run:
$ curl -X 'POST' \
$ 'https://<union-url>/ask' \
$ -H 'accept: application/json' \
$ -H 'Content-Type: application/json' \
$ -d '{
$ "content": "What was the title of the first Star Wars Movie?"
$ }'
Which return an answer to the query:
"The first Star Wars movie was titled \"A New Hope.\""