Understanding Disaggregated GenAI Model Serving With Llm-d

Table of Contents

What is llm-d?

llm-d is an open source solution for managing high-scale, high-performance Large Language Model (LLM) deployments. LLMs are at the heart of generative AI – so when you chat with ChatGPT or Gemini, you’re talking to an LLM.

Simple LLM deployments – where an LLM is deployed to a single server – can suffer from latency issues, even with just one user. This can be because of lack of memory-bandwidth on the server, or because of KV cache pressure on system memory. This means that you’re kept waiting for the LLM to respond to your question or instruction, which can really drag, and nobody likes to be kept waiting.

Llm-d tries to solve this by splitting up the LLM deployment (disaggregating the deployment) and separating out different components of the architecture onto dedicated hardware. This means that the various parts of the system can be managed and scaled independently. Llm-d is a cloud-native system and uses Kubernetes as an orchestration engine for all of this, so that managing the necessary resources can be done automatically – using Kubernetes’ automation features.

Why run your own LLM service?

For some organizations, sovereignty (that is, keeping things under your own control, governance and oversight) is imperative. That’s true for sensitive data, and also for sensitive data processing, like the things folks do with LLMs, such as building Retrieval Augmented Generative AI (RAG) systems or agentic workflows. So for those organizations, there’s no question that they’re going to want to run an LLM service under their own watch, on their own systems, even in their own data center. With open weight large language models like Kimi-2.5 and GM-5 that can hold their own against Gemini Pro, Claude Sonnet and Grok becoming available, it’s never been a better time to run a sovereign AI Factory. And that’s where llm-d comes into its own.

Architecture of llm-d

Understanding disaggregated genai model serving with llm-d 1 — Understanding disaggregated genai model serving with llm-d 3

At a high level, llm-d is composed of four major components, all of which run on a Kubernetes cluster:

Inference scheduler – this part is an adaptive load balancer, responsible for intelligently routing user questions to worker nodes that have already cached relevant context related to the user’s question. It’s using metrics pulled from a Prometheus metrics endpoint to take routing decisions.
Cache manager – this part is responsible for coordinating LLM key-value (KV) caches. Getting caching right is a critical factor in getting the best possible LLM performance.
Prefill worker – llm-d splits the actual LLM workload in two. The prefill component performs the heavy, compute intensive processing of prompts and can be scaled independently.
Decode worker – the decode component performs the memory-bandwidth dependent task of generating tokens (this is the part that is responsible for writing the answer to the user’s question).

Llm-d is designed to work with the very high-performance hardware these setups need, like servers with enterprise-grade GPUs and Infiniband network switching – which is an alternative networking solution to the classic ethernet.

Getting hands-on

I find that the best way to learn about something more deeply is to work with it. So to that end, I put together some Juju charms for Ubuntu, to get a better understanding of how llm-d works for myself. They can enable you to deploy an LLM to an llm-d setup in a clean, straightforward way – without needing to be a Kubernetes guru.

I’ve made the source code available in GitHub, along with some instructions about how to build the code and get things up and running. Note that I’ve just been playing around with Juju charms when building these; there may be bugs and they are not supported by Canonical, so use them at your own risk.

The diagram below illustrates how the various Juju charms that manage the system are integrated.

Understanding disaggregated genai model serving with llm-d 2 — Understanding disaggregated genai model serving with llm-d 4

Juju charms offer a clean approach to devops, and the system has primitives for both cloud infrastructure and Kubernetes. If you’d like to dig in and learn more about how to develop or use Juju charms, head over to our Juju page.

Understanding disaggregated GenAI model serving with llm-d

What is llm-d?

Why run your own LLM service?

Architecture of llm-d

Getting hands-on

Further reading

Like this:

Related

Discover more from Ubuntu-Server.com

Comments

Leave a Reply Cancel reply

What is llm-d?

Why run your own LLM service?

Architecture of llm-d

Getting hands-on

Further reading

Share this:

Like this:

Related

Discover more from Ubuntu-Server.com

Comments

Leave a Reply Cancel reply