Gateway API Inference Extension

**As a** cluster operator managing traffic for generative models
**I want** to route prompt traffic within my cluster based on generative model request criteria
**So that** I can build a system to host multiple generative models.

### Background

Generative AI models have an entirely different pattern of traffic over typical HTTP REST, gRPC, or even L4 traffic. The goal of the Gateway API inference extension is to address generative AI model routing use cases that are unique to its type of traffic, and by doing so, increase GPU utilization and reduce latency of requests.

The inference extension works as a literal extension to the Gateway API:

<img width="400" height="286" alt="Image" src="https://github.com/user-attachments/assets/c6c2a20d-96a4-4259-9c75-93eedf653297" />

The major features/use cases available in the inference extension today are as follows:

> 
> **Model-aware routing:** Instead of simply routing based on the path of the request, an [inference gateway](https://gateway-api-inference-extension.sigs.k8s.io/#concepts-and-definitions) allows you to route to models based on the model names. This is enabled by support for GenAI Inference API specifications (such as OpenAI API) in the gateway implementations such as in Envoy Proxy. This model-aware routing also extends to Low-Rank Adaptation (LoRA) fine-tuned models.
> 
> **Serving priority:** an [inference gateway](https://gateway-api-inference-extension.sigs.k8s.io/#concepts-and-definitions) allows you to specify the serving priority of your models. For example, you can specify that your models for online inference of chat tasks (which is more latency sensitive) have a higher [Criticality](https://gateway-api-inference-extension.sigs.k8s.io/reference/spec/#criticality) than a model for latency tolerant tasks such as a summarization.
> 
> **Model rollouts:** an [inference gateway](https://gateway-api-inference-extension.sigs.k8s.io/#concepts-and-definitions) allows you to incrementally roll out new model versions by traffic splitting definitions based on the model names.
> 
> **Extensibility for Inference Services:** an [inference gateway](https://gateway-api-inference-extension.sigs.k8s.io/#concepts-and-definitions) defines extensibility pattern for additional Inference services to create bespoke routing capabilities should out of the box solutions not fit your needs.
> 
> **Customizable Load Balancing for Inference:** an [inference gateway](https://gateway-api-inference-extension.sigs.k8s.io/#concepts-and-definitions) defines a pattern for customizable load balancing and request routing that is optimized for Inference. An [inference gateway](https://gateway-api-inference-extension.sigs.k8s.io/#concepts-and-definitions) provides a reference implementation of model endpoint picking leveraging metrics emitted from the model servers. This endpoint picking mechanism can be used in lieu of traditional load balancing mechanisms. Model Server-aware load balancing ("smart" load balancing as its sometimes referred to in this repo) has been proven to reduce the serving latency and improve utilization of accelerators in your clusters.
>
>https://gateway-api-inference-extension.sigs.k8s.io/

After initial implementation, we plan to keep up and contribute to the inference extension working group in order to support these use cases for OSS users and NGINX customers.

### Acceptance Criteria
- TBD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gateway API Inference Extension #3644

Background

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gateway API Inference Extension #3644

Description

Background

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions