-
Notifications
You must be signed in to change notification settings - Fork 130
Description
As a cluster operator managing traffic for generative models
I want to route prompt traffic within my cluster based on generative model request criteria
So that I can build a system to host multiple generative models.
Background
Generative AI models have an entirely different pattern of traffic over typical HTTP REST, gRPC, or even L4 traffic. The goal of the Gateway API inference extension is to address generative AI model routing use cases that are unique to its type of traffic, and by doing so, increase GPU utilization and reduce latency of requests.
The inference extension works as a literal extension to the Gateway API:

The major features/use cases available in the inference extension today are as follows:
Model-aware routing: Instead of simply routing based on the path of the request, an inference gateway allows you to route to models based on the model names. This is enabled by support for GenAI Inference API specifications (such as OpenAI API) in the gateway implementations such as in Envoy Proxy. This model-aware routing also extends to Low-Rank Adaptation (LoRA) fine-tuned models.
Serving priority: an inference gateway allows you to specify the serving priority of your models. For example, you can specify that your models for online inference of chat tasks (which is more latency sensitive) have a higher Criticality than a model for latency tolerant tasks such as a summarization.
Model rollouts: an inference gateway allows you to incrementally roll out new model versions by traffic splitting definitions based on the model names.
Extensibility for Inference Services: an inference gateway defines extensibility pattern for additional Inference services to create bespoke routing capabilities should out of the box solutions not fit your needs.
Customizable Load Balancing for Inference: an inference gateway defines a pattern for customizable load balancing and request routing that is optimized for Inference. An inference gateway provides a reference implementation of model endpoint picking leveraging metrics emitted from the model servers. This endpoint picking mechanism can be used in lieu of traditional load balancing mechanisms. Model Server-aware load balancing ("smart" load balancing as its sometimes referred to in this repo) has been proven to reduce the serving latency and improve utilization of accelerators in your clusters.
After initial implementation, we plan to keep up and contribute to the inference extension working group in order to support these use cases for OSS users and NGINX customers.
Acceptance Criteria
- TBD
Metadata
Metadata
Assignees
Labels
Type
Projects
Status