For the complete documentation index, see llms.txt. Markdown versions of all docs pages are available by appending .md to any docs URL.
Virtual key management
Issue API keys with per-key token budgets and cost tracking (also known as virtual keys).
Issue API keys with per-key token budgets and cost tracking (also known as virtual keys).
About
Virtual key management is a common feature in AI gateway solutions that allows you to issue API keys to users or applications, each with independent token budgets and cost tracking. Competitors like LiteLLM and Portkey offer this as a single “virtual keys” abstraction.
Agentgateway achieves the same outcome by composing three existing capabilities:
- API key authentication: Identify incoming requests by API key
- Token-based rate limiting: Enforce per-key token budgets
- Observability metrics: Track per-key spending and usage
This composable approach gives you more flexibility in how you configure and apply virtual key management policies, while maintaining compatibility with standard Kubernetes patterns.
How virtual keys work
Virtual keys combine authentication, rate limiting, and observability to create isolated token budgets for each API key:
flowchart TD
A[Request arrives with API key] --> B[Validate API key]
B --> C[Extract user ID]
C --> D[Check user's token budget]
D --> E{Budget available?}
E -->|Yes| F[Forward to LLM]
F --> G[Track token usage]
G --> H[Deduct from budget]
E -->|No| I[Reject with 429]
subgraph refill["Budget refills periodically"]
H
end
When a request arrives:
- Agentgateway validates the API key
- The user ID is extracted from a request header
- The request is checked against the user’s token budget
- If budget is available, the request proceeds to the LLM
- Token usage is tracked and deducted from the user’s budget
- If budget is exhausted, the request is rejected with a 429 status code
- Budgets refill at the configured interval (daily, hourly, etc.)
More considerations
Evaluation order: Rate limiting is evaluated before prompt guards (content safety checks). This means that requests rejected by guardrails (403 Forbidden) still consume quota from the user’s token budget. In contrast, authentication (JWT/OPA) is evaluated before rate limiting, so unauthenticated requests do not consume quota.
Multiple policies: When multiple AgentgatewayPolicy resources target the same Gateway or HTTPRoute, one policy silently overwrites the other based on creation order, even though both report ACCEPTED/ATTACHED status. There is no error to indicate that one policy’s settings are not taking effect. To avoid this conflict, combine the settings that apply to the same target into a single policy. For example, this guide puts API key authentication and per-key rate limiting in one policy rather than two.
Before you begin
Set up virtual keys
This example creates two virtual keys (for Alice and Bob) with independent daily token budgets. The budget is deliberately small (100 tokens per day) so that you can exhaust it in a few requests and see the enforcement in action. For production-sized budgets, see Advanced configuration.
Create API keys for users
Create an API key secret that stores keys and metadata for each user.
kubectl apply -f- <<EOF
apiVersion: v1
kind: Secret
metadata:
name: llm-api-keys
namespace: agentgateway-system
type: Opaque
stringData:
alice: |
{
"key": "sk-alice-abc123def456",
"metadata": {
"user_id": "alice"
}
}
bob: |
{
"key": "sk-bob-xyz789uvw012",
"metadata": {
"user_id": "bob"
}
}
EOFReview the following table to understand this configuration.
| Setting | Description |
|---|---|
stringData.<name> | Each key in stringData represents a user. The value is a JSON object containing the API key and metadata. |
key | The API key value that users include in their Authorization: Bearer header. |
metadata.user_id | The user identifier extracted by rate limiting policies to enforce per-user budgets. |
Configure API key authentication
Create an AgentgatewayPolicy that requires API key authentication for all requests to the gateway. You can source the API keys from a single Secret with secretRef, or from multiple Secrets selected by label with secretSelector. Use secretSelector when you want to spread keys across many Secrets, such as one Secret per team or tenant, instead of maintaining a single Secret.
Reference a single Secret by name. This example uses the llm-api-keys Secret that you created in the previous step.
kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
name: api-key-auth
namespace: agentgateway-system
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: agentgateway-proxy
traffic:
apiKeyAuthentication:
mode: Strict
secretRef:
name: llm-api-keys
EOFReview the following table to understand this configuration.
| Setting | Description |
|---|---|
targetRefs | Apply the policy to the entire Gateway so all routes require API keys. |
apiKeyAuthentication.mode | Set to Strict to require a valid API key for all requests. |
secretRef.name | References a single Secret containing API keys and user metadata. Use this or secretSelector, not both. |
secretSelector.matchLabels | Selects all Secrets that carry the given labels, combining their keys. Use instead of secretRef when keys are spread across multiple Secrets. Secret-only. |
Configure per-key token budgets
Update the api-key-auth AgentgatewayPolicy from the previous step to also enforce a per-user token budget.
The policy sends a per-user token cost to the rate limit server. It extracts the user_id from each API key and reports the token usage of each response under that descriptor. The rate limit server holds the actual budget (100 tokens per day per user), which you deploy in the next step.
kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayPolicy
metadata:
name: api-key-auth
namespace: agentgateway-system
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: agentgateway-proxy
traffic:
apiKeyAuthentication:
mode: Strict
secretRef:
name: llm-api-keys
rateLimit:
global:
domain: agentgateway
backendRef:
kind: Service
name: ratelimit
namespace: ratelimit
port: 8081
descriptors:
- entries:
- name: user_id
expression: 'apiKey.user_id'
unit: Tokens
EOFsecretRef authentication from the previous step. If you used secretSelector instead, keep your secretSelector block in place of secretRef.Review the following table to understand this configuration.
| Setting | Description |
|---|---|
apiKeyAuthentication | The API key authentication from the previous step. Keeping it in the same policy as the rate limit avoids the silent conflict that occurs when two policies target the same Gateway. |
rateLimit.global | Use global rate limiting to enforce limits across all agentgateway instances. |
domain | The rate limit domain. Must match the domain in the rate limit server configuration (agentgateway). |
backendRef | References the rate limit server Service. Must include kind, name, namespace, and port. This example points at the ratelimit Service in the ratelimit namespace that you deploy in the next step. |
descriptors[].entries[].name | The name of the descriptor entry. Must match a key in the rate limit server config. Set to user_id to rate limit per user. |
descriptors[].entries[].expression | CEL expression to extract the user ID from the API key’s metadata. |
descriptors[].unit | Set to Tokens so the gateway reports each response’s token count as the cost. The rate limit server subtracts that cost from the user’s budget. |
Deploy the rate limit server
Global rate limiting requires an external rate limit server that stores the budgets and maintains the counters. Deploy Redis and the rate limit service as described in Deploy the rate limit service in the global rate limiting guide. That example deploys a ratelimit Service in the ratelimit namespace (the target of the backendRef in the previous step) and configures it with the user_id token-budget descriptor that this guide relies on:
# Excerpt from the rate limit server ConfigMap
domain: agentgateway
descriptors:
- key: user_id
rate_limit:
unit: day
requests_per_unit: 100 # 100 tokens per day per userThe key (user_id) matches the descriptor name in your token budget policy, and the domain (agentgateway) matches the policy’s domain. The requests_per_unit value is the per-user token budget, because the policy reports token usage with unit: Tokens. To change the budget, edit requests_per_unit in the server config; to change the window, edit unit (second, minute, hour, or day).
Set up an LLM backend
Create an AgentgatewayBackend that connects to your LLM provider.
kubectl apply -f- <<EOF
apiVersion: agentgateway.dev/v1alpha1
kind: AgentgatewayBackend
metadata:
name: openai
namespace: agentgateway-system
spec:
ai:
provider:
openai:
model: gpt-3.5-turbo
policies:
auth:
secretRef:
name: openai-secret
EOFFor detailed instructions on creating backends and storing provider API keys, see the API keys guide.
Create a route to the backend
Create an HTTPRoute that routes requests to your LLM backend.
kubectl apply -f- <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: openai
namespace: agentgateway-system
spec:
parentRefs:
- name: agentgateway-proxy
namespace: agentgateway-system
rules:
- matches:
- path:
type: PathPrefix
value: /openai
backendRefs:
- name: openai
namespace: agentgateway-system
group: agentgateway.dev
kind: AgentgatewayBackend
EOFTest the virtual keys
Send a request with Alice’s API key. Verify that the request succeeds.
curl "$INGRESS_GW_ADDRESS/openai" \ -H "Authorization: Bearer sk-alice-abc123def456" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello!"}] }'Example successful response:
{ "id": "chatcmpl-abc123", "object": "chat.completion", "created": 1234567890, "model": "gpt-3.5-turbo", "choices": [{ "index": 0, "message": { "role": "assistant", "content": "Hello! How can I help you today?" }, "finish_reason": "stop" }], "usage": { "prompt_tokens": 10, "completion_tokens": 9, "total_tokens": 19 } }Send several more requests with Alice’s API key until her 100-token daily budget is exhausted. Because the LLM provider returns roughly 20-30 tokens per response, a handful of requests pushes Alice over the budget. The request that crosses the budget still completes; subsequent requests are rejected with a 429 status code.
for i in $(seq 1 10); do STATUS=$(curl -s -o /dev/null -w "%{http_code}" \ "$INGRESS_GW_ADDRESS/openai" \ -H "Authorization: Bearer sk-alice-abc123def456" \ -H "Content-Type: application/json" \ -d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello!"}]}') echo "Request $i: HTTP $STATUS" doneExample 429 response:
HTTP/1.1 429 Too Many Requests x-ratelimit-limit: 100 x-ratelimit-remaining: 0 x-ratelimit-reset: 43200 rate limit exceededVerify that Bob can still send requests with his own budget, independent of Alice’s usage.
curl "$INGRESS_GW_ADDRESS/openai" \ -H "Authorization: Bearer sk-bob-xyz789uvw012" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "Hello!"}] }'Bob’s requests succeed because he has his own independent budget.
Monitor per-key spending
Track token usage and spending for each virtual key by using Prometheus metrics.
By default, the agentgateway token usage metric (agentgateway_gen_ai_client_token_usage) is broken down by dimensions such as the model and token type, but not by user. To attribute usage to each virtual key, add a user_id label to the metrics with a metrics policy, then query Prometheus.
Before you begin
Set up a Prometheus instance to scrape agentgateway metrics. The OpenTelemetry stack guide walks you through the full setup; at a minimum, complete the Prometheus step. The following steps assume the kube-prometheus-stack release exists in the telemetry namespace, as deployed by that guide.
Add a per-user metric label
Create an AgentgatewayPolicy that adds the
user_idfrom each API key as a label on all Prometheus metrics. Thefrontend.metricsfield can only be set on a policy that targets the Gateway.kubectl apply -f- <<EOF apiVersion: agentgateway.dev/v1alpha1 kind: AgentgatewayPolicy metadata: name: per-user-metrics namespace: agentgateway-system spec: targetRefs: - group: gateway.networking.k8s.io kind: Gateway name: agentgateway-proxy frontend: metrics: attributes: add: - name: user_id expression: 'apiKey.user_id' EOFReview the following table to understand this configuration.
Setting Description frontend.metrics.attributes.add[].nameThe name of the Prometheus label to add ( user_id).frontend.metrics.attributes.add[].expressionA CEL expression that is evaluated per request. Use apiKey.user_idto read theuser_idfrom the authenticated API key. If the expression fails to evaluate (for example, on an unauthenticated request), the label value is set tounknown.Theuser_idlabel is high cardinality: every unique value creates a new metric series, which increases Prometheus memory and storage. This is acceptable for tens or hundreds of keys, but avoid attaching unbounded identifiers (such as raw end-user IDs) to metrics at large scale. Prefer lower-cardinality dimensions like tier or team when possible.Send a few requests with each virtual key so that the metrics have per-user data to report. You can reuse the requests from Test the virtual keys.
Query per-key usage
Port-forward the Prometheus server from the OpenTelemetry stack.
kubectl port-forward -n telemetry svc/kube-prometheus-stack-prometheus 9090:9090Then open the Prometheus UI at
http://localhost:9090/graphand run the following queries, or send them to the HTTP API withcurl. For example:curl -s http://localhost:9090/api/v1/query \ --data-urlencode 'query=sum by (user_id) (agentgateway_gen_ai_client_token_usage_sum)'Example output:
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1782410561.391,"720"]},{"metric":{"user_id":"bob"},"value":[1782410561.391,"448"]},{"metric":{"user_id":"alice"},"value":[1782410561.391,"448"]}]}}Query token usage broken down by user ID. The token usage metric carries a separate series per token type (
input,output,input_cache_read), so match both the input and output types in a single selector and sum them, rather than adding two selectors together.# Total tokens consumed by user over the last 24 hours sum by (user_id) ( increase(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type=~"input|output"}[24h]) ) # Percentage of a 100-token daily budget used (adjust the divisor to match your budget) (sum by (user_id) ( increase(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type=~"input|output"}[24h]) ) / 100) * 100Each result series is labeled with a
user_id, such asaliceandbob. If a key is missing theuser_idfield, or the request is not attributed to a key, its usage appears underuser_id="unknown".Example output:
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1782411002.488,"0"]},{"metric":{"user_id":"bob"},"value":[1782411002.488,"372.2787929364588"]},{"metric":{"user_id":"alice"},"value":[1782411002.488,"309.56920815395927"]}]}} {"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1782411059.867,"0"]},{"metric":{"user_id":"bob"},"value":[1782411059.867,"370.95800165527817"]},{"metric":{"user_id":"alice"},"value":[1782411059.867,"307.9427844448483"]}]}}increase()andrate()need at least two samples within the time range to report a value, so a brand-newuser_idseries shows no result until it has been scraped a few times under continuous traffic. For a quick instant check, query the cumulative counter directly:sum by (user_id) (agentgateway_gen_ai_client_token_usage_sum).Calculate costs per user by multiplying token counts by your provider’s pricing. Input and output tokens are usually priced differently, so reduce each token type to a per-user series with
sum by (user_id)before adding them, which keeps the two sides matchable.# Cost per user (assuming $0.50 per 1M input tokens, $1.50 per 1M output tokens) sum by (user_id) (rate(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="input"}[24h])) / 1000000 * 0.50 + sum by (user_id) (rate(agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="output"}[24h])) / 1000000 * 1.50Example output:
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1782410758.432,"0"]},{"metric":{"user_id":"bob"},"value":[1782410758.432,"6.101636101191084e-09"]},{"metric":{"user_id":"alice"},"value":[1782410758.432,"5.106526900820178e-09"]}]}}
For more information on cost tracking, see the cost tracking guide.
Advanced configuration
Tiered budgets based on user type
Provide different budget tiers for free, standard, and premium users.
Add tier metadata to each API key in the Secret.
apiVersion: v1 kind: Secret metadata: name: llm-api-keys namespace: agentgateway-system type: Opaque stringData: alice: | { "key": "sk-alice-abc123def456", "metadata": { "user_id": "alice", "tier": "premium" } } charlie: | { "key": "sk-charlie-ghi345jkl678", "metadata": { "user_id": "charlie", "tier": "free" } }Configure rate limiting to use the tier and user_id from API key metadata.
traffic: rateLimit: global: domain: agentgateway backendRef: kind: Service name: ratelimit namespace: ratelimit port: 8081 descriptors: - entries: - name: tier expression: 'apiKey.tier' - name: user_id expression: 'apiKey.user_id' unit: TokensConfigure the rate limit server with tier-based budgets.
domain: agentgateway descriptors: - key: tier value: "free" descriptors: - key: user_id rate_limit: unit: day requests_per_unit: 10000 # 10K tokens/day for free tier - key: tier value: "standard" descriptors: - key: user_id rate_limit: unit: day requests_per_unit: 100000 # 100K tokens/day for standard tier - key: tier value: "premium" descriptors: - key: user_id rate_limit: unit: day requests_per_unit: 500000 # 500K tokens/day for premium tier
Hourly budget limits
Set a smaller budget that refreshes every hour for tighter cost control.
# In the ratelimit-config ConfigMap
domain: agentgateway
descriptors:
- key: user_id
rate_limit:
unit: hour
requests_per_unit: 10000 # 10,000 tokens per hourMulti-tenant virtual keys
Create virtual keys scoped to both user and tenant for multi-tenant applications. Add tenant_id to the API key metadata.
# In TrafficPolicy
descriptors:
- entries:
- name: tenant_id
expression: 'apiKey.tenant_id'
- name: user_id
expression: 'apiKey.user_id'
unit: Tokens# In the ratelimit-config ConfigMap
domain: agentgateway
descriptors:
- key: tenant_id
descriptors:
- key: user_id
rate_limit:
unit: day
requests_per_unit: 50000For more advanced rate limiting patterns, see the budget and spend limits guide.
Cleanup
You can remove the resources that you created in this guide.kubectl delete AgentgatewayPolicy api-key-auth per-user-metrics -n agentgateway-system --ignore-not-found
kubectl delete secret llm-api-keys -n agentgateway-system
kubectl delete httproute openai -n agentgateway-system
kubectl delete AgentgatewayBackend openai -n agentgateway-systemTo remove the rate limit server, follow the cleanup steps in the global rate limiting guide.
What’s next
- Manage API keys for detailed authentication configuration
- Budget and spend limits for advanced rate limiting patterns
- Track costs per request for cost calculation and monitoring
- Set up observability to view token usage metrics and logs