Skip to content

Data plane does not sync upstream servers IPs #3626

@baburciu

Description

@baburciu

Describe the bug
With the new data plane architecture, we're seeing data plane configuration using old/obsolete pod IPs as upstream servers.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy a Gateway and HTTPRoute using NGF v2.0.2 (same issues seen with v2.0.1) in front of some backend services pods
  2. Recycle the backend services pods so that they get new IPs
  3. Access the endpoint exposed by API Gateway and get 50x Gateway errors, which upon inspection are caused by old IPs (previous pods' IPs, before them being deleted) still present in nginx -T output for block upstream of the service.

Recycling (deleting and waiting for k8s to create a new one) the data plane pod solves it, its nginx -T output has the new backend pod IP as upstream server. Recycling the control plane pod(s) makes no difference.

Examples:

  • GET: https://api-dev.foo.company.network/api/v3/gu/guardian/loan-application/list?LoanType=0&PageNumber=1&PageSize=10&searchBy=&searchString= returns 504 Gateway Timeout (502 Bad Gateway in Postman):
$ k logs gw-ingress-dev-nginx-dev-546d4cdb8d-v42l4 --since 5m | rg 504 -C2
Defaulted container "nginx" out of: nginx, init (init)
2025/07/15 08:50:15 [info] 5789#5789: *992 client canceled stream 1 while connecting to upstream, client: 128.127.114.197, server: api-dev.foo.company.network, request: "GET /api/v3/gu/guardian/loan-application/list?LoanType=0&PageNumber=1&PageSize=10&searchBy=&searchString= HTTP/2.0", upstream: "http://10.22.0.88:8080/api/v3/guardian/loan-application/list?LoanType=0&PageNumber=1&PageSize=10&searchBy=&searchString=", host: "api-dev.foo.company.network"
2025/07/15 08:51:15 [error] 5789#5789: *994 upstream timed out (110: Operation timed out) while connecting to upstream, client: 128.127.114.197, server: api-dev.foo.company.network, request: "GET /api/v3/gu/guardian/loan-application/list?LoanType=0&PageNumber=1&PageSize=10&searchBy=&searchString= HTTP/2.0", upstream: "http://10.22.0.88:8080/api/v3/guardian/loan-application/list?LoanType=0&PageNumber=1&PageSize=10&searchBy=&searchString=", host: "api-dev.foo.company.network"
128.127.114.197 - - [15/Jul/2025:08:51:15 +0000] "GET /api/v3/gu/guardian/loan-application/list?LoanType=0&PageNumber=1&PageSize=10&searchBy=&searchString= HTTP/2.0" 504 562 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 Edg/138.0.0.0"
$ k logs gw-ingress-dev-nginx-dev-546d4cdb8d-v9qbk --since 5m | rg 504 -C2
Defaulted container "nginx" out of: nginx, init (init)
2025/07/15 08:50:37 [error] 4559#4559: *1387 upstream timed out (110: Operation timed out) while connecting to upstream, client: 102.89.22.199, server: api-dev.foo.company.network, request: "GET /api/v3/gu/guardian/loan-application/list?LoanType=0&PageNumber=1&PageSize=10&searchBy=&searchString= HTTP/1.1", upstream: "http://10.22.0.88:8080/api/v3/guardian/loan-application/list?LoanType=0&PageNumber=1&PageSize=10&searchBy=&searchString=", host: "api-dev.foo.company.network"
102.89.22.199 - - [15/Jul/2025:08:50:37 +0000] "GET /api/v3/gu/guardian/loan-application/list?LoanType=0&PageNumber=1&PageSize=10&searchBy=&searchString= HTTP/1.1" 504 160 "-" "PostmanRuntime/7.44.1"
$
$ kubectl exec -it gw-ingress-dev-nginx-dev-546d4cdb8d-v9qbk -c nginx -- nginx -T | rg 10.22.0.88 -C7

upstream dev_gateforce-guardian-canary_80 {
    random two least_conn;
    zone dev_gateforce-guardian-canary_80 512k;


    server 10.22.0.88:8080;




}

upstream dev_gateforce-guardian_80 {
    random two least_conn;
    zone dev_gateforce-guardian_80 512k;


    server 10.22.0.88:8080;




}
$
$ kubectl exec -it gw-ingress-dev-nginx-dev-546d4cdb8d-v9qbk -c nginx -- cat /etc/nginx/conf.d/http.conf  | rg 10.22.0.88 -C10

upstream dev_gateforce-guardian-canary_80 {
    random two least_conn;
    zone dev_gateforce-guardian-canary_80 512k;


    server 10.22.0.88:8080;




}

upstream dev_gateforce-guardian_80 {
    random two least_conn;
    zone dev_gateforce-guardian_80 512k;


    server 10.22.0.88:8080;




}
## but there's no pod with IP 10.22.0.88
$ kubectl get pod -A -o wide | rg 10.22.0.88
$
## 10.22.0.93 is the IP of the svc endpoint
$ kubectl get pod -l app.kubernetes.io/instance=gateforce-guardian-dev -o wide
NAME                                                     READY   STATUS      RESTARTS   AGE    IP           NODE                                 NOMINATED NODE   READINESS GATES
gateforce-guardian-dev-api-7dbf7b5785-mz7k6              1/1     Running     0          160m   10.22.0.93   aks-spot01ded5-19828094-vmss00008k   <none>           <none>    # this is the IP of the svc endpoint
$
## recycling a data plane pod updates the config for that particular pod
$ kubectl delete pod gw-ingress-dev-nginx-dev-546d4cdb8d-v9qbk
pod "gw-ingress-dev-nginx-dev-546d4cdb8d-v9qbk" deleted
$
$ kubectl get pod -l app.kubernetes.io/instance=nginx-gateway-fabric-dev
NAME                                        READY   STATUS    RESTARTS   AGE
gw-ingress-dev-nginx-dev-546d4cdb8d-7g8s9   1/1     Running   0          14s
gw-ingress-dev-nginx-dev-546d4cdb8d-v42l4   1/1     Running   0          20h
gw-ingress-dev-nginx-dev-546d4cdb8d-flszc   1/1     Running   0          20h
nginx-gateway-fabric-dev-c49b64446-8qswk    1/1     Running   0          3m49s
nginx-gateway-fabric-dev-c49b64446-fzbf2    1/1     Running   0          3m49s
nginx-gateway-fabric-dev-c49b64446-nxbfm    1/1     Running   0          3m49s
$ kubectl exec -it gw-ingress-dev-nginx-dev-546d4cdb8d-7g8s9 -c nginx -- nginx -T | rg 10.22.0.88 -C10
$ kubectl exec -it gw-ingress-dev-nginx-dev-546d4cdb8d-7g8s9 -c nginx -- nginx -T | rg "upstream dev_gateforce-guardian" -A10
upstream dev_gateforce-guardian-canary_80 {
    random two least_conn;
    zone dev_gateforce-guardian-canary_80 512k;


    server 10.22.0.93:8080;



    keepalive_timeout 90;
}
--
  upstream dev_gateforce-guardian_80 {
    random two least_conn;
    zone dev_gateforce-guardian_80 512k;


    server 10.22.0.93:8080;           # we see the pod IP updated as the correct upstream server for the NGF data plane pod we recycled



    keepalive_timeout 90;
}
$ 
$ kubectl exec -it gw-ingress-dev-nginx-dev-546d4cdb8d-v42l4 -c nginx -- nginx -T | rg "upstream dev_gateforce-guardian" -A10
upstream dev_gateforce-guardian-canary_80 {
    random two least_conn;
    zone dev_gateforce-guardian-canary_80 512k;


    server 10.22.0.88:8080;



    keepalive_timeout 90;
}
--
  upstream dev_gateforce-guardian_80 {
    random two least_conn;
    zone dev_gateforce-guardian_80 512k;


    server 10.22.0.88:8080;



    keepalive_timeout 90;
}
$

ngf-data-plane-out-of-sync.md

Expected behavior
I'd expect to see the backend services endpoint current (pod) IPs used as upstream servers, in the data plane configuration.

Your environment
We're on AKS v1.30.12 with Ubuntu 22.04.5 LTS & kernel 5.15.0-1090-azure & container runtime containerd://1.7.27-1 nodes.

  • Version of the NGINX Gateway Fabric: "version":"2.0.2","commit":"283a21813b30de2fb6de34bae2dbfad8a4d40963"
  • Version of Kubernetes: v1.30.12
  • Kubernetes platform (e.g. Mini-kube or GCP): AKS
  • Details on how you expose the NGINX Gateway Fabric Pod: Service of type LoadBalancer provided by Azure Load Balancer in Standard SKU
  • Logs of NGINX container: kubectl -n <nginx-deployment-namespace> logs deployments/<nginx-deployment>
  • NGINX Configuration: kubectl -n <nginx-deployment-namespace> exec -it deployments/<nginx-deployment> -- nginx -T

Additional context
From what is depicted in https://github.com/sarthyparty/nginx-gateway-fabric/blob/20d27b39a3373d95fd2a4782469de8ac7cc342d0/docs/architecture/configuration-flow.md#detailed-configuration-flow, the control plane pod configures the nginx-agent component of the data plane pod through gRPC, so there isn't any ConfigMap/Secret we could follow for changes as trigger to recycle the data plane pod to workaround this. Not really sure how to debug further, as I understand #3503 blocks debug logging enablement.
Many thanks in advance for any pointers you may have.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcommunityrefinedRequirements are refined and the issue is ready to be implemented.

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions