Data plane does not sync upstream servers IPs

**Describe the bug**
With the new data plane architecture, we're seeing data plane configuration using old/obsolete pod IPs as upstream servers.

**To Reproduce**
Steps to reproduce the behavior:
1. Deploy a Gateway and HTTPRoute using NGF v2.0.2 (same issues seen with v2.0.1) in front of some backend services pods
2. Recycle the backend services pods so that they get new IPs
3. Access the endpoint exposed by API Gateway and get 50x Gateway errors, which upon inspection are caused by old IPs (previous pods' IPs, before them being deleted) still present in `nginx -T` output for block `upstream` of the service. 

Recycling (deleting and waiting for k8s to create a new one) the data plane pod solves it, its `nginx -T` output has the new backend pod IP as upstream server. Recycling the control plane pod(s) makes no difference.

Examples:
- `GET: https://api-dev.foo.company.network/api/v3/gu/guardian/loan-application/list?LoanType=0&PageNumber=1&PageSize=10&searchBy=&searchString=` returns `504 Gateway Timeout` (`502 Bad Gateway` in Postman):
```shell
$ k logs gw-ingress-dev-nginx-dev-546d4cdb8d-v42l4 --since 5m | rg 504 -C2
Defaulted container "nginx" out of: nginx, init (init)
2025/07/15 08:50:15 [info] 5789#5789: *992 client canceled stream 1 while connecting to upstream, client: 128.127.114.197, server: api-dev.foo.company.network, request: "GET /api/v3/gu/guardian/loan-application/list?LoanType=0&PageNumber=1&PageSize=10&searchBy=&searchString= HTTP/2.0", upstream: "http://10.22.0.88:8080/api/v3/guardian/loan-application/list?LoanType=0&PageNumber=1&PageSize=10&searchBy=&searchString=", host: "api-dev.foo.company.network"
2025/07/15 08:51:15 [error] 5789#5789: *994 upstream timed out (110: Operation timed out) while connecting to upstream, client: 128.127.114.197, server: api-dev.foo.company.network, request: "GET /api/v3/gu/guardian/loan-application/list?LoanType=0&PageNumber=1&PageSize=10&searchBy=&searchString= HTTP/2.0", upstream: "http://10.22.0.88:8080/api/v3/guardian/loan-application/list?LoanType=0&PageNumber=1&PageSize=10&searchBy=&searchString=", host: "api-dev.foo.company.network"
128.127.114.197 - - [15/Jul/2025:08:51:15 +0000] "GET /api/v3/gu/guardian/loan-application/list?LoanType=0&PageNumber=1&PageSize=10&searchBy=&searchString= HTTP/2.0" 504 562 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 Edg/138.0.0.0"
$ k logs gw-ingress-dev-nginx-dev-546d4cdb8d-v9qbk --since 5m | rg 504 -C2
Defaulted container "nginx" out of: nginx, init (init)
2025/07/15 08:50:37 [error] 4559#4559: *1387 upstream timed out (110: Operation timed out) while connecting to upstream, client: 102.89.22.199, server: api-dev.foo.company.network, request: "GET /api/v3/gu/guardian/loan-application/list?LoanType=0&PageNumber=1&PageSize=10&searchBy=&searchString= HTTP/1.1", upstream: "http://10.22.0.88:8080/api/v3/guardian/loan-application/list?LoanType=0&PageNumber=1&PageSize=10&searchBy=&searchString=", host: "api-dev.foo.company.network"
102.89.22.199 - - [15/Jul/2025:08:50:37 +0000] "GET /api/v3/gu/guardian/loan-application/list?LoanType=0&PageNumber=1&PageSize=10&searchBy=&searchString= HTTP/1.1" 504 160 "-" "PostmanRuntime/7.44.1"
$
$ kubectl exec -it gw-ingress-dev-nginx-dev-546d4cdb8d-v9qbk -c nginx -- nginx -T | rg 10.22.0.88 -C7

upstream dev_gateforce-guardian-canary_80 {
    random two least_conn;
    zone dev_gateforce-guardian-canary_80 512k;


    server 10.22.0.88:8080;




}

upstream dev_gateforce-guardian_80 {
    random two least_conn;
    zone dev_gateforce-guardian_80 512k;


    server 10.22.0.88:8080;




}
$
$ kubectl exec -it gw-ingress-dev-nginx-dev-546d4cdb8d-v9qbk -c nginx -- cat /etc/nginx/conf.d/http.conf  | rg 10.22.0.88 -C10

upstream dev_gateforce-guardian-canary_80 {
    random two least_conn;
    zone dev_gateforce-guardian-canary_80 512k;


    server 10.22.0.88:8080;




}

upstream dev_gateforce-guardian_80 {
    random two least_conn;
    zone dev_gateforce-guardian_80 512k;


    server 10.22.0.88:8080;




}
## but there's no pod with IP 10.22.0.88
$ kubectl get pod -A -o wide | rg 10.22.0.88
$
## 10.22.0.93 is the IP of the svc endpoint
$ kubectl get pod -l app.kubernetes.io/instance=gateforce-guardian-dev -o wide
NAME                                                     READY   STATUS      RESTARTS   AGE    IP           NODE                                 NOMINATED NODE   READINESS GATES
gateforce-guardian-dev-api-7dbf7b5785-mz7k6              1/1     Running     0          160m   10.22.0.93   aks-spot01ded5-19828094-vmss00008k   <none>           <none>    # this is the IP of the svc endpoint
$
## recycling a data plane pod updates the config for that particular pod
$ kubectl delete pod gw-ingress-dev-nginx-dev-546d4cdb8d-v9qbk
pod "gw-ingress-dev-nginx-dev-546d4cdb8d-v9qbk" deleted
$
$ kubectl get pod -l app.kubernetes.io/instance=nginx-gateway-fabric-dev
NAME                                        READY   STATUS    RESTARTS   AGE
gw-ingress-dev-nginx-dev-546d4cdb8d-7g8s9   1/1     Running   0          14s
gw-ingress-dev-nginx-dev-546d4cdb8d-v42l4   1/1     Running   0          20h
gw-ingress-dev-nginx-dev-546d4cdb8d-flszc   1/1     Running   0          20h
nginx-gateway-fabric-dev-c49b64446-8qswk    1/1     Running   0          3m49s
nginx-gateway-fabric-dev-c49b64446-fzbf2    1/1     Running   0          3m49s
nginx-gateway-fabric-dev-c49b64446-nxbfm    1/1     Running   0          3m49s
$ kubectl exec -it gw-ingress-dev-nginx-dev-546d4cdb8d-7g8s9 -c nginx -- nginx -T | rg 10.22.0.88 -C10
$ kubectl exec -it gw-ingress-dev-nginx-dev-546d4cdb8d-7g8s9 -c nginx -- nginx -T | rg "upstream dev_gateforce-guardian" -A10
upstream dev_gateforce-guardian-canary_80 {
    random two least_conn;
    zone dev_gateforce-guardian-canary_80 512k;


    server 10.22.0.93:8080;



    keepalive_timeout 90;
}
--
  upstream dev_gateforce-guardian_80 {
    random two least_conn;
    zone dev_gateforce-guardian_80 512k;


    server 10.22.0.93:8080;           # we see the pod IP updated as the correct upstream server for the NGF data plane pod we recycled



    keepalive_timeout 90;
}
$ 
$ kubectl exec -it gw-ingress-dev-nginx-dev-546d4cdb8d-v42l4 -c nginx -- nginx -T | rg "upstream dev_gateforce-guardian" -A10
upstream dev_gateforce-guardian-canary_80 {
    random two least_conn;
    zone dev_gateforce-guardian-canary_80 512k;


    server 10.22.0.88:8080;



    keepalive_timeout 90;
}
--
  upstream dev_gateforce-guardian_80 {
    random two least_conn;
    zone dev_gateforce-guardian_80 512k;


    server 10.22.0.88:8080;



    keepalive_timeout 90;
}
$
```
- another example in attached file, where the svc endpoint is not used as upstream server, even though it exists, falling back to `unix:/var/run/nginx/nginx-503-server.sock`, similar to what https://github.com/nginx/nginx-gateway-fabric/issues/3139 reports. 

[ngf-data-plane-out-of-sync.md](https://github.com/user-attachments/files/21237902/ngf-data-plane-out-of-sync.md)


**Expected behavior**
I'd expect to see the backend services endpoint current (pod) IPs used as upstream servers, in the data plane configuration.

**Your environment**
We're on `AKS v1.30.12` with `Ubuntu 22.04.5 LTS` & kernel `5.15.0-1090-azure` & container runtime `containerd://1.7.27-1` nodes.
* Version of the NGINX Gateway Fabric: `"version":"2.0.2","commit":"283a21813b30de2fb6de34bae2dbfad8a4d40963"`
* Version of Kubernetes: ` v1.30.12`
* Kubernetes platform (e.g. Mini-kube or GCP): AKS
* Details on how you expose the NGINX Gateway Fabric Pod: Service of type LoadBalancer provided by Azure Load Balancer in Standard SKU
* Logs of NGINX container: `kubectl -n <nginx-deployment-namespace> logs deployments/<nginx-deployment>`
* NGINX Configuration: `kubectl -n <nginx-deployment-namespace> exec -it deployments/<nginx-deployment> -- nginx -T`

**Additional context**
From what is depicted in https://github.com/sarthyparty/nginx-gateway-fabric/blob/20d27b39a3373d95fd2a4782469de8ac7cc342d0/docs/architecture/configuration-flow.md#detailed-configuration-flow, the control plane pod configures the nginx-agent component of the data plane pod through gRPC, so there isn't any ConfigMap/Secret we could follow for changes as trigger to recycle the data plane pod to workaround this. Not really sure how to debug further, as I understand https://github.com/nginx/nginx-gateway-fabric/issues/3503 blocks debug logging enablement. 
Many thanks in advance for any pointers you may have.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data plane does not sync upstream servers IPs #3626

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Data plane does not sync upstream servers IPs #3626

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions