This article explains why you shouldn’t enable cross-zone load balancing on your NLBs when using instance target groups with client IP preservation.
Types of load balancers
In the context of HTTP traffic:
- Application Load Balancer: operates at application layer, is aware of HTTP traffic, can direct traffic based on criteria like path and headers.
- Network Load Balancer: operates at the data link layer, is aware of TCP traffic, has no awareness of upper layer protocols like HTTP.
Availability zones and load balancers
It’s best practice to run your applications in multiple availability zones. Load balancers allow you to distribute traffic over target instances hosted in multiple availability zones.
NLB cross-zone load balancing
Cross-zone load balancing means that each NLB endpoint can direct traffic to target instances in a different availability zone.
By default, it’s disabled:
When enabled, it allows load balancing to target instances in different zones:
Note that NLBs preserve the source IP and port of the ingress traffic.
The problem with NLB cross-zone load balancing
Suppose your TCP client is behind a NAT gateway of some sort, and is making plenty of requests to your service.
For each request:
- Resolve NLB CNAME to an IP address.
- Open a TCP connection to the destination of that NLB endpoint’s IP address.
- The NLB forwards TCP connections to a target instance. Since cross-zone load balancing is enabled, this target instance could be in zone A or B.
Everything could be fine, until:
- TCP client makes request 1 and resolves the CNAME to NLB IP 220.127.116.11.
- TCP client makes request 2 and resolves the CNAME to NLB IP 18.104.22.168.
- The NAT gateway on the TCP client’s network decides to assign the same source port to both TCP connections.
- Both requests arrive to NLB endpoints A and B.
- NLB endpoint A forwards it to instance A. NLB endpoint B forwards it to the exact same instance A.
- The source IP and source TCP port of request 1 and 2 are identical. How can the target instance differentiate them? It cannot. The target will reset the second connection. Surprise, you have a low probability of occurrence race condition that leads to TCP connections getting reset randomly.
- Disable cross-zone load balancing. Then you’re left with load balancing based on DNS, which should be fine in most cases.
- Use a target group of type IP instead of instance, or disable client IP preservation on the instance based target group. If you do that, the source IP will be masqueraded with the NLB endpoint private source IP.