Every time you use Netflix, visit Amazon, or open your banking app, your request quietly passes through one of the most important — and least visible — pieces of infrastructure in modern computing: a load balancer. Without it, a single server would collapse under the weight of millions of simultaneous users. With it, traffic flows intelligently across dozens or hundreds of servers, and the failure of any one machine goes completely unnoticed by the people relying on the service.
This article explains load balancers from first principles — what problem they solve, how they work mechanically, what types exist, and how each of the major load-balancing algorithms decides where to send the next request. Every concept is illustrated with a diagram so the logic is immediately clear, not just theoretically understood.
1. What Is a Load Balancer?
A load balancer is a device or piece of software that distributes incoming network requests across a pool of backend servers. It acts as the single point of contact for clients — users, browsers, mobile apps — and routes each request to whichever server in the pool is best positioned to handle it at that moment.
The core job of a load balancer is captured in its name: balancing the load. Load in this context means computational work — CPU cycles, memory consumption, network bandwidth, and open connections. If one server is processing 1,000 requests and another is idle, the load is unbalanced. The load balancer’s job is to prevent that scenario by distributing work intelligently.
Figure 1 — Basic Load Balancer Architecture
Clients send requests to a single address; the load balancer distributes them across the server pool
2. Why Do We Need Load Balancers?
The problem load balancers solve is straightforward: a single server has a finite capacity. It can handle some number of requests per second before it runs out of CPU, RAM, or network bandwidth. When traffic exceeds that limit, requests start timing out, pages stop loading, and users leave.
The solution is to run multiple servers and divide traffic among them. But dividing traffic intelligently requires something that understands the current state of each server and can make routing decisions in real time. That something is the load balancer.
Beyond raw capacity, load balancers solve four additional problems:
- High availability — if one server fails, the load balancer automatically stops sending it traffic and routes requests to the remaining healthy servers. Users experience no interruption.
- Scalability — adding capacity is as simple as adding a new server to the pool and registering it with the load balancer. No downtime, no reconfiguration of clients.
- Maintenance windows — a server can be gracefully removed from the pool for updates, drained of existing connections, updated, tested, and returned to service — all without any visible impact to users.
- Performance — by ensuring no single server is overloaded, all servers remain responsive and latency stays low across the entire user base.
3. How a Load Balancer Works
When a user’s browser sends a request to your website, the following sequence occurs:
- DNS resolution — the browser looks up your domain (e.g.,
app.example.com) and receives the IP address of the load balancer, not of any individual server. - Connection to load balancer — the browser opens a TCP connection to the load balancer’s IP address and sends its HTTP request.
- Algorithm decision — the load balancer applies its configured algorithm (Round Robin, Least Connections, etc.) to select which backend server should handle this request.
- Forwarding the request — the load balancer forwards the request to the selected server, either by proxying it (opening its own connection to the server) or by rewriting the destination address (NAT-based balancing).
- Server response — the backend server processes the request and returns the response to the load balancer.
- Response delivery — the load balancer delivers the response to the client. From the client’s perspective, they communicated with a single address the whole time.
Figure 2 — Request and Response Flow
A sequence diagram showing the full journey of one HTTP request through a load balancer
4. Types of Load Balancers
4.1 Layer 4 Load Balancers (Transport Layer)
A Layer 4 load balancer operates at the transport layer of the OSI model — the level of TCP and UDP. It makes routing decisions based purely on IP addresses and port numbers. It does not read the content of the packets; it simply sees a TCP connection arriving on port 443 and decides which server gets that connection, then forwards all subsequent packets in that connection to the same destination.
How it works: The load balancer performs Network Address Translation (NAT), rewriting the destination IP address in the packet header from the load balancer’s IP to the chosen backend server’s IP. The server receives the packet as if it came directly from the client.
When to use Layer 4: When you need very high throughput with minimal latency overhead. Because the load balancer doesn’t need to inspect packet contents, it processes packets extremely quickly. Good for: game servers, streaming media, financial trading systems, any application where microseconds matter.
Figure 3 — Layer 4 vs Layer 7 Load Balancing
Layer 4 LBs route by TCP/UDP info; Layer 7 LBs read HTTP content for smarter decisions
4.2 Layer 7 Load Balancers (Application Layer)
A Layer 7 load balancer operates at the application layer of the OSI model — the level of HTTP, HTTPS, WebSockets, and gRPC. It can read the full content of each request before making a routing decision. This means it can route based on:
- URL path — send
/api/*requests to API servers and/images/*requests to a CDN or image-specific servers - HTTP headers — route mobile user-agent strings to mobile-optimised servers, or route requests with a specific
Authorizationheader to an authenticated backend - Cookies — use session cookies to route a returning user to the same server that holds their session data
- Query parameters — send A/B test groups to different server pools
- Request content — parse the body of a POST request and route based on content type
Layer 7 load balancers are also the right place for SSL/TLS termination — decrypting HTTPS traffic so that backend servers receive plain HTTP and don’t need to handle encryption overhead themselves.
Examples of Layer 7 load balancers: Nginx, HAProxy (in HTTP mode), AWS Application Load Balancer (ALB), Google Cloud HTTP(S) Load Balancer, Azure Application Gateway, Traefik, Envoy.
4.3 Hardware vs Software vs Cloud Load Balancers
🔩 Hardware LB
Dedicated physical appliances (F5 BIG-IP, Citrix ADC). Highest throughput. Often 100 Gbps+. Very expensive. Common in large banks, telcos, legacy enterprises.
SAR 50,000–500,000+
💻 Software LB
Nginx, HAProxy, Traefik — runs on standard Linux servers. Highly configurable. Cost of a VM. Used by most modern web applications and startups.
Cost of a VM (SAR 100–500/mo)
☁️ Cloud LB
AWS ALB/NLB, GCP Cloud Load Balancing, Azure Load Balancer. Managed service, global scale, auto-scaling. Pay-per-use with minimal ops overhead.
~SAR 60–300/mo + traffic
4.4 DNS Load Balancing
DNS load balancing is conceptually different: instead of a single device routing requests, DNS returns different IP addresses for the same hostname in response to different DNS queries. Round-robin DNS returns Server 1’s IP to the first lookup, Server 2’s IP to the second, and so on.
DNS load balancing is extremely simple to implement but has significant limitations: DNS responses are cached for the duration of the TTL (often minutes to hours), so a failed server may still receive requests until the DNS cache expires. It also has no awareness of server health or load. Modern systems use dedicated load balancers for application traffic and use DNS primarily for geographic routing (sending users to the nearest data centre).
5. Load Balancing Algorithms
The algorithm is the core of a load balancer’s decision-making. Each request, the algorithm determines which server gets the work. Different algorithms make different trade-offs between simplicity, fairness, and responsiveness to real-world server state.
5.1 Round Robin
Round Robin is the simplest possible algorithm. Each new request goes to the next server in a fixed rotation. After the last server, it cycles back to the first.
Figure 4 — Round Robin Algorithm
Requests cycle sequentially: R1→S1, R2→S2, R3→S3, R4→S1, R5→S2…
Best for: Servers of identical capacity handling requests of similar processing cost. Stateless applications where any server can handle any request.
Weakness: If requests have very different processing costs (one request takes 10ms, another takes 2 seconds), some servers will become overloaded while others are idle — because Round Robin counts requests, not workload.
5.2 Weighted Round Robin
Weighted Round Robin assigns each server a relative weight. Servers with higher weights receive proportionally more requests. A server with weight 3 receives three requests for every one request sent to a server with weight 1.
Figure 5 — Weighted Round Robin (Weight 3:2:1)
Higher-capacity servers receive proportionally more requests based on their assigned weight
Best for: Heterogeneous server pools where servers have different hardware capabilities. Also useful during gradual rollouts — give a new server weight 1 while existing servers have weight 5, and slowly increase the new server’s weight as confidence builds.
5.3 Least Connections
Least Connections routes each new request to the server that currently has the fewest active connections. The load balancer maintains a real-time count of open connections per server and updates it as connections open and close.
Figure 6 — Least Connections Algorithm
The next request always goes to the server with the fewest open connections at that moment
Best for: Applications with long-lived connections or requests of highly variable processing duration. Database connections, WebSocket applications, file upload/download services. Significantly outperforms Round Robin when request durations vary widely.
Weakness: A server with few connections is not the same as a server with low CPU usage. A server handling 5 extremely CPU-intensive operations may be more saturated than a server handling 50 trivial requests.
5.4 Weighted Least Connections
Combines the concepts of Least Connections and Weighted Round Robin. The routing decision is based on the ratio of current connections to the server’s weight. A server with 10 connections and weight 2 is treated as having 5 “effective” connections — better than a server with 8 connections and weight 1.
The selection formula: score = active_connections / weight. The server with the lowest score receives the next request.
Server 1: 30 connections, weight 3 → score = 30/3 = 10.0 ← CHOSEN
Server 2: 12 connections, weight 1 → score = 12/1 = 12.0
Server 3: 20 connections, weight 2 → score = 20/2 = 10.0 (tie → first one wins)
5.5 IP Hash (Sticky Sessions by IP)
IP Hash generates a hash from the client’s IP address and uses that hash to deterministically select a server. The same client IP always maps to the same server (as long as the server pool doesn’t change).
The mapping is calculated as: server_index = hash(client_ip) % number_of_servers
Figure 7 — IP Hash Algorithm (Persistent Routing)
The same IP always maps to the same server — session state is preserved without cookies
Best for: Applications that store session data on the server (not in a shared cache) and need all requests from the same user to reach the same server. E-commerce checkout flows, applications using server-side sessions without Redis.
Weakness: If a server goes down, all sessions pinned to it are lost. Also, traffic distribution can be uneven if some IP ranges generate significantly more traffic than others (e.g., if many users are behind a corporate NAT and share one IP).
5.6 Least Response Time
The most intelligent of the common algorithms, Least Response Time routes each request to the server with the combination of fewest active connections AND lowest average response time. The load balancer continuously measures how long each server takes to respond and weights that into its routing decision.
Score formula: score = active_connections × avg_response_time_ms. Lower score wins.
Server 1: 10 conns × 50ms avg = 500 ← CHOSEN (fastest effective throughput)
Server 2: 3 conns × 200ms avg = 600
Server 3: 2 conns × 350ms avg = 700
This algorithm correctly identifies that Server 1, despite more connections, is actually serving requests faster than Servers 2 and 3. Round Robin or Least Connections would have sent the next request to Server 3, which is actually performing worst.
Best for: Production systems with mixed workloads where server performance varies dynamically. Requires the load balancer to actively measure response times, which adds a small amount of overhead but significantly improves overall system performance.
5.7 Random
Random selection picks a backend server completely at random for each request. At scale and with large numbers of requests, Random approaches the distribution of Round Robin by the law of large numbers. It has the advantage of being stateless — the load balancer doesn’t need to track any state about previous routing decisions.
Best for: Simplicity in distributed load balancing scenarios, particularly in service mesh architectures (like Envoy/Istio) where multiple load balancers operate simultaneously and maintaining shared state about the “current” server in a Round Robin cycle would require coordination. Random with 2 choices (pick 2 random servers, choose the one with fewer connections) is a well-studied technique called “Power of Two Random Choices” that achieves near-optimal load distribution.
6. Health Checks and Failover
A load balancer is only useful if it routes traffic to servers that are actually working. Health checks are the mechanism by which the load balancer continuously verifies that each backend server is alive and capable of serving requests.
Figure 8 — Health Check and Automatic Failover
When Server 2 fails health checks, it is automatically removed from the pool — users see no disruption
The three types of health checks commonly used:
- TCP health check — attempts to open a TCP connection to the server’s port. If the connection succeeds, the server is considered alive. Fast, but only verifies the TCP stack is responding, not the application itself.
- HTTP health check — sends an HTTP GET to a specific health endpoint (e.g.,
/healthor/ping) and expects a 200 OK response. Verifies the application is running and able to process requests. - HTTPS health check — same as HTTP but over TLS, verifying both application health and certificate validity.
Health checks are typically configured with a failure threshold (e.g., fail 3 consecutive checks before removing) and a recovery threshold (e.g., pass 2 consecutive checks before readding). This prevents flapping — rapid add/remove cycles caused by a server that intermittently fails.
7. Session Persistence (Sticky Sessions)
Traditional load balancing routes each request independently — a user’s second request might go to a different server than their first. For stateless applications (where no information is stored on the server between requests), this is fine. For stateful applications (where the server stores session data in memory), it causes problems: the second server doesn’t know about the user’s session.
Session persistence (also called sticky sessions) ensures that all requests from a specific user always go to the same backend server. There are three approaches:
- Cookie-based stickiness — the load balancer injects a cookie (e.g.,
SERVERID=server2) into the first response. Subsequent requests include that cookie, and the load balancer routes them to the specified server. Most flexible and doesn’t require same IP. - IP hash stickiness — as described above, the client’s IP address determines the server. Simple but breaks when users share IPs (NAT, corporate proxies).
- Server-side session tracking — the load balancer maintains an internal table mapping session IDs to servers. More accurate than IP hash but requires the load balancer to parse session tokens.
The modern architectural best practice is to avoid sticky sessions entirely by storing session data in a shared cache (Redis, Memcached) rather than server memory. Any server can then handle any request because session state is always accessible centrally. This enables full horizontal scaling without stickiness constraints.
8. Choosing the Right Algorithm
9. Frequently Asked Questions
A reverse proxy sits in front of servers and forwards client requests to them — a load balancer is a type of reverse proxy. All load balancers are reverse proxies, but not all reverse proxies are load balancers. A reverse proxy that forwards all traffic to a single backend server is not load balancing. Tools like Nginx can act as both a reverse proxy and a load balancer depending on configuration.
By running multiple load balancers in an active-active or active-passive configuration. In active-active, all load balancers share traffic via Anycast routing or DNS round-robin. In active-passive, one handles traffic while the other stands by, and a virtual IP (using VRRP/HSRP protocol) floats between them — if the active one fails, the passive one claims the VIP and takes over within seconds. Cloud load balancers (AWS ALB, GCP LB) handle this automatically; the cloud infrastructure is inherently redundant.
Yes — this is called SSL termination. The load balancer handles the SSL handshake and decryption, then communicates with backend servers over plain HTTP (or re-encrypted HTTP over an internal network). SSL termination reduces CPU load on backend servers, centralises certificate management (you renew one certificate at the load balancer rather than on every server), and allows the load balancer to inspect and route based on HTTP content that would otherwise be encrypted.
For cloud-hosted applications: AWS Application Load Balancer (ALB) if on AWS, or Azure Application Gateway if on Azure — both are available from Bahrain and UAE regions respectively, giving low-latency connectivity for Saudi users. For on-premise or hybrid deployments: Nginx or HAProxy running on Ubuntu Server are the industry standard, highly configurable, and free. For maximum flexibility and features: Traefik is excellent for containerised (Docker/Kubernetes) environments. For high-traffic production systems requiring NCA ECC compliance, hardware load balancers (F5 BIG-IP, Citrix ADC) are deployed by large government and banking institutions.
Local load balancing distributes traffic across servers within a single data centre or availability zone. Global load balancing (also called GSLB — Global Server Load Balancing) distributes traffic across multiple data centres in different geographic regions. GSLB typically uses DNS-based routing to direct users to the nearest or best-performing data centre. Most enterprises use both: GSLB to route to the right region, then local load balancing to distribute within that region.
WebSocket connections are long-lived and stateful — once established, all messages in a conversation must go through the same server. Load balancers handle WebSockets by treating the initial HTTP upgrade handshake as a request and pinning the resulting WebSocket connection to the chosen server for its full lifetime. Layer 7 load balancers (Nginx, HAProxy, AWS ALB) support WebSocket pass-through. You must ensure your load balancer’s connection timeout is long enough to accommodate long-lived WebSocket sessions.
Conclusion
Load balancers are the invisible traffic directors of the modern internet — present in almost every production web application, yet rarely noticed when working correctly. Understanding how they work, what types exist, and how each algorithm makes its decisions gives you the knowledge to design systems that are scalable, highly available, and resilient to failure.
The key takeaways: use Round Robin for simplicity with uniform workloads; use Least Connections when request processing times vary; use Least Response Time when performance is critical and you need the best possible latency; use IP Hash only when you need session affinity without a shared session store; and use a Layer 7 load balancer when you need content-based routing, SSL termination, or header manipulation. Always configure health checks — a load balancer without health checks will route traffic to dead servers.
IT Infrastructure Services · Saudi Arabia
Need Load Balancing and High-Availability Infrastructure?
Visit To Me designs and implements load balancing, high-availability, and web server infrastructure for businesses in Saudi Arabia. From Nginx configuration to AWS ALB setup to full cloud migration, our certified engineers deliver production-grade solutions with written SLAs.
Leave a Reply