LoadBalancer
LoadBalancer is the most common way of exposing backend applications to the outside world. Its API is very similar to NodePort with the only exception being the spec.type: LoadBalancer. At the very least, a user is expected to define which ports to expose and a label selector to match backend Pods:
From the networking point of view, a LoadBalancer Service is expected to accomplish three things:
- Allocate a new, externally routable IP from a pool of addresses and release it when a Service is deleted.
- Make sure the packets for this IP get delivered to one of the Kubernetes Nodes.
- Program Node-local data plane to deliver the incoming traffic to one of the healthy backend Endpoints.
By default, Kubernetes will only take care of the last item, i.e. kube-proxy (or itās equivalent) will program a Node-local data plane to enable external reachability ā most of the work to enable this is already done by the NodePort implementation. However, the most challenging part ā IP allocation and reachability ā is left to external implementations. What this means is that in a vanilla Kubernetes cluster, LoadBalancer Services will remain in a āpendingā state, i.e. they will have no external IP and will not be reachable from the outside:
However, as soon as a LoadBalancer controller gets installed, it collects all āpendingā Services and allocates a unique external IP from its own pool of addresses. It then updates a Service status with the allocated IP and configures external infrastructure to deliver incoming packets to (by default) all Kubernetes Nodes.
As anything involving orchestration of external infrastructure, the mode of operation of a LoadBalancer controller depends on its environment:
Both on-prem and public cloud-based clusters can use existing cloud L4 load balancers, e.g. Network Load Balancer (NLB) for Amazon Elastic Kubernetes Service (EKS), Standard Load Balancer for Azure Kubernetes Services (AKE), Cloud Load Balancer for Google Kubernetes Engine (GKE), LBaaS plugin for Openstack or NSX ALB for VMWare. The in-cluster components responsible for load balancer orchestration is called
cloud-controller-managerand is usually deployed next to thekube-controller-manageras a part of the Kubernetes control plane.On-prem clusters can have multiple configurations options, depending on the requirements and what infrastructure may already be available in a data centre:
- Existing load-balancer appliances from incumbent vendors like F5 can be integrated with on-prem clusters allowing for the same appliance instance to be re-used for multiple purposes.
- If direct interaction with the physical network is possible, load-balancing can be performed by one of the many cluster add-ons, utilising either gratuitous ARP (for L2 integration) or BGP (for L3 integration) protocols to advertise external IPs and attract traffic for those IPs to cluster Nodes.
There are many implementations of these cluster add-ons ranging from simple controllers, designed to work in isolated environments, all the way to feature-rich and production-grade projects. This is a relatively active area of development with new projects appearing almost every year. The table below is an attempt to summarise some of the currently available solutions along with their notable features:
| Name | Description |
|---|---|
| MetalLB | One of the most mature projects today. Supports both ARP and BGP modes via custom userspace implementations. Currently can only be configured via ConfigMaps with CRD-based operator in the works. |
| OpenELB | Developed as a part of a wider project called Kubesphere. Supports both ARP and BGP modes, with BGP implementation built on top of GoBGP. Configured via CRDs. |
| Kube-vip | Started as a solution for Kubernetes control plane high availability and got extended to function as a LoadBalancer controller. Supports both L2 and GoBGP-based L3 modes. Can be configured via flags, env vars and ConfigMaps. |
| PureLB | Fork of metalLB with reworked ARP and BGP implementations. Uses BIRD for BGP and can be configured via CRDs. |
| Klipper | An integrated LB controller for K3S clusters. Exposes LoadBalancer Services as hostPorts on all cluster Nodes. |
| Akrobateo | Extends the idea borrowed from klipper to work on any general-purpose Kubernetes Node (not just K3S). Like klipper, it doesnāt use any extra protocol and simply relies on the fact that Node IPs are reachable from the rest of the network. The project is no longer active. |
Each of the above projects has its own pros and cons but I deliberately didnāt want to make a decision matrix. Instead, Iāll provide a list of things worth considering when choosing a LoadBalancer add-on:
- IPv6 support ā despite IPv6-only and dual-stack networking being supported for internal Kubernetes addressing, IPv6 support for external IPs is still quite patchy among many of the projects.
- Community ā this applies to most of the CNCF projects, having an active community is a sign of a healthy project that is useful to more than just its main contributors.
- Control plane HA LB ā this is very often left out of scope, however, it is still a problem that needs to be solved, especially for external access.
- Proprietary vs existing routing implementation ā although the former may be an easier implementation choice (we only need a small subset of ARP and BGP protocols), troubleshooting may become an issue if the control plane is abstracted away and extending its functionality is a lot more challenging compared to just turning on a knob in one of the existing routing daemons.
- CRD vs ConfigMap ā CRDs provide an easier and Kubernetes-friendly way of configuring in-cluster resources.
Finally, itās very important to understand why LoadBalancer Services are also assigned with a unique NodePort (previous chapter explains how it happens). As weāll see in the below lab scenarios, NodePort is not really needed if we use direct network integration via BGP or ARP. In these cases, the underlying physical network is aware of both the external IP, as learned from BGPās NLRI or ARPās SIP/DIP fields, and its next-hop learned from BGPās next-hop or ARPās source MAC fields. This information is advertised throughout the network so that every device knows where to send these packets.
However, the same does not apply to environments where L4 load balancer is located multiple hops away from cluster Nodes. In these cases, intermediate network devices are not aware of external IPs and only know how to forward packets to Node IPs. This is why an external load balancer will DNAT incoming packets to one of the Node IPs and will use NodePort as a unique identifier of a target Service.
The last point is summarised in the following high-level diagram, showing how load balancers operate in two different scenarios:
Lab
The lab will demonstrate how MetalLB operates in an L3 mode. Weāll start with a control plane (BGP) overview and demonstrate three different modes of data plane operation:
- IPTables orchestrated by kube-proxy
- IPVS orchestrated by kube-proxy
- eBPF orchestrated by Cilium
Preparation
Refer to the respective chapters for the instructions on how to setup IPTables, IPVS or eBPF data planes. Once the required data plane is configured, setup a test deployment with 3 Pods and expose it via a LoadBalancer Service:
The above command will also deploy a standalone container called frr. This container is attached to the same bridge as the lab Nodes and runs a BGP routing daemon (as a part of FRR) that will act as a top of the rack (TOR) switch in a physical data centre. It is pre-configured to listen for incoming BGP connection requests, will automatically peer with them and install any received routes into its local routing table.
Confirm the assigned LoadBalancer IP, e.g. 198.51.100.0 in the output below:
To verify that the LoadBalancer Service is functioning, try connecting to the deployment from a virtual TOR container running the BGP daemon:
Finally, setup the following command aliases:
Control plane overview
First, letās check how the routing table looks inside of our virtual TOR switch:
We see a single host route with three equal-cost next-hops. This route is the result of BGP updates received from three MetalLB speakers:
Letās see how these updates look inside of the BGP database:
Unlike standard BGP daemons, MetalLB BGP speakers do not accept any incoming updates, so thereās no way to influence the outbound routing. However, just sending out the updates out while setting the next-hop to self is enough to establish external reachability. In a normal network, these updates will propagate throughout the fabric and within seconds the entire data centre will be aware of the new IP and where to forward it.
Note
Since MetalLB implements both L2 and L3 modes in custom userspace code and doesnāt interact with the kernel FIB, thereās very limited visibility into the control plane state of the speakers ā they will only log certain life-cycle events (e.g. BGP session state) which can be viewed with kubectl logs.
IPTables data plane
As soon a LoadBalancer controller publishes an external IP in the status.loadBalancer field, kube-proxy, who watches all Services, gets notified and inserts the KUBE-FW-* chain right next to the ClusterIP entry of the same Service. So somewhere inside the KUBE-SERVICES chain, you will see a rule that matches the external IP:
Inside the KUBE-FW chain packets get marked for IP masquerading (SNAT to incoming interface) and get redirected to the KUBE-SVC-* chain. The last KUBE-MARK-DROP entry is used when spec.loadBalancerSourceRanges are defined in order to drop packets from unspecified prefixes:
The KUBE-SVC chain is the same as the one used for the ClusterIP Services ā one of the Endpoints gets chosen randomly and incoming packets get DNATāed to its address inside one of the KUBE-SEP-* chains:
See IPTables chapter for more details.
IPVS data plane
LoadBalancer IPVS implementation is very similar to NodePort. The first rule of the KUBE-SERVICES chain intercepts all packets with a matching destination IP:
The KUBE-LOAD-BALANCER ipset contains all external IPs allocated to LoadBalancer Services:
All matched packets get marked for SNAT, which is explained in more detail in the IPTables chapter:
The IPVS configuration contains one entry per external IP with all healthy backend Endpoints selected in a round-robin fashion:
Cilium eBPF data plane
Cilium treats LoadBalancer Services the same way as NodePort. All of the code walkthroughs from the eBPF section of the NodePort Chapter will apply to this use case, i.e. incoming packets get intercepted as they ingress one of the external interfaces and get matched against a list of configured Services:
If a match is found, packets go through destination NAT and optionally source address translation (for Services with spec.externalTrafficPolicy set to Cluster) and get redirected straight to the Podās veth interface. See the NodePort chapter for more details and code overview.
Caveats
For a very long time, Kubernetes only supported a single LoadBalancer Controller. Running multiple controllers has been introduced in a recent feature, however controller implementations are still catching up.
Most of the big public cloud providers do not support the BYO controller model, so cluster add-ons that rely on L2 or L3 integration would only work in some clouds (e.g. Packet) but not in others (e.g. AWS, Azure, GCP). However, itās still possible to use controllers that use hostPort (e.g. klipper, akrobateo).