eBPF
eBPF has emerged as a new alternative to IPTables and IPVS mechanisms implemented by kube-proxy with the promise to reduce CPU utilization and latency, improve throughput and increase scale.
As of today, there are two implementations of Kubernetes Service’s data plane in eBPF – one from Calico and one from Cilium.
Since Cilium was the first product to introduce kube-proxy-less data plane, we’ll focus on its implementation in this chapter. However it should be noted that there is no “standard” way to implement the Services data plane in eBPF, so Calico’s approach may be different.
Cilium’s kube-proxy replacement is called Host-Reachable Services and it literally makes any ClusterIP reachable from the host (Kubernetes Node). It does that by attaching eBPF programs to cgroup hooks, intercepting all system calls and transparently modifying the ones that are destined to ClusterIP VIPs. Since Cilium attaches them to the root cgroup, it affects all sockets of all processes on the host. As of today, Cilium’s implementation supports the following syscalls, which cover most of the use-cases but depend on the underlying Linux kernel version:
This is what typically happens when a client, e.g. a process inside a Pod, tries to communicate with a remote ClusterIP:
- Client’s network application invokes one of the syscalls.
- eBPF program attached to this syscall’s hook is executed.
- The input to this eBPF program contains a number of socket parameters like destination IP and port number.
- These input details are compared to existing ClusterIP Services and if no match is found, control flow is returned to the Linux kernel.
- In case one of the existing Services did match, the eBPF program selects one of the backend Endpoints and “redirects” the syscall to it by modifying its destination address, before passing it back to the Linux kernel.
- Subsequent data is exchanged over the opened socket by calling
read()andwrite()without any involvement from the eBPF program.
It’s very important to understand that in this case, the destination NAT translation happens at the syscall level, before the packet is even built by the kernel. What this means is that the first packet to leave the client network namespace already has the right destination IP and port number and can be forwarded by a separate data plane managed by a CNI plugin (in most cases though the entire data plane is managed by the same plugin).
Info
Below is a high-level diagram of what happens when a Pod on Node worker-2 tries to communicate with a ClusterIP 10.96.32.28:80. See section below for a detailed code walkthrough.
Lab
Preparation
Assuming that the lab environment is already set up, Cilium can be enabled with the following command:
Wait for Cilium daemonset to initialize:
Now we need to “kick” all Pods to restart and pick up the new CNI plugin:
To make sure there’s is no interference from kube-proxy we’ll remove it completely along with any IPTables rules set up by it:
Check that the cilium is healthy:
In order to have a working ClusterIP to test against, create a deployment with 3 nginx Pods and examine the assigned ClusterIP and IPs of the backend Pods:
Now let’s see what happens when a client tries to communicate with this Service.
A day in the life of a Packet
First, let’s take a look at the first few packets of a client session. Keep a close eye on the destination IP of the captured packets:
The first TCP packet sent at 20:11:29.780374 already contains the destination IP of one of the backend Pods. This kind of behaviour can very easily enhance but also trip up applications relying on traffic interception.
Now let’s take a close look at the “happy path” of the eBPF program responsible for this. The above curl command would try to connect to an IPv4 address and would invoke the connect() syscall, to which the connect4 eBPF program is attached (source).
Most of the processing is done inside the __sock4_xlate_fwd function; we’ll break it down into multiple parts for simplicity and omit some of the less important bits that cover special use cases like sessionAffinity and externalTrafficPolicy. Note that regardless of what happens in the above function, the returned value is always SYS_PROCEED, which returns the control flow back to the kernel.
The first thing that happens inside this function is the Services map lookup based on the destination IP and port:
Kubernetes Services can have an arbitrary number of Endpoints, depending on the number of matching Pods, however eBPF maps have fixed size, so storing variable-size values is not possible. In order to overcome that, the lookup process is broken into two steps:
- The first lookup is done just with the destination IP and port and the returned value tells how many Endpoints are currently associated with the Service.
- The second lookup is done with the same destination IP and port plus an additional field called
backend_slotwhich corresponds to one of the backend Endpoints.
During the first lookup backend_slot is set to 0. The returned value contains a number of fields but the most important one at this stage is count – the total number of Endpoints for this Service.
Let’s look inside the eBPF map and see what entries match that last two octets of our ClusterIP 10.96.32.28:
If the backend_slot is set to 0, the key would only contain the IP and port of the Service, so that second line would match the first lookup and the returned value can be interpreted as:
backend_id = 0count = 3
Now the eBPF program knows that the total number of Endpoints is 3 but it still hasn’t picked one yet. The control returns to the __sock4_xlate_fwd function where the count information is used to update the lookup key.backend_slot:
This is where the backend selection takes place either randomly (for TCP) or based on the socket cookie (for UDP):
The second lookup is performed in the same map, but now the key contains the previously selected backend_slot:
The lookup result will contain either one of the values from rows 1, 3 or 4 and will have a non-zero value for backend_id – 0b 00, 09 00 or 0a 00:
Using this value we can now extract IP and port details of the backend Pod:
Let’s assume that the backend_id that got chosen before was 0a 00 and look up the details in the eBPF map:
The returned value can be interpreted as:
- Address =
10.0.0.27 - Port =
80
Finally, the eBPF program does the socket-based NAT translation, i.e. re-writing of the destination IP and port with the values returned from the eariler lookup:
At this stage, the eBPF program returns and execution flow continues inside the Linux kernel networking stack all the way until the packet is built and sent out of the egress interface. The packet continues along the path built by the CNI portion of Cilium.
This is all that’s required to replace the biggest part of kube-proxy’s functionality. One big difference with kube-proxy implementation is that NAT translation only happens for traffic originating from one of the Kubernetes nodes, e.g. externally originated ClusterIP traffic is not currently supported. This is why we haven’t considered the Any-to-Service communication use case, as we did for IPTables and IPVS.
Info
Due to a known issue with kind, make sure to run make cilium-unhook when you’re finished with this Cilium lab to detach eBPF programs from the host cgroup.