eBPF

eBPF has emerged as a new alternative to IPTables and IPVS mechanisms implemented by kube-proxy with the promise to reduce CPU utilization and latency, improve throughput and increase scale. As of today, there are two implementations of Kubernetes Service’s data plane in eBPF – one from Calico and one from Cilium. Since Cilium was the first product to introduce kube-proxy-less data plane, we’ll focus on its implementation in this chapter. However it should be noted that there is no “standard” way to implement the Services data plane in eBPF, so Calico’s approach may be different.

Cilium’s kube-proxy replacement is called Host-Reachable Services and it literally makes any ClusterIP reachable from the host (Kubernetes Node). It does that by attaching eBPF programs to cgroup hooks, intercepting all system calls and transparently modifying the ones that are destined to ClusterIP VIPs. Since Cilium attaches them to the root cgroup, it affects all sockets of all processes on the host. As of today, Cilium’s implementation supports the following syscalls, which cover most of the use-cases but depend on the underlying Linux kernel version:

$ bpftool cgroup tree /run/cilium/cgroupv2/
CgroupPath
ID       AttachType      AttachFlags     Name
/run/cilium/cgroupv2
2005     connect4
1970     connect6
2007     post_bind4
2002     post_bind6
2008     sendmsg4
2003     sendmsg6
2009     recvmsg4
2004     recvmsg6
2006     getpeername4
1991     getpeername6

This is what typically happens when a client, e.g. a process inside a Pod, tries to communicate with a remote ClusterIP:

Client’s network application invokes one of the syscalls.
eBPF program attached to this syscall’s hook is executed.
The input to this eBPF program contains a number of socket parameters like destination IP and port number.
These input details are compared to existing ClusterIP Services and if no match is found, control flow is returned to the Linux kernel.
In case one of the existing Services did match, the eBPF program selects one of the backend Endpoints and “redirects” the syscall to it by modifying its destination address, before passing it back to the Linux kernel.
Subsequent data is exchanged over the opened socket by calling read() and write() without any involvement from the eBPF program.

It’s very important to understand that in this case, the destination NAT translation happens at the syscall level, before the packet is even built by the kernel. What this means is that the first packet to leave the client network namespace already has the right destination IP and port number and can be forwarded by a separate data plane managed by a CNI plugin (in most cases though the entire data plane is managed by the same plugin).

Info

A somewhat similar idea has previously been implemented by a product called Appswitch. See 1, 2, 3 for more details.

Below is a high-level diagram of what happens when a Pod on Node worker-2 tries to communicate with a ClusterIP 10.96.32.28:80. See section below for a detailed code walkthrough.

Lab

Preparation

Assuming that the lab environment is already set up, Cilium can be enabled with the following command:

make cilium

Wait for Cilium daemonset to initialize:

make cilium-wait

Now we need to “kick” all Pods to restart and pick up the new CNI plugin:

make nuke-all-pods

To make sure there’s is no interference from kube-proxy we’ll remove it completely along with any IPTables rules set up by it:

make nuke-kube-proxy

Check that the cilium is healthy:

$ make cilium-check | grep health
Cilium health daemon:       Ok
Controller Status:      	40/40 healthy
Cluster health:         	3/3 reachable   (2021-08-02T19:52:07Z)

In order to have a working ClusterIP to test against, create a deployment with 3 nginx Pods and examine the assigned ClusterIP and IPs of the backend Pods:

make deployment && make scale-up && make cluster-ip
$ kubectl get svc web
NAME   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
web    ClusterIP   10.96.32.28     <none>        80/TCP    5s
$ kubectl get ep web
NAME   ENDPOINTS                                 AGE
web    10.0.0.234:80,10.0.0.27:80,10.0.2.76:80   11m

Now let’s see what happens when a client tries to communicate with this Service.

A day in the life of a Packet

First, let’s take a look at the first few packets of a client session. Keep a close eye on the destination IP of the captured packets:

$ NODE=k8s-guide-worker2 make tshoot
bash-5.1# tcpdump -enni any -q &
bash-5.1# tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes

bash-5.1# curl -s 10.96.32.28 | grep Welcome
<title>Welcome to nginx!</title>
<h1>Welcome to nginx!</h1>
20:11:29.780374 eth0  Out ifindex 24 aa:24:9c:63:2e:7d 10.0.2.202.45676 > 10.0.0.27.80: tcp 0
20:11:29.781996 eth0  In  ifindex 24 2a:89:e2:43:42:6e 10.0.0.27.80 > 10.0.2.202.45676: tcp 0
20:11:29.782014 eth0  Out ifindex 24 aa:24:9c:63:2e:7d 10.0.2.202.45676 > 10.0.0.27.80: tcp 0
20:11:29.782297 eth0  Out ifindex 24 aa:24:9c:63:2e:7d 10.0.2.202.45676 > 10.0.0.27.80: tcp 75

The first TCP packet sent at 20:11:29.780374 already contains the destination IP of one of the backend Pods. This kind of behaviour can very easily enhance but also trip up applications relying on traffic interception.

Now let’s take a close look at the “happy path” of the eBPF program responsible for this. The above curl command would try to connect to an IPv4 address and would invoke the connect() syscall, to which the connect4 eBPF program is attached (source).

__section("connect4")
int sock4_connect(struct bpf_sock_addr *ctx)
{
	if (sock_is_health_check(ctx))
		return __sock4_health_fwd(ctx);

	__sock4_xlate_fwd(ctx, ctx, false);
	return SYS_PROCEED;
}

Most of the processing is done inside the __sock4_xlate_fwd function; we’ll break it down into multiple parts for simplicity and omit some of the less important bits that cover special use cases like sessionAffinity and externalTrafficPolicy. Note that regardless of what happens in the above function, the returned value is always SYS_PROCEED, which returns the control flow back to the kernel.

The first thing that happens inside this function is the Services map lookup based on the destination IP and port:

static __always_inline int __sock4_xlate_fwd(struct bpf_sock_addr *ctx,
					     struct bpf_sock_addr *ctx_full,
					     const bool udp_only)
{
	struct lb4_backend *backend;
	struct lb4_service *svc;
	struct lb4_key key = {
		.address	= ctx->user_ip4,
		.dport		= ctx_dst_port(ctx),
	}, orig_key = key;
	struct lb4_service *backend_slot;

	svc = lb4_lookup_service(&key, true);
	if (!svc)
		svc = sock4_wildcard_lookup_full(&key, in_hostns);
	if (!svc)
		return -ENXIO;

Kubernetes Services can have an arbitrary number of Endpoints, depending on the number of matching Pods, however eBPF maps have fixed size, so storing variable-size values is not possible. In order to overcome that, the lookup process is broken into two steps:

The first lookup is done just with the destination IP and port and the returned value tells how many Endpoints are currently associated with the Service.
The second lookup is done with the same destination IP and port plus an additional field called backend_slot which corresponds to one of the backend Endpoints.

During the first lookup backend_slot is set to 0. The returned value contains a number of fields but the most important one at this stage is count – the total number of Endpoints for this Service.

static __always_inline
struct lb4_service *lb4_lookup_service(struct lb4_key *key,
				       const bool scope_switch)
{
	struct lb4_service *svc;

	key->scope = LB_LOOKUP_SCOPE_EXT;
	key->backend_slot = 0;
	svc = map_lookup_elem(&LB4_SERVICES_MAP_V2, key);
	if (svc) {
		if (!scope_switch || !lb4_svc_is_local_scope(svc))
			return svc->count ? svc : NULL;
		key->scope = LB_LOOKUP_SCOPE_INT;
		svc = map_lookup_elem(&LB4_SERVICES_MAP_V2, key);
		if (svc && svc->count)
			return svc;
	}

	return NULL;
}

Let’s look inside the eBPF map and see what entries match that last two octets of our ClusterIP 10.96.32.28:

$ NODE=k8s-guide-worker2
$ cilium=$(kubectl get -l k8s-app=cilium pods -n cilium --field-selector spec.nodeName=$NODE -o jsonpath='{.items[0].metadata.name}')
$ kubectl -n cilium exec -it $cilium -- bpftool map dump pinned /sys/fs/bpf/tc/globals/cilium_lb4_services_v2 | grep "20 1c"
key: 0a 60 20 1c 00 50 03 00  00 00 00 00  value: 0b 00 00 00 00 00 00 07  00 00 00 00
key: 0a 60 20 1c 00 50 00 00  00 00 00 00  value: 00 00 00 00 03 00 00 07  00 00 00 00
key: 0a 60 20 1c 00 50 01 00  00 00 00 00  value: 09 00 00 00 00 00 00 07  00 00 00 00
key: 0a 60 20 1c 00 50 02 00  00 00 00 00  value: 0a 00 00 00 00 00 00 07  00 00 00 00

If the backend_slot is set to 0, the key would only contain the IP and port of the Service, so that second line would match the first lookup and the returned value can be interpreted as:

backend_id = 0
count = 3

Now the eBPF program knows that the total number of Endpoints is 3 but it still hasn’t picked one yet. The control returns to the __sock4_xlate_fwd function where the count information is used to update the lookup key.backend_slot:

	if (backend_id == 0) {
		backend_from_affinity = false;

		key.backend_slot = (sock_select_slot(ctx_full) % svc->count) + 1;
		backend_slot = __lb4_lookup_backend_slot(&key);
		if (!backend_slot) {
			update_metrics(0, METRIC_EGRESS, REASON_LB_NO_BACKEND_SLOT);
			return -ENOENT;
		}

		backend_id = backend_slot->backend_id;
		backend = __lb4_lookup_backend(backend_id);
	}

This is where the backend selection takes place either randomly (for TCP) or based on the socket cookie (for UDP):

static __always_inline __maybe_unused
__u64 sock_select_slot(struct bpf_sock_addr *ctx)
{
	return ctx->protocol == IPPROTO_TCP ?
	       get_prandom_u32() : sock_local_cookie(ctx);
}

The second lookup is performed in the same map, but now the key contains the previously selected backend_slot:

static __always_inline
struct lb4_service *__lb4_lookup_backend_slot(struct lb4_key *key)
{
	return map_lookup_elem(&LB4_SERVICES_MAP_V2, key);
}

The lookup result will contain either one of the values from rows 1, 3 or 4 and will have a non-zero value for backend_id – 0b 00, 09 00 or 0a 00:

$ kubectl -n cilium exec -it $cilium -- bpftool map dump pinned /sys/fs/bpf/tc/globals/cilium_lb4_services_v2 | grep "20 1c"
key: 0a 60 20 1c 00 50 03 00  00 00 00 00  value: 0b 00 00 00 00 00 00 07  00 00 00 00
key: 0a 60 20 1c 00 50 00 00  00 00 00 00  value: 00 00 00 00 03 00 00 07  00 00 00 00
key: 0a 60 20 1c 00 50 01 00  00 00 00 00  value: 09 00 00 00 00 00 00 07  00 00 00 00
key: 0a 60 20 1c 00 50 02 00  00 00 00 00  value: 0a 00 00 00 00 00 00 07  00 00 00 00

Using this value we can now extract IP and port details of the backend Pod:

static __always_inline struct lb4_backend *__lb4_lookup_backend(__u16 backend_id)
{
	return map_lookup_elem(&LB4_BACKEND_MAP, &backend_id);
}

Let’s assume that the backend_id that got chosen before was 0a 00 and look up the details in the eBPF map:

$ kubectl -n cilium exec -it $cilium -- bpftool map lookup pinned /sys/fs/bpf/tc/globals/cilium_lb4_backends key 0x0a 0x00
key: 0a 00  value: 0a 00 00 1b 00 50 00 00

The returned value can be interpreted as:

Address = 10.0.0.27
Port = 80

Finally, the eBPF program does the socket-based NAT translation, i.e. re-writing of the destination IP and port with the values returned from the eariler lookup:

	ctx->user_ip4 = backend->address;
	ctx_set_port(ctx, backend->port);

	return 0;

At this stage, the eBPF program returns and execution flow continues inside the Linux kernel networking stack all the way until the packet is built and sent out of the egress interface. The packet continues along the path built by the CNI portion of Cilium.

This is all that’s required to replace the biggest part of kube-proxy’s functionality. One big difference with kube-proxy implementation is that NAT translation only happens for traffic originating from one of the Kubernetes nodes, e.g. externally originated ClusterIP traffic is not currently supported. This is why we haven’t considered the Any-to-Service communication use case, as we did for IPTables and IPVS.

Info

Due to a known issue with kind, make sure to run make cilium-unhook when you’re finished with this Cilium lab to detach eBPF programs from the host cgroup.

Additional reading

Cilium socket LB presentation

Kubernetes Without kube-proxy