cilium

Cilium is one of the most advanced and powerful Kubernetes networking solutions. At its core, it utilizes the power of eBPF to perform a wide range of functionality ranging from traffic filtering for NetworkPolicies all the way to CNI and kube-proxy replacement. Arguably, CNI is the least important part of Cilium as it doesn’t add as much values as, say, Host-Reachable Services, and is often dropped in favour of other CNI plugins (see CNI chaining). However, it still exists and satisfies the Kubernetes network model requirements in a very unique way, which is why it is worth exploring it separately from the rest of the Cilium functionality.

Connectivity is set up by creating a veth link and moving one side of that link into a Pod’s namespace. The other side of the link is left dangling in the node’s root namespace. Cilium attaches eBPF programs to ingress TC hooks of these links in order to intercept all incoming packets for further processing.

Note

One thing to note is that veth links in the root namespace do not have any IP address configured and most of the network connectivity and forwarding is performed within eBPF programs.

Reachability is implemented differently, depending on Cilium’s configuration:
1. In the tunnel mode, Cilium sets up a number of VXLAN or Geneve interfaces and forwards traffic over them.
2. In the native-routing mode, Cilium does nothing to setup reachability, assuming that it will be provided externally. This is normally done either by the underlying SDN (for cloud use-cases) or by native OS routing (for on-prem use-cases) which can be orchestrated with static routes or BGP.

For demonstration purposes, we’ll use a VXLAN-based configuration option and the following network topology:

Lab

Preparation

Assuming that the lab environment is already set up, Cilium can be enabled with the following command:

make cilium

Wait for Cilium daemonset to initialize:

make cilium-wait

Now we need to “kick” all Pods to restart and pick up the new CNI plugin:

make nuke-all-pods

To make sure there’s is no interference from kube-proxy we’ll remove it completely along with any IPTables rules set up by it:

make nuke-kube-proxy

Check that the cilium is healthy:

$ make cilium-check | grep health
Cilium health daemon:       Ok
Controller Status:      	40/40 healthy
Cluster health:         	3/3 reachable   (2021-08-02T19:52:07Z)

Walkthrough

Here’s how the information from the above diagram can be validated (using worker2 as an example):

1. Pod IP and default route

$ NODE=k8s-guide-worker2 make tshoot
bash-5.0# ip -4 -br addr show dev eth0
eth0@if24        UP             10.0.0.210/32 

bash-5.0# ip route
default via 10.0.0.215 dev eth0 mtu 1450 
10.0.0.215 dev eth0 scope link

The default route has its nexthop statically pinned to eth0@if24, which is also where ARP requests are sent:

bash-5.0# ip neigh
10.0.0.215 dev eth0 lladdr da:0c:20:4a:86:f7 REACHABLE

As mentioned above, the peer side of eth0@if24 does not have any IP configured, so ARP resolution requires a bit of eBPF magic, described below.

2. Node’s eBPF programs:

Find out the name of the Cilium agent running on Node worker-2:

NODE=k8s-guide-worker2
cilium=$(kubectl get -l k8s-app=cilium pods -n cilium --field-selector spec.nodeName=$NODE -o jsonpath='{.items[0].metadata.name}')

Each Cilium agent contains a copy of bpftool which can be used to retrieve the list of eBPF programs along with their points attachment:

$ kubectl -n cilium exec -it $cilium -- bpftool net show 
xdp:

tc:
cilium_net(9) clsact/ingress bpf_host_cilium_net.o:[to-host] id 2746
cilium_host(10) clsact/ingress bpf_host.o:[to-host] id 2734
cilium_host(10) clsact/egress bpf_host.o:[from-host] id 2740
cilium_vxlan(11) clsact/ingress bpf_overlay.o:[from-overlay] id 2291
cilium_vxlan(11) clsact/egress bpf_overlay.o:[to-overlay] id 2719
eth0(17) clsact/ingress bpf_netdev_eth0.o:[from-netdev] id 2754
eth0(17) clsact/egress bpf_netdev_eth0.o:[to-netdev] id 2775
lxc_health(18) clsact/ingress bpf_lxc.o:[from-container] id 2794
lxcdae72534e167(20) clsact/ingress bpf_lxc.o:[from-container] id 2806
lxcb79e11918044(22) clsact/ingress bpf_lxc.o:[from-container] id 2828
lxc473b3117af85(24) clsact/ingress bpf_lxc.o:[from-container] id 2895

Each interface is listed together with its link index, so it’s easy to spot the program attached to eth0@if24.

Info

Attached eBPF programs can also be discovered using tc filter show dev lxc473b3117af85 ingress command.

Use bpftool prog show id to view additional information about a program, including a list of attached eBPF maps:

kubectl -n cilium exec -it $cilium -- bpftool prog show id 2895
2895: sched_cls  tag 8ac62d31226a84ef  gpl
	loaded_at 2020-12-20T09:52:11+0000  uid 0
	xlated 28984B  jited 16726B  memlock 32768B  map_ids 450,143,165,145,151,152,338,227,449,148,147,149,167,166,146,141,339,150

The program itself can be found on Cilium’s Github page in bpf/bpf_lxc.c. The code is very readable and easy to follow even for people not familiar with C. Below is an abridged version of the from-container program, showing only the relevant code paths:

/* Attachment/entry point is ingress for veth, egress for ipvlan. */
__section("from-container")
int handle_xgress(struct __ctx_buff *ctx)
{
    switch (proto) {
	case bpf_htons(ETH_P_IP):
		invoke_tailcall_if(__or(__and(is_defined(ENABLE_IPV4), is_defined(ENABLE_IPV6)),
					is_defined(DEBUG)),
				   CILIUM_CALL_IPV4_FROM_LXC, tail_handle_ipv4);
		break;
	case bpf_htons(ETH_P_ARP):
		ep_tail_call(ctx, CILIUM_CALL_ARP);
        break;
    }
}

Inside the handle_xgress function, packet’s Ethernet protocol number is examined to determine what to do with it next. Following the path an ARP packet would take as an example, the next call is to CILIUM_CALL_ARP program:

/*
 * ARP responder for ARP requests from container
 * Respond to IPV4_GATEWAY with NODE_MAC
 */
__section_tail(CILIUM_MAP_CALLS, CILIUM_CALL_ARP)
int tail_handle_arp(struct __ctx_buff *ctx)
{
    union macaddr mac = NODE_MAC;
	union macaddr smac;
    __be32 sip;
	__be32 tip;
    return arp_respond(ctx, &mac, tip, &smac, sip, 0);
}

This leads to the cilium/bpf/lib/apr.h where ARP reply is first prepared and then sent back to the ingress interface using the redirect action:

static __always_inline int
arp_respond(struct __ctx_buff *ctx, union macaddr *smac, __be32 sip,
	    union macaddr *dmac, __be32 tip, int direction)
{
	int ret = arp_prepare_response(ctx, smac, sip, dmac, tip);

	return ctx_redirect(ctx, ctx_get_ifindex(ctx), direction);
}

As is the case with all of the stateless ARP responders, a reply is crafted out of the original packet by swapping some of the fields while populating the other ones with well-known information (e.g. source MAC):

arp_prepare_response(struct __ctx_buff *ctx, union macaddr *smac, __be32 sip,
		     union macaddr *dmac, __be32 tip)
{
	__be16 arpop = bpf_htons(ARPOP_REPLY);

	if (eth_store_saddr(ctx, smac->addr, 0) < 0 ||
	    eth_store_daddr(ctx, dmac->addr, 0) < 0 ||
	    ctx_store_bytes(ctx, 20, &arpop, sizeof(arpop), 0) < 0 ||
	    /* sizeof(macadrr)=8 because of padding, use ETH_ALEN instead */
	    ctx_store_bytes(ctx, 22, smac, ETH_ALEN, 0) < 0 ||
	    ctx_store_bytes(ctx, 28, &sip, sizeof(sip), 0) < 0 ||
	    ctx_store_bytes(ctx, 32, dmac, ETH_ALEN, 0) < 0 ||
	    ctx_store_bytes(ctx, 38, &tip, sizeof(tip), 0) < 0)
		return DROP_WRITE_ERROR;

	return 0;
}

3. Node’s eBPF maps.

bpftool is also helpful to view the list eBPF maps together with their persistent pinned location. The following command returns a structured list of all lpm_trie-type eBPF maps:

$ kubectl -n cilium exec -it $cilium -- bpftool map list --bpffs -j | jq '.[] | select( .type == "lpm_trie" )' | jq

{
  "bytes_key": 12,
  "bytes_memlock": 3215360,
  "bytes_value": 1,
  "flags": 1,
  "frozen": 0,
  "id": 168,
  "max_entries": 65536,
  "pinned": [
    "/sys/fs/bpf/tc/globals/cilium_lb4_source_range"
  ],
  "type": "lpm_trie"
}
{
  "bytes_key": 12,
  "bytes_memlock": 3215360,
  "bytes_value": 1,
  "flags": 1,
  "frozen": 0,
  "id": 176,
  "max_entries": 65536,
  "pinned": [],
  "type": "lpm_trie"
}
{
  "bytes_key": 12,
  "bytes_memlock": 3215360,
  "bytes_value": 1,
  "flags": 1,
  "frozen": 0,
  "id": 222,
  "max_entries": 65536,
  "pinned": [],
  "type": "lpm_trie"
}
{
  "bytes_key": 24,
  "bytes_memlock": 36868096,
  "bytes_value": 12,
  "flags": 1,
  "frozen": 0,
  "id": 338,
  "max_entries": 512000,
  "pinned": [
    "/sys/fs/bpf/tc/globals/cilium_ipcache"
  ],
  "type": "lpm_trie"
}
{
  "bytes_key": 24,
  "bytes_memlock": 36868096,
  "bytes_value": 12,
  "flags": 1,
  "frozen": 0,
  "id": 368,
  "max_entries": 512000,
  "pinned": [],
  "type": "lpm_trie"
}
{
  "bytes_key": 24,
  "bytes_memlock": 36868096,
  "bytes_value": 12,
  "flags": 1,
  "frozen": 0,
  "id": 428,
  "max_entries": 512000,
  "pinned": [],
  "type": "lpm_trie"
}

One of the most interesting maps in the above list is IPCACHE, it is used to perform effecient IP Longest-Prefix Match lookups. Examine the contents of this map:

kubectl -n cilium exec -it $cilium -- bpftool map dump id 338 | head -n5
key:
40 00 00 00 00 00 00 01  0a 00 00 b9 00 00 00 00
00 00 00 00 00 00 00 00
value:
fe 6f 00 00 00 00 00 00  00 00 00 00

The key for the lookup is based on the ipcache_key data structure:

struct ipcache_key {
	struct bpf_lpm_trie_key lpm_key;
	__u16 pad1;
	__u8 pad2;
	__u8 family;
	union {
		struct {
			__u32		ip4;
			__u32		pad4;
			__u32		pad5;
			__u32		pad6;
		};
		union v6addr	ip6;
	};
}

The returned value is based on the remote_endpoint_info data structure:

struct remote_endpoint_info {
	__u32		sec_label;
	__u32		tunnel_endpoint;
	__u8		key;
};

4. Control plane information.

eBPF maps are populated by Cilium agents running as daemonset and every agent posts information about its local environment to custom Kubernetes resources. For example, Cilium Endpoints can be viewed like this:

$ kubectl get ciliumendpoints.cilium.io net-tshoot-l8vwz -oyaml
apiVersion: cilium.io/v2
kind: CiliumEndpoint
metadata:
  name: net-tshoot-l8vwz
  namespace: default
status:
  encryption: {}
  external-identifiers:
    container-id: d6effe0cc6d567e4776a3701851d9ab278ff128adede9419c8fda34daf6b46ef
    k8s-namespace: default
    k8s-pod-name: net-tshoot-l8vwz
    pod-name: default/net-tshoot-l8vwz
  id: 92
  identity:
    id: 53731
    labels:
    - k8s:io.cilium.k8s.policy.cluster=default
    - k8s:io.cilium.k8s.policy.serviceaccount=default
    - k8s:io.kubernetes.pod.namespace=default
    - k8s:name=net-tshoot
  networking:
    addressing:
    - ipv4: 10.0.0.210
    node: 172.18.0.6
  state: ready

A day in the life of a Packet

Now let’s track what happens when Pod-1 tries to talk to Pod-3.

Note

We’ll assume that the ARP and MAC tables are converged and fully populated and we’re tracing the first packet of a flow with no active conntrack entries.

Setup pointer variables for Pod-1, Pod-3 and Cilium agents running on egress and ingress Nodes:

NODE1=k8s-guide-worker
cilium1=$(kubectl get -l k8s-app=cilium pods -n cilium --field-selector spec.nodeName=$NODE1 -o jsonpath='{.items[0].metadata.name}')
pod1=$(kubectl get -l name=net-tshoot pods -n default --field-selector spec.nodeName=$NODE1 -o jsonpath='{.items[0].metadata.name}')
NODE3=k8s-guide-control-plane
cilium3=$(kubectl get -l k8s-app=cilium pods -n cilium --field-selector spec.nodeName=$NODE3 -o jsonpath='{.items[0].metadata.name}')
pod3=$(kubectl get -l name=net-tshoot pods -n default --field-selector spec.nodeName=$NODE3 -o jsonpath='{.items[0].metadata.name}')

1. Check the routing table of Pod-1:

kubectl -n default exec -it $pod1 -- ip route get 10.0.1.110
10.0.1.110 via 10.0.2.184 dev eth0 src 10.0.2.99 uid 0 
    cache mtu 1450

2. Check the interface indices of Pod-1’s veth link:

kubectl -n default exec -it $pod1 -- ip link show dev eth0
23: eth0@if24: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default 
    link/ether b2:10:ef:6b:fa:0a brd ff:ff:ff:ff:ff:ff link-netnsid 0

3. Find the eBPF program attached to that interface

kubectl -n cilium exec -it $cilium1 -- bpftool net show | grep 24 

lxcda722d56d553(24) clsact/ingress bpf_lxc.o:[from-container] id 2901

4. eBPF Packet processing on egress Node

Note

For the sake of brevity, code walkthrough is reduced to a sequence of function calls only stopping at points when packet forwarding decisions are made.

Packet’s header information is passed to the handle_xgress, defined in bpf/bpf_lxc.c, where its Ethertype is checked.
All IPv4 packets are dispatched to handle_ipv4_from_lxc via an intermediate tail_handle_ipv4 function.
Most of the packet processing decisions are made inside handle_ipv4_from_lxc. At some point the execution flow reaches this part of the function where destination IP lookup is triggered.
The lookup_ip4_remote_endpoint function is defined inside bpf/lib/eps.h and uses IPCACHE eBPF map to look up information about a remote endpoint:

#define lookup_ip4_remote_endpoint(addr) \
	ipcache_lookup4(&IPCACHE_MAP, addr, V4_CACHE_KEY_LEN)

ipcache_lookup4(struct bpf_elf_map *map, __be32 addr, __u32 prefix)
{
	struct ipcache_key key = {
		.lpm_key = { IPCACHE_PREFIX_LEN(prefix), {} },
		.family = ENDPOINT_KEY_IPV4,
		.ip4 = addr,
	};
	key.ip4 &= GET_PREFIX(prefix);
	return map_lookup_elem(map, &key);
}

To simulate a map lookup, we can use bpftool map lookup command and point it at a pinned location of the IPCACHE map. The key is based on the ipcache_key struct with destination IP 10.0.1.110, prefix length and ENDPOINT_KEY_IPV4 values specified:

kubectl -n cilium exec -it $cilium1 -- bpftool map lookup pinned /sys/fs/bpf/tc/globals/cilium_ipcache key hex 40 00 00 00 00 00 00 01 0a 00 01 6e 00 00 00 00 00 00 00 00 00 00 00 00
key:
40 00 00 00 00 00 00 01  0a 00 01 6e 00 00 00 00
00 00 00 00 00 00 00 00
value:
e3 d1 00 00 ac 12 00 05  00 00 00 00

The result contains one important value which will be used later to build an outer IP header:

Target Node IP – 172.18.0.5 (from 0xac 0x12 0x00 0x05)

Once the lookup results are processed, execution continues in handle_ipv4_from_lxc function and eventually reaches this encapsulation directive.
All encapsulation-related functions are defined inside cilium/bpf/lib/encap.h and the packet gets VXLAN-encapsulated and redirected straight to the egress VXLAN interface.
At this point the packet has all the necessary headers and is delivered to the ingress Node by the underlay (in our case it’s docker’s Linux bridge).

5. eBPF packet processing on ingress Node

Once the VXLAN packet reaches the target Node, it triggers another eBPF hook:

kubectl -n cilium exec $cilium3 -- bpftool net show | grep vxlan

cilium_vxlan(5) clsact/ingress bpf_overlay.o:[from-overlay] id 2729
cilium_vxlan(5) clsact/egress bpf_overlay.o:[to-overlay] id 2842

This time it’s the from-overlay program located inside bpf/bpf_overlay.c.
All IPv4 packets get processed by the handle_ipv4 function.
Inside this function execution flow reaches the point where another map lookup is triggered. This lookup is needed to identify the local interface that’s supposed to receive this packet and build the correct Ethernet header.
The lookup_ip4_endpoint function is defined inside bpf/lib/eps.h:

static __always_inline __maybe_unused struct endpoint_info *
__lookup_ip4_endpoint(__u32 ip)
{
	struct endpoint_key key = {};

	key.ip4 = ip;
	key.family = ENDPOINT_KEY_IPV4;

	return map_lookup_elem(&ENDPOINTS_MAP, &key);
}

The ENDPOINTS_MAP is pinned in the file called cilium_lxc which can be found next to the IPCACHE map in /sys/fs/bpf/tc/globals/ directory. The key for the lookup can be built based on the endpoint_key data structure by plugging in values of destination IP (10.0.1.110) and IPv4 address family. The resulting lookup will look similar to this:

kubectl -n cilium exec -it $cilium3 -- bpftool map lookup pinned /sys/fs/bpf/tc/globals/cilium_lxc key hex 0a 00 01 6e 00 00 00 00  00 00 00 00 00 00 00 00 01 00 00 00
key:
0a 00 01 6e 00 00 00 00  00 00 00 00 00 00 00 00
01 00 00 00
value:
0b 00 00 00 00 00 d6 07  00 00 00 00 00 00 00 00
6a 8c ee 3a 73 d5 00 00  3e 43 da fb c7 04 00 00
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00

The value gets read into the endpoint_info struct and contains the following information:

Interface index of the host side of the veth link – 0x0b
MAC address of the host side of the veth link – 3e:43:da:fb:c7
MAC address of the Pod side of the veth link – 6a:8c:ee:3a:73:d5
Endpoint ID (lxc_id) which is used in dynamic egress policy lookup – 0xd6 0x07

At this point the lookup result gets passed to ipv4_local_delivery which does two things:

Populates source and destination MAC addresses and decrements TTL.
Makes a tail-call to another eBPF program identified by the lxc_id.

The last call is made to the to-container program that passes the packet’s context through ipv4_policy where, finally, it gets redirected the destination veth interface.

SNAT functionality

Although Cilium supports eBPF-based masquerading, in the current lab this functionality had to be disabled due to its reliance on the Host-Reachable Service feature which is known to have problems with kind.

In our case Cilium falls back to traditional IPTables-based masquerading of external traffic:

$ docker exec  k8s-guide-worker2 iptables -t nat -vnL CILIUM_POST_nat | head -n3
Chain CILIUM_POST_nat (1 references)
 pkts bytes target     prot opt in     out     source               destination         
    8   555 MASQUERADE  all  --  *      !cilium_+  10.0.0.0/24         !10.0.0.0/24          /* cilium masquerade non-cluster */

Info

Due to a known issue with kind, make sure to run make cilium-unhook when you’re finished with this Cilium lab to detach eBPF programs from the host cgroup.

Caveats and Gotchas

Cilium’s kubeproxy-free functionality depends on recent Linux kernel versions and contains a number of known limitations.
Since eBPF programs get loaded into the kernel, simulating a cluster on a shared kernel (e.g. with kind) may lead to unexpected issues. For full functionality testing, it is recommended to run each node in a dedicated VM, e.g. with something like Firecracker and Ignite.

Additional Reading

Cilium Code Walk Through Series including Life of a Packet in Cilium.
Cilium Datapath from the official documentation site.
bpftool use-cases