cilium
Cilium is one of the most advanced and powerful Kubernetes networking solutions. At its core, it utilizes the power of eBPF to perform a wide range of functionality ranging from traffic filtering for NetworkPolicies all the way to CNI and kube-proxy replacement. Arguably, CNI is the least important part of Cilium as it doesn’t add as much values as, say, Host-Reachable Services, and is often dropped in favour of other CNI plugins (see CNI chaining). However, it still exists and satisfies the Kubernetes network model requirements in a very unique way, which is why it is worth exploring it separately from the rest of the Cilium functionality.
- Connectivity is set up by creating a
vethlink and moving one side of that link into a Pod’s namespace. The other side of the link is left dangling in the node’s root namespace. Cilium attaches eBPF programs to ingress TC hooks of these links in order to intercept all incoming packets for further processing.
Note
One thing to note is that veth links in the root namespace do not have any IP address configured and most of the network connectivity and forwarding is performed within eBPF programs.
Reachability is implemented differently, depending on Cilium’s configuration:
In the
tunnelmode, Cilium sets up a number of VXLAN or Geneve interfaces and forwards traffic over them.In the
native-routingmode, Cilium does nothing to setup reachability, assuming that it will be provided externally. This is normally done either by the underlying SDN (for cloud use-cases) or by native OS routing (for on-prem use-cases) which can be orchestrated with static routes or BGP.
For demonstration purposes, we’ll use a VXLAN-based configuration option and the following network topology:
Lab
Preparation
Assuming that the lab environment is already set up, Cilium can be enabled with the following command:
Wait for Cilium daemonset to initialize:
Now we need to “kick” all Pods to restart and pick up the new CNI plugin:
To make sure there’s is no interference from kube-proxy we’ll remove it completely along with any IPTables rules set up by it:
Check that the cilium is healthy:
Walkthrough
Here’s how the information from the above diagram can be validated (using worker2 as an example):
1. Pod IP and default route
The default route has its nexthop statically pinned to eth0@if24, which is also where ARP requests are sent:
As mentioned above, the peer side of eth0@if24 does not have any IP configured, so ARP resolution requires a bit of eBPF magic, described below.
2. Node’s eBPF programs:
Find out the name of the Cilium agent running on Node worker-2:
Each Cilium agent contains a copy of bpftool which can be used to retrieve the list of eBPF programs along with their points attachment:
Each interface is listed together with its link index, so it’s easy to spot the program attached to eth0@if24.
Info
Attached eBPF programs can also be discovered using tc filter show dev lxc473b3117af85 ingress command.
Use bpftool prog show id to view additional information about a program, including a list of attached eBPF maps:
The program itself can be found on Cilium’s Github page in bpf/bpf_lxc.c. The code is very readable and easy to follow even for people not familiar with C. Below is an abridged version of the from-container program, showing only the relevant code paths:
Inside the handle_xgress function, packet’s Ethernet protocol number is examined to determine what to do with it next. Following the path an ARP packet would take as an example, the next call is to CILIUM_CALL_ARP program:
This leads to the cilium/bpf/lib/apr.h where ARP reply is first prepared and then sent back to the ingress interface using the redirect action:
As is the case with all of the stateless ARP responders, a reply is crafted out of the original packet by swapping some of the fields while populating the other ones with well-known information (e.g. source MAC):
3. Node’s eBPF maps.
bpftool is also helpful to view the list eBPF maps together with their persistent pinned location. The following command returns a structured list of all lpm_trie-type eBPF maps:
One of the most interesting maps in the above list is IPCACHE, it is used to perform effecient IP Longest-Prefix Match lookups. Examine the contents of this map:
The key for the lookup is based on the ipcache_key data structure:
The returned value is based on the remote_endpoint_info data structure:
4. Control plane information.
eBPF maps are populated by Cilium agents running as daemonset and every agent posts information about its local environment to custom Kubernetes resources. For example, Cilium Endpoints can be viewed like this:
A day in the life of a Packet
Now let’s track what happens when Pod-1 tries to talk to Pod-3.
Note
We’ll assume that the ARP and MAC tables are converged and fully populated and we’re tracing the first packet of a flow with no active conntrack entries.
Setup pointer variables for Pod-1, Pod-3 and Cilium agents running on egress and ingress Nodes:
1. Check the routing table of Pod-1:
2. Check the interface indices of Pod-1’s veth link:
3. Find the eBPF program attached to that interface
4. eBPF Packet processing on egress Node
Note
For the sake of brevity, code walkthrough is reduced to a sequence of function calls only stopping at points when packet forwarding decisions are made.
- Packet’s header information is passed to the
handle_xgress, defined inbpf/bpf_lxc.c, where its Ethertype is checked. - All IPv4 packets are dispatched to
handle_ipv4_from_lxcvia an intermediatetail_handle_ipv4function. - Most of the packet processing decisions are made inside
handle_ipv4_from_lxc. At some point the execution flow reaches this part of the function where destination IP lookup is triggered. - The
lookup_ip4_remote_endpointfunction is defined insidebpf/lib/eps.hand usesIPCACHEeBPF map to look up information about a remote endpoint:
To simulate a map lookup, we can use bpftool map lookup command and point it at a pinned location of the IPCACHE map. The key is based on the ipcache_key struct with destination IP 10.0.1.110, prefix length and ENDPOINT_KEY_IPV4 values specified:
The result contains one important value which will be used later to build an outer IP header:
- Target Node IP – 172.18.0.5 (from
0xac 0x12 0x00 0x05)
Once the lookup results are processed, execution continues in
handle_ipv4_from_lxcfunction and eventually reaches this encapsulation directive.All encapsulation-related functions are defined inside
cilium/bpf/lib/encap.hand the packet gets VXLAN-encapsulated and redirected straight to the egress VXLAN interface.At this point the packet has all the necessary headers and is delivered to the ingress Node by the underlay (in our case it’s docker’s Linux bridge).
5. eBPF packet processing on ingress Node
- Once the VXLAN packet reaches the target Node, it triggers another eBPF hook:
- This time it’s the
from-overlayprogram located insidebpf/bpf_overlay.c. - All IPv4 packets get processed by the
handle_ipv4function. - Inside this function execution flow reaches the point where another map lookup is triggered. This lookup is needed to identify the local interface that’s supposed to receive this packet and build the correct Ethernet header.
- The
lookup_ip4_endpointfunction is defined insidebpf/lib/eps.h:
The ENDPOINTS_MAP is pinned in the file called cilium_lxc which can be found next to the IPCACHE map in /sys/fs/bpf/tc/globals/ directory. The key for the lookup can be built based on the endpoint_key data structure by plugging in values of destination IP (10.0.1.110) and IPv4 address family. The resulting lookup will look similar to this:
The value gets read into the endpoint_info struct and contains the following information:
- Interface index of the host side of the veth link –
0x0b - MAC address of the host side of the veth link –
3e:43:da:fb:c7 - MAC address of the Pod side of the veth link –
6a:8c:ee:3a:73:d5 - Endpoint ID (
lxc_id) which is used in dynamic egress policy lookup –0xd6 0x07
- At this point the lookup result gets passed to
ipv4_local_deliverywhich does two things:
- Populates source and destination MAC addresses and decrements TTL.
- Makes a tail-call to another eBPF program identified by the
lxc_id.
- The last call is made to the
to-containerprogram that passes the packet’s context throughipv4_policywhere, finally, it gets redirected the destinationvethinterface.
SNAT functionality
Although Cilium supports eBPF-based masquerading, in the current lab this functionality had to be disabled due to its reliance on the Host-Reachable Service feature which is known to have problems with kind.
In our case Cilium falls back to traditional IPTables-based masquerading of external traffic:
Info
Due to a known issue with kind, make sure to run make cilium-unhook when you’re finished with this Cilium lab to detach eBPF programs from the host cgroup.
Caveats and Gotchas
- Cilium’s kubeproxy-free functionality depends on recent Linux kernel versions and contains a number of known limitations.
- Since eBPF programs get loaded into the kernel, simulating a cluster on a shared kernel (e.g. with kind) may lead to unexpected issues. For full functionality testing, it is recommended to run each node in a dedicated VM, e.g. with something like Firecracker and Ignite.
Additional Reading
Cilium Code Walk Through Series including Life of a Packet in Cilium.
Cilium Datapath from the official documentation site.