IPTABLES
Most of the focus of this section will be on the standard node-local proxy implementation called kube-proxy. It is used by default by most of the Kubernetes orchestrators and is installed as a daemonset on top of an newly bootstrapped cluster:
The default mode of operation for kube-proxy is iptables, as it provides support for a wider set of operating systems without requiring extra kernel modules and has a “good enough” performance characteristics for the majority of small to medium-sized clusters.
This area of Kubernetes networking is one of the most poorly documented. On the one hand, there are blogposts that cover parts of the kube-proxy dataplane, on the other hand there’s an amazing diagram created by Tim Hockin that shows a complete logical flow of packet forwarding decisions but provides very little context and is quite difficult to trace for specific flows. The goal of this article is to bridge the gap between these two extremes and provide a high level of detail while maintaining an easily consumable format.
So for demonstration purposes, we’ll use the following topology with a “web” deployment and two pods scheduled on different worker nodes. The packet forwarding logic for ClusterIP-type services has two distinct paths within the dataplane, which is what we’re gonna be focusing on next:
- Pod-to-Service communication (purple packets) – implemented entirely within an egress node and relies on CNI for pod-to-pod reachability.
- Any-to-Service communication (grey packets) – includes any externally-originated and, most notable, node-to-service traffic flows.
The above diagram shows a slightly simplified sequence of match/set actions implemented inside Netfilter’s NAT table. The lab section below will show a more detailed view of this dataplane along verification commands.
Note
One key thing to remember is that none of the ClusterIPs implemented this way are visible in the Linux routing table. The whole dataplane is implemented entirely within iptable’s NAT table, which makes it both very flexible and extremely difficult to troubleshoot at the same time.
Lab Setup
To make sure that lab is in the right state, reset it to a blank state:
Now let’s spin up a new deployment and expose it with a ClusterIP service:
The result of the above two commands can be verified like this:
The simplest way to test connectivity would be to connect to the assigned ClusterIP 10.96.94.225 from one of the nodes, e.g.:
One last thing before moving on, let’s set up the following bash alias as a shortcut to k8s-guide-worker’s NAT iptable:
Use case #1: Pod-to-Service communication
According to Tim’s diagram all Pod-to-Service packets get intercepted by the PREROUTING chain:
These packets get redirected to the KUBE-SERVICES chain, where they get matched against all configured ClusterIPs, eventually reaching these lines:
Since the sourceIP of the packet belongs to a Pod (10.244.0.0/16 is the PodCIDR range), the second line gets matched and the lookup continues in the service-specific chain. Here we have two Pods matching the same label-selector (--replicas=2) and both chains are configured with equal distribution probability:
Let’s assume that in this case the first rule gets matched, so our packet continues on to the next chain where it gets DNAT’ed to the target IP of the destination Pod (10.244.1.3):
From here on our packet remains unmodified and continues along its forwarding path set up by a CNI plugin until it reaches the target Node and gets sent directly to the destination Pod.
Use case #2: Any-to-Service communication
Let’s assume that the k8s-guide-worker node (IP 172.18.0.12) is sending a packet to our ClusterIP service. This packet gets intercepted in the OUTPUT chain and continues to the KUBE-SERVICES where it gets redirected via the KUBE-MARK-MASQ chain:
The purpose of this chain is to mark all packets that will need to get SNAT’ed before they get sent to the final destination:
Since MARK is not a terminating target, the lookup continues down the KUBE-SERVICES chain where our packets gets DNAT’ed to one of the randomly selected backend endpoints (as shown above).
However, this time, before it gets sent to its final destination, the packet gets another detour via the KUBE-POSTROUTING chain:
Here all packets with a special SNAT mark (0x4000) fall through to the last rule and get SNAT’ed to the IP of the outgoing interface, which in this case is the veth interface connected to the Pod:
The final MASQUERADE action ensures that the return packets follow the same way back, even if they were originated outside of the Kubernetes cluster.
Info
The above sequence of lookups may look long an inefficient but bear in mind that this is only done once, for the first packet of the flow and the remainder of the session gets offloaded to Netfilter’s connection tracking system.