IPVS
IPTables has been the first implementation of kube-proxy’s dataplane, however, over time its limitations have become more pronounced, especially when operating at scale. There are several side-effects of implementing a proxy with something that was designed to be a firewall, the main one being a limited set of data structures. The way it manifests itself is that every ClusterIP Service needs to have a unique entry, these entries can’t be grouped and have to be processed sequentially as chains of tables. This means that any dataplane lookup or a create/update/delete operation needs to traverse the chain until a match is found which, at a large-enough scale can result in minutes of added processing time.
Note
Detailed performance analysis and measurement results of running iptables at scale can be found in the Additional Reading section at the bottom of the page.
All this led to ipvs being added as an enhancement proposal and eventually graduating to GA in Kubernetes version 1.11. The new dataplane implementation offers a number of improvements over the existing iptables mode:
All Service load-balancing is migrated to IPVS which can perform in-kernel lookups and masquerading in constant time, regardless of the number of configured Services or Endpoints.
The remaining rules in IPTables have been re-engineered to make use of ipset, making the lookups more efficient.
Multiple additional load-balancer scheduling modes are now available, with the default one being a simple round-robin.
On the surface, this makes the decision to use ipvs an obvious one, however, since iptables have been the default mode for so long, some of its quirks and undocumented side-effects have become the standard. One of the fortunate side-effects of the iptables mode is that ClusterIP is never bound to any kernel interface and remains completely virtual (as a NAT rule). So when ipvs changed this behaviour by introducing a dummy kube-ipvs0 interface, it made it possible for processes inside Pods to access any host-local services bound to 0.0.0.0 by targeting any existing ClusterIP. Although this does make ipvs less safe by default, it doesn’t mean that these risks can’t be mitigated (e.g. by not binding to 0.0.0.0).
The diagram below is a high-level and simplified view of two distinct datapaths for the same ClusterIP virtual service – one from a remote Pod and one from a host-local interface.
Lab Setup
Assuming that the lab environment is already set up, ipvs can be enabled with the following command:
Under the covers, the above command updates the proxier mode in kube-proxy’s ConfigMap so in order for this change to get picked up, we need to restart all of the agents and flush out any existing iptable rules:
Check the logs to make sure kube-proxy has loaded all of the required kernel modules. In case of a failure, the following error will be present in the logs and kube-proxy will fall back to the iptables mode:
Another way to confirm that the change has succeeded is to check that Nodes now have a new dummy ipvs device:
Note
One thing to remember when migrating from iptables to ipvs on an existing cluster (as opposed to rebuilding it from scratch), is that all of the KUBE-SVC/KUBE-SEP chains will still be there at least until they cleaned up manually or a node is rebooted.
Spin up a test deployment and expose it as a ClusterIP Service:
Check that all Pods are up and note the IP allocated to our Service:
Before we move forward, there are a couple of dependencies we need to satisfy:
- Pick one of the Nodes hosting a test deployment and install the following packages:
- On the same Node set up the following set of aliases to simplify access to iptables, ipvs and ipset:
Use case #1: Pod-to-Service communication
Any packet leaving a Pod will first pass through the PREROUTING chain which is where kube-proxy intercepts all Service-bound traffic:
The size of the KUBE-SERVICES chain is reduced compared to the iptables mode and the lookup stops once the destination IP is matched against the KUBE-CLUSTER-IP ipset:
This ipset contains all existing ClusterIPs and the lookup is performed in O(1) time:
Following the lookup in the PREROUTING chain, our packet gets to the routing decision stage which is where it gets intercepted by Netfilter’s NF_INET_LOCAL_IN hook and redirected to IPVS.
This is where the packet gets DNAT’ed to the IP of one of the selected backend Pods (10.244.1.6 in our case) and continues on to its destination unmodified, following the forwarding path built by a CNI plugin.
Use case #2: Any-to-Service communication
Any host-local service trying to communicate with a ClusterIP will first get its packet through OUTPUT and KUBE-SERVICES chains:
Since source IP does not belong to the PodCIDR range, our packet gets a de-tour via the KUBE-MARK-MASQ chain:
Here the packet gets marked for future SNAT, to make sure it will have a return path from the Pod:
The following few steps are exactly the same as described for the previous use case:
- The packet reaches the end of the
KUBE-SERVICESchain. - The routing lookup returns a local dummy ipvs interface.
- IPVS intercepts the packet and performs the backend selection and NATs the destination IP address.
The modified packet metadata continues along the forwarding path until it hits the egress veth interface where it gets picked up by the POSTROUTING chain:
This is where the source IP of the packet gets modified to match the one of the egress interface, so the destination Pod knows where to send a reply:
The final masquerading action is performed if the destination IP and Port match one of the local Endpoints which are stored in the KUBE-LOOP-BACK ipset:
Info
It should be noted that, similar to the iptables mode, all of the above lookups are only performed for the first packet of the session and all subsequent packets follow a much shorter path in the conntrack subsystem.
Additional reading
Scaling Kubernetes to Support 50,000 Services