calico
Calico is another example of a full-blown Kubernetes “networking solution” with functionality including network policy controller, kube-proxy replacement and network traffic observability. CNI functionality is still the core element of Calico and the focus of this chapter will be on how it satisfies the Kubernetes network model requirements.
- Connectivity is set up by creating a
vethlink and moving one side of that link into a Pod’s namespace. The other side of the link is left dangling in the node’s root namespace. For each local Pod, Calico sets up a PodIP host-route pointing over the veth link.
Note
One oddity of Calico CNI is that the node end of the veth link does not have an IP address. In order to provide Pod-to-Node egress connectivity, each veth link is set up with proxy_arp which makes root NS respond to any ARP request coming from the Pod (assuming that the node has a default route itself).
Reachability can be established in two different ways:
Static routes and overlays – Calico supports IPIP and VXLAN and has an option to only setup tunnels for traffic crossing the L3 subnet boundary.
BGP – the most popular choice for on-prem deployments, it works by configuring a Bird BGP speaker on every node and setting up peerings to ensure that reachability information gets propagated to every node. There are several options for how to set up this peering, including full-mesh between nodes, dedicated route-reflector node and external peering with the physical network.
Info
The above two modes are not mutually exclusive, BGP can be used with IPIP in public cloud environments. For a complete list of networking options for both on-prem and public cloud environments, refer to this guide.
For demonstration purposes, we’ll use a BGP-based configuration option with external off-cluster route-reflector. The fully converged and populated IP and MAC tables will look like this:
Lab
Assuming that the lab environment is already set up, calico can be enabled with the following commands:
Check that the calico-node daemonset has all pods in READY state:
Now we need to “kick” all Pods to restart and pick up the new CNI plugin:
To make sure kube-proxy and calico set up the right set of NAT rules, existing NAT tables need to be flushed and re-populated:
Build and start a GoBGP-based route reflector:
Finally, reconfigure Calico’s BGP daemonset to peer with the GoBGP route reflector:
Here’s how the information from the diagram can be validated (using worker2 as an example):
- Pod IP and default route
Note how the default route is pointing to the fake next-hop address 169.254.1.1. This will be the same for all Pods and this IP will resolve to the same MAC address configured on all veth links:
- Node’s routing table
A few interesting things to note in the above output:
- The 2 x /24 routes programmed by
birdare the PodCIDR ranges of the other two nodes. - The blackhole /24 route is the PodCIDR of the local node.
- Inside the local PodCIDR there’s a /32 host-route configured for each running Pod.
- BGP RIB of the GoBGP route reflector
A day in the life of a Packet
Let’s track what happens when Pod-1 (actual name is net-tshoot-rg2lp) tries to talk to Pod-3 (net-tshoot-6wszq).
Note
We’ll assume that the ARP and MAC tables are converged and fully populated. In order to do that issue a ping from Pod-1 to Pod-3’s IP (10.244.236.0)
- Check the peer interface index of the veth link of Pod-1:
This information (if14) will be used in step 2 to identify the node side of the veth link.
- Pod-1 wants to send a packet to
10.244.236.0. Its network stack performs a route lookup:
- The nexthop IP is
169.254.1.1oneth0, ARP table lookup is needed to get the destination MAC:
As mentioned above, the node side of the veth link doesn’t have any IP configured:
So in order to respond to an ARP request for 169.254.1.1, all veth links have proxy ARP enabled:
- The packet reaches the root namespace of the ingress node, where another L3 lookup takes place:
- The packet is sent to the target node where another FIB lookup is performed:
The target IP is reachable over the veth link so ARP is used to determine the destination MAC address:
- Finally, the packet gets delivered to the
eth0interface of the target pod:
SNAT functionality
SNAT functionality for traffic egressing the cluster is done in two stages:
cali-POSTROUTINGchain is inserted at the top of the POSTROUTING chain.Inside that chain
cali-nat-outgoinis SNAT’ing all egress traffic originating fromcali40masq-ipam-pools.
Calico configures all IPAM pools as ipsets for a more efficient matching within iptables. These pools can be viewed on each individual node:
Caveats and Gotchas
- Calico support GoBGP-based routing, but only as an experimental feature.
- BGP configs are generated from templates based on the contents of the Calico datastore. This makes the customization of the generated BGP config very problematic.