Egress
Egress is a very loosely defined term in the Kubernetes ecosystem. Unlike its counterpart, egress traffic is not controlled by any standard Kubernetes API or a proxy. This is because most of the egress traffic is not revenue-generating and, in fact, can be completely optional. For situations when a Pod needs to communicate with an external service, it would make sense to do this via an API gateway rather than allow direct communication and most of the service meshes provide this functionality, e.g. Consul’s Terminating Gateway or OSM’s Egress Policy API. However, we still need a way to allow for Pod-initiated external communication, without a service mesh integration, and this is how it can be done:
- By default, traffic leaving a Pod will follow the default route out of a Node and will get masqueraded (SNAT’ed) to the address of the outgoing interface. This is normally provisioned by a CNI plugin option, e.g. the
ipMasqoption of the bridge plugin, or by a separate agent, e.g.ip-masq-agent. - For security reasons, some or all egress traffic can get redirected to an “egress gateway” deployed on a subset of Kubernetes Nodes. The operation, UX and redirection mechanism are implementation-specific and can work at an application level, e.g. Istio’s Egress Gateway, or at an IP level, e.g. Cilium’s Egress Gateway.
In both cases, the end result is that a packet leaves one of the Kubernetes Nodes, SNAT’ed to the address of the egress interface. The rest of the forwarding is done by the underlying network.
Lab
The way direct local egress works has already been described in the CNI part of this guide. Refer to the respective sections of the kindnet, flannel, weave, calico and cilium chapters for more details.
For this lab exercise, we’ll focus on how Cilium implements the Egress Gateway functionality via a custom resource called CiliumEgressNATPolicy.
Preparation
Assuming that the lab environment is already set up, Cilium can be enabled with the following command:
Wait for the Cilium daemonset to initialize:
Now we need to “kick” all Pods to restart and pick up the new CNI plugin:
To make sure there’s is no interference from kube-proxy we’ll remove it completely along with any IPTables rules set up by it:
Deploy an “external” echo server that will be used to check the source IP of the incoming request:
By default, we should have a net-tshoot daemonset running on all Nodes:
We can use these Pods to verify the (default) local egress behaviour by sending an HTTP GET to the echo server:
These are the same IPs that are assigned to our lab Nodes:
Finally, we can enable the CiliumEgressNATPolicy that will NAT all traffic from Pods in the default namespace to the IP of the control-plane node:
This can be verified by re-running the earlier command:
We can see that now all three requests appear to have come from the same Node.
Walkthrough
Now let’s briefly walk through how Cilium implements the above NAT policy. The Cilium CNI chapter explains how certain eBPF programs get attached to different interfaces. In our case, we’re looking at a program attached to all lxc interfaces and processing incoming packets a Pod called from-container. Inside this program, a packet goes through several functions before it eventually gets to the handle_ipv4_from_lxc function (source) which does the bulk of work in IPv4 packet processing. The relevant part of this function is this one:
Here, our packet’s source and destination IPs get passed to the lookup_ip4_egress_endpoint which performs a lookup in the following map:
The above can be translated as the following:
- Match all packets with source IP
10.0.0.174,10.0.2.86or10.0.1.42(all Pods in the default namespace) and destination prefix of172.18.0.0/16 - Return the value with egress IP of
172.18.0.3and tunnel endpoint of172.18.0.3.
The returned value is used in the encap_and_redirect_lxc function call that encapsulates the packet and forwards it to the Node with IP 172.18.0.3.
On the egress Node, our packet gets processed by the from-overlay function (source), and eventually falls through to the local network stack. The local network stack has the default route pointing out the eth0 interface, which is where our packet gets forwarded next.
At this point, Cilium applies its configured IP masquerade policy using either IPTables or eBPF translation. The eBPF masquerading is implemented as a part of the to-netdev (source) program attached to the egress direction of the eth0 interface.
From handle_nat_fwd function (source) the processing goes through tail_handle_nat_fwd_ipv4, nodeport_nat_ipv4_fwd and eventually gets to the snat_v4_process function (source) where all of the NAT translations take place. All new packets will fall through to the snat_v4_new_mapping function where a new random source port will be allocated to the packet:
Finally, once the new source port has been selected and the connection tracking entry for subsequent packets set up, the packet gets its headers updated and before being sent out of the egress interface: