NodePort
NodePort builds on top of the ClusterIP Service and provides a way to expose a group of Pods to the outside world. At the API level, the only difference from the ClusterIP is the mandatory service type which has to be set to NodePort, the rest of the values can remain the same.
Whenever a new Kubernetes cluster gets built, one of the available configuration parameters is service-node-port-range which defines a range of ports to use for NodePort allocation and usually defaults to 30000-32767. One interesting thing about NodePort allocation is that it is not managed by a controller. The configured port range value eventually gets passed to the kube-apiserver as an argument and allocation happens as the API server saves a Service resource into its persistent storage (e.g. etcd cluster); a unique port is allocated for both Nodeport and LoadBalancer services. So by the time the Service definition makes it to the persistent storage, it already contains a couple of extra fields:
One of the side-effects of this kind of behaviour is that ClusterIP and NodePort values are immutable ā they cannot be changed throughout the lifecycle of an object. The only way to change or update an existing Service is to provide the right metadata and omit both ClusterIP and NodePort values from the spec.
From the networking point of view, NodePortās implementation is very easy to understand:
- For each port in the NodePort Service, API server allocated a unique port from the
service-node-port-range. - This port is programmed in the dataplane of each Node by the
kube-proxy(or its equivalent) ā the most common implementations with IPTables, IPVS and eBPF are covered in the Lab section below. - Any incoming packet matching one of the configured NodePorts will get destination NATāed to one of the healthy Endpoints and source NATāed (via masquerade/overload) to the address of the incoming interface.
- The reply packet coming from the Pod will get reverse NATāed using the connection tracking entry set up by the incoming packet.
Note
Both DNAT and SNAT can be avoided by using Direct server return (DSR) and service.spec.externalTrafficPolicy respectively. This is discussed in the Optimisations chapter
The following diagram shows network connectivity for a couple of hypothetical NodePort Services.
Note
One important thing worth remembering is that a NodePort Service is rarely used on its own. Most of the time, youād use a LoadBalancer type service which builds on top of the NodePort. That being said, NodePort services can be quite useful on their own in environments where LoadBalancer is not available or in more static setups utilising spec.externalIPs.
Lab
To demonstrate the different modes of dataplane operation, weāll use three different scenarios:
- IPTables orchestrated by kube-proxy
- IPVS as orchestrated by kube-proxy
- eBPF as orchestrated by Cilium
Preparation
Refer to the respective chapters for the instructions on how to setup the IPTables, IPVS or Cilium eBPF data planes. Once the required data plane is configured, setup a test deployment with 3 Pods and expose it via a NodePort Service:
Confirm the assigned NodePort (e.g. 30510 in the output below) and take a note of the Endpoint addresses:
To verify that a NodePort service is functioning, first, determine IPs of each one of the cluster Nodes:
Combine each IP with the assigned NodePort value and check that there is external reachability from your host OS:
Finally, setup the following command aliases:
IPTables Implementation
According to Timās IPtables diagram, external packets get first intercepted in the PREROUTING chain and redirected to the KUBE-SERVICES chain:
The KUBE-NODEPORTS chain is appended to the bottom of the KUBE-SERVICES chain and uses ADDRTYPE to only match packets that are destined to one of the locally configured addresses:
Each of the configured NodePort Services will have two entries ā one to enable SNAT masquerading in the KUBE-POSTROUTING chain (see ClusterIP walkthrough for more details) and another one for Endpoint-specific DNAT actions:
Inside the KUBE-SVC-* chain there will be one entry per each healthy backend Endpoint with random probability to ensure equal traffic distribution:
This is where the final Destination NAT translation takes place, each of the above chains translates the original destination IP and NodePort to the address of one of the Endpoints:
You may have noticed the presence of KUBE-MARK-MASQ in the above chains, this rule exists to account for a corner case of Pod talking to its own Service via a ClusterIP (i.e. Pod itself is a part of the Service itās trying to talk to) and the random distribution selecting itself as the destination. In this case, both source and destination IPs will be the same and this rule exists to ensure that the packets get SNATāed to prevent them from being dropped.
IPVS Implementation
IPVS data plane still relies on IPTables for a number of corner cases, which is why we can see a similar rule, matching all LOCAL packets and redirecting them to the KUBE-NODE-PORT chain:
However, its is implemented is slightly different and makes use of IP sets, reducing the time complexity of a lookup for N configured Services from O(N) to O(1):
All configured NodePorts are kept inside the KUBE-NODE-PORT-TCP ipset:
Assuming weāve got 30064 allocated as a NodePort, we can see all interfaces that are listening for incoming packets for this Service:
The IPVS configuration for each individual listener is the same and contains a set of backend Endpoint addresses with the default round-robin traffic distribution:
Cilium eBPF Implementation
The way Cilium deals with NodePort Services is quite complicated so weāll try to focus only on the relevant āhappyā code paths ignoring corner cases and interaction with other services, like firewalling or encryption.
At boot time, Cilium attaches a pair of eBPF programs to a set of Nodeās external network interfaces (they can be picked automatically or defined in configuration). In our case, we only have one external interface eth0 and we can see eBPF programs attached to it using bpftool:
Letās focus on the ingress part and walk through the source code of the from-netdev program. During the first few steps, the SKB data structure gets first passed to the handle_netdev function (source) and on to the do_netdev function (source) which handles IPSec, security identity and logging operations. At the end, a tail call transfers the control over to the handle_ipv4 function (source) which is where most of the forwarding decisions take place.
One of the first things that happen inside handle_ipv4 is the following check which confirms that Cilium was configured to process NodePort Services and the packet is coming from an external source, in which case the SKB context is passed over to the nodeport_lb4 function:
The nodeport_lb4 function (source) deals with anything related to NodePort Service load-balancing and address translation. Initially, it builds a 4-tuple which will be used for internal connection tracking and attempts to extract a Service map lookup key:
The key gets build with the destination IP and L4 port of an ingress packet. Similar to Ciliumās ClusterIP implementation (and for the same reasons) the lookup is performed in two stages and the first one is only used to determine the total number of backend Endpoints (svc->count):
For example, this is how a map lookup for a packet going to 172.18.0.6:30171 would look like:
The returned result sets the count to the number of healthy backend Endpoints (0x03 in our case) which is then used in the second lookup inside the lb4_local function (source):
This time, the exact backend_id is determined either randomly of using a MAGLEV hash lookup. The value of backend_id is used to look up the destination IP and port of the target Endpoint:
With this information in hand, the control flow is passed from the lb4_local to the lb4_xlate function:
As its name suggests, lb4_xlate (source) performs L4 header re-writes and checksum updates to finish the translation of the original packet which now has the destination IP and port of one of the backend Endpoints:
At this point, with the packet fully translated and connection tracking entries updated, the control flow returns to the handle_ipv4 function where a Cilium endpoint is looked up and its details are used to call the bpf_redirect_neigh eBPF helper function to redirect the packet straight to the target interface, similar to how it was described in the Cilium CNI chapter: