Kubernetes – Container networking

Networking is a vital concern for production-level operations. At a service level, we need a reliable way for our application components to find and communicate with each other. Introducing containers and clustering into the mix makes things more complex as we now have multiple networking namespaces to bear in mind. Communication and discovery now becomes a feat that must navigate container IP space, host networking, and sometimes even multiple data center network topologies.

Kubernetes benefits here from getting its ancestry from the clustering tools used by Google for the past decade. Networking is one area where Google has outpaced the competition with one of the largest networks on the planet. Earlier, Google built its own hardware switches and Software-defined Networking (SDN) to give them more control, redundancy, and efficiency in their day-to-day network operations. Many of the lessons learned from running and networking two billion containers per week have been distilled into Kubernetes, and informed how K8s networking is done.

The Docker approach

In order to understand the motivation behind the K8s networking model, let’s review Docker’s approach to container networking.

Docker default networks

The following are some of Docker’s default networks:

  • Bridge network: In a nonswarm scenario, Docker will use the bridge network driver (called bridge) to allow standalone containers to speak to each other. You can think of the bridge as a link layer device that forwards network traffic between segments. If containers are connected to the same bridge network, they can communicate; if they’re not connected, they can’t. The bridged network is the default choice unless otherwise specified. In this mode, the container has its own networking namespace and is then bridged via virtual interfaces to the host (or node, in the case of K8s) network. In the bridged network, two containers can use the same IP range because they are completely isolated. Therefore, service communication requires some additional port mapping through the host side of network interfaces.
  • Host based: Docker also offers host-based networking for standalone containers, which creates a virtual bridge called docker0 that allocates private IP address space for the containers using that bridge. Each container gets a virtual Ethernet (veth) device that you can see in the container as eth0. Performance is greatly benefited since it removes a level of network virtualization; however, you lose the security of having an isolated network namespace. Additionally, port usage must be managed more carefully since all containers share an IP.

There’s also a none network, which creates a container with no external interface. Only a loopback device is shown if you inspect the network interfaces.

In all of these scenarios, we are still on a single machine, and outside of  host mode, the container IP space is not available outside that machine. Connecting containers across two machines requires NAT and port mapping for communication.

Docker user-defined networks

In order to address the cross-machine communication issue and allow greater flexibility, Docker also supports user-defined networks via network plugins. These networks exist independent of the containers themselves. In this way, containers can join the same existing networks. Through the new plugin architecture, various drivers can be provided for different network use cases such as the following:

  • Swarm: In a clustered situation with Swarm, the default behavior is an overlay network, which allows you to connect multiple Docker daemons running on multiple machines. In order to coordinate across multiple hosts, all containers and daemons must all agree on the available networks and their topologies. Overlay networking introduces a significant amount of complexity with dynamic port mapping that Kubernetes avoids.
You can read more about overlay networks here: https://docs.docker.com/network/overlay/.
  • Macvlan: Docker also provides macvlan addressing, which is most similar to the networking model that Kubernetes provides, as it assigns each Docker container a MAC address that makes it appear as a physical device on your network. Macvlan offers a more efficient network virtualization and isolation as it bypasses the Linux bridge. It is important to note that as of this book’s publishing, Macvlan isn’t supported in most cloud providers.

As a result of these options, Docker must manage complex port allocation on a per-machine basis for each host IP, and that information must be maintained and propagated to all other machines in the cluster. Docker users a gossip protocol to manage the forwarding and proxying of ports to other containers.

The Kubernetes approach

Kubernetes’ approach to networking differs from the Docker’s, so let’s see how. We can learn about Kubernetes while considering four major topics in cluster scheduling and orchestration:

  • Decoupling container-to-container communication by providing pods, not containers, with an IP address space
  • Pod-to-pod communication and service as the dominant communication paradigm within the Kubernetes networking model
  • Pod-to-service and external-to-service communications, which are provided by the services object

These considerations are a meaningful simplification for the Kubernetes networking model, as there’s no dynamic port mapping to track. Again, IP addressing is scoped at the pod level, which means that networking in Kubernetes requires that each pod has its own IP address. This means that all containers in a given pod share that IP address, and are considered to be in the same network namespace. We’ll explore how to manage this shared IP resource when we discuss internal and external services later in this chapter. Kubernetes facilitates the pod-to-pod communication by not allowing the use of network address translation (NAT) for container-to-container or container-to-node (minion) traffic. Furthermore, the internal container IP address must match the IP address that is used to communicate with it. This underlines the Kubernetes assumption that all pods are able to communicate with all other pods regardless of the host they’ve landed on, and that communication then informs routing within pods to a local IP address space that is provided to containers. All containers within a given host can communicate with each other on their reserved ports via localhost. This unNATed, flat IP space simplifies networking changes when you begin scaling to thousands of pods.

These rules keep much of the complexity out of our networking stack and ease the design of the applications. Furthermore, they eliminate the need to redesign network communication in legacy applications that are migrated from existing infrastructure. In greenfield applications, they allow for a greater scale in handling hundreds, or even thousands of services and application communications.

Astute readers may have also noticed that this creates a model that’s backwards compatible with VMs and physical hosts that have a similar IP architecture as pods, with a single address per VM or physical host. This means you don’t have to change your approach to service discovery, loadbalancing, application configuration, and port management, and can port over your application management workflows when working with Kubernetes.

K8s achieves this pod-wide IP magic using a pod container placeholder. Remember that the pause container that we saw in Chapter 1, Introduction to Kubernetes, in the Services running on the master section, is often referred to as a pod infrastructure container, and it has the important job of reserving the network resources for our application containers that will be started later on. Essentially, the pause container holds the networking namespace and IP address for the entire pod, and can be used by all the containers running within. The pause container joins first and holds the namespace while the subsequent containers in the pod join it when they start up using Docker’s --net=container:%ID% function.

If you’d like to look over the code in the pause container, it’s right here: https://github.com/kubernetes/kubernetes/blob/master/build/pause/pause.c.

Kubernetes can achieve the preceding feature set using either CNI plugins for production workloads or kubenet networking for simplified cluster communication. Kubernetes can also be used when your cluster is going to rely on logical partioning provided by a cloud service provider’s security groups or network access control lists (NACLs). Let’s dig into the specific networking options now.

Networking options

There are two approaches to the networking model that we have suggested. First, you can use one of the CNI plugins that exist in the ecosystem. This involves solutions that work with native networking layers of AWS, GCP, and Azure. There are also overlay-friendly plugins, which we’ll cover in the next section. CNI is meant to be a common plugin architecture for containers. It’s currently supported by several orchestration tools such as Kubernetes, Mesos, and CloudFoundry.

Network plugins are considered in alpha and therefore their capabilities, content, and configuration will change rapidly.

If you’re looking for a simpler alternative for testing and using smaller clusters, you can use the kubenet plugin, which uses bridge and host-local CNI plugs with a straightforward implementation of cbr0. This plugin is only available on Linux, and doesn’t provide any advanced features. As it’s often used with the supplementation of a cloud provider’s networking stance, it does not handle policies or cross-node networking.

Just as with CPU, memory, and storage, Kubernetes takes advantage of network namespaces, each with their own iptables rules, interfaces, and route tables.  Kubernetes uses iptables and NAT to manage multiple logical addresses that sit behind a single physical address, though you have the option to provide your cluster with multiple physical interfaces (NICs). Most people will find themselves generating multiple logical interfaces and using technologies such as multiplexing, virtual bridges, and hardware switching using SR-IOV in order to create multiple devices.

You can find out more information at https://github.com/containernetworking/cni.

Always refer to the Kubernetes documentation for the latest and full list of supported networking options.

Networking comparisons

To get a better understanding of networking in containers, it can be instructive to look at the popular choices for container networking. The following approaches do not make an exhaustive list, but should give a taste of the options available.

Weave

Weave provides an overlay network for Docker containers. It can be used as a plugin with the new Docker network plugin interface, and it is also compatible with Kubernetes through a CNI plugin. Like many overlay networks, many criticize the performance impact of the encapsulation overhead. Note that they have recently added a preview release with Virtual Extensible LAN (VXLAN) encapsulation support, which greatly improves performance. For more information, visit  http://blog.weave.works/2015/06/12/weave-fast-datapath/.

 

Flannel

Flannel comes from CoreOS and is an etcd-backed overlay. Flannel gives a full subnet to each host/node, enabling a similar pattern to the Kubernetes practice of a routable IP per pod or group of containers. Flannel includes an in-kernel VXLAN encapsulation mode for better performance and has an experimental multi-network mode similar to the overlay Docker plugin. For more information, visit  https://github.com/coreos/flannel.

Project Calico

Project Calico is a layer 3-based networking model that uses the built-in routing functions of the Linux kernel. Routes are propagated to virtual routers on each host via Border Gateway Protocol (BGP). Calico can be used for anything from small-scale deploys to large internet-scale installations. Because it works at a lower level on the network stack, there is no need for additional NAT, tunneling, or overlays. It can interact directly with the underlying network infrastructure. Additionally, it has a support for network-level ACLs to provide additional isolation and security. For more information, visit  http://www.projectcalico.org/.

Canal

Canal merges both Calico for the network policy and Flannel for the overlay into one solution. It supports both Calico and Flannel type overlays and uses the Calico policy enforcement logic. Users can choose from overlay and non-overlay options with this setup as it combines the features of the preceding two projects. For more information, visit https://github.com/tigera/canal.

Kube-router

Kube-router option is a purpose-built networking solution that aims to provide high performance that’s easy to use. It’s based on the Linux LVS/IPVS kernel load balancing technologies as proxy. It also uses kernel-based networking and uses iptables as a network policy enforcer. Since it doesn’t use an overlay technology, it’s potentially a high-performance option for the future. For more information, visit the following URL: https://github.com/cloudnativelabs/kube-router.

Balanced design

It’s important to point out the balance that Kubernetes is trying to achieve by placing the IP at the pod level. Using unique IP addresses at the host level is problematic as the number of containers grows. Ports must be used to expose services on specific containers and allow external communication. In addition to this, the complexity of running multiple services that may or may not know about each other (and their custom ports) and managing the port space becomes a big issue.

However, assigning an IP address to each container can be overkill. In cases of sizable scale, overlay networks and NATs are needed in order to address each container. Overlay networks add latency, and IP addresses would be taken up by backend services as well since they need to communicate with their frontend counterparts.

Here, we really see an advantage in the abstractions that Kubernetes provides at the application and service level. If I have a web server and a database, we can keep them on the same pod and use a single IP address. The web server and database can use the local interface and standard ports to communicate, and no custom setup is required. Furthermore, services on the backend are not needlessly exposed to other application stacks running elsewhere in the cluster (but possibly on the same host). Since the pod sees the same IP address that the applications running within it see, service discovery does not require any additional translation.

If you need the flexibility of an overlay network, you can still use an overlay at the pod level. Weave, Flannel, and Project Calico can be used with Kubernetes as well as a plethora of other plugins and overlays that are available.

This is also very helpful in the context of scheduling the workloads. It is key to have a simple and standard structure for the scheduler to match constraints and understand where space exists on the cluster’s network at any given time. This is a dynamic environment with a variety of applications and tasks running, so any additional complexity here will have rippling effects.

There are also implications for service discovery. New services coming online must determine and register an IP address on which the rest of the world, or at least a cluster, can reach them. If NAT is used, the services will need an additional mechanism to learn their externally facing IP.

Comments are closed.