Different load balancing and reverse proxying strategies to use in Production K8s Deployments to expose services to outside traffic

Morning sunlight on Horton Plains National Park

In this post, I’m going to tackle a topic that any K8s novice would start to think about, once they have cleared the basic concepts. **How would one go about exposing the services deployed inside a K8s cluster to outside traffic?**The content and some of the diagrams I’ve used in the post are from an internal tech talk I conducted at WSO2.

Let’s define some terms!

Before we move on to the actual discussion, let’s define and agree on a few terms as they could be confusing with each use if not defined as a specific term first. Let’s define a sample deployment and refer to parts of this sample deployment later in the discussion.

Basic understanding of Pods, Services, and the usage of the Service Types in K8s is expected for the topics discussed in this article.

  • K8s cluster — The boundary within which Pods, Services, and other K8s constructs are created and maintained. This includes K8s Master and the Nodes where the actual Pods are scheduled.
  • Node(with upper case N) — These are the K8s Nodes that all Pods, Volumes, and other computational constructs will be spawned in. They are not managed by K8s, and reside within the private network that is assigned to them by the IaaS provider when the virtual machines are spawned.
  • Node private network— The K8s Nodes reside in a private network in the Cloud Service Provider (e.g.: AWS VPC). For the sample deployment, let’s define this private network CIDR as 10.128.0.0/24 . There will be routing set up so that traffic is properly routed to and from this private network (e.g.: with the use of Internet and NAT Gateways).
  • Container Networking — The networking boundary within which the K8s constructs are created. This is a Software Defined Network, almost always implemented as Flannel. How routing works within this network could vary with different Cloud Service Providers, however generally every private IP assigned within this space can talk to each other without having to do NAT in between. This network is manipulated by kube-proxywhenever new Services and Pods are created. There are usually two CIDRs defined within this space, Pod Network CIDR and the Service Network CIDR.
  • Pod Network CIDR— The IP address range from which Pods spawned within the K8s cluster will get IP addresses from. For the sample deployment, let’s keep this to 10.244.0.0./16
  • Service Network CIDR — The IP address range from which the Services created within the K8s cluster will get IP addresses from. For the sample deployment, let’s keep this to 10.250.10.0/20
  • (K8s) Service (with upper case S) — The K8s Service construct that this article focuses on.
  • service — The application service that a particular set of Pods will provide

The following is an excerpt from a single Service definition, myservice.yml . The .spec.type field can have values of ClusterIP , NodePort , LoadBalancer (or ExternalName which is not related to the scope of this article). This Service exposes port 8080 mapped to the Pod port 8080 . The name of the Service will be myservice .

apiVersion: v1
kind: Service
metadata:
  name: myservice
...
spec:
  type: <type>
  ...
  ports:
  - name: myserviceport1
    port: 8080
    targetPort: 8080
    protocol: TCP
...

The Problem

Pods are ephemeral.

In K8s, Pods are the basic unit of computation. However, Pods are not guaranteed to live for the whole life time that the user would intend it to be. In fact, the only contract that K8s adheres to when it comes to liveliness of Pods is that only the desired count of Pods (a.k.a. Replicas) would be maintained. They are ephemeral, and are vulnerable to kill signals from K8s during occasions such as scaling, memory or CPU over usage, rescheduling for more efficient resource use, or even to downtime because of outside factors (e.g.: the whole K8s Node going down). The IP address that is assigned to a Pod at its creation will not survive such events.

Therefore, although each Pod has a cluster-wide routable IP address, those Pod IP addresses are not usable as direct service endpoints, simply because of the fact that at any given time there’s a chance that one or more would stop being responsive. There should be a construct that stands as a single, fixed service endpoint as a reverse proxy for a given set of Pods.

This is where K8s Services come into play.

A K8s Service is fixed in terms of reachability from within the cluster (and from outside of the cluster which is what this article would address). Once it gets an internal cluster-wide IP address, that can be used to refer to it until the Service is intentionally removed. It can also be referred from within the K8s cluster using the Service name .

You may notice that the problem statement and the primary functionality of a K8s Service as described above is not something only a K8s deployment would have. Bare-metal or virtualized deployments also need some kind of reverse proxying for a given set of compute instances that offer a particular service. For an example in AWS, for a set of EC2 compute instances managed by an Autoscaling Group, there should be an ELB/ALB that acts as both a fixed referable address and a load balancing mechanism. This is the same type of functionality offered by K8s Services, except in K8s, Services are not compute instances that are managed by the provider/have to be managed by an end user. They are more a product of Software Defined Networking and Linux Kernel based packet filtering, than a single process of computation running on some CPU.

Although a K8s Service does basic load balancing, as you will understand in the following sections, sometimes when advanced load balancing and reverse proxying features (e.g.: L7 features like path based routing) are needed, those advanced features are offset to a real load balancing process running as a compute process, either internal to the K8s cluster as Pods, or as external managed services. K8s Services will only perform L3 routing.

However, the set of Pods handled by the Service are still not accessible from a client outside the K8s internal network. Every IP address that was discussed above is within the CIDR(s) defined when creating the K8s Overlay Network. Therefore, these are addresses in the private space. They will not be routed to the K8s cluster if a client tries to invoke them from outside the network, because routing rules for them are setup inside the Overlay Network.

The Solution: Service Types

To address this problem, there are several Service type s that can be leveraged to allow ingress of external traffic to a Pod with a private IP address. Let’s explore each type and see if the goal to end up with a public IP address that a domain name can be mapped with can be achieved with any of them.

type: ClusterIP

Pros: None that is specific to this type

This is the default type of Service that would be created if the type field is not explicitly specified for a Service definition. A cluster-wide, but internal IP address, which is part of one of the Service network CIDR that is divided up when setting up Container Networking, will be provided as the fixed IP address of the Service. This IP address (and the Service name)is routable from anywhere within the K8s Overlay Network.

curl http://myservice:8080/
curl http://$(kubectl get svc myservice --template='{{.spec.clusterIP}}'):8080/

However, as mentioned above, this IP address is (sort of) useless on its own for the scope of this article. It is not routable from outside the cluster network unless you can somehow implement the following series of steps.

  1. Watch and detect new Service creations at the K8s Master — doable, easy
  2. Get the ClusterIP addresses assigned for each Service — doable, easy
  3. Create and update routing tables from all the possible clients up to the K8s Nodes to route the requests to the cluster network — not so much

You can also expose a Service to outside traffic through the use of the kubectl proxy command. However, this exposes the complete K8s API Server and is not a recommended approach to take in production systems.

Pros:

  • None that is specific to this type

Cons:

  • Not an easy task to expose the Services
  • Doing so with other workarounds would potentially expose the most possible surface for attacks

Moving on!

type: NodePort

As far as the order in which this article explores the Service types, this is the first that would allow us to expose the Services in a meaningful manner in a production deployment. And as it would be apparent later, this would be the basis of other mechanisms to map a Service port to a physical port.

A Service with NodePort type would allocate a physical port on the K8s Node for each .spec.ports[*] port entry defined in the Service definition. Furthermore, it will do so for all the K8s Nodes in the cluster (as opposed to doing so only in the Node(s) that the Pods handled by the Service are spawned in). This way, the Service port will be mapped to a port opened on the eth0 interface of every Node.

If myservice.yml is created with .spec.type=NodePort there would be a random (but consistent) physical port opened in each of the Nodes in the K8s Cluster. This port will be picked from a range that is modifiable when creating the K8s Cluster --service-node-port-range which has a default range of 32000 — 32767 . The port assigned to each .spec.ports[*] entry can be checked by looking at the output from kubectl get svc <svc_name> where under the PORT(S) column, there would be a value after a : for each port.

If (e.g.:) the myserviceport1 in myservice above (8080) is assigned to NodePort 31644 , then external clients who have access to the Node private network can directly access the Service using any of the Node IP addresses and the NodePort. Assuming one of these Node IP addresses is 10.128.0.13 (from range 10.128.0.0/24 defined during above definitions), we can invoke myservice port 8080 using the following curl command from within the Node private network.

curl http://10.128.0.13:31644/

Additionally, if the random assignment of ports is something to worry about, fixed port values (within the above mentioned range) can be specified using .spec.ports[*].nodePort value for each entry. The downside of predefined NodePorts is that they have to be coordinated among the potential number of Services that will be deployed so that one Service would not clash with another.

As long as there are mechanisms set up ingress and egress from the Node private network, any external client will also be able to access myservice with Node public IP addresses.

As the next step, a reverse proxy can be setup in the IaaS public address space that sits within the same network grouping (e.g.: VPC) as the K8s Nodes so that external traffic is routed to (and load balanced between) the NodePorts -> Service Ports -> Pod Ports.

Pros:

  • Easiest method to expose internal Services to outside traffic
  • Enables greater freedom when it comes to setting up external load balancing and reverse proxying
  • Service-provided L3 load balancing functionality (e.g.: .spec.sessionAffinity) is available
  • Straight-forward, easy-to-untangle mechanism that would help when troubleshooting (especially with predefined NodePorts)

Cons:

  • Have to manage load balancing and reverse proxying external from K8s management
  • Have to coordinate NodePorts among Services

This is the basis of all other types of reverse proxying discussed from this point onward. Any kind of traffic that has to come inside the K8s Cluster Network has to come through some kind of a NodePort. All variations try to address other factors such as dynamic management of external load balancers with K8s directives, but NodePort is the bridge that connects the Node Private Network with the Service Network.

type: LoadBalancer

Also known as the Cloud Load Balancer approach, specifying this as the .spec.type when K8s is deployed in a supported Cloud Service Provider would result in a load balancer in the Cloud to be provisioned that proxies the particular Service.

For an example, in a K8s cluster deployed in AWS, this would result in an ELB instance being provisioned that proxies traffic for the Service inside the K8s cluster. The traffic that reaches the ELB will reach the Overlay Network through random NodePorts exposed to ferry the traffic from Cloud Service Provider private network to K8s Overlay Network.

The exact details on how this flow is implemented can vary depending on the Cloud Service Provider. However, the overall idea is to quickly provision a Load Balancer without having to go through the Cloud Service Provider APIs separately. K8s will translate the Service definition and perform the API calls on behalf of the user to setup the resources required (e.g.: Load balancer, static public IP address, routing on the Cloud Service Provider).

Pros:

  • Provisioning load balancing and reverse proxying with minimum effort
  • NodePort management is done without the intervention of the user
  • Do not have to manage load balancing facilities outside of K8s domain

Cons:

  • A single load balancer proxies for a single Service, and therefore is a costly approach
  • The implementation details are sometimes opaque and requires manual investigation to understand and troubleshoot
  • Sets up Network Load Balancers most of the time and therefore L7 features like path based routing and TLS/SSL termination are out of the table
  • Usually takes time for Cloud Service Provider to complete provisioning the load balancer

The above are the different Service type s available that provide different methods to expose a Service to the outside of the K8s cluster. However, they alone do not get the task to a complete state (e.g. NodePort just exports the ports, LoadBalancer is not flexible). Therefore certain “patterns” of combining the above with more application level functionality have come up. Following are two such methods.

Bare Metal Service Load Balancer Pattern

Before K8s v1.1, Bare Metal Service Load Balancer was the preferred solution to tackle shortcomings of the above LoadBalancer Service type. This makes use of NodePorts and a set of HAProxy Pods that acts as a L7 reverse proxy for the rest of a sets of Pods. The solution is roughly the following.

  1. A single-container Pod contains an HAProxy deployment along with an additional binary called service_loadbalancer
  2. This Pod is deployed as a DaemonSet where only a single Pod is scheduled per Node
  3. The service_loadbalancerbinary constantly watches the K8s API Server and retrieves the details of Services. With the use of Service Annotation metadata, each Service can indicate the load balancing details to be adopted by any third party load balancer (TLS/SSL Termination, virtual host names etc.)
  4. With the details retrieved, it rewrites the HAProxy configuration file, filling the backend and frontend section details with Pod IP addresses for each Service
  5. After the HAProxy configuration file is written, service_loadbalancer does a soft-reload on the HAProxy process.
  6. HAProxy exposes ports 80 and 443. These are then exposed to outside traffic as NodePorts
  7. The NodePorts can be exposed to outside through a public Load Balancer

In this approach, any incoming traffic reaches a Pod in the following order

  1. Traffic is routed to the public Load Balancer via the public IP address
  2. Load Balancer forwards the traffic to the NodePorts
  3. Once traffic reaches the NodePorts HAProxy starts L7 deconstruction and does host or path based routing based on the Service Annotation details
  4. Once the routing decision is taken, the traffic is directly forwarded to the Pod IP

Notice that once traffic reaches the NodePorts (#2 above) it is inside the Overlay Network. Therefore, load balancing between direct Pod IP addresses can be done, based on any L7 decisions. In other words, the Service K8s construct is not involved in the load balancing decision, other than providing a grouping mechanism in the beginning. This is in contrast to approaches discussed so far where load balancing happens outside the K8s Overlay Network, and traffic is usually pushed to the Service virtual IP.

In stories that I have used this approach personally, the main driver for choosing it has been its customizability. Some use cases involve specific requirements like combinations of host and/or path based routing, mutual authentication, etc. With this approach (and with some Go coding competency) these customizations were easy to implement.

Pros:

  • Can make L7 decisions
  • Can make use of specific load balancer features such as cookie based Session Stickiness that may not be possible with Cloud Load Balancer approach
  • Has more control over how load balancing should be scaled
  • Load balancing details are managed with K8s constructs such as Service annotations
  • Is more customizable when it comes to different use cases
  • Economical since only one Cloud Load Balancer would be provisioned for a complete K8s cluster
  • Transparent configuration changes and mechanism since after the service_loadbalancer the changes involved are HAProxy specific
  • Changes are propagated quickly as the service_loadbalancer picks up the changes within a short interval period

Cons:

  • Complex to set up and troubleshoot
  • Could result in a single point of failure if the number of Nodes or affinity specified Nodes is limited
  • service_loadbalancer only has support for HAProxy (although could support other reverse proxies in theory with considerable code changes)

After K8s v1.1, kubernetes/contrib repository has marked the folder service_loadbalancer as deprecated, favoring the HAProxy Ingress Controller.

traefik which combines the above mentioned service_loadbalancer logic into a lightweight reverse proxy, also popped up later after the Service Load Balancer pattern got deprecated. There is an Enterprise Edition that seems to be production recommended. However I have not personally seen that in action in production yet.

Ingress

It is fair to say that the concept of Ingress and the associated Ingress Controller evolved out of the Bare Metal Service Load Balancer pattern discussed above. This approach is only available after K8s v1.1.

In the Service Load Balancer pattern, the load balancing and reverse proxying intent was tied up with the Service declaration and the implementation was tied up with the service_loadbalancer code. Annotations are implementation specific, and are not trusted to be standard. Adding support for new type of a load balancer require multiple changes in the existing code.

Ingress takes these concerns out into their own K8s API objects.

An Ingress is an intent of exposing a route from outside to a certain set of Pods proxied by a Service. In this intent, the load balancing specifics such as TLS termination, host and/or path based routing, and load distribution policies could be specified as rules. This is in contrast to implementation specific annotations the Service Load Balancer pattern followed. Ingress rules are parts of the Ingress spec section and thus are standard constructs of the K8s API, not some arbitrary string that a only custom implementation would be able to make a meaning out of.

The implementation of the Ingress (intent) is done by an Ingress Controller. This is a load balancer specific implementation of a contract that should configure a given load balancer (e.g.: Nginx, HAProxy, AWS ALB) according to the details provided by the Ingress resources. Which Ingress Controller to use for a certain Ingress resource is specified by metadata.annotations.kubernetes.io/ingress.class annotation. The values here could be nginx , gce or any other IngressController implementation identifier.

The details of this implementation could vary between different Ingress Controllers. They only have to adhere to the standards defined by the Ingress resource API.

For an example, Nginx Ingress Controller deploys an Nginx proxy as a Pod and configures it according to the Ingress resources defined in the cluster. AWS ALB Ingress Controller provisions an AWS ALB with the routing details of the Ingress rules. The same is true for GCE Ingress Controller, where a GCP L7 load balancer will be provisioned. Both the AWS ALB and GCP Ingress Controller spawned external load balancers will forward traffic to Pods through the Service Cluster IP exposed as NodePort type.

This might look similar to the Cloud Load Balancer ( type: LoadBalancer ) approach discussed above, however there are key differences between Ingress and LoadBalancer type Services.

  1. Cloud Load Balancer approach most of the time provisions Network Load Balancers with no control of L7 constructs. Ingress does not do this. Its resulting load balancers are mostly L7.
  2. Cloud Load Balancer approach is dictated by the Service declaration. Ingress is a separate declaration that does not depend on the Service type.
  3. There would be one Cloud Service Provider load balancer per Service in the former approach, whereas with Ingress multiple Services and backends could be managed with a single load balancer.

Pros:

  • Standardized approach to provisioning external load balancers
  • Support for HTTP and L7 features like path based routing
  • K8s managed load balancer behavior
  • Can manage multiple Services with a single load balancer
  • Can easily plugin different load balancer options

Cons:

  • Ingress Controller implementations could be buggy (e.g.: I have had a fair share of GCP Ingress Controller not properly picking up readiness and liveliness probes to detect healthy backends)
  • Implementation control is in the hands of the Ingress Controller, which might restrict certain customizations that otherwise can be done with the Service Load Balancer pattern
  • K8s managed load balancer configuration could mean less control over the Cloud Service Provider load balancer with means that were available previously (perhaps in the older deployment architecture).

Gathering All Up Together

The above options are the basic ones available OOTB at the moment. However, many more patterns can be created by combining these approaches or custom implementations together.

When choosing the right approach for a deployment, the following factors can be used as bases of comparison.

Cost

What would be the infrastructure and operational cost of each approach? Would there be multiple instances of load balancers and associated resources (static IP addresses, storage, firewall rules etc) or would they be compressed to the workable minimum?

Would operating with a specific approach mean hiring a new resource with Go lang competency? Or can you manage with only the knowledge in YAML and the K8s API?

Complexity and Customizability

Does the approach reasonably explain the inner workings of the implementations? Are there multiple abstractions that lack adequate documentation and make the underlying implementation too opaque? How easy is it to troubleshoot the path between a client which is outside the K8s Pod network and a backend inside it?

How complex is it to incorporate project/team/organization specific customizations into the approach? Do changes have to be pushed upstream to the wider project to be approved before being added to the “official” repositories/image registries?

Latency

How many routing hops should a request go through after entering the K8s Overlay Network before hitting the actual backend? Do these hops introduce considerable latency? Are implementations following proper load testing to identify saturation points?

I have seen deployments with the Service Load Balancer implementation where the HAProxy started to respawn with high inflight request counts because the default artifacts provided in the kubernetes/contrib repository had not adequately described the CPU and memory limits and requests.

Operational Transition

What is the degree of freedom allowed to modify resources created as part of the approach? Can the older processes and methods of infrastructure management be used with new approach for resources created outside the K8s realm? Or would Ops tasks be drastically disrupted?

For example, managing an AWS ALB instance, created as part of an AWS ALB Ingress Controller, with methods such as Terraform could be tricky as the Ingress Controller itself could see outside changes as intrusive. On the other hand if a basic NodePort setup is proxied by the ALB, then Terraform resources could very well be involved as management artifacts.

At the end of the day, exposing internal services to clients outside the K8s network is all about creating a bridge between the Node private network and the Pod network. This bridge is usually a NodePort. As long as this bridge is set up, any external load balancer with a public IP address can see and collect that port as a forwardable backend. Except for Service type ClusterIP, all other approaches discussed here are about how to manage this bridge, a.k.a. the NodePort. As long as this concept is understood, figuring out the reverse proxying strategy for you K8s cluster is a simple task.


Written on February 28, 2019 by chamila de alwis.

Originally published on Medium