009: Deployment stack and routing with Traefik Skip to main content

009: Deployment stack and routing with Traefik

Traefik is an open-source HTTP reverse proxy and load balancer that is easy to use.  Let Traefik point at your orchestrator and you are ready to go.

The daily mood

A fellow peer explained me our approach to Software deployment. Instead of getting lost in tons of infromation sources, it is such a high value to get told about the history, as well as to get idiot questions answered. In this case, that opportunity definitely raised my degree of understanding, comfort and motivation.

Project history

Following to our adoption of Helm for immutable Kubernetes packaging, our SRE team leveraged known tools for configuration management (Ansible) and pipeline automation (Jenkins). In a few years our platform grew to a hundred of services developped by 12 teams, staging accross 4 environments and different Cloud imfrastructure providers. At some point, we not only needed more DevOps automation, but also process standardization and parametrization (e.g. via templating). Ansible quickly reached its limits. We adopted Terraform for non VM-based environments in replacement to AWS CloudFormation and Azure Resource Manager (ARM) templates. But as Kubernetes became our strategical platform for operations, we still had to put in place some Cloud Native solutions.

Contributions

A Kubernetes development cluster was created to allow engineering teams transit from Docker-compose, and to work on a small subset of the platform inside their own namespace. As some simple tooling was required for QA to bootstrap and sandbox their test campaigns, we also started to experiment and evaluate different scenarios for application composition and deployment orchestration:
  1. Helmfile
  2. Custom Helm plugin wrapping our existing Ansible scripts
  3. GitOps via Flux Helm operator *
Helmfile was granted to be a better option than Ansible. Thus for some reason, it was then abandonned in favor of a Custom Helm plugin, while nobody had yet a clue on how to implement Flux. Today's post is about the solution currently in place, i.e. Custom Helm plugin, but we'll have a look at the other alternatives in future posts.

We are currently using Helm 2 (while Helm 3 is already released). It will introduce breaking changes on the way we will build charts. The current study may also become obsolete with the new concept of Library Charts allowing more sharing and less replication of chart configurations.

* The very first time I heared about Flux, I thought about the Flux capacitor, a fictive invention by Dr. Emett Brown which makes time-travel possible the film Back to the future (1985) There coupd be not better name to strive for a modern operational model.

Custom Helm Plugin

The project basically consists in a repository of Helm charts and a repository of Ansible scripts. The Custom Helm Plugin was developped separately until it merged with Ansible.
It allows for creating, configuring and deploying collections of helm charts instead of handling them one by one. Such a collection is refered to as a deployment stack. Configurations can be defined at different levels of a hierarchy of environments (ex. dev, qa), stacks, products and charts.
From an architecture point of view, wrapping legacy code had the nice advantage to minimize the migration effort and cost while including new functionality on top, e.g. commands for adding default repositories and ingres controller, creating pre-configured Helm charts and provisioning user accounts.

Deployment routine

The following pattern shows our plugin in action:
helm <plugin_name> deploy <stack_name> --released-charts --skip-deps --verbose
Command parameters:
  • --released-charts pulls charts and subcharts packages from the image repository, instead of re-building them from source. In this case, the Docker image versions specified as values are obviously ignored. 
  • --skip-deps pulls subcharts packages from the image repository, instead of re-building them from source. This option is equivalent to a helm dep up per chart. It is ignored if --released-charts is in use.
  • --verbose displays all sub sequent logs. This option is equivalent to a helm <command> --debug.
Because of the different command options and multiple levels of values configuration in the stack hierarchy (environment, stack, helm chart version etc.), it is fairly probable to end up with a non working configuration. So I had to mistakes and learn how to troubleshoot errors.

Troubleshooting configuration

I started with the deployment of one of our simpliest stacks (5 charts, 17 pods) and actually forgot the --released-charts option, so that all charts had to build from source in current development state before they could deploy to my cluster. The process took some time before it threw the following error
validation failed: unable to recognize "":
no matches for kind "IngressRouteTCP" in version "traefik.containo.us/v1alpha1" 
In a nutshell, the error means that some object definition is not supported by my cluster.

What is Traefik?

We chose Traefik as our default Kubernetes ingress controller for routing Microservices communications. Indeed, Traefik routers answered well to our requirement to implement a domain-based naming convention, therefore dynamically create routes according to different team namespaces.
We are currently using v1 (while v2 is already released). If I understood correctly, Traefik v1 does not allow to create such "dynamic routes" for backend services which do not support the default HTTP protocol, such as Kafka and some of our databases. For that reason, we had to configure those components so that they are access through a given endpoint created at Traefik launch and able to apply a static host matching rule via Kubernetes Custom Resource (CRD) object. This is precisely the configuration part where my error appears.

Solving the issue

I was suggested to disable corresponding flags in my helm-values, which allowed my configuration to validate. But not understanding why it is so confused me a bit. Is that like I am missing some kind of objects, or running a different cluster API version than the one required by the resource definition (cf. kubectl api-versions), or would it even be possible at all to upgrade to Traefik v2? So many open questions started to fill my head in almost a hopeless and useless way.

One step further, I figured out that I was missing access permission for pulling binary images from our private registry, an Artifactory opened to our internal network only. So I requested permission to SRE (on Slack) and DEVOPS (on Jira), then created personal accessed keys.

Deployment result

The deployment routine finally completed and I could monitor pods while creating, pulling images and so on.
watch -n 3 kubectl get pods
Later on, another colleague pointed me to K9s which is an even more stylish and convenient way to watch cluster resources.

Even though not all of the pods reached a ready state, my Traefik dashboard showed the hostnames mapped with Kubernetes pods and that my local browser resolved from /etc/hosts. IAM/IDP redirection also worked fine. 

Troubleshooting execution

Once a pod in error status has been identified, in my case Kafka, the next step is to look at its details:
kubectly inspect pod <kafka_pod>                   # get the name for the init container
kubectl get logs -c <init_container> <kafka_pod>   # get the container log
It appeared that a Kafka init container depended on Zookeeper but couldn't reach it. After some time investigating the health check command based on netcat (nc), the process finally completed although it wasn't clear why, again... 

I found out that I had to remove any  parameters from that command (ex. -w for timeout,  -v for verbose) in order for it to work as expected in non-interactive mode. At first I thought this issue shouldn't be related to my environment only, until I remembered we had actually disabled custom ingres (Traefik) configuration for several components including Kafka, which may have had any impact on further network behaviour.

And so I cleaned up my Kubernetes cluster and Helm cache, started the process all over again with different kinds of command line parameter, ran into the same error related to Traefik, and even worse, found some more issues in the stack trace:
render error in "<chart-X>/templates/ingress.yaml": 
Ingress's host must be specified
render error in "<chart-Y>/templates/secret.yaml": 
Encrypted database password is required
At that point it became hard for me to resist to the temptation to look into each single detail and potentially touch the sources. I had actually received the instruction not to do so but within the wrapper values instead, which definitely makes sense to me since chart configuration values seem to be owned by different persons from stack configuration values. But I couldn't really figure out what to change and how, so I finally sticked to my first configuration in order to move forward with my ramp-up.

UPDATE: A colleague finally reminded me that as our wrapper project is a living asset whereas the release charts are immutable. Since I am obviously working on a recent Git revision, I was likely using an environment configuration (ex. dev) which is more recent than the released chart versions defined by a given team stack (ex. stackXY). In this case it is best practice to update those versions inside a team stack/folder (environments/dev/group_vars/arch/vars.yml), or create a config dedicated to my own environment. Sinilarly when I let the charts build from source, it is possible that some version of Docker images referenced are not yet available from the repository. The reason is that developers use their own local repo to build and test applications described in a chart already commited in git...

Conclusions

What I learnt from this experience is that configuring and deploying collections of Helm charts can get extremely awkward.
My first deployment was not very satisfying but completed in under 3 minutes, provided that Docker images were previously pulled by the cluster.
Next, I will setup local Helm and Docker registries in order to better understand and control build or pull activities during deployments.

Comments