FluxCD GitOps operator supports Kustomize for dynamically featuring Kubernetes resources includind HelmRelease. This post is about solution analysis.
The daily mood
My manager looked at my current activity on Helm deployment toolchain, and latest findings as described in my previous post on Flux Helm Operator + Kustomize.
He seems to generally support my directions and approach, but also raised some concerns and questions that I shall answer in this solution draft.
Context
We currently operate hundreds of services in Kubernetes clusters. Without claiming to have a Microservice architecture, we generally try to design small-sized standalone components that communicate through HTTP. We also try to develop and test with agility (Scrum), create immutable deliveries (Maven artifacts, Docker images, Helm charts), use resilient infrastructure, operationalise for reliability. Our services are actually shared by a dozen of different teams, combined via logical stacks on purpose (ex. a product), and move accross 5 promotion stages until they finally reach production. With this, we are serving a growing number of users accross multiple regional clusters. That should sufficiently reveal our extended need for configuration, governance and automation, as mainly backed today by our System Repiability Engineers (SRE).
Issues
While Developers continuously deliver and tag Docker images, SRE consume them (by specifying a given version or ~ for latest) using Helm charts featured with Ansible scripts. The overall volume of infrastructure code and complexity is growing with new services added, composed or changed, which makes assets not only hard to maintain by teams, but also difficult to handle by new projects and hires. Plus recurring configurations and deployments of charts is a timely-intensive and error-prone activity, which the engineering organisation can barely capitalize on. At the end it is difficult for Developers to comply with reliability model on one hand, for Operations to support frequent change on the other hand. As reported by our QA, the later we are in the promotion pipeline, the more constraints we meet and the slower we can move on. With this, our time-to-market and business growth are litterally at risk.
Challenges
One year ago the architecture team has put in place a development cluster togeher with new tooling (a custom Helm plugin as the bridge between Helm Charts and Ansible) and guidelines (documentation, training sessions, support). Following to that they noticed a growth in demand and commitment to dev cluster configuration, but still a limited value, maily because only a few people were confident with it. At the same time, SRE team started a new initiative based on GitOps as the new operational model, and picked FluxCD as the tool of choice for implementing it. Flux is compatible with Helm Charts but not compatible (i.e. not reasonably) with last Custom Plugin and approach (based on Ansible), so original discrepencies between Dev and Ops would remain.
Introducing Kustomize
Kustomize is a client tool for simplifying configuration and deployment of Kubernetes manifests. It is available with kubectl (as an out-of-the-box plugin) and is commonly used as an alternative to Helm. It is also supported by FluxCD. Unlike Helm, Kustomize does not support templating and immutability, so we definitely want to maintain Helm. But Kustomize offers a native and purely declarative (therefore comprehensive) approach which we could use for factorizing configurations. Although Helm and Kustomize do not integrate directly with each other, we found out that their combination could fit for our purpose. We considered two different scenarios:
- Use Kustomize to feature Kubernetes manifests rendered using Helm, as described by Testingclouds in article Customizing Upstream Helm Charts with Kustomize. But in this case we introduce a new custom step and resource representation which is neither reflecting the code nor the cluster.
- Use Kustomize to feature Flux Helm Operator HelmRelease (CRD) objects. A scenario that we thought to be worth evaluating.
Solution big picture
We assume the need for a layer atop of Helm charts for abstracting configurations and deployments like in Helmsman. From there we will discuss how a HelmRelease management solution compares with vs. without Kustomize.
Flux Helm Operator + Kustomize solution diagram:
On the top-left corner, the development sources. On the top-right inside, the GitOps repository. Configuration assets are assembled iteratively via manual input and script-based manifest generation. On the bottom part of the picture, the GitOps operator.
Strenghs of the solution
- Native: Already part of Flux so no need for additional installation/maintenance.
- Grouping: Submit multiple HelmReleases at once instead of individual ones.
- Factorising: Limit replication and maintenance of identical assets for different namespaces.
- Accessibility: Template-free approach leveraging adoption.
Weaknesses of the solution
- Structural complexity: Through sharing at base level, multiplexing at overlay level, and working around non standard support for multi-overlay patches and dynamic values.
- New definition of a stack: Due to non support of HelmRelease composition at overlay level, an application stack is now composed of one or multiple bases.
- Process complexity: Through iterative configuration in a semi-automated workflow.
- GitOps-circuit-breaker: Unlike a Kustomize-free solution, cluster-state changes (e.g. newer Docker image) are not likely to replicate back to the (shared) HelmReleases.
Factorising measurement
The only one benefit that is measurable is factorisation.
And because of Kustomize limitations previously mentioned, it can't be an exact science. The following table is based on our example "portal" application.
Without Kustomize | With Kustomize | |
#Lines base | = 463 | = 463 |
+ generated object(s) | +7 = 470 | |
#Lines for 3 clusters | x3 = 1389 | +5x3 = 485 |
+ generated object(s) | +10x3 = 515 | |
#Lines for 5 ns | x5 = 6945 | +40x5 = 715 |
+ generated object(s) | +10x3x5 = 865 |
Because of some Kustomize limitations as previously mentioned, some objects need to be automatically replicated by the generation scripts in order to simulate inheritance. We can observe the drawback of this behaviour on last line including 2 multiplications at 2nd overlay, 3 at 3rd etc.
Even when including generated objects, we are able to achieve a factor 8 in the example above. If I am not wrong the factor calculation formula should be as follow:
( siblings_at_overlay_n x siblings_at_overlay_n+1 ) / number_overlays
I assume it to be satisfactory to spare as many lines of code. Moreover, generated objects are actually not maintenance-relevant lines. Still measurement of only one benefit is not the holly grail.
Conclusion
I focused on some technical goals and approach as part of my ramp-up. Following to a given achievement (PoT), we came to talk about benefits (PoV) in regards to a concrete business requirement. This draft comes in preparation to a solution proposal. We may submit it, should the value justify it. A simple approach is to put all pros and contras in a table or matrix (see also SWOT Analysis). A larger evaluation of criterias (Excel) and feasibility (PoC) might then be required, then a pilot phase and finally a rollout until a solution gets widely adopted. To be clear we couldn't show as much value for now, so that the solution has an unprobable future. But as a former Solution Engineer it's actually a great experience to visualise the whole internal process, for a new idea to become reality.
Comments
Post a Comment