008: Site Reliability Engineering

Site Reliability Engineering (SRE) is a particular approach to DevOps for keeping systems highly-available (HA) while continuously releasing features.

The daily mood

I enjoy my work a lot. However, I just start to realize how important it is to stay consistent and reflective in my ramp-up. I do have directions but nobody feeds me with a concrete planning and tasks, so that every new working day starts with a self-review, questioning and planning. In general, it is just so much important to read, learn and practice a lot now that i have time allocated to this, in order to be able to contribute later when I'll have more responsibility and less time.

One of my focus areas is the science of operationalization. I am making good progress in reading and understanding Google SRE Book: 11/32 chapters covered so far, mainly introduction and principles.

Introduction

The Site Reliability Engineer (SRE) role truely emerged at Google around 2003 while traditionnal sysadmin approach (block any change to prevent bugs) didn't anymore fit to modern Cloud business and operations (forster change to improve service competitivity and customer satisfaction): "SRE is what happens when you ask a software engineer to design an operations team". The era of distributed systems (ex. Borg) and fully automated factories (ex. Tesla Motors Inc.) had begun.

Principles

Google realized the criticality of Mean time to failure (MTTF) and Mean time to repair (MTTR) time frames in a system lifecycle, in correlation with the bottleneck of on-duty employees. They figured out that in order to keep-up with growing demand and capacity planning, 50% of the SRE time should be plant on the aggregate "ops", means half of the time being on-call and the other half investing in design and optimisation.

In order to reach that goal, there should be a culture of:

removing toil (the collection of manual repetitive tasks that are killing you if they don't get automated soon)
pro-actively monitoring services (ex. SolarWinds Pingdom checks load and availability, PagerDuty creates incidents, Jira assigns tickets etc.)
phasing rollouts (ex. release manager reduces experiments on low capacity and eventually adopts feature flipping)

Site Reliability Engineering also abandons the non-realistic idea that any service should be 100% available and one dedicated team responsible. Instead, the intentional use of error-budget (or permitted unavailability, ex. 0.01% ) should allow everybody to compensate technical debt vs. achieve maximal feature velocity.

Our context

We initially learnt state-of-the-art DevOps when launching our first cloud services. Five years later we started an initiative aiming to put more SRE theory into practice, especially in a perspective of improving our Service Level Agreements (SLA = contracts) via well defined Service Level Objectives (SLO = goals) and relevant Service Level Indicators (SLI = metrics). For that, SRE theory shouldn't be reserved to SRE role but shared accross the R&D organisation. A first workshop happened and i was able to watch the recording. For our team, the topic is tightly coupled with track name Observability.

Take-away

The book helps me getting familiar with operational infrastructure requirements.

In first half we get introduced to the challenges and principles of SRE.

In second half, we get guidance on SRE practices (ex. hierarchy of tests in Chapter 17) and management.

The newbie cloud architect diary

Search This Blog

008: Site Reliability Engineering

Labels

Comments

Post a Comment