A introduction to the buzzword and the rationale for implementing it
What is Observability?
Before we dive in to the waters, we need to define what observability is. Let’s go for some tweets first.
Monitoring is for operating software/systems— Charity Majors (@mipsytipsy) September 23, 2017
Instrumentation is for writing software
Observability is for understanding systems
What’s interesting though is her tweet before that.
Getting pretty annoyed with the very sloppy use of terms by people who then use them to bludgeon or lecture others. Here, I'll help:— Charity Majors (@mipsytipsy) September 23, 2017
Yes, the term has been thrown around with pretty much any personal meaning you could add during the process. It’s not a rad new word for your existing monitoring setup. It’s not a new way to do tracing. And it’s certainly not that new flashy dashboard someone just designed.
So what is Observability then?
Observability is the level of visibility that the system grants to an outside observer. It’s a property of a system, just like usability, availability, and scalability.
We talk about how a system is observable, just like how we talk about how a system is scalable.
So it’s should be clear from this point onward that we are not talking about just a new technology stack, but how to enable this property for a system successfully.
Why talk about it?
Observability of a system is not an after-thought that you think about until after the deployment is done. If not designed, at that level of the deployment lifecycle, successful observability will be by accident, if it’s achievable at all. Here the emphasis is on “successful” rather than “observability”. Even for a system designed with not much focus on observability, it’s possible to manage some kind of high level look into the dynamic state. However that would not be a successful one which covers all intended user stories for observability. It’s a matter of whenthan if for an incident to happen without anyone knowing what caused it or (gasp!!) what the exact impact was.
There is another side to enabling a proper level of observability of a system. It’s the run-time state of the system that a given desired state should be compared against. The properties collected about a running system speaks about the actual state of a system. If this is different from the desired state of the system, a normalizing process should be executed to bring the system to the desired state. Without proper observability, enforcing process based, codified desired state is accidental at best.
What enables successful observability for a given system are a collection of what already existing different practices encompass. These include monitoring, tracing, log aggregation, and alerting that produce various metrics that provides a view into server performance, service availability, and capacity consumption. Well designed and holistic approach to managing these insights results in an observable system with no blind spots.
Hindsight is a product of Foresight
One truth that I have found to be always reliable about systems is that incidents are unavoidable. It’s a part of running a system and you manage incidents to maintain qualities like availability, usability, and fault tolerance at an acceptable level. Sticking to Service Level Agreements (SLAs) is the primary function of any system. This brings out an acceptable rate of errors and downtimes that we measure against a pre-defined set of numbers.
Having accepted the fact that incidents are part of a system, it’s important to design the level of observability of a system to focus on incident resolution and root cause analysis. A designed system that collects information which when analyzed points to clear root causes of incidents without a doubt is far better (and manageable) than a system that tries to fanatically avoid incidents. Well understood root cause analysis reports improve system SLAs over-time far better than anticipated issues can. Incidents in their nature are hard to anticipate, therefore, even for a cleverly designed system, chances of some issue spiraling out to be a service level interruption is 100% over the course of the uptime. Interestingly this is why the sooner the volatility of a system is accepted, the better it can be designed to withstand it.
Uptime is important. Post-mortems are importanter.
This is why designed observability of a system is helpful than an accidental one. The quality of the view inside should be a product of the experience running other systems (and their gaps). This however, introduces the familiar vicious cycle of not having enough experience to design a system to gain enough experience. Therefore, a set of guidelines for observable systems can help organizations hit the ground running even with minimum experience of running successful or faulty systems.
Downfall of the Monolithic
Analyzing a monolithic application deployment is easy. This idea is relatively new because until Container Cluster Management and Microservices laid waste to the then established way of solution designing, we thought otherwise.
Single servers with a predefined number of instances demands a static approach to make it observable. Introduce a monitoring tool, install agents, and may be with an additional dashboard, you’re done.
This drastically changes when it comes to distributed systems that tends to scale up and down with almost no advance notice in terms of capacity and the number of instances. A dynamic environment like this requires plans and tools that can understand the different demands of the system. For an example, a dynamic system with low cost horizontal scaling is less worried about individual instance CPU usages than a static system that depends on vertical scaling that requires doing so once a year. The rationales and approaches for the new paradigms change so often that it’s worth to have a set of best practices, guidelines, and if possible an implementation framework rather than a concrete solution for designing observable systems.
Therefore organizations should strive to develop their own best practices when it comes to observable systems. This effort is to define a framework for doing so.
Written on August 21, 2018 by chamila de alwis.
Originally published on Medium