Implementing a Metrics Monitoring Stack

TK Checklist for an observable system

why monitor

Alerting Post-Hoc analysis dashboarding capacity planning — can tolerate low frequency (of may be hours or days), also can be faulty because trends are not affected

alerting -======

should not be noise developer/ops time is valuable and should not be needlessly disturbed what alerts generate most disturbance should be decided

additional

a monitoring system with more emphasis on post-hoc analysis would be better helpful to find the root cause (and fixing it) than systems that try to remediate the situation automatically (and in doing so hiding the root cause) complex rules are too rigid to be adopted better to have separate simpler ones than interconnected complex ones

email alerts are rarely acted upon, and when done so, takes too much time out of ops cycles because of brief downtimes or false positives

it should be possible to easily enable monitoring for a new component of the architecture  e.g. an existing monitoring stack that would only need an addition of an agent in to the newly introduced cluster nodes

staring at dashboards do not work. that has been a failure and a waste of resources

there should be a monitoring “toolbox” and a monitoring “day pack” in an organization that can be used to quickly enable better designed observability in a new deployment. this will collect fixes and best practices along the way with each deployment

a monitoring system should answer “what” and “why” (may be immediate cause) that should point to better post-mortems without having to start with a blank slate :

may be: alerts and dashboards can be balanced  critical ones that need attention are alerts  ones that may need attention, but won’t break functionality should be dashboard features  white box monitoring may mostly produce dashboard features  black box may produce alerts since it’s the view from outside the system, and most likely affect the end users similarly

need to cover crticucal hops and paths

monolithic system monitoring and SLA calculations are easy, because of shared infrastructure. but that changes when approaching microservices based deployments where multi-tenancy seeps into the infrastructure decisions.system level aggregates do not tell the complete story for these systems.

4 golden signals

latency errors traffic saturation — should also work with predictions ( may be with soft limits)?

??: easy to translate to tradtional/VM based deployment architecture, but how does this translate into Containers?

==========================

mean wouldn’t work most of the time a histogram with a distribution of the stats over different ranges would give a better understanding at the dashboarding than a single number that keeps changing. a single number, or even the temporal graph of that single number doesn’t give much

granularity of each metric is important

there should be a way to determine if a certain node or a path has a metric or alert attached to it

a losely coupled simple system with multiple components are better than a tightly couple rigid system that cannot adopt, or even worse, adopt every single one of the metrics. only the required scope should be taken into consideration when designing observability

questions on a new metric

does it capture an otherwise uncaptured metric when and how will I be able to ignore this are there cases that this metric shows instances when users are not affected in reality (test runs etc.) can this alert be attended to later? how urgent is this? can the action be automated? are others alerted too? could paging be useless?

You can’t debug systems with dashboards


Written on by .

Originally published on Medium