Implementing a Metrics Monitoring Stack
TK Checklist for an observable system
why monitor
Alerting Post-Hoc analysis dashboarding capacity planning — can tolerate low frequency (of may be hours or days), also can be faulty because trends are not affected
alerting -======
should not be noise developer/ops time is valuable and should not be needlessly disturbed what alerts generate most disturbance should be decided
additional
a monitoring system with more emphasis on post-hoc analysis would be better helpful to find the root cause (and fixing it) than systems that try to remediate the situation automatically (and in doing so hiding the root cause) complex rules are too rigid to be adopted better to have separate simpler ones than interconnected complex ones
email alerts are rarely acted upon, and when done so, takes too much time out of ops cycles because of brief downtimes or false positives
it should be possible to easily enable monitoring for a new component of the architecture e.g. an existing monitoring stack that would only need an addition of an agent in to the newly introduced cluster nodes
staring at dashboards do not work. that has been a failure and a waste of resources
there should be a monitoring “toolbox” and a monitoring “day pack” in an organization that can be used to quickly enable better designed observability in a new deployment. this will collect fixes and best practices along the way with each deployment
a monitoring system should answer “what” and “why” (may be immediate cause) that should point to better post-mortems without having to start with a blank slate :
may be: alerts and dashboards can be balanced critical ones that need attention are alerts ones that may need attention, but won’t break functionality should be dashboard features white box monitoring may mostly produce dashboard features black box may produce alerts since it’s the view from outside the system, and most likely affect the end users similarly
need to cover crticucal hops and paths
monolithic system monitoring and SLA calculations are easy, because of shared infrastructure. but that changes when approaching microservices based deployments where multi-tenancy seeps into the infrastructure decisions.system level aggregates do not tell the complete story for these systems.
4 golden signals
latency errors traffic saturation — should also work with predictions ( may be with soft limits)?
??: easy to translate to tradtional/VM based deployment architecture, but how does this translate into Containers?
==========================
mean wouldn’t work most of the time a histogram with a distribution of the stats over different ranges would give a better understanding at the dashboarding than a single number that keeps changing. a single number, or even the temporal graph of that single number doesn’t give much
granularity of each metric is important
there should be a way to determine if a certain node or a path has a metric or alert attached to it
a losely coupled simple system with multiple components are better than a tightly couple rigid system that cannot adopt, or even worse, adopt every single one of the metrics. only the required scope should be taken into consideration when designing observability
questions on a new metric
does it capture an otherwise uncaptured metric when and how will I be able to ignore this are there cases that this metric shows instances when users are not affected in reality (test runs etc.) can this alert be attended to later? how urgent is this? can the action be automated? are others alerted too? could paging be useless?
You can’t debug systems with dashboards
Written on by .
Originally published on Medium