{"id":48,"date":"2021-11-05T15:10:00","date_gmt":"2021-11-05T15:10:00","guid":{"rendered":"https:\/\/baecke.io\/?p=48"},"modified":"2021-11-05T15:10:00","modified_gmt":"2021-11-05T15:10:00","slug":"application-health-business-frame","status":"publish","type":"post","link":"https:\/\/baecke.io\/?p=48","title":{"rendered":"Application Health in Cloud-Native Systems: Why Kubernetes Observability Needs a Business Frame"},"content":{"rendered":"<p>A cloud-native system can report perfect health while the business it runs is quietly failing.<\/p>\n<p>The dashboards are green. Pods are stable, CPU is comfortable, memory is fine, and not one infrastructure alert has fired. Meanwhile checkout is failing for one customer in twenty, and no one will know until the complaints arrive, because nothing being measured is actually about the customer. This is the central blind spot of cloud-native observability as most enterprises practise it: it measures whether the platform is functioning, not whether the application is delivering value, and it mistakes the first for the second.<\/p>\n<h2>Infrastructure Metrics Answer the Wrong Question<\/h2>\n<p>Pod restarts, CPU utilisation, memory pressure: these are real and worth watching, but they answer a question about the platform, not the product. They tell you the machinery is turning. They say nothing about whether the machinery is producing what the business needs, and in a distributed system the gap between those two is wide enough to drive an outage through.<\/p>\n<p>The reason this matters more in cloud-native systems than it did before is that the infrastructure and the experience have come apart. In a monolith on a server, a healthy server usually meant a healthy application. In a distributed system of many services and ephemeral pods, the infrastructure can be entirely healthy while a single failing dependency quietly breaks a business-critical path. Measuring the infrastructure now tells you less about the experience than it used to, precisely when the systems are more complex and the stakes are higher.<\/p>\n<h2>Start From the Business Transaction<\/h2>\n<p>A business-framed approach to health starts somewhere different: with the transactions the business actually cares about. Is checkout succeeding? Are payments completing? Is the core user journey working, right now, for real users? Business transaction success rate is the first-class signal, and everything else is supporting detail. It is the one number that, if wrong, means the system is failing regardless of how green the infrastructure looks.<\/p>\n<p>From there, latency stops being an abstract platform metric and becomes a user-experience commitment. Service level objectives should be defined by what the user actually experiences and what the business can tolerate, not by an infrastructure threshold chosen because it was easy to measure. A latency SLO tied to the user is a promise about the product. A latency threshold tied to the host is a fact about a machine, and the two are not the same promise.<\/p>\n<h2>Error Budgets Defined by Product Impact<\/h2>\n<p>The same reframing applies to error budgets. An error budget set by infrastructure thresholds treats all errors as equal, when they are not. An error in a background job that retries cleanly is not the same as an error in checkout, and a reliability model that cannot tell them apart will spend its attention in the wrong places. Defining error budgets by product impact aligns the organisation&#8217;s reliability effort with the things that actually matter to the business.<\/p>\n<p>Error budget is a piece of jargon worth unpacking, because the idea behind it is genuinely useful to a non-engineer. No system is perfect, so rather than pretend to aim for zero failures, you decide in advance how much unreliability is acceptable over a period, and that allowance is the budget. While budget remains, teams ship new features freely. When it runs out, they stop adding features and fix reliability instead. It turns an endless, unwinnable argument about how reliable is reliable enough into a simple rule with a number, and defining that number by business impact is what keeps the effort pointed at what actually matters.<\/p>\n<p>This changes the conversations as much as the tooling. When error budgets are framed by product impact, the decision about whether to ship or to stabilise becomes a business discussion with a shared language, rather than an argument between engineers about infrastructure metrics leadership cannot interpret. Reliability stops being an engineering preoccupation and becomes a business trade-off everyone can reason about.<\/p>\n<h2>The Three Layers, Read From the Top<\/h2>\n<p>It helps to see cloud-native observability as three layers: platform health, service health, and business health. Platform health is whether the clusters and infrastructure are sound. Service health is whether the individual services are meeting their technical targets. Business health is whether the application is delivering the outcomes the business depends on. All three matter, but most enterprises instrument from the bottom up and stop before they reach the top, which is exactly backwards.<\/p>\n<p>Reading from the top changes the priorities. Business health is the layer that tells you whether anything is actually wrong from the only perspective that pays the bills, and the lower layers exist to explain why. An organisation that starts at business health and drills down when it degrades will find problems before its customers do. One that starts at platform health and hopes it correlates with the experience will keep being surprised by outages its dashboards never saw.<\/p>\n<h2>Aligning Observability to the Business Changes Everything<\/h2>\n<p>Framing health around the business is not only a tooling choice, it is an organisational one. It changes what gets alerted on, what gets prioritised in an incident, and how reliability is discussed with leadership. It moves the reliability conversation onto ground the business shares, which is the only ground on which engineering and leadership can actually agree about what matters and what to do about it.<\/p>\n<p>The tooling decisions follow from the frame. Instrument the business transactions first, define SLOs and error budgets by user and product impact, and treat the infrastructure layers as the explanation rather than the headline. Done that way, observability stops being a wall of green that fails to predict outages and becomes an early-warning system aligned with the things the business would actually pay to protect.<\/p>\n<h2>Reliability the Business Can Actually Feel<\/h2>\n<p>The point of observability is not to prove the platform is running. It is to know, before your customers tell you, whether the application is doing its job. That requires starting from the business transaction, defining reliability in terms the user experiences, and budgeting errors by the damage they actually do. Green infrastructure that cannot see a failing checkout is not observability, it is decoration. The systems worth building are the ones that measure reliability the business can actually feel, because that is the only reliability the business was ever paying for.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Infrastructure metrics tell you the platform is running, not that the application is delivering value. Cloud-native health needs a business frame: transaction success, user-facing SLOs, and error budgets defined by product impact.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[],"class_list":["post-48","post","type-post","status-publish","format-standard","hentry","category-architecture-observability"],"_links":{"self":[{"href":"https:\/\/baecke.io\/index.php?rest_route=\/wp\/v2\/posts\/48","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/baecke.io\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/baecke.io\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/baecke.io\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/baecke.io\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=48"}],"version-history":[{"count":0,"href":"https:\/\/baecke.io\/index.php?rest_route=\/wp\/v2\/posts\/48\/revisions"}],"wp:attachment":[{"href":"https:\/\/baecke.io\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=48"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/baecke.io\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=48"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/baecke.io\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=48"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}