SRE Site reliability engineering Notes
- SLIs / SLOs / Error Budgets - Measure what matters (e.g., latency, availability) and use "error budgets" to balance innovation vs. reliability—push features only if you're under budget.
- Commit to clear promises that set service objectives, expectations, and levels.
- Assess those promises continuously, with metrics and budgetary limits.
Toil Reduction: Automate repetitive ops work; aim for <50% of team time on manual tasks.
- Production Practices: Incident response, postmortems, and capacity planning as engineering disciplines.
- React quickly to keep and repair promises, be on-call, and guard autonomy to avoid new gatekeepers.
- On-Call and Automation: SREs code their way out of ops; leverage your infra skills for chaos engineering or canarying.
SRE principles - 1st book
embracing risk (Chapter 3)
>> Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness—with features, service, and performance—is optimized.
>> when we set an availability target of 99.99%,we want to exceed it, but not by much: that would waste opportunities to add features to the system, clean up technical debt, or reduce its operational costs.
>> In a sense, we view the availability target as both a minimum and a maximum. The key advantage of this framing is that it unlocks explicit, thoughtful risktaking.
>> How can we use the service cost to help locate a service on the risk continuum?
>> Risk olerance of Services
- Identify the Risk Tolerance of Consumer Services
- Target level of availability
- What level of service will the users expect?
- Does this service tie directly to revenue (either our revenue, or our customers’ revenue)?
- Is this a paid service, or is it free?
- If there are competitors in the marketplace, what level of service do those competitors provide?
- Is this service targeted at consumers, or at enterprises?
- Types of failures
- Cost
- If we were to build and operate these systems at one more nine of availability, what would our incremental increase in revenue be?
- e.g. availability target: 99.9% → 99.99%, increase in availability: 0.09%, revenue: $1M, Value of improved availability: $1M * 0.0009 = $900
- if now simple translation function between reliability and revenue, strategy maybe to consider background error rate, no value in driving service below background error rate, e.g. packet loss 0.1%.
- Does this additional revenue offset the cost of reaching that level of reliability?
- If we were to build and operate these systems at one more nine of availability, what would our incremental increase in revenue be?
- service level objectives (Chapter 4)
- eliminating toil (Chapter 5)
DevOps (Chapter 6)
Reading
- Book "Thinking in Promises"
>> The goal of Promise Theory is to reveal the behavior of a whole from the sum of its parts, taking the viewpoint of the parts rather than the whole.
>> A conditional promise cannot be assessed unless the assessor also sees that the condition itself is promised.
>> A conditional promise is not a promise unless the condition itself is also promised.
>> Promise Theory makes a simple prediction about services, which is possibly counterintuitive. It tells us that the responsibility for getting service ultimately lies with the client, not the server.
