Site Reliability Engineering by Betsy Beyer
Jan 28, 2017 · 8 minute read · CommentsХорошая и полезная книга. Главы про мониторинг и инциденты буду перечитывать, и не один раз. Но есть и проблемы, как и у прошлой книги “How Google Tests Software”. Несколько глав воспринимаются как откровенная реклама - смотрите какие крутые у нас системы (ни одной из них нет в свободном доступе), как здорово мы все автоматизируем (без конкретных примеров) и т.д. А может быть мне просто завидно.
Тем не менее, книга стоящая. Кстати, теперь доступна и онлайн Site Reliability Engineering.
Notes on Google’s Site Reliability Engineering book - товарищ постарался и написал краткий конспект по главам.
Заметки:
-
The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.
-
Latency - the time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.
-
Traffic - a measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a key-value storage system, this measurement might be transactions and retrievals per second.
-
Errors - the rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, “If you committed to one-second response times, any request over one second is an error”). Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content.
-
Saturation - how “full” your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential. In complex systems, saturation can be supplemented with higher-level load measurement: can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives?
-
-
Formally, we can think of the troubleshooting process as an application of the hypothetico-deductive method: given a set of observations about a system and a theoretical basis for understanding system behavior, we iteratively hypothesize potential causes for the failure and try to test those hypotheses. In an idealized model such as that in Figure 12-1, we’d start with a problem report telling us that something is wrong with the system. Then we can look at the system’s telemetry and logs to understand its current state. This information, combined with our knowledge of how the system is built, how it should operate, and its failure modes, enables us to identify some possible causes.
-
Common Pitfalls Ineffective troubleshooting sessions are plagued by problems at the Triage, Examine, and Diagnose steps, often because of a lack of deep system understanding. The following are common pitfalls to avoid:
- Looking at symptoms that aren’t relevant or misunderstanding the meaning of system metrics. Wild goose chases often result.
- Misunderstanding how to change the system, its inputs, or its environment, so as to safely and effectively test hypotheses.
- Coming up with wildly improbable theories about what’s wrong, or latching on to causes of past problems, reasoning that since it happened once, it must be happening again.
- Hunting down spurious correlations that are actually coincidences or are correlated with shared causes.
-
Fixing the first and second common pitfalls is a matter of learning the system in question and becoming experienced with the common patterns used in distributed systems. The third trap is a set of logical fallacies that can be avoided by remembering that not all failures are equally probable—as doctors are taught, “when you hear hoofbeats, think of horses not zebras.“Also remember that, all things being equal, we should prefer simpler explanations.”
-
A Managed Incident Now - let’s examine how this incident might have played out if it were handled using principles of incident management. It’s 2 p.m., and Mary is into her third coffee of the day. The pager’s harsh tone surprises her, and she gulps the drink down. Problem: a datacenter has stopped serving traffic. She starts to investigate. Shortly another alert fires, and the second datacenter out of five is out of order. Because this is a rapidly growing issue, she knows that she’ll benefit from the structure of her incident management framework. Mary snags Sabrina. “Can you take command?” Nodding her agreement, Sabrina quickly gets a rundown of what’s occurred thus far from Mary. She captures these details in an email that she sends to a prearranged mailing list. Sabrina recognizes that she can’t yet scope the impact of the incident, so she asks for Mary’s assessment. Mary responds, “Users have yet to be impacted; let’s just hope we don’t lose a third datacenter.” Sabrina records Mary’s response in a live incident document. When the third alert fires, Sabrina sees the alert among the debugging chatter on IRC and quickly follows up to the email thread with an update. The thread keeps VPs abreast of the high-level status without bogging them down in minutiae. Sabrina asks an external communications representative to start drafting user messaging. She then follows up with Mary to see if they should contact the developer on-call (currently Josephine). Receiving Mary’s approval, Sabrina loops in Josephine.
-
Best Practices for Incident Management Prioritize. Stop the bleeding, restore service, and preserve the evidence for rootcausing. Prepare. Develop and document your incident management procedures in advance, in consultation with incident participants. Trust. Give full autonomy within the assigned role to all incident participants. Introspect. Pay attention to your emotional state while responding to an incident. If you start to feel panicky or overwhelmed, solicit more support. Consider alternatives. Periodically consider your options and re-evaluate whether it still makes sense to continue what you’re doing or whether you should be taking another tack in incident response. Practice. Use the process routinely so it becomes second nature. Change it around. Were you incident commander last time? Take on a different role this time. Encourage every team member to acquire familiarity with each role.
-
Simple Round Robin One very simple approach to load balancing has each client send requests in roundrobin fashion to each backend task in its subset to which it can successfully connect and which isn’t in lame duck state. For many years, this was our most common approach, and it’s still used by many services. Unfortunately, while Round Robin has the advantage of being very simple and performing significantly better than just selecting backend tasks randomly, the results of this policy can be very poor. While actual numbers depend on many factors, such as varying query cost and machine diversity, we’ve found that Round Robin can result in a spread of up to 2x in CPU consumption from the least to the most loaded task. Such a spread is extremely wasteful and occurs for a number of reasons, including: • Small subsetting • Varying query costs • Machine diversity • Unpredictable performance factors
-
Paxos Overview: An Example Protocol Paxos operates as a sequence of proposals, which may or may not be accepted by a majority of the processes in the system. If a proposal isn’t accepted, it fails. Each proposal has a sequence number, which imposes a strict ordering on all of the operations in the system. In the first phase of the protocol, the proposer sends a sequence number to the acceptors. Each acceptor will agree to accept the proposal only if it has not yet seen a proposal with a higher sequence number. Proposers can try again with a higher sequence number if necessary. Proposers must use unique sequence numbers (drawing from disjoint sets, or incorporating their hostname into the sequence number, for instance). If a proposer receives agreement from a majority of the acceptors, it can commit the proposal by sending a commit message with a value.
-
First Layer: Soft Deletion; Second Layer: Backups and Their Related Recovery Methods; Overarching Layer: Replication
-
Polarizing time means that when a person comes into work each day, they should know if they’re doing just project work or just interrupts. Polarizing their time in this way means they get to concentrate for longer periods of time on the task at hand. They don’t get stressed out because they’re being roped into tasks that drag them away from the work they’re supposed to be doing.
-
If you currently assign tickets randomly to victims on your team, stop. Doing so is extremely disrespectful of your team’s time, and works completely counter to the principle of not being interruptible as much as possible. Tickets should be a full-time role, for an amount of time that’s manageable for a person. If you happen to be in the unenviable position of having more tickets than can be closed by the primary and secondary on-call engineers combined, then structure your ticket rotation to have two people handling tickets at any given time. Don’t spread the load across the entire team. People are not machines, and you’re just causing context switches that impact valuable flow time.