Building Secure and Reliable Systems, Heather Adkins

Feb 20, 2020 · 3 minute read · Comments

Крутая книга, как и прошлые две. Как и раньше, используемые практики, в основном, применимы для достаточно больших компаний. В общем, отличный пример “как надо”. Читаешь и грустишь.

Заметки:

quote an SRE maxim — hope is not a strategy.
The shared infrastructure is a natural place to provide shared defenses. Edge routers can throttle high-bandwidth attacks, protecting the backbone network. Network load balancers can throttle packet-flooding attacks to protect the application load balancers. Application load balancers can throttle application-specific attacks before the traffic reaches service frontends. Layering defenses tends to be cost-effective, since you only need to capacity-plan inner layers for styles of DoS attacks that can breach the defenses of outer layers.
Providing a highly secure, reliable, and consistent software supply chain will likely require you to make many changes — from scripting your build steps, to implementing build provenance, to implementing configuration-as-code. Coordinating all of those changes may be difficult. Bugs or missing functionality in these controls can also pose a significant risk to engineering productivity. In the worst-case scenario, an error in these controls can potentially cause an outage for your service. You may be more successful if you focus on securing one particular aspect of the supply chain at a time.
Consider the very rare issue of memory corruption by bit flip. A modern error-correcting memory module has a less than 1% chance per year of encountering an uncorrectable bit flip that can crash a system. An engineer debugging an unexpected crash probably won’t think, “I bet this was caused by an extremely unlikely electrical malfunction in the memory chips!” However, at very large scale, these rarities become certainties. A hypothetical cloud service utilizing 25,000 machines might use memory across 400,000 RAM chips. Given the odds of a 0.1% yearly risk of uncorrectable errors per chip, the scale of the service could lead to 400 occurrences annually. People running the cloud service will likely observe a memory failure every day.
One system we worked on, used for phone support, allowed administrators to impersonate a user and to view the UI from their perspective. As a debugger, this system was wonderful; you could clearly and quickly reproduce a user’s problem. However, this type of system provides possibilities for abuse. Debugging endpoints — from impersonation to raw database access — need to be secured.
For many incidents, debugging unusual system behavior need not require access to user data. For example, when diagnosing TCP traffic problems, the speed and quality of bytes on the wire is often enough to diagnose issues. Encrypting data in transit can protect it from any possible attempt by third parties to observe it. This has the fortunate side effect of allowing more engineers access to packet dumps when needed. However, one possible mistake is to treat metadata as nonsensitive. A malicious actor can still learn a lot about a user from metadata by tracking correlated access patterns — for instance, by noting the same user accessing a divorce lawyer and a dating site in the same session. You should carefully assess the risks from treating metadata as nonsensitive.
Many excellent books cover the topic of communications thoroughly. For a deeper understanding of communications, we recommend Nick Morgan’s Can You Hear Me? (Harvard Business Review Press, 2018) and Alan Alda’s If I Understood You, Would I Have This Look on My Face?