Thomas Step

← Blog

If you still need help building out AWS infrastructure, send me an email and I'll see if I can help out.

re:Invent: Beyond Five 9s: Lessons From Our Highest Available Data Planes

This is an overview of a session that I went to during re:Invent 2021. I start by providing the notes I took during the session, and then I will give my take and comments if I have any at the end.

Monday 13:45


There is also an amazon Builder’s Library series for this.

10 things to go over:

1. Insist on the Highest Standards

2. Cattle vs. Pets (How To Manage Systems)

3. Limit the Blast Radius

4. Circuit Breakers

5. Raising the Bar in Testing

6. Lifecycle Management (Credentials Management)

7. Modular Separation

8. Static Stability (I missed some of what this actually is)

9. (Principal of) Constant Work

10. Retries

My notes:

I am 100% onboard with Cattle vs. Pets. I think naming things gets finicky and makes it more difficult to replicate certain resources. My preference is to let CloudFormation generate names for as many resources as it can. For resources that require a name, I try to use a unique string like the CloudFormation Stack’s name in the resource’s name.

In Limit the Blast Radius they mentioned cell-based architecture, which was brought up in multiple of my sessions. I have a feeling that this is starting to become an AWS best practice instead of simple regional or zonal isolation. For a completely serverless architecture, implementing something like this might not be too difficult to reason through, but I can only imagine the cost increase that a cell-based architecture would bring to a workload using compute resources like Fargate or EC2.

The Circuit Breakers that they brought up were both new to me. I find it interesting that neither one is an AWS service yet unless I do not know about them. They talked about load shedders and bullet counters as if they would need to be custom-created.

I do not completely agree with Raising the Bar in Testing. 1000s of unit tests sounds like a nightmare. I would have personally invested more of that effort into functional or integration tests.

Credential Lifecycle Management is somehow still such a large problem in computing and distributed systems. Why have we not figured out a better, more uniform way of rotating credentials? Maybe that is something I can look further into or work on.

The Constant Work section was confusing to me. I must have missed something. Everyone always talks about autoscaling and efficiency, so it seemed strange to me that they were suggesting purposefully writing less efficient code.

Categories: aws