Chaos Engineering
”- Simulating the failure of an entire region or datacenter.
- Partially deleting Kafka topics over a variety of instances to recreate an issue that occurred in production.
- Injecting latency between services for a select percentage of traffic over a predetermined period of time.
- Function-based chaos (runtime injection): Randomly causing functions to throw exceptions.
- Code insertion: Adding instructions to the target program and allowing fault injection to occur prior to certain instructions.
- Time travel: Forcing system clocks out of sync with each other.
- Executing a routine in driver code emulating I/O errors.
- Maxing out CPU cores on an Elasticsearch cluster.” https://www.oreilly.com/library/view/chaos-engineering/9781491988459/ch01.html
Gremlin
Have telemetry built to supplement Gremlin metrics
Resource
CPU
Memory
IO
Disk
State
Shutdown
Time Travel
Process Killer
Network
Blackhole “Black-hole attacks occur when a router deletes all messages it is supposed to forward. From time to time, a router is misconfigured to offer a zero-cost route to every destination in the Internet. This causes all traffic to be sent to this router. Since no device can sustain such a load, the router fails.”
Latency
Packet Loss
DNS
Resiliency Testing
Failure Modes & Effect Analysis (FMEA)
Build Scenarios
Game Days
Practical Scenarios
Failure Modes Test Mechanism(s) Expected Results
Table-Top Exercises
Theoretical Scenarios
Failure Modes Explore Hypothetically Capture Preparedness
reliability_engineering
]