MFS-даунтаймы в крупных сервисах
Google App Engine Incident #19007
Amazon SimpleDB Service Disruption
Cassandra Overload because of hint pressure + MVs
Принцип constant work
Reliability, constant work, and a good cup of coffee
Как работать с ретраями
Fixing retries with token buckets and circuit breakers
What is Backoff For?
Полезные статьи про MFS
Metastable Failures in Distributed Systems
Metastable Failures in the Wild
Metastability and Distributed Systems