Encountered an AWS S3 Outage today on Travis CI. The broadcasting message shew that some apps providers have nothing to do but just pray or twitter and expect AWS to recover the issuing region S3 service (the most failures were from North Virginia Regnion, US-East, the following number -1 depends on accounts). Quora even has core stack out so content is not accessible for a while. As a content provider, we hope at least the authentication and reading on hot/warm data shall be maintained with appropertiate techniques as staticizing and CDN, hybrid mode for warm/hot data, et. The fact is cruel. On the other hand, which brings us back to Netflix Simian Army and most people are sure Netflix will not be impacted based on the chaos weather test from public reports.
Since S3 is a storage service, moreover, it is also a base to support many other famous AWS services. Which could explain why the AWS monitoring system was still presenting a green light at first couple of hours. It is weird to build WatchDog on top of the system under watch. However, it happened.
It failed to query build and test logs from cloud for a while in the morning. It seemed only US-East region was impacted. My public resume and logboks on US-West works fine till now. However, most developers will see US-East1 outage but we shall keep it in mind that the number X in US-EastX is up to different user accounts. Which means, US-East1 for some developer is US-East2 for some other developers. This mapping table will not change for one specified developer account.
Detail information could be found at Travis broadcast update
The core stack of authentication, basic view, dashboard, et. shall be alive at severe disaster. It is an element of service no matter what kind of techniques and serving bandwidth is. For a content provider, the cache of CDN shall be closely monitored.
If a vertical module is identified as core-stack, it shall be a separate service. Even I could not write on quora, could not search or see the recommended streamlines, I still want to see a portal showing a primitive channel to access the contents.
As mentioned in above section, the core UX part as user login authentication, dashboard presenting (even there could be red alert in chart), hot/warm data reading are better be cached on static channel as CDN. I don’t have work experience on internet company so this section is just by best imaging. Since front-end development model is sophisticated and strong enough, we could try to make resources as static as possible. Below are techniques we can grasp at first sight:
- Memcache and Redis with DB;
- Squid or Nginx (with plugin) cache;
- CDN with balanced and separated cloudfront nodes;
- DNS, prefetching, plan-B on primary domain failure;
More consideration includes traditional concerns around cache design patterns on caching, CAP vs. latency.
In some cases, we can design an alternative UX to alternatively activate in cases the dynamic solution failure/outage is detected.
This can provide end users and operation stakeholders a straight view that our service is still up but some part is under recovery.
Since core stack is separated, the error from infrastructure shall be captured and handled in an approperiate way. Which could porivde an information page, a short message, or, just a pager call to operation staff but avoid collapsing the whole application service like some of the famous applications we saw during the accident.
In short, it is to localize errors as much as possible.
Simian Army, Netflix Paper, is the means to create realistic chaos withing stage system when actual infrastructure is still working fine. For example, “Chaos Gorilla” is the tool to let Dev and Ops test the mitigation-plan when AWS cloud service is out.
It will randomly change host file on your application so product team could test and live with disasters from infrastructure. With the first hand “realistic” data, a self-healing or graceful recovery could be designed and integrated.
The paper was drafted year 2011 and it is still worthy reading every time when we see a batch of internet services are down when there is an AWS flu.
Live with Murphy’s law, live with test.
In other words, as mentioned in an interview on CloudFormation topic, the first thing we need to consider about CloudFormation technical stack is portability: if it supports to migrate your service to another provider to create an either stand-by cloud deployment or a fast grown plan-b. I would rather prefer to construct an ORM style internal stack to work as cloudformat to keep an open interface in cases. And there shall be a designed and open to provide interfaced “DataFormat” to keep warm data duplicated.
The system is rent but design principle shall include AWS infrastructure as on premise thinking. The AWS services you utilitized will be part of your system, not just a rental from mindset. The SDLC shall still cover it from architecture phase.
As applications are being migrated from business line to end-user line, from BS/CS to cloud, from cadenced CI to CD even DevOps, the overall architecture and systematic thinking as a whole system on premises mind is not popular any more. I remember the days in Lucent Technologies to write and review system specifications for a security feature to support series of following candidate features. We drafted thoughts, made prototype to test, verifying with modelling tools and took the system in a whole view top to bottom and state by state within the layer.
2017-03-02, Max, added Simian Army section.