Part VI – Start at the beginning with Part I
While the business was busy stuffing their sorrys in a sack, our team was having some fun.
From July of 2010 to April of 2011 my role on the team was Architect, it was probably one of the most productive stints I can remember. During that time I learned Infrastructure as Code by building servers for Artifactory, Confluence, Jira, Crowd, and a number of other products using Chef. There we were reading the insanely poor documentation on the Chef site, 4-5 of us all learning Chef at the same time trying to get something working. Chef is similar to Grails in my head, too many magic mushrooms growing everywhere. If something doesn’t work, it might be your code, but it might be some unknown configuration that you missed. However, once you know where the magic starts and ends, they can both be quite useful.
I spent days on end building and stripping down infrastructure in the AWS cloud. We learned about Availability Zones, and Regions and how to operate in multiple locations at once. We watched AWS go down two or three times in that period but managed to weather all those outages with a little luck and forethought. We talked to vendors and startups building tools for clouds. We talked to other consumer enterprises building high scale websites for customers. We spent a lot of time reading High Scalability and the first edition of The Art of Scalability. We wrote up comparisons between Riak, Cassandra, MongoDB and HBase. We tried to decide what might work best for a new distributed item catalog. We guessed, and hoped we didn’t end up like the guy that picked Cassandra for Digg in 2010.
For better or worse, the Digg disaster and the good relationship we struck up with Basho led us to choose Riak for our first NoSQL system in late 2010. We had great collaboration with the Basho engineers, we were helping them find the bugs in their system but the underlying technology was rock solid. In six years Riak never failed us, the only times we had problems were completely self-inflicted.
In the end we had numerous systems operating in AWS. The first was the failover site mentioned in Part I. If TWLER.com went down, we would switch over to the browse only site in a few minutes. We got to exercise this capability more than once. The second was the build infrastructure, our Atlassian suite, Artifactory and Jenkins were all cloud deployed. What we learned running production systems in the AWS cloud gave us the confidence to push towards a whole new architecture for TWLER.com.
GOTO Part VII