A Digital Ecommerce Transformation – 2012, The First Cloud Holiday – Part XXII

This will be the last entry in this series, at least for the foreseeable future.   Writing all this out has finally put this era of life firmly in the past, even though there were three more years of work to deliver an infinitely scalable cloud based ecommerce system. This really only scratched the surface of the amount of work and amazing team that delivered mobile and browser based ecommerce to TWLER during a time of incredible hardship for the company. New CEOs, falling revenue, and insanely tough competition left many to think TWLER was going to wind up as another retailer on the trash heap of history. But steady 30% growth online and an ecommerce platform that allowed for intra-day changes to the system gave the business teams the tools they needed to drive growth and make up for lost revenue in the stores. If you look at the revenue numbers for 2012-2014, store revenue declined, online increased and it all more or less evened out. Now onto the story.

There’s a running joke at TWLER that this Holiday is the most important Holiday ever! It’s funny because it actually is true, if you don’t deliver at Holiday, you go into oblivion like all the other failed electronics retailers like Circuit City and HH Gregg.

We went into early November scorching hot, cloud home page ramped up through the summer to take 100% of traffic, and cloud PDPs were still ramping up as we went into November. We had adopted the strategy of shunting small amounts of traffic to our new systems, than turning up the dial with global load balancing as we learned how to operate them. Remember, everything was new. We were operating in AWS for the first time, we had built all new UIs, controllers, caches, data services, and content management systems from scratch in about 10 months. We were also beating the hell out of Akamai as we load tested our systems and determined what we would cache in the CDN. Given how quickly inventory and pricing could change, we had to choose carefully what was cached in Akamai.

We were running three load tests a week trying to get the entire system up to a point that was 50% higher in traffic and transactions from the previous year. We figured 50% was a fairly safe bet considering historical traffic trends showed that we would likely only get a 20% traffic increase. The only place to truly stress test for scale was in production, so we would run the tests starting after midnight and ramp them up to full scale by 3AM. Obviously that meant a lot of late nights for the testing teams and many of the development teams as problems were uncovered at higher and higher scale.

The second week in November, the marketing team decided that the current pricing signage wasn’t good enough, it was going to be a highly promotional holiday and we needed everyone to know our pricing was as low as anyone’s. The team wanted to change all price messaging to “Guaranteed lowest price” from something else, with info on how we would price match any major competitor. The UI team estimated in the old system, it would take 3 months to make those changes; in the new cloud home page and PDPs, it took us two hours. Most of it was testing. This more than anything else, guaranteed that we would go into holiday with the cloud site taking full load even though business teams still weren’t completely bought in to the new world.

By the week of Thanksgiving, we were seeing good load tests and meeting our estimated peak loads for the Black Friday sale, which started at 2AM on Thanksgiving morning. Everyone knew when the sale was supposed to drop, and most shoppers were willing to wait up until 2AM central time, buy the limited inventory items, and then go to bed.

As 2AM rolled around, we felt like we were ready, we estimated we would take 60% of the load in AWS for home page and PDPs, and the remaining load for search and checkout would go to the ATG clusters in the datacenter. Caches were warmed, systems were scaled out to meet the load, and teams were in place to monitor everything that happened.

When the sale dropped, traffic ramped up instantly and kept rising minute over minute. We blew past 50% increase in traffic that was estimated and approached 150% increase in peak traffic. We figured out that the old systems simply couldn’t scale to demand, so the peak traffic was likely depressed and spread out over more time when the system throttled, but with systems that elastically scaled, the peak traffic just kept growing. This was ok in the cloud systems, but we were approaching scale way beyond expected in the commerce back end. As ATG servers heated up and the database was being pounded, we were minutes away from throttling traffic when the peak subsided and we had weathered the first storm.

Over the course of Thanksgiving, Black Friday and Cyber Monday, TWLER had its most successful online holiday ever, as competitor ecommerce sites melted down all around us. Virtually every retailer was down at some point over that time period, but we survived. Lots of things failed, including tax and inventory systems, but tax can be estimated and fixed before shipping, and you can guess at inventory for some amount of time until you run into trouble. While we remained up, the days and nights were spent fighting scaling issues, database issues, back end services issues, and network issues. When Cyber Monday ended successfully, the team felt like it had been fighting fires for five days straight.

The post mortem on the new cloud based systems was that it saved TWLER from catastrophe. Previously skeptical business and tech team members saw the front ends take massive loads and scale up accordingly. All the problems occurred in the legacy systems and enterprise services. It wasn’t flawless, but it showed that the investment was worth it and the direction was correct. It was my third holiday, and certainly the most exciting one ever. The architecture that I envisioned was starting to come to life, the teams we created banded together and bonded over intense problem solving, and the cloud future of TWLER.com was cemented with a successful holiday showing.

The next three years continued the evolution from ATG to cloud distributed architecture. While there were some rocky times and numerous outages during holidays, we survived them all and ended up at a fully automated future. Holiday’s became boring, and the only fun was watching daily revenue increase and pass $200M in a single day. Otherwise we just hung out and watched the metrics roll in, the mantra became “bring it on!” as we wished for higher traffic to truly test our systems.

And with that boredom, came the desire for a new challenge, which eventually took me away from TWLER in 2016 to take on restructuring the architecture for an entire company. That story may appear here someday, but it is still in progress.

Thanks to everyone that worked on TWLER.com from 2010-2015, it was truly a journey worth taking.