Part XII of a multipart story, to start at the beginning goto Part 1.
Since we’re in the middle of Holiday 2017, I thought a digression on Holiday was in order.
If you have never worked in retail, than you’ve have missed out on the grand experience we call “Holiday”. On the other hand, you’ve probably actually enjoyed the time of year from mid-November to Christmas while you celebrate with your friends and family, and take advantage of thousands of days of deals from the many retailers trying to get their share of wallet from you.
Holiday, with a capital H, is something that has to be experienced to be believed. In my first Holiday at TWLER in 2010, I was on a team that had just started writing code and had very little in production leading into Thanksgiving. The only offering we supported was the failure site, if the main TWLER.com went down, we would quickly spin up the browse only site so consumers would be able to at least see what products we sold, and where our stores were located. In 2010, this was actually a pretty good thing since the ecommerce site was still less than 5% of revenue.
When you work in IT in a retailer, your entire year is judged on whether or not the systems you support survive the shopping onslaught of Holiday. In the online space, an ecommerce site might make 30% of its revenue in the five days from Thanksgiving to Cyber Monday. TWLER.com also experienced the third highest traffic of North American retailers during that time. This massive scale up to 20X normal daily traffic was largely accomplished without clouds in the 2000s. You had to take a really good guess as to how much infrastructure was needed, build it all out over the course of the year, and hope you weren’t overwhelmed by consumer behavior. You could easily receive 1M requests per second at the edge, and 100,000+ requests per second to your actual systems. If those requests were concentrated on the wrong systems, you could easily take down your site.
TWLER counts how long you’ve been at a company by the number of Holidays you’ve experienced. If someone asks how long you’ve worked there, you might say “four Holidays.” And every Holiday is the most important one yet, because those six weeks account for 50% or more of yearly revenue.
After a few Holidays, you realize the second the current year’s Holiday is over, you are immediately planning for the next one. There is no break. It’s like a giant tsunami that is slowly approaching, day by day. You can look over your shoulder and it’s always there, waiting to crash down on you and ruin your day. Once this year’s tsunami passes, you turn around and can see next year’s on the horizon.
In my six Holidays at TWLER, we experienced numerous outages, usually caused by either internal stupidity, or unexpected consumer behavior. In our first few years, we would purposely force our ecommerce site to use “enterprise” services because they were the “single source” for things like taxes, or inventory. This is a great notion, but only if the “enterprise” services were actually built to support the entire Enterprise. Since TWLER was store focused, this meant the “enterprise” services were often down at night for maintenance, or were not built to withstand massive surges in traffic. One million people refreshing a PDP to check for inventory on a big sale every few seconds quickly overwhelmed these services. So we often turned these services off and flew semi-blind, rather than have the site completely fail.
In other instances we tried to use various promotion functions embedded in our ATG commerce server. These seemed like useful things to easily setup a promotion like buy one get one. But when millions of people come looking for the sale, the vendor built commerce engines go down quickly by destroying their own database with the same exact calls, over and over again. They hadn’t heard of caching yet, I guess.
We would sometimes publish our starting times for various sales, saying a big sale is starting at 11AM and send out millions of customer emails. The marketing teams loved the starting times and the technology teams hated them. We warned that setting a hard start time is a sure route to failure. Yet we did it multiple times and incurred multiple failures as the traffic surge brought down the site. There are physical limits even in clouds, you can only spin things up so fast and 10M rqs will bring down most sites. After a few of these episodes, we did convince the marketing teams that it wasn’t the way to go and learned how to have sales with gradual ramp-ups in requests rather than massive surges.
Around 2013, the Black Friday shopping was so intense in the evening across the nation that the credit card networks themselves slowed down. Instead of taking a few seconds to auth a credit card, it started taking one or two minutes. This was across all retailers. However, the change in time caused threads to hang up inside our ecommerce systems and all of a sudden we ran out of threads as they were all tied up waiting for payments to happen. For the next year, we changed our payment process to go asynchronous so that would never happen again.
There are many more stories of failure, but from every failure we learned something and implemented fixes for the next year’s wave. This is why Holiday in retail is such fun, every year you get to test your mettle against the highest traffic the world can generate. You planned all year, you implemented new technologies and new solutions, but sometimes the consumer confounds you and does something totally unexpected.
The last story is one where the consumer behavior combined with new features took us down unexpectedly. In 2014 we implemented “Save for Later” lists where you could put your items on a list that you could access later and add them to your cart. As Thanksgiving rolled around and the Black Friday sale went out at around 2AM, our Add to Cart function started getting pounded at a rate far higher than we had tested it for. We were seeing 100K rqs in the first few minutes the sale was happening, it rapidly brought the Add to Cart function to its knees and we had to take a outage immediately to get systems back together and increase capacity.
This was completely unexpected consumer behavior so what happened? It turned out that customers used the Save for Later lists to pre-shop the Black Friday sale and add all the things they wanted to buy into the lists. Then when 2AM rolled around, they opened their Save for Later lists and started clicking the Add to Cart buttons one after the other. A single customer might click 5-10 Add to Cart buttons in a few seconds. With hundreds of thousands of customers figuring out the same method independently, it led to a massive spike in Add to Cart requests, we effectively DDOSed our Add to Cart function with simultaneous collective human behavior.
I feel like I could keep going on Holiday for another two pages, but that’s enough for this year, maybe we’ll do it again in the all important next year.
Goto Part XIII