I recently had the opportunity to sit down with Kelly Goetsch and Dirk Hoerig to record an interview on Target’s engineering culture as part of their Commerce Tomorrow podcast. Kelly was in town for the Open Source North conference which we were both speaking at. We sat down in the Target recording studio to tape the show. I’ve been using the studio to create an internal podcast for Target Engineering so we had an audio engineer and figured out how to include Dirk from Germany.
Part XII of a multipart story, to start at the beginning goto Part 1.
Since we’re in the middle of Holiday 2017, I thought a digression on Holiday was in order.
If you have never worked in retail, than you’ve have missed out on the grand experience we call “Holiday”. On the other hand, you’ve probably actually enjoyed the time of year from mid-November to Christmas while you celebrate with your friends and family, and take advantage of thousands of days of deals from the many retailers trying to get their share of wallet from you.
Holiday, with a capital H, is something that has to be experienced to be believed. In my first Holiday at TWLER in 2010, I was on a team that had just started writing code and had very little in production leading into Thanksgiving. The only offering we supported was the failure site, if the main TWLER.com went down, we would quickly spin up the browse only site so consumers would be able to at least see what products we sold, and where our stores were located. In 2010, this was actually a pretty good thing since the ecommerce site was still less than 5% of revenue.
When you work in IT in a retailer, your entire year is judged on whether or not the systems you support survive the shopping onslaught of Holiday. In the online space, an ecommerce site might make 30% of its revenue in the five days from Thanksgiving to Cyber Monday. TWLER.com also experienced the third highest traffic of North American retailers during that time. This massive scale up to 20X normal daily traffic was largely accomplished without clouds in the 2000s. You had to take a really good guess as to how much infrastructure was needed, build it all out over the course of the year, and hope you weren’t overwhelmed by consumer behavior. You could easily receive 1M requests per second at the edge, and 100,000+ requests per second to your actual systems. If those requests were concentrated on the wrong systems, you could easily take down your site.
TWLER counts how long you’ve been at a company by the number of Holidays you’ve experienced. If someone asks how long you’ve worked there, you might say “four Holidays.” And every Holiday is the most important one yet, because those six weeks account for 50% or more of yearly revenue.
After a few Holidays, you realize the second the current year’s Holiday is over, you are immediately planning for the next one. There is no break. It’s like a giant tsunami that is slowly approaching, day by day. You can look over your shoulder and it’s always there, waiting to crash down on you and ruin your day. Once this year’s tsunami passes, you turn around and can see next year’s on the horizon.
In my six Holidays at TWLER, we experienced numerous outages, usually caused by either internal stupidity, or unexpected consumer behavior. In our first few years, we would purposely force our ecommerce site to use “enterprise” services because they were the “single source” for things like taxes, or inventory. This is a great notion, but only if the “enterprise” services were actually built to support the entire Enterprise. Since TWLER was store focused, this meant the “enterprise” services were often down at night for maintenance, or were not built to withstand massive surges in traffic. One million people refreshing a PDP to check for inventory on a big sale every few seconds quickly overwhelmed these services. So we often turned these services off and flew semi-blind, rather than have the site completely fail.
In other instances we tried to use various promotion functions embedded in our ATG commerce server. These seemed like useful things to easily setup a promotion like buy one get one. But when millions of people come looking for the sale, the vendor built commerce engines go down quickly by destroying their own database with the same exact calls, over and over again. They hadn’t heard of caching yet, I guess.
We would sometimes publish our starting times for various sales, saying a big sale is starting at 11AM and send out millions of customer emails. The marketing teams loved the starting times and the technology teams hated them. We warned that setting a hard start time is a sure route to failure. Yet we did it multiple times and incurred multiple failures as the traffic surge brought down the site. There are physical limits even in clouds, you can only spin things up so fast and 10M rqs will bring down most sites. After a few of these episodes, we did convince the marketing teams that it wasn’t the way to go and learned how to have sales with gradual ramp-ups in requests rather than massive surges.
Around 2013, the Black Friday shopping was so intense in the evening across the nation that the credit card networks themselves slowed down. Instead of taking a few seconds to auth a credit card, it started taking one or two minutes. This was across all retailers. However, the change in time caused threads to hang up inside our ecommerce systems and all of a sudden we ran out of threads as they were all tied up waiting for payments to happen. For the next year, we changed our payment process to go asynchronous so that would never happen again.
There are many more stories of failure, but from every failure we learned something and implemented fixes for the next year’s wave. This is why Holiday in retail is such fun, every year you get to test your mettle against the highest traffic the world can generate. You planned all year, you implemented new technologies and new solutions, but sometimes the consumer confounds you and does something totally unexpected.
The last story is one where the consumer behavior combined with new features took us down unexpectedly. In 2014 we implemented “Save for Later” lists where you could put your items on a list that you could access later and add them to your cart. As Thanksgiving rolled around and the Black Friday sale went out at around 2AM, our Add to Cart function started getting pounded at a rate far higher than we had tested it for. We were seeing 100K rqs in the first few minutes the sale was happening, it rapidly brought the Add to Cart function to its knees and we had to take a outage immediately to get systems back together and increase capacity.
This was completely unexpected consumer behavior so what happened? It turned out that customers used the Save for Later lists to pre-shop the Black Friday sale and add all the things they wanted to buy into the lists. Then when 2AM rolled around, they opened their Save for Later lists and started clicking the Add to Cart buttons one after the other. A single customer might click 5-10 Add to Cart buttons in a few seconds. With hundreds of thousands of customers figuring out the same method independently, it led to a massive spike in Add to Cart requests, we effectively DDOSed our Add to Cart function with simultaneous collective human behavior.
I feel like I could keep going on Holiday for another two pages, but that’s enough for this year, maybe we’ll do it again in the all important next year.
Goto Part XIII
What does it mean to start an open source project internal to an organization? Does that make any sense?
Many large organizations have very large systems within them, systems which are mission critical to the delivery of their business model. These systems are often bottlenecks for development as, in some fashion, they cannot be avoided and some development work is needed to complete any enterprise scale capability. They limit the change available to an organization.
What if there were a way to unlock the capacity limit on these systems? There is, open source the project.
If you open source a project internal to a company you are opening up the codebase for anyone in the company to work on. Instead of supplying a dedicated development team, you now need a dedicated team of system stewards, people that ensure the stability of the system and that the code being added meets the criteria of the project’s sponsors.
You can now do this fairly easily with Git based source control, where anyone in the company could write a module or patch and submit a pull request. The stewards review the pull request and whether the code takes them in the direction of their roadmap for the project and potentially accept the request into the main repo.
If done correctly you’ve opened up the system to the teams with the greatest need, while still maintaining control over the system and its direction. If done incorrectly you’ll probably have the biggest mess of your life. To push an entire enterprise forward at higher velocity the risk may be worth it.
We’ll be at MinneBar on April 12, 2014 which is again at Best Buy campus this year. It’s always nice to spend Saturdays at work! My colleague Kannan Swaminathan and I will be presenting our Cassandra and Riak at BestBuy.com presentation that we previously presented at CodeFreeze. Hopefully the Twin Cities conference attendees will not notice.
My article on the BestBuy.com Cloud Architecture appears in the March/April 2014 edition of IEEE Software Magazine. I’ve been busy writing that article instead of writing here.
I will be presenting with my colleague Kannan Swaminathan at this year’s Code Freeze at the University of Minnesota January 16, 2014. We will be doing the breakout sessions so you’ll have two chances to attend. It should be an informative talk on how Best Buy is using Cassandra and Riak. Hope to see you there!
BestBuy.com will be presenting at the University of Minnesota Computer Science department Tech Talk series on October 9th at noon. We will be presenting on the Architecture & Technology of BestBuy.com.
You probably have to be a student of the UMN to attend; CS students of Minnesota, I hope to see you there.
So why am I still talking about this?
In the early 2000s, somehow the industry got convinced that software was just another form of manufacturing, if you defined a process and applied it rigorously, little chunks of perfectly coded software would come spewing out the end of your assembly line. Since it was manufacturing processes, labor could be sourced from anywhere and we could all get our software faster, better and at lower cost.
In 2001 my job got offshored, just like many of us that worked through that period. However, my particular offshoring is remarkable in that I truly got offshored. The firm hired a company which purchased an old cruise ship and had it parked somewhere off the coast of San Diego in international waters. Some poor sods from various countries were relegated to a permanent offshore vacation where they coded 24 hours a day. Yes! The CTO explained how at the end of one person’s 12 hour shift, they would simply step aside and the new person would hop in the chair and just pickup where he/she left off. All that work we were trying to tell people would take 3 more months would be done in a few weeks!
It’s now 10+ years later and I haven’t heard about a massive flotilla of cruise ships blocking the entire western coastline of the USA so I’m assuming this model didn’t catch on. Actually I know it didn’t catch on as the CTO absolutely failed to deliver any software at all after six months of trying. Being that it was a startup, it than promptly disappeared.
In any case, the renaissance of the local software engineer took over a few years ago and shows no sign of stopping. Yet I still find myself in conversations regarding the commodity nature of developers. Does this happen to anyone else?
What’s the half life of the code your are writing today?
Half life (not the game) is the term used to describe the decay of radioactive isotopes. The longer the half life, the slower the decay. If you have a gram of radioactive material, it will change over time until eventually all the radioactive material decays.
I like to think about the code we write as having a half life. Well written code in a slowly changing area of an application has a long half life. It doesn’t mean the code never changes, it just means only small changes occur over long periods of time. The half life of the code may be in years (Caesium 134, half life of about 2 years).
However, brand new code in a rapidly changing area, say the new UI of your brand new site, has a half life of days (Manganese 52, half life of 5 or 6 days). This would mean you’d expect half the code to be replaced in one work week. The next week one quarter of the remaining code would be replaced, etc. until virtually no code from the original work is left.
Thinking about half life is useful because it tells you how much effort you should be devoting to testing and ensuring the code is rock solid. Long half life code should be well tested, documented and vetted for scalability. Short half life code should be thrown out with little testing and few thoughts about scalability or maintainability. Why? Because the code will be gone by next week.
Unlike isotopes, the half life of code changes once the code is complete and in production. Production marks a point where half life increases dramatically. In fact, you should be actively cranking up the half life by making the code clean and scalable.
Still, there’s a limit depending on the velocity of change in the various parts of the application. These days UIs evolve rapidly for consumer driven applications. The half life is short and the amount of effort put into this code is low. It should still work, but may not be something your proud to say you wrote. Then again, you should be pleased as you put forth the appropriate amount of effort.