Part II – Start at the beginning with Part I
In 2010, the cloud was not new but it was ignored by large companies, particularly in non-technology focused segments such as retail. While it is clear now, at the time large retailers had not yet awakened to the new reality that a company’s prowess in software might decide its future outcomes.
As an example, in my first year at TWLER (The Worlds Largest Electronics Retailer), store sales during December were not going well. Since many retailers make 50% or more of their annual sales in November and December, this spelled impending disaster. Traffic was down in stores and the company’s reaction was to propose that the digital channel stop its free shipping offer. This clearly showed the company leadership’s inability to fathom online shopping. The logic would be that if a customer could buy it online and have it shipped for free, they would not go to a TWLER store. Seems reasonable, but if the customer is shopping online and we did not offer free shipping, they would simply click to the next store, probably Amazon, and buy their electronics from them with free shipping. This customer was not going to a TWLER store that Holiday, ever. We were actually saving sales for the company, it just wasn’t understood.
A minor digression on the state of the ATG system is necessary to understand what we were dealing with. The original ATG system was built in 2003, at the time it was an excellent decision for a mid-sized retailer to build its first ecommerce engine. But over time, and numerous one-off projects, the codebase had morphed, intertwined and been generally neglected and abused. As an example, one ongoing project when I arrived was to widen the product detail page (PDP) and move the Add To Cart button from the left side of the page to the right side. This seemed fairly innocuous, but it took six months and well over one million dollars to accomplish this task. This seems ridiculous to me so as an architect, I dug into why this was happening.
It turns out there were multiple reasons why this project was practically impossible to complete. To start with, there were nine separate versions of the PDP, each made for a different category such as TVs, Music, Computers, etc. The nine separate PDPs all had common origins in some ancestral PDP, but after years of projects aimed at the individual categories, they had all strayed in different ways, including using different Javascript frameworks and versions to accomplish dynamic page elements. These PDPs were written in JSP/Javascript and were each well over 10,000 lines long intermixing actual Java into the JSPs themselves. Imagine trying to figure out how to change nine different pages all implemented slightly differently in monstrous JSP files, with no test automation to determine if you broke anything in the process.
This sounds bad enough, but the executable for ATG was in the GB range, built as an ear, with a special Ant file which only one or two people understood (more on this later). It was necessary to build and run the entire ear to determine if the page changes worked since the JSP code was so intertwined with the server side code. However, it was impossible to actually run this ear on a developer’s machine because it also required a full working copy of the Oracle database. No one had actually figured out how to make all these things work yet on a single desktop or laptop machine.
Instead, there was one shared development server for the entire Dotcom division. This server was a large Unix box but still too small to serve thousands of developers trying to build and run the ATG codebase. This server alone routinely failed due to lack of disk space and not enough CPU. But, it was the only place to build and run code you had worked on in your IDE, so everyone had to deal with it.
If you weren’t crying yet, the next step was to actually deploy the code to the staging environment (skipping the integration environment altogether) because the reality was the front end code only worked if it could access the Internet as there were so many externally downloaded components to the page. Even though you deployed the code to the shared developer environment, you couldn’t actually run it there. The staging build happened once every night.
To sum this up, the normal front end development cycle is change some code, save it locally, have it automatically picked up by your running app server and test it. This cycle time should be in the seconds range so you can work quickly and efficiently through all the little tweaks necessary to make a UI page look good and work as expected. The cycle at TWLER was change some code, save it locally, do your best to make sure the page compiled, check in the code, push to your developer environment, do your best to check it compiled, wait for the overnight stage push (go home), come back the next morning and see if the change worked in stage (assuming the push didn’t fail, which it often did). Instead of a cycle time of seconds for each code change, the cycle time was one day. One entire 24 hour day!
Did I mention zero automated regression tests?
Now I bet you think that $1M was cheap. In fact, I still don’t know how anyone actually got any work done in these conditions, but I do know that the churn in the front end development team was enormous.