Category Archives: Digital Ecommerce Transformation

A Digital Ecommerce Transformation – Front End Madness – Part II

Part II – Start at the beginning with Part I

In 2010, the cloud was not new but it was ignored by large companies, particularly in non-technology focused segments such as retail. While it is clear now, at the time large retailers had not yet awakened to the new reality that a company’s prowess in software might decide its future outcomes.

As an example, in my first year at TWLER (The Worlds Largest Electronics Retailer), store sales during December were not going well. Since many retailers make 50% or more of their annual sales in November and December, this spelled impending disaster. Traffic was down in stores and the company’s reaction was to propose that the digital channel stop its free shipping offer. This clearly showed the company leadership’s inability to fathom online shopping. The logic would be that if a customer could buy it online and have it shipped for free, they would not go to a TWLER store. Seems reasonable, but if the customer is shopping online and we did not offer free shipping, they would simply click to the next store, probably Amazon, and buy their electronics from them with free shipping. This customer was not going to a TWLER store that Holiday, ever. We were actually saving sales for the company, it just wasn’t understood.

A minor digression on the state of the ATG system is necessary to understand what we were dealing with. The original ATG system was built in 2003, at the time it was an excellent decision for a mid-sized retailer to build its first ecommerce engine. But over time, and numerous one-off projects, the codebase had morphed, intertwined and been generally neglected and abused. As an example, one ongoing project when I arrived was to widen the product detail page (PDP) and move the Add To Cart button from the left side of the page to the right side. This seemed fairly innocuous, but it took six months and well over one million dollars to accomplish this task. This seems ridiculous to me so as an architect, I dug into why this was happening.

It turns out there were multiple reasons why this project was practically impossible to complete. To start with, there were nine separate versions of the PDP, each made for a different category such as TVs, Music, Computers, etc. The nine separate PDPs all had common origins in some ancestral PDP, but after years of projects aimed at the individual categories, they had all strayed in different ways, including using different Javascript frameworks and versions to accomplish dynamic page elements. These PDPs were written in JSP/Javascript and were each well over 10,000 lines long intermixing actual Java into the JSPs themselves. Imagine trying to figure out how to change nine different pages all implemented slightly differently in monstrous JSP files, with no test automation to determine if you broke anything in the process.

This sounds bad enough, but the executable for ATG was in the GB range, built as an ear, with a special Ant file which only one or two people understood (more on this later). It was necessary to build and run the entire ear to determine if the page changes worked since the JSP code was so intertwined with the server side code. However, it was impossible to actually run this ear on a developer’s machine because it also required a full working copy of the Oracle database. No one had actually figured out how to make all these things work yet on a single desktop or laptop machine.

Instead, there was one shared development server for the entire Dotcom division. This server was a large Unix box but still too small to serve thousands of developers trying to build and run the ATG codebase. This server alone routinely failed due to lack of disk space and not enough CPU. But, it was the only place to build and run code you had worked on in your IDE, so everyone had to deal with it.

If you weren’t crying yet, the next step was to actually deploy the code to the staging environment (skipping the integration environment altogether) because the reality was the front end code only worked if it could access the Internet as there were so many externally downloaded components to the page. Even though you deployed the code to the shared developer environment, you couldn’t actually run it there. The staging build happened once every night.

To sum this up, the normal front end development cycle is change some code, save it locally, have it automatically picked up by your running app server and test it. This cycle time should be in the seconds range so you can work quickly and efficiently through all the little tweaks necessary to make a UI page look good and work as expected. The cycle at TWLER was change some code, save it locally, do your best to make sure the page compiled, check in the code, push to your developer environment, do your best to check it compiled, wait for the overnight stage push (go home), come back the next morning and see if the change worked in stage (assuming the push didn’t fail, which it often did). Instead of a cycle time of seconds for each code change, the cycle time was one day. One entire 24 hour day!

Did I mention zero automated regression tests?

Now I bet you think that $1M was cheap. In fact, I still don’t know how anyone actually got any work done in these conditions, but I do know that the churn in the front end development team was enormous.

GOTO Part III

A Digital eCommerce Transformation – A Multipart Series

It took six long years but we successfully transformed a monolithic 10 year old ATG based commerce system into a completely distributed, infinite scaling eCommerce platform.  I’ll call the place TWLER, or the world’s largest electronics retailer.

I joined TWLER in 2010 as a Hadoop Architect on the small team that was tasked with rewriting TWLER.com.   I interviewed in May and was offered the role, but didn’t start until July. About two weeks before I started the VP called me and let me know that the Hadoop project was cancelled and if I declined the offer they would understand. I had spent the last two years learning about Hadoop and installing one of the first Hadoop cluster in the Twin Cities on a bunch of old Dell towers for a company called Peoplenet. The VP there had the idea that they would build a remote data collection system that could take in one million messages per second back in 2008. I tried to startup a Hadoop consulting arm for the consulting company I was working for at the time, Object Partners. However, it was a bit too early in the Twin Cities for companies to be interested in BigData and NoSQL. We spent a lot of time talking to companies about the technology, but no contracts were forthcoming. I put together an Introduction to Hadoop presentation and gave that numerous times, all to no avail. So when TWLER came calling with a Hadoop Architect position, I jumped at the chance to actually use Hadoop at a large company. All that to say I was quite disappointed when the role fell through and seriously considered declining the position as I had not yet resigned, and after two years of pushing a technology I did want to see Hadoop in action. But I was quite tired of my second stint in consulting and ready to move on.

So I arrived at TWLER with no assigned role. At the time we were five architects under an Operations Vice President, and I had no idea of the organizational politics that were hindering a major system rewrite. The cancelling of one project should have been a clue to the future as already, funds were being removed from this team.

We started out under the tutelage of Michael N., a renowned architect that had worked on the initial version of TWLER.com. Our first task was to setup a modern development pipeline using Chef in AWS. A reminder that this was in 2010 when a large retailer doing anything in AWS was highly uncommon, and Chef was practically brand new.

The pipeline we stood up at the time was fairly standard, Atlassian Stash for Git, Crowd for user management, Confluence for knowledge management, Jira for issue tracking, Bamboo for continuous integration, and Artifactory for artifact storage and a Maven repo. We used Chef to completely automate the deployment of these tools into AWS. While none of these was new to TWLER, they had not been combined together, made externally accessible or offered to anyone who wanted to use them throughout the company.

We used this infrastructure to start a small AWS based project to put together a small site that could be used during outages. The site would allow customers to lookup locations, products and prices, but full commerce capabilities would not be available. It would reside in the cloud, ready to be deployed within minutes. It says something that the first thing we built was an outage site, TWLER.com was not a stable platform at the time.

GOTO Part II