Category Archives: Software Development

A Digital Ecommerce Transformation – Lucky to Have a Coach – Part XVIII

Part XVIII in a multipart series, to start at the beginning, goto Part I.

2012 was the year that I really began working for the VP of Operations at TWLER, Judy D. In 2011, the Chief Architect for TWLER decided to leave as the investment in the rewrite of was cut in half, see Part VII for more on that episode. Over the course of 2011, I became the main point of contact for the team, but didn’t completely take over till later that year.

To this point in my career, I had been leading many large Enterprise Java programs, mainly as a consultant, and had just finished an MBA. I had about 18 months of pure engineering management at United Health Group, having a team of about 25 writing online wellness programs. However, that management position didn’t end well, and I learned much and more about politics at large companies.

I did not have an extensive track record of management, but because I had spearheaded the funding effort and defined the vision for the future Ecommerce platform, I became the manager of the team and was promoted to Director. Much of the work we did is detailed here, or will be detailed here in the future, today the discussion is about having a great manager as coach.

I’m going to say, as a manager, I was very rough. During my time at UHG, I had four managers in 18 months, so I received very little direction or coaching. As the lead Architect and Engineer on many programs, I was only interested in software competency and delivering systems. If you couldn’t help me do that, you were dead to me. I was rightly accused of being blunt, straightforward and lacking empathy at this point in my career. I didn’t really care, I knew how to hire great engineers, build teams where everyone had fun while delivering more than asked, and being a total pain-in-the-ass to whoever was lucky enough to manage me.

But Judy decided to invest her time and started giving me feedback on how I was acting and how I was being perceived. It was the first time in my career that someone had explained perception being reality, not logic and data. In meetings with the VPs and Senior Directors in IT, I would get worked up because they knew how to push my buttons. Once you’re in attack mode, you’ve lost the room; everyone thinks that if you can’t manage yourself here, you must not be a good manager overall. Judy helped me understand how to deflect criticisms and attacks on my project’s direction, by acknowledging the speakers point, and then presenting why I thought our direction would work. Never directly saying the other person was wrong, just that our current direction was working.

Learning the managerial arts of controlling meetings, protecting my reputation, and letting others be right (even when they were wrong), helped my project thrive by giving me the tools to fend off the IT team without offending them or giving them ammunition to take back to their leaders.

Judy always took the high road, she didn’t care if others were being assholes, it wasn’t an excuse to be an asshole too. She would always let me know she’s heard about my latest escapades at meetings across the company. She no longer wanted to hear negative news; she only wanted to hear how I was helping people, nothing else.

This process was long; Judy would debrief me after meetings and point out exactly where I lost the room and when I went off on a rant. But I was willing to listen, and I corrected my mistakes. Over time, it was as if I had been in management my entire career. I had learned to manage my reactions, while also managing the outcomes of the meetings and the perception of my team and myself.

I can’t thank Judy enough for the coaching effort she put in to teach me to be a Director at TWLER, and later a VP for Fortune 100 companies.

A Digital Ecommerce Transformation – Lower Environment PaaS – Part XVII

Part XVII of a multipart series, to start at the beginning, goto Part I.

In the ongoing development of (The World’s Largest Electronics Retailer), the actual process of building of the system was severely neglected over the years. The first thing that we established as a team back in 2010 was a new set of build infrastructure (Git, Jenkins, Artifactory, Chef) so that engineer’s had actual continuous integration tools to work with.

Aside: Continuous Delivery was just appearing at this time. We had Jez Humble, author of the seminal Continuous Delivery book, in to talk to us about his ideas in early 2011. This was a small gathering of about 15 engineers and architects who were interested in the concept that we eventually implemented at

The biggest problem with writing software at TWLER was the lower environments. As we were breaking up the monolithic ATG application into distributed components, the lower environments were still geared towards single ATG application delivery. We had DEV, TEST, INT, STAGE, BREAKFIX, and PROD environments. STAGE and BREAKFIX were marginally close replicas of PROD, but it was really not possible to create true replica environments. The environments were not well segregated either, we found on numerous occasions PROD applications accessing services and data from non-PROD environments. In the overwhelming realm of things to fix, this wasn’t even a top priority.

We decided we had to replace the DEV and TEST environments. DEV was basically non-functional, it was one large server that all 1000+ engineers were trying to use to integrate their code. Builds constantly failed, even when they succeeded the results were often unusable. It could take you all day to just try and get one change built and hope the results were something you could work with. Most people just skipped DEV and went right to TEST, which was an actual managed environment.

We decided that a lower environment PaaS was the direction we wanted to go. We wanted every engineer at TWLER to be able to spin up every application at TWLER in their own private sandbox. That way engineers would have to ability to test their systems in isolation, while still using the latest component versions of everyone else’s system.

To achieve that we decided to build an OpenStack environment and a homegrown PaaS. But how to get the capital to create the OpenStack environment? At the time, mid-2011, I had excess dollars assigned as resource capital, it was meant to only be used on hiring engineers. In my projects at the time, we were still proving out the chosen architecture and were ending the year with excess resource capital. At the time, TWLER was extremely controlling of capital, you either had resource capital, or hardware investment capital, but not both.

So I went on a reconnaissance mission to find someone with investment capital that needed resources to see if we could trade. After a couple weeks of hunting, and getting the financial managers of TWLER involved, we finally found a team that was willing to trade their investment capital for resources. We swapped $150k and used that money to buy three OpenCompute racks to build out our OpenStack environment. We estimated with high density OpenCompute, we could house about 5000 VMs on OpenStack. If we stuck with the company standard hardware, HP, we would only be able to build out an environment that housed 1000 VMs with those dollars.

We knew that if we were successful, we would want to expand the OpenStack environment quickly. We decided on full automation from bare metal to operating OpenStack and proceeded down the path using Crowbar and Chef to build out the roll-in racks to OpenStack. At the same time we started building out the PaaS framework, Omnitank, and getting teams on-board to truly fully automate their deployments so their components could be included in the PaaS. This adventure was documented at OpenStack summit as one of the keynotes in front of 2000+ attendees.

This PaaS was so successful, we were able to shutdown our DEV and TEST environments. We also allowed engineers from around the company to use the OpenStack environment. We actually had hundreds of forward thinking engineers from the enterprise side of TWLER using our OpenStack PaaS as it was the fastest way to get a test environment created.

It was so successful that when we had our first major outage and lost all environments, there were numerous enterprise teams that were down for weeks rebuilding their environments. We used this as a lesson to learn which teams actually automated their infrastructure, and which teams were polishing virtual chrome. But we did have to explain to IT VPs that this whole environment was still in beta, and it was use at your own risk.

Goto Part XVIII

A Digital Ecommerce Transformation – Modernizing ATG? – Part XVI

Part XVI of a multipart series, to start at the beginning, goto Part I.

As much as I love disparaging ATG, it was actually built as an all-inclusive platform, which, at the time, was a modern way to build systems. ATG had a built in Inversion of Control framework prior to Spring. It had a set of tag libraries and did its best to help teams build ecommerce sites quickly in a J2EE application server manner. It just wasn’t ever meant to scale to level we needed in 2012, the architecture was wrong. Everything you did trigged calls to the single database instance, there was no caching so a promotion calculation would be run every time a customer wanted to see the promotional price.   With two million concurrent users, all interested in the latest promotions, this is a disaster. Any problem in any part of the system brought down the whole application. We needed to evolve from monolith to distributed.

Similarly, by 2012, deploying massive EAR files had been a dead architecture choice for at least five years. Tying everything together into one massive deployment was the opposite direction of the Agile/DevOps movement springing up with the Continuous Deployment mania.

But this was the legacy we faced at TWLER (The World’s Largest Electronics Retailer). We were ten years into unchecked ATG development with thirteen applications deployed on the platform, all rolled up into that 2GB EAR. You can only imagine the quarterly deployment nightmare, literally a waking nightmare as all deployments were overnight affairs, but that’s the fodder for another post.

We dedicated a team to figuring out how to deconstruct our ATG implementation and code. The codebase was series of interconnected, circularly coupled, packages. There were five or ten main packages that were used by all the applications so that a change to one package, would inevitably break 2-3 applications. Thus, an eight week QA cycle was necessary to ensure all the integrated code from 1000s of engineers would result in thirteen working applications.

But if you can’t throw it out, and you can’t break functionality, what do you do?

The first course of action was to modernize the build. It wasn’t actually possible to build the ATG application on a developer machine. The only way to build the application was by kicking off a build in the developer integration environment. Thus, if you were an engineer working on ATG, you could make a bunch of changes to your code, check them in to the CVS source control system, and kick off a build. Imagine 1000 engineers all trying to do this at the same time.

The developer integration environment was overwhelmed by the number of build requests so instead of on-demand, it would schedule 2-3 builds per day. So 1000 engineers trying to get their code in for 30 different projects to build together, guess the result of the build? If you guessed broken, than good job! You’ve worked in an enterprise environment in the mid-2000s!

To solve all these problems wasn’t easy, but the playbook is straightforward:

  • We had to automate the build so anyone could run it.
  • We had to separate the applications so you could just build the one you were working on.
  • We had to come up with a solution for the developer laptop, so you could build and run the ATG server locally.
  • We had to decide what portions of the ATG code to invest in and what to throw out.
  • We had to break the dependencies between the ATG packages so developers could work on things in isolation.
  • We had to automate the changes in the database so a local DB could be built and modified on the developer’s machine.
  • We had to fix the integration environment so if wasn’t constantly failing.

Many might think this was a throwaway investment and not worth the time or effort. We faced those challenges internally to TWLER as well. But, we were able to convince leadership that in order to move quickly in the future, we needed to move faster now in our legacy architecture. We would be using portions of ATG for the next three years at least, and we wouldn’t be able to complete our new work without augmenting the existing systems.

Looking back on it, I’m happy to report that the decision to invest in ATG as we dismantled it, was a good one.

Goto Part XVII

A Digital Ecommerce Transformation – Cloud Home Page – Part XV

Part XV in a multipart story.  To start at the beginning goto Part I.

Our goals for 2012 were to deliver two things, a new browse architecture for and a Holiday without issues. Not a small task for a team saddled down with a giant monolithic application and a mandate to deliver features and not affect the business during the rewrite.

We started in on the new browse architecture. It was our feeling that if we could simply scale up the browse capabilities of, we had a chance of making our aging ATG monolithic application survive another Holiday.

About 98% of traffic on an ecommerce system is people browsing the site, the other 2% is people actually trying to buy stuff. If you are planning for 10X increase in traffic for one week out of the year, than attacking the 98% seemed like a good place to start. Also, since we were dealing with a monolith, any traffic we removed from the 150 node ATG cluster was more power towards the checkout process. In fact, as we did the math, if we took off 70% of the traffic in the first year at Holiday, we’d actually have 3X the capacity we needed for the checkout process and remaining components left on the ATG servers.

We started with a project we called Cloud Home Page. The home page is the most served page on the site and at Holiday we had to make it static and cache it at the CDN. The business teams didn’t like this because, without dynamic content, there wasn’t any way to adjust what people were seeing as they landed on and lead them towards the new sales events. The Cloud Home Page plan was for a dynamic, cloud-based home page, with minimal personalization but modifiable within a 15 minute window.

Since we were coming from a J2EE style JSP on top of Servlets architecture, we first wanted to upgrade our front-end. The new architecture specified a thin and dynamic UI layer with zero coupling to the back end. That meant only HTML, CSS and Javascript were allowed. All data communication was done with JSON, and we tried to get the number of calls to the backend down to one.

That is, during a request for, the controller would make only a single request to the backend for all the data it needed to build the page. The request was handled by a service aggregator that would then manage the 20-50 service requests for data and build a JSON response in the specified SLA of less than one second (for the home page). By specifying a single request, we could let the service aggregator determine what calls to cut-off and what data to serve from cache to meet the page SLA. We also had a single point where we could add or remove functionality depending on the load. If the load was light, we could add a few more personalization services such as recommendations. If we were at peak loads, we could turn off all personalization and even some of the dynamic page elements to lessen the load on our servers. We created a highly dynamic Home Page that was tuneable given the system load.

However, to build just the home page we needed many teams to work together and deliver simultaneously. We had a new front-end team designing the Backbone based page structure and components. We had a team building a front end controller framework to drive the UX. Another team was building the service aggregation and distributed service management layer (sadly this was right before Hystrix became available). A team was building up a new distributed caching layer to gather together all the data needed to drive the home page. We continued the team that was extracting the item catalog from ATG and distributing it across our datacenters and AWS. And we imagined a team building a new customer management system we called Customer Graph, although we only had an architect working on that project. To ease the gridlock in our lower environments, we kicked off an OpenStack based lower environment PaaS just because we didn’t have enough other things going on. Topping it all off, our digital operations teams was learning how to manage systems in the AWS cloud.

We did have one final team, their job was to Mavenize the ATG build. This might sound trivial but it was the most harrowing undertaking of them all. We felt we needed to try to modernize the ATG deployment process to allow teams to move faster as we estimated it would take at least three years to exit the ATG platform. To make that palatable, we had to automate the entire ATG deployment process. Given that the current build process was a 10,000 line recursive ANT script, we put some of our most masochistic personalities on it. Besides automating ATG, they had to separate the thirteen intertwined applications that were deployed together as an ear file.  A 2GB ear file.

Goto Part XVI

A Digital Ecommerce Transformation – At the end of 2017, A look back at 2011 – Part XIII

Part XIII of a multipart series, to start at the beginning goto Part I.

If you’re keeping track at Part XII, we’ve gone through half of 2010 and all of 2011. It’s currently the last day of 2017, the temperature is -10F in Minneapolis going down to something around -20F later tonight. Whatever plans we might have come up with to celebrate the New Year got shelved for the option of staying home and being warm.   Instead, let’s recap the story so far…

If you remember, I came to TWLER (The World’s Largest Electronics Retailer) in mid-2010 ostensibly to join a team rewriting I took a role as a Hadoop Architect because for the previous two years I had been trying to start a Big Data consulting practice within the consulting company I was working for at the time. It was a few years too early for Big Data consulting so I jumped at the chance to put into practice everything I had been discussing with companies around town. But, that didn’t happen because the Hadoop project was killed before I even started. Instead I worked on Chef and learned Infrastructure as Code in the AWS cloud while establishing an automated testing team using JBehave to try and put in place functional testing of before making massive changes to the codebase.

While everyone knew was failing and needed help, there was no clear path and no engineering team that owned the site. It had been operated as a dumping ground for external integrator’s code for so long that everyone and no one claimed ownership. A separate team had sprung up to write the mobile site and their express purpose was to obsolete The Enterprise IT team claimed ownership but the best they could do was point to all the developers from Accenture and WiPro working on various projects.

In these conditions, out team of seven architects began to claim ownership by asserting that, as employees of TWLER, we were responsible for the build and operation of the site. There was little resistance at first, no one wanted in 2010, it was seen as a sideline business, a necessary evil that distracted from the true engine of the company, big box stores.

However, the journey was not destined to be easy or straight, and the architect that hired us all left in March of 2011 as the funding for our rewrite was cut in half. It was in this limbo state that gained the support of the architecture team and we collaborated to put forth a plan to rewrite I began selling the plan across TWLER and gained a small amount of interest. But when the CEO declared we would double our ecommerce revenue in three years, my plan took off, because, it was the only plan available.

As we start 2018, I’ll harken back to 2012 and the twists and turns required to put in place the first layered cloud distributed service oriented ecommerce platform built by a major retailer.

Goto Part XIV

A Digital Ecommerce Transformation – Holiday – This Year is Always the Most Important One Ever – Part XII

Part XII of a multipart story, to start at the beginning goto Part 1.

Since we’re in the middle of Holiday 2017, I thought a digression on Holiday was in order.

If you have never worked in retail, than you’ve have missed out on the grand experience we call “Holiday”. On the other hand, you’ve probably actually enjoyed the time of year from mid-November to Christmas while you celebrate with your friends and family, and take advantage of thousands of days of deals from the many retailers trying to get their share of wallet from you.

Holiday, with a capital H, is something that has to be experienced to be believed. In my first Holiday at TWLER in 2010, I was on a team that had just started writing code and had very little in production leading into Thanksgiving. The only offering we supported was the failure site, if the main went down, we would quickly spin up the browse only site so consumers would be able to at least see what products we sold, and where our stores were located. In 2010, this was actually a pretty good thing since the ecommerce site was still less than 5% of revenue.

When you work in IT in a retailer, your entire year is judged on whether or not the systems you support survive the shopping onslaught of Holiday. In the online space, an ecommerce site might make 30% of its revenue in the five days from Thanksgiving to Cyber Monday. also experienced the third highest traffic of North American retailers during that time. This massive scale up to 20X normal daily traffic was largely accomplished without clouds in the 2000s. You had to take a really good guess as to how much infrastructure was needed, build it all out over the course of the year, and hope you weren’t overwhelmed by consumer behavior. You could easily receive 1M requests per second at the edge, and 100,000+ requests per second to your actual systems. If those requests were concentrated on the wrong systems, you could easily take down your site.

TWLER counts how long you’ve been at a company by the number of Holidays you’ve experienced. If someone asks how long you’ve worked there, you might say “four Holidays.” And every Holiday is the most important one yet, because those six weeks account for 50% or more of yearly revenue.

After a few Holidays, you realize the second the current year’s Holiday is over, you are immediately planning for the next one. There is no break. It’s like a giant tsunami that is slowly approaching, day by day. You can look over your shoulder and it’s always there, waiting to crash down on you and ruin your day. Once this year’s tsunami passes, you turn around and can see next year’s on the horizon.

In my six Holidays at TWLER, we experienced numerous outages, usually caused by either internal stupidity, or unexpected consumer behavior. In our first few years, we would purposely force our ecommerce site to use “enterprise” services because they were the “single source” for things like taxes, or inventory. This is a great notion, but only if the “enterprise” services were actually built to support the entire Enterprise. Since TWLER was store focused, this meant the “enterprise” services were often down at night for maintenance, or were not built to withstand massive surges in traffic. One million people refreshing a PDP to check for inventory on a big sale every few seconds quickly overwhelmed these services. So we often turned these services off and flew semi-blind, rather than have the site completely fail.

In other instances we tried to use various promotion functions embedded in our ATG commerce server. These seemed like useful things to easily setup a promotion like buy one get one. But when millions of people come looking for the sale, the vendor built commerce engines go down quickly by destroying their own database with the same exact calls, over and over again.  They hadn’t heard of caching yet, I guess.

We would sometimes publish our starting times for various sales, saying a big sale is starting at 11AM and send out millions of customer emails. The marketing teams loved the starting times and the technology teams hated them. We warned that setting a hard start time is a sure route to failure. Yet we did it multiple times and incurred multiple failures as the traffic surge brought down the site. There are physical limits even in clouds, you can only spin things up so fast and 10M rqs will bring down most sites. After a few of these episodes, we did convince the marketing teams that it wasn’t the way to go and learned how to have sales with gradual ramp-ups in requests rather than massive surges.

Around 2013, the Black Friday shopping was so intense in the evening across the nation that the credit card networks themselves slowed down. Instead of taking a few seconds to auth a credit card, it started taking one or two minutes. This was across all retailers. However, the change in time caused threads to hang up inside our ecommerce systems and all of a sudden we ran out of threads as they were all tied up waiting for payments to happen. For the next year, we changed our payment process to go asynchronous so that would never happen again.

There are many more stories of failure, but from every failure we learned something and implemented fixes for the next year’s wave. This is why Holiday in retail is such fun, every year you get to test your mettle against the highest traffic the world can generate. You planned all year, you implemented new technologies and new solutions, but sometimes the consumer confounds you and does something totally unexpected.

The last story is one where the consumer behavior combined with new features took us down unexpectedly. In 2014 we implemented “Save for Later” lists where you could put your items on a list that you could access later and add them to your cart. As Thanksgiving rolled around and the Black Friday sale went out at around 2AM, our Add to Cart function started getting pounded at a rate far higher than we had tested it for. We were seeing 100K rqs in the first few minutes the sale was happening, it rapidly brought the Add to Cart function to its knees and we had to take a outage immediately to get systems back together and increase capacity.

This was completely unexpected consumer behavior so what happened? It turned out that customers used the Save for Later lists to pre-shop the Black Friday sale and add all the things they wanted to buy into the lists. Then when 2AM rolled around, they opened their Save for Later lists and started clicking the Add to Cart buttons one after the other. A single customer might click 5-10 Add to Cart buttons in a few seconds. With hundreds of thousands of customers figuring out the same method independently, it led to a massive spike in Add to Cart requests, we effectively DDOSed our Add to Cart function with simultaneous collective human behavior.

I feel like I could keep going on Holiday for another two pages, but that’s enough for this year, maybe we’ll do it again in the all important next year.

Goto Part XIII

A Digital Ecommerce Transformation – The Last Failed Holiday – Part XI

This is Part XI, to start at the beginning goto Part I.

It was mid-November of 2011 when we had finally secured the funding to start the rewrite of But that timeframe is right before Holiday, the time between Thanksgiving and Christmas that generally defines the Holiday season for retailers, and can account for 50% or more of their annual revenue. Particularly, there are three days that make or break the business, Thanksgiving, Black Friday and Cyber Monday. On those three days the traffic to an electronics retail site increases by 10x or 20x. TWLER is the third highest scaling eCommerce site in North America during this time period, behind only Amazon and Walmart. Designing systems that can survive this type of load is difficult, and operating them is even more difficult. But we had the courage to believe we could do it.

First we had to try and get through Holiday of 2011 with our antiquated ATG system and poorly defined architecture that often called Enterprise services that were never designed for web scale. Why we called the Enterprise services was one of those well intentioned but clearly incompetent EA decisions large companies make. Not understanding the demands of ecommerce, the EA team had forced the Dotcom team to use Enterprise Tax, Payment, and Inventory systems. This isn’t a bad idea, it’s great to have Enterprise services, but only if they scale. If they are not designed with the Internet in mind then you have serious problems. In Dotcom, we operate with zero downtime, we have traffic 24 hours a day, we have demanding latency requirements and we scale 10-20X for a few days a year. The Enterprise services could not handle any of those requirements, so forcing the digital teams to use them out of a desire for reuse and lower costs is idiotic. But that’s how non-digital EA teams think.

So this Holiday, like the last one, was marred with outages, all attributed to Enterprise services that failed under load. It was painful but a great lesson in future architecture principals. The new would be designed to operate regardless of whether Enterprise services were available, it didn’t matter what we had to do, we would isolate ourselves from systems not designed for web scale. This was actually a good thing for everyone but the EA’s didn’t agree because it violated some outdated EA principle.

The only saving grace for TWLER at this point in the maturity of eCommerce, was that people wanted to buy from TWLER. We had good prices on many things and, even though we suffered through the Holiday, people basically distribute themselves in these situations, they make up for your lack of scale by trying again at less popular times. You shouldn’t count on this but human behavior can be one of your scaling algorithms.

The need for a rewrite was confirmed yet again. After we limped through Cyber Monday we started getting down to the business of building teams to implement our new architectural direction. We had great ideas and a solid plan, but without high quality engineers to execute it, we would just be wasting our time.

Goto Part XII

A Digital Ecommerce Transformation – The Agile eCommerce Platform – Part IX

Part IX – To start at the beginning goto Part I

For three months we shopped the walking deck with anyone at TWLER (The World’s Largest Electronics Retailer) who would listen. We were seeking support and feedback on the direction. We were ensuring, at the least, that people had heard of us and knew we were actively pursuing a rewrite of As we knew, there were three other teams that were trying to gain the necessary momentum to do the rewrite themselves, and we had to stand out and move faster. We made progress, occasionally teams would actually ask us to come review our plans for, or recommend others to hear the story.

I said I was bad at Power Point and now I’ll prove it. Here’s one of the first drawings of what we termed our Agile Ecommerce Platform made in 2011. Gradients seemed to be a thing in 2011 and the Platform drawing made great use of them.

But the major platform pieces were there and we had described the basic underpinnings of what an ecommerce platform delivered. The main idea being that the front end is loosely coupled to the platform, allowing the front end teams to move quickly and change the site content in real time if necessary. This was a break from Java EE direction of JSP/Servlet architecture, and a break from standard Spring where UI was integrated all the way back to the database for ease of programming. Unfortunately, ease of programming also meant slow front ends and costly changes for a site the size of and the speed at which the business needed to move. Instead the front ends would be HTML, Javascript and CSS with data being transferred via JSON contracts

There were other similarly horrible slides that animated the move from monolith to component based system deployed into a cloud. Here’s one of the slides that showed the changes, sorry that the animations aren’t available and the translation to newer PPT didn’t work well.

Overall the quality of the slides didn’t matter, the ideas mattered and more importantly, the credibility of the presenters was what ultimately made the difference.


A Digital Ecommerce Transformation – Making the New Mission: A Whole New Architecture – Part VIII

Part VIII – To start at the beginning goto Part I.

I’ll admit it, the deck I made was terrible, I’m not a master of Power Point, and the color scheme left a lot to be desired. I had crude animations showing how we would shift our monolithic application into the cloud, while retaining the customer data and checkout processes in the datacenter.

For about three weeks I worked mainly with another architect to take the many ideas we had discussed over the last year, and what we’d learned about operating in a cloud, and turn that into an architecture vision and implementation plan. We settled on three years to transform the ATG system to a distributed service oriented layered cloud architecture. The deck outlined the current issues with the ATG system, the future state architecture and how we would get there, and the cost of the first year of development.

My colleague urged me to begin presenting the deck to interested parties to get feedback and learn what resonated with the various digital teams. He was instrumental in networking across the organization and arranging meetings with Directors, Senior Directors and VPs in Digital and Business teams.

The first presentations did not go well, the business leaders didn’t get much from a highly technical deck with $13M of capital tied to it in the first year. Mostly the feedback was that we’ve heard this pitch multiple times over the last ten years, why should we believe you? They had a point, numerous consulting firms had been through with grand plans to rewrite It had already been attempted twice, the last attempt a failed implementation of the Microsoft Commerce system that was relegated to powering the Canadian site and failing miserably even at that effort.

We regrouped and tried to determine what would make this a better presentation. We knew many of the core problems with the site and that the business teams had been unable to make changes in the homepage or product detail pages (PDPs) for years. There were a few decks kicking around that defined the UX driven future of that would never be implemented due to technology failure. We decided to modify the deck and highlight that in the first year we would transform the homepage and PDPs into a new architecture that would allow fast changes and high scale utilizing the CDN for more caching and isolating all calls to the cloud layer. In that way we would severely limit the number of calls making it back to the ATG commerce system running in the datacenter allowing it to scale by relegating it to the Cart and Checkout functions.

There wasn’t anything we could find that outlined a similar architecture so, as far as we knew, we were embarking on a bold new way to use clouds at scale.


A Digital Ecommerce Transformation – Disaster Strikes, Twice – Part VII

Part VII – Start at the beginning with Part I

In late 2010 as budgets were being prepared for 2011, TWLER’s (The World’s Largest Electronics Retailer) fortunes were taking a nosedive. The stock price was steadily declining from the 40s to the 30s on its way to the 10s in 2011. In that environment, increasing the funding for the web architecture team to continue its revamp of must not have appeared as a viable project. The $7M in funding the VP Operations secured in 2010 was slashed to $3.5M for 2011. Disaster number one. It meant we had to cut back on projects and reduce headcount to meet that budget.

I was feeling restless with the limited ability to start new work and had found a CTO role at a local startup that I was considering. But even with the lowered funding, it was clear to me that had to be rewritten, and it was just a matter of time before someone got the funding to do it. Since there were three teams vying for that role, I felt that our team needed to take a larger effort towards securing the funding. I decided to pass on the CTO role in February of 2011 as I felt the team we had had the ability to accomplish the rewrite once the funding returned.

But, under the funding circumstances, the digital Chief Architect, Michael N., decided to pursue more interesting options as it was clear to him that TWLER wasn’t serious about investing in the rewrite. That bombshell was dropped on us in March of 2011 at the Mongolian Barbecue place nearby, which claimed the reputation for the exit announcement lunch after a few more exits were communicated at that restaurant. The team was devastated and left to consider our options. Most of us were prepared to leave. I was told later there was a deadpool started in the business side and I was on top of the list. However, I had just decided to rededicate towards the goal of the rewrite and, even though I was now angry I let the CTO opportunity slip by, I still knew this had to happen.

As the VP Operations was deciding whom to put in the role to manage the team, I knew I had to step up and lead. I hadn’t had much interaction with the VP and was sure she didn’t know my background in managing teams and delivering large projects. I had also just finished an MBA and was itching to use the new knowledge.

The VP began trying to find a replacement and began looking at recruiting a Chief Architect level dude from Accenture. While that architect was quite good if using the Accenture scale, we had worked with him enough to know he wasn’t cut out for the digital chief architect role and would have been a step down from any of us. We revolted en mass and made it clear we wouldn’t work for him.   That left the VP without a candidate and we all decided to work that way for a few months.

During that time, I slowly started making team decisions and pushing direction with the rest of the team, and everyone decided to follow. I had extensive experience taking over teams without direct authority as I did it at every consulting gig for the last 10 years; eventually the teams let me make the decisions. Additionally I made sure to get time with the VP and let her know what decisions the team was making to give her confidence we were making progress in this interim state.

But I knew the time had come to step up, it was time to put together a comprehensive plan to rewrite I dropped all my current responsibilities and did what anyone at did when they were pursuing investment capital, made a deck.