Category Archives: Architecture

A Digital Ecommerce Transformation – Holiday – This Year is Always the Most Important One Ever – Part XII

Part XII of a multipart story, to start at the beginning goto Part 1.

Since we’re in the middle of Holiday 2017, I thought a digression on Holiday was in order.

If you have never worked in retail, than you’ve have missed out on the grand experience we call “Holiday”. On the other hand, you’ve probably actually enjoyed the time of year from mid-November to Christmas while you celebrate with your friends and family, and take advantage of thousands of days of deals from the many retailers trying to get their share of wallet from you.

Holiday, with a capital H, is something that has to be experienced to be believed. In my first Holiday at TWLER in 2010, I was on a team that had just started writing code and had very little in production leading into Thanksgiving. The only offering we supported was the failure site, if the main TWLER.com went down, we would quickly spin up the browse only site so consumers would be able to at least see what products we sold, and where our stores were located. In 2010, this was actually a pretty good thing since the ecommerce site was still less than 5% of revenue.

When you work in IT in a retailer, your entire year is judged on whether or not the systems you support survive the shopping onslaught of Holiday. In the online space, an ecommerce site might make 30% of its revenue in the five days from Thanksgiving to Cyber Monday. TWLER.com also experienced the third highest traffic of North American retailers during that time. This massive scale up to 20X normal daily traffic was largely accomplished without clouds in the 2000s. You had to take a really good guess as to how much infrastructure was needed, build it all out over the course of the year, and hope you weren’t overwhelmed by consumer behavior. You could easily receive 1M requests per second at the edge, and 100,000+ requests per second to your actual systems. If those requests were concentrated on the wrong systems, you could easily take down your site.

TWLER counts how long you’ve been at a company by the number of Holidays you’ve experienced. If someone asks how long you’ve worked there, you might say “four Holidays.” And every Holiday is the most important one yet, because those six weeks account for 50% or more of yearly revenue.

After a few Holidays, you realize the second the current year’s Holiday is over, you are immediately planning for the next one. There is no break. It’s like a giant tsunami that is slowly approaching, day by day. You can look over your shoulder and it’s always there, waiting to crash down on you and ruin your day. Once this year’s tsunami passes, you turn around and can see next year’s on the horizon.

In my six Holidays at TWLER, we experienced numerous outages, usually caused by either internal stupidity, or unexpected consumer behavior. In our first few years, we would purposely force our ecommerce site to use “enterprise” services because they were the “single source” for things like taxes, or inventory. This is a great notion, but only if the “enterprise” services were actually built to support the entire Enterprise. Since TWLER was store focused, this meant the “enterprise” services were often down at night for maintenance, or were not built to withstand massive surges in traffic. One million people refreshing a PDP to check for inventory on a big sale every few seconds quickly overwhelmed these services. So we often turned these services off and flew semi-blind, rather than have the site completely fail.

In other instances we tried to use various promotion functions embedded in our ATG commerce server. These seemed like useful things to easily setup a promotion like buy one get one. But when millions of people come looking for the sale, the vendor built commerce engines go down quickly by destroying their own database with the same exact calls, over and over again.  They hadn’t heard of caching yet, I guess.

We would sometimes publish our starting times for various sales, saying a big sale is starting at 11AM and send out millions of customer emails. The marketing teams loved the starting times and the technology teams hated them. We warned that setting a hard start time is a sure route to failure. Yet we did it multiple times and incurred multiple failures as the traffic surge brought down the site. There are physical limits even in clouds, you can only spin things up so fast and 10M rqs will bring down most sites. After a few of these episodes, we did convince the marketing teams that it wasn’t the way to go and learned how to have sales with gradual ramp-ups in requests rather than massive surges.

Around 2013, the Black Friday shopping was so intense in the evening across the nation that the credit card networks themselves slowed down. Instead of taking a few seconds to auth a credit card, it started taking one or two minutes. This was across all retailers. However, the change in time caused threads to hang up inside our ecommerce systems and all of a sudden we ran out of threads as they were all tied up waiting for payments to happen. For the next year, we changed our payment process to go asynchronous so that would never happen again.

There are many more stories of failure, but from every failure we learned something and implemented fixes for the next year’s wave. This is why Holiday in retail is such fun, every year you get to test your mettle against the highest traffic the world can generate. You planned all year, you implemented new technologies and new solutions, but sometimes the consumer confounds you and does something totally unexpected.

The last story is one where the consumer behavior combined with new features took us down unexpectedly. In 2014 we implemented “Save for Later” lists where you could put your items on a list that you could access later and add them to your cart. As Thanksgiving rolled around and the Black Friday sale went out at around 2AM, our Add to Cart function started getting pounded at a rate far higher than we had tested it for. We were seeing 100K rqs in the first few minutes the sale was happening, it rapidly brought the Add to Cart function to its knees and we had to take a outage immediately to get systems back together and increase capacity.

This was completely unexpected consumer behavior so what happened? It turned out that customers used the Save for Later lists to pre-shop the Black Friday sale and add all the things they wanted to buy into the lists. Then when 2AM rolled around, they opened their Save for Later lists and started clicking the Add to Cart buttons one after the other. A single customer might click 5-10 Add to Cart buttons in a few seconds. With hundreds of thousands of customers figuring out the same method independently, it led to a massive spike in Add to Cart requests, we effectively DDOSed our Add to Cart function with simultaneous collective human behavior.

I feel like I could keep going on Holiday for another two pages, but that’s enough for this year, maybe we’ll do it again in the all important next year.

Goto Part XIII

A Digital Ecommerce Transformation – The Last Failed Holiday – Part XI

This is Part XI, to start at the beginning goto Part I.

It was mid-November of 2011 when we had finally secured the funding to start the rewrite of TWLER.com. But that timeframe is right before Holiday, the time between Thanksgiving and Christmas that generally defines the Holiday season for retailers, and can account for 50% or more of their annual revenue. Particularly, there are three days that make or break the business, Thanksgiving, Black Friday and Cyber Monday. On those three days the traffic to an electronics retail site increases by 10x or 20x. TWLER is the third highest scaling eCommerce site in North America during this time period, behind only Amazon and Walmart. Designing systems that can survive this type of load is difficult, and operating them is even more difficult. But we had the courage to believe we could do it.

First we had to try and get through Holiday of 2011 with our antiquated ATG system and poorly defined architecture that often called Enterprise services that were never designed for web scale. Why we called the Enterprise services was one of those well intentioned but clearly incompetent EA decisions large companies make. Not understanding the demands of ecommerce, the EA team had forced the Dotcom team to use Enterprise Tax, Payment, and Inventory systems. This isn’t a bad idea, it’s great to have Enterprise services, but only if they scale. If they are not designed with the Internet in mind then you have serious problems. In Dotcom, we operate with zero downtime, we have traffic 24 hours a day, we have demanding latency requirements and we scale 10-20X for a few days a year. The Enterprise services could not handle any of those requirements, so forcing the digital teams to use them out of a desire for reuse and lower costs is idiotic. But that’s how non-digital EA teams think.

So this Holiday, like the last one, was marred with outages, all attributed to Enterprise services that failed under load. It was painful but a great lesson in future architecture principals. The new TWLER.com would be designed to operate regardless of whether Enterprise services were available, it didn’t matter what we had to do, we would isolate ourselves from systems not designed for web scale. This was actually a good thing for everyone but the EA’s didn’t agree because it violated some outdated EA principle.

The only saving grace for TWLER at this point in the maturity of eCommerce, was that people wanted to buy from TWLER. We had good prices on many things and, even though we suffered through the Holiday, people basically distribute themselves in these situations, they make up for your lack of scale by trying again at less popular times. You shouldn’t count on this but human behavior can be one of your scaling algorithms.

The need for a TWLER.com rewrite was confirmed yet again. After we limped through Cyber Monday we started getting down to the business of building teams to implement our new architectural direction. We had great ideas and a solid plan, but without high quality engineers to execute it, we would just be wasting our time.

Goto Part XII

A Digital Ecommerce Transformation – CEO Inspires Panic – Part X

Part X – to start at the beginning goto Part 1

The major breakthrough for our effort to rewrite TWLER.com happened in 2011 when the CEO announced, without consulting anyone in digital, that TWLER.com would double its revenue in 3-5 years. It seems he had finally figured out that Wall Street wanted to see double-digit growth in ecommerce if it were going to believe TWLER had a future. When online retailers are eating up your market and showrooming becomes a verb applied towards your company, you know you have to make changes.

This announcement was met with panic internally, the digital team had been growing at a reasonable 10-12% per year for the last few years, and that level of growth was putting enormous strain on the site. Shifting to the 30% growth target to double within three years was unthinkable, they all knew the site would tank and that would be the end for everyone.

The CEO panic was our opportunity; many of these executives had seen our plans to rewrite TWLER.com and called us back in to review them again. Desperation was in the air and we were throwing them a lifeline. As it became clear people were now taking us seriously, my VP assigned a business leader to help us refine our pitch to make it more palatable for executives, and review the numbers to ensure they made sense. We settled on $13M for the first year growing to $20M for the subsequent years and ramping back down again in year four. The fever grew and we presented to the President of Digital, the CIO and finally made it to the people with the money, two EVPs that oversaw digital and stores.

As we prepped for that meeting my VP was determining who should give the presentation. She could give it, the more polished business manager could give it, or I could give it. In the end she decided to let me present it because she thought I showed the most authenticity, having wrote the deck and presented it at least 50 times in the past few months. I was grateful because I wanted to represent my ideas and make it clear that I was going to run this program.

The presentation went well, very well. The EVPs were excited to see what looked like a viable plan brought internally to them, rather than by consultants. They overlooked the raw slides and poorly done graphics and animations and granted us the $13M, asked if we wanted more, and said we should be done in 18 months, not three years. We agreed that 18 months was a better timeframe, but that was it, we all knew three years was a stretch, 18 months was ridiculous if we wanted to evolve the site rather than build new. But if agreeing got us the capital, than we agreed.

One of the EVPs, still with TWLER to this day in 2017, performed what I consider the best management judo I’ve ever experienced. After the presentation she came around the table to chat with me about the project. She said, “I’ve seen a lot of presentations to fix TWLER.com, but this is the best one I’ve seen. Why? Because till now I didn’t believe the presenters could actually deliver, but I believe you can deliver.”

I’m sure she was just doing her management thing and making sure I knew she thought the project was important and that I was capable of doing it. But her statement motivated me for years to deliver TWLER.com so as not to disappoint her. Someday I’ll perform my own management judo on an unsuspecting engineer with a great idea and thank her for the lesson.

GOTO Part XI

A Digital Ecommerce Transformation – The Agile eCommerce Platform – Part IX

Part IX – To start at the beginning goto Part I

For three months we shopped the walking deck with anyone at TWLER (The World’s Largest Electronics Retailer) who would listen. We were seeking support and feedback on the direction. We were ensuring, at the least, that people had heard of us and knew we were actively pursuing a rewrite of TWLER.com. As we knew, there were three other teams that were trying to gain the necessary momentum to do the rewrite themselves, and we had to stand out and move faster. We made progress, occasionally teams would actually ask us to come review our plans for TWLER.com, or recommend others to hear the story.

I said I was bad at Power Point and now I’ll prove it. Here’s one of the first drawings of what we termed our Agile Ecommerce Platform made in 2011. Gradients seemed to be a thing in 2011 and the Platform drawing made great use of them.

But the major platform pieces were there and we had described the basic underpinnings of what an ecommerce platform delivered. The main idea being that the front end is loosely coupled to the platform, allowing the front end teams to move quickly and change the site content in real time if necessary. This was a break from Java EE direction of JSP/Servlet architecture, and a break from standard Spring where UI was integrated all the way back to the database for ease of programming. Unfortunately, ease of programming also meant slow front ends and costly changes for a site the size of TWLER.com and the speed at which the business needed to move. Instead the front ends would be HTML, Javascript and CSS with data being transferred via JSON contracts

There were other similarly horrible slides that animated the move from monolith to component based system deployed into a cloud. Here’s one of the slides that showed the changes, sorry that the animations aren’t available and the translation to newer PPT didn’t work well.

Overall the quality of the slides didn’t matter, the ideas mattered and more importantly, the credibility of the presenters was what ultimately made the difference.

GOTO Part X

A Digital Ecommerce Transformation – Making the New Mission: A Whole New Architecture – Part VIII

Part VIII – To start at the beginning goto Part I.

I’ll admit it, the deck I made was terrible, I’m not a master of Power Point, and the color scheme left a lot to be desired. I had crude animations showing how we would shift our monolithic application into the cloud, while retaining the customer data and checkout processes in the datacenter.

For about three weeks I worked mainly with another architect to take the many ideas we had discussed over the last year, and what we’d learned about operating in a cloud, and turn that into an architecture vision and implementation plan. We settled on three years to transform the ATG system to a distributed service oriented layered cloud architecture. The deck outlined the current issues with the ATG system, the future state architecture and how we would get there, and the cost of the first year of development.

My colleague urged me to begin presenting the deck to interested parties to get feedback and learn what resonated with the various digital teams. He was instrumental in networking across the organization and arranging meetings with Directors, Senior Directors and VPs in Digital and Business teams.

The first presentations did not go well, the business leaders didn’t get much from a highly technical deck with $13M of capital tied to it in the first year. Mostly the feedback was that we’ve heard this pitch multiple times over the last ten years, why should we believe you? They had a point, numerous consulting firms had been through with grand plans to rewrite TWLER.com. It had already been attempted twice, the last attempt a failed implementation of the Microsoft Commerce system that was relegated to powering the Canadian site and failing miserably even at that effort.

We regrouped and tried to determine what would make this a better presentation. We knew many of the core problems with the site and that the business teams had been unable to make changes in the homepage or product detail pages (PDPs) for years. There were a few decks kicking around that defined the UX driven future of TWLER.com that would never be implemented due to technology failure. We decided to modify the deck and highlight that in the first year we would transform the homepage and PDPs into a new architecture that would allow fast changes and high scale utilizing the CDN for more caching and isolating all calls to the cloud layer. In that way we would severely limit the number of calls making it back to the ATG commerce system running in the datacenter allowing it to scale by relegating it to the Cart and Checkout functions.

There wasn’t anything we could find that outlined a similar architecture so, as far as we knew, we were embarking on a bold new way to use clouds at scale.

GOTO Part IX

A Digital Ecommerce Transformation – Disaster Strikes, Twice – Part VII

Part VII – Start at the beginning with Part I

In late 2010 as budgets were being prepared for 2011, TWLER’s (The World’s Largest Electronics Retailer) fortunes were taking a nosedive. The stock price was steadily declining from the 40s to the 30s on its way to the 10s in 2011. In that environment, increasing the funding for the web architecture team to continue its revamp of TWLER.com must not have appeared as a viable project. The $7M in funding the VP Operations secured in 2010 was slashed to $3.5M for 2011. Disaster number one. It meant we had to cut back on projects and reduce headcount to meet that budget.

I was feeling restless with the limited ability to start new work and had found a CTO role at a local startup that I was considering. But even with the lowered funding, it was clear to me that TWLER.com had to be rewritten, and it was just a matter of time before someone got the funding to do it. Since there were three teams vying for that role, I felt that our team needed to take a larger effort towards securing the funding. I decided to pass on the CTO role in February of 2011 as I felt the team we had had the ability to accomplish the rewrite once the funding returned.

But, under the funding circumstances, the digital Chief Architect, Michael N., decided to pursue more interesting options as it was clear to him that TWLER wasn’t serious about investing in the TWLER.com rewrite. That bombshell was dropped on us in March of 2011 at the Mongolian Barbecue place nearby, which claimed the reputation for the exit announcement lunch after a few more exits were communicated at that restaurant. The team was devastated and left to consider our options. Most of us were prepared to leave. I was told later there was a deadpool started in the business side and I was on top of the list. However, I had just decided to rededicate towards the goal of the TWLER.com rewrite and, even though I was now angry I let the CTO opportunity slip by, I still knew this had to happen.

As the VP Operations was deciding whom to put in the role to manage the team, I knew I had to step up and lead. I hadn’t had much interaction with the VP and was sure she didn’t know my background in managing teams and delivering large projects. I had also just finished an MBA and was itching to use the new knowledge.

The VP began trying to find a replacement and began looking at recruiting a Chief Architect level dude from Accenture. While that architect was quite good if using the Accenture scale, we had worked with him enough to know he wasn’t cut out for the digital chief architect role and would have been a step down from any of us. We revolted en mass and made it clear we wouldn’t work for him.   That left the VP without a candidate and we all decided to work that way for a few months.

During that time, I slowly started making team decisions and pushing direction with the rest of the team, and everyone decided to follow. I had extensive experience taking over teams without direct authority as I did it at every consulting gig for the last 10 years; eventually the teams let me make the decisions. Additionally I made sure to get time with the VP and let her know what decisions the team was making to give her confidence we were making progress in this interim state.

But I knew the time had come to step up, it was time to put together a comprehensive plan to rewrite TWLER.com. I dropped all my current responsibilities and did what anyone at TWLER.com did when they were pursuing investment capital, made a deck.

GOTO Part VIII

A Digital Ecommerce Transformation – Fun With Clouds – Part VI

Part VI – Start at the beginning with Part I

While the business was busy stuffing their sorrys in a sack, our team was having some fun.

From July of 2010 to April of 2011 my role on the team was Architect, it was probably one of the most productive stints I can remember. During that time I learned Infrastructure as Code by building servers for Artifactory, Confluence, Jira, Crowd, and a number of other products using Chef. There we were reading the insanely poor documentation on the Chef site, 4-5 of us all learning Chef at the same time trying to get something working. Chef is similar to Grails in my head, too many magic mushrooms growing everywhere. If something doesn’t work, it might be your code, but it might be some unknown configuration that you missed. However, once you know where the magic starts and ends, they can both be quite useful.

I spent days on end building and stripping down infrastructure in the AWS cloud. We learned about Availability Zones, and Regions and how to operate in multiple locations at once. We watched AWS go down two or three times in that period but managed to weather all those outages with a little luck and forethought. We talked to vendors and startups building tools for clouds. We talked to other consumer enterprises building high scale websites for customers. We spent a lot of time reading High Scalability and the first edition of The Art of Scalability.  We wrote up comparisons between Riak, Cassandra, MongoDB and HBase. We tried to decide what might work best for a new distributed item catalog. We guessed, and hoped we didn’t end up like the guy that picked Cassandra for Digg in 2010.

For better or worse, the Digg disaster and the good relationship we struck up with Basho led us to choose Riak for our first NoSQL system in late 2010. We had great collaboration with the Basho engineers, we were helping them find the bugs in their system but the underlying technology was rock solid. In six years Riak never failed us, the only times we had problems were completely self-inflicted.

In the end we had numerous systems operating in AWS. The first was the failover site mentioned in Part I. If TWLER.com went down, we would switch over to the browse only site in a few minutes. We got to exercise this capability more than once. The second was the build infrastructure, our Atlassian suite, Artifactory and Jenkins were all cloud deployed. What we learned running production systems in the AWS cloud gave us the confidence to push towards a whole new architecture for TWLER.com.

GOTO Part VII

A Digital Ecommerce Transformation – The Business Viewpoint in 2010 – Part V

Part V – Start at the beginning with Part I

When I arrived at TWLER (The Worlds Largest Electronics Retailer), it was clear that the digital business teams were sad, sad, sad, sad, sad, sad, sad, sad and very sad. After a couple weeks of reading through the ATG codebase I was also sad. Sad enough that I seriously considered searching for a new job because the code was such a mess. A massive mess. We were supposed to fix this?

The business teams were in charge of managing the site, adding new items, changing pricing, removing items from the site, fixing orders, making content, creating sale landing pages, emails, etc. Everything that keeps a large ecommerce site moving, usually referred to as site operations or business ops. The business teams had created a small shadow IT organization to try and maintain stability and make changes in the only way they could with IT controlling the ATG codebase. The slogans for the shadow IT teams were things like “Do more with less!” and “Any way to get it done!”   The only recourse they had was to make changes to the UI via Javascript and use a bypass of the deployment systems to post new Javascript files directly onto the production servers. Since this was the largest known ATG cluster at more than 400 servers, this procedure was fraught with danger. Appalling, yes, but if that’s the only way to get something done than it falls within creative license.

The process to start a new project went something like this:

  1. Write an RFP for a new thing such as adding a marketplace to the browse and commerce portions of the site.
  2. Seek bids from the three IT integrators that were approved by IT.
  3. Receive bids back with one IT integrator, we’ll call them A, as the project manager and the other two IT integrators vying for delivery.
  4. Bids start at $1M and only go up. For something like a marketplace, $27M was closer to the mark.
  5. Sign the contract to start the work.
  6. Within a week, 20 onshore coordinators and 100 offshore developers magically appear and start wreaking havoc on the shared codebase.
  7. 9-19 months later, severely over budget, something resembling a marketplace appears and is attempted to merge with the existing headstream, using a branch that started 9-19 months ago.
  8. Chaos ensues as every other project delivered between that time is broken and the IT integrator’s teams start fighting amongst themselves.
  9. After another two months, victory is declared, something buggy and barely working is deployed, the contract is finished and the 120 people disappear within a week.
  10. Bug fixes are now the responsibility of the shadow IT team mentioned above, to re-engage the IT integrators to fix all the problems they created needs a new RFP.
  11. Repeat this process until spirit is broken.

Surprisingly (sarcasm) the business teams were not very receptive to a new IT-like team coming in and telling them they were going to fix everything with Agile, DevOps, Cloud and really small engineering teams. As I was told numerous times when trying to engage the business to act as the SME for the Agile teams, “heard it before, new process, SOA architecture, will be able to work magic two years from now.” “Not buying it this time!”

There’s really only one solution to this problem (besides hiring a whole new business team) and that is to start delivering on your promises. That’s what we set out to do, but the environment made it unduly difficult for us.

GOTO Part VI

A Digital Ecommerce Transformation – First Architecture Forays – Part IV

Part IV – Start at the beginning with Part I

As the first team of architects to work for TWLER.com, we started mapping out the current situation and planning for the future. ATG was the base eCommerce engine and Oracle the RDBMS. Some attempt to provide enterprise services had been undertaken in the past and tax, inventory and payment had all been removed from the ATG codebase and were now called as enterprise services. There were a number of other integrations but these were the main services that caused issues in the digital world due to the disconnect in service levels that were present in the TWLER (The Worlds Largest Electronics Retailer) environment.

The basic architecture of where we started is below, I take no responsibility for the IT side of the architecture:

 

It’s worth mapping out the organizational structure to start to understand the additional frictions present in TWLER’s attempts to run a digital eCommerce system.

The Digital teams were separated out from the IT teams many years ago and had remained in that state.  Digital was viewed as unimportant since stores drove the revenue, Dotcom was a sideline play, even in 2010.  Digital was run like a business and software development was matrixed out to the IT team. The Digital team had taken over its own operations at some point because the IT teams were unable to support 24×7 operations in their model. There were dotted lines between the Digital VP Operations and the IT VP Digital Portfolio, as well as between the Digital Chief Architect and IT Senior Director Digital.

The Business VPs in the Digital team drove the business projects, with input from the wider enterprise business teams in marketing and merchandising. Between the Digital Business and IT was a Business Relationship Manager, who was supposed to translate the business asks into IT requirements. The IT requirements were then shipped off to an enterprise integrator to spin up a new team and deliver the projects. Your success varies.

In addition, internally to the Digital team was an innovation team that had spun up a completely separate Mobile site on ATG, and managed the Mobile Apps. This team was quite clear in their goals to replace all of Dotcom with their mobile site at some point in the future. So now I knew that there were two teams with the goal to replace the main www.TWLER.com eCommerce site.

Finally, there was the Senior Director of Digital in the IT team who was mainly responsible for hiring Accenture to manage projects, and WiPro/TCS to execute projects. This was a very important role as someone had to manage our vendor partners and ensure quality delivery (since I can’t get sarcasm across in writing, read this last line with as much sarcasm as you can imagine). This Senior Director also had plans to rewrite the Dotcom site, he had not gained support from the Digital VPs because they were extraordinarily unhappy with the quantity and quality delivered thus far.

To recap, in the first six months at TWLER I learned that there were at least three teams who felt they were responsible for rewriting TWLER.com. Additionally, the software development on the ATG platform was spread across multiple divisions, multiple vendors and multiple countries. All features were added by project teams who appeared when the money started, and vanished the second the money stopped, all support fell to the Digital VP of Operations. This VP was forced to contend with the mess of code and integrations that 1000s of developers were contributing to every day. I finally understood why she had moved forward with hiring her own team to rewrite TWLER.com. We owned the codebase but lacked capital, in year one we had $7M to get the team started. By the time 2011 rolled around, that amount was cut back to $3.5M.

GOTO Part V

A Digital Ecommerce Transformation – A Little More Background Before We Get Started – Part III

Part III – Start at the beginning with Part I

The TWLER.com (The Worlds Largest Electronics Retailer) architecture team started with the chief architect and five additional people, myself for high scale Java applications and NoSQL, a TWLER consultant/architect that had lived through the entire life of TWLER.com and recently converted to an employee, a second TWLER employee architect specializing in APIs and Product catalogs, a cloud infrastructure architect, and another systems thinking high scale Java architect. With a six person team, we tasked ourselves with converting TWLER.com from its massive monolithic state to a new not yet known state. What we decided early, or was decided for us, was that we were going to evolve out of the current state to a future state. It was decided for us by two constraints, not enough capital to build a new system separately and run the current system, and the stipulation that business must continue unaffected by our efforts. At this time revenue was around $1.5 billion.

There were multiple projects that we started that first year: infrastructure automation of a modern build infrastructure deployed in a cloud, cloud based outage site, distributed product catalog and QA automation. There was one more task assigned to me as exploratory which was to look into the ATG Ant build and see if there was any way of automating it with a dependency management framework.

The selection of these projects was built around the strategy to establish a modern engineering infrastructure that we could then leverage with all our future work. Having a robust Continuous Integration environment, universally available Git and SVN version control systems, artifact repository, wiki, task management and user authentication system would allow us to move quickly in the future. The distributed product catalog was built off the theory that to exit ATG, the basis of that plan required a new distributed product catalog outside of the ATG system. QA automation was the final card needed to move past six week manual release testing windows so that we could run thousands of regression tests in hours and speed up the release process.

My main project was QA automation, which was a successful disaster. Successful in that we were able to automate many of the manual test cases using Selenium and JBehave, but that the UI was so inconsistently rendered that all fields had to be accessed directly with XPath, and even then the pages sometimes rendered differently causing the tests to fail. Additionally, we were unable to setup test data in a consistent manner causing tests to fail randomly when underlying data was changed or deleted. In reality it was still better than the manual tests, but we struggled to maintain even 80% test coverage for more than a few days.

The most interesting work for me was delving into the ANT build file for ATG. I had plenty of experience with ANT in its heyday in the late 90s and early 2000s. I thought I knew ANT pretty well, but when faced with the 20+ ANT files that made up the build of 14 separate ATG applications, I had met my match. In my spare time I started tracing through the ANT files, determining how variables were setup up and finding the main path through the files. I had to diagram out the actual workings of the files as there was multiple instances of recursion occurring within the build process. My goal was to figure out how to convert the build over to Maven, which I’m sure many of you hate, but bringing in dependency management and forcing a standard file system layout we felt were the most important considerations.

After a few weeks, it was my estimate that it would probably take a couple people 2-3 months to convert the ATG build from ANT to Maven, and that we could tackle it in smaller pieces by starting with the 13 builds that were not the main TWLER.com site. We shelved this idea for the time being, but when we did get back to it a year later, my estimate proved exceedingly optimistic.

GOTO Part IV