Category Archives: Minneapolis

A Digital Ecommerce Transformation – CEO Inspires Panic – Part X

Part X – to start at the beginning goto Part 1

The major breakthrough for our effort to rewrite TWLER.com happened in 2011 when the CEO announced, without consulting anyone in digital, that TWLER.com would double its revenue in 3-5 years. It seems he had finally figured out that Wall Street wanted to see double-digit growth in ecommerce if it were going to believe TWLER had a future. When online retailers are eating up your market and showrooming becomes a verb applied towards your company, you know you have to make changes.

This announcement was met with panic internally, the digital team had been growing at a reasonable 10-12% per year for the last few years, and that level of growth was putting enormous strain on the site. Shifting to the 30% growth target to double within three years was unthinkable, they all knew the site would tank and that would be the end for everyone.

The CEO panic was our opportunity; many of these executives had seen our plans to rewrite TWLER.com and called us back in to review them again. Desperation was in the air and we were throwing them a lifeline. As it became clear people were now taking us seriously, my VP assigned a business leader to help us refine our pitch to make it more palatable for executives, and review the numbers to ensure they made sense. We settled on $13M for the first year growing to $20M for the subsequent years and ramping back down again in year four. The fever grew and we presented to the President of Digital, the CIO and finally made it to the people with the money, two EVPs that oversaw digital and stores.

As we prepped for that meeting my VP was determining who should give the presentation. She could give it, the more polished business manager could give it, or I could give it. In the end she decided to let me present it because she thought I showed the most authenticity, having wrote the deck and presented it at least 50 times in the past few months. I was grateful because I wanted to represent my ideas and make it clear that I was going to run this program.

The presentation went well, very well. The EVPs were excited to see what looked like a viable plan brought internally to them, rather than by consultants. They overlooked the raw slides and poorly done graphics and animations and granted us the $13M, asked if we wanted more, and said we should be done in 18 months, not three years. We agreed that 18 months was a better timeframe, but that was it, we all knew three years was a stretch, 18 months was ridiculous if we wanted to evolve the site rather than build new. But if agreeing got us the capital, than we agreed.

One of the EVPs, still with TWLER to this day in 2017, performed what I consider the best management judo I’ve ever experienced. After the presentation she came around the table to chat with me about the project. She said, “I’ve seen a lot of presentations to fix TWLER.com, but this is the best one I’ve seen. Why? Because till now I didn’t believe the presenters could actually deliver, but I believe you can deliver.”

I’m sure she was just doing her management thing and making sure I knew she thought the project was important and that I was capable of doing it. But her statement motivated me for years to deliver TWLER.com so as not to disappoint her. Someday I’ll perform my own management judo on an unsuspecting engineer with a great idea and thank her for the lesson.

GOTO Part XI

A Digital Ecommerce Transformation – Making the New Mission: A Whole New Architecture – Part VIII

Part VIII – To start at the beginning goto Part I.

I’ll admit it, the deck I made was terrible, I’m not a master of Power Point, and the color scheme left a lot to be desired. I had crude animations showing how we would shift our monolithic application into the cloud, while retaining the customer data and checkout processes in the datacenter.

For about three weeks I worked mainly with another architect to take the many ideas we had discussed over the last year, and what we’d learned about operating in a cloud, and turn that into an architecture vision and implementation plan. We settled on three years to transform the ATG system to a distributed service oriented layered cloud architecture. The deck outlined the current issues with the ATG system, the future state architecture and how we would get there, and the cost of the first year of development.

My colleague urged me to begin presenting the deck to interested parties to get feedback and learn what resonated with the various digital teams. He was instrumental in networking across the organization and arranging meetings with Directors, Senior Directors and VPs in Digital and Business teams.

The first presentations did not go well, the business leaders didn’t get much from a highly technical deck with $13M of capital tied to it in the first year. Mostly the feedback was that we’ve heard this pitch multiple times over the last ten years, why should we believe you? They had a point, numerous consulting firms had been through with grand plans to rewrite TWLER.com. It had already been attempted twice, the last attempt a failed implementation of the Microsoft Commerce system that was relegated to powering the Canadian site and failing miserably even at that effort.

We regrouped and tried to determine what would make this a better presentation. We knew many of the core problems with the site and that the business teams had been unable to make changes in the homepage or product detail pages (PDPs) for years. There were a few decks kicking around that defined the UX driven future of TWLER.com that would never be implemented due to technology failure. We decided to modify the deck and highlight that in the first year we would transform the homepage and PDPs into a new architecture that would allow fast changes and high scale utilizing the CDN for more caching and isolating all calls to the cloud layer. In that way we would severely limit the number of calls making it back to the ATG commerce system running in the datacenter allowing it to scale by relegating it to the Cart and Checkout functions.

There wasn’t anything we could find that outlined a similar architecture so, as far as we knew, we were embarking on a bold new way to use clouds at scale.

GOTO Part IX

A Digital Ecommerce Transformation – Disaster Strikes, Twice – Part VII

Part VII – Start at the beginning with Part I

In late 2010 as budgets were being prepared for 2011, TWLER’s (The World’s Largest Electronics Retailer) fortunes were taking a nosedive. The stock price was steadily declining from the 40s to the 30s on its way to the 10s in 2011. In that environment, increasing the funding for the web architecture team to continue its revamp of TWLER.com must not have appeared as a viable project. The $7M in funding the VP Operations secured in 2010 was slashed to $3.5M for 2011. Disaster number one. It meant we had to cut back on projects and reduce headcount to meet that budget.

I was feeling restless with the limited ability to start new work and had found a CTO role at a local startup that I was considering. But even with the lowered funding, it was clear to me that TWLER.com had to be rewritten, and it was just a matter of time before someone got the funding to do it. Since there were three teams vying for that role, I felt that our team needed to take a larger effort towards securing the funding. I decided to pass on the CTO role in February of 2011 as I felt the team we had had the ability to accomplish the rewrite once the funding returned.

But, under the funding circumstances, the digital Chief Architect, Michael N., decided to pursue more interesting options as it was clear to him that TWLER wasn’t serious about investing in the TWLER.com rewrite. That bombshell was dropped on us in March of 2011 at the Mongolian Barbecue place nearby, which claimed the reputation for the exit announcement lunch after a few more exits were communicated at that restaurant. The team was devastated and left to consider our options. Most of us were prepared to leave. I was told later there was a deadpool started in the business side and I was on top of the list. However, I had just decided to rededicate towards the goal of the TWLER.com rewrite and, even though I was now angry I let the CTO opportunity slip by, I still knew this had to happen.

As the VP Operations was deciding whom to put in the role to manage the team, I knew I had to step up and lead. I hadn’t had much interaction with the VP and was sure she didn’t know my background in managing teams and delivering large projects. I had also just finished an MBA and was itching to use the new knowledge.

The VP began trying to find a replacement and began looking at recruiting a Chief Architect level dude from Accenture. While that architect was quite good if using the Accenture scale, we had worked with him enough to know he wasn’t cut out for the digital chief architect role and would have been a step down from any of us. We revolted en mass and made it clear we wouldn’t work for him.   That left the VP without a candidate and we all decided to work that way for a few months.

During that time, I slowly started making team decisions and pushing direction with the rest of the team, and everyone decided to follow. I had extensive experience taking over teams without direct authority as I did it at every consulting gig for the last 10 years; eventually the teams let me make the decisions. Additionally I made sure to get time with the VP and let her know what decisions the team was making to give her confidence we were making progress in this interim state.

But I knew the time had come to step up, it was time to put together a comprehensive plan to rewrite TWLER.com. I dropped all my current responsibilities and did what anyone at TWLER.com did when they were pursuing investment capital, made a deck.

GOTO Part VIII

A Digital Ecommerce Transformation – Fun With Clouds – Part VI

Part VI – Start at the beginning with Part I

While the business was busy stuffing their sorrys in a sack, our team was having some fun.

From July of 2010 to April of 2011 my role on the team was Architect, it was probably one of the most productive stints I can remember. During that time I learned Infrastructure as Code by building servers for Artifactory, Confluence, Jira, Crowd, and a number of other products using Chef. There we were reading the insanely poor documentation on the Chef site, 4-5 of us all learning Chef at the same time trying to get something working. Chef is similar to Grails in my head, too many magic mushrooms growing everywhere. If something doesn’t work, it might be your code, but it might be some unknown configuration that you missed. However, once you know where the magic starts and ends, they can both be quite useful.

I spent days on end building and stripping down infrastructure in the AWS cloud. We learned about Availability Zones, and Regions and how to operate in multiple locations at once. We watched AWS go down two or three times in that period but managed to weather all those outages with a little luck and forethought. We talked to vendors and startups building tools for clouds. We talked to other consumer enterprises building high scale websites for customers. We spent a lot of time reading High Scalability and the first edition of The Art of Scalability.  We wrote up comparisons between Riak, Cassandra, MongoDB and HBase. We tried to decide what might work best for a new distributed item catalog. We guessed, and hoped we didn’t end up like the guy that picked Cassandra for Digg in 2010.

For better or worse, the Digg disaster and the good relationship we struck up with Basho led us to choose Riak for our first NoSQL system in late 2010. We had great collaboration with the Basho engineers, we were helping them find the bugs in their system but the underlying technology was rock solid. In six years Riak never failed us, the only times we had problems were completely self-inflicted.

In the end we had numerous systems operating in AWS. The first was the failover site mentioned in Part I. If TWLER.com went down, we would switch over to the browse only site in a few minutes. We got to exercise this capability more than once. The second was the build infrastructure, our Atlassian suite, Artifactory and Jenkins were all cloud deployed. What we learned running production systems in the AWS cloud gave us the confidence to push towards a whole new architecture for TWLER.com.

GOTO Part VII

Fall Speaking Schedule

I have two talks coming up in the Fall, they both will be at excellent conferences so I hope to see you there:

October 16, 2017 O’Reilly Software Architecture Conference in London.  I’ll be giving a talk on Platform architecture in the retail space.  Having worked in retail for the past seven years, the number of systems required and the breadth of systems is staggering.  Applying a common architecture pattern to all systems is essential to allow reuse and rapid development in all aspects of retail.

Next I’ll be staying local for the Midwest Architecture Conference in Minneapolis on November 9, 2017.  I’ll be giving one of the keynotes about how to restructure Enterprise Architecture so that it might actually be useful.  Let’s all support out local architecture conferences, there’s not many left!

 

A Digital Ecommerce Transformation – A Little More Background Before We Get Started – Part III

Part III – Start at the beginning with Part I

The TWLER.com (The Worlds Largest Electronics Retailer) architecture team started with the chief architect and five additional people, myself for high scale Java applications and NoSQL, a TWLER consultant/architect that had lived through the entire life of TWLER.com and recently converted to an employee, a second TWLER employee architect specializing in APIs and Product catalogs, a cloud infrastructure architect, and another systems thinking high scale Java architect. With a six person team, we tasked ourselves with converting TWLER.com from its massive monolithic state to a new not yet known state. What we decided early, or was decided for us, was that we were going to evolve out of the current state to a future state. It was decided for us by two constraints, not enough capital to build a new system separately and run the current system, and the stipulation that business must continue unaffected by our efforts. At this time revenue was around $1.5 billion.

There were multiple projects that we started that first year: infrastructure automation of a modern build infrastructure deployed in a cloud, cloud based outage site, distributed product catalog and QA automation. There was one more task assigned to me as exploratory which was to look into the ATG Ant build and see if there was any way of automating it with a dependency management framework.

The selection of these projects was built around the strategy to establish a modern engineering infrastructure that we could then leverage with all our future work. Having a robust Continuous Integration environment, universally available Git and SVN version control systems, artifact repository, wiki, task management and user authentication system would allow us to move quickly in the future. The distributed product catalog was built off the theory that to exit ATG, the basis of that plan required a new distributed product catalog outside of the ATG system. QA automation was the final card needed to move past six week manual release testing windows so that we could run thousands of regression tests in hours and speed up the release process.

My main project was QA automation, which was a successful disaster. Successful in that we were able to automate many of the manual test cases using Selenium and JBehave, but that the UI was so inconsistently rendered that all fields had to be accessed directly with XPath, and even then the pages sometimes rendered differently causing the tests to fail. Additionally, we were unable to setup test data in a consistent manner causing tests to fail randomly when underlying data was changed or deleted. In reality it was still better than the manual tests, but we struggled to maintain even 80% test coverage for more than a few days.

The most interesting work for me was delving into the ANT build file for ATG. I had plenty of experience with ANT in its heyday in the late 90s and early 2000s. I thought I knew ANT pretty well, but when faced with the 20+ ANT files that made up the build of 14 separate ATG applications, I had met my match. In my spare time I started tracing through the ANT files, determining how variables were setup up and finding the main path through the files. I had to diagram out the actual workings of the files as there was multiple instances of recursion occurring within the build process. My goal was to figure out how to convert the build over to Maven, which I’m sure many of you hate, but bringing in dependency management and forcing a standard file system layout we felt were the most important considerations.

After a few weeks, it was my estimate that it would probably take a couple people 2-3 months to convert the ATG build from ANT to Maven, and that we could tackle it in smaller pieces by starting with the 13 builds that were not the main TWLER.com site. We shelved this idea for the time being, but when we did get back to it a year later, my estimate proved exceedingly optimistic.

GOTO Part IV

A Digital Ecommerce Transformation – Front End Madness – Part II

Part II – Start at the beginning with Part I

In 2010, the cloud was not new but it was ignored by large companies, particularly in non-technology focused segments such as retail. While it is clear now, at the time large retailers had not yet awakened to the new reality that a company’s prowess in software might decide its future outcomes.

As an example, in my first year at TWLER (The Worlds Largest Electronics Retailer), store sales during December were not going well. Since many retailers make 50% or more of their annual sales in November and December, this spelled impending disaster. Traffic was down in stores and the company’s reaction was to propose that the digital channel stop its free shipping offer. This clearly showed the company leadership’s inability to fathom online shopping. The logic would be that if a customer could buy it online and have it shipped for free, they would not go to a TWLER store. Seems reasonable, but if the customer is shopping online and we did not offer free shipping, they would simply click to the next store, probably Amazon, and buy their electronics from them with free shipping. This customer was not going to a TWLER store that Holiday, ever. We were actually saving sales for the company, it just wasn’t understood.

A minor digression on the state of the ATG system is necessary to understand what we were dealing with. The original ATG system was built in 2003, at the time it was an excellent decision for a mid-sized retailer to build its first ecommerce engine. But over time, and numerous one-off projects, the codebase had morphed, intertwined and been generally neglected and abused. As an example, one ongoing project when I arrived was to widen the product detail page (PDP) and move the Add To Cart button from the left side of the page to the right side. This seemed fairly innocuous, but it took six months and well over one million dollars to accomplish this task. This seems ridiculous to me so as an architect, I dug into why this was happening.

It turns out there were multiple reasons why this project was practically impossible to complete. To start with, there were nine separate versions of the PDP, each made for a different category such as TVs, Music, Computers, etc. The nine separate PDPs all had common origins in some ancestral PDP, but after years of projects aimed at the individual categories, they had all strayed in different ways, including using different Javascript frameworks and versions to accomplish dynamic page elements. These PDPs were written in JSP/Javascript and were each well over 10,000 lines long intermixing actual Java into the JSPs themselves. Imagine trying to figure out how to change nine different pages all implemented slightly differently in monstrous JSP files, with no test automation to determine if you broke anything in the process.

This sounds bad enough, but the executable for ATG was in the GB range, built as an ear, with a special Ant file which only one or two people understood (more on this later). It was necessary to build and run the entire ear to determine if the page changes worked since the JSP code was so intertwined with the server side code. However, it was impossible to actually run this ear on a developer’s machine because it also required a full working copy of the Oracle database. No one had actually figured out how to make all these things work yet on a single desktop or laptop machine.

Instead, there was one shared development server for the entire Dotcom division. This server was a large Unix box but still too small to serve thousands of developers trying to build and run the ATG codebase. This server alone routinely failed due to lack of disk space and not enough CPU. But, it was the only place to build and run code you had worked on in your IDE, so everyone had to deal with it.

If you weren’t crying yet, the next step was to actually deploy the code to the staging environment (skipping the integration environment altogether) because the reality was the front end code only worked if it could access the Internet as there were so many externally downloaded components to the page. Even though you deployed the code to the shared developer environment, you couldn’t actually run it there. The staging build happened once every night.

To sum this up, the normal front end development cycle is change some code, save it locally, have it automatically picked up by your running app server and test it. This cycle time should be in the seconds range so you can work quickly and efficiently through all the little tweaks necessary to make a UI page look good and work as expected. The cycle at TWLER was change some code, save it locally, do your best to make sure the page compiled, check in the code, push to your developer environment, do your best to check it compiled, wait for the overnight stage push (go home), come back the next morning and see if the change worked in stage (assuming the push didn’t fail, which it often did). Instead of a cycle time of seconds for each code change, the cycle time was one day. One entire 24 hour day!

Did I mention zero automated regression tests?

Now I bet you think that $1M was cheap. In fact, I still don’t know how anyone actually got any work done in these conditions, but I do know that the churn in the front end development team was enormous.

GOTO Part III

A Digital eCommerce Transformation – A Multipart Series

It took six long years but we successfully transformed a monolithic 10 year old ATG based commerce system into a completely distributed, infinite scaling eCommerce platform.  I’ll call the place TWLER, or the world’s largest electronics retailer.

I joined TWLER in 2010 as a Hadoop Architect on the small team that was tasked with rewriting TWLER.com.   I interviewed in May and was offered the role, but didn’t start until July. About two weeks before I started the VP called me and let me know that the Hadoop project was cancelled and if I declined the offer they would understand. I had spent the last two years learning about Hadoop and installing one of the first Hadoop cluster in the Twin Cities on a bunch of old Dell towers for a company called Peoplenet. The VP there had the idea that they would build a remote data collection system that could take in one million messages per second back in 2008. I tried to startup a Hadoop consulting arm for the consulting company I was working for at the time, Object Partners. However, it was a bit too early in the Twin Cities for companies to be interested in BigData and NoSQL. We spent a lot of time talking to companies about the technology, but no contracts were forthcoming. I put together an Introduction to Hadoop presentation and gave that numerous times, all to no avail. So when TWLER came calling with a Hadoop Architect position, I jumped at the chance to actually use Hadoop at a large company. All that to say I was quite disappointed when the role fell through and seriously considered declining the position as I had not yet resigned, and after two years of pushing a technology I did want to see Hadoop in action. But I was quite tired of my second stint in consulting and ready to move on.

So I arrived at TWLER with no assigned role. At the time we were five architects under an Operations Vice President, and I had no idea of the organizational politics that were hindering a major system rewrite. The cancelling of one project should have been a clue to the future as already, funds were being removed from this team.

We started out under the tutelage of Michael N., a renowned architect that had worked on the initial version of TWLER.com. Our first task was to setup a modern development pipeline using Chef in AWS. A reminder that this was in 2010 when a large retailer doing anything in AWS was highly uncommon, and Chef was practically brand new.

The pipeline we stood up at the time was fairly standard, Atlassian Stash for Git, Crowd for user management, Confluence for knowledge management, Jira for issue tracking, Bamboo for continuous integration, and Artifactory for artifact storage and a Maven repo. We used Chef to completely automate the deployment of these tools into AWS. While none of these was new to TWLER, they had not been combined together, made externally accessible or offered to anyone who wanted to use them throughout the company.

We used this infrastructure to start a small AWS based project to put together a small site that could be used during outages. The site would allow customers to lookup locations, products and prices, but full commerce capabilities would not be available. It would reside in the cloud, ready to be deployed within minutes. It says something that the first thing we built was an outage site, TWLER.com was not a stable platform at the time.

GOTO Part II

Peak Architecture

I believe peak architect occurred somewhere around 2013, that was the year with the most people employed with architect titles ever. Since 2013, we are now entering the post-big-architect era, where architect roles are quickly shrinking in the tech world. This is a good thing.

The number of roles with architect titles in industry was getting ridiculous. I can riff 20 titles in less than a minute if I try: Data, Technical, Web, Digital, Application, Cloud, Security, Infrastructure, Network, Business, User Experience, Information, Middleware, Enterprise, Solution, Sales, Product, Project, Principal and my personal favorite, Sharepoint. The question is, what did all those architects do?

My contention is that they completely messed up most companies by proclaiming they knew how to select and manage technology at large companies, and forced vast numbers of engineers to conform to a highly limited set of technologies in the name of fiscal conservatism. Then retreated to their own worlds to rest and consider the next great set of technologies they would enshrine five years from now.

What is the optimum number of architects for a company?

The answer to this, like all math problems, is 0, 1 or some hard to derive number. Zero and one are actually quite good answers. Zero is complete engineering anarchy, you’ve decided to let the engineers do whatever they want. If they’re good engineers, they might actually make an amazing system by developing external APIs which hide all the technology choices from the outside world.

One is also a great answer, you’re basically saying someone needs to think about the company as a whole, and one insanely smart person can get a whole company pointed in the right direction, or they can spend a lot of time thinking and watching engineers do whatever they want. One architect can spawn ideas, if they’re in touch with the engineers, that just may be enough to get people pointed in the right direction.

A better number is some ratio to the number of engineers. But what ratio to pick? Too low and you have architects running over each other to tell engineering teams what to do. Something like the commissioned sales model, “that team was mine, I talked to them first.” Too few and you’re not much better than 0 or 1. My personal favorite ratio right now is 1:100, one architect for every 100 engineers. That seems to be enough to get some rational standardization (if you don’t like standardization, deal with it, every company beyond startup mode has some set of standards), while allowing engineers to fight the power and pick new technologies when necessary. Since technology is progressing faster and faster, it’s essential to keep new technologies flowing into an organization, and old ones leaving (sometimes 6 months is old!). Relying on a few individuals to follow these trends and pick the right tools for everyone is impossible. But sourcing newness from those engineers willing to push a new idea over a few intentional roadblocks will keep your organization current enough to remain relevant.

Peak architecture has passed, time to jump on the bandwagon and become engineers again!

Open Source North Rides Again

We had another fun and exciting rendition of Open Source North in June earlier this year (2016). It’s a conference that eventually will be recognized as the premier technology event of the year in the Twin Cities. Well, that’s my prediction at least.

I had a great time and got to give my presentation/group discussion on the state of Architecture in the Midwest and how the practice needs to be redefined. Architecture means nothing these days, Enterprise Architect in it’s traditional form is dying, and Engineers are taking over the space once occupied by Architects. Why have an architect if you can have an engineer do the same job and deliver the working software as well?

Hope to see you again next year!