Category Archives: Uncategorized

The Chief Architect’s Purpose

A few years ago I was attending the O’Reilly Software Architecture conference in London where I had prepared a talk on Target’s new vision for the future, a Target Retail Platform. After the conference I had a couple days of vacation and I stopped by an art exhibit at the Saatchi Gallery on Alexander Calder. It was there that I learned that Calder defined the Chief Architect’s purpose better than anyone.

The Calder quote at the top says: “I think I am a realist, because I make what I see. It’s only the problem of seeing it. If you can imagine a thing, then you can make it and ‘tout de suite you’re a realist. The universe is real, but you can’t see it. You have to imagine it. Once you imagine it, you can be realistic about producing it.

This quote struck me as I had spent the last few days thinking about architecture, and how the architecture conference I’d been attending rarely talked about architecture. I had attempted to create a presentation on Target’s architecture vision, nothing about how we would build it or the technologies used, but simply the vision, principles and structure that would guide a 3000 engineer strong team to leave its current reality and build a new one, a retail platform. I would guess that 80% of the presentations at the conference were on microservices. It was 2017 and all I could think was architects are behind the curve if this is their first introductions to learning about microservices. However, the microservices presentations were far better attended than my own, so, joke was on me.

I loved this quote because I often used the term Architecture Realist when people would ask me about my architecture style. I was only interested in creating architectures that would and could be implemented. I had found that most Chief Architects rarely present anything that engineering teams can actually use to guide their development, and instead fall back to edicts on technologies.

But what really struck me was the simple definition of Calder’s artistic approach, which coalesced for me into the Chief Architect’s purpose.

“The universe is real, but you can’t see it. You have to imagine it.”

This is it, this is the purpose of being a Chief Architect. It’s your role to take yourself out of your current reality and imagine a new one.

“Once you imagine it, you can be realistic about producing it.”

Unlike Calder who could then go about executing his vision via a painting or mobile, producing what you imagine within a large enterprise is not a straightforward task.

The presentation I had created, “Platform Architecture for Omnichannel Retail” was my attempt to convey what I had imagined for Target. Communicating to 3000 individuals to follow a vision is the Chief Architect’s job, not its purpose. I prepared the presentation for O’Reilly to force myself to document the vision in a way that could be consumed by a technical workforce.

Over the ensuing three years I’ve given some version of this presentation hundreds of times to small and large groups within Target. In the real world, producing an architecture means understanding and engagement from teams, which is best done in small groups where people feel comfortable asking hard questions.

What you’ll find is you don’t always have the right answers, your reality, which you thought was perfect, was half baked. But if your principles were sound, working through the hard questions to answers that follow the principles of the architecture builds the patterns and practices necessary to produce the new reality.

For Target in 2020, the new reality is here, we have a Retail Platform.

The Simple Formula to IT Modernization

Having instigated and led through the digital transformation of BestBuy.com, and then moving to Target where we have been completely modernizing IT across the entire company, I’ve learned the simple formula to IT success.  While the formula is simple, the execution is what is inherently difficult.

In my first year at Best Buy, I spent months pitching how we would rebuild BestBuy.com and move it to the cloud back in 2010 and 2011.  At the time, this was a radical idea as clouds were new and unstable, and were considered highly insecure by corporate IT leaders.  The pitch was centered on the standard IT formula of People, Process and Technology.  This formula is good, but it also has a big miss, there is no emphasis on outcomes.  The pitch worked and I was granted $13M to start the rebuild of BestBuy.com, a project that took four years and well over $100M to complete.

While we continued to use the People, Process and Technology formula in our decks and communication to leadership, I managed my team of 200+ engineers to outcomes.  We had to deliver a flexible, maintainable, layered cloud ecommerce platform that scaled infinitely or we were failures.  We implemented Product Management and Agile and morphed into a Product Engineering team that brought BestBuy.com into the modern world and did our part in the overall turnaround of the company.

For more on that see my 22 part series on the Digital Transformation of BestBuy.com.

The People, Process, Technology formula was great for selling to VPs and EVPs, but the dissonance between the sales pitch and the implementation kept me wondering about a better way.  Then I moved to Target which helped me understand that I had found a better way, I just didn’t have the name for it yet.

At Target, our CIO is the most architecture centric CIO that’s ever existed.  Most CIOs pay lip service to architecture, but then hand it off to an Enterprise Architecture team and say “go implement architecture.”  But when the architects try, they are constantly overruled or ignored because all good architecture decisions require tradeoffs in feature delivery in the short term.  Without an overarching vision, no business or IT leaders will make feature tradeoffs.

With little support or understanding from the CIO, the various IT VPs are free to flout architecture rules or governance, and therefore go off and implement their locally optimized solutions.  This is the core cause of IT inconsistency and sprawl, and why every SOA ever designed failed at enterprise scale.

At Target, we’ve used a different formula consistently for the last four years:

  • Architecture First
  • Team Second
  • Value Third

How does this compare to People, Process and Technology, let’s analyze it.  

People and Team sound the same but they have different inferences.  People is generic and generally boils down to something about only hiring A players, and B players hire C players because they are insecure.  This assumes you already somehow have a bunch of A players, and that you are an A player too.  This is obviously ridiculous.

Team, however, is about getting people to work together of all types and capabilities.  It’s about maximizing talent by the group coming together and creating something more than its parts.  At Target we’ve strived to create a learning culture where the most granular breakdown of the organization is the team rather than the individual.  Inside the team, there are ranges of capabilities, often based on experience, the environment encourages helping each other through pairing or mobbing, increasing everyone’s capabilities.

Process isn’t even part of the new formula though it is important.  Process actually gets rolled up into Value.  Value is what you are striving for, it’s the outcome of working on features and technology.  But if a feature doesn’t resonate with the customer, no actual value is delivered, although we have learned something that didn’t work.  Value, in the end, is how the customer perceives it, and how you measure it.  Delivery of value uses a process, in Target’s case Product and Agile.  Making Value measurable is the hard part.  Saying you delivered value by adding a new payment type is great, but measuring the impact in incremental sales through an experiment which tests whether a new payment type actually increases sales is better.

Technology and Architecture are often tied together, but the reality is Architecture is technology independent.  Technologies are tools, or, as architects like to say, implementation details.  Architecture is the vision, strategy and principles underlying and overlaying how every system is built and how it fits into the larger picture.  Getting the architecture right gives every engineering team a place to fit their work into how Target’s guests benefit.  Too often, engineering teams have no idea how they fit into the enterprise, so they make choices and build solely to please themselves and their sponsors.  But if the team understands how they benefit the company, they have a higher calling and are willing to make architecture tradeoffs.  

Getting the architecture right allows the company to achieve both known and unknown outcomes.  If we learned anything from the last two months of COVID lockdowns, a good architecture allows you to flex, scale and build new capabilities overnight.  It allows you to withstand an instant 30% channel shift from store customer to online customer.  

Architecture, Team, Value is the simple formula to IT modernization.  Just look at the recent outcomes.  

Commerce Tomorrow Podcast

I recently had the opportunity to sit down with Kelly Goetsch and Dirk Hoerig to record an interview on Target’s engineering culture as part of their Commerce Tomorrow podcast.  Kelly was in town for the Open Source North conference which we were both speaking at.  We sat down in the Target recording studio to tape the show.  I’ve been using the studio to create an internal podcast for Target Engineering so we had an audio engineer and figured out how to include Dirk from Germany.

Click here to hear the podcast.

A Digital Ecommerce Transformation – Holiday – This Year is Always the Most Important One Ever – Part XII

Part XII of a multipart story, to start at the beginning goto Part 1.

Since we’re in the middle of Holiday 2017, I thought a digression on Holiday was in order.

If you have never worked in retail, than you’ve have missed out on the grand experience we call “Holiday”. On the other hand, you’ve probably actually enjoyed the time of year from mid-November to Christmas while you celebrate with your friends and family, and take advantage of thousands of days of deals from the many retailers trying to get their share of wallet from you.

Holiday, with a capital H, is something that has to be experienced to be believed. In my first Holiday at TWLER in 2010, I was on a team that had just started writing code and had very little in production leading into Thanksgiving. The only offering we supported was the failure site, if the main TWLER.com went down, we would quickly spin up the browse only site so consumers would be able to at least see what products we sold, and where our stores were located. In 2010, this was actually a pretty good thing since the ecommerce site was still less than 5% of revenue.

When you work in IT in a retailer, your entire year is judged on whether or not the systems you support survive the shopping onslaught of Holiday. In the online space, an ecommerce site might make 30% of its revenue in the five days from Thanksgiving to Cyber Monday. TWLER.com also experienced the third highest traffic of North American retailers during that time. This massive scale up to 20X normal daily traffic was largely accomplished without clouds in the 2000s. You had to take a really good guess as to how much infrastructure was needed, build it all out over the course of the year, and hope you weren’t overwhelmed by consumer behavior. You could easily receive 1M requests per second at the edge, and 100,000+ requests per second to your actual systems. If those requests were concentrated on the wrong systems, you could easily take down your site.

TWLER counts how long you’ve been at a company by the number of Holidays you’ve experienced. If someone asks how long you’ve worked there, you might say “four Holidays.” And every Holiday is the most important one yet, because those six weeks account for 50% or more of yearly revenue.

After a few Holidays, you realize the second the current year’s Holiday is over, you are immediately planning for the next one. There is no break. It’s like a giant tsunami that is slowly approaching, day by day. You can look over your shoulder and it’s always there, waiting to crash down on you and ruin your day. Once this year’s tsunami passes, you turn around and can see next year’s on the horizon.

In my six Holidays at TWLER, we experienced numerous outages, usually caused by either internal stupidity, or unexpected consumer behavior. In our first few years, we would purposely force our ecommerce site to use “enterprise” services because they were the “single source” for things like taxes, or inventory. This is a great notion, but only if the “enterprise” services were actually built to support the entire Enterprise. Since TWLER was store focused, this meant the “enterprise” services were often down at night for maintenance, or were not built to withstand massive surges in traffic. One million people refreshing a PDP to check for inventory on a big sale every few seconds quickly overwhelmed these services. So we often turned these services off and flew semi-blind, rather than have the site completely fail.

In other instances we tried to use various promotion functions embedded in our ATG commerce server. These seemed like useful things to easily setup a promotion like buy one get one. But when millions of people come looking for the sale, the vendor built commerce engines go down quickly by destroying their own database with the same exact calls, over and over again.  They hadn’t heard of caching yet, I guess.

We would sometimes publish our starting times for various sales, saying a big sale is starting at 11AM and send out millions of customer emails. The marketing teams loved the starting times and the technology teams hated them. We warned that setting a hard start time is a sure route to failure. Yet we did it multiple times and incurred multiple failures as the traffic surge brought down the site. There are physical limits even in clouds, you can only spin things up so fast and 10M rqs will bring down most sites. After a few of these episodes, we did convince the marketing teams that it wasn’t the way to go and learned how to have sales with gradual ramp-ups in requests rather than massive surges.

Around 2013, the Black Friday shopping was so intense in the evening across the nation that the credit card networks themselves slowed down. Instead of taking a few seconds to auth a credit card, it started taking one or two minutes. This was across all retailers. However, the change in time caused threads to hang up inside our ecommerce systems and all of a sudden we ran out of threads as they were all tied up waiting for payments to happen. For the next year, we changed our payment process to go asynchronous so that would never happen again.

There are many more stories of failure, but from every failure we learned something and implemented fixes for the next year’s wave. This is why Holiday in retail is such fun, every year you get to test your mettle against the highest traffic the world can generate. You planned all year, you implemented new technologies and new solutions, but sometimes the consumer confounds you and does something totally unexpected.

The last story is one where the consumer behavior combined with new features took us down unexpectedly. In 2014 we implemented “Save for Later” lists where you could put your items on a list that you could access later and add them to your cart. As Thanksgiving rolled around and the Black Friday sale went out at around 2AM, our Add to Cart function started getting pounded at a rate far higher than we had tested it for. We were seeing 100K rqs in the first few minutes the sale was happening, it rapidly brought the Add to Cart function to its knees and we had to take a outage immediately to get systems back together and increase capacity.

This was completely unexpected consumer behavior so what happened? It turned out that customers used the Save for Later lists to pre-shop the Black Friday sale and add all the things they wanted to buy into the lists. Then when 2AM rolled around, they opened their Save for Later lists and started clicking the Add to Cart buttons one after the other. A single customer might click 5-10 Add to Cart buttons in a few seconds. With hundreds of thousands of customers figuring out the same method independently, it led to a massive spike in Add to Cart requests, we effectively DDOSed our Add to Cart function with simultaneous collective human behavior.

I feel like I could keep going on Holiday for another two pages, but that’s enough for this year, maybe we’ll do it again in the all important next year.

Goto Part XIII

Half Life of Code

What’s the half life of the code your are writing today?

Half life (not the game) is the term used to describe the decay of radioactive isotopes.  The longer the half life, the slower the decay.  If you have a gram of radioactive material, it will change over time until eventually all the radioactive material decays.

I like to think about the code we write as having a half life.  Well written code in a slowly changing area of an application has a long half life.  It doesn’t mean the code never changes, it just means only small changes occur over long periods of time.  The half life of the code may be in years (Caesium 134, half life of about 2 years).

However, brand new code in a rapidly changing area, say the new UI of your brand new site, has a half life of days (Manganese 52, half life of 5 or 6 days).  This would mean you’d expect half the code to be replaced in one work week.  The next week one quarter of the remaining code would be replaced, etc. until virtually no code from the original work is left.

Thinking about half life is useful because it tells you how much effort you should be devoting to testing and ensuring the code is rock solid.  Long half life code should be well tested, documented and vetted for scalability.  Short half life code should be thrown out with little testing and few thoughts about scalability or maintainability.  Why?  Because the code will be gone by next week.

Unlike isotopes, the half life of code changes once the code is complete and in production.  Production marks a point where half life increases dramatically.  In fact, you should be actively cranking up the half life by making the code clean and scalable.

Still, there’s a limit depending on the velocity of change in the various parts of the application.  These days UIs evolve rapidly for consumer driven applications.  The half life is short and the amount of effort put into this code is low.  It should still work, but may not be something your proud to say you wrote.  Then again, you should be pleased as you put forth the appropriate amount of effort.

Layered Cloud versus Hybrid Cloud Architecture

We had a great week at Openstack Summit in Portland.  See the article in Wired magazine for a short summary.  Or watch the Best Buy Openstack keynote.

One thing I learned from three days at the Openstack Summit is that I have always misconstrued the definition of Hybrid cloud architecture.  When we started making plans for our cloud architecture, I always thought of it as a Hybrid cloud.  At Openstack, there were numerous presentation on Hybrid cloud and all of them revolved around using the cloud to provide additional scaling for an application that runs in the datacenter.  In all cases, the datacenter architecture stack was simply recreated in the cloud and used for peak load.  The database master is in the datacenter and a slave exists in the cloud.  The Hybrid cloud architecture simply means using a cloud to elastically horizontally scale an existing application.

When I originally thought about Hybrid cloud I thought of an application that has one or more layers in the cloud, and the remaining layers in the datacenter.  I now call this a Layered Cloud architecture.  In our case we built our new product browse capability in the cloud and kept the remaining application in the datacenter.  All the data in the cloud was non-secure, basically public data so there was little to no security issues.  We are keeping the commerce pipeline in the datacenter simply because it is easier to keep the commerce data and transactions in our secure datacenter.

joel-breakout-sessionThis is a good example of assumptions clouding my view of reality.  I’ve read plenty of articles and information about Hybrid cloud, but until I was sitting in a presentation having someone tell me about Hybrid cloud, I never noticed my definition was incorrect.   Than after recognizing this, I watched every presentation to determine which definition was used more frequently.  Unfortunately for me, all the definitions were the same and they did not support my original view.

Building a Culture of Architecture

As we remake BestBuy.com into a new platform, we are building a culture of architecture at the same time.  Previous to 2010, BestBuy.com had no holistic architecture team guiding its development.  Instead, a long series of projects simply bolted on more and more functionality until the resulting system was impossible to deterministically change.  With little test and low regression coverage, any change in the system often resulted in unintended consequences.

In 2010 an architecture team was built and claimed ownership over BestBuy.com.  We began to involve ourselves in projects that affected the site.  We built a path and vision to remake the site into a next generation eCommerce platform.  But over all, we established that architecture mattered, and agile architecture would be our culture.  Our group of architects share similar architecture values, high involvement in development, decoupled flexible systems, TDD, small focused teams, high quality engineers, and letting architects lead projects rather than delivery managers.

The path of architecture has worked, teams with projects come find us now and we are involved with all aspects of the site.  We are slowly working our way towards an infinitely scaling cloud/datacenter SOA.  It is the architects who intervene when necessary, set engineering direction and mediate between all parties.  To make it work, the culture of architecture must be in place first.