Caveat: By “leader” I equally mean leadership on CTO level, but also experienced engineers on an individual contributor level. And make no mistake - I am an engineer by heart and made many mistakes along the way. This article is just a way for me to learn from my mistakes.
Your two main objectives as a leader
On an engineering level the world often looks bright. New technologies are emerging on a daily basis. Cool new languages, new paradigms that are supposed to solve all problems. New companies promote Xaas systems that revolutionize your technology landscape in a wim - or so they say.
From a leadership level (CTO/VP) the world looks differently. All those new concepts and ideas are cool and potentially of great benefit for you - but first and foremost you have two main objectives when it comes a potentially new architecture:
- Your department - and your technology must delight customers and deliver new cool stuff - fast.
- You have to keep costs at bay. Being frugal is a competitive advantage - especially in times like these (recession cough cough)
But how do you evaluate new cool architecture solutions and bring together the leadership and the engineering perspective?
Trust the engineers - and bring in the leadership perspective
Many companies trust their engineers to make decisions on the architecture level. That’s a very good idea. The leadership then makes sure that new choices are actually able to deliver on your two main objectives. They make your department faster in delivery (or at least not slower) - and they are cost effective.
In one of my past assignments I worked together with a very talented department of engineers. The company was quite old and they had a very old AWS PostgreSQL cluster that was used to store log data (and a looots of other data).
The problems piled up as the logging table grew larger and larger. Restarting the PostgreSQL cluster took ages - and recovery would have been very painful. There were also other problems like accountability: Nobody knew who sent or consumed data from that table.
Time to look for alternatives.
The team came up with the following solution:
- A central Kafka server (AWS MSK)
- A microservice that allows to control (think authentication) who sends and consumes data to/from Kafka
- Kafka would then stream the log data to Elastic (AWS OpenSearch) where the data would end up.
This solution looked good. But is this the best possible solution?
If you want to answer that question on a quantitative basis you’d have to compare that solution to alternatives in real production over a couple of years. This can be done in a lab, but not in real life - too costly and impossible.
But there is a qualitative way by checking some simple questions.
A checklist for architecture decisions
The following checklist can be used to drill down to important questions. That checklist is an important qualitative way to evaluate if a proposal actually delivers on its promises. Equally important to know for leadership staff and engineers.
- Is there a zero effort alternative? In the example above with the PostgreSQL system we could have simply pruned the data automatically to curb the size. Some consumers might not have liked that, but that might have been manageable. No need for 3 new systems and months of engineering.
- Requirements and problems and premature optimization. Was the problem really solved - or did we just fell in love with a technology? In many architecture discussions I’ve seen the original problems were not solved in a better way. Often new problems were created that we did not have in the first place. Not good.
- Alternatives. Did we evaluate alternative architectures with pros and cons? We don’t have to overdo this. But only looking at one solution is a trap.
- Cost. A running PostgreSQL server and an organization that knows how to run it is trivial and cheap. Three completely new systems (Kafka (MSK), OpenSearch / Elastic and a custom microservice) are a massive undertaking and will bind a lot of people and resources. Don’t forget to factor in these costs and add a price tag. How many people will we need? What do they cost per year?
- Development speed. Ask yourself: Will the new setup make us faster (cycle time, deployment frequency)? Especially if you think that some engineers will be occupied with maintaining systems and fixing bugs in a new setup.
- Debugging. How difficult will be problems to debug? OpenSearch has thousands of configuration options - so has Kafka. Expect long discussions about shards, retention in partitions and more. Is that really what you want?
- Ownership. Who owns the new system? Who does upgrades and maintains the system? Who reacts when there are bugs? Is there a team - or do we create single points of failure?
- Maintenance. Do we have a plan how to maintain the systems - are things like downtime taken into account?
- Do a pre-mortem. Once you think you got a solution - evaluate what could potentially go wrong. This allows you to discover hidden challenges and manage the risk during the transition phase afterwards.
The IT landscape of a company will evolve over time. It’s the duty of everyone - but especially of senior and leadership staff - that these changes make the company faster and are cost effective. The checklist introduced in this article can guide you to an answer in a qualitative way. Use it whenever you have to rework part of the architecture of your company.
- Awesome photo on top by Alex Wong