Your engineers are telling you that the monolith needs to be split into several microservices. Work has slowed to a snails pace. It takes forever to get new features out. Changes often introduce bugs in areas of the code that weren’t even touched. It seems that any step forward is a step backward.
Have you ever cleaned a messy kid’s room? For me and my children it often starts like this, “Daddy I need help cleaning my room, it’s too hard!” The mess that they have made while playing has become too much to handle on their own. Monoliths are sometimes like that, built feature by feature until one day it’s too hard to move forward. Thankfully, there is a formula to clean up the mess.
Before I get stared with my child I have to determine if they are really ready to clean up. If they are too tired they will not be able to focus. Splitting a monolith involves the same inspection. Are you sure the engineers that built the monolith want to change? Microservices require an increased level of production readiness from the owning engineers. That muscle is going to take time to develop. This may require you to sell the change and potentially move or remove those opposed.
Also, it takes a certain type of engineer to do this work. I love this work. Give me a problem like this and I’ll go into my hobbit-hole for six months and return with the end product. Engineers that prefer greenfield work will be bored to tears (or quitting) by refactoring the same method call over 8,000 lines. It may be better to reserve them for maintenance and enhancement of the extracted services.
If the engineers are on-board there’s another hidden obstacle. When my children clean their room sometimes they try to push everything to the corners or pile it in the closet. The room looks somewhat clean but really the mess is just hidden better. Under the hood nothing has really changed. Are the engineers working on decomposition going to lead the system to the same state?
This hidden obstacle is very hard to avoid due to Conway’s law. Conway’s law states that organizations are constrained to produce systems that match the communication structure of the organization. I have witnessed this at multiple companies. Do you have a top-down organization? You will get a few god services that run the show. Do you have an office environment where one team can turn around and talk to another? You will get an tangled mess where services reach into other services to get work done. To fix this you may need to restructure teams and the building. Segregate teams and ensure they communicate through the proper channels.
Now it’s time to begin cleaning the room. When a mess is especially bad I help my kids by pushing every toy to the middle of the room. My son likes this because it creates a huge pile of toys. Sometimes I go as far as emptying every toy box onto the pile. This large pile is your monolith. Each function is a toy, individually distinguishable but together a large mess.
When the large pile is created the next step is to start defining sub-piles. I might know that there are Lincoln Logs, Legos, and toy cars in the pile. My son and I will create three piles and begin picking through the large pile for these items. Filtering the large pile is a lot of work and not all of the piles are known up front. We may discover that there should be a pile for toy trains as they are uncovered.
Decomposing a monolith is similar. You may have a general idea of the bounded contexts (sub-piles) within the code. As you start refactoring though you will discover more that need to be created. Some of these may even preempt work currently in progress. This can be very frustrating for those going through the process for the first time. They might expect that an architect, someone in charge, or even that they have it all mapped out. With a large monolith that’s practically impossible and if possible it is ill-advised.
Creating these bounded context cannot be done in a vacuum. It can be very tempting to put the engineer who knows the most about an area in charge. This can end in disaster. The engineer may say something like, “This process is great for an unknown system but is unnecessary in this case because I know it so well.” You may end up with microservices but they will be structured like the existing services. That would be like dividing the massive pile into smaller plies and placing each in a separate room. The term for that mess is a distributed monolith and is actually worse than a normal monolith. Toy trains are in each room now and it’s hard to play with them all at once. Distributed monoliths cause network traffic and costs to shoot through the roof.
To form a bounded context then you put a team together to start the discovery process. The team will interview domain experts which will help them to determine if they have a valid sub-pile or not. Those domain experts will be used throughout the decomposition process to validate the bounded context along the way. Have the team read Domain Driven Design before they start. It is difficult but imperative that this stage is done right.
Once the bounded context is created your engineers can begin the refactor work. This is like me and my son digging through the large pile to categorize toys into their appropriate sub-plies. This is where resolve comes in on your part. Project owners and managers will be frustrated or confused. Their feature work is even slower now. They may ask, “Why are we investing in the monolith when we’re just going to throw the work out?”
In a way they have a point. It would be better if the room could be picked up and things can be placed in order without creating the large pile. In some systems this is possible. Maybe you have some well defined contexts and just a few things are out of place. In other systems the monolith is too far gone. Those require the discovery and subdivision before splitting. In extreme cases it may feel like it would be faster just to rewrite everything. Burn the room down and start fresh. This is known as a big-bang rewrite and almost never works because you loose out on learnings of the past. Besides, the end goal is not to throw out work but to extract it.
On the other end of resolve you may have to slow down the engineers that want to get to the end state fast. I have seen this masked as “Getting to value.” They may want to skip some steps because it’s too much work right now. You may have to encourage some good engineering hygiene. The patterns and practices are extra work but in the long run create a more robust system.
Getting to value is a great mindset though. During the decomposition process the way to get to value fast is to work within the monolith. Build a well defined API using the bounded context discovery work. Then wire that well factored API up using the existing messy code underneath. This will make the engineers cringe but will prove out that the API is valuable and correct. The code underneath can then be straightened up to match the well defined API.
Once my son and I have created some tidy sub-piles then we begin moving them to boxes. When a box is well worn toys may spill from the holes so it is important to inspect them before use. Similarly, ensure the new APIs have well defined walls or seams. One API should not reach into another to change it’s state. This may look like the orders-api storing it’s data into an orders-data-api. Or an orchestration-api reaching into multiple APIs to “set things up.” This is more art than science and good looks to minimize the network traffic on each request to the system. Systems should act as sources of truth and work even if one component is down. Eventual consistency is key here.
Do you need to pause all feature work while the decomposition process is going on? No, but you do need communication. New feature work needs to go through the same discovery process as the existing work. Then when you are sure of what sub-pile the new work belongs you can either create the pile or add it to the existing work.
The largest killer of this whole process is lack of resolve. When my son begins playing with toys instead of sorting them I have to gently correct him so that we stay on focus. To you this might look like a priority shift. Perhaps one of the products is on fire and you need to shift resources to fight it. Don’t! The truth is this refactor work is likely a multi-year process for the first system. For every week that you disrupt a team you likely set them back two.
Another killer is rushing. When you hear an engineer say it’ll take six months they might be overconfident and it’ll really take a year. It will be tempting to put the engineer with the lowest estimation in charge. The decomposition process takes a long time, there is no way to speed it up.
A cousin to rushing is attempting to throw more people at the problem. If I only had more engineers this project would move faster! The truth is once the bounded contexts have been defined you may only need a single engineer to execute on the refactor. Adding more people just increases the lines of communication and slows the work down.
The truth is the time to pull systems out will decrease dramatically after each system is removed. The first may be a multi-year project but the second and third will be faster. This is why my son likes the large pile. The first few sub-piles take time but once the cruft is out of the way it gets faster to sort toys. The same thing happens with code. The first refactor is actually touches on the second and third. Each pass is a bit faster than the last.
So is it worth it in the end? A monolith in and of itself is not bad. I have seen monoliths scale businesses to 100M+ in revenue. I have also seen new products formed from the extracted APIs and new life breathed into old companies.
It can be worth it. I love seeing my son’s happy dance at the end of the process, “There is so much room to play daddy!” The questions you need to answer are; “How stuck are my engineers?” and “How much resolve do I have?” Once you know the answers to both you can begin the process.