How do you successfully modernize a 30-year-old system with over 12,000 programs that is relied on by over 3 million users and runs on a mainframe?
This article dives into a real-world example of how VMware Tanzu Labs approaches migrating a business-critical system from a mainframe infrastructure to a microservices architecture. We describe the approach and considerations we took to strangle a domain from a monolith at a safe, yet fast pace.
Setting goals for mainframe modernization
Modernizing the mainframe is part of a high-level program for our customer. The following were their overall expected goals:
Reduce infrastructure costs which are worth millions of euros per year.
Bring business knowledge to the new generation of developers since there are fewer and fewer people understanding COBOL in the company.
Improve the development lifecycle as it takes months to get a new feature into production.
Simplify the code and architecture.
Understanding the business domain
One of the first questions we face in this kind of modernization is: how can we start decomposing such a monolith, especially when the existing system to modernize is so massive?
In order to answer this question, we conducted an event storming workshop (one of the tools in our SWIFT methodology) to learn more about the customer's system . A big benefit of event storming is that it is a great tool to break down silos in a company and encourage people to talk and understand each other. We strongly recommend using it to create a better understanding of large systems.
By gathering business experts, developers, ops, architects and having everyone collaborate on describing their processes, we were able to capture all of the important events that occur within the business processes of the mainframe. The result of this exercise is a visual representation of the system that helps identify what Domain Driven Design (DDD) calls bounded contexts. Such bounded contexts are usually good candidates for being applications or services on their own.
The people involved in the workshop first identified system pain points and improvement opportunities, then prioritized the bounded contexts using a dot-voting exercise to select the first facet they wanted to modernize.
After understanding the business processes covered by the mainframe a bit better, we also needed to have a view of the desired notional architecture, and how we could integrate it smoothly within the existing system.
Indeed, we often consider that incremental rewriting is safer than a big-bang switchover because it gives us the possibility to:
Develop new features in the existing system before the end of the rewriting.
Rewrite the existing features slice by slice and deliver them earlier in production.
Identify problems earlier and reduce the costs of development in case of failure.
One of the consequences of this approach, however, is that we had to find ways to make the existing and new systems coexist.
Here is the big picture of the existing system architecture we worked on:
This is a classic multitier architecture in which each tier is technology-oriented and managed by a dedicated team. The end users interact with one big front-end application that provides tons of cross-domain functionalities. This application communicates with a Java 2 Platform, Enterprise Edition (J2EE) backend that delegates the operations to the mainframe through an Enterprise Service Bus (Oracle ESB). Hence the backend acts as a facade and doesn't contain any business logic.
The Lightweight Directory Access Protocol (LDAP) is used to authenticate the end users, but also configures the routes from the backends to the bus. More specifically, the backend application is using Weblogic RMI with EJB/T3 protocol to communicate with the ESB (EJBs are only used as a pipe to transport XML data that contain the information about the command to execute). This command is then converted by the bus into a COBOL transaction.
Here is a selection of XML that illustrates the format used (the real format was a bit more involved):
<Payload> <Transaction> <!-- ... --> <Domain>D01</Domain> <Name>T01<Name> <!-- ... --> </Transaction> <Transaction-Data> <!-- the data of the transaction were placed here --> </Transaction-Data> </Payload>
The bus calls the mainframe and converts the COBOL results into XML for the backend to consume. The mainframe is relying on two different databases: IBM Db2 and DL/I. DL/I is a hierarchical database, but during the engagement, we were only concerned with IBM Db2.
There were some imposed constraints that had an impact on the decisions we made during our engagement:
- The business could not be directly involved in the transformation project (we usually prefer the opposite but it was not possible in this context).
- Existing business rules must be migrated as they are, even when they would be considered obsolete by the domain experts.
- It is not possible to modify the COBOL code to integrate the new system.
Because of these constraints, we were restricted in the choice of implementation patterns supporting the coexistence of the mainframe with the new system.
There were a few attempts to modernize the mainframe done by the customer in the last decade with varying degrees of success (e.g., application full rewrite and COBOL translation). However, the results were not satisfactory because the attempts were either too expensive or resulted in poor code quality.
They wanted us to do the modernization using a lean approach, relying on our extreme programming practices to see if we could strangle their monolith incrementally.
A few months after the event storming, we started an engagement with this customer to modernize their mainframe. Prior to beginning the work, we built a team consisting of people from the customer’s side and people from VMware.
As stated by Conway's Law, architecture goes hand-in-hand with organization; therefore, it is very important to take this into account and not solely focus on the architecture and technical aspects.
In order to deliver software at a fast and safe pace, we believe that traditional organizations structured in silos are not very adapted because of their communication structure. That's why when we work with a client, we always advocate for building a balanced team which is a stream-aligned team typically responsible for the flow of work for a given business domain.
The VMware collaborators who join the customer’s team play the role of an enabling team: during the short-term period of the engagement, we share our practices and provide guidance to the stream-aligned team. Our goal is to make the customer's team autonomous after this period and to make our presence no longer necessary for them to continue their work.
We paired with six people on the customer’s side: a product manager, a business analyst, a COBOL expert, an ops engineer, and two software engineers. During the engagement, they were enabled to learn our practices and ways of working.
A helpful approach during engagements is the usage of a platform to deploy the applications. For example, a PaaS is very effective in relieving developers from production-like development environments and enables them to focus on writing code rather than dealing with infrastructure issues or a ticketing system.
Fortunately, this customer was using the VMware Tanzu Application Service, which saved us a lot of time because we were able to deploy our microservices and automate the path to production.
The first day
The first workshop of the engagement is what we call an Inception—we use it to meet the team and stakeholders and to define our goals for the upcoming weeks, as well as the potential risks. At the end of this workshop, we had garnered a few user stories in a backlog to start working on the modernization.
Here are the main goals that were identified during the Inception:
Shutdown a COBOL transaction.
Introduce a subdomain-specific microservice and its database.
Identify patterns that allow the current and new systems to coexist.
Teach the team lean and extreme programming practices.
Improve the path to production.
Our approach to architecture
When we build something new that hasn't been built before, we usually follow the Tracer Bullets principle (as defined in the book The Pragmatic Programmer), meaning, we build and deploy our apps as early as possible in order to validate or invalidate our assumptions. This gives us the opportunity to eventually pivot towards another direction if needed. Using this approach makes it easier and cheaper to change the architecture, as well as gives us confidence that what we are building is really solving the problems we identified.
Iteration 1: Decoupling old and new
We started the separation by introducing an API gateway component to strangle the monolith. The role of this gateway was to connect the legacy world with the new system and to serve as an entry point for the upcoming web applications that interact with the different subdomains.
After validation with the enterprise architects, our idea was to do the following:
We wanted to configure the ESB to redirect the transactions of the domain to our gateway, and make the gateway call the bus again to callback the mainframe, acting just as a pass-through in this first phase. This is a concrete application of the Strangler Fig pattern, and permitted us to safely go to production without having to migrate anything, but still allowed us to build the new system incrementally and without breaking the existing system.
You might wonder why we introduced a gateway right now instead of building the subdomain microservice and therefore make the bus redirect the transactions to it (as in the following picture).
This is a perfectly valid idea in some situations, but in this case, the gateway option was directly chosen for (at least) a couple of reasons:
There was a plan to have some applications talking to the new system and the gateway would be their interface to the “new world.”
Modernizing the mainframe was also seen as an opportunity to get rid of the ESB and by introducing the gateway, this was our way to do it.
The team was not sure about which technology to employ for the gateway, so we evaluated spikes from Spring Cloud Gateway and a homemade Golang gateway (a spike is a development method where we run experimentations to identify possible solutions to solve a problem).
The focus here was mainly on performance since we wanted the gateway to be able to properly accommodate the traffic currently supported by the bus. Additionally, we wanted to evaluate whether or not the latency introduced by the new component would impact the user experience.
After experimentations and measurements, we found that the performances between the two solutions were quite similar; however, the team decided to employ Spring Cloud Gateway because they felt more comfortable with a Java solution with enterprise support, rather than a self-built Golang (which would require them to master a new technology).
When the bus team was asked about redirecting the traffic to our gateway, we were told that the wait would be at least six months before being able to do the appropriate modifications in the ESB. Of course, it was not possible to wait this long so we had to change our strategy in the next iteration.
Iteration 2: Bypassing the bus
To avoid the long waiting period of modifying the ESB, we decided to introduce a new component that would play the role we expected the bus to play: transforming EJB calls into REST calls.
The LDAP was reconfigured to reroute all the backend application traffic to this new component.
Meanwhile, we also created a microservice for the subdomain we were working on.
During this phase, the microservice captured one read-only transaction (and told to do nothing with it yet). Our purpose here was just to bootstrap and integrate these new components together.
The benefits of all of these architectural iterations began appearing here. We set up a new system that captured the backend application traffic, redirected it back to the legacy system when necessary, and gave us the ability to rewrite the features we were interested in.
All of this then went to production without breaking anything from an end user perspective.
Iteration 3: Reading mainframe data
Next, the team rewrote the business rules of the chosen transaction to be read-only, so that it was not necessary to introduce a new database yet. We just fetched the data from the Db2 when necessary for the business rules. The purpose of this phase was to see if we could access the Db2 world from our new system in the different environments.
Iteration 4: Writing new data
During this step, we captured another transaction that was responsible for writing data. This is the moment we decided to introduce a database for our microservice.
Because it was very hard to understand the implemented business rules in the mainframe, we chose to apply the Parallel Run Pattern and wrote a Comparator module as defined by Sam Newman in his book Monolith to Microservices. With this pattern, rather than calling either the old or the new system, we called both and compared the results to verify that they are equivalent. At this point, only the old system was considered to be the source of truth.
This approach allowed us to verify that the rules were correctly implemented before transferring ownership of the data to the new system.
This was how the architecture looked at that point:
The Comparator module was embedded in the subdomain microservice first, leaving the option for the team to move it to a separate application later, if needed.
Of course, each of these components was fully implemented following extreme programming practices (e.g., test-driven development, pairing, continuous integration, etc.), and we wrote pipeline configurations to deploy them in all of the customer’s environments.
At the end of the nine weeks, the team was feeling very confident about continuing the migration on their own and that using our practices adapted to fit their context. Using an incremental approach helped us a lot in building that confidence, even in a very challenging environment.
One key point we learned from this engagement was the importance of taking the context into account rather than trying to copy and paste what others have done in different situations.
In short, modernizing mainframe applications is complex and requires lengthy endeavors. The described approach:
Involved the business in your modernization efforts to avoid rewriting unused features.
Kicked-off the customer’s rewriting journey within just a few weeks.
Accelerated the path to production lead time from 1–3 month(s) to only 30 minutes.
Enabled six people on the customer’s side who were then able to:
Follow our lean and extreme programming practices to continue their modernization journey.
Deploy the new system coexisting with the old one without any interruption of service.
Deliver and evolve the architecture incrementally without impacting the end users.
About the AuthorMore Content by Fouad Hamdi