Enterprise Grade Platform Engineering at Charles Schwab

September 23, 2024 Michael Coté

This blog is based on "Navigating Market Storms by Leveraging VMware Tanzu and VMware Cloud Foundation at Scale," a panel with Schwab at Explore 2024.

Charles Schwab manages 35.6 million brokerage accounts and executes 5.49 million trades on an average trading day. The company manages $9.41 trillion in client assets. As you'd expect, they're a highly regulated company spanning several businesses including banking and personal investing. What's more, their customers expect their apps to be well designed and modern. That all sounds great to me because I've been a customer and frequent user for almost a decade. 

This all means Charles Schwab is a great example of how to run an enterprise application platform. They've used VMware Tanzu Platform for Cloud Foundry for several years. The practices they've adopted serve as a guide on the evolution of platform engineering. This is valuable for the many organizations putting platform engineering in place. One estimate says that the number of established platform engineering groups will almost double by 2026. The general vibes from that community match those numbers.

While platform engineering is just now starting to spread in large organizations, it's been in practice in small pockets for several years. Four members of Charles Schwab’s platform group recently  spoke at Explore in a panel moderated by Nicky Pike. They discussed their experience with the Tanzu Platform for Cloud Foundry over the past four years. 

The panel was composed of: Dor Anis, Director, Container Platform Organization; Christopher Murphy, Director, Container Platform Engineering; Rajesh Jaiswal,  Sr Manager, Container Platform Reliability; Erik Scales, Senior Cloud Engineer

Let's look at two things Charles Schwab is doing that make their platform engineering approach enterprise grade: scaling, and security and governance.

Scaling

During the week, every morning something predictable happens. The markets open up and people trade. See the chart above for what this looks like over the day. This is predictable, but occasionally the volume of trading spikes in unpredictable ways. This is the market storm. As Charles Schwab's Dor Anis put it: "If anyone remembers the Gamestop situation, that is a perfect example of a market storm when there is a run on the market in some way." The apps and services used to trade have to keep up with this intense volume of transactions.

To hear the panel tell it, a lot of what the platform team at Charles Schwab does is to ensure that they're ready for these market storms. They discuss several ways that they handle scaling. Let's look at one of them.

Ensuring platform scalability and resilience starts with making sure applications are architected appropriately. The Tanzu Platform for Cloud Foundry is built to support cloud native applications. 

The guidelines for creating cloud native applications are well-known and proven. “Generally speaking, follow the 12 factor app pattern, have a stateless application, and deploy it as a microservice to PCF, that is our guidance," Anis says, "It's pretty simple."

This is a role for the platform engineering group that I think is underappreciated: they need to play a role in specifying what type of application architectures work on the platform. It's tempting to think in a more traditional way where the operations staff have to support a variety of application architectures that come their way. And, in large organizations built up of years of acquisitions, this is an inescapable reality for parts of their app portfolio. 

But, when the platform team can drive consistency in application architecture, they can start to make promises about resilience, reliability, supportability, scalability, and the other "ilities." A platform engineering team that specifies what types of architectures the platform supports is putting in place a contract. "Write your applications this way, and we can ensure that they run well in production." Site Reliability Engineering thinking brought this idea of "contracts" into enterprise operations, and the Schwab team talks about that way of thinking frequently.

That "contract" extends beyond the app architecture. One example of that is in the continuous integration and continuous deployment (CI/CD) pipeline. In contrast to traditional approaches where individual developers or development teams create their own build pipelines, many platform teams standardize the CI/CD pipeline. This allows platform teams to control how applications are built, configured, and ultimately deployed. For Charles Schwab, this kind of thinking is key, as Rajesh puts it "today anything which goes to the platform is via automation." This allows the team to control app configuration and put in controls for things like quota, security groups, and other operations configuration. 

To me, what the Charles Schwab team is doing is making sure they have the controls in place to scale how they manage all those applications. This removes burden for the application developers, but also allows the platform team to manage the apps in production. Introducing this consistency comes in handy when the team needs to scale applications. "If you want to sync your apps, you don't have to reach out to the hundreds of application owners to do the deployments at a platform level," Rajesh says.

One of the fundamental principles of Cloud Foundry is that developers should not build and package the containers for their applications. Instead, developers use buildpacks to specify how their applications should be built and containerized. This allows the platform team to control and automate those application builds. In another talk from Explore, Scott Rosenberg, from TeraSky gave a great overview of why this principle is a good idea. A lot of the benefits of using buildpacks are focused on security, but there are basic operations benefits as well. 

The Charles Schwab panel cited two of them. First, the team doesn't need to work with each team when there are configuration and scaling changes they need to make. Another often under-appreciated benefit is that you have a more reliable accounting of all of your apps, where they're running, what configuration they have, which teams are responsible - you know, all that CMDB stuff we've been chasing. Keeping an accurate record of all the apps and services running in an organization has historically been very difficult. Because the platform is building and deploying the applications, and in a consistent, standardized way, the platform can easily keep track of everything. 

The approach of building apps for the developers also means you can find and improve poorly configured applications. The platform team can inspect what's in each application and make sure it's up-to-date and able to handle scaling requirements. "If you're not doing the right tech stack, if you're on the older version," Rajesh says, "we are able to take that action and raise the alarm early ..."

Ensuring all the great -illities in production starts with the application and enforcing good architecture practices. The panel talks about other practices they follow, like frequently testing, or "practicing" failure conditions to see if applications are architected and configured correctly. "[We] try to identify apps which could be a blocker in this resiliency," Rajesh says. That principle and the tooling in the Tanzu Platform, he says, "has really helped us to build up tooling around it." There's more discussion of this practicing failure in the talk if you're interested.

Security and Governance

That general principle of specifying and using build automation for enforcing how applications are architected and configured pays off handsomely for security, compliance and governance. Every organization wants to be secure but a financial firm like Charles Schwab that operates at such high volume is especially interested. Even more so, because of the role they play in the US economy, they're under very close regulatory attention.

I want to highlight two security and governance points the panel makes about how they use the Tanzu Platform.

Enforcing Policy Compliance Contracts

First, just as with app building and configuration, the platform team can specify and enforce security and compliance requirements for the application. Again, because the platform builds, configures, and deploys applications, the team can enforce compliant patterns, scan for approved application components, and put in place guardrails. 

How do you manage the human side of this, though? "I'd basically say at the very beginning, make sure you set the rules and guardrails that you want before you start letting people on-board," Erik Scales says, "Because once they're on-boarded, it's hard to get them to change their bad habits." That SRE-notion of a contract with developers comes up here as well. "We've developed the terms and conditions, Dor Anis says, "and it's been really helpful to be able to point at it, and, [say] like, 'you're not following that.'" 

We see this repeatedly with platform teams: they rely on the platform to enforce security and compliance requirements. There's still the need to work with developers to go over those requirements, but traditionally checking for and enforcing compliance has been difficult.

Shift-left for Compliance

Compliance auditors certainly want to trust you, but they're not going to take your word that you're following regulations. They'll want proof that you're doing it. Proving that you're following regulations is a large part of being compliant. 

Once again, a centralized platform like the Tanzu Platform helps by centralizing all of the relevant information in one place in one consistent way. This means that "it's easier to demonstrate all the auditable stuff that we need to do," Scales says, "like people logging in - everything's centralized in pretty much one place. So when we show the auditors, 'hey, this is how you get in and these are the people that can get in' - it's easier for the auditors to go and, and vet that out."

And, if you care about productivity, this adds to overall productivity for auditors, developers, and operations staff. There's much less time spent collecting all of that information from various teams, figuring out how to interpret the different records and apps, and so forth.

That's Just Two of Many Lessons Learned

There's a lot more to learn from this panel. If you're setting up and growing a platform engineering group in a large organization like Charles Schwab, it's worth checking out the whole thing

In particular, there's a great discussion of how the platform teams are organized and staffed. Instead of one team, the platform teams are divided into a developer-facing team and then a platform operations team. I've long wondered how one platform engineering team can do all of platform engineering - product managing and building the platform, working with developers, and then running and trouble-shooting the actual platform. At the scale that Charles Schwab operates, I think this team sub-division is likely the solution.

Of course, I recommend starting with the Tanzu Platform, a full stack for building your private cloud PaaS, whether you want to use Cloud Foundry or Kubernetes, it comes bundled with everything you'd expect from a PaaS. There's even a ready-to-go generative AI stack built in that can get you started on your enterprise AI journey right away. For most enterprises, because we're the ones who build and maintain the Spring Framework, if you've got Java applications you'll find it super-easy to start migrating and modernizing your existing to the next platform for running Spring apps.

If you are interested in learning more about the intersections of compliance, security governance, platform and processes please join us on October 17th at 11:30AM PT for "Go Fast and be More Secure: Lessons Learned from the Biggest Breaches of 2024," a webinar with guest speaker Sandy Cariellie of Forrester Research. 

About the Author

Michael Coté

Michael Coté works on the advocates team for VMware Tanzu. See https://cote.io for more.

Follow on Twitter More Content by Michael Coté
Previous
Broadcom Named a Leader by IDC MarketScape in the APeJ for Cloud Cost and Capacity Optimization 2024
Broadcom Named a Leader by IDC MarketScape in the APeJ for Cloud Cost and Capacity Optimization 2024

Broadcom Named a Leader by IDC MarketScape in the APeJ for Cloud Cost and Capacity Optimization 2024

Next
To Improve Your Mean Time to Recovery, Start at the Beginning
To Improve Your Mean Time to Recovery, Start at the Beginning

To reduce your mean time to recovery (MTTR) from an outage, you have to take care of the hard stuff first.