T-Mobile has been a vocal Tanzu Application Service user for years, and as that environment grows—along with its younger Kubernetes environment—the wireless provider is learning a lot of valuable lessons. In this session from our recent SpringOne event, Brendan Aye and James Webb—two of T-Mobile’s cloud native platform leaders—share their experiences and strategies on a topic that should resonate with everybody who’s ever had to manage an application platform: what to do when something goes wrong.
Aye and Webb get into specifics in the embedded video, but keep reading for some highlights from the talk that lay out the scope of T-Mobile’s cloud native footprint, as well as how their team approaches communicating about and resolving platform issues. For more information about the company’s journey over the past couple of years, scroll to the bottom.
Cloud native environments are important and growing fast
“We have 30 [Tanzu Application Service] foundations supporting 75,000 application instances. This is a fairly important platform to the business. Most of our middleware and a lot of our order-management and digital-facing channels run on the platform now. Our Kubernetes environment is about 90 clusters, 22,000 pods. We've been hovering right around the 100,000 [container] mark for a while...
“I know...100,000 containers does not sound like a lot to a lot of large companies, but for us it's a pretty big deal. These are not workloads that are running ephemerally, being spun up and spun down every day. It's more a place for our core applications that we used to be running on VMs or bare metal or other forms of infrastructure, and moving them to a platform that provides an agile, consistent experience for teams to run their workloads without having to go through all the pain of maintaining their infrastructure like they had to do in the past.” —James Webb
Slack has opened up communication
“Another big thing for us is Slack. We adopted Slack very early when we deployed these platforms. We have customer channels for both [TAS] and Kubernetes, where effectively all our customers are members of those channels. We broadcast all of our notifications there around platform upgrades, any kind of incidents we have, new features, breaking changes, things of that nature. So, we can very quickly interact with all of our customers in one place, instead of having to shoot out emails and have people reply-all, and everyone gets fatigued from that kind of stuff...
“Because of that, we have customers that report problems very quickly because that's our primary method to engage with [support]...So that's generally the fastest way that we find out about incidents that we haven't caught with our monitoring, is when we see one or two or three customers report similar behavior or chime in on someone else's report. We know pretty quickly what's important and what might be causing issues on our platform.” —Brendan Aye
“Everything’s production to us”
“The other thing that we do for our internal customers is we don't evaluate things in terms of production and non-production. Everything's production to us. All of our customers are important, whether it's just internal developers who are trying to meet deadlines for their project, or whether it's external customers who are interacting with the website to buy or upgrade a phone.
“Nothing is more frustrating to me than hearing someone say, ‘Well, it's just non-production; I don't care.’...As a culture on our team, we do not say that. Every customer is important to us.” —James Webb
Don’t blame, just fix
“A big thing that you see in many large corporations is a mean time-to-blame, where customers that have incidents for their applications want to shift blame to someone else as quickly as possible. We've tried to really not play that game. We don't want to be in a position to try to blame someone else for an issue, or to be embarrassed because our platform has a problem itself.
“If it's our fault, we accept responsibility. If it's not, we demonstrate why and make sure leadership knows, as well...We want to fix the issue, explain what went wrong, and talk about how we can try to prevent that in the future.
“[These are] new platforms for our company—a lot of new technologies, new architectures. So when we see customers doing things that will get them in trouble...we make sure that the customer knows about it...We will help them redo architecture, redo any kind of changes they want to make, because we want them to be successful on the platform. If their app dies and it's not our fault, it still is a bad look overall to see apps failing on the platforms that our team manages.” —Brendan Aye
More from T-Mobile
About the AuthorMore Content by Derrick Harris