Deep Dive: Chaos Engineering for Cloud Foundry Platform - Karun Chennuri & Ramesh Krishnaram, T-Mobi
Deep Dive: Chaos Engineering for Cloud Foundry Platform - Karun Chennuri & Ramesh Krishnaram, T-Mobile USA Inc Modern Internet-scale microservice architectures exhibit complex communication behavior and failure scenarios with chaotic behavior (a.k.a the Butterfly Effect) that may lead to large scale disruptive events. This complexity comes from the Cloud Foundry components, services running thereon, and the underlying infrastructure necessary to provide highly available compute, network, security, storage, persistence services. For a distributed microservice architecture to function ideally, these elements must all work in tandem and tolerate failure. To systematically verify that a system can tolerate failure, a disciplined approach is necessary. One such approach is “Chaos Engineering.” Cloud Foundry is key in T-Mobile’s infrastructure, undoubtedly one of the largest CF platforms in the world, running business critical operations with over 30,000+ containers. Building resilency, self-healing and High Availability in to systems and apps – is one of the core factors that decides the success of our group. This proposal demonstrates the approach and the custom tools T-Mobile has been working on to purposefully breaking systems, identifying weaknesses, taking corrective actions and preparing for Game Days. Here at T-Mobile we started addressing Chaos Engineering at 2 different levels - “Platform” & “App” level Chaos Engineering. In this talk, we would like to discuss the architecture details, drivers that we had opensourced to the community, Demo walk-through on features and future steps. As a part of this talk, Karun would like to demo the following features: Simulate App level attacks * Bad gateway errors at app level * Latency between service and database * Kill an app/service app is dependent on Simulate Platform attacks: * Terminate VM instances * Host level attacks – CPU, Memory hogs * Advanced Network Traffic attacks * Advanced Packet Level attacks Technology: Python, Go, Spring boot, Java, PCF, Linux All this put together helps any large technology company in a systematic approach to verifying reliability of the Cloud Foundry platform. About Karun Chennuri Karun Chennuri, is the Sr Engineer at T-Mobile, who currently leads DevSecOps efforts for Cloud Foundry and Kubernetes teams within T-Mobile. He is a Software Developer with Security Expertise and has about 14 years of experience handling various assignments dealing in Security Solution integration, Product security development, Cloud Security, SDN and security assignments in general.⠀ About Ramesh Krishnaram Ramesh Krishnaram is the Sr.Manager for Platform Engineering at T-Mobile. His team at T-Mobile is responsible for providing simple, secure, scalable services with which developers can rapidly build, test, deploy software to the cloud. Over the past few years, Ramesh has spent time building, managing teams that are responsible for a suite of services in the PaaS portfolio. As part of his professional experience, Ramesh also has experience serving an evangelist for Chaos engineering delivering on tech talks and also working with software development teams educating them on the principles of failure testing (how/why/what). https://www.cloudfoundry.org/