This blog is based on a session I co-presented with Lyft’s Yash Kumaraswamy and Centrica Hive’s Rob Fisher in Barcelona in November 2018. In my last blog, I described how Centrica Hive scaled their cloud services to serve one of the largest IoT offerings in Europe. In this blog, I’ll talk about the phenomenal growth of Lyft services and how Wavefront, a cloud-native monitoring platform, helped them. If you’d rather hear it live – check the video of our session here.
Lyft is one of the two leading U.S. companies in the business of Transportation as a Service. Lyft is trying to reduce the number of cars on the street by maximizing the utilization of Lyft cars already on the street. Lyft is currently operating in 300+ cities across the U.S. and Canada.
Lyft is moving away from a monolith service to 200+ microservices and about 10,000+ EC2 instances. In short, what this translates into is a massive amount of metrics!
Explosive Cloud Service Growth Resulted in Graphite Meltdown
Lyft’s observability team of 5 owns many duties. They are tasked with architecting and maintaining all aspects of monitoring: ingest pipeline and real-time aggregation, distributed tracing, PagerDuty interactions and integrations, the real-time business metric framework, dashboards and user experience with monitoring and alerting. In the beginning, the observability team had to face the challenge of managing open-source tools with poor reliability, performance, and maintainability. They used Graphite with Grafana, but these tools failed to provide answers to some of the fundamental questions:
- Was the release successful?
- Do we have 5xx spikes?
- For GMs and VPs: How many rides serviced at any given time?
- How are the services doing from the DevOps perspective?
Availability and performance of their cloud service, as well as their business metrics, are critical, so they wanted to have a high performing observability platform with an uptime in the high nines and the one that works reliably and consistently. They faced many challenges at that time. Constant growth and proportional metrics increase caused much difficulty with the ingestion and query performance. With the significant volume metrics flowing in, monitoring microservices delivery was a difficult undertaking – compounded with the cardinality limits imposed for Graphite. Graphite was still not able to handle the load and traffic successfully. Graphite meltdowns resulted in engineering productivity loss and lack of visibility into Lyft’s infrastructure.
For the observability team, open source monitoring tools became quite a burden because of the need for manual maintenance, which deterred them from further expanding service features and forced them to spend most of the time supporting their monitoring infrastructure. It was hard to scale with Grafana and Graphite. Measuring from the time of metric generation to the time it becomes available for consumption on the front-end UI, SREs were looking at about roughly three minutes regarding metric aggregation delay.
What Lyft’s SRE team managed to do in early 2016 was to move out of these very low performing Graphite aggregating servers to more highly performing StatsD service. Also, they got rid of the Etsy node.js/StatsD server that ran as a sidecar on these instances. They flushed the metrics as they were generated to the central StatsD cluster, thereby reducing the latency from the time of metric generation to the time of metric resolution to under two minutes – which was pretty good. In the process, the team managed to increase the accuracy of higher percentiles (P99s) and also the accuracy of lower and upper values. That was a significant improvement. However, they wanted to improve their monitoring further.
Wavefront to the Rescue
In 2016, Lyft completely cut over to using Wavefront and replaced the node.js/StatsD service. They developed a highly performing metrics pipeline that is very flexible and scalable that was pumping metrics to Wavefront.
Lyft started collecting metrics using Collectd which is similar to Telegraf, running on the instance and scraping for the system stats: CPU, memory, file descriptor utilization, memory utilization, etc. There are also arbitrary custom scripts. The team developed bash functions that form metrics to the local stats agents. They also have application metrics, and they have core libraries that generate default metrics consumed by the stats pipeline. Also, Lyft has scraper scripts that gather metrics from different sources such as CloudWatch or MongoLab, etc. In the future, when Lyft moves to Kubernetes, they plan to gather Kubernetes(K8s) stats as well.
Wavefront at Lyft Today: Serving 1,000+ Devs, 18,000+ Alerts
Today, Wavefront is Lyft’s engineering teams’ go-to observability platform for time series data and alarming. They use Wavefront for both cloud infrastructure monitoring (AWS) as well as application and microservices (written in Python and Golang) monitoring. They are running upwards of 500,000 metrics per second which can peak to 800,000 metrics per second. Lyft has about 1,000+ developers actively using Wavefront product for their day-to-day monitoring, alerting and engineering development needs. There are 1,000+ Wavefront dashboards and 18,000+ smart alerts created for all the teams. And product managers use Wavefront as a gauge to identify how successful the feature rollouts are, “to understand our growth pattern, demand pattern, also to judge how effective those rollouts are.”
Business Metrics on Lyft Rides – Per Second!
In addition to engineering metrics, Lyft has all their business execs including GMs and VPs also using Wavefront metrics to know how their business is doing. The business metrics monitoring for Lyft is pure gold, providing real-time as well as historical insight into the true state of the business. Although business data flow through the same aggregation cluster as other metrics, their source is different. It is AWS Kinesis instead of the application. In Wavefront, business metrics are available all the time – per second, on every dashboard, visible to everyone, helping them make better, data-driven business decisions. Those business metrics include:
- Passenger metrics
– New user signups / installs / activations
– Current passengers with the app open
- Driver metrics
– New driver applications / activations
– Current drivers with the app open
- Ride metrics
– Rides requested / accepted / dropped off / canceled / lapsed
– Lyft Line rides dropped off
– Paid vs. Couponed rides dropped off
- Marketplace metrics
– Drivers available
– Drivers en route
– Driver utilization %
Streaming Mobile Client Metrics
The Wavefront Platform also collects information from the Lyft Mobile Apps (clients). In real-time, the observability team can answer questions such as, “how many mobile app opens did we have in a given region, and what was the app opened a week, two weeks before that?” These are very critical KPIs that are useful for them to understand the state of business as well as infrastructure serving mobile clients.
Global Microservices Monitoring with Envoy Metrics in Wavefront
Lyft built and is using Envoy, a highly popular Layer 7 high-performance proxy that does service to service communication. Envoy provides excellent observability, out-of-the-box monitoring, metrics, logging and tracing support. Envoy runs on every host/instance that Lyft has in their fleet, and it generates extremely useful metrics. Using the Wavefront integration with Grafana, Lyft monitors the state of all Envoy instances.
Their engineers can also see how their microservices are performing, overall requests per second, success rates (not 5xx) and many more. The reason for running Envoy on every host is, besides doing the service to service communication, its generation of consistent performance metrics across every step of the process. In a complicated distributed system like Lyft’s, it becomes extremely hard to track down bugs, track down the source of problems when one gets paged. By analyzing all these metrics in Wavefront, they can determine which link in the chain is the one that’s causing issues or causing downtime.
We have recently released the Wavefront Distributed Tracing beta – you can try it here.
Wavefront is our single source of truth for triaging purposes. When someone gets paged, we want them to be a click away from resolving the issue, for example, no need to look at Amazon charts. Context switching is extremely expensive.
Dashboards and Alarms as Code
Lyft’s observability team also uses Wavefront for all their alarming. All their dashboards and alarms are checked into the source control system for a single source of truth – of what alarms are available in production at any given time. It also helps them have a strong code review process and allows developers to tweak alarms easily. If a developers want to do that, they don’t have to first understand the entire infrastructure or the logic behind alarms; they can tweak alarms as they wish. Additional information helps to add another context for an on-call engineer, helping them troubleshoot easier and faster. They can achieve a consistent look and feel across all their 200 plus microservices. The ownership is very distributed; there’s no single team that owns every aspect of dashboards and alarms.
Wavefront – The First Source of Truth
Wavefront helps Lyft unify all Lyft’s metrics, becoming the single source of truth for all triaging purposes, speeding troubleshooting and cutting down the expenses. Wavefront provides the holistic window for metrics from numerous sources, available on a single click, with real-time visibility into the key services. Its highly efficient alert engine helps Lyft filter noise and capture anomalies. The powerful Wavefront platform reveals immediately what has happened and what needs to be done to make the error impact mild – helping teams to find a needle in a haystack.
In conclusion, monitoring production code releases with metrics and smart alarms significantly help in decision making. Wavefront delivers intelligent alerting for proactive monitoring, an ultra-fast query language with first-class citizen’s ability to answer essential questions about the business. All together, Wavefront clears the road for a smooth ride to the Lyft business.Get Started with Wavefront Follow @stela_udo Follow @WavefrontHQ
The post A Billion Lyft Rides – How to Scale a Cloud Service with Speed and Insights (2 of 2) appeared first on Wavefront by VMware.
About the AuthorFollow on Twitter Follow on Linkedin More Content by Stela Udovicic