Yammer engineers created approximately 3,000 alerts with VMware Tanzu Observability
Shortened the time for troubleshooting, further protecting SLA commitments and meeting customer expectations
DEA team was able to catch more bugs before production, preventing site outages and costly SLA misses
Microsoft Yammer is a secure enterprise social network internal to an organization, that enables people to connect and engage across their company. Employees can share knowledge, interest, and ideas from any device. Employees can build online communities and connect with leaders.
Because Yammer is part of the Office 365 platform, you benefit from things like integrated apps and services across Office 365, as well as best-in-class control, security and compliance offered through the Microsoft Trust Center. Yammer also enables live and on-demand events for up to 10,000 attendees across the web and mobile apps.
Companies want a secure, private enterprise tool that helps employees work together, whereas users want a tool that’s easy to use. Yammer provides both, which is why companies such as Air France, Goodyear, MARS and Virgin Trains use Yammer to help employees connect and engage across the globe.
Over four years, the Yammer cloud service grew rapidly. A combination of Yammer expanding and its engineering teams developing new features requiring new services resulted in a significant telemetry increase. The transition from monolith to microservices drove the need for even more measurement points.
Geographical expansion created additional challenges. Observability tooling had to be able to serve multiple data centers spanning different continents. Yammer’s Data Engineering and Analytics (DEA) team needed a powerful observability tool to help them exceed their global service-level agreements (SLA). Only a fast and reliable modern platform could fulfill the engineering team’s expectations. The DEA team understood early on that a powerful observability tool is a must.
The transition from monolith to microservices can hardly be addressed with open source monitoring tools which the DEA team used previously. They did not want to maintain and scale open source monitoring platforms requiring significant engineering investment. Yammer DEA team adopted Tanzu Observability and it scaled to meet the demand of a rapidly growing service and did not pose a limitation for Yammer’s expansion. Since Tanzu Observability is a cloud service, it natively works with multiple data centers in different geographies.
The reliability of [Tanzu Observability] is pretty impressive. We use a lot of hosted services, and obviously, Yammer itself is a hosted service, and I can appreciate, over the last 16 years in tech, how hard it is to keep the nines up in an SLA.”Ben Freeman, Data Engineering and Analytics, Microsoft Yammer
In addition, Yammer engineering teams started relying on Tanzu Observability analytics-driven alerts to be able to hit their SLAs. They adopted a modern approach to alerting, mandating that internal customers avoid alerting on logs. “Metrics for everything,” they say. Metrics are quick, and Yammer gets better history out of the analytics, with fewer false positives. Compared to logs, metrics are ingested about 5–10 times faster and are also more reliable. Logs can often be delayed up to five minutes, which can have a detrimental impact on their SLAs. Any delay in outage detection increases the time to fix the issue, which could result in an SLA breach.
Today, Yammer’s team uses Tanzu Observability data to report SLAs and KPIs such as latency P99s to business stakeholders, helping them understand the health of their service around the clock.
Currently, Tanzu Observability across Yammer development teams is mandatory—and that goes for 100 percent of the engineers
The Development team handles about 100 microservices, and they always have someone on call for a weekly rotation. Everyone on the team is expected to be able to troubleshoot. Developers own their code in full.
Tanzu Observability is also used to monitor the release pipelines. When a developer deploys code, pushes up a pull request (PR), and the coworker gives it a thumbs up, the code goes into the release pipeline and CI kicks off as well as CI metrics. They observe that pipeline data for particular metrics, and if those exceed some parameters, the PR gets pushed back to the developer that issued it. This process significantly improves code quality by relying on Tanzu Observability data, which prevents bugs from going into production.
When engineers reach a stable release process, they don’t necessarily need to look at Tanzu Observability constantly. However, for any big code pushes, Yammer encourages everyone to watch Tanzu Observability dashboards. Yammer engineers have already created approximately 3,000 alerts in Tanzu Observability.
A best practice that the Yammer team has adopted with daily Tanzu Observability use is to have alerts for all critical microservices. If a new problem does arise later, engineers should add appropriate metrics and alerts
As for the plans for the future, the Yammer team is currently working on Graph QL. They would like to come up with a “templatized” alerting system, to find the appropriate (service) owner for every downstream service.