Last week, at VMworld Europe 2018, I was very fortunate to present a session with Rob Fisher from Centrica Hive and Yash Kumaraswamy from Lyft. Both customers shared their impressive stories how they reached a fantastic scale with Wavefront. In this two-part blog, I describe their respective stories of how using cloud-native analytics for monitoring helped them eliminate many of their observability pains. If you’d rather listen to the live recording of the session, you can find it here.
Centrica is a large energy group based in the UK, and Hive is its smart home offering. Hive started as a smart thermostat business and has grown to incorporate lighting, leak detection, as well as security cameras, motion detectors and all other sorts of things, enough to keep 12 to 15 product teams busy, plus core services. From their central London garage startup “with everything going off the boss’s credit card“ approach, they made it to the largest IoT platform in the UK with over 500,000 customers.
Taking Control of the AWS Cloud Estate
Hive’s first concern five years ago when they set off entirely cloud-native infrastructure was to get to the market as quickly as possible, gain a competitive advantage and have it all work. It went fine since there were no concerns about having to sort out a data center or to manage hardware. Developers obtained everything they needed right away from the public cloud – VMs, storage, and databases. It seemed even better for a while because the customers loved the product, and all the people involved but for the SREs were pleased. SREs, however, had to deal with “the huge unknown mess of stuff in the Amazon cloud“.
Centrica needed to know where the money was going and how efficient were the steps they were undertaking – by the metrics. With so many VMs, many services and accounts with separate API keys and permissions, they couldn’t tell. Without a global view, they could not even imagine what it was that they have got. Apparently, there was no control of the estate.
Fixing Broken Configuration Management
After having used open source monitoring tools which didn’t work for them, they rather quickly realized that Wavefront was their best choice. They introduced Puppet configuration management, installed collection agents and used Telegraf to feed metrics into the Wavefront. With hundreds of VMs and loads of data in Wavefront, their expectation was that everybody’s going to love it, it’s going to be great. Initially, with only about 50 hosts seen by Wavefront, they thought Wavefront doesn’t work. Turned out that their Puppet implementation was broken – but their team didn’t know that. About 80% of the Puppet entities were failing. So the first thing Wavefront told them was that their configuration management was broken. Now they have a “lovely interface“ between Puppet and Wavefront, with full visibility to the estate. With richly tagged data, they can see the instance types, availability zones, and all the things that one can easily lose track of in the cloud.
To avoid receiving customer criticism through Twitter about things not working as they should, a cloud-native company needs proper monitoring. And not the old type of monitoring they had from the past era – like Nagios. Nagios performs point-in-time checks by running scripts that keep one informed on each possible failure throughout the system. But Hive needed an advanced monitoring solution that would support a built-in expectation of failure. In the cloud that’s normal: instances come and go all the time and a cloud-native approach takes it into account. With services the Hive team maintains, no host lives for more than a week. Wavefront proved itself able to manage that sort of transient estate. Rather than caring about individual hosts, they are now keeping an eye on the tier of an application, monitoring with ease, for example, the average CPU load across the whole tier.
After acquiring some experience with Wavefront, the Hive team has started creating synthetic metrics to monitor their services. Every service knows what it’s supposed to do and it will emit a metric every ten seconds or so that says if it’s working properly or not. Then SRE can simply monitor that metric.
The bottom line is: SREs really don’t care about the details of the system underneath. Does that say that the underlying data CPU usage, memory usage, and network traffic aren’t important? They are, and the Hive team records them in great detail but they don’t alert off them. They use them after there’s been a problem. Wavefront is very capable of correlating and presenting a lot of data from various sources. Rob Fisher shared how Wavefront was terrific in the post-mortem situations as well. For instance, after an outage at 3 AM, an SRE can pull up all the data for all affected hosts, all affected services and say for the previous 12 hours and look at how things interact and what happened before the outage. That’s the important piece, i.e. what led up to the outage which can help them avoid the future ones.
Security and Compliance
Security is very important for Centrica Hive. Since Centrica as the company operates in a regulated industry, it also gets audited and needs to be able to prove its reliability and the high levels of security. Centrica Hive created a large compliance suite, a set of programs that performs tasks on every VM and runs them every hour or any time that Puppet makes a change. There are typically 500-600 tasks per host and they secure that all Centrica hosts and VMs comply with specific rules.
There is a document that comes from Centrica which describes the rules and tests which all Centrica hosts and VMs must comply with. So when Centrica gets audited these compliance test results must be very easy to read. It is also easy to determine which part of Centrica documents they belong to or whether they comply to requirements. The auditor looks at that and confirms that these tests prove Centrica is verifying everything that needs to be checked. Centrica Hive team then present the charts. The Hive team wrote a report which puts all the information into Wavefront every time that compliance suite runs. Centrica can show to auditors what their tests are supposed to be doing and how the tests are doing it. The auditors love it because it’s really easy to read it and to audit. Centrica product management loves it because they know that their developers are pushing applications that meet high-security requirements.
In addition, Wavefront executive dashboards developed by the Hive teams, shows the CTO and CEO a broad brush how their services are working, if a particular service responds within the specified time, and if their machines are 100% security compliant. The execs love that because it gives them a real insight into what the technical part of the business is doing without having to wade through loads of technical details and sit through meetings.
Managing Cloud Cost
When all teams can have all the resources they want at any time, as it is when moving to the cloud, you often end up with all kinds of things, everything you spin up and forget to spin down when you have finished with it and didn’t need it anymore.
Although you can get some information on the overall amount spent in the cloud each month – by using an in-house tool, there is no way of knowing where does the money go when it shouldn’t – on things you don’t need. That’s where Hive uses Wavefront. By looking at the charts and comparing what they have obtained, even with the considerable discount, as prepaid Amazon reserved instances, with instances used and how heavily they are using them, they can discern what reservations they should buy. Their people can then be advised to move from one service to another or one instance type to some other. They are putting in all that information all the time, and they have programmed alerts for when reservations are about to expire.
Termination of a host that has a disk attached to it may not result in disk detach. Hive ends up with sometimes hundreds of Elastic Block Store (EBS) volumes, not in use but paid for. Hive has very clear Wavefront dashboards that show disk volumes that aren’t attached to an instance, load balancers with no instance behind and elastic IPs not used. Hive knows how many elastic IPs they pay for and that information goes into Wavefront. Then Wavefront dynamically finds out how many elastic IPs are actually in use and it alerts Hive team to any discrepancies. This functionality alone saved Hive tens of thousands of dollars each month.
Considering all said, it’s no wonder that growing a successful SaaS such as Centrica’s is directly tied to having a new platform for analytics such as Wavefront. Wavefront provided Centrica Hive with the first (if not the single pane of glass) – from metrics from VMs, the ones from the cloud services and Wavefront integrations, along with own generated telemetry from the Amazon APIs, and the Wavefront Amazon SDK, all are combined to be provide real-time visibility into all aspects of their digital service.
In the part 2 of this blog series, I will share Lyft’s story with Wavefront, based on the session we presented at VMworld 2018 EMEA.
In the meantime, If you’re ready to try Wavefront, check out our free trial.
About the AuthorFollow on Twitter Follow on Linkedin More Content by Stela Udovicic