AWS Lambda Serverless, Automatic Rollbacks, and Wavefront Monitoring

April 3, 2017 Conor Beverland

Automatic rollbacks are a way to quickly revert back to a previous version of your application to limit the impact to your end users. Typically, the application developer specifies what the conditions are to a monitoring system, and if those conditions are met, the application is reverted to a previous version that is known to work.

Your application always needs to roll forward for enhancements. But it also needs to roll backwards when necessary. If there is state in a database or a cache, your application will have to be designed to take that into consideration and handle it correctly. While out of the scope of this blog post, there are a variety of ways to handle this cautiously, and rollbacks won’t help in every case. However, when used with care, it can be a valuable tool in one’s toolbox to limit the impact of a bad code push on your users.

A Peek into Serverless

Here at Wavefront, we’re excited about the possibilities of serverless technology and all the implications around monitoring with it, and of it. A growing percentage of our customers are also starting to use it too. If you’re not familiar with the concept of serverless, for the purpose of this discussion, we simply mean AWS Lambda. The key point of “serverless” is not that there are no servers, but that you don’t have to worry about them anymore, i.e. they are handled by your cloud provider. All you focus on is the core logic of your function and the cloud provider takes care of running it somewhere for you.

Quick side note: we recently introduced a turn-key monitoring suite for AWS cloud services including for AWS Lambda, see its prepackaged dashboard below. The suite also applies analytics with real-time AWS pricing to shows you how much you can reduce your overall cloud services bill. If you’re attending the AWS Summit in San Francisco on April 18-19, come by the Wavefront booth for a product demo. Or we can get you started with a free trial immediately.

aws dashAutomatic Rollback Plugin

Let’s get back to automatic rollback. In the following example, we have a test program that we’re deploying using Serverless.com’s framework. It includes an API Gateway endpoint that in turn calls our test Lambda function that can be put into a mode where it returns a high error rate on the requests it’s processing, and reports these to the Wavefront Proxy.

module.exports.transaction = (event, context, callback) => {
let status = processTransaction();
if (status) {
sendWavefrontMetric(1, ‘demoapp.mymetric.success’, callback);
} else {
sendWavefrontMetric(1, ‘demoapp.mymetric.error’, callback);
}
};

We also have our serverless.yaml configuration specify that our automatic rollback plugin should be used and trigger a rollback if the 5 minute moving average error rate goes over 20%:

plugins:
– wavefront-serverless-rollback-plugin

custom:
wavefrontDebugMode: false
wavefrontForceDeploy: true
wavefrontApiKey: “xyzxyzyx-f706-4265-9acd-efbce085e82f”
wavefrontApiInstanceUrl: “https://try.wavefront.com”
wavefrontRollbackAlertTriggerThreshold: 2
wavefrontRollbackAlertCondition: ‘mavg(5m, sum(msum(5m, rate(ts(“demoapp.mymetric.error”))))/sum(msum(5m, rate(ts(“demoapp.mymetric.*”))))) > .2’

In the following run, we can see the automatic rollback happen after the Wavefront alert trigger fires.

$ watch -t -n0 curl -sq https://e42fywgyje.execute-api.us-east-1.amazonaws.com/dev/process
SENT to Wavefront: demoapp.mymetric.success 1 1490741766 source=demoapp-dev-transaction

As seen in the Wavefront graph above, the success rate (in percent) is high after the function is deployed as depicted by the orange line, with a tolerable amount of errors, shown by the blue line.

But then a newer faulty version is deployed that returns a higher rate of errors:

$ watch -t -n0 curl -sq https://e42fywgyje.execute-api.us-east-1.amazonaws.com/dev/process
SENT to Wavefront: demoapp.mymetric.error 1 1490743751 source=demoapp-dev-transaction

Seen above, the error rate keeps on climbing after the newer function has been deployed.

Thankfully, a Wavefront alert (configured below) is already set to catch this condition:

Alert definitionSo not long afterwards, the alert triggers a rollback to revert the faulty function to its previous version.

Now seen above, the success rate is coming back up after the alert triggered at the time marked by the highlighted event.

So, what happened here?

The Serverless.com plugin hooks into serverless’ deploy command. When the deploy command is invoked, it sets up an automated rollback as follows:

  1. It downloads the service’s previous CloudFormation template to be used as the rollback target.
  2. It uploads a rollback Lambda and sets up an API Gateway by which to trigger it.
  3. It creates a Wavefront webhook and alert to trigger the rollback.
  4. When the alert is triggered, it invokes the rollback Lambda via the API Gateway end point and executes the rollback using the previous version of the CloudFormation template.

The code is published as an npm and is also available here:

https://github.com/wavefrontHQ/wavefront-serverless-rollback-plugin

https://github.com/wavefrontHQ/wavefront-serverless-rollback-plugin-demo

Accuracy of Alerting Triggers

This plugin allows you to take advantage of Wavefront’s rich alerting expressions to specify when your application should be automatically rolled back. We feel that this flexibility and the ability to be accurate are really important to highlight. Alerting capabilities in most monitoring tools simply don’t have the range of analytics expressiveness to allow you to zero in on the exact conditions to alert on. Either they’re too broad with just simple percentages or even a single static value, or are too complex and add a level of indirection that requires time-consuming calibration. Alerting is a safety net, and if it’s not moldable into your real-world scenarios, or requires settling for levels of false alerts or missing alerts. then there will be significant gaps in its ability to support you.

One example of a real-world problem is having a sporadic volume service, or simply a service that has lulls in the early mornings, such as a web forum or perhaps a promo redemption service where the error rate percentage is higher. In the following example, we take the time of day into consideration. Between midnight and 2am, we allow for higher error rates since we know there is known noise.

if(between(hour(“US/Pacific“),0,2),
mavg(5m, sum(msum(5m, rate(ts(“demoapp.mymetric.error“))))/sum(msum(5m, rate(ts(“demoapp.mymetric.*“)))) > .3,
mavg(5m, sum(msum(5m, rate(ts(“demoapp.mymetric.error“))))/sum(msum(5m, rate((ts(“demoapp.mymetric.*“))))) > .2)

So now no one is needlessly woken up at 2am, because you were able to tailor the condition to a specific real world condition.

Rolling Back on Our Discussion

We hope this discussion has been of some use, and that the Serverless.com plugin is useful too. This is just one example of how Wavefront’s powerful, analytics-driven alerting can be integrated into the deployment process for an application and improve the end user experience by detecting a bad deployment before a human may be able to react. Automatic rollbacks are not appropriate in every deployment depending on the code of course, however when used with care, they’re a valuable tool in improving your end user experience when things go wrong.

Interested to learn more about how Wavefront can monitor your applications and cloud services on AWS like Lambda and more? Want to alert more intelligently using advanced analytics and dynamic queries to reduce alert noise, and automate more with confidence? Sign up now for a free hands-on demo and trial of the Wavefront service.

Thank you to Anggoro Dewanto at AccelByte Inc for developing the Serverless plugin.

 

Get Started with Wavefront Follow @conor_bev

The post AWS Lambda Serverless, Automatic Rollbacks, and Wavefront Monitoring appeared first on Wavefront by VMware.

Previous
Containers are Leaky Abstractions (and other truths I hide from my kids)
Containers are Leaky Abstractions (and other truths I hide from my kids)

I recently sat on a panel at the ContainerWorld 2017 conference, discussing the maturity model of container...

Next
Metrics vs. Logs Dilemma: Selecting the Right Platform for Monitoring Your Cloud Services (part 3 of 3)
Metrics vs. Logs Dilemma: Selecting the Right Platform for Monitoring Your Cloud Services (part 3 of 3)

The post Metrics vs. Logs Dilemma: Selecting the Right Platform for Monitoring Your Cloud Services (part 3 ...

Visionary in Gartner® Magic Quadrant™

Learn More