Application Metrics Collection in Kubernetes via Telegraf Sidecars and Wavefront

August 13, 2018 Bill Shetti

(This is a follow up to the blog: Monitoring VMware Kubernetes Engine and Application Metrics with Wavefront)

Kubernetes is becoming the de facto management tool to run applications homogeneously across resources (bare metal, public cloud, or private cloud). The single most widely deployed operational component in Kubernetes is monitoring. Prometheus is often used in a cluster to aggregate metrics from a cluster. Grafana is then used to graph these metrics. Most articles are written showcasing Prometheus and Grafana with a focus on cluster (node, pod, etc) metrics. Rarely do any of these discuss application level metrics. While Prometheus has exporters – e.g. for mysql (see setup), for Nginx – there are alternative mechanisms to export application metrics.

In this blog, I will explore the use of Telegraf, as a sidecar to extract metrics from different application components such as Flask, Django, and MySQL.

Flask is a Python-based web framework used to build websites, and API servers.
Django is a Python-based web framework, similar to Flask, but is generally used to facilitate the ease and creation of complex, database-driven websites.
MySQL is an open-source relational database management system

Telegraf has a wide range of plugins, more than Prometheus’ set of exporters. Telegraf can send metrics to multiple locations (e.g. Wavefront, Prometheus, etc). In this configuration I will showcase Wavefront. Wavefront can aggregate all metrics from all Kubernetes clusters. This differs from Prometheus, which generally displays metrics for the specific cluster its deployed in.

Application & Cluster Metrics in Wavefront

Before walking through the detailed Telegraf setup with Wavefront, it’s useful to see the end product. Since Telegraf is collecting metrics from Flask, Django, and MySQL containers and sending them to Wavefront, the following graphs show the output in Wavefront. In addition, Wavefront also shows the cluster metrics (node/pod/namespace stats). Configuration and creation of the sidecars and the configurations used are detailed in the next few sections.

Application Metrics

api-server metrics (flask based)

Metrics detailed above are generally added by the developer in Python for specific API calls in flask. The two metrics on display are for a particular API call (i.e. get all signed up users):

“Timer” per API call — several metrics such as Timer_stddev, Timer_mean, Timer_upper, etc are displayed per call
Total number of times this call is made in any given period

The application outputs these metrics via StatsD (port 8125), which is collected by a Telegraf sidecar collector in the same pod as the API-server.

MySQL Metrics

MySQL Metrics

MySQL metrics are obtained via a pull from MySQL directly. Approximately 200+ StatsD metrics can be pulled. These metrics are output via Telegraf configured as a MySQL collector.

Web Server Metrics (Django based)

Django-based Application Metrics

Metrics detailed above are generally added by the developer in Python for specific views in Django. In this case, a form’s page is being measured. The two metrics on display are for a particular “view” (form page):

“Timer” calculating the “time” it takes to insert data into a database from the form page — its includes several metrics such as Timer_stddev, Timer_mean, Timer_upper, etc.
The total number of times the form is filled out

The application outputs these metrics via StatsD (port 8125), which is collected by a Telegraf sidecar collector in the same pod as the web-server.

Cluster Metrics

In addition to application metrics, the entire set of cluster metrics is also displayed. This is achieved using heapster, with output to Wavefront. The following “cluster” metrics are generally shown:

Namespaces level metrics
Node level metrics
Pod level metrics
Pod container metrics

The following set of charts show the standard Kubernetes dashboard in Wavefront.

Cluster Metrics

Sample Application (called Fitcycle)

In order to walk through the configuration, it’s important to understand the application. I built an application with StatsD output (stdout) for Flask and Django and deployed it in Kubernetes. The sample app is called Fitcycle and is located here. You can run this in any Kubernetes platform (GKE, EKS, etc). I specifically ran it in VMware Kubernetes Engine (VKE). Once deployed, the following services are available:

Main webpage and form page for Fitcycle is served by a Django server (supported by web-server PODs)
API is served by a Flask based server (API-server PODs) – it has multiple replicas
MySQL server is served by the MySQL POD
Nginx ingress controller which is preloaded by VMware Kubernetes Engine (not shown in the diagram below).
Nginx ingress controller uses a URL based routing rule to load balance between the API-server and web-server

Fitcycle Application

The application outputs the following metrics:

API-server (flask) and the web-server (Django) output via StatsD to port 8125 in each pod (internally)
MySQL collects metrics that can be accessed ny logging in and polling for the right tables

How do we collect and expose the metrics?

Creating a StatsD Collector Using Telegraf

Telegraf has a wide variety of inputs/outputs. In deploying Telegraf to collect the application metrics for Fitcycle, I created a StatsD container with the following configuration:

StatsD input plugin polling port 8125 against the main container in the pod for the API-server pod and web-server pod.
Wavefront output plugin to send the output to the Wavefront proxy service running in the cluster.
FULL LIST of Telegraf outputs

Detailed repo for building the container is located here. The container uses the alpine version of Telegraf but changes the standard telegraf.conf file with the following:

telegraf.conf

# Global tags can be specified here in key="value" format.
  [global_tags]
  pod_name = "$POD_NAME"

# Configuration for telegraf agent
  [agent]
  ## Default data collection interval for all inputs
  interval = "$INTERVAL"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

## Telegraf will send metrics to outputs in batches of at
  ## most metric_batch_size metrics.
  metric_batch_size = 1000
  ## For failed writes, telegraf will cache metric_buffer_limit metrics for each
  ## output, and will flush this buffer on a successful write. Oldest metrics
  ## are dropped first when this buffer fills.
  metric_buffer_limit = 10000

## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "0s"

## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "$INTERVAL"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "0s"

## By default, precision will be set to the same timestamp order as the
  ## collection interval, with the maximum being 1s.
  ## Precision will NOT be used for service inputs, such as logparser and statsd.
  ## Valid values are "Nns", "Nus" (or "Nµs"), "Nms", "Ns".
  precision = ""
  ## Run telegraf in debug mode
  debug = false
  ## Run telegraf in quiet mode
  quiet = false
  ## Override default hostname, if empty use os.Hostname()
  hostname = "$NODE_HOSTNAME"
  ## If set to true, do no set the "host" tag in the telegraf agent.
  omit_hostname = false

# Statsd Server
  [[inputs.statsd]]
  ## Protocol, must be "tcp", "udp4", "udp6" or "udp" (default=udp)
  protocol = "udp"

## MaxTCPConnection - applicable when protocol is set to tcp (default=250)
 max_tcp_connections = 250

## Enable TCP keep alive probes (default=false)
  tcp_keep_alive = false

## Specifies the keep-alive period for an active network connection.
  ## Only applies to TCP sockets and will be ignored if tcp_keep_alive is false.
  ## Defaults to the OS configuration.
  # tcp_keep_alive_period = "2h"

## Address and port to host UDP listener on
  service_address = ":8125"

## The following configuration options control when telegraf clears it's cache
  ## of previous values. If set to false, then telegraf will only clear it's
  ## cache when the daemon is restarted.
  ## Reset gauges every interval (default=true)
  delete_gauges = true
  ## Reset counters every interval (default=true)
  delete_counters = true
  ## Reset sets every interval (default=true)
  delete_sets = true
  ## Reset timings & histograms every interval (default=true)
  delete_timings = true

## Percentiles to calculate for timing & histogram stats
  percentiles = [90]

## separator to use between elements of a statsd metric
  metric_separator = "_"

## Parses tags in the datadog statsd format
  ## http://docs.datadoghq.com/guides/dogstatsd/
  parse_data_dog_tags = false

## Statsd data translation templates, more info can be read here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md#graphite
  # templates = [
  # "cpu.* measurement*"
  # ]

## Number of UDP messages allowed to queue up, once filled,
  ## the statsd server will start dropping packets
  allowed_pending_messages = 10000

## Number of timing/histogram values to track per-measurement in the
  ## calculation of percentiles. Raising this limit increases the accuracy
  ## of percentiles but also increases the memory usage and cpu time.
  percentile_limit = 1000

# Specify optional tags to be applied to all metrics for this plugin
  # NOTE: Order matters, this needs to be at the end of the plugin definition
  # [[inputs.statsd.tags]]
  # tag1 = "foo"
  # tag2 = "bar"

# Configuration for Wavefront proxy to send metrics to
  [[outputs.wavefront]]
  host = "$WAVEFRONT_PROXY"
  port = 2878
  metric_separator = "."
  source_override = ["hostname", "nodename"]
  convert_paths = true
  use_regex = false

As noted in blue above, two plugins are configured for Telegraf:

Input section (for StatsD)
Output section (for Wavefront, but can also be replaced with Prometheus)

There are several ENV variables in blue above that are important to note:

$POD_NAME — used to note the name of the pod if you want to particularly distinguish the pod (I will pass this in when using the container in Kubernetes as a sidecar)
$NODE_HOSTNAME — used to note the node where the pod is running (I will get this via a global spec variable from Kubernetes when creating the sidecar container)
$INTERVAL — to note the collection interval time
$WAVEFRONT_PROXY — this is the Kubernetes service name, DNS or IP of the Wavefront proxy

This telegraf.conf is used in the Dockerfile to create the container.

Dockerfile

# Telegraf agent configured for Wavefront output intended to be used in a sidecar config

FROM telegraf:alpine

ENV WAVEFRONT_PROXY="wavefront-proxy"
ENV INTERVAL="60s"

COPY telegraf.conf /etc/telegraf/telegraf.conf

CMD ["telegraf", "--config", "/etc/telegraf/telegraf.conf", 
"--config-directory", "/etc/telegraf/telegraf.d"]

Now simply run: docker build -t telegraf-statsd and save the container to your favorite repo. My version of the Telgraf based StatsD container is available via Google Registry.

Kubernetes configuration using Telegraf-StatsD container

Now that the StatsD collector container is built and saved, I added it in a several Kubernetes deployment YAML files (api-server pod and the web-server pod). I’ll walk through the API-server (flask server) Kubernetes deployment file showing how to configure the StatsD collector as a sidecar. The Django and MySQL configurations are similar, and details are found in my git repo.

Here is the deployment YAML for the api-server:

apiVersion: apps/v1beta1 # for versions before 1.8.0 use apps/v1beta1
kind: Deployment
metadata:
 name: api-server
 labels:
 app: fitcycle
spec:
 selector:
 matchLabels:
 app: fitcycle
 tier: api
 strategy:
 type: Recreate
 replicas: 3
 template:
 metadata:
 labels:
 app: fitcycle
 tier: api
 spec:
 volumes:
 - name: "fitcycle-apiserver-data"
 emptyDir: {}
 containers:
 - image: gcr.io/learning-containers-187204/api-server-ml:latest
 name: api-server
 env: 
 - name: MYSQL_ID
 value: "root"
 - name: MYSQL_PASSWORD
 valueFrom:
 secretKeyRef:
 name: mysql-pass
 key: password
 - name: MYSQL_SERVER
 value: fitcycle-mysql
 ports:
 - containerPort: 5000
 name: api-server
 volumeMounts:
 - mountPath: "/data"
 name: "fitcycle-apiserver-data"
 resources:
 requests:
 memory: "64Mi"
 cpu: "100m"
 limits:
 memory: "256Mi"
 cpu: "500m"
 
- image: gcr.io/learning-containers-187204/telegraf-statsd-sc:latest
 name: telegraf-statsd
 ports:
 - name: udp-statsd
 containerPort: 8125
 protocol: UDP
 - name: udp-8092
 containerPort: 8092
 - name: tcp-8094
 containerPort: 8094
 env:
 - name: WAVEFRONT_PROXY
 value: wavefront-proxy
 - name: INTERVAL
 value: 60s
 - name: METRIC_SOURCE_NAME
 # This can be change to use the Deployment / Statefulset name instead as a simple value
 # The metric source name should not be an ephemeral value
 valueFrom:
 fieldRef:
 fieldPath: spec.nodeName
 - name: POD_NAME
 valueFrom:
 fieldRef:
 fieldPath: metadata.name
 - name: NAMESPACE
 valueFrom:
 fieldRef:
 fieldPath: metadata.namespace
 - name: NODE_HOSTNAME
 valueFrom:
 fieldRef:
 fieldPath: spec.nodeName
 resources:
 requests:
 memory: 30Mi
 cpu: 100m
 limits:
 memory: 50Mi
 cpu: 200m

Note the sections in blue. Key items to note in the configuration are:

Use the pre-built StatsD collector container
NODE_HOSTNAME variable uses a value from Kubernetes

- name: NODE_HOSTNAME
 valueFrom:
 fieldRef:
 fieldPath: spec.nodeName

spec.nodeName will return the node name this deployment is being deployed in.

Collection INTERVAL set to 60s for Wavefront
WAVEFRONT_PROXY is set to the service name of Wavefront proxy running in the Kubernetes cluster. Installation Notes Here.
Enabling port 8125 — which will listen to the output from the API-server

In order to run:

kubectl create -f api-server-deployment.yaml

Follow the instructions in the github repo for Django and MySQL configurations.

Sample Application (Fitcycle) with Telegraf Sidecars

Now that I have deployed the sidecars, we need to also deploy the Wavefront proxy (see instructions in the github repo), and deploy the Wavefront heapster deployment. The application with sidecars now looks like follows:

App (Fitcycle) with Telegraf sidecars

You can view the the output in Wavefront at the beginning of this blog.

Click the links below for more information on Wavefront, Telegraf, and VKE.

Try Wavefront by VMware Free for 30 Days

Telegraf

VMware Kubernetes Engine (VKE)

Get Started with Wavefront Follow @shetti Follow @WavefrontHQ

The post Application Metrics Collection in Kubernetes via Telegraf Sidecars and Wavefront appeared first on Wavefront by VMware.

100,000 Containers and Counting! Wavefront Enterprise-Grade Enhancements Help Reliably and Securely Scale Cloud Applications

The major focus of the Wavefront release for VMworld 2018 is to enable our SaaS and cloud-enabled business ...

Monitoring VMware Kubernetes Engine and Application Metrics with Wavefront

For a demonstration of the solution discussed in this article, please see VKE Wavefront Integration video p...

Application Metrics Collection in Kubernetes via Telegraf Sidecars and Wavefront

Application & Cluster Metrics in Wavefront

Application Metrics

MySQL Metrics

Web Server Metrics (Django based)

Cluster Metrics

Sample Application (called Fitcycle)

Creating a StatsD Collector Using Telegraf

Kubernetes configuration using Telegraf-StatsD container

Sample Application (Fitcycle) with Telegraf Sidecars

Previous

Next

Application Metrics Collection in Kubernetes via Telegraf Sidecars and Wavefront

Application & Cluster Metrics in Wavefront

Application Metrics

MySQL Metrics

Web Server Metrics (Django based)

Cluster Metrics

Sample Application (called Fitcycle)

Creating a StatsD Collector Using Telegraf

Kubernetes configuration using Telegraf-StatsD container

Sample Application (Fitcycle) with Telegraf Sidecars

Previous

Next

Most Recent

As you continue your monitoring and observability journey with Wavefront by VMware and more teams in your organization start using the platform, one thing becomes clear; you need the...

In this post, I’ll walk you through instrumentation and monitoring of a simple standalone application developed with Spring Boot 2.0. The application is running as a container in Kubernetes....

I recently chatted with Scott Bonebrake, Principal Software Engineer in the Data Engineering and Analytics (DEA) team at Microsoft Yammer. Yammer is a secure enterprise social network internal to...

Red Hat OpenShift is an open-source, cloud application development platform that enables you to develop, deploy, and manage applications on your cloud infrastructure. We at Wavefront provide...

As a developer, SRE, or operations engineer, you often get incident alerts, and you know it’s just the beginning. Alerts lead to a slew of questions that must be answered... The post Faster,...

Earlier this year, we released the Wavefront Collector for Kubernetes, supporting all popular flavors of Kubernetes: VMware Enterprise PKS and Tanzu, Pivotal Cloud Foundry, Red Hat OpenShift,...

Wavefront by VMware delivers comprehensive observability capabilities that benefit customers greatly. In particular, Wavefront is designed for enterprises needing at-scale Kubernetes...

Are you getting ready for the 2019 AWS re:Invent conference? If so, we at Wavefront by VMware have a packed conference schedule for you. And, you can increase your chances of... The post How to...

Kubernetes ❤️ Prometheus Prometheus has become something of a de-facto standard for how to start monitoring Kubernetes. There are good reasons for this: It’s open source, freely available and...

Distributed tracing is a critical piece of application observability. But instrumenting your applications with traces is not always easy. Whether you are an SRE or a developer, you need...

External linking helps engineering teams connect Wavefront to logging tools such as vRealize Log Insight, ELK, or Splunk. For example, when you have received alerts and see them in...

For the VMware Secure State engineering team, metrics have become an integral part of daily life. From monitoring our services to customer success and new features, all activities are...

How do you find unknown unknows? How do you detect silent failures in your cloud services involving hidden dependencies that are flying below your radar? If undetected, they can accumulate... The...

Distributed tracing is a critical piece of application observability. But, the sheer number of traces containing a ton of information can be overwhelming. In this blog, I’ll show you...

Editor’s Note: Author will be hosting a webinar on this topic Tuesday, November 19, 2019 at 10:00 AM PST. Click here to register now! SLOs Just Became Important Back in... The post SLO Alerting...

Wavefront has added a Dynatrace integration to its portfolio of over 200 pre-built integrations. With this integration, Wavefront customers can easily ingest all or selected metrics from the...

As an SRE deploying Wavefront as Enterprise Monitoring-as-a-Service across your organization, you may be asking: Is there a good way to see which metric time series are writing the most... The...

Recently, we had the pleasure to talk to Manu Mathew, Senior Manager – Performance Engineering from Secureworks. Manu shared his team’s journey of supporting rapid service growth, some of...

Earlier this week, we announced key enhancements to Wavefront’s application observability with support for span logs, context-enriched alerts, and unified status dashboards combining service,...

We are thrilled to announce the expansion of the Kubernetes observability experience in Wavefront with the next generation release of the Wavefront Collector for Kubernetes. This release has a...