Why care about Dopplers
You might be wondering what a Doppler is (and why you care about it). Doppler VMs are a core component of log and metrics transport; one that you probably won’t care about until it stops working. Insufficiently scaled Doppler VMs are a frequent source of log loss in Cloud Foundry. This post aims to be a resource to an operator whose platform is dropping logs and is trying to determine if Doppler is the reason.
Anyone that has worked with the logging pipeline in Cloud Foundry has probably heard of Dopplers, but how many actually know what they do? Dopplers are an intermediary point on a log’s journey through Cloud Foundry. A log or metric begins its journey when it is emitted by a Cloud Foundry application or component. The emitted logs and metrics are sent to a logging agent on the same VM that then sends them to the Dopplers. Dopplers receive logs and metrics and make them available to consumers. Common consumers include the various nozzles that pull from the Firehose, like Splunk and Datadog, and Pivotal products, like Healthwatch and Metrics.
Is Doppler Under-Scaled?
There are some key scaling indicators that can help determine whether Doppler is the problem, and if it is, what to do about it.
First take a look at your logging system metrics.
The most universally accessible method for users of Cloud Foundry to get at these metrics is to query the Log Cache. The section below gives a list of a few metrics to look at. For each metric, a PromQL query is given that can be issued against the Log Cache to get the relevant information on that metric.
There are two methods to execute a PromQL query against Log Cache,
using the cf query
command from the log-cache-cli
plugin or curling the Log Cache directly:
CF Query (requires CLI plugin)
$ cf query "<PromQL>"
Curling Log Cache
$ curl -v https://log-cache.<cf system domain>/api/v1/query --data-urlencode "query=<PromQL>" -H "Authorization: $(cf oauth-token)"
Metrics to Inspect
Check the following metrics for the conditions noted. If any of those conditions are true, then Doppler is probably under-scaled.
- Check if the
doppler.ingress
metric is higher than 16,000 per second per Doppler.- PromQL:
rate(ingress{source_id='doppler'}[5m])
- PromQL:
- Check if CPU usage on the Doppler VMs is consistenly above 75% OR
- The
doppler.dropped
with tag {direction=ingress} metrics is greater than zero.- PromQL:
sum(max_over_time(dropped{source_id='doppler', direction='ingress'}[<some time>])) by (index) > 0
- PromQL:
- The
If none of these things are out of expected ranges, it is very likely that the problem is somewhere else in the logging pipeline.
Scaling Up
Now that we know the ‘what’, we can handle the ‘what to do’.
In general, adding more Dopplers is recommended. This is because, regardless of the amount of CPU allocated, Dopplers have a i
hard limit of 16,000 logs per second per Doppler. The amount to scale varies by foundation. A good strategy is to look at the
rate of ingress
metric and figure out the minimum Dopplers by dividing by 16,000, then adding another 20% on top of those.
Once Dopplers have been scaled check for dropped logs and scale further if necessary.
The exception to this horizontal scaling rule is that a foundation can have at most 40 Dopplers before overall logging function degrades and horizontal scaling has no effect. When a foundation is at 40 Dopplers and dropping logs it is possible to try scaling CPU as a buffering measure, but ultimately it is likely that the foundation will need to be sharded into smaller foundations.
Other Problems
It is worth mentioning that there is another metric, doppler.dropped
with tag {direction=egress}
, that indicates slow
consumers of Dopplers. This can be an indication that logs are dropping because of a slow consumer. Slow consumers are another
big topic that be covered by their own future post.