Three months ago, we launched VMWare Tanzu RabbitMQ for Kubernetes to automate high-performance messaging on demand with our cluster Operator.* Since then, customers have approached us with higher-level needs that inspired us to extend and improve Tanzu RabbitMQ. In other words, you’ve spoken, and we’ve listened. And so now, in version 1.1, we go well beyond automating cluster operations to orchestrating complex topologies, adding alerts, and previewing active-passive replication. So what does this all mean?
Messaging topology operator
This new Operator takes the concept of Tanzu RabbitMQ infrastructure-as-code another step forward, allowing platform or service operators and developers to quickly create users, permissions, queues, exchanges, and queue policies and parameters. The new Operator wraps RabbitMQ APIs in Kubernetes APIs so users can create and modify complex messaging topologies simply by modifying YAML. We’ll provide a deeper dive in an upcoming blog post.
Tanzu RabbitMQ clusters already include a Prometheus endpoint. With this, monitoring tools can automatically detect and scrape labeled metrics. In version 1.1, we’ve added a set of alert definitions to run on Prometheus Alert Manager or any compatible tool. These alerts include Kubernetes- and infrastructure-related problems, as well as RabbitMQ-specific problems. In case of failure, users will be reached by an alert with the right actionable information. We believe this will dramatically reduce mean time to resolution. You can read more about monitoring and alerting in RabbitMQ from a recent blog post on the open source RabbitMQ site.
Active-passive replication for disaster recovery
We’re also introducing another new feature as a technology preview only (it’s not recommended for production use at this time).This enhancement builds on the capabilities already provided by the schema replication plug-in. It provides the addition of safe data replication between one active site and one or more backup sites. Should the primary site fail, the backup site is ready with the unconsumed messages to serve the application.
This new experimental Tanzu RabbitMQ capability enables full replication of both metadata and messages in quorum queues to other clusters in remote sites.
How is it different from DIY solutions? Most people that try to protect their RabbitMQ messages do it by publishing the same message to a federated exchange. With a federated exchange, the message is published both to the active cluster and to a remote passive cluster. Keeping the remote cluster from being overrun with messages requires setting a specific time-to-live (TTL). But estimating the proper TTL setting means speculating on the pace of message consumption on the active site.
This approach can work well if messages are flowing normally. In that case, they will be available in the passive site queues almost as soon as they are available on the active site queues. Should the primary site fail, you can simply point both consumers and producers of the application to the alternative site.
Unfortunately, we can’t assume messages will flow correctly for purposes of disaster recovery.
What happens if a network glitch makes the passive site unavailable for a few minutes? Should the application wait? Should it assume that these messages have already been hacked on the primary site? Even worse, what happens if the messages expire on the remote site before they have been processed on the active site because a consumer got disconnected or the pace of consumption slowed down?
Rather than speculating about the time the message will spend in the queue, we can now tell the remote site when to purge the message. With our new active-passive capability we can park messages for longer at the remote site without breaking the cluster. This provides an efficient way for one or more remote sites to replicate data faster than with a federated exchange or a shovel.
In order to synchronize the schema (messaging topologies and RBAC configuration) between the active and remote site, the active-passive solution uses an existing commercial plugin. This removes another part of the burden from the human operator that used a DIY solution to synchronize users, permissions, and messaging topologies between the clusters.
Last but not least, this solution orchestrates the replication for the operator. In a single, simple command the user can decide which queues will be replicated on the active site and which remote sites will be following and replicating data.
If you combine this capability with the core capability of Tanzu RabbitMQ to automate the provisioning of the clusters and messaging topology objects, you can deploy a production topology of Tanzu RabbitMQ with a disaster recovery standby site in a matter of minutes.
If you are a Tanzu RabbitMQ user, you can download the release and follow the quick-start guide.
*We use the upper-cased “Operator” to denote Kubernetes Operators vs. platform or service human operators.