VMware RabbitMQ for Kubernetes v1.4.1 DR 配置指南

April 20, 2023

Co-written by Xiaoqing Wang and Chunyi Lyu.

         在上一篇文章中,我们详细介绍了VMware RabbitMQ for kubernetes 的安装部署,本文在此基础之上继续介绍如何在VMware RabbitMQ 集群之间配置 Disaster Recovery

         VMware RabbitMQ 商业版提供了DR (Disaster Recovery) 功能,它能够通过广域网在集群(或节点)之间实现Schema定义和消息数据的多站点复制,该功能在官方文档中也被称为Warm Standby Replication。在VMware RabbitMQ DR的架构中,我们把正常提供消息服务的RabbitMQ 集群称为上游(upstream)集群,把备用(Stanby)的集群称为下游(downstream)集群。一个上游集群可以配置一个或多个下游集群,但一个下游集群只能有一个上游集群。当上游集群发生故障时,可以通过命令激活(Promote)下游集群并让它接管消息服务。下游集群拥有与上游集群完全相同的schema定义(vhost/exchange/queues),但在它没有被激活(Promote)之前,在它的队列中是看不到任何消息的。从上游集群同步来的消息被保存在下游集群内部的Stream队列中,仅当下游集群被激活(Promote)时,这些消息才会被推送到相应的队列中,并且默认只推送那些上游集群未消费(或未确认的)的消息,从而最大程度避免消费者的重复消费。关于VMware RabbitMQ Warm Standby Replication的更多内容,请访问官方文档:Warm Standby Replication (vmware.com)

        VMware RabbitMQ 的上下游集群之间是通过两个插件来实现的同步的:

  • rabbitmq_schema_definition_sync:该插件负责将schema定义从上游集群同步到下游集群,此处的Schema定义包括vhost/exchange/queues/user/policy等。上游集群所有的Schema定义,都会被复制到下游集群。该插件使用AMQP 1.0协议在上下游之间进行通信。
  • rabbitmq_standby_replication:该插件负责将message从上游集群同步到下游集群,message同步是以vhost为单位的,插件会自动同步指定vhost下所有队列(或使用正则表达式筛选部分队列)的消息到下游集群。在当前版本中,它只支持Quorum队列类型。它使用Stream协议在上下游之间进行通信。

        下面我们就以两个3节点的RabbitMQ 集群为例,继续介绍在RabbitMQ集群之间配置DR(Disaster Recovery)的方法和步骤。

本文涉及的主要步骤如下:

  1. 创建上游(UpStream)和下游(DownStream)集群。
  2. 在上游和下游集群配置 Schema Replication
  3. 在上游和下游集群配置 Standby Replication
  4. 验证上游和下游集群的 DR 配置和状态。
  5. 激活(Promote)下游集群。
  6. 激活(Promote)下游集群之后。

创建上游(UpStream) 和下游(Downstream)集群

 在本例中,RabbitMQ上游集群部署在命名空间(namespace) “vmware-rabbitmq”下,集群名称为“upstream-cluster”RabbitMQ下游集群部署在命名空间(namespace) “tanzu-rabbitmq”下,集群名称为“downstream-cluster”

1.创建 namespace pullsecret

使用下面的命令在K8s中创建上下游的namespace

# 用于RabbitMQ上游集群
kubectl create namespace vmware-rabbitmq
# 用于RabbitMQ下游集群
kubectl create namespace tanzu-rabbitmq

在两个namepace下创建 imagePullSecret

两个namespace下都需要创建imagePullSecret,它包含了访问Tanzu Network的账号和密码。在上一篇文章中,我们已经在“secrets-ns”命名空间下创建了名为“tanzu-rabbitmq-registry-creds”secret,并export到所有的namespace。并且,我们也已经在“vmware-rabbitmq”命名空间下使用占位符的方式创建了"tanzu-rabbitmq-registry-creds"(它指向"secrets-ns/tanzu-rabbitmq-registry-creds")

请使用如下命令确认”vmware-rabbitmq”命名空间下是否创建了“tanzu-rabbitmq-registry-creds”对象。

kubectl get secret -n vmware-rabbitmq

如没有,请执行kubectl apply -f upstream-cluster-registry-creds.yml 创建它,upstream-cluster-registry-creds.yml的定义如下:

vi upstream-cluster-registry-creds.yml

apiVersion: v1
kind: Secret
metadata:
  name: tanzu-rabbitmq-registry-creds
  namespace: vmware-rabbitmq  # replace with your namespace.
  annotations:
    secretgen.carvel.dev/image-pull-secret: "secrets-ns/tanzu-rabbitmq-registry-creds"
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: e30K

”tanzu-rabbitmq”命名空间下,请执行kubectl apply -f downstream-cluster-registry-creds.yml 创建Secret对象,downstream-cluster-registry-creds.yml的定义如下:

vi downstream-cluster-registry-creds.yml

apiVersion: v1
kind: Secret
metadata:
  name: tanzu-rabbitmq-registry-creds
  namespace: tanzu-rabbitmq  # replace with your namespace.
  annotations:
    secretgen.carvel.dev/image-pull-secret: "secrets-ns/tanzu-rabbitmq-registry-creds"
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: e30K

2.编辑 upstream-cluster.yaml downstream-cluster.yaml

上游集群 upstream-cluster.yml 的定义

vi upstream-cluster.yml

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: upstream-cluster
  namespace: vmware-rabbitmq
spec:
   replicas: 3
   imagePullSecrets:
   - name: tanzu-rabbitmq-registry-creds
   service:
     type: NodePort
   rabbitmq:
     additionalPlugins:
       - rabbitmq_stream
       - rabbitmq_schema_definition_sync
       - rabbitmq_standby_replication
       - rabbitmq_stream_management
     additionalConfig: |
       schema_definition_sync.operating_mode = upstream
       standby.replication.operating_mode = upstream
       standby.replication.retention.size_limit.messages = 5000000000
      #standby.replication.retention.time_limit.messages = 12h

注意:

a.) 上游集群的 schema_definition_sync.operating_mode/standby.replication.operating_mode 参数必须设置为”upstream”

b.) DR的配置完成后,每个需要同步消息的vhost(设置tags: ["standby_replication"])下都会自动创建名为“rabbitmq.internal.osr.messages”的内部Stream队列,发往下游的消息会暂存在该队列中。用户可以通过两种方式限定该队列的长度,限定消息占用的磁盘空间消息过期时间

  • standby.replication.retention.size_limit.messages 限定内部Stream队列最大占用的磁盘空间,默认为5GB.
  • standby.replication.retention.time_limit.messages 限定内部Stream队列消息的过期时间,默认为12h.
  • 在上例yaml文件中使用的是默认值,用户可以根据需要修改上述参数。

c.) 为了便于测试(暴露端口),这里使用了.service.type: NodePortVMware RabbitMQ 还支持 ClusterIP/LoadBalancer 类型。

下游集群 downstream-cluster yaml 的定义

vim downstream-cluster.yml

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: downstream-cluster
  namespace: tanzu-rabbitmq
spec:
   replicas: 3
   imagePullSecrets:
   - name: tanzu-rabbitmq-registry-creds
   service:
     type: NodePort
   rabbitmq:
     additionalPlugins:
       - rabbitmq_stream
       - rabbitmq_schema_definition_sync
       - rabbitmq_standby_replication
       - rabbitmq_stream_management
     additionalConfig: |
       schema_definition_sync.operating_mode = downstream
       standby.replication.operating_mode = downstream
       schema_definition_sync.downstream.locals.users = ^default_user_
       schema_definition_sync.downstream.locals.global_parameters = ^standby

注意:

a.) 下游集群的 schema_definition_sync.operating_mode/standby.replication.operating_mode 参数必须设置为”downstream”

b.) 下游集群的 schema定义默认与上游集群完全相同,但是可以通过参数 schema_definition_sync.downstream.locals.* 在下游集群设置例外条件,从而允许下游集群保留自己的scheam定义。在上述Yaml文件中,下游集群可以保留自己的user(user名称以”default_user_”开头)patameter(参数名称以”standby”开头)

3.创建 UpStream/DownStream RabbitMQ 集群

执行yaml文件,创建上下游集群:

kubectl apply -f upstream-cluster.yml
kubectl apply -f downstream-cluster.yml

# 确定上下游集群都在正常运行,STATUS:Running
root@client:~/dr# kubectl get pod -n vmware-rabbitmq

NAME                        READY   STATUS    RESTARTS   AGE
upstream-cluster-server-0   1/1     Running   0          4m1s
upstream-cluster-server-1   1/1     Running   0          4m1s
upstream-cluster-server-2   1/1     Running   0          4m1s

root@client:~/dr# kubectl get pod -n tanzu-rabbitmq

NAME                          READY   STATUS    RESTARTS   AGE
downstream-cluster-server-0   1/1     Running   0          2m21s
downstream-cluster-server-1   1/1     Running   0          2m21s
downstream-cluster-server-2   1/1     Running   0          2m21s

通过下面的命令,查看上游集群和下游集群的 rabbitmq svc 映射到K8S节点的端口号(注意567215672映射出来的端口)

root@client:~/dr# kubectl get svc -n vmware-rabbitmq 

NAME                     TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                                         AGE
upstream-cluster         NodePort    10.110.224.126   <none>        5672:31743/TCP,15672:30557/TCP,5552:31718/TCP,15692:32287/TCP   5m35s
upstream-cluster-nodes   ClusterIP   None             <none>        4369/TCP,25672/TCP                                              5m35s
root@client:~/dr# kubectl get svc -n tanzu-rabbitmq

NAME                       TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                                                         AGE
downstream-cluster         NodePort    10.108.8.138   <none>        15672:30631/TCP,5552:30093/TCP,15692:31189/TCP,5672:32672/TCP   3m49s
downstream-cluster-nodes   ClusterIP   None           <none>        4369/TCP,25672/TCP                                              3m49s

通过下面的命令,获取两个集群的默认用户名和密码:
# 获取两个namespaceRabbitMQ default user secrets.
root@client:~/dr# kubectl get secret -n vmware-rabbitmq | grep default-user

upstream-cluster-default-user         Opaque                                8      10m
root@client:~/dr# kubectl get secret -n tanzu-rabbitmq | grep default-user

downstream-cluster-default-user         Opaque                                8      8m43s
# 获取上游集群的 default user name
root@client:~/dr# kubectl get secret upstream-cluster-default-user -n vmware-rabbitmq -o jsonpath={.data.username} |base64 –decode

default_user_3O9yX-FfDA1GpOY2Aij
# 获取上游集群的 default user password
root@client:~/dr# kubectl get secret upstream-cluster-default-user -n vmware-rabbitmq -o jsonpath={.data.password} |base64 –decode

RpuIpDUCquO4pMfN7kw16--DUFjJV-QF
# 获取下游集群的 default user name
root@client:~/dr# kubectl get secret downstream-cluster-default-user -n tanzu-rabbitmq -o jsonpath={.data.username} |base64 –decode

default_user_reCbFdqSCIKNlG4vKSv
# 获取上游集群的 default user password
root@client:~/dr# kubectl get secret downstream-cluster-default-user -n tanzu-rabbitmq -o jsonpath={.data.password} |base64 –decode

u0geY8kJw-dKVpyPZeyKAEHZttIvYSKc

访问 Managment UIhttp://k8s_node_ip:ports),查看当前集群的状态:

Title: fig:

在上游和下游集群配置 Schema Replication

Schema Replication 的配置需要同时在上下游集群进行,配置成功后,上游集群的Schema定义会被完整复制到下游集群,并在两个集群之间保持准实时的同步。

1.在上游集群(upstream)配置 Schema Replication

a.) 创建 replicator 用户,并授权

在上下游集群之间使用 “replicator” 用户进行schema定义和message数据的同步,当然您也可以自定义该用户名。

vi upstream-cluster-replicator-user-secret.yml

---
apiVersion: v1
kind: Secret
metadata:
  name: upstream-replicator-secret
  namespace: vmware-rabbitmq
type: Opaque
stringData:
  username: replicator
  password: YourPassWord
---
apiVersion: rabbitmq.com/v1beta1
kind: User
metadata:
  name: rabbitmq-replicator
  namespace: vmware-rabbitmq
spec:
  rabbitmqClusterReference:
    name:  upstream-cluster   # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
  importCredentialsSecret:
    name: upstream-replicator-secret

注意,上述yaml文件执行后,在rabbitmq中实际创建的用户名是 “replicator”

vi upstream-cluster-replicator-permission.yml

apiVersion: rabbitmq.com/v1beta1
kind: Permission
metadata:
  name: rabbitmq-replicator.rabbitmq-schema-definition-sync.all
  namespace: vmware-rabbitmq
spec:
  vhost: "rabbitmq_schema_definition_sync" # name of a vhost
  userReference:
    name: rabbitmq-replicator
  permissions:
    write: ".*"
    configure: ".*"
    read: ".*"
  rabbitmqClusterReference:
    name: upstream-cluster

注意: vhost: “rabbitmq_schema_definition_sync” 及其所属的队列都是用来同步schema定义的。在创建schema replication配置的时候,vhost: “rabbitmq_schema_definition_sync”会被自动创建。

执行上述 yaml文件

kubectl apply -f upstream-cluster-replicator-user-secret.yml
kubectl apply -f upstream-cluster-replicator-permission.yml

登录 upstream-cluster Managment UI,确认User成功创建,且正确授权:

Title: fig:

b.) 创建 schemaReplication 配置;

vi upstream-cluster-schema-replication.yml

apiVersion: rabbitmq.com/v1beta1
kind: SchemaReplication
metadata:
  name: upstream-cluster-schema-replication
  namespace: vmware-rabbitmq
spec:
  endpoints: "10.110.224.126:5672"
  upstreamSecret:
    name: upstream-replicator-secret
  rabbitmqClusterReference:
    name: upstream-cluster

在上述 yaml文件中需要指定上游集群(upstream) svcIP地址,通过下面的命令获取upstream svc的内部IP:

root@client:~# kubectl get svc -n vmware-rabbitmq

NAME                     TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                                         AGE
upstream-cluster         NodePort    10.110.224.126   <none>        5672:31743/TCP,15672:30557/TCP,5552:31718/TCP,15692:32287/TCP   14h
upstream-cluster-nodes   ClusterIP   None             <none>        4369/TCP,25672/TCP                                              14h

执行yaml文件创建 schemaReplication 配置:

kubectl apply -f upstream-cluster-schema-replication.yml

2.在下游集群(downstream)配置 Schema Replication

a.) 在下游集群创建 replicator 用户

在下游集群创建“replicator”用户,它的定义与上游集群的replicator几乎完全相同 (namespace不同),下游集群将使用该用户连接到上游集群。

vi downstream-cluster-replicator-user-secret.yml

---
apiVersion: v1
kind: Secret
metadata:
  name: upstream-replicator-secret
  namespace: tanzu-rabbitmq
type: Opaque
stringData:
  username: replicator
  password: VMware1!
---
apiVersion: rabbitmq.com/v1beta1
kind: User
metadata:
  name: rabbitmq-replicator
  namespace: tanzu-rabbitmq
spec:
  rabbitmqClusterReference:
    name:  downstream-cluster # the upstream RabbitMQ cluster name. It must be in the same namespace and it is a mandatory value.
  importCredentialsSecret:
    name: upstream-replicator-secret

执行yaml文件,在下游集群创建用户:

kubectl apply -f downstream-cluster-replicator-user-secret.yml

b.) 在下游集群创建 schemaReplication 配置:

下游集群的 schemaReplication yaml文件与上游集群几乎完全相同(namespace不同)endpoints 要指向上游集群 svc的内部IP.

vi downstream-cluster-schema-replication.yml

apiVersion: rabbitmq.com/v1beta1
kind: SchemaReplication
metadata:
  name: downstream-cluster-schema-replication
  namespace: tanzu-rabbitmq
spec:
  endpoints: "10.110.224.126:5672"
  upstreamSecret:
    name: upstream-replicator-secret
  rabbitmqClusterReference:
    name: downstream-cluster

执行yaml文件,在下游集群创建schemaReplication配置:

kubectl apply -f downstream-cluster-schema-replication.yml

3.Schema Replication 配置验证

a.) 通过 Management UI 查看连接

登录上游集群的Management UI,查看连接(“Connections”),因为上下游集群各有三个节点,所以这里应该可以看到通过“replicator”用户创建了6AMPQ1.0 连接:

Title: fig:

访问“Queues”页面,确认vhost : “rabbitmq_schema_definition_sync” 下已经创建了内部队列(rabbitmq.internal.definition_sync.*)

Title: fig:

b.) 创建schema定义,验证是否同步成功

你可以通过上游集群的management UI 或使用命令行创建任意类型的队列(或其他类型的对象),然后,登录下游集群的management UI确认该对象是否成功同步到下游。

在上游和下游集群配置 Standby Replication

Standby Replication 配置也需要在上下游集群分别进行,配置成功后,上游集群的收到的消息会被准实时的复制到下游集群。但在下游集群被激活(Promote)之前,用户在下游集群的队列中是看不到任何消息的。

1.在上游集群配置 Standby Replication

a.) 创建一个vhost: test-vhost-1

VMware RabbitMQ 是以 vhost 为单位进行消息数据同步的,在下面的yaml文件中定义了一个vhost:”test-vhost-1”,并授予 rabbitmq-replicator(replicator) 用户对它的访问权限。

vim upstream-cluster-test-vhost-1-permission.yml

---
apiVersion: rabbitmq.com/v1beta1
kind: Vhost
metadata:
  name: test-vhost-1
  namespace: vmware-rabbitmq
spec:
  name: "test-vhost-1" # vhost name
  tags: ["standby_replication"] # 添加了这个tags之后,vhost下的队列消息才能同步到downstream.
  rabbitmqClusterReference:
    name: upstream-cluster
---
apiVersion: rabbitmq.com/v1beta1
kind: Permission
metadata:
  name: rabbitmq-replicator.test-vhost-1.all
  namespace: vmware-rabbitmq
spec:
  vhost: "test-vhost-1" # name of a vhost
  userReference:
    name: rabbitmq-replicator
  permissions:
    write: ".*"
    configure: ".*"
    read: ".*"
  rabbitmqClusterReference:
    name: upstream-cluster

注意,在Vhost:”test-vhost-1“的定义中添加了tags: ["standby_replication"]。如果需要将vhost下的队列消息同步到下游集群,那么这个vhost必须添加tags: ["standby_replication"]。用户可以在standby replication配置中定义队列名称的Pattern(参看下面的upstream-cluster-standby-replication.yml),只有名称符合规则的队列才会同步消息到下游集群。

执行 yaml 文件,创建vhost并授权:

kubectl apply -f upstream-cluster-test-vhost-1-permission.yml

登录 Management UI ,确认 vhost:”test-vhost-1” 已成功创建,注意在“test-vhost-1” 下会自动创建两个Stream类型的内部队列,用于在上下游之间同步消息数据。

Title: fig:

b.)配置 Standby Replication

上游集群的 Standby Replication yaml 文件定义了一个RabbitMQ PolicyPolicy 包含了一个正则表达式,只有名称符合正则表达式的队列,它的消息才会被同步到下游集群。在下面的yaml文件中,使用了“^.*”意为匹配所有队列名称。

vim upstream-cluster-standby-replication.yml

apiVersion: rabbitmq.tanzu.vmware.com/v1beta1
kind: StandbyReplication
metadata:
  name: upstream-cluster-standby-replication
  namespace: vmware-rabbitmq
spec:
  operatingMode: "upstream" # has to be "upstream" to configure an upstream RabbitMQ cluster; required value
  upstreamModeConfiguration: # list of policies that Operator will create
    replicationPolicies:
      - name: test-vhost-1-replica-policy # policy name; required value
        pattern: "^.*" # any regex expression that will be used to match quorum queues name; required value
        vhost: "test-vhost-1" # vhost name; must be an existing vhost;
  rabbitmqClusterReference:
    name: upstream-cluster

执行 yaml 文件,创建standby replication对象。

kubectl apply -f upstream-cluster-standby-replication.yml

登录management UI,注意 ”test-vhost-1” vhost下所有的队列都应用了“test-vhost-1-replica-policy”

Title: fig:

2.在下游集群配置Standby Replication

a.) 配置 Standby Replication

下游集群的 standby replication yaml 文件中需要指定上游集群的 rabbitmq svc endpoints ports,注意standby replication使用stream队列同步消息数据,所以端口号是“5552”,而不是“5672”

vim downstream-cluster-standby-replication.yml

apiVersion: rabbitmq.tanzu.vmware.com/v1beta1
kind: StandbyReplication
metadata:
  name: downstream-cluster-standby-replication
  namespace: tanzu-rabbitmq
spec:
  operatingMode: "downstream" # has to be "downstream" to configure an downstream RabbitMQ cluster
  downstreamModeConfiguration:
    endpoints: "10.110.224.126:5552" # comma separated list of endpoints to the upstream RabbitMQ
    upstreamSecret:
      name: upstream-replicator-secret # an existing Kubernetes secret; required value
  rabbitmqClusterReference:
    name: downstream-cluster

执行 yaml 文件创建 Standby Replication

kubectl apply -f downstream-cluster-standby-replication.yml

b.) 验证 Standby Replicaion 配置:

登录到上游集群的 management UI,打开connections Stream,可以发现增加了一个来自下游集群的Stream连接。

Title: fig:

验证上游和下游集群的 DR 配置和状态

1.创建测试队列,发送测试消息:

在上游集群的vhost:“test-vhost-1”下创建一个Exchange(fanout类型)Quorum队列,并将Quorum队列与Exchange进行绑定。下面的yaml文件中包含了整个过程:

vi upstream-cluster-exchange-quorum-binding.yml

---
apiVersion: rabbitmq.com/v1beta1
kind: Queue
metadata:
  name: q1.quorum-queue
  namespace: vmware-rabbitmq
spec:
  name: q1.quorum # name of the queue
  vhost: "test-vhost-1" # default to '/' if not provided
  type: quorum # without providing a queue type, rabbitmq creates a classic queue
  autoDelete: false
  durable: true # seting 'durable' to false means this queue won't survive a server restart
  rabbitmqClusterReference:
    name: upstream-cluster
---
apiVersion: rabbitmq.com/v1beta1
kind: Exchange
metadata:
  name: test1.fanout-exchange
  namespace: vmware-rabbitmq
spec:
  name: test1.fanout # name of the exchange
  type: fanout # can be set to 'direct', 'fanout', 'headers', and 'topic'
  vhost: "test-vhost-1"
  autoDelete: false
  durable: true
  rabbitmqClusterReference:
    name: upstream-cluster
---
apiVersion: rabbitmq.com/v1beta1
kind: Binding
metadata:
  name: q1-test1-binding
  namespace: vmware-rabbitmq
spec:
  source: test1.fanout # an existing exchange
  destination: q1.quorum # an existing queue
  destinationType: queue # can be 'queue' or 'exchange'
  vhost: "test-vhost-1"
  rabbitmqClusterReference:
    name: upstream-cluster

执行yml文件,创建exchangesqueues

kubectl apply -f upstream-cluster-exchange-quorum-binding.yml

登录Pods,使用命令行写入消息:

root@client:~/dr# kubectl get pod -n vmware-rabbitmq

NAME                        READY   STATUS    RESTARTS        AGE
upstream-cluster-server-0   1/1     Running   1 (17h ago)     18h
upstream-cluster-server-1   1/1     Running   1 (4h41m ago)   18h
upstream-cluster-server-2   1/1     Running   1 (17h ago)     18h

root@client:~/dr# kubectl exec -it upstream-cluster-server-0 -n vmware-rabbitmq -- /bin/bash

Defaulted container "rabbitmq" out of: rabbitmq, setup-container (init)

rabbitmq [ ~ ]$ rabbitmqadmin --vhost=test-vhost-1 publish exchange=test1.fanout routing_key=test1 payload="hello world 1"

Message published

rabbitmq [ ~ ]$ rabbitmqadmin --vhost=test-vhost-1 publish exchange=test1.fanout routing_key=test2 payload="hello world 2"

Message published

rabbitmq [ ~ ]$ rabbitmqadmin --vhost=test-vhost-1 publish exchange=test1.fanout routing_key=test3 payload="hello world 3"

Message published

2.Schema Replication 配置验证

登录上游集群(upstream)的任意Pods,检查Schema Replication的状态

root@client:~/dr# kubectl exec -it upstream-cluster-server-0 -n vmware-rabbitmq -- /bin/bash

Defaulted container "rabbitmq" out of: rabbitmq, setup-container (init)

# 检查上游集群 schema replication 的状态
rabbitmq [ ~ ]$ rabbitmqctl schema_replication_status

Schema replication status on node rabbit@upstream-cluster-server-0.upstream-cluster-nodes.vmware-rabbitmq
Operating mode: upstream
State: syncing
Upstream endpoint(s): 10.110.224.126:5672
Upstream username: replicator

注意,上述命令的返回结果是“Operating mode: upstream”

登录下游集群(downstream)的任意Pods

root@client:~# kubectl exec -it downstream-cluster-server-0 -n tanzu-rabbitmq -- /bin/bash

Defaulted container "rabbitmq" out of: rabbitmq, setup-container (init)
# 检查下游集群 schema replication 的状态
rabbitmq [ ~ ]$ rabbitmqctl schema_replication_status

Schema replication status on node rabbit@downstream-cluster-server-0.downstream-cluster-nodes.tanzu-rabbitmq
Operating mode: downstream
State: syncing
Upstream endpoint(s): 10.110.224.126:5672
Upstream username: replicator

注意,上述命令的返回结果是“Operating mode: downstream”

3. Standby Replication 配置验证

登录上游集群的任意Pods,执行 rabbitmq-diagnostics inspect_standby_upstream_metrics ,在每个节点上都会显示Leader队列在本地的同步情况。

例如,如果 q1.quorum Leader队列位于Pods upstream-cluster-server-2,登录Pods: upstream-cluster-server-2 ,执行结果如下:

root@client:~/dr# kubectl exec -it upstream-cluster-server-2 -n vmware-rabbitmq -- /bin/bash

Defaulted container "rabbitmq" out of: rabbitmq, setup-container (init)

rabbitmq [ ~ ]$ rabbitmq-diagnostics inspect_standby_upstream_metrics

Inspecting standby upstream metrics related to recovery...
queue   timestamp       vhost
q1.quorum       1679046634809   test-vhost-1

上述命令返回结果的 “timestamp” 是队列最近一次完成消息同步的时间戳。

登录下游集群(downstream)的任意Pods,执行 rabbitmqctl display_disk_space_used_by_standby_replication_data,查看下游集群缓存的消息同步数据占用的磁盘空间大小。在本例中仅有三条消息,所以这里显示的占用磁盘空间为0.

rabbitmq [ ~ ]$ rabbitmqctl display_disk_space_used_by_standby_replication_data

Listing disk space (in gb) used by multi-DC replication
node    size    unit    vhost
rabbit@downstream-cluster-server-0.downstream-cluster-nodes.tanzu-rabbitmq      0.0     gb      test-vhost-1

注意,上述命令返回结果中node “rabbit@downstream-cluster-server-0”,登录Pods:”downstream-cluster-server-0”,执行以下命令:

root@client:~/dr# kubectl exec -it downstream-cluster-server-0 -n tanzu-rabbitmq -- /bin/bash

Defaulted container "rabbitmq" out of: rabbitmq, setup-container (init)

rabbitmq [ ~ ]$ rabbitmqctl list_vhosts_available_for_standby_replication_recovery

Listing virtual hosts available for multi-DC replication recovery on node rabbit@downstream-cluster-server-0.downstream-cluster-nodes.tanzu-rabbitmq
test-vhost-1

rabbitmq [ ~ ]$ rabbitmq-diagnostics inspect_standby_downstream_metrics

Inspecting standby downstream metrics related to recovery...
queue   timestamp       vhost
q1.quorum       1679046634809   test-vhost-1

rabbitmq [ ~ ]$ rabbitmq-diagnostics inspect_local_data_available_for_standby_replication_recovery

Inspecting local data replicated for multi-DC recovery
exchange        messages        routing_key     vhost
test1.fanout    1       test3   test-vhost-1
test1.fanout    1       test2   test-vhost-1
test1.fanout    1       test1   test-vhost-1

激活 (Promote) 下游集群

在下游集群(downstream)执行Promote

登录下游集群的任意Pods,执行rabbitmqctl promote_standby_replication_downstream_cluster命令:

root@client:~# kubectl exec -it downstream-cluster-server-0 -n tanzu-rabbitmq -- /bin/bash

Defaulted container "rabbitmq" out of: rabbitmq, setup-container (init)
# 执行Promote命令
rabbitmq [ ~ ]$ rabbitmqctl promote_standby_replication_downstream_cluster

Will promote cluster to upstream...
first_timestamp last_timestamp  message_count   virtual_host
2023-03-17 09:50:34     2023-03-17 09:50:44     3       test-vhost-1

执行完成后,从下游集群的 management UI 可以看到同步的消息已推送到相应的队列。

Title: fig:

注意,rabbitmqctl promote_standby_replication_downstream_cluster 默认只会把上游集群未消费(或未确认)的消息推送到相应的队列,该命令还有一些常用参数:

  • --start-from-scratch:能够从最早可用的数据中恢复消息,无论这些消息在上游集群中是否已经被消费。
  • --exclude-virtual-hosts:在恢复数据时,将指定的vhost排除在外。

激活(Promote)下游集群之后

下游集群被Promote之后,就可以对外提供消息服务了,但请注意,虽然下游队列的消息可以正常读写,但下游集群的”schema_definition_sync.operating_mode”状态仍然是downstream,它不允许创建新的schema定义,例如创建vhost/exchange/queue等操作都不会生效,它与上游集群之间用于同步的AMQP/Stream连接都还在。

如果想要下游集群在Promote之后,与旧的上游集群完全脱离关系,有两种方式可供选择:

方式 1 (需要RabbitMQ Pods滚动重启):修改RabbitMQ下游集群的yaml文件并重新提交,把参数“schema_definition_sync.operating_mode”“standby.replication.operating_mode”修改为upstream,这种方式会使下游集群的Pods滚动重启(一个接一个重启,始终保持有两个节点对外提供服务)

a.) 删除旧的 schema/standby replication 配置:

# 使用yaml文件执行删除
kubectl delete -f downstream-cluster-standby-replication.yml
kubectl delete -f downstream-cluster-schema-replication.yml
# 或者,也可以选择使用对象名执行删除
kubectl delete standbyreplication.rabbitmq.tanzu.vmware.com downstream-cluster-standby-replication -n tanzu-rabbitmq
kubectl delete schemareplication.rabbitmq.com downstream-cluster-schema-replication -n tanzu-rabbitmq
# 获取schema/standby replication对象名的方法
kubectl get standbyreplication.rabbitmq.tanzu.vmware.com -n tanzu-rabbitmq
kubectl get schemareplication.rabbitmq.com downstream-cluster-schema-replication -n tanzu-rabbitmq

b.) 删除本地缓存数据,释放磁盘空间(可选)

# 登录下游集群任意节点
root@client:~/dr# kubectl exec -it downstream-cluster-server-0 -n tanzu-rabbitmq -- /bin/bash

Defaulted container "rabbitmq" out of: rabbitmq, setup-container (init)
# 删除复制数据
rabbitmq [ ~ ]$ rabbitmqctl delete_all_data_on_standby_replication_cluster

Will delete all data on the standby message replication cluster
node    vhost
rabbit@downstream-cluster-server-0.downstream-cluster-nodes.tanzu-rabbitmq      test-vhost-1

c.) 编辑并重新提交 promoted-downstream-cluster.yml

复制 downstream-cluster.yml promoted-downstream-cluster.yml,参照下面的例子修改 promoted-downstream-cluster.yml

vim promoted-downstream-cluster.yml

...   
       additionalConfig: |
       schema_definition_sync.operating_mode = upstream
       standby.replication.operating_mode = upstream
      #schema_definition_sync.downstream.locals.users = ^default_user_
      #schema_definition_sync.downstream.locals.global_parameters = ^standby
...

执行yaml文件kubectl apply -f promoted-downstream-cluster.yml ,三个Pods会滚动重启。在所有的 Pods 重启完成后,当前集群的“operating.mode”已经变为了upstream,后续我们可以为它配置新的下游集群,使当前的RabbitMQ集群重新处于DR的保护之下。

重启完成后,上下游集群之间就再无任何逻辑关系,并断开了它们之间所有的AMQP/Steam同步连接。

方式 2(暂时不重启RabbitMQ Pods):在下游集群被Promote之后,如果不方便立即进行滚动重启,可选择这种方式。此方式使用命令行修改下游集群的“schema_definition_sync.operating.mode”,并断开上下游集群之间的Standby Replication Stream连接。但命令行方式只能临时修改“schema_definition_sync.operating_mode” upstream,只要下游集群重新启动,它会重新变回 downstream。所以建议在合适的时间仍要参考方式1修改downstream-cluster.yml 并重启Pods

a.) 删除 schema/standby replication 配置

# 使用yaml文件执行删除
kubectl delete -f downstream-cluster-standby-replication.yml
kubectl delete -f downstream-cluster-schema-replication.yml
# 或者,也可以选择使用对象名执行删除
kubectl delete standbyreplication.rabbitmq.tanzu.vmware.com downstream-cluster-standby-replication -n tanzu-rabbitmq
kubectl delete schemareplication.rabbitmq.com downstream-cluster-schema-replication -n tanzu-rabbitmq

b.) 断开与上游集群的Stream连接:

Promote之后,上下游集群之间的Standby ReplicationStream连接仍然存在,需要执行rabbitmqctl disconnect_standby_replication_downstream命令,断开上下游之间的Stream连接。

# 登录下游集群的任意Pods,确认Stream连接发起的节点。
# 也可从上游集群的management UI查看stream连接来自下游集群的哪个节点
rabbitmq [ ~ ]$ rabbitmqctl display_disk_space_used_by_standby_replication_data

Listing disk space (in gb) used by multi-DC replication
node    size    unit    vhost
rabbit@downstream-cluster-server-0.downstream-cluster-nodes.tanzu-rabbitmq      0.0     gb      test-vhost-1
# 从上面的命令返回结果中,找到节点名称为:downstream-cluster-server-0,登录该节点执行 disconnect 命令
root@client:~/dr# kubectl exec -it downstream-cluster-server-0 -n tanzu-rabbitmq -- /bin/bash

Defaulted container "rabbitmq" out of: rabbitmq, setup-container (init)
# 断开与上游集群的standby replication stream连接
rabbitmq [ ~ ]$ rabbitmqctl disconnect_standby_replication_downstream

Will disconnect standby message replication on node rabbit@downstream-cluster-server-0.downstream-cluster-nodes.tanzu-rabbitmq from its upstream
Error:
true

c.) 登录所有Pods(每一个),执行 rabbitmqctl set_schema_replication_mode upstream

root@client:~/dr# kubectl exec -it downstream-cluster-server-0 -n tanzu-rabbitmq -- /bin/bash
Defaulted container "rabbitmq" out of: rabbitmq, setup-container (init)

# 要登录每一个Pods,执行 set_schema_replication_mode 命令
rabbitmq [ ~ ]$ rabbitmqctl set_schema_replication_mode upstream

Will set schema replication mode on node rabbit@downstream-cluster-server-0.downstream-cluster-nodes.tanzu-rabbitmq to upstream
Done
# 确认 operating.mode 已经变为 upstream
rabbitmq [ ~ ]$  rabbitmqctl schema_replication_status

Schema replication status on node rabbit@downstream-cluster-server-0.downstream-cluster-nodes.tanzu-rabbitmq
Operating mode: upstream
State: recover
Upstream endpoint(s):
Upstream username: default_user_reCbFdqSCIKNlG4vKSv

执行上面的操作之后,上下游集群之间的schema连接(AMQP 1.0)会全部断开了,用户可以在当前集群创建Schema定义了。

d.) 删除缓存在本地的数据,释放磁盘空间(可选)

执行后,下游集群上的缓存数据全部被删除,就再也无法使用命令rabbitmqctl  promote_standby_replication_downstream_cluster --start-from-scratch 恢复数据了。

# 登录下游集群任意节点执行:
root@client:~/dr# kubectl exec -it downstream-cluster-server-0 -n tanzu-rabbitmq -- /bin/bash

Defaulted container "rabbitmq" out of: rabbitmq, setup-container (init)
# 删除复制数据
rabbitmq [ ~ ]$ rabbitmqctl delete_all_data_on_standby_replication_cluster

Will delete all data on the standby message replication cluster
node    vhost
rabbit@downstream-cluster-server-0.downstream-cluster-nodes.tanzu-rabbitmq      test-vhost-1

注意,方式二只适用于不能立即执行滚动重启的场景,例如恰好处于消息访问的峰值时段,但过了高峰期后,在非繁忙时段,仍然需要修改yaml文件并滚动重启。

Promote后的下游集群,在完成了上述善后工作之后,管理员可以把它作为上游,为它配置新的下游集群,主要步骤与本文前面的内容相同。

小结

通过上面的步骤,我们已经在RabbitMQ的上下游集群之间配置了Disaster Recovery,并演示了如何激活(Promote)下游集群,以及下游集群被激活后的善后工作。经过本文的介绍,相信您对VMware RabbitMQ集群的DR解决方案已经有了更详细的了解,如果您这篇文章有问题或建议,欢迎与我们联系。关于VMware RabbitMQ for Kubernetes 的最新版本和更多技术内容请访问官方文档:VMware RabbitMQ for Kubernetes Documentation

Previous
Out of the Box Supply Chain with Testing
Out of the Box Supply Chain with Testing

Out of the Box Supply Chain with Testing This package contains Cartographer Supply Chains that tie together...

Next Video
Spring Deployment to Kubernetes with CF Korifi
Spring Deployment to Kubernetes with CF Korifi

Korifi is an open source project aimed at simplifying the Kubernetes experience for application developers....