译文 | Apache Pulsar 集群如何确保消息不丢

2021年11月25日 阅读数:9
这篇文章主要向大家介绍译文 | Apache Pulsar 集群如何确保消息不丢,主要内容包括基础应用、实用技巧、原理机制等方面,希望对大家有所帮助。
原文做者 Jack Vanlightly,翻译 Sijia@StreamNative,如需转载,请后台留言。
英文连接: https://jack-vanlightly.com/b...
关于 Apache Pulsar
Apache Pulsar 是 Apache 软件基金会顶级项目,是下一代云原生分布式消息流平台,集消息、存储、轻量化函数式计算为一体,采用计算与存储分离架构设计,支持多租户、持久化存储、多机房跨区域数据复制,具备强一致性、高吞吐、低延时及高可扩展性等流数据存储特性。
GitHub 地址: http://github.com/apache/pulsar/

阅读本文须要大约 15 分钟。node

消息系统

Apache Pulsar 的工做原理介绍了 Pulsar 的协议和存储模型,主要强调了(1)Pulsar 计算与存储分离;(2)Pulsar 将 topic 分红 ledger 与分片,无需重平衡,便可向 bookie 自动写入新数据的特性。本文将会使用 Blockade 工具关闭节点、下降网速,并丢失网络数据包进行测试。在消息如何在 RabbitMQ 集群中丢失消息如何在 Kafka 集群中丢失文中,我使用了相同的自动化测试工具,本文的测试不只包含数据丢失,还包括消息排序和消息重复。python

我已经将测试使用的代码上传到 GitHub,你能够在 ChaosTesingCode 仓库中的 Pulsar 文件夹里查看。git

在每一个测试场景中,咱们都建立了新的 blockade 集群,而且进行了相同的配置:github

  • Apache Pulsar broker 数量
  • Apache BookKeeper 节点(Bookie)数量
  • Ensemble size (E)
  • Write quorum size (Qw)
  • Ack quorum size (Qa)

客户端经过 proxy 与 broker 进行通讯。算法

图片

在每一个测试中,咱们都发送消息并添加干扰操做,好比关闭 topic 所属的 Pulsar broker,关闭当前 ledger ensemble 中的 1 个 bookie,从 ZooKeeper 中隔离节点等。shell

消息发送后,启动 reader 读取并计算消息条数,确保接收到了全部 ack 的消息,而且确保消息是按照正确顺序被读取。apache

若是你想看测试方法,请阅读全文;若是你只想看测试结果,建议直接看结语部分。segmentfault

测试剖析

在测试中,producer 以最快速度发送消息,我设置的发送消息速率为 2 万 - 3 万条/秒。有些测试须要较长的时间,因此我增长了发送消息的总数,但不下降 producer 的发送速率。bash

禁用消息去重。网络

经过调用 pulsar-test.py python 脚本进行测试。测试须要如下参数:

  • 测试类型(no-fail、kill-broker、kill-bookie、kill-bookies[n]、isolate-broker-from-zk、isolate-bookie-from-zk、custom-isolation[partition|partition])。测试类型会在测试部分进行解释。
  • 测试名称
  • 测试运行次数
  • 发送消息数
  • 开始干扰操做时消息的位置。例如,50000 指在接收到第 5 万次 ack 时,干扰操做开始。启用干扰操做一般须要一段时间,通常会在启用后的第几十万条消息处开始。
  • Ledger 配置(E、Qw、Qa),格式为 E-Qw-Qa,如 2-2-1
  • 节点计数,格式为 brokers-bookies,如 3-3
  • 是否启用数据去重(是/否)

blockade.yml 文件都存储在 cluster/blockade-files 目录中。每一个文件中的 broker 和 bookie 数量各不相同。在调用 blockade 前,需将相应文件复制到 cluster 目录中。

调用 pulsar-test.py 执行如下操做:

  1. 若是 blockade 集群正在运行,使用 blockade ` destroy `命令终止集群。
  2. 复制相应的 blockade.yml
  3. blockade up
  4. 发布消息。消息内容为递增的整数,用于检查消息顺序。
  5. 开始干扰操做
  6. 消息发布一结束,马上开始读取数据
  7. 分析并记录结果
  8. 重复“测试运行次数“

在终端能够查看全部输出。也能够在 automated/test-output 目录下的 [test-name]-output.txt 文件中查看输出。能够在 [test-name]-duplicates.txt 和 [test-name]-out-of-order.txt 文件中分别查看重复消息和乱序消息。

运行结果的格式以下:


Final send count: 2000000
Final ack count: 2000000
Final positive ack count: 2000000
Final negative ack count: 0
Messages received: 2000000
Acked messages missing: 0
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0

追踪消息:

  • 消息发出后,将消息对应的整数做为键添加到 messages_sent 词典,并添加空列表做为值。
  • Ack 消息接收成功后,添加消息对应的整数值到 messages_pos_acked 集合。
  • Ack 消息接收失败后,添加消息所对应的整数值到 messages_neg_acked 集合。

读取消息时,在 messages_sent 词典中查找,并添加消息 ID 到相应列表。经过这种方式便可检测到丢失的和重复的消息。

  • 丢失已 ack 的消息 = messages_sent 词典与 messages_pos_acked 集合中重合的全部值为空列表的键
  • 接收到未 ack 的消息 = messages_sent 词典中值为空列表,但不在 messages_pos_acked 集合中的全部
  • 重复消息 = messages_sent 词典中,值列表包含多个消息 ID 的键
  • 乱序消息 = 整数值小于上一条消息,词典中值为非空的消息 ID 列表(即未被消费过)

因为 producer 一次只容许处理 1 万条消息,当接收的消息比发送的消息滞后 1 万时,producer 会暂停并等待 ack 追赶。在测试中,咱们也的确见到过 1 万。

Pulsar 客户端在消息发送失败时自动重试,直到发送超时。所以,当消息成功写入 BookKeeper,但 Pulsar broker 在向客户端发送 ack 前出现故障时,会出现消息重复。测试中默认关闭消息去重。

默认 E、Qw、Qa

测试中出现了很奇怪的现象,在关闭 bookie 后,reader 和 consumer 不能进行消费。后来咱们发现这和 E、Qw、Qa 的默认值有关。游标存储在由上述默认值建立的 ledger 中,若是 bookie 数量不够,则没法建立订阅。

在 blockade.yml 文件中为全部 bookie 的 environment 进行设置:

environment: {
   "clusterName": "cluster-1",
   "zookeeperServers": "zk1:2181",
   "managedLedgerDefaultEnsembleSize": 4,
   "managedLedgerDefaultWriteQuorum": 3,
   "managedLedgerDefaultAckQuorum": 2
}

要保证集群正常工做,以上配置至少须要 4 个 bookie。所以,在使用少许 bookie 测试时,须要保证 bookie 数量与使用场景最少须要 bookie 的数量相同(还要考虑到 bookie 可能会出故障)。

在全部的测试场景中,E、Qw、Qa 默认值为 二、二、2,bookie 数量很多于 3。这样的设置保证了在关闭 bookie 时,不会丢失游标数据,而且能够建立新的 reader 和 consumer。

测试场景 0 - 对比实验组

在第一个测试场景中,发送 200 万条消息,无干扰,检查消息是否所有读取,而且读取顺序正确,消息不重复。对于这一场景,咱们测试了 5 次。

最后一次测试的终端输出以下。

$ python pulsar-test.py no-fail no-fail-test1 5 2000000 50000 2-2-1 3-3 false
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #1 on topic no-fail-test1_5  ------------
... (omitted)
-----------------------------------------------
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #2 on topic no-fail-test1_5  ------------
... (omitted)
-----------------------------------------------
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #3 on topic no-fail-test1_5  ------------
... (omitted)
-----------------------------------------------
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #4 on topic no-fail-test1_5  ------------
... (omitted)
-----------------------------------------------
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #5 on topic no-fail-test1_5  ------------
2018-10-19 16:30:44.795 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.208.9:6650
2018-10-19 16:30:44.799 INFO  ClientConnection:285 | [192.168.208.1:34476 -> 192.168.208.9:6650] Connected to broker
2018-10-19 16:30:45.965 INFO  BatchMessageContainer:41 | { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://vanlightly/cluster-1/ns1/no-fail-test1_5] [producerName_ = ] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 0] [averageBatchSize = 0]} BatchMessageContainer constructed
2018-10-19 16:30:45.965 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/no-fail-test1_5, ] Getting connection from pool
2018-10-19 16:30:47.313 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar3:6650
2018-10-19 16:30:47.313 INFO  ClientConnection:287 | [192.168.208.1:34482 -> 192.168.208.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar3:6650
2018-10-19 16:30:47.894 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/no-fail-test1_5, ] Created producer on broker [192.168.208.1:34482 -> 192.168.208.9:6650] 
Send count: 56645 Ack count: 50000 Pos: 50000 Neg: 0
No chaos action to perform
Send count: 102275 Ack count: 100000 Pos: 100000 Neg: 0
Send count: 151578 Ack count: 150000 Pos: 150000 Neg: 0
Send count: 207610 Ack count: 200000 Pos: 200000 Neg: 0
Send count: 255241 Ack count: 250000 Pos: 250000 Neg: 0
Send count: 309300 Ack count: 300000 Pos: 300000 Neg: 0
Send count: 354595 Ack count: 350000 Pos: 350000 Neg: 0
Send count: 407730 Ack count: 400000 Pos: 400000 Neg: 0
Send count: 459060 Ack count: 450000 Pos: 450000 Neg: 0
Send count: 503590 Ack count: 500000 Pos: 500000 Neg: 0
Send count: 557272 Ack count: 550000 Pos: 550000 Neg: 0
Send count: 601064 Ack count: 600000 Pos: 600000 Neg: 0
Send count: 653045 Ack count: 650000 Pos: 650000 Neg: 0
Send count: 701051 Ack count: 700000 Pos: 700000 Neg: 0
Send count: 752087 Ack count: 750000 Pos: 750000 Neg: 0
Send count: 800455 Ack count: 800000 Pos: 800000 Neg: 0
Send count: 851194 Ack count: 850000 Pos: 850000 Neg: 0
Send count: 904089 Ack count: 900000 Pos: 900000 Neg: 0
Send count: 958704 Ack count: 950000 Pos: 950000 Neg: 0
Send count: 1007863 Ack count: 1000000 Pos: 1000000 Neg: 0
Send count: 1053645 Ack count: 1050000 Pos: 1050000 Neg: 0
Send count: 1102725 Ack count: 1100000 Pos: 1100000 Neg: 0
Send count: 1150921 Ack count: 1150000 Pos: 1150000 Neg: 0
Send count: 1201317 Ack count: 1200000 Pos: 1200000 Neg: 0
Send count: 1254143 Ack count: 1250000 Pos: 1250000 Neg: 0
Send count: 1303459 Ack count: 1300000 Pos: 1300000 Neg: 0
Send count: 1358449 Ack count: 1350000 Pos: 1350000 Neg: 0
Send count: 1408474 Ack count: 1400000 Pos: 1400000 Neg: 0
Send count: 1456386 Ack count: 1450000 Pos: 1450000 Neg: 0
Send count: 1505650 Ack count: 1500000 Pos: 1500000 Neg: 0
Send count: 1550343 Ack count: 1550000 Pos: 1550000 Neg: 0
Send count: 1601060 Ack count: 1600000 Pos: 1600000 Neg: 0
Send count: 1652008 Ack count: 1650000 Pos: 1650000 Neg: 0
Send count: 1701559 Ack count: 1700000 Pos: 1700000 Neg: 0
Send count: 1751662 Ack count: 1750000 Pos: 1750000 Neg: 0
Send count: 1801550 Ack count: 1800000 Pos: 1800000 Neg: 0
Send count: 1852799 Ack count: 1850000 Pos: 1850000 Neg: 0
Send count: 1901281 Ack count: 1900000 Pos: 1900000 Neg: 0
Send count: 1951079 Ack count: 1950000 Pos: 1950000 Neg: 0
Send count: 2000000 Ack count: 2000000 Pos: 2000000 Neg: 0
2018-10-19 16:31:51.177 INFO  ProducerImpl:467 | [persistent://vanlightly/cluster-1/ns1/no-fail-test1_5, cluster-1-2-0] Closed producer
2018-10-19 16:31:51.346 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.208.9:6650
2018-10-19 16:31:51.348 INFO  ClientConnection:285 | [192.168.208.1:34574 -> 192.168.208.9:6650] Connected to broker
2018-10-19 16:31:51.357 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/no-fail-test1_5, reader-5a64ed, 0] Getting connection from pool
2018-10-19 16:31:51.645 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar3:6650
2018-10-19 16:31:51.645 INFO  ClientConnection:287 | [192.168.208.1:34578 -> 192.168.208.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar3:6650
2018-10-19 16:31:51.716 INFO  ConsumerImpl:168 | [persistent://vanlightly/cluster-1/ns1/no-fail-test1_5, reader-5a64ed, 0] Created consumer on broker [192.168.208.1:34578 -> 192.168.208.9:6650] 
Last confirmed entry: [0, 2557]
Start reading from (0,0,0,-1)
Received: 50000 Curr Entry: [0, 58]
Received: 100000 Curr Entry: [0, 115]
Received: 150000 Curr Entry: [0, 177]
Received: 200000 Curr Entry: [0, 236]
Received: 250000 Curr Entry: [0, 300]
Received: 300000 Curr Entry: [0, 386]
Received: 350000 Curr Entry: [0, 446]
Received: 400000 Curr Entry: [0, 506]
Received: 450000 Curr Entry: [0, 569]
Received: 500000 Curr Entry: [0, 629]
Received: 550000 Curr Entry: [0, 695]
Received: 600000 Curr Entry: [0, 754]
Received: 650000 Curr Entry: [0, 816]
Received: 700000 Curr Entry: [0, 892]
Received: 750000 Curr Entry: [0, 955]
Received: 800000 Curr Entry: [0, 1021]
Received: 850000 Curr Entry: [0, 1086]
Received: 900000 Curr Entry: [0, 1152]
Received: 950000 Curr Entry: [0, 1211]
Received: 1000000 Curr Entry: [0, 1268]
Received: 1050000 Curr Entry: [0, 1339]
Received: 1100000 Curr Entry: [0, 1410]
Received: 1150000 Curr Entry: [0, 1472]
Received: 1200000 Curr Entry: [0, 1534]
Received: 1250000 Curr Entry: [0, 1595]
Received: 1300000 Curr Entry: [0, 1654]
Received: 1350000 Curr Entry: [0, 1713]
Received: 1400000 Curr Entry: [0, 1779]
Received: 1450000 Curr Entry: [0, 1836]
Received: 1500000 Curr Entry: [0, 1896]
Received: 1550000 Curr Entry: [0, 1971]
Received: 1600000 Curr Entry: [0, 2043]
Received: 1650000 Curr Entry: [0, 2101]
Received: 1700000 Curr Entry: [0, 2164]
Received: 1750000 Curr Entry: [0, 2226]
Received: 1800000 Curr Entry: [0, 2297]
Received: 1850000 Curr Entry: [0, 2367]
Received: 1900000 Curr Entry: [0, 2433]
Received: 1950000 Curr Entry: [0, 2496]
Received: 2000000 Curr Entry: [0, 2557]
Read phase complete with message (0,2557,378,-1)
2018-10-19 16:32:35.080 INFO  ConsumerImpl:761 | [persistent://vanlightly/cluster-1/ns1/no-fail-test1_5, reader-5a64ed, 0] Closed consumer 0
Results --------------------------------------------
Final send count: 2000000
Final ack count: 2000000
Final positive ack count: 2000000
Final negative ack count: 0
Messages received: 2000000
Acked messages missing: 0
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0

能够在 test-output/no-fail-test1-output.txt 文件中查看简要结果。

16:11:45 Start test
16:13:54: 
16:13:54: Test Run #1 on topic no-fail-test1_1  ------------
16:15:41: Results --------------------------------------------
16:15:41: Final send count: 2000000
16:15:41: Final ack count: 2000000
16:15:41: Final positive ack count: 2000000
16:15:41: Final negative ack count: 0
16:15:41: Messages received: 2000000
16:15:41: Acked messages missing: 0
16:15:41: Non-acked messages received: 0
16:15:41: Out-of-order: 0
16:15:41: Duplicates: 0
16:15:41: ----------------------------------------------------
16:17:57: 
16:17:57: Test Run #2 on topic no-fail-test1_2  ------------
16:19:44: Results --------------------------------------------
16:19:44: Final send count: 2000000
16:19:44: Final ack count: 2000000
16:19:44: Final positive ack count: 2000000
16:19:44: Final negative ack count: 0
16:19:44: Messages received: 2000000
16:19:44: Acked messages missing: 0
16:19:44: Non-acked messages received: 0
16:19:44: Out-of-order: 0
16:19:44: Duplicates: 0
16:19:44: ----------------------------------------------------
16:22:01: 
16:22:01: Test Run #3 on topic no-fail-test1_3  ------------
16:23:52: Results --------------------------------------------
16:23:52: Final send count: 2000000
16:23:52: Final ack count: 2000000
16:23:52: Final positive ack count: 2000000
16:23:52: Final negative ack count: 0
16:23:52: Messages received: 2000000
16:23:52: Acked messages missing: 0
16:23:52: Non-acked messages received: 0
16:23:52: Out-of-order: 0
16:23:52: Duplicates: 0
16:23:52: ----------------------------------------------------
16:26:20: 
16:26:20: Test Run #4 on topic no-fail-test1_4  ------------
16:28:18: Results --------------------------------------------
16:28:18: Final send count: 2000000
16:28:18: Final ack count: 2000000
16:28:18: Final positive ack count: 2000000
16:28:18: Final negative ack count: 0
16:28:18: Messages received: 2000000
16:28:18: Acked messages missing: 0
16:28:18: Non-acked messages received: 0
16:28:18: Out-of-order: 0
16:28:18: Duplicates: 0
16:28:18: ----------------------------------------------------
16:30:43: 
16:30:43: Test Run #5 on topic no-fail-test1_5  ------------
16:32:36: Results --------------------------------------------
16:32:36: Final send count: 2000000
16:32:36: Final ack count: 2000000
16:32:36: Final positive ack count: 2000000
16:32:36: Final negative ack count: 0
16:32:36: Messages received: 2000000
16:32:36: Acked messages missing: 0
16:32:36: Non-acked messages received: 0
16:32:36: Out-of-order: 0
16:32:36: Duplicates: 0
16:32:36: ----------------------------------------------------

从终端输出能够看出速度并非很快。五个测试共用了大约 20 分钟。主要问题在于建立 blockade 集群速度较慢。使用 RabbitMQ 和 Kafka,运行速度很快;但 Pulsar 集群占用了大量 CPU,启动缓慢。缘由可能和咱们运行了 8 个节点(1 ZK、1 Proxy、3 Broker、3 Bookie)有关,最多时节点数达到 11 个。

测试中没有出现消息丢失、重复或乱序的问题。

测试场景 1 - 关闭 topic broker owner

参数配置 2-2-1 3-3。

在测试中,pulsar-test.py 使用 shell 脚本检测 broker owner,并在消息传输时将其关闭。

参数配置为 E=二、Qw=二、Qa=1,3 个 broker,3 个 bookie,这种配置下冗余最小。每条消息有两个副本,但 broker 接收到任一 bookie 的一个 ack 后,就会将 ack 发送到客户端。

在消息传输时关闭 broker 不会形成数据丢失。只有在 Qa bookie 持久化消息到磁盘后,消息才被 ack,而且由 broker 发送 ack。若是关闭 broker,最差的状况是 reader 读取未被 ack 的消息。Bookie 可能已经持久化消息,但 broker 在向客户端发送 ack 前出现故障。在启用新 broker 后,恢复期间会检测到这些消息,并确保关闭 ledger 前生成消息副本,使这些消息对 reader 可用。

如下是某次测试的完整输出。在每次测试中,发送 100 万条消息,在发送到第 5 万条消息时关闭 owner。咱们进行了 5 次该测试。

$ python pulsar-test.py kill-broker kill-broker-test1 5 1000000 50000 2-2-1 3-3 false
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #1 on topic kill-broker-test1_1  ------------
-------------------------------------------------
WRITE PHASE
-------------------------------------------------
2018-10-19 20:31:48.389 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar2:6650
2018-10-19 20:31:48.389 INFO  ClientConnection:287 | [192.168.96.1:52202 -> 192.168.96.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar2:6650
Send count: 51706 Ack count: 50000 Pos: 50000 Neg: 0
Send count: 101199 Ack count: 100000 Pos: 100000 Neg: 0
Send count: 155831 Ack count: 150000 Pos: 150000 Neg: 0
Send count: 208015 Ack count: 200000 Pos: 200000 Neg: 0
Send count: 250923 Ack count: 250000 Pos: 250000 Neg: 0
Send count: 303241 Ack count: 300000 Pos: 300000 Neg: 0
Send count: 350970 Ack count: 350000 Pos: 350000 Neg: 0
Send count: 400041 Ack count: 400000 Pos: 400000 Neg: 0
Send count: 450612 Ack count: 450000 Pos: 450000 Neg: 0
-------------------------------------------------
pulsar2 is the topic owner, killing pulsar2!!!!!!
-------------------------------------------------
Send count: 501553 Ack count: 500000 Pos: 500000 Neg: 0
Send count: 551276 Ack count: 550000 Pos: 550000 Neg: 0
Send count: 602319 Ack count: 600000 Pos: 600000 Neg: 0
Send count: 651554 Ack count: 650000 Pos: 650000 Neg: 0
2018-10-19 20:32:07.710 INFO  ClientConnection:1237 | [192.168.96.1:52202 -> 192.168.96.9:6650] Connection closed
2018-10-19 20:32:07.710 INFO  HandlerBase:129 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test1_1, cluster-1-1-0] Schedule reconnection in 0.1 s
2018-10-19 20:32:07.710 INFO  ClientConnection:195 | [192.168.96.1:52202 -> 192.168.96.9:6650] Destroyed connection
2018-10-19 20:32:07.810 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test1_1, cluster-1-1-0] Getting connection from pool
2018-10-19 20:32:07.815 ERROR ClientConnection:792 | [192.168.96.1:52196 -> 192.168.96.9:6650] Failed lookup req_id: 3 error: 6
2018-10-19 20:32:07.815 INFO  HandlerBase:129 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test1_1, cluster-1-1-0] Schedule reconnection in 0.192 s
2018-10-19 20:32:08.009 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test1_1, cluster-1-1-0] Getting connection from pool
2018-10-19 20:32:08.500 INFO  ConnectionPool:53 | Deleting stale connection from pool for pulsar://pulsar2:6650 use_count: -1 @ 0
2018-10-19 20:32:08.500 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar2:6650
2018-10-19 20:32:08.501 INFO  ClientConnection:287 | [192.168.96.1:52276 -> 192.168.96.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar2:6650
-------------------------------------------------
pulsar2 KILLED!
-------------------------------------------------
Send count: 702977 Ack count: 700000 Pos: 692977 Neg: 7023
2018-10-19 20:32:38.508 INFO  ClientConnection:1237 | [192.168.96.1:52276 -> 192.168.96.9:6650] Connection closed
2018-10-19 20:32:38.509 INFO  HandlerBase:129 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, cluster-1-1-0] Schedule reconnection in 0.1 s
2018-10-19 20:32:38.509 INFO  ClientConnection:195 | [192.168.96.1:52276 -> 192.168.96.9:6650] Destroyed connection
2018-10-19 20:32:38.610 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, cluster-1-1-0] Getting connection from pool
2018-10-19 20:32:38.647 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-19 20:32:38.647 INFO  ClientConnection:287 | [192.168.96.1:52306 -> 192.168.96.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-19 20:32:39.592 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, cluster-1-1-0] Created producer on broker [192.168.96.1:52306 -> 192.168.96.9:6650] 
Send count: 750954 Ack count: 750000 Pos: 740000 Neg: 10000
Send count: 805568 Ack count: 800000 Pos: 790000 Neg: 10000
Send count: 853879 Ack count: 850000 Pos: 840000 Neg: 10000
Send count: 904020 Ack count: 900000 Pos: 890000 Neg: 10000
Send count: 953552 Ack count: 950000 Pos: 940000 Neg: 10000
Send count: 1000000 Ack count: 1000000 Pos: 990000 Neg: 10000
2018-10-19 20:32:47.320 INFO  ProducerImpl:467 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, cluster-1-1-0] Closed producer
2018-10-19 20:32:47.466 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.96.9:6650
2018-10-19 20:32:47.468 INFO  ClientConnection:285 | [192.168.96.1:52406 -> 192.168.96.9:6650] Connected to broker
2018-10-19 20:32:47.477 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, reader-63a5b2, 0] Getting connection from pool
2018-10-19 20:32:47.496 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-19 20:32:47.496 INFO  ClientConnection:287 | [192.168.96.1:52410 -> 192.168.96.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-19 20:32:47.562 INFO  ConsumerImpl:168 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, reader-63a5b2, 0] Created consumer on broker [192.168.96.1:52410 -> 192.168.96.9:6650]
-------------------------------------------------
READ PHASE
-------------------------------------------------
Last confirmed entry: [1, 387]
Start reading from (0,0,0,-1)
Received: 50000 Curr Entry: [0, 72]
Received: 100000 Curr Entry: [0, 148]
Received: 150000 Curr Entry: [0, 207]
Received: 200000 Curr Entry: [0, 272]
Received: 250000 Curr Entry: [0, 335]
Received: 300000 Curr Entry: [0, 400]
Received: 350000 Curr Entry: [0, 466]
Received: 400000 Curr Entry: [0, 534]
Received: 450000 Curr Entry: [0, 611]
Received: 500000 Curr Entry: [0, 688]
Received: 550000 Curr Entry: [0, 747]
Received: 600000 Curr Entry: [0, 815]
Received: 650000 Curr Entry: [0, 913]
Received: 700000 Curr Entry: [1, 7]
Received: 750000 Curr Entry: [1, 83]
Received: 800000 Curr Entry: [1, 157]
Received: 850000 Curr Entry: [1, 227]
Received: 900000 Curr Entry: [1, 287]
Received: 950000 Curr Entry: [1, 343]
Read phase complete with message (1,387,469,-1)
2018-10-19 20:33:11.943 INFO  ConsumerImpl:761 | [persistent://vanlightly/cluster-1/ns1/kill-broker-test2_10, reader-63a5b2, 0] Closed consumer 0
Results --------------------------------------------
Final send count: 1000000
Final ack count: 1000000
Final positive ack count: 990000
Final negative ack count: 10000
Messages received: 990000
Acked messages missing: 0
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0

能够看到,在消息传输过程当中,识别 pulsar2 为 owner 后,很快将其关闭。客户端出现了链接失败的问题,并经过 proxy 从新链接到新的 owner pulsar1。重连成功后,继续发送消息。

因为 producer 受到限制,一般只会传输 1 万条消息,而这 1 万条消息所有接收失败(超时)。这样接收成功的消息就有 99 万条,无重复消息,均被有序读取。

咱们共进行了 5 次测试,测试结果以下:

19:59:08: 
19:59:08: Test Run #1 on topic kill-broker-test_1  ------------
20:00:41: Results --------------------------------------------
20:00:41: Final send count: 1000000
20:00:41: Final ack count: 1000000
20:00:41: Final positive ack count: 990000
20:00:41: Final negative ack count: 10000
20:00:41: Messages received: 990000
20:00:41: Acked messages missing: 0
20:00:41: Non-acked messages received: 0
20:00:41: Out-of-order: 0
20:00:41: Duplicates: 0
20:00:41: ----------------------------------------------------
20:02:54: 
20:02:54: Test Run #2 on topic kill-broker-test_2  ------------
20:03:54: Results --------------------------------------------
20:03:54: Final send count: 1000000
20:03:54: Final ack count: 1000000
20:03:54: Final positive ack count: 1000000
20:03:54: Final negative ack count: 0
20:03:54: Messages received: 1000000
20:03:54: Acked messages missing: 0
20:03:54: Non-acked messages received: 0
20:03:54: Out-of-order: 0
20:03:54: Duplicates: 0
20:03:54: ----------------------------------------------------
20:06:06: 
20:06:06: Test Run #3 on topic kill-broker-test_3  ------------
20:07:31: Results --------------------------------------------
20:07:31: Final send count: 1000000
20:07:31: Final ack count: 1000000
20:07:31: Final positive ack count: 990000
20:07:31: Final negative ack count: 10000
20:07:31: Messages received: 991065
20:07:31: Acked messages missing: 0
20:07:31: Non-acked messages received: 1065
20:07:31: Out-of-order: 0
20:07:31: Duplicates: 0
20:07:31: ----------------------------------------------------
20:09:42: 
20:09:42: Test Run #4 on topic kill-broker-test_4  ------------
20:11:18: Results --------------------------------------------
20:11:18: Final send count: 1000000
20:11:18: Final ack count: 1000000
20:11:18: Final positive ack count: 990000
20:11:18: Final negative ack count: 10000
20:11:18: Messages received: 990000
20:11:18: Acked messages missing: 0
20:11:18: Non-acked messages received: 0
20:11:18: Out-of-order: 0
20:11:18: Duplicates: 0
20:11:18: ----------------------------------------------------
20:13:36: 
20:13:36: Test Run #5 on topic kill-broker-test_5  ------------
20:15:09: Results --------------------------------------------
20:15:09: Final send count: 1000000
20:15:09: Final ack count: 1000000
20:15:09: Final positive ack count: 990000
20:15:09: Final negative ack count: 10000
20:15:09: Messages received: 990000
20:15:09: Acked messages missing: 0
20:15:09: Non-acked messages received: 0
20:15:09: Out-of-order: 0
20:15:09: Duplicates: 0
20:15:09: ----------------------------------------------------

能够看出没有消息丢失。第 3 次运行时,读取了 1065 条未 ack 的消息。若是重试发送消息引发了消息重复,这种状况就很正常。启用去重就能够解决这一问题,咱们会在后面的场景中进行测试。

正如预期,broker 故障不会致使消息丢失。

测试场景 2 - 关闭当前 Ledger Ensemble 中的 Bookie

参数配置 2-2-1 3-3。

咱们采用冗余最小的配置来测试 Pulsar 的灵活性,E=二、Qw=二、Qa=1,3 个 broker,3 个 bookie。在发送 100 万条消息期间,识别当前 ledger ensemble 中的一个 bookie,并将其关闭。而后,检查 reader 读取 100 万条消息的顺序是否正确。

在这里我作了一个特殊的假设,把当前 ledger 看做第一个 ledger。在进行过无数次测试后,我发现至少前 100 万条消息会落在第一个 ledger 的 entry 中。所以,我决定只关注第一个 ledger 中的 ensemble,它在 ZooKeeper 中的路径已知。

若是 Qw = 2,只关闭一个 bookie,不该该出现数据丢失。若是关闭了 ack 消息的 bookie,则在 AutoRecovery 复制数据前,仍有一个 entry 副本,而出现数据丢失的状况是两个副本同时丢失。

单次运行的完整输出以下。

$ python pulsar-test.py kill-bookie kill-bookie-test 5 1000000 50000 2-2-1 3-3 false
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #1 on topic kill-bookie-test_1  ------------
-------------------------------------------------
WRITE PHASE
-------------------------------------------------
2018-10-19 23:46:10.939 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.224.9:6650
2018-10-19 23:46:12.877 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar3:6650
2018-10-19 23:46:12.877 INFO  ClientConnection:287 | [192.168.224.1:50588 -> 192.168.224.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar3:6650
2018-10-19 23:46:13.397 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/kill-bookie-test_1, ] Created producer on broker [192.168.224.1:50588 -> 192.168.224.9:6650] 
Send count: 52575 Ack count: 50000 Pos: 50000 Neg: 0
Send count: 100560 Ack count: 100000 Pos: 100000 Neg: 0
Send count: 156455 Ack count: 150000 Pos: 150000 Neg: 0
Send count: 203545 Ack count: 200000 Pos: 200000 Neg: 0
Send count: 255199 Ack count: 250000 Pos: 250000 Neg: 0
Send count: 305732 Ack count: 300000 Pos: 300000 Neg: 0
Send count: 357709 Ack count: 350000 Pos: 350000 Neg: 0
Send count: 407634 Ack count: 400000 Pos: 400000 Neg: 0
-------------------------------------------------
bookie1 is in the current ledger ensemble, killing bookie1!!!!!!
-------------------------------------------------
Send count: 451809 Ack count: 450000 Pos: 450000 Neg: 0
Send count: 501234 Ack count: 500000 Pos: 500000 Neg: 0
Send count: 551348 Ack count: 550000 Pos: 550000 Neg: 0
Send count: 601729 Ack count: 600000 Pos: 600000 Neg: 0
Send count: 650566 Ack count: 650000 Pos: 650000 Neg: 0
-------------------------------------------------
bookie1 KILLED!
-------------------------------------------------
Send count: 701961 Ack count: 700000 Pos: 700000 Neg: 0
Send count: 751294 Ack count: 750000 Pos: 750000 Neg: 0
Send count: 802939 Ack count: 800000 Pos: 800000 Neg: 0
Send count: 850390 Ack count: 850000 Pos: 850000 Neg: 0
Send count: 903739 Ack count: 900000 Pos: 900000 Neg: 0
Send count: 953286 Ack count: 950000 Pos: 950000 Neg: 0
Send count: 1000000 Ack count: 1000000 Pos: 1000000 Neg: 0
2018-10-19 23:46:47.281 INFO  ProducerImpl:467 | [persistent://vanlightly/cluster-1/ns1/kill-bookie-test_1, cluster-1-2-0] Closed producer
-------------------------------------------------
READ PHASE
-------------------------------------------------
2018-10-19 23:46:57.499 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.224.9:6650
2018-10-19 23:46:57.500 INFO  ClientConnection:285 | [192.168.224.1:50756 -> 192.168.224.9:6650] Connected to broker
2018-10-19 23:46:57.508 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-bookie-test_1, reader-4d4064, 0] Getting connection from pool
2018-10-19 23:46:57.566 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar2:6650
2018-10-19 23:46:57.566 INFO  ClientConnection:287 | [192.168.224.1:50760 -> 192.168.224.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar2:6650
2018-10-19 23:46:57.750 INFO  ConsumerImpl:168 | [persistent://vanlightly/cluster-1/ns1/kill-bookie-test_1, reader-4d4064, 0] Created consumer on broker [192.168.224.1:50760 -> 192.168.224.9:6650] 
LCE. broker pulsar2 lac_line "1:475",
Last confirmed entry: [1, 475]
Start reading from (0,0,0,-1)
Received: 50000 Curr Entry: [0, 70]
Received: 100000 Curr Entry: [0, 165]
Received: 150000 Curr Entry: [0, 270]
Received: 200000 Curr Entry: [0, 339]
Received: 250000 Curr Entry: [0, 403]
Received: 300000 Curr Entry: [0, 462]
Received: 350000 Curr Entry: [0, 519]
Received: 400000 Curr Entry: [0, 579]
Received: 450000 Curr Entry: [0, 658]
Received: 500000 Curr Entry: [0, 717]
Received: 550000 Curr Entry: [0, 782]
Received: 600000 Curr Entry: [0, 848]
Received: 650000 Curr Entry: [0, 942]
Received: 700000 Curr Entry: [1, 57]
Received: 750000 Curr Entry: [1, 119]
Received: 800000 Curr Entry: [1, 195]
Received: 850000 Curr Entry: [1, 267]
Received: 900000 Curr Entry: [1, 357]
Received: 950000 Curr Entry: [1, 418]
Received: 1000000 Curr Entry: [1, 475]
Read phase complete with message (1,475,324,-1)
2018-10-19 23:47:51.236 INFO  ConsumerImpl:761 | [persistent://vanlightly/cluster-1/ns1/kill-bookie-test_1, reader-4d4064, 0] Closed consumer 0
Results --------------------------------------------
Final send count: 1000000
Final ack count: 1000000
Final positive ack count: 1000000
Final negative ack count: 0
Messages received: 1000000
Acked messages missing: 0
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0
----------------------------------------------------

在消息传输期间关闭了 1 个 bookie,但消息传输并未受到影响,reader 按顺序成功读取了 100 万条消息,而且没有出现消息重复。

5 次测试的运行结果以下。

23:52:20 Start test
23:54:49: 
23:54:49: Test Run #1 on topic kill-bookie-test_1  ------------
23:56:38: Results --------------------------------------------
23:56:38: Final send count: 1000000
23:56:38: Final ack count: 1000000
23:56:38: Final positive ack count: 1000000
23:56:38: Final negative ack count: 0
23:56:38: Messages received: 1000000
23:56:38: Acked messages missing: 0
23:56:38: Non-acked messages received: 0
23:56:38: Out-of-order: 0
23:56:38: Duplicates: 0
23:56:38: ----------------------------------------------------
23:58:54: 
23:58:54: Test Run #2 on topic kill-bookie-test_2  ------------
00:00:50: Results --------------------------------------------
00:00:50: Final send count: 1000000
00:00:50: Final ack count: 1000000
00:00:50: Final positive ack count: 1000000
00:00:50: Final negative ack count: 0
00:00:50: Messages received: 1000000
00:00:50: Acked messages missing: 0
00:00:50: Non-acked messages received: 0
00:00:50: Out-of-order: 0
00:00:50: Duplicates: 0
00:00:50: ----------------------------------------------------
00:03:12: 
00:03:12: Test Run #3 on topic kill-bookie-test_3  ------------
00:05:01: Results --------------------------------------------
00:05:01: Final send count: 1000000
00:05:01: Final ack count: 1000000
00:05:01: Final positive ack count: 1000000
00:05:01: Final negative ack count: 0
00:05:01: Messages received: 1000000
00:05:01: Acked messages missing: 0
00:05:01: Non-acked messages received: 0
00:05:01: Out-of-order: 0
00:05:01: Duplicates: 0
00:05:01: ----------------------------------------------------
00:07:15: 
00:07:15: Test Run #4 on topic kill-bookie-test_4  ------------
00:09:04: Results --------------------------------------------
00:09:04: Final send count: 1000000
00:09:04: Final ack count: 1000000
00:09:04: Final positive ack count: 1000000
00:09:04: Final negative ack count: 0
00:09:04: Messages received: 1000000
00:09:04: Acked messages missing: 0
00:09:04: Non-acked messages received: 0
00:09:04: Out-of-order: 0
00:09:04: Duplicates: 0
00:09:04: ----------------------------------------------------
00:11:23: 
00:11:23: Test Run #5 on topic kill-bookie-test_5  ------------
00:13:23: Results --------------------------------------------
00:13:23: Final send count: 1000000
00:13:23: Final ack count: 1000000
00:13:23: Final positive ack count: 1000000
00:13:23: Final negative ack count: 0
00:13:23: Messages received: 1000000
00:13:23: Acked messages missing: 0
00:13:23: Non-acked messages received: 0
00:13:23: Out-of-order: 0
00:13:23: Duplicates: 0
00:13:23: ----------------------------------------------------

没有出现消息丢失、重复和乱序。

测试场景 3 - 隔离 Topic Owner Broker 与 ZooKeeper

参数配置 2-2-1 3-3。

在此次测试中,咱们只隔离 topic owner broker 和 ZooKeeper。咱们预测在隔离 ZooKeeper 前,broker owner 会持续接收写入;隔离生效后,broker owner 自动重启,另外一个 broker 接管 broker owner 角色。

预计测试结果与关闭 bookie 的结果相似。

单次运行的完整输出以下。

$ python pulsar-test.py isolate-broker-from-zk iso-broker 1 2000000 50000 2-2-1 3-3 false
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #1 on topic iso-broker_1  ------------
-------------------------------------------------
WRITE PHASE
-------------------------------------------------
2018-10-20 08:22:43.961 INFO  ConnectionPool:63 | Created connection for pulsar://172.22.0.9:6650
2018-10-20 08:22:43.963 INFO  ClientConnection:285 | [172.22.0.1:46968 -> 172.22.0.9:6650] Connected to broker
2018-10-20 08:22:44.664 INFO  BatchMessageContainer:41 | { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://vanlightly/cluster-1/ns1/iso-broker_1] [producerName_ = ] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 0] [averageBatchSize = 0]} BatchMessageContainer constructed
2018-10-20 08:22:44.664 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, ] Getting connection from pool
2018-10-20 08:22:45.254 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-20 08:22:45.254 INFO  ClientConnection:287 | [172.22.0.1:46972 -> 172.22.0.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-20 08:22:46.329 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, ] Created producer on broker [172.22.0.1:46972 -> 172.22.0.9:6650] 
Send count: 57639 Ack count: 50000 Pos: 50000 Neg: 0
pulsar1 is the topic owner, isolating pulsar1 from zookeepr!!!!!!
Send count: 100918 Ack count: 100000 Pos: 100000 Neg: 0
Send count: 151902 Ack count: 150000 Pos: 150000 Neg: 0
Send count: 200658 Ack count: 200000 Pos: 200000 Neg: 0
Send count: 251287 Ack count: 250000 Pos: 250000 Neg: 0
Send count: 306076 Ack count: 300000 Pos: 300000 Neg: 0
Send count: 353796 Ack count: 350000 Pos: 350000 Neg: 0
Send count: 403469 Ack count: 400000 Pos: 400000 Neg: 0
Send count: 455195 Ack count: 450000 Pos: 450000 Neg: 0
-------------------------------------------------
pulsar1 ISOLATED!
-------------------------------------------------
Send count: 506817 Ack count: 500000 Pos: 500000 Neg: 0
Send count: 557051 Ack count: 550000 Pos: 550000 Neg: 0
Send count: 601668 Ack count: 600000 Pos: 600000 Neg: 0
Send count: 652490 Ack count: 650000 Pos: 650000 Neg: 0
Send count: 709134 Ack count: 700000 Pos: 700000 Neg: 0
Send count: 757860 Ack count: 750000 Pos: 750000 Neg: 0
Send count: 805959 Ack count: 800000 Pos: 800000 Neg: 0
Send count: 858283 Ack count: 850000 Pos: 850000 Neg: 0
Send count: 907911 Ack count: 900000 Pos: 900000 Neg: 0
Send count: 951597 Ack count: 950000 Pos: 950000 Neg: 0
Send count: 1002732 Ack count: 1000000 Pos: 1000000 Neg: 0
Send count: 1050689 Ack count: 1050000 Pos: 1050000 Neg: 0
Send count: 1104743 Ack count: 1100000 Pos: 1100000 Neg: 0
Send count: 1157332 Ack count: 1150000 Pos: 1150000 Neg: 0
Send count: 1202361 Ack count: 1200000 Pos: 1200000 Neg: 0
Send count: 1256378 Ack count: 1250000 Pos: 1250000 Neg: 0
Send count: 1301586 Ack count: 1300000 Pos: 1300000 Neg: 0
Send count: 1352458 Ack count: 1350000 Pos: 1350000 Neg: 0
Send count: 1407455 Ack count: 1400000 Pos: 1400000 Neg: 0
2018-10-20 08:23:40.624 INFO  ClientConnection:1237 | [172.22.0.1:46972 -> 172.22.0.9:6650] Connection closed
2018-10-20 08:23:40.624 INFO  HandlerBase:129 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, cluster-1-0-0] Schedule reconnection in 0.1 s
2018-10-20 08:23:40.624 INFO  ClientConnection:195 | [172.22.0.1:46972 -> 172.22.0.9:6650] Destroyed connection
2018-10-20 08:23:40.724 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, cluster-1-0-0] Getting connection from pool
2018-10-20 08:23:41.103 INFO  ConnectionPool:53 | Deleting stale connection from pool for pulsar://pulsar1:6650 use_count: -1 @ 0
2018-10-20 08:23:41.103 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-20 08:23:41.104 INFO  ClientConnection:287 | [172.22.0.1:47050 -> 172.22.0.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-20 08:23:42.139 INFO  ClientConnection:1237 | [172.22.0.1:47050 -> 172.22.0.9:6650] Connection closed
2018-10-20 08:23:42.139 INFO  HandlerBase:129 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, cluster-1-0-0] Schedule reconnection in 0.186 s
2018-10-20 08:23:42.139 INFO  ClientConnection:195 | [172.22.0.1:47050 -> 172.22.0.9:6650] Destroyed connection
2018-10-20 08:23:42.325 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, cluster-1-0-0] Getting connection from pool
2018-10-20 08:23:42.750 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar3:6650
2018-10-20 08:23:42.751 INFO  ClientConnection:287 | [172.22.0.1:47054 -> 172.22.0.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar3:6650
2018-10-20 08:23:43.632 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, cluster-1-0-0] Created producer on broker [172.22.0.1:47054 -> 172.22.0.9:6650] 
Send count: 1459178 Ack count: 1450000 Pos: 1450000 Neg: 0
Send count: 1505703 Ack count: 1500000 Pos: 1500000 Neg: 0
Send count: 1551367 Ack count: 1550000 Pos: 1550000 Neg: 0
Send count: 1601042 Ack count: 1600000 Pos: 1600000 Neg: 0
Send count: 1651310 Ack count: 1650000 Pos: 1650000 Neg: 0
Send count: 1707836 Ack count: 1700000 Pos: 1700000 Neg: 0
Send count: 1751159 Ack count: 1750000 Pos: 1750000 Neg: 0
Send count: 1802089 Ack count: 1800000 Pos: 1800000 Neg: 0
Send count: 1851082 Ack count: 1850000 Pos: 1850000 Neg: 0
Send count: 1906930 Ack count: 1900000 Pos: 1900000 Neg: 0
Send count: 1951920 Ack count: 1950000 Pos: 1950000 Neg: 0
Send count: 2000000 Ack count: 2000000 Pos: 2000000 Neg: 0
2018-10-20 08:23:57.505 INFO  ProducerImpl:467 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, cluster-1-0-0] Closed producer
-------------------------------------------------
READ PHASE
-------------------------------------------------
2018-10-20 08:24:07.620 INFO  ConnectionPool:63 | Created connection for pulsar://172.22.0.9:6650
2018-10-20 08:24:07.621 INFO  ClientConnection:285 | [172.22.0.1:47156 -> 172.22.0.9:6650] Connected to broker
2018-10-20 08:24:07.635 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, reader-c1ba0b, 0] Getting connection from pool
2018-10-20 08:24:07.665 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar3:6650
2018-10-20 08:24:07.665 INFO  ClientConnection:287 | [172.22.0.1:47160 -> 172.22.0.9:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar3:6650
2018-10-20 08:24:07.749 INFO  ConsumerImpl:168 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, reader-c1ba0b, 0] Created consumer on broker [172.22.0.1:47160 -> 172.22.0.9:6650] 
Last confirmed entry: [1, 658]
Start reading from (0,0,0,-1)
Received: 50000 Curr Entry: [0, 61]
Received: 100000 Curr Entry: [0, 142]
Received: 150000 Curr Entry: [0, 212]
Received: 200000 Curr Entry: [0, 288]
Received: 250000 Curr Entry: [0, 349]
Received: 300000 Curr Entry: [0, 420]
Received: 350000 Curr Entry: [0, 487]
Received: 400000 Curr Entry: [0, 554]
Received: 450000 Curr Entry: [0, 612]
Received: 500000 Curr Entry: [0, 675]
Received: 550000 Curr Entry: [0, 731]
Received: 600000 Curr Entry: [0, 790]
Received: 650000 Curr Entry: [0, 847]
Received: 700000 Curr Entry: [0, 904]
Received: 750000 Curr Entry: [0, 961]
Received: 800000 Curr Entry: [0, 1022]
Received: 850000 Curr Entry: [0, 1080]
Received: 900000 Curr Entry: [0, 1136]
Received: 950000 Curr Entry: [0, 1194]
Received: 1000000 Curr Entry: [0, 1252]
Received: 1050000 Curr Entry: [0, 1310]
Received: 1100000 Curr Entry: [0, 1370]
Received: 1150000 Curr Entry: [0, 1428]
Received: 1200000 Curr Entry: [0, 1487]
Received: 1250000 Curr Entry: [0, 1545]
Received: 1300000 Curr Entry: [0, 1602]
Received: 1350000 Curr Entry: [0, 1660]
Received: 1400000 Curr Entry: [0, 1721]
Received: 1450000 Curr Entry: [1, 1]
Received: 1500000 Curr Entry: [1, 60]
Received: 1550000 Curr Entry: [1, 124]
Received: 1600000 Curr Entry: [1, 186]
Received: 1650000 Curr Entry: [1, 247]
Received: 1700000 Curr Entry: [1, 303]
Received: 1750000 Curr Entry: [1, 361]
Received: 1800000 Curr Entry: [1, 420]
Received: 1850000 Curr Entry: [1, 478]
Received: 1900000 Curr Entry: [1, 539]
Received: 1950000 Curr Entry: [1, 598]
Received: 2000000 Curr Entry: [1, 658]
Read phase complete with message (1,658,138,-1)
2018-10-20 08:24:44.361 INFO  ConsumerImpl:761 | [persistent://vanlightly/cluster-1/ns1/iso-broker_1, reader-c1ba0b, 0] Closed consumer 0
Results --------------------------------------------
Final send count: 2000000
Final ack count: 2000000
Final positive ack count: 2000000
Final negative ack count: 0
Messages received: 2000000
Acked messages missing: 0
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0
----------------------------------------------------

在本次测试中,咱们发送了 200 万条消息,由于隔离 owner broker 再看到效果须要一段时间。大概在发送 140 万条消息时出现故障并迅速从新链接。Reader 有序读取所有消息,而且没有消息重复。

5 次测试的运行结果以下。

08:33:38 Start test
08:35:55: 
08:35:55: Test Run #1 on topic iso-broker_1  ------------
08:38:10: Results --------------------------------------------
08:38:10: Final send count: 2000000
08:38:10: Final ack count: 2000000
08:38:10: Final positive ack count: 2000000
08:38:10: Final negative ack count: 0
08:38:10: Messages received: 2000000
08:38:10: Acked messages missing: 0
08:38:10: Non-acked messages received: 0
08:38:10: Out-of-order: 0
08:38:10: Duplicates: 0
08:38:10: ----------------------------------------------------
08:40:29: 
08:40:29: Test Run #2 on topic iso-broker_2  ------------
08:42:36: Results --------------------------------------------
08:42:36: Final send count: 2000000
08:42:36: Final ack count: 2000000
08:42:36: Final positive ack count: 2000000
08:42:36: Final negative ack count: 0
08:42:36: Messages received: 2000000
08:42:36: Acked messages missing: 0
08:42:36: Non-acked messages received: 0
08:42:36: Out-of-order: 0
08:42:36: Duplicates: 0
08:42:36: ----------------------------------------------------
08:44:55: 
08:44:55: Test Run #3 on topic iso-broker_3  ------------
08:47:08: Results --------------------------------------------
08:47:08: Final send count: 2000000
08:47:08: Final ack count: 2000000
08:47:08: Final positive ack count: 2000000
08:47:08: Final negative ack count: 0
08:47:08: Messages received: 2001000
08:47:08: Acked messages missing: 0
08:47:08: Non-acked messages received: 0
08:47:08: Out-of-order: 0
08:47:08: Duplicates: 1000
08:47:08: ----------------------------------------------------
08:49:33: 
08:49:33: Test Run #4 on topic iso-broker_4  ------------
08:51:52: Results --------------------------------------------
08:51:52: Final send count: 2000000
08:51:52: Final ack count: 2000000
08:51:52: Final positive ack count: 2000000
08:51:52: Final negative ack count: 0
08:51:52: Messages received: 2001090
08:51:52: Acked messages missing: 0
08:51:52: Non-acked messages received: 0
08:51:52: Out-of-order: 0
08:51:52: Duplicates: 1090
08:51:52: ----------------------------------------------------
08:54:18: 
08:54:18: Test Run #5 on topic iso-broker_5  ------------
08:56:41: Results --------------------------------------------
08:56:41: Final send count: 2000000
08:56:41: Final ack count: 2000000
08:56:41: Final positive ack count: 2000000
08:56:41: Final negative ack count: 0
08:56:41: Messages received: 2000000
08:56:41: Acked messages missing: 0
08:56:41: Non-acked messages received: 0
08:56:41: Out-of-order: 0
08:56:41: Duplicates: 0
08:56:41: ----------------------------------------------------

没有出现消息丢失和乱序问题,但出现了消息重复。在两次测试中,分别出现了 1000 条和 1090 条重复消息。

当已读取整数值小于等于读取过的消息时(即这是一条重复消息),reader 会在 [test-name]-duplicates.txt 中记录消息 ID、以前消费消息的整数值、当前消息的整数值。

test-output]$ cat iso-broker_duplicates.txt
run|last_msg_id|last_value|curr_msg_id|curr_value
3|(0,1643,999,-1)|1356634|(1,0,0,-1)|1355635
4|(0,1597,792,-1)|1220380|(1,0,0,-1)|1219588
4|(1,712,296,-1)|1811705|(2,0,0,-1)|1811409

第 3 次运行结果中只有一个 entry,也就是说重复消息位于连续的消息块中。整数值 1,355,635 第二次出现于 ledger 1,entry 0 的第一条消息中,消息 ID 为(1,0,0,-1)。在 ledger 0 中,共追加了 1000 条消息,在下一个 ledger 的开头一样写入了这 1000 条消息。

在第 4 次测试中也出现了相似的状况。

第一个 broker 重启后,客户端从新发送消息多是形成消息重复的缘由。Broker 向 pulsar1 的 BK ensemble 发送 1000 条消息,可是在向客户端发送 ack 前,因为与 ZooKeeper 断开,broker 出现故障。而后,客户端从新链接到新的 owner pulsar2,再次发送 1000 条消息。Pulsar2 恢复了上一个 ledger(即 pulsar1 拥有的 ledger),在关闭此 ledger 后建立新 ledger,准备接收新消息。Broker 接收到了客户端发来的 1000 条消息,并将这些消息写入新 ledger。所以出现了消息重复。

咱们能够启用消息去重测试上述可能的缘由。启用消息去重时,broker 将每一个 producer 最后的序列号存储在哈希表中。若是收到的序列号小于全部现有序列号,则这条消息为重复的消息,不对其作任何处理。Broker 将关于 topic 的数据(producer、序列号)存储在游标中,当 broker 进行故障转移时,新 broker 将从新建立哈希表。因为哈希表会按期刷新,若是 broker 出现故障,哈希表中的最新序列号可能会丢失。若是新 broker 仅依赖于哈希表更新快照,则在 broker 进行故障转移时,极可能再次出现消息重复。为了不这一状况的发生,Pulsar 中的新 broker owner 从 ledger 读取最新的 N 条 entry,并追加这些 entry 到哈希表,以确保不会由于故障转移致使哈希表中数据不完整。

测试场景 4 - 隔离 Topic Owner Broker 与 ZooKeeper,启用消息去重

参数配置为 2-2-1 3-3,启用消息去重。

除了启用消息去重外,测试场景 4 与测试场景 3 彻底相同。本次测试用于检测消息去重是否能够防止消息重复。

在 blockade.yml 文件中,为 broker 的 environment 添加 brokerDeduplicationEnabled 参数,启用消息去重。

5 次测试的运行结果以下。

09:24:34 Start test
09:26:56: 
09:26:56: Test Run #1 on topic iso-broker_1  ------------
09:29:27: Results --------------------------------------------
09:29:27: Final send count: 2000000
09:29:27: Final ack count: 2000000
09:29:27: Final positive ack count: 2000000
09:29:27: Final negative ack count: 0
09:29:27: Messages received: 2000000
09:29:27: Acked messages missing: 0
09:29:27: Non-acked messages received: 0
09:29:27: Out-of-order: 0
09:29:27: Duplicates: 0
09:29:27: ----------------------------------------------------
09:32:48: 
09:32:48: Test Run #2 on topic iso-broker_2  ------------
09:35:06: Results --------------------------------------------
09:35:06: Final send count: 2000000
09:35:06: Final ack count: 2000000
09:35:06: Final positive ack count: 2000000
09:35:06: Final negative ack count: 0
09:35:06: Messages received: 2000000
09:35:06: Acked messages missing: 0
09:35:06: Non-acked messages received: 0
09:35:06: Out-of-order: 0
09:35:06: Duplicates: 0
09:35:06: ----------------------------------------------------
09:37:21: 
09:37:21: Test Run #3 on topic iso-broker_3  ------------
09:39:21: Results --------------------------------------------
09:39:21: Final send count: 2000000
09:39:21: Final ack count: 2000000
09:39:21: Final positive ack count: 2000000
09:39:21: Final negative ack count: 0
09:39:21: Messages received: 2000000
09:39:21: Acked messages missing: 0
09:39:21: Non-acked messages received: 0
09:39:21: Out-of-order: 0
09:39:21: Duplicates: 0
09:39:21: ----------------------------------------------------
09:41:39: 
09:41:39: Test Run #4 on topic iso-broker_4  ------------
09:43:51: Results --------------------------------------------
09:43:51: Final send count: 2000000
09:43:51: Final ack count: 2000000
09:43:51: Final positive ack count: 2000000
09:43:51: Final negative ack count: 0
09:43:51: Messages received: 2000000
09:43:51: Acked messages missing: 0
09:43:51: Non-acked messages received: 0
09:43:51: Out-of-order: 0
09:43:51: Duplicates: 0
09:43:51: ----------------------------------------------------
09:46:17: 
09:46:17: Test Run #5 on topic iso-broker_5  ------------
09:48:53: Results --------------------------------------------
09:48:53: Final send count: 2000000
09:48:53: Final ack count: 2000000
09:48:53: Final positive ack count: 2000000
09:48:53: Final negative ack count: 0
09:48:53: Messages received: 2000000
09:48:53: Acked messages missing: 0
09:48:53: Non-acked messages received: 0
09:48:53: Out-of-order: 0
09:48:53: Duplicates: 0
09:48:53: ----------------------------------------------------

从测试结果能够看出,启用消息去重能够解决测试场景 3 中出现的消息重复问题。

测试场景 5 - 隔离 Bookie 与 ZooKeeper

参数配置 2-2-1 3-3。

强调一下,bookie 不须要 ZooKeeper 进行读写,只须要在 ZooKeeper 中进行注册、垃圾回收、AutoRecovery。因此,隔离 bookie 与 ZooKeeper 应该不会产生任何影响。

首先在当前 ledger 的 ensemble 中识别出一个 bookie,在消息传输期间将其与 ZooKeeper 隔离。5 次测试结果以下。

$ cat iso-bookie_output.txt
12:03:52 Start test
12:06:10: 
12:06:10: Test Run #1 on topic iso-bookie_1  ------------
12:08:18: Results --------------------------------------------
12:08:18: Final send count: 2000000
12:08:18: Final ack count: 2000000
12:08:18: Final positive ack count: 2000000
12:08:18: Final negative ack count: 0
12:08:18: Messages received: 2000000
12:08:18: Acked messages missing: 0
12:08:18: Non-acked messages received: 0
12:08:18: Out-of-order: 0
12:08:18: Duplicates: 0
12:08:18: ----------------------------------------------------
12:10:35: 
12:10:35: Test Run #2 on topic iso-bookie_2  ------------
12:12:40: Results --------------------------------------------
12:12:40: Final send count: 2000000
12:12:40: Final ack count: 2000000
12:12:40: Final positive ack count: 2000000
12:12:40: Final negative ack count: 0
12:12:40: Messages received: 2000000
12:12:40: Acked messages missing: 0
12:12:40: Non-acked messages received: 0
12:12:40: Out-of-order: 0
12:12:40: Duplicates: 0
12:12:40: ----------------------------------------------------
12:14:55: 
12:14:55: Test Run #3 on topic iso-bookie_3  ------------
12:16:54: Results --------------------------------------------
12:16:54: Final send count: 2000000
12:16:54: Final ack count: 2000000
12:16:54: Final positive ack count: 2000000
12:16:54: Final negative ack count: 0
12:16:54: Messages received: 2000000
12:16:54: Acked messages missing: 0
12:16:54: Non-acked messages received: 0
12:16:54: Out-of-order: 0
12:16:54: Duplicates: 0
12:16:54: ----------------------------------------------------
12:19:10: 
12:19:10: Test Run #4 on topic iso-bookie_4  ------------
12:21:16: Results --------------------------------------------
12:21:16: Final send count: 2000000
12:21:16: Final ack count: 2000000
12:21:16: Final positive ack count: 2000000
12:21:16: Final negative ack count: 0
12:21:16: Messages received: 2000000
12:21:16: Acked messages missing: 0
12:21:16: Non-acked messages received: 0
12:21:16: Out-of-order: 0
12:21:16: Duplicates: 0
12:21:16: ----------------------------------------------------
12:23:36: 
12:23:36: Test Run #5 on topic iso-bookie_5  ------------
12:25:39: Results --------------------------------------------
12:25:39: Final send count: 2000000
12:25:39: Final ack count: 2000000
12:25:39: Final positive ack count: 2000000
12:25:39: Final negative ack count: 0
12:25:39: Messages received: 2000000
12:25:39: Acked messages missing: 0
12:25:39: Non-acked messages received: 0
12:25:39: Out-of-order: 0
12:25:39: Duplicates: 0
12:25:39: ----------------------------------------------------

没有出现消息丢失、重复和乱序。

测试场景 6 - 关闭多个 Bookie

参数配置 3-3-1 3-5。

除了冗余数为 3,关闭 2 个 bookie 外,测试场景 6 与测试场景 2 相同。使用 Pulsar 时,bookie 故障次数不超过 Qw-1,就不会出现数据丢失。

可使用 kill-bookies[n] 命令关闭当前 ensemble 中任意数量的 bookie。

单次运行的完整输出以下。

$ python pulsar-test.py kill-bookies[2] kill-m-bookies 1 2000000 50000 3-3-1 3-5 false
Creating blockade cluster
Running test with config: E 3 Qw 3 Qa 1 with nodes bookie1 bookie2 bookie3 bookie4 bookie5 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 3-3-1
Test Run #1 on topic kill-m-bookies_1  ------------
-------------------------------------------------
WRITE PHASE
-------------------------------------------------
2018-10-20 22:41:21.932 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.176.11:6650
2018-10-20 22:41:21.932 INFO  ClientConnection:285 | [192.168.176.1:47178 -> 192.168.176.11:6650] Connected to broker
2018-10-20 22:41:22.599 INFO  BatchMessageContainer:41 | { BatchContainer [size = 0] [batchSizeInBytes_ = 0] [maxAllowedMessageBatchSizeInBytes_ = 131072] [maxAllowedNumMessagesInBatch_ = 1000] [topicName = persistent://vanlightly/cluster-1/ns1/kill-m-bookies_1] [producerName_ = ] [batchSizeInBytes_ = 0] [numberOfBatchesSent = 0] [averageBatchSize = 0]} BatchMessageContainer constructed
2018-10-20 22:41:22.599 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-m-bookies_1, ] Getting connection from pool
2018-10-20 22:41:23.140 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-20 22:41:23.140 INFO  ClientConnection:287 | [192.168.176.1:47182 -> 192.168.176.11:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-20 22:41:23.631 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/kill-m-bookies_1, ] Created producer on broker [192.168.176.1:47182 -> 192.168.176.11:6650] 
Send count: 53933 Ack count: 50000 Pos: 50000 Neg: 0
-------------------------------------------------
Identifing first 2 bookies in ensemble
-------------------------------------------------
Send count: 105567 Ack count: 100000 Pos: 100000 Neg: 0
Send count: 150962 Ack count: 150000 Pos: 150000 Neg: 0
Send count: 202992 Ack count: 200000 Pos: 200000 Neg: 0
Send count: 251576 Ack count: 250000 Pos: 250000 Neg: 0
Send count: 304059 Ack count: 300000 Pos: 300000 Neg: 0
Send count: 354476 Ack count: 350000 Pos: 350000 Neg: 0
Send count: 401795 Ack count: 400000 Pos: 400000 Neg: 0
Send count: 451880 Ack count: 450000 Pos: 450000 Neg: 0
Send count: 500607 Ack count: 500000 Pos: 500000 Neg: 0
Send count: 550300 Ack count: 550000 Pos: 550000 Neg: 0
Send count: 600474 Ack count: 600000 Pos: 600000 Neg: 0
Send count: 650469 Ack count: 650000 Pos: 650000 Neg: 0
Send count: 702927 Ack count: 700000 Pos: 700000 Neg: 0
Send count: 751328 Ack count: 750000 Pos: 750000 Neg: 0
Send count: 805205 Ack count: 800000 Pos: 800000 Neg: 0
Send count: 853940 Ack count: 850000 Pos: 850000 Neg: 0
-------------------------------------------------
2 BOOKIES KILLED!
-------------------------------------------------
Send count: 901111 Ack count: 900000 Pos: 900000 Neg: 0
Send count: 950963 Ack count: 950000 Pos: 950000 Neg: 0
Send count: 1000561 Ack count: 1000000 Pos: 1000000 Neg: 0
Send count: 1053648 Ack count: 1050000 Pos: 1050000 Neg: 0
Send count: 1103723 Ack count: 1100000 Pos: 1100000 Neg: 0
Send count: 1155762 Ack count: 1150000 Pos: 1150000 Neg: 0
Send count: 1202638 Ack count: 1200000 Pos: 1200000 Neg: 0
Send count: 1253326 Ack count: 1250000 Pos: 1250000 Neg: 0
Send count: 1300821 Ack count: 1300000 Pos: 1300000 Neg: 0
Send count: 1355902 Ack count: 1350000 Pos: 1350000 Neg: 0
Send count: 1409752 Ack count: 1400000 Pos: 1400000 Neg: 0
Send count: 1451236 Ack count: 1450000 Pos: 1450000 Neg: 0
Send count: 1500961 Ack count: 1500000 Pos: 1500000 Neg: 0
Send count: 1554802 Ack count: 1550000 Pos: 1550000 Neg: 0
Send count: 1600702 Ack count: 1600000 Pos: 1600000 Neg: 0
Send count: 1652755 Ack count: 1650000 Pos: 1650000 Neg: 0
Send count: 1705403 Ack count: 1700000 Pos: 1700000 Neg: 0
Send count: 1757240 Ack count: 1750000 Pos: 1750000 Neg: 0
Send count: 1804954 Ack count: 1800000 Pos: 1800000 Neg: 0
Send count: 1850631 Ack count: 1850000 Pos: 1850000 Neg: 0
Send count: 1907237 Ack count: 1900000 Pos: 1900000 Neg: 0
Send count: 1951117 Ack count: 1950000 Pos: 1950000 Neg: 0
Send count: 2000000 Ack count: 2000000 Pos: 2000000 Neg: 0
2018-10-20 22:42:20.572 INFO  ProducerImpl:467 | [persistent://vanlightly/cluster-1/ns1/kill-m-bookies_1, cluster-1-0-0] Closed producer
-------------------------------------------------
READ PHASE
-------------------------------------------------
2018-10-20 22:42:30.708 INFO  ConnectionPool:63 | Created connection for pulsar://192.168.176.11:6650
2018-10-20 22:42:30.710 INFO  ClientConnection:285 | [192.168.176.1:47374 -> 192.168.176.11:6650] Connected to broker
2018-10-20 22:42:30.715 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-m-bookies_1, reader-b23b69, 0] Getting connection from pool
2018-10-20 22:42:31.028 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-20 22:42:31.028 INFO  ClientConnection:287 | [192.168.176.1:47378 -> 192.168.176.11:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-20 22:42:31.110 INFO  ConsumerImpl:168 | [persistent://vanlightly/cluster-1/ns1/kill-m-bookies_1, reader-b23b69, 0] Created consumer on broker [192.168.176.1:47378 -> 192.168.176.11:6650] 
Last confirmed entry: [0, 2546]
Start reading from (0,0,0,-1)
Received: 50000 Curr Entry: [0, 62]
Received: 100000 Curr Entry: [0, 135]
Received: 150000 Curr Entry: [0, 203]
Received: 200000 Curr Entry: [0, 276]
Received: 250000 Curr Entry: [0, 341]
Received: 300000 Curr Entry: [0, 409]
Received: 350000 Curr Entry: [0, 470]
Received: 400000 Curr Entry: [0, 532]
Received: 450000 Curr Entry: [0, 599]
Received: 500000 Curr Entry: [0, 661]
Received: 550000 Curr Entry: [0, 724]
Received: 600000 Curr Entry: [0, 799]
Received: 650000 Curr Entry: [0, 866]
Received: 700000 Curr Entry: [0, 943]
Received: 750000 Curr Entry: [0, 1005]
Received: 800000 Curr Entry: [0, 1073]
Received: 850000 Curr Entry: [0, 1133]
Received: 900000 Curr Entry: [0, 1205]
Received: 950000 Curr Entry: [0, 1268]
Received: 1000000 Curr Entry: [0, 1340]
Received: 1050000 Curr Entry: [0, 1402]
Received: 1100000 Curr Entry: [0, 1460]
Received: 1150000 Curr Entry: [0, 1523]
Received: 1200000 Curr Entry: [0, 1588]
Received: 1250000 Curr Entry: [0, 1647]
Received: 1300000 Curr Entry: [0, 1705]
Received: 1350000 Curr Entry: [0, 1764]
Received: 1400000 Curr Entry: [0, 1825]
Received: 1450000 Curr Entry: [0, 1884]
Received: 1500000 Curr Entry: [0, 1941]
Received: 1550000 Curr Entry: [0, 1998]
Received: 1600000 Curr Entry: [0, 2064]
Received: 1650000 Curr Entry: [0, 2122]
Received: 1700000 Curr Entry: [0, 2184]
Received: 1750000 Curr Entry: [0, 2241]
Received: 1800000 Curr Entry: [0, 2295]
Received: 1850000 Curr Entry: [0, 2364]
Received: 1900000 Curr Entry: [0, 2425]
Received: 1950000 Curr Entry: [0, 2481]
Received: 2000000 Curr Entry: [0, 2546]
Read phase complete with message (0,2546,561,-1)
2018-10-20 22:44:15.596 INFO  ConsumerImpl:761 | [persistent://vanlightly/cluster-1/ns1/kill-m-bookies_1, reader-b23b69, 0] Closed consumer 0
Results --------------------------------------------
Final send count: 2000000
Final ack count: 2000000
Final positive ack count: 2000000
Final negative ack count: 0
Messages received: 2000000
Acked messages missing: 0
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0
----------------------------------------------------

5 次测试的运行结果以下。

$ cat kill-m-bookies_output.txt
22:59:01 Start test
23:01:37: 
23:01:37: Test Run #1 on topic kill-m-bookies_1  ------------
23:04:49: Results --------------------------------------------
23:04:49: Final send count: 2000000
23:04:49: Final ack count: 2000000
23:04:49: Final positive ack count: 2000000
23:04:49: Final negative ack count: 0
23:04:49: Messages received: 2000000
23:04:49: Acked messages missing: 0
23:04:49: Non-acked messages received: 0
23:04:49: Out-of-order: 0
23:04:49: Duplicates: 0
23:04:49: ----------------------------------------------------
23:07:26: 
23:07:26: Test Run #2 on topic kill-m-bookies_2  ------------
23:09:23: Results --------------------------------------------
23:09:23: Final send count: 2000000
23:09:23: Final ack count: 2000000
23:09:23: Final positive ack count: 2000000
23:09:23: Final negative ack count: 0
23:09:23: Messages received: 2005107
23:09:23: Acked messages missing: 0
23:09:23: Non-acked messages received: 0
23:09:23: Out-of-order: 0
23:09:23: Duplicates: 5107
23:09:23: ----------------------------------------------------
23:11:56: 
23:11:56: Test Run #3 on topic kill-m-bookies_3  ------------
23:14:48: Results --------------------------------------------
23:14:48: Final send count: 2000000
23:14:48: Final ack count: 2000000
23:14:48: Final positive ack count: 2000000
23:14:48: Final negative ack count: 0
23:14:48: Messages received: 2000000
23:14:48: Acked messages missing: 0
23:14:48: Non-acked messages received: 0
23:14:48: Out-of-order: 0
23:14:48: Duplicates: 0
23:14:48: ----------------------------------------------------
23:17:25: 
23:17:25: Test Run #4 on topic kill-m-bookies_4  ------------
23:20:24: Results --------------------------------------------
23:20:24: Final send count: 2000000
23:20:24: Final ack count: 2000000
23:20:24: Final positive ack count: 2000000
23:20:24: Final negative ack count: 0
23:20:24: Messages received: 2000000
23:20:24: Acked messages missing: 0
23:20:24: Non-acked messages received: 0
23:20:24: Out-of-order: 0
23:20:24: Duplicates: 0
23:20:24: ----------------------------------------------------
23:23:05: 
23:23:05: Test Run #5 on topic kill-m-bookies_5  ------------
23:25:03: Results --------------------------------------------
23:25:03: Final send count: 2000000
23:25:03: Final ack count: 2000000
23:25:03: Final positive ack count: 2000000
23:25:03: Final negative ack count: 0
23:25:03: Messages received: 2000000
23:25:03: Acked messages missing: 0
23:25:03: Non-acked messages received: 0
23:25:03: Out-of-order: 0
23:25:03: Duplicates: 0
23:25:03: ----------------------------------------------------

在一次测试中出现了 5000 多条重复消息。经过检查运行的完整输出,我发现了和 producer id 相关的链接错误。

Send count: 557331 Ack count: 550000 Pos: 550000 Neg: 0
Send count: 601885 Ack count: 600000 Pos: 600000 Neg: 0
2018-10-20 23:07:49.283 INFO  HandlerBase:129 | [persistent://vanlightly/cluster-1/ns1/kill-m-bookies_2, cluster-1-2-0] Schedule reconnection in 0.1 s
2018-10-20 23:07:49.319 ERROR ClientConnection:638 | [192.168.208.1:58276 -> 192.168.208.11:6650] Got invalid producer Id in SendReceipt: 0 -- msg: 648684
2018-10-20 23:07:49.319 ERROR ClientConnection:638 | [192.168.208.1:58276 -> 192.168.208.11:6650] Got invalid producer Id in SendReceipt: 0 -- msg: 649684
2018-10-20 23:07:49.320 ERROR ClientConnection:638 | [192.168.208.1:58276 -> 192.168.208.11:6650] Got invalid producer Id in SendReceipt: 0 -- msg: 650684
2018-10-20 23:07:49.320 ERROR ClientConnection:638 | [192.168.208.1:58276 -> 192.168.208.11:6650] Got invalid producer Id in SendReceipt: 0 -- msg: 651684
2018-10-20 23:07:49.321 ERROR ClientConnection:638 | [192.168.208.1:58276 -> 192.168.208.11:6650] Got invalid producer Id in SendReceipt: 0 -- msg: 652684
2018-10-20 23:07:49.321 ERROR ClientConnection:638 | [192.168.208.1:58276 -> 192.168.208.11:6650] Got invalid producer Id in SendReceipt: 0 -- msg: 653047

具体缘由还不能肯定,可能与 bookie 故障相关。不管是否和 bookie 故障相关,启用消息去重均可以有效解决相似的消息重复问题。

没有出现消息丢失或乱序。

若是关闭 Qw 个 bookie,会出现怎样的结果呢?虽然这样的测试设定会损坏全部消息副本,咱们仍是来实际测试一下。

测试场景 7 - 关闭 Qw 个 Bookie

参数配置 2-2-1 3-5。

本次测试中,咱们关闭了组成当前 ledger 中 ensemble 的两个 bookie。

单次运行的完整输出以下。

$ python pulsar-test.py kill-bookies[2] kill-qw-bookies 1 2000000 50000 2-2-1 3-5 false
Destroying blockade cluster
Creating blockade cluster
Running test with config: E 2 Qw 2 Qa 1 with nodes bookie1 bookie2 bookie3 bookie4 bookie5 proxy pulsar1 pulsar2 pulsar3 zk1 
Creating tenant and namespace with configuration 2-2-1
Test Run #1 on topic kill-qw-bookies_1  ------------
-------------------------------------------------
WRITE PHASE
-------------------------------------------------
2018-10-20 23:46:24.587 INFO  ConnectionPool:63 | Created connection for pulsar://172.20.0.11:6650
2018-10-20 23:46:24.588 INFO  ClientConnection:285 | [172.20.0.1:40032 -> 172.20.0.11:6650] Connected to broker
2018-10-20 23:46:25.620 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-20 23:46:25.620 INFO  ClientConnection:287 | [172.20.0.1:40036 -> 172.20.0.11:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-20 23:46:26.098 INFO  ProducerImpl:155 | [persistent://vanlightly/cluster-1/ns1/kill-qw-bookies_1, ] Created producer on broker [172.20.0.1:40036 -> 172.20.0.11:6650] 
Send count: 56001 Ack count: 50000 Pos: 50000 Neg: 0
-------------------------------------------------
Identifing first 2 bookies in ensemble
-------------------------------------------------
Send count: 101422 Ack count: 100000 Pos: 100000 Neg: 0
Send count: 150870 Ack count: 150000 Pos: 150000 Neg: 0
Send count: 204342 Ack count: 200000 Pos: 200000 Neg: 0
Send count: 252210 Ack count: 250000 Pos: 250000 Neg: 0
Send count: 304805 Ack count: 300000 Pos: 300000 Neg: 0
Send count: 357891 Ack count: 350000 Pos: 350000 Neg: 0
Send count: 400852 Ack count: 400000 Pos: 400000 Neg: 0
Send count: 450256 Ack count: 450000 Pos: 450000 Neg: 0
Send count: 502910 Ack count: 500000 Pos: 500000 Neg: 0
Send count: 551386 Ack count: 550000 Pos: 550000 Neg: 0
Send count: 604204 Ack count: 600000 Pos: 600000 Neg: 0
Send count: 653056 Ack count: 650000 Pos: 650000 Neg: 0
Send count: 709835 Ack count: 700000 Pos: 700000 Neg: 0
-------------------------------------------------
2 BOOKIES KILLED!
-------------------------------------------------
Send count: 753606 Ack count: 750000 Pos: 750000 Neg: 0
Send count: 803318 Ack count: 800000 Pos: 800000 Neg: 0
Send count: 853127 Ack count: 850000 Pos: 850000 Neg: 0
Send count: 905068 Ack count: 900000 Pos: 900000 Neg: 0
Send count: 951923 Ack count: 950000 Pos: 950000 Neg: 0
Send count: 1004825 Ack count: 1000000 Pos: 1000000 Neg: 0
Send count: 1051416 Ack count: 1050000 Pos: 1050000 Neg: 0
Send count: 1102301 Ack count: 1100000 Pos: 1100000 Neg: 0
Send count: 1151166 Ack count: 1150000 Pos: 1150000 Neg: 0
Send count: 1205605 Ack count: 1200000 Pos: 1200000 Neg: 0
Send count: 1251999 Ack count: 1250000 Pos: 1250000 Neg: 0
Send count: 1306006 Ack count: 1300000 Pos: 1300000 Neg: 0
Send count: 1357391 Ack count: 1350000 Pos: 1350000 Neg: 0
Send count: 1406744 Ack count: 1400000 Pos: 1400000 Neg: 0
Send count: 1458207 Ack count: 1450000 Pos: 1450000 Neg: 0
Send count: 1509467 Ack count: 1500000 Pos: 1500000 Neg: 0
Send count: 1558339 Ack count: 1550000 Pos: 1550000 Neg: 0
Send count: 1603861 Ack count: 1600000 Pos: 1600000 Neg: 0
Send count: 1654170 Ack count: 1650000 Pos: 1650000 Neg: 0
Send count: 1702673 Ack count: 1700000 Pos: 1700000 Neg: 0
Send count: 1751781 Ack count: 1750000 Pos: 1750000 Neg: 0
Send count: 1803040 Ack count: 1800000 Pos: 1800000 Neg: 0
Send count: 1850616 Ack count: 1850000 Pos: 1850000 Neg: 0
Send count: 1906137 Ack count: 1900000 Pos: 1900000 Neg: 0
Send count: 1955752 Ack count: 1950000 Pos: 1950000 Neg: 0
Send count: 2000000 Ack count: 2000000 Pos: 2000000 Neg: 0
2018-10-20 23:47:28.728 INFO  ProducerImpl:467 | [persistent://vanlightly/cluster-1/ns1/kill-qw-bookies_1, cluster-1-0-0] Closed producer
-------------------------------------------------
READ PHASE
-------------------------------------------------
2018-10-20 23:47:38.923 INFO  ConnectionPool:63 | Created connection for pulsar://172.20.0.11:6650
2018-10-20 23:47:38.924 INFO  ClientConnection:285 | [172.20.0.1:40216 -> 172.20.0.11:6650] Connected to broker
2018-10-20 23:47:38.931 INFO  HandlerBase:53 | [persistent://vanlightly/cluster-1/ns1/kill-qw-bookies_1, reader-1b446f, 0] Getting connection from pool
2018-10-20 23:47:39.176 INFO  ConnectionPool:63 | Created connection for pulsar://pulsar1:6650
2018-10-20 23:47:39.176 INFO  ClientConnection:287 | [172.20.0.1:40220 -> 172.20.0.11:6650] Connected to broker through proxy. Logical broker: pulsar://pulsar1:6650
2018-10-20 23:47:39.276 INFO  ConsumerImpl:168 | [persistent://vanlightly/cluster-1/ns1/kill-qw-bookies_1, reader-1b446f, 0] Created consumer on broker [172.20.0.1:40220 -> 172.20.0.11:6650] 
Last confirmed entry: [0, 2506]

能够继续发送消息。同时关闭两个 bookie 后,broker 将会经过新的 ensemble 建立新分片。本来有 5 个 bookie,关闭 2 个后,剩下的 3 个 bookie 将会和 E、2 个 Qw 一块儿负责新分片上的操做。

可是,因为 reader 会尝试从没有数据的分片读取数据,reader 会阻塞在 read_next() 上(除非重启已关闭的 bookie,且数据完整)。

咱们能够在 ZooKeeper 中检查复制不足的 ledger。下面的 bash 脚本能够检测这些 ledger 的路径。

$ bash monitor-underrep.sh 
[0000]

也能够查看第一个 ledger 的详细信息。

$ bash show-ledger.sh 00/0000/L0000
...(omitted a bunch of uniteresting output)
quorumSize: 2
ensembleSize: 2
length: 0
lastEntryId: -1
state: OPEN
segment {
  ensembleMember: "172.20.0.3:3181"
  ensembleMember: "172.20.0.5:3181"
  firstEntryId: 0
}
segment {
  ensembleMember: "172.20.0.7:3181"
  ensembleMember: "172.20.0.4:3181"
  firstEntryId: 949
}
segment {
  ensembleMember: "172.20.0.7:3181"
  ensembleMember: "172.20.0.6:3181"
  firstEntryId: 959
}
segment {
  ensembleMember: "172.20.0.7:3181"
  ensembleMember: "172.20.0.6:3181"
  firstEntryId: 1340
}
digestType: CRC32C
password: ""
ackQuorumSize: 1

Reader 阻塞的第一个分片上有两个 IP 地址:172.20.0.3 和 172.20.0.5,分别对应于关闭的两个 bookie。

因为第一个分片上的数据已丢失,所以没法恢复 ledger。若是已关闭的两个 bookie 上的数据已永久丢失,咱们只能跳转至 ID 为 949 的 entry 继续读取数据。

可是若是能够找回关闭的两个 bookie 中任意一个 bookie 上的数据,就能够经过 AutoRecovery 彻底复制分片且数据可用。咱们来模拟一下,在集群目录中使用 blockade start bookie1 命令便可启动 bookie1。

咱们注意到,reader 复活,消费 65 万条消息后发生阻塞,直到达到咱们设置的 60 秒超时,测试结束。

Start reading from (0,0,0,-1)
Received: 50000 Curr Entry: [0, 66]
Received: 100000 Curr Entry: [0, 158]
Received: 150000 Curr Entry: [0, 246]
Received: 200000 Curr Entry: [0, 312]
Received: 250000 Curr Entry: [0, 382]
Received: 300000 Curr Entry: [0, 444]
Received: 350000 Curr Entry: [0, 501]
Received: 400000 Curr Entry: [0, 566]
Received: 450000 Curr Entry: [0, 635]
Received: 500000 Curr Entry: [0, 698]
Received: 550000 Curr Entry: [0, 761]
Received: 600000 Curr Entry: [0, 826]
Received: 650000 Curr Entry: [0, 895]
Read phase complete with message (0,946,478,-1)
2018-10-21 00:04:15.760 INFO  ConsumerImpl:761 | [persistent://vanlightly/cluster-1/ns1/kill-qw-bookies_1, reader-1b446f, 0] Closed consumer 0
Results --------------------------------------------
Final send count: 2000000
Final ack count: 2000000
Final positive ack count: 2000000
Final negative ack count: 0
Messages received: 691566
Acked messages missing: 1308433
Non-acked messages received: 0
Out-of-order: 0
Duplicates: 0
----------------------------------------------------

Reader 读取的最后一条 entry ID 为 946,继续数 3 条就到第二个分片了。为何 reader 在第一个分片的结尾处发生了阻塞?

当 Qa=1 时,同时关闭 ensemble 中的两个 bookie,最后 3 个 entry 持久化到 1 个 bookie 上。而这个 bookie 仍然处于关闭状态,所以最后 3 个 entry 不可读。

咱们使用 reader.py 从 topic 读取数据。在第 65 万条消息处出现了阻塞。如今启用 bookie3:blockade start bookie3

因为对读操做应用了指数退避算法,在读取失败并恢复读取时,指数退避算法会像 TCP 同样缓慢启动,逐渐增长从 bookie 读取的批大小。reader.py 在消费到阻塞处时,跳过这一条消息后消费速度有所降低,而后迅速消费了 200 万条消息。

在 ZooKeeper 中复制不足的路径下,没再出现 ledger 0000。理论上,在 ledger 完成复制后能够再次关闭 bookie3,由于 bookie1 中有第一个分片中的全部 entry,因此应该当即启用 AutoRecovery 并从新复制 ledger 到第二个 bookie。

$ blockade kill bookie3
$ bash monitor-underrep.sh 
[]
[]
[]
[]
[]
[]
[]
[]
[]
[0000]
[0000]
[0000]
[0000]
[]
[]

[] 表示不存在 ledger 复制不足。查看此路径须要 2 秒钟(运行 Docker、zookeeper-shell 等),确认 ledger 0000 复制不足一共须要 16 秒,但复制操做只需 8 秒钟。

再次检查 ledger,第一个分片上的 ensemble 应该有所变化:

$bash show-ledger.sh 00/0000/L0000
...
quorumSize: 2
ensembleSize: 2
length: 0
lastEntryId: -1
state: OPEN
segment {
  ensembleMember: "172.20.0.6:3181"
  ensembleMember: "172.20.0.4:3181"
  firstEntryId: 0
}
segment {
  ensembleMember: "172.20.0.7:3181"
  ensembleMember: "172.20.0.4:3181"
  firstEntryId: 949
}
segment {
  ensembleMember: "172.20.0.7:3181"
  ensembleMember: "172.20.0.6:3181"
  firstEntryId: 959
}
segment {
  ensembleMember: "172.20.0.7:3181"
  ensembleMember: "172.20.0.6:3181"
  firstEntryId: 1340
}
digestType: CRC32C
password: ""
ackQuorumSize: 1

Bookie3 (172.20.0.5)再也不属于第一个分片,AutoRecovery 也完成了相应的复制。

经过以上测试,咱们发现:

若是不想丢消息,最多能够关闭 Qw-1 个 bookie。若是关闭 Qw 个 bookie,要保证不出现数据丢失,则须要恢复这些 bookie 中的全部数据。

若是关闭 bookie 的数量等于 Qa 的数量,AutoRecovery 不能修复 ledger 复制不足的问题。当 Qa=1,关闭 1 个 bookie 时,AutoRecovery 不能修复 ledger。

这种状况下,BookKeeper 不能恢复 ledger ,Ivan Kelly 提出,“关闭 bookie 的操做有问题。关闭操做须要确认分片的最后一个 entry。当 Qa=1 时,若是 write quorum 中有 1 个 bookie 出现故障,则没法确认分片的最后一个 entry,由于该 bookie 可能已经 ack 了写操做(严格来讲,可能会 ack 写操做)。”

所以 Qa=1 的配置很是冒险,由于一旦有 1 个 bookie 出现故障,就没法修复 ledger 复制不足的问题。

测试总结

  • 确认丢失消息总数:0
  • 乱序消息总数:0
  • 全部测试中均未出现消息丢失和消息乱序。
  • 在关闭 Qw 个 bookie 的测试中,同时恢复 2 个 bookie 不会出现数据丢失。但这一场景不在上述总结范围内。

结语

经过测试能够发现,关闭 broker 或将 broker 与 ZooKeeper 隔离都不会形成消息丢失,但会读取未 ack 的消息(很常见,不可避免),若是未启用消息去重,还会出现消息重复。即便在故障转移场景中,也可使用消息去重。下次我会进一步测试消息去重。

Qw 大于等于 2 时,关闭 1 个 bookie 或将 bookie 与 ZooKeeper 隔离也不会形成消息丢失。

关闭 ensemble 中的所有 bookie 会阻塞 reader 直到恢复 bookie。若是不能成功恢复 bookie,则会丢失全部数据。丢失所有数据副本是咱们最不想看到的状况。

最小值:

  • Write Quorum(Qw)。1 个 bookie 宕机时,要确保数据不丢失,Qw 必须大于等于 2。
  • Ack Quorum(Qa)。1 个 bookie 宕机时,要经过 AutoRecovery 从新复制 ledger,Qa 必须大于等于 2。
  • 至少须要默认值 E 个 bookie 才能够建立 reader 和 consumer,由于须要使用这个值建立游标。

总之,Apache Pulsar 功能强大,在不一样配置状况下也不容易出现数据丢失、重复、乱序的问题。在后续测试中,仍有一些须要注意的地方,如在 broker 故障转移后,若是正在运行的消息数量接近内部等待消息队列大小,producer 的速度会迅速降低。但若是仅考虑是否出现消息丢失和乱序,Apache Pulsar 天衣无缝。

能够进一步测试的方向有如下几个:

  • 消息去重和 broker 故障转移
  • 使用速度较慢的 producer 运行较长时间(几个小时),运行中每 5 分钟随机关闭 1 个节点(broker 或 bookie)。运行结束后,检查是否出现消息丢失和乱序的问题。
  • 发送大量消息。在几个小时以内,每 5-10 分钟随机关闭 1 个 bookie。运行结束后,检查是否经过 AutoRecovery 保证复制了所有消息,没有出现消息丢失。
  • 分区 topic - 全部类型的测试
  • 在网络不稳定或网速较慢的条件下进行测试

相关阅读

点击连接 ,为 Apache Pulsar 点赞!