flink关于zk引发的重启

2021年11月26日 阅读数:2
这篇文章主要向大家介绍flink关于zk引发的重启,主要内容包括基础应用、实用技巧、原理机制等方面,希望对大家有所帮助。

背景

最近用flink on k8s跑程序的过程当中,发现某个时刻常常致使程序重启,定时任务天天加载一次缓存,该缓存有大量数据,加载时长须要60-90s左右。这个定时任务常常会致使k8s重启程序,使其极不稳定,因而各类调优。html

内存相关

  1. 怀疑多是算子的sender和receiver之间由于加载缓存致使某种通讯不可达,默认的心跳时间是50s,因而修改参数:heartbeat.timeout: 180000,heartbeat.interval: 20000。
  2. jobmanager和taskmanager是用akka通讯,修改参数akka.ask.timeout: 240s。

这些操做以后,偶尔仍是会在加载缓存的时候发现异常,日志截取以下java

2020-10-16 17:05:05,939 WARN org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 29068ms for sessionid 0x30135fa8005449f
2020-10-16 17:05:05,948 INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 29068ms for sessionid 0x30135fa8005449f, closing socket connection and attempting reconnect
2020-10-16 17:05:07,609 INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED
2020-10-16 17:05:07,611 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.
2020-10-16 17:05:07,612 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - JobManager for job 1bb3b7bdcfbc39cf760064ed9736ea80 with leader id bed26e07640e5e79197e468c85354534 lost leadership.
2020-10-16 17:05:07,613 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.
2020-10-16 17:05:07,614 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager connection for job 1bb3b7bdcfbc39cf760064ed9736ea80.
2020-10-16 17:05:07,615 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally Source: Custom Source -> Flat Map -> Timestamps/Watermarks (15/15) (052a84a37a0647ab485baa54f149b762).
2020-10-16 17:05:07,615 INFO org.apache.flink.runtime.taskmanager.Task - Source: Custom Source -> Flat Map -> Timestamps/Watermarks (15/15) (052a84a37a0647ab485baa54f149b762) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: JobManager responsible for 1bb3b7bdcfbc39cf760064ed9736ea80 lost the leadership.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1274)
at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:155)
at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$1(TaskExecutor.java:1698)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.Exception: Job leader for job id 1bb3b7bdcfbc39cf760064ed9736ea80 lost leadership.
... 22 more

再通过调查发现,这个跟zk有关系,zk在切换leader或者遇到网络波动之类的,会触发SUSPENDED状态,这个状态,会致使lost the leadership错误,而遇到这个错误,k8s直接就重启程序。其实访问zk仍是正常的。 再通过一系列调查,这种问题别人早就遇到,还改了代码,就是flink官方没合并代码。调查的过程不表,有用的连接以下git

  1. https://www.cnblogs.com/029zz010buct/p/10946244.html

这个有用的是升级curator包, flink用的是2.12.0,暂时没去操做,里面提到的SessionConnectionStateErrorPolicy是在4.x版本的,应该仍是要去编译部分代码。github

  1. https://github.com/apache/flink/pull/9066 https://issues.apache.org/jira/browse/FLINK-10052apache

    这个是其余人的解决方案,本人用的也是这个方法。 不把SUSPENDED状态认为是lost leadership,修改LeaderLatch的handleStateChange方法api

            case RECONNECTED:
            {
                try
                {
                    if (!hasLeadership.get())
                    {
                        reset();
                    }
                }
                catch ( Exception e )
                {
                    ThreadUtils.checkInterrupted(e);
                    log.error("Could not reset leader latch", e);
                    setLeadership(false);
                }
                break;
            }

            case LOST:
            {
                setLeadership(false);
                break;
            }

编译flink-shaded-hadoop-2-uber

找到这段代码以后,天然是找到了flink-shaded-hadoop-2-uber-xxx.jar这个包,在flink1.10的版本,还支持hadoop的这个包,在1.11以后已经再也不主动支持,须要的要本身去下载,由于这个包在打镜像时会特地加上去,因此目标锁定这个包,从新编译。简单说下编译过程缓存

  1. https://github.com/apache/curator/tree/apache-curator-2.12.0 下载这个版本的源码,修改curator-recipes下的src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java,修改内容如上所示,打的包是2.12.0。
  2. https://github.com/apache/flink-shaded/tree/release-10.0 下载flink-shaded 1.10版本的源码,修改flink-shaded-hadoop-2-parent的pom文件,增长exclusion,去掉curator-recipes的依赖,增长本身编译的curator-recipes。观察到不去掉依赖,默认是2.7.1版本,应该是这块代码好多年没动过,版本一直停留在2.7.1。
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-common</artifactId>
			<version>${hadoop.version}</version>
			<exclusions>
				...省略若干exclusion
				<exclusion>
				<groupId>org.apache.curator</groupId>
					<artifactId>curator-recipes</artifactId>
				</exclusion>
			</exclusions>
		</dependency>
		
		<dependency>
			<groupId>org.apache.curator</groupId>
			<artifactId>curator-recipes</artifactId>
			<version>2.12.0</version>
		</dependency>
  1. 由于咱们用的是2.8.3-10.0版本的,源码是2.4.1的,修改为<hadoop.version>2.8.3</hadoop.version>
  2. 看根目录的readme.md,在flink-shaded-release-10.0/flink-shaded-hadoop-2-parent目录运行mvn package -Dshade-sources打包,打包完成以后,用工具反编译观察一下,SUSPENDED的代码确实去掉了,从新打镜像,跑程序。

最后

观察了好久没报zk的问题,可是仍是有问题,目标锁定在oom,系统日志看到是由于oom被kill了。网络