[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Serious stability issues when running on YARN (Flink 1.7.0)

Hi Gyula,

Your issue is possibly related to [1] that slots prematurely released. I’ve raised a PR which is still pending review.


On Dec 20, 2018, at 9:33 PM, Gyula Fóra <gyula.fora@xxxxxxxxx> wrote:


Since we have moved to the new execution mode with Flink 1.7.0 we have observed some pretty bad stability issues with the Yarn execution. 

It's pretty hard to understand what's going on so sorry for the vague description but here is what seems to happen:

In some cases when a bigger job fails (lets say 30 taskmanagers, 10 slots each) and the job tries to recover we can observe taskmanagers start to fail.

The errors usually look like this:
20181220T141057.132+0100  INFO The heartbeat of TaskManager with id container_e15_1542798679751_0033_01_000021 timed out.  [org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$ @ 1137]
20181220T141057.133+0100  INFO Closing TaskExecutor connection container_e15_1542798679751_0033_01_000021 because: The heartbeat of TaskManager with id container_e15_1542798679751_0033_01_000021  timed out.  [org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection() @ 822]
20181220T141057.135+0100  INFO Execute processors -> (Filter config stream -> (Filter Failures, Flat Map), Filter BEA) (168/180) (3e9c164e4c0594f75c758624815265f1) switched from RUNNING to FAILED. org.apache.flink.util.FlinkException: The assigned slot container_e15_1542798679751_0033_01_000021_0 was removed.
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(
	at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(
	at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(
	at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(
	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(
	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
 [org.apache.flink.runtime.executiongraph.Execution.transitionState() @ 1342]

The job then goes in a restart loop, where taskmanagers come and go, the UI sometimes displays more than 30 taskmanagers and some extra slots. I have in some instances seen "GC overhead limit exceeded" during the recovery which is very strange.

I suspect there might be something strange happening, maybe some broken logic in the slot allocations or some memory leak. 

Has anyone observed anything similar so far?
Seems to only affect some of our larger jobs. This hasn't been a problem in the previous Flink releases where we always used the "legacy" execution mode.

Thank you!