[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Running Flink on Yarn

I  have a setup for Flink(1.4.2) with YARN. I'm using Flink Yarn Client for
deploying my jobs to Yarn Cluster. 

In the current setup parallelism was directly mapped to the number of cores,
with each parallel instance of the job running in one container. So for a
parallelism of 9, there are 10 containers - 1 JM and 9 TM and each container
has 1 core. Each container(or each parallel instance) has one task manager
and each slot holds the entire pipeline for the job. 

Most of the jobs have a join with the window storing data for last ⅔ hours.
As per my understanding here, 
each container will save it's own copy of the this last 2/3 hours data and
this is not shared between two container. 

Since this window data will be same across each container, I feel if I could
have one task manager with  with multiple task slot that could share this
window data I could save a lot on my resources (each container won't need to
maintain it's own copy of window data). If I had 3 container each with one
TM and 3 Task Slot each, then I would need only 3 containers for my job to
achieve a parallelism of 9 (each task slot will hold the entire job
pipeline, so each container helps me achieve a parallelism of 3
individually). I'm assuming that this window data will be shared among all
parallel instance running in different task slot in each container. Please
correct me here. 

As per flink docs - 

Having multiple slots means more subtasks share the same JVM. Tasks in the
same JVM share TCP connections (via multiplexing) and heartbeat messages.
They may also share data sets and data structures, thus reducing the
per-task overhead. 

Sent from: