Recently, the community has actively been working on this. The JIRA to follow is:https://issues.apache.org/jira/browse/SPARK-25299. A group of various companies including Bloomberg and Palantir are in the works of a WIP solution that implements a varied version of Option #5 (which is elaborated upon in the google doc linked in the JIRA summary).On Wed, Dec 19, 2018 at 5:20 AM <marek-simunek@xxxxxxxxx> wrote:Hi everyone,we are facing same problems as Facebook had, where shuffle service is a bottleneck. For now we solved that with large task size (2g) to reduce shuffle I/O.
I saw very nice presentation from Brian Cho on Optimizing shuffle I/O at large scale. It is a implementation of white paper.
Brian Cho at the end of the lecture kindly mentioned about plans to contribute it back to Spark. I checked mailing list and spark JIRA and didn't find any ticket on this topic.
Please, does anyone has a contact on someone from Facebook who could know more about this? Or are there some plans to bring similar optimization to Spark?