执行大数据量的shuffle过程时Executor注册shuffle service失败
问题
执行超过50T数据的shuffle过程时,出现部分Executor注册shuffle service超时然后丢失从而导致任务失败的问题。错误日志如下所示:
2016-10-19 01:33:34,030 | WARN | ContainersLauncher #14 | Exception from container-launch with container ID: container_e1452_1476801295027_2003_01_004512 and exit code: 1 | LinuxContainerExecutor.java:397 ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:561) at org.apache.hadoop.util.Shell.run(Shell.java:472) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:381) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:312) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:88) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-10-19 01:33:34,031 | INFO | ContainersLauncher #14 | Exception from container-launch. | ContainerExecutor.java:300 2016-10-19 01:33:34,031 | INFO | ContainersLauncher #14 | Container id: container_e1452_1476801295027_2003_01_004512 | ContainerExecutor.java:300 2016-10-19 01:33:34,031 | INFO | ContainersLauncher #14 | Exit code: 1 | ContainerExecutor.java:300 2016-10-19 01:33:34,031 | INFO | ContainersLauncher #14 | Stack trace: ExitCodeException exitCode=1: | ContainerExecutor.java:300
回答
由于当前数据量较大,有50T数据导入,超过了shuffle的规格,shuffle负载过高,shuffle service服务处于过载状态,可能无法及时响应Executor的注册请求,从而出现上面的问题。
Executor注册shuffle service的超时时间是5秒,最多重试3次,该参数目前不可配。
建议适当调大task retry次数和Executor失败次数。
在客户端的“spark-defaults.conf”配置文件中配置如下参数。“spark.yarn.max.executor.failures”如果不存在,则手动添加该参数项。
参数 |
描述 |
默认值 |
---|---|---|
spark.task.maxFailures |
task retry次数。 |
4 |
spark.yarn.max.executor.failures |
Executor失败次数。 关闭Executor个数动态分配功能的场景即“spark.dynamicAllocation.enabled”参数设为“false”时。 |
numExecutors * 2, with minimum of 3 |
Executor失败次数。 开启Executor个数动态分配功能的场景即“spark.dynamicAllocation.enabled”参数设为“true”时。 |
3 |