文档首页/ MapReduce服务 MRS/ 故障排除/ 使用Spark/ Yarn-client模式提交Spark任务时ApplicationMaster尝试启动两次失败
更新时间:2024-11-01 GMT+08:00
分享

Yarn-client模式提交Spark任务时ApplicationMaster尝试启动两次失败

问题背景与现象

Yarn-client模式提交任务AppMaster尝试启动两次失败。

原因分析

  1. Driver端异常:
    16/05/11 18:10:56 INFO Client: 
    client token: N/A
    diagnostics: Application application_1462441251516_0024 failed 2 times due to AM Container for appattempt_1462441251516_0024_000002 exited with  exitCode: 10
    For more detailed output, check the application tracking page:https://hdnode5:26001/cluster/app/application_1462441251516_0024 Then click on links to logs of each attempt.
    Diagnostics: Exception from container-launch.
    Container id: container_1462441251516_0024_02_000001
  2. 在ApplicationMaster日志中,异常如下:
    2016-05-12 10:21:23,715 | ERROR | [main] | Failed to connect to driver at 192.168.30.57:23867, retrying ... | org.apache.spark.Logging$class.logError(Logging.scala:75)
    2016-05-12 10:21:24,817 | ERROR | [main] | Failed to connect to driver at 192.168.30.57:23867, retrying ... | org.apache.spark.Logging$class.logError(Logging.scala:75)
    2016-05-12 10:21:24,918 | ERROR | [main] | Uncaught exception:  | org.apache.spark.Logging$class.logError(Logging.scala:96)
    org.apache.spark.SparkException: Failed to connect to driver!
    at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:426)
    at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:292)
    …
    2016-05-12 10:21:24,925 | INFO  | [Thread-1] | Unregistering ApplicationMaster with FAILED (diag message: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

    Spark-client模式任务Driver运行在客户端节点上(通常是集群外的某个节点),启动时先在集群中启动AppMaster进程,进程启动后要向Driver进程注册信息,注册成功后,任务才能继续。从AppMaster日志中可以看出,无法连接至Driver,所以任务失败。

解决办法

  1. 请检查Driver进程所在的IP是否可以ping通。
  2. 启动一个Spark PI任务,会有类似如下打印信息。

    16/05/11 18:07:20 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.100:23662]
    16/05/11 18:07:20 INFO Utils: Successfully started service 'sparkDriver' on port 23662.

  3. 在该节点,也就是2中示例的192.168.1.100上执行netstat - anp | grep 23662看下此端口是否打开,如下打印标明,相关端口是打开的。

    tcp        0      0  ip:port    :::*                    LISTEN      107274/java        
    tcp        0      0  ip:port   ip:port                  ESTABLISHED 107274/java        

  4. 在AppMaster启动的节点执行telnet 192.168.1.100 23662看下是否可以连通该端口,请使用root用户和omm用户都执行一遍。 如果出现Escape character is '^]'类似打印则说明可以连通,如果出现connection refused则表示失败,无法连接到相关端口。

    如果相关端口打开,但是从别的节点无法连通到该端口,则需要排查下相关网络配置。

    23662这个端口每次都是随机的,所以要根据自己启动任务打开的端口来测试。

相关文档