ApplicationMaster Failed to Start Twice in Yarn-client Mode

Symptom

In Yarn-client mode, ApplicationMaster fails to start twice.

Cause Analysis

Driver exception:

16/05/11 18:10:56 INFO Client: 
client token: N/A
diagnostics: Application application_1462441251516_0024 failed 2 times due to AM Container for appattempt_1462441251516_0024_000002 exited with  exitCode: 10
For more detailed output, check the application tracking page:https://hdnode5:26001/cluster/app/application_1462441251516_0024 Then click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1462441251516_0024_02_000001

The ApplicationMaster log file contains the following error information:

2016-05-12 10:21:23,715 | ERROR | [main] | Failed to connect to driver at 192.168.30.57:23867, retrying ... | org.apache.spark.Logging$class.logError(Logging.scala:75)
2016-05-12 10:21:24,817 | ERROR | [main] | Failed to connect to driver at 192.168.30.57:23867, retrying ... | org.apache.spark.Logging$class.logError(Logging.scala:75)
2016-05-12 10:21:24,918 | ERROR | [main] | Uncaught exception:  | org.apache.spark.Logging$class.logError(Logging.scala:96)
org.apache.spark.SparkException: Failed to connect to driver!
at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:426)
at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:292)
...
2016-05-12 10:21:24,925 | INFO  | [Thread-1] | Unregistering ApplicationMaster with FAILED (diag message: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!) | org.apache.spark.Logging$class.logInfo(Logging.scala:59)

In Spark-client mode, the task Driver runs on a client node (usually a node outside the cluster). During the startup, the ApplicationMaster process is started in the cluster. After the process is started, information needs to be registered with the Driver process. The task can be continued only after the registration is successful. According to the ApplicationMaster log, the connection to the Driver fails, which causes the task failure.

Solution

Check whether the IP address of the Driver process can be pinged.

Start a SparkPI task. Information similar to the following is displayed on the console:

16/05/11 18:07:20 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.100:23662]
16/05/11 18:07:20 INFO Utils: Successfully started service 'sparkDriver' on port 23662.

Run the netstat - anp | grep 23662 command on the node (192.168.1.100 in 2) to check whether the port is enabled. The following information indicates that the port is enabled.

tcp        0      0  ip:port    :::*                    LISTEN      107274/java        
tcp        0      0  ip:port   ip:port                  ESTABLISHED 107274/java

Run the telnet 192.168.1.100 23662 command on the node where ApplicationMaster is started to check whether the port can be connected. Perform this operation as both the root and omm users. If information similar to Escape character is '^]' is displayed, the connection is normal. If connection refused is displayed, the connection fails and the related port cannot be connected.

If the port is enabled but cannot be connected from other nodes, check the network configuration.

The port (port 23662 in this example) is randomly selected each time. Therefore, you need to test the port enabled by the task.