Help Center/ MapReduce Service/ Troubleshooting/ Using Spark/ Failed to Connect to ResourceManager When a Spark Task Is Submitted
Updated on 2023-01-11 GMT+08:00

Failed to Connect to ResourceManager When a Spark Task Is Submitted

Symptom

The connection to ResourceManager is abnormal, and Spark tasks fail to be submitted.

Cause Analysis

  1. The following error message is displayed on the driver, indicating that connections to port 26004 on the active and standby ResourceManager nodes are denied:
    15/08/19 18:36:16 INFO RetryInvocationHandler: Exception while invoking getClusterMetrics of class ApplicationClientProtocolPBClientImpl over 33 after 1 fail over attempts. Trying to fail over after sleeping for 17448ms. 
     java.net.ConnectException: Call From ip0 to ip1:26004 failed on connection exception: java.net.ConnectException: Connection refused.
    INFO RetryInvocationHandler: Exception while invoking getClusterMetrics of class ApplicationClientProtocolPBClientImpl over 32 after 2 fail over attempts. Trying to fail over after sleeping for 16233ms. 
     java.net.ConnectException: Call From ip0 to ip2:26004 failed on connection exception: java.net.ConnectException: Connection refused;
  2. On MRS Manager, check whether ResourceManager is running properly, as shown in Figure 1. If Yarn is faulty or an unknown exception occurs on a Yarn service instance, the ResourceManager of the cluster may be abnormal.
    Figure 1 Service status
  3. Check whether the client in the cluster is of the latest version.

    Check whether the ResourceManager instance has been migrated in the cluster. (Uninstall a ResourceManager instance and add it back to other nodes.)

  4. On MRS Manager, click Audit to view audit logs and check whether related operations are recorded.

    Run the ping command to check whether the IP address can be pinged.

Solution

  • If ResourceManager is abnormal, see the Yarn-related sections to rectify the fault.
  • If the client is not the latest, download the client again.
  • If the IP address cannot be pinged, contact the network administrator to check the network.
  • If HA is enabled for the cluster, set Yarn parameter yarn.client.failover-sleep-base-ms to a smaller value.