Updated on 2024-08-10 GMT+08:00

Spark REST APIs

Function

Spark REST APIs display some metrics of the web UI in JSON format, provide users with a simpler method to create new display and monitoring tools, and enable users to query information about running apps and the completed apps. The open-source Spark REST API allows users to query information about Jobs, Stages, Storage, Environment, and Executors. In the FusionInsight version, the REST API used to query SQL, JDBC/ODBC server, and Streaming information is added. For more information about the open-source REST API, see https://spark.apache.org/docs/3.1.1/monitoring.html#rest-api.

Preparing a Running Environment

Install the FusionInsight client. Install the client on the node. For example, install the client in the /opt/client directory.

REST API

You can run the given command to skip the REST API filter to obtain application information.

  • For clusters with the security mode enabled, JobHistory only works with the HTTPS protocol. Therefore, when using the command, ensure that the URL includes the HTTPS protocol.
  • For clusters with the security mode enabled, you need to set spark.ui.customErrorPage to false and restart Spark2x. Modify the parameter value for JobHistory2x, JDBCServer2x, and SparkResource2x instances.

Accessing JobHistory of Spark2x using HTTPS is different from HTTP-based access. When using HTTPS, make sure that the SSL protocol supported by the curl command is also supported by the cluster. This is important because SSL security encryption is used. If the cluster does not support the SSL protocol, use either of the following methods:

  • Modify the SSL protocol configured for the cluster. For example, if the curl command supports only the TLSv1 protocol (TLSv1 has security vulnerabilities and must be used with caution), perform the following steps:
    1. Log in to FusionInsight Manager, click Cluster, click the name of the desired cluster, choose Service > Spark2x, and choose Configurations > All Configurations.
    2. Search for ssl in the search box. Check whether the value of spark.ssl.historyServer.protocol for JobHistory contains TLSv1. If it does not, add TLSv1 to the value.
    3. Clear the value of the spark.ssl.historyServer.enabledAlgorithms parameter for JobHistory.
    4. Click Save Configuration and then OK. Restart the Spark2x service or JobHistory instance.
  • Perform the following steps to upgrade the curl version on the node:
    1. Download the curl installation package at http://curl.haxx.se/download/.
    2. Decompress the installation package.

      tar -xzvf curl-x.x.x.tar.gz

    3. Overwrite the old curl version with the new one.

      cd curl-x.x.x

      ./configure

      make

      make install

    4. Update the dynamic link library of curl.

      ldconfig

    5. Log back into the node once the installation is complete and execute the following command to verify if the curl version has been updated:

      curl --version

  • Obtaining information about all JobHistory applications
    • Command:
      curl -k -i --negotiate -u: "https://192.168.227.16:4040/api/v1/applications"

      192.168.227.16 indicates the service IP address of the JobHistory node and 4040 indicates the port number of the JobHistory node.

    • Command output:
      [ {
        "id" : "application_1517290848707_0008",
        "name" : "Spark Pi",
        "attempts" : [ {
          "startTime" : "2018-01-30T15:05:37.433CST",
          "endTime" : "2018-01-30T15:06:04.625CST",
          "lastUpdated" : "2018-01-30T15:06:04.848CST",
          "duration" : 27192,
          "sparkUser" : "sparkuser",
          "completed" : true,
          "startTimeEpoch" : 1517295937433,
          "endTimeEpoch" : 1517295964625,
          "lastUpdatedEpoch" : 1517295964848
        } ]
      }, {
        "
      id" : "application_1517290848707_0145",
        "name" : "Spark shell",
        "attempts" : [ {
          "startTime" : "2018-01-31T15:20:31.286CST",
          "endTime" : "1970-01-01T07:59:59.999CST",
          "lastUpdated" : "2018-01-31T15:20:47.086CST",
          "duration" : 0,
          "sparkUser" : "admintest",
          "completed" : false,
          "startTimeEpoch" : 1517383231286,
          "endTimeEpoch" : -1,
          "lastUpdatedEpoch" : 1517383247086
        } ]
      }]
    • Result analysis:
      With this command, you can query information about all Spark applications in the current cluster, including running applications and the completed applications. Table 1 provides information about each application.
      Table 1 Parameter description

      Parameter

      Description

      id

      Application ID.

      name

      Application name.

      attempts

      Application detailed information, including its application start and end times, the user who launched it, and whether it was completed.

  • Obtaining information about a specific JobHistory application
    • Command:
      curl -k -i --negotiate -u: "https://192.168.227.16:4040/api/v1/applications/application_1517290848707_0008"

      192.168.227.16 indicates the service IP address of the JobHistory node, 4040 indicates the port number of the JobHistory node, and application_1517290848707_0008 indicates the application ID.

    • Command output:
      {
        "id" : "application_1517290848707_0008",
        "name" : "Spark Pi",
        "attempts" : [ {
          "startTime" : "2018-01-30T15:05:37.433CST",
          "endTime" : "2018-01-30T15:06:04.625CST",
          "lastUpdated" : "2018-01-30T15:06:04.848CST",
          "duration" : 27192,
          "sparkUser" : "sparkuser",
          "completed" : true,
          "startTimeEpoch" : 1517295937433,
          "endTimeEpoch" : 1517295964625,
          "lastUpdatedEpoch" : 1517295964848
        } ]
      }
    • Result analysis:

      With this command, you can query information about a Spark application. Table 1 provides information about the application.

  • Obtaining information about the executor of a running application:
    • Command for an alive executor:
      curl -k -i --negotiate -u: "https://192.168.169.84:8090/proxy/application_1478570725074_0046/api/v1/applications/application_1478570725074_0046/executors"
    • Command for all alive and dead executors:
      curl -k -i --negotiate -u: "https://192.168.169.84:8090/proxy/application_1478570725074_0046/api/v1/applications/application_1478570725074_0046/allexecutors"

      192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8090 indicates the port number of the ResourceManager, and application_1478570725074_0046 indicates the application ID in YARN.

    • Command output:
      [{
        "id" : "driver",
        "hostPort" : "192.168.169.84:23886",
        "isActive" : true,
        "rddBlocks" : 0,
        "memoryUsed" : 0,
        "diskUsed" : 0,
        "activeTasks" : 0,
        "failedTasks" : 0,
        "completedTasks" : 0,
        "totalTasks" : 0,
        "totalDuration" : 0,
        "totalInputBytes" : 0,
        "totalShuffleRead" : 0,
        "totalShuffleWrite" : 0,
        "maxMemory" : 278019440,
        "executorLogs" : { }
      }, {
        "id" : "1",
        "hostPort" : "192.168.169.84:23902",
        "isActive" : true,
        "rddBlocks" : 0,
        "memoryUsed" : 0,
        "diskUsed" : 0,
        "totalCores" : 1,
        "maxTasks" : 1,
        "activeTasks" : 0,
        "failedTasks" : 0,
        "completedTasks" : 0,
        "totalTasks" : 0,
        "totalDuration" : 0,
        "totalGCTime" : 139,
        "totalInputBytes" : 0,
        "totalShuffleRead" : 0,
        "totalShuffleWrite" : 0,
        "maxMemory" : 555755765,
        "executorLogs" : {
          "stdout" : "https://XTJ-224:8044/node/containerlogs/container_1478570725074_0049_01_000002/admin/stdout?start=-4096",
          "stderr" : "https://XTJ-224:8044/node/containerlogs/container_1478570725074_0049_01_000002/admin/stderr?start=-4096"
        }
      } ]
    • Result analysis:
      With this command, you can query information about all executors including drivers of the current application. Table 2 provides basic information about each executor.
      Table 2 Executor parameter description

      Parameter

      Description

      id

      Executor ID.

      hostPort

      IP address and port number of the node where the executor resides.

      executorLogs

      Path for viewing executor logs.

Enhanced REST API

  • SQL commands: Get all SQL statements and the one with the longest execution time.
    • Spark UI command:
      curl -k -i --negotiate -u: "https://192.168.195.232:8090/proxy/application_1476947670799_0053/api/v1/applications/application_1476947670799_0053/SQL"

      192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8090 indicates the port number of the ResourceManager, and application_1476947670799_0053 indicates the application ID in YARN.

      You can add parameters to the URL after the command to search for the corresponding SQL statements.

      You can run the following command to view 100 SQL statements:

      curl -k -i --negotiate -u: "https://192.168.195.232:8090/proxy/application_1476947670799_0053/api/v1/applications/application_1476947670799_0053/SQL?limit=100"

      You can run the following command to view running parameters:

      curl -k -i --negotiate -u: "https://192.168.195.232:8090/proxy/application_1476947670799_0053/api/v1/applications/application_1476947670799_0053/SQL?completed=false"
    • JobHistory command:
      curl -k -i --negotiate -u: "https://192.168.227.16:4040/api/v1/applications/application_1478570725074_0004/SQL"

      192.168.227.16 indicates the service IP address of the JobHistory node, 4040 indicates the port number of the JobHistory node, and application_1478570725074_0004 indicates the application ID.

    • Command output:

      The query results of the Spark UI and JobHistory commands are as follows:

      {
        "longestDurationOfCompletedSQL" : [ {
          "id" : 0,
          "status" : "COMPLETED",
          "description" : "getCallSite at SQLExecution.scala:48",
          "submissionTime" : "2016/11/08 15:39:00",
          "duration" : "2 s",
          "runningJobs" : [ ],
          "successedJobs" : [ 0 ],
          "failedJobs" : [ ]
        } ],
        "sqls" : [ {
          "id" : 0,
          "status" : "COMPLETED",
          "description" : "getCallSite at SQLExecution.scala:48",
          "submissionTime" : "2016/11/08 15:39:00",
          "duration" : "2 s",
          "runningJobs" : [ ],
          "successedJobs" : [ 0 ],
          "failedJobs" : [ ]
        }]
      }
    • Result analysis:
      After running this command, you can obtain all the SQL statements executed by the current application (the sqls part of the command output) and the SQL statements with the longest execution time (the longestDurationOfCompletedSQL part of the command output). The information about each SQL statement is listed in Table 3.
      Table 3 SQL parameter description

      Parameter

      Description

      id

      SQL statement ID.

      status

      Execution status of an SQL statement. The options are RUNNING, COMPLETED, and FAILED.

      runningJobs

      Jobs that are being executed.

      successedJobs

      Jobs that have been executed.

      failedJobs

      Jobs that fail to be executed.

  • JDBC Server commands: Obtain the number of connections, the number of running SQL statements, as well as information about all sessions and all SQL statements.
    • Command:
      curl -k -i --negotiate -u: "https://192.168.195.232:8090/proxy/application_1476947670799_0053/api/v1/applications/application_1476947670799_0053/sqlserver"

      192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8090 indicates the port number of the ResourceManager, and application_1476947670799_0053 indicates the application ID in YARN.

    • Command output:
      {
        "sessionNum" : 1,
        "runningSqlNum" : 0,
        "sessions" : [ {
          "user" : "spark",
          "ip" : "192.168.169.84",
          "sessionId" : "9dfec575-48b4-4187-876a-71711d3d7a97",
          "startTime" : "2016/10/29 15:21:10",
          "finishTime" : "",
          "duration" : "1 minute 50 seconds",
          "totalExecute" : 1
        } ],
        "sqls" : [ {
          "user" : "spark",
          "jobId" : [ ],
          "groupId" : "e49ff81a-230f-4892-a209-a48abea2d969",
          "startTime" : "2016/10/29 15:21:13",
          "finishTime" : "2016/10/29 15:21:14",
          "duration" : "555 ms",
          "statement" : "show tables",
          "state" : "FINISHED",
          "detail" : "== Parsed Logical Plan ==\nShowTablesCommand None\n\n== Analyzed Logical Plan ==\ntableName: string, isTemporary: boolean\nShowTablesCommand None\n\n== Cached Logical Plan ==\nShowTablesCommand None\n\n== Optimized Logical Plan ==\nShowTablesCommand None\n\n== Physical Plan ==\nExecutedCommand ShowTablesCommand None\n\nCode Generation: true"
        } ]
      }
    • Result analysis:
      After running this command, you can query the number of sessions in the current JDBC application, number of being-executed SQL statements, and information about all sessions and SQL statements. The information about each session is listed in Table 4, and the information about each SQL statement is listed in Table 5:
      Table 4 Session parameter description

      Parameter

      Description

      user

      User connected to the session.

      ip

      IP address of the node where the session resides.

      sessionId

      Session ID.

      startTime

      Time when the session starts the connection.

      finishTime

      Time when the session ends the connection.

      duration

      Session connection duration.

      totalExecute

      Number of SQL statements executed by the session.

      Table 5 SQL parameter description

      Parameter

      Description

      user

      User who executes the SQL statement.

      jobId

      IDs of jobs contained in the SQL statement.

      groupId

      ID of the group where the SQL statement resides.

      startTime

      SQL start time.

      finishTime

      SQL end time.

      duration

      SQL execution duration.

      statement

      SQL statement.

      detail

      Logical plan and physical plan.

  • JDBC API enhancement cancels the SQL statement that is being executed by using the execution ID obtained from Beeline.
    • Command:
      curl -k -i --negotiate -X PUT -u: "https://192.168.195.232:8090/proxy/application_1477722033672_0008/api/v1/applications/application_1477722033672_0008/cancel/execution?executionId=8"
    • Command output:

      Cancel the job whose execution ID is 8.

    • Notes:

      Run the SQL statement in spark-beeline. If the SQL statement generates a Spark task, the execution ID of the SQL statement will be printed in Beeline. To cancel the execution of the SQL statement, run the preceding command.

  • Streaming commands: Obtain the average input frequency, scheduling delay, execution duration, and total delay.
    • Command:
      curl -k -i --negotiate -u: "https://192.168.195.232:8090/proxy/application_1477722033672_0008/api/v1/applications/application_1477722033672_0008/streaming/statistics"

      192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8090 indicates the port number of the ResourceManager, and application_1477722033672_0008 indicates the application ID in YARN.

    • Command output:
      {
      "startTime" : "2018-12-25T08:58:10.836GMT",  
      "batchDuration" : 1000,  
      "numReceivers" : 1,  
      "numActiveReceivers" : 1,  
      "numInactiveReceivers" : 0,  
      "numTotalCompletedBatches" : 373,  
      "numRetainedCompletedBatches" : 373,  
      "numActiveBatches" : 0,  
      "numProcessedRecords" : 1,  
      "numReceivedRecords" : 1,  
      "avgInputRate" : 0.002680965147453083,  
      "avgSchedulingDelay" : 14,  
      "avgProcessingTime" : 47,  
      "avgTotalDelay" : 62
      }
    • Result analysis:

      Once you execute this command, you can retrieve the average input frequency (measured in events per second), average scheduling delay (measured in milliseconds), average execution time (measured in milliseconds), and average total delay (measured in milliseconds) of the current Streaming application.