Updated on 2024-08-10 GMT+08:00

Spark REST APIs

Function

Spark REST APIs display some metrics of the web UI in JSON format, provide users with a simpler method to create new display and monitoring tools, and enable users to query information about running apps and the completed apps. The open-source Spark REST API allows users to query information about Jobs, Stages, Storage, Environment, and Executors. In the FusionInsight version, the REST API used to query SQL, JDBC server, and Streaming information is added. For more information about the open-source REST API, see https://spark.apache.org/docs/3.1.1/monitoring.html#rest-api.

Preparing a Running Environment

Install a client. Install the client on the node. For example, install the client in the /opt/client directory.

REST APIs

You can run the given command to skip the REST API filter to obtain application information.

When using JobHistory on normal clusters, it only supports HTTP protocol. Therefore, ensure that the command includes the HTTP protocol in the URL.

  • Obtaining information about all applications on the JobHistory node:
    • Command:
      curl http://192.168.227.16:4040/api/v1/applications?mode=monitoring --insecure

      192.168.227.16 indicates the service IP address of the JobHistory node and 4040 indicates the port number of the JobHistory node.

    • Command output:
      [ {
        "id" : "application_1517290848707_0008",
        "name" : "Spark Pi",
        "attempts" : [ {
          "startTime" : "2018-01-30T15:05:37.433CST",
          "endTime" : "2018-01-30T15:06:04.625CST",
          "lastUpdated" : "2018-01-30T15:06:04.848CST",
          "duration" : 27192,
          "sparkUser" : "sparkuser",
          "completed" : true,
          "startTimeEpoch" : 1517295937433,
          "endTimeEpoch" : 1517295964625,
          "lastUpdatedEpoch" : 1517295964848
        } ]
      }, {
        "
      id" : "application_1517290848707_0145",
        "name" : "Spark shell",
        "attempts" : [ {
          "startTime" : "2018-01-31T15:20:31.286CST",
          "endTime" : "1970-01-01T07:59:59.999CST",
          "lastUpdated" : "2018-01-31T15:20:47.086CST",
          "duration" : 0,
          "sparkUser" : "admintest",
          "completed" : false,
          "startTimeEpoch" : 1517383231286,
          "endTimeEpoch" : -1,
          "lastUpdatedEpoch" : 1517383247086
        } ]
      }]
    • Result analysis:
      With this command, you can query information about all Spark applications in the current cluster, including running applications and the completed applications. Table 1 provides information about each application.
      Table 1 Parameter description

      Parameter

      Description

      id

      Application ID.

      name

      Application name.

      attempts

      Attempts executed by the application, including the attempt start time, attempt end time, user who initiates the attempts, and status indicating whether the attempts are completed.

  • Obtaining information about a specific application on the JobHistory node:
    • Command:
      curl http://192.168.227.16:4040/api/v1/applications/application_1517290848707_0008?mode=monitoring --insecure

      192.168.227.16 indicates the service IP address of the JobHistory node, 4040 indicates the port number of the JobHistory node, and application_1517290848707_0008 indicates the application ID.

    • Command output:
      {
        "id" : "application_1517290848707_0008",
        "name" : "Spark Pi",
        "attempts" : [ {
          "startTime" : "2018-01-30T15:05:37.433CST",
          "endTime" : "2018-01-30T15:06:04.625CST",
          "lastUpdated" : "2018-01-30T15:06:04.848CST",
          "duration" : 27192,
          "sparkUser" : "sparkuser",
          "completed" : true,
          "startTimeEpoch" : 1517295937433,
          "endTimeEpoch" : 1517295964625,
          "lastUpdatedEpoch" : 1517295964848
        } ]
      }
    • Result analysis:

      With this command, you can query information about a Spark application. Table 1 provides information about the application.

  • Obtaining information about the executor of a running application:
    • Command for an alive executor:
      curl http://192.168.169.84:8088/proxy/application_1478570725074_0046/api/v1/applications/application_1478570725074_0046/executors?mode=monitoring --insecure
    • Command for all alive and dead executors:
      curl http://192.168.169.84:8088/proxy/application_1478570725074_0046/api/v1/applications/application_1478570725074_0046/allexecutors?mode=monitoring --insecure

      192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8088 indicates the port number of the ResourceManager, and application_1478570725074_0046 indicates the application ID in YARN.

    • Command output:
      [{
        "id" : "driver",
        "hostPort" : "192.168.169.84:23886",
        "isActive" : true,
        "rddBlocks" : 0,
        "memoryUsed" : 0,
        "diskUsed" : 0,
        "activeTasks" : 0,
        "failedTasks" : 0,
        "completedTasks" : 0,
        "totalTasks" : 0,
        "totalDuration" : 0,
        "totalInputBytes" : 0,
        "totalShuffleRead" : 0,
        "totalShuffleWrite" : 0,
        "maxMemory" : 278019440,
        "executorLogs" : { }
      }, {
        "id" : "1",
        "hostPort" : "192.168.169.84:23902",
        "isActive" : true,
        "rddBlocks" : 0,
        "memoryUsed" : 0,
        "diskUsed" : 0,
        "totalCores" : 1,
        "maxTasks" : 1,
        "activeTasks" : 0,
        "failedTasks" : 0,
        "completedTasks" : 0,
        "totalTasks" : 0,
        "totalDuration" : 0,
        "totalGCTime" : 139,
        "totalInputBytes" : 0,
        "totalShuffleRead" : 0,
        "totalShuffleWrite" : 0,
        "maxMemory" : 555755765,
        "executorLogs" : {
          "stdout" : "https://XTJ-224:8044/node/containerlogs/container_1478570725074_0049_01_000002/admin/stdout?start=-4096",
          "stderr" : "https://XTJ-224:8044/node/containerlogs/container_1478570725074_0049_01_000002/admin/stderr?start=-4096"
        }
      } ]
    • Result analysis:
      With this command, you can query information about all executors including drivers of the current application. Table 2 provides basic information about each executor.
      Table 2 Executor parameter description

      Parameter

      Description

      id

      Executor ID.

      hostPort

      IP address and port number of the node where the executor resides.

      executorLogs

      Path for viewing executor logs.

Enhanced REST APIs

  • SQL commands: Get all SQL statements and the one with the longest execution time.
    • Spark UI command:
      curl http://192.168.195.232:8088/proxy/application_1476947670799_0053/api/v1/applications/application_1476947670799_0053/SQL?mode=monitoring --insecure

      192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8088 indicates the port number of the ResourceManager, and application_1476947670799_0053 indicates the application ID in YARN.

    • JobHistory command:
      curl http://192.168.227.16:4040/api/v1/applications/application_1478570725074_0004/SQL?mode=monitoring --insecure

      192.168.227.16 indicates the service IP address of the JobHistory node, 4040 indicates the port number of the JobHistory node, and application_1478570725074_0004 indicates the application ID.

    • Command output:

      The query results of the Spark UI and JobHistory commands are as follows:

      {
        "longestDurationOfCompletedSQL" : [ {
          "id" : 0,
          "status" : "COMPLETED",
          "description" : "getCallSite at SQLExecution.scala:48",
          "submissionTime" : "2016/11/08 15:39:00",
          "duration" : "2 s",
          "runningJobs" : [ ],
          "successedJobs" : [ 0 ],
          "failedJobs" : [ ]
        } ],
        "sqls" : [ {
          "id" : 0,
          "status" : "COMPLETED",
          "description" : "getCallSite at SQLExecution.scala:48",
          "submissionTime" : "2016/11/08 15:39:00",
          "duration" : "2 s",
          "runningJobs" : [ ],
          "successedJobs" : [ 0 ],
          "failedJobs" : [ ]
        }]
      }
    • Result analysis:
      After running this command, you can obtain all the SQL statements executed by the current application (the sqls part of the command output) and the SQL statements with the longest execution time (the longestDurationOfCompletedSQL part of the command output). The information about each SQL statement is listed in Table 3.
      Table 3 SQL parameter description

      Parameter

      Description

      id

      SQL statement ID.

      status

      Execution status of an SQL statement. The options are RUNNING, COMPLETED, and FAILED.

      runningJobs

      Jobs that are being executed.

      successedJobs

      Jobs that have been executed.

      failedJobs

      Jobs that fail to be executed.

  • JDBC Server commands: Obtain the number of connections, the number of running SQL statements, as well as information about all sessions and all SQL statements.
    • Command:
      curl http://192.168.195.232:8088/proxy/application_1476947670799_0053/api/v1/applications/application_1476947670799_0053/sqlserver?mode=monitoring --insecure

      192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8088 indicates the port number of the ResourceManager, and application_1476947670799_0053 indicates the application ID in YARN.

    • Command output:
      {
        "sessionNum" : 1,
        "runningSqlNum" : 0,
        "sessions" : [ {
          "user" : "spark",
          "ip" : "192.168.169.84",
          "sessionId" : "9dfec575-48b4-4187-876a-71711d3d7a97",
          "startTime" : "2016/10/29 15:21:10",
          "finishTime" : "",
          "duration" : "1 minute 50 seconds",
          "totalExecute" : 1
        } ],
        "sqls" : [ {
          "user" : "spark",
          "jobId" : [ ],
          "groupId" : "e49ff81a-230f-4892-a209-a48abea2d969",
          "startTime" : "2016/10/29 15:21:13",
          "finishTime" : "2016/10/29 15:21:14",
          "duration" : "555 ms",
          "statement" : "show tables",
          "state" : "FINISHED",
          "detail" : "== Parsed Logical Plan ==\nShowTablesCommand None\n\n== Analyzed Logical Plan ==\ntableName: string, isTemporary: boolean\nShowTablesCommand None\n\n== Cached Logical Plan ==\nShowTablesCommand None\n\n== Optimized Logical Plan ==\nShowTablesCommand None\n\n== Physical Plan ==\nExecutedCommand ShowTablesCommand None\n\nCode Generation: true"
        } ]
      }
    • Result analysis:
      After running this command, you can query the number of sessions in the current JDBC application, number of being-executed SQL statements, and information about all sessions and SQL statements. The information about each session is listed in Table 4, and the information about each SQL statement is listed in Table 5.
      Table 4 Session parameter description

      Parameter

      Description

      user

      User connected to the session.

      ip

      IP address of the node where the session resides.

      sessionId

      Session ID.

      startTime

      Time when the session starts the connection.

      finishTime

      Time when the session ends the connection.

      duration

      Session connection duration.

      totalExecute

      Number of SQL statements executed by the session.

      Table 5 SQL parameter description

      Parameter

      Description

      user

      User who executes the SQL statement.

      jobId

      IDs of jobs contained in the SQL statement.

      groupId

      ID of the group where the SQL statement resides.

      startTime

      SQL start time.

      finishTime

      SQL end time.

      duration

      SQL statement execution duration.

      statement

      SQL statement.

      detail

      Logical/Physical plan.

  • Streaming commands: Obtain the average input frequency, average scheduling delay, average execution duration, and average total delay.
    • Command:
      curl http://192.168.195.232:8088/proxy/application_1477722033672_0008/api/v1/applications/application_1477722033672_0008/streaming/statistics?mode=monitoring --insecure

      192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8088 indicates the port number of the ResourceManager, and application_1477722033672_0008 indicates the application ID in YARN.

    • Command output:
      {
      "startTime" : "2018-12-25T08:58:10.836GMT",  
      "batchDuration" : 1000,  
      "numReceivers" : 1,  
      "numActiveReceivers" : 1,  
      "numInactiveReceivers" : 0,  
      "numTotalCompletedBatches" : 373,  
      "numRetainedCompletedBatches" : 373,  
      "numActiveBatches" : 0,  
      "numProcessedRecords" : 1,  
      "numReceivedRecords" : 1,  
      "avgInputRate" : 0.002680965147453083,  
      "avgSchedulingDelay" : 14,  
      "avgProcessingTime" : 47,  
      "avgTotalDelay" : 62
      }
    • Result analysis:

      Once you execute this command, you can retrieve the average input frequency (measured in events per second), average scheduling delay (measured in milliseconds), average execution time (measured in milliseconds), and average total delay (measured in milliseconds) of the current Streaming application.