Help Center/ MapReduce Service/ Developer Guide (LTS)/ Spark2x Development Guide (Security Mode)/ FAQs About Spark Application Development/ Common Spark APIs/ Spark REST APIs

Updated on 2024-08-10 GMT+08:00

View PDF

Spark REST APIs

Function

Spark REST APIs display some metrics of the web UI in JSON format, provide users with a simpler method to create new display and monitoring tools, and enable users to query information about running apps and the completed apps. The open-source Spark REST API allows users to query information about Jobs, Stages, Storage, Environment, and Executors. In the FusionInsight version, the REST API used to query SQL, JDBC/ODBC server, and Streaming information is added. For more information about the open-source REST API, see https://spark.apache.org/docs/3.1.1/monitoring.html#rest-api.

Preparing a Running Environment

Install the FusionInsight client. Install the client on the node. For example, install the client in the /opt/client directory.

REST API

You can run the given command to skip the REST API filter to obtain application information.

For clusters with the security mode enabled, JobHistory only works with the HTTPS protocol. Therefore, when using the command, ensure that the URL includes the HTTPS protocol.
For clusters with the security mode enabled, you need to set spark.ui.customErrorPage to false and restart Spark2x. Modify the parameter value for JobHistory2x, JDBCServer2x, and SparkResource2x instances.

Accessing JobHistory of Spark2x using HTTPS is different from HTTP-based access. When using HTTPS, make sure that the SSL protocol supported by the curl command is also supported by the cluster. This is important because SSL security encryption is used. If the cluster does not support the SSL protocol, use either of the following methods:

Modify the SSL protocol configured for the cluster. For example, if the curl command supports only the TLSv1 protocol (TLSv1 has security vulnerabilities and must be used with caution), perform the following steps:
1. Log in to FusionInsight Manager, click Cluster, click the name of the desired cluster, choose Service > Spark2x, and choose Configurations > All Configurations.
2. Search for ssl in the search box. Check whether the value of spark.ssl.historyServer.protocol for JobHistory contains TLSv1. If it does not, add TLSv1 to the value.
3. Clear the value of the spark.ssl.historyServer.enabledAlgorithms parameter for JobHistory.
4. Click Save Configuration and then OK. Restart the Spark2x service or JobHistory instance.
Perform the following steps to upgrade the curl version on the node:
1. Download the curl installation package at http://curl.haxx.se/download/.
2. Decompress the installation package.
  tar -xzvf curl-x.x.x.tar.gz
3. Overwrite the old curl version with the new one.
  cd curl-x.x.x
  
  ./configure
  
  make
  
  make install
4. Update the dynamic link library of curl.
  ldconfig
5. Log back into the node once the installation is complete and execute the following command to verify if the curl version has been updated:
  curl --version

Obtaining information about all JobHistory applications

Command:
```
curl -k -i --negotiate -u: "https://192.168.227.16:4040/api/v1/applications"
```
192.168.227.16 indicates the service IP address of the JobHistory node and 4040 indicates the port number of the JobHistory node.

Command output:

[ {
  "id" : "application_1517290848707_0008",
  "name" : "Spark Pi",
  "attempts" : [ {
    "startTime" : "2018-01-30T15:05:37.433CST",
    "endTime" : "2018-01-30T15:06:04.625CST",
    "lastUpdated" : "2018-01-30T15:06:04.848CST",
    "duration" : 27192,
    "sparkUser" : "sparkuser",
    "completed" : true,
    "startTimeEpoch" : 1517295937433,
    "endTimeEpoch" : 1517295964625,
    "lastUpdatedEpoch" : 1517295964848
  } ]
}, {
  "
id" : "application_1517290848707_0145",
  "name" : "Spark shell",
  "attempts" : [ {
    "startTime" : "2018-01-31T15:20:31.286CST",
    "endTime" : "1970-01-01T07:59:59.999CST",
    "lastUpdated" : "2018-01-31T15:20:47.086CST",
    "duration" : 0,
    "sparkUser" : "admintest",
    "completed" : false,
    "startTimeEpoch" : 1517383231286,
    "endTimeEpoch" : -1,
    "lastUpdatedEpoch" : 1517383247086
  } ]
}]

Result analysis:

With this command, you can query information about all Spark applications in the current cluster, including running applications and the completed applications. Table 1 provides information about each application.

**Table 1** Parameter description
Parameter	Description
id	Application ID.
name	Application name.
attempts	Application detailed information, including its application start and end times, the user who launched it, and whether it was completed.

Obtaining information about a specific JobHistory application

Command:
```
curl -k -i --negotiate -u: "https://192.168.227.16:4040/api/v1/applications/application_1517290848707_0008"
```
192.168.227.16 indicates the service IP address of the JobHistory node, 4040 indicates the port number of the JobHistory node, and application_1517290848707_0008 indicates the application ID.

Command output:

{
  "id" : "application_1517290848707_0008",
  "name" : "Spark Pi",
  "attempts" : [ {
    "startTime" : "2018-01-30T15:05:37.433CST",
    "endTime" : "2018-01-30T15:06:04.625CST",
    "lastUpdated" : "2018-01-30T15:06:04.848CST",
    "duration" : 27192,
    "sparkUser" : "sparkuser",
    "completed" : true,
    "startTimeEpoch" : 1517295937433,
    "endTimeEpoch" : 1517295964625,
    "lastUpdatedEpoch" : 1517295964848
  } ]
}

Result analysis:
With this command, you can query information about a Spark application. Table 1 provides information about the application.

Obtaining information about the executor of a running application:

Command for an alive executor:

curl -k -i --negotiate -u: "https://192.168.169.84:8090/proxy/application_1478570725074_0046/api/v1/applications/application_1478570725074_0046/executors"

Command for all alive and dead executors:
```
curl -k -i --negotiate -u: "https://192.168.169.84:8090/proxy/application_1478570725074_0046/api/v1/applications/application_1478570725074_0046/allexecutors"
```
192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8090 indicates the port number of the ResourceManager, and application_1478570725074_0046 indicates the application ID in YARN.

Command output:

[{
  "id" : "driver",
  "hostPort" : "192.168.169.84:23886",
  "isActive" : true,
  "rddBlocks" : 0,
  "memoryUsed" : 0,
  "diskUsed" : 0,
  "activeTasks" : 0,
  "failedTasks" : 0,
  "completedTasks" : 0,
  "totalTasks" : 0,
  "totalDuration" : 0,
  "totalInputBytes" : 0,
  "totalShuffleRead" : 0,
  "totalShuffleWrite" : 0,
  "maxMemory" : 278019440,
  "executorLogs" : { }
}, {
  "id" : "1",
  "hostPort" : "192.168.169.84:23902",
  "isActive" : true,
  "rddBlocks" : 0,
  "memoryUsed" : 0,
  "diskUsed" : 0,
  "totalCores" : 1,
  "maxTasks" : 1,
  "activeTasks" : 0,
  "failedTasks" : 0,
  "completedTasks" : 0,
  "totalTasks" : 0,
  "totalDuration" : 0,
  "totalGCTime" : 139,
  "totalInputBytes" : 0,
  "totalShuffleRead" : 0,
  "totalShuffleWrite" : 0,
  "maxMemory" : 555755765,
  "executorLogs" : {
    "stdout" : "https://XTJ-224:8044/node/containerlogs/container_1478570725074_0049_01_000002/admin/stdout?start=-4096",
    "stderr" : "https://XTJ-224:8044/node/containerlogs/container_1478570725074_0049_01_000002/admin/stderr?start=-4096"
  }
} ]

Result analysis:

With this command, you can query information about all executors including drivers of the current application. Table 2 provides basic information about each executor.

**Table 2** Executor parameter description
Parameter	Description
id	Executor ID.
hostPort	IP address and port number of the node where the executor resides.
executorLogs	Path for viewing executor logs.

Enhanced REST API

SQL commands: Get all SQL statements and the one with the longest execution time.

Spark UI command:

curl -k -i --negotiate -u: "https://192.168.195.232:8090/proxy/application_1476947670799_0053/api/v1/applications/application_1476947670799_0053/SQL"

192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8090 indicates the port number of the ResourceManager, and application_1476947670799_0053 indicates the application ID in YARN.

You can add parameters to the URL after the command to search for the corresponding SQL statements.

You can run the following command to view 100 SQL statements:

curl -k -i --negotiate -u: "https://192.168.195.232:8090/proxy/application_1476947670799_0053/api/v1/applications/application_1476947670799_0053/SQL?limit=100"

You can run the following command to view running parameters:

curl -k -i --negotiate -u: "https://192.168.195.232:8090/proxy/application_1476947670799_0053/api/v1/applications/application_1476947670799_0053/SQL?completed=false"

JobHistory command:
```
curl -k -i --negotiate -u: "https://192.168.227.16:4040/api/v1/applications/application_1478570725074_0004/SQL"
```
192.168.227.16 indicates the service IP address of the JobHistory node, 4040 indicates the port number of the JobHistory node, and application_1478570725074_0004 indicates the application ID.

Command output:

The query results of the Spark UI and JobHistory commands are as follows:

{
  "longestDurationOfCompletedSQL" : [ {
    "id" : 0,
    "status" : "COMPLETED",
    "description" : "getCallSite at SQLExecution.scala:48",
    "submissionTime" : "2016/11/08 15:39:00",
    "duration" : "2 s",
    "runningJobs" : [ ],
    "successedJobs" : [ 0 ],
    "failedJobs" : [ ]
  } ],
  "sqls" : [ {
    "id" : 0,
    "status" : "COMPLETED",
    "description" : "getCallSite at SQLExecution.scala:48",
    "submissionTime" : "2016/11/08 15:39:00",
    "duration" : "2 s",
    "runningJobs" : [ ],
    "successedJobs" : [ 0 ],
    "failedJobs" : [ ]
  }]
}

Result analysis:

After running this command, you can obtain all the SQL statements executed by the current application (the sqls part of the command output) and the SQL statements with the longest execution time (the longestDurationOfCompletedSQL part of the command output). The information about each SQL statement is listed in Table 3.

**Table 3** SQL parameter description
Parameter	Description
id	SQL statement ID.
status	Execution status of an SQL statement. The options are RUNNING, COMPLETED, and FAILED.
runningJobs	Jobs that are being executed.
successedJobs	Jobs that have been executed.
failedJobs	Jobs that fail to be executed.

JDBC Server commands: Obtain the number of connections, the number of running SQL statements, as well as information about all sessions and all SQL statements.

Command:
```
curl -k -i --negotiate -u: "https://192.168.195.232:8090/proxy/application_1476947670799_0053/api/v1/applications/application_1476947670799_0053/sqlserver"
```
192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8090 indicates the port number of the ResourceManager, and application_1476947670799_0053 indicates the application ID in YARN.

Command output:

{
  "sessionNum" : 1,
  "runningSqlNum" : 0,
  "sessions" : [ {
    "user" : "spark",
    "ip" : "192.168.169.84",
    "sessionId" : "9dfec575-48b4-4187-876a-71711d3d7a97",
    "startTime" : "2016/10/29 15:21:10",
    "finishTime" : "",
    "duration" : "1 minute 50 seconds",
    "totalExecute" : 1
  } ],
  "sqls" : [ {
    "user" : "spark",
    "jobId" : [ ],
    "groupId" : "e49ff81a-230f-4892-a209-a48abea2d969",
    "startTime" : "2016/10/29 15:21:13",
    "finishTime" : "2016/10/29 15:21:14",
    "duration" : "555 ms",
    "statement" : "show tables",
    "state" : "FINISHED",
    "detail" : "== Parsed Logical Plan ==\nShowTablesCommand None\n\n== Analyzed Logical Plan ==\ntableName: string, isTemporary: boolean\nShowTablesCommand None\n\n== Cached Logical Plan ==\nShowTablesCommand None\n\n== Optimized Logical Plan ==\nShowTablesCommand None\n\n== Physical Plan ==\nExecutedCommand ShowTablesCommand None\n\nCode Generation: true"
  } ]
}

Result analysis:

After running this command, you can query the number of sessions in the current JDBC application, number of being-executed SQL statements, and information about all sessions and SQL statements. The information about each session is listed in Table 4, and the information about each SQL statement is listed in Table 5:

**Table 4** Session parameter description
Parameter	Description
user	User connected to the session.
ip	IP address of the node where the session resides.
sessionId	Session ID.
startTime	Time when the session starts the connection.
finishTime	Time when the session ends the connection.
duration	Session connection duration.
totalExecute	Number of SQL statements executed by the session.

**Table 5** SQL parameter description
Parameter	Description
user	User who executes the SQL statement.
jobId	IDs of jobs contained in the SQL statement.
groupId	ID of the group where the SQL statement resides.
startTime	SQL start time.
finishTime	SQL end time.
duration	SQL execution duration.
statement	SQL statement.
detail	Logical plan and physical plan.

JDBC API enhancement cancels the SQL statement that is being executed by using the execution ID obtained from Beeline.
- Command:
```
curl -k -i --negotiate -X PUT -u: "https://192.168.195.232:8090/proxy/application_1477722033672_0008/api/v1/applications/application_1477722033672_0008/cancel/execution?executionId=8"
```
- Command output:
  Cancel the job whose execution ID is 8.
- Notes:
  Run the SQL statement in spark-beeline. If the SQL statement generates a Spark task, the execution ID of the SQL statement will be printed in Beeline. To cancel the execution of the SQL statement, run the preceding command.
Streaming commands: Obtain the average input frequency, scheduling delay, execution duration, and total delay.
- Command:
```
curl -k -i --negotiate -u: "https://192.168.195.232:8090/proxy/application_1477722033672_0008/api/v1/applications/application_1477722033672_0008/streaming/statistics"
```
  192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8090 indicates the port number of the ResourceManager, and application_1477722033672_0008 indicates the application ID in YARN.
- Command output:
```
{
"startTime" : "2018-12-25T08:58:10.836GMT",  
"batchDuration" : 1000,  
"numReceivers" : 1,  
"numActiveReceivers" : 1,  
"numInactiveReceivers" : 0,  
"numTotalCompletedBatches" : 373,  
"numRetainedCompletedBatches" : 373,  
"numActiveBatches" : 0,  
"numProcessedRecords" : 1,  
"numReceivedRecords" : 1,  
"avgInputRate" : 0.002680965147453083,  
"avgSchedulingDelay" : 14,  
"avgProcessingTime" : 47,  
"avgTotalDelay" : 62
}
```
- Result analysis:
  Once you execute this command, you can retrieve the average input frequency (measured in events per second), average scheduling delay (measured in milliseconds), average execution time (measured in milliseconds), and average total delay (measured in milliseconds) of the current Streaming application.

Parent topic: Common Spark APIs

Previous topic: Spark Python APIs

Next topic: Spark Client CLI

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot