Spark REST APIs
Function
Spark REST APIs display some metrics of the web UI in JSON format, provide users with a simpler method to create new display and monitoring tools, and enable users to query information about running apps and the completed apps. The open-source Spark REST API allows users to query information about Jobs, Stages, Storage, Environment, and Executors. In the FusionInsight version, the REST API used to query SQL, JDBC server, and Streaming information is added. For more information about the open-source REST API, see https://spark.apache.org/docs/3.1.1/monitoring.html#rest-api.
Preparing a Running Environment
Install a client. Install the client on the node. For example, install the client in the /opt/client directory.
REST APIs
You can run the given command to skip the REST API filter to obtain application information.
When using JobHistory on normal clusters, it only supports HTTP protocol. Therefore, ensure that the command includes the HTTP protocol in the URL.
- Obtaining information about all applications on the JobHistory node:
- Command:
curl http://192.168.227.16:4040/api/v1/applications?mode=monitoring --insecure
192.168.227.16 indicates the service IP address of the JobHistory node and 4040 indicates the port number of the JobHistory node.
- Command output:
[ { "id" : "application_1517290848707_0008", "name" : "Spark Pi", "attempts" : [ { "startTime" : "2018-01-30T15:05:37.433CST", "endTime" : "2018-01-30T15:06:04.625CST", "lastUpdated" : "2018-01-30T15:06:04.848CST", "duration" : 27192, "sparkUser" : "sparkuser", "completed" : true, "startTimeEpoch" : 1517295937433, "endTimeEpoch" : 1517295964625, "lastUpdatedEpoch" : 1517295964848 } ] }, { " id" : "application_1517290848707_0145", "name" : "Spark shell", "attempts" : [ { "startTime" : "2018-01-31T15:20:31.286CST", "endTime" : "1970-01-01T07:59:59.999CST", "lastUpdated" : "2018-01-31T15:20:47.086CST", "duration" : 0, "sparkUser" : "admintest", "completed" : false, "startTimeEpoch" : 1517383231286, "endTimeEpoch" : -1, "lastUpdatedEpoch" : 1517383247086 } ] }]
- Result analysis:
With this command, you can query information about all Spark applications in the current cluster, including running applications and the completed applications. Table 1 provides information about each application.
- Command:
- Obtaining information about a specific application on the JobHistory node:
- Command:
curl http://192.168.227.16:4040/api/v1/applications/application_1517290848707_0008?mode=monitoring --insecure
192.168.227.16 indicates the service IP address of the JobHistory node, 4040 indicates the port number of the JobHistory node, and application_1517290848707_0008 indicates the application ID.
- Command output:
{ "id" : "application_1517290848707_0008", "name" : "Spark Pi", "attempts" : [ { "startTime" : "2018-01-30T15:05:37.433CST", "endTime" : "2018-01-30T15:06:04.625CST", "lastUpdated" : "2018-01-30T15:06:04.848CST", "duration" : 27192, "sparkUser" : "sparkuser", "completed" : true, "startTimeEpoch" : 1517295937433, "endTimeEpoch" : 1517295964625, "lastUpdatedEpoch" : 1517295964848 } ] }
- Result analysis:
With this command, you can query information about a Spark application. Table 1 provides information about the application.
- Command:
- Obtaining information about the executor of a running application:
- Command for an alive executor:
curl http://192.168.169.84:8088/proxy/application_1478570725074_0046/api/v1/applications/application_1478570725074_0046/executors?mode=monitoring --insecure
- Command for all alive and dead executors:
curl http://192.168.169.84:8088/proxy/application_1478570725074_0046/api/v1/applications/application_1478570725074_0046/allexecutors?mode=monitoring --insecure
192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8088 indicates the port number of the ResourceManager, and application_1478570725074_0046 indicates the application ID in YARN.
- Command output:
[{ "id" : "driver", "hostPort" : "192.168.169.84:23886", "isActive" : true, "rddBlocks" : 0, "memoryUsed" : 0, "diskUsed" : 0, "activeTasks" : 0, "failedTasks" : 0, "completedTasks" : 0, "totalTasks" : 0, "totalDuration" : 0, "totalInputBytes" : 0, "totalShuffleRead" : 0, "totalShuffleWrite" : 0, "maxMemory" : 278019440, "executorLogs" : { } }, { "id" : "1", "hostPort" : "192.168.169.84:23902", "isActive" : true, "rddBlocks" : 0, "memoryUsed" : 0, "diskUsed" : 0, "totalCores" : 1, "maxTasks" : 1, "activeTasks" : 0, "failedTasks" : 0, "completedTasks" : 0, "totalTasks" : 0, "totalDuration" : 0, "totalGCTime" : 139, "totalInputBytes" : 0, "totalShuffleRead" : 0, "totalShuffleWrite" : 0, "maxMemory" : 555755765, "executorLogs" : { "stdout" : "https://XTJ-224:8044/node/containerlogs/container_1478570725074_0049_01_000002/admin/stdout?start=-4096", "stderr" : "https://XTJ-224:8044/node/containerlogs/container_1478570725074_0049_01_000002/admin/stderr?start=-4096" } } ]
- Result analysis:
With this command, you can query information about all executors including drivers of the current application. Table 2 provides basic information about each executor.
- Command for an alive executor:
Enhanced REST APIs
- SQL commands: Get all SQL statements and the one with the longest execution time.
- Spark UI command:
curl http://192.168.195.232:8088/proxy/application_1476947670799_0053/api/v1/applications/application_1476947670799_0053/SQL?mode=monitoring --insecure
192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8088 indicates the port number of the ResourceManager, and application_1476947670799_0053 indicates the application ID in YARN.
- JobHistory command:
curl http://192.168.227.16:4040/api/v1/applications/application_1478570725074_0004/SQL?mode=monitoring --insecure
192.168.227.16 indicates the service IP address of the JobHistory node, 4040 indicates the port number of the JobHistory node, and application_1478570725074_0004 indicates the application ID.
- Command output:
The query results of the Spark UI and JobHistory commands are as follows:
{ "longestDurationOfCompletedSQL" : [ { "id" : 0, "status" : "COMPLETED", "description" : "getCallSite at SQLExecution.scala:48", "submissionTime" : "2016/11/08 15:39:00", "duration" : "2 s", "runningJobs" : [ ], "successedJobs" : [ 0 ], "failedJobs" : [ ] } ], "sqls" : [ { "id" : 0, "status" : "COMPLETED", "description" : "getCallSite at SQLExecution.scala:48", "submissionTime" : "2016/11/08 15:39:00", "duration" : "2 s", "runningJobs" : [ ], "successedJobs" : [ 0 ], "failedJobs" : [ ] }] }
- Result analysis:
After running this command, you can obtain all the SQL statements executed by the current application (the sqls part of the command output) and the SQL statements with the longest execution time (the longestDurationOfCompletedSQL part of the command output). The information about each SQL statement is listed in Table 3.
- Spark UI command:
- JDBC Server commands: Obtain the number of connections, the number of running SQL statements, as well as information about all sessions and all SQL statements.
- Command:
curl http://192.168.195.232:8088/proxy/application_1476947670799_0053/api/v1/applications/application_1476947670799_0053/sqlserver?mode=monitoring --insecure
192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8088 indicates the port number of the ResourceManager, and application_1476947670799_0053 indicates the application ID in YARN.
- Command output:
{ "sessionNum" : 1, "runningSqlNum" : 0, "sessions" : [ { "user" : "spark", "ip" : "192.168.169.84", "sessionId" : "9dfec575-48b4-4187-876a-71711d3d7a97", "startTime" : "2016/10/29 15:21:10", "finishTime" : "", "duration" : "1 minute 50 seconds", "totalExecute" : 1 } ], "sqls" : [ { "user" : "spark", "jobId" : [ ], "groupId" : "e49ff81a-230f-4892-a209-a48abea2d969", "startTime" : "2016/10/29 15:21:13", "finishTime" : "2016/10/29 15:21:14", "duration" : "555 ms", "statement" : "show tables", "state" : "FINISHED", "detail" : "== Parsed Logical Plan ==\nShowTablesCommand None\n\n== Analyzed Logical Plan ==\ntableName: string, isTemporary: boolean\nShowTablesCommand None\n\n== Cached Logical Plan ==\nShowTablesCommand None\n\n== Optimized Logical Plan ==\nShowTablesCommand None\n\n== Physical Plan ==\nExecutedCommand ShowTablesCommand None\n\nCode Generation: true" } ] }
- Result analysis:
After running this command, you can query the number of sessions in the current JDBC application, number of being-executed SQL statements, and information about all sessions and SQL statements. The information about each session is listed in Table 4, and the information about each SQL statement is listed in Table 5.
Table 4 Session parameter description Parameter
Description
user
User connected to the session.
ip
IP address of the node where the session resides.
sessionId
Session ID.
startTime
Time when the session starts the connection.
finishTime
Time when the session ends the connection.
duration
Session connection duration.
totalExecute
Number of SQL statements executed by the session.
Table 5 SQL parameter description Parameter
Description
user
User who executes the SQL statement.
jobId
IDs of jobs contained in the SQL statement.
groupId
ID of the group where the SQL statement resides.
startTime
SQL start time.
finishTime
SQL end time.
duration
SQL statement execution duration.
statement
SQL statement.
detail
Logical/Physical plan.
- Command:
- Streaming commands: Obtain the average input frequency, average scheduling delay, average execution duration, and average total delay.
- Command:
curl http://192.168.195.232:8088/proxy/application_1477722033672_0008/api/v1/applications/application_1477722033672_0008/streaming/statistics?mode=monitoring --insecure
192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8088 indicates the port number of the ResourceManager, and application_1477722033672_0008 indicates the application ID in YARN.
- Command output:
{ "startTime" : "2018-12-25T08:58:10.836GMT", "batchDuration" : 1000, "numReceivers" : 1, "numActiveReceivers" : 1, "numInactiveReceivers" : 0, "numTotalCompletedBatches" : 373, "numRetainedCompletedBatches" : 373, "numActiveBatches" : 0, "numProcessedRecords" : 1, "numReceivedRecords" : 1, "avgInputRate" : 0.002680965147453083, "avgSchedulingDelay" : 14, "avgProcessingTime" : 47, "avgTotalDelay" : 62 }
- Result analysis:
Once you execute this command, you can retrieve the average input frequency (measured in events per second), average scheduling delay (measured in milliseconds), average execution time (measured in milliseconds), and average total delay (measured in milliseconds) of the current Streaming application.
- Command:
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot