Spark REST APIs
Function
Spark REST APIs display some metrics of the web UI in JSON format, provide users with a simpler method to create new display and monitoring tools, and enable users to query information about running apps and the completed apps. The open-source Spark REST API allows users to query information about Jobs, Stages, Storage, Environment, and Executors. In the FusionInsight version, the REST API used to query SQL, JDBC/ODBC server, and Streaming information is added. For more information about the open-source REST API, see https://spark.apache.org/docs/3.1.1/monitoring.html#rest-api.
Preparing a Running Environment
Install the FusionInsight client. Install the client on the node. For example, install the client in the /opt/client directory.
REST API
You can run the given command to skip the REST API filter to obtain application information.
- For clusters with the security mode enabled, JobHistory only works with the HTTPS protocol. Therefore, when using the command, ensure that the URL includes the HTTPS protocol.
- For clusters with the security mode enabled, you need to set spark.ui.customErrorPage to false and restart Spark2x. Modify the parameter value for JobHistory2x, JDBCServer2x, and SparkResource2x instances.
Accessing JobHistory of Spark2x using HTTPS is different from HTTP-based access. When using HTTPS, make sure that the SSL protocol supported by the curl command is also supported by the cluster. This is important because SSL security encryption is used. If the cluster does not support the SSL protocol, use either of the following methods:
- Modify the SSL protocol configured for the cluster. For example, if the curl command supports only the TLSv1 protocol (TLSv1 has security vulnerabilities and must be used with caution), perform the following steps:
- Log in to FusionInsight Manager, click Cluster, click the name of the desired cluster, choose Service > Spark2x, and choose Configurations > All Configurations.
- Search for ssl in the search box. Check whether the value of spark.ssl.historyServer.protocol for JobHistory contains TLSv1. If it does not, add TLSv1 to the value.
- Clear the value of the spark.ssl.historyServer.enabledAlgorithms parameter for JobHistory.
- Click Save Configuration and then OK. Restart the Spark2x service or JobHistory instance.
- Perform the following steps to upgrade the curl version on the node:
- Download the curl installation package at http://curl.haxx.se/download/.
- Decompress the installation package.
- Overwrite the old curl version with the new one.
./configure
make
make install
- Update the dynamic link library of curl.
- Log back into the node once the installation is complete and execute the following command to verify if the curl version has been updated:
- Obtaining information about all JobHistory applications
- Command:
curl -k -i --negotiate -u: "https://192.168.227.16:4040/api/v1/applications"
192.168.227.16 indicates the service IP address of the JobHistory node and 4040 indicates the port number of the JobHistory node.
- Command output:
[ { "id" : "application_1517290848707_0008", "name" : "Spark Pi", "attempts" : [ { "startTime" : "2018-01-30T15:05:37.433CST", "endTime" : "2018-01-30T15:06:04.625CST", "lastUpdated" : "2018-01-30T15:06:04.848CST", "duration" : 27192, "sparkUser" : "sparkuser", "completed" : true, "startTimeEpoch" : 1517295937433, "endTimeEpoch" : 1517295964625, "lastUpdatedEpoch" : 1517295964848 } ] }, { " id" : "application_1517290848707_0145", "name" : "Spark shell", "attempts" : [ { "startTime" : "2018-01-31T15:20:31.286CST", "endTime" : "1970-01-01T07:59:59.999CST", "lastUpdated" : "2018-01-31T15:20:47.086CST", "duration" : 0, "sparkUser" : "admintest", "completed" : false, "startTimeEpoch" : 1517383231286, "endTimeEpoch" : -1, "lastUpdatedEpoch" : 1517383247086 } ] }]
- Result analysis:
With this command, you can query information about all Spark applications in the current cluster, including running applications and the completed applications. Table 1 provides information about each application.
- Command:
- Obtaining information about a specific JobHistory application
- Command:
curl -k -i --negotiate -u: "https://192.168.227.16:4040/api/v1/applications/application_1517290848707_0008"
192.168.227.16 indicates the service IP address of the JobHistory node, 4040 indicates the port number of the JobHistory node, and application_1517290848707_0008 indicates the application ID.
- Command output:
{ "id" : "application_1517290848707_0008", "name" : "Spark Pi", "attempts" : [ { "startTime" : "2018-01-30T15:05:37.433CST", "endTime" : "2018-01-30T15:06:04.625CST", "lastUpdated" : "2018-01-30T15:06:04.848CST", "duration" : 27192, "sparkUser" : "sparkuser", "completed" : true, "startTimeEpoch" : 1517295937433, "endTimeEpoch" : 1517295964625, "lastUpdatedEpoch" : 1517295964848 } ] }
- Result analysis:
With this command, you can query information about a Spark application. Table 1 provides information about the application.
- Command:
- Obtaining information about the executor of a running application:
- Command for an alive executor:
curl -k -i --negotiate -u: "https://192.168.169.84:8090/proxy/application_1478570725074_0046/api/v1/applications/application_1478570725074_0046/executors"
- Command for all alive and dead executors:
curl -k -i --negotiate -u: "https://192.168.169.84:8090/proxy/application_1478570725074_0046/api/v1/applications/application_1478570725074_0046/allexecutors"
192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8090 indicates the port number of the ResourceManager, and application_1478570725074_0046 indicates the application ID in YARN.
- Command output:
[{ "id" : "driver", "hostPort" : "192.168.169.84:23886", "isActive" : true, "rddBlocks" : 0, "memoryUsed" : 0, "diskUsed" : 0, "activeTasks" : 0, "failedTasks" : 0, "completedTasks" : 0, "totalTasks" : 0, "totalDuration" : 0, "totalInputBytes" : 0, "totalShuffleRead" : 0, "totalShuffleWrite" : 0, "maxMemory" : 278019440, "executorLogs" : { } }, { "id" : "1", "hostPort" : "192.168.169.84:23902", "isActive" : true, "rddBlocks" : 0, "memoryUsed" : 0, "diskUsed" : 0, "totalCores" : 1, "maxTasks" : 1, "activeTasks" : 0, "failedTasks" : 0, "completedTasks" : 0, "totalTasks" : 0, "totalDuration" : 0, "totalGCTime" : 139, "totalInputBytes" : 0, "totalShuffleRead" : 0, "totalShuffleWrite" : 0, "maxMemory" : 555755765, "executorLogs" : { "stdout" : "https://XTJ-224:8044/node/containerlogs/container_1478570725074_0049_01_000002/admin/stdout?start=-4096", "stderr" : "https://XTJ-224:8044/node/containerlogs/container_1478570725074_0049_01_000002/admin/stderr?start=-4096" } } ]
- Result analysis:
With this command, you can query information about all executors including drivers of the current application. Table 2 provides basic information about each executor.
- Command for an alive executor:
Enhanced REST API
- SQL commands: Get all SQL statements and the one with the longest execution time.
- Spark UI command:
curl -k -i --negotiate -u: "https://192.168.195.232:8090/proxy/application_1476947670799_0053/api/v1/applications/application_1476947670799_0053/SQL"
192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8090 indicates the port number of the ResourceManager, and application_1476947670799_0053 indicates the application ID in YARN.
You can add parameters to the URL after the command to search for the corresponding SQL statements.
You can run the following command to view 100 SQL statements:
curl -k -i --negotiate -u: "https://192.168.195.232:8090/proxy/application_1476947670799_0053/api/v1/applications/application_1476947670799_0053/SQL?limit=100"
You can run the following command to view running parameters:
curl -k -i --negotiate -u: "https://192.168.195.232:8090/proxy/application_1476947670799_0053/api/v1/applications/application_1476947670799_0053/SQL?completed=false"
- JobHistory command:
curl -k -i --negotiate -u: "https://192.168.227.16:4040/api/v1/applications/application_1478570725074_0004/SQL"
192.168.227.16 indicates the service IP address of the JobHistory node, 4040 indicates the port number of the JobHistory node, and application_1478570725074_0004 indicates the application ID.
- Command output:
The query results of the Spark UI and JobHistory commands are as follows:
{ "longestDurationOfCompletedSQL" : [ { "id" : 0, "status" : "COMPLETED", "description" : "getCallSite at SQLExecution.scala:48", "submissionTime" : "2016/11/08 15:39:00", "duration" : "2 s", "runningJobs" : [ ], "successedJobs" : [ 0 ], "failedJobs" : [ ] } ], "sqls" : [ { "id" : 0, "status" : "COMPLETED", "description" : "getCallSite at SQLExecution.scala:48", "submissionTime" : "2016/11/08 15:39:00", "duration" : "2 s", "runningJobs" : [ ], "successedJobs" : [ 0 ], "failedJobs" : [ ] }] }
- Result analysis:
After running this command, you can obtain all the SQL statements executed by the current application (the sqls part of the command output) and the SQL statements with the longest execution time (the longestDurationOfCompletedSQL part of the command output). The information about each SQL statement is listed in Table 3.
- Spark UI command:
- JDBC Server commands: Obtain the number of connections, the number of running SQL statements, as well as information about all sessions and all SQL statements.
- Command:
curl -k -i --negotiate -u: "https://192.168.195.232:8090/proxy/application_1476947670799_0053/api/v1/applications/application_1476947670799_0053/sqlserver"
192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8090 indicates the port number of the ResourceManager, and application_1476947670799_0053 indicates the application ID in YARN.
- Command output:
{ "sessionNum" : 1, "runningSqlNum" : 0, "sessions" : [ { "user" : "spark", "ip" : "192.168.169.84", "sessionId" : "9dfec575-48b4-4187-876a-71711d3d7a97", "startTime" : "2016/10/29 15:21:10", "finishTime" : "", "duration" : "1 minute 50 seconds", "totalExecute" : 1 } ], "sqls" : [ { "user" : "spark", "jobId" : [ ], "groupId" : "e49ff81a-230f-4892-a209-a48abea2d969", "startTime" : "2016/10/29 15:21:13", "finishTime" : "2016/10/29 15:21:14", "duration" : "555 ms", "statement" : "show tables", "state" : "FINISHED", "detail" : "== Parsed Logical Plan ==\nShowTablesCommand None\n\n== Analyzed Logical Plan ==\ntableName: string, isTemporary: boolean\nShowTablesCommand None\n\n== Cached Logical Plan ==\nShowTablesCommand None\n\n== Optimized Logical Plan ==\nShowTablesCommand None\n\n== Physical Plan ==\nExecutedCommand ShowTablesCommand None\n\nCode Generation: true" } ] }
- Result analysis:
After running this command, you can query the number of sessions in the current JDBC application, number of being-executed SQL statements, and information about all sessions and SQL statements. The information about each session is listed in Table 4, and the information about each SQL statement is listed in Table 5:
Table 4 Session parameter description Parameter
Description
user
User connected to the session.
ip
IP address of the node where the session resides.
sessionId
Session ID.
startTime
Time when the session starts the connection.
finishTime
Time when the session ends the connection.
duration
Session connection duration.
totalExecute
Number of SQL statements executed by the session.
Table 5 SQL parameter description Parameter
Description
user
User who executes the SQL statement.
jobId
IDs of jobs contained in the SQL statement.
groupId
ID of the group where the SQL statement resides.
startTime
SQL start time.
finishTime
SQL end time.
duration
SQL execution duration.
statement
SQL statement.
detail
Logical plan and physical plan.
- Command:
- JDBC API enhancement cancels the SQL statement that is being executed by using the execution ID obtained from Beeline.
- Command:
curl -k -i --negotiate -X PUT -u: "https://192.168.195.232:8090/proxy/application_1477722033672_0008/api/v1/applications/application_1477722033672_0008/cancel/execution?executionId=8"
- Command output:
- Notes:
Run the SQL statement in spark-beeline. If the SQL statement generates a Spark task, the execution ID of the SQL statement will be printed in Beeline. To cancel the execution of the SQL statement, run the preceding command.
- Command:
- Streaming commands: Obtain the average input frequency, scheduling delay, execution duration, and total delay.
- Command:
curl -k -i --negotiate -u: "https://192.168.195.232:8090/proxy/application_1477722033672_0008/api/v1/applications/application_1477722033672_0008/streaming/statistics"
192.168.195.232 indicates the service IP address of the master node of the ResourceManager, 8090 indicates the port number of the ResourceManager, and application_1477722033672_0008 indicates the application ID in YARN.
- Command output:
{ "startTime" : "2018-12-25T08:58:10.836GMT", "batchDuration" : 1000, "numReceivers" : 1, "numActiveReceivers" : 1, "numInactiveReceivers" : 0, "numTotalCompletedBatches" : 373, "numRetainedCompletedBatches" : 373, "numActiveBatches" : 0, "numProcessedRecords" : 1, "numReceivedRecords" : 1, "avgInputRate" : 0.002680965147453083, "avgSchedulingDelay" : 14, "avgProcessingTime" : 47, "avgTotalDelay" : 62 }
- Result analysis:
Once you execute this command, you can retrieve the average input frequency (measured in events per second), average scheduling delay (measured in milliseconds), average execution time (measured in milliseconds), and average total delay (measured in milliseconds) of the current Streaming application.
- Command:
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot