Help Center> Cloud Stream Service> Best Practices> Using Spark to Collect Statistics on Vehicle Startup Duration in Real Time

Using Spark to Collect Statistics on Vehicle Startup Duration in Real Time

Scenario Overview

This practice helps you understand the basic functions of CS. In this practice, use the user-defined Spark job of CS to collect vehicle driving data that is ingested in real time and output information about each vehicle per startup, including the vehicle startup time, vehicle stop time, vehicle startup duration, and much more.

The original data is a simplified simulation of the real-time status information about vehicle access to the Internet of Vehicles (IoV). Generally, the bus data of a vehicle accessing to IoV includes information uploaded during driving, such as the longitude and latitude information, speed, direction, acc status, throttle position, and brake position. In this practice, the simulation data format is CSV, and only the vehicle ID, acc status, and data reporting time are used.

In this practice, the status management capability of the Spark Streaming is used to collect statistics on the duration of each vehicle startup based on the real-time access data of vehicles, and output the result. Compared with the solution of writing original data to disks and then conducting offline analysis, the solution adopting stream computing delivers more real-time performance and reduces data writing operations, thereby reducing costs.

In this practice, two DIS streams are used to separately serve as source and sink streams to simplify the process. In this practice, two DIS streams are used to separately serve as source and sink streams to simplify the process. CS supports source and sink streams of various types. For details, see the Cloud Stream Service Stream Ecosystem Development Guide.

The procedure is as follows:

Prepare a Spark Sample Program
Prepare Spark Sample Data
Create a Cluster
Create a Job
View the Job Execution Results

Prepare a Spark Sample Program

Create an OBS bucket for storing the CS sample programs, logs, and output results.
1. Log in to the public cloud management console.
2. Click Service List and choose Storage > Object Storage Service.
3. Click Create Bucket to create a bucket named cs-test. Retain the default values for Storage Class and Bucket Policy.
4. Click Create Now.
Download the sample program SparkStreamingCountOnTime-assembly-1.0.jar to your local PC.
1. Contact technical support to obtain the cs-sparkstreaming-CarAccTime.zip package.
2. Decompress the package in your directory.
Enter the cs-test bucket, click Upload File. In the displayed dialog box, select the SparkStreamingCountOnTime-assembly-1.0.jar file in the cs-sparkstreaming-CarAccTime directory.

Prepare Spark Sample Data

Create a DIS stream for receiving vehicle data (that is, Spark sample data) in real time.
1. Log in to the public cloud management console.
2. Choose Service List > EI Enterprise Intelligence > Data Ingestion Service.
3. On the DIS console, click Buy Stream.
4. On the displayed Buy Stream page, configure the stream as follows:
  Region: CN North-Beijing1
  
  Stream Name: sparkstreaming-input
  
  Stream Type: Common
  
  Partitions: 1
  
  Data Retention (days): 1
  
  Source Data Type: BLOB
  Figure 1 Creating a DIS stream for receiving real-time vehicle data
5. Click Buy Now.
Repeat 1 to create a DIS stream named sparkstreaming-output for receiving the CS job output data.
Create a DIS agency on the IAM management console.

When you create a DIS stream and dump data to OBS, you need to create an IAM agency to authorize DIS to access your OBS resources.
1. Log in to the public cloud management console.
2. Choose Service List > Management & Deployment > Identity and Access Management.
3. Click Agencies. On the displayed page, click Create Agency.
4. Configure a DIS agency named test with the configurations shown in Figure 2.
  Figure 2 Configuring DIS agency information
5. Click OK.
Configure a data dump task for the DIS stream.
1. Log in to the public cloud management console.
2. Choose Service List > EI Enterprise Intelligence > Data Ingestion Service.
3. In the left navigation pane, click Stream Management. On the displayed page, click sparkstreaming-output in the Name/ID column of the stream list.
4. On the displayed page, click Dump Management, and then click Add Dump Task.
5. Configure the dump task by referring to the configurations shown in Figure 3 and then click Create Now.
- Dump Destination: Select the cs-test bucket created in 1.
- IAM Agency: Select the test agency created in 3.
Figure 3 Configuring the dump task
Download the sample program SparkStreamingCountOnTime-assembly-1.0.jar to your local PC.
1. Contact technical support to obtain the cs-sparkstreaming-CarAccTime.zip package.
2. Decompress the package in your directory.

Run the JAR file in the application package to generate the simulation data, and send the data to the DIS stream.

Obtain the SparkStreamingCountOnTime-assembly-1.0.jar file from the cs-sparkstreaming-CarAccTime directory.

Run the following command in the directory where the SparkStreamingCountOnTime-assembly-1.0.jar file is stored:

java -cp SparkStreamingCountOnTime-assembly-1.0.jar util.GenerateData CarNumber ReportPeriod DisEndpoint AK SK ProjectId Region DisChannel

The following table describes the parameters involved in the preceding command.

**Table 1** Parameter configurations involved in the preceding command
Parameter	Configuration Method
CarNumber	Set this parameter to the total vehicle quantity contained in the simulation data. The value is an integer. The recommended value is 1000.
ReportPeriod	Set this parameter to the interval (unit: s) for reporting data by each vehicle in the simulation data. The value is an integer. The recommended value is 10.
DisEndpoint	Set this parameter to the endpoint address of DIS. You can obtain the parameter value from Regions and Endpoints. For example, https://dis.cn-north-1.myhuaweicloud.com:443.
AK/SK	Perform the following steps to obtain the parameter value: Log in to the public cloud management console. Move the cursor to the username in the upper right corner and select My Credentials from the drop-down list. On the displayed My Credentials page, click Access Keys. Click Add Access Key, enter the password and verification code as prompted, and click OK. In the Download Access Key dialog box that is displayed, click OK to save the access keys to your browser's default download path. Open the downloaded credentials.csv file to obtain the access keys (AK and SK).
ProjectId	Perform the following steps to obtain the parameter value: Log in to the public cloud management console. Move the cursor to the username in the upper right corner and select My Credentials from the drop-down list. On the Project List page, obtain the project ID corresponding to the CN North-Beijing1 region.
Region	Set this parameter to cn-north-1.
DisChannel	Set this parameter to sparkstreaming-input, which is the DIS stream used for receiving data.

If "*** data sent" is displayed on the CLI console, the data is successfully sent.

Create a Cluster

Log in to the public cloud management console.
Click Service List and choose EI Enterprise Intelligence > Cloud Stream Service to access the CS console.
In the left navigation pane, click Cluster Management to switch to the Cluster Management page.

Click Create Cluster. On the displayed Create Cluster page, configure basic cluster information by referring to Table 2.

**Table 2** Parameters related to cluster configuration
Parameter	Configuration Method
Billing Mode	The pay-per-use billing mode is used.
Region	Click in the upper left corner of the management console and choose CN North-Beijing1.
Name	cs_demo
Description	Description of a cluster. This parameter can be left unspecified.
Tag	If you want to use the same tag to identify multiple cloud resources, you are advised to create predefined tags in the TMS. In this way, the same tag can be selected for all services. This parameter can be left unspecified.
Enterprise Project	For Enterprise Project, select the enterprise project that you have created on the Enterprise Management console. For details about how to create an enterprise project on the Enterprise Management console, see Creating an Enterprise Project in the Enterprise Management User Guide. If you do not select an enterprise project for the cluster, use the default project will be used.
Management Node Specs	Specifications of management nodes used by an exclusive cluster. The parameter value is positively correlated with the number of jobs running in the cluster. Select 2 SPUs.
Max. SPUs in a Cluster	Maximum number of SPUs in a cluster for the purpose of dynamic capacity expansion. The default value is 100. The parameter value can be changed after the cluster is created. Retain the default value.
Advanced Configuration	You can configure and adjust the VPC and subnet to which the cluster belongs based on the network plan. Retain the default values for related parameters.

The following figure shows an example of the basic cluster information.

Figure 4 Creating a cluster
click to enlarge

Click OK. The system automatically switches to the Cluster Management page, where Status of the created cluster is Requesting resources.

It takes about 1 to 3 minutes to create a cluster. If the value of Status changes to Running, the cluster is successfully created.

Create a Job

In the navigation tree on the left pane of the CS management console, choose Job Management to switch to the Job Management page.
Click Create Job to switch to the Create Job dialog box.
Specify job parameters as required.
- Select Spark Streaming Jar Job For Type.
- Set Name to SparkStreamingCountOnTime.
Figure 5 Creating a job
Click OK to enter the Edit page.

Upload the JAR file.

Figure 6 Uploading the JAR file
click to enlarge

**Table 3** Parameters for uploading a JAR file
Parameter	Configuration Method
Upload Mode	Select OBS. Click Select a File from OBS and select the SparkStreamingCountOnTime-assembly-1.0.jar file that is uploaded to the cs-test bucket in Prepare a Spark Sample Program. Click OK.
Main Class	Select Manually assign. For Class Name, enter car.CountOnTime. Enter the following parameters in the text box next to Class Arguments: DisEndpoint Region AK SK ProjectId InputStream Duration OutputStream NOTE: For details about how to set DisEndpoint, Region, AK, SK, and ProjectId, refer to Table 1. InputStream refers to the input channel of the CS job. In this practice, set this parameter to sparkstreaming-input. OutputStream refers to the output channel of the CS job. In this practice, set this parameter to sparkstreaming-output. Duration refers to the batch interval (unit: second) of the Spark Streaming application. The value is an integer. In this practice, set this parameter to 1.
Configuration File	You do not need to set this parameter because there are no available user-defined configuration files.
Cluster	Select the cs_demo cluster created in Create a Cluster.

Click Configure Parameters on the left to configure job parameters.

Figure 7 Configuring job parameters
click to enlarge

**Table 4** Configuring job parameters
Parameter	Configuration Method
SPUs	This parameter indicates the total number of SPUs used by a job. The parameter value is automatically generated based on the values of parameters Driver SPUs, Executors, and SPUs per Executor.
Driver SPUs	This parameter indicates the number of SPUs used by the driver node. Use the default value 1.
Executors	This parameter indicates the number of executor nodes. Use the default value 1.
SPUs per Executor	This parameter indicates the number of SPUs per executor node. Use the default value 1.
Save Job Log	This parameter indicates whether to save job logs. Select this option to enable the job log saving function. Click the text box next to OBS Bucket. In the dialog box that is displayed, select the cs-test bucket created in 1. Click Authorize OBS.
Alarm Generation upon Job Exception	After you enable this function, CS sends related alarm information over SMSs or emails if a job fails or arrears occur. In this practice, do not select this option.
Auto Restart upon Exception	If you enable this function, CS automatically restarts and restores abnormal jobs upon job exceptions. In this practice, do not select this option.

Click Select the Target Cluster. From the Cluster drop-down menu, select the cs_demo cluster created in Create a Cluster.

Figure 8 Selecting the cluster
Click Submit. On the page that is displayed, click OK.

After the job is submitted, the system automatically switches to the Job Management page, and the created job is displayed in the job list. You can view the Status column to query the job status. After a job is successfully submitted, Status of the job will change from Submitting to Running.

View the Job Execution Results

In the navigation tree on the left pane of the CS management console, choose Job Management to switch to the Job Management page.
In the Name column, click SparkStreamingCountOnTime to switch to the Job Details page.
Click Job Monitoring. You can view information about the following four metrics: InputSize, SchedulingDelay, ProcessingTime, and TotalDelay.

You can also log in to the ECS connected to the cluster to view more monitoring information.
Figure 9 Checking the job monitoring information
Click Running Logs to view job logs.

You can view the driver or executor logs by selecting the options in the drop-down list box in the upper right corner or clicking StdOut or StdError.
Figure 10 Viewing job logs
After a dump interval elapses (the dump interval is set in 4 and the default value is 300 seconds), log in to the OBS management console. Click cs-test in the Bucket Name column. Click Objects from the left navigation pane. On the displayed page, locate the row where the dump file for the sparkstreaming-output stream resides and click Download.

Figure 11 Exporting the dump file
After the file is downloaded, open it. You can view the statistics about vehicle startup durations collected by CS jobs.

The file is in the CSV format. As shown in the following figure, the fields, from left to right, indicate the vehicle ID, startup time, stop time, and startup duration separately.
Figure 12 Vehicle statistics about vehicle startup durations

Next Article: Using Flink SQL to Analyze Logs in Real Time

Did this article solve your problem?

Thank you for your score！Your feedback would help us improve the website.

Products

Compute

Application

Dedicated Cloud

Storage

Management & Deployment

Migration

Network

Enterprise Intelligence

Video

Database

Edge Cloud Services

DevCloud

Security

Cloud Communications

Internet of Things

Solutions

Industry-Specific Solutions

General-Purpose Solutions

Security

DevOps

Enterprise Intelligence

Essential Platform

Big Data

Visual Cognition

Speech and Semantics

Support

Help Center

Customer Services

Developers

Console

语言 - Language

中国站 - 简体中文

中国站 - English

International - 简体中文

International - English