Using Spark to Collect Statistics on Vehicle Startup Duration in Real Time
Scenario Overview
This practice helps you understand the basic functions of CS. In this practice, use the user-defined Spark job of CS to collect vehicle driving data that is ingested in real time and output information about each vehicle per startup, including the vehicle startup time, vehicle stop time, vehicle startup duration, and much more.
The original data is a simplified simulation of the real-time status information about vehicle access to the Internet of Vehicles (IoV). Generally, the bus data of a vehicle accessing to IoV includes information uploaded during driving, such as the longitude and latitude information, speed, direction, acc status, throttle position, and brake position. In this practice, the simulation data format is CSV, and only the vehicle ID, acc status, and data reporting time are used.
In this practice, the status management capability of the Spark Streaming is used to collect statistics on the duration of each vehicle startup based on the real-time access data of vehicles, and output the result. Compared with the solution of writing original data to disks and then conducting offline analysis, the solution adopting stream computing delivers more real-time performance and reduces data writing operations, thereby reducing costs.
In this practice, two DIS streams are used to separately serve as source and sink streams to simplify the process. In this practice, two DIS streams are used to separately serve as source and sink streams to simplify the process. CS supports source and sink streams of various types. For details, see the Cloud Stream Service Stream Ecosystem Development Guide.
The procedure is as follows:
Prepare a Spark Sample Program
- Create an OBS bucket for storing the CS sample programs, logs, and output results.
- Log in to the public cloud management console.
- Click Service List and choose Storage > Object Storage Service.
- Click Create Bucket to create a bucket named cs-test. Retain the default values for Storage Class and Bucket Policy.
- Click Create Now.
- Download the sample program SparkStreamingCountOnTime-assembly-1.0.jar to your local PC.
- Contact technical support to obtain the cs-sparkstreaming-CarAccTime.zip package.
- Decompress the package in your directory.
- Enter the cs-test bucket, click Upload File. In the displayed dialog box, select the SparkStreamingCountOnTime-assembly-1.0.jar file in the cs-sparkstreaming-CarAccTime directory.
Prepare Spark Sample Data
- Create a DIS stream for receiving vehicle data (that is, Spark sample data) in real time.
- Log in to the public cloud management console.
- Choose Service List > EI Enterprise Intelligence > Data Ingestion Service.
- On the DIS console, click Buy Stream.
- On the displayed Buy Stream page, configure the stream as follows:
Region: CN North-Beijing1
Stream Name: sparkstreaming-input
Stream Type: Common
Partitions: 1
Data Retention (days): 1
Source Data Type: BLOBFigure 1 Creating a DIS stream for receiving real-time vehicle data
- Click Buy Now.
- Repeat 1 to create a DIS stream named sparkstreaming-output for receiving the CS job output data.
- Create a DIS agency on the IAM management console.
When you create a DIS stream and dump data to OBS, you need to create an IAM agency to authorize DIS to access your OBS resources.
- Log in to the public cloud management console.
- Choose Service List > Management & Deployment > Identity and Access Management.
- Click Agencies. On the displayed page, click Create Agency.
- Configure a DIS agency named test with the configurations shown in Figure 2.
- Click OK.
- Configure a data dump task for the DIS stream.
- Log in to the public cloud management console.
- Choose Service List > EI Enterprise Intelligence > Data Ingestion Service.
- In the left navigation pane, click Stream Management. On the displayed page, click sparkstreaming-output in the Name/ID column of the stream list.
- On the displayed page, click Dump Management, and then click Add Dump Task.
- Configure the dump task by referring to the configurations shown in Figure 3 and then click Create Now.
- Download the sample program SparkStreamingCountOnTime-assembly-1.0.jar to your local PC.
- Contact technical support to obtain the cs-sparkstreaming-CarAccTime.zip package.
- Decompress the package in your directory.
- Run the JAR file in the application package to generate the simulation data, and send the data to the DIS stream.
- Obtain the SparkStreamingCountOnTime-assembly-1.0.jar file from the cs-sparkstreaming-CarAccTime directory.
- Run the following command in the directory where the SparkStreamingCountOnTime-assembly-1.0.jar file is stored:
java -cp SparkStreamingCountOnTime-assembly-1.0.jar util.GenerateData CarNumber ReportPeriod DisEndpoint AK SK ProjectId Region DisChannel
The following table describes the parameters involved in the preceding command.Table 1 Parameter configurations involved in the preceding command Parameter
Configuration Method
CarNumber
Set this parameter to the total vehicle quantity contained in the simulation data. The value is an integer. The recommended value is 1000.
ReportPeriod
Set this parameter to the interval (unit: s) for reporting data by each vehicle in the simulation data. The value is an integer. The recommended value is 10.
DisEndpoint
Set this parameter to the endpoint address of DIS. You can obtain the parameter value from Regions and Endpoints. For example, https://dis.cn-north-1.myhuaweicloud.com:443.
AK/SK
Perform the following steps to obtain the parameter value:
- Log in to the public cloud management console.
- Move the cursor to the username in the upper right corner and select My Credentials from the drop-down list.
- On the displayed My Credentials page, click Access Keys.
- Click Add Access Key, enter the password and verification code as prompted, and click OK.
- In the Download Access Key dialog box that is displayed, click OK to save the access keys to your browser's default download path.
- Open the downloaded credentials.csv file to obtain the access keys (AK and SK).
ProjectId
Perform the following steps to obtain the parameter value:
- Log in to the public cloud management console.
- Move the cursor to the username in the upper right corner and select My Credentials from the drop-down list.
- On the Project List page, obtain the project ID corresponding to the CN North-Beijing1 region.
Region
Set this parameter to cn-north-1.
DisChannel
Set this parameter to sparkstreaming-input, which is the DIS stream used for receiving data.
- If "*** data sent" is displayed on the CLI console, the data is successfully sent.
Create a Cluster
- Log in to the public cloud management console.
- Click Service List and choose EI Enterprise Intelligence > Cloud Stream Service to access the CS console.
- In the left navigation pane, click Cluster Management to switch to the Cluster Management page.
- Click Create Cluster. On the displayed Create Cluster page, configure basic cluster information by referring to Table 2.
Table 2 Parameters related to cluster configuration Parameter
Configuration Method
Billing Mode
The pay-per-use billing mode is used.
Region
Click
in the upper left corner of the management console and choose CN North-Beijing1.Name
cs_demo
Description
Description of a cluster. This parameter can be left unspecified.
Tag
If you want to use the same tag to identify multiple cloud resources, you are advised to create predefined tags in the TMS. In this way, the same tag can be selected for all services. This parameter can be left unspecified.
Enterprise Project
For Enterprise Project, select the enterprise project that you have created on the Enterprise Management console.
For details about how to create an enterprise project on the Enterprise Management console, see Creating an Enterprise Project in the Enterprise Management User Guide.
If you do not select an enterprise project for the cluster, use the default project will be used.
Management Node Specs
Specifications of management nodes used by an exclusive cluster. The parameter value is positively correlated with the number of jobs running in the cluster. Select 2 SPUs.
Max. SPUs in a Cluster
Maximum number of SPUs in a cluster for the purpose of dynamic capacity expansion. The default value is 100. The parameter value can be changed after the cluster is created. Retain the default value.
Advanced Configuration
You can configure and adjust the VPC and subnet to which the cluster belongs based on the network plan. Retain the default values for related parameters.
The following figure shows an example of the basic cluster information.
Figure 4 Creating a cluster
- Click OK. The system automatically switches to the page, where of the created cluster is Requesting resources.
It takes about 1 to 3 minutes to create a cluster. If the value of Status changes to Running, the cluster is successfully created.
Create a Job
- In the navigation tree on the left pane of the CS management console, choose to switch to the Job Management page.
- Click Create Job to switch to the Create Job dialog box.
- Specify job parameters as required.
- Select Spark Streaming Jar Job For Type.
- Set Name to SparkStreamingCountOnTime.
Figure 5 Creating a job
- Click OK to enter the Edit page.
- Upload the JAR file. Figure 6 Uploading the JAR file
Table 3 Parameters for uploading a JAR file Parameter
Configuration Method
Upload Mode
- Select OBS.
- Click Select a File from OBS and select the SparkStreamingCountOnTime-assembly-1.0.jar file that is uploaded to the cs-test bucket in Prepare a Spark Sample Program.
- Click OK.
Main Class
- Select Manually assign.
- For Class Name, enter car.CountOnTime.
- Enter the following parameters in the text box next to Class Arguments:
DisEndpoint Region AK SK ProjectId InputStream Duration OutputStream
NOTE:- For details about how to set DisEndpoint, Region, AK, SK, and ProjectId, refer to Table 1.
- InputStream refers to the input channel of the CS job. In this practice, set this parameter to sparkstreaming-input.
- OutputStream refers to the output channel of the CS job. In this practice, set this parameter to sparkstreaming-output.
- Duration refers to the batch interval (unit: second) of the Spark Streaming application. The value is an integer. In this practice, set this parameter to 1.
Configuration File
You do not need to set this parameter because there are no available user-defined configuration files.
Cluster
Select the cs_demo cluster created in Create a Cluster.
- Click Configure Parameters on the left to configure job parameters. Figure 7 Configuring job parameters
Table 4 Configuring job parameters Parameter
Configuration Method
SPUs
This parameter indicates the total number of SPUs used by a job. The parameter value is automatically generated based on the values of parameters Driver SPUs, Executors, and SPUs per Executor.
Driver SPUs
This parameter indicates the number of SPUs used by the driver node. Use the default value 1.
Executors
This parameter indicates the number of executor nodes. Use the default value 1.
SPUs per Executor
This parameter indicates the number of SPUs per executor node. Use the default value 1.
Save Job Log
This parameter indicates whether to save job logs.
- Select this option to enable the job log saving function.
- Click the text box next to OBS Bucket. In the dialog box that is displayed, select the cs-test bucket created in 1.
- Click Authorize OBS.
Alarm Generation upon Job Exception
After you enable this function, CS sends related alarm information over SMSs or emails if a job fails or arrears occur. In this practice, do not select this option.
Auto Restart upon Exception
If you enable this function, CS automatically restarts and restores abnormal jobs upon job exceptions. In this practice, do not select this option.
- Click Select the Target Cluster. From the Cluster drop-down menu, select the cs_demo cluster created in Create a Cluster. Figure 8 Selecting the cluster
- Click Submit. On the page that is displayed, click OK.
After the job is submitted, the system automatically switches to the page, and the created job is displayed in the job list. You can view the column to query the job status. After a job is successfully submitted, Status of the job will change from to .
View the Job Execution Results
- In the navigation tree on the left pane of the CS management console, choose to switch to the Job Management page.
- In the Name column, click SparkStreamingCountOnTime to switch to the Job Details page.
- Click Job Monitoring. You can view information about the following four metrics: InputSize, SchedulingDelay, ProcessingTime, and TotalDelay. You can also log in to the ECS connected to the cluster to view more monitoring information.Figure 9 Checking the job monitoring information
- Click Running Logs to view job logs. You can view the driver or executor logs by selecting the options in the drop-down list box in the upper right corner or clicking StdOut or StdError.Figure 10 Viewing job logs
- After a dump interval elapses (the dump interval is set in 4 and the default value is 300 seconds), log in to the OBS management console. Click cs-test in the Bucket Name column. Click Objects from the left navigation pane. On the displayed page, locate the row where the dump file for the sparkstreaming-output stream resides and click Download. Figure 11 Exporting the dump file
- After the file is downloaded, open it. You can view the statistics about vehicle startup durations collected by CS jobs. The file is in the CSV format. As shown in the following figure, the fields, from left to right, indicate the vehicle ID, startup time, stop time, and startup duration separately.Figure 12 Vehicle statistics about vehicle startup durations
Next Article: Using Flink SQL to Analyze Logs in Real Time


Did this article solve your problem?
Thank you for your score!Your feedback would help us improve the website.