Writing and Running the Spark Program in the Linux Environment

Scenario

After the program codes are developed, you can upload the codes to the Linux client for running. The running procedures of applications developed in Scala or Java are the same.

A Spark application developed using Python does not need to be packed into a JAR file. You only need to copy the sample project to the compiler.
It is needed to ensure that the version of Python installed on the worker and driver is consistent, otherwise the following error will be reported: "Python in worker has different version %s than that in driver %s."
Ensure that Maven image repository of the SDK in the Huawei image site has been configured in Maven. For details, see Configuring Huawei Open-Source Mirrors.

Procedure

In the IntelliJ IDEA, open the Maven tool window.

On the main page of the IDEA, choose View > Tool Windows > Maven to open the Maven tool window.
Figure 1 Opening the Maven tool window

If the project is not imported using Maven, perform the following operations:

Right-click the pom file in the sample code project and choose Add as Maven Project from the shortcut menu to add a Maven project.

Figure 2 Adding a Maven project
Use Maven to generate a JAR file.
1. In the Maven tool window, select clean from Lifecycle to execute the Maven building process.
  Figure 3 Selecting clean from Lifecycle and execute the Maven building process
2. In the Maven tool window, select package from Lifecycle and execute the Maven building process.
  Figure 4 Selecting package from Lifecycle and execute the Maven build process.
  
  If the following information is displayed in Run:, the packaging is successful.
  Figure 5 Packaging success message
3. You can obtain the JAR package from the target folder in the project directory.
  Figure 6 Obtaining the JAR Package
Copy the JAR file generated in 2 (for example, CollectFemaleInfo.jar) to the Spark running environment (that is, the Spark client), for example, /opt/female. Run the Spark application. For details about the sample project, see Developing a Spark Application.

Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.