Updated on 2024-08-16 GMT+08:00

Compiling and Running a Spark Application

Scenario

After application code development is complete, you can upload the JAR file to the Linux client to run applications. The procedures for running applications developed using Scala or Java are the same on the Spark client.

  • Spark applications can run only on Linux, but not on Windows.
  • A Spark application developed using Python does not need to be packed into a JAR file. You only need to copy the sample project to the compiler.

Running the Spark Core Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the Spark Core sample application (Scala and Java).

    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the Spark client directory and invoke the bin/spark-submit script to run code.

    <inputPath> indicates the input path in HDFS.

    bin/spark-submit --class com.huawei.bigdata.spark.examples.FemaleInfoCollection --master yarn --deploy-mode client /opt/female/FemaleInfoCollection.jar <inputPath>

Running the Spark SQL Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the Spark SQL sample application (Scala and Java).

    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the Spark client directory and invoke the bin/spark-submit script to run code.

    <inputPath> indicates the input path in HDFS.

    bin/spark-submit --class com.huawei.bigdata.spark.examples.FemaleInfoCollection --master yarn --deploy-mode client /opt/female/FemaleInfoCollection.jar <inputPath>

Running the Spark Streaming Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the Spark Streaming sample application (Scala and Java).

    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.
    Go to the Spark client directory and invoke the bin/spark-submit script to run code.

    The path of the Spark Streaming Kafka dependency package on the client is different from that of other dependency packages. For example, the path of other dependency packages is $SPARK_HOME/jars, and the path of the Spark Streaming Kafka dependency package is $SPARK_HOME/jars/streamingClient. Therefore, when you run an application, you need to add a configuration item to the spark-submit command to specify the path of the Spark Streaming Kafka dependency package, for example, --jars $SPARK_HOME/jars/streamingClient/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient/kafka_*.jar,$SPARK_HOME/jars/streamingClient/spark-streaming-kafka-0-8_*.jar.

    • Spark Streaming Write To Print Sample Code

      bin/spark-submit --master yarn --deploy-mode client --jars $SPARK_HOME/jars/streamingClient/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient/kafka_*.jar,$SPARK_HOME/jars/streamingClient/spark-streaming-kafka-*.jar --class com.huawei.bigdata.spark.examples.FemaleInfoCollectionPrint /opt/female/FemaleInfoCollectionPrint.jar <checkPointDir> <batchTime> <topics> <brokers>

      • The JAR version name in --jars varies depending on the cluster.
      • The value of brokers is in brokerIp:9092 format.
      • <checkPointDir> indicates the path for backing up the application result to the HDFS. <batchTime> indicates the interval for Streaming to process data in batches.
    • Write To Kafka Sample Code for Spark Streaming

      bin/spark-submit --master yarn --deploy-mode client --jars $SPARK_HOME/jars/streamingClient/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient/kafka_*.jar,$SPARK_HOME/jars/streamingClient/spark-streaming-kafka-*.jar --

      class com.huawei.bigdata.spark.examples.DstreamKafkaWriter/opt/female/SparkStreamingExample-1.0.jar <groupId> <brokers> <topic>

Running the "Accessing Spark SQL Through JDBC" Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the "Accessing Spark SQL Through JDBC" sample application (Scala and Java).

    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the Spark client directory and run the java -cp command to run the code.

    java -cp ${SPARK_HOME}/jars/*:${SPARK_HOME}/conf:/opt/female/SparkThriftServerJavaExample-*.jar com.huawei.bigdata.spark.examples.ThriftServerQueriesTest ${SPARK_HOME}/conf/hive-site.xml ${SPARK_HOME}/conf/spark-defaults.conf

    For a normal cluster, comment out the security configuration code. For details, see 2 and 2.

    In the preceding command line, you can minimize the corresponding running dependency packages based on different sample projects. For details about the dependency package of the sample project, see 1.

Running the Spark on HBase Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the Spark on HBase sample application (Scala and Java).

    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the Spark client directory and invoke the bin/spark-submit script to run code. The application running sequence is as follows: TableCreation, TableInputData, and TableOutputData.

    When running the TableInputData sample application, you need to specify <inputPath>, which indicates the input path in HDFS.

    bin/spark-submit --class com.huawei.bigdata.spark.examples.TableInputData --master yarn --deploy-mode client /opt/female/TableInputData.jar <inputPath>

    If Kerberos authentication is enabled when Spark tasks connect to HBase to read and write data, set spark.yarn.security.credentials.hbase.enabled in the client configuration file spark-default.conf to true. This configuration needs to be modified for all Spark tasks connecting to HBase to read and write data.

Running the Spark HBase to HBase Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the Spark HBase to HBase sample application (Scala and Java).

    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the Spark client directory and invoke the bin/spark-submit script to run code.

    When running the sample application, you need to specify <zkQuorum>, which indicates the IP address of ZooKeeper.

    bin/spark-submit --class com.huawei.bigdata.spark.examples.SparkHbasetoHbase --master yarn --deploy-mode client /opt/female/FemaleInfoCollection.jar <zkQuorum>

Running the Spark Hive to HBase Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the Spark Hive to HBase sample application (Scala and Java).

    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the Spark client directory and invoke the bin/spark-submit script to run code.

    When running the sample application, you need to specify <zkQuorum>, which indicates the IP address of ZooKeeper.

    bin/spark-submit --class com.huawei.bigdata.spark.examples.SparkHivetoHbase --master yarn --deploy-mode client /opt/female/FemaleInfoCollection.jar <zkQuorum>

Running the Spark Streaming Kafka to HBase Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the Spark Streaming Kafka to HBase sample application (Scala and Java).

    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the Spark client directory and invoke the bin/spark-submit script to run code.

    When running the sample application, you need to specify <checkPointDir><topic><brokerList>. <checkPointDir> indicates an HDFS path for storing the application result backup. <topic> indicates a topic name read from Kafka. <brokerList> indicates the IP address of the Kafka server.

    The path of the Spark Streaming Kafka dependency package on the client is different from that of other dependency packages. For example, the path of other dependency packages is $SPARK_HOME/lib, and the path of the Spark Streaming Kafka dependency package is $SPARK_HOME/lib/streamingClient010. Therefore, when you run an application, you need to add a configuration item to the spark-submit command to specify the path of the Spark Streaming Kafka dependency package, for example, --jars $SPARK_HOME/jars/streamingClient010/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient010/kafka_*.jar,$SPARK_HOME/jars/streamingClient010/spark-streaming-kafka-*.jar.

    Spark Streaming To HBase Sample Code

    bin/spark-submit --master yarn --deploy-mode client --jars $SPARK_HOME/jars/streamingClient010/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient010/kafka_*.jar,$SPARK_HOME/jars/streamingClient010/spark-streaming-kafka-0*.jar --class com.huawei.bigdata.spark.examples.streaming.SparkOnStreamingToHbase /opt/female/FemaleInfoCollectionPrint.jar <checkPointDir> <topic> <brokerList>

    • -- The JAR file name in --jars varies depending on the cluster.
    • The format of brokerlist is brokerIp:9092.

Running the "Connecting Spark Streaming to Kafka0-10" Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the sample application (Scala and Java) for connecting Spark Streaming to Kafka0-10.

    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the Spark client directory and invoke the bin/spark-submit script to run code.

    When running the sample application, you need to specify <checkpointDir> <brokers> <topic> <batchTime>. <checkPointDir> indicates an HDFS path for storing the application result backup. <brokers> indicates the Kafka address for obtaining metadata, and the value is in brokerIp:21007 format in the security cluster mode and brokerIp:9092 in normal cluster mode. <topic> indicates a topic name read from Kafka. <batchTime> indicates an interval for Streaming processing in batches.

    "Spark Streaming Reads Kafka 0-10" Sample Code

    • Run the following commands to submit a security cluster task:

      bin/spark-submit --master yarn --deploy-mode client --files ./conf/jaas.conf,./conf/user.keytab --driver-java-options "-Djava.security.auth.login.config=./jaas.conf" --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf" --jars $SPARK_HOME/jars/streamingClient010/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient010/kafka_*.jar,$SPARK_HOME/jars/streamingClient010/spark-streaming-kafka-*.jar --class com.huawei.bigdata.spark.examples.SecurityKafkaWordCount /opt/SparkStreamingKafka010JavaExample-*.jar <checkpointDir> <brokers> <topic> <batchTime>

      The configuration example is as follows:

      --files ./jaas.conf,./user.keytab // Use --files to specify the jaas.conf and keytab files.
      --driver-java-options "-Djava.security.auth.login.config=./jaas.conf" // Specify the path of jaas.conf file on the driver. In yarn-client mode, use --driver-java-options "-Djava.security.auth.login.config" to specify it. In yarn-cluster mode, use --conf "spark.yarn.cluster.driver.extraJavaOptions to specify it.
      --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"// Specify the path of the jaas.conf file on the executor.
    • Command for submitting tasks in the normal cluster:

      bin/spark-submit --master yarn --deploy-mode client --jars $SPARK_HOME/jars/streamingClient010/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient010/kafka_*.jar,$SPARK_HOME/jars/streamingClient010/spark-streaming-kafka-*.jar --class com.huawei.bigdata.spark.examples.SecurityKafkaWordCount /opt/SparkStreamingKafka010JavaExample-*.jar <checkpointDir> <brokers> <topic> <batchTime>

      Spark Streaming Write To Kafka 0-10 code example (this example exists only in mrs-sample-project-1.6.0.zip):

      bin/spark-submit --master yarn --deploy-mode client --jars $SPARK_HOME/jars/streamingClient010/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient010/kafka_*.jar,$SPARK_HOME/jars/streamingClient010/spark-streaming-kafka-*.jar --class com.huawei.bigdata.spark.examples.JavaDstreamKafkaWriter /opt/JavaDstreamKafkaWriter.jar <checkPointDir> <brokers> <topics>

Running the Spark Structured Streaming Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the Spark Structured Streaming sample application (Scala and Java).

    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the conf directory of the Spark client and invoke the spark-submit script to run code.

    When running the sample application, you need to specify <brokers> <subscribe-type> <topic> <protocol> <service> <domain>. <brokers> indicates the Kafka address for obtaining metadata. <subscribe-type> indicates a Kafka subscription type (which is generally subscribe, indicating the specified topic that is subscribed). <topic> indicates a topic name read from Kafka. <protocol> indicates a security access protocol. <service> indicates a Kerberos service name. <domain> indicates a Kerberos domain name.

    For a normal cluster, comment out some code for configuring the Kafka security protocol. For details, see the description in Java Sample Code and Scala Sample Code.

    The path of the Spark Structured Streaming Kafka dependency package on the client is different from that of other dependency packages. For example, the path of other dependency packages is $SPARK_HOME/jars, and the path of the Spark Structured Streaming Kafka dependency package is $SPARK_HOME/jars/streamingClient010. Therefore, when you run an application, you need to add a configuration item to the spark-submit command to specify the path of the Spark Streaming Kafka dependency package, for example, --jars $SPARK_HOME/jars/streamingClient010/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient010/kafka_*.jar,$SPARK_HOME/jars/streamingClient010/spark-sql-kafka-*.jar.

    Sample Code for Connecting Spark Structured Streaming to Kafka

    • Run the following commands to submit tasks in the security cluster:

      cd /opt/client/Spark/spark/conf

      spark-submit --master yarn --deploy-mode client --files ./jaas.conf,./user.keytab --driver-java-options "-Djava.security.auth.login.config=./jaas.conf" --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf" --jars $SPARK_HOME/jars/streamingClient010/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient010/kafka_*.jar,$SPARK_HOME/jars/streamingClient010/spark-sql-kafka-*.jar --class com.huawei.bigdata.spark.examples.SecurityKafkaWordCount /root/jars/SparkStructuredStreamingJavaExample-*.jar <brokers> <subscribe-type> <topic> <protocol> <service> <domain>

    • Command for submitting tasks in the normal cluster:

      spark-submit --master yarn --deploy-mode client --jars $SPARK_HOME/jars/streamingClient010/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient010/kafka_*.jar,$SPARK_HOME/jars/streamingClient010/spark-sql-kafka-*.jar --class com.huawei.bigdata.spark.examples.SecurityKafkaWordCount /root/jars/SparkStructuredStreamingJavaExample-*.jar <brokers> <subscribe-type> <topic> <protocol> <service> <domain>

      The configuration example is as follows:

      --files <local Path>/jaas.conf,<local Path>/user.keytab // Use --files to specify the jaas.conf and keytab files.
      --driver-java-options "-Djava.security.auth.login.config=<local Path>/jaas.conf" // Specify the path of jaas.conf file on the driver. In yarn-client mode, use --driver-java-options "-Djava.security.auth.login.config" to specify it. In yarn-cluster mode, use --conf "spark.yarn.cluster.driver.extraJavaOptions to specify it. If an error is reported, indicating that you have no permission to read and write the local directory, you need to specify spark.sql.streaming.checkpointLocation and you must have read and write permissions on the directory specified by this parameter.
      --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"  // Specify the path of the jaas.conf file on the executor.
      -- The JAR file name in --jars varies depending on the cluster.
      The security cluster <brokers> is in brokerIp:21007 format. For the <protocol> <service> <domain> format, see the $KAFKA_HOME/config/consumer.properties file.
      For a normal cluster, the value of <brokers> is in brokerIp:9092 format. For details about <domain>, see the $KAFKA_HOME/config/consumer.properties file. The value of <protocol> is replaced with null, and the value of <service> is kafka.

Submitting a Python Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Submit a Python application.

    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the Spark client directory and invoke the bin/spark-submit script to run code.

    <inputPath> indicates the input path in HDFS.

    The sample code does not provide authentication information. Therefore, you need to configure spark.yarn.keytab and spark.yarn.principal to specify authentication information.

    bin/spark-submit --master yarn --deploy-mode client --conf spark.yarn.keytab=/opt/FIclient/user.keytab --conf spark.yarn.principal=sparkuser /opt/female/SparkPythonExample/collectFemaleInfo.py <inputPath>

Submitting the SparkLauncher Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Submit the SparkLauncher application.

    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    java -cp $SPARK_HOME/jars/*:{JAR_PATH} com.huawei.bigdata.spark.examples.SparkLauncherExample yarn-client {TARGET_JAR_PATH} { TARGET_JAR_MAIN_CLASS} {args}

    • JAR_PATH indicates the path of the SparkLauncher JAR package.
    • TARGET_JAR_PATH indicates the path of the JAR package of the Spark application to be submitted.
    • args is the parameter of the Spark application to be submitted.

Related Information

The running dependency packages of the "Accessing Spark SQL Through JDBC" sample application (Scala and Java) are as follows:

  • "Accessing Spark SQL Through JDBC" Sample Projects (Scala)
    • commons-collections-<version>.jar
    • commons-configuration-<version>.jar
    • commons-io-<version>.jar
    • commons-lang-<version>.jar
    • commons-logging-<version>.jar
    • guava-<version>.jar
    • hadoop-auth-<version>.jar
    • hadoop-common-<version>.jar
    • hadoop-mapreduce-client-core-<version>.jar
    • hive-exec-<version>.spark2.jar
    • hive-jdbc-<version>.spark2.jar
    • hive-metastore-<version>.spark2.jar
    • hive-service-<version>.spark2.jar
    • httpclient-<version>.jar
    • httpcore-<version>.jar
    • libthrift-<version>.jar
    • log4j-<version>.jar
    • slf4j-api-<version>.jar
    • zookeeper-<version>.jar
    • scala-library-<version>.jar
  • "Accessing Spark SQL Through JDBC" Sample Projects (Java)
    • commons-collections-<version>.jar
    • commons-configuration-<version>.jar
    • commons-io-<version>.jar
    • commons-lang-<version>.jar
    • commons-logging-<version>.jar
    • guava-<version>.jar
    • hadoop-auth-<version>.jar
    • hadoop-common-<version>.jar
    • hadoop-mapreduce-client-core-<version>.jar
    • hive-exec-<version>.spark2.jar
    • hive-jdbc-<version>.spark2.jar
    • hive-metastore-<version>.spark2.jar
    • hive-service-<version>.spark2.jar
    • httpclient-<version>.jar
    • httpcore-<version>.jar
    • libthrift-<version>.jar
    • log4j-<version>.jar
    • slf4j-api-<version>.jar
    • zookeeper-<version>.jar