Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

Compiling and Running a Spark Application

Updated on 2024-08-16 GMT+08:00

Scenario

After application code development is complete, you can upload the JAR file to the Linux client to run applications. The procedures for running applications developed using Scala or Java are the same on the Spark client.

NOTE:
  • Spark applications can run only on Linux, but not on Windows.
  • A Spark application developed using Python does not need to be packed into a JAR file. You only need to copy the sample project to the compiler.

Running the Spark Core Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the Spark Core sample application (Scala and Java).

    NOTICE:
    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the Spark client directory and invoke the bin/spark-submit script to run code.

    <inputPath> indicates the input path in HDFS.

    bin/spark-submit --class com.huawei.bigdata.spark.examples.FemaleInfoCollection --master yarn --deploy-mode client /opt/female/FemaleInfoCollection.jar <inputPath>

Running the Spark SQL Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the Spark SQL sample application (Scala and Java).

    NOTICE:
    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the Spark client directory and invoke the bin/spark-submit script to run code.

    <inputPath> indicates the input path in HDFS.

    bin/spark-submit --class com.huawei.bigdata.spark.examples.FemaleInfoCollection --master yarn --deploy-mode client /opt/female/FemaleInfoCollection.jar <inputPath>

Running the Spark Streaming Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the Spark Streaming sample application (Scala and Java).

    NOTICE:
    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.
    Go to the Spark client directory and invoke the bin/spark-submit script to run code.
    NOTE:

    The path of the Spark Streaming Kafka dependency package on the client is different from that of other dependency packages. For example, the path of other dependency packages is $SPARK_HOME/jars, and the path of the Spark Streaming Kafka dependency package is $SPARK_HOME/jars/streamingClient. Therefore, when you run an application, you need to add a configuration item to the spark-submit command to specify the path of the Spark Streaming Kafka dependency package, for example, --jars $SPARK_HOME/jars/streamingClient/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient/kafka_*.jar,$SPARK_HOME/jars/streamingClient/spark-streaming-kafka-0-8_*.jar.

    • Spark Streaming Write To Print Sample Code

      bin/spark-submit --master yarn --deploy-mode client --jars $SPARK_HOME/jars/streamingClient/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient/kafka_*.jar,$SPARK_HOME/jars/streamingClient/spark-streaming-kafka-*.jar --class com.huawei.bigdata.spark.examples.FemaleInfoCollectionPrint /opt/female/FemaleInfoCollectionPrint.jar <checkPointDir> <batchTime> <topics> <brokers>

      NOTE:
      • The JAR version name in --jars varies depending on the cluster.
      • The value of brokers is in brokerIp:9092 format.
      • <checkPointDir> indicates the path for backing up the application result to the HDFS. <batchTime> indicates the interval for Streaming to process data in batches.
    • Write To Kafka Sample Code for Spark Streaming

      bin/spark-submit --master yarn --deploy-mode client --jars $SPARK_HOME/jars/streamingClient/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient/kafka_*.jar,$SPARK_HOME/jars/streamingClient/spark-streaming-kafka-*.jar --

      class com.huawei.bigdata.spark.examples.DstreamKafkaWriter/opt/female/SparkStreamingExample-1.0.jar <groupId> <brokers> <topic>

Running the "Accessing Spark SQL Through JDBC" Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the "Accessing Spark SQL Through JDBC" sample application (Scala and Java).

    NOTICE:
    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the Spark client directory and run the java -cp command to run the code.

    java -cp ${SPARK_HOME}/jars/*:${SPARK_HOME}/conf:/opt/female/SparkThriftServerJavaExample-*.jar com.huawei.bigdata.spark.examples.ThriftServerQueriesTest ${SPARK_HOME}/conf/hive-site.xml ${SPARK_HOME}/conf/spark-defaults.conf
    NOTE:

    For a normal cluster, comment out the security configuration code. For details, see 2 and 2.

    In the preceding command line, you can minimize the corresponding running dependency packages based on different sample projects. For details about the dependency package of the sample project, see 1.

Running the Spark on HBase Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the Spark on HBase sample application (Scala and Java).

    NOTICE:
    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the Spark client directory and invoke the bin/spark-submit script to run code. The application running sequence is as follows: TableCreation, TableInputData, and TableOutputData.

    When running the TableInputData sample application, you need to specify <inputPath>, which indicates the input path in HDFS.

    bin/spark-submit --class com.huawei.bigdata.spark.examples.TableInputData --master yarn --deploy-mode client /opt/female/TableInputData.jar <inputPath>

    NOTE:

    If Kerberos authentication is enabled when Spark tasks connect to HBase to read and write data, set spark.yarn.security.credentials.hbase.enabled in the client configuration file spark-default.conf to true. This configuration needs to be modified for all Spark tasks connecting to HBase to read and write data.

Running the Spark HBase to HBase Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the Spark HBase to HBase sample application (Scala and Java).

    NOTICE:
    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the Spark client directory and invoke the bin/spark-submit script to run code.

    When running the sample application, you need to specify <zkQuorum>, which indicates the IP address of ZooKeeper.

    bin/spark-submit --class com.huawei.bigdata.spark.examples.SparkHbasetoHbase --master yarn --deploy-mode client /opt/female/FemaleInfoCollection.jar <zkQuorum>

Running the Spark Hive to HBase Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the Spark Hive to HBase sample application (Scala and Java).

    NOTICE:
    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the Spark client directory and invoke the bin/spark-submit script to run code.

    When running the sample application, you need to specify <zkQuorum>, which indicates the IP address of ZooKeeper.

    bin/spark-submit --class com.huawei.bigdata.spark.examples.SparkHivetoHbase --master yarn --deploy-mode client /opt/female/FemaleInfoCollection.jar <zkQuorum>

Running the Spark Streaming Kafka to HBase Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the Spark Streaming Kafka to HBase sample application (Scala and Java).

    NOTICE:
    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the Spark client directory and invoke the bin/spark-submit script to run code.

    When running the sample application, you need to specify <checkPointDir><topic><brokerList>. <checkPointDir> indicates an HDFS path for storing the application result backup. <topic> indicates a topic name read from Kafka. <brokerList> indicates the IP address of the Kafka server.

    NOTE:

    The path of the Spark Streaming Kafka dependency package on the client is different from that of other dependency packages. For example, the path of other dependency packages is $SPARK_HOME/lib, and the path of the Spark Streaming Kafka dependency package is $SPARK_HOME/lib/streamingClient010. Therefore, when you run an application, you need to add a configuration item to the spark-submit command to specify the path of the Spark Streaming Kafka dependency package, for example, --jars $SPARK_HOME/jars/streamingClient010/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient010/kafka_*.jar,$SPARK_HOME/jars/streamingClient010/spark-streaming-kafka-*.jar.

    Spark Streaming To HBase Sample Code

    bin/spark-submit --master yarn --deploy-mode client --jars $SPARK_HOME/jars/streamingClient010/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient010/kafka_*.jar,$SPARK_HOME/jars/streamingClient010/spark-streaming-kafka-0*.jar --class com.huawei.bigdata.spark.examples.streaming.SparkOnStreamingToHbase /opt/female/FemaleInfoCollectionPrint.jar <checkPointDir> <topic> <brokerList>

    NOTE:
    • -- The JAR file name in --jars varies depending on the cluster.
    • The format of brokerlist is brokerIp:9092.

Running the "Connecting Spark Streaming to Kafka0-10" Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the sample application (Scala and Java) for connecting Spark Streaming to Kafka0-10.

    NOTICE:
    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the Spark client directory and invoke the bin/spark-submit script to run code.

    When running the sample application, you need to specify <checkpointDir> <brokers> <topic> <batchTime>. <checkPointDir> indicates an HDFS path for storing the application result backup. <brokers> indicates the Kafka address for obtaining metadata, and the value is in brokerIp:21007 format in the security cluster mode and brokerIp:9092 in normal cluster mode. <topic> indicates a topic name read from Kafka. <batchTime> indicates an interval for Streaming processing in batches.

    "Spark Streaming Reads Kafka 0-10" Sample Code

    • Run the following commands to submit a security cluster task:

      bin/spark-submit --master yarn --deploy-mode client --files ./conf/jaas.conf,./conf/user.keytab --driver-java-options "-Djava.security.auth.login.config=./jaas.conf" --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf" --jars $SPARK_HOME/jars/streamingClient010/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient010/kafka_*.jar,$SPARK_HOME/jars/streamingClient010/spark-streaming-kafka-*.jar --class com.huawei.bigdata.spark.examples.SecurityKafkaWordCount /opt/SparkStreamingKafka010JavaExample-*.jar <checkpointDir> <brokers> <topic> <batchTime>

      The configuration example is as follows:

      --files ./jaas.conf,./user.keytab // Use --files to specify the jaas.conf and keytab files.
      --driver-java-options "-Djava.security.auth.login.config=./jaas.conf" // Specify the path of jaas.conf file on the driver. In yarn-client mode, use --driver-java-options "-Djava.security.auth.login.config" to specify it. In yarn-cluster mode, use --conf "spark.yarn.cluster.driver.extraJavaOptions to specify it.
      --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"// Specify the path of the jaas.conf file on the executor.
    • Command for submitting tasks in the normal cluster:

      bin/spark-submit --master yarn --deploy-mode client --jars $SPARK_HOME/jars/streamingClient010/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient010/kafka_*.jar,$SPARK_HOME/jars/streamingClient010/spark-streaming-kafka-*.jar --class com.huawei.bigdata.spark.examples.SecurityKafkaWordCount /opt/SparkStreamingKafka010JavaExample-*.jar <checkpointDir> <brokers> <topic> <batchTime>

      Spark Streaming Write To Kafka 0-10 code example (this example exists only in mrs-sample-project-1.6.0.zip):

      bin/spark-submit --master yarn --deploy-mode client --jars $SPARK_HOME/jars/streamingClient010/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient010/kafka_*.jar,$SPARK_HOME/jars/streamingClient010/spark-streaming-kafka-*.jar --class com.huawei.bigdata.spark.examples.JavaDstreamKafkaWriter /opt/JavaDstreamKafkaWriter.jar <checkPointDir> <brokers> <topics>

Running the Spark Structured Streaming Sample Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Run the Spark Structured Streaming sample application (Scala and Java).

    NOTICE:
    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the conf directory of the Spark client and invoke the spark-submit script to run code.

    When running the sample application, you need to specify <brokers> <subscribe-type> <topic> <protocol> <service> <domain>. <brokers> indicates the Kafka address for obtaining metadata. <subscribe-type> indicates a Kafka subscription type (which is generally subscribe, indicating the specified topic that is subscribed). <topic> indicates a topic name read from Kafka. <protocol> indicates a security access protocol. <service> indicates a Kerberos service name. <domain> indicates a Kerberos domain name.

    NOTE:

    For a normal cluster, comment out some code for configuring the Kafka security protocol. For details, see the description in Java Sample Code and Scala Sample Code.

    The path of the Spark Structured Streaming Kafka dependency package on the client is different from that of other dependency packages. For example, the path of other dependency packages is $SPARK_HOME/jars, and the path of the Spark Structured Streaming Kafka dependency package is $SPARK_HOME/jars/streamingClient010. Therefore, when you run an application, you need to add a configuration item to the spark-submit command to specify the path of the Spark Streaming Kafka dependency package, for example, --jars $SPARK_HOME/jars/streamingClient010/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient010/kafka_*.jar,$SPARK_HOME/jars/streamingClient010/spark-sql-kafka-*.jar.

    Sample Code for Connecting Spark Structured Streaming to Kafka

    • Run the following commands to submit tasks in the security cluster:

      cd /opt/client/Spark/spark/conf

      spark-submit --master yarn --deploy-mode client --files ./jaas.conf,./user.keytab --driver-java-options "-Djava.security.auth.login.config=./jaas.conf" --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf" --jars $SPARK_HOME/jars/streamingClient010/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient010/kafka_*.jar,$SPARK_HOME/jars/streamingClient010/spark-sql-kafka-*.jar --class com.huawei.bigdata.spark.examples.SecurityKafkaWordCount /root/jars/SparkStructuredStreamingJavaExample-*.jar <brokers> <subscribe-type> <topic> <protocol> <service> <domain>

    • Command for submitting tasks in the normal cluster:

      spark-submit --master yarn --deploy-mode client --jars $SPARK_HOME/jars/streamingClient010/kafka-clients-*.jar,$SPARK_HOME/jars/streamingClient010/kafka_*.jar,$SPARK_HOME/jars/streamingClient010/spark-sql-kafka-*.jar --class com.huawei.bigdata.spark.examples.SecurityKafkaWordCount /root/jars/SparkStructuredStreamingJavaExample-*.jar <brokers> <subscribe-type> <topic> <protocol> <service> <domain>

      The configuration example is as follows:

      --files <local Path>/jaas.conf,<local Path>/user.keytab // Use --files to specify the jaas.conf and keytab files.
      --driver-java-options "-Djava.security.auth.login.config=<local Path>/jaas.conf" // Specify the path of jaas.conf file on the driver. In yarn-client mode, use --driver-java-options "-Djava.security.auth.login.config" to specify it. In yarn-cluster mode, use --conf "spark.yarn.cluster.driver.extraJavaOptions to specify it. If an error is reported, indicating that you have no permission to read and write the local directory, you need to specify spark.sql.streaming.checkpointLocation and you must have read and write permissions on the directory specified by this parameter.
      --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"  // Specify the path of the jaas.conf file on the executor.
      -- The JAR file name in --jars varies depending on the cluster.
      The security cluster <brokers> is in brokerIp:21007 format. For the <protocol> <service> <domain> format, see the $KAFKA_HOME/config/consumer.properties file.
      For a normal cluster, the value of <brokers> is in brokerIp:9092 format. For details about <domain>, see the $KAFKA_HOME/config/consumer.properties file. The value of <protocol> is replaced with null, and the value of <service> is kafka.

Submitting a Python Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Submit a Python application.

    NOTICE:
    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    Go to the Spark client directory and invoke the bin/spark-submit script to run code.

    <inputPath> indicates the input path in HDFS.

    NOTE:

    The sample code does not provide authentication information. Therefore, you need to configure spark.yarn.keytab and spark.yarn.principal to specify authentication information.

    bin/spark-submit --master yarn --deploy-mode client --conf spark.yarn.keytab=/opt/FIclient/user.keytab --conf spark.yarn.principal=sparkuser /opt/female/SparkPythonExample/collectFemaleInfo.py <inputPath>

Submitting the SparkLauncher Application

  1. Run the mvn package command in the project directory to generate a JAR file and obtain it from the target directory in the project directory, for example, FemaleInfoCollection.jar.
  2. Copy the generated JAR file (for example, CollectFemaleInfo.jar) to the Spark operating environment (that is, the Spark client), for example, /opt/female. In the security cluster with Kerberos authentication enabled, copy the user.keytab and krb5.conf files obtained in Preparing a Spark Application Development User to the conf directory of the Spark client, for example, /opt/client/Spark/spark/conf. For a cluster with Kerberos authentication disabled, you do not need to copy the user.keytab and krb5.conf files.
  3. Submit the SparkLauncher application.

    NOTICE:
    • Do not restart the HDFS service or all DataNode instances during Spark job running. Otherwise, the job may fail and some JobHistory data may be lost.
    • When running the program, you can select the following running mode as required:
      • --deploy-mode client: The driver process runs on the client, and the running result is directly output after the program running.
      • --deploy-mode cluster: The driver process runs in ApplicationMaster (AM) of Yarn. The running result and logs are displayed on the Yarn web UI.

    java -cp $SPARK_HOME/jars/*:{JAR_PATH} com.huawei.bigdata.spark.examples.SparkLauncherExample yarn-client {TARGET_JAR_PATH} { TARGET_JAR_MAIN_CLASS} {args}

    NOTE:
    • JAR_PATH indicates the path of the SparkLauncher JAR package.
    • TARGET_JAR_PATH indicates the path of the JAR package of the Spark application to be submitted.
    • args is the parameter of the Spark application to be submitted.

Related Information

The running dependency packages of the "Accessing Spark SQL Through JDBC" sample application (Scala and Java) are as follows:

  • "Accessing Spark SQL Through JDBC" Sample Projects (Scala)
    • commons-collections-<version>.jar
    • commons-configuration-<version>.jar
    • commons-io-<version>.jar
    • commons-lang-<version>.jar
    • commons-logging-<version>.jar
    • guava-<version>.jar
    • hadoop-auth-<version>.jar
    • hadoop-common-<version>.jar
    • hadoop-mapreduce-client-core-<version>.jar
    • hive-exec-<version>.spark2.jar
    • hive-jdbc-<version>.spark2.jar
    • hive-metastore-<version>.spark2.jar
    • hive-service-<version>.spark2.jar
    • httpclient-<version>.jar
    • httpcore-<version>.jar
    • libthrift-<version>.jar
    • log4j-<version>.jar
    • slf4j-api-<version>.jar
    • zookeeper-<version>.jar
    • scala-library-<version>.jar
  • "Accessing Spark SQL Through JDBC" Sample Projects (Java)
    • commons-collections-<version>.jar
    • commons-configuration-<version>.jar
    • commons-io-<version>.jar
    • commons-lang-<version>.jar
    • commons-logging-<version>.jar
    • guava-<version>.jar
    • hadoop-auth-<version>.jar
    • hadoop-common-<version>.jar
    • hadoop-mapreduce-client-core-<version>.jar
    • hive-exec-<version>.spark2.jar
    • hive-jdbc-<version>.spark2.jar
    • hive-metastore-<version>.spark2.jar
    • hive-service-<version>.spark2.jar
    • httpclient-<version>.jar
    • httpcore-<version>.jar
    • libthrift-<version>.jar
    • log4j-<version>.jar
    • slf4j-api-<version>.jar
    • zookeeper-<version>.jar

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback