Updated on 2024-11-29 GMT+08:00

Getting Started

Use Spark from scratch and submit Spark applications, including Spark Core and Spark SQL. Spark Core is the kernel module of Spark. It executes tasks and is used to compile Spark applications. Spark SQL is a module that executes SQL statements.

Scenario Description

Develop a Spark application to perform the following operations on logs about netizens' dwell time for online shopping on a weekend.

  • Collect statistics on female netizens who dwell on online shopping for more than 2 hours on the weekend.
  • The first column in the log file records names, the second column records genders, and the third column records the dwell durations in the unit of minute. Three columns are separated by comma (,).

log1.txt: logs collected on Saturday

LiuYang,female,20
YuanJing,male,10
GuoYijun,male,5
CaiXuyu,female,50
Liyuan,male,20
FangBo,female,50
LiuYang,female,20
YuanJing,male,10
GuoYijun,male,50
CaiXuyu,female,50
FangBo,female,60

log2.txt: logs collected on Sunday

LiuYang,female,20
YuanJing,male,10
CaiXuyu,female,50
FangBo,female,50
GuoYijun,male,5
CaiXuyu,female,50
Liyuan,male,20
CaiXuyu,female,50
FangBo,female,50
LiuYang,female,20
YuanJing,male,10
FangBo,female,50
GuoYijun,male,50
CaiXuyu,female,50
FangBo,female,60

Prerequisites

  • On Manager, you have created a user and granted the HDFS, Yarn, Kafka, and Hive permissions to the user.
  • You have installed and configured tools such as IntelliJ IDEA and JDK based on the development language.
  • The Spark client has been installed and the network connection of the client has been configured.
  • For Spark SQL programs, you have started Spark SQL or Beeline on the client to enter SQL statements.

Procedure

  1. Obtain the sample project and import it to IDEA. Import the JAR package on which the sample project depends. Use IDEA to configure and generate JAR files.
  2. Prepare the data required by the sample project.

    Save the original log files in the scenario description to the HDFS system.
    1. Create two text files (input_data1.txt and input_data2.txt) on the local host and copy the content in the log1.txt and log2.txt files to the input_data1.txt and input_data2.txt files, respectively.
    2. Create the /tmp/input directory in HDFS, and upload input_data1.txt and input_data2.txt to the /tmp/input directory:

  3. Upload the generated JAR file to the Spark running environment (Spark client), for example, /opt/female.
  4. Go the client directory, configure the environment variables, and log in to the system. If multiple Spark instances or services are installed, run the following commands to load environment variables of the instance when using the client to connect to an instance:

    source bigdata_env

    source Spark/component_env

    kinit <Service user for authentication>

  5. Run the following script in the bin directory to submit the Spark application:

    spark-submit --class com.xxx.bigdata.spark.examples.FemaleInfoCollection --master yarn-client /opt/female/FemaleInfoCollection.jar <inputPath>

    • FemaleInfoCollection.jar is the JAR package generated in 1.
    • <inputPath> is the directory created in 2.b.

  6. (Optional) After calling the spark-sql or spark-beeline script in the bin directory, directly enter SQL statements to perform operations such as query.

    For example, create a table, insert a piece of data, and then query the table.

    spark-sql> CREATE TABLE TEST(NAME STRING, AGE INT);
    Time taken: 0.348 seconds
    spark-sql>INSERT INTO TEST VALUES('Jack', 20);
    Time taken: 1.13 seconds
    spark-sql> SELECT * FROM TEST;
    Jack      20
    Time taken: 0.18 seconds, Fetched 1 row(s)

  7. View the running result of the Spark application.

    • View the running result data in a specified file.

      The storage path and format of the result data are specified by the Spark application.

    • Check the running status on the web page.
      1. Log in to Manager. Select Spark from the Service drop-down list.
      1. Go to the Spark overview page and click any SparkWebUI instance, for example, JobHistory(host2).
      2. The History Server UI is displayed.

        The History Server UI is used to display the status of Spark applications that are complete or incomplete.

        Figure 1 History Server UI
      3. Select an application ID and click this page to go to the Spark UI of the application.

        Spark UI: used to display the status of running applications.

        Figure 2 Spark UI
    • View Spark logs to learn application runtime conditions.

      View Spark Log Overview to learn application running status, and adjust applications based on log information.