Updated on 2022-06-01 GMT+08:00

Basic Concepts

Basic Concepts

  • RDD

    Resilient Distributed Dataset (RDD) is a core concept of Spark. It indicates a read-only and partitionable distributed dataset. Partial or all data of this dataset can be cached in the memory and reused between computations.

    RDD creation

    • An RDD can be created from the input of HDFS or other storage systems that are compatible with Hadoop.
    • A new RDD can be converted from a parent RDD.
    • An RDD can be converted from a collection of data sets through encoding.

    RDD storage

    • Users can select different storage levels to store an RDD for reuse. (There are 11 storage levels to store an RDD.)
    • By default, the RDD is stored in the memory. When the memory is insufficient, the RDD overflows to the disk.
  • RDD dependency

    The RDD dependency includes the narrow dependency and wide dependency.

    Figure 1 RDD dependency
    • Narrow dependency: Each partition of the parent RDD is used by at most one partition of the child RDD.
    • Wide dependency: Partitions of the child RDD depend on all partitions of the parent RDD.

    The narrow dependency facilitates the optimization. Logically, each RDD operator is a fork/join (the join is the barrier used to synchronize multiple concurrent tasks); fork the RDD to each partition, and then perform the computation. After the computation, join the results, and then perform the fork/join operation on the next RDD operator. It is uneconomical to directly translate the RDD into physical implementation. The first is that every RDD (even intermediate result) needs to be physicalized into memory or storage, which is time-consuming and occupies much space. The second is that as a global barrier, the join operation is very expensive and the entire join process will be slowed down by the slowest node. If the partitions of the child RDD narrowly depend on the partitions of the parent RDD, the two fork/join processes can be combined to implement classic fusion optimization. If the relationship in the continuous operator sequence is narrow dependency, multiple fork/join processes can be combined to reduce a large number of global barriers and eliminate the physicalization of many RDD intermediate results, which greatly improves the performance. This is called pipeline optimization in Spark.

  • Transformation and action (RDD operations)

    Operations on RDD include transformation (the return value is an RDD) and action (the return value is not an RDD). Figure 2 shows the RDD operation process. The transformation is lazy, which indicates that the transformation from one RDD to another RDD is not immediately executed. Spark only records the transformation but does not execute it immediately. The real computation is started only when the action is started. The action returns results or writes the RDD data into the storage system. The action is the driving force for Spark to start the computation.

    Figure 2 RDD operation

    The data and operation model of RDD are quite different from those of Scala.

    val file = sc.textFile("hdfs://...")
    val errors = file.filter(_.contains("ERROR"))
    errors.cache()
    errors.count()
    1. The textFile operator reads log files from the HDFS and returns file (as an RDD).
    2. The filter operator filters rows with ERROR and assigns them to errors (a new RDD). The filter operator is a transformation.
    3. The cache operator caches errors for future use.
    4. The count operator returns the number of rows of errors. The count operator is an action.
    Transformation includes the following types:
    • The RDD elements are regarded as simple elements.

      The input and output has the one-to-one relationship, and the partition structure of the result RDD remains unchanged, for example, map.

      The input and output has the one-to-many relationship, and the partition structure of the result RDD remains unchanged, for example, flatMap (one element becomes a sequence containing multiple elements after map and then flattens to multiple elements).

      The input and output has the one-to-one relationship, but the partition structure of the result RDD changes, for example, union (two RDDs integrates to one RDD, and the number of partitions becomes the sum of the number of partitions of two RDDs) and coalesce (partitions are reduced).

      Operators of some elements are selected from the input, such as filter, distinct (duplicate elements are deleted), subtract (elements only exist in this RDD are retained), and sample (samples are taken).

    • The RDD elements are regarded as key-value pairs.

      Perform the one-to-one calculation on the single RDD, such as mapValues (the partition mode of the source RDD is retained, which is different from map).

      Sort the single RDD, such as sort and partitionBy (partitioning with consistency, which is important to the local optimization).

      Restructure and reduce the single RDD based on key, such as groupByKey and reduceByKey.

      Join and restructure two RDDs based on the key, such as join and cogroup.

      The later three operations involve sorting and are called shuffle operations.

    Action includes the following types:

    • Generate scalar configuration items, such as count (the number of elements in the returned RDD), reduce, fold/aggregate (the number of scalar configuration items that are returned), and take (the number of elements before the return).
    • Generate the Scala collection, such as collect (import all elements in the RDD to the Scala collection) and lookup (look up all values corresponds to the key).
    • Write data to the storage, such as saveAsTextFile (which corresponds to the preceding textFile).
    • Check points, such as checkpoint. When Lineage is quite long (which occurs frequently in graphics computation), it takes a long period of time to execute the whole sequence again when a fault occurs. In this case, checkpoint is used as the check point to write the current data to stable storage.
  • Shuffle

    Shuffle is a specific phase in the MapReduce framework, which is located between the Map phase and the Reduce phase. If the output results of Map are to be used by Reduce, the output results must be hashed based on a key and distributed to each Reducer. This process is called Shuffle. Shuffle involves the read and write of the disk and the transmission of the network, so that the performance of Shuffle directly affects the operation efficiency of the entire program.

    The following figure shows the entire process of the MapReduce algorithm.

    Figure 3 Algorithm process

    Shuffle is a bridge to connect data. The following describes the implementation of shuffle in Spark.

    Shuffle divides the Job of a Spark into multiple stages. The former stages contain one or more ShuffleMapTasks, and the last stage contains one or more ResultTasks.

  • Spark application structure

    The Spark application structure includes the initialized SparkContext and the main program.

    • Initialized SparkContext: constructs the operating environment of the Spark application.

      Constructs the SparkContext object, for example,

      new SparkContext(master, appName, [SparkHome], [jars])

      Parameter description:

      master: indicates the link string. The link modes include local, Yarn-cluster, and Yarn-client.

      appName: indicates the application name.

      SparkHome: indicates the directory where Spark is installed in the cluster.

      jars: indicates the code and dependency package of the application.

    • Main program: processes data.

    For details about how to submit an application, see https://spark.apache.org/docs/2.2.2/submitting-applications.html.

  • Spark shell commands

    The basic Spark shell command supports the submission of the Spark application. The Spark shell command is as follows:

    ./bin/spark-submit \
      --class <main-class> \
      --master <master-url> \
      ... # other options
      <application-jar> \
      [application-arguments]

    Parameter description:

    --class: indicates the name of the class of the Spark application.

    --master: indicates the master to which the Spark application links, such as Yarn-client and Yarn-cluster.

    application-jar: indicates the path of the JAR file of the Spark application.

    application-arguments: indicates the parameter required to submit the Spark application. This parameter can be left blank.

  • Spark JobHistory Server

    It is used to monitor the details in each phase of the Spark framework of a running or historical Spark job and provide the log display, which helps users to develop, configure, and optimize the job in more fine-grained units.

Basic Concepts of the Spark SQL

DataFrame

The DataFrame is a structured and distributed dataset consisting of multiple columns. The DataFrame is equal to a table in the relationship database or the DataFrame in the R/Python. The DataFrame is the most basic concept in the Spark SQL, which can be created by using multiple methods, such as the structured dataset, Hive table, external database or RDD.

The program entry of the Spark SQL is the SQLContext class (or its subclasses). A SparkContext object is required as a construction parameter for the creation of the SQLContext. One subclass of the SQLContext is the HiveContext. Compared with its parent class, the functions of parser, UDF and reading inventory Hive data of the HiveQL are added in the HiveContext. However, the HiveContext does not rely on the running Hive but the class library of the Hive.

By using the SQLContext and its subclasses, the basic dataset DataFrame in the Spark SQL can be easily created. The DataFrame provides various APIs for Spark SQL and is compatible with various data sources, such as the Parquet, JSON, Hive data, Database, and HBase. These data sources can be read by using the same syntax.

Basic Concepts of the Spark Streaming

Dstream

The DStream (Discretized Stream) is an abstract concept provided by the Spark Streaming.

The DStream is a continuous data stream which is obtained from the data source or transformed and generated by the input stream. In essence, a DStream is a series of continuous RDDs. The RDD is a read-only and partitionable distributed dataset.

Each RDD in the DStream contains data in a range, as shown in Figure 4.

Figure 4 Relationship between DStream and RDD

All operators applied in the DStream are translated to the operations in the lower RDD, as shown in Figure 5. The transformation of the lower RDDs is calculated by using a Spark engine. Most operation details are concealed in the DStream operators and High-level APIs are provided for developers.

Figure 5 DStream operator translation