Updated on 2024-08-16 GMT+08:00

Introduction to Spark Application Development

Spark

Spark is a distributed batch processing framework. It provides analysis and mining and iterative memory computing capabilities and supports application development in multiple programming languages, including Scala, Java, and Python. Spark applies to the following scenarios:

  • Data processing: Spark can process data quickly and has fault tolerance and scalability.
  • Iterative computation: Spark supports iterative computation to meet the requirements of multi-step data processing logic.
  • Data mining: Based on massive data, Spark can handle complex data mining and analysis and support multiple data mining and machine learning algorithms.
  • Streaming processing: Spark supports streaming processing with delay in seconds and supports multiple external data sources.
  • Query analysis: Spark supports standard SQL query analysis, provides the DSL (DataFrame), and supports multiple external inputs.
  • Figure 1 shows the component architecture of Apache Spark. This section provides guidance to application development of Spark, Spark SQL and Spark Streaming. For details about MLlib and GraghX, visit the Spark official website at http://spark.apache.org/docs/2.2.2/.
Figure 1 Spark architecture

Spark APIs

Spark supports application development in multiple programming languages, including Scala, Java, and Python. Since Spark is developed in Scala and Scala is easy to read, you are advised to develop Spark applications in Scala.

Table 1 describes Spark APIs in different languages.

Table 1 Spark APIs

API

Description

Scala API

Indicates the API in Scala. Since Scala is easy to read, you are advised to use Scala APIs to develop applications.

Java API

Indicates the API in Java.

Python API

Indicates the API in Python.

Divided by different modes, Spark Core and Spark Streaming use APIs listed in the preceding table to develop applications. Spark SQL can be accessed through CLI and ThriftServer. There are two ways to access the ThriftServer: Beeline and the JDBC client code.

For the spark-sql, spark-shell, and spark-submit scripts (running applications contain SQL operations), do not use the proxy user parameter to submit a task.