Updated on 2022-09-14 GMT+08:00

Application Development Overview

Spark Introduction

Spark is a distributed batch processing system as well as an analysis and mining engine. It provides an iterative memory computation framework and supports the development in multiple programming languages, including Scala, Java, and Python. The application scenarios of Spark include:

  • Data processing: Spark can process data quickly and has fault tolerance and scalability.
  • Iterative computation: Spark supports iterative computation to meet the requirements of multi-step data processing logic.
  • Data mining: Based on massive data, Spark can handle complex data mining and analysis and support multiple data mining and machine learning algorithms.
  • Streaming processing: Spark supports stream processing at a seconds-level delay and supports multiple external data sources.
  • Query analysis: Sparks supports standard SQL query analysis, provides the DSL (DataFrame), and supports multiple external inputs.

This section focuses on the application development guides of Spark, Spark SQL and Spark Streaming.

Spark Development API Introduction

Spark supports the development in multiple programming languages, including Scala, Java, and Python. Since Spark is developed in Scala and Scala is easy to read, users are advised to develop Spark application in Scala.

Divided by different languages, the APIs of Spark are listed in Table 1.

Table 1 Spark APIs

Function

Description

Scala API

Indicates the API in Scala. For common APIs of Spark Core, Spark SQL and Spark Streaming, see Scala. Since Scala is easy to read, users are advised to use Scala APIs in the program development.

Java API

Indicates the API in Java. For common APIs of Spark Core, Spark SQL and Spark Streaming, see Java.

Python API

Indicates the API in Python. For common APIs of Spark Core, Spark SQL and Spark Streaming, see Python.

Divided by different modes, APIs listed in the preceding table are used in the development of Spark Core and Spark Streaming. Spark SQL supports CLI and JDBCServer for accessing. There are two ways to access the JDBCServer: Beeline and the JDBC client code. For details, see JDBCServer Interface.

For spark-sql, spark-shell and spark-submit (which application contains SQL operations), do not use the proxy user parameter to submit a task. This is partly because the spark-sql script with the proxy user parameter does not support task submission and partly because the sample program mentioned in this document already contains security authentication.