Spark Client CLI

For details about how to use the Spark CLIs, visit the official website http://archive.apache.org/dist/spark/docs/3.3.1/quick-start.html.

Common Spark CLIs are as follows:

spark-shell
This command offers an easy method to familiarize oneself with APIs, much like interactive data analysis tools. It supports two languages including Scala and Python. In the Spark directory, run the ./bin/spark-shell command to log in the interactive interface of Scala, obtain data from HDFS, and perform the RDD.

Example: A line of code can be used to collect statistics on all words in a file.

scala> sc.textFile("hdfs://10.96.1.57:9000//wordcount_data.txt").flatMap(l => l.split(" ")).map(w => (w,1)).reduceByKey(_+_).collect()
spark-submit
This command is used to submit Spark applications to a Spark cluster for running and return the running result. The class, master, JAR file, and input parameters must be specified.

Example: Run GroupByTest in the JAR file, use four input parameters, and set the cluster running mode to local single-core.

./bin/spark-submit --class org.apache.spark.examples.GroupByTest --master local[1] examples/jars/spark-examples_2.12-3.1.1-hw-ei-311001.jar 6 10 10 3
spark-sql
It is used to run the Hive metadata service and query command lines in the local or cluster mode. If its logical plan needs to be queried, add "explain extended" before the SQL statement.

Example:

Select key from src group by key
run-example
This command is used to run or debug the built-in examples in the Spark open source community.

Example: Run SparkPi.

./run-example SparkPi 100