Spark Client CLI

For how to use the Spark CLIs, visit the official website http://spark.apache.org/docs/3.1.1/quick-start.html.

Common CLIs

Common Spark CLIs are as follows:

spark-shell
This command offers an easy method to familiarize oneself with APIs, much like interactive data analysis tools. It supports two languages including Scala and Python. In the Spark directory, run the ./bin/spark-shell command to access the interactive interface of Scala, obtain data from HDFS, and then perform the RDD.

Example: A line of code can be used to collect statistics on all words in a file.

scala> sc.textFile("hdfs://10.96.1.57:9000//wordcount_data.txt").flatMap(l => l.split(" ")).map(w => (w,1)).reduceByKey(_+_).collect()

You can directly specify the Keytab and Principal in the command line to obtain authentication, and regularly update the keytab and authorized tokens to avoid the authentication expiry. The following command is used as an example.

spark-shell --principal spark2x/hadoop.<System domain name>@<System domain name> --keytab ${BIGDATA_HOME}/FusionInsight_Spark2x_8.1.0.1/install/FusionInsight-Spark2x-3.1.1/keytab/spark2x/SparkResource/spark2x.keytab --master yarn
spark-submit
This command is used to submit Spark applications to a Spark cluster for running and return the running result. The class, master, JAR file, and input parameters must be specified.

Example: Run GroupByTest in the JAR file, use four input parameters, and set the cluster running mode to local single-core.

./bin/spark-submit --class org.apache.spark.examples.GroupByTest --master local[1] examples/jars/spark-examples_2.12-3.1.1-hw-ei-311001-SNAPSHOT.jar 6 10 10 3

You can directly specify the Keytab and Principal in the command line to obtain authentication, and regularly update the keytab and authorized tokens to avoid the authentication expiry. The following command is used as an example.

spark-submit --class org.apache.spark.examples.GroupByTest --master yarn --principal spark2x/hadoop.<System domain name>@<System domain name> --keytab ${BIGDATA_HOME}/FusionInsight_Spark2x_8.1.0.1/install/FusionInsight-Spark2x-3.1.1/keytab/spark2x/SparkResource/spark2x.keytab examples/jars/spark-examples_2.12-3.1.1-hw-ei-311001-SNAPSHOT.jar 6 10 10 3
spark-sql
It is used to run the Hive metadata service and query command lines in the local or cluster mode. If its logical plan needs to be queried, add "explain extended" before the SQL statement.

The following is an example:

Select key from src group by key

You can directly specify the Keytab and Principal in the command line to obtain authentication, and regularly update the keytab and authorized tokens to avoid the authentication expiry. The following command is used as an example.

spark-sql --principal spark2x/hadoop.<System domain name>@<System domain name> --keytab ${BIGDATA_HOME}/FusionInsight_Spark2x_8.1.0.1/install/FusionInsight-Spark2x-3.1.1/keytab/spark2x/SparkResource/spark2x.keytab --master yarn
run-example
This command is used to run or debug the built-in examples in the Spark open source community.

Example: Run SparkPi.

./run-example SparkPi 100