Development Plan

Overview

This section describes how to use Spark to perform operations such as data insertion, query, update, incremental query, query at a specific time point, and data deletion on Hudi.

For details, see the sample code.

Packaging the Project

Upload the user.keytab and krb5.conf files to the server where the client is located.

Use the Maven tool provided by IDEA to package the project and generate the JAR file. For details, see Commissioning a Spark Application in a Linux Environment.
- Before compilation and packaging, change the paths of the user.keytab and krb5.conf files in the sample code to the actual paths on the client server.
- The Python sample code does not need to be packaged using Maven.
Upload the generated JAR file to any directory (for example, /opt/example/) on the server where the Spark client is located.

Running the Task

Log in to the Spark client node and run the following commands:
source Client installation directory/bigdata_env

source Client installation directory/Hudi/component_env

kinit Hudi development user
After compiling and building the sample code, you can use the spark-submit command to perform the write, update, query, and delete operations in sequence.
- Run the Java sample project.
  spark-submit --keytab <user_keytab_path> --principal=<principal_name> --class com.huawei.bigdata.hudi.examples.HoodieWriteClientExample /opt/example/hudi-java-security-examples-1.0.jar hdfs://hacluster/tmp/example/hoodie_java hoodie_java
  
  <user_keytab_path> indicates the authentication file path, <principal_name> indicates the authentication user name, /opt/example/hudi-java-examples-1.0.jar indicates the JAR file path, hdfs://hacluster/tmp/example/hoodie_java indicates the storage path of the Hudi table, and hoodie_java indicates the name of the Hudi table.
- Run the Scala sample project.
  spark-submit --keytab <user_keytab_path> --principal=<principal_name> --class com.huawei.bigdata.hudi.examples.HoodieDataSourceExample /opt/example/hudi-scala-security-examples-1.0.jar hdfs://hacluster/tmp/example/hoodie_scala hoodie_scala
  
  /opt/example/hudi-scala-examples-1.0.jar indicates the JAR file path, <user_keytab_path> indicates the authentication file path, <principal_name> indicates the authentication user name, hdfs://hacluster/tmp/example/hoodie_scala indicates the storage path of the Hudi table, and hoodie_Scala indicates the name of the Hudi table.
- Run the Python sample project.
  spark-submit /opt/example/HudiPythonExample.py hdfs://hacluster/tmp/huditest/example/python hudi_trips_cow
  
  hdfs://hacluster/tmp/huditest/example/python indicates the storage path of the Hudi table, and hudi_trips_cow indicates the name of the Hudi table.