Updated on 2025-08-11 GMT+08:00

Interconnecting Hudi with OBS Using an IAM Agency

After configuring decoupled storage and compute for a cluster by referring to Interconnecting an MRS Cluster with OBS Using an IAM Agency, you can create Hudi COW tables in spark-shell and store them to OBS.

Interconnecting Hudi with OBS

  1. Log in to the client installation node as the client installation user.
  2. Run the following commands to configure environment variables:

    Load the environment variables.

    source Client installation directory/bigdata_env

    Load the component environment variables.

    source Client installation directory/Hudi/component_env

  3. Modify the configuration file:

    vim Client installation directory/Hudi/hudi/conf/hdfs-site.xml

    Modify the following content, where the dfs.namenode.acls.enabled parameter specifies whether to enable the HDFS ACL function.

    <property>
    <name>dfs.namenode.acls.enabled</name>
    <value>false</value>
    </property>

  4. Authenticate the user of the cluster with Kerberos authentication enabled. Skip this step for the user of the cluster with Kerberos authentication disabled.

    kinit Username

  5. Start spark-shell and run the following commands to create a COW table and save it in OBS:

    import org.apache.hudi.QuickstartUtils._
    import scala.collection.JavaConversions._
    import org.apache.spark.sql.SaveMode._
    import org.apache.hudi.DataSourceReadOptions._
    import org.apache.hudi.DataSourceWriteOptions._
    import org.apache.hudi.config.HoodieWriteConfig._
    val tableName = "hudi_cow_table"
    val basePath = "obs://testhudi/cow_table/"
    val dataGen = new DataGenerator
    val inserts = convertToStringList(dataGen.generateInserts(10))
    val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
    df.write.format("org.apache.hudi").
    options(getQuickstartWriteConfigs).
    option(PRECOMBINE_FIELD_OPT_KEY, "ts").
    option(RECORDKEY_FIELD_OPT_KEY, "uuid").
    option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
    option(TABLE_NAME, tableName).
    mode(Overwrite).
    save(basePath);

    In the preceding command, obs://testhudi/cow_table/ indicates the OBS path, and testhudi indicates the name of the parallel file system. Change them as required.

  6. Use DataSource to check whether the table is created and whether the data is normal.

    val roViewDF = spark.
    read.
    format("org.apache.hudi").
    load(basePath + "/*/*/*/*")
    roViewDF.createOrReplaceTempView("hudi_ro_table")
    spark.sql("select * from  hudi_ro_table").show()
    Figure 1 Viewing table data

  7. Exit the spark-shell CLI.

    :q