更新时间:2025-12-18 GMT+08:00

概述

Fabric Data支持通过UDF处理图片(PNG/JPG)、音频(WAV/MP3)和视频(MP4/AVI)等多模态数据。在使用UDF处理数据之前,需要将多模态数据保存到Fabric表中。下文以图片为例演示如何导入数据到Fabric中:

  1. 准备图片类型数据。
    1. 将图片数据读出后写入parquet文件中。
      import pyarrow as pa
      import pandas as pd
      data = {"img": [
          {'filename': "image.png", 'format': 'png', 'height': 1, 'width': 2},
          ]
      }
      with open("image.png", 'rb') as file:
          data["img"][0]["data"] = file.read()
      df = pd.DataFrame(data)
      schema = pa.schema([('img', pa.struct([('filename', pa.string()), ('format', pa.string()), ('height', pa.int64()), ('width', pa.int64()), ('data', pa.binary())]))])
      table = pa.Table.from_pandas(df, schema=schema)
      pa.parquet.write_table(table, "image_type.parquet")
    2. 将写入数据后的parquet文件上传到OBS中。
  1. 创建包含图片类型的表,指定location为上一步的OBS路径。
    import os
    from fabric_data.multimodal import ai_lake
    from fabric_data.multimodal.types import image
    # Set the target database name
    target_database = "multimodal_lake"
    
    import logging
    con = ai_lake.connect(
        fabric_endpoint=os.getenv("fabric_endpoint"),
        fabric_endpoint_id=os.getenv("fabric_endpoint_id"),
        fabric_workspace_id=os.getenv("fabric_workspace_id"),
        lf_catalog_name=os.getenv("lf_catalog_name"),
        lf_instance_id=os.getenv("lf_instance_id"),
        access_key=os.getenv("access_key"),
        secret_key=os.getenv("secret_key"),
        default_database=target_database,
        use_single_cn_mode=True,
        logging_level=logging.WARN,
    )
    con.set_function_staging_workspace(
        obs_directory_base=os.getenv("obs_directory_base"),
        obs_bucket_name=os.getenv("obs_bucket_name"),
        obs_server=os.getenv("obs_server"),
        access_key=os.getenv("access_key"),
        secret_key=os.getenv("secret_key"))
    con.create_table("image_table", schema={"img": image.Image}, external=True, location="obs://image_type")