Updated on 2025-08-25 GMT+08:00

Creating and Managing Tables

Creating a Table

The CREATE TABLE command is used to create a table. When creating a table, you can define the following content:

  • Columns and their data types.
  • Table distribution definition, that is, the distribution strategy of the table, which determines how the DataArts Fabric SQL database divides data among segments. Table storage format.
  • Partitioned table definition.

Example: CREATE TABLE creates a table named web_returns_p1, storing data in the ORC file format.

1
2
3
4
5
6
7
CREATE TABLE web_returns_p1
(
    wr_returned_date_sk       integer,
    wr_returned_time_sk       integer,
    wr_item_sk                integer,
    wr_refunded_customer_sk   integer
) store as ORC;

Table Types

DataArts Fabric SQL supports two table types: managed table and external table.

Type

Description

Use Case

Managed table

Both data and metadata are managed by DataArts Fabric SQL. Users can manage metadata and data files through DataArts Fabric SQL. Deleting a table removes both its data and metadata. The data path must be a parallel file system.

  • Service data requiring frequent updates.
  • Critical data needing protection against accidental deletion.

External table

DataArts Fabric SQL reads data files from a specified location through metadata or table schemas. Users cannot modify the actual data files through DataArts Fabric SQL. Deleting a table only removes its metadata, leaving the data files unaffected.

  • Read-only analysis.
  • Fine-grained access control based on file paths.

For managed tables, data files are still stored in a conventional file system (for example, OBS parallel file system). Users can alter these files without informing DataArts Fabric SQL. However, doing so violates the expectations and agreements of DataArts Fabric SQL regarding managed tables, potentially leading to undefined errors.

Examples of using managed tables and external tables
-- Create a schema.
CREATE SCHEMA sales_schema WITH LOCATION 'obs://bucket/catalog1/sales_schema/';

-- Create a managed table.
CREATE TABLE region_sales_info(id int, item varchar(128), sale_date date) partition by (region varchar(128)) store as parquet;

-- Create an external table to point to the managed table.
CREATE EXTERNAL TABLE readonly_region_sales_info(id int, item varchar(128), sale_date date) partition by (region varchar(128)) store as parquet location 'obs://bucket/catalog1/sales_schema/region_sales_info/';

-- Import service data from the managed table.
INSERT INTO region_sales_info VALUES
(0, 'apple', '2001-01-01', 'Region A'),
(1, 'banana', '2001-01-02', 'Region B'),
(2, 'carrot', '2001-01-03', 'Region C'),
(3, 'desk', '2001-01-04', 'Region D');

SELECT * FROM region_sales_info order by 1;
 id |  item  | sale_date  |  region  
----+--------+------------+----------
  0 | apple  | 2001-01-01 | Region A
  1 | banana | 2001-01-02 | Region B
  2 | carrot | 2001-01-03 | Region C
  3 | desk   | 2001-01-04 | Region D
(4 rows)

-- No data can be queried as the external table initially does not contain partition information.
SELECT * FROM readonly_region_sales_info;
 id | item | sale_date | region 
----+------+-----------+--------
(0 rows)

-- Restore the partition information of the external table.
MSCK REPAIR TABLE readonly_region_sales_info;
SELECT * FROM readonly_region_sales_info order by 1;
 id |  item  | sale_date  |  region  
----+--------+------------+----------
  0 | apple  | 2001-01-01 | Region A
  1 | banana | 2001-01-02 | Region B
  2 | carrot | 2001-01-03 | Region C
  3 | desk   | 2001-01-04 | Region D
(4 rows)

-- Drop the external table does not affect data files.
DROP TABLE readonly_region_sales_info;
SELECT COUNT(1) from region_sales_info;
 count 
-------
     4
(1 row)

-- Dropping the managed table also deletes the data.
DROP TABLE region_sales_info;

Table Distribution Definition

DataArts Fabric SQL supports the following distribution methods: hash and round-robin.

Strategy

Description

Use Case

Advantage and Disadvantage

Hash

Data is organized into N buckets (files) using a hash algorithm.

Fact tables with large amounts of data.

  • Efficiently utilizes file hash information for pruning and filtering during data reads. Speeds up computation in multi-table joins by sending files with the same hash value to the same node.

Round-robin

Data is randomly organized across multiple files.

Large fact tables where no suitable distribution column exists for hash distribution.

  • Ensures no data skew, improving cluster space utilization.
  • Preferred when no appropriate distribution column is available for large tables. Otherwise, Hash distribution is recommended for better performance.

Selecting a distribution key

When using the hash distribution method, users must specify a distribution key for the table. Upon inserting a record, the system calculates a hash based on the distribution column's value and stores the data on the corresponding DataNode (DN).

Thus, selecting a hash distribution key is critical and must adhere to the following principles:

  1. The key values should be highly distinct. For example, consider choosing the table's primary key as the distribution key, such as selecting the ID number in a personnel information table.
  2. Do not select columns with constant filters whenever possible. For example, if certain queries involving the table dwcjk have a constant constraint on the key zqdh (for example, zqdh='000001'`), do not use zqdh as the distribution key.
  3. After satisfying the first two principles, prioritize selecting join conditions as the distribution key to push join tasks down to DNs for execution, minimizing inter-DN communication.

    Improper selection of a distribution key in a hash table may lead to data skew, causing I/O bottlenecks in some DNs and degrading overall query performance. Hence, after applying the hash distribution strategy, perform a data skew check to ensure even data distribution across all DNs.

Viewing Table Data

  • Query information about all tables in the current schema.
    1
    SHOW TABLES;
    
  • Query the attributes of a table.
    1
    DESCRIBE  web_returns_p1;
    
  • Query the data volume of the web_returns_p1 table.
    1
    SELECT count(*) FROM web_returns_p1;
    
  • Query all data from the web_returns_p1 table.
    1
    SELECT * FROM web_returns_p1;
    
  • Query only the data of the c_customer_sk field.
    1
    SELECT c_customer_sk FROM web_returns_p1;
    
  • Filter out duplicate data of the c_customer_sk field.
    1
    SELECT DISTINCT( c_customer_sk ) FROM web_returns_p1;
    
  • Query all data whose c_customer_sk is 3869.
    1
    SELECT * FROM web_returns_p1 WHERE c_customer_sk = 3869;
    
  • Sort data by the c_customer_sk field.
    1
    SELECT * FROM web_returns_p1 ORDER BY c_customer_sk;
    

Deleting Table Data

When using a table, you may need to delete the table data.

  • Currently, data cannot be deleted using the DELETE command.
  • Executing the following command will delete all rows in the table.
    1
    TRUNCATE TABLE customer_t1;
    
  • Delete the created table.
    1
    DROP TABLE customer_t1;