Updated on 2024-11-08 GMT+08:00

Step 4: Creating Another Table and Loading Data

After selecting a storage mode, compression level, distribution mode, and distribution key for each table, use these attributes to create tables and reload data. Compare the system performance before and after the table recreation.

  1. Delete the tables created before.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    DROP TABLE store_sales;
    DROP TABLE date_dim;
    DROP TABLE store;
    DROP TABLE item;
    DROP TABLE time_dim;
    DROP TABLE promotion;
    DROP TABLE customer_demographics;
    DROP TABLE customer_address;
    DROP TABLE household_demographics;
    DROP TABLE customer;
    DROP TABLE income_band;
    
    DROP FOREIGN TABLE obs_from_store_sales_001;
    DROP FOREIGN TABLE obs_from_date_dim_001;
    DROP FOREIGN TABLE obs_from_store_001;
    DROP FOREIGN TABLE obs_from_item_001;
    DROP FOREIGN TABLE obs_from_time_dim_001;
    DROP FOREIGN TABLE obs_from_promotion_001;
    DROP FOREIGN TABLE obs_from_customer_demographics_001;
    DROP FOREIGN TABLE obs_from_customer_address_001;
    DROP FOREIGN TABLE obs_from_household_demographics_001;
    DROP FOREIGN TABLE obs_from_customer_001;
    DROP FOREIGN TABLE obs_from_income_band_001;
    

  2. Create tables and specify storage and distribution modes for them.

    Only the syntax for recreating the store_sales table is provided for simplicity. To recreate all the other tables, copy the syntax in Creating a Another Table After Design Optimization.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    CREATE TABLE store_sales
    (
        ss_sold_date_sk           integer                       ,
        ss_sold_time_sk           integer                       ,
        ss_item_sk                integer               not null,
        ss_customer_sk            integer                       ,
        ss_cdemo_sk               integer                       ,
        ss_hdemo_sk               integer                       ,
        ss_addr_sk                integer                       ,
        ss_store_sk               integer                       ,
        ss_promo_sk               integer                       ,
        ss_ticket_number          bigint               not null,
        ss_quantity               integer                       ,
        ss_wholesale_cost         decimal(7,2)                  ,
        ss_list_price             decimal(7,2)                  ,
        ss_sales_price            decimal(7,2)                  ,
        ss_ext_discount_amt       decimal(7,2)                  ,
        ss_ext_sales_price        decimal(7,2)                  ,
        ss_ext_wholesale_cost     decimal(7,2)                  ,
        ss_ext_list_price         decimal(7,2)                  ,
        ss_ext_tax                decimal(7,2)                  ,
        ss_coupon_amt             decimal(7,2)                  ,
        ss_net_paid               decimal(7,2)                  ,
        ss_net_paid_inc_tax       decimal(7,2)                  ,
        ss_net_profit             decimal(7,2)                  
    ) 
    WITH (ORIENTATION = column,COMPRESSION=middle)
    DISTRIBUTE BY hash (ss_item_sk);
    

  3. Load sample data into these tables.
  4. Record the loading time in the benchmark tables.

    Benchmark

    Before

    After

    Loading time (11 tables)

    341584 ms

    257241 ms

    Occupied storage space

    Store_Sales

    42 GB

    -

    Date_Dim

    11 MB

    -

    Store

    232 KB

    -

    Item

    110 MB

    -

    Time_Dim

    11 MB

    -

    Promotion

    256 KB

    -

    Customer_Demographics

    171 MB

    -

    Customer_Address

    170 MB

    -

    Household_Demographics

    504 KB

    -

    Customer

    441 MB

    -

    Income_Band

    88 KB

    -

    Total storage space

    42 GB

    -

    Query execution time

    Query 1

    14552.05 ms

    -

    Query 2

    27952.36 ms

    -

    Query 3

    17721.15 ms

    -

    Total execution time

    60225.56 ms

    -

  5. Run the ANALYZE command to update statistics.

    1
    ANALYZE;
    

    If ANALYZE is returned, the execution is successful.

    1
    ANALYZE
    

  6. Check for data skew.

    For a hash table, an improper distribution key may cause data skew or poor I/O performance on certain DNs. Therefore, you need to check the table to ensure that data is evenly distributed on each DN. You can run the following SQL statements to check for data skew:

    1
    SELECT a.count,b.node_name FROM (SELECT count(*) AS count,xc_node_id FROM table_name GROUP BY xc_node_id) a, pgxc_node b WHERE a.xc_node_id=b.node_id ORDER BY a.count desc;
    

    xc_node_id corresponds to a DN. Generally, over 5% difference between the amount of data on different DNs is regarded as data skew. If the difference is over 10%, choose another distribution key. In GaussDB(DWS), you can select multiple distribution keys to distribute data evenly.