Updated on 2022-07-26 GMT+08:00

TPC-DS Data Construction

  1. Log in to the ECS and run the following command to create a directory for storing the TPC-DS tool:

    1
    2
    mkdir -p /data1/script/tpcds-kit/tpcds1000X
    mkdir -p /data2/script/tpcds-kit/tpcds1000X
    

  2. Obtain the latest TPC-DS data construction tool dsdgen from the Official website and use SFTP to upload the tool to the /data1/script/tpcds-kit directory on the ECS.
  3. Run the following commands to decompress the TPC-DS package and compile the package to generate the data construction tool dsdgen:

    • Replace tpcds_3.2.0.zip with the actual software package name.
    • Replace DSGen-software-code-3.2.0rc1 with the actual name of the decompressed folder.
    1
    2
    cd /data1/script/tpcds-kit && unzip tpcds_3.2.0.zip
    cd DSGen-software-code-3.2.0rc1/tools && make
    

  1. Go to the /data1/script/tpcds-kit/DSGen-software-code-3.2.0rc1/tools directory and run the following commands to generate data:

    • Because of the large size of the TPC-DS data, the size of a single table is also large. Therefore, data is generated in shards.
    • The total size of TPC-DS 1000X data file is about 930 GB. Make sure that the ECS disk space is sufficient.
    • Because the generated data is large, it takes a long time to import data if only one GDS is started. You are advised to generate data on two data disks evenly. In the following example, shards 1 to 5 are stored in /data1/script/tpcds-kit/tpcds1000X, and shards 6 to 10 are stored in /data2/script/tpcds-kit/tpcds1000X.
    1
    2
    for c in {1..5};do (./dsdgen -scale 1000 -dir /data1/script/tpcds-kit/tpcds1000X -TERMINATE N -parallel 10 -child ${c} -force Y > /dev/null 2>&1 &);done
    for c in {6..10};do (./dsdgen -scale 1000 -dir /data2/script/tpcds-kit/tpcds1000X -TERMINATE N -parallel 10 -child ${c} -force Y > /dev/null 2>&1 &);done
    

    Where,

    • -scale specifies the data scale. In this example, the value is 1000.
    • -dir specifies the directories where the generated data files are stored. In this example, the directories are /data1/script/tpcds-kit/tpcds1000X and /data2/script/tpcds-kit/tpcds1000X.
    • -TERMINATE indicates whether a separator is required at the end of each record.
    • -parallel specifies the number of shards. In this example, the value is 10.
    • -child specifies a shard sequence. It does not need to be changed.

  2. Run the following commands to check the data file generation progress: You can also run the ps ux|grep dsdgen command to check whether the file generation process stops.

    1
    2
    du -sh /data1/script/tpcds-kit/tpcds1000X/*.dat
    du -sh /data2/script/tpcds-kit/tpcds1000X/*.dat