Suggestions for Creating CarbonData Tables

Scenario

This section provides suggestions based on more than 50 test cases to help you create CarbonData tables with higher query performance.

**Table 1** Columns in the CarbonData table
Column name	Data type	Cardinality	Attribution
msname	String	30 million	dimension
BEGIN_TIME	bigint	10,000	dimension
host	String	1 million	dimension
dime_1	String	1,000	dimension
dime_2	String	500	dimension
dime_3	String	800	dimension
counter_1	numeric(20,0)	NA	measure
...	...	NA	measure
counter_100	numeric(20,0)	NA	measure

Procedure

If the to-be-created table contains a column that is frequently used for filtering, for example, this column is used in more than 80% of filtering scenarios,
implement optimization as follows:

Place this column in the first column of sort_columns.

For example, if msname is used most frequently as a filter criterion in a query, it is placed in the first column. Run the following command to create a table. The query performance is good if msname is used as the filter criterion.
```
create table carbondata_table(
    msname String,
    ...
    )STORED AS carbondata TBLPROPERTIES ('SORT_COLUMS'='msname');
```
If the to-be-created table has multiple columns which are frequently used to filter the results,
implement optimization as follows:

Create an index for the columns.

For example, if msname, host, and dime_1 are frequently used columns, the sort_columns column sequence is "dime_1-> host-> msname..." based on cardinality. Run the following command to create a table. The following command can improve the filtering performance of dime_1, host, and msname.
```
create table carbondata_table(
    dime_1 String,
    host String,
    msname String,
    dime_2 String,
    dime_3 String,
    ...
    )STORED AS carbondata 
TBLPROPERTIES ('SORT_COLUMS'='dime_1,host,msname');
```

If the frequency of each column used for filtering is similar,

implement optimization as follows:

sort_columns is sorted in ascending order of cardinality.

Run the following command to create a table:

create table carbondata_table(
    Dime_1 String,
    BEGIN_TIME bigint,
    HOST String,
    msname String,
    ...
    )STORED AS carbondata
TBLPROPERTIES ('SORT_COLUMS'='dime_2,dime_3,dime_1, BEGIN_TIME,host,msname');

Create tables in ascending order of cardinalities. Then create secondary indexes for columns with more cardinalities. The statement for creating an index is as follows:

create index carbondata_table_index_msidn on tablecarbondata_table (
msname String) as 'carbondata' PROPERTIES ('table_blocksize'='128');
create index carbondata_table_index_host on tablecarbondata_table (
host String) as 'carbondata' PROPERTIES ('table_blocksize'='128');

For columns of measure type, not requiring high accuracy, the numeric (20,0) data type is not required. You are advised to use the double data type to replace the numeric (20,0) data type to enhance query performance.
The result of performance analysis of test-case shows reduction in query execution time from 15 to 3 seconds, thereby improving performance by nearly 5 times. The command for creating a table is as follows:
```
create table carbondata_table(
    Dime_1 String,
    BEGIN_TIME bigint,
    HOST String,
    msname String,
    counter_1 double,
    counter_2 double,
    ...
    counter_100 double,
    )STORED AS carbondata
;
```
If values (start_time for example) of a column are incremental:
For example, if data is loaded to CarbonData every day, start_time is incremental for each load. In this case, it is recommended that the start_time column be put at the end of sort_columns, because incremental values are efficient in using min/max index. The command for creating a table is as follows:
```
create table carbondata_table(
    Dime_1 String,
    HOST String,
    msname String,
    counter_1 double,
    counter_2 double,
    BEGIN_TIME bigint,
    ...
    counter_100 double,
    )STORED AS carbondata 
    TBLPROPERTIES ( 'SORT_COLUMS'='dime_2,dime_3,dime_1..BEGIN_TIME');
```