Updated on 2024-12-11 GMT+08:00

CarbonData Data Types

Overview

In CarbonData, data is stored in entities called tables. CarbonData tables are similar to RDBMS tables. RDBMS data is stored in a table consisting of rows and columns. CarbonData tables store structured data, and have fixed columns and data types.

Supported Data Types

CarbonData tables support the following data types:

  • Int
  • String
  • BigInt
  • Smallint
  • Char
  • Varchar
  • Boolean
  • Decimal
  • Double
  • TimeStamp
  • Date
  • Array
  • Struct
  • Map

The following table describes supported data types and their respective values range.

Table 1 CarbonData data types

Data Type

Value Range

Int

4-byte signed integer ranging from -2,147,483,648 to 2,147,483,647.

NOTE:

If a non-dictionary column is of the int data type, it is internally stored as the BigInt type.

String

100,000 characters

NOTE:

If the CHAR or VARCHAR data type is used in CREATE TABLE, the two data types are automatically converted to the String data type.

If a column contains more than 32,000 characters, add the column to the LONG_STRING_COLUMNS attribute of the tblproperties table during table creation.

BigInt

64-bit value ranging from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

SmallInt

–32,768 to 32,767

Char

A to Z and a to z

Varchar

A to Z, a to z, and 0 to 9

Boolean

true or false

Decimal

The default value is (10,0) and maximum value is (38,38).

NOTE:

When query with filters, append BD to the number to achieve accurate results. For example, select * from carbon_table where num = 1234567890123456.22BD.

Double

64-bit value ranging from 4.9E-324 to 1.7976931348623157E308

TimeStamp

The default format is yyyy-MM-dd HH:mm:ss.

Date

The DATE data type is used to store calendar dates. The default format is yyyy-MM-DD.

Array<data_type>

N/A

NOTE:

Currently, only two layers of complex types can be nested.

Struct<col_name: data_type COMMENT col_comment, ...>

Map<primitive_type, data_type>

Main Specifications of CarbonData

Table 2 Main specifications of CarbonData

Entity

Tested Value

Test Environment

Number of tables

10000

3 nodes. 4 vCPUs and 20 GB memory for each executor. Driver memory: 5 GB, 3 executors.

Total columns: 107

Strings: 75

Int: 13

BigInt: 7

Timestamp: 6

Double: 6

Number of table columns

2000

3 nodes. 4 vCPUs and 20 GB memory for each executor. Driver memory: 5 GB, 3 executors.

Maximum size of a raw CSV file

200GB

17 cluster nodes. 150 GB memory and 25 vCPUs for each executor. Driver memory: 10 GB, 17 executors.

Number of CSV files in each folder

100 folders. Each folder has 10 files. The size of each file is 50 MB.

3 nodes. 4 vCPUs and 20 GB memory for each executor. Driver memory: 5 GB, 3 executors.

Number of load folders

10000

3 nodes. 4 vCPUs and 20 GB memory for each executor. Driver memory: 5 GB, 3 executors.

The memory required for data loading depends on the following factors:

  • Number of columns
  • Column values
  • Concurrency (configured using carbon.number.of.cores.while.loading)
  • Sort size in memory (configured using carbon.sort.size)
  • Intermediate cache (configured using carbon.graph.rowset.size)

Data loading of an 8 GB CSV file that contains 10 million records and 300 columns with each row size being about 0.8 KB requires about 10 GB executor memory. That is, set carbon.sort.size to 100000 and retain the default values for other parameters.

Specifications of Secondary Index Tables

Table 3 Table specifications

Entity

Tested Value

Number of secondary index tables

10

Number of composite columns in a secondary index table

5

Length of column name in a secondary index table (unit: character)

120

Length of a secondary index table name (unit: character)

120

Cumulative length of all secondary index table names + column names in an index table (unit: character)

3800**

  • Characters of column names in an index table refer to the upper limit allowed by Hive or the upper limit of available resources.
  • Secondary index tables are registered using Hive and stored in HiveSERDEPROPERTIES in JSON format. The value of SERDEPROPERTIES supported by Hive can contain a maximum of 4,000 characters and cannot be changed.