Updated on 2024-12-13 GMT+08:00

HetuEngine Data Type

Currently, the following data types are supported during table creation: TINYINT, SMALLINT, BIGINT, INT, BOOLEAN, REAL, DECIMAL, DOUBLE, VARCHAR, STRING, BINARY, VARBINARY, TIMESTAMP, DATE, CHAR, ARRAY, ROW, MAP, and STRUCT. Other types are supported during data query and calculation.

Generally, most non-composite data types can be entered by literals and character strings. In the example, a string in the JSON format is added.

select json '{"name": "aa", "gender": "man"}';
              _col0              
---------------------------------
 {"name":"aa","gender":"man"} 
(1 row)

Boolean

The valid text values for "true" are TRUE, t, true, and 1.

The valid text values for the "false" value are FALSE, f, false, and 0.

TRUE and FALSE are standard usages (SQL-compatible).

The following is an example:

select BOOLEAN '0';
 _col0 
-------
 false 
(1 row)
 
select BOOLEAN 'TRUE';
 _col0 
-------
 true  
(1 row)
 
select BOOLEAN 't';
 _col0 
-------
 true  
(1 row)

Integer

Table 1 Integer

Parameter

Description

Storage Space

Value Range

Literal

TINYINT

Tiny integer

8 bits

-128~127

TINYINT

SMALLINT

Small integer

16 bits

-32,768 ~ +32,767

SMALLINT

INTEGER

Integer

32 bits

-2,147,483,648 ~ +2,147,483,647

INT

BIGINT

Big integer

64 bits

-9,223,372,036,854,775,808 ~ 9,223,372,036,854,775,807

BIGINT

The following is an example:

--Create a table containing TINYINT data:
CREATE TABLE int_type_t1  (IT_COL1 TINYINT) ;    
--Insert data of the TINYINT type:
insert into int_type_t1  values (TINYINT'10');    
--View data:
SELECT * FROM int_type_t1;  
 it_col1    
---------   
 10  
(1 row)    
--Drop a table:
DROP TABLE int_type_t1;  

Fixed Precision

Parameter

Description

Storage Space

Value Range

Literal

DECIMAL

Decimal number with fixed precision. The maximum precision is 38 bits, but the precision with less than 18 bits can ensure the best performance.

Decimal has two input parameters:

  • precision: total number of digits. The default value is 38.
  • scale: number of decimal places. The default value is 0.
    NOTE:

    If scale is 0 and precision is 38, the maximum precision is 19 bits.

64 bits

DECIMAL

NUMERIC

Same as DECIMAL

128 characters

NUMERIC

Table 2 Examples of literals

Example

Data Type

DECIMAL '0'

DECIMAL(1)

DECIMAL '12345'

DECIMAL(5)

DECIMAL '0000012345.1234500000'

DECIMAL(20, 10)

--Create a table containing DECIMAL data.
CREATE TABLE decimal_t1 (dec_col1 DECIMAL(10,3)) ;    

– Insert data of the DECIMAL type.
insert into decimal_t1 values (DECIMAL '5.325');     

--View data.
SELECT * FROM decimal_t1;  
 dec_col1  
---------   
 5.325 
(1 row)    

-- Negative example: The number of decimal places exceeds the defined length. As a result, the SQL statement fails to be executed.
insert into decimal_t1 values (DECIMAL '5.3253');
Query 20201126_034601_00053_tq98i@default@HetuEngine failed: Insert query has mismatched column types: Table: [decimal(10,3)], Query: [decimal(5,4)]

--Drop the table.
DROP TABLE decimal_t1;  

--Create a NUMERIC type table.
CREATE TABLE tb_numberic_hetu(col1 NUMERIC(9,7));
CREATE TABLE

--Insert data.
INSERT INTO tb_numberic_hetu values(9.12);
INSERT: 1 row

--View data.
SELECT * FROM tb_numberic_hetu;
col1    
------------
  9.1200000 
(1 row)

Float

Parameter

Description

Storage Space

Value Range

Literal

REAL

Real number

32 bits

1.40129846432481707e-45 to 3.40282346638528860e+38, positive or negative

REAL

DOUBLE

Double-precision floating point number with 15 to 17 valid digits, depending on the application scenario. The number of valid digits does not depend on the decimal point.

64 bits

4.94065645841246544e-324 to 1.79769313486231570e+308, positive or negative

DOUBLE

FLOAT

Single-precision floating point number with 6 to 9 valid digits, depending on the application scenario. The number of valid digits does not depend on the decimal point.

32 bits

1.40129846432481707e-45 to 3.40282346638528860e+38, positive or negative

FLOAT

Usage:
  • When a distributed query uses high-performance hardware instructions to perform single-precision or double-precision computing, the computing result may be slightly different because the execution sequence is different each time when an aggregate function, such as SUM() or AVG(), is invoked. Especially when the data volume is large (tens of millions or even billions of records), the computing result may be slightly different. In this case, you are advised to use the DECIMAL data type.
  • An alias can be used to specify the data type.

    The following is an example:

    --Create a table that contains float data.
    CREATE TABLE float_t1 (float_col1 FLOAT) ;
    --Insert data of the float type.
    insert into float_t1 values (float '3.50282346638528862e+38');
    --View data.
    SELECT * FROM float_t1;
    float_col1
    ------------
    Infinity
    (1 row)
    --Drop the table.
    DROP TABLE float_t1;
  • When the decimal part is 0, you can use cast() to convert the decimal part to an integer of the corresponding range. The decimal part is rounded off.

    The following is an example:

    select CAST(1000.0001 as INT);
    _col0
    -------
    1000
    (1 row)
    select CAST(122.5001 as TINYINT);
    _col0
    -------
    123
    (1 row)
  • When an exponential expression is used, the string can be converted to the corresponding type.
    The following is an example:
    select CAST(152e-3 as double);
    _col0
    -------
    0.152
    (1 row)

Character

Parameter

Description

VARCHAR(n)

Variable-length character string. n indicates the byte length.

CHAR(n)

Fixed-length character string. If the length is insufficient, spaces are added. n indicates the byte length. If the precision n is not specified, the default value 1 is used.

VARBINARY

Variable-length binary data. The value must be prefixed with X, for example, X'65683F'. Currently, a binary character string of a specified length is not supported.

JSON

The value can be a JSON object, a JSON array, a JSON number, a JSON string, true, false or null.

STRING

String compatible with impala. The bottom layer is varchar.

BINARY

Compatible with Hive BINARY. The underlying implementation is VARBINARY.

  • In SQL expressions, simple character expressions and unicodes are supported. A unicode character string uses U& as the fixed prefix. An escape character must be added before Unicode that is represented by a 4-digit value.
    -- Character expression
    select 'hello,winter!';  
          _col0         
    ------------------  
     hello,winter!   
    (1 row)  
    -- Unicode expression
    select U&'Hello winter \2603 !';  
          _col0         
    ------------------  
     Hello winter ☃ !   
    (1 row)    
    -- User-defined escape character
    select U&'Hello winter #2603 !' UESCAPE '#';  
          _col0         
    ------------------  
     Hello winter ☃ !   
    (1 row)  
  • VARBINARY and BINARY
    -- Create a VARBINARY or BINARY table.
     create table binary_tb(col1 BINARY);
    
    --Insert data.
     INSERT INTO binary_tb values (X'63683F');
    
    --Query data.
     select * from binary_tb ;  -- 63 68 3f 
  • When two CHARs with different numbers of spaces at the end are compared, the two CHARs are considered equal.
    SELECT CAST('FO' AS CHAR(4)) = CAST('FO     ' AS CHAR(5));
     _col0 
    -------
     true  
    (1 row)

Time and Date Type

The time and date are accurate to milliseconds.

Table 3 Time and date type

Parameter

Description

Storage Space

DATE

Date and time. Only the ISO 8601 format is supported, for example 2020-01-01.

32 bits

TIME

Time (hour, minute, second, millisecond) without a time zone

Example: TIME '01:02:03.456'

64 bits

TIME WITH TIMEZONE

Time with a time zone (hour, minute, second, millisecond). Time zones are expressed as the numeric UTC offset value.

Example: TIME '01:02:03.456 -08:00'

96 bits

TIMESTAMP

Timestamp

64 bits

TIMESTAMP WITH TIMEZONE

Timestamp with time zone

64 bits

INTERVAL YEAR TO MONTH

Time interval literal year and month, in the format of SY-M.

S: optional symbols (+/-)

Y: years

M: months

128 bits

INTERVAL DAY TO SECOND

The time interval literally indicates the day, hour, minute, and second, and is accurate to millisecond. The format is SD H:M:S.nnn.

S: optional symbols (+/-)

D: days

M: minutes

S: seconds

nnn: milliseconds

128 bits

The following is an example:

-- Query the date:
SELECT DATE '2020-07-08';
   _col0    
------------
 2020-07-08 
(1 row)

-- Query time:
SELECT TIME '23:10:15';
 _col0     
--------------
 23:10:15 
(1 row)

SELECT TIME '01:02:03.456 -08:00';
 _col0     
--------------
 01:02:03.456-08:00 
(1 row)

-- Time interval usage
SELECT TIMESTAMP '2015-10-18 23:00:15' + INTERVAL '3 12:15:4.111' DAY TO SECOND;      
 _col0          
------------------------- 
 2015-10-22 11:15:19.111 
(1 row)

SELECT TIMESTAMP '2015-10-18 23:00:15' + INTERVAL '3-1' YEAR TO MONTH;
          _col0          
-------------------------
 2018-11-18 23:00:15 
(1 row)

select INTERVAL '3' YEAR + INTERVAL '2' MONTH ;
 _col0 
-------
 3-2   
(1 row)

select INTERVAL '1' DAY+INTERVAL '2' HOUR +INTERVAL '3' MINUTE +INTERVAL '4' SECOND ;
     _col0      
----------------
 1 02:03:04.000 
(1 row)

ARRAY

Array

Example: ARRAY[1, 2, 3].

-- Create an ARRAY type table.
create table array_tb(col1 ARRAY<STRING>);

-- Insert a record of the ARRAY type.
insert into array_tb values(ARRAY['HetuEngine','Hive','Mppdb']);

--Query data.
select * from array_tb; --    [HetuEngine, Hive, Mppdb]

MAP

Data type of a key-value pair.

Example: MAP(ARRAY['foo', 'bar'], ARRAY[1, 2]).

-- Create a Map table.
create table map_tb(col1 MAP<STRING,INT>);

-- Insert a piece of data of the Map type.
insert into map_tb values(MAP(ARRAY['foo','bar'],ARRAY[1,2]));

--Query data.
select * from map_tb; --   {bar=2, foo=1} 

ROW

ROW fields can be any supported data type or a combination of different field data types.

-- Create a ROW table.
create table row_tb (id int,col1 row(a int,b varchar));

-- Insert data of the ROW type.
insert into row_tb values (1,ROW(1,'HetuEngine'));

--Query data.
select * from row_tb; 
 id |     col1     
----|--------------
  1 | {a=1, b=HetuEngine} 

--Fields can be named. By default, a Row field is not named.
select  row(1,2e0),CAST(ROW(1, 2e0) AS ROW(x BIGINT, y DOUBLE));
           _col0          |    _col1     
------------------------|--------------
 {1, 2.0} | {x=1, y=2.0} 
(1 row) 

--Named fields can be accessed using the domain operator ".".
select col1.b from row_tb; -- HetuEngine

--Both named and unnamed fields can be accessed through location indexes, which start from 1 and must be a constant.
select col1[1] from row_tb;   -- 1

IPADDRESS

IP address, which can be an IPv4 or IPv6 address. In the system, however, this type of address is a unified IPv6 address.

IPv4 is supported by mapping IPv4 addresses to the IPv6 address range (RFC 4291#section-2.5.5.2). When an IPv4 address is created, it is mapped to an IPv6 address. During formatting, if the data is IPv4, the data will be mapped to IPv4 again. Other addresses are formatted according to the format defined in RFC 5952.

The following is an example:

select IPADDRESS '10.0.0.1', IPADDRESS '2001:db8::1';
  _col0   |    _col1
----------|-------------
 10.0.0.1 | 2001:db8::1
(1 row)

UUID

A standard universally unique identifier (UUID) is also called a globally unique identifier (GUID).

It complies with the format defined in RFC 4122.

The following is an example:

select UUID '12151fd2-7586-11e9-8f9e-2a86e4085a59';
                _col0
--------------------------------------
 12151fd2-7586-11e9-8f9e-2a86e4085a59
(1 row)

HYPERLOGLOG

Base statistics.

The cost of using HyperLogLog to approximate the count value of a unique number is far less than that of using count.

For details, see HyperLogLog Functions.

  • HyperLogLog

    A HyperlogLog Sketch can be used to efficiently calculate the approximate value of distinct().

    It begins with a sparse representation, then becomes a dense representation. And at this time efficiency becomes higher.

  • P4HyperLogLog

    Similar to a HyperlogLog Sketch, but it starts with a dense representation.

QDIGEST

A quantile, also referred to as a quantile point, refers to that a probability distribution range of a random variable is divided into several equal numerical points. Commonly used quantile points include a median (that is, a 2-quantile), a 4-quartile, and a 100-percentile. Quantile digest is a set of quantiles. When the data to be queried is close to a quantile, the quantile can be used as the approximate value of the data to be queried. Its precision can be adjusted, but higher precision results in expensive space overhead.

STRUCT

The bottom layer is implemented using ROW. For details, see ROW.

The following is an example:

-- Creating a STRUCT table.
create table struct_tab (id int,col1 struct<col2: integer, col3: string>);

-- Insert data of the STRUCT type.
 insert into struct_tab VALUES(1, struct<2, 'HetuEngine'>);

--Query data.
select * from struct_tab; 
 id |        col1         
----|---------------------
  1 | {col2=2, col3=HetuEngine}