Importing Data
Function
The LOAD DATA function can be used to import data in CSV, Parquet, ORC, JSON, and Avro formats. The data is converted into the Parquet data format for storage.
Syntax
1 2 |
LOAD DATA INPATH 'folder_path' INTO TABLE [db_name.]table_name OPTIONS(property_name=property_value, ...); |
Keywords
- INPATH: path of data to be imported
- OPTIONS: list of properties
Parameters
Parameter |
Description |
---|---|
folder_path |
OBS path of the file or folder used for storing the raw data. |
db_name |
Enter the database name. If this parameter is not specified, the current database is used. |
table_name |
Name of the table to which data is to be imported. |
The following configuration options can be used during data import:
- DATA_TYPE: specifies the type of data to be imported. Currently, CSV, Parquet, ORC, JSON, and Avro are supported. The default value is CSV.
The configuration item is OPTIONS ('DATA_TYPE' = 'CSV').
When importing a CSV file or a JSON file, you can select one of the following modes:
- PERMISSIVE: When the PERMISSIVE mode is selected, the data of a column is set to null if its data type does not match that of the target table column.
- DROPMALFORMED: When the DROPMALFORMED mode is selected, the data of a column s not imported if its data type does not match that of the target table column.
- FAILFAST: When the FAILFAST mode is selected, exceptions might occur and the import may fail if a column type does not match.
You can set the mode by adding OPTIONS ('MODE' = 'PERMISSIVE') to the OPTIONS parameter.
- DELIMITER: You can specify a separator in the import statement. The default value is ,.
The configuration item is OPTIONS('DELIMITER'=',').
For CSV data, the following delimiters are supported:
- Tab character, for example, 'DELIMITER'='\t'.
- Any binary character, for example, 'DELIMITER'='\u0001(^A)'.
- Single quotation mark ('). A single quotation mark must be enclosed in double quotation marks (" "). For example, 'DELIMITER'= "'".
- \001(^A) and \017(^Q) are also supported, for example, 'DELIMITER'='\001(^A)' and 'DELIMITER'='\017(^Q)'.
- QUOTECHAR: You can specify quotation marks in the import statement. The default value is double quotation marks (").
The configuration item is OPTIONS('QUOTECHAR'='"').
- COMMENTCHAR: You can specify the comment character in the import statement. During the import operation, if a comment character is at the beginning of a row, the row is considered as a comment and will not be imported. The default value is a pound key (#).
The configuration item is OPTIONS('COMMENTCHAR'='#').
- HEADER: Indicates whether the source file contains a header. Possible values can be true and false. true indicates that the source file contains a header, and false indicates that the source file does not contain a header. The default value is false. If no header exists, specify the FILEHEADER parameter in the LOAD DATA statement to add a header.
The configuration item is OPTIONS('HEADER'='true').
- FILEHEADER: If the source file does not contain any header, add a header to the LOAD DATA statement.
OPTIONS('FILEHEADER'='column1,column2')
- ESCAPECHAR: Is used to perform strict verification of the escape character on CSV files. The default value is a slash (\\).
The configuration item is OPTIONS. (ESCAPECHAR?=?\\?)
Enter ESCAPECHAR in the CSV data. ESCAPECHAR must be enclosed in double quotation marks (" "). For example, "a\b".
- MAXCOLUMNS: This parameter is optional and specifies the maximum number of columns parsed by a CSV parser in a line.
The configuration item is OPTIONS('MAXCOLUMNS'='400').
Table 2 MAXCOLUMNS Name of the Optional Parameter
Default Value
Maximum Value
MAXCOLUMNS
2000
20000
After the value of MAXCOLUMNS Option is set, data import will require the memory of executor. As a result, data may fail to be imported due to insufficient executor memory.
- DATEFORMAT: Specifies the date format of a column.
OPTIONS('DATEFORMAT'='dateFormat')
- The default value is yyyy-MM-dd.
- The date format is specified by the date mode string of Java. For the Java strings describing date and time pattern, characters A to Z and a to z without single quotation marks (') are interpreted as pattern characters , which are used to represent date or time string elements. If the pattern character is quoted by single quotation marks ('), text matching rather than parsing is performed. For the definition of pattern characters in Java, see Table 3.
Table 3 Definition of characters involved in the date and time patterns Character
Date or Time Element
Example
G
Epoch ID
AD
y
Year
1996; 96
M
Month
July; Jul; 07
w
Number of the week in a year
27 (the twenty-seventh week of the year)
W
Number of the week in a month
2 (the second week of the month)
D
Number of the day in a year
189 (the 189th day of the year)
d
Number of the day in a month
10 (the tenth day of the month)
u
Number of the day in a week
1 (Monday), ..., 7 (Sunday)
a
am/pm flag
pm (12:00-24:00)
H
Hour time (0-23)
2
h
Hour time (1-12)
12
m
Number of minutes
30
s
Number of seconds
55
S
Number of milliseconds
978
z
Time zone
Pacific Standard Time; PST; GMT-08:00
- TIMESTAMPFORMAT: Specifies the timestamp format of a column.
OPTIONS('TIMESTAMPFORMAT'='timestampFormat')
- Default value: yyyy-MM-dd HH:mm:ss.
- The timestamp format is specified by the Java time pattern string. For details, see Table 3.
- Mode: Specifies the processing mode of error records while importing. The options are as follows: PERMISSIVE, DROPMALFORMED, and FAILFAST.
OPTIONS('MODE'='permissive')
- PERMISSIVE (default): Parse bad records as much as possible. If a field cannot be converted, the entire row is null.
- DROPMALFORMED: Ignore the bad records that cannot be parsed.
- FAILFAST: If a record cannot be parsed, an exception is thrown and the job fails.
- BADRECORDSPATH: Specifies the directory for storing error records during the import.
OPTIONS('BADRECORDSPATH'='obs://bucket/path')
It is recommended that this option be used together with the DROPMALFORMED pattern to import the records that can be successfully converted into the target table and store the records that fail to be converted to the specified error record storage directory.
Precautions
- When importing or creating an OBS table, you must specify a folder as the directory. If a file is specified, data import may be failed.
- Only the raw data stored in the OBS path can be imported.
- You are advised not to concurrently import data in to a table. If you concurrently import data into a table, there is a possibility that conflicts occur, leading to failed data import.
- Only one path can be specified during data import. The path cannot contain commas (,).
- If a folder and a file with the same name exist in the OBS bucket directory, the data is preferentially to be imported directed to the file rather than the folder.
- When importing data of the PARQUET, ORC, or JSON format, you must specify DATA_TYPE. Otherwise, the data is parsed into the default format CSV. In this case, the format of the imported data is incorrect.
- If the data to be imported is in the CSV or JSON format and contains the date and columns, you need to specify DATEFORMAT and TIMESTAMPFORMAT. Otherwise, the data will be parsed into the default date and timestamp formats.
Example
Before importing data, you must create a table. For details, see Creating an OBS Table or Creating a DLI Table.
- To import a CSV file to a DLI table named t, run the following statement:
1 2
LOAD DATA INPATH 'obs://dli/data.csv' INTO TABLE t OPTIONS('DELIMITER'=',' , 'QUOTECHAR'='"','COMMENTCHAR'='#','HEADER'='false');
- To import a JSON file to a DLI table named jsontb, run the following statement:
1 2
LOAD DATA INPATH 'obs://dli/alltype.json' into table jsontb OPTIONS('DATA_TYPE'='json','DATEFORMAT'='yyyy/MM/dd','TIMESTAMPFORMAT'='yyyy/MM/dd HH:mm:ss');
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.