Importing and Exporting Data by Running COPY
COPY is one of cqlsh commands. It includes COPY TO and COPY FROM. They are used to copy data to and from Cassandra.
- If the exported file is in CSV format, it needs to be written into the target file by row, and fields are separated by delimiters.
- If no field name is specified, all fields are exported.
- To skip some fields, specify a field list.
COPY FROM allows you to import data from a CSV file to an existing table.
- The source file is imported by row.
- All rows in the dataset must contain the same number of fields, and the PRIMARY KEY field must have a value. During the import, the PRIMARY KEY field will be verified and the existing records are updated.
- If HEADER is set to False and no field name is specified, fields are imported in a specified order. After field names are specified, the fields are imported in sequence. The missing and empty fields are set to null.
- The source file can only have fewer fields than the target table.
- When only COPY FROM is used to import data, the number of rows in a dataset cannot exceed 2 million.
Precautions
- Import and export data during off-peak hours to minimize the impacts on your services.
- Obtain the latest binary package by following Connecting to an Instance over a Private Network.
COPY Syntax
- COPY TO
COPY table_name [( column_list )] TO 'file_name' [, 'file2_name', ...] | STDOUT [WITH option = 'value' [ADN ...]]
- COPY FROM
COPY table_name [( column_list )] FROM 'file_name' [, 'file2_name', ...] | STDIN [WITH option = 'value' [ADN ...]]
COPY supports one or more comma-separated file names or a list of Python glob expressions.
For some common syntax symbols in the COPY command, see Table 1.
Symbol |
Description |
---|---|
Uppercase letters |
Text keyword. |
Lowercase letters |
A variable, which needs to be replaced with a user-defined value. |
Italic |
(Optional) Enclose optional command parameters in square brackets ([]). Do not enter only square brackets. |
( ) |
Group. Parentheses (()) indicate the group to be selected. Do not input only brackets. |
| |
Or. Use vertical bars (|) to separate elements. You can input any element. Do not enter only vertical bars. |
... |
Repeatable. The ellipsis (...) indicates that you can repeat syntax elements multiple times as required. |
'Literal string' |
The single quotation marks (') must contain the character string in the CQL statement. Use single quotation marks to keep uppercase letters. |
{ key : value } |
The map set. Include a map set or key-value pair in braces ({}). Separate keys and values with colons. |
<datatype1,datatype2> |
Set, list, map, or tuple of an ordered list. Angle brackets (< >) contain data types in collections, lists, maps, or tuples. Data types are separated by commas (,). |
cql_statement; |
End a CQL statement. Semicolons (;) end all CQL statements. |
[--] |
Use two hyphens (--) to separate command line options from command arguments. This syntax is useful when parameters may be mistaken for command arguments. |
' <schema> ... </schema> ' |
Search CQL only; single quotation marks (') enclose the entire XML schema declaration. |
@xml_entity='xml_entity_type' |
Search CQL only; identify entities and literal values to overwrite XML elements in schemas and solrConfig files. |
COPY Usage Suggestions
Command |
Parameter |
Description |
Default Value |
Applicability |
---|---|---|---|---|
TO/FROM |
DELIMITER |
A single character used to separate fields. |
English comma, |
- |
TO/FROM |
QUOTE |
A single character that contains a field value. |
" |
- |
TO/FROM |
ESCAPE |
Escapes a single character using the QUOTE character. |
\ |
- |
TO/FROM |
HEADER |
Boolean value (true | false), indicating the name of the column in the first row. True matches the field name with the imported column name and inserts the column name into the first row of the exported data. |
FALSE |
- |
TO/FROM |
NULL |
Filled value of the field whose query result is empty. You can set this parameter as required. |
Empty string () |
- |
TO/FROM |
DATETIMEFORMAT |
Time format for reading or writing CSV time data. The timestamp is in the strftime format. If this parameter is not set, the default value is the value of time_format in the cqlshrc file. Default format: %Y- %m- %d %H: %M: %S %z. |
%Y-%m-%d %H:%M:%S%z |
- |
TO/FROM |
MAXATTEMPTS |
Maximum number of retry times when an error occurs. |
5 |
- |
TO/FROM |
REPORTFREQUENCY |
Frequency of displaying the status, in seconds. |
0.25 |
- |
TO/FROM |
DECIMALSEP |
Delimiter character for decimal values. |
English full stop. |
- |
TO/FROM |
THOUSANDSSEP |
Separator of a thousand array. |
None |
- |
TO/FROM |
BOOLSTYLE |
Boolean values indicate True and False. The value is case-insensitive. For example, the values yes and no have the same effect as values YES and NO. |
True,False |
- |
TO/FROM |
NUMPROCESSES |
Number of working processes. |
16 |
The default value of this parameter is the number of kernels on the computer minus one. There is no maximum value for this parameter. You can run the dstat and dstat -lvrn 10 commands to check the CPU idle time. If the CPU idle time exists, use the default number of working processes. You can increase the number of processes while observing the CPU usage of the instance. It is recommended that the CPU usage be less than or equal to 60%. If the CPU usage of the executor is idle and the CPU usage of the instance exceeds the recommended value, expand the capacity to further improve the performance. |
TO/FROM |
CONFIGFILE |
Specifies a cqlshrc configuration file to set the WITH option.
NOTE:
Command line options always overwrite the cqlshrc file. |
None, user-defined |
- |
TO/FROM |
RATEFILE |
Prints the output statistics to this file. |
None, user-defined |
You are advised to add this parameter when exporting data to improve statistics efficiency. |
TO/FROM |
ORIGIN |
Check whether the database to be imported or exported is an open-source Cassandra database.
|
False |
- |
FROM |
CHUNKSIZE |
The block size is passed to the worker process. |
5000 |
This parameter specifies the number of rows sent from the Feeder process (reading data from files) to the worker process. Depending on the average row size of the dataset, it may be advantageous to increase the value of this parameter. |
FROM |
INGESTRATE |
Approximate import rate per second. |
100000 |
INGESTRATE indicates the rate (in rows) at which the feeder process sends data to the worker process per second. Generally, you do not need to change the value unless the rate is too high and needs to be limited. |
FROM |
MAXBATCHSIZE |
Maximum size of a batch file to be imported. |
20 |
The value of this parameter can be as large as possible but cannot exceed the upper limit.
|
FROM |
MINBATCHSIZE |
Minimum size of a batch import file. |
2 |
For each chunk, the worker process writes data in batches based on the minimum batch size. The value may need to be adjusted based on the block size, number of nodes in the cluster, and number of VNODEs on each node. If the chunk size is larger, increase the value accordingly. |
FROM |
MAXROWS |
Maximum number of rows. The value -1 indicates that there is no upper limit. |
-1 |
- |
FROM |
SKIPROWS |
Number of rows to skip. |
0 |
- |
FROM |
SKIPCOLS |
A comma-separated list of column names to skip. |
None, user-defined |
- |
FROM |
MAXPARSEERRORS |
Maximum number of global parsing errors. The value -1 indicates that there is no upper limit. |
-1 |
- |
FROM |
MAXINSERTERRORS |
Maximum number of global insertion errors. The value -1 indicates that there is no upper limit. |
-1 |
- |
FROM |
ERRFILE |
A file that stores all rows that are not imported. If no value is set, the information is stored in import_ ks _ table .err, where ks is the key space and table is the table name. |
import_ ks _ table .err |
- |
FROM |
TTL |
The time to live is in seconds. By default, data does not expire. |
3600 |
- |
TO |
ENCODING |
Output character string type. |
UTF-8 |
- |
TO |
PAGESIZE |
Size of the page for obtaining results. |
1000 |
Size of the result page. The value is an integer. The default value is 1000. The larger the page size, the longer the value of pagetimeout. If the data volume in a single row is large, set this parameter to a smaller value. If the data volume in a single row is small, set this parameter to a larger value. The best effect of this value depends on the local batch write capability of the executor. If the local batch write capability is strong (for example, Huawei Cloud obsfs is used), you can increase the value. |
TO |
PAGETIMEOUT |
The page times out to obtain the result. |
10 |
The value is an integer, indicating the timeout interval for obtaining each page. The unit is second. The default value is 10 seconds.
|
TO |
BEGINTOKEN |
Minimum token for exporting data. |
None, user-defined |
The value is a string, indicating the minimum token to be considered during data export. Records with smaller tokens will not be exported. The default value is empty, indicating that there is no minimum token. |
TO |
ENDTOKEN |
Maximum token used to export data. |
None, user-defined |
The value is a string, indicating the maximum number of tokens to be considered during data export. Records with larger tokens will not be exported. This parameter is left empty by default, indicating that there is no maximum token. |
TO |
MAXREQUESTS |
Maximum number of requests that can be processed concurrently by each worker. |
6 |
The value of this parameter is an integer, indicating the maximum number of running requests that can be processed by each working process. Total degree of parallelism during data export = Number of working processes x Value of this parameter. Default value: 6 Each request will export data for the entire token range. |
TO |
MAXOUTPUTSIZE |
Maximum size of an output file, in lines. After this parameter is set, the output file is split into multiple segments when the size of the output file exceeds the value of this parameter. The value -1 indicates that there is no upper limit. |
-1 |
The value of this parameter is an integer, indicating the maximum size of an output file in the unit of lines. If the value of this parameter is exceeded, the output file is split into multiple segments. The default value is -1, indicating that there is no limit on the maximum value. Therefore, the file is the only output file. This parameter can be used together with MAXFILESIZE. |
TO |
MAXFILESIZE |
Maximum size of an output file, in KB. After this parameter is set, the output file is split into multiple segments when the size of the output file exceeds the value of this parameter. |
None, user-defined |
The value of this parameter is an integer, indicating the maximum size of an output file in bytes. The final file size is close to the value of this parameter. If the file size exceeds this value, the output file is split into multiple segments. The default value is -1, indicating that there is no limit on the maximum value. Therefore, the file is the only output file. This parameter can be used together with MAXOUTPUTSIZE. |
TO |
dataformats |
Output file format. Currently, this parameter can only be set to json. |
None, user-defined |
- |
TO |
DATATYPE |
The file format can be Parquet or ORC. |
None, user-defined |
- |
TO |
RESULTFILE |
The exported file containing detailed results. |
None, user-defined |
You are advised to add this parameter when exporting data to improve statistics efficiency. |
TO |
wherecondition |
Export condition specified during the export. |
None, user-defined |
- |
Procedure
The following uses an example to describe how to preconfigure data, export data, and import data.
- Pre-configuring Data
- Create a keyspace.
CREATE KEYSPACE cycling WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
- Create a table.
CREATE TABLE cycling.cyclist_name ( id UUID PRIMARY KEY, lastname text, firstname text );
- Insert a data record.
INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (5b6962dd-3f90-4c93-8f61-eabfa4a803e2, 'VOS','Marianne'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (e7cd5752-bc0d-4157-a80f-7523add8dbcd, 'VAN DER BREGGEN','Anna'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (e7ae5cf3-d358-4d99-b900-85902fda9bb0, 'FRAME','Alex'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (220844bf-4860-49d6-9a4b-6b5d3a79cbfb, 'TIRALONGO','Paolo'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47, 'KRUIKSWIJK','Steven'); INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (fb372533-eb95-4bb4-8685-6ef61e994caa, 'MATTHEWS', 'Michael');
- Create a keyspace.
- Exports data from and imports data to the cyclist_name table.
- Export the id and lastname columns from the cyclist_name table to a CSV file.
COPY cycling.cyclist_name (id,lastname) TO '../cyclist_lastname.csv' WITH HEADER = TRUE;
Figure 1 Exported successfully
After the preceding command is executed successfully, the cyclist_lastname.csv file is created in the upper-level directory of the current directory. If the file already exists, it will be overwritten.
- Export the id and first name columns from the cyclist_name table to another CSV file.
COPY cycling.cyclist_name (id,firstname) TO '../cyclist_firstname.csv' WITH HEADER = TRUE;
Figure 2 Exported successfully
After the preceding command is executed successfully, the cyclist_firstname.csv file is created in the upper-level directory of the current directory. If the file already exists, it will be overwritten.
- Delete data from the cyclist_name table. To ensure data security, the TRUNCATE command is not supported.
DELETE FROM cycling.cyclist_name WHERE id = 'fb372533-eb95-4bb4-8685-6ef61e994caa';
- No data exists in the table.
SELECT * FROM cycling.cyclist_name ;
Figure 3 Querying data
- Import the cyclist_firstname.csv file.
COPY cycling.cyclist_name (id,firstname) FROM '../cyclist_firstname.csv' WITH HEADER = TRUE;
Figure 4 Import succeeded
- Verify the imported data.
SELECT * FROM cycling.cyclist_name;
Figure 5 Import succeeded
- Import the cyclist_lastname.csv file.
COPY cycling.cyclist_name (id,lastname) FROM '../cyclist_lastname.csv' WITH HEADER = TRUE;
Figure 6 Importing data
- Check whether the data is updated.
SELECT * FROM cycling.cyclist_name;
The query result is displayed,
Figure 7 Import succeeded
- Export the id and lastname columns from the cyclist_name table to a CSV file.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot