Importing and Exporting Data by Running COPY

COPY is one of cqlsh commands. It includes COPY TO and COPY FROM. They are used to copy data to and from Cassandra.

COPY TO can export data from a table to a CSV, Parquet, or ORC file.

If the exported file is in CSV format, it needs to be written into the target file by row, and fields are separated by delimiters.
If no field name is specified, all fields are exported.
To skip some fields, specify a field list.

COPY FROM allows you to import data from a CSV file to an existing table.

The source file is imported by row.
All rows in the dataset must contain the same number of fields, and the PRIMARY KEY field must have a value. During the import, the PRIMARY KEY field will be verified and the existing records are updated.
If HEADER is set to False and no field name is specified, fields are imported in a specified order. After field names are specified, the fields are imported in sequence. The missing and empty fields are set to null.
The source file can only have fewer fields than the target table.
When only COPY FROM is used to import data, the number of rows in a dataset cannot exceed 2 million.

Precautions

Import and export data during off-peak hours to minimize the impacts on your services.
Obtain the latest binary package by following Connecting to a GeminiDB Cassandra Instance over a Private Network.

COPY Syntax

COPY TO
COPY table_name [( column_list )] TO 'file_name' [, 'file2_name', ...] | STDOUT [WITH option = 'value' [AND ...]]
COPY FROM
COPY table_name [( column_list )] FROM 'file_name' [, 'file2_name', ...] | STDIN [WITH option = 'value' [AND ...]]

COPY supports one or more comma-separated file names or a list of Python glob expressions.

For some common syntax symbols in the COPY command, see Table 1.

**Table 1** Symbol conventions
Symbol	Description
Uppercase letters	Text keyword.
Lowercase letters	A variable, which needs to be replaced with a user-defined value.
Italic	(Optional) Enclose optional command parameters in square brackets ([]). Do not enter only square brackets.
( )	Group. Parentheses (()) indicate the group to be selected. Do not input only brackets.
\|	Or. Use vertical bars (\|) to separate elements. You can input any element. Do not enter only vertical bars.
...	Repeatable. The ellipsis (...) indicates that you can repeat syntax elements multiple times as required.
'Literal string'	The single quotation marks (') must contain the character string in the CQL statement. Use single quotation marks to keep uppercase letters.
{ key : value }	The map set. Include a map set or key-value pair in braces ({}). Separate keys and values with colons.
<datatype1,datatype2>	Set, list, map, or tuple of an ordered list. Angle brackets (< >) contain data types in collections, lists, maps, or tuples. Data types are separated by commas (,).
cql_statement;	End a CQL statement. Semicolons (;) end all CQL statements.
[--]	Use two hyphens (--) to separate command line options from command arguments. This syntax is useful when parameters may be mistaken for command arguments.
' <schema> ... </schema> '	Search CQL only; single quotation marks (') enclose the entire XML schema declaration.
@xml_entity='xml_entity_type'	Search CQL only; identify entities and literal values to overwrite XML elements in schemas and solrConfig files.

COPY Usage Suggestions

**Table 2** Description
Command	Parameter	Description	Default Value	Applicability
TO/FROM	DELIMITER	A single character used to separate fields.	English comma,	-
TO/FROM	QUOTE	A single character that contains a field value.	"	-
TO/FROM	ESCAPE	Escapes a single character using the QUOTE character.	\	-
TO/FROM	HEADER	Boolean value (true \| false), indicating the name of the column in the first row. True matches the field name with the imported column name and inserts the column name into the first row of the exported data.	FALSE	-
TO/FROM	NULL	Filled value of the field whose query result is empty. You can set this parameter as required.	Empty string ()	-
TO/FROM	DATETIMEFORMAT	Time format for reading or writing CSV time data. The timestamp is in the strftime format. If this parameter is not set, the default value is the value of time_format in the cqlshrc file. Default format: %Y- %m- %d %H: %M: %S %z.	%Y-%m-%d %H:%M:%S%z	-
TO/FROM	MAXATTEMPTS	Maximum number of retry times when an error occurs.	5	-
TO/FROM	REPORTFREQUENCY	Frequency of displaying the status, in seconds.	0.25	-
TO/FROM	DECIMALSEP	Delimiter character for decimal values.	English full stop.	-
TO/FROM	THOUSANDSSEP	Separator of a thousand array.	None	-
TO/FROM	BOOLSTYLE	Boolean values indicate True and False. The value is case-insensitive. For example, the values yes and no have the same effect as values YES and NO.	True,False	-
TO/FROM	NUMPROCESSES	Number of working processes.	16	The default value of this parameter is the number of kernels on the computer minus one. There is no maximum value for this parameter. You can run the dstat and dstat -lvrn 10 commands to check the CPU idle time. If the CPU idle time exists, use the default number of working processes. You can increase the number of processes while observing the CPU usage of the instance. It is recommended that the CPU usage be less than or equal to 60%. If the CPU usage of the executor is idle and the CPU usage of the instance exceeds the recommended value, expand the capacity to further improve the performance.
TO/FROM	CONFIGFILE	Specifies a cqlshrc configuration file to set the WITH option. NOTE: Command line options always overwrite the cqlshrc file.	None, user-defined	-
TO/FROM	RATEFILE	Prints the output statistics to this file.	None, user-defined	You are advised to add this parameter when exporting data to improve statistics efficiency.
TO/FROM	ORIGIN	Check whether the database to be imported or exported is an open-source Cassandra database. If the open-source Cassandra is used, the value is True. If GeminiDB Cassandra is used, the value is False.	False	-
FROM	CHUNKSIZE	The block size is passed to the worker process.	5000	This parameter specifies the number of rows sent from the Feeder process (reading data from files) to the worker process. Depending on the average row size of the dataset, it may be advantageous to increase the value of this parameter.
FROM	INGESTRATE	Approximate import rate per second.	100000	INGESTRATE indicates the rate (in rows) at which the feeder process sends data to the worker process per second. Generally, you do not need to change the value unless the rate is too high and needs to be limited.
FROM	MAXBATCHSIZE	Maximum size of a batch file to be imported.	20	The value of this parameter can be as large as possible but cannot exceed the upper limit. MAXBATCHSIZE x The size of a single row < batch_size_fail_threshold_in_kb. If the batch size is too large, an alarm will be reported and rejected. Set the following parameters in cassandra.yaml: batch_size_warn_threshold_in_kb (The current value is 5.) batch_size_fail_threshold_in_kb (The current value is 50.)
FROM	MINBATCHSIZE	Minimum size of a batch import file.	2	For each chunk, the worker process writes data in batches based on the minimum batch size. The value may need to be adjusted based on the block size, number of nodes in the cluster, and number of vnodes on each node. The larger the chunk size is, the larger the value will be.
FROM	MAXROWS	Maximum number of rows. The value -1 indicates that there is no upper limit.	-1	-
FROM	SKIPROWS	Number of rows to skip.	0	-
FROM	SKIPCOLS	A comma-separated list of column names to skip.	None, user-defined	-
FROM	MAXPARSEERRORS	Maximum number of global parsing errors. The value -1 indicates that there is no upper limit.	-1	-
FROM	MAXINSERTERRORS	Maximum number of global insertion errors. The value -1 indicates that there is no upper limit.	-1	-
FROM	ERRFILE	A file that stores all rows that are not imported. If no value is set, the information is stored in import_ ks _ table .err, where ks is the key space and table is the table name.	import_ ks _ table .err	-
FROM	TTL	The time to live is in seconds. By default, data does not expire.	3600	-
TO	ENCODING	Output character string type.	UTF-8	-
TO	PAGESIZE	Size of the page for obtaining results.	1000	Size of the result page. The value is an integer. The default value is 1000. The larger the page size, the longer the value of pagetimeout. If the data volume in a single row is large, set this parameter to a smaller value. If the data volume in a single row is small, set this parameter to a larger value. The best effect of this value depends on the local batch write capability of the executor. If the local batch write capability is strong.
TO	PAGETIMEOUT	The page times out to obtain the result.	10	The value is an integer, indicating the timeout interval for obtaining each page. The unit is second. The default value is 10 seconds. For a large page size or a large partition, increase the value of this parameter. If a timeout occurs, increase the value of this parameter. If the server times out, an exponential backoff policy is automatically initiated to prevent the server from being further overloaded, so you may notice the delay. The driver also generates a timeout. In this case, the driver does not know whether the server discards the request or returns the result later. There is a low probability that data may be lost or duplicated. Increasing the value of this parameter is helpful in preventing driver build timeouts.
TO	BEGINTOKEN	Minimum token for exporting data.	None, user-defined	The value is a string, indicating the minimum token to be considered during data export. Records with smaller tokens will not be exported. The default value is empty, indicating that there is no minimum token.
TO	ENDTOKEN	Maximum token used to export data.	None, user-defined	The value is a string, indicating the maximum number of tokens to be considered during data export. Records with larger tokens will not be exported. This parameter is left empty by default, indicating that there is no maximum token.
TO	MAXREQUESTS	Maximum number of requests that can be processed concurrently by each worker.	6	The value of this parameter is an integer, indicating the maximum number of running requests that can be processed by each working process. Total degree of parallelism during data export = Number of working processes x Value of this parameter. Default value: 6 Each request will export data for the entire token range.
TO	MAXOUTPUTSIZE	Maximum size of an output file, in lines. After this parameter is set, the output file is split into multiple segments when the size of the output file exceeds the value of this parameter. The value -1 indicates that there is no upper limit.	-1	The value of this parameter is an integer, indicating the maximum size of an output file in the unit of lines. If the value of this parameter is exceeded, the output file is split into multiple segments. The default value is -1, indicating that there is no limit on the maximum value. Therefore, the file is the only output file. This parameter can be used together with MAXFILESIZE.
TO	MAXFILESIZE	Maximum size of an output file, in KB. After this parameter is set, the output file is split into multiple segments when the size of the output file exceeds the value of this parameter.	None, user-defined	The value of this parameter is an integer, indicating the maximum size of an output file in bytes. The final file size is close to the value of this parameter. If the file size exceeds this value, the output file is split into multiple segments. The default value is -1, indicating that there is no limit on the maximum value. Therefore, the file is the only output file. This parameter can be used together with MAXOUTPUTSIZE.
TO	dataformats	Output file format. Currently, this parameter can only be set to json.	None, user-defined	-
TO	DATATYPE	The file format can be Parquet or ORC.	None, user-defined	-
TO	RESULTFILE	The exported file containing detailed results.	None, user-defined	You are advised to add this parameter when exporting data to improve statistics efficiency.
TO	wherecondition	Export condition specified during the export.	None, user-defined	-

Procedure

The following uses an example to describe how to preconfigure data, export data, and import data.

Pre-configuring Data

Create a keyspace.

CREATE KEYSPACE cycling WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

Create a table.

CREATE TABLE cycling.cyclist_name (
   id UUID PRIMARY KEY,
   lastname text,
   firstname text
);

Insert a data record.

INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (5b6962dd-3f90-4c93-8f61-eabfa4a803e2, 'VOS','Marianne');
INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (e7cd5752-bc0d-4157-a80f-7523add8dbcd, 'VAN DER BREGGEN','Anna');
INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (e7ae5cf3-d358-4d99-b900-85902fda9bb0, 'FRAME','Alex');
INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (220844bf-4860-49d6-9a4b-6b5d3a79cbfb, 'TIRALONGO','Paolo');
INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47, 'KRUIKSWIJK','Steven');
INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (fb372533-eb95-4bb4-8685-6ef61e994caa, 'MATTHEWS', 'Michael');

Exports data from and imports data to the cyclist_name table.
1. Export the id and lastname columns from the cyclist_name table to a CSV file.
```
COPY cycling.cyclist_name (id,lastname)  TO '../cyclist_lastname.csv' WITH HEADER = TRUE;
```
  Figure 1 Exported successfully
  
  After the preceding command is executed successfully, the cyclist_lastname.csv file is created in the upper-level directory of the current directory. If the file already exists, it will be overwritten.
2. Export the id and first name columns from the cyclist_name table to another CSV file.
```
COPY cycling.cyclist_name (id,firstname)  TO '../cyclist_firstname.csv' WITH HEADER = TRUE;
```
  Figure 2 Exported successfully
  
  After the preceding command is executed successfully, the cyclist_firstname.csv file is created in the upper-level directory of the current directory. If the file already exists, it will be overwritten.
3. Delete data from the cyclist_name table. To ensure data security, the TRUNCATE command is not supported.
```
DELETE FROM cycling.cyclist_name WHERE id = 'fb372533-eb95-4bb4-8685-6ef61e994caa';
```
4. No data exists in the table.
```
SELECT * FROM cycling.cyclist_name ;
```
  Figure 3 Querying data
5. Import the cyclist_firstname.csv file.
```
COPY cycling.cyclist_name (id,firstname) FROM '../cyclist_firstname.csv' WITH HEADER = TRUE;
```
  Figure 4 Import succeeded
6. Verify the imported data.
```
SELECT * FROM cycling.cyclist_name;
```
  Figure 5 Import succeeded
7. Import the cyclist_lastname.csv file.
```
COPY cycling.cyclist_name (id,lastname) FROM '../cyclist_lastname.csv' WITH HEADER = TRUE;
```
  Figure 6 Importing data
8. Check whether the data is updated.
```
SELECT * FROM cycling.cyclist_name;
```
  The query result is displayed,
  
  Figure 7 Import succeeded