Updated on 2023-11-21 GMT+08:00

Importing and Exporting Data by Running COPY

COPY is one of cqlsh commands. It includes COPY TO and COPY FROM. They are used to copy data to and from Cassandra.

COPY TO can export data from a table to a CSV, Parquet, or ORC file.
  • If the exported file is in CSV format, it needs to be written into the target file by row, and fields are separated by delimiters.
  • If no field name is specified, all fields are exported.
  • To skip some fields, specify a field list.

COPY FROM allows you to import data from a CSV file to an existing table.

  • The source file is imported by row.
  • All rows in the dataset must contain the same number of fields, and the PRIMARY KEY field must have a value. During the import, the PRIMARY KEY field will be verified and the existing records are updated.
  • If HEADER is set to False and no field name is specified, fields are imported in a specified order. After field names are specified, the fields are imported in sequence. The missing and empty fields are set to null.
  • The source file can only have fewer fields than the target table.
  • When only COPY FROM is used to import data, the number of rows in a dataset cannot exceed 2 million.

Precautions

COPY Syntax

  • COPY TO

    COPY table_name [( column_list )] TO 'file_name' [, 'file2_name', ...] | STDOUT [WITH option = 'value' [ADN ...]]

  • COPY FROM

    COPY table_name [( column_list )] FROM 'file_name' [, 'file2_name', ...] | STDIN [WITH option = 'value' [ADN ...]]

COPY supports one or more comma-separated file names or a list of Python glob expressions.

For some common syntax symbols in the COPY command, see Table 1.

Table 1 Symbol conventions

Symbol

Description

Uppercase letters

Text keyword.

Lowercase letters

A variable, which needs to be replaced with a user-defined value.

Italic

(Optional) Enclose optional command parameters in square brackets ([]). Do not enter only square brackets.

( )

Group. Parentheses (()) indicate the group to be selected. Do not input only brackets.

|

Or. Use vertical bars (|) to separate elements. You can input any element. Do not enter only vertical bars.

...

Repeatable. The ellipsis (...) indicates that you can repeat syntax elements multiple times as required.

'Literal string'

The single quotation marks (') must contain the character string in the CQL statement. Use single quotation marks to keep uppercase letters.

{ key : value }

The map set. Include a map set or key-value pair in braces ({}). Separate keys and values with colons.

<datatype1,datatype2>

Set, list, map, or tuple of an ordered list. Angle brackets (< >) contain data types in collections, lists, maps, or tuples. Data types are separated by commas (,).

cql_statement;

End a CQL statement. Semicolons (;) end all CQL statements.

[--]

Use two hyphens (--) to separate command line options from command arguments. This syntax is useful when parameters may be mistaken for command arguments.

' <schema> ... </schema> '

Search CQL only; single quotation marks (') enclose the entire XML schema declaration.

@xml_entity='xml_entity_type'

Search CQL only; identify entities and literal values to overwrite XML elements in schemas and solrConfig files.

COPY Usage Suggestions

Table 2 Description

Command

Parameter

Description

Default Value

Applicability

TO/FROM

DELIMITER

A single character used to separate fields.

English comma,

-

TO/FROM

QUOTE

A single character that contains a field value.

"

-

TO/FROM

ESCAPE

Escapes a single character using the QUOTE character.

\

-

TO/FROM

HEADER

Boolean value (true | false), indicating the name of the column in the first row.

True matches the field name with the imported column name and inserts the column name into the first row of the exported data.

FALSE

-

TO/FROM

NULL

Filled value of the field whose query result is empty. You can set this parameter as required.

Empty string ()

-

TO/FROM

DATETIMEFORMAT

Time format for reading or writing CSV time data.

The timestamp is in the strftime format. If this parameter is not set, the default value is the value of time_format in the cqlshrc file.

Default format: %Y- %m- %d %H: %M: %S %z.

%Y-%m-%d %H:%M:%S%z

-

TO/FROM

MAXATTEMPTS

Maximum number of retry times when an error occurs.

5

-

TO/FROM

REPORTFREQUENCY

Frequency of displaying the status, in seconds.

0.25

-

TO/FROM

DECIMALSEP

Delimiter character for decimal values.

English full stop.

-

TO/FROM

THOUSANDSSEP

Separator of a thousand array.

None

-

TO/FROM

BOOLSTYLE

Boolean values indicate True and False. The value is case-insensitive. For example, the values yes and no have the same effect as values YES and NO.

True,False

-

TO/FROM

NUMPROCESSES

Number of working processes.

16

The default value of this parameter is the number of kernels on the computer minus one. There is no maximum value for this parameter.

You can run the dstat and dstat -lvrn 10 commands to check the CPU idle time. If the CPU idle time exists, use the default number of working processes. You can increase the number of processes while observing the CPU usage of the instance. It is recommended that the CPU usage be less than or equal to 60%. If the CPU usage of the executor is idle and the CPU usage of the instance exceeds the recommended value, expand the capacity to further improve the performance.

TO/FROM

CONFIGFILE

Specifies a cqlshrc configuration file to set the WITH option.

NOTE:

Command line options always overwrite the cqlshrc file.

None, user-defined

-

TO/FROM

RATEFILE

Prints the output statistics to this file.

None, user-defined

You are advised to add this parameter when exporting data to improve statistics efficiency.

TO/FROM

ORIGIN

Check whether the database to be imported or exported is an open-source Cassandra database.

  • If the open-source Cassandra is used, the value is True.
  • If GeminiDB Cassandra is used, the value is False.

False

-

FROM

CHUNKSIZE

The block size is passed to the worker process.

5000

This parameter specifies the number of rows sent from the Feeder process (reading data from files) to the worker process. Depending on the average row size of the dataset, it may be advantageous to increase the value of this parameter.

FROM

INGESTRATE

Approximate import rate per second.

100000

INGESTRATE indicates the rate (in rows) at which the feeder process sends data to the worker process per second. Generally, you do not need to change the value unless the rate is too high and needs to be limited.

FROM

MAXBATCHSIZE

Maximum size of a batch file to be imported.

20

The value of this parameter can be as large as possible but cannot exceed the upper limit.

  • MAXBATCHSIZE x The size of a single row < batch_size_fail_threshold_in_kb.
  • If the batch size is too large, an alarm will be reported and rejected.
  • Set the following parameters in cassandra.yaml:

    batch_size_warn_threshold_in_kb (The current value is 5.)

    batch_size_fail_threshold_in_kb (The current value is 50.)

FROM

MINBATCHSIZE

Minimum size of a batch import file.

2

For each chunk, the worker process writes data in batches based on the minimum batch size. The value may need to be adjusted based on the block size, number of nodes in the cluster, and number of VNODEs on each node. If the chunk size is larger, increase the value accordingly.

FROM

MAXROWS

Maximum number of rows. The value -1 indicates that there is no upper limit.

-1

-

FROM

SKIPROWS

Number of rows to skip.

0

-

FROM

SKIPCOLS

A comma-separated list of column names to skip.

None, user-defined

-

FROM

MAXPARSEERRORS

Maximum number of global parsing errors. The value -1 indicates that there is no upper limit.

-1

-

FROM

MAXINSERTERRORS

Maximum number of global insertion errors. The value -1 indicates that there is no upper limit.

-1

-

FROM

ERRFILE

A file that stores all rows that are not imported.

If no value is set, the information is stored in import_ ks _ table .err, where ks is the key space and table is the table name.

import_ ks _ table .err

-

FROM

TTL

The time to live is in seconds. By default, data does not expire.

3600

-

TO

ENCODING

Output character string type.

UTF-8

-

TO

PAGESIZE

Size of the page for obtaining results.

1000

Size of the result page. The value is an integer. The default value is 1000.

The larger the page size, the longer the value of pagetimeout. If the data volume in a single row is large, set this parameter to a smaller value. If the data volume in a single row is small, set this parameter to a larger value. The best effect of this value depends on the local batch write capability of the executor. If the local batch write capability is strong.

TO

PAGETIMEOUT

The page times out to obtain the result.

10

The value is an integer, indicating the timeout interval for obtaining each page. The unit is second. The default value is 10 seconds.

  • For a large page size or a large partition, increase the value of this parameter.
  • If a timeout occurs, increase the value of this parameter.
  • If the server times out, an exponential backoff policy is automatically initiated to prevent the server from being further overloaded, so you may notice the delay. The driver also generates a timeout. In this case, the driver does not know whether the server discards the request or returns the result later. There is a low probability that data may be lost or duplicated. Increasing the value of this parameter is helpful in preventing driver build timeouts.

TO

BEGINTOKEN

Minimum token for exporting data.

None, user-defined

The value is a string, indicating the minimum token to be considered during data export.

Records with smaller tokens will not be exported.

The default value is empty, indicating that there is no minimum token.

TO

ENDTOKEN

Maximum token used to export data.

None, user-defined

The value is a string, indicating the maximum number of tokens to be considered during data export.

Records with larger tokens will not be exported.

This parameter is left empty by default, indicating that there is no maximum token.

TO

MAXREQUESTS

Maximum number of requests that can be processed concurrently by each worker.

6

The value of this parameter is an integer, indicating the maximum number of running requests that can be processed by each working process.

Total degree of parallelism during data export = Number of working processes x Value of this parameter.

Default value: 6 Each request will export data for the entire token range.

TO

MAXOUTPUTSIZE

Maximum size of an output file, in lines.

After this parameter is set, the output file is split into multiple segments when the size of the output file exceeds the value of this parameter. The value -1 indicates that there is no upper limit.

-1

The value of this parameter is an integer, indicating the maximum size of an output file in the unit of lines. If the value of this parameter is exceeded, the output file is split into multiple segments. The default value is -1, indicating that there is no limit on the maximum value. Therefore, the file is the only output file. This parameter can be used together with MAXFILESIZE.

TO

MAXFILESIZE

Maximum size of an output file, in KB.

After this parameter is set, the output file is split into multiple segments when the size of the output file exceeds the value of this parameter.

None, user-defined

The value of this parameter is an integer, indicating the maximum size of an output file in bytes. The final file size is close to the value of this parameter. If the file size exceeds this value, the output file is split into multiple segments. The default value is -1, indicating that there is no limit on the maximum value. Therefore, the file is the only output file. This parameter can be used together with MAXOUTPUTSIZE.

TO

dataformats

Output file format. Currently, this parameter can only be set to json.

None, user-defined

-

TO

DATATYPE

The file format can be Parquet or ORC.

None, user-defined

-

TO

RESULTFILE

The exported file containing detailed results.

None, user-defined

You are advised to add this parameter when exporting data to improve statistics efficiency.

TO

wherecondition

Export condition specified during the export.

None, user-defined

-

Procedure

The following uses an example to describe how to preconfigure data, export data, and import data.

  1. Pre-configuring Data

    1. Create a keyspace.
      CREATE KEYSPACE cycling WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
    2. Create a table.
      CREATE TABLE cycling.cyclist_name (
         id UUID PRIMARY KEY,
         lastname text,
         firstname text
      );
    3. Insert a data record.
      INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (5b6962dd-3f90-4c93-8f61-eabfa4a803e2, 'VOS','Marianne');
      INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (e7cd5752-bc0d-4157-a80f-7523add8dbcd, 'VAN DER BREGGEN','Anna');
      INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (e7ae5cf3-d358-4d99-b900-85902fda9bb0, 'FRAME','Alex');
      INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (220844bf-4860-49d6-9a4b-6b5d3a79cbfb, 'TIRALONGO','Paolo');
      INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47, 'KRUIKSWIJK','Steven');
      INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (fb372533-eb95-4bb4-8685-6ef61e994caa, 'MATTHEWS', 'Michael');

  2. Exports data from and imports data to the cyclist_name table.

    1. Export the id and lastname columns from the cyclist_name table to a CSV file.
      COPY cycling.cyclist_name (id,lastname)  TO '../cyclist_lastname.csv' WITH HEADER = TRUE;
      Figure 1 Exported successfully

      After the preceding command is executed successfully, the cyclist_lastname.csv file is created in the upper-level directory of the current directory. If the file already exists, it will be overwritten.

    2. Export the id and first name columns from the cyclist_name table to another CSV file.
      COPY cycling.cyclist_name (id,firstname)  TO '../cyclist_firstname.csv' WITH HEADER = TRUE;
      Figure 2 Exported successfully

      After the preceding command is executed successfully, the cyclist_firstname.csv file is created in the upper-level directory of the current directory. If the file already exists, it will be overwritten.

    3. Delete data from the cyclist_name table. To ensure data security, the TRUNCATE command is not supported.
      DELETE FROM cycling.cyclist_name WHERE id = 'fb372533-eb95-4bb4-8685-6ef61e994caa';
    4. No data exists in the table.
      SELECT * FROM cycling.cyclist_name ;
      Figure 3 Querying data
    5. Import the cyclist_firstname.csv file.
      COPY cycling.cyclist_name (id,firstname) FROM '../cyclist_firstname.csv' WITH HEADER = TRUE;
      Figure 4 Import succeeded
    6. Verify the imported data.
      SELECT * FROM cycling.cyclist_name;
      Figure 5 Import succeeded
    7. Import the cyclist_lastname.csv file.
      COPY cycling.cyclist_name (id,lastname) FROM '../cyclist_lastname.csv' WITH HEADER = TRUE;
      Figure 6 Importing data
    8. Check whether the data is updated.
      SELECT * FROM cycling.cyclist_name;

      The query result is displayed,

      Figure 7 Import succeeded