Help Center/ GeminiDB/ GeminiDB Cassandra API/ Best Practices/ Modeling Data of GeminiDB Cassandra Instances
Updated on 2024-11-06 GMT+08:00

Modeling Data of GeminiDB Cassandra Instances

This section describes concepts and suggestion of modeling data on GeminiDB Cassandra instances.

GeminiDB Cassandra API is a distributed, decentralized, and highly available wide-column store, a special type of NoSQL databases.

Data is evenly distributed to nodes in a GeminiDB Cassandra cluster using the consistent hashing algorithm. Each node functions as a proxy to receive requests from clients. Based on the cluster keyspace replica and snitch policies, GeminiDB Cassandra API replicates data within the specified primary key range of each node to other nodes in the cluster to improve data reliability and service availability in the distributed system.

An adjustable consistency level (such as ONE and QUORUM) is defined for each read and write, so GeminiDB Cassandra API ensures service availability and data consistency of a single request.

Concepts

Key

There are multiple keys, for example:

CREATE TABLE mytable1 ( name text PRIMARY KEY , age int , address text , person_id text );
CREATE TABLE mytable2 ( name text , age int , address text , person_id text, PRIMARY KEY (name, age) );
CREATE TABLE mytable3 ( name text , age int , address text , person_id text, PRIMARY KEY ((name, age), person_id) ) WITH CLUSTERING ORDER BY (person_id DESC );
  • PRIMARY KEY: A primary key is a unique identifier for each record in a table and consists of multiple data types. In the preceding example, name, (name, age), and ((name, age), person_id) indicate primary keys of mytable1, mytable2, and mytable3.
  • PARTITION KEY: A partition key is the first column of a primary key and determines which node will store hashed data of GeminiDB Cassandra instances. In the preceding example, name, name, and (name, age) indicate partition keys of mytable1, mytable2, and mytable3, respectively. Data that shares partition keys is distributed to the same partition.
  • CLUSTERING KEY: Each primary key column after a partition key is considered a clustering key, which is used to sort data within a partition. In the preceding example, mytable1 does not have a clustering key. age and person_id indicate clustering keys of mytable2 and mytable3, respectively.

To improve performance of a GeminiDB Cassandra cluster, ensure data is evenly distributed on each node in the cluster. Factors affecting the performance include the partition size, data redundancy, and disk space usage. No more than 100,000 records and no more than 100 MB of data are recommended in each partition.

Secondary index

Example:

CREATE INDEX mytable_idx_age ON mytable2 (age);

Create a native secondary index in the age column of mytable2. The native secondary index stores index data in a new table on a GeminiDB Cassandra instance. Values in the index column are used as keys while the original index keys are used as values. The final structure of the index table may be:

CREATE TABLE mytable_index_age (age int, name text , address text , person_id text, PRIMARY KEY(age, name));

PARTITION KEY cannot be used to find the node storing the index table based on age, because index and native data in the index table are stored on the same node and the local data placement policy is used.

You are advised to add a partition constraint of the original table when using the native secondary index. If the partition key is not specified, full table scan will be performed for a query. The following modes are recommended:

SELECT * FROM mytable2 WHERE age = 11 AND name = 'name';
SELECT * FROM mytable2 WHERE age >= 11 AND name IN ('name1', 'name2') ;
SELECT * FROM mytable2 WHERE age = 11 AND TOKEN (name)> xxxxx AND TOKEN(name) < yyyyy;

Suggestions and Principles for Modeling Data

Before performing operations on GeminiDB Cassandra instances, you need to create a service model, organize data (design primary keys), read and write data based on application features.

  • No JOIN: GeminiDB Cassandra API does not support a JOIN query, so you need to execute it on a client or create a table if necessary.
  • No referential integrity: Cross-table referential integrity is not supported. Data in a table cannot be referenced from another table using foreign keys.
  • Denormalization: Redundant data is added to an otherwise normalized relational database to improve read performance.
  • Query-first: Different from a relational database management system (RDBMS), the query-first approach focuses on how to search for information first and then set up a database based on those searches or queries.
  • Designing for optimal storage: How to store relational database tables is transparent to users. To model data of GeminiDB Cassandra instances, you need to consider data storage rules on disks and minimize data partitions.
  • Sorting is a design decision: The sort order available on queries is fixed during table creation.