Help Center/ GeminiDB/ GeminiDB Cassandra API/ Best Practices/ Designing Primary Keys for a Wide Table

Updated on 2024-10-08 GMT+08:00

View PDF

Designing Primary Keys for a Wide Table

GeminiDB Cassandra API is a distributed database engine in which data is distributed based on primary keys. If the primary key of a table contains multiple columns, GeminiDB Cassandra API uses the columns in a left-to-right order to query data. A primary key that is improperly designed cannot be effectively used in queries. In this case, a large number of queries may be performed on a small amount of hot spot data, which degrades the query performance. Therefore, the design of primary keys plays an important role in data queries. This topic describes the considerations for primary key design and provides examples.

Are Primary Key Values Unique?

Different versions of a row use the same primary key value. By default, the latest version is returned when a query is performed. In most cases, primary keys must be unique.

Best practice: A primary key is a column or a set of columns. Each primary key value corresponds to a record.

[userid]: Only one column is specified as the primary key. Only one record is generated for each user.
[userid][orderid]: Two columns are specified as the primary key. Multiple records are generated for each user.

How Do I Design the Primary Key in Different Scenarios?

The primary key design restricts data query methods. GeminiDB Cassandra API supports SELECT statements that use the following methods:

The primary key is used to query data, for example:

SELECT * FROM table WHERE userid='abc' AND orderid=123;

To use this method, you need to specify all primary key columns. The values in all primary key columns must be explicit.

Data is queried based on the primary key range, for example:
```
SELECT * FROM table WHERE userid='abc' AND 123<orderid<456;
```

To use this method, you need to specify the range that you want to scan in the first primary key column. If you do not specify the range, queries may time out or fail.

Best practice: How do I submit complex queries using the preceding query methods?

Create an index table.
Specify columns that you want to scan other than the primary key columns in the query conditions. Irrelevant data is automatically filtered out.
Use secondary indexes.
Execute the ORDER BY statement to sort data in descending order. This way, new records are sorted to top rows of the table. For example:
```
SELECT * FROM table WHERE userid='abc' AND 123<orderid<456 ORDER BY orderid DESC;
```

When most queries are submitted to retrieve the up-to-date data, you can design the primary key as [userid][orderid DESC] to sort the data in descending order.

Factors To Be Considered During Primary Key Design

The following factors need to be considered:

Length of values in primary key columns: Values in primary key columns should be short in length. Columns that store fixed-length values, such as long integers, are recommended as the primary key columns. If the length is not fixed, you are advised to limit it within 2 KB to reduce storage costs and improve write performance.
Number of primary key columns: Fewer primary key columns can improve write performance and reduce storage costs. One to three primary key columns are recommended.

What Should I Avoid When Designing Primary Keys?

GeminiDB Cassandra instance data is distributed based on primary keys. If the primary key of a table contains multiple columns, data is distributed based the columns in a left-to-right order. To avoid a large number of write operations from being performed on a small amount of hot spot data, note the following items:

Values in the first primary key column must be dispersed.
Do not specify a column that contains auto-incremental data or a column in which values have the same prefix, such as the timestamp column, as the first primary key column or the index column.
Do not specify a column that contains enumerated data, such as order types, or a column in which values have obvious prefixes as the first primary key column.

If you have to specify a column of the preceding type as the first primary key column, use the hash method to distribute data in the column.

For example, if you have to specify the column pk that contains auto-incremental strings as the first primary key column, you can create a column named pk1 based on the pk column using the following algorithm: pk1 = hash(pk).substring(0,4)+pk. The pk1 column is concatenated by the pk column and a prefix that is the first four digits of the result returned by the hash method based on the pk column.

Will Stacked Hot Spots Occur for Fully Distributed Data?

The hash method is used to distribute data to different partitions. This prevents a server from being terminated by hot spots and the other servers from being idle. This way, the distributed architecture and concurrent processing are utilized in an efficient manner.

Best practice:

Design a MD5 hash algorithm. The primary key is [md5(userid).subStr(0,4)][userId][orderid].
Design a reverse index. The primary key is [reverse(userid)][orderid].
Design the modulo operation. The primary key is [bucket][timestamp][hostname][log-event]; long bucket = timestamp % numBuckets.
Add random numbers. The primary key is [userId][orderid][random(100)].

Can a Primary Key Be Simplified?

You can reduce the number of primary key columns to decrease the amount of data that is scanned and improve the efficiency of queries and insert operations.

Best practice:

Replace the STRING data type with the LONG or INT data type, for example, '2015122410' => Long(2015122410).
Replace names with codes, for example,'mobile phone'=>'sj'.

Common Design Examples

Primary key designs for log data and time series data

To query the data of a metric that is generated over a period of time, design the primary key as [hostname][log-event][timestamp].
To query the most recent records of a metric, design the primary key as [hostname][log-event][timestamp DESC].
To query data that contains only the time dimension or query data whose volume is large in a specific dimension, design the primary key as long bucket = timestamp % numBuckets; [bucket][timestamp][hostname][log-event].

Primary key designs for transaction data

To query the transaction records of a seller within a specific period of time, design the primary key as [seller_id][timestamp][order_number].
To query the transaction records of a buyer within a specific period of time, design the primary key as [buyer_id][timestamp][order_number].
To query data based on order IDs, design the primary key as [order_number].
To join three tables to perform queries, design the primary key of the table that stores buyer data as [buyer_id][timestamp][order_number], primary key of the table that stores seller data as [seller_id][timestamp][order_number], and primary key of the table that stores order IDs as [order_number].

Parent topic: Best Practices

Previous topic: Scenarios

Next topic: Pre-partitioning Tables