Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive
Help Center/ GeminiDB/ GeminiDB Cassandra API/ Best Practices/ Designing Primary Keys for a Wide Table

Designing Primary Keys for a Wide Table

Updated on 2024-10-08 GMT+08:00

GeminiDB Cassandra API is a distributed database engine in which data is distributed based on primary keys. If the primary key of a table contains multiple columns, GeminiDB Cassandra API uses the columns in a left-to-right order to query data. A primary key that is improperly designed cannot be effectively used in queries. In this case, a large number of queries may be performed on a small amount of hot spot data, which degrades the query performance. Therefore, the design of primary keys plays an important role in data queries. This topic describes the considerations for primary key design and provides examples.

Are Primary Key Values Unique?

Different versions of a row use the same primary key value. By default, the latest version is returned when a query is performed. In most cases, primary keys must be unique.

Best practice: A primary key is a column or a set of columns. Each primary key value corresponds to a record.

  • [userid]: Only one column is specified as the primary key. Only one record is generated for each user.
  • [userid][orderid]: Two columns are specified as the primary key. Multiple records are generated for each user.

How Do I Design the Primary Key in Different Scenarios?

The primary key design restricts data query methods. GeminiDB Cassandra API supports SELECT statements that use the following methods:

  • The primary key is used to query data, for example:
    SELECT * FROM table WHERE userid='abc' AND orderid=123;
NOTE:

To use this method, you need to specify all primary key columns. The values in all primary key columns must be explicit.

  • Data is queried based on the primary key range, for example:
    SELECT * FROM table WHERE userid='abc' AND 123<orderid<456;
NOTE:

To use this method, you need to specify the range that you want to scan in the first primary key column. If you do not specify the range, queries may time out or fail.

Best practice: How do I submit complex queries using the preceding query methods?

  • Create an index table.
  • Specify columns that you want to scan other than the primary key columns in the query conditions. Irrelevant data is automatically filtered out.
  • Use secondary indexes.
  • Execute the ORDER BY statement to sort data in descending order. This way, new records are sorted to top rows of the table. For example:
    SELECT * FROM table WHERE userid='abc' AND 123<orderid<456 ORDER BY orderid DESC;
NOTE:

When most queries are submitted to retrieve the up-to-date data, you can design the primary key as [userid][orderid DESC] to sort the data in descending order.

Factors To Be Considered During Primary Key Design

The following factors need to be considered:

  • Length of values in primary key columns: Values in primary key columns should be short in length. Columns that store fixed-length values, such as long integers, are recommended as the primary key columns. If the length is not fixed, you are advised to limit it within 2 KB to reduce storage costs and improve write performance.
  • Number of primary key columns: Fewer primary key columns can improve write performance and reduce storage costs. One to three primary key columns are recommended.

What Should I Avoid When Designing Primary Keys?

GeminiDB Cassandra instance data is distributed based on primary keys. If the primary key of a table contains multiple columns, data is distributed based the columns in a left-to-right order. To avoid a large number of write operations from being performed on a small amount of hot spot data, note the following items:

  • Values in the first primary key column must be dispersed.
  • Do not specify a column that contains auto-incremental data or a column in which values have the same prefix, such as the timestamp column, as the first primary key column or the index column.
  • Do not specify a column that contains enumerated data, such as order types, or a column in which values have obvious prefixes as the first primary key column.

If you have to specify a column of the preceding type as the first primary key column, use the hash method to distribute data in the column.

For example, if you have to specify the column pk that contains auto-incremental strings as the first primary key column, you can create a column named pk1 based on the pk column using the following algorithm: pk1 = hash(pk).substring(0,4)+pk. The pk1 column is concatenated by the pk column and a prefix that is the first four digits of the result returned by the hash method based on the pk column.

Will Stacked Hot Spots Occur for Fully Distributed Data?

The hash method is used to distribute data to different partitions. This prevents a server from being terminated by hot spots and the other servers from being idle. This way, the distributed architecture and concurrent processing are utilized in an efficient manner.

Best practice:

  • Design a MD5 hash algorithm. The primary key is [md5(userid).subStr(0,4)][userId][orderid].
  • Design a reverse index. The primary key is [reverse(userid)][orderid].
  • Design the modulo operation. The primary key is [bucket][timestamp][hostname][log-event]; long bucket = timestamp % numBuckets.
  • Add random numbers. The primary key is [userId][orderid][random(100)].

Can a Primary Key Be Simplified?

You can reduce the number of primary key columns to decrease the amount of data that is scanned and improve the efficiency of queries and insert operations.

Best practice:

  • Replace the STRING data type with the LONG or INT data type, for example, '2015122410' => Long(2015122410).
  • Replace names with codes, for example,'mobile phone'=>'sj'.

Common Design Examples

Primary key designs for log data and time series data

  • To query the data of a metric that is generated over a period of time, design the primary key as [hostname][log-event][timestamp].
  • To query the most recent records of a metric, design the primary key as [hostname][log-event][timestamp DESC].
  • To query data that contains only the time dimension or query data whose volume is large in a specific dimension, design the primary key as long bucket = timestamp % numBuckets; [bucket][timestamp][hostname][log-event].

Primary key designs for transaction data

  • To query the transaction records of a seller within a specific period of time, design the primary key as [seller_id][timestamp][order_number].
  • To query the transaction records of a buyer within a specific period of time, design the primary key as [buyer_id][timestamp][order_number].
  • To query data based on order IDs, design the primary key as [order_number].
  • To join three tables to perform queries, design the primary key of the table that stores buyer data as [buyer_id][timestamp][order_number], primary key of the table that stores seller data as [seller_id][timestamp][order_number], and primary key of the table that stores order IDs as [order_number].

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback