Updated on 2024-10-08 GMT+08:00

Typical CarbonData Configuration Parameters

This section provides the details of all the configurations required for the CarbonData System.

Parameters in the carbon.properties File

Configure CarbonData parameters on the server or client based on the actual application scenario.

  • Server: Log in to FusionInsight Manager and choose Cluster > Services > Spark2x. Click Configurations then All Configurations, click JDBCServer(Role), and select Customization. Then, add CarbonData parameters in spark.carbon.customized.configs.
  • Client: Log in to the client node and configure related parameters in the {Client installation directory}/Spark/spark/conf/carbon.properties file.
Table 1 System configurations in carbon.properties

Parameter

Default Value

Description

carbon.ddl.base.hdfs.url

hdfs://hacluster/opt/data

HDFS relative path from the HDFS base path, which is configured in fs.defaultFS. The path configured in carbon.ddl.base.hdfs.url will be appended to the HDFS path configured in fs.defaultFS. If this path is configured, you do not need to pass the complete path while dataload.

For example, if the absolute path of the CSV file is hdfs://10.18.101.155:54310/data/cnbc/2016/xyz.csv, the path hdfs://10.18.101.155:54310 will come from property fs.defaultFS and you can configure /data/cnbc/ as carbon.ddl.base.hdfs.url.

During data loading, you can specify the CSV path as /2016/xyz.csv.

carbon.badRecords.location

-

Storage path of bad records. This path is an HDFS path. The default value is Null. If bad records logging or bad records operation redirection is enabled, the path must be configured by the user.

carbon.bad.records.action

fail

The following are four types of actions for bad records:

FORCE: Data is automatically corrected by storing the bad records as NULL.

REDIRECT: Bad records are written to the CSV file in carbon.badRecords.location instead of being loaded.

IGNORE: Bad records are neither loaded nor written to the CSV file.

FAIL: Data loading fails if any bad records are found.

carbon.update.sync.folder

/tmp/carbondata

Specifies the modifiedTime.mdt file path. You can set it to an existing path or a new path.

NOTE:

If you set this parameter to an existing path, ensure that all users can access the path and the path has the 777 permission.

carbon.enable.badrecord.action.redirect

false

Specifies whether to enable the REDIRECT mode to handle bad records during data loading. When it is enabled, bad records in source files will be recorded in a CSV file generated in a specified storage location each time data is loaded. CSV injection may occur when such CSV files are opened in Windows.

Table 2 Performance configurations in carbon.properties

Parameter

Default Value

Description

Data Loading Configuration

carbon.sort.file.write.buffer.size

16384

CarbonData sorts data and writes it to a temporary file to limit memory usage. This parameter controls the size of the buffer used for reading and writing temporary files. The unit is bytes.

The value ranges from 10240 to 10485760.

carbon.graph.rowset.size

100,000

Rowset size exchanged in data loading graph steps.

The value ranges from 500 to 1,000,000.

carbon.number.of.cores.while.loading

6

Number of cores used during data loading. The greater the number of cores, the better the compaction performance. If the CPU resources are sufficient, you can increase the value of this parameter.

carbon.sort.size

500000

Number of records to be sorted

carbon.enableXXHash

true

Hashmap algorithm used for hashkey calculation

carbon.number.of.cores.block.sort

7

Number of cores used for sorting blocks during data loading

carbon.max.driver.lru.cache.size

-1

Maximum size of LRU caching for data loading at the driver side. The unit is MB. The default value is -1, indicating that there is no memory limit for the caching. Only integer values greater than 0 are accepted.

carbon.max.executor.lru.cache.size

-1

Maximum size of LRU caching for data loading at the executor side. The unit is MB. The default value is -1, indicating that there is no memory limit for the caching. Only integer values greater than 0 are accepted. If this parameter is not configured, the value of carbon.max.driver.lru.cache.size is used.

carbon.merge.sort.prefetch

true

Whether to enable prefetch of data during merge sort while reading data from sorted temp files in the process of data loading

carbon.update.persist.enable

true

Configuration to enable the dataset of RDD/dataframe to persist data. Enabling this will reduce the execution time of UPDATE operation.

enable.unsafe.sort

true

Whether to use unsafe sort during data loading. Unsafe sort reduces the garbage collection during data load operation, resulting in better performance. The default value is true, indicating that unsafe sort is enabled.

enable.offheap.sort

true

Whether to use off-heap memory for sorting of data during data loading

offheap.sort.chunk.size.inmb

64

Size of data chunks to be sorted, in MB. The value ranges from 1 to 1024.

carbon.unsafe.working.memory.in.mb

512

Size of the unsafe working memory. This will be used for sorting data and storing column pages. The unit is MB.

Memory required for data loading:

carbon.number.of.cores.while.loading [default value is 6] x Number of tables to load in parallel x offheap.sort.chunk.size.inmb [default value is 64 MB] + carbon.blockletgroup.size.in.mb [default value is 64 MB] + Current compaction ratio [64 MB/3.5]

= Around 900 MB per table

Memory required for data query:

(SPARK_EXECUTOR_INSTANCES. [default value is 2] x (carbon.blockletgroup.size.in.mb [default value: 64 MB] + carbon.blockletgroup.size.in.mb [default value = 64 MB x 3.5) x Number of cores per executor [default value: 1])

= ~ 600 MB

carbon.sort.inmemory.storage.size.in.mb

512

Size of the intermediate sort data to be kept in the memory. Once the specified value is reached, the system writes data to the disk. The unit is MB.

sort.inmemory.size.inmb

1024

Size of the intermediate sort data to be kept in the memory. Once the specified value is reached, the system writes data to the disk. The unit is MB.

If carbon.unsafe.working.memory.in.mb and carbon.sort.inmemory.storage.size.in.mb are configured, you do not need to set this parameter. If this parameter has been configured, 20% of the memory is used for working memory carbon.unsafe.working.memory.in.mb, and 80% is used for sort storage memory carbon.sort.inmemory.storage.size.in.mb.

NOTE:

The value of spark.yarn.executor.memoryOverhead configured for Spark must be greater than the value of sort.inmemory.size.inmb configured for CarbonData. Otherwise, Yarn might stop the executor if off-heap access exceeds the configured executor memory.

carbon.blockletgroup.size.in.mb

64

The data is read as a group of blocklets which are called blocklet groups. This parameter specifies the size of each blocklet group. Higher value results in better sequential I/O access.

The minimum value is 16 MB. Any value less than 16 MB will be reset to the default value (64 MB).

The unit is MB.

enable.inmemory.merge.sort

false

Whether to enable inmemorymerge sort.

use.offheap.in.query.processing

true

Whether to enable offheap in query processing.

carbon.load.sort.scope

local_sort

Sort scope for the load operation. There are two types of sort: batch_sort and local_sort. If batch_sort is selected, the loading performance is improved but the query performance is reduced.

NOTE:

local_sort conflicts with DDL operations on partitioned tables and they cannot be used at the same time. In addition, local_sort does not significantly improve the performance of partitioned tables. You are advised not to enable this feature on partitioned tables.

carbon.batch.sort.size.inmb

-

Size of data to be considered for batch sorting during data loading. The recommended value is less than 45% of the total sort data. The unit is MB.

NOTE:

If this parameter is not set, its value is about 45% of the value of sort.inmemory.size.inmb by default.

enable.unsafe.columnpage

true

Whether to keep page data in heap memory during data loading or query to prevent garbage collection bottleneck.

carbon.use.local.dir

false

Whether to use Yarn local directories for multi-disk data loading. If this parameter is set to true, Yarn local directories are used to load multi-disk data to improve data loading performance.

carbon.use.multiple.temp.dir

false

Whether to use multiple temporary directories for storing temporary files to improve data loading performance.

carbon.load.datamaps.parallel.db_name.table_name

N/A

The value can be true or false. You can set the database name and table name to improve the first query performance of the table.

Compaction Configuration

carbon.number.of.cores.while.compacting

2

Number of cores to be used while compacting data. The greater the number of cores, the better the compaction performance. If the CPU resources are sufficient, you can increase the value of this parameter.

carbon.compaction.level.threshold

4,3

This configuration is for minor compaction which decides how many segments to be merged.

For example, if this parameter is set to 2,3, minor compaction is triggered every two segments. 3 is the number of level 1 compacted segments which is further compacted to new segment.

The value ranges from 0 to 100.

carbon.major.compaction.size

1024

Major compaction size. Sum of the segments which is below this threshold will be merged.

The unit is MB.

carbon.horizontal.compaction.enable

true

Whether to enable/disable horizontal compaction. After every DELETE and UPDATE statement, horizontal compaction may occur in case the incremental (DELETE/ UPDATE) files becomes more than specified threshold. By default, this parameter is set to true. You can set this parameter to false to disable horizontal compaction.

carbon.horizontal.update.compaction.threshold

1

Threshold limit on number of UPDATE delta files within a segment. In case the number of delta files goes beyond the threshold, the UPDATE delta files within the segment becomes eligible for horizontal compaction and are compacted into single UPDATE delta file. By default, this parameter is set to 1. The value ranges from 1 to 10000.

carbon.horizontal.delete.compaction.threshold

1

Threshold limit on number of DELETE incremental files within a block of a segment. In case the number of incremental files goes beyond the threshold, the DELETE incremental files for the particular block of the segment becomes eligible for horizontal compaction and are compacted into single DELETE incremental file. By default, this parameter is set to 1. The value ranges from 1 to 10000.

Query Configuration

carbon.number.of.cores

4

Number of cores to be used during query

carbon.limit.block.distribution.enable

false

Whether to enable the CarbonData distribution for limit query. The default value is false, indicating that block distribution is disabled for query statements that contain the keyword limit. For details about how to optimize this parameter, see Typical CarbonData Performance Tuning Parameters.

carbon.custom.block.distribution

false

Whether to enable Spark or CarbonData block distribution. By default, the value is false, indicating that Spark block distribution is enabled. To enable CarbonData block distribution, change the value to true.

carbon.infilter.subquery.pushdown.enable

false

If this is set to true and a Select query is triggered in the filter with subquery, the subquery is executed and the output is broadcast as IN filter to the left table. Otherwise, SortMergeSemiJoin is executed. You are advised to set this to true when IN filter subquery does not return too many records. For example, when the IN sub-sentence query returns 10,000 or fewer records, enabling this parameter will give the query results faster.

Example: select * from flow_carbon_256b where cus_no in (select cus_no from flow_carbon_256b where dt>='20260101' and dt<='20260701' and txn_bk='tk_1' and txn_br='tr_1') limit 1000;

carbon.scheduler.minRegisteredResourcesRatio

0.8

Minimum resource (executor) ratio needed for starting the block distribution. The default value is 0.8, indicating that 80% of the requested resources are allocated for starting block distribution.

carbon.dynamicAllocation.schedulerTimeout

5

Maximum time that the scheduler waits for executors to be active. The default value is 5 seconds, and the maximum value is 15 seconds.

enable.unsafe.in.query.processing

true

Whether to use unsafe sort during query. Unsafe sort reduces the garbage collection during query, resulting in better performance. The default value is true, indicating that unsafe sort is enabled.

carbon.enable.vector.reader

true

Whether to enable vector processing for result collection to improve query performance

carbon.query.show.datamaps

true

SHOW TABLES lists all tables including the primary table and datamaps. To filter out the datamaps, set this parameter to false.

Secondary Index Configuration

carbon.secondary.index.creation.threads

1

Number of threads to concurrently process segments during secondary index creation. This property helps fine-tuning the system when there are a lot of segments in a table. The value ranges from 1 to 50.

carbon.si.lookup.partialstring

true

  • When the parameter value is true, it includes indexes started with, ended with, and contained.
  • When the parameter value is false, it includes only secondary indexes started with.

carbon.si.segment.merge

true

Enabling this property merges .carbondata files inside the secondary index segment. The merging will happen after the load operation. That is, at the end of the secondary index table load, small files are checked and merged.

NOTE:

Table Block Size is used as the size threshold for merging small files.

Table 3 Other configurations in carbon.properties

Parameter

Default Value

Description

Data Loading Configuration

carbon.lock.type

HDFSLOCK

Type of lock to be acquired during concurrent operations on a table.

There are following types of lock implementation:

  • LOCALLOCK: Lock is created on local file system as a file. This lock is useful when only one Spark driver (or JDBCServer) runs on a machine.
  • HDFSLOCK: Lock is created on HDFS file system as a file. This lock is useful when multiple Spark applications are running and no ZooKeeper is running on a cluster.

carbon.sort.intermediate.files.limit

20

Minimum number of intermediate files. After intermediate files are generated, sort and merge the files. For details about how to optimize this parameter, see Typical CarbonData Performance Tuning Parameters.

carbon.csv.read.buffersize.byte

1048576

Size of CSV reading buffer

carbon.merge.sort.reader.thread

3

Maximum number of threads used for reading intermediate files for final merging.

carbon.concurrent.lock.retries

100

Maximum number of retries used to obtain the concurrent operation lock. This parameter is used for concurrent loading.

carbon.concurrent.lock.retry.timeout.sec

1

Interval between the retries to obtain the lock for concurrent operations.

carbon.lock.retries

3

Maximum number of retries to obtain the lock for any operations other than import.

carbon.lock.retry.timeout.sec

5

Interval between the retries to obtain the lock for any operation other than import.

carbon.tempstore.location

/opt/Carbon/TempStoreLoc

Temporary storage location. By default, the System.getProperty("java.io.tmpdir") method is used to obtain the value. For details about how to optimize this parameter, see the description of carbon.use.local.dir in Typical CarbonData Performance Tuning Parameters.

carbon.load.log.counter

500000

Data loading records count in logs

SERIALIZATION_NULL_FORMAT

\N

Value to be replaced with NULL

carbon.skip.empty.line

false

Setting this property will ignore the empty lines in the CSV file during data loading.

carbon.load.datamaps.parallel

false

Whether to enable parallel datamap loading for all tables in all sessions. This property will improve the time to load datamaps into memory by distributing the job among executors, thus improving query performance.

Merging Configuration

carbon.numberof.preserve.segments

0

If you want to preserve some number of segments from being compacted, then you can set this configuration.

For example, if carbon.numberof.preserve.segments is set to 2, the latest two segments will always be excluded from the compaction.

No segments will be preserved by default.

carbon.allowed.compaction.days

0

This configuration is used to control on the number of recent segments that needs to be merged.

For example, if this parameter is set to 2, the segments which are loaded in the time frame of past 2 days only will get merged. Segments which are loaded earlier than 2 days will not be merged.

This configuration is disabled by default.

carbon.enable.auto.load.merge

false

Whether to enable compaction along with data loading.

carbon.merge.index.in.segment

true

This configuration merges all the CarbonIndex files (.carbonindex) in a segment into a single MergeIndex file (.carbonindexmerge) upon data loading completion. This significantly reduces the delay in serving the first query.

Query Configuration

max.query.execution.time

60

Maximum time allowed for one query to be executed.

The unit is minute.

carbon.enableMinMax

true

MinMax is used to improve query performance. You can set this to false to disable this function.

carbon.lease.recovery.retry.count

5

Maximum number of attempts that need to be made for recovering a lease on a file.

Minimum value: 1

Maximum value: 50

carbon.lease.recovery.retry.interval

1000 (ms)

Interval or pause time after a lease recovery attempt is made on a file.

Minimum value: 1000 (ms)

Maximum value: 10000 (ms)

Parameters in the spark-defaults.conf File

  • Log in to the client node and configure the parameters listed in Table 4 in the {Client installation directory}/Spark/spark/conf/spark-defaults.conf file.
    Table 4 Spark configuration reference in spark-defaults.conf

    Parameter

    Default Value

    Description

    spark.driver.memory

    4G

    Memory to be used for the driver process. SparkContext has been initialized.

    NOTE:

    In client mode, do not use SparkConf to set this parameter in the application because the driver JVM has been started. To configure this parameter, configure it in the --driver-memory command-line option or in the default property file.

    spark.executor.memory

    4 GB

    Memory to be used for each executor process.

    spark.sql.crossJoin.enabled

    true

    If the query contains a cross join, enable this property so that no error occurs. In this case, you can use a cross join instead of a join for better performance.

  • Configure the following parameters in the spark-defaults.conf file on the Spark driver.
    • In spark-sql mode: Log in to the Spark client node and configure the parameters listed in Table 5 in the {Client installation directory}/Spark/spark/conf/spark-defaults.conf file.
      Table 5 Parameter description

      Parameter

      Value

      Description

      spark.driver.extraJavaOptions

      -Dlog4j.configuration=file:/opt/client/Spark2x/spark/conf/log4j.properties -Djetty.version=x.y.z -Dzookeeper.server.principal=zookeeper/hadoop.<System domain name> -Djava.security.krb5.conf=/opt/client/KrbClient/kerberos/var/krb5kdc/krb5.conf -Djava.security.auth.login.config=/opt/client/Spark2x/spark/conf/jaas.conf -Dorg.xerial.snappy.tempdir=/opt/client/Spark2x/tmp -Dcarbon.properties.filepath=/opt/client/Spark2x/spark/conf/carbon.properties -Djava.io.tmpdir=/opt/client/Spark2x/tmp

      The default value /opt/client/Spark2x/spark indicates CLIENT_HOME of the client and is added to the end of the value of spark.driver.extraJavaOptions. This parameter is used to specify the path of the carbon.propertiesfile in Driver.

      NOTE:

      Spaces next to equal marks (=) are not allowed.

      spark.sql.session.state.builder

      org.apache.spark.sql.hive.FIHiveACLSessionStateBuilder

      Session state constructor.

      spark.carbon.sqlastbuilder.classname

      org.apache.spark.sql.hive.CarbonInternalSqlAstBuilder

      AST constructor.

      spark.sql.catalog.class

      org.apache.spark.sql.hive.HiveACLExternalCatalog

      Hive External catalog to be used. This parameter is mandatory if Spark ACL is enabled.

      spark.sql.hive.implementation

      org.apache.spark.sql.hive.HiveACLClientImpl

      How to call the Hive client. This parameter is mandatory if Spark ACL is enabled.

      spark.sql.hiveClient.isolation.enabled

      false

      This parameter is mandatory if Spark ACL is enabled.

    • In JDBCServer: Log in to the node where JDBCServer is installed and configure the parameters listed in Table 6 in the {BIGDATA_HOME}/FusionInsight_Spark_*/*_JDBCServer/etc/spark-defaults.conf file.
      Table 6 Parameter description

      Parameter

      Value

      Description

      spark.driver.extraJavaOptions

      -Xloggc:${SPARK_LOG_DIR}/indexserver-omm-%p-gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:MaxDirectMemorySize=512M -XX:MaxMetaspaceSize=512M -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=10M -XX:OnOutOfMemoryError='kill -9 %p' -Djetty.version=x.y.z -Dorg.xerial.snappy.tempdir=${BIGDATA_HOME}/tmp/spark2x/JDBCServer/snappy_tmp -Djava.io.tmpdir=${BIGDATA_HOME}/tmp/spark2x/JDBCServer/io_tmp -Dcarbon.properties.filepath=${SPARK_CONF_DIR}/carbon.properties -Djdk.tls.ephemeralDHKeySize=2048 -Dspark.ssl.keyStore=${SPARK_CONF_DIR}/child.keystore #{java_stack_prefer}

      The default value ${SPARK_CONF_DIR} depends on a specific cluster and is added to the end of the value of the spark.driver.extraJavaOptions parameter. This parameter is used to specify the path of the carbon.properties file in Driver.

      NOTE:

      Spaces next to equal marks (=) are not allowed.

      spark.sql.session.state.builder

      org.apache.spark.sql.hive.FIHiveACLSessionStateBuilder

      Session state constructor.

      spark.carbon.sqlastbuilder.classname

      org.apache.spark.sql.hive.CarbonInternalSqlAstBuilder

      AST constructor.

      spark.sql.catalog.class

      org.apache.spark.sql.hive.HiveACLExternalCatalog

      Hive External catalog to be used. This parameter is mandatory if Spark ACL is enabled.

      spark.sql.hive.implementation

      org.apache.spark.sql.hive.HiveACLClientImpl

      How to call the Hive client. This parameter is mandatory if Spark ACL is enabled.

      spark.sql.hiveClient.isolation.enabled

      false

      This parameter is mandatory if Spark ACL is enabled.