Common CarbonData Parameters
This section provides the details of all the configurations required for the CarbonData System.
Parameters in the carbon.properties File
Configure CarbonData parameters on the server or client based on the actual application scenario.
- Server: Log in to FusionInsight Manager and choose Cluster > Services > Spark2x. Click Configurations then All Configurations, click JDBCServer(Role), and select Customization. Then, add CarbonData parameters in spark.carbon.customized.configs.
- Client: Log in to the client node and configure related parameters in the {Client installation directory}/Spark/spark/conf/carbon.properties file.
Parameter |
Default Value |
Description |
---|---|---|
carbon.ddl.base.hdfs.url |
hdfs://hacluster/opt/data |
HDFS relative path from the HDFS base path, which is configured in fs.defaultFS. The path configured in carbon.ddl.base.hdfs.url will be appended to the HDFS path configured in fs.defaultFS. If this path is configured, you do not need to pass the complete path while dataload. For example, if the absolute path of the CSV file is hdfs://10.18.101.155:54310/data/cnbc/2016/xyz.csv, the path hdfs://10.18.101.155:54310 will come from property fs.defaultFS and you can configure /data/cnbc/ as carbon.ddl.base.hdfs.url. During data loading, you can specify the CSV path as /2016/xyz.csv. |
carbon.badRecords.location |
- |
Storage path of bad records. This path is an HDFS path. The default value is Null. If bad records logging or bad records operation redirection is enabled, the path must be configured by the user. |
carbon.bad.records.action |
fail |
The following are four types of actions for bad records: FORCE: Data is automatically corrected by storing the bad records as NULL. REDIRECT: Bad records are written to the CSV file in carbon.badRecords.location instead of being loaded. IGNORE: Bad records are neither loaded nor written to the CSV file. FAIL: Data loading fails if any bad records are found. |
carbon.update.sync.folder |
/tmp/carbondata |
Specifies the modifiedTime.mdt file path. You can set it to an existing path or a new path.
NOTE:
If you set this parameter to an existing path, ensure that all users can access the path and the path has the 777 permission. |
carbon.enable.badrecord.action.redirect |
false |
Specifies whether to enable the REDIRECT mode to handle bad records during data loading. When it is enabled, bad records in source files will be recorded in a CSV file generated in a specified storage location each time data is loaded. CSV injection may occur when such CSV files are opened in Windows. |
Parameter |
Default Value |
Description |
---|---|---|
Data Loading Configuration |
||
carbon.sort.file.write.buffer.size |
16384 |
CarbonData sorts data and writes it to a temporary file to limit memory usage. This parameter controls the size of the buffer used for reading and writing temporary files. The unit is bytes. The value ranges from 10240 to 10485760. |
carbon.graph.rowset.size |
100,000 |
Rowset size exchanged in data loading graph steps. The value ranges from 500 to 1,000,000. |
carbon.number.of.cores.while.loading |
6 |
Number of cores used during data loading. The greater the number of cores, the better the compaction performance. If the CPU resources are sufficient, you can increase the value of this parameter. |
carbon.sort.size |
500000 |
Number of records to be sorted |
carbon.enableXXHash |
true |
Hashmap algorithm used for hashkey calculation |
carbon.number.of.cores.block.sort |
7 |
Number of cores used for sorting blocks during data loading |
carbon.max.driver.lru.cache.size |
-1 |
Maximum size of LRU caching for data loading at the driver side. The unit is MB. The default value is -1, indicating that there is no memory limit for the caching. Only integer values greater than 0 are accepted. |
carbon.max.executor.lru.cache.size |
-1 |
Maximum size of LRU caching for data loading at the executor side. The unit is MB. The default value is -1, indicating that there is no memory limit for the caching. Only integer values greater than 0 are accepted. If this parameter is not configured, the value of carbon.max.driver.lru.cache.size is used. |
carbon.merge.sort.prefetch |
true |
Whether to enable prefetch of data during merge sort while reading data from sorted temp files in the process of data loading |
carbon.update.persist.enable |
true |
Configuration to enable the dataset of RDD/dataframe to persist data. Enabling this will reduce the execution time of UPDATE operation. |
enable.unsafe.sort |
true |
Whether to use unsafe sort during data loading. Unsafe sort reduces the garbage collection during data load operation, resulting in better performance. The default value is true, indicating that unsafe sort is enabled. |
enable.offheap.sort |
true |
Whether to use off-heap memory for sorting of data during data loading |
offheap.sort.chunk.size.inmb |
64 |
Size of data chunks to be sorted, in MB. The value ranges from 1 to 1024. |
carbon.unsafe.working.memory.in.mb |
512 |
Size of the unsafe working memory. This will be used for sorting data and storing column pages. The unit is MB. Memory required for data loading: carbon.number.of.cores.while.loading [default value is 6] x Number of tables to load in parallel x offheap.sort.chunk.size.inmb [default value is 64 MB] + carbon.blockletgroup.size.in.mb [default value is 64 MB] + Current compaction ratio [64 MB/3.5] = Around 900 MB per table Memory required for data query: (SPARK_EXECUTOR_INSTANCES. [default value is 2] x (carbon.blockletgroup.size.in.mb [default value: 64 MB] + carbon.blockletgroup.size.in.mb [default value = 64 MB x 3.5) x Number of cores per executor [default value: 1]) = ~ 600 MB |
carbon.sort.inmemory.storage.size.in.mb |
512 |
Size of the intermediate sort data to be kept in the memory. Once the specified value is reached, the system writes data to the disk. The unit is MB. |
sort.inmemory.size.inmb |
1024 |
Size of the intermediate sort data to be kept in the memory. Once the specified value is reached, the system writes data to the disk. The unit is MB. If carbon.unsafe.working.memory.in.mb and carbon.sort.inmemory.storage.size.in.mb are configured, you do not need to set this parameter. If this parameter has been configured, 20% of the memory is used for working memory carbon.unsafe.working.memory.in.mb, and 80% is used for sort storage memory carbon.sort.inmemory.storage.size.in.mb.
NOTE:
The value of spark.yarn.executor.memoryOverhead configured for Spark must be greater than the value of sort.inmemory.size.inmb configured for CarbonData. Otherwise, Yarn might stop the executor if off-heap access exceeds the configured executor memory. |
carbon.blockletgroup.size.in.mb |
64 |
The data is read as a group of blocklets which are called blocklet groups. This parameter specifies the size of each blocklet group. Higher value results in better sequential I/O access. The minimum value is 16 MB. Any value less than 16 MB will be reset to the default value (64 MB). The unit is MB. |
enable.inmemory.merge.sort |
false |
Whether to enable inmemorymerge sort. |
use.offheap.in.query.processing |
true |
Whether to enable offheap in query processing. |
carbon.load.sort.scope |
local_sort |
Sort scope for the load operation. There are two types of sort: batch_sort and local_sort. If batch_sort is selected, the loading performance is improved but the query performance is reduced.
NOTE:
local_sort conflicts with DDL operations on partitioned tables and they cannot be used at the same time. In addition, local_sort does not significantly improve the performance of partitioned tables. You are advised not to enable this feature on partitioned tables. |
carbon.batch.sort.size.inmb |
- |
Size of data to be considered for batch sorting during data loading. The recommended value is less than 45% of the total sort data. The unit is MB.
NOTE:
If this parameter is not set, its value is about 45% of the value of sort.inmemory.size.inmb by default. |
enable.unsafe.columnpage |
true |
Whether to keep page data in heap memory during data loading or query to prevent garbage collection bottleneck. |
carbon.use.local.dir |
false |
Whether to use Yarn local directories for multi-disk data loading. If this parameter is set to true, Yarn local directories are used to load multi-disk data to improve data loading performance. |
carbon.use.multiple.temp.dir |
false |
Whether to use multiple temporary directories for storing temporary files to improve data loading performance. |
carbon.load.datamaps.parallel.db_name.table_name |
N/A |
The value can be true or false. You can set the database name and table name to improve the first query performance of the table. |
Compaction Configuration |
||
carbon.number.of.cores.while.compacting |
2 |
Number of cores to be used while compacting data. The greater the number of cores, the better the compaction performance. If the CPU resources are sufficient, you can increase the value of this parameter. |
carbon.compaction.level.threshold |
4,3 |
This configuration is for minor compaction which decides how many segments to be merged. For example, if this parameter is set to 2,3, minor compaction is triggered every two segments. 3 is the number of level 1 compacted segments which is further compacted to new segment. The value ranges from 0 to 100. |
carbon.major.compaction.size |
1024 |
Major compaction size. Sum of the segments which is below this threshold will be merged. The unit is MB. |
carbon.horizontal.compaction.enable |
true |
Whether to enable/disable horizontal compaction. After every DELETE and UPDATE statement, horizontal compaction may occur in case the incremental (DELETE/ UPDATE) files becomes more than specified threshold. By default, this parameter is set to true. You can set this parameter to false to disable horizontal compaction. |
carbon.horizontal.update.compaction.threshold |
1 |
Threshold limit on number of UPDATE delta files within a segment. In case the number of delta files goes beyond the threshold, the UPDATE delta files within the segment becomes eligible for horizontal compaction and are compacted into single UPDATE delta file. By default, this parameter is set to 1. The value ranges from 1 to 10000. |
carbon.horizontal.delete.compaction.threshold |
1 |
Threshold limit on number of DELETE incremental files within a block of a segment. In case the number of incremental files goes beyond the threshold, the DELETE incremental files for the particular block of the segment becomes eligible for horizontal compaction and are compacted into single DELETE incremental file. By default, this parameter is set to 1. The value ranges from 1 to 10000. |
Query Configuration |
||
carbon.number.of.cores |
4 |
Number of cores to be used during query |
carbon.limit.block.distribution.enable |
false |
Whether to enable the CarbonData distribution for limit query. The default value is false, indicating that block distribution is disabled for query statements that contain the keyword limit. For details about how to optimize this parameter, see Configurations for Performance Tuning. |
carbon.custom.block.distribution |
false |
Whether to enable Spark or CarbonData block distribution. By default, the value is false, indicating that Spark block distribution is enabled. To enable CarbonData block distribution, change the value to true. |
carbon.infilter.subquery.pushdown.enable |
false |
If this is set to true and a Select query is triggered in the filter with subquery, the subquery is executed and the output is broadcast as IN filter to the left table. Otherwise, SortMergeSemiJoin is executed. You are advised to set this to true when IN filter subquery does not return too many records. For example, when the IN sub-sentence query returns 10,000 or fewer records, enabling this parameter will give the query results faster. Example: select * from flow_carbon_256b where cus_no in (select cus_no from flow_carbon_256b where dt>='20260101' and dt<='20260701' and txn_bk='tk_1' and txn_br='tr_1') limit 1000; |
carbon.scheduler.minRegisteredResourcesRatio |
0.8 |
Minimum resource (executor) ratio needed for starting the block distribution. The default value is 0.8, indicating that 80% of the requested resources are allocated for starting block distribution. |
carbon.dynamicAllocation.schedulerTimeout |
5 |
Maximum time that the scheduler waits for executors to be active. The default value is 5 seconds, and the maximum value is 15 seconds. |
enable.unsafe.in.query.processing |
true |
Whether to use unsafe sort during query. Unsafe sort reduces the garbage collection during query, resulting in better performance. The default value is true, indicating that unsafe sort is enabled. |
carbon.enable.vector.reader |
true |
Whether to enable vector processing for result collection to improve query performance |
carbon.query.show.datamaps |
true |
SHOW TABLES lists all tables including the primary table and datamaps. To filter out the datamaps, set this parameter to false. |
Secondary Index Configuration |
||
carbon.secondary.index.creation.threads |
1 |
Number of threads to concurrently process segments during secondary index creation. This property helps fine-tuning the system when there are a lot of segments in a table. The value ranges from 1 to 50. |
carbon.si.lookup.partialstring |
true |
|
carbon.si.segment.merge |
true |
Enabling this property merges .carbondata files inside the secondary index segment. The merging will happen after the load operation. That is, at the end of the secondary index table load, small files are checked and merged.
NOTE:
Table Block Size is used as the size threshold for merging small files. |
Parameter |
Default Value |
Description |
---|---|---|
Data Loading Configuration |
||
carbon.lock.type |
HDFSLOCK |
Type of lock to be acquired during concurrent operations on a table. There are following types of lock implementation:
|
carbon.sort.intermediate.files.limit |
20 |
Minimum number of intermediate files. After intermediate files are generated, sort and merge the files. For details about how to optimize this parameter, see Configurations for Performance Tuning. |
carbon.csv.read.buffersize.byte |
1048576 |
Size of CSV reading buffer |
carbon.merge.sort.reader.thread |
3 |
Maximum number of threads used for reading intermediate files for final merging. |
carbon.concurrent.lock.retries |
100 |
Maximum number of retries used to obtain the concurrent operation lock. This parameter is used for concurrent loading. |
carbon.concurrent.lock.retry.timeout.sec |
1 |
Interval between the retries to obtain the lock for concurrent operations. |
carbon.lock.retries |
3 |
Maximum number of retries to obtain the lock for any operations other than import. |
carbon.lock.retry.timeout.sec |
5 |
Interval between the retries to obtain the lock for any operation other than import. |
carbon.tempstore.location |
/opt/Carbon/TempStoreLoc |
Temporary storage location. By default, the System.getProperty("java.io.tmpdir") method is used to obtain the value. For details about how to optimize this parameter, see the description of carbon.use.local.dir in Configurations for Performance Tuning. |
carbon.load.log.counter |
500000 |
Data loading records count in logs |
SERIALIZATION_NULL_FORMAT |
\N |
Value to be replaced with NULL |
carbon.skip.empty.line |
false |
Setting this property will ignore the empty lines in the CSV file during data loading. |
carbon.load.datamaps.parallel |
false |
Whether to enable parallel datamap loading for all tables in all sessions. This property will improve the time to load datamaps into memory by distributing the job among executors, thus improving query performance. |
Merging Configuration |
||
carbon.numberof.preserve.segments |
0 |
If you want to preserve some number of segments from being compacted, then you can set this configuration. For example, if carbon.numberof.preserve.segments is set to 2, the latest two segments will always be excluded from the compaction. No segments will be preserved by default. |
carbon.allowed.compaction.days |
0 |
This configuration is used to control on the number of recent segments that needs to be merged. For example, if this parameter is set to 2, the segments which are loaded in the time frame of past 2 days only will get merged. Segments which are loaded earlier than 2 days will not be merged. This configuration is disabled by default. |
carbon.enable.auto.load.merge |
false |
Whether to enable compaction along with data loading. |
carbon.merge.index.in.segment |
true |
This configuration enables to merge all the CarbonIndex files (.carbonindex) into a single MergeIndex file (.carbonindexmerge) upon data loading completion. This significantly reduces the delay in serving the first query. |
Query Configuration |
||
max.query.execution.time |
60 |
Maximum time allowed for one query to be executed. The unit is minute. |
carbon.enableMinMax |
true |
MinMax is used to improve query performance. You can set this to false to disable this function. |
carbon.lease.recovery.retry.count |
5 |
Maximum number of attempts that need to be made for recovering a lease on a file. Minimum value: 1 Maximum value: 50 |
carbon.lease.recovery.retry.interval |
1000 (ms) |
Interval or pause time after a lease recovery attempt is made on a file. Minimum value: 1000 (ms) Maximum value: 10000 (ms) |
Parameters in the spark-defaults.conf File
- Log in to the client node and configure the parameters listed in Table 4 in the {Client installation directory}/Spark/spark/conf/spark-defaults.conf file.
Table 4 Spark configuration reference in spark-defaults.conf Parameter
Default Value
Description
spark.driver.memory
4G
Memory to be used for the driver process. SparkContext has been initialized.
NOTE:In client mode, do not use SparkConf to set this parameter in the application because the driver JVM has been started. To configure this parameter, configure it in the --driver-memory command-line option or in the default property file.
spark.executor.memory
4 GB
Memory to be used for each executor process.
spark.sql.crossJoin.enabled
true
If the query contains a cross join, enable this property so that no error is thrown. In this case, you can use a cross join instead of a join for better performance.
- Configure the following parameters in the spark-defaults.conf file on the Spark driver.
- In spark-sql mode: Log in to the Spark client node and configure the parameters listed in Table 5 in the {Client installation directory}/Spark/spark/conf/spark-defaults.conf file.
Table 5 Parameter description Parameter
Value
Description
spark.driver.extraJavaOptions
-Dlog4j.configuration=file:/opt/client/Spark2x/spark/conf/log4j.properties -Djetty.version=x.y.z -Dzookeeper.server.principal=zookeeper/hadoop.<System domain name> -Djava.security.krb5.conf=/opt/client/KrbClient/kerberos/var/krb5kdc/krb5.conf -Djava.security.auth.login.config=/opt/client/Spark2x/spark/conf/jaas.conf -Dorg.xerial.snappy.tempdir=/opt/client/Spark2x/tmp -Dcarbon.properties.filepath=/opt/client/Spark2x/spark/conf/carbon.properties -Djava.io.tmpdir=/opt/client/Spark2x/tmp
The default value /opt/client/Spark2x/spark indicates CLIENT_HOME of the client and is added to the end of the value of spark.driver.extraJavaOptions. This parameter is used to specify the path of the carbon.propertiesfile in Driver.
NOTE:Spaces next to equal marks (=) are not allowed.
spark.sql.session.state.builder
org.apache.spark.sql.hive.FIHiveACLSessionStateBuilder
Session state constructor.
spark.carbon.sqlastbuilder.classname
org.apache.spark.sql.hive.CarbonInternalSqlAstBuilder
AST constructor.
spark.sql.catalog.class
org.apache.spark.sql.hive.HiveACLExternalCatalog
Hive External catalog to be used. This parameter is mandatory if Spark ACL is enabled.
spark.sql.hive.implementation
org.apache.spark.sql.hive.HiveACLClientImpl
How to call the Hive client. This parameter is mandatory if Spark ACL is enabled.
spark.sql.hiveClient.isolation.enabled
false
This parameter is mandatory if Spark ACL is enabled.
- In JDBCServer: Log in to the node where JDBCServer is installed and configure the parameters listed in Table 6 in the {BIGDATA_HOME}/FusionInsight_Spark_*/*_JDBCServer/etc/spark-defaults.conf file.
Table 6 Parameter description Parameter
Value
Description
spark.driver.extraJavaOptions
-Xloggc:${SPARK_LOG_DIR}/indexserver-omm-%p-gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:MaxDirectMemorySize=512M -XX:MaxMetaspaceSize=512M -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=10M -XX:OnOutOfMemoryError='kill -9 %p' -Djetty.version=x.y.z -Dorg.xerial.snappy.tempdir=${BIGDATA_HOME}/tmp/spark2x/JDBCServer/snappy_tmp -Djava.io.tmpdir=${BIGDATA_HOME}/tmp/spark2x/JDBCServer/io_tmp -Dcarbon.properties.filepath=${SPARK_CONF_DIR}/carbon.properties -Djdk.tls.ephemeralDHKeySize=2048 -Dspark.ssl.keyStore=${SPARK_CONF_DIR}/child.keystore #{java_stack_prefer}
The default value ${SPARK_CONF_DIR} depends on a specific cluster and is added to the end of the value of the spark.driver.extraJavaOptions parameter. This parameter is used to specify the path of the carbon.properties file in Driver.
NOTE:Spaces next to equal marks (=) are not allowed.
spark.sql.session.state.builder
org.apache.spark.sql.hive.FIHiveACLSessionStateBuilder
Session state constructor.
spark.carbon.sqlastbuilder.classname
org.apache.spark.sql.hive.CarbonInternalSqlAstBuilder
AST constructor.
spark.sql.catalog.class
org.apache.spark.sql.hive.HiveACLExternalCatalog
Hive External catalog to be used. This parameter is mandatory if Spark ACL is enabled.
spark.sql.hive.implementation
org.apache.spark.sql.hive.HiveACLClientImpl
How to call the Hive client. This parameter is mandatory if Spark ACL is enabled.
spark.sql.hiveClient.isolation.enabled
false
This parameter is mandatory if Spark ACL is enabled.
- In spark-sql mode: Log in to the Spark client node and configure the parameters listed in Table 5 in the {Client installation directory}/Spark/spark/conf/spark-defaults.conf file.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.