Configuring Enhanced Aggregation for an Elasticsearch Cluster

Scenario

In the case of data clustering, enhanced aggregation uses the vectorization technology to process data in batches, improving aggregation performance and facilitating aggregated analysis for faster time to insight.

In large-scale dataset aggregation and analysis scenarios, a major chunk of the time is spent on data grouping and aggregation.

Enhancing data grouping and aggregation relies on two things:

Sorting key: Data is sorted based on the sorting key.
Cluster key: It is contained in the sorting key. Data is clustered based on the clustering key.

Table 1 describes common scenarios that involve enhanced aggregation.

**Table 1** Common scenarios for enhanced aggregation
Scenario	Description	Details
Aggregation of low-cardinality fields	Aggregates fields that have a small number of unique values, for example, a field that indicates storage classes.	Grouping and Aggregation of Low-cardinality Fields
Aggregation of high-cardinality fields	Aggregates fields that have a large number of unique values, for example, a field that indicates storage time, which can aggregated by day.	High-cardinality Field Histogram Aggregation
Hybrid aggregation of low- and high-cardinality fields	Groups and aggregates low-cardinality fields first, and then creates a histogram using high-cardinality fields.	Low-cardinality and High-cardinality Field Mixing

Constraints

Only Elasticsearch 7.10.2 clusters support aggregation enhancement.

Grouping and Aggregation of Low-cardinality Fields

Generally, low-cardinality fields use grouping as a way of aggregation. With an appropriate sorting key, grouping prepares the data for batch vectorization.

For example, the query statement is as follows:

POST testindex/_search
{
  "size": 0,
  "aggs": {
    "groupby_region": {
      "terms": {
        "field": "region"
      },
      "aggs": {
        "groupby_host": {
          "terms": {
            "field": "host"
          },
          "aggs": {
            "avg_cpu_usage": {
              "avg": {
                "field": "cpu_usage"
              }
            }
          }
        }
      }
    }
  }
}

Assume that region and host are low-cardinality fields. To use enhanced aggregation, set parameters as follows:

//Configure an index
"settings" : {
    "index" : {
        "search" : {
            "turbo" : {
                "enabled" : "true" //Enable optimization
            }
        },
        "sort" : { //Specify a sorting key
            "field" : [
                "region",
                "host",
                "other"
            ]
        },
        "cluster" : {
            "field" : [ //Specify a clustering key
                "region",
                "host"
            ]
        }
    }
}

The clustering key must be a subset of the sorting key.

High-cardinality Field Histogram Aggregation

High-cardinality fields commonly use histogram aggregation, which facilitates data processing per range.

For example, the query statement is as follows: This query groups the field timestamp using a histogram and calculates the average score.

POST testindex/_search?pretty
{
  "size": 0,
  "aggs": {
    "avg_score": {
      "avg": {
        "field": "score"
      },
      "aggs": {
        "groupbytime": {
          "date_histogram": {
            "field": "timestamp",
            "calendar_interval": "day"
          }
        }
      }
    }
  }
}

timestamp is a typical high-cardinality field. To apply enhanced aggregation to such a field, set parameters as follows:

//Configure an index
"settings" : {
    "index" : {
        "search" : {
            "turbo" : {
                "enabled" : "true" //Enable optimization
            }
        },
        "sort" : { //Specify a sorting key
            "field" : [
                "timestamp"
            ]
        }
    }
}

Low-cardinality and High-cardinality Field Mixing

Where low-cardinality and high-cardinality fields are mixed, first groups and aggregates low-cardinality fields, and then aggregates high-cardinality fields using histograms.

For example, the query statement is as follows:

POST testindex/_search
{
  "size": 0,
  "aggs": {
    "groupby_region": {
      "terms": {
        "field": "region"
      },
      "aggs": {
        "groupby_host": {
          "terms": {
            "field": "host"
          },
          "aggs": {
            "groupby_timestamp": {
              "date_histogram": {
                "field": "timestamp",
                "interval": "day"
              },
              "aggs": {
                "avg_score": {
                  "avg": {
                    "field": "score"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

To first group the low-cardinality field region, and then the low-cardinality field host, and then cluster the high-cardinality field timestamp using a histogram, set the parameters as follows:

//Configure an index
"settings" : {
    "index" : {
        "search" : {
            "turbo" : {
                "enabled" : "true" //Enable optimization
            }
        },
        "sort" : { //Specify a sorting key
            "field" : [
                "region",
                "host",
                "timestamp",
                "other"
            ]
        },
        "cluster" : {
            "field" : [ //Specify a clustering key
                "region",
                "host"
            ]
        }
    }
}

The clustering key must be a subset of the sorting key.
High-cardinality fields must be in the sorting key, and high-cardinality fields must follow the last low-cardinality field.

Performance Testing

Test environment

Dataset: esrally nyc_taxis
Cluster specifications: 4U16G, 100 GB, high I/O x 3 nodes

Test Procedure

Create an index template in the cluster, specify sorting keys, and disable enhanced aggregation.

PUT /_template/nyc_taxis
{
  "template": "nyc_taxis*",
  "settings": {
    "index.search.turbo.enabled": false,
    "index.sort.field": "dropoff_datetime",
    "number_of_shards": 3,
    "number_of_replicas": 0
  }
}

Use esrally to test the nyc_taxis dataset and obtain the result when enhanced aggregation is disabled.

Create another index template in the same cluster, specify sorting keys, and enable enhanced aggregation.

PUT /_template/nyc_taxis
{
  "template": "nyc_taxis*",
  "settings": {
    "index.search.turbo.enabled": true,
    "index.sort.field": "dropoff_datetime",
    "number_of_shards": 3,
    "number_of_replicas": 0
  }
}

Use esrally to test the nyc_taxis dataset and obtain the result when enhanced aggregation is enabled.

Test Result

This test focuses on the query result of dropoff_datetime aggregation, that is, the results of tasks autohisto_agg and date_histogram_agg. The following table compares the test results between when enhanced aggregation is disabled and when it is enabled.

Metric	Task	Unit	Enhanced Aggregation Disabled			Enhanced Aggregation Enabled			Enhanced Aggregation Disabled	Enhanced Aggregation Enabled	open/close	Conclusion
Metric	Task	Unit	Test Round 1	Test Round 2	Test Round 3	Test Round 1	Test Round 2	Test Round 3	Mean Value	Mean Value	open/close	Conclusion
Min Throughput	autohisto_agg	ops/s	4.42	4.44	4.43	11.66	11.94	11.96	4.43	11.85	2.68	Throughput improves more than 2.5 times.
Mean Throughput	autohisto_agg	ops/s	4.50	4.46	4.44	11.81	11.99	12.00	4.47	11.93	2.67
Median Throughput	autohisto_agg	ops/s	4.51	4.46	4.44	11.83	11.98	12.00	4.47	11.94	2.67
Max Throughput	autohisto_agg	ops/s	4.54	4.48	4.45	11.90	12.07	12.02	4.49	12.00	2.67
100th percentile latency	autohisto_agg	ms	216.30	-	-	-	84.56	80.38	216.30	82.47	0.38	Latency decreases by more than 60%.
100th percentile service time	autohisto_agg	ms	216.30	-	-	-	84.56	80.38	216.30	82.47	0.38	Latency decreases by more than 60%.
error rate	autohisto_agg	%	0	0	0	0	0	0	0	0	0	-
Min Throughput	date_histogram_agg	ops/s	4.72	4.67	4.65	12.57	12.40	12.59	4.68	12.52	2.68	Throughput improves more than 2.5 times.
Mean Throughput	date_histogram_agg	ops/s	4.73	4.67	4.67	12.61	12.46	12.61	4.69	12.56	2.68
Median Throughput	date_histogram_agg	ops/s	4.73	4.67	4.67	12.62	12.46	12.60	4.69	12.56	2.68
Max Throughput	date_histogram_agg	ops/s	4.74	4.67	4.67	12.64	12.49	12.63	4.69	12.59	2.68
50th percentile latency	date_histogram_agg	ms	202.61	218.09	213.43	77.64	76.02	82.63	211.38	78.77	0.37	Latency decreases by more than 60%.
100th percentile latency	date_histogram_agg	ms	207.35	223.88	246.63	77.99	-	-	225.95	77.99	0.35
50th percentile service time	date_histogram_agg	ms	202.61	218.09	213.43	77.64	76.02	82.63	211.38	78.77	0.37
100th percentile service time	date_histogram_agg	ms	207.35	223.88	246.63	77.99	-	-	225.95	77.99	0.35
error rate	date_histogram_agg	%	0	0	0	0	0	0	0	0	0	-

Test Conclusion

Given the same cluster configuration, aggregation performance improves significantly when enhanced aggregation is enabled. Query throughput improves by more than 2.5 times, and latency decreases by more than 60%.

Parent topic: Enhancing Search Capabilities for Elasticsearch Clusters

Previous topic: Configuring Large Query Isolation for an Elasticsearch Cluster

Next topic: Configuring Read/Write Splitting Between Two Elasticsearch Clusters