Help Center/ Elastic Load Balance/ Best Practices/ Security/ Configuring Client Retry to Improve Service Availability
Updated on 2025-11-14 GMT+08:00

Configuring Client Retry to Improve Service Availability

ELB High Availability

To improve the high availability of ELB, you can purchase load balancers in multiple AZs and enable the health check function for backend servers.

  • System HA deployment: Load balancers can be deployed in multiple AZs in active-active mode. If the load balancer in an AZ goes down, the load balancer in another AZ will take over to distribute traffic. In addition, session persistence within an AZ is implemented, eliminating the impact of single points of failure (SPOFs) of servers in a single AZ of the ELB cluster and ensuring system stability.
  • Health check: ELB checks the health of backend servers based on how you configure the health check. This ensures that new requests can be forwarded to healthy backend servers.

Client Retry Application Scenarios

Generally, ELB's high availability system can handle disaster recovery for essential services. However, in extreme scenarios, issues like connection resets or timeouts can disrupt services and compromise user experience. To address these extreme issues, you can configure the client retry logic to start new connections when there are connection resets or timeouts, to improve the fault tolerance and stability of the system.

The client retry logic is recommended in the following scenarios:

  1. Backend server health check failures: If the health check on a backend server fails, traffic can still be routed over the existing Layer 4 connections to the unhealthy backend server, within the sticky session or deregistration delay duration.
  2. Cross-AZ switchover for high availability: In extreme scenarios, if the AZ where the ELB cluster is located is faulty, the load balancer deployed in another AZ will divert the traffic from the faulty AZ to the healthy AZ. In this case, the persistent connection that is transmitting data in the faulty AZ cannot be restored, so the client needs to start a connection again.

Importance of Retry

Both the client and server may encounter temporary faults, such as transient network or disk jitter, service unavailability, or invoking timeout, due to infrastructure or running environment reasons. As a result, service access may time out.

You can design automated client retry systems to reduce the impacts of such faults on services and ensure successful execution.

Backend Service Unavailable Scenarios

Table 1 Recommended retry scenarios

Scenario

Description

Backend server unavailable

The health check on a backend server (ECS or container) fails due to faults, such as service process suspension, service process faults, hardware faults, virtualization migration failures, or network disruptions.

Complex network environment

Due to the complex network environment among the clients, load balancers, and backend servers, network jitter, packet loss, and data retransmission may occur occasionally. In this case, client requests may temporarily fail.

Complex hardware issues

Client requests may temporarily fail due to occasional hardware faults, such as VM HA and disk latency jitter.

Recommended Client Retry Rules

Table 2 Client retry rules

Retry Rule

Description

Remember the conditions that trigger retries.

Abnormal scenarios such as connection timeouts and resets.

Retry only idempotent operations.

It is recommended that client only retry idempotent operations.

A retried operation may be repeatedly executed. Therefore, not all operations are suitable to be retried.

Configure proper retry times and interval.

Configure the retry times and interval based on service requirements in actual scenarios to prevent the following problems:

  • If the number of retries is insufficient or the interval is too long, the application may fail to complete operations.
  • If the number of retries is too large or the interval is too short, the application may overload the system with excessive resource usage and the server may be blocked due to too many requests.

Common retry interval policies include immediate retry, fixed-interval retry, exponential backoff retry, and random backoff retry.

Avoid retry nesting.

Retry nesting may cause the retry interval to be exponentially amplified.

Record retry exceptions and print failure reports.

During retry, you can print retry error logs at the WARN level.

Reuse the retry system of a mature open-source ecosystem library.

Mature open-source middleware software has a rich client library. Based on the keepalive and detection mechanism of the connection pool, set a proper retry interval, retry times, and backoff policy.

For details about how to design a retry system, see the keepalive and detection mechanism of the open-source ecosystem connection pool.