Configuring Client Retry to Improve Service Availability
ELB High Availability
To improve the high availability of ELB, you can purchase load balancers in multiple AZs and enable the health check function for backend servers.
- System HA deployment: Load balancers can be deployed in multiple AZs in active-active mode. If the load balancer in an AZ goes down, the load balancer in another AZ will take over to distribute traffic. In addition, session persistence within an AZ is implemented, eliminating the impact of single points of failure (SPOFs) of servers in a single AZ of the ELB cluster and ensuring system stability.
- Health check: ELB checks the health of backend servers based on how you configure the health check. This ensures that new requests can be forwarded to healthy backend servers.
Client Retry Application Scenarios
Generally, ELB's high availability system can handle disaster recovery for essential services. However, in extreme scenarios, issues like connection resets or timeouts can disrupt services and compromise user experience. To address these extreme issues, you can configure the client retry logic to start new connections when there are connection resets or timeouts, to improve the fault tolerance and stability of the system.
The client retry logic is recommended in the following scenarios:
- Backend server health check failures: If the health check on a backend server fails, traffic can still be routed over the existing Layer 4 connections to the unhealthy backend server, within the sticky session or deregistration delay duration.
- Cross-AZ switchover for high availability: In extreme scenarios, if the AZ where the ELB cluster is located is faulty, the load balancer deployed in another AZ will divert the traffic from the faulty AZ to the healthy AZ. In this case, the persistent connection that is transmitting data in the faulty AZ cannot be restored, so the client needs to start a connection again.
Importance of Retry
Both the client and server may encounter temporary faults, such as transient network or disk jitter, service unavailability, or invoking timeout, due to infrastructure or running environment reasons. As a result, service access may time out.
You can design automated client retry systems to reduce the impacts of such faults on services and ensure successful execution.
Backend Service Unavailable Scenarios
|
Scenario |
Description |
|---|---|
|
Backend server unavailable |
The health check on a backend server (ECS or container) fails due to faults, such as service process suspension, service process faults, hardware faults, virtualization migration failures, or network disruptions. |
|
Complex network environment |
Due to the complex network environment among the clients, load balancers, and backend servers, network jitter, packet loss, and data retransmission may occur occasionally. In this case, client requests may temporarily fail. |
|
Complex hardware issues |
Client requests may temporarily fail due to occasional hardware faults, such as VM HA and disk latency jitter. |
Recommended Client Retry Rules
|
Retry Rule |
Description |
|---|---|
|
Remember the conditions that trigger retries. |
Abnormal scenarios such as connection timeouts and resets. |
|
Retry only idempotent operations. |
It is recommended that client only retry idempotent operations. A retried operation may be repeatedly executed. Therefore, not all operations are suitable to be retried. |
|
Configure proper retry times and interval. |
Configure the retry times and interval based on service requirements in actual scenarios to prevent the following problems:
Common retry interval policies include immediate retry, fixed-interval retry, exponential backoff retry, and random backoff retry. |
|
Avoid retry nesting. |
Retry nesting may cause the retry interval to be exponentially amplified. |
|
Record retry exceptions and print failure reports. |
During retry, you can print retry error logs at the WARN level. |
|
Reuse the retry system of a mature open-source ecosystem library. |
Mature open-source middleware software has a rich client library. Based on the keepalive and detection mechanism of the connection pool, set a proper retry interval, retry times, and backoff policy. For details about how to design a retry system, see the keepalive and detection mechanism of the open-source ecosystem connection pool. |
Helpful Links
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot