Повторные попытки являются неотъемлемой частью распределенных и облачных архитектур. В распределенной среде микросервисов часто встречаются компоненты (сервисы), разговаривающие друг с другом для выполнения запроса пользователя.
Хотя механизмы повтора необходимы, к сожалению, это одна из менее реализованных функций в реальных реализациях. На это есть несколько причин.
Во-первых, это дополнительный код, который должен написать разработчик, и тестеры, которые должны создавать тесты интеграции. Этот драйвер может быть легко смягчен путем введения связанных задач на этапе планирования Scrum. Тем не менее, он часто упускается из виду и спешит в самом конце спринта, если он вообще рассматривается.
Вторая причина техническая. Хотя мы можем выполнять повторные попытки на месте в течение жизненного цикла одной синхронной транзакции, мы также должны учитывать экспоненциальные непрерывные длительные повторные попытки. (Эти длительные повторные попытки не могут быть применены к синхронным пользовательским транзакциям, так как они приведут к проблемам с тайм-аутом соединения на клиентской стороне и негативному взаимодействию с пользователем. Для асинхронных процессов — бэкэнд-связь между сервисами и сценарии использования на основе рабочих процессов, хотя , они должны быть на месте.)
Вам также может понравиться: Как использовать Spring Retry
On-the-spot retries solve problems like network glitches, short JVM stop-the-world collections. Still, if the remote server is restarting, or having some severe issues like memory leaks, we cannot do on-the-spot retries, as it’s not feasible to hold the request thread open for long periods. Some serverless components like Lambda functions also have a maximum execution time limitation we need to consider.
As a consumer of the problematic service, the best strategy we can apply in these scenarios is exponential backoffs. In this strategy, we try once. If it fails, we wait for an X amount of configurable time, we try again, and if it is still down, we try after 2X, 3X amount of time, and so on.
Exponential backoff’s problem is you need to maintain a process/a thread for doing these retries. This new process may not be desirable for a microservice as it will bring additional complexity and hardware requirements to the micro-service on hand. In a microservices environment, we are not bound to specific languages. The capabilities of the selected language can also be a road blocker to maintain additional threads etc.
Because of these technical challenges, it is always a good idea to offload retries to another system for backend and asynchronous service-to-service communications. One of the ways of offloading is using a queue.
In one of my previous articles, Asynchronous Retries With SQS, I explained how a queue (specifically AWS SQS, but it applies to other queues as well) can be used to offload retries. Container-based environments can make use of service meshes and offload the non-functional communication requirements to their sidecars.
In this article, I will be offloading the exponential backoff retries to a serverless workflow engine: AWS Step Functions.
Before explaining more about the AWS Step Functions, let’s consider a sample serverless architecture. In this architecture, we receive a command from a user from a REST API exposed from Amazon API Gateway and backed by an AWS Lambda function. This is an asynchronous command for the user, so the user is not waiting for an immediate response. An example could be a captured order that is sent for fulfillment. The lambda function then passes this command to a backend process or service. (In the serverless world, everything can go wrong, including services, so we need to design for failure for every component). In our sample architecture, the backend process is an external SaaS application’s web service, living on the Internet. Because it is living on the Internet, it is not reliable all the time, which makes it a perfect candidate for retries. If the AWS Lambda fails to call the external service, it takes the original request payload (context) and triggers a retry AWS Step Functions workflow with it. The workflow re-uses the very same lambda function in it; the business logic is not duplicated. This time though, the AWS Lambda execution is scheduled by the workflow engine.
AWS Step Functions workflows are composed of one or more states. If you want to do a particular work, set the state’s type to Task. Each Task can do work mostly an integration against another service or an AWS Lambda function. When the workflow orchestrator attempts to schedule and run a Task and the Task fails, you have an option to ask the orchestrator to retry it a specified number of times, while backing off between the attempts. A sample Task that is calling Lambda Function is below.
xxxxxxxxxx
"ExternalWebhookState": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:...:function:ExternalServiceLambda",
"Retry": [ {
"ErrorEquals": ["MyCustomError"],
"IntervalSeconds": 1,
"MaxAttempts": 10,
"BackoffRate": 2.0
} ],
"End": true
}
Here, it is saying that it needs to try a maximum ten times, first retry after 1 second with a backoff rate of 2X. That means the first re-try will be executed after 1 second, the second 2 seconds later than that, the third 4 seconds, and so on. As long as the underlying Lambda returns a functional error to Step Functions (Lambda function should throw MyCustomError
in the above example), this loop will continue.
If, after ten retries, the remote party does not respond, AWS Step Functions will mark the Task and the workflow execution as unsuccessful. At this moment, we can log it or take action by listening to a CloudWatch Event rule against Step Function Execution Status Changed events. This rule can point to another lambda function for actions like opening trouble tickets or sending a notification to the operations teams via SNS. Please note that every state change and retry is costing you. Therefore we need to put some thought on the maximum retries we want from the service. On the other hand, serverless is cheap, and retries will rarely happen in a healthy operation.
For more information about asynchronous communication mechanisms, please take a look at my other article here.
Further Reading
Understanding Retry Pattern With Exponential Back-off and Circuit Breaker Pattern
Asynchronous Retries With AWS SQS