Статьи

Использование пошаговых функций AWS для разгрузки экспоненциальных откатов

Повторные попытки являются неотъемлемой частью распределенных и облачных архитектур. В распределенной среде микросервисов часто встречаются компоненты (сервисы), разговаривающие друг с другом для выполнения запроса пользователя.

Хотя механизмы повтора необходимы, к сожалению, это одна из менее реализованных функций в реальных реализациях. На это есть несколько причин.

Во-первых, это дополнительный код, который должен написать разработчик, и тестеры, которые должны создавать тесты интеграции. Этот драйвер может быть легко смягчен путем введения связанных задач на этапе планирования Scrum. Тем не менее, он часто упускается из виду и спешит в самом конце спринта, если он вообще рассматривается.

Вторая причина техническая. Хотя мы можем выполнять повторные попытки на месте в течение жизненного цикла одной синхронной транзакции, мы также должны учитывать экспоненциальные непрерывные длительные повторные попытки. (Эти длительные повторные попытки не могут быть применены к синхронным пользовательским транзакциям, так как они приведут к проблемам с тайм-аутом соединения на клиентской стороне и негативному взаимодействию с пользователем. Для асинхронных процессов — бэкэнд-связь между сервисами и сценарии использования на основе рабочих процессов, хотя , они должны быть на месте.)

Вам также может понравиться:  Как использовать Spring Retry

On-the-spot retries solve problems like network glitches, short JVM stop-the-world collections. Still, if the remote server is restarting, or having some severe issues like memory leaks, we cannot do on-the-spot retries, as it’s not feasible to hold the request thread open for long periods. Some serverless components like Lambda functions also have a maximum execution time limitation we need to consider.

As a consumer of the problematic service, the best strategy we can apply in these scenarios is exponential backoffs. In this strategy, we try once. If it fails, we wait for an X amount of configurable time, we try again, and if it is still down, we try after 2X, 3X amount of time, and so on.

Exponential backoff’s problem is you need to maintain a process/a thread for doing these retries. This new process may not be desirable for a microservice as it will bring additional complexity and hardware requirements to the micro-service on hand. In a microservices environment, we are not bound to specific languages. The capabilities of the selected language can also be a road blocker to maintain additional threads etc.

Because of these technical challenges, it is always a good idea to offload retries to another system for backend and asynchronous service-to-service communications. One of the ways of offloading is using a queue.

In one of my previous articles, Asynchronous Retries With SQS, I explained how a queue (specifically AWS SQS, but it applies to other queues as well) can be used to offload retries. Container-based environments can make use of service meshes and offload the non-functional communication requirements to their sidecars.

In this article, I will be offloading the exponential backoff retries to a serverless workflow engine: AWS Step Functions.

Figure 1: Sample Architecture

Before explaining more about the AWS Step Functions, let’s consider a sample serverless architecture. In this architecture, we receive a command from a user from a REST API exposed from Amazon API Gateway and backed by an AWS Lambda function. This is an asynchronous command for the user, so the user is not waiting for an immediate response. An example could be a captured order that is sent for fulfillment. The lambda function then passes this command to a backend process or service. (In the serverless world, everything can go wrong, including services, so we need to design for failure for every component). In our sample architecture, the backend process is an external SaaS application’s web service, living on the Internet. Because it is living on the Internet, it is not reliable all the time, which makes it a perfect candidate for retries. If the AWS Lambda fails to call the external service, it takes the original request payload (context) and triggers a retry AWS Step Functions workflow with it. The workflow re-uses the very same lambda function in it; the business logic is not duplicated. This time though, the AWS Lambda execution is scheduled by the workflow engine.

Figure 2: Workflow composed of one state

AWS Step Functions workflows are composed of one or more states. If you want to do a particular work, set the state’s type to Task. Each Task can do work mostly an integration against another service or an AWS Lambda function. When the workflow orchestrator attempts to schedule and run a Task and the Task fails, you have an option to ask the orchestrator to retry it a specified number of times, while backing off between the attempts. A sample Task that is calling Lambda Function is below.

JSON