Monday, September 30, 2024

Coping with Failures during Long Running Actions

In this brief note I want to draw attention to some of the features in the LRA protocol that can help service writers manage failures. LRA is a transaction protocol that provides certain desirable properties for building reliable systems such as Atomicity, (eventual) Consistency and Durability. Providing this level of assurance is non trivial but the protocol provides a simple model that can help participants to easily play their part in enabling such systems.

LRA is not just for orchestrating services, it is as equally as important for managing failures. Apart from the specification I have not seen many posts, articles etc covering this important topic, and it is this deficit that I’d like to address in some posts. I had wanted to kick off with an article and demonstration of participant failover but I hit an issue while writing the demo which we need to release the fix for before I can showcase that. So instead, in this post I’ll just bring to the readers attention one or two, but by no means all, of the main features that service writers can use to help them to create more reliable microservices, a preview if you like, before going into more depth in a subsequent post.

Some remarkable items to consider include:

  1. Failing participants must be restarted. There is an option to change the callbacks on restart, any of the endpoints can be changed, even passing over responsibility for, say, the compensation to some other microservice. Likewise, failing coordinators must be restarted if progress of LRAs is to be made.
  2. There is an @Status annotation on participants that the coordinator can use to monitor participant progress and to enable participants to fully participate in the recovery protocol, in particular there is support for non-idempotent compensate endpoints; if there is an @Status endpoint and the compensate endpoint has previously returned a 202 Accepted HTTP status code, then it will periodically poll the status endpoint until the participant reports that it has reached an end state. The @Forget annotation is used by the coordinator to inform the participant that it is free to clean up.
  3. There are state transitions which participants use to notify the coordinator of failures (FailedToCompensate and FailedToComplete) and of transitory states (Compensating and Completing).
  4. Managing timeouts, although the actions supported by the protocol are long running careful choice of time limits for actions can bound failure windows and reduce the need for complicated recover procedures.
  5. And of course there is support for nested Long Running Actions which is a jewel in the toolkit for building reliable distributed systems.

That’s all for now - I’ve deliberately kept the ideas brief and high level so that they can be explored in greater depth later.