Monday, September 30, 2024

Coping with Failures during Long Running Actions

In this brief note I want to draw attention to some of the features in the LRA protocol that can help service writers manage failures. LRA is a transaction protocol that provides certain desirable properties for building reliable systems such as Atomicity, (eventual) Consistency and Durability. Providing this level of assurance is non trivial but the protocol provides a simple model that can help participants to easily play their part in enabling such systems.

LRA is not just for orchestrating services, it is as equally as important for managing failures. Apart from the specification I have not seen many posts, articles etc covering this important topic, and it is this deficit that I’d like to address in some posts. I had wanted to kick off with an article and demonstration of participant failover but I hit an issue while writing the demo which we need to release the fix for before I can showcase that. So instead, in this post I’ll just bring to the readers attention one or two, but by no means all, of the main features that service writers can use to help them to create more reliable microservices, a preview if you like, before going into more depth in a subsequent post.

Some remarkable items to consider include:

  1. Failing participants must be restarted. There is an option to change the callbacks on restart, any of the endpoints can be changed, even passing over responsibility for, say, the compensation to some other microservice. Likewise, failing coordinators must be restarted if progress of LRAs is to be made.
  2. There is an @Status annotation on participants that the coordinator can use to monitor participant progress and to enable participants to fully participate in the recovery protocol, in particular there is support for non-idempotent compensate endpoints; if there is an @Status endpoint and the compensate endpoint has previously returned a 202 Accepted HTTP status code, then it will periodically poll the status endpoint until the participant reports that it has reached an end state. The @Forget annotation is used by the coordinator to inform the participant that it is free to clean up.
  3. There are state transitions which participants use to notify the coordinator of failures (FailedToCompensate and FailedToComplete) and of transitory states (Compensating and Completing).
  4. Managing timeouts, although the actions supported by the protocol are long running careful choice of time limits for actions can bound failure windows and reduce the need for complicated recover procedures.
  5. And of course there is support for nested Long Running Actions which is a jewel in the toolkit for building reliable distributed systems.

That’s all for now - I’ve deliberately kept the ideas brief and high level so that they can be explored in greater depth later.

5 comments:

sachin said...

Hi.. I am using narayana lra coordinator image hosted on EKS. When my coordinator restarts during transactions, it doesnt pick the existing transaction, causing existing transaction finish without completion. On further analysis I found that transaction manager is not starting when image starts. Instead it starts during first lra-coordinator/start calls. Can you please suggest.

Michael Musgrove said...

Hi Sachin, I am assuming the following scenario:

1) You deployed a single coordinator.
2) You started an application which sent a request to that coordinator to start an LRA which succeeded (ie the coordinator returned the id of the LRA back to your application).
3) The coordinator failed and you restarted it.
The coordinator will not reload the transaction until it has ran an initial recovery scan which by default is every 120 seconds. When running using WildFly we recently resolved an RFE to run an initial scan on start up (https://issues.redhat.com/browse/WFLY-20026) but WildFly is the only container where we added that feature. If you want that for whatever container you are using then please let us know and we'll look into adding it which I think is necessary.
4) The LRA timed out and your application was told to compensate.

If my analysis of your use case and scenario is correct then I would suggest you use a longer timeout for the LRA and/or you manually ask for a recovery scan as soon as the coordinator is restarted (send an HTTP GET request to the recovery endpoint http://localhost:/lra-coordinator/coordinator).

Michael Musgrove said...

The HTTP GET request needs to be sent to the /recovery endpoint of the recovery coordinator url (typically that would be http://localhost:port/lra-coordinator/recovery). This request should hit the JAX-RS method: https://github.com/jbosstm/lra/blob/0.0.10.Final/coordinator/src/main/java/io/narayana/lra/coordinator/api/RecoveryCoordinator.java#L144

sachin said...

Hi Michael..Thanks for the reply.

Environment Details: single container with image from quay.io
Saga is in between two rest apis
Orchestrator service is written using CAMEL SAGA
persistence is enable using file system

WRT understanding
Point 3 We are trying to simulate the behavior: what will happen if lra coordinator get restarted in mid of transaction. We are using sleep between two rest apis.
Point 4: Recovery module is not starting even after 120 sec. It starts only after 1st transaction.

Please suggest.

Michael Musgrove said...

I'm not sure, maybe we need a startup hook (something like an "@Observes StartupEvent event" possibly, or the JAX-RS equivalent). If you can let us know what podman/docker command you use to restart the coordinator including the volume command so that it gets the correct storage (it needs access to the same object store it had when you first created the LRA).

From what you report I think hitting any of the coordinator endpoints would be enough to start it, such as curl -X GET http://localhost:8080/lra-coordinator for example, ie you wouldn't need to start a new LRA.