Narayana team blog: WS-BA Participant Completion Race Condition: Part Two

The Details

In a previous post, I described a benign race condition that in unusual circumstances can cause some Business Activities to be cancelled that would have otherwise been able to close. In this post I'll go into the details of how this happens.

First consider the following client code:

UserBusinessActivity uba = UserBusinessActivityFactory.userBusinessActivity();
uba.begin();
myWebServiceClient.invoke();
uba.close();

The client code is very simple, it just begins a business activity, invokes a Web service and then closes the business activity. The Web service uses the Participant-Completion protocol and so notifies the coordinator of completion just before returning control to the client.

Here's a diagram showing the pertinent message exchanges that occur under a normal situation.

The messages are numbered to indicate the order in which they are sent.

1. request. This represents the application request made by the client.
2. completed. After the participant has completed its work, it notifies the coordinator that it has completed.
3. response. This represents the response to the client's application request.
4. close. The client notifies the coordinator that it wishes to close the activity. It then waits for a 'closed' or failure response from the coordinator.
5a. close/5b. closed. The coordinator has processed the '2.completed' message so can close the activity. It starts by sending the 'close' message to the participant and waits for the 'closed' response as confirmation. These two messages are asynchronous.
6. closed. The coordinator now has all 'closed' acknowledgements so notifies the client that the activity successfully closed.

Messages '2.completed' and '4.close' are asynchronous (or 'one way' in Web services parlance) so effectively, we have a race condition with the following competing parties:

Party 1. The completed message '2.completed'.
Party 2. The response '3.response' followed by '4.close'.

When running in the same VM, or on a low latency network, '3.response' will be sent very quickly. This is because it is simply travelling on the HTTP response over an already open socket. This just leaves messages '2.completed' and '4.close' which will take much longer relative to '3.response'. To understand this, lets take a look at what happens when an asynchronous Web service call is made:

The client sends the message to the Web service
The server-side SOAP stack uses an existing thread from a pool dedicated to receiving SOAP messages.
As the service is asynchronous, the message will be passed to another thread to be processed.
The receiving thread will now return the HTTP response.

The race condition occurs because steps 1-3 can happen relatively quickly in a single VM, and thus it's likely that both messages 2 and 4, will be waiting to be processed at the same time. The order in which they are processed is dependent on the implementation of the thread pool and is also at the mercy of thread scheduling in the VM, so it's possible that either could be processed first.

This race condition is much less likely to happen in a distributed environment as the network costs will be significantly higher. As a result message '3.response' will take long enough to send, so as to give message '2.completed' enough of a head start. But it is still possible so the client application must be coded defensively to catch and handle a TransactionRollbackException. Your code ought to be doing this anyway to deal with server crashes.

Here's a diagram showing what messages are exchanged when the race condition occurs. You will see that the activity ends in a consistent state.

I've omitted messages 1-3 from the following explanation as they are the same as in the success case.

4. close. This message is processed by the coordinator before message '2.completed'
5a. cancel. The coordinator has not yet processed the '2.completed' message so cannot close the activity. The coordinator then sends a 'cancel' message to the participant as it thinks it has not yet completed. This message and subsequent retires, are dropped by the participant as they are not valid for a completed participant.
5b. compensate/5c. compensated. After one or more unacknowledged 'cancel' messages, the coordinator switches to sending 'compensate' messages which will cause the participant to compensate the work. The participant acknowledges with a 'compensated' reply.
6. Transaction rolledback exception. The coordinator notifies the client that the activity failed to close.

As you can see from the steps above, when this race condition arises, any work done by participants is compensated and the client is notified of the outcome. Thus a consistent outcome is achieved.

Wednesday, January 30, 2013

WS-BA Participant Completion Race Condition: Part Two

The Details

1 comment: