[openstack-dev] [mistral] bugfix for "Fix concurrency issues by using READ_COMMITTED" unveils / creates a different bug
clint at fewbar.com
Thu Dec 10 20:27:04 UTC 2015
Excerpts from ELISHA, Moshe (Moshe)'s message of 2015-12-07 08:29:44 -0800:
> Hi all,
> The current bugfix I am working on have unveiled / created a bug.
> Test "WorkflowResumeTest.test_resume_different_task_states" sometimes fails because "task4" is executed twice instead of once (See unit test output and workflow below).
> This happens because task2 on-complete is running task4 as expected but also task3 executes task4 by mistake.
> It is not consistent but it happens quite often - This happens if the unit test resumes the WF and updates action execution of task2 and finishes task2 before task3 is finished.
> 1. Task2 in method on_action_complete - changes task2 state to RUNNING.
> 2. Task3 in method on_action_complete - changes task2 state to RUNNING (before task2 calls _on_task_state_change).
> 3. Task3 in "_on_task_state_change" > "continue_workflow" > "DirectWorkflowController ._find_next_commands" - it finds task2 because task2 is in SUCCESS and processed = False and "_find_next_commands_for_task(task2)" returns task4.
> 4. Task3 executes command to RunTask task4.
> 5. Task2 in "_on_task_state_change" > "continue_workflow" > "DirectWorkflowController ._find_next_commands" - it finds task2 because task2 is in SUCCESS and processed = False and "_find_next_commands_for_task(task2)" returns task4.
> 6. Task2 executes command to RunTask task4.
>  - https://review.openstack.org/#/c/253819/
> If I am not mistaken - this issue also exists in the current code and my bugfix only made it much more often. Can you confirm?
> I don't have enough knowledge on how to fix this issue...
> For now - I have modified the test_resume_different_task_states unit test to wait for task3 to be processed before updating the action execution of task2.
> If you agree this bug exist today as well - we can proceed with my bugfix and open a different bug for that issue.
I'd agree that this is likely happening more reliably because READ
COMMITTED will just give you the state that causes the bug more often
than REPEATABLE READ, because now if you happen to have threads running
at the same time when the new vulnerable state is reached, they can both
see the new state and react to it. Before they only had that problem if
they both started after the enabling state was in the db, thus sharing
the same db snapshot.
What you actually need is to atomically claim the rows. You have to do
UPDATE whatever_table SET executor = 'me' WHERE executor IS NULL;
And if you get 0 rows updated, that means somebody else claimed it and
you do nothing. Note that you also need some liveness testing in this
system, since if your executor dies, that row will be lost forever. In
Heat they solved it by having a queue for each executor and pinging on
oslo.messaging. Please don't do that.
I suggest instead switching to tooz, and joining the distributed lock
revolution, where zookeeper will give you a nice atomic distributed lock
for this, and detect when to break it because of a dead executor. (or
consul or etcd once our fine community finishes landing those)
More information about the OpenStack-dev