[openstack-dev] [oslo.messaging] State wrapping in the MessageHandlingServer

Matthew Booth mbooth at redhat.com
Wed Nov 11 10:00:01 UTC 2015


On Tue, Nov 10, 2015 at 6:46 PM, Joshua Harlow <harlowja at fastmail.com>
wrote:

> Matthew Booth wrote:
>
>> My patch to MessageHandlingServer is currently being reverted because it
>> broke Nova tests:
>>
>> https://review.openstack.org/#/c/235347/
>>
>> Specifically it causes a number of tests to take a very long time to
>> execute, which ultimately results in the total build time limit being
>> exceeded. This is very easy to replicate. The
>> test
>> nova.tests.functional.test_server_group.ServerGroupTest.test_boot_servers_with_affinity
>> is an example test which will always hit this issue. The problem is
>> that ServerGroupTest.setUp() does:
>>
>>          self.compute2 = self.start_service('compute', host='host2')
>>          self.addCleanup(self.compute2.kill)
>>
>> The problem with this is that start_service() adds a fixture which also
>> adds kill as a cleanup method. kill does stop(), wait(). This means that
>> the resulting call order is: start, stop, wait, stop, wait. The
>> redundant call to kill is obviously a wart, but I feel we should have
>> handled it anyway.
>>
>> The problem is that we decided it should be possible to restart a
>> server. There are some unit tests in oslo.messaging that do this. It's
>> not clear to me that there are any projects which do this, but after
>> this experience I feel like it would be good to check before changing it
>> :)
>>
>> The implication of that is that after wait() the state wraps, and we're
>> now waiting on start() again. Consequently, when the second cleanup call
>> hangs.
>>
>> We could fix Nova (at least the usage we have seen) by removing the
>> wrapping. After wait() if you want to start a server again you need to
>> create a new one.
>>
>> So, to be specific, lets consider the following 2 call sequences:
>>
>> 1. start stop wait stop wait
>> 2. start stop wait start stop wait
>>
>> What should they do? The behaviours with and without wrapping are:
>>
>> 1. start stop wait stop wait
>> WRAP: start stop wait HANG HANG
>> NO WRAP: start stop wait NO-OP NO-OP
>>
>> 2. start stop wait start stop wait
>> WRAP: start stop wait start stop wait
>> NO WRAP: start stop wait NO-OP NO-OP NO-OP
>>
>> I'll refresh my memory on what they did before my change in the morning.
>> Perhaps it might be simpler to codify the current behaviour, but iirc I
>> proposed this because it was previously undefined due to races.
>>
>
> I personally prefer not allowing restarting, its needless code complexity
> imho and a feature that people imho probably aren't using anyway (just
> create a new server object if u are doing this), so I'd be fine with doing
> the above NO WRAP and turning those into NO-OPs (and for example raising a
> runtime error in the case of start stop wait start ... to denote that
> restarting isn't recommended/possible). If we have a strong enough reason
> to really to start stop wait start ...
>
> I might be convinced the code complexity is worth it but for now I'm not
> convinced...
>

I agree, and in the hopefully unlikely event that we did break anybody, at
least they would get an obvious exception rather than a hang. A lesson from
breaking nova was that the log messages were generated and were available
in the failed test runs, but nobody noticed them.

Incidentally, I think I'd also merge my second patch into the first before
resubmitting, which adds timeouts and the option not to log.

Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20151111/704af828/attachment.html>


More information about the OpenStack-dev mailing list