[openstack-dev] [oslo] logging around olso lockutils

Sean Dague sean at dague.net
Fri Sep 26 10:51:53 UTC 2014


On 09/26/2014 12:03 AM, Christopher Yeoh wrote:
> On Thu, 25 Sep 2014 08:49:12 -0400
> Sean Dague <sean at dague.net> wrote:
> 
>>
>> #1 - tried to get a lock, but someone else has it. Then we know we've
>> got lock contention. .
>> #2 - something is still holding a lock after some "long" amount of
>> time.
> 
> +1 to both.
> 
>> #2 turned out to be a critical bit in understanding one of the worst
>> recent gate impacting issues.
>>
>> You can write a tool today that analyzes the logs and shows you these
>> things. However, I wonder if we could actually do something creative
>> in the code itself to do this already. I'm curious if the creative
>> use of Timers might let us emit log messages under the conditions
>> above (someone with better understanding of python internals needs to
>> speak up here). Maybe it's too much overhead, but I think it's worth
>> at least asking the question.
> 
> Even a simple log at the end when its finally released if it has been
> held for a long time would be handy. As matching up acquire/release by
> eye is not easy.
> 
> I don't think we get a log message when an acquire is attempted but
> fails. That might help get a measure of lock contention?
> 
>> The same issue exists when it comes to processutils I think, warning
>> that a command is still running after 10s might be really handy,
>> because it turns out that issue #2 was caused by this, and it took
>> quite a bit of decoding to figure that out.
> 
> Also I think that perhaps a log message when a period task takes longer
> than the interval that the task is meant to be run would be a useful
> warning sign that something odd is going on.

Well, we do get some of the periodic tasks today when they overrun, IIRC.

However.... the surprising thing about the Nova resource tracker lock
bug was that basically all the time was taken by blockdev failing (which
took almost 2 minutes) and happened to be inside a lock that it probably
didn't need to be in.

I think if we find we're ever taking a lost for > 10s... we're going to
start impacting the rest of the system in very bad ways. Especially
given the RPC timeouts at 60s. Getting that information as close to the
actual issue as possible would be super helpful.

	-Sean

-- 
Sean Dague
http://dague.net



More information about the OpenStack-dev mailing list