Open Stack

Fri Sep 26 00:25:36 UTC 2014

+1

Exactly what I was thinking.  Semaphore races and deadlocks are important to be able to trace, but the normal production cloud doesn't want to see those messages.  

What might be even better would be to also put a counter on the semaphores so that if they ever are >1 or <0 they report an error on normal log levels.  I'm assuming it would be an error.  I can't see why it would be just a warn or info, but, I don't know the guts of the code here.

--Rocky

-----Original Message-----
From: Joshua Harlow [mailto:harlowja at outlook.com] 
Sent: Thursday, September 25, 2014 12:23 PM
To: openstack at nemebean.com; OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [oslo] logging around olso lockutils

Or how about we add in a new log level?

A few libraries I have come across support the log level 5 (which is less than debug (10) but greater than notset (0))...

One usage of this is in the multiprocessing library in python itself @

https://hg.python.org/releasing/3.4/file/8671f89107c8/Lib/multiprocessing/util.py#l34

Kazoo calls it the 'BLATHER' level @

https://github.com/python-zk/kazoo/blob/master/kazoo/loggingsupport.py

Since these messages can be actually useful for lock_utils developers it could be useful to keep them[1]?

Just a thought...

[1] Ones mans DEBUG is another mans garbage, ha.

On Sep 25, 2014, at 12:06 PM, Ben Nemec <openstack at nemebean.com> wrote:

> On 09/25/2014 07:49 AM, Sean Dague wrote:
>> Spending a ton of time reading logs, oslo locking ends up basically
>> creating a ton of output at DEBUG that you have to mentally filter to
>> find problems:
>> 
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Created new semaphore "iptables" internal_lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:206
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Acquired semaphore "iptables" lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:229
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Attempting to grab external lock "iptables" external_lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:178
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Got file lock "/opt/stack/data/nova/nova-iptables" acquire
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:93
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Got semaphore / lock "_do_refresh_provider_fw_rules" inner
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:271
>> 2014-09-24 18:44:49.244 DEBUG nova.compute.manager
>> [req-b91cb1c1-f211-43ef-9714-651eeb3b2302
>> DeleteServersAdminTestXML-1408641898
>> DeleteServersAdminTestXML-469708524] [instance:
>> 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm
>> BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=<?>,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None)
>> _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407
>> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Released file lock "/opt/stack/data/nova/nova-iptables" release
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:115
>> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Releasing semaphore "iptables" lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:238
>> 2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Semaphore / lock released "_do_refresh_provider_fw_rules" inner
>> 
>> Also readable here:
>> http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240
>> 
>> (Yes, it's kind of ugly)
>> 
>> What occured to me is that in debugging locking issues what we actually
>> care about is 2 things semantically:
>> 
>> #1 - tried to get a lock, but someone else has it. Then we know we've
>> got lock contention. .
>> #2 - something is still holding a lock after some "long" amount of time.
> 
> We did just merge https://review.openstack.org/#/c/122166/ which adds
> some contention/timing information to the log messages and should at
> least be a step toward what you're talking about.
> 
> For context, we had some bad logging that resulted in
> https://bugs.launchpad.net/oslo.concurrency/+bug/1367941 .  That has
> been cleaned up to at least be accurate, but it did add an extra log
> message (created and acquired).  The reason we cared about that is we
> thought there might be a bad interaction between our code and eventlet,
> so we wanted to know whether we were in fact locking the same semaphore
> twice or mistakenly creating two separate ones (as it turns out, neither
> - it was just the bad logging I mentioned earlier).
> 
> So, given that I think everyone involved agrees that the double-locking
> thing was a cosmetic issue and not a functional one we could probably
> just remove the created/using messages here:
> https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L202
> which would eliminate one message per lock operation without
> significantly impacting debuggability of our code.  Those messages only
> exist because we didn't trust what was going on at a lower level.
> 
> It would also be nice to reduce the duplication of the messages.  For
> example, we log that we acquired a semaphore here:
> https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L229
> and then log essentially the same information here:
> https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L270
> 
> I don't think we should completely remove the logging in either place
> because depending on how the code is called both can serve a useful,
> non-overlapping purpose, but we could add a parameter to lock() that
> tells it whether it needs to log the acquire/release, and then when it's
> called from synchronized() we just set that to False since we know that
> information is already going to be logged.
> 
> Oh, I also see the logging is hosed for external_lock too.  We need to
> fix that to work like the internal locks do, possibly with additional
> log message de-duplication (although we're actually dealing with two
> separate locks there, so that might not be wise...).
> 
> In summary, my initial suggestions would be:
> 1) remove the logging in internal_lock
> 2) Eliminate the duplicate logging when using the synchronized decorator
> 3) Fix the external_lock logging since it's useless and misleading right
> now.
> 
> I don't know that those cleanups would completely address your issues,
> but I think that along with the already-merged timing patch I linked
> above would get us a long way toward less sucky lockutils logging, and
> would do so with minimal performance impact.  If that proves to still be
> too much log spam we could look into more drastic measures later of course.
> 
> Thoughts?
> 
> -Ben
> 
>> 
>> #2 turned out to be a critical bit in understanding one of the worst
>> recent gate impacting issues.
>> 
>> You can write a tool today that analyzes the logs and shows you these
>> things. However, I wonder if we could actually do something creative in
>> the code itself to do this already. I'm curious if the creative use of
>> Timers might let us emit log messages under the conditions above
>> (someone with better understanding of python internals needs to speak up
>> here). Maybe it's too much overhead, but I think it's worth at least
>> asking the question.
>> 
>> The same issue exists when it comes to processutils I think, warning
>> that a command is still running after 10s might be really handy, because
>> it turns out that issue #2 was caused by this, and it took quite a bit
>> of decoding to figure that out.
>> 
>> 	-Sean
>> 
> 
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Open Stack

[openstack-dev] [oslo] logging around olso lockutils

OpenStack

Community

Documentation

Branding & Legal