[openstack-dev] [oslo] logging around olso lockutils

Joshua Harlow harlowja at outlook.com
Thu Sep 25 19:22:57 UTC 2014


Or how about we add in a new log level?

A few libraries I have come across support the log level 5 (which is less than debug (10) but greater than notset (0))...

One usage of this is in the multiprocessing library in python itself @

https://hg.python.org/releasing/3.4/file/8671f89107c8/Lib/multiprocessing/util.py#l34

Kazoo calls it the 'BLATHER' level @

https://github.com/python-zk/kazoo/blob/master/kazoo/loggingsupport.py

Since these messages can be actually useful for lock_utils developers it could be useful to keep them[1]?

Just a thought...

[1] Ones mans DEBUG is another mans garbage, ha.

On Sep 25, 2014, at 12:06 PM, Ben Nemec <openstack at nemebean.com> wrote:

> On 09/25/2014 07:49 AM, Sean Dague wrote:
>> Spending a ton of time reading logs, oslo locking ends up basically
>> creating a ton of output at DEBUG that you have to mentally filter to
>> find problems:
>> 
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Created new semaphore "iptables" internal_lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:206
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Acquired semaphore "iptables" lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:229
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Attempting to grab external lock "iptables" external_lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:178
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Got file lock "/opt/stack/data/nova/nova-iptables" acquire
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:93
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Got semaphore / lock "_do_refresh_provider_fw_rules" inner
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:271
>> 2014-09-24 18:44:49.244 DEBUG nova.compute.manager
>> [req-b91cb1c1-f211-43ef-9714-651eeb3b2302
>> DeleteServersAdminTestXML-1408641898
>> DeleteServersAdminTestXML-469708524] [instance:
>> 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm
>> BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=<?>,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None)
>> _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407
>> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Released file lock "/opt/stack/data/nova/nova-iptables" release
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:115
>> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Releasing semaphore "iptables" lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:238
>> 2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Semaphore / lock released "_do_refresh_provider_fw_rules" inner
>> 
>> Also readable here:
>> http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240
>> 
>> (Yes, it's kind of ugly)
>> 
>> What occured to me is that in debugging locking issues what we actually
>> care about is 2 things semantically:
>> 
>> #1 - tried to get a lock, but someone else has it. Then we know we've
>> got lock contention. .
>> #2 - something is still holding a lock after some "long" amount of time.
> 
> We did just merge https://review.openstack.org/#/c/122166/ which adds
> some contention/timing information to the log messages and should at
> least be a step toward what you're talking about.
> 
> For context, we had some bad logging that resulted in
> https://bugs.launchpad.net/oslo.concurrency/+bug/1367941 .  That has
> been cleaned up to at least be accurate, but it did add an extra log
> message (created and acquired).  The reason we cared about that is we
> thought there might be a bad interaction between our code and eventlet,
> so we wanted to know whether we were in fact locking the same semaphore
> twice or mistakenly creating two separate ones (as it turns out, neither
> - it was just the bad logging I mentioned earlier).
> 
> So, given that I think everyone involved agrees that the double-locking
> thing was a cosmetic issue and not a functional one we could probably
> just remove the created/using messages here:
> https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L202
> which would eliminate one message per lock operation without
> significantly impacting debuggability of our code.  Those messages only
> exist because we didn't trust what was going on at a lower level.
> 
> It would also be nice to reduce the duplication of the messages.  For
> example, we log that we acquired a semaphore here:
> https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L229
> and then log essentially the same information here:
> https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L270
> 
> I don't think we should completely remove the logging in either place
> because depending on how the code is called both can serve a useful,
> non-overlapping purpose, but we could add a parameter to lock() that
> tells it whether it needs to log the acquire/release, and then when it's
> called from synchronized() we just set that to False since we know that
> information is already going to be logged.
> 
> Oh, I also see the logging is hosed for external_lock too.  We need to
> fix that to work like the internal locks do, possibly with additional
> log message de-duplication (although we're actually dealing with two
> separate locks there, so that might not be wise...).
> 
> In summary, my initial suggestions would be:
> 1) remove the logging in internal_lock
> 2) Eliminate the duplicate logging when using the synchronized decorator
> 3) Fix the external_lock logging since it's useless and misleading right
> now.
> 
> I don't know that those cleanups would completely address your issues,
> but I think that along with the already-merged timing patch I linked
> above would get us a long way toward less sucky lockutils logging, and
> would do so with minimal performance impact.  If that proves to still be
> too much log spam we could look into more drastic measures later of course.
> 
> Thoughts?
> 
> -Ben
> 
>> 
>> #2 turned out to be a critical bit in understanding one of the worst
>> recent gate impacting issues.
>> 
>> You can write a tool today that analyzes the logs and shows you these
>> things. However, I wonder if we could actually do something creative in
>> the code itself to do this already. I'm curious if the creative use of
>> Timers might let us emit log messages under the conditions above
>> (someone with better understanding of python internals needs to speak up
>> here). Maybe it's too much overhead, but I think it's worth at least
>> asking the question.
>> 
>> The same issue exists when it comes to processutils I think, warning
>> that a command is still running after 10s might be really handy, because
>> it turns out that issue #2 was caused by this, and it took quite a bit
>> of decoding to figure that out.
>> 
>> 	-Sean
>> 
> 
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




More information about the OpenStack-dev mailing list