[openstack-dev] [oslo] logging around olso lockutils

Ben Nemec openstack at nemebean.com
Thu Sep 25 19:06:43 UTC 2014


On 09/25/2014 07:49 AM, Sean Dague wrote:
> Spending a ton of time reading logs, oslo locking ends up basically
> creating a ton of output at DEBUG that you have to mentally filter to
> find problems:
> 
> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
> Created new semaphore "iptables" internal_lock
> /opt/stack/new/nova/nova/openstack/common/lockutils.py:206
> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
> Acquired semaphore "iptables" lock
> /opt/stack/new/nova/nova/openstack/common/lockutils.py:229
> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
> Attempting to grab external lock "iptables" external_lock
> /opt/stack/new/nova/nova/openstack/common/lockutils.py:178
> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
> Got file lock "/opt/stack/data/nova/nova-iptables" acquire
> /opt/stack/new/nova/nova/openstack/common/lockutils.py:93
> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
> Got semaphore / lock "_do_refresh_provider_fw_rules" inner
> /opt/stack/new/nova/nova/openstack/common/lockutils.py:271
> 2014-09-24 18:44:49.244 DEBUG nova.compute.manager
> [req-b91cb1c1-f211-43ef-9714-651eeb3b2302
> DeleteServersAdminTestXML-1408641898
> DeleteServersAdminTestXML-469708524] [instance:
> 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm
> BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=<?>,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None)
> _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407
> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
> Released file lock "/opt/stack/data/nova/nova-iptables" release
> /opt/stack/new/nova/nova/openstack/common/lockutils.py:115
> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
> Releasing semaphore "iptables" lock
> /opt/stack/new/nova/nova/openstack/common/lockutils.py:238
> 2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils
> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
> Semaphore / lock released "_do_refresh_provider_fw_rules" inner
> 
> Also readable here:
> http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240
> 
> (Yes, it's kind of ugly)
> 
> What occured to me is that in debugging locking issues what we actually
> care about is 2 things semantically:
> 
> #1 - tried to get a lock, but someone else has it. Then we know we've
> got lock contention. .
> #2 - something is still holding a lock after some "long" amount of time.

We did just merge https://review.openstack.org/#/c/122166/ which adds
some contention/timing information to the log messages and should at
least be a step toward what you're talking about.

For context, we had some bad logging that resulted in
https://bugs.launchpad.net/oslo.concurrency/+bug/1367941 .  That has
been cleaned up to at least be accurate, but it did add an extra log
message (created and acquired).  The reason we cared about that is we
thought there might be a bad interaction between our code and eventlet,
so we wanted to know whether we were in fact locking the same semaphore
twice or mistakenly creating two separate ones (as it turns out, neither
- it was just the bad logging I mentioned earlier).

So, given that I think everyone involved agrees that the double-locking
thing was a cosmetic issue and not a functional one we could probably
just remove the created/using messages here:
https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L202
which would eliminate one message per lock operation without
significantly impacting debuggability of our code.  Those messages only
exist because we didn't trust what was going on at a lower level.

It would also be nice to reduce the duplication of the messages.  For
example, we log that we acquired a semaphore here:
https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L229
and then log essentially the same information here:
https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L270

I don't think we should completely remove the logging in either place
because depending on how the code is called both can serve a useful,
non-overlapping purpose, but we could add a parameter to lock() that
tells it whether it needs to log the acquire/release, and then when it's
called from synchronized() we just set that to False since we know that
information is already going to be logged.

Oh, I also see the logging is hosed for external_lock too.  We need to
fix that to work like the internal locks do, possibly with additional
log message de-duplication (although we're actually dealing with two
separate locks there, so that might not be wise...).

In summary, my initial suggestions would be:
1) remove the logging in internal_lock
2) Eliminate the duplicate logging when using the synchronized decorator
3) Fix the external_lock logging since it's useless and misleading right
now.

I don't know that those cleanups would completely address your issues,
but I think that along with the already-merged timing patch I linked
above would get us a long way toward less sucky lockutils logging, and
would do so with minimal performance impact.  If that proves to still be
too much log spam we could look into more drastic measures later of course.

Thoughts?

-Ben

> 
> #2 turned out to be a critical bit in understanding one of the worst
> recent gate impacting issues.
> 
> You can write a tool today that analyzes the logs and shows you these
> things. However, I wonder if we could actually do something creative in
> the code itself to do this already. I'm curious if the creative use of
> Timers might let us emit log messages under the conditions above
> (someone with better understanding of python internals needs to speak up
> here). Maybe it's too much overhead, but I think it's worth at least
> asking the question.
> 
> The same issue exists when it comes to processutils I think, warning
> that a command is still running after 10s might be really handy, because
> it turns out that issue #2 was caused by this, and it took quite a bit
> of decoding to figure that out.
> 
> 	-Sean
> 




More information about the OpenStack-dev mailing list