[Openstack] Locking problems with NFS v3 and shared /var/lib/nova/instances
Antonio Messina
antonio.s.messina at gmail.com
Tue Feb 4 09:23:01 UTC 2014
Hi all,
I am trying to configure a few compute nodes (currently Folsom) with a
shared NFS filesystem to store the virtual machine images. The idea is
to enable live migration, but we also need it because some of the
compute nodes have a very small hard disk.
The NFS server we are using is an appliance, but it's actually based
on CentOS 6.2. However, I only have access to a web interface which is
quite limited, so I cannot directly edit /etc/exports, nor checking
the logs, and it only supports NFS v3.
The problem I am facing is that when I try to spawn multiple instances
on different nodes, all accessing the same NFS filesystem, many of
them will fail with the following error::
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py",
line 756, in _spawn
block_device_info)
File "/usr/lib/python2.7/dist-packages/nova/exception.py", line
117, in wrapped
temp_level, payload)
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/usr/lib/python2.7/dist-packages/nova/exception.py", line
92, in wrapped
return f(*args, **kw)
File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py",
line 1099, in spawn
admin_pass=admin_password)
File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py",
line 1365, in _create_image
project_id=instance['project_id'])
File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/imagebackend.py",
line 131, in cache
*args, **kwargs)
File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/imagebackend.py",
line 178, in create_image
prepare_template(target=base, *args, **kwargs)
File "/usr/lib/python2.7/dist-packages/nova/utils.py", line 791, in inner
with lock:
File "/usr/lib/python2.7/dist-packages/nova/utils.py", line 664,
in __enter__
self.trylock()
File "/usr/lib/python2.7/dist-packages/nova/utils.py", line 699,
in trylock
fcntl.lockf(self.lockfile, fcntl.LOCK_EX | fcntl.LOCK_NB)
IOError: [Errno 37] No locks available
It seems that before trying to download a new base image from the glance
storage, nova-compute will create a file in
``/var/lib/nova/instances/locks`` named after the ID of the image, and
then it calls ``fcntl()`` in order to ensure that only one
nova-compute process is actually downloading the image and writing in
``/var/lib/nova/instances/_base``.
However, locking in NFS v3 is quite unreliable, especially when you
execute many ``fcntl()`` at the same time, as it seems to happen in
``/usr/lib/python2.7/dist-packages/nova/utils.py:670``::
while True:
try:
# Using non-blocking locks since green threads are not
# patched to deal with blocking locking calls.
# Also upon reading the MSDN docs for locking(), it seems
# to have a laughable 10 attempts "blocking" mechanism.
self.trylock()
return self
except IOError, e:
if e.errno in (errno.EACCES, errno.EAGAIN):
# external locks synchronise things like iptables
# updates - give it some time to prevent busy spinning
time.sleep(0.01)
else:
raise
The only solutions I can think of are:
* use NFS v4 instead of v3 (but I would like to use the storage
appliance I'm currently using, which doesn't support v4)
* mount a NFS v4 filesystem in /var/lib/nova/instances/locks (or any
other filesystem with decent locking)
* use a differenet shared filesystem (glusterfs?)
So I am asking whoever already deployed OpenStack with a shared
filesystem for /var/lib/nova/instances if you faced the same problems
and how you solved them.
Thank you
Antonio Messina
--
antonio.s.messina at gmail.com
antonio.messina at uzh.ch +41 (0)44 635 42 22
GC3: Grid Computing Competence Center http://www.gc3.uzh.ch/
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich Switzerland
More information about the Openstack
mailing list