[openstack-dev] [nova][vmware] Locking: A can of worms

Vui Chiap Lam vuichiap at vmware.com
Fri Mar 7 21:03:36 UTC 2014

Thanks Matt, good points raised here.

I think at minimum we should address the single-node case by @synchronized-ing
all access to the image cache data on the datastore, in a way that is fairly 
easy to reason about. Spawn failures are not desirable but acceptable in edge
cases if a subsequent retry on same or different node can succeed. What we need
to absolutely avoid is a partially deleted image cache directory rendering
future spawns impossible.

There are some non-elegant ways to work around the multi-node scenarios if one
really has to (e.g. dedicating a different cache location for each node), but 
i think the multi-node should get addressed at some point too.

There is distributing locking provisions in the a vSphere datastore for files
in use (you cannot delete a disk used by a running VM, say). The image cache as
implemented exists as a collection of files on a datastore, many of which may
not be used by any process. Either that picture changes or we need to adopt
some other means to achieve distributed locking.


----- Original Message -----
| From: "Matthew Booth" <mbooth at redhat.com>
| To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org>
| Sent: Friday, March 7, 2014 8:53:26 AM
| Subject: [openstack-dev] [nova][vmware] Locking: A can of worms
| We need locking in the VMware driver. There are 2 questions:
| 1. How much locking do we need?
| 2. Do we need single-node or multi-node locking?
| I believe these are quite separate issues, so I'm going to try not to
| confuse them. I'm going to deal with the first question first.
| In reviewing the image cache ageing patch, I came across a race
| condition between cache ageing and spawn(). One example of the race is:
| Cache Ageing                spawn()
| * Check timestamps
|                             * Delete timestamp
|                             * Check for image cache directory
| * Delete directory
|                             * Use image cache directory
| This will cause spawn() to explode. There are various permutations of
| this. For example, the following are all possible:
| * A simple failure of spawn() with no additional consequences.
| * Calling volumeops.attach_disk_to_vm() with a vmdk_path that doesn't
| exist. It's not 100% clear to me that ReconfigVM_Task will throw an
| error in this case, btw, which would probably be bad.
| * Leaving a partially deleted image cache directory which doesn't
| contain the base image. This would be really bad.
| The last comes about because recursive directory delete isn't atomic,
| and may partially succeed, which is a tricky problem. However, in
| discussion, Gary also pointed out that directory moves are not atomic
| (see MoveDatastoreFile_Task). This is positively nasty. We already knew
| that spawn() races with itself to create an image cache directory, and
| we've hit this problem in practise. We haven't fixed the race, but we do
| manage it. The management relies on the atomicity of a directory move.
| Unfortunately it isn't atomic, which presents the potential problem of
| spawn() attempting to use an incomplete image cache directory. We also
| have the problem of 2 spawns using a linked clone image racing to create
| the same resized copy.
| We could go through all of the above very carefully to assure ourselves
| that we've found all the possible failure paths, and that in every case
| the problems are manageable and documented. However, I would place a
| good bet that the above is far from a complete list, and we would have
| to revisit it in its entirety every time we touched any affected code.
| And that would be a lot of code.
| We need something to manage concurrent access to resources. In all of
| the above cases, if we make the rule that everything which uses an image
| cache directory must hold a lock on it whilst using it, all of the above
| problems go away. Reasoning about their correctness becomes the
| comparatively simple matter of ensuring that the lock is used correctly.
| Note that we need locking in both the single and multi node cases,
| because even single node is multi-threaded.
| The next question is whether that locking needs to be single node or
| multi node. Specifically, do we currently, or do we plan to, allow an
| architecture where multiple Nova nodes access the same datastore
| concurrently. If we do, then we need to find a distributed locking
| solution. Ideally this would use the datastore itself for lock
| mediation. Failing that, apparently this tool is used elsewhere within
| the project:
| https://urldefense.proofpoint.com/v1/url?u=http://zookeeper.apache.org/doc/trunk/zookeeperOver.html&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=rsFzbyNhYbCCSvAxpT1LNg%3D%3D%0A&m=vBUMpOkjKNZ9junwuw6QvSABre8rkQgvEOsh3z8KCKw%3D%0A&s=307bfdefbb7713086bbfba506b21e22bfbe99aaccffb941ac9a3d107e00ebf05
| That would be an added layer of architecture and deployment complexity,
| but if we need it, it's there.
| If we can confidently say that 2 Nova instances should never be
| accessing the same datastore (how about hot/warm/cold failover?), we can
| use Nova's internal synchronisation tools. This would simplify matters
| greatly!
| I think this is one of those areas which is going to improve both the
| quality of the driver, and the confidence of reviewers to merge changes.
| Right now it takes a lot of brain cycles to work through all the various
| paths of a race to work out if any of them are really bad, and it has to
| be repeated every time you touch the code. A little up-front effort will
| make a whole class of problems go away.
| Matt
| --
| Matthew Booth, RHCA, RHCSS
| Red Hat Engineering, Virtualisation Team
| GPG ID:  D33C3490
| GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490
| _______________________________________________
| OpenStack-dev mailing list
| OpenStack-dev at lists.openstack.org
| https://urldefense.proofpoint.com/v1/url?u=http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=rsFzbyNhYbCCSvAxpT1LNg%3D%3D%0A&m=vBUMpOkjKNZ9junwuw6QvSABre8rkQgvEOsh3z8KCKw%3D%0A&s=965f83bacb86fe52e58e0595e15b699b6b1e32694c1701e3770767b3cd9bb967

More information about the OpenStack-dev mailing list