[openstack-dev] Libvirt Resize/Cold Migrations and SSH

Solly Ross sross at redhat.com
Tue Feb 25 21:05:05 UTC 2014


Hi All,
I've been working on/thinking about a bug filed a while ago related to libvirt resize/cold migrations.  The bug ended up being roughly as such:

On a Packstack install, cold migrations and resizes fail under the default setup with an error about not being able to do an SSH `mkdir` operation.
The case ended up being that Nova was failing to do the resize because the individual compute nodes didn't have passwordless (key-based) ssh permissions
into the other compute nodes.

The proposed temporary fix was to manually give the compute nodes SSH permissions into each other, with the moderate-term
fix being to have Packstack distribute SSH keys among the compute nodes and set up permissions.

While these fixes work, they left me with a certain dirty taste in my mouth, since it doesn't seem quite elegant to have Nova SSH-ing around
between compute nodes, and the upstream community seemed to agree with this (there was a thread a while ago, but I got sidetracked with other
work).  Upon further investigation, I found four points at which the Nova libvirt driver uses SSH, all of which revolve around the method
`migrate_disk_and_power_off` (the main part of the resize/cold migration code):

1. to detect shared storage
2. to create the directory for the instance on the destination system
3. to copy the disk image from the source to the destination system (uses either rysnc over ssh or scp)
4. to remove the directory created in (2) in case of an error during the process

Number 1 can be trivially eliminated by using the existing '_is_instance_storage_shared' method in the RPCAPI from the compute manager, and passing that value to the driver (with the other drivers
most likely ignoring it) instead of checking from within the driver code.  Numbers 2 and 4 can be eliminated by using a "pre_x, x, cleanup_x" flow, similarly to how live migrations are handled (with
"pre_x" and "cleanup_x" being run on the destination machines via the RPCAPI).  That only leaves number 3.  Note that these are only used when we are going between machines without shared storage.
Shared storage eliminates cases 2-4.

So here's my question: can number 3 be "elminated", so to speak?  Having to give full SSH permissions for a file copy seems a bit overkill (we could, for example, run an rsync daemon, in which case
rsync would connect via the daemon and not ssh).  Is it worth it?  Additionally, if we do not eliminate number 3, is it worth it to refactor the code to eliminate numbers 2 and 4 (I already have code
to eliminate number 1 -- see https://gist.github.com/DirectXMan12/9217699).

Best Regards,
Solly Ross



More information about the OpenStack-dev mailing list