Open Stack

Thu Apr 27 10:39:47 UTC 2017

We've encountered a bug in resize which resulted in data loss. The gist is
that user was resizing a qcow2 instance whose image had been deleted from
glance. In driver.finish_migration on the destination host, an error
occurred attempting to copy the image from the source hosts's image cache,
putting the instance into an error state. Note that instance.host has been
set to the destination host before finish_migration runs. When the image
cache cleanup ran on the source host, the instance is no longer in the list
of expected instances on that host because instance.host == dest. Image
cache manager expired the image from the cache, and there was no other copy
of the image.

Let's ignore the root cause of the side-loading error, because that's the
type of transient error which can always occur. I'm looking for a way to
avoid deleting the image from the image cache in future until the resize
operation has completed. The obvious way to do this is to update the
instance list generated in ComputeManager._run_image_cache_manager_pass to
consider not only instances where instance.host is in the node list, but
also any instance with a migration record where source/dest is in the node
list.

The problem with this is that the data model doesn't seem to allow us to
fetch the currently active migration. Following the error above, the
errors_out_migration decorator on finish_resize has set the migration to an
error state. AFAICT this is never deleted, so the presence of a migration
in an error state only means that a migration involving this instance has
occurred in the past. It doesn't mean that it's currently relevant, so it's
basically meaningless.

Firstly, have I missed any semantics of the migration record which might
allow to me to unambiguously identify currently relevant migrations,
whether in an error state or otherwise? That would be ideal, and I'd just
go with that.

If not, how about adding an active migration field to instance? I don't
think it would ever make sense to have more than 1 current migration for a
given instance. It would be set back to NULL when the migration was
complete, and we'd at least have an opportunity to do something explicit
with migrations in an error state.

In the meantime I'm going to look for more backportable avenues to fix
this. Perhaps not updating instance.host until after finish_migration.

Matt
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170427/6be28487/attachment.html>

Open Stack

[openstack-dev] Use data loss bug on error during resize (migration datamodel issue)

OpenStack

Community

Documentation

Branding & Legal