[openstack-dev] Libvirt snapshot process optimization

Rafi Khardalian rafi at metacloud.com
Fri Aug 31 18:58:07 UTC 2012


Any thoughts on this?


On Tue, Aug 28, 2012 at 2:09 AM, Rafi Khardalian <rafi at metacloud.com> wrote:

> I had a couple different ideas on how to approach this given the
> constraints:
>
> 1. Keep the snapshots around until the next operation which restarts the
> VM process.  This code would run prior to (re)starting a Qemu/KVM process
> such as hard reboot, resume on host boot, another snapshot, etc.  It would
> look at all existing snapshots in a given qcow2 file and remove them prior
> to executing said operation.  This would limit the number of snapshots in
> the image itself to 1, which I would take as a fair trade off to make
> snapshots only minimally disruptive.
>
> 2. Suspend the VM twice during the snapshot process.  Once to take the
> snapshot and a second time to delete it.  This is a simple change and I
> already have the code done but think #1 would be a cleaner approach.
>
> Or, we could support both via configuration options.  I looked at the
> possibility of executing snapshots via libvirt but realized it only
> recently became available and the documentation around the API hook is a
> bit sparse.  So it is probably not a feasible solution at this time, mainly
> due do the former.
>
> If anyone else has another idea, that would be great to hear as well.
>
> Rafi
>
>
>
> On Thursday, August 23, 2012, Vishvananda Ishaya wrote:
>
>> We discussed this in the mailing list in the past
>>
>> quoting daniel:
>>
>> > a) is it safe to use qemu-img to create/delete a snapshot in a disk
>> file that libvirt is writing to.
>> > if not:
>> > b) is it safe to use qemu-img to delete a snapshot in a disk file that
>> libvirt is writing to but not actively using.
>> > if not:
>> > c) is it safe to use qemu-img to create/delete a snapshot in a disk
>> file that libvirt has an open file handle to.
>>
>> Sadly, the answer is no to all those questions. For Qcow2 files, using
>> internal snapshots, you cannot make *any* changes to the qcow2 file,
>> while QEMU has it open. The reasons are that QEMU may have metadata
>> changes pending to the file which have not yet flushed to disk, and
>> second, creating/deleteing the snapshot with qemu-img may cause
>> metadat changes that QEMU won't be aware of. Either way you will likely
>> cause corruption of the qcow2 file.
>>
>> For these reasons, QEMU provides monitor commands for snapshotting,
>> that libvirt uses whenever the guest is running. Libvirt will only
>> use qemu-img, if the the guest is offline.
>>
>> Regards,
>> Daniel
>>
>>
>> So we unfortunately cannot delete the snapshot while the domain is
>> running. Unless we are willing to leave a bunch of old internal snapshots
>> in the file then we have to deal with this performance hit.
>> Vish
>>
>> On Aug 23, 2012, at 5:27 PM, Rafi Khardalian <rafi at metacloud.com> wrote:
>>
>> > Assuming there are reasons for keeping suspend part of the snapshot
>> > process, the flow can be optimized to reduce the impact to running VMs.
>> > This is done by resuming immediately after the "qemu-img snapshot"
>> > operation (libvirt_utils.create_snapshot), rather than waiting until the
>> > "qemu-img convert" process (libvirt_utils.extract_snapshot) also
>> > completes.  I've been unable to find a reason for waiting until the
>> > convert is done.
>> >
>> > Modified snippet snapshot() snippet from the libvirt driver,
>> representing
>> > the change I'm proposing:
>> >
>> >        # Make the snapshot
>> >        try:
>> >            libvirt_utils.create_snapshot(disk_path, snapshot_name)
>> >        finally:
>> >            if state == power_state.RUNNING:
>> >                self._create_new_domain(xml_desc)
>> >
>> >        # Export the snapshot to a raw image
>> >        with utils.tempdir() as tmpdir:
>> >            try:
>> >                out_path = os.path.join(tmpdir, snapshot_name)
>> >                libvirt_utils.extract_snapshot(disk_path, source_format,
>> >                                               snapshot_name, out_path,
>> >                                               image_format)
>> >            finally:
>> >                libvirt_utils.delete_snapshot(disk_path, snapshot_name)
>> >
>> > I agree it would be ideal if we could find a way to guarantee a
>> consistent
>> > state in the guest VM, though I'm concerned about how users would
>> respond
>> > to a full shutdown being forced upon them to take a snapshot.
>> >
>> > -----Original Message-----
>> > From: Joshua Harlow [mailto:harlowja at yahoo-inc.com]
>> > Sent: Thursday, August 23, 2012 5:13 PM
>> > To: OpenStack Development Mailing List; Rafi Khardalian
>> > Cc: openstack-dev
>> > Subject: Re: [openstack-dev] Libvirt snapshot process optimization
>> >
>> > I'd almost like to see the VM be shutdown before snapshot, but that零
>> just
>> > me.
>> >
>> > In fact just looking at the libvirt docs, 'suspend does not save a
>> > persistent image of the guest's memory. For this, save is used.' So that
>> > could leave guests in some weird state, so that sort of sucks. A
>> shutdown
>> > could at least trigger ACPI shutdown to occur in the VM and would
>> > hopefully leave it in a ok state (emphasis on hopefully). I just think
>> > that reducing the amount of time is going to be hard without
>> > hypervisor<->vm communication (ie signaling all the apps in the vm to
>> > stop) or libvirt (+others) needs to persist the memory image.
>> >
>> > My guess is suspend is trying to do what it can, which won't be 100%
>> right
>> > without memory state saving or some other communication happening...
>> > Perhaps a 'save' call (or shutdown sequence) should be used, but this
>> > probably isn't any faster, but at least it would be 'correct' (shared
>> > storage state not included). There is also the question of uploading
>> > snapshots (but that零 a different question).
>> >
>> > On 8/23/12 2:00 PM, "Rafi Khardalian" <rafi at metacloud.com> wrote:
>> >
>> >> Hi all,
>> >>
>> >> I'm looking at the libvirt snapshot code and was wondering about the
>> > order
>> >> and purpose of several operations.  At a high level, it looks like the
>> VM
>> >> being snapshotted is first suspended (managedSave), actual qcow2
>> snapshot
>> >> is taken, then extraction is done (qemu-img convert) before returning
>> the
>> >> instance to its prior state.
>> >>
>> >> My question is, with snapshots being atomic, why suspend the VM?
>> > Assuming
>> >> there's a reason for this, why not do the qemu-img convert call after
>> the
>> >> VM
>
>
>
> --
> ---
> Rafi Khardalian
> Vice President, Operations | Metacloud, Inc.
> Email: rafi at metacloud.com | Tel: 855-638-2256, Ext. 2662
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20120831/b3af5c25/attachment-0001.html>


More information about the OpenStack-dev mailing list