[openstack-dev] [nova] ops meetup feedback

Sean Dague sean at dague.net
Tue Sep 20 15:36:29 UTC 2016


On 09/20/2016 11:20 AM, Daniel P. Berrange wrote:
> On Tue, Sep 20, 2016 at 11:01:23AM -0400, Sean Dague wrote:
>> On 09/20/2016 10:38 AM, Daniel P. Berrange wrote:
>>> On Tue, Sep 20, 2016 at 09:20:15AM -0400, Sean Dague wrote:
>>>> This is a bit delayed due to the release rush, finally getting back to
>>>> writing up my experiences at the Ops Meetup.
>>>>
>>>> Nova Feedback Session
>>>> =====================
>>>>
>>>> We had a double session for Feedback for Nova from Operators, raw
>>>> etherpad here - https://etherpad.openstack.org/p/NYC-ops-Nova.
>>>>
>>>> The median release people were on in the room was Kilo. Some were
>>>> upgrading to Liberty, many had older than Kilo clouds. Remembering
>>>> these are the larger ops environments that are engaged enough with the
>>>> community to send people to the Ops Meetup.
>>>>
>>>>
>>>> Performance Bottlenecks
>>>> -----------------------
>>>>
>>>> * scheduling issues with Ironic - (this is a bug we got through during
>>>>   the week after the session)
>>>> * live snapshots actually end up performance issue for people
>>>>
>>>> The workarounds config group was not well known, and everyone in the
>>>> room wished we advertised that a bit more. The solution for snapshot
>>>> performance is in there
>>>>
>>>> There were also general questions about what scale cells should be
>>>> considered at.
>>>>
>>>> ACTION: we should make sure workarounds are advertised better
>>>
>>> Workarounds ought to be something that admins are rarely, if
>>> ever, having to deal with.
>>>
>>> If the lack of live snapshot is such a major performance problem
>>> for ops, this tends to suggest that our default behaviour is wrong,
>>> rather than a need to publicise that operators should set this
>>> workaround.
>>>
>>> eg, instead of optimizing for the case of a broken live snapshot
>>> support by default, we should optimize for the case of working
>>> live snapshot by default. The broken live snapshot stuff was so
>>> rare that no one has ever reproduced it outside of the gate
>>> AFAIK.
>>>
>>> IOW, rather than hardcoding disable_live_snapshot=True in nova,
>>> we should just set it in the gate CI configs, and leave it set
>>> to False in Nova, so operators get good performance out of the
>>> box.
>>>
>>> Also it has been a while since we added the workaround, and IIRC,
>>> we've got newer Ubuntu available on at least some of the gate
>>> hosts now, so we have the ability to test to see if it still
>>> hits newer Ubuntu. 
>>
>> Here is my reconstruction of the snapshot issue from what I can remember
>> of the conversation.
>>
>> Nova defaults to live snapshots. This uses the libvirt facility which
>> dumps both memory and disk. And then we throw away the memory. For large
>> memory guests (especially volume backed ones that might have a fast path
>> for the disk) this leads to a lot of overhead for no gain. The
>> workaround got them past it.
> 
> I think you've got it backwards there.
> 
> Nova defaults to *not* using live snapshots:
> 
>     cfg.BoolOpt(
>         'disable_libvirt_livesnapshot',
>         default=True,
>         help="""
> Disable live snapshots when using the libvirt driver.
> ...""")
> 
> 
> When live snapshot is disabled like this, the snapshot code is unable
> to guarantee a consistent disk state. So the libvirt nova driver will
> stop the guest by doing a managed save (this saves all memory to
> disk), then does the disk snapshot, then restores the managed saved
> (which loads all memory from disk).
> 
> This is terrible for multiple reasons
> 
>   1. the guest workload stops running while snapshot is taken
>   2. we churn disk I/O saving & loading VM memory
>   3. you can't do it at all if host PCI devices are attached to
>      the VM
> 
> Enabling live snapshots by default fixes all these problems, at the
> risk of hitting the live snapshot bug we saw in the gate CI but never
> anywhere else.

Ah, right. I'll propose inverting the default and we'll see if we can
get past the testing in the gate - https://review.openstack.org/#/c/373430/

	-Sean

-- 
Sean Dague
http://dague.net



More information about the OpenStack-dev mailing list