Open Stack

Tue Sep 20 16:16:52 UTC 2016

On Tue, Sep 20, 2016 at 11:36:29AM -0400, Sean Dague wrote:
> On 09/20/2016 11:20 AM, Daniel P. Berrange wrote:
> > On Tue, Sep 20, 2016 at 11:01:23AM -0400, Sean Dague wrote:
> >> On 09/20/2016 10:38 AM, Daniel P. Berrange wrote:
> >>> On Tue, Sep 20, 2016 at 09:20:15AM -0400, Sean Dague wrote:
> >>>> This is a bit delayed due to the release rush, finally getting back to
> >>>> writing up my experiences at the Ops Meetup.
> >>>>
> >>>> Nova Feedback Session
> >>>> =====================
> >>>>
> >>>> We had a double session for Feedback for Nova from Operators, raw
> >>>> etherpad here - https://etherpad.openstack.org/p/NYC-ops-Nova.
> >>>>
> >>>> The median release people were on in the room was Kilo. Some were
> >>>> upgrading to Liberty, many had older than Kilo clouds. Remembering
> >>>> these are the larger ops environments that are engaged enough with the
> >>>> community to send people to the Ops Meetup.
> >>>>
> >>>>
> >>>> Performance Bottlenecks
> >>>> -----------------------
> >>>>
> >>>> * scheduling issues with Ironic - (this is a bug we got through during
> >>>>   the week after the session)
> >>>> * live snapshots actually end up performance issue for people
> >>>>
> >>>> The workarounds config group was not well known, and everyone in the
> >>>> room wished we advertised that a bit more. The solution for snapshot
> >>>> performance is in there
> >>>>
> >>>> There were also general questions about what scale cells should be
> >>>> considered at.
> >>>>
> >>>> ACTION: we should make sure workarounds are advertised better
> >>>
> >>> Workarounds ought to be something that admins are rarely, if
> >>> ever, having to deal with.
> >>>
> >>> If the lack of live snapshot is such a major performance problem
> >>> for ops, this tends to suggest that our default behaviour is wrong,
> >>> rather than a need to publicise that operators should set this
> >>> workaround.
> >>>
> >>> eg, instead of optimizing for the case of a broken live snapshot
> >>> support by default, we should optimize for the case of working
> >>> live snapshot by default. The broken live snapshot stuff was so
> >>> rare that no one has ever reproduced it outside of the gate
> >>> AFAIK.
> >>>
> >>> IOW, rather than hardcoding disable_live_snapshot=True in nova,
> >>> we should just set it in the gate CI configs, and leave it set
> >>> to False in Nova, so operators get good performance out of the
> >>> box.
> >>>
> >>> Also it has been a while since we added the workaround, and IIRC,
> >>> we've got newer Ubuntu available on at least some of the gate
> >>> hosts now, so we have the ability to test to see if it still
> >>> hits newer Ubuntu. 
> >>
> >> Here is my reconstruction of the snapshot issue from what I can remember
> >> of the conversation.
> >>
> >> Nova defaults to live snapshots. This uses the libvirt facility which
> >> dumps both memory and disk. And then we throw away the memory. For large
> >> memory guests (especially volume backed ones that might have a fast path
> >> for the disk) this leads to a lot of overhead for no gain. The
> >> workaround got them past it.
> > 
> > I think you've got it backwards there.
> > 
> > Nova defaults to *not* using live snapshots:
> > 
> >     cfg.BoolOpt(
> >         'disable_libvirt_livesnapshot',
> >         default=True,
> >         help="""
> > Disable live snapshots when using the libvirt driver.
> > ...""")
> > 
> > 
> > When live snapshot is disabled like this, the snapshot code is unable
> > to guarantee a consistent disk state. So the libvirt nova driver will
> > stop the guest by doing a managed save (this saves all memory to
> > disk), then does the disk snapshot, then restores the managed saved
> > (which loads all memory from disk).
> > 
> > This is terrible for multiple reasons
> > 
> >   1. the guest workload stops running while snapshot is taken
> >   2. we churn disk I/O saving & loading VM memory
> >   3. you can't do it at all if host PCI devices are attached to
> >      the VM
> > 
> > Enabling live snapshots by default fixes all these problems, at the
> > risk of hitting the live snapshot bug we saw in the gate CI but never
> > anywhere else.
> 
> Ah, right. I'll propose inverting the default and we'll see if we can
> get past the testing in the gate - https://review.openstack.org/#/c/373430/

NB the bug was non-deterministic and rare, even in the gate, so the
real test is whether it gets past the gate 20 times in a row :-)

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

Open Stack

[openstack-dev] [nova] ops meetup feedback

OpenStack

Community

Documentation

Branding & Legal