[Openstack-operators] Fwd: [openstack-dev] [nova] Boston Forum session recap - cellsv2
David Medberry
openstack at medberry.net
Fri May 19 22:28:31 UTC 2017
Thanks Dan and Matt!
On Fri, May 19, 2017 at 2:48 PM, Matt Riedemann <mriedemos at gmail.com> wrote:
> FYI
>
>
>
> -------- Forwarded Message --------
> Subject: [openstack-dev] [nova] Boston Forum session recap - cellsv2
> Date: Fri, 19 May 2017 08:13:24 -0700
> From: Dan Smith <dms at danplanet.com>
> Reply-To: OpenStack Development Mailing List (not for usage questions) <
> openstack-dev at lists.openstack.org>
> To: OpenStack Development Mailing List (not for usage questions) <
> openstack-dev at lists.openstack.org>
>
> The etherpad for this session is here [1]. The goal of the session was
> to get some questions answered that the developers had for operators
> around the topic of cellsv2.
>
> The bulk of the time was spent discussing ways to limit instance
> scheduling retries in a cellsv2 world where placement eliminates
> resource-reservation races. Reschedules would be upcalls from the cell,
> which we are trying to avoid.
>
> While placement should eliminate 95% (or more) of reschedules due to
> pre-claiming resources before booting, there will still be cases where
> we may want to reschedule due to unexpected transient failures. How many
> of those remain, and whether or not rescheduling for them is really
> useful is in question.
>
> The compromise that seemed popular in the room was to grab more than one
> host at the time of scheduling, claim for that one, but pass the rest to
> the cell. If the cell needs to reschedule, the cell conductor would try
> one of the alternates that came as part of the original boot request,
> instead of asking scheduler again.
>
> During the discussion of this, an operator raised the concern that
> without reschedules, a single compute that fails to boot 100% of the
> time ends up becoming a magnet for all future builds, looking like an
> excellent target for the scheduler, but failing anything that is sent to
> it. If we don't reschedule, that situation could be very problematic. An
> idea came out that we should really have compute monitor and disable
> itself if a certain number of _consecutive_ build failures crosses a
> threshold. That would mitigate/eliminate the "fail magnet" behavior and
> further reduce the need for retries. A patch has been proposed for this,
> and so far enjoys wide support [2].
>
> We also discussed the transition to counting quotas, and what that means
> for operators. The room seemed in favor of this, and discussion was brief.
>
> Finally, I made the call for people with reasonably-sized pre-prod
> environments to begin testing cellsv2 to help prove it out and find the
> gremlins. CERN and NeCTAR specifically volunteered for this effort.
>
> [1]
> https://etherpad.openstack.org/p/BOS-forum-cellsv2-developer
> -community-coordination
> [2] https://review.openstack.org/#/c/463597/
>
> --Dan
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20170519/af972df0/attachment.html>
More information about the OpenStack-operators
mailing list