[Openstack-operators] Fwd: [openstack-dev] [nova] Boston Forum session recap - cellsv2
Matt Riedemann
mriedemos at gmail.com
Fri May 19 20:48:54 UTC 2017
FYI
-------- Forwarded Message --------
Subject: [openstack-dev] [nova] Boston Forum session recap - cellsv2
Date: Fri, 19 May 2017 08:13:24 -0700
From: Dan Smith <dms at danplanet.com>
Reply-To: OpenStack Development Mailing List (not for usage questions)
<openstack-dev at lists.openstack.org>
To: OpenStack Development Mailing List (not for usage questions)
<openstack-dev at lists.openstack.org>
The etherpad for this session is here [1]. The goal of the session was
to get some questions answered that the developers had for operators
around the topic of cellsv2.
The bulk of the time was spent discussing ways to limit instance
scheduling retries in a cellsv2 world where placement eliminates
resource-reservation races. Reschedules would be upcalls from the cell,
which we are trying to avoid.
While placement should eliminate 95% (or more) of reschedules due to
pre-claiming resources before booting, there will still be cases where
we may want to reschedule due to unexpected transient failures. How many
of those remain, and whether or not rescheduling for them is really
useful is in question.
The compromise that seemed popular in the room was to grab more than one
host at the time of scheduling, claim for that one, but pass the rest to
the cell. If the cell needs to reschedule, the cell conductor would try
one of the alternates that came as part of the original boot request,
instead of asking scheduler again.
During the discussion of this, an operator raised the concern that
without reschedules, a single compute that fails to boot 100% of the
time ends up becoming a magnet for all future builds, looking like an
excellent target for the scheduler, but failing anything that is sent to
it. If we don't reschedule, that situation could be very problematic. An
idea came out that we should really have compute monitor and disable
itself if a certain number of _consecutive_ build failures crosses a
threshold. That would mitigate/eliminate the "fail magnet" behavior and
further reduce the need for retries. A patch has been proposed for this,
and so far enjoys wide support [2].
We also discussed the transition to counting quotas, and what that means
for operators. The room seemed in favor of this, and discussion was brief.
Finally, I made the call for people with reasonably-sized pre-prod
environments to begin testing cellsv2 to help prove it out and find the
gremlins. CERN and NeCTAR specifically volunteered for this effort.
[1]
https://etherpad.openstack.org/p/BOS-forum-cellsv2-developer-community-coordination
[2] https://review.openstack.org/#/c/463597/
--Dan
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-operators
mailing list