[Openstack-operators] [nova][ironic][scheduler][placement][heat] IMPORTANT: Getting rid of the automated reschedule functionality
Jay Pipes
jaypipes at gmail.com
Mon May 22 18:43:30 UTC 2017
On 05/22/2017 02:24 PM, Fox, Kevin M wrote:
> So.... one gut reaction is this is going to make more heat stacks fail. If pushing the orchestration stuff out of nova's the goal, you probably should involve heat so that it knows the difference between a vm that failed because it was scheduled poorly and can just be resubmitted and a vm that failed for other reasons?
Yes, that is a good point, thank you.
> To your comment do one thing well comment, I agree, but not necessarily with your conclusion. if nova's api is to provide a way for a user to launch a vm, if it fails more often due to internal issues such as scheduling races, thats nova's failure (or at least a machine's job to handle the failure) not the users.
The races you refer to above will be handled by the claims being done in
the scheduler, with the exception of the affinity last-minute violation
case ( (b) in my original post). Please see my note in the original post
about Nova handling this race condition when placement is made aware of
affinity/distance in Queens/Rocky. Until that happens, allowing
Heat/Tacker to retry a launch by understanding that the instance went
into ERROR due to a last-minute affinity violation would be a stop-gap
measure. Would you be OK with that stop-gap?
> The same retry code has to get implemented either way. Retry in nova, or retry by the thing driving nova. If there are a lot of different things driving nova (a lot of things already), then it means reimplementing the same solution over and over again in all the things. That makes openstack harder to use, and its already hard to use. Thats not a good path to continue to be on.
We aren't trying to make OpenStack harder to use. We are trying to make
what Nova does more clear-cut and simple.
> Maybe your right, that all retries should be external to the individual projects and maybe done by a central openstack retry daemon or something, and maybe nova'api's a low level api users shouldn't actually be talking to, but instead talking to an 'openstack cli, but for rest' kind of api.
Yes, this.
In other words, how k8s works. In k8s, the user doesn't call the kubelet
directly to launch a pod. Instead, the user calls a porcelain REST API,
passing in a declarative definition of what the pod should look like.
The k8s API service writes that pod definition to etcd storage and then
the k8s scheduler constantly tries to make that pod definition a reality
by launching containers on resource nodes (ok, that's a bit simplified,
but I hope you get my drift).
Also note that k8s is an *orchestrator*. Nova isn't an orchestrator.
> Thats something that openstack as a whole probably needs to talk
about anyway. Each project keeps evolving their api's separately without
central discussion of these sorts of cross cutting concerns that maybe
are better solved the same way.
Sure, not disagreeing with you on this. In Nova-land, we are trying to
reduce the scope of what Nova is responsible for so that we can achieve
a better functional definition for the project and allow other projects
to perform work that is more relevant to their functional area.
> Also, for the record, I do use anit-affinity groups all the time, but haven't touched nfv.
How often do you see retries due to the last-minute anti-affinity violation?
Thanks for the feedback, Kevin!
-jay
> Thanks,
> Kevin
> ________________________________________
> From: Jay Pipes [jaypipes at gmail.com]
> Sent: Monday, May 22, 2017 10:54 AM
> To: openstack-operators at lists.openstack.org
> Subject: [Openstack-operators] [nova][ironic][scheduler][placement] IMPORTANT: Getting rid of the automated reschedule functionality
>
> Hi Ops,
>
> I need your feedback on a very important direction we would like to
> pursue. I realize that there were Forum sessions about this topic at the
> summit in Boston and that there were some decisions that were reached.
>
> I'd like to revisit that decision and explain why I'd like your support
> for getting rid of the automatic reschedule behaviour entirely in Nova
> for Pike.
>
> == The current situation and why it sucks ==
>
> Nova currently attempts to "reschedule" instances when any of the
> following events occur:
>
> a) the "claim resources" process that occurs on the nova-compute worker
> results in the chosen compute node exceeding its own capacity
>
> b) in between the time a compute node was chosen by the scheduler,
> another process launched an instance that would violate an affinity
> constraint
>
> c) an "unknown" exception occurs during the spawn process. In practice,
> this really only is seen when the Ironic baremetal node that was chosen
> by the scheduler turns out to be unreliable (IPMI issues, BMC failures,
> etc) and wasn't able to launch the instance. [1]
>
> The logic for handling these reschedules makes the Nova conductor,
> scheduler and compute worker code very complex. With the new cellsv2
> architecture in Nova, child cells are not able to communicate with the
> Nova scheduler (and thus "ask for a reschedule").
>
> We (the Nova team) would like to get rid of the automated rescheduling
> behaviour that Nova currently exposes because we could eliminate a large
> amount of complexity (which leads to bugs) from the already-complicated
> dance of communication that occurs between internal Nova components.
>
> == What we would like to do ==
>
> With the move of the resource claim to the Nova scheduler [2], we can
> entirely eliminate the a) class of Reschedule causes.
>
> This leaves class b) and c) causes of Rescheduling.
>
> For class b) causes, we should be able to solve this issue when the
> placement service understands affinity/anti-affinity (maybe
> Queens/Rocky). Until then, we propose that instead of raising a
> Reschedule when an affinity constraint was last-minute violated due to a
> racing scheduler decision, that we simply set the instance to an ERROR
> state.
>
> Personally, I have only ever seen anti-affinity/affinity use cases in
> relation to NFV deployments, and in every NFV deployment of OpenStack
> there is a VNFM or MANO solution that is responsible for the
> orchestration of instances belonging to various service function chains.
> I think it is reasonable to expect the MANO system to be responsible for
> attempting a re-launch of an instance that was set to ERROR due to a
> last-minute affinity violation.
>
> **Operators, do you agree with the above?**
>
> Finally, for class c) Reschedule causes, I do not believe that we should
> be attempting automated rescheduling when "unknown" errors occur. I just
> don't believe this is something Nova should be doing.
>
> I recognize that large Ironic users expressed their concerns about
> IPMI/BMC communication being unreliable and not wanting to have users
> manually retry a baremetal instance launch. But, on this particular
> point, I'm of the opinion that Nova just do one thing and do it well.
> Nova isn't an orchestrator, nor is it intending to be a "just
> continually try to get me to this eventual state" system like Kubernetes.
>
> If we removed Reschedule for class c) failures entirely, large Ironic
> deployers would have to train users to manually retry a failed launch or
> would need to write a simple retry mechanism into whatever client/UI
> that they expose to their users.
>
> **Ironic operators, would the above decision force you to abandon Nova
> as the multi-tenant BMaaS facility?**
>
> Thanks in advance for your consideration and feedback.
>
> Best,
> -jay
>
> [1] This really does not occur with any frequency for hypervisor virt
> drivers, since the exceptions those hypervisors throw are caught by the
> nova-compute worker and handled without raising a Reschedule.
>
> [2]
> http://specs.openstack.org/openstack/nova-specs/specs/pike/approved/placement-claims.html
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
More information about the OpenStack-operators
mailing list