[Openstack-operators] [nova][ironic][scheduler][placement] IMPORTANT: Getting rid of the automated reschedule functionality

Marc Heckmann marc.heckmann at ubisoft.com
Tue May 23 13:48:46 UTC 2017


For the anti-affinity use case, it's really useful for smaller or medium size operators who want to provide some form of failure domains to users but do not have the resources to create AZ's at DC or even at rack or row scale. Don't forget that as soon as you introduce AZs, you need to grow those AZs at the same rate and have the same flavor offerings across those AZs.

For the retry thing, I think enough people have chimed in to echo the general sentiment.

-m


On Mon, 2017-05-22 at 16:30 -0600, David Medberry wrote:
I have to agree with James....

My affinity and anti-affinity rules have nothing to do with NFV. a-a is almost always a failure domain solution. I'm not sure we have users actually choosing affinity (though it would likely be for network speed issues and/or some sort of badly architected need or perceived need for coupling.)

On Mon, May 22, 2017 at 12:45 PM, James Penick <jpenick at gmail.com<mailto:jpenick at gmail.com>> wrote:


On Mon, May 22, 2017 at 10:54 AM, Jay Pipes <jaypipes at gmail.com<mailto:jaypipes at gmail.com>> wrote:
Hi Ops,

Hi!


For class b) causes, we should be able to solve this issue when the placement service understands affinity/anti-affinity (maybe Queens/Rocky). Until then, we propose that instead of raising a Reschedule when an affinity constraint was last-minute violated due to a racing scheduler decision, that we simply set the instance to an ERROR state.

Personally, I have only ever seen anti-affinity/affinity use cases in relation to NFV deployments, and in every NFV deployment of OpenStack there is a VNFM or MANO solution that is responsible for the orchestration of instances belonging to various service function chains. I think it is reasonable to expect the MANO system to be responsible for attempting a re-launch of an instance that was set to ERROR due to a last-minute affinity violation.

**Operators, do you agree with the above?**

I do not. My affinity and anti-affinity use cases reflect the need to build large applications across failure domains in a datacenter.

Anti-affinity: Most anti-affinity use cases relate to the ability to guarantee that instances are scheduled across failure domains, others relate to security compliance.

Affinity: Hadoop/Big data deployments have affinity use cases, where nodes processing data need to be in the same rack as the nodes which house the data. This is a common setup for large hadoop deployers.

I recognize that large Ironic users expressed their concerns about IPMI/BMC communication being unreliable and not wanting to have users manually retry a baremetal instance launch. But, on this particular point, I'm of the opinion that Nova just do one thing and do it well. Nova isn't an orchestrator, nor is it intending to be a "just continually try to get me to this eventual state" system like Kubernetes.

Kubernetes is a larger orchestration platform that provides autoscale. I don't expect Nova to provide autoscale, but

I agree that Nova should do one thing and do it really well, and in my mind that thing is reliable provisioning of compute resources. Kubernetes does autoscale among other things. I'm not asking for Nova to provide Autoscale, I -AM- asking OpenStack's compute platform to provision a discrete compute resource reliably. This means overcoming common and simple error cases. As a deployer of OpenStack I'm trying to build a cloud that wraps the chaos of infrastructure, and present a reliable facade. When my users issue a boot request, I want to see if fulfilled. I don't expect it to be a 100% guarantee across any possible failure, but I expect (and my users demand) that my "Infrastructure as a service" API make reasonable accommodation to overcome common failures.


If we removed Reschedule for class c) failures entirely, large Ironic deployers would have to train users to manually retry a failed launch or would need to write a simple retry mechanism into whatever client/UI that they expose to their users.

**Ironic operators, would the above decision force you to abandon Nova as the multi-tenant BMaaS facility?**


 I just glanced at one of my production clusters and found there are around 7K users defined, many of whom use OpenStack on a daily basis. When they issue a boot call, they expect that request to be honored. From their perspective, if they call AWS, they get what they ask for. If you remove reschedules you're not just breaking the expectation of a single deployer, but for my thousands of engineers who, every day, rely on OpenStack to manage their stack.

I don't have a "i'll take my football and go home" mentality. But if you remove the ability for the compute provisioning API to present a reliable facade over infrastructure, I have to go write something else, or patch it back in. Now it's even harder for me to get and stay current with OpenStack.

During the summit the agreement was, if I recall, that reschedules would happen within a cell, and not between the parent and cell. That was completely acceptable to me.

-James


_______________________________________________
OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org<mailto:OpenStack-operators at lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



_______________________________________________
OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org<mailto:OpenStack-operators at lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20170523/2508ae35/attachment.html>


More information about the OpenStack-operators mailing list