[Openstack-operators] [nova][ironic][scheduler][placement] IMPORTANT: Getting rid of the automated reschedule functionality

David Medberry openstack at medberry.net
Mon May 22 22:30:56 UTC 2017


I have to agree with James....

My affinity and anti-affinity rules have nothing to do with NFV. a-a is
almost always a failure domain solution. I'm not sure we have users
actually choosing affinity (though it would likely be for network speed
issues and/or some sort of badly architected need or perceived need for
coupling.)

On Mon, May 22, 2017 at 12:45 PM, James Penick <jpenick at gmail.com> wrote:

>
>
> On Mon, May 22, 2017 at 10:54 AM, Jay Pipes <jaypipes at gmail.com> wrote:
>
>> Hi Ops,
>>
>> Hi!
>
>
>>
>> For class b) causes, we should be able to solve this issue when the
>> placement service understands affinity/anti-affinity (maybe Queens/Rocky).
>> Until then, we propose that instead of raising a Reschedule when an
>> affinity constraint was last-minute violated due to a racing scheduler
>> decision, that we simply set the instance to an ERROR state.
>>
>> Personally, I have only ever seen anti-affinity/affinity use cases in
>> relation to NFV deployments, and in every NFV deployment of OpenStack there
>> is a VNFM or MANO solution that is responsible for the orchestration of
>> instances belonging to various service function chains. I think it is
>> reasonable to expect the MANO system to be responsible for attempting a
>> re-launch of an instance that was set to ERROR due to a last-minute
>> affinity violation.
>>
>
>
>> **Operators, do you agree with the above?**
>>
>
> I do not. My affinity and anti-affinity use cases reflect the need to
> build large applications across failure domains in a datacenter.
>
> Anti-affinity: Most anti-affinity use cases relate to the ability to
> guarantee that instances are scheduled across failure domains, others
> relate to security compliance.
>
> Affinity: Hadoop/Big data deployments have affinity use cases, where nodes
> processing data need to be in the same rack as the nodes which house the
> data. This is a common setup for large hadoop deployers.
>
>
>> I recognize that large Ironic users expressed their concerns about
>> IPMI/BMC communication being unreliable and not wanting to have users
>> manually retry a baremetal instance launch. But, on this particular point,
>> I'm of the opinion that Nova just do one thing and do it well. Nova isn't
>> an orchestrator, nor is it intending to be a "just continually try to get
>> me to this eventual state" system like Kubernetes.
>>
>
> Kubernetes is a larger orchestration platform that provides autoscale. I
> don't expect Nova to provide autoscale, but
>
> I agree that Nova should do one thing and do it really well, and in my
> mind that thing is reliable provisioning of compute resources. Kubernetes
> does autoscale among other things. I'm not asking for Nova to provide
> Autoscale, I -AM- asking OpenStack's compute platform to provision a
> discrete compute resource reliably. This means overcoming common and simple
> error cases. As a deployer of OpenStack I'm trying to build a cloud that
> wraps the chaos of infrastructure, and present a reliable facade. When my
> users issue a boot request, I want to see if fulfilled. I don't expect it
> to be a 100% guarantee across any possible failure, but I expect (and my
> users demand) that my "Infrastructure as a service" API make reasonable
> accommodation to overcome common failures.
>
>
>
>> If we removed Reschedule for class c) failures entirely, large Ironic
>> deployers would have to train users to manually retry a failed launch or
>> would need to write a simple retry mechanism into whatever client/UI that
>> they expose to their users.
>>
>> **Ironic operators, would the above decision force you to abandon Nova as
>> the multi-tenant BMaaS facility?**
>>
>>
>  I just glanced at one of my production clusters and found there are
> around 7K users defined, many of whom use OpenStack on a daily basis. When
> they issue a boot call, they expect that request to be honored. From their
> perspective, if they call AWS, they get what they ask for. If you remove
> reschedules you're not just breaking the expectation of a single deployer,
> but for my thousands of engineers who, every day, rely on OpenStack to
> manage their stack.
>
> I don't have a "i'll take my football and go home" mentality. But if you
> remove the ability for the compute provisioning API to present a reliable
> facade over infrastructure, I have to go write something else, or patch it
> back in. Now it's even harder for me to get and stay current with OpenStack.
>
> During the summit the agreement was, if I recall, that reschedules would
> happen within a cell, and not between the parent and cell. That was
> completely acceptable to me.
>
> -James
>
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20170522/dfa71e5a/attachment.html>


More information about the OpenStack-operators mailing list