Open Stack

Thu Dec 17 06:45:40 UTC 2015

Excerpts from Adrian Otto's message of 2015-12-16 16:56:39 -0800:
> Clint,
> 
> I think you are categorically dismissing a very real ops challenge of how to set correct system limits, and how to adjust them in a running system. I have been stung by this challenge repeatedly over the years. As developers we *guess* at what a sensible default for a value will be for a limit, but we are sometimes wrong. When we are, that guess has a very real, and very negative impact on users of production systems. The idea of using one limit for all users is idealistic. I’m convinced based on my experience that it's not the best approach in practice. What we usually want to do is bump up a limit for a single user, or dynamically drop a limit for all users. The problem is that very few systems implement limits in a way they can be adjusted while the system is running, and very rarely on a per-tenant basis. So yes, I will assert that having a quota implementation and the related complexity is justified by the ability to adapt limit levels while the system is running.
> 
> Think for a moment about the pain that an ops team goes through when they have to take a service down that’s affecting thousands or tens of thousands of users. We have to send zillions of emails to customers, we need to hold emergency change management meetings. We have to answer questions like “why didn’t you test for this?” when we did test for it, and it worked fine under simulation, but not in a real production environment under this particular stimulus. "Why can’t you take the system down in sections to keep the service up?" When the answer to all this is “because the developers never put themselves in the shoes of the ops team when they designed it.”
> 
> Those who know me will attest to the fact that I care deeply about applying the KISS principle. The principle guides us to keep our designs as simple as possible unless it’s essential to make them more complex. In this case, the complexity is justified.
> 
> Now if there are production ops teams for large scale systems that argue that dynamic limits and per-user overrides are pointless, then I’ll certainly reconsider my position.
> 

Hm, I think we agree that ops need ways to enact policies smoothly,
that's for sure, and I am sorry I've failed to communicate that. My
experience has been somewhat different, and I tend to treat every single
request that comes in as the one that will crash your service and trigger
those emails. With thousands of users, weakening the limitations that
protect the service for any subset of them seems like a huge undertaking
and carries a large risk. Making the system resilient no matter who is
talking to it would be my focus.

However, I'm not directly involved and we're just going in circles,
so we'll just have to agree to disagree. I hope, sincerely, that I'm
wrong. :)

Open Stack

[openstack-dev] [openstack][magnum] Quota for Magnum Resources

OpenStack

Community

Documentation

Branding & Legal