[openstack-dev] [nova] readout from Philly Operators Meetup

Sylvain Bauza sbauza at redhat.com
Wed Mar 11 16:53:28 UTC 2015

Thanks Sean for writing up this report, greatly appreciated.
Comments inline.

Le 11/03/2015 13:59, Sean Dague a écrit :
> The last couple of days I was at the Operators Meetup acting as Nova
> rep for the meeting. All the sessions were quite nicely recorded to
> etherpads here - https://etherpad.openstack.org/p/PHL-ops-meetup
> There was both a specific Nova session -
> https://etherpad.openstack.org/p/PHL-ops-nova-feedback as well as a
> bunch of relevant pieces of information in other sessions.
> This is an attempt for some summary here, anyone else that was in
> attendance please feel free to correct if I'm interpreting something
> incorrectly. There was a lot of content there, so this is in no way
> comprehensive list, just the highlights that I think make the most
> sense for the Nova team.
> =========================
>   Nova Network -> Neutron
> =========================
> This remains listed as the #1 issue from the Operator Community on
> their burning issues list
> (https://etherpad.openstack.org/p/PHL-ops-burning-issues L18). During
> the tags conversation we straw polled the audience
> (https://etherpad.openstack.org/p/PHL-ops-tags L45) and about 75% of
> attendees were over on neutron already. However those on Nova Network
> we disproportionally the largest clusters and longest standing
> OpenStack users.
> Of those on nova-network about 1/2 had no interest in being on
> Neutron (https://etherpad.openstack.org/p/PHL-ops-nova-feedback
> L24). Some of the primary reasons were the following:
> - Complexity concerns - neutron has a lot more moving parts
> - Performance concerns - nova multihost means there is very little
>    between guests and the fabric, which is really important for the HPC
>    workload use case for OpenStack.
> - Don't want OVS - ovs adds additional complexity, and performance
>    concerns. Many large sites are moving off ovs back to linux bridge
>    with neutron because they are hitting OVS scaling limits (especially
>    if on UDP) - (https://etherpad.openstack.org/p/PHL-ops-OVS L142)
> The biggest disconnect in the model seems to be that Neutron assumes
> you want self service networking. Most of these deploys don't. Or even
> more importantly, they live in an organization where that is never
> going to be an option.
> Neutron provider networks is close, except it doesn't provide for
> floating IP / NAT.
> Going forward: I think the gap analysis probably needs to be revisited
> with some of the vocal large deployers. I think we assumed the
> functional parity gap was closed with DVR, but it's not clear in it's
> current format it actually meets the n-net multihost users needs.
> ===================
>   EC2 going forward
> ===================
> Having a sustaninable EC2 is of high interest to the operator
> community. Many large deploys have some users that were using AWS
> prior to using OpenStack, or currently are using both. They have
> preexisting tooling for that.
> There didn't seem to be any objection to the approach of an external
> proxy service for this function -
> (https://etherpad.openstack.org/p/PHL-ops-nova-feedback L111). Mostly
> the question is timing, and the fact that no one has validated the
> stackforge project. The fact that we landed everything people need to
> run this in Kilo is good, as these production deploys will be able to
> test it for their users when they upgrade.
> ============================
>   Burning Nova Features/Bugs
> ============================
> Hierarchical Projects Quotas
> ----------------------------
> Hugely desired feature by the operator community
> (https://etherpad.openstack.org/p/PHL-ops-nova-feedback L116). Missed
> Kilo. This made everyone sad.
> Action: we should queue this up as early Liberty priority item.
> Out of sync Quotas
> ------------------
> https://etherpad.openstack.org/p/PHL-ops-nova-feedback L63
> The quotas code is quite racey (this is kind of a known if you look at
> the bug tracker). It was actually marked as a top soft spot during
> last fall's bug triage -
> http://lists.openstack.org/pipermail/openstack-dev/2014-September/046517.html
> There is an operator proposed spec for an approach here -
> https://review.openstack.org/#/c/161782/
> Action: we should make a solution here a top priority for enhanced
> testing and fixing in Liberty. Addressing this would remove a lot of
> pain from ops.
> Reporting on Scheduler Fails
> ----------------------------
> Apparently, some time recently, we stopped logging scheduler fails
> above DEBUG, and that behavior also snuck back into Juno as well
> (https://etherpad.openstack.org/p/PHL-ops-nova-feedback L78). This
> has made tracking down root cause of failures far more difficult.
> Action: this should hopefully be a quick fix we can get in for Kilo
> and backport.
It's unfortunate that failed scheduling attempts are providing only an 
INFO log. A quick fix could be at least to turn the verbosity up to WARN 
so it would be noticied more easily (including the whole filters stack 
with their results).
That said, I'm pretty against any proposal which would expose those 
specific details (ie. the number of hosts which are succeeding per 
filter) in an API endpoint because it would also expose the underlying 
infrastructure capacity and would ease DoS discoveries. A workaround 
could be to include in the ERROR message only the name of the filter 
which has been denied so the operators could very easily match what the 
user is saying with what they're seeing in the scheduler logs.

Does that work for people ? I can provide changes for both.


> =============================
>   Additional Interesting Bits
> =============================
> Rabbit
> ------
> There was a whole session on Rabbit -
> https://etherpad.openstack.org/p/PHL-ops-rabbit-queue
> Rabbit is a top operational concern for most large sites. Almost all
> sites have a "restart everything that talks to rabbit" script because
> during rabbit ha opperations queues tend to blackhole.
> All other queue systems OpenStack supports are worse than Rabbit (from
> experience in that room).
> oslo.messaging < 1.6.0 was a significant regression in dependability
> from the incubator code. It now seems to be getting better but still a
> lot of issues. (L112)
> Operators *really* want the concept in
> https://review.openstack.org/#/c/146047/ landed. (I asked them to
> provide such feedback in gerrit).
> Nova Rolling Upgrades
> ---------------------
> Most people really like the concept, couldn't find anyone that had
> used it yet because Neutron doesn't support it, so they had to big
> bang upgrades anyway.
> Galera Upstream Testing
> -----------------------
> The majority of deploys run with Galera MySQL. There was a question
> about whether or not we could get that into upstream testing pipeline
> as that's the common case.
> 	-Sean

More information about the OpenStack-dev mailing list