[openstack-dev] [nova] readout from Philly Operators Meetup
sean at dague.net
Wed Mar 11 12:59:10 UTC 2015
The last couple of days I was at the Operators Meetup acting as Nova
rep for the meeting. All the sessions were quite nicely recorded to
etherpads here - https://etherpad.openstack.org/p/PHL-ops-meetup
There was both a specific Nova session -
https://etherpad.openstack.org/p/PHL-ops-nova-feedback as well as a
bunch of relevant pieces of information in other sessions.
This is an attempt for some summary here, anyone else that was in
attendance please feel free to correct if I'm interpreting something
incorrectly. There was a lot of content there, so this is in no way
comprehensive list, just the highlights that I think make the most
sense for the Nova team.
Nova Network -> Neutron
This remains listed as the #1 issue from the Operator Community on
their burning issues list
(https://etherpad.openstack.org/p/PHL-ops-burning-issues L18). During
the tags conversation we straw polled the audience
(https://etherpad.openstack.org/p/PHL-ops-tags L45) and about 75% of
attendees were over on neutron already. However those on Nova Network
we disproportionally the largest clusters and longest standing
Of those on nova-network about 1/2 had no interest in being on
L24). Some of the primary reasons were the following:
- Complexity concerns - neutron has a lot more moving parts
- Performance concerns - nova multihost means there is very little
between guests and the fabric, which is really important for the HPC
workload use case for OpenStack.
- Don't want OVS - ovs adds additional complexity, and performance
concerns. Many large sites are moving off ovs back to linux bridge
with neutron because they are hitting OVS scaling limits (especially
if on UDP) - (https://etherpad.openstack.org/p/PHL-ops-OVS L142)
The biggest disconnect in the model seems to be that Neutron assumes
you want self service networking. Most of these deploys don't. Or even
more importantly, they live in an organization where that is never
going to be an option.
Neutron provider networks is close, except it doesn't provide for
floating IP / NAT.
Going forward: I think the gap analysis probably needs to be revisited
with some of the vocal large deployers. I think we assumed the
functional parity gap was closed with DVR, but it's not clear in it's
current format it actually meets the n-net multihost users needs.
EC2 going forward
Having a sustaninable EC2 is of high interest to the operator
community. Many large deploys have some users that were using AWS
prior to using OpenStack, or currently are using both. They have
preexisting tooling for that.
There didn't seem to be any objection to the approach of an external
proxy service for this function -
(https://etherpad.openstack.org/p/PHL-ops-nova-feedback L111). Mostly
the question is timing, and the fact that no one has validated the
stackforge project. The fact that we landed everything people need to
run this in Kilo is good, as these production deploys will be able to
test it for their users when they upgrade.
Burning Nova Features/Bugs
Hierarchical Projects Quotas
Hugely desired feature by the operator community
(https://etherpad.openstack.org/p/PHL-ops-nova-feedback L116). Missed
Kilo. This made everyone sad.
Action: we should queue this up as early Liberty priority item.
Out of sync Quotas
The quotas code is quite racey (this is kind of a known if you look at
the bug tracker). It was actually marked as a top soft spot during
last fall's bug triage -
There is an operator proposed spec for an approach here -
Action: we should make a solution here a top priority for enhanced
testing and fixing in Liberty. Addressing this would remove a lot of
pain from ops.
Reporting on Scheduler Fails
Apparently, some time recently, we stopped logging scheduler fails
above DEBUG, and that behavior also snuck back into Juno as well
(https://etherpad.openstack.org/p/PHL-ops-nova-feedback L78). This
has made tracking down root cause of failures far more difficult.
Action: this should hopefully be a quick fix we can get in for Kilo
Additional Interesting Bits
There was a whole session on Rabbit -
Rabbit is a top operational concern for most large sites. Almost all
sites have a "restart everything that talks to rabbit" script because
during rabbit ha opperations queues tend to blackhole.
All other queue systems OpenStack supports are worse than Rabbit (from
experience in that room).
oslo.messaging < 1.6.0 was a significant regression in dependability
from the incubator code. It now seems to be getting better but still a
lot of issues. (L112)
Operators *really* want the concept in
https://review.openstack.org/#/c/146047/ landed. (I asked them to
provide such feedback in gerrit).
Nova Rolling Upgrades
Most people really like the concept, couldn't find anyone that had
used it yet because Neutron doesn't support it, so they had to big
bang upgrades anyway.
Galera Upstream Testing
The majority of deploys run with Galera MySQL. There was a question
about whether or not we could get that into upstream testing pipeline
as that's the common case.
More information about the OpenStack-dev