[openstack-dev] [nova] readout from Philly Operators Meetup

Sean Dague sean at dague.net
Wed Mar 11 12:59:10 UTC 2015


The last couple of days I was at the Operators Meetup acting as Nova
rep for the meeting. All the sessions were quite nicely recorded to
etherpads here - https://etherpad.openstack.org/p/PHL-ops-meetup

There was both a specific Nova session -
https://etherpad.openstack.org/p/PHL-ops-nova-feedback as well as a
bunch of relevant pieces of information in other sessions.

This is an attempt for some summary here, anyone else that was in
attendance please feel free to correct if I'm interpreting something
incorrectly. There was a lot of content there, so this is in no way
comprehensive list, just the highlights that I think make the most
sense for the Nova team.

=========================
 Nova Network -> Neutron
=========================

This remains listed as the #1 issue from the Operator Community on
their burning issues list
(https://etherpad.openstack.org/p/PHL-ops-burning-issues L18). During
the tags conversation we straw polled the audience
(https://etherpad.openstack.org/p/PHL-ops-tags L45) and about 75% of
attendees were over on neutron already. However those on Nova Network
we disproportionally the largest clusters and longest standing
OpenStack users.

Of those on nova-network about 1/2 had no interest in being on
Neutron (https://etherpad.openstack.org/p/PHL-ops-nova-feedback
L24). Some of the primary reasons were the following:

- Complexity concerns - neutron has a lot more moving parts
- Performance concerns - nova multihost means there is very little
  between guests and the fabric, which is really important for the HPC
  workload use case for OpenStack.
- Don't want OVS - ovs adds additional complexity, and performance
  concerns. Many large sites are moving off ovs back to linux bridge
  with neutron because they are hitting OVS scaling limits (especially
  if on UDP) - (https://etherpad.openstack.org/p/PHL-ops-OVS L142)

The biggest disconnect in the model seems to be that Neutron assumes
you want self service networking. Most of these deploys don't. Or even
more importantly, they live in an organization where that is never
going to be an option.

Neutron provider networks is close, except it doesn't provide for
floating IP / NAT.

Going forward: I think the gap analysis probably needs to be revisited
with some of the vocal large deployers. I think we assumed the
functional parity gap was closed with DVR, but it's not clear in it's
current format it actually meets the n-net multihost users needs.

===================
 EC2 going forward
===================

Having a sustaninable EC2 is of high interest to the operator
community. Many large deploys have some users that were using AWS
prior to using OpenStack, or currently are using both. They have
preexisting tooling for that.

There didn't seem to be any objection to the approach of an external
proxy service for this function -
(https://etherpad.openstack.org/p/PHL-ops-nova-feedback L111). Mostly
the question is timing, and the fact that no one has validated the
stackforge project. The fact that we landed everything people need to
run this in Kilo is good, as these production deploys will be able to
test it for their users when they upgrade.

============================
 Burning Nova Features/Bugs
============================

Hierarchical Projects Quotas
----------------------------

Hugely desired feature by the operator community
(https://etherpad.openstack.org/p/PHL-ops-nova-feedback L116). Missed
Kilo. This made everyone sad.

Action: we should queue this up as early Liberty priority item.

Out of sync Quotas
------------------

https://etherpad.openstack.org/p/PHL-ops-nova-feedback L63

The quotas code is quite racey (this is kind of a known if you look at
the bug tracker). It was actually marked as a top soft spot during
last fall's bug triage -
http://lists.openstack.org/pipermail/openstack-dev/2014-September/046517.html

There is an operator proposed spec for an approach here -
https://review.openstack.org/#/c/161782/

Action: we should make a solution here a top priority for enhanced
testing and fixing in Liberty. Addressing this would remove a lot of
pain from ops.

Reporting on Scheduler Fails
----------------------------

Apparently, some time recently, we stopped logging scheduler fails
above DEBUG, and that behavior also snuck back into Juno as well
(https://etherpad.openstack.org/p/PHL-ops-nova-feedback L78). This
has made tracking down root cause of failures far more difficult.

Action: this should hopefully be a quick fix we can get in for Kilo
and backport.

=============================
 Additional Interesting Bits
=============================

Rabbit
------

There was a whole session on Rabbit -
https://etherpad.openstack.org/p/PHL-ops-rabbit-queue

Rabbit is a top operational concern for most large sites. Almost all
sites have a "restart everything that talks to rabbit" script because
during rabbit ha opperations queues tend to blackhole.

All other queue systems OpenStack supports are worse than Rabbit (from
experience in that room).

oslo.messaging < 1.6.0 was a significant regression in dependability
from the incubator code. It now seems to be getting better but still a
lot of issues. (L112)

Operators *really* want the concept in
https://review.openstack.org/#/c/146047/ landed. (I asked them to
provide such feedback in gerrit).

Nova Rolling Upgrades
---------------------

Most people really like the concept, couldn't find anyone that had
used it yet because Neutron doesn't support it, so they had to big
bang upgrades anyway.

Galera Upstream Testing
-----------------------

The majority of deploys run with Galera MySQL. There was a question
about whether or not we could get that into upstream testing pipeline
as that's the common case.


	-Sean

-- 
Sean Dague
http://dague.net



More information about the OpenStack-dev mailing list