[openstack-dev] [nova] readout from Philly Operators Meetup
Joe Gordon
joe.gordon0 at gmail.com
Wed Mar 11 18:48:31 UTC 2015
On Wed, Mar 11, 2015 at 5:59 AM, Sean Dague <sean at dague.net> wrote:
> The last couple of days I was at the Operators Meetup acting as Nova
> rep for the meeting. All the sessions were quite nicely recorded to
> etherpads here - https://etherpad.openstack.org/p/PHL-ops-meetup
>
> There was both a specific Nova session -
> https://etherpad.openstack.org/p/PHL-ops-nova-feedback as well as a
> bunch of relevant pieces of information in other sessions.
>
> This is an attempt for some summary here, anyone else that was in
> attendance please feel free to correct if I'm interpreting something
> incorrectly. There was a lot of content there, so this is in no way
> comprehensive list, just the highlights that I think make the most
> sense for the Nova team.
>
> =========================
> Nova Network -> Neutron
> =========================
>
> This remains listed as the #1 issue from the Operator Community on
> their burning issues list
> (https://etherpad.openstack.org/p/PHL-ops-burning-issues L18). During
> the tags conversation we straw polled the audience
> (https://etherpad.openstack.org/p/PHL-ops-tags L45) and about 75% of
> attendees were over on neutron already. However those on Nova Network
> we disproportionally the largest clusters and longest standing
> OpenStack users.
>
> Of those on nova-network about 1/2 had no interest in being on
> Neutron (https://etherpad.openstack.org/p/PHL-ops-nova-feedback
> L24). Some of the primary reasons were the following:
>
> - Complexity concerns - neutron has a lot more moving parts
> - Performance concerns - nova multihost means there is very little
> between guests and the fabric, which is really important for the HPC
> workload use case for OpenStack.
> - Don't want OVS - ovs adds additional complexity, and performance
> concerns. Many large sites are moving off ovs back to linux bridge
> with neutron because they are hitting OVS scaling limits (especially
> if on UDP) - (https://etherpad.openstack.org/p/PHL-ops-OVS L142)
>
> The biggest disconnect in the model seems to be that Neutron assumes
> you want self service networking. Most of these deploys don't. Or even
> more importantly, they live in an organization where that is never
> going to be an option.
>
> Neutron provider networks is close, except it doesn't provide for
> floating IP / NAT.
>
> Going forward: I think the gap analysis probably needs to be revisited
> with some of the vocal large deployers. I think we assumed the
> functional parity gap was closed with DVR, but it's not clear in it's
> current format it actually meets the n-net multihost users needs.
>
> ===================
> EC2 going forward
> ===================
>
> Having a sustaninable EC2 is of high interest to the operator
> community. Many large deploys have some users that were using AWS
> prior to using OpenStack, or currently are using both. They have
> preexisting tooling for that.
>
> There didn't seem to be any objection to the approach of an external
> proxy service for this function -
> (https://etherpad.openstack.org/p/PHL-ops-nova-feedback L111). Mostly
> the question is timing, and the fact that no one has validated the
> stackforge project. The fact that we landed everything people need to
> run this in Kilo is good, as these production deploys will be able to
> test it for their users when they upgrade.
>
> ============================
> Burning Nova Features/Bugs
> ============================
>
> Hierarchical Projects Quotas
> ----------------------------
>
> Hugely desired feature by the operator community
> (https://etherpad.openstack.org/p/PHL-ops-nova-feedback L116). Missed
> Kilo. This made everyone sad.
>
> Action: we should queue this up as early Liberty priority item.
>
> Out of sync Quotas
> ------------------
>
> https://etherpad.openstack.org/p/PHL-ops-nova-feedback L63
>
> The quotas code is quite racey (this is kind of a known if you look at
> the bug tracker). It was actually marked as a top soft spot during
> last fall's bug triage -
>
> http://lists.openstack.org/pipermail/openstack-dev/2014-September/046517.html
>
> There is an operator proposed spec for an approach here -
> https://review.openstack.org/#/c/161782/
>
> Action: we should make a solution here a top priority for enhanced
> testing and fixing in Liberty. Addressing this would remove a lot of
> pain from ops.
>
>
To help us better track quota bugs I created a quotas tag:
https://bugs.launchpad.net/nova/+bugs?field.tag=quotas
Next step is re-triage those bugs: mark fixed bugs as fixed, deduplicate
bugs etc.
> Reporting on Scheduler Fails
> ----------------------------
>
> Apparently, some time recently, we stopped logging scheduler fails
> above DEBUG, and that behavior also snuck back into Juno as well
> (https://etherpad.openstack.org/p/PHL-ops-nova-feedback L78). This
> has made tracking down root cause of failures far more difficult.
>
> Action: this should hopefully be a quick fix we can get in for Kilo
> and backport.
>
> =============================
> Additional Interesting Bits
> =============================
>
> Rabbit
> ------
>
> There was a whole session on Rabbit -
> https://etherpad.openstack.org/p/PHL-ops-rabbit-queue
>
> Rabbit is a top operational concern for most large sites. Almost all
> sites have a "restart everything that talks to rabbit" script because
> during rabbit ha opperations queues tend to blackhole.
>
> All other queue systems OpenStack supports are worse than Rabbit (from
> experience in that room).
>
> oslo.messaging < 1.6.0 was a significant regression in dependability
> from the incubator code. It now seems to be getting better but still a
> lot of issues. (L112)
>
> Operators *really* want the concept in
> https://review.openstack.org/#/c/146047/ landed. (I asked them to
> provide such feedback in gerrit).
>
> Nova Rolling Upgrades
> ---------------------
>
> Most people really like the concept, couldn't find anyone that had
> used it yet because Neutron doesn't support it, so they had to big
> bang upgrades anyway.
>
> Galera Upstream Testing
> -----------------------
>
> The majority of deploys run with Galera MySQL. There was a question
> about whether or not we could get that into upstream testing pipeline
> as that's the common case.
>
>
> -Sean
>
> --
> Sean Dague
> http://dague.net
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150311/0c0977f9/attachment.html>
More information about the OpenStack-dev
mailing list