<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Mar 11, 2015 at 5:59 AM, Sean Dague <span dir="ltr"><<a href="mailto:sean@dague.net" target="_blank">sean@dague.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">The last couple of days I was at the Operators Meetup acting as Nova<br>
rep for the meeting. All the sessions were quite nicely recorded to<br>
etherpads here - <a href="https://etherpad.openstack.org/p/PHL-ops-meetup" target="_blank">https://etherpad.openstack.org/p/PHL-ops-meetup</a><br>
<br>
There was both a specific Nova session -<br>
<a href="https://etherpad.openstack.org/p/PHL-ops-nova-feedback" target="_blank">https://etherpad.openstack.org/p/PHL-ops-nova-feedback</a> as well as a<br>
bunch of relevant pieces of information in other sessions.<br>
<br>
This is an attempt for some summary here, anyone else that was in<br>
attendance please feel free to correct if I'm interpreting something<br>
incorrectly. There was a lot of content there, so this is in no way<br>
comprehensive list, just the highlights that I think make the most<br>
sense for the Nova team.<br>
<br>
=========================<br>
Nova Network -> Neutron<br>
=========================<br>
<br>
This remains listed as the #1 issue from the Operator Community on<br>
their burning issues list<br>
(<a href="https://etherpad.openstack.org/p/PHL-ops-burning-issues" target="_blank">https://etherpad.openstack.org/p/PHL-ops-burning-issues</a> L18). During<br>
the tags conversation we straw polled the audience<br>
(<a href="https://etherpad.openstack.org/p/PHL-ops-tags" target="_blank">https://etherpad.openstack.org/p/PHL-ops-tags</a> L45) and about 75% of<br>
attendees were over on neutron already. However those on Nova Network<br>
we disproportionally the largest clusters and longest standing<br>
OpenStack users.<br>
<br>
Of those on nova-network about 1/2 had no interest in being on<br>
Neutron (<a href="https://etherpad.openstack.org/p/PHL-ops-nova-feedback" target="_blank">https://etherpad.openstack.org/p/PHL-ops-nova-feedback</a><br>
L24). Some of the primary reasons were the following:<br>
<br>
- Complexity concerns - neutron has a lot more moving parts<br>
- Performance concerns - nova multihost means there is very little<br>
between guests and the fabric, which is really important for the HPC<br>
workload use case for OpenStack.<br>
- Don't want OVS - ovs adds additional complexity, and performance<br>
concerns. Many large sites are moving off ovs back to linux bridge<br>
with neutron because they are hitting OVS scaling limits (especially<br>
if on UDP) - (<a href="https://etherpad.openstack.org/p/PHL-ops-OVS" target="_blank">https://etherpad.openstack.org/p/PHL-ops-OVS</a> L142)<br>
<br>
The biggest disconnect in the model seems to be that Neutron assumes<br>
you want self service networking. Most of these deploys don't. Or even<br>
more importantly, they live in an organization where that is never<br>
going to be an option.<br>
<br>
Neutron provider networks is close, except it doesn't provide for<br>
floating IP / NAT.<br>
<br>
Going forward: I think the gap analysis probably needs to be revisited<br>
with some of the vocal large deployers. I think we assumed the<br>
functional parity gap was closed with DVR, but it's not clear in it's<br>
current format it actually meets the n-net multihost users needs.<br>
<br>
===================<br>
EC2 going forward<br>
===================<br>
<br>
Having a sustaninable EC2 is of high interest to the operator<br>
community. Many large deploys have some users that were using AWS<br>
prior to using OpenStack, or currently are using both. They have<br>
preexisting tooling for that.<br>
<br>
There didn't seem to be any objection to the approach of an external<br>
proxy service for this function -<br>
(<a href="https://etherpad.openstack.org/p/PHL-ops-nova-feedback" target="_blank">https://etherpad.openstack.org/p/PHL-ops-nova-feedback</a> L111). Mostly<br>
the question is timing, and the fact that no one has validated the<br>
stackforge project. The fact that we landed everything people need to<br>
run this in Kilo is good, as these production deploys will be able to<br>
test it for their users when they upgrade.<br>
<br>
============================<br>
Burning Nova Features/Bugs<br>
============================<br>
<br>
Hierarchical Projects Quotas<br>
----------------------------<br>
<br>
Hugely desired feature by the operator community<br>
(<a href="https://etherpad.openstack.org/p/PHL-ops-nova-feedback" target="_blank">https://etherpad.openstack.org/p/PHL-ops-nova-feedback</a> L116). Missed<br>
Kilo. This made everyone sad.<br>
<br>
Action: we should queue this up as early Liberty priority item.<br>
<br>
Out of sync Quotas<br>
------------------<br>
<br>
<a href="https://etherpad.openstack.org/p/PHL-ops-nova-feedback" target="_blank">https://etherpad.openstack.org/p/PHL-ops-nova-feedback</a> L63<br>
<br>
The quotas code is quite racey (this is kind of a known if you look at<br>
the bug tracker). It was actually marked as a top soft spot during<br>
last fall's bug triage -<br>
<a href="http://lists.openstack.org/pipermail/openstack-dev/2014-September/046517.html" target="_blank">http://lists.openstack.org/pipermail/openstack-dev/2014-September/046517.html</a><br>
<br>
There is an operator proposed spec for an approach here -<br>
<a href="https://review.openstack.org/#/c/161782/" target="_blank">https://review.openstack.org/#/c/161782/</a><br>
<br>
Action: we should make a solution here a top priority for enhanced<br>
testing and fixing in Liberty. Addressing this would remove a lot of<br>
pain from ops.<br>
<br></blockquote><div><br></div><div>To help us better track quota bugs I created a quotas tag:</div><div><br></div><div><a href="https://bugs.launchpad.net/nova/+bugs?field.tag=quotas">https://bugs.launchpad.net/nova/+bugs?field.tag=quotas</a><br></div><div><br></div><div>Next step is re-triage those bugs: mark fixed bugs as fixed, deduplicate bugs etc.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
Reporting on Scheduler Fails<br>
----------------------------<br>
<br>
Apparently, some time recently, we stopped logging scheduler fails<br>
above DEBUG, and that behavior also snuck back into Juno as well<br>
(<a href="https://etherpad.openstack.org/p/PHL-ops-nova-feedback" target="_blank">https://etherpad.openstack.org/p/PHL-ops-nova-feedback</a> L78). This<br>
has made tracking down root cause of failures far more difficult.<br>
<br>
Action: this should hopefully be a quick fix we can get in for Kilo<br>
and backport.<br>
<br>
=============================<br>
Additional Interesting Bits<br>
=============================<br>
<br>
Rabbit<br>
------<br>
<br>
There was a whole session on Rabbit -<br>
<a href="https://etherpad.openstack.org/p/PHL-ops-rabbit-queue" target="_blank">https://etherpad.openstack.org/p/PHL-ops-rabbit-queue</a><br>
<br>
Rabbit is a top operational concern for most large sites. Almost all<br>
sites have a "restart everything that talks to rabbit" script because<br>
during rabbit ha opperations queues tend to blackhole.<br>
<br>
All other queue systems OpenStack supports are worse than Rabbit (from<br>
experience in that room).<br>
<br>
oslo.messaging < 1.6.0 was a significant regression in dependability<br>
from the incubator code. It now seems to be getting better but still a<br>
lot of issues. (L112)<br>
<br>
Operators *really* want the concept in<br>
<a href="https://review.openstack.org/#/c/146047/" target="_blank">https://review.openstack.org/#/c/146047/</a> landed. (I asked them to<br>
provide such feedback in gerrit).<br>
<br>
Nova Rolling Upgrades<br>
---------------------<br>
<br>
Most people really like the concept, couldn't find anyone that had<br>
used it yet because Neutron doesn't support it, so they had to big<br>
bang upgrades anyway.<br>
<br>
Galera Upstream Testing<br>
-----------------------<br>
<br>
The majority of deploys run with Galera MySQL. There was a question<br>
about whether or not we could get that into upstream testing pipeline<br>
as that's the common case.<br>
<span class=""><font color="#888888"><br>
<br>
-Sean<br>
<br>
--<br>
Sean Dague<br>
<a href="http://dague.net" target="_blank">http://dague.net</a><br>
<br>
__________________________________________________________________________<br>
OpenStack Development Mailing List (not for usage questions)<br>
Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
</font></span></blockquote></div><br></div></div>