<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jun 2, 2017 at 4:42 PM, Ben Nemec <span dir="ltr"><<a href="mailto:openstack@nemebean.com" target="_blank">openstack@nemebean.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="gmail-"><br>
<br>
On 03/28/2017 05:01 PM, Ben Nemec wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Final (hopefully) update:<br>
<br>
All active compute nodes have been rebooted and things seem to be stable<br>
again.  Jobs are even running a little faster, so I'm thinking this had<br>
a detrimental effect on performance too.  I've set a reminder for about<br>
two months from now to reboot again if we're still using this environment.<br>
</blockquote>
<br></span>
The reminder popped up this week, and I've rebooted all the compute nodes again.  It went pretty smoothly so I doubt anyone noticed that it happened (except that I forgot to restart the zuul-status webapp), but if you run across any problems let me know.</blockquote><div><br></div><div>Thanks Ben! <a href="http://zuul-status.tripleo.org/">http://zuul-status.tripleo.org/</a> is awesome, I missed it.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="gmail-HOEnZb"><div class="gmail-h5"><br>
<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
On 03/24/2017 12:48 PM, Ben Nemec wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
To follow-up on this, we've continued to hit this issue on other compute<br>
nodes.  Not surprising, of course.  They've all been up for about the<br>
same period of time and have had largely even workloads.<br>
<br>
It has caused problems though because it is cropping up faster than I<br>
can respond (it takes a few hours to cycle all the instances off a<br>
compute node, and I need to sleep sometime :-), so I've started<br>
pre-emptively rebooting compute nodes to get ahead of it.  Hopefully<br>
I'll be able to get all of the potentially broken nodes at least<br>
disabled by the end of the day so we'll have another 3 months before we<br>
have to worry about this again.<br>
<br>
On 03/24/2017 11:47 AM, Derek Higgins wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
On 22 March 2017 at 22:36, Ben Nemec <<a href="mailto:openstack@nemebean.com" target="_blank">openstack@nemebean.com</a>> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Hi all (owl?),<br>
<br>
You may have missed it in all the ci excitement the past couple of<br>
days, but<br>
we had a partial outage of rh1 last night.  It turns out the OVS port<br>
issue<br>
Derek discussed in<br>
<a href="http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html" rel="noreferrer" target="_blank">http://lists.openstack.org/pip<wbr>ermail/openstack-dev/2016-Dece<wbr>mber/109182.html</a><br>
<br>
<br>
reared its ugly head on a few of our compute nodes, which caused them<br>
to be<br>
unable to spawn new instances.  They kept getting scheduled since it<br>
looked<br>
like they were underutilized, which caused most of our testenvs to<br>
fail.<br>
<br>
I've rebooted the affected nodes, as well as a few more that looked<br>
like<br>
they might run into the same problem in the near future.  Everything<br>
looks<br>
to be working well again since sometime this morning (when I disabled<br>
the<br>
broken compute nodes), but there aren't many jobs passing due to the<br>
plethora of other issues we're hitting in ci.  There have been some<br>
stable<br>
job passes though so I believe things are working again.<br>
<br>
As far as preventing this in the future, the right thing to do would<br>
probably be to move to a later release of OpenStack (either point or<br>
major)<br>
where hopefully this problem would be fixed.  However, I'm hesitant<br>
to do<br>
that for a few reasons.  First is "the devil you know". Outside of this<br>
issue, we've gotten rh1 pretty rock solid lately.  It's been<br>
overworked, but<br>
has been cranking away for months with no major cloud-related outages.<br>
Second is that an upgrade would be a major process, probably<br>
involving some<br>
amount of downtime.  Since the long-term plan is to move everything<br>
to RDO<br>
cloud I'm not sure that's the best use of our time at this point.<br>
</blockquote>
<br>
+1 on keeping the status quo until moving to rdo-cloud.<br>
<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Instead, my plan for the near term is to keep a closer eye on the error<br>
notifications from the services.  We previously haven't had anything<br>
consuming those, but I've dropped a little tool on the controller<br>
that will<br>
dump out error notifications so we can watch for signs of this<br>
happening<br>
again.  I suspect the signs were there long before the actual breakage<br>
happened, but nobody was looking for them.  Now I will be.<br>
<br>
So that's where things stand with rh1.  Any comments or concerns<br>
welcome.<br>
<br>
Thanks.<br>
<br>
-Ben<br>
<br>
______________________________<wbr>______________________________<wbr>______________<br>
<br>
<br>
OpenStack Development Mailing List (not for usage questions)<br>
Unsubscribe:<br>
<a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.op<wbr>enstack.org?subject:unsubscrib<wbr>e</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi<wbr>-bin/mailman/listinfo/openstac<wbr>k-dev</a><br>
</blockquote>
<br>
______________________________<wbr>______________________________<wbr>______________<br>
<br>
<br>
OpenStack Development Mailing List (not for usage questions)<br>
Unsubscribe:<br>
<a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.op<wbr>enstack.org?subject:unsubscrib<wbr>e</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi<wbr>-bin/mailman/listinfo/openstac<wbr>k-dev</a><br>
<br>
</blockquote>
<br>
______________________________<wbr>______________________________<wbr>______________<br>
<br>
OpenStack Development Mailing List (not for usage questions)<br>
Unsubscribe:<br>
<a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.op<wbr>enstack.org?subject:unsubscrib<wbr>e</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi<wbr>-bin/mailman/listinfo/openstac<wbr>k-dev</a><br>
</blockquote>
<br>
______________________________<wbr>______________________________<wbr>______________<br>
OpenStack Development Mailing List (not for usage questions)<br>
Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.op<wbr>enstack.org?subject:unsubscrib<wbr>e</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi<wbr>-bin/mailman/listinfo/openstac<wbr>k-dev</a><br>
</blockquote>
<br>
______________________________<wbr>______________________________<wbr>______________<br>
OpenStack Development Mailing List (not for usage questions)<br>
Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.op<wbr>enstack.org?subject:unsubscrib<wbr>e</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi<wbr>-bin/mailman/listinfo/openstac<wbr>k-dev</a><br>
</div></div></blockquote></div><br></div></div>