[openstack-dev] Your next semi weekly gate status report
cboylan at sapwetik.org
Mon Mar 27 21:57:25 UTC 2017
Previously we saw Libvirt crashes, OOMs, and Tempest SSH Banner
failures were a problem. The SSH Banner failures have since been sorted
out, thank you to everyone that helped work that out. For details please
see https://review.openstack.org/#/c/439638/. There is also a fix for a
race that was causing ssh to fail in test_attach_detach_volume that has
been fixed in https://review.openstack.org/#/c/449661/ (this change is
not merged yet so would be great if tempest cores could get this in).
To address the OOMs we've also seen work to reduce the memory overhead
in running devstack. Changes to modify Apache's memory use have gone in:
We also tried putting MySQL on a diet, but that had to be reverted,
There is also a memory_tracker logging service which you'll find logs
for in your job logs now. This can be useful in determining where memory
was used which you can use to reduce memory use.
It is great to see people take an interest in addressing memory issues.
And we no longer see OOMkiller being a major problem according to
elastic-recheck. That said there is more that we can do here.
Outstanding changes that may help too include:
But we also really need individual projects to be looking at the memory
consumption of openstack itself and work on trimming as they are able.
Unfortunately the Libvirt crashes continue to be a problem.
Current top issues:
1. Libvirt crashes: http://status.openstack.org/elastic-recheck/#1643911
Libvirt is randomly crashing during the job which causes things to fail
(for obvious reasons). To address this will likely require someone with
experience debugging libvirt since it's most likely a bug isolated to
libvirt. We're looking for someone familiar with libvirt internals to
drive the effort to fix this issue,
2. Network packet loss in OSIC
This has caused connectivity errors to external services. Various e-r
bugs like http://status.openstack.org/elastic-recheck/index.html#1282876
appear to have tripped on this. We expect that the problem has been
corrected, but we should keep an eye on these and make sure they fall
off the e-r list.
Also our classification rate has taken a nose dive lately:
Something that would help out is if people start classifiying these
failures. While the overall failure rate is lower than in previous
weeks, having a low classification rate means there are race conditions
(or other failures) we're not tracking yet, which will only make it more
difficult to fix. Normally if there is < a 90% classification rate we've
got at least one big persistent failure condition we're not aware of
mtreinish and clarkb
More information about the OpenStack-dev