G8 H8

25 Jan 2019

      Time for a quick update on gate status.

* There were some shelve tests that were failing ssh pretty badly in the 
tempest-slow job due to a neutron issue: 
https://launchpad.net/bugs/1812552. It seems 
https://review.openstack.org/#/c/631944/ might have squashed that bug.

* Probably our biggest issue right now is test_subnet_details failing: 
http://status.openstack.org/elastic-recheck/#1813198. I suspect that is 
somehow related to using cirros 0.4.0 in devstack as of Jan 20. I have a 
tempest patch up for review to help debug that when it fails 
https://review.openstack.org/#/c/633225 since it seems we're not parsing 
nic names properly which is how we get the mangled udhcpc..pid file name.

* Another nasty one that is affecting unit/functional tests (the bug is 
against nova but the query hits other projects as well) is 
http://status.openstack.org/elastic-recheck/#1813147 where subunit 
parsing fails. It seems cinder had to deal with something like this 
recently too so the nova team needs to figure out what cinder did to 
resolve this. I'm not sure if this is a recent regression or not, but 
the logstash trends start around Jan 17 so it could be recent.

* https://bugs.launchpad.net/cinder/+bug/1810526 is a cinder bug related 
to etcd intermittently dropping connections and then cinder services hit 
ToozConnectionErrors which cause other things to fail, like volume 
status updates are lost during delete and then tempest times out waiting 
for the volume to be deleted. I have a fingerprint in the bug but it 
shows up in successful jobs too which is frustrating. I would expect 
that for grenade while services are being restarted (although do we 
restart etcd in grenade?) but it also shows up in non-grenade jobs. I 
believe cinder is just using tooz+etcd as a distributed lock manager so 
I'm not sure how valid it would be to add retries on that locking code 
or not when the service is unavailable. One suggestion in IRC was to not 
use tooz/etcd for DLM in single-node jobs but that kind of side-steps 
the issue - but if etcd is lagging because of lots of services eating up 
resources on the single node, it might not be a bad option.

-- 

Thanks,

Matt

G8 H8

Matt Riedemann