Hi,
Wiadomość napisana przez Matt Riedemann <mriedemos@gmail.com> w dniu 26.01.2019, o godz. 01:47:
Time for a quick update on gate status.
* There were some shelve tests that were failing ssh pretty badly in the tempest-slow job due to a neutron issue: https://launchpad.net/bugs/1812552. It seems https://review.openstack.org/#/c/631944/ might have squashed that bug.
* Probably our biggest issue right now is test_subnet_details failing: http://status.openstack.org/elastic-recheck/#1813198. I suspect that is somehow related to using cirros 0.4.0 in devstack as of Jan 20. I have a tempest patch up for review to help debug that when it fails https://review.openstack.org/#/c/633225 since it seems we're not parsing nic names properly which is how we get the mangled udhcpc..pid file name.
I was looking at logs from failed job [1] and what I noticed in tempest log [2] is fact that couple of times this command returned proper „eth0” interface and then it once return empty string which, looking at command in tempest test means IMO that IP address (10.1.0.3 in above example) wasn’t configured on any interface. Maybe this interface is losing its IP address during renew lease process and we just should make tempest test more proof for such (temporary I hope) issue.
* Another nasty one that is affecting unit/functional tests (the bug is against nova but the query hits other projects as well) is http://status.openstack.org/elastic-recheck/#1813147 where subunit parsing fails. It seems cinder had to deal with something like this recently too so the nova team needs to figure out what cinder did to resolve this. I'm not sure if this is a recent regression or not, but the logstash trends start around Jan 17 so it could be recent.
We have same issue in neutron-functional job on python 3. It is waiting for review in [3]. I was recently talk with about it with Matthew Treinish on IRC [4] and it looks that limiting output on pythonlogging stream did the trick and we finally should be able to make it working. Probably You will need to do something similar.
* https://bugs.launchpad.net/cinder/+bug/1810526 is a cinder bug related to etcd intermittently dropping connections and then cinder services hit ToozConnectionErrors which cause other things to fail, like volume status updates are lost during delete and then tempest times out waiting for the volume to be deleted. I have a fingerprint in the bug but it shows up in successful jobs too which is frustrating. I would expect that for grenade while services are being restarted (although do we restart etcd in grenade?) but it also shows up in non-grenade jobs. I believe cinder is just using tooz+etcd as a distributed lock manager so I'm not sure how valid it would be to add retries on that locking code or not when the service is unavailable. One suggestion in IRC was to not use tooz/etcd for DLM in single-node jobs but that kind of side-steps the issue - but if etcd is lagging because of lots of services eating up resources on the single node, it might not be a bad option.
--
Thanks,
Matt
[1] http://logs.openstack.org/78/570078/17/check/tempest-slow/161ea32/job-output... [2] http://logs.openstack.org/78/570078/17/check/tempest-slow/161ea32/controller... [3] https://review.openstack.org/#/c/577383/ [4] http://eavesdrop.openstack.org/irclogs/%23openstack-qa/%23openstack-qa.2019-... — Slawek Kaplonski Senior software engineer Red Hat