[neutron][CI] How to reduce number of rechecks - brainstorming

Sean Mooney smooney at redhat.com
Thu Nov 18 20:57:55 UTC 2021


On Thu, 2021-11-18 at 16:24 +0000, Jeremy Stanley wrote:
> On 2021-11-18 08:15:06 -0800 (-0800), Dan Smith wrote:
> [...]
> > Absolutely agree, humans are not good at making these decisions.
> > Despite "trust" in the core team, and even using a less-loaded
> > word than "abuse," I really don't think that even allowing the
> > option to override flaky tests by force merge is the right
> > solution (at all).
> [...]
> 
> Just about any time we Gerrit admins have decided to bypass testing
> to merge some change (and to be clear, we really don't like to if we
> can avoid it), we introduce a new test-breaking bug we then need to
> troubleshoot and fix. It's a humbling reminder that even though you
> may feel absolutely sure something's safe to merge without passing
> tests, you're probably wrong.
well the example i gave is a failure in the interaction between nova and cinder failing in the neturon gate.
there is no way the neutron patch under reivew could cause that failure to happen and i chose
a specific example of the intermite failure we have with the compute  volume detach fiailest where it looks like the bug
is actully the tempest test. 
https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_67c/810915/2/gate/nova-live-migration/67c89da/testr_results.html

it appear that for some reason attaching a cinder volume and live migrating the vm while the kernel/OS in the vm is still booting up can result in a kernel panic.
This has been an on going battel to solve for many weeks.

there is no way that a change in neutron  or glance or keystone patch could have cause the guest kernel to crash.
https://bugs.launchpad.net/nova/+bug/1950310 and https://bugs.launchpad.net/nova/+bug/1939108 are two of the related bugs 
if they are running the tempest.api.compute.admin.test_live_migration.LiveMigrationTest* in any of there jobs however they could have been impacted by this

lee yarwood has started implemeting a very old tepest spec https://specs.openstack.org/openstack/qa-specs/specs/tempest/implemented/ssh-auth-strategy.html
for this and we think that will fix the test failure.https://review.opendev.org/c/openstack/tempest/+/817772/2
i suspect we have many other cases in tempest today where we have intermitent failures cause by the guest os not being ready before
we do operations on the guest beyond the curent volume attach/detach issues

i did not sucggest allowint the ci to be overrdien because i think that is generally a good idea, 
its not but some time there are failure that we are activly trying to fix but have not found a solution for for months.

im pretty sure this live migration test prevent patches to the ironic virt driver landing not so long ago requiring sevel retries.
the ironic virirt dirver obvioly does not supprot live migrationand the chagne was not touching any oter part of nova so the failure was unrealted.
https://review.opendev.org/c/openstack/nova/+/799327 is the cahange i was thinking of the master version need 3 recheck the backprot needed 6 more 
https://review.opendev.org/c/openstack/nova/+/799772 that may have actully been casue by https://bugs.launchpad.net/nova/+bug/1931702 which is an other bug for
a similar kernel panic but i would not be surpirsed if it ws actully the same root causes.


i think that point was lost in my orginal message.
the point i was trying to make is sometimes the failture is not about the code under review its because the test is wrong.
we shoudl fix the test but it can be very frustrating if you recheck somethign 3-4 times where it passes in check and fails in gate for somethign you know is unrealated
but you dont want to disable the test because you dont want to losse coverate for somthign that typically fails a low amount of time.


regards
sean.





More information about the openstack-discuss mailing list