On 12/10/2018 1:21 AM, Ghanshyam Mann wrote:
I am getting few failure on tempest-slow (tempest-multinode-full) for stable branches which might take time to fix and till then let's keep nova-multiattach on stable branches and remove only for master.
Bug https://bugs.launchpad.net/cinder/+bug/1807723/ is blocking removing the nova-multiattach job from master. Something is going on with TestMultiAttachVolumeSwap when there are two hosts. That test is marked slow but runs in nova-multiattach which also runs slow tests, and nova-multiattach is a single node job. With tempest change: https://review.openstack.org/#/c/606978/ TestMultiAttachVolumeSwap gets run in the tempest-slow job which is multi-node, and as a result I'm seeing race failures in that test. I've put my notes into the bug, but I need some help from Cinder at this point. I thought I had initially identified a very obvious problem in nova, but now I think nova is working as designed (although very confusing) and we're hitting a race during the swap where deleting the attachment record for the volume/server we swapped *from* is failing saying the target is still active. The fact we used to run this on a single-node job likely masked some race issue. As far as next steps, we could: 1. Move forward with removing nova-multiattach but skip TestMultiAttachVolumeSwap until bug 1807723 is fixed. 2. Try to workaround bug 1807723 in Tempest by creating the multiattach volume and servers on the same host (by pinning them to an AZ). 3. Add some retry logic to Cinder and hope it is just a race failure when the volume is connected to servers across different hosts. Ultimately this is the best scenario but I'm just not yet sure if that is really the issue or if something is really messed up in the volume backend when this fails where retries wouldn't help. -- Thanks, Matt