Re: [cinder][nova] Running parallel iSCSI/LVM c-vol backends is causing random failures in CI

23 Mar 2021

      On Tue, 23 Mar 2021 at 17:46, Gorka Eguileor <geguileo@redhat.com> wrote:
...
On 09/03, Lee Yarwood wrote:
...
Hello all,
I reported the following bug last week but I've yet to get any real
feedback after asking a few times in irc.
Running parallel iSCSI/LVM c-vol backends is causing random failures in CI
https://bugs.launchpad.net/cinder/+bug/1917750
AFAICT tgtadm is causing this behaviour. As I've stated in the bug
with Fedora 32 and lioadm I don't see the WWN conflict between the two
backends. Does anyone know if using lioadm is an option on Focal?
Thanks in advance,
Lee
Hi Lee,
Sorry for the late reply.
I started looking at the case some time ago but got "distracted" with
some other issue.
I am no expert on STGT, since I always work with LIO, but from I could
gather this seems to be caused by the conjunction of us:
- Using the tgtadm helper
- Having 2 different cinder-volume services running on 2 different hosts
  (one in compute and another on controller).
- Using the same volume_backend_name for both LVM backends.
If we were running a single cinder-volume service with 2 backends this
issue wouldn't happen (I checked).
If we used a different volume_backend_name for each of the 2 services
and used a volume type picking one of them for the operations, this
wouldn't happen either.
If we used LIO instead, this wouldn't happen.
The cause is the automatic generation of serial/wwn for volumes by the
STGT, that seems to be deterministic.  First target created on a host
will be have a 60000000000000000e0000000001 prefix and then the LUN
number (the 3 before it that we see in the connection_info is just to
state that the WWN is of NAA type).
This means that the first volume exposed by STGT on any host will ALWAYS
have the same WWN and will mess things up if we attach them to the same
host, because the premise of a WWN is its uniqueness and everything in
Cinder and OS-Brick assumes this and will not be changed.
For LIO it seems that the generation of the seria/wwn is non
deterministic (or at least not the same on all hosts) so the issue won't
happen in this specific deployment configuration.
So the options to prevent this issue are to run both backends on the
controller node, use different volume_backend_name and a volume type, or
use LIO.
Thanks Gorka,

Just to copy my reply from the bug here.

I'm not entirely sure how using a different volume_backend_name would
help? As you say above the first target on both hosts would still have
the 60000000000000000e0000000001 prefix regardless of the name right?

Moving to a single service multibackend approach would be best but
given required job changes etc isn't something I think we can do in
the short term.

Moving to lioadm is still my preferred short term solution to this
with the following devstack change awaiting reviews below:

cinder: Default CINDER_ISCSI_HELPER to lioadm on Ubuntu
https://review.opendev.org/c/openstack/devstack/+/779624

Cheers,

Lee