On Tue, 23 Mar 2021 at 17:46, Gorka Eguileor <geguileo@redhat.com> wrote:
On 09/03, Lee Yarwood wrote:
Hello all,
I reported the following bug last week but I've yet to get any real feedback after asking a few times in irc.
Running parallel iSCSI/LVM c-vol backends is causing random failures in CI https://bugs.launchpad.net/cinder/+bug/1917750
AFAICT tgtadm is causing this behaviour. As I've stated in the bug with Fedora 32 and lioadm I don't see the WWN conflict between the two backends. Does anyone know if using lioadm is an option on Focal?
Thanks in advance,
Lee
Hi Lee,
Sorry for the late reply.
I started looking at the case some time ago but got "distracted" with some other issue.
I am no expert on STGT, since I always work with LIO, but from I could gather this seems to be caused by the conjunction of us:
- Using the tgtadm helper - Having 2 different cinder-volume services running on 2 different hosts (one in compute and another on controller). - Using the same volume_backend_name for both LVM backends.
If we were running a single cinder-volume service with 2 backends this issue wouldn't happen (I checked).
If we used a different volume_backend_name for each of the 2 services and used a volume type picking one of them for the operations, this wouldn't happen either.
If we used LIO instead, this wouldn't happen.
The cause is the automatic generation of serial/wwn for volumes by the STGT, that seems to be deterministic. First target created on a host will be have a 60000000000000000e0000000001 prefix and then the LUN number (the 3 before it that we see in the connection_info is just to state that the WWN is of NAA type).
This means that the first volume exposed by STGT on any host will ALWAYS have the same WWN and will mess things up if we attach them to the same host, because the premise of a WWN is its uniqueness and everything in Cinder and OS-Brick assumes this and will not be changed.
For LIO it seems that the generation of the seria/wwn is non deterministic (or at least not the same on all hosts) so the issue won't happen in this specific deployment configuration.
So the options to prevent this issue are to run both backends on the controller node, use different volume_backend_name and a volume type, or use LIO.
Thanks Gorka, Just to copy my reply from the bug here. I'm not entirely sure how using a different volume_backend_name would help? As you say above the first target on both hosts would still have the 60000000000000000e0000000001 prefix regardless of the name right? Moving to a single service multibackend approach would be best but given required job changes etc isn't something I think we can do in the short term. Moving to lioadm is still my preferred short term solution to this with the following devstack change awaiting reviews below: cinder: Default CINDER_ISCSI_HELPER to lioadm on Ubuntu https://review.opendev.org/c/openstack/devstack/+/779624 Cheers, Lee