[cinder][nova] Running parallel iSCSI/LVM c-vol backends is causing random failures in CI
geguileo at redhat.com
Tue Mar 23 17:46:21 UTC 2021
On 09/03, Lee Yarwood wrote:
> Hello all,
> I reported the following bug last week but I've yet to get any real
> feedback after asking a few times in irc.
> Running parallel iSCSI/LVM c-vol backends is causing random failures in CI
> AFAICT tgtadm is causing this behaviour. As I've stated in the bug
> with Fedora 32 and lioadm I don't see the WWN conflict between the two
> backends. Does anyone know if using lioadm is an option on Focal?
> Thanks in advance,
Sorry for the late reply.
I started looking at the case some time ago but got "distracted" with
some other issue.
I am no expert on STGT, since I always work with LIO, but from I could
gather this seems to be caused by the conjunction of us:
- Using the tgtadm helper
- Having 2 different cinder-volume services running on 2 different hosts
(one in compute and another on controller).
- Using the same volume_backend_name for both LVM backends.
If we were running a single cinder-volume service with 2 backends this
issue wouldn't happen (I checked).
If we used a different volume_backend_name for each of the 2 services
and used a volume type picking one of them for the operations, this
wouldn't happen either.
If we used LIO instead, this wouldn't happen.
The cause is the automatic generation of serial/wwn for volumes by the
STGT, that seems to be deterministic. First target created on a host
will be have a 60000000000000000e0000000001 prefix and then the LUN
number (the 3 before it that we see in the connection_info is just to
state that the WWN is of NAA type).
This means that the first volume exposed by STGT on any host will ALWAYS
have the same WWN and will mess things up if we attach them to the same
host, because the premise of a WWN is its uniqueness and everything in
Cinder and OS-Brick assumes this and will not be changed.
For LIO it seems that the generation of the seria/wwn is non
deterministic (or at least not the same on all hosts) so the issue won't
happen in this specific deployment configuration.
So the options to prevent this issue are to run both backends on the
controller node, use different volume_backend_name and a volume type, or
More information about the openstack-discuss