VM boot volume disappears from compute's multipath Daemon when NetApp controller is placed offline

7 Jan 2020

      I have a setup where each VM gets assigned two vDisks, one encrypted boot
volume and another storage volume.

Storage used is NetApp (tripleo-netapp). With two controllers on NetApp
side working in active/ active mode.

My test case goes as follows:

- I stop one of the active controllers.

- I stop one of my VMs using OpenStack server stop

- I then start my VM one more time using OpenStack server start.

- VM fails to start.

Here're my findings, hope someone would help if they can explain me the
behaviour seen below:

 My VM: vel1bgw01-MCM2, it is running on compute
overcloud-sriovperformancecompute-3.localdomain

[root@overcloud-controller-0 (vel1asbc01) cbis-admin]# openstack server
show vel1bgw01-MCM2

+--------------------------------------+------------------------------------------------------------------------------------------------------------+

| Field | Value |

+--------------------------------------+------------------------------------------------------------------------------------------------------------+

| OS-DCF:diskConfig | MANUAL |

| OS-EXT-AZ:availability_zone | zone1 |

| OS-EXT-SRV-ATTR:host | overcloud-sriovperformancecompute-3.localdomain |

| OS-EXT-SRV-ATTR:hypervisor_hostname |
overcloud-sriovperformancecompute-3.localdomain |

| OS-EXT-SRV-ATTR:instance_name | instance-00000 <+4400000>e93 |

| OS-EXT-STS:power_state | Running |

| OS-EXT-STS:task_state | None |

| OS-EXT-STS:vm_state | active |

| OS-SRV-USG:launched_at | 2019-12-18T15:49:37.000000 <+4437000000> |

| OS-SRV-USG:terminated_at | None |

| accessIPv4 | |

| accessIPv6 | |

| addresses | SBC01_MGW01_TIPC=192.168.48.22 <+441921684822>;
SBC01_MGW01_DATAPATH_MATE=192.168.16.11 <+441921681611>;
SBC01_MGW01_DATAPATH=192.168.32.8 <+44192168328> |

| config_drive | True |

| created | 2019-12-18T15:49:16Z |

| flavor | SBC_MCM (asbc_mcm) |

| hostId | 7886 <+447886>df0f7a3d4e131304
<+44131304>a8eb860e6a704c5fda2a7ed751b544ff2bf5
|

| id | 5c70a984-89 <+4498489>a9-44ce-876d-9e2e568eb819 |

| image | |

| key_name | CBAM-b5fd59a066e8450 <+448450>ca9f104a69da5a043-Keypair |

| name | vel1bgw01-MCM2 |

| os-extended-volumes:volumes_attached | [{u'id': u'717e5744-4786-42
<+445744478642>dc-9e3e-3c5e6994 <+446994>c482'}, {u'id': u'd6cf0cf9-36d1-4b
62-86 <+446286>b4-faa4a6642166 <+446642166>'}] |

| progress | 0 |

| project_id | 41777 <+4441777>c6f1e7b4f8d8fd76b5e0f67e5e8 |

| properties | |

| security_groups | [{u'name': u'vel1bgw01-TIPC-Security-Group'}] |

| status | ACTIVE |

| updated | 2020-01-07T17:18:32Z |

| user_id | be13deba85794016 <+4485794016>a00fec9d18c5d7cf |

+--------------------------------------+------------------------------------------------------------------------------------------------------------+

*It is mapped to the following vDisks (seen using virsh list on compute-3)*:

   - dm-uuid-mpath-3600 <+443600>a098000 <+44098000>d9818 <+449818>b0000185
   <+440000185>c5dfa0714 <+440714> è Boot Volume
   - dm-uuid-mpath-3600 <+443600>a098000 <+44098000>d9818 <+449818>b
   000018565 <+44000018565>dfa069e è Storage volume

<disk type='block' device='disk'>

<driver name='qemu' type='raw' cache='none' io='native'/>

<source dev='/dev/disk/by-id/dm-uuid-mpath-3600 <+443600>a098000 <+44098000>
d9818 <+449818>b0000185 <+440000185>c5dfa0714 <+440714>'/>

<backingStore/>

<target dev='vda' bus='virtio'/>

<serial>717e5744-4786-42 <+445744478642>dc-9e3e-3c5e6994 <+446994>
c482</serial>

<alias name='virtio-disk0'/>

<address type='pci' domain='0x0000 <+440000>' bus='0x00' slot='0x04'
function='0x0'/>

</disk>

<disk type='block' device='disk'>

<driver name='qemu' type='raw' cache='none' io='native'/>

<source dev='/dev/disk/by-id/dm-uuid-mpath-3600 <+443600>a098000 <+44098000>
d9818 <+449818>b000018565 <+44000018565>dfa069e'/>

<backingStore/>

<target dev='vdb' bus='virtio'/>

<serial>d6cf0cf9-36d1-4b62-86 <+446286>b4-faa4a6642166 <+446642166></serial>

<alias name='virtio-disk1'/>

<address type='pci' domain='0x0000 <+440000>' bus='0x00' slot='0x08'
function='0x0'/>

</disk>

Name: crypt-dm-uuid-mpath-3600 <+443600>a098000 <+44098000>d9818 <+449818>b
0000185 <+440000185>c5dfa0714 <+440714>

State: ACTIVE

Read Ahead: 256

Tables present: LIVE

Open count: 1

Event number: 0

Major, minor: 253, 5

Number of targets: 1

UUID: CRYPT-LUKS1-769 <+441769>cc20bc5af469c8c9075 <+449075>
a2a6fc4aa0-crypt-dm-uuid-mpath-*3600 <+443600>a098000 <+44098000>d9818
<+449818>b0000185 <+440000185>c5dfa0714 <+440714>*

Name: *mpathpy*

State: ACTIVE

Read Ahead: 256

Tables present: LIVE

Open count: 1

Event number: 32

Major, minor: 253, *4*

Number of targets: 1

UUID: mpath-*3600 <+443600>a098000 <+44098000>d9818 <+449818>b0000185
<+440000185>c5dfa0714 <+440714>*

Name: crypt-dm-uuid-mpath-3600 <+443600>a098000 <+44098000>d9818 <+449818>b
000018565 <+44000018565>dfa069e

State: ACTIVE

Read Ahead: 256

Tables present: LIVE

Open count: 1

Event number: 0

Major, minor: 253, 7

Number of targets: 1

UUID: CRYPT-LUKS1-4015 <+4414015>c585a0df4074821 <+444074821>
ca312c4caacca-crypt-dm-uuid-mpath-*3600 <+443600>a098000 <+44098000>d9818
<+449818>b000018565 <+44000018565>dfa069e*

Name: *mpathpz*

State: ACTIVE

Read Ahead: 256

Tables present: LIVE

Open count: 1

Event number: 28

Major, minor: 253, *6*

Number of targets: 1

UUID: mpath-*3600 <+443600>a098000 <+44098000>d9818 <+449818>b000018565
<+44000018565>dfa069e*

This means boot volume is represented by dm-4 while storage volume is
represented by dm-6

Dumping the multipath daemon on the controller shows that at a steady
running state both DMs are accounted for (see below).

multipathd> show maps

name sysfs uuid

*mpathpy dm-4 3600 <+4443600>a098000 <+44098000>d9818 <+449818>b0000185
<+440000185>c5dfa0714 <+440714>*

*mpathpz dm-6 3600 <+4463600>a098000 <+44098000>d9818 <+449818>b000018565
<+44000018565>dfa069e*

mpathqi dm-12 3600 <+44123600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>df5dfafd40

mpathqj dm-13 3600 <+44133600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>de5dfafd10

mpathpw dm-0 3600 <+4403600>a098000 <+44098000>d9818 <+449818>b000018425
<+44000018425>dfa059f

mpathpx dm-1 3600 <+4413600>a098000 <+44098000>d9818 <+449818>b0000184
<+440000184>c5dfa05fc

mpathqk dm-16 3600 <+44163600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>eb5dfafe80

mpathql dm-17 3600 <+44173600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>e95dfafe26

mpathqh dm-9 3600 <+4493600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>c65dfafa91

These vDisks are mapped to the following multipaths:

multipathd> show topology

mpathpy (3600 <+443600>a098000 <+44098000>d9818 <+449818>b0000185
<+440000185>c5dfa0714) <+440714> dm-4 NETAPP ,INF-01-00 <+440100>

size=21G features='4 queue_if_no_path pg_init_retries 50
retain_attached_hw_handle' hwhandler='1 rdac' wp=rw

|-+- policy='service-time 0' prio=14 status=active

| |- 30:0:0:82 sdm 8:192 active ready running

| `- 32:0:0:82 sdk 8:160 active ready running

`-+- policy='service-time 0' prio=0 status=enabled

|- 33:0:0:82 sdn 8:208 failed faulty running

`- 31:0:0:82 sdl 8:176 failed faulty running

mpathpz (3600 <+443600>a098000 <+44098000>d9818 <+449818>b000018565
<+44000018565>dfa069e) dm-6 NETAPP ,INF-01-00 <+440100>

size=10G features='4 queue_if_no_path pg_init_retries 50
retain_attached_hw_handle' hwhandler='1 rdac' wp=rw

|-+- policy='service-time 0' prio=14 status=active

| |- 30:0:0:229 sdr 65:16 active ready running

| `- 32:0:0:229 sdp 8:240 active ready running

`-+- policy='service-time 0' prio=0 status=enabled

|- 31:0:0:229 sdo 8:224 failed faulty running

`- 33:0:0:229 sdq 65:0 failed faulty running

Now, it starts getting very interesting, if I shutdown controller-A from
NetApp side, dm-4 disappears but dm-6 is still running while detecting the
active path is controller B while standby path is controller-A which now is
displayed as failed

multipathd> show topology

mpathpz (3600 <+443600>a098000 <+44098000>d9818 <+449818>b000018565
<+44000018565>dfa069e) dm-6 NETAPP ,INF-01-00 <+440100>

size=10G features='4 queue_if_no_path pg_init_retries 50
retain_attached_hw_handle' hwhandler='1 rdac' wp=rw

|-+- policy='service-time 0' prio=0 status=enabled

| |- 30:0:0:229 sdr 65:16 failed faulty running

| `- 32:0:0:229 sdp 8:240 failed faulty running

`-+- policy='service-time 0' prio=11 status=active

|- 31:0:0:229 sdo 8:224 active ready running

`- 33:0:0:229 sdq 65:0 active ready running

multipathd> show maps

name sysfs uuid

*mpathpz dm-6 3600 <+4463600>a098000 <+44098000>d9818 <+449818>b000018565
<+44000018565>dfa069e*

mpathqi dm-12 3600 <+44123600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>df5dfafd40

mpathqj dm-13 3600 <+44133600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>de5dfafd10

mpathpw dm-0 3600 <+4403600>a098000 <+44098000>d9818 <+449818>b000018425
<+44000018425>dfa059f

mpathpx dm-1 3600 <+4413600>a098000 <+44098000>d9818 <+449818>b0000184
<+440000184>c5dfa05fc

mpathqk dm-16 3600 <+44163600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>eb5dfafe80

mpathql dm-17 3600 <+44173600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>e95dfafe26

mpathqg dm-8 3600 <+4483600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>c75dfafac0

mpathqh dm-9 3600 <+4493600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>c65dfafa91

If I restore Controller-A into service from NetApp side while fail only the
path to controller A from multipathd everything works fine, dm-4 is still
present and the VM can be put into service.

multipathd> fail path sdk

ok

multipathd>

multipathd> fail path sdm

ok

mpathpy (3600 <+443600>a098000 <+44098000>d9818 <+449818>b0000185
<+440000185>c5dfa0714) <+440714> dm-4 NETAPP ,INF-01-00 <+440100>

size=21G features='4 queue_if_no_path pg_init_retries 50
retain_attached_hw_handle' hwhandler='1 rdac' wp=rw

|-+- policy='service-time 0' prio=0 status=enabled

| |- 32:0:0:82 sdk 8:160 failed faulty running

| `- 30:0:0:82 sdm 8:192 failed faulty running

`-+- policy='service-time 0' prio=9 status=active

|- 31:0:0:82 sdl 8:176 active ready running

`- 33:0:0:82 sdn 8:208 active ready running

multipathd> reinstate path sdk

ok

multipathd>

multipathd> reinstate path sdm

ok

mpathpy (3600 <+443600>a098000 <+44098000>d9818 <+449818>b0000185
<+440000185>c5dfa0714) <+440714> dm-4 NETAPP ,INF-01-00 <+440100>

size=21G features='4 queue_if_no_path pg_init_retries 50
retain_attached_hw_handle' hwhandler='1 rdac' wp=rw

|-+- policy='service-time 0' prio=14 status=active

| |- 32:0:0:82 sdk 8:160 active ready running

| `- 30:0:0:82 sdm 8:192 active ready running

`-+- policy='service-time 0' prio=9 status=enabled

|- 31:0:0:82 sdl 8:176 active ready running

`- 33:0:0:82 sdn 8:208 active ready running

It is observed in the working case, the storage volume disappears (which
seems normal), also the instance totally vanishes from the virsh list and
no trace can be found at the KVM level if we run ps -def | grep fd |
grep <instance_ID>. However, the boot volume is always present in the
multipathd records when we stop the VM at normal conditions without
stopping NetApp controller.

Any ideas?

Kind regards,
Ahmed

Ahmed ZAKY

tags

participants (1)