I have a setup where each VM gets assigned two vDisks, one encrypted boot volume and another storage volume.
Storage used is NetApp (tripleo-netapp). With two controllers on NetApp side working in active/ active mode.
My test case goes as follows:
- I stop one of the active controllers.
- I stop one of my VMs using OpenStack server stop
- I then start my VM one more time using OpenStack server start.
- VM fails to start.
Here're my findings, hope someone would help if they can explain me the behaviour seen below:
My VM: vel1bgw01-MCM2, it is running on compute overcloud-sriovperformancecompute-3.localdomain
[root@overcloud-controller-0 (vel1asbc01) cbis-admin]# openstack server show vel1bgw01-MCM2
+--------------------------------------+------------------------------------------------------------------------------------------------------------+
| Field | Value |
+--------------------------------------+------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | zone1 |
| OS-EXT-SRV-ATTR:host | overcloud-sriovperformancecompute-3.localdomain |
| OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-sriovperformancecompute-3.localdomain |
| OS-EXT-SRV-ATTR:instance_name | instance-00000e93 |
| OS-EXT-STS:power_state | Running |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2019-12-18T15:49:37.000000 |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | SBC01_MGW01_TIPC=192.168.48.22; SBC01_MGW01_DATAPATH_MATE=192.168.16.11; SBC01_MGW01_DATAPATH=192.168.32.8 |
| config_drive | True |
| created | 2019-12-18T15:49:16Z |
| flavor | SBC_MCM (asbc_mcm) |
| hostId | 7886df0f7a3d4e131304a8eb860e6a704c5fda2a7ed751b544ff2bf5 |
| id | 5c70a984-89a9-44ce-876d-9e2e568eb819 |
| image | |
| key_name | CBAM-b5fd59a066e8450ca9f104a69da5a043-Keypair |
| name | vel1bgw01-MCM2 |
| os-extended-volumes:volumes_attached | [{u'id': u'717e5744-4786-42dc-9e3e-3c5e6994c482'}, {u'id': u'd6cf0cf9-36d1-4b62-86b4-faa4a6642166'}] |
| progress | 0 |
| project_id | 41777c6f1e7b4f8d8fd76b5e0f67e5e8 |
| properties | |
| security_groups | [{u'name': u'vel1bgw01-TIPC-Security-Group'}] |
| status | ACTIVE |
| updated | 2020-01-07T17:18:32Z |
| user_id | be13deba85794016a00fec9d18c5d7cf |
+--------------------------------------+------------------------------------------------------------------------------------------------------------+
It is mapped to the following vDisks (seen using virsh list on compute-3):
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native'/>
<source dev='/dev/disk/by-id/dm-uuid-mpath-3600a098000d9818b0000185c5dfa0714'/>
<backingStore/>
<target dev='vda' bus='virtio'/>
<serial>717e5744-4786-42dc-9e3e-3c5e6994c482</serial>
<alias name='virtio-disk0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</disk>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native'/>
<source dev='/dev/disk/by-id/dm-uuid-mpath-3600a098000d9818b000018565dfa069e'/>
<backingStore/>
<target dev='vdb' bus='virtio'/>
<serial>d6cf0cf9-36d1-4b62-86b4-faa4a6642166</serial>
<alias name='virtio-disk1'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
</disk>
Name: crypt-dm-uuid-mpath-3600a098000d9818b0000185c5dfa0714
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 1
Event number: 0
Major, minor: 253, 5
Number of targets: 1
UUID: CRYPT-LUKS1-769cc20bc5af469c8c9075a2a6fc4aa0-crypt-dm-uuid-mpath-3600a098000d9818b0000185c5dfa0714
Name: mpathpy
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 1
Event number: 32
Major, minor: 253, 4
Number of targets: 1
UUID: mpath-3600a098000d9818b0000185c5dfa0714
Name: crypt-dm-uuid-mpath-3600a098000d9818b000018565dfa069e
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 1
Event number: 0
Major, minor: 253, 7
Number of targets: 1
UUID: CRYPT-LUKS1-4015c585a0df4074821ca312c4caacca-crypt-dm-uuid-mpath-3600a098000d9818b000018565dfa069e
Name: mpathpz
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 1
Event number: 28
Major, minor: 253, 6
Number of targets: 1
UUID: mpath-3600a098000d9818b000018565dfa069e
This means boot volume is represented by dm-4 while storage volume is represented by dm-6
Dumping the multipath daemon on the controller shows that at a steady running state both DMs are accounted for (see below).
multipathd> show maps
name sysfs uuid
mpathpy dm-4 3600a098000d9818b0000185c5dfa0714
mpathpz dm-6 3600a098000d9818b000018565dfa069e
mpathqi dm-12 3600a098000d9818b000018df5dfafd40
mpathqj dm-13 3600a098000d9818b000018de5dfafd10
mpathpw dm-0 3600a098000d9818b000018425dfa059f
mpathpx dm-1 3600a098000d9818b0000184c5dfa05fc
mpathqk dm-16 3600a098000d9818b000018eb5dfafe80
mpathql dm-17 3600a098000d9818b000018e95dfafe26
mpathqh dm-9 3600a098000d9818b000018c65dfafa91
These vDisks are mapped to the following multipaths:
multipathd> show topology
mpathpy (3600a098000d9818b0000185c5dfa0714) dm-4 NETAPP ,INF-01-00
size=21G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 rdac' wp=rw
|-+- policy='service-time 0' prio=14 status=active
| |- 30:0:0:82 sdm 8:192 active ready running
| `- 32:0:0:82 sdk 8:160 active ready running
`-+- policy='service-time 0' prio=0 status=enabled
|- 33:0:0:82 sdn 8:208 failed faulty running
`- 31:0:0:82 sdl 8:176 failed faulty running
mpathpz (3600a098000d9818b000018565dfa069e) dm-6 NETAPP ,INF-01-00
size=10G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 rdac' wp=rw
|-+- policy='service-time 0' prio=14 status=active
| |- 30:0:0:229 sdr 65:16 active ready running
| `- 32:0:0:229 sdp 8:240 active ready running
`-+- policy='service-time 0' prio=0 status=enabled
|- 31:0:0:229 sdo 8:224 failed faulty running
`- 33:0:0:229 sdq 65:0 failed faulty running
Now, it starts getting very interesting, if I shutdown controller-A from NetApp side, dm-4 disappears but dm-6 is still running while detecting the active path is controller B while standby path is controller-A which now is displayed as failed
multipathd> show topology
mpathpz (3600a098000d9818b000018565dfa069e) dm-6 NETAPP ,INF-01-00
size=10G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 rdac' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| |- 30:0:0:229 sdr 65:16 failed faulty running
| `- 32:0:0:229 sdp 8:240 failed faulty running
`-+- policy='service-time 0' prio=11 status=active
|- 31:0:0:229 sdo 8:224 active ready running
`- 33:0:0:229 sdq 65:0 active ready running
multipathd> show maps
name sysfs uuid
mpathpz dm-6 3600a098000d9818b000018565dfa069e
mpathqi dm-12 3600a098000d9818b000018df5dfafd40
mpathqj dm-13 3600a098000d9818b000018de5dfafd10
mpathpw dm-0 3600a098000d9818b000018425dfa059f
mpathpx dm-1 3600a098000d9818b0000184c5dfa05fc
mpathqk dm-16 3600a098000d9818b000018eb5dfafe80
mpathql dm-17 3600a098000d9818b000018e95dfafe26
mpathqg dm-8 3600a098000d9818b000018c75dfafac0
mpathqh dm-9 3600a098000d9818b000018c65dfafa91
If I restore Controller-A into service from NetApp side while fail only the path to controller A from multipathd everything works fine, dm-4 is still present and the VM can be put into service.
multipathd> fail path sdk
ok
multipathd>
multipathd> fail path sdm
ok
mpathpy (3600a098000d9818b0000185c5dfa0714) dm-4 NETAPP ,INF-01-00
size=21G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 rdac' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| |- 32:0:0:82 sdk 8:160 failed faulty running
| `- 30:0:0:82 sdm 8:192 failed faulty running
`-+- policy='service-time 0' prio=9 status=active
|- 31:0:0:82 sdl 8:176 active ready running
`- 33:0:0:82 sdn 8:208 active ready running
multipathd> reinstate path sdk
ok
multipathd>
multipathd> reinstate path sdm
ok
mpathpy (3600a098000d9818b0000185c5dfa0714) dm-4 NETAPP ,INF-01-00
size=21G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 rdac' wp=rw
|-+- policy='service-time 0' prio=14 status=active
| |- 32:0:0:82 sdk 8:160 active ready running
| `- 30:0:0:82 sdm 8:192 active ready running
`-+- policy='service-time 0' prio=9 status=enabled
|- 31:0:0:82 sdl 8:176 active ready running
`- 33:0:0:82 sdn 8:208 active ready running
It is observed in the working case, the storage volume disappears (which seems normal), also the instance totally vanishes from the virsh list and no trace can be found at the KVM level if we run ps -def | grep fd | grep <instance_ID>. However, the boot volume is always present in the multipathd records when we stop the VM at normal conditions without stopping NetApp controller.
Any ideas?