I have a setup where each VM gets assigned two vDisks, one encrypted boot
volume and another storage volume.
Storage used is NetApp (tripleo-netapp). With two controllers on NetApp
side working in active/ active mode.
My test case goes as follows:
- I stop one of the active controllers.
- I stop one of my VMs using OpenStack server stop
- I then start my VM one more time using OpenStack server start.
- VM fails to start.
Here're my findings, hope someone would help if they can explain me the
behaviour seen below:
My VM: vel1bgw01-MCM2, it is running on compute
overcloud-sriovperformancecompute-3.localdomain
[root@overcloud-controller-0 (vel1asbc01) cbis-admin]# openstack server
show vel1bgw01-MCM2
+--------------------------------------+------------------------------------------------------------------------------------------------------------+
| Field | Value |
+--------------------------------------+------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | zone1 |
| OS-EXT-SRV-ATTR:host | overcloud-sriovperformancecompute-3.localdomain |
| OS-EXT-SRV-ATTR:hypervisor_hostname |
overcloud-sriovperformancecompute-3.localdomain |
| OS-EXT-SRV-ATTR:instance_name | instance-00000 <+4400000>e93 |
| OS-EXT-STS:power_state | Running |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2019-12-18T15:49:37.000000 <+4437000000> |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | SBC01_MGW01_TIPC=192.168.48.22 <+441921684822>;
SBC01_MGW01_DATAPATH_MATE=192.168.16.11 <+441921681611>;
SBC01_MGW01_DATAPATH=192.168.32.8 <+44192168328> |
| config_drive | True |
| created | 2019-12-18T15:49:16Z |
| flavor | SBC_MCM (asbc_mcm) |
| hostId | 7886 <+447886>df0f7a3d4e131304
<+44131304>a8eb860e6a704c5fda2a7ed751b544ff2bf5
|
| id | 5c70a984-89 <+4498489>a9-44ce-876d-9e2e568eb819 |
| image | |
| key_name | CBAM-b5fd59a066e8450 <+448450>ca9f104a69da5a043-Keypair |
| name | vel1bgw01-MCM2 |
| os-extended-volumes:volumes_attached | [{u'id': u'717e5744-4786-42
<+445744478642>dc-9e3e-3c5e6994 <+446994>c482'}, {u'id': u'd6cf0cf9-36d1-4b
62-86 <+446286>b4-faa4a6642166 <+446642166>'}] |
| progress | 0 |
| project_id | 41777 <+4441777>c6f1e7b4f8d8fd76b5e0f67e5e8 |
| properties | |
| security_groups | [{u'name': u'vel1bgw01-TIPC-Security-Group'}] |
| status | ACTIVE |
| updated | 2020-01-07T17:18:32Z |
| user_id | be13deba85794016 <+4485794016>a00fec9d18c5d7cf |
+--------------------------------------+------------------------------------------------------------------------------------------------------------+
*It is mapped to the following vDisks (seen using virsh list on compute-3)*:
- dm-uuid-mpath-3600 <+443600>a098000 <+44098000>d9818 <+449818>b0000185
<+440000185>c5dfa0714 <+440714> è Boot Volume
- dm-uuid-mpath-3600 <+443600>a098000 <+44098000>d9818 <+449818>b
000018565 <+44000018565>dfa069e è Storage volume
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native'/>
<source dev='/dev/disk/by-id/dm-uuid-mpath-3600 <+443600>a098000 <+44098000>
d9818 <+449818>b0000185 <+440000185>c5dfa0714 <+440714>'/>
<backingStore/>
<target dev='vda' bus='virtio'/>
<serial>717e5744-4786-42 <+445744478642>dc-9e3e-3c5e6994 <+446994>
c482</serial>
<alias name='virtio-disk0'/>
<address type='pci' domain='0x0000 <+440000>' bus='0x00' slot='0x04'
function='0x0'/>
</disk>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native'/>
<source dev='/dev/disk/by-id/dm-uuid-mpath-3600 <+443600>a098000 <+44098000>
d9818 <+449818>b000018565 <+44000018565>dfa069e'/>
<backingStore/>
<target dev='vdb' bus='virtio'/>
<serial>d6cf0cf9-36d1-4b62-86 <+446286>b4-faa4a6642166 <+446642166></serial>
<alias name='virtio-disk1'/>
<address type='pci' domain='0x0000 <+440000>' bus='0x00' slot='0x08'
function='0x0'/>
</disk>
Name: crypt-dm-uuid-mpath-3600 <+443600>a098000 <+44098000>d9818 <+449818>b
0000185 <+440000185>c5dfa0714 <+440714>
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 1
Event number: 0
Major, minor: 253, 5
Number of targets: 1
UUID: CRYPT-LUKS1-769 <+441769>cc20bc5af469c8c9075 <+449075>
a2a6fc4aa0-crypt-dm-uuid-mpath-*3600 <+443600>a098000 <+44098000>d9818
<+449818>b0000185 <+440000185>c5dfa0714 <+440714>*
Name: *mpathpy*
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 1
Event number: 32
Major, minor: 253, *4*
Number of targets: 1
UUID: mpath-*3600 <+443600>a098000 <+44098000>d9818 <+449818>b0000185
<+440000185>c5dfa0714 <+440714>*
Name: crypt-dm-uuid-mpath-3600 <+443600>a098000 <+44098000>d9818 <+449818>b
000018565 <+44000018565>dfa069e
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 1
Event number: 0
Major, minor: 253, 7
Number of targets: 1
UUID: CRYPT-LUKS1-4015 <+4414015>c585a0df4074821 <+444074821>
ca312c4caacca-crypt-dm-uuid-mpath-*3600 <+443600>a098000 <+44098000>d9818
<+449818>b000018565 <+44000018565>dfa069e*
Name: *mpathpz*
State: ACTIVE
Read Ahead: 256
Tables present: LIVE
Open count: 1
Event number: 28
Major, minor: 253, *6*
Number of targets: 1
UUID: mpath-*3600 <+443600>a098000 <+44098000>d9818 <+449818>b000018565
<+44000018565>dfa069e*
This means boot volume is represented by dm-4 while storage volume is
represented by dm-6
Dumping the multipath daemon on the controller shows that at a steady
running state both DMs are accounted for (see below).
multipathd> show maps
name sysfs uuid
*mpathpy dm-4 3600 <+4443600>a098000 <+44098000>d9818 <+449818>b0000185
<+440000185>c5dfa0714 <+440714>*
*mpathpz dm-6 3600 <+4463600>a098000 <+44098000>d9818 <+449818>b000018565
<+44000018565>dfa069e*
mpathqi dm-12 3600 <+44123600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>df5dfafd40
mpathqj dm-13 3600 <+44133600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>de5dfafd10
mpathpw dm-0 3600 <+4403600>a098000 <+44098000>d9818 <+449818>b000018425
<+44000018425>dfa059f
mpathpx dm-1 3600 <+4413600>a098000 <+44098000>d9818 <+449818>b0000184
<+440000184>c5dfa05fc
mpathqk dm-16 3600 <+44163600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>eb5dfafe80
mpathql dm-17 3600 <+44173600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>e95dfafe26
mpathqh dm-9 3600 <+4493600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>c65dfafa91
These vDisks are mapped to the following multipaths:
multipathd> show topology
mpathpy (3600 <+443600>a098000 <+44098000>d9818 <+449818>b0000185
<+440000185>c5dfa0714) <+440714> dm-4 NETAPP ,INF-01-00 <+440100>
size=21G features='4 queue_if_no_path pg_init_retries 50
retain_attached_hw_handle' hwhandler='1 rdac' wp=rw
|-+- policy='service-time 0' prio=14 status=active
| |- 30:0:0:82 sdm 8:192 active ready running
| `- 32:0:0:82 sdk 8:160 active ready running
`-+- policy='service-time 0' prio=0 status=enabled
|- 33:0:0:82 sdn 8:208 failed faulty running
`- 31:0:0:82 sdl 8:176 failed faulty running
mpathpz (3600 <+443600>a098000 <+44098000>d9818 <+449818>b000018565
<+44000018565>dfa069e) dm-6 NETAPP ,INF-01-00 <+440100>
size=10G features='4 queue_if_no_path pg_init_retries 50
retain_attached_hw_handle' hwhandler='1 rdac' wp=rw
|-+- policy='service-time 0' prio=14 status=active
| |- 30:0:0:229 sdr 65:16 active ready running
| `- 32:0:0:229 sdp 8:240 active ready running
`-+- policy='service-time 0' prio=0 status=enabled
|- 31:0:0:229 sdo 8:224 failed faulty running
`- 33:0:0:229 sdq 65:0 failed faulty running
Now, it starts getting very interesting, if I shutdown controller-A from
NetApp side, dm-4 disappears but dm-6 is still running while detecting the
active path is controller B while standby path is controller-A which now is
displayed as failed
multipathd> show topology
mpathpz (3600 <+443600>a098000 <+44098000>d9818 <+449818>b000018565
<+44000018565>dfa069e) dm-6 NETAPP ,INF-01-00 <+440100>
size=10G features='4 queue_if_no_path pg_init_retries 50
retain_attached_hw_handle' hwhandler='1 rdac' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| |- 30:0:0:229 sdr 65:16 failed faulty running
| `- 32:0:0:229 sdp 8:240 failed faulty running
`-+- policy='service-time 0' prio=11 status=active
|- 31:0:0:229 sdo 8:224 active ready running
`- 33:0:0:229 sdq 65:0 active ready running
multipathd> show maps
name sysfs uuid
*mpathpz dm-6 3600 <+4463600>a098000 <+44098000>d9818 <+449818>b000018565
<+44000018565>dfa069e*
mpathqi dm-12 3600 <+44123600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>df5dfafd40
mpathqj dm-13 3600 <+44133600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>de5dfafd10
mpathpw dm-0 3600 <+4403600>a098000 <+44098000>d9818 <+449818>b000018425
<+44000018425>dfa059f
mpathpx dm-1 3600 <+4413600>a098000 <+44098000>d9818 <+449818>b0000184
<+440000184>c5dfa05fc
mpathqk dm-16 3600 <+44163600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>eb5dfafe80
mpathql dm-17 3600 <+44173600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>e95dfafe26
mpathqg dm-8 3600 <+4483600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>c75dfafac0
mpathqh dm-9 3600 <+4493600>a098000 <+44098000>d9818 <+449818>b000018
<+44000018>c65dfafa91
If I restore Controller-A into service from NetApp side while fail only the
path to controller A from multipathd everything works fine, dm-4 is still
present and the VM can be put into service.
multipathd> fail path sdk
ok
multipathd>
multipathd> fail path sdm
ok
mpathpy (3600 <+443600>a098000 <+44098000>d9818 <+449818>b0000185
<+440000185>c5dfa0714) <+440714> dm-4 NETAPP ,INF-01-00 <+440100>
size=21G features='4 queue_if_no_path pg_init_retries 50
retain_attached_hw_handle' hwhandler='1 rdac' wp=rw
|-+- policy='service-time 0' prio=0 status=enabled
| |- 32:0:0:82 sdk 8:160 failed faulty running
| `- 30:0:0:82 sdm 8:192 failed faulty running
`-+- policy='service-time 0' prio=9 status=active
|- 31:0:0:82 sdl 8:176 active ready running
`- 33:0:0:82 sdn 8:208 active ready running
multipathd> reinstate path sdk
ok
multipathd>
multipathd> reinstate path sdm
ok
mpathpy (3600 <+443600>a098000 <+44098000>d9818 <+449818>b0000185
<+440000185>c5dfa0714) <+440714> dm-4 NETAPP ,INF-01-00 <+440100>
size=21G features='4 queue_if_no_path pg_init_retries 50
retain_attached_hw_handle' hwhandler='1 rdac' wp=rw
|-+- policy='service-time 0' prio=14 status=active
| |- 32:0:0:82 sdk 8:160 active ready running
| `- 30:0:0:82 sdm 8:192 active ready running
`-+- policy='service-time 0' prio=9 status=enabled
|- 31:0:0:82 sdl 8:176 active ready running
`- 33:0:0:82 sdn 8:208 active ready running
It is observed in the working case, the storage volume disappears (which
seems normal), also the instance totally vanishes from the virsh list and
no trace can be found at the KVM level if we run ps -def | grep fd |
grep <instance_ID>. However, the boot volume is always present in the
multipathd records when we stop the VM at normal conditions without
stopping NetApp controller.
Any ideas?
Kind regards,
Ahmed