VM boot volume disappears from compute's multipath Daemon when NetApp controller is placed offline
I have a setup where each VM gets assigned two vDisks, one encrypted boot volume and another storage volume. Storage used is NetApp (tripleo-netapp). With two controllers on NetApp side working in active/ active mode. My test case goes as follows: - I stop one of the active controllers. - I stop one of my VMs using OpenStack server stop - I then start my VM one more time using OpenStack server start. - VM fails to start. Here're my findings, hope someone would help if they can explain me the behaviour seen below: My VM: vel1bgw01-MCM2, it is running on compute overcloud-sriovperformancecompute-3.localdomain [root@overcloud-controller-0 (vel1asbc01) cbis-admin]# openstack server show vel1bgw01-MCM2 +--------------------------------------+------------------------------------------------------------------------------------------------------------+ | Field | Value | +--------------------------------------+------------------------------------------------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | zone1 | | OS-EXT-SRV-ATTR:host | overcloud-sriovperformancecompute-3.localdomain | | OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-sriovperformancecompute-3.localdomain | | OS-EXT-SRV-ATTR:instance_name | instance-00000 <+4400000>e93 | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | active | | OS-SRV-USG:launched_at | 2019-12-18T15:49:37.000000 <+4437000000> | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | SBC01_MGW01_TIPC=192.168.48.22 <+441921684822>; SBC01_MGW01_DATAPATH_MATE=192.168.16.11 <+441921681611>; SBC01_MGW01_DATAPATH=192.168.32.8 <+44192168328> | | config_drive | True | | created | 2019-12-18T15:49:16Z | | flavor | SBC_MCM (asbc_mcm) | | hostId | 7886 <+447886>df0f7a3d4e131304 <+44131304>a8eb860e6a704c5fda2a7ed751b544ff2bf5 | | id | 5c70a984-89 <+4498489>a9-44ce-876d-9e2e568eb819 | | image | | | key_name | CBAM-b5fd59a066e8450 <+448450>ca9f104a69da5a043-Keypair | | name | vel1bgw01-MCM2 | | os-extended-volumes:volumes_attached | [{u'id': u'717e5744-4786-42 <+445744478642>dc-9e3e-3c5e6994 <+446994>c482'}, {u'id': u'd6cf0cf9-36d1-4b 62-86 <+446286>b4-faa4a6642166 <+446642166>'}] | | progress | 0 | | project_id | 41777 <+4441777>c6f1e7b4f8d8fd76b5e0f67e5e8 | | properties | | | security_groups | [{u'name': u'vel1bgw01-TIPC-Security-Group'}] | | status | ACTIVE | | updated | 2020-01-07T17:18:32Z | | user_id | be13deba85794016 <+4485794016>a00fec9d18c5d7cf | +--------------------------------------+------------------------------------------------------------------------------------------------------------+ *It is mapped to the following vDisks (seen using virsh list on compute-3)*: - dm-uuid-mpath-3600 <+443600>a098000 <+44098000>d9818 <+449818>b0000185 <+440000185>c5dfa0714 <+440714> è Boot Volume - dm-uuid-mpath-3600 <+443600>a098000 <+44098000>d9818 <+449818>b 000018565 <+44000018565>dfa069e è Storage volume <disk type='block' device='disk'> <driver name='qemu' type='raw' cache='none' io='native'/> <source dev='/dev/disk/by-id/dm-uuid-mpath-3600 <+443600>a098000 <+44098000> d9818 <+449818>b0000185 <+440000185>c5dfa0714 <+440714>'/> <backingStore/> <target dev='vda' bus='virtio'/> <serial>717e5744-4786-42 <+445744478642>dc-9e3e-3c5e6994 <+446994> c482</serial> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000 <+440000>' bus='0x00' slot='0x04' function='0x0'/> </disk> <disk type='block' device='disk'> <driver name='qemu' type='raw' cache='none' io='native'/> <source dev='/dev/disk/by-id/dm-uuid-mpath-3600 <+443600>a098000 <+44098000> d9818 <+449818>b000018565 <+44000018565>dfa069e'/> <backingStore/> <target dev='vdb' bus='virtio'/> <serial>d6cf0cf9-36d1-4b62-86 <+446286>b4-faa4a6642166 <+446642166></serial> <alias name='virtio-disk1'/> <address type='pci' domain='0x0000 <+440000>' bus='0x00' slot='0x08' function='0x0'/> </disk> Name: crypt-dm-uuid-mpath-3600 <+443600>a098000 <+44098000>d9818 <+449818>b 0000185 <+440000185>c5dfa0714 <+440714> State: ACTIVE Read Ahead: 256 Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 5 Number of targets: 1 UUID: CRYPT-LUKS1-769 <+441769>cc20bc5af469c8c9075 <+449075> a2a6fc4aa0-crypt-dm-uuid-mpath-*3600 <+443600>a098000 <+44098000>d9818 <+449818>b0000185 <+440000185>c5dfa0714 <+440714>* Name: *mpathpy* State: ACTIVE Read Ahead: 256 Tables present: LIVE Open count: 1 Event number: 32 Major, minor: 253, *4* Number of targets: 1 UUID: mpath-*3600 <+443600>a098000 <+44098000>d9818 <+449818>b0000185 <+440000185>c5dfa0714 <+440714>* Name: crypt-dm-uuid-mpath-3600 <+443600>a098000 <+44098000>d9818 <+449818>b 000018565 <+44000018565>dfa069e State: ACTIVE Read Ahead: 256 Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 7 Number of targets: 1 UUID: CRYPT-LUKS1-4015 <+4414015>c585a0df4074821 <+444074821> ca312c4caacca-crypt-dm-uuid-mpath-*3600 <+443600>a098000 <+44098000>d9818 <+449818>b000018565 <+44000018565>dfa069e* Name: *mpathpz* State: ACTIVE Read Ahead: 256 Tables present: LIVE Open count: 1 Event number: 28 Major, minor: 253, *6* Number of targets: 1 UUID: mpath-*3600 <+443600>a098000 <+44098000>d9818 <+449818>b000018565 <+44000018565>dfa069e* This means boot volume is represented by dm-4 while storage volume is represented by dm-6 Dumping the multipath daemon on the controller shows that at a steady running state both DMs are accounted for (see below). multipathd> show maps name sysfs uuid *mpathpy dm-4 3600 <+4443600>a098000 <+44098000>d9818 <+449818>b0000185 <+440000185>c5dfa0714 <+440714>* *mpathpz dm-6 3600 <+4463600>a098000 <+44098000>d9818 <+449818>b000018565 <+44000018565>dfa069e* mpathqi dm-12 3600 <+44123600>a098000 <+44098000>d9818 <+449818>b000018 <+44000018>df5dfafd40 mpathqj dm-13 3600 <+44133600>a098000 <+44098000>d9818 <+449818>b000018 <+44000018>de5dfafd10 mpathpw dm-0 3600 <+4403600>a098000 <+44098000>d9818 <+449818>b000018425 <+44000018425>dfa059f mpathpx dm-1 3600 <+4413600>a098000 <+44098000>d9818 <+449818>b0000184 <+440000184>c5dfa05fc mpathqk dm-16 3600 <+44163600>a098000 <+44098000>d9818 <+449818>b000018 <+44000018>eb5dfafe80 mpathql dm-17 3600 <+44173600>a098000 <+44098000>d9818 <+449818>b000018 <+44000018>e95dfafe26 mpathqh dm-9 3600 <+4493600>a098000 <+44098000>d9818 <+449818>b000018 <+44000018>c65dfafa91 These vDisks are mapped to the following multipaths: multipathd> show topology mpathpy (3600 <+443600>a098000 <+44098000>d9818 <+449818>b0000185 <+440000185>c5dfa0714) <+440714> dm-4 NETAPP ,INF-01-00 <+440100> size=21G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 rdac' wp=rw |-+- policy='service-time 0' prio=14 status=active | |- 30:0:0:82 sdm 8:192 active ready running | `- 32:0:0:82 sdk 8:160 active ready running `-+- policy='service-time 0' prio=0 status=enabled |- 33:0:0:82 sdn 8:208 failed faulty running `- 31:0:0:82 sdl 8:176 failed faulty running mpathpz (3600 <+443600>a098000 <+44098000>d9818 <+449818>b000018565 <+44000018565>dfa069e) dm-6 NETAPP ,INF-01-00 <+440100> size=10G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 rdac' wp=rw |-+- policy='service-time 0' prio=14 status=active | |- 30:0:0:229 sdr 65:16 active ready running | `- 32:0:0:229 sdp 8:240 active ready running `-+- policy='service-time 0' prio=0 status=enabled |- 31:0:0:229 sdo 8:224 failed faulty running `- 33:0:0:229 sdq 65:0 failed faulty running Now, it starts getting very interesting, if I shutdown controller-A from NetApp side, dm-4 disappears but dm-6 is still running while detecting the active path is controller B while standby path is controller-A which now is displayed as failed multipathd> show topology mpathpz (3600 <+443600>a098000 <+44098000>d9818 <+449818>b000018565 <+44000018565>dfa069e) dm-6 NETAPP ,INF-01-00 <+440100> size=10G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 rdac' wp=rw |-+- policy='service-time 0' prio=0 status=enabled | |- 30:0:0:229 sdr 65:16 failed faulty running | `- 32:0:0:229 sdp 8:240 failed faulty running `-+- policy='service-time 0' prio=11 status=active |- 31:0:0:229 sdo 8:224 active ready running `- 33:0:0:229 sdq 65:0 active ready running multipathd> show maps name sysfs uuid *mpathpz dm-6 3600 <+4463600>a098000 <+44098000>d9818 <+449818>b000018565 <+44000018565>dfa069e* mpathqi dm-12 3600 <+44123600>a098000 <+44098000>d9818 <+449818>b000018 <+44000018>df5dfafd40 mpathqj dm-13 3600 <+44133600>a098000 <+44098000>d9818 <+449818>b000018 <+44000018>de5dfafd10 mpathpw dm-0 3600 <+4403600>a098000 <+44098000>d9818 <+449818>b000018425 <+44000018425>dfa059f mpathpx dm-1 3600 <+4413600>a098000 <+44098000>d9818 <+449818>b0000184 <+440000184>c5dfa05fc mpathqk dm-16 3600 <+44163600>a098000 <+44098000>d9818 <+449818>b000018 <+44000018>eb5dfafe80 mpathql dm-17 3600 <+44173600>a098000 <+44098000>d9818 <+449818>b000018 <+44000018>e95dfafe26 mpathqg dm-8 3600 <+4483600>a098000 <+44098000>d9818 <+449818>b000018 <+44000018>c75dfafac0 mpathqh dm-9 3600 <+4493600>a098000 <+44098000>d9818 <+449818>b000018 <+44000018>c65dfafa91 If I restore Controller-A into service from NetApp side while fail only the path to controller A from multipathd everything works fine, dm-4 is still present and the VM can be put into service. multipathd> fail path sdk ok multipathd> multipathd> fail path sdm ok mpathpy (3600 <+443600>a098000 <+44098000>d9818 <+449818>b0000185 <+440000185>c5dfa0714) <+440714> dm-4 NETAPP ,INF-01-00 <+440100> size=21G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 rdac' wp=rw |-+- policy='service-time 0' prio=0 status=enabled | |- 32:0:0:82 sdk 8:160 failed faulty running | `- 30:0:0:82 sdm 8:192 failed faulty running `-+- policy='service-time 0' prio=9 status=active |- 31:0:0:82 sdl 8:176 active ready running `- 33:0:0:82 sdn 8:208 active ready running multipathd> reinstate path sdk ok multipathd> multipathd> reinstate path sdm ok mpathpy (3600 <+443600>a098000 <+44098000>d9818 <+449818>b0000185 <+440000185>c5dfa0714) <+440714> dm-4 NETAPP ,INF-01-00 <+440100> size=21G features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_handle' hwhandler='1 rdac' wp=rw |-+- policy='service-time 0' prio=14 status=active | |- 32:0:0:82 sdk 8:160 active ready running | `- 30:0:0:82 sdm 8:192 active ready running `-+- policy='service-time 0' prio=9 status=enabled |- 31:0:0:82 sdl 8:176 active ready running `- 33:0:0:82 sdn 8:208 active ready running It is observed in the working case, the storage volume disappears (which seems normal), also the instance totally vanishes from the virsh list and no trace can be found at the KVM level if we run ps -def | grep fd | grep <instance_ID>. However, the boot volume is always present in the multipathd records when we stop the VM at normal conditions without stopping NetApp controller. Any ideas? Kind regards, Ahmed
participants (1)
- 
                
                Ahmed ZAKY