[Openstack] VMs not terminated correctly and undeletable, with cinder and CEPH

Aviolat Romain Romain.Aviolat at nagra.com
Fri Jun 5 08:02:17 UTC 2015


Dear Openstack community

I have a strange intermittent problem with cinder when VMs with volumes attached are terminated. Sometimes the process doesn't work and the VMs are stuck into and undeletable state.

Some info about my setup:
*****************
* Openstack Juno from ubuntu-cloud repo
* Ubuntu 14.04 LTS on every machines 
* CEPH (giant) as storage backend on the storage nodes
* 3x controllers in HA (HA-proxy, Galera, CEPH-monitors...)
* RabbitMQ as messaging system
*****************

I put my whole setup in debug mode and here's what happened for a problematic VM: 

Info about the VM and its volume:
*****************
Vm: suricata-bbd3c3f5-85fb-4c90-8e17-e7f083a4a0bc
Volume attached: 821006f1-c655-4589-ba3e-7c6804c8e120 on /dev/vda
Volume size: 500GB
Running on hypervisor: cloudcompute2
Libvirt instance ID: instance-000002e3
*****************

To begin here's the VM action log from Horizon, we can see that the user deleted the VM at 1.23 PM:
*****************
Request ID                                                              Action  Start Time                        User ID                                                     Message
req-22f33ed5-a231-4ea2-95fa-0022bf731079    delete    June 4, 2015, 1:23 p.m.    27cf3aaa0d7942009b03eabf7f686849    Error
*****************

Here's the corresponding action on the controller that received the request:
*****************
2015-06-04 13:23:16.693 32302 DEBUG nova.compute.api [req-22f33ed5-a231-4ea2-95fa-0022bf731079 None] [instance: bbd3c3f5-85fb-4c90-8e17-e7f083a4a0bc] Going to try to terminate instance delete /usr/lib/python2.7/dist-packages/nova/compute/api.py:1802
2015-06-04 13:23:16.774 32302 DEBUG nova.quota [req-22f33ed5-a231-4ea2-95fa-0022bf731079 None] Created reservations ['2221d555-1db0-4c36-9bd6-6dc815c22fc9', '3b70b444-7633-4a6b-b8b2-16128ee1469c', '6b8b2421-cfc1-4568-8bd8-430a982d977f'] reserve /usr/lib/python2.7/dist-packages/nova/quota.py:1310
2015-06-04 13:23:16.799 32302 INFO nova.osapi_compute.wsgi.server [req-22f33ed5-a231-4ea2-95fa-0022bf731079 None] 172.24.1.17 "DELETE /v2/c9324c924a5049e7922882aff55c3813/servers/bbd3c3f5-85fb-4c90-8e17-e7f083a4a0bc HTTP/1.1" status: 204 len: 179 time: 0.1563590
2015-06-04 13:23:16.883 32296 DEBUG nova.api.openstack.wsgi [req-903c3ea2-151a-4f8c-a48a-6991721c8c18 None] Calling method '<bound method Controller.detail of <nova.api.openstack.compute.servers.Controller object at 0x7f869ec56790>>' _process_stack /usr/lib/python2.7/dist-packages/nova/api/openstack/wsgi.py:937
2015-06-04 13:23:16.884 32296 DEBUG nova.api.openstack.compute.servers [req-903c3ea2-151a-4f8c-a48a-6991721c8c18 None] Removing options 'project_id, limit' from query remove_invalid_options /usr/lib/python2.7/dist-packages/nova/api/openstack/compute/servers.py:1533
*****************

Then on nova-compute on the compute node hosting the VM, we receive the DELETE-instance request:
*****************
2015-06-04 13:23:16.847 1971 AUDIT nova.compute.manager [req-22f33ed5-a231-4ea2-95fa-0022bf731079 None] [instance: bbd3c3f5-85fb-4c90-8e17-e7f083a4a0bc] Terminating instance
2015-06-04 13:23:16.854 1971 INFO nova.virt.libvirt.driver [-] [instance: bbd3c3f5-85fb-4c90-8e17-e7f083a4a0bc] Instance destroyed successfully.
*****************

Then the network device is removed:
*****************
2015-06-04 13:23:17.558 1971 DEBUG nova.network.linux_net [req-22f33ed5-a231-4ea2-95fa-0022bf731079 None] Net device removed: 'qvo20393401-b2' delete_net_dev /usr/lib/python2.7/dist-packages/nova/network/linux_net.py:1381
*****************

The the libvirt config file is removed:
*****************
2015-06-04 13:23:18.048 1971 INFO nova.virt.libvirt.driver [req-22f33ed5-a231-4ea2-95fa-0022bf731079 None] [instance: bbd3c3f5-85fb-4c90-8e17-e7f083a4a0bc] Deleting instance files /var/lib/nova/instances/bbd3c3f5-85fb-4c90-8e17-e7f083a4a0bc_del
*****************

The next one is strange:
*****************
2015-06-04 13:23:19.303 1971 DEBUG nova.virt.libvirt.driver [req-22f33ed5-a231-4ea2-95fa-0022bf731079 None] [instance: bbd3c3f5-85fb-4c90-8e17-e7f083a4a0bc] Could not determine fibre channel world wide node names get_volume_connector /usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py:1280
2015-06-04 13:23:19.358 1971 DEBUG nova.volume.cinder [req-22f33ed5-a231-4ea2-95fa-0022bf731079 None] Cinderclient connection created using URL: http://haproxy-api:8776/v2/c9324c924a5049e7922882aff55c3813 get_cinder_client_version /usr/lib/python2.7/dist-packages/nova/volume/cinder.py:255
*****************

We can see that the instance was correctly removed from the computenode:
*****************
[CloudCompute2:~] 1 $ sudo virsh list --all
  Id    Name                           State
----------------------------------------------------
  31    instance-0000043b              running
  32    instance-000002d4              running
  42    instance-00000288              running
  47    instance-0000028f              running
  48    instance-00000465              running
  52    instance-00000482              running
*****************

Then on controller2 we have some strange errors, looks like cinder sent messages but those timed-out (cinder-api.log).

*****************
2015-06-04 13:23:19.512 17252 INFO cinder.api.openstack.wsgi [req-ce2dcb7f-4bfd-4345-a199-165f13f7f11d 27cf3aaa0d7942009b03eabf7f686849 c9324c924a5049e7922882aff55c3813 - - -] POST http://haproxy-api:8776/v2/c9324c924a5049e7922882aff55c3813/volumes/821006f1-c655-4589-ba3e-7c6804c8e120/action
2015-06-04 13:23:19.513 17252 DEBUG cinder.api.openstack.wsgi [req-ce2dcb7f-4bfd-4345-a199-165f13f7f11d 27cf3aaa0d7942009b03eabf7f686849 c9324c924a5049e7922882aff55c3813 - - -] Action body: {"os-terminate_connection": {"connector": {"ip": "172.24.1.27", "host": "cloudcompute2.ksnet.nagra.com", "initiator": "iqn.1993-08.org.debian:01:73625b6efe77"}}} get_method /usr/lib/python2.7/dist-packages/cinder/api/openstack/wsgi.py:1010
2015-06-04 13:23:19.546 17252 INFO oslo.messaging._drivers.impl_rabbit [req-ce2dcb7f-4bfd-4345-a199-165f13f7f11d 27cf3aaa0d7942009b03eabf7f686849 c9324c924a5049e7922882aff55c3813 - - -] Connecting to AMQP server on cloudcontroller1-api:5672
2015-06-04 13:23:19.556 17252 INFO oslo.messaging._drivers.impl_rabbit [req-ce2dcb7f-4bfd-4345-a199-165f13f7f11d 27cf3aaa0d7942009b03eabf7f686849 c9324c924a5049e7922882aff55c3813 - - -] Connected to AMQP server on cloudcontroller1-api:5672
2015-06-04 13:23:19.567 17252 INFO oslo.messaging._drivers.impl_rabbit [req-ce2dcb7f-4bfd-4345-a199-165f13f7f11d 27cf3aaa0d7942009b03eabf7f686849 c9324c924a5049e7922882aff55c3813 - - -] Connecting to AMQP server on cloudcontroller0-api:5672
2015-06-04 13:23:19.581 17252 INFO oslo.messaging._drivers.impl_rabbit [req-ce2dcb7f-4bfd-4345-a199-165f13f7f11d 27cf3aaa0d7942009b03eabf7f686849 c9324c924a5049e7922882aff55c3813 - - -] Connected to AMQP server on cloudcontroller0-api:5672
2015-06-04 13:23:20.170 17247 INFO eventlet.wsgi.server [-] (17247) accepted ('172.24.1.19', 36820)
2015-06-04 13:23:20.171 17247 INFO cinder.api.openstack.wsgi [-] OPTIONS http://172.24.1.19:8776/
2015-06-04 13:23:20.171 17247 DEBUG cinder.api.openstack.wsgi [-] Empty body provided in request get_body /usr/lib/python2.7/dist-packages/cinder/api/openstack/wsgi.py:789
2015-06-04 13:23:20.172 17247 INFO cinder.api.openstack.wsgi [-] http://172.24.1.19:8776/ returned with HTTP 200
2015-06-04 13:23:20.172 17247 INFO eventlet.wsgi.server [-] 172.24.1.19 - - [04/Jun/2015 13:23:20] "OPTIONS / HTTP/1.0" 200 281 0.001354
2015-06-04 13:23:21.157 17245 INFO eventlet.wsgi.server [-] (17245) accepted ('172.24.1.17', 34260)
2015-06-04 13:23:21.159 17245 INFO cinder.api.openstack.wsgi [-] OPTIONS http://172.24.1.19:8776/
2015-06-04 13:23:21.159 17245 DEBUG cinder.api.openstack.wsgi [-] Empty body provided in request get_body /usr/lib/python2.7/dist-packages/cinder/api/openstack/wsgi.py:789
2015-06-04 13:23:21.160 17245 INFO cinder.api.openstack.wsgi [-] http://172.24.1.19:8776/ returned with HTTP 200
2015-06-04 13:23:21.161 17245 INFO eventlet.wsgi.server [-] 172.24.1.17 - - [04/Jun/2015 13:23:21] "OPTIONS / HTTP/1.0" 200 281 0.002750
2015-06-04 13:23:21.173 17251 INFO eventlet.wsgi.server [-] (17251) accepted ('172.24.1.18', 49259)
2015-06-04 13:23:21.174 17251 INFO cinder.api.openstack.wsgi [-] OPTIONS http://172.24.1.19:8776/
2015-06-04 13:23:21.175 17251 DEBUG cinder.api.openstack.wsgi [-] Empty body provided in request get_body /usr/lib/python2.7/dist-packages/cinder/api/openstack/wsgi.py:789
2015-06-04 13:23:21.176 17251 INFO cinder.api.openstack.wsgi [-] http://172.24.1.19:8776/ returned with HTTP 200 .....
2015-06-04 13:24:19.338 17254 INFO eventlet.wsgi.server [-] 172.24.1.18 - - [04/Jun/2015 13:24:19] "OPTIONS / HTTP/1.0" 200 281 0.001299
2015-06-04 13:24:19.585 17252 ERROR cinder.api.middleware.fault [req-ce2dcb7f-4bfd-4345-a199-165f13f7f11d 27cf3aaa0d7942009b03eabf7f686849 c9324c924a5049e7922882aff55c3813 - - -] Caught error: Timed out waiting for a reply to message ID 27d7e0cc6f444391823d5d4d1c1c85c1
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault Traceback (most recent call last):
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/cinder/api/middleware/fault.py", line 76, in __call__
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     return req.get_response(self.application)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1320, in send
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     application, catch_exc_info=False)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1284, in call_application
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     app_iter = application(self.environ, start_response)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     return resp(environ, start_response)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 130, in __call__
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     resp = self.call_func(req, *args, **self.kwargs)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 195, in call_func
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     return self.func(req, *args, **kwargs)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/osprofiler/web.py", line 99, in __call__
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     return request.get_response(self.application)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1320, in send
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     application, catch_exc_info=False)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/webob/request.py", line 1284, in call_application
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     app_iter = application(self.environ, start_response)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/keystonemiddleware/auth_token.py", line 661, in __call__
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     return self._app(env, start_response)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     return resp(environ, start_response)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     return resp(environ, start_response)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/routes/middleware.py", line 131, in __call__
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     response = self.app(environ, start_response)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 144, in __call__
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     return resp(environ, start_response)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 130, in __call__
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     resp = self.call_func(req, *args, **self.kwargs)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/webob/dec.py", line 195, in call_func
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     return self.func(req, *args, **kwargs)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/cinder/api/openstack/wsgi.py", line 897, in __call__
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     content_type, body, accept)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/cinder/api/openstack/wsgi.py", line 945, in _process_stack
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     action_result = self.dispatch(meth, request, action_args)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/cinder/api/openstack/wsgi.py", line 1021, in dispatch
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     return method(req=request, **action_args)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/cinder/api/contrib/volume_actions.py", line 218, in _terminate_connection
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     self.volume_api.terminate_connection(context, volume, connector)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/cinder/volume/api.py", line 87, in wrapped
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     return func(self, context, target_obj, *args, **kwargs)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/cinder/volume/api.py", line 519, in terminate_connection
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     force)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/cinder/volume/rpcapi.py", line 170, in terminate_connection
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     connector=connector, force=force)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 152, in call
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     retry=self.retry)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     timeout=timeout, retry=retry)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 408, in send
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     retry=retry)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 397, in _send
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     result = self._waiter.wait(msg_id, timeout)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 285, in wait
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     reply, ending = self._poll_connection(msg_id, timeout)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault   File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 235, in _poll_connection
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault     % msg_id)
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault MessagingTimeout: Timed out waiting for a reply to message ID 27d7e0cc6f444391823d5d4d1c1c85c1
2015-06-04 13:24:19.585 17252 TRACE cinder.api.middleware.fault
2015-06-04 13:24:19.596 17252 INFO cinder.api.middleware.fault [req-ce2dcb7f-4bfd-4345-a199-165f13f7f11d 27cf3aaa0d7942009b03eabf7f686849 c9324c924a5049e7922882aff55c3813 - - -] http://haproxy-api:8776/v2/c9324c924a5049e7922882aff55c3813/volumes/821006f1-c655-4589-ba3e-7c6804c8e120/action returned with HTTP 500
2015-06-04 13:24:19.598 17252 INFO eventlet.wsgi.server [req-ce2dcb7f-4bfd-4345-a199-165f13f7f11d 27cf3aaa0d7942009b03eabf7f686849 c9324c924a5049e7922882aff55c3813 - - -] Traceback (most recent call last):
   File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 405, in handle_one_response
     write(''.join(towrite))
   File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 349, in write
     _writelines(towrite)
   File "/usr/lib/python2.7/socket.py", line 334, in writelines
     self.flush()
   File "/usr/lib/python2.7/socket.py", line 303, in flush
     self._sock.sendall(view[write_offset:write_offset+buffer_size])
   File "/usr/lib/python2.7/dist-packages/eventlet/greenio.py", line 307, in sendall
     tail = self.send(data, flags)
   File "/usr/lib/python2.7/dist-packages/eventlet/greenio.py", line 293, in send
     total_sent += fd.send(data[total_sent:], flags)
error: [Errno 104] Connection reset by peer
2015-06-04 13:24:19.598 17252 INFO eventlet.wsgi.server [req-ce2dcb7f-4bfd-4345-a199-165f13f7f11d 27cf3aaa0d7942009b03eabf7f686849 c9324c924a5049e7922882aff55c3813 - - -] 172.24.1.17 - - [04/Jun/2015 13:24:19] "POST /v2/c9324c924a5049e7922882aff55c3813/volumes/821006f1-c655-4589-ba3e-7c6804c8e120/action HTTP/1.1" 500 0 60.230434
2015-06-04 13:24:20.321 17254 INFO eventlet.wsgi.server [-] (17254) accepted ('172.24.1.19', 37634)
2015-06-04 13:24:20.321 17254 INFO cinder.api.openstack.wsgi [-] OPTIONS http://172.24.1.19:8776/
*****************

I checked in the CEPH cluster and the volume wasn't deleted. I also checked the state of the RabbitMQ cluster and everything is working as expected.

I'm not able to reproduce the problem, it happen sometimes but I haven't found the trigger... When it happens the volume is un-deletable, it's marked as in-use by the VM and the VM is un-deletable because a volume is attached. The only way I found is to edit the database (nove, cinder, ...) and delete the volume in the CEPH cluster manually. 

You help will be very much appreciated ! 

Thanks,

Romain

Romain Aviolat
Senior System Administrator - R&D and ops Infrastructure 
Kudelski Security - Kudelski Group 
rte de Genève 22-24, 1033 Cheseaux, SWITZERLAND
+41 21 732 03 79






More information about the Openstack mailing list