An compute service hang issue

Sean Mooney smooney at redhat.com
Mon Apr 26 14:41:08 UTC 2021


On Mon, 2021-04-26 at 13:56 +0000, Yingji Sun wrote:
> Buddies,
> 
> Have you ever met such issue that the nova-compute service looks hanging there and not able to boot instance?
> 
> When creating an instance, we can only see logs like below and there is NOT any information.
> 
> 
> 2021-04-23 02:39:00.252 1 DEBUG nova.compute.manager [req-a2ed90b6-792f-48e1-ba7f-e1e35d9537c2 373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e 1ee7955a2eaf4c86bcc3e650f2a8e2a7 - fc86abb50a684911a30f7955d386a3ea fc86abb50a684911a30f7955d386a3ea] [instance: e0e77edc-99b4-473e-b318-8e6f04428cda] Starting instance... _do_build_and_run_instance /usr/lib/python3.7/site-packages/nova/compute/manager.py:2202^[[00m
> 
> 2021-04-23 03:03:04.927 1 DEBUG oslo_concurrency.lockutils [req-50659582-ad9b-4c17-bbb2-254d5e1141f9 373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e 1ee7955a2eaf4c86bcc3e650f2a8e2a7 - fc86abb50a684911a30f7955d386a3ea fc86abb50a684911a30f7955d386a3ea] Lock "a3619b50-3704-4f3e-b908-d525a41756eb" acquired by "nova.compute.manager.ComputeManager.build_and_run_instance.<locals>._locked_do_build_and_run_instance" :: waited 0.001s inner /usr/lib/python3.7/site-packages/oslo_concurrency/lockutils.py:327
> 
> There is NO other messages, including the periodically tasks.
> 
> If I sent a request to create another instance, I can only see another _do_build_and_run_instance log.
> 
> 
> 
> 
> 2021-04-23 03:03:05.061 1 DEBUG nova.compute.manager [req-50659582-ad9b-4c17-bbb2-254d5e1141f9 373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e 1ee7955a2eaf4c86bcc3e650f2a8e2a7 - fc86abb50a684911a30f7955d386a3ea fc86abb50a684911a30f7955d386a3ea] [instance: a3619b50-3704-4f3e-b908-d525a41756eb] Starting instance... _do_build_and_run_instance /usr/lib/python3.7/site-packages/nova/compute/manager.py:2202^[[00m
> 
> 2021-04-23 03:32:44.718 1 DEBUG oslo_concurrency.lockutils [req-c7857cb3-02ae-4c7b-92d7-b2ec178c1b13 373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e 1ee7955a2eaf4c86bcc3e650f2a8e2a7 - fc86abb50a684911a30f7955d386a3ea fc86abb50a684911a30f7955d386a3ea] Lock "3b07a6eb-12bb-4d19-9302-25ea3e746944" acquired by "nova.compute.manager.ComputeManager.build_and_run_instance.<locals>._locked_do_build_and_run_instance" :: waited 0.001s inner /usr/lib/python3.7/site-packages/oslo_concurrency/lockutils.py:327
> 
> I am sure the compute service is still running as I can see the heart-beat time changes correctly.
> 
> From the rabbitmq message, It looks the code goes here
> 
> nova/compute/manager.py
> 
> def _build_and_run_instance(self, context, instance, image, injected_files,
>         admin_password, requested_networks, security_groups,
>         block_device_mapping, node, limits, filter_properties,
>         request_spec=None):
> 
>     image_name = image.get('name')
>     self._notify_about_instance_usage(context, instance, 'create.start',
>             extra_usage_info={'image_name': image_name})
>     compute_utils.notify_about_instance_create(
>         context, instance, self.host,
>         phase=fields.NotificationPhase.START,
>         bdms=block_device_mapping)
> 
> as I can see a message with "event_type": "compute.instance.create.start"
> 
> 
> INFO:root:Body: {'oslo.version': '2.0', 'oslo.message': '{"message_id": "0ef11509-7a65-46b8-ac7e-ed6482fc527d", "publisher_id": "compute.compute01", "event_type": "compute.instance.create.start", "priority": "INFO", "payload": {"tenant_id": "1ee7955a2eaf4c86bcc3e650f2a8e2a7", "user_id": "373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e", "instance_id": "3b07a6eb-12bb-4d19-9302-25ea3e746944", "display_name": "yingji-06", "reservation_id": "r-f8iphdcf", "hostname": "yingji-06", "instance_type": "m1.tiny", "instance_type_id": 15, "instance_flavor_id": "4d5644e1-c561-4b6c-952c-f4dd93c87948", "architecture": null, "memory_mb": 512, "disk_gb": 1, "vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "host": null, "node": null, "availability_zone": "nova", "cell_name": "", "created_at": "2021-04-23 03:32:39+00:00", "terminated_at": "", "deleted_at": "", "launched_at": "", "image_ref_url": "https://192.168.111.160:9292//images/303a325f-048
> 
> 
> This issue is not always reproducible and restarting the compute service can work around this.
> 
> Could you please give any suggestion on how to resolve this issue or how I can investigate ?
i assume this is with the vmware driver?
it kind of sounds like eventlet is not monkey patching properly and its blocking on a call or something like that.

we have seen this in the pass when talking to libvirt where we were not properly proxying calls into the libvirt lib
and as a result we would end up blocking in the compute agent when making some external calls to libvirt.

i wonder if you are seing something similar?
> 
> Yingji.
> 
> 





More information about the openstack-discuss mailing list