An compute service hang issue

26 Apr 2021

      Buddies,

Have you ever met such issue that the nova-compute service looks hanging there and not able to boot instance?

When creating an instance, we can only see logs like below and there is NOT any information.

2021-04-23 02:39:00.252 1 DEBUG nova.compute.manager [req-a2ed90b6-792f-48e1-ba7f-e1e35d9537c2 373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e 1ee7955a2eaf4c86bcc3e650f2a8e2a7 - fc86abb50a684911a30f7955d386a3ea fc86abb50a684911a30f7955d386a3ea] [instance: e0e77edc-99b4-473e-b318-8e6f04428cda] Starting instance... _do_build_and_run_instance /usr/lib/python3.7/site-packages/nova/compute/manager.py:2202^[[00m

2021-04-23 03:03:04.927 1 DEBUG oslo_concurrency.lockutils [req-50659582-ad9b-4c17-bbb2-254d5e1141f9 373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e 1ee7955a2eaf4c86bcc3e650f2a8e2a7 - fc86abb50a684911a30f7955d386a3ea fc86abb50a684911a30f7955d386a3ea] Lock "a3619b50-3704-4f3e-b908-d525a41756eb" acquired by "nova.compute.manager.ComputeManager.build_and_run_instance.<locals>._locked_do_build_and_run_instance" :: waited 0.001s inner /usr/lib/python3.7/site-packages/oslo_concurrency/lockutils.py:327

There is NO other messages, including the periodically tasks.

If I sent a request to create another instance, I can only see another _do_build_and_run_instance log.

2021-04-23 03:03:05.061 1 DEBUG nova.compute.manager [req-50659582-ad9b-4c17-bbb2-254d5e1141f9 373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e 1ee7955a2eaf4c86bcc3e650f2a8e2a7 - fc86abb50a684911a30f7955d386a3ea fc86abb50a684911a30f7955d386a3ea] [instance: a3619b50-3704-4f3e-b908-d525a41756eb] Starting instance... _do_build_and_run_instance /usr/lib/python3.7/site-packages/nova/compute/manager.py:2202^[[00m

2021-04-23 03:32:44.718 1 DEBUG oslo_concurrency.lockutils [req-c7857cb3-02ae-4c7b-92d7-b2ec178c1b13 373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e 1ee7955a2eaf4c86bcc3e650f2a8e2a7 - fc86abb50a684911a30f7955d386a3ea fc86abb50a684911a30f7955d386a3ea] Lock "3b07a6eb-12bb-4d19-9302-25ea3e746944" acquired by "nova.compute.manager.ComputeManager.build_and_run_instance.<locals>._locked_do_build_and_run_instance" :: waited 0.001s inner /usr/lib/python3.7/site-packages/oslo_concurrency/lockutils.py:327

I am sure the compute service is still running as I can see the heart-beat time changes correctly.

From the rabbitmq message, It looks the code goes here

nova/compute/manager.py

def _build_and_run_instance(self, context, instance, image, injected_files,
        admin_password, requested_networks, security_groups,
        block_device_mapping, node, limits, filter_properties,
        request_spec=None):

    image_name = image.get('name')
    self._notify_about_instance_usage(context, instance, 'create.start',
            extra_usage_info={'image_name': image_name})
    compute_utils.notify_about_instance_create(
        context, instance, self.host,
        phase=fields.NotificationPhase.START,
        bdms=block_device_mapping)

as I can see a message with "event_type": "compute.instance.create.start"

INFO:root:Body: {'oslo.version': '2.0', 'oslo.message': '{"message_id": "0ef11509-7a65-46b8-ac7e-ed6482fc527d", "publisher_id": "compute.compute01", "event_type": "compute.instance.create.start", "priority": "INFO", "payload": {"tenant_id": "1ee7955a2eaf4c86bcc3e650f2a8e2a7", "user_id": "373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e", "instance_id": "3b07a6eb-12bb-4d19-9302-25ea3e746944", "display_name": "yingji-06", "reservation_id": "r-f8iphdcf", "hostname": "yingji-06", "instance_type": "m1.tiny", "instance_type_id": 15, "instance_flavor_id": "4d5644e1-c561-4b6c-952c-f4dd93c87948", "architecture": null, "memory_mb": 512, "disk_gb": 1, "vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "host": null, "node": null, "availability_zone": "nova", "cell_name": "", "created_at": "2021-04-23 03:32:39+00:00", "terminated_at": "", "deleted_at": "", "launched_at": "", "image_ref_url": "https://192.168.111.160:9292//images/303a325f-048

This issue is not always reproducible and restarting the compute service can work around this.

Could you please give any suggestion on how to resolve this issue or how I can investigate ?

Yingji.

Yingji Sun

Sean Mooney

Yingji Sun

Sean Mooney

Yingji Sun

Sean Mooney

tags

participants (2)