An compute service hang issue
Buddies, Have you ever met such issue that the nova-compute service looks hanging there and not able to boot instance? When creating an instance, we can only see logs like below and there is NOT any information. 2021-04-23 02:39:00.252 1 DEBUG nova.compute.manager [req-a2ed90b6-792f-48e1-ba7f-e1e35d9537c2 373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e 1ee7955a2eaf4c86bcc3e650f2a8e2a7 - fc86abb50a684911a30f7955d386a3ea fc86abb50a684911a30f7955d386a3ea] [instance: e0e77edc-99b4-473e-b318-8e6f04428cda] Starting instance... _do_build_and_run_instance /usr/lib/python3.7/site-packages/nova/compute/manager.py:2202^[[00m 2021-04-23 03:03:04.927 1 DEBUG oslo_concurrency.lockutils [req-50659582-ad9b-4c17-bbb2-254d5e1141f9 373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e 1ee7955a2eaf4c86bcc3e650f2a8e2a7 - fc86abb50a684911a30f7955d386a3ea fc86abb50a684911a30f7955d386a3ea] Lock "a3619b50-3704-4f3e-b908-d525a41756eb" acquired by "nova.compute.manager.ComputeManager.build_and_run_instance.<locals>._locked_do_build_and_run_instance" :: waited 0.001s inner /usr/lib/python3.7/site-packages/oslo_concurrency/lockutils.py:327 There is NO other messages, including the periodically tasks. If I sent a request to create another instance, I can only see another _do_build_and_run_instance log. 2021-04-23 03:03:05.061 1 DEBUG nova.compute.manager [req-50659582-ad9b-4c17-bbb2-254d5e1141f9 373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e 1ee7955a2eaf4c86bcc3e650f2a8e2a7 - fc86abb50a684911a30f7955d386a3ea fc86abb50a684911a30f7955d386a3ea] [instance: a3619b50-3704-4f3e-b908-d525a41756eb] Starting instance... _do_build_and_run_instance /usr/lib/python3.7/site-packages/nova/compute/manager.py:2202^[[00m 2021-04-23 03:32:44.718 1 DEBUG oslo_concurrency.lockutils [req-c7857cb3-02ae-4c7b-92d7-b2ec178c1b13 373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e 1ee7955a2eaf4c86bcc3e650f2a8e2a7 - fc86abb50a684911a30f7955d386a3ea fc86abb50a684911a30f7955d386a3ea] Lock "3b07a6eb-12bb-4d19-9302-25ea3e746944" acquired by "nova.compute.manager.ComputeManager.build_and_run_instance.<locals>._locked_do_build_and_run_instance" :: waited 0.001s inner /usr/lib/python3.7/site-packages/oslo_concurrency/lockutils.py:327 I am sure the compute service is still running as I can see the heart-beat time changes correctly. From the rabbitmq message, It looks the code goes here nova/compute/manager.py def _build_and_run_instance(self, context, instance, image, injected_files, admin_password, requested_networks, security_groups, block_device_mapping, node, limits, filter_properties, request_spec=None): image_name = image.get('name') self._notify_about_instance_usage(context, instance, 'create.start', extra_usage_info={'image_name': image_name}) compute_utils.notify_about_instance_create( context, instance, self.host, phase=fields.NotificationPhase.START, bdms=block_device_mapping) as I can see a message with "event_type": "compute.instance.create.start" INFO:root:Body: {'oslo.version': '2.0', 'oslo.message': '{"message_id": "0ef11509-7a65-46b8-ac7e-ed6482fc527d", "publisher_id": "compute.compute01", "event_type": "compute.instance.create.start", "priority": "INFO", "payload": {"tenant_id": "1ee7955a2eaf4c86bcc3e650f2a8e2a7", "user_id": "373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e", "instance_id": "3b07a6eb-12bb-4d19-9302-25ea3e746944", "display_name": "yingji-06", "reservation_id": "r-f8iphdcf", "hostname": "yingji-06", "instance_type": "m1.tiny", "instance_type_id": 15, "instance_flavor_id": "4d5644e1-c561-4b6c-952c-f4dd93c87948", "architecture": null, "memory_mb": 512, "disk_gb": 1, "vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "host": null, "node": null, "availability_zone": "nova", "cell_name": "", "created_at": "2021-04-23 03:32:39+00:00", "terminated_at": "", "deleted_at": "", "launched_at": "", "image_ref_url": "https://192.168.111.160:9292//images/303a325f-048 This issue is not always reproducible and restarting the compute service can work around this. Could you please give any suggestion on how to resolve this issue or how I can investigate ? Yingji.
On Mon, 2021-04-26 at 13:56 +0000, Yingji Sun wrote:
Buddies,
Have you ever met such issue that the nova-compute service looks hanging there and not able to boot instance?
When creating an instance, we can only see logs like below and there is NOT any information.
2021-04-23 02:39:00.252 1 DEBUG nova.compute.manager [req-a2ed90b6-792f-48e1-ba7f-e1e35d9537c2 373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e 1ee7955a2eaf4c86bcc3e650f2a8e2a7 - fc86abb50a684911a30f7955d386a3ea fc86abb50a684911a30f7955d386a3ea] [instance: e0e77edc-99b4-473e-b318-8e6f04428cda] Starting instance... _do_build_and_run_instance /usr/lib/python3.7/site-packages/nova/compute/manager.py:2202^[[00m
2021-04-23 03:03:04.927 1 DEBUG oslo_concurrency.lockutils [req-50659582-ad9b-4c17-bbb2-254d5e1141f9 373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e 1ee7955a2eaf4c86bcc3e650f2a8e2a7 - fc86abb50a684911a30f7955d386a3ea fc86abb50a684911a30f7955d386a3ea] Lock "a3619b50-3704-4f3e-b908-d525a41756eb" acquired by "nova.compute.manager.ComputeManager.build_and_run_instance.<locals>._locked_do_build_and_run_instance" :: waited 0.001s inner /usr/lib/python3.7/site-packages/oslo_concurrency/lockutils.py:327
There is NO other messages, including the periodically tasks.
If I sent a request to create another instance, I can only see another _do_build_and_run_instance log.
2021-04-23 03:03:05.061 1 DEBUG nova.compute.manager [req-50659582-ad9b-4c17-bbb2-254d5e1141f9 373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e 1ee7955a2eaf4c86bcc3e650f2a8e2a7 - fc86abb50a684911a30f7955d386a3ea fc86abb50a684911a30f7955d386a3ea] [instance: a3619b50-3704-4f3e-b908-d525a41756eb] Starting instance... _do_build_and_run_instance /usr/lib/python3.7/site-packages/nova/compute/manager.py:2202^[[00m
2021-04-23 03:32:44.718 1 DEBUG oslo_concurrency.lockutils [req-c7857cb3-02ae-4c7b-92d7-b2ec178c1b13 373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e 1ee7955a2eaf4c86bcc3e650f2a8e2a7 - fc86abb50a684911a30f7955d386a3ea fc86abb50a684911a30f7955d386a3ea] Lock "3b07a6eb-12bb-4d19-9302-25ea3e746944" acquired by "nova.compute.manager.ComputeManager.build_and_run_instance.<locals>._locked_do_build_and_run_instance" :: waited 0.001s inner /usr/lib/python3.7/site-packages/oslo_concurrency/lockutils.py:327
I am sure the compute service is still running as I can see the heart-beat time changes correctly.
From the rabbitmq message, It looks the code goes here
nova/compute/manager.py
def _build_and_run_instance(self, context, instance, image, injected_files, admin_password, requested_networks, security_groups, block_device_mapping, node, limits, filter_properties, request_spec=None):
image_name = image.get('name') self._notify_about_instance_usage(context, instance, 'create.start', extra_usage_info={'image_name': image_name}) compute_utils.notify_about_instance_create( context, instance, self.host, phase=fields.NotificationPhase.START, bdms=block_device_mapping)
as I can see a message with "event_type": "compute.instance.create.start"
INFO:root:Body: {'oslo.version': '2.0', 'oslo.message': '{"message_id": "0ef11509-7a65-46b8-ac7e-ed6482fc527d", "publisher_id": "compute.compute01", "event_type": "compute.instance.create.start", "priority": "INFO", "payload": {"tenant_id": "1ee7955a2eaf4c86bcc3e650f2a8e2a7", "user_id": "373cb863547407cf3b99034b3b66395e76c137b40f905e7a61e25b1f97df4f3e", "instance_id": "3b07a6eb-12bb-4d19-9302-25ea3e746944", "display_name": "yingji-06", "reservation_id": "r-f8iphdcf", "hostname": "yingji-06", "instance_type": "m1.tiny", "instance_type_id": 15, "instance_flavor_id": "4d5644e1-c561-4b6c-952c-f4dd93c87948", "architecture": null, "memory_mb": 512, "disk_gb": 1, "vcpus": 1, "root_gb": 1, "ephemeral_gb": 0, "host": null, "node": null, "availability_zone": "nova", "cell_name": "", "created_at": "2021-04-23 03:32:39+00:00", "terminated_at": "", "deleted_at": "", "launched_at": "", "image_ref_url": "https://192.168.111.160:9292//images/303a325f-048
This issue is not always reproducible and restarting the compute service can work around this.
Could you please give any suggestion on how to resolve this issue or how I can investigate ? i assume this is with the vmware driver? it kind of sounds like eventlet is not monkey patching properly and its blocking on a call or something like that.
we have seen this in the pass when talking to libvirt where we were not properly proxying calls into the libvirt lib and as a result we would end up blocking in the compute agent when making some external calls to libvirt. i wonder if you are seing something similar?
Yingji.
Sean, You are right. I am working with vmware driver. Is it possible that you share some code fix samples so that I can have a try in my environment ? Below is my investigation. Would you please give any suggestion ? With my instance, vm_state is building and task_state is NULL. I have some suspect here ###################################################################### try: scheduler_hints = self._get_scheduler_hints(filter_properties, request_spec) with self.rt.instance_claim(context, instance, node, allocs, limits): # See my comments in instance_claim below. ###################################################################### Here is my investigation in <yingji></yingji> def _build_and_run_instance(self, context, instance, image, injected_files, admin_password, requested_networks, security_groups, block_device_mapping, node, limits, filter_properties, request_spec=None): self._notify_about_instance_usage(context, instance, 'create.start', extra_usage_info={'image_name': image_name}) compute_utils.notify_about_instance_create( context, instance, self.host, phase=fields.NotificationPhase.START, bdms=block_device_mapping) <yingji> I see rabbitmq sent here. <#yingji> # NOTE(mikal): cache the keystone roles associated with the instance # at boot time for later reference instance.system_metadata.update( {'boot_roles': ','.join(context.roles)}) self._check_device_tagging(requested_networks, block_device_mapping) self._check_trusted_certs(instance) request_group_resource_providers_mapping = \ self._get_request_group_mapping(request_spec) if request_group_resource_providers_mapping: self._update_pci_request_spec_with_allocated_interface_name( context, instance, request_group_resource_providers_mapping) # TODO(Luyao) cut over to get_allocs_for_consumer allocs = self.reportclient.get_allocations_for_consumer( context, instance.uuid) <yingji> I see "GET /allocations/<my-instance-uuidd>" in placement-api.log, so, it looks to reach here. </yingji> # My suspect code snippet. <yingji> ###################################################################### try: scheduler_hints = self._get_scheduler_hints(filter_properties, request_spec) with self.rt.instance_claim(context, instance, node, allocs, limits): # See my comments in instance_claim below. ###################################################################### </yingji> ........ with self._build_resources(context, instance, requested_networks, security_groups, image_meta, block_device_mapping, request_group_resource_providers_mapping) as resources: instance.vm_state = vm_states.BUILDING instance.task_state = task_states.SPAWNING # NOTE(JoshNang) This also saves the changes to the # instance from _allocate_network_async, as they aren't # saved in that function to prevent races. instance.save(expected_task_state= task_states.BLOCK_DEVICE_MAPPING) block_device_info = resources['block_device_info'] network_info = resources['network_info'] LOG.debug('Start spawning the instance on the hypervisor.', instance=instance) with timeutils.StopWatch() as timer: <yingji> The driver code starts here. However in my case, it looks not reach here. </yingji> self.driver.spawn(context, instance, image_meta, injected_files, admin_password, allocs, network_info=network_info, block_device_info=block_device_info) @utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE) def instance_claim(self, context, instance, nodename, allocations, limits=None): ...... if self.disabled(nodename): # instance_claim() was called before update_available_resource() # (which ensures that a compute node exists for nodename). We # shouldn't get here but in case we do, just set the instance's # host and nodename attribute (probably incorrect) and return a # NoopClaim. # TODO(jaypipes): Remove all the disabled junk from the resource # tracker. Servicegroup API-level active-checking belongs in the # nova-compute manager. self._set_instance_host_and_node(instance, nodename) return claims.NopClaim() # sanity checks: if instance.host: LOG.warning("Host field should not be set on the instance " "until resources have been claimed.", instance=instance) if instance.node: LOG.warning("Node field should not be set on the instance " "until resources have been claimed.", instance=instance) cn = self.compute_nodes[nodename] <yingji> I did not see the rabbitmq messsge that should be sent here. </yingji> pci_requests = objects.InstancePCIRequests.get_by_instance_uuid( context, instance.uuid) Yingji. > 在 4/26/21 下午10:46,“Sean Mooney”<smooney@redhat.com> 写入:
7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6LOZaIYlm8ZHXZ%2FZS44fyhkAgrbHV8MVjfKf6pkeTwc%3D&reserved=0
This issue is not always reproducible and restarting the compute service can work around this.
Could you please give any suggestion on how to resolve this issue or how I can investigate ?
i assume this is with the vmware driver? it kind of sounds like eventlet is not monkey patching properly and its blocking on a call or something like that.
we have seen this in the pass when talking to libvirt where we were not properly proxying calls into the libvirt lib
and as a result we would end up blocking in the compute agent when making some external calls to libvirt.
i wonder if you are seing something similar?
Yingji.
On Tue, 2021-04-27 at 01:19 +0000, Yingji Sun wrote:
Sean,
You are right. I am working with vmware driver. Is it possible that you share some code fix samples so that I can have a try in my environment ?
in the libvirt case we had service wide hangs https://bugs.launchpad.net/nova/+bug/1840912 that were resovled by https://github.com/openstack/nova/commit/36ee9c1913a449defd3b35f5ee5fb4afcd4...
Below is my investigation. Would you please give any suggestion ?
With my instance, vm_state is building and task_state is NULL.
I have some suspect here ###################################################################### try: scheduler_hints = self._get_scheduler_hints(filter_properties, request_spec) with self.rt.instance_claim(context, instance, node, allocs, limits): # See my comments in instance_claim below. ######################################################################
Here is my investigation in <yingji></yingji>
def _build_and_run_instance(self, context, instance, image, injected_files, admin_password, requested_networks, security_groups, block_device_mapping, node, limits, filter_properties, request_spec=None):
self._notify_about_instance_usage(context, instance, 'create.start', extra_usage_info={'image_name': image_name}) compute_utils.notify_about_instance_create( context, instance, self.host, phase=fields.NotificationPhase.START, bdms=block_device_mapping) <yingji> I see rabbitmq sent here. <#yingji>
# NOTE(mikal): cache the keystone roles associated with the instance # at boot time for later reference instance.system_metadata.update( {'boot_roles': ','.join(context.roles)})
self._check_device_tagging(requested_networks, block_device_mapping) self._check_trusted_certs(instance)
request_group_resource_providers_mapping = \ self._get_request_group_mapping(request_spec)
if request_group_resource_providers_mapping: self._update_pci_request_spec_with_allocated_interface_name( context, instance, request_group_resource_providers_mapping)
# TODO(Luyao) cut over to get_allocs_for_consumer allocs = self.reportclient.get_allocations_for_consumer( context, instance.uuid) <yingji> I see "GET /allocations/<my-instance-uuidd>" in placement-api.log, so, it looks to reach here. </yingji>
# My suspect code snippet. <yingji> ###################################################################### try: scheduler_hints = self._get_scheduler_hints(filter_properties, request_spec) with self.rt.instance_claim(context, instance, node, allocs, limits): # See my comments in instance_claim below. ###################################################################### </yingji>
........
with self._build_resources(context, instance, requested_networks, security_groups, image_meta, block_device_mapping, request_group_resource_providers_mapping) as resources: instance.vm_state = vm_states.BUILDING instance.task_state = task_states.SPAWNING # NOTE(JoshNang) This also saves the changes to the # instance from _allocate_network_async, as they aren't # saved in that function to prevent races. instance.save(expected_task_state= task_states.BLOCK_DEVICE_MAPPING) block_device_info = resources['block_device_info'] network_info = resources['network_info'] LOG.debug('Start spawning the instance on the hypervisor.', instance=instance) with timeutils.StopWatch() as timer:
<yingji> The driver code starts here. However in my case, it looks not reach here. </yingji> self.driver.spawn(context, instance, image_meta, injected_files, admin_password, allocs, network_info=network_info, block_device_info=block_device_info)
so this synchronized decorator prints a log message which you should see when it is aquired and released https://github.com/openstack/oslo.concurrency/blob/4da91987d6ce7de2bb61c6ed7... you should see that in the logs. i notice also that in your code you do not have the fair=true argmument on master and for a few releases now we have enable the use of fair locking with https://github.com/openstack/nova/commit/1ed9f9dac59c36cdda54a9852a1f93939b3... to resolve long delays in the ironic diriver https://bugs.launchpad.net/nova/+bug/1864122 but the same issues would also affect vmware or any other clustered hypervior where the resouce tracker is manageing multiple nodes. its very possible that that is what is causing your current issues.
@utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE) def instance_claim(self, context, instance, nodename, allocations, limits=None): ...... if self.disabled(nodename): # instance_claim() was called before update_available_resource() # (which ensures that a compute node exists for nodename). We # shouldn't get here but in case we do, just set the instance's # host and nodename attribute (probably incorrect) and return a # NoopClaim. # TODO(jaypipes): Remove all the disabled junk from the resource # tracker. Servicegroup API-level active-checking belongs in the # nova-compute manager. self._set_instance_host_and_node(instance, nodename) return claims.NopClaim()
# sanity checks: if instance.host: LOG.warning("Host field should not be set on the instance " "until resources have been claimed.", instance=instance)
if instance.node: LOG.warning("Node field should not be set on the instance " "until resources have been claimed.", instance=instance)
cn = self.compute_nodes[nodename] <yingji> I did not see the rabbitmq messsge that should be sent here. </yingji> pci_requests = objects.InstancePCIRequests.get_by_instance_uuid( context, instance.uuid)
Yingji.
> 在 4/26/21 下午10:46,“Sean Mooney”<smooney@redhat.com> 写入:
7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6LOZaIYlm8ZHXZ%2FZS44fyhkAgrbHV8MVjfKf6pkeTwc%3D&reserved=0
This issue is not always reproducible and restarting the compute service can work around this.
Could you please give any suggestion on how to resolve this issue or how I can investigate ?
i assume this is with the vmware driver? it kind of sounds like eventlet is not monkey patching properly and its blocking on a call or something like that.
we have seen this in the pass when talking to libvirt where we were not properly proxying calls into the libvirt lib
and as a result we would end up blocking in the compute agent when making some external calls to libvirt.
i wonder if you are seing something similar?
Yingji.
Sean, I think your comments on synchronized lock should be the root cause of my issue. In my logs, I see after a log of " Lock "compute_resources" acquired by ", my compute node get "stucked". Mar 12 01:13:58 controller-mpltc45f7n nova-compute[756]: 2021-03-12 01:13:58.044 1 DEBUG oslo_concurrency.lockutils [req-7f57447c-7aae-48fe-addd-46f80e80246a - - - - -] Lock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker._safe_update_available_resource" :: waited 0.000s inner /usr/lib/python3.7/site-packages/oslo_concurrency/lockutils.py:327 I think it is because the lock "compute_resources" is not released any more that caused this issue. At that time there is some mysql issue when calling _safe_update_available_resource . So I think the exception is not handled the the lock is not released. Yingji 在 4/27/21 下午4:11,“Sean Mooney”<smooney@redhat.com> 写入: On Tue, 2021-04-27 at 01:19 +0000, Yingji Sun wrote:
Sean,
You are right. I am working with vmware driver. Is it possible that you share some code fix samples so that I can have a try in my environment ?
in the libvirt case we had service wide hangs https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.launc... that were resovled by https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com...
so this synchronized decorator prints a log message which you should see when it is aquired and released https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com... you should see that in the logs.
i notice also that in your code you do not have the fair=true argmument on master and for a few releases now we have enable the use of fair locking with https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenstack%2Fnova%2Fcommit%> 2F1ed9f9dac59c36cdda54a9852a1f93939b3ebbc3&data=04%7C01%7Cyingjisun%40vmware.com%7C24653a0ee7d3480b975e08d90954010c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637551078670001853%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZX6GQ2TmyznoLo3FhjXGJl8MSb9hAUrjMeDtDEzcjTI%3D&reserved=0 to resolve long delays in the ironic diriver https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.launc... but the same issues would also affect vmware or any other clustered hypervior where the resouce tracker is manageing multiple nodes.
its very possible that that is what is causing your current issues.
@utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE)
if instance.node: LOG.warning("Node field should not be set on the instance " "until resources have been claimed.", instance=instance)
cn = self.compute_nodes[nodename] <yingji> I did not see the rabbitmq messsge that should be sent here. </yingji> pci_requests = objects.InstancePCIRequests.get_by_instance_uuid( context, instance.uuid)
On Thu, 2021-04-29 at 00:54 +0000, Yingji Sun wrote:
Sean,
I think your comments on synchronized lock should be the root cause of my issue.
In my logs, I see after a log of " Lock "compute_resources" acquired by ", my compute node get "stucked".
Mar 12 01:13:58 controller-mpltc45f7n nova-compute[756]: 2021-03-12 01:13:58.044 1 DEBUG oslo_concurrency.lockutils [req-7f57447c-7aae-48fe-addd-46f80e80246a - - - - -] Lock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker._safe_update_available_resource" :: waited 0.000s inner /usr/lib/python3.7/site-packages/oslo_concurrency/lockutils.py:327
I think it is because the lock "compute_resources" is not released any more that caused this issue. At that time there is some mysql issue when calling _safe_update_available_resource . So I think the exception is not handled the the lock is not released. we aquire the locks with decorators so that they cant be leaked in that way if there is an exception. but without the use of "fair" locks there is no garunette what greenthread will be resumed so indivigual request can end up waiting a very long time. i think its more likely that the lock is being aquired and release properly but some operations like the periodics might be starving the other operartion and the api request are just not getting processed.
Yingji
在 4/27/21 下午4:11,“Sean Mooney”<smooney@redhat.com> 写入:
On Tue, 2021-04-27 at 01:19 +0000, Yingji Sun wrote:
Sean,
You are right. I am working with vmware driver. Is it possible that you share some code fix samples so that I can have a try in my environment ?
in the libvirt case we had service wide hangs https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.launc... that were resovled by https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com...
so this synchronized decorator prints a log message which you should see when it is aquired and released https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com... you should see that in the logs.
i notice also that in your code you do not have the fair=true argmument on master and for a few releases now we have enable the use of fair locking with https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenstack%2Fnova%2Fcommit%> 2F1ed9f9dac59c36cdda54a9852a1f93939b3ebbc3&data=04%7C01%7Cyingjisun%40vmware.com%7C24653a0ee7d3480b975e08d90954010c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637551078670001853%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZX6GQ2TmyznoLo3FhjXGJl8MSb9hAUrjMeDtDEzcjTI%3D&reserved=0 to resolve long delays in the ironic diriver https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.launc... but the same issues would also affect vmware or any other clustered hypervior where the resouce tracker is manageing multiple nodes.
its very possible that that is what is causing your current issues.
@utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE)
if instance.node: LOG.warning("Node field should not be set on the instance " "until resources have been claimed.", instance=instance)
cn = self.compute_nodes[nodename] <yingji> I did not see the rabbitmq messsge that should be sent here. </yingji> pci_requests = objects.InstancePCIRequests.get_by_instance_uuid( context, instance.uuid)
participants (2)
-
Sean Mooney
-
Yingji Sun