Re: An compute service hang issue

28 Apr 2021

      Sean,

I think your comments on synchronized lock should be the root cause of my issue.

In my logs, I see after a log of " Lock "compute_resources" acquired by ", my compute node get "stucked".

Mar 12 01:13:58 controller-mpltc45f7n nova-compute[756]: 2021-03-12 01:13:58.044 1 DEBUG oslo_concurrency.lockutils [req-7f57447c-7aae-48fe-addd-46f80e80246a - - - - -] Lock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker._safe_update_available_resource" :: waited 0.000s inner /usr/lib/python3.7/site-packages/oslo_concurrency/lockutils.py:327 

I think it is because the lock "compute_resources" is not released any more that caused this issue. At that time there is some mysql issue when calling _safe_update_available_resource . So I think the exception is not handled the the lock is not released.

Yingji

在 4/27/21 下午4:11，“Sean Mooney”<smooney@redhat.com> 写入:

On Tue, 2021-04-27 at 01:19 +0000, Yingji Sun wrote:
...
...
Sean,
You are right. I am working with vmware driver. Is it possible that you share some code fix samples so that I can have a try in my environment ?
...
in the libvirt case we had service wide hangs https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.launc... that were resovled by
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com...
...
so this synchronized decorator prints a log message which you should see
when it is aquired and released
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com...
you should see that in the logs.
i notice also that in your code you do not have the fair=true argmument on master and for a few releases now
we have enable the use of fair locking with https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenstack%2Fnova%2Fcommit%> 2F1ed9f9dac59c36cdda54a9852a1f93939b3ebbc3&data=04%7C01%7Cyingjisun%40vmware.com%7C24653a0ee7d3480b975e08d90954010c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637551078670001853%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZX6GQ2TmyznoLo3FhjXGJl8MSb9hAUrjMeDtDEzcjTI%3D&reserved=0
to resolve long delays in the ironic diriver https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.launc... but the same issues
would also affect vmware or any other clustered hypervior where the resouce tracker is manageing multiple nodes.
...
its very possible that that is what is causing your current issues.
...
...
@utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE)
if instance.node:
            LOG.warning("Node field should not be set on the instance "
                        "until resources have been claimed.",
                        instance=instance)
cn = self.compute_nodes[nodename]
        <yingji>
          I did not see the rabbitmq messsge that should be sent here.
        </yingji>
        pci_requests = objects.InstancePCIRequests.get_by_instance_uuid(
            context, instance.uuid)