An compute service hang issue

Yingji Sun yingjisun at vmware.com
Thu Apr 29 00:54:22 UTC 2021


Sean,

I think your comments on synchronized lock should be the root cause of my issue.

In my logs, I see after a log of " Lock "compute_resources" acquired by ", my compute node get "stucked".

Mar 12 01:13:58 controller-mpltc45f7n nova-compute[756]: 2021-03-12 01:13:58.044 1 DEBUG oslo_concurrency.lockutils [req-7f57447c-7aae-48fe-addd-46f80e80246a - - - - -] Lock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker._safe_update_available_resource" :: waited 0.000s inner /usr/lib/python3.7/site-packages/oslo_concurrency/lockutils.py:327 

I think it is because the lock "compute_resources" is not released any more that caused this issue. At that time there is some mysql issue when calling _safe_update_available_resource . So I think the exception is not handled the the lock is not released.

Yingji

在 4/27/21 下午4:11,“Sean Mooney”<smooney at redhat.com> 写入:

On Tue, 2021-04-27 at 01:19 +0000, Yingji Sun wrote:
> > Sean,
> > 
> > You are right. I am working with vmware driver. Is it possible that you share some code fix samples so that I can have a try in my environment ?

> in the libvirt case we had service wide hangs https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.launchpad.net%2Fnova%2F%2Bbug%2F1840912&data=04%7C01%7Cyingjisun%40vmware.com%7C24653a0ee7d3480b975e08d90954010c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637551078669991854%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HmsgFfteGe%2Ff3WelUzFLwtyQSfp387Lnjpr14EVZbEY%3D&reserved=0 that were resovled by
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenstack%2Fnova%2Fcommit%2F36ee9c1913a449defd3b35f5ee5fb4afcd44169e&data=04%7C01%7Cyingjisun%40vmware.com%7C24653a0ee7d3480b975e08d90954010c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637551078670001853%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=s8x5tjb3UBr%2FaYuPgNt4F3v4PwLFioJCfTpekqzJRVY%3D&reserved=0
> > 
> > 
> so this synchronized decorator prints a log message which you should see
> when it is aquired and released
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenstack%2Foslo.concurrency%2Fblob%2F4da91987d6ce7de2bb61c6ed760a019961a0a344%2Foslo_concurrency%2Flockutils.py%23L355-L371&data=04%7C01%7Cyingjisun%40vmware.com%7C24653a0ee7d3480b975e08d90954010c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637551078670001853%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=RY4MR1YIMCSo0smYk9AhzUZWkyUumv5W1bwohk7auHE%3D&reserved=0
> you should see that in the logs.
> 
> i notice also that in your code you do not have the fair=true argmument on master and for a few releases now
> we have enable the use of fair locking with https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenstack%2Fnova%2Fcommit%> 2F1ed9f9dac59c36cdda54a9852a1f93939b3ebbc3&data=04%7C01%7Cyingjisun%40vmware.com%7C24653a0ee7d3480b975e08d90954010c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637551078670001853%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZX6GQ2TmyznoLo3FhjXGJl8MSb9hAUrjMeDtDEzcjTI%3D&reserved=0
> to resolve long delays in the ironic diriver https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.launchpad.net%2Fnova%2F%2Bbug%2F1864122&data=04%7C01%7Cyingjisun%40vmware.com%7C24653a0ee7d3480b975e08d90954010c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637551078670001853%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Yy0yRX3fv5dTJlIbRoQmKGd2SYNe6%2BTqD684WWUwIGk%3D&reserved=0 but the same issues
would also affect vmware or any other clustered hypervior where the resouce tracker is manageing multiple nodes.

> its very possible that that is what is causing your current issues.


> >     @utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE)
> > 
> >         if instance.node:
> >             LOG.warning("Node field should not be set on the instance "
> >                         "until resources have been claimed.",
> >                         instance=instance)
> > 
> >         cn = self.compute_nodes[nodename]
> >         <yingji>
> >           I did not see the rabbitmq messsge that should be sent here.
> >         </yingji>
> >         pci_requests = objects.InstancePCIRequests.get_by_instance_uuid(
> >             context, instance.uuid)
> > 





More information about the openstack-discuss mailing list