An compute service hang issue

Sean Mooney smooney at redhat.com
Thu Apr 29 09:27:05 UTC 2021


On Thu, 2021-04-29 at 00:54 +0000, Yingji Sun wrote:
> Sean,
> 
> I think your comments on synchronized lock should be the root cause of my issue.
> 
> In my logs, I see after a log of " Lock "compute_resources" acquired by ", my compute node get "stucked".
> 
> Mar 12 01:13:58 controller-mpltc45f7n nova-compute[756]: 2021-03-12 01:13:58.044 1 DEBUG oslo_concurrency.lockutils [req-7f57447c-7aae-48fe-addd-46f80e80246a - - - - -] Lock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker._safe_update_available_resource" :: waited 0.000s inner /usr/lib/python3.7/site-packages/oslo_concurrency/lockutils.py:327 
> 
> I think it is because the lock "compute_resources" is not released any more that caused this issue. At that time there is some mysql issue when calling _safe_update_available_resource . So I think the exception is not handled the the lock is not released.
we aquire the locks with decorators so that they cant be leaked in that way if there is an exception.
but without the use of "fair" locks there is no garunette what greenthread will be resumed so indivigual request can end up waiting a very long time.
i think its more likely that the lock is being aquired and release properly but some operations like the periodics might be starving the other
operartion and the api request are just not getting processed.
> 
> Yingji
> 
> 在 4/27/21 下午4:11,“Sean Mooney”<smooney at redhat.com> 写入:
> 
> On Tue, 2021-04-27 at 01:19 +0000, Yingji Sun wrote:
> > > Sean,
> > > 
> > > You are right. I am working with vmware driver. Is it possible that you share some code fix samples so that I can have a try in my environment ?
> 
> > in the libvirt case we had service wide hangs https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.launchpad.net%2Fnova%2F%2Bbug%2F1840912&data=04%7C01%7Cyingjisun%40vmware.com%7C24653a0ee7d3480b975e08d90954010c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637551078669991854%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HmsgFfteGe%2Ff3WelUzFLwtyQSfp387Lnjpr14EVZbEY%3D&reserved=0 that were resovled by
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenstack%2Fnova%2Fcommit%2F36ee9c1913a449defd3b35f5ee5fb4afcd44169e&data=04%7C01%7Cyingjisun%40vmware.com%7C24653a0ee7d3480b975e08d90954010c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637551078670001853%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=s8x5tjb3UBr%2FaYuPgNt4F3v4PwLFioJCfTpekqzJRVY%3D&reserved=0
> > > 
> > > 
> > so this synchronized decorator prints a log message which you should see
> > when it is aquired and released
> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenstack%2Foslo.concurrency%2Fblob%2F4da91987d6ce7de2bb61c6ed760a019961a0a344%2Foslo_concurrency%2Flockutils.py%23L355-L371&data=04%7C01%7Cyingjisun%40vmware.com%7C24653a0ee7d3480b975e08d90954010c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637551078670001853%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=RY4MR1YIMCSo0smYk9AhzUZWkyUumv5W1bwohk7auHE%3D&reserved=0
> > you should see that in the logs.
> > 
> > i notice also that in your code you do not have the fair=true argmument on master and for a few releases now
> > we have enable the use of fair locking with https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenstack%2Fnova%2Fcommit%> 2F1ed9f9dac59c36cdda54a9852a1f93939b3ebbc3&data=04%7C01%7Cyingjisun%40vmware.com%7C24653a0ee7d3480b975e08d90954010c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637551078670001853%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZX6GQ2TmyznoLo3FhjXGJl8MSb9hAUrjMeDtDEzcjTI%3D&reserved=0
> > to resolve long delays in the ironic diriver https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.launchpad.net%2Fnova%2F%2Bbug%2F1864122&data=04%7C01%7Cyingjisun%40vmware.com%7C24653a0ee7d3480b975e08d90954010c%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637551078670001853%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Yy0yRX3fv5dTJlIbRoQmKGd2SYNe6%2BTqD684WWUwIGk%3D&reserved=0 but the same issues
> would also affect vmware or any other clustered hypervior where the resouce tracker is manageing multiple nodes.
> 
> > its very possible that that is what is causing your current issues.
> 
> 
> > >     @utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE)
> > > 
> > >         if instance.node:
> > >             LOG.warning("Node field should not be set on the instance "
> > >                         "until resources have been claimed.",
> > >                         instance=instance)
> > > 
> > >         cn = self.compute_nodes[nodename]
> > >         <yingji>
> > >           I did not see the rabbitmq messsge that should be sent here.
> > >         </yingji>
> > >         pci_requests = objects.InstancePCIRequests.get_by_instance_uuid(
> > >             context, instance.uuid)
> > > 
> 
> 
> 





More information about the openstack-discuss mailing list