An compute service hang issue
Yingji Sun
yingjisun at vmware.com
Tue Apr 27 01:19:45 UTC 2021
Sean,
You are right. I am working with vmware driver. Is it possible that you share some code fix samples so that I can have a try in my environment ?
Below is my investigation. Would you please give any suggestion ?
With my instance, vm_state is building and task_state is NULL.
I have some suspect here
######################################################################
try:
scheduler_hints = self._get_scheduler_hints(filter_properties,
request_spec)
with self.rt.instance_claim(context, instance, node, allocs,
limits):
# See my comments in instance_claim below.
######################################################################
Here is my investigation in <yingji></yingji>
def _build_and_run_instance(self, context, instance, image, injected_files,
admin_password, requested_networks, security_groups,
block_device_mapping, node, limits, filter_properties,
request_spec=None):
self._notify_about_instance_usage(context, instance, 'create.start',
extra_usage_info={'image_name': image_name})
compute_utils.notify_about_instance_create(
context, instance, self.host,
phase=fields.NotificationPhase.START,
bdms=block_device_mapping)
<yingji>
I see rabbitmq sent here.
<#yingji>
# NOTE(mikal): cache the keystone roles associated with the instance
# at boot time for later reference
instance.system_metadata.update(
{'boot_roles': ','.join(context.roles)})
self._check_device_tagging(requested_networks, block_device_mapping)
self._check_trusted_certs(instance)
request_group_resource_providers_mapping = \
self._get_request_group_mapping(request_spec)
if request_group_resource_providers_mapping:
self._update_pci_request_spec_with_allocated_interface_name(
context, instance, request_group_resource_providers_mapping)
# TODO(Luyao) cut over to get_allocs_for_consumer
allocs = self.reportclient.get_allocations_for_consumer(
context, instance.uuid)
<yingji>
I see "GET /allocations/<my-instance-uuidd>" in placement-api.log,
so, it looks to reach here.
</yingji>
# My suspect code snippet.
<yingji>
######################################################################
try:
scheduler_hints = self._get_scheduler_hints(filter_properties,
request_spec)
with self.rt.instance_claim(context, instance, node, allocs,
limits):
# See my comments in instance_claim below.
######################################################################
</yingji>
........
with self._build_resources(context, instance,
requested_networks, security_groups, image_meta,
block_device_mapping,
request_group_resource_providers_mapping) as resources:
instance.vm_state = vm_states.BUILDING
instance.task_state = task_states.SPAWNING
# NOTE(JoshNang) This also saves the changes to the
# instance from _allocate_network_async, as they aren't
# saved in that function to prevent races.
instance.save(expected_task_state=
task_states.BLOCK_DEVICE_MAPPING)
block_device_info = resources['block_device_info']
network_info = resources['network_info']
LOG.debug('Start spawning the instance on the hypervisor.',
instance=instance)
with timeutils.StopWatch() as timer:
<yingji>
The driver code starts here.
However in my case, it looks not reach here.
</yingji>
self.driver.spawn(context, instance, image_meta,
injected_files, admin_password,
allocs, network_info=network_info,
block_device_info=block_device_info)
@utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE)
def instance_claim(self, context, instance, nodename, allocations,
limits=None):
......
if self.disabled(nodename):
# instance_claim() was called before update_available_resource()
# (which ensures that a compute node exists for nodename). We
# shouldn't get here but in case we do, just set the instance's
# host and nodename attribute (probably incorrect) and return a
# NoopClaim.
# TODO(jaypipes): Remove all the disabled junk from the resource
# tracker. Servicegroup API-level active-checking belongs in the
# nova-compute manager.
self._set_instance_host_and_node(instance, nodename)
return claims.NopClaim()
# sanity checks:
if instance.host:
LOG.warning("Host field should not be set on the instance "
"until resources have been claimed.",
instance=instance)
if instance.node:
LOG.warning("Node field should not be set on the instance "
"until resources have been claimed.",
instance=instance)
cn = self.compute_nodes[nodename]
<yingji>
I did not see the rabbitmq messsge that should be sent here.
</yingji>
pci_requests = objects.InstancePCIRequests.get_by_instance_uuid(
context, instance.uuid)
Yingji.
> 在 4/26/21 下午10:46,“Sean Mooney”<smooney at redhat.com> 写入:
>
> 7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6LOZaIYlm8ZHXZ%2FZS44fyhkAgrbHV8MVjfKf6pkeTwc%3D&reserved=0
> >
> >
> > This issue is not always reproducible and restarting the compute service can work around this.
> >
> > Could you please give any suggestion on how to resolve this issue or how I can investigate ?
> i assume this is with the vmware driver?
> it kind of sounds like eventlet is not monkey patching properly and its blocking on a call or something like that.
>
> we have seen this in the pass when talking to libvirt where we were not properly proxying calls into the libvirt lib
and as a result we would end up blocking in the compute agent when making some external calls to libvirt.
> i wonder if you are seing something similar?
> >
> > Yingji.
> >
More information about the openstack-discuss
mailing list