Sean, You are right. I am working with vmware driver. Is it possible that you share some code fix samples so that I can have a try in my environment ? Below is my investigation. Would you please give any suggestion ? With my instance, vm_state is building and task_state is NULL. I have some suspect here ###################################################################### try: scheduler_hints = self._get_scheduler_hints(filter_properties, request_spec) with self.rt.instance_claim(context, instance, node, allocs, limits): # See my comments in instance_claim below. ###################################################################### Here is my investigation in <yingji></yingji> def _build_and_run_instance(self, context, instance, image, injected_files, admin_password, requested_networks, security_groups, block_device_mapping, node, limits, filter_properties, request_spec=None): self._notify_about_instance_usage(context, instance, 'create.start', extra_usage_info={'image_name': image_name}) compute_utils.notify_about_instance_create( context, instance, self.host, phase=fields.NotificationPhase.START, bdms=block_device_mapping) <yingji> I see rabbitmq sent here. <#yingji> # NOTE(mikal): cache the keystone roles associated with the instance # at boot time for later reference instance.system_metadata.update( {'boot_roles': ','.join(context.roles)}) self._check_device_tagging(requested_networks, block_device_mapping) self._check_trusted_certs(instance) request_group_resource_providers_mapping = \ self._get_request_group_mapping(request_spec) if request_group_resource_providers_mapping: self._update_pci_request_spec_with_allocated_interface_name( context, instance, request_group_resource_providers_mapping) # TODO(Luyao) cut over to get_allocs_for_consumer allocs = self.reportclient.get_allocations_for_consumer( context, instance.uuid) <yingji> I see "GET /allocations/<my-instance-uuidd>" in placement-api.log, so, it looks to reach here. </yingji> # My suspect code snippet. <yingji> ###################################################################### try: scheduler_hints = self._get_scheduler_hints(filter_properties, request_spec) with self.rt.instance_claim(context, instance, node, allocs, limits): # See my comments in instance_claim below. ###################################################################### </yingji> ........ with self._build_resources(context, instance, requested_networks, security_groups, image_meta, block_device_mapping, request_group_resource_providers_mapping) as resources: instance.vm_state = vm_states.BUILDING instance.task_state = task_states.SPAWNING # NOTE(JoshNang) This also saves the changes to the # instance from _allocate_network_async, as they aren't # saved in that function to prevent races. instance.save(expected_task_state= task_states.BLOCK_DEVICE_MAPPING) block_device_info = resources['block_device_info'] network_info = resources['network_info'] LOG.debug('Start spawning the instance on the hypervisor.', instance=instance) with timeutils.StopWatch() as timer: <yingji> The driver code starts here. However in my case, it looks not reach here. </yingji> self.driver.spawn(context, instance, image_meta, injected_files, admin_password, allocs, network_info=network_info, block_device_info=block_device_info) @utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE) def instance_claim(self, context, instance, nodename, allocations, limits=None): ...... if self.disabled(nodename): # instance_claim() was called before update_available_resource() # (which ensures that a compute node exists for nodename). We # shouldn't get here but in case we do, just set the instance's # host and nodename attribute (probably incorrect) and return a # NoopClaim. # TODO(jaypipes): Remove all the disabled junk from the resource # tracker. Servicegroup API-level active-checking belongs in the # nova-compute manager. self._set_instance_host_and_node(instance, nodename) return claims.NopClaim() # sanity checks: if instance.host: LOG.warning("Host field should not be set on the instance " "until resources have been claimed.", instance=instance) if instance.node: LOG.warning("Node field should not be set on the instance " "until resources have been claimed.", instance=instance) cn = self.compute_nodes[nodename] <yingji> I did not see the rabbitmq messsge that should be sent here. </yingji> pci_requests = objects.InstancePCIRequests.get_by_instance_uuid( context, instance.uuid) Yingji. > 在 4/26/21 下午10:46,“Sean Mooney”<smooney@redhat.com> 写入:
7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6LOZaIYlm8ZHXZ%2FZS44fyhkAgrbHV8MVjfKf6pkeTwc%3D&reserved=0
This issue is not always reproducible and restarting the compute service can work around this.
Could you please give any suggestion on how to resolve this issue or how I can investigate ?
i assume this is with the vmware driver? it kind of sounds like eventlet is not monkey patching properly and its blocking on a call or something like that.
we have seen this in the pass when talking to libvirt where we were not properly proxying calls into the libvirt lib
and as a result we would end up blocking in the compute agent when making some external calls to libvirt.
i wonder if you are seing something similar?
Yingji.