Re: An compute service hang issue

26 Apr 2021

      Sean,

You are right. I am working with vmware driver. Is it possible that you share some code fix samples so that I can have a try in my environment ?

Below is my investigation. Would you please give any suggestion ?

With my instance, vm_state is building and task_state is NULL.

I have some suspect here
######################################################################
        try:
            scheduler_hints = self._get_scheduler_hints(filter_properties,
                                                        request_spec)
            with self.rt.instance_claim(context, instance, node, allocs,
                                        limits):
        # See my comments in instance_claim below.
        ######################################################################

Here is my investigation in <yingji></yingji>

    def _build_and_run_instance(self, context, instance, image, injected_files,
            admin_password, requested_networks, security_groups,
            block_device_mapping, node, limits, filter_properties,
            request_spec=None):

       self._notify_about_instance_usage(context, instance, 'create.start',
                extra_usage_info={'image_name': image_name})
        compute_utils.notify_about_instance_create(
            context, instance, self.host,
            phase=fields.NotificationPhase.START,
            bdms=block_device_mapping)
        <yingji>
          I see rabbitmq sent here.
        <#yingji>

        # NOTE(mikal): cache the keystone roles associated with the instance
        # at boot time for later reference
        instance.system_metadata.update(
            {'boot_roles': ','.join(context.roles)})

        self._check_device_tagging(requested_networks, block_device_mapping)
        self._check_trusted_certs(instance)

        request_group_resource_providers_mapping = \
            self._get_request_group_mapping(request_spec)

        if request_group_resource_providers_mapping:
            self._update_pci_request_spec_with_allocated_interface_name(
                context, instance, request_group_resource_providers_mapping)

        # TODO(Luyao) cut over to get_allocs_for_consumer
        allocs = self.reportclient.get_allocations_for_consumer(
                context, instance.uuid)
        <yingji>
            I see "GET /allocations/<my-instance-uuidd>" in placement-api.log,
            so, it looks to reach here.
        </yingji>

        # My suspect code snippet. 
        <yingji>
        ######################################################################
        try:
            scheduler_hints = self._get_scheduler_hints(filter_properties,
                                                        request_spec)
            with self.rt.instance_claim(context, instance, node, allocs,
                                        limits):
        # See my comments in instance_claim below.
        ######################################################################
        </yingji>

                ........

                with self._build_resources(context, instance,
                        requested_networks, security_groups, image_meta,
                        block_device_mapping,
                        request_group_resource_providers_mapping) as resources:
                    instance.vm_state = vm_states.BUILDING
                    instance.task_state = task_states.SPAWNING
                    # NOTE(JoshNang) This also saves the changes to the
                    # instance from _allocate_network_async, as they aren't
                    # saved in that function to prevent races.
                    instance.save(expected_task_state=
                            task_states.BLOCK_DEVICE_MAPPING)
                    block_device_info = resources['block_device_info']
                    network_info = resources['network_info']
                    LOG.debug('Start spawning the instance on the hypervisor.',
                              instance=instance)
                    with timeutils.StopWatch() as timer:

        <yingji>
          The driver code starts here.
          However in my case, it looks not reach here.
        </yingji> 
                        self.driver.spawn(context, instance, image_meta,
                                          injected_files, admin_password,
                                          allocs, network_info=network_info,
                                          block_device_info=block_device_info)

    @utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE)
    def instance_claim(self, context, instance, nodename, allocations,
                       limits=None):
        ......
        if self.disabled(nodename):
            # instance_claim() was called before update_available_resource()
            # (which ensures that a compute node exists for nodename). We
            # shouldn't get here but in case we do, just set the instance's
            # host and nodename attribute (probably incorrect) and return a
            # NoopClaim.
            # TODO(jaypipes): Remove all the disabled junk from the resource
            # tracker. Servicegroup API-level active-checking belongs in the
            # nova-compute manager.
            self._set_instance_host_and_node(instance, nodename)
            return claims.NopClaim()

        # sanity checks:
        if instance.host:
            LOG.warning("Host field should not be set on the instance "
                        "until resources have been claimed.",
                        instance=instance)

        if instance.node:
            LOG.warning("Node field should not be set on the instance "
                        "until resources have been claimed.",
                        instance=instance)

        cn = self.compute_nodes[nodename]
        <yingji>
          I did not see the rabbitmq messsge that should be sent here.
        </yingji>
        pci_requests = objects.InstancePCIRequests.get_by_instance_uuid(
            context, instance.uuid)

Yingji.

> 在 4/26/21 下午10:46，“Sean Mooney”<smooney@redhat.com> 写入:
...
7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6LOZaIYlm8ZHXZ%2FZS44fyhkAgrbHV8MVjfKf6pkeTwc%3D&reserved=0
...
This issue is not always reproducible and restarting the compute service can work around this.
Could you please give any suggestion on how to resolve this issue or how I can investigate ?
i assume this is with the vmware driver?
it kind of sounds like eventlet is not monkey patching properly and its blocking on a call or something like that.
we have seen this in the pass when talking to libvirt where we were not properly proxying calls into the libvirt lib
and as a result we would end up blocking in the compute agent when making some external calls to libvirt.
...
i wonder if you are seing something similar?
...
Yingji.

Re: An compute service hang issue

Yingji Sun