Re: An compute service hang issue

27 Apr 2021

      On Tue, 2021-04-27 at 01:19 +0000, Yingji Sun wrote:
...
Sean,
You are right. I am working with vmware driver. Is it possible that you share some code fix samples so that I can have a try in my environment ?
in the libvirt case we had service wide hangs https://bugs.launchpad.net/nova/+bug/1840912 that were resovled by
https://github.com/openstack/nova/commit/36ee9c1913a449defd3b35f5ee5fb4afcd4...
...
Below is my investigation. Would you please give any suggestion ?
With my instance, vm_state is building and task_state is NULL.
I have some suspect here
######################################################################
        try:
            scheduler_hints = self._get_scheduler_hints(filter_properties,
                                                        request_spec)
            with self.rt.instance_claim(context, instance, node, allocs,
                                        limits):
        # See my comments in instance_claim below.
        ######################################################################
Here is my investigation in <yingji></yingji>
    def _build_and_run_instance(self, context, instance, image, injected_files,
            admin_password, requested_networks, security_groups,
            block_device_mapping, node, limits, filter_properties,
            request_spec=None):
       self._notify_about_instance_usage(context, instance, 'create.start',
                extra_usage_info={'image_name': image_name})
        compute_utils.notify_about_instance_create(
            context, instance, self.host,
            phase=fields.NotificationPhase.START,
            bdms=block_device_mapping)
        <yingji>
          I see rabbitmq sent here.
        <#yingji>
        # NOTE(mikal): cache the keystone roles associated with the instance
        # at boot time for later reference
        instance.system_metadata.update(
            {'boot_roles': ','.join(context.roles)})
        self._check_device_tagging(requested_networks, block_device_mapping)
        self._check_trusted_certs(instance)
        request_group_resource_providers_mapping = \
            self._get_request_group_mapping(request_spec)
        if request_group_resource_providers_mapping:
            self._update_pci_request_spec_with_allocated_interface_name(
                context, instance, request_group_resource_providers_mapping)
        # TODO(Luyao) cut over to get_allocs_for_consumer
        allocs = self.reportclient.get_allocations_for_consumer(
                context, instance.uuid)
        <yingji>
            I see "GET /allocations/<my-instance-uuidd>" in placement-api.log,
            so, it looks to reach here.
        </yingji>
        # My suspect code snippet. 
        <yingji>
        ######################################################################
        try:
            scheduler_hints = self._get_scheduler_hints(filter_properties,
                                                        request_spec)
            with self.rt.instance_claim(context, instance, node, allocs,
                                        limits):
        # See my comments in instance_claim below.
        ######################################################################
        </yingji>
                ........
                with self._build_resources(context, instance,
                        requested_networks, security_groups, image_meta,
                        block_device_mapping,
                        request_group_resource_providers_mapping) as resources:
                    instance.vm_state = vm_states.BUILDING
                    instance.task_state = task_states.SPAWNING
                    # NOTE(JoshNang) This also saves the changes to the
                    # instance from _allocate_network_async, as they aren't
                    # saved in that function to prevent races.
                    instance.save(expected_task_state=
                            task_states.BLOCK_DEVICE_MAPPING)
                    block_device_info = resources['block_device_info']
                    network_info = resources['network_info']
                    LOG.debug('Start spawning the instance on the hypervisor.',
                              instance=instance)
                    with timeutils.StopWatch() as timer:
        <yingji>
          The driver code starts here.
          However in my case, it looks not reach here.
        </yingji> 
                        self.driver.spawn(context, instance, image_meta,
                                          injected_files, admin_password,
                                          allocs, network_info=network_info,
                                          block_device_info=block_device_info)
so this synchronized decorator prints a log message which you should see
when it is aquired and released
https://github.com/openstack/oslo.concurrency/blob/4da91987d6ce7de2bb61c6ed7...
you should see that in the logs.

i notice also that in your code you do not have the fair=true argmument on master and for a few releases now
we have enable the use of fair locking with https://github.com/openstack/nova/commit/1ed9f9dac59c36cdda54a9852a1f93939b3...
to resolve long delays in the ironic diriver https://bugs.launchpad.net/nova/+bug/1864122 but the same issues
would also affect vmware or any other clustered hypervior where the resouce tracker is manageing multiple nodes.

its very possible that that is what is causing your current issues.
...
    @utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE)
    def instance_claim(self, context, instance, nodename, allocations,
                       limits=None):
        ......
        if self.disabled(nodename):
            # instance_claim() was called before update_available_resource()
            # (which ensures that a compute node exists for nodename). We
            # shouldn't get here but in case we do, just set the instance's
            # host and nodename attribute (probably incorrect) and return a
            # NoopClaim.
            # TODO(jaypipes): Remove all the disabled junk from the resource
            # tracker. Servicegroup API-level active-checking belongs in the
            # nova-compute manager.
            self._set_instance_host_and_node(instance, nodename)
            return claims.NopClaim()
        # sanity checks:
        if instance.host:
            LOG.warning("Host field should not be set on the instance "
                        "until resources have been claimed.",
                        instance=instance)
        if instance.node:
            LOG.warning("Node field should not be set on the instance "
                        "until resources have been claimed.",
                        instance=instance)
        cn = self.compute_nodes[nodename]
        <yingji>
          I did not see the rabbitmq messsge that should be sent here.
        </yingji>
        pci_requests = objects.InstancePCIRequests.get_by_instance_uuid(
            context, instance.uuid)
Yingji.
> 在 4/26/21 下午10:46，“Sean Mooney”<smooney@redhat.com> 写入:
...
7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6LOZaIYlm8ZHXZ%2FZS44fyhkAgrbHV8MVjfKf6pkeTwc%3D&reserved=0
...
This issue is not always reproducible and restarting the compute service can work around this.
Could you please give any suggestion on how to resolve this issue or how I can investigate ?
i assume this is with the vmware driver?
it kind of sounds like eventlet is not monkey patching properly and its blocking on a call or something like that.
we have seen this in the pass when talking to libvirt where we were not properly proxying calls into the libvirt lib
and as a result we would end up blocking in the compute agent when making some external calls to libvirt.
...
i wonder if you are seing something similar?
...
Yingji.