On Tue, Jun 25, 2019 at 10:14 PM Matt Riedemann <mriedemos@gmail.com> wrote:

There are still quite a few TODOs in the code [1][2][3] from a kilo era
blueprint [4]. At this point I'm pretty sure you can't startup the
nova-compute service without having a ComputeNode record without a host
and hypervisor_hostname field set (we don't set the
ComputeNode.service_id anywhere anymore as far as I can tell, except in
some ComputeNode RPC compat code [5]).

I've stumbled across all of this code before, but was looking at it
again today because I have a very simple change I need to make which is
going from a ComputeNode object and getting the related nova-compute
Service object for that node.

Looking at the code one might think this is reasonable:

service = objects.Service.get_by_id(ctxt, compute_node.service_id)

But compute_node.service_id is likely None. Or how about:

service = objects.Service.get_by_compute_host(ctxt, compute_node.host)

But ComputeNode.host is also nullable (though likely should have a value
as noted above).

Yeah basically, before this blueprint, a ComputeNode record was only having an hypervisor_hostname value and a service_id which was a FK from the Service record.

Given we preferred to have a tuple (host, hypervisor_hostname) key for the CN record, we deprecated the service_id and wanted to add a new field named 'host'.

For that, we were looking at the existing Service record to know the field value. After this, we were directly providing the 'host' field value.

That said, since it was possible to have compute and conductor services having different release versions, that's why we were wanting to still be able to look at the backward compability.

Now, we're waaaaay after Kilo, so I think we no longer need to support the compatibility ;-)

This is a long way of me saying this code is all gross and we should
clean it up, which means making sure all of this Kilo era compat code
for old records is no longer necessary, which means all of those records
should be migrated by now but how should we check?

I *think* this might just be as simple as a "nova-status upgrade check"
check which scans the cells looking for (non-deleted) compute_nodes
records where host is NULL and report an error if any are found. I
believe the recovery action for an operator that hits this is to delete
the busted compute_nodes record and restart the nova-compute service so
a new compute node record is created. I would really think that anything
this scan would find would be orphaned compute_nodes records that could
just be deleted since another compute_nodes record probably already
exists for the same hypervisor_hostname value. IOW, I don't think we
need an online data migration routine for this.

Yeah, agreed with the above. I don't think we need an online data migration for this and I'm pretty sure an nova-status upgrade check should be enough.

-Sylvain

Hopefully at least one person (Sylvain) can agree with me here and the
plan of action I've put forth.

[1]
https://github.com/openstack/nova/blob/91647a9b711a8102c79bb17c6b4dff24ad6f8f58/nova/db/sqlalchemy/models.py#L123
[2]
https://github.com/openstack/nova/blob/91647a9b711a8102c79bb17c6b4dff24ad6f8f58/nova/objects/compute_node.py#L150
[3]
https://github.com/openstack/nova/blob/91647a9b711a8102c79bb17c6b4dff24ad6f8f58/nova/objects/compute_node.py#L263
[4]
https://blueprints.launchpad.net/nova/+spec/detach-service-from-computenode
[5]
https://github.com/openstack/nova/blob/91647a9b711a8102c79bb17c6b4dff24ad6f8f58/nova/objects/compute_node.py#L118

--

Thanks,

Matt