[nova] Can we drop the kilo era ComputeNode host/service_id compat code now?
There are still quite a few TODOs in the code [1][2][3] from a kilo era blueprint [4]. At this point I'm pretty sure you can't startup the nova-compute service without having a ComputeNode record without a host and hypervisor_hostname field set (we don't set the ComputeNode.service_id anywhere anymore as far as I can tell, except in some ComputeNode RPC compat code [5]). I've stumbled across all of this code before, but was looking at it again today because I have a very simple change I need to make which is going from a ComputeNode object and getting the related nova-compute Service object for that node. Looking at the code one might think this is reasonable: service = objects.Service.get_by_id(ctxt, compute_node.service_id) But compute_node.service_id is likely None. Or how about: service = objects.Service.get_by_compute_host(ctxt, compute_node.host) But ComputeNode.host is also nullable (though likely should have a value as noted above). This is a long way of me saying this code is all gross and we should clean it up, which means making sure all of this Kilo era compat code for old records is no longer necessary, which means all of those records should be migrated by now but how should we check? I *think* this might just be as simple as a "nova-status upgrade check" check which scans the cells looking for (non-deleted) compute_nodes records where host is NULL and report an error if any are found. I believe the recovery action for an operator that hits this is to delete the busted compute_nodes record and restart the nova-compute service so a new compute node record is created. I would really think that anything this scan would find would be orphaned compute_nodes records that could just be deleted since another compute_nodes record probably already exists for the same hypervisor_hostname value. IOW, I don't think we need an online data migration routine for this. Hopefully at least one person (Sylvain) can agree with me here and the plan of action I've put forth. [1] https://github.com/openstack/nova/blob/91647a9b711a8102c79bb17c6b4dff24ad6f8... [2] https://github.com/openstack/nova/blob/91647a9b711a8102c79bb17c6b4dff24ad6f8... [3] https://github.com/openstack/nova/blob/91647a9b711a8102c79bb17c6b4dff24ad6f8... [4] https://blueprints.launchpad.net/nova/+spec/detach-service-from-computenode [5] https://github.com/openstack/nova/blob/91647a9b711a8102c79bb17c6b4dff24ad6f8... -- Thanks, Matt
On Tue, Jun 25, 2019 at 10:05 PM, Matt Riedemann <mriedemos@gmail.com> wrote:
There are still quite a few TODOs in the code [1][2][3] from a kilo era blueprint [4]. At this point I'm pretty sure you can't startup the nova-compute service without having a ComputeNode record without a host and hypervisor_hostname field set (we don't set the ComputeNode.service_id anywhere anymore as far as I can tell, except in some ComputeNode RPC compat code [5]).
I've stumbled across all of this code before, but was looking at it again today because I have a very simple change I need to make which is going from a ComputeNode object and getting the related nova-compute Service object for that node.
Looking at the code one might think this is reasonable:
service = objects.Service.get_by_id(ctxt, compute_node.service_id)
But compute_node.service_id is likely None. Or how about:
service = objects.Service.get_by_compute_host(ctxt, compute_node.host)
But ComputeNode.host is also nullable (though likely should have a value as noted above).
This is a long way of me saying this code is all gross and we should clean it up, which means making sure all of this Kilo era compat code for old records is no longer necessary, which means all of those records should be migrated by now but how should we check?
I *think* this might just be as simple as a "nova-status upgrade check" check which scans the cells looking for (non-deleted) compute_nodes records where host is NULL and report an error if any are found. I believe the recovery action for an operator that hits this is to delete the busted compute_nodes record and restart the nova-compute service so a new compute node record is created. I would really think that anything this scan would find would be orphaned compute_nodes records that could just be deleted since another compute_nodes record probably already exists for the same hypervisor_hostname value. IOW, I don't think we need an online data migration routine for this.
Hopefully at least one person (Sylvain) can agree with me here and the plan of action I've put forth.
You plan makes sens to me too. gibi
[1] https://github.com/openstack/nova/blob/91647a9b711a8102c79bb17c6b4dff24ad6f8... [2] https://github.com/openstack/nova/blob/91647a9b711a8102c79bb17c6b4dff24ad6f8... [3] https://github.com/openstack/nova/blob/91647a9b711a8102c79bb17c6b4dff24ad6f8... [4] https://blueprints.launchpad.net/nova/+spec/detach-service-from-computenode [5] https://github.com/openstack/nova/blob/91647a9b711a8102c79bb17c6b4dff24ad6f8...
--
Thanks,
Matt
On Tue, Jun 25, 2019 at 10:14 PM Matt Riedemann <mriedemos@gmail.com> wrote:
There are still quite a few TODOs in the code [1][2][3] from a kilo era blueprint [4]. At this point I'm pretty sure you can't startup the nova-compute service without having a ComputeNode record without a host and hypervisor_hostname field set (we don't set the ComputeNode.service_id anywhere anymore as far as I can tell, except in some ComputeNode RPC compat code [5]).
I've stumbled across all of this code before, but was looking at it again today because I have a very simple change I need to make which is going from a ComputeNode object and getting the related nova-compute Service object for that node.
Looking at the code one might think this is reasonable:
service = objects.Service.get_by_id(ctxt, compute_node.service_id)
But compute_node.service_id is likely None. Or how about:
service = objects.Service.get_by_compute_host(ctxt, compute_node.host)
But ComputeNode.host is also nullable (though likely should have a value as noted above).
Yeah basically, before this blueprint, a ComputeNode record was only having an hypervisor_hostname value and a service_id which was a FK from the Service record. Given we preferred to have a tuple (host, hypervisor_hostname) key for the CN record, we deprecated the service_id and wanted to add a new field named 'host'. For that, we were looking at the existing Service record to know the field value. After this, we were directly providing the 'host' field value. That said, since it was possible to have compute and conductor services having different release versions, that's why we were wanting to still be able to look at the backward compability. Now, we're waaaaay after Kilo, so I think we no longer need to support the compatibility ;-) This is a long way of me saying this code is all gross and we should
clean it up, which means making sure all of this Kilo era compat code for old records is no longer necessary, which means all of those records should be migrated by now but how should we check?
I *think* this might just be as simple as a "nova-status upgrade check" check which scans the cells looking for (non-deleted) compute_nodes records where host is NULL and report an error if any are found. I believe the recovery action for an operator that hits this is to delete the busted compute_nodes record and restart the nova-compute service so a new compute node record is created. I would really think that anything this scan would find would be orphaned compute_nodes records that could just be deleted since another compute_nodes record probably already exists for the same hypervisor_hostname value. IOW, I don't think we need an online data migration routine for this.
Yeah, agreed with the above. I don't think we need an online data migration for this and I'm pretty sure an nova-status upgrade check should be enough. -Sylvain
Hopefully at least one person (Sylvain) can agree with me here and the plan of action I've put forth.
[1]
https://github.com/openstack/nova/blob/91647a9b711a8102c79bb17c6b4dff24ad6f8... [2]
https://github.com/openstack/nova/blob/91647a9b711a8102c79bb17c6b4dff24ad6f8... [3]
https://github.com/openstack/nova/blob/91647a9b711a8102c79bb17c6b4dff24ad6f8... [4] https://blueprints.launchpad.net/nova/+spec/detach-service-from-computenode [5]
https://github.com/openstack/nova/blob/91647a9b711a8102c79bb17c6b4dff24ad6f8...
--
Thanks,
Matt
participants (3)
-
Balázs Gibizer
-
Matt Riedemann
-
Sylvain Bauza