Cleanup database(s)

Eugen Block eblock at nde.ag
Tue Mar 9 14:47:06 UTC 2021


Hi,

> i think the problem is you reinstalled the cloud with exisitng  
> instances and change the hostnames of the
> compute nodes which is not a supported operations.(specifically  
> changing the hostname of a computenode with vms is not supported)
> so in doing so that would cause all the compute service to be  
> recreated for the new compute nodes and create new RPs in placment.
> the existing instnace however would still have there allocation on  
> the old RPs and the old hostnames woudl be set in the instnace.host
> can you confirm that?

this environment grew from being just an experiment to our production  
cloud, so there might be a couple of unsupported things, but it still  
works fine, so that's something. ;-)

I'll try to explain and hopefully clarify some things.
We upgraded the databases on a virtual machine prior to the actual  
cloud upgrade. Since the most important services successfully started  
we went ahead and installed two control nodes with pacemaker and  
imported the already upgraded databases.
Then we started to evacuate the compute nodes one by one and added  
them to the new cloud environment while the old one was still up and  
running.

To launch existing instances in the new cloud we had to experiment a  
little, but from previous troubleshooting sessions we knew which  
tables we had to change in order to bring the instances up on the new  
compute nodes.
Basically, we changed instances.host and instances.node to reflect one  
of the new compute nodes. So the answer to your question would  
probably be "no", the instances.host don't have the old hostnames.

> can you clarify those point. e.g. were all the workload removed  
> before the reinstall? if not did the host name change?
> that is harder probelm to fix unless you can restore the old host  
> name but i suspect you likely have booted new vms if this even has  
> been runing
> for a year.

I understand, it seems as I'll have to go through the resource  
allocations one by one and update them in order to be able to remove  
the old RPs.
One final question though, is there anything risky about updating the  
allocations to match the actual RP? I tested that for an uncritical  
instance, shut it down and booted it again, all without an issue, it  
seems. If I do that for the rest, ist there anything I should be aware  
of? From what I saw so far all new instances are allocated properly,  
so the placement itself seems to be working well, right?

Thanks!
Eugen

Zitat von Sean Mooney <smooney at redhat.com>:

> On Tue, 2021-03-09 at 09:20 +0000, Eugen Block wrote:
>> Hi again,
>>
>> I just wanted to get some clarification on how to proceed.
>>
>> > what you proably need to do in this case is check if the RPs still
>> > have allocations and if so
>> > verify that the allocation are owned by vms that nolonger exist.
>> > if that is the case you should be able to delete teh allcaotion and
>> > then the RP
>> > if the allocations are related to active vms that are now on the
>> > rebuild nodes then you will have to try and
>> > heal the allcoations.
>>
>> I checked all allocations for the old compute nodes, those are all
>> existing VMs. So simply deleting the allocations won't do any good, I
>> guess. From [1] I understand that I should overwrite all allocations
>> (we're on Train so there's no "unset" available yet) for those VMs to
>> point to the new compute nodes (resource_providers). After that I
>> should delete the resource providers, correct?
>> I ran "heal_allocations" for one uncritical instance, but it didn't
>> have any visible effect, the allocations still show one of the old
>> compute nodes.
>> What I haven't tried yet is to delete allocations for an instance and
>> then try to heal it as the docs also mention.
>>
>> Do I understand that correctly or am I still missing something?
>
> i think the problem is you reinstalled the cloud with exisitng  
> instances and change the hostnames of the
> compute nodes which is not a supported operations.(specifically  
> changing the hostname of a computenode with vms is not supported)
> so in doing so that would cause all the compute service to be  
> recreated for the new compute nodes and create new RPs in placment.
> the existing instnace however would still have there allocation on  
> the old RPs and the old hostnames woudl be set in the instnace.host
> can you confirm that?
>
> in this case you dont actully have orphaned allocation exactly you  
> have allcoation against the incorrect RP but if the instnace.host  
> does not
> match the hypervisor hostname that its on then heal allocations will  
> not be able to fix that.
>
> just looking at your orginal message you said "last year we migrated  
> our OpenStack to a highly available environment through a reinstall  
> of all nodes"
>
> i had assumed you have no instnace form the orignial enviornment  
> with the old names if you had exising instnaces with the old name  
> then you would
> have had to ensure the host names did not change to  do that  
> correctly without breaking the resouce tracking in nova.
>
> can you clarify those point. e.g. were all the workload removed  
> before the reinstall? if not did the host name change?
> that is harder probelm to fix unless you can restore the old host  
> name but i suspect you likely have booted new vms if this even has  
> been runing
> for a year.
>>
>> Regards,
>> Eugen
>>
>>
>> [1]
>> https://docs.openstack.org/nova/latest/admin/troubleshooting/orphaned-allocations.html
>>
>> Zitat von Sean Mooney <smooney at redhat.com>:
>>
>> > On Mon, 2021-03-08 at 14:18 +0000, Eugen Block wrote:
>> > > Thank you, Sean.
>> > >
>> > > > so you need to do
>> > > > openstack compute service list to get teh compute service ids
>> > > > then do
>> > > > openstack compute service delete <id-1> <id-2> ...
>> > > >
>> > > > you need to make sure that you only remvoe the unused old serivces
>> > > > but i think that would fix your issue.
>> > >
>> > > That's the thing, they don't show up in the compute service list. But
>> > > I also found them in the resource_providers table, only the old
>> > > compute nodes appear here:
>> > >
>> > > MariaDB [nova]> select name from nova_api.resource_providers;
>> > > +--------------------------+
>> > > > name                     |
>> > > +--------------------------+
>> > > > compute1.fqdn            |
>> > > > compute2.fqdn            |
>> > > > compute3.fqdn            |
>> > > > compute4.fqdn            |
>> > > +--------------------------+
>> > ah in that case the compute service delete is ment to remove the RPs too
>> > but if the RP had stale allcoation at teh time of the delete the RP
>> > delete will fail
>> >
>> > what you proably need to do in this case is check if the RPs still
>> > have allocations and if so
>> > verify that the allocation are owned by vms that nolonger exist.
>> > if that is the case you should be able to delete teh allcaotion and
>> > then the RP
>> > if the allocations are related to active vms that are now on the
>> > rebuild nodes then you will have to try and
>> > heal the allcoations.
>> >
>> > there is a openstack client extention called osc-placement that you
>> > can install to help.
>> > we also have a heal allcoation command in nova-manage that may help
>> > but the next step would be to validate
>> > if the old RPs are still in use or not. from there you can then work
>> > to align novas and placment view with
>> > the real toplogy.
>> >
>> > that could invovle removing the old compute nodes form the
>> > compute_nodes table or marking them as deleted but
>> > both nova db and plamcent need to be kept in sysnc to correct your
>> > current issue.
>> >
>> > >
>> > >
>> > > Zitat von Sean Mooney <smooney at redhat.com>:
>> > >
>> > > > On Mon, 2021-03-08 at 13:18 +0000, Eugen Block wrote:
>> > > > > Hi *,
>> > > > >
>> > > > > I have a quick question, last year we migrated our OpenStack to a
>> > > > > highly available environment through a reinstall of all nodes. The
>> > > > > migration went quite well, we're working happily in the new  
>> cloud but
>> > > > > the databases still contain deprecated data. For example, the
>> > > > > nova-scheduler logs lines like these on a regular basis:
>> > > > >
>> > > > > /var/log/nova/nova-scheduler.log:2021-02-19 12:02:46.439  
>> 23540 WARNING
>> > > > > nova.scheduler.host_manager [...] No compute service record  
>> found for
>> > > > > host compute1
>> > > > >
>> > > > > This is one of the old compute nodes that has been  
>> reinstalled and is
>> > > > > now compute01. I tried to find the right spot to delete  
>> some lines in
>> > > > > the DB but there are a couple of places so I wanted to check and ask
>> > > > > you for some insights.
>> > > > >
>> > > > > The scheduler messages seem to originate in
>> > > > >
>> > > > > /usr/lib/python3.6/site-packages/nova/scheduler/host_manager.py
>> > > > >
>> > > > > ---snip---
>> > > > >          for cell_uuid, computes in compute_nodes.items():
>> > > > >              for compute in computes:
>> > > > >                  service = services.get(compute.host)
>> > > > >
>> > > > >                  if not service:
>> > > > >                      LOG.warning(
>> > > > >                          "No compute service record found for host
>> > > > > %(host)s",
>> > > > >                          {'host': compute.host})
>> > > > >                      continue
>> > > > > ---snip---
>> > > > >
>> > > > > So I figured it could be this table in the nova DB:
>> > > > >
>> > > > > ---snip---
>> > > > > MariaDB [nova]> select host,deleted from compute_nodes;
>> > > > > +-----------+---------+
>> > > > > > host      | deleted |
>> > > > > +-----------+---------+
>> > > > > > compute01 |       0 |
>> > > > > > compute02 |       0 |
>> > > > > > compute03 |       0 |
>> > > > > > compute04 |       0 |
>> > > > > > compute05 |       0 |
>> > > > > > compute1  |       0 |
>> > > > > > compute2  |       0 |
>> > > > > > compute3  |       0 |
>> > > > > > compute4  |       0 |
>> > > > > +-----------+---------+
>> > > > > ---snip---
>> > > > >
>> > > > > What would be the best approach here to clean up a little? I believe
>> > > > > it would be safe to simply purge those lines containing the old
>> > > > > compute node, but there might be a smoother way. Or maybe there are
>> > > > > more places to purge old data from?
>> > > > so the step you porably missed was deleting the old compute
>> > > service records
>> > > >
>> > > > so you need to do
>> > > > openstack compute service list to get teh compute service ids
>> > > > then do
>> > > > openstack compute service delete <id-1> <id-2> ...
>> > > >
>> > > > you need to make sure that you only remvoe the unused old serivces
>> > > > but i think that would fix your issue.
>> > > >
>> > > > >
>> > > > > I'd appreciate any ideas.
>> > > > >
>> > > > > Regards,
>> > > > > Eugen
>> > > > >
>> > > > >
>> > >
>> > >
>> > >
>>
>>
>>






More information about the openstack-discuss mailing list