Cleanup database(s)
Sean Mooney
smooney at redhat.com
Tue Mar 9 13:43:41 UTC 2021
On Tue, 2021-03-09 at 09:20 +0000, Eugen Block wrote:
> Hi again,
>
> I just wanted to get some clarification on how to proceed.
>
> > what you proably need to do in this case is check if the RPs still
> > have allocations and if so
> > verify that the allocation are owned by vms that nolonger exist.
> > if that is the case you should be able to delete teh allcaotion and
> > then the RP
> > if the allocations are related to active vms that are now on the
> > rebuild nodes then you will have to try and
> > heal the allcoations.
>
> I checked all allocations for the old compute nodes, those are all
> existing VMs. So simply deleting the allocations won't do any good, I
> guess. From [1] I understand that I should overwrite all allocations
> (we're on Train so there's no "unset" available yet) for those VMs to
> point to the new compute nodes (resource_providers). After that I
> should delete the resource providers, correct?
> I ran "heal_allocations" for one uncritical instance, but it didn't
> have any visible effect, the allocations still show one of the old
> compute nodes.
> What I haven't tried yet is to delete allocations for an instance and
> then try to heal it as the docs also mention.
>
> Do I understand that correctly or am I still missing something?
i think the problem is you reinstalled the cloud with exisitng instances and change the hostnames of the
compute nodes which is not a supported operations.(specifically changing the hostname of a computenode with vms is not supported)
so in doing so that would cause all the compute service to be recreated for the new compute nodes and create new RPs in placment.
the existing instnace however would still have there allocation on the old RPs and the old hostnames woudl be set in the instnace.host
can you confirm that?
in this case you dont actully have orphaned allocation exactly you have allcoation against the incorrect RP but if the instnace.host does not
match the hypervisor hostname that its on then heal allocations will not be able to fix that.
just looking at your orginal message you said "last year we migrated our OpenStack to a highly available environment through a reinstall of all nodes"
i had assumed you have no instnace form the orignial enviornment with the old names if you had exising instnaces with the old name then you would
have had to ensure the host names did not change to do that correctly without breaking the resouce tracking in nova.
can you clarify those point. e.g. were all the workload removed before the reinstall? if not did the host name change?
that is harder probelm to fix unless you can restore the old host name but i suspect you likely have booted new vms if this even has been runing
for a year.
>
> Regards,
> Eugen
>
>
> [1]
> https://docs.openstack.org/nova/latest/admin/troubleshooting/orphaned-allocations.html
>
> Zitat von Sean Mooney <smooney at redhat.com>:
>
> > On Mon, 2021-03-08 at 14:18 +0000, Eugen Block wrote:
> > > Thank you, Sean.
> > >
> > > > so you need to do
> > > > openstack compute service list to get teh compute service ids
> > > > then do
> > > > openstack compute service delete <id-1> <id-2> ...
> > > >
> > > > you need to make sure that you only remvoe the unused old serivces
> > > > but i think that would fix your issue.
> > >
> > > That's the thing, they don't show up in the compute service list. But
> > > I also found them in the resource_providers table, only the old
> > > compute nodes appear here:
> > >
> > > MariaDB [nova]> select name from nova_api.resource_providers;
> > > +--------------------------+
> > > > name |
> > > +--------------------------+
> > > > compute1.fqdn |
> > > > compute2.fqdn |
> > > > compute3.fqdn |
> > > > compute4.fqdn |
> > > +--------------------------+
> > ah in that case the compute service delete is ment to remove the RPs too
> > but if the RP had stale allcoation at teh time of the delete the RP
> > delete will fail
> >
> > what you proably need to do in this case is check if the RPs still
> > have allocations and if so
> > verify that the allocation are owned by vms that nolonger exist.
> > if that is the case you should be able to delete teh allcaotion and
> > then the RP
> > if the allocations are related to active vms that are now on the
> > rebuild nodes then you will have to try and
> > heal the allcoations.
> >
> > there is a openstack client extention called osc-placement that you
> > can install to help.
> > we also have a heal allcoation command in nova-manage that may help
> > but the next step would be to validate
> > if the old RPs are still in use or not. from there you can then work
> > to align novas and placment view with
> > the real toplogy.
> >
> > that could invovle removing the old compute nodes form the
> > compute_nodes table or marking them as deleted but
> > both nova db and plamcent need to be kept in sysnc to correct your
> > current issue.
> >
> > >
> > >
> > > Zitat von Sean Mooney <smooney at redhat.com>:
> > >
> > > > On Mon, 2021-03-08 at 13:18 +0000, Eugen Block wrote:
> > > > > Hi *,
> > > > >
> > > > > I have a quick question, last year we migrated our OpenStack to a
> > > > > highly available environment through a reinstall of all nodes. The
> > > > > migration went quite well, we're working happily in the new cloud but
> > > > > the databases still contain deprecated data. For example, the
> > > > > nova-scheduler logs lines like these on a regular basis:
> > > > >
> > > > > /var/log/nova/nova-scheduler.log:2021-02-19 12:02:46.439 23540 WARNING
> > > > > nova.scheduler.host_manager [...] No compute service record found for
> > > > > host compute1
> > > > >
> > > > > This is one of the old compute nodes that has been reinstalled and is
> > > > > now compute01. I tried to find the right spot to delete some lines in
> > > > > the DB but there are a couple of places so I wanted to check and ask
> > > > > you for some insights.
> > > > >
> > > > > The scheduler messages seem to originate in
> > > > >
> > > > > /usr/lib/python3.6/site-packages/nova/scheduler/host_manager.py
> > > > >
> > > > > ---snip---
> > > > > for cell_uuid, computes in compute_nodes.items():
> > > > > for compute in computes:
> > > > > service = services.get(compute.host)
> > > > >
> > > > > if not service:
> > > > > LOG.warning(
> > > > > "No compute service record found for host
> > > > > %(host)s",
> > > > > {'host': compute.host})
> > > > > continue
> > > > > ---snip---
> > > > >
> > > > > So I figured it could be this table in the nova DB:
> > > > >
> > > > > ---snip---
> > > > > MariaDB [nova]> select host,deleted from compute_nodes;
> > > > > +-----------+---------+
> > > > > > host | deleted |
> > > > > +-----------+---------+
> > > > > > compute01 | 0 |
> > > > > > compute02 | 0 |
> > > > > > compute03 | 0 |
> > > > > > compute04 | 0 |
> > > > > > compute05 | 0 |
> > > > > > compute1 | 0 |
> > > > > > compute2 | 0 |
> > > > > > compute3 | 0 |
> > > > > > compute4 | 0 |
> > > > > +-----------+---------+
> > > > > ---snip---
> > > > >
> > > > > What would be the best approach here to clean up a little? I believe
> > > > > it would be safe to simply purge those lines containing the old
> > > > > compute node, but there might be a smoother way. Or maybe there are
> > > > > more places to purge old data from?
> > > > so the step you porably missed was deleting the old compute
> > > service records
> > > >
> > > > so you need to do
> > > > openstack compute service list to get teh compute service ids
> > > > then do
> > > > openstack compute service delete <id-1> <id-2> ...
> > > >
> > > > you need to make sure that you only remvoe the unused old serivces
> > > > but i think that would fix your issue.
> > > >
> > > > >
> > > > > I'd appreciate any ideas.
> > > > >
> > > > > Regards,
> > > > > Eugen
> > > > >
> > > > >
> > >
> > >
> > >
>
>
>
More information about the openstack-discuss
mailing list