Re: [octavia] Purging ERROR'ed amphora and load balancers stuck in transitional states

14 Feb 2024

      Thank you Michael, that's a fantastic write-up.

It's really handy to know all the flows have terminal states, I need to 
review our timeout settings.

regards,

Dale Smith

On 14/02/24 13:17, Michael Johnson wrote:
...
Hi there,
We highly recommend that you do not make changes to the state of load
balancers in the database.
Here are some comments about state in Octavia, what these mean, and
how to resolve the issue properly.
The provisioning_status in Octavia is used by the controllers to
provide concurrency locking on the objects. So, when a load balancer
has a provisioning_status of PENDING_* it means that one of the
controllers has ownership of that load balancer and is actively
working on it.
You can check on this by viewing the controller worker, health
manager, and/or housekeeping logs. Typically you will see scrolling
"WARNING" level log messages indicating that the controller is
retrying some activity that is failing, such as trying to reach an
amphora management endpoint over the load balancer management network
or waiting for the VM to fully boot.
The default timeouts for some PENDING_* states can be up to 12 hours
in some cases. This comes down to the debate about if a controller
management system should "try forever" (similar to typical kubernetes
clusters) to recover from outside failures vs quickly returning a
terminal status to the user and unlocking the object. We highly
recommend that you tune the timeouts in the octavia.conf to align to
your cloud's objectives. Personally, I lean towards being responsive
to users than banging our head against failed cloud services (i.e
running out of compute host space) with endless retries.
Also note, just because you have seen the resource in a PENDING_*
state for "a long period of time(tm)", does not mean the resource has
been in that state the whole time. It may have moved out and then back
in, such as when automatic failovers occur. It's important to check
your logs.
All of the flows in Octavia have terminal states for requests that
will unlock the object with either an "ACTIVE" or "ERROR" state that
is mutable (i.e. failover or delete are possible).
If you change the state in the database, you could unlock an object
that one of the controllers is actively retrying, allowing another
controller to take action on the object in parallel. This can lead to
database corruption (looking at the person claiming a load balancer
has five amphora attached to it, grin) of the load balancer and/or
resources abandoned in the cloud (neutron ports, VM instances, IP
addresses, etc.). This is why we do not recommend doing this. I have
seen people spend hours cleaning up after they blindly changed states
in the database.
Now, aside from bugs that have been fixed over the years (I don't know
what version you are running, but you can check the release notes)
there are a few scenarios that can lead to "stuck" PENDING_* states:
1. Someone has killed the controller process with SIGKILL (-9), thus
not allowing the flow to complete and safely unlock the objects. We
typically see this when people implement systemd or container (looking
at you k8s) process management tools but neglect to configure the
proper graceful shutdown timeouts and/or health monitoring. Some of
these process monitoring tools default to only waiting a few seconds
for a graceful shutdown before issuing a SIGKILL. As with all
OpenStack controllers, SIGKILL will not allow a graceful shutdown and
preservation of state which typically requires manual cleanup.
2. Someone pulled the power cords out of the controller host without
graceful shutdown. Well, you probably have cloud wide issues when this
happens.
3. There have been bugs discovered, but rarely. Look through your
controller logs to see which controller had ownership of the load
balancer and why the flow was not allowed to complete. If in doubt,
file a launchpad bug with the logs and we will help you take a look.
Help us help you.
4. You are not running HA messaging queues and one of the messages
from the API tier (where a lock was initiated) never gets delivered to
a controller worker process. We recommend running with HA and durable
queues to avoid this.
That said, the Octavia team has addressed scenarios 1-3 (3 to some
degree as there is always a possibility of a bug that circumvents
this) by enabling Jobboard in the TaskFlow engine we use in Octavia.
When enabled, this saves the flow state in Redis or Zookeeper and uses
worker threads on the controllers to identify if a flow is not
properly progressing and will automatically resume the flow at the
Task it was previously running on a new controller. This feature was
introduced in the Ussuri release.
Ok, so let's say you don't have jobboard enabled in your deployment
and the intern hit the "big shiny red button" near the door killing
the power to your datacenter. You probably have many services in a bad
state due to a non-graceful shutdown. The first step is to run the
recovery steps starting at keystone and working your way down to
glance, neutron, nova and barbican. Once your cloud is stabilized
again and basic services are functioning (how many drives and power
supplies did you just replace????) we can look at Octavia.
1. Do a openstack loadbalancer list as an admin and identify all that
are in PENDING_* states.
2. Search your controller logs (yes all of them) for those load
balancer IDs and make sure a controller is not actively working on
that load balancer. I.e. check for scrolling "retrying" log messages
and trace it back to the load balancer ID. You don't want to mess up
the load balancer your web tier colleagues are frantically trying to
startup or update to recover from the "big shiny red button" incident.
This is a key step to make sure you don't abandon cloud resources or
corrupt the load balancer in worse ways.
3. Then, AND ONLY THEN, would you enter the database and update the
load balancer provisioning_status to ERROR.
4. Close the database terminal window, wipe the sweat from your brow,
and either trigger a load balancer failover or load balancer delete as
needed.
This process is really a "only if I know why it happened" last resort.
Michael
On Tue, Feb 13, 2024 at 3:05 PM Dale Smith <dale@catalystcloud.nz> wrote:
...
In the case where it's stuck in PENDING_UPDATE for a long time, we force the loadbalancer to ERROR status and then perform a failover to let Octavia re-create the amphorae.
SQL> UPDATE load_balancer SET provisioning_status = 'ERROR' WHERE id = '<LOADBALANCER ID>' LIMIT 1;
$ openstack loadbalancer failover <LOADBALANCER ID>
That *may* also help the PENDING_DELETE case, if it can recover and then allow a subsequent delete.
regards,
Dale Smith
On 14/02/24 01:59, Paul Browne wrote:
Hello Octavia users,
For a while I've been trying to find a systematic way to remove Octavia LBs+Amphora that have become stuck in ERROR or transitional PENDING_* states ;
$ openstack loadbalancer amphora list -f value  | grep ERROR
a2323f14-3aaa-418a-8249-111aaa9c21fe 1fa7bd54-f60c-420c-94f3-d4c02f03d4fe ERROR MASTER 10.8.0.242 192.168.3.247
e5b236ba-e7ee-4ed7-9f58-57ce7a408489 1fa7bd54-f60c-420c-94f3-d4c02f03d4fe ERROR BACKUP 10.8.0.190 192.168.3.247
6b556f28-93c9-49dd-b6ee-4379288e7957 d5e402fe-2c4b-49af-a700-532cb408cee5 ERROR MASTER 10.8.0.39 192.168.3.126
c669db5d-8686-4d5c-9e95-e02030b34301 d5e402fe-2c4b-49af-a700-532cb408cee5 ERROR BACKUP 10.8.0.174 192.168.3.126
$ openstack loadbalancer list -f value | grep -vi active
1fa7bd54-f60c-420c-94f3-d4c02f03d4fe k8s-clusterapi-cluster-default-ci-6386871107-kube-upgrade-kubeapi 3a06571936a0424bb40bc5c672c4ccb1 192.168.3.247 PENDING_UPDATE amphora
d5e402fe-2c4b-49af-a700-532cb408cee5 k8s-clusterapi-cluster-default-ci-6386871107-latest-kubeapi 3a06571936a0424bb40bc5c672c4ccb1 192.168.3.126 PENDING_DELETE amphora
These resources are marked immutable and so cannot be failed over or deleted;
$ openstack loadbalancer amphora failover a2323f14-3aaa-418a-8249-111aaa9c21fe
Load Balancer 1fa7bd54-f60c-420c-94f3-d4c02f03d4fe is immutable and cannot be updated. (HTTP 409) (Request-ID: req-6e66c4e8-c3d3-4549-a03c-367017c8c8b3)
$ openstack loadbalancer  failover 1fa7bd54-f60c-420c-94f3-d4c02f03d4fe
Invalid state PENDING_UPDATE of loadbalancer resource 1fa7bd54-f60c-420c-94f3-d4c02f03d4fe (HTTP 409) (Request-ID: req-84b44212-e7a8-4101-a16f-18c774c0577e)
The backing Nova instances for these Amphora do seem to exist and be in good working order.
Is there any API way to purge these out of Octavia's service state, or would (very careful) DB hackery be required here?
Many thanks,
Paul Browne
*******************
Paul Browne
Research Computing Platforms
University Information Services
Roger Needham Building
JJ Thompson Avenue
University of Cambridge
Cambridge
United Kingdom
E-Mail: pfb29@cam.ac.uk
Tel: 0044-1223-746548
*******************