Many thanks also for the great write-up, Michael, that's been very helpful indeed to know more about the state flows. Unfortunately, we did recently fall into scenario #2, where power to our entire data centre (not just our part of it) was completely disrupted. Albeit we were somewhat fortunate in not all that much hard recovery graft was needed, most things Just Worked(tm) coming back up in the right order, barring some dead switches to replace! There was some OpenStack service disruption to clean up, as could be expected. My last issue with some Octavia amphora in odd states is some that seemed to have been not-quite-created, don't have any project_id or loadbalancer_id associated nor any extant Nova instances backing them $ openstack loadbalancer amphora list -f value | grep -vi ALLOCATED 68ef09d2-014d-40b5-8700-c3304fca4925 None ERROR MASTER 10.8.0.157 10.0.0.123 82f535b6-ce7e-4824-9f3b-13f56e4ab3b7 None ERROR BACKUP 10.8.0.184 10.0.0.237 ba2ab344-917e-4880-8036-ccb510ab1781 None ERROR MASTER 10.8.0.148 10.0.0.237 bc80451e-bb12-4661-bbbf-7f7a4b359035 None ERROR BACKUP 10.8.0.191 10.0.0.123 There doesn't seem to be any active logging trace of Octavia daemons working on these (we're running Yoga currently). Though since we're running Kolla-Ansible (via Kayobe), I'm unsure at the moment whether those external state-tracking mechanisms are yet supported in K-A's deployment of Octavia natively. I can look into that though. Thanks, Paul Browne ******************* Paul Browne Research Computing Platforms University Information Services Roger Needham Building JJ Thompson Avenue University of Cambridge Cambridge United Kingdom E-Mail: pfb29@cam.ac.uk<mailto:pfb29@cam.ac.uk> Tel: 0044-1223-746548 ******************* ________________________________ From: Dale Smith Sent: Wednesday, February 14, 2024 01:30 To: Michael Johnson Cc: openstack-discuss@lists.openstack.org Subject: Re: [octavia] Purging ERROR'ed amphora and load balancers stuck in transitional states Thank you Michael, that's a fantastic write-up. It's really handy to know all the flows have terminal states, I need to review our timeout settings. regards, Dale Smith On 14/02/24 13:17, Michael Johnson wrote:
Hi there,
We highly recommend that you do not make changes to the state of load balancers in the database.
Here are some comments about state in Octavia, what these mean, and how to resolve the issue properly.
The provisioning_status in Octavia is used by the controllers to provide concurrency locking on the objects. So, when a load balancer has a provisioning_status of PENDING_* it means that one of the controllers has ownership of that load balancer and is actively working on it.
You can check on this by viewing the controller worker, health manager, and/or housekeeping logs. Typically you will see scrolling "WARNING" level log messages indicating that the controller is retrying some activity that is failing, such as trying to reach an amphora management endpoint over the load balancer management network or waiting for the VM to fully boot.
The default timeouts for some PENDING_* states can be up to 12 hours in some cases. This comes down to the debate about if a controller management system should "try forever" (similar to typical kubernetes clusters) to recover from outside failures vs quickly returning a terminal status to the user and unlocking the object. We highly recommend that you tune the timeouts in the octavia.conf to align to your cloud's objectives. Personally, I lean towards being responsive to users than banging our head against failed cloud services (i.e running out of compute host space) with endless retries.
Also note, just because you have seen the resource in a PENDING_* state for "a long period of time(tm)", does not mean the resource has been in that state the whole time. It may have moved out and then back in, such as when automatic failovers occur. It's important to check your logs.
All of the flows in Octavia have terminal states for requests that will unlock the object with either an "ACTIVE" or "ERROR" state that is mutable (i.e. failover or delete are possible).
If you change the state in the database, you could unlock an object that one of the controllers is actively retrying, allowing another controller to take action on the object in parallel. This can lead to database corruption (looking at the person claiming a load balancer has five amphora attached to it, grin) of the load balancer and/or resources abandoned in the cloud (neutron ports, VM instances, IP addresses, etc.). This is why we do not recommend doing this. I have seen people spend hours cleaning up after they blindly changed states in the database.
Now, aside from bugs that have been fixed over the years (I don't know what version you are running, but you can check the release notes) there are a few scenarios that can lead to "stuck" PENDING_* states: 1. Someone has killed the controller process with SIGKILL (-9), thus not allowing the flow to complete and safely unlock the objects. We typically see this when people implement systemd or container (looking at you k8s) process management tools but neglect to configure the proper graceful shutdown timeouts and/or health monitoring. Some of these process monitoring tools default to only waiting a few seconds for a graceful shutdown before issuing a SIGKILL. As with all OpenStack controllers, SIGKILL will not allow a graceful shutdown and preservation of state which typically requires manual cleanup. 2. Someone pulled the power cords out of the controller host without graceful shutdown. Well, you probably have cloud wide issues when this happens. 3. There have been bugs discovered, but rarely. Look through your controller logs to see which controller had ownership of the load balancer and why the flow was not allowed to complete. If in doubt, file a launchpad bug with the logs and we will help you take a look. Help us help you. 4. You are not running HA messaging queues and one of the messages from the API tier (where a lock was initiated) never gets delivered to a controller worker process. We recommend running with HA and durable queues to avoid this.
That said, the Octavia team has addressed scenarios 1-3 (3 to some degree as there is always a possibility of a bug that circumvents this) by enabling Jobboard in the TaskFlow engine we use in Octavia. When enabled, this saves the flow state in Redis or Zookeeper and uses worker threads on the controllers to identify if a flow is not properly progressing and will automatically resume the flow at the Task it was previously running on a new controller. This feature was introduced in the Ussuri release.
Ok, so let's say you don't have jobboard enabled in your deployment and the intern hit the "big shiny red button" near the door killing the power to your datacenter. You probably have many services in a bad state due to a non-graceful shutdown. The first step is to run the recovery steps starting at keystone and working your way down to glance, neutron, nova and barbican. Once your cloud is stabilized again and basic services are functioning (how many drives and power supplies did you just replace????) we can look at Octavia.
1. Do a openstack loadbalancer list as an admin and identify all that are in PENDING_* states. 2. Search your controller logs (yes all of them) for those load balancer IDs and make sure a controller is not actively working on that load balancer. I.e. check for scrolling "retrying" log messages and trace it back to the load balancer ID. You don't want to mess up the load balancer your web tier colleagues are frantically trying to startup or update to recover from the "big shiny red button" incident. This is a key step to make sure you don't abandon cloud resources or corrupt the load balancer in worse ways. 3. Then, AND ONLY THEN, would you enter the database and update the load balancer provisioning_status to ERROR. 4. Close the database terminal window, wipe the sweat from your brow, and either trigger a load balancer failover or load balancer delete as needed.
This process is really a "only if I know why it happened" last resort.
Michael
On Tue, Feb 13, 2024 at 3:05 PM Dale Smith <dale@catalystcloud.nz> wrote:
In the case where it's stuck in PENDING_UPDATE for a long time, we force the loadbalancer to ERROR status and then perform a failover to let Octavia re-create the amphorae.
SQL> UPDATE load_balancer SET provisioning_status = 'ERROR' WHERE id = '<LOADBALANCER ID>' LIMIT 1;
$ openstack loadbalancer failover <LOADBALANCER ID>
That *may* also help the PENDING_DELETE case, if it can recover and then allow a subsequent delete.
regards,
Dale Smith
On 14/02/24 01:59, Paul Browne wrote:
Hello Octavia users,
For a while I've been trying to find a systematic way to remove Octavia LBs+Amphora that have become stuck in ERROR or transitional PENDING_* states ;
$ openstack loadbalancer amphora list -f value | grep ERROR a2323f14-3aaa-418a-8249-111aaa9c21fe 1fa7bd54-f60c-420c-94f3-d4c02f03d4fe ERROR MASTER 10.8.0.242 192.168.3.247 e5b236ba-e7ee-4ed7-9f58-57ce7a408489 1fa7bd54-f60c-420c-94f3-d4c02f03d4fe ERROR BACKUP 10.8.0.190 192.168.3.247 6b556f28-93c9-49dd-b6ee-4379288e7957 d5e402fe-2c4b-49af-a700-532cb408cee5 ERROR MASTER 10.8.0.39 192.168.3.126 c669db5d-8686-4d5c-9e95-e02030b34301 d5e402fe-2c4b-49af-a700-532cb408cee5 ERROR BACKUP 10.8.0.174 192.168.3.126
$ openstack loadbalancer list -f value | grep -vi active 1fa7bd54-f60c-420c-94f3-d4c02f03d4fe k8s-clusterapi-cluster-default-ci-6386871107-kube-upgrade-kubeapi 3a06571936a0424bb40bc5c672c4ccb1 192.168.3.247 PENDING_UPDATE amphora d5e402fe-2c4b-49af-a700-532cb408cee5 k8s-clusterapi-cluster-default-ci-6386871107-latest-kubeapi 3a06571936a0424bb40bc5c672c4ccb1 192.168.3.126 PENDING_DELETE amphora
These resources are marked immutable and so cannot be failed over or deleted;
$ openstack loadbalancer amphora failover a2323f14-3aaa-418a-8249-111aaa9c21fe Load Balancer 1fa7bd54-f60c-420c-94f3-d4c02f03d4fe is immutable and cannot be updated. (HTTP 409) (Request-ID: req-6e66c4e8-c3d3-4549-a03c-367017c8c8b3)
$ openstack loadbalancer failover 1fa7bd54-f60c-420c-94f3-d4c02f03d4fe Invalid state PENDING_UPDATE of loadbalancer resource 1fa7bd54-f60c-420c-94f3-d4c02f03d4fe (HTTP 409) (Request-ID: req-84b44212-e7a8-4101-a16f-18c774c0577e)
The backing Nova instances for these Amphora do seem to exist and be in good working order.
Is there any API way to purge these out of Octavia's service state, or would (very careful) DB hackery be required here?
Many thanks, Paul Browne
******************* Paul Browne Research Computing Platforms University Information Services Roger Needham Building JJ Thompson Avenue University of Cambridge Cambridge United Kingdom E-Mail: pfb29@cam.ac.uk Tel: 0044-1223-746548 *******************