[OCTAVIA][ROCKY] - MASTER & BACKUP instances unexpectedly deleted by octavia

Lingxian Kong anlin.kong at gmail.com
Tue Jun 4 11:38:27 UTC 2019


Hi Gaël,

We also met with the issue before which happened during the failover
process, but I'm not sure your situation is the same with us. I just paste
my previous investigation here, hope that will help.

"With the Octavia version we have deployed in the production, the amphora
record in the `amphora_health` table is deleted at the beginning of the
failover process in order to disable the amphora health monitoring, while
the amphora record in `amphora` table is marked as DELETED.

On the other hand, the octavia-housekeeper service will delete the amphora
record in `amphora` table if it doesn’t find its related record in
`amphora_health` table which is always true during the current failover
process. As a result, if the failover process fails, there will be no
amphora records relating to the load balancer in the database."

This patch is here
https://review.opendev.org/#/q/Ief97ddda8261b5bbc54c6824f90ae9c7a2d81701,
unfortunately, it has not been backported to Rocky.


Best regards,
Lingxian Kong
Catalyst Cloud


On Tue, Jun 4, 2019 at 9:13 PM Gaël THEROND <gael.therond at gmail.com> wrote:

> Hi Felix,
>
> « Glad » you had the same issue before, and yes of course I looked at the
> HM logs which is were I actually found out that this event was triggered
> by octavia (Beside the DB data that validated that) here is my log trace
> related to this event, It doesn't really shows major issue IMHO.
>
> Here is the stacktrace that our octavia service archived for our both
> controllers servers, with the initial loadbalancer creation trace
> (Worker.log) and both controllers triggered task (Health-Manager.log).
>
> http://paste.openstack.org/show/7z5aZYu12Ttoae3AOhwF/
>
> I well may have miss something in it, but I don't see something strange on
> from my point of view.
> Feel free to tell me if you spot something weird.
>
>
> Le mar. 4 juin 2019 à 10:38, Felix Hüttner <felix.huettner at mail.schwarz>
> a écrit :
>
>> Hi Gael,
>>
>>
>>
>> we had a similar issue in the past.
>>
>> You could check the octiava healthmanager log (should be on the same node
>> where the worker is running).
>>
>> This component monitors the status of the Amphorae and restarts them if
>> they don’t trigger a callback after a specific time. This might also happen
>> if there is some connection issue between the two components.
>>
>>
>>
>> But normally it should at least restart the LB with new Amphorae…
>>
>>
>>
>> Hope that helps
>>
>>
>>
>> Felix
>>
>>
>>
>> *From:* Gaël THEROND <gael.therond at gmail.com>
>> *Sent:* Tuesday, June 4, 2019 9:44 AM
>> *To:* Openstack <openstack at lists.openstack.org>
>> *Subject:* [OCTAVIA][ROCKY] - MASTER & BACKUP instances unexpectedly
>> deleted by octavia
>>
>>
>>
>> Hi guys,
>>
>>
>>
>> I’ve a weird situation here.
>>
>>
>>
>> I smoothly operate a large scale multi-region Octavia service using the
>> default amphora driver which imply the use of nova instances as
>> loadbalancers.
>>
>>
>>
>> Everything is running really well and our customers (K8s and traditional
>> users) are really  happy with the solution so far.
>>
>>
>>
>> However, yesterday one of those customers using the loadbalancer in front
>> of their ElasticSearch cluster poked me because this loadbalancer suddenly
>> passed from ONLINE/OK to ONLINE/ERROR, meaning the amphoras were no longer
>> available but yet the anchor/member/pool and listeners settings were still
>> existing.
>>
>>
>>
>> So I investigated and found out that the loadbalancer amphoras have been
>> destroyed by the octavia user.
>>
>>
>>
>> The weird part is, both the master and the backup instance have been
>> destroyed at the same moment by the octavia service user.
>>
>>
>>
>> Is there specific circumstances where the octavia service could decide to
>> delete the instances but not the anchor/members/pool ?
>>
>>
>>
>> It’s worrying me a bit as there is no clear way to trace why does Octavia
>> did take this action.
>>
>>
>>
>> I digged within the nova and Octavia DB in order to correlate the action
>> but except than validating my investigation it doesn’t really help as there
>> are no clue of why the octavia service did trigger the deletion.
>>
>>
>>
>> If someone have any clue or tips to give me I’ll be more than happy to
>> discuss this situation.
>>
>>
>>
>> Cheers guys!
>> Hinweise zum Datenschutz finden Sie hier
>> <https://www.datenschutz.schwarz>.
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20190604/5dd8c680/attachment.html>


More information about the openstack-discuss mailing list