[OCTAVIA][ROCKY] - MASTER & BACKUP instances unexpectedly deleted by octavia - openstack-discuss

newer
[openstack-ansible] suse support...

[OCTAVIA][ROCKY] - MASTER & BACKUP instances unexpectedly deleted by octavia

older
[Kolla] Stuck at bootstrap_gnocchi...

Gaël THEROND

4 Jun 2019 4 Jun '19

7:43 a.m.

Hi guys, I’ve a weird situation here. I smoothly operate a large scale multi-region Octavia service using the default amphora driver which imply the use of nova instances as loadbalancers. Everything is running really well and our customers (K8s and traditional users) are really happy with the solution so far. However, yesterday one of those customers using the loadbalancer in front of their ElasticSearch cluster poked me because this loadbalancer suddenly passed from ONLINE/OK to ONLINE/ERROR, meaning the amphoras were no longer available but yet the anchor/member/pool and listeners settings were still existing. So I investigated and found out that the loadbalancer amphoras have been destroyed by the octavia user. The weird part is, both the master and the backup instance have been destroyed at the same moment by the octavia service user. Is there specific circumstances where the octavia service could decide to delete the instances but not the anchor/members/pool ? It’s worrying me a bit as there is no clear way to trace why does Octavia did take this action. I digged within the nova and Octavia DB in order to correlate the action but except than validating my investigation it doesn’t really help as there are no clue of why the octavia service did trigger the deletion. If someone have any clue or tips to give me I’ll be more than happy to discuss this situation. Cheers guys!

Attachments:

attachment.html (text/html — 1.9 KB)

Show replies by date

Felix Hüttner

4 Jun 4 Jun

8:38 a.m.

Hi Gael, we had a similar issue in the past. You could check the octiava healthmanager log (should be on the same node where the worker is running). This component monitors the status of the Amphorae and restarts them if they don’t trigger a callback after a specific time. This might also happen if there is some connection issue between the two components. But normally it should at least restart the LB with new Amphorae… Hope that helps Felix From: Gaël THEROND <gael.therond@gmail.com> Sent: Tuesday, June 4, 2019 9:44 AM To: Openstack <openstack@lists.openstack.org> Subject: [OCTAVIA][ROCKY] - MASTER & BACKUP instances unexpectedly deleted by octavia Hi guys, I’ve a weird situation here. I smoothly operate a large scale multi-region Octavia service using the default amphora driver which imply the use of nova instances as loadbalancers. Everything is running really well and our customers (K8s and traditional users) are really happy with the solution so far. However, yesterday one of those customers using the loadbalancer in front of their ElasticSearch cluster poked me because this loadbalancer suddenly passed from ONLINE/OK to ONLINE/ERROR, meaning the amphoras were no longer available but yet the anchor/member/pool and listeners settings were still existing. So I investigated and found out that the loadbalancer amphoras have been destroyed by the octavia user. The weird part is, both the master and the backup instance have been destroyed at the same moment by the octavia service user. Is there specific circumstances where the octavia service could decide to delete the instances but not the anchor/members/pool ? It’s worrying me a bit as there is no clear way to trace why does Octavia did take this action. I digged within the nova and Octavia DB in order to correlate the action but except than validating my investigation it doesn’t really help as there are no clue of why the octavia service did trigger the deletion. If someone have any clue or tips to give me I’ll be more than happy to discuss this situation. Cheers guys! Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz>.

Gaël THEROND

9:06 a.m.

Hi Felix, « Glad » you had the same issue before, and yes of course I looked at the HM logs which is were I actually found out that this event was triggered by octavia (Beside the DB data that validated that) here is my log trace related to this event, It doesn't really shows major issue IMHO. Here is the stacktrace that our octavia service archived for our both controllers servers, with the initial loadbalancer creation trace (Worker.log) and both controllers triggered task (Health-Manager.log). http://paste.openstack.org/show/7z5aZYu12Ttoae3AOhwF/ I well may have miss something in it, but I don't see something strange on from my point of view. Feel free to tell me if you spot something weird. Le mar. 4 juin 2019 à 10:38, Felix Hüttner <felix.huettner@mail.schwarz> a écrit :

...

Hi Gael,

we had a similar issue in the past.

You could check the octiava healthmanager log (should be on the same node where the worker is running).

This component monitors the status of the Amphorae and restarts them if they don’t trigger a callback after a specific time. This might also happen if there is some connection issue between the two components.

But normally it should at least restart the LB with new Amphorae…

Hope that helps

Felix

*From:* Gaël THEROND <gael.therond@gmail.com> *Sent:* Tuesday, June 4, 2019 9:44 AM *To:* Openstack <openstack@lists.openstack.org> *Subject:* [OCTAVIA][ROCKY] - MASTER & BACKUP instances unexpectedly deleted by octavia

Hi guys,

I’ve a weird situation here.

I smoothly operate a large scale multi-region Octavia service using the default amphora driver which imply the use of nova instances as loadbalancers.

Everything is running really well and our customers (K8s and traditional users) are really happy with the solution so far.

However, yesterday one of those customers using the loadbalancer in front of their ElasticSearch cluster poked me because this loadbalancer suddenly passed from ONLINE/OK to ONLINE/ERROR, meaning the amphoras were no longer available but yet the anchor/member/pool and listeners settings were still existing.

So I investigated and found out that the loadbalancer amphoras have been destroyed by the octavia user.

The weird part is, both the master and the backup instance have been destroyed at the same moment by the octavia service user.

Is there specific circumstances where the octavia service could decide to delete the instances but not the anchor/members/pool ?

It’s worrying me a bit as there is no clear way to trace why does Octavia did take this action.

I digged within the nova and Octavia DB in order to correlate the action but except than validating my investigation it doesn’t really help as there are no clue of why the octavia service did trigger the deletion.

If someone have any clue or tips to give me I’ll be more than happy to discuss this situation.

Cheers guys! Hinweise zum Datenschutz finden Sie hier <https://www.datenschutz.schwarz>.

Lingxian Kong

11:38 a.m.

Hi Gaël, We also met with the issue before which happened during the failover process, but I'm not sure your situation is the same with us. I just paste my previous investigation here, hope that will help. "With the Octavia version we have deployed in the production, the amphora record in the `amphora_health` table is deleted at the beginning of the failover process in order to disable the amphora health monitoring, while the amphora record in `amphora` table is marked as DELETED. On the other hand, the octavia-housekeeper service will delete the amphora record in `amphora` table if it doesn’t find its related record in `amphora_health` table which is always true during the current failover process. As a result, if the failover process fails, there will be no amphora records relating to the load balancer in the database." This patch is here https://review.opendev.org/#/q/Ief97ddda8261b5bbc54c6824f90ae9c7a2d81701, unfortunately, it has not been backported to Rocky. Best regards, Lingxian Kong Catalyst Cloud On Tue, Jun 4, 2019 at 9:13 PM Gaël THEROND <gael.therond@gmail.com> wrote:

...

Hi Felix,

« Glad » you had the same issue before, and yes of course I looked at the HM logs which is were I actually found out that this event was triggered by octavia (Beside the DB data that validated that) here is my log trace related to this event, It doesn't really shows major issue IMHO.

Here is the stacktrace that our octavia service archived for our both controllers servers, with the initial loadbalancer creation trace (Worker.log) and both controllers triggered task (Health-Manager.log).

http://paste.openstack.org/show/7z5aZYu12Ttoae3AOhwF/

I well may have miss something in it, but I don't see something strange on from my point of view. Feel free to tell me if you spot something weird.

Le mar. 4 juin 2019 à 10:38, Felix Hüttner <felix.huettner@mail.schwarz> a écrit :

...
Hi Gael,

we had a similar issue in the past.

You could check the octiava healthmanager log (should be on the same node where the worker is running).

This component monitors the status of the Amphorae and restarts them if they don’t trigger a callback after a specific time. This might also happen if there is some connection issue between the two components.

But normally it should at least restart the LB with new Amphorae…

Hope that helps

Felix

*From:* Gaël THEROND <gael.therond@gmail.com> *Sent:* Tuesday, June 4, 2019 9:44 AM *To:* Openstack <openstack@lists.openstack.org> *Subject:* [OCTAVIA][ROCKY] - MASTER & BACKUP instances unexpectedly deleted by octavia

Hi guys,

I’ve a weird situation here.

I smoothly operate a large scale multi-region Octavia service using the default amphora driver which imply the use of nova instances as loadbalancers.

Everything is running really well and our customers (K8s and traditional users) are really happy with the solution so far.

However, yesterday one of those customers using the loadbalancer in front of their ElasticSearch cluster poked me because this loadbalancer suddenly passed from ONLINE/OK to ONLINE/ERROR, meaning the amphoras were no longer available but yet the anchor/member/pool and listeners settings were still existing.

So I investigated and found out that the loadbalancer amphoras have been destroyed by the octavia user.

The weird part is, both the master and the backup instance have been destroyed at the same moment by the octavia service user.

Is there specific circumstances where the octavia service could decide to delete the instances but not the anchor/members/pool ?

It’s worrying me a bit as there is no clear way to trace why does Octavia did take this action.

I digged within the nova and Octavia DB in order to correlate the action but except than validating my investigation it doesn’t really help as there are no clue of why the octavia service did trigger the deletion.

If someone have any clue or tips to give me I’ll be more than happy to discuss this situation.

Cheers guys! Hinweise zum Datenschutz finden Sie hier <https://www.datenschutz.schwarz>.

Gaël THEROND

1:03 p.m.

Hi Lingxian Kong, That’s actually very interesting as I’ve come to the same conclusion this morning during my investigation and was starting to think about a fix, which it seems you already made! Is there a reason why it didn’t was backported to rocky? Very helpful, many many thanks to you you clearly spare me hours of works! I’ll get a review of your patch and test it on our lab. Le mar. 4 juin 2019 à 11:06, Gaël THEROND <gael.therond@gmail.com> a écrit :

...

Hi Felix,

« Glad » you had the same issue before, and yes of course I looked at the HM logs which is were I actually found out that this event was triggered by octavia (Beside the DB data that validated that) here is my log trace related to this event, It doesn't really shows major issue IMHO.

Here is the stacktrace that our octavia service archived for our both controllers servers, with the initial loadbalancer creation trace (Worker.log) and both controllers triggered task (Health-Manager.log).

http://paste.openstack.org/show/7z5aZYu12Ttoae3AOhwF/

I well may have miss something in it, but I don't see something strange on from my point of view. Feel free to tell me if you spot something weird.

Le mar. 4 juin 2019 à 10:38, Felix Hüttner <felix.huettner@mail.schwarz> a écrit :

...
Hi Gael,

we had a similar issue in the past.

You could check the octiava healthmanager log (should be on the same node where the worker is running).

This component monitors the status of the Amphorae and restarts them if they don’t trigger a callback after a specific time. This might also happen if there is some connection issue between the two components.

But normally it should at least restart the LB with new Amphorae…

Hope that helps

Felix

*From:* Gaël THEROND <gael.therond@gmail.com> *Sent:* Tuesday, June 4, 2019 9:44 AM *To:* Openstack <openstack@lists.openstack.org> *Subject:* [OCTAVIA][ROCKY] - MASTER & BACKUP instances unexpectedly deleted by octavia

Hi guys,

I’ve a weird situation here.

I smoothly operate a large scale multi-region Octavia service using the default amphora driver which imply the use of nova instances as loadbalancers.

Everything is running really well and our customers (K8s and traditional users) are really happy with the solution so far.

However, yesterday one of those customers using the loadbalancer in front of their ElasticSearch cluster poked me because this loadbalancer suddenly passed from ONLINE/OK to ONLINE/ERROR, meaning the amphoras were no longer available but yet the anchor/member/pool and listeners settings were still existing.

So I investigated and found out that the loadbalancer amphoras have been destroyed by the octavia user.

The weird part is, both the master and the backup instance have been destroyed at the same moment by the octavia service user.

Is there specific circumstances where the octavia service could decide to delete the instances but not the anchor/members/pool ?

It’s worrying me a bit as there is no clear way to trace why does Octavia did take this action.

I digged within the nova and Octavia DB in order to correlate the action but except than validating my investigation it doesn’t really help as there are no clue of why the octavia service did trigger the deletion.

If someone have any clue or tips to give me I’ll be more than happy to discuss this situation.

Cheers guys! Hinweise zum Datenschutz finden Sie hier <https://www.datenschutz.schwarz>.

Carlos Goncalves

1:16 p.m.

On Tue, Jun 4, 2019 at 3:06 PM Gaël THEROND <gael.therond@gmail.com> wrote:

...

Hi Lingxian Kong,

That’s actually very interesting as I’ve come to the same conclusion this morning during my investigation and was starting to think about a fix, which it seems you already made!

Is there a reason why it didn’t was backported to rocky?

The patch was merged in master branch during Rocky development cycle, hence included in stable/rocky as well.

...

Very helpful, many many thanks to you you clearly spare me hours of works! I’ll get a review of your patch and test it on our lab.

Le mar. 4 juin 2019 à 11:06, Gaël THEROND <gael.therond@gmail.com> a écrit :

...
Hi Felix,

« Glad » you had the same issue before, and yes of course I looked at the HM logs which is were I actually found out that this event was triggered by octavia (Beside the DB data that validated that) here is my log trace related to this event, It doesn't really shows major issue IMHO.

Here is the stacktrace that our octavia service archived for our both controllers servers, with the initial loadbalancer creation trace (Worker.log) and both controllers triggered task (Health-Manager.log).

http://paste.openstack.org/show/7z5aZYu12Ttoae3AOhwF/

I well may have miss something in it, but I don't see something strange on from my point of view. Feel free to tell me if you spot something weird.

Le mar. 4 juin 2019 à 10:38, Felix Hüttner <felix.huettner@mail.schwarz> a écrit :

...
Hi Gael,

we had a similar issue in the past.

You could check the octiava healthmanager log (should be on the same node where the worker is running).

This component monitors the status of the Amphorae and restarts them if they don’t trigger a callback after a specific time. This might also happen if there is some connection issue between the two components.

But normally it should at least restart the LB with new Amphorae…

Hope that helps

Felix

From: Gaël THEROND <gael.therond@gmail.com> Sent: Tuesday, June 4, 2019 9:44 AM To: Openstack <openstack@lists.openstack.org> Subject: [OCTAVIA][ROCKY] - MASTER & BACKUP instances unexpectedly deleted by octavia

Hi guys,

I’ve a weird situation here.

I smoothly operate a large scale multi-region Octavia service using the default amphora driver which imply the use of nova instances as loadbalancers.

Everything is running really well and our customers (K8s and traditional users) are really happy with the solution so far.

However, yesterday one of those customers using the loadbalancer in front of their ElasticSearch cluster poked me because this loadbalancer suddenly passed from ONLINE/OK to ONLINE/ERROR, meaning the amphoras were no longer available but yet the anchor/member/pool and listeners settings were still existing.

So I investigated and found out that the loadbalancer amphoras have been destroyed by the octavia user.

The weird part is, both the master and the backup instance have been destroyed at the same moment by the octavia service user.

Is there specific circumstances where the octavia service could decide to delete the instances but not the anchor/members/pool ?

It’s worrying me a bit as there is no clear way to trace why does Octavia did take this action.

I digged within the nova and Octavia DB in order to correlate the action but except than validating my investigation it doesn’t really help as there are no clue of why the octavia service did trigger the deletion.

If someone have any clue or tips to give me I’ll be more than happy to discuss this situation.

Cheers guys!

Hinweise zum Datenschutz finden Sie hier.

Gaël THEROND

1:19 p.m.

Oh, that's perfect so, I'll just update my image and my platform as we're using kolla-ansible and that's super easy. You guys rocks!! (Pun intended ;-)). Many many thanks to all of you, that will real back me a lot regarding the Octavia solidity and Kolla flexibility actually ^^. Le mar. 4 juin 2019 à 15:17, Carlos Goncalves <cgoncalves@redhat.com> a écrit :

...

On Tue, Jun 4, 2019 at 3:06 PM Gaël THEROND <gael.therond@gmail.com> wrote:

...
Hi Lingxian Kong,

That’s actually very interesting as I’ve come to the same conclusion

this morning during my investigation and was starting to think about a fix, which it seems you already made!

...
Is there a reason why it didn’t was backported to rocky?

The patch was merged in master branch during Rocky development cycle, hence included in stable/rocky as well.

...
Very helpful, many many thanks to you you clearly spare me hours of

works! I’ll get a review of your patch and test it on our lab.

...
Le mar. 4 juin 2019 à 11:06, Gaël THEROND <gael.therond@gmail.com> a

...
...
Hi Felix,

« Glad » you had the same issue before, and yes of course I looked at

...
...
Here is the stacktrace that our octavia service archived for our both

controllers servers, with the initial loadbalancer creation trace (Worker.log) and both controllers triggered task (Health-Manager.log).

...
http://paste.openstack.org/show/7z5aZYu12Ttoae3AOhwF/

I well may have miss something in it, but I don't see something strange

on from my point of view.

...
Feel free to tell me if you spot something weird.

Le mar. 4 juin 2019 à 10:38, Felix Hüttner <felix.huettner@mail.schwarz> a écrit :

...
Hi Gael,

we had a similar issue in the past.

You could check the octiava healthmanager log (should be on the same

node where the worker is running).

...
This component monitors the status of the Amphorae and restarts them

if they don’t trigger a callback after a specific time. This might also happen if there is some connection issue between the two components.

...
But normally it should at least restart the LB with new Amphorae…

Hope that helps

Felix

From: Gaël THEROND <gael.therond@gmail.com> Sent: Tuesday, June 4, 2019 9:44 AM To: Openstack <openstack@lists.openstack.org> Subject: [OCTAVIA][ROCKY] - MASTER & BACKUP instances unexpectedly

deleted by octavia

...
Hi guys,

I’ve a weird situation here.

I smoothly operate a large scale multi-region Octavia service using

...
...
...
Everything is running really well and our customers (K8s and

écrit : the HM logs which is were I actually found out that this event was triggered by octavia (Beside the DB data that validated that) here is my log trace related to this event, It doesn't really shows major issue IMHO. the default amphora driver which imply the use of nova instances as loadbalancers. traditional users) are really happy with the solution so far.

...
...
...
However, yesterday one of those customers using the loadbalancer in

front of their ElasticSearch cluster poked me because this loadbalancer suddenly passed from ONLINE/OK to ONLINE/ERROR, meaning the amphoras were no longer available but yet the anchor/member/pool and listeners settings were still existing.

...
So I investigated and found out that the loadbalancer amphoras have

been destroyed by the octavia user.

...
The weird part is, both the master and the backup instance have been

destroyed at the same moment by the octavia service user.

...
Is there specific circumstances where the octavia service could decide

to delete the instances but not the anchor/members/pool ?

...
It’s worrying me a bit as there is no clear way to trace why does

Octavia did take this action.

...
I digged within the nova and Octavia DB in order to correlate the

action but except than validating my investigation it doesn’t really help as there are no clue of why the octavia service did trigger the deletion.

...
If someone have any clue or tips to give me I’ll be more than happy to

discuss this situation.

...
Cheers guys!

Hinweise zum Datenschutz finden Sie hier.

Gaël THEROND

10 Jun 10 Jun

1:14 p.m.

Hi guys, Just a quick question regarding this bug, someone told me that it have been patched within stable/rocky, BUT, were you talking about the openstack/octavia repositoy or the openstack/kolla repository? Many Thanks! Le mar. 4 juin 2019 à 15:19, Gaël THEROND <gael.therond@gmail.com> a écrit :

...

Oh, that's perfect so, I'll just update my image and my platform as we're using kolla-ansible and that's super easy.

You guys rocks!! (Pun intended ;-)).

Many many thanks to all of you, that will real back me a lot regarding the Octavia solidity and Kolla flexibility actually ^^.

Le mar. 4 juin 2019 à 15:17, Carlos Goncalves <cgoncalves@redhat.com> a écrit :

...
On Tue, Jun 4, 2019 at 3:06 PM Gaël THEROND <gael.therond@gmail.com> wrote:

...
Hi Lingxian Kong,

That’s actually very interesting as I’ve come to the same conclusion

this morning during my investigation and was starting to think about a fix, which it seems you already made!

...
Is there a reason why it didn’t was backported to rocky?

The patch was merged in master branch during Rocky development cycle, hence included in stable/rocky as well.

...
Very helpful, many many thanks to you you clearly spare me hours of

works! I’ll get a review of your patch and test it on our lab.

...
Le mar. 4 juin 2019 à 11:06, Gaël THEROND <gael.therond@gmail.com> a

...
...
Hi Felix,

« Glad » you had the same issue before, and yes of course I looked at

...
...
Here is the stacktrace that our octavia service archived for our both

controllers servers, with the initial loadbalancer creation trace (Worker.log) and both controllers triggered task (Health-Manager.log).

...
http://paste.openstack.org/show/7z5aZYu12Ttoae3AOhwF/

I well may have miss something in it, but I don't see something

strange on from my point of view.

...
Feel free to tell me if you spot something weird.

Le mar. 4 juin 2019 à 10:38, Felix Hüttner <felix.huettner@mail.schwarz> a écrit :

...
Hi Gael,

we had a similar issue in the past.

You could check the octiava healthmanager log (should be on the same

node where the worker is running).

...
This component monitors the status of the Amphorae and restarts them

if they don’t trigger a callback after a specific time. This might also happen if there is some connection issue between the two components.

...
But normally it should at least restart the LB with new Amphorae…

Hope that helps

Felix

From: Gaël THEROND <gael.therond@gmail.com> Sent: Tuesday, June 4, 2019 9:44 AM To: Openstack <openstack@lists.openstack.org> Subject: [OCTAVIA][ROCKY] - MASTER & BACKUP instances unexpectedly

deleted by octavia

...
Hi guys,

I’ve a weird situation here.

I smoothly operate a large scale multi-region Octavia service using

...
...
...
Everything is running really well and our customers (K8s and

écrit : the HM logs which is were I actually found out that this event was triggered by octavia (Beside the DB data that validated that) here is my log trace related to this event, It doesn't really shows major issue IMHO. the default amphora driver which imply the use of nova instances as loadbalancers. traditional users) are really happy with the solution so far.

...
...
...
However, yesterday one of those customers using the loadbalancer in

front of their ElasticSearch cluster poked me because this loadbalancer suddenly passed from ONLINE/OK to ONLINE/ERROR, meaning the amphoras were no longer available but yet the anchor/member/pool and listeners settings were still existing.

...
So I investigated and found out that the loadbalancer amphoras have

been destroyed by the octavia user.

...
The weird part is, both the master and the backup instance have been

destroyed at the same moment by the octavia service user.

...
Is there specific circumstances where the octavia service could

decide to delete the instances but not the anchor/members/pool ?

...
It’s worrying me a bit as there is no clear way to trace why does

Octavia did take this action.

...
I digged within the nova and Octavia DB in order to correlate the

action but except than validating my investigation it doesn’t really help as there are no clue of why the octavia service did trigger the deletion.

...
If someone have any clue or tips to give me I’ll be more than happy

to discuss this situation.

...
Cheers guys!

Hinweise zum Datenschutz finden Sie hier.

Carlos Goncalves

11 Jun 11 Jun

10:59 a.m.

On Mon, Jun 10, 2019 at 3:14 PM Gaël THEROND <gael.therond@gmail.com> wrote:

...

Hi guys,

Just a quick question regarding this bug, someone told me that it have been patched within stable/rocky, BUT, were you talking about the openstack/octavia repositoy or the openstack/kolla repository?

Octavia. https://review.opendev.org/#/q/Ief97ddda8261b5bbc54c6824f90ae9c7a2d81701

...

Many Thanks!

Le mar. 4 juin 2019 à 15:19, Gaël THEROND <gael.therond@gmail.com> a écrit :

...
Oh, that's perfect so, I'll just update my image and my platform as we're using kolla-ansible and that's super easy.

You guys rocks!! (Pun intended ;-)).

Many many thanks to all of you, that will real back me a lot regarding the Octavia solidity and Kolla flexibility actually ^^.

Le mar. 4 juin 2019 à 15:17, Carlos Goncalves <cgoncalves@redhat.com> a écrit :

...
On Tue, Jun 4, 2019 at 3:06 PM Gaël THEROND <gael.therond@gmail.com> wrote:

...
Hi Lingxian Kong,

That’s actually very interesting as I’ve come to the same conclusion this morning during my investigation and was starting to think about a fix, which it seems you already made!

Is there a reason why it didn’t was backported to rocky?

The patch was merged in master branch during Rocky development cycle, hence included in stable/rocky as well.

...
Very helpful, many many thanks to you you clearly spare me hours of works! I’ll get a review of your patch and test it on our lab.

Le mar. 4 juin 2019 à 11:06, Gaël THEROND <gael.therond@gmail.com> a écrit :

...
Hi Felix,

« Glad » you had the same issue before, and yes of course I looked at the HM logs which is were I actually found out that this event was triggered by octavia (Beside the DB data that validated that) here is my log trace related to this event, It doesn't really shows major issue IMHO.

Here is the stacktrace that our octavia service archived for our both controllers servers, with the initial loadbalancer creation trace (Worker.log) and both controllers triggered task (Health-Manager.log).

http://paste.openstack.org/show/7z5aZYu12Ttoae3AOhwF/

I well may have miss something in it, but I don't see something strange on from my point of view. Feel free to tell me if you spot something weird.

Le mar. 4 juin 2019 à 10:38, Felix Hüttner <felix.huettner@mail.schwarz> a écrit :

...
Hi Gael,

we had a similar issue in the past.

You could check the octiava healthmanager log (should be on the same node where the worker is running).

This component monitors the status of the Amphorae and restarts them if they don’t trigger a callback after a specific time. This might also happen if there is some connection issue between the two components.

But normally it should at least restart the LB with new Amphorae…

Hope that helps

Felix

From: Gaël THEROND <gael.therond@gmail.com> Sent: Tuesday, June 4, 2019 9:44 AM To: Openstack <openstack@lists.openstack.org> Subject: [OCTAVIA][ROCKY] - MASTER & BACKUP instances unexpectedly deleted by octavia

Hi guys,

I’ve a weird situation here.

I smoothly operate a large scale multi-region Octavia service using the default amphora driver which imply the use of nova instances as loadbalancers.

Everything is running really well and our customers (K8s and traditional users) are really happy with the solution so far.

However, yesterday one of those customers using the loadbalancer in front of their ElasticSearch cluster poked me because this loadbalancer suddenly passed from ONLINE/OK to ONLINE/ERROR, meaning the amphoras were no longer available but yet the anchor/member/pool and listeners settings were still existing.

So I investigated and found out that the loadbalancer amphoras have been destroyed by the octavia user.

The weird part is, both the master and the backup instance have been destroyed at the same moment by the octavia service user.

Is there specific circumstances where the octavia service could decide to delete the instances but not the anchor/members/pool ?

It’s worrying me a bit as there is no clear way to trace why does Octavia did take this action.

I digged within the nova and Octavia DB in order to correlate the action but except than validating my investigation it doesn’t really help as there are no clue of why the octavia service did trigger the deletion.

If someone have any clue or tips to give me I’ll be more than happy to discuss this situation.

Cheers guys!

Hinweise zum Datenschutz finden Sie hier.

Gaël THEROND

12:09 p.m.

Ok nice, do you have the commit hash? I would look at it and validate that it have been committed to Stein too so I could bump my service to stein using Kolla. Thanks! Le mar. 11 juin 2019 à 12:59, Carlos Goncalves <cgoncalves@redhat.com> a écrit :

...

On Mon, Jun 10, 2019 at 3:14 PM Gaël THEROND <gael.therond@gmail.com> wrote:

...
Hi guys,

Just a quick question regarding this bug, someone told me that it have

been patched within stable/rocky, BUT, were you talking about the openstack/octavia repositoy or the openstack/kolla repository?

Octavia.

https://review.opendev.org/#/q/Ief97ddda8261b5bbc54c6824f90ae9c7a2d81701

...
Many Thanks!

Le mar. 4 juin 2019 à 15:19, Gaël THEROND <gael.therond@gmail.com> a

...
...
Oh, that's perfect so, I'll just update my image and my platform as

we're using kolla-ansible and that's super easy.

...
You guys rocks!! (Pun intended ;-)).

Many many thanks to all of you, that will real back me a lot regarding

...
...
Le mar. 4 juin 2019 à 15:17, Carlos Goncalves <cgoncalves@redhat.com>

a écrit :

...
...
On Tue, Jun 4, 2019 at 3:06 PM Gaël THEROND <gael.therond@gmail.com>

wrote:

...
...
Hi Lingxian Kong,

That’s actually very interesting as I’ve come to the same conclusion

...
...
...
...
Is there a reason why it didn’t was backported to rocky?

The patch was merged in master branch during Rocky development cycle, hence included in stable/rocky as well.

...
Very helpful, many many thanks to you you clearly spare me hours of

works! I’ll get a review of your patch and test it on our lab.

...
Le mar. 4 juin 2019 à 11:06, Gaël THEROND <gael.therond@gmail.com>

a écrit :

...
...
Hi Felix,

« Glad » you had the same issue before, and yes of course I looked

at the HM logs which is were I actually found out that this event was

...
...
...
...
...
Here is the stacktrace that our octavia service archived for our

both controllers servers, with the initial loadbalancer creation trace (Worker.log) and both controllers triggered task (Health-Manager.log).

...
http://paste.openstack.org/show/7z5aZYu12Ttoae3AOhwF/

I well may have miss something in it, but I don't see something

strange on from my point of view.

...
Feel free to tell me if you spot something weird.

Le mar. 4 juin 2019 à 10:38, Felix Hüttner <felix.huettner@mail.schwarz> a écrit : > > Hi Gael, > > > > we had a similar issue in the past. > > You could check the octiava healthmanager log (should be on the same node where the worker is running). > > This component monitors the status of the Amphorae and restarts

...
...
...
...
...
> > > > But normally it should at least restart the LB with new Amphorae… > > > > Hope that helps > > > > Felix > > > > From: Gaël THEROND <gael.therond@gmail.com> > Sent: Tuesday, June 4, 2019 9:44 AM > To: Openstack <openstack@lists.openstack.org> > Subject: [OCTAVIA][ROCKY] - MASTER & BACKUP instances unexpectedly deleted by octavia > > > > Hi guys, > > > > I’ve a weird situation here. > > > > I smoothly operate a large scale multi-region Octavia service using the default amphora driver which imply the use of nova instances as loadbalancers. > > > > Everything is running really well and our customers (K8s and

écrit : the Octavia solidity and Kolla flexibility actually ^^. this morning during my investigation and was starting to think about a fix, which it seems you already made! triggered by octavia (Beside the DB data that validated that) here is my log trace related to this event, It doesn't really shows major issue IMHO. them if they don’t trigger a callback after a specific time. This might also happen if there is some connection issue between the two components. traditional users) are really happy with the solution so far.

...
...
...
...
...
> > > > However, yesterday one of those customers using the loadbalancer in front of their ElasticSearch cluster poked me because this loadbalancer suddenly passed from ONLINE/OK to ONLINE/ERROR, meaning the amphoras were no longer available but yet the anchor/member/pool and listeners settings were still existing. > > > > So I investigated and found out that the loadbalancer amphoras have been destroyed by the octavia user. > > > > The weird part is, both the master and the backup instance have been destroyed at the same moment by the octavia service user. > > > > Is there specific circumstances where the octavia service could decide to delete the instances but not the anchor/members/pool ? > > > > It’s worrying me a bit as there is no clear way to trace why does Octavia did take this action. > > > > I digged within the nova and Octavia DB in order to correlate the action but except than validating my investigation it doesn’t really help as there are no clue of why the octavia service did trigger the deletion. > > > > If someone have any clue or tips to give me I’ll be more than happy to discuss this situation. > > > > Cheers guys! > > Hinweise zum Datenschutz finden Sie hier.

Carlos Goncalves

12:13 p.m.

You can find the commit hash from the link I provided. The patch is available from Queens so it is also available in Stein. On Tue, Jun 11, 2019 at 2:10 PM Gaël THEROND <gael.therond@gmail.com> wrote:

...

Ok nice, do you have the commit hash? I would look at it and validate that it have been committed to Stein too so I could bump my service to stein using Kolla.

Thanks!

Le mar. 11 juin 2019 à 12:59, Carlos Goncalves <cgoncalves@redhat.com> a écrit :

...
On Mon, Jun 10, 2019 at 3:14 PM Gaël THEROND <gael.therond@gmail.com> wrote:

...
Hi guys,

Just a quick question regarding this bug, someone told me that it have been patched within stable/rocky, BUT, were you talking about the openstack/octavia repositoy or the openstack/kolla repository?

Octavia.

https://review.opendev.org/#/q/Ief97ddda8261b5bbc54c6824f90ae9c7a2d81701

...
Many Thanks!

Le mar. 4 juin 2019 à 15:19, Gaël THEROND <gael.therond@gmail.com> a écrit :

...
Oh, that's perfect so, I'll just update my image and my platform as we're using kolla-ansible and that's super easy.

You guys rocks!! (Pun intended ;-)).

Many many thanks to all of you, that will real back me a lot regarding the Octavia solidity and Kolla flexibility actually ^^.

Le mar. 4 juin 2019 à 15:17, Carlos Goncalves <cgoncalves@redhat.com> a écrit :

...
On Tue, Jun 4, 2019 at 3:06 PM Gaël THEROND <gael.therond@gmail.com> wrote:

...
Hi Lingxian Kong,

That’s actually very interesting as I’ve come to the same conclusion this morning during my investigation and was starting to think about a fix, which it seems you already made!

Is there a reason why it didn’t was backported to rocky?

The patch was merged in master branch during Rocky development cycle, hence included in stable/rocky as well.

...
Very helpful, many many thanks to you you clearly spare me hours of works! I’ll get a review of your patch and test it on our lab.

Le mar. 4 juin 2019 à 11:06, Gaël THEROND <gael.therond@gmail.com> a écrit : > > Hi Felix, > > « Glad » you had the same issue before, and yes of course I looked at the HM logs which is were I actually found out that this event was triggered by octavia (Beside the DB data that validated that) here is my log trace related to this event, It doesn't really shows major issue IMHO. > > Here is the stacktrace that our octavia service archived for our both controllers servers, with the initial loadbalancer creation trace (Worker.log) and both controllers triggered task (Health-Manager.log). > > http://paste.openstack.org/show/7z5aZYu12Ttoae3AOhwF/ > > I well may have miss something in it, but I don't see something strange on from my point of view. > Feel free to tell me if you spot something weird. > > > Le mar. 4 juin 2019 à 10:38, Felix Hüttner <felix.huettner@mail.schwarz> a écrit : >> >> Hi Gael, >> >> >> >> we had a similar issue in the past. >> >> You could check the octiava healthmanager log (should be on the same node where the worker is running). >> >> This component monitors the status of the Amphorae and restarts them if they don’t trigger a callback after a specific time. This might also happen if there is some connection issue between the two components. >> >> >> >> But normally it should at least restart the LB with new Amphorae… >> >> >> >> Hope that helps >> >> >> >> Felix >> >> >> >> From: Gaël THEROND <gael.therond@gmail.com> >> Sent: Tuesday, June 4, 2019 9:44 AM >> To: Openstack <openstack@lists.openstack.org> >> Subject: [OCTAVIA][ROCKY] - MASTER & BACKUP instances unexpectedly deleted by octavia >> >> >> >> Hi guys, >> >> >> >> I’ve a weird situation here. >> >> >> >> I smoothly operate a large scale multi-region Octavia service using the default amphora driver which imply the use of nova instances as loadbalancers. >> >> >> >> Everything is running really well and our customers (K8s and traditional users) are really happy with the solution so far. >> >> >> >> However, yesterday one of those customers using the loadbalancer in front of their ElasticSearch cluster poked me because this loadbalancer suddenly passed from ONLINE/OK to ONLINE/ERROR, meaning the amphoras were no longer available but yet the anchor/member/pool and listeners settings were still existing. >> >> >> >> So I investigated and found out that the loadbalancer amphoras have been destroyed by the octavia user. >> >> >> >> The weird part is, both the master and the backup instance have been destroyed at the same moment by the octavia service user. >> >> >> >> Is there specific circumstances where the octavia service could decide to delete the instances but not the anchor/members/pool ? >> >> >> >> It’s worrying me a bit as there is no clear way to trace why does Octavia did take this action. >> >> >> >> I digged within the nova and Octavia DB in order to correlate the action but except than validating my investigation it doesn’t really help as there are no clue of why the octavia service did trigger the deletion. >> >> >> >> If someone have any clue or tips to give me I’ll be more than happy to discuss this situation. >> >> >> >> Cheers guys! >> >> Hinweise zum Datenschutz finden Sie hier.

Gaël THEROND

12:15 p.m.

Oh, really sorry, I was looking at your answer from my mobile mailing app and it didn't shows, sorry ^^ Many thanks for your help! Le mar. 11 juin 2019 à 14:13, Carlos Goncalves <cgoncalves@redhat.com> a écrit :

...

You can find the commit hash from the link I provided. The patch is available from Queens so it is also available in Stein.

On Tue, Jun 11, 2019 at 2:10 PM Gaël THEROND <gael.therond@gmail.com> wrote:

...
Ok nice, do you have the commit hash? I would look at it and validate

that it have been committed to Stein too so I could bump my service to stein using Kolla.

...
Thanks!

Le mar. 11 juin 2019 à 12:59, Carlos Goncalves <cgoncalves@redhat.com>

...
...
On Mon, Jun 10, 2019 at 3:14 PM Gaël THEROND <gael.therond@gmail.com>

wrote:

...
...
Hi guys,

Just a quick question regarding this bug, someone told me that it

have been patched within stable/rocky, BUT, were you talking about the openstack/octavia repositoy or the openstack/kolla repository?

Octavia.

https://review.opendev.org/#/q/Ief97ddda8261b5bbc54c6824f90ae9c7a2d81701

...
...
Many Thanks!

Le mar. 4 juin 2019 à 15:19, Gaël THEROND <gael.therond@gmail.com> a

écrit :

...
...
...
Oh, that's perfect so, I'll just update my image and my platform as

we're using kolla-ansible and that's super easy.

...
You guys rocks!! (Pun intended ;-)).

Many many thanks to all of you, that will real back me a lot

regarding the Octavia solidity and Kolla flexibility actually ^^.

...
Le mar. 4 juin 2019 à 15:17, Carlos Goncalves <cgoncalves@redhat.com>

a écrit :

...
...
On Tue, Jun 4, 2019 at 3:06 PM Gaël THEROND <gael.therond@gmail.com>

wrote:

...
> > Hi Lingxian Kong, > > That’s actually very interesting as I’ve come to the same conclusion this morning during my investigation and was starting to think about a fix, which it seems you already made! > > Is there a reason why it didn’t was backported to rocky?

The patch was merged in master branch during Rocky development cycle, hence included in stable/rocky as well.

> > Very helpful, many many thanks to you you clearly spare me hours of works! I’ll get a review of your patch and test it on our lab. > > Le mar. 4 juin 2019 à 11:06, Gaël THEROND <gael.therond@gmail.com> a écrit : >> >> Hi Felix, >> >> « Glad » you had the same issue before, and yes of course I looked at the HM logs which is were I actually found out that this event was triggered by octavia (Beside the DB data that validated that) here is my log trace related to this event, It doesn't really shows major issue IMHO. >> >> Here is the stacktrace that our octavia service archived for our both controllers servers, with the initial loadbalancer creation trace (Worker.log) and both controllers triggered task (Health-Manager.log). >> >> http://paste.openstack.org/show/7z5aZYu12Ttoae3AOhwF/ >> >> I well may have miss something in it, but I don't see something strange on from my point of view. >> Feel free to tell me if you spot something weird. >> >> >> Le mar. 4 juin 2019 à 10:38, Felix Hüttner <felix.huettner@mail.schwarz> a écrit : >>> >>> Hi Gael, >>> >>> >>> >>> we had a similar issue in the past. >>> >>> You could check the octiava healthmanager log (should be on the same node where the worker is running). >>> >>> This component monitors the status of the Amphorae and restarts

...
...
...
...
...
>>> >>> >>> >>> But normally it should at least restart the LB with new Amphorae… >>> >>> >>> >>> Hope that helps >>> >>> >>> >>> Felix >>> >>> >>> >>> From: Gaël THEROND <gael.therond@gmail.com> >>> Sent: Tuesday, June 4, 2019 9:44 AM >>> To: Openstack <openstack@lists.openstack.org> >>> Subject: [OCTAVIA][ROCKY] - MASTER & BACKUP instances unexpectedly deleted by octavia >>> >>> >>> >>> Hi guys, >>> >>> >>> >>> I’ve a weird situation here. >>> >>> >>> >>> I smoothly operate a large scale multi-region Octavia service using the default amphora driver which imply the use of nova instances as loadbalancers. >>> >>> >>> >>> Everything is running really well and our customers (K8s and

...
...
...
...
...
>>> >>> >>> >>> However, yesterday one of those customers using the loadbalancer in front of their ElasticSearch cluster poked me because this loadbalancer suddenly passed from ONLINE/OK to ONLINE/ERROR, meaning the amphoras were no longer available but yet the anchor/member/pool and

...
...
...
...
...
>>> >>> >>> >>> So I investigated and found out that the loadbalancer amphoras have been destroyed by the octavia user. >>> >>> >>> >>> The weird part is, both the master and the backup instance have been destroyed at the same moment by the octavia service user. >>> >>> >>> >>> Is there specific circumstances where the octavia service could decide to delete the instances but not the anchor/members/pool ? >>> >>> >>> >>> It’s worrying me a bit as there is no clear way to trace why does Octavia did take this action. >>> >>> >>> >>> I digged within the nova and Octavia DB in order to correlate

a écrit : them if they don’t trigger a callback after a specific time. This might also happen if there is some connection issue between the two components. traditional users) are really happy with the solution so far. listeners settings were still existing. the action but except than validating my investigation it doesn’t really help as there are no clue of why the octavia service did trigger the deletion.

...
...
...
...
...
>>> >>> >>> >>> If someone have any clue or tips to give me I’ll be more than happy to discuss this situation. >>> >>> >>> >>> Cheers guys! >>> >>> Hinweise zum Datenschutz finden Sie hier.

2254

Age (days ago)

2261

Last active (days ago)

List overview

Download

11 comments

4 participants

participants (4)

Carlos Goncalves
Felix Hüttner
Gaël THEROND
Lingxian Kong