[largescale-sig] RPC ping

newer
per user quota not working properly

older
[requirements][oslo] Inclusion of...

Arnaud Morin

27 Jul 2020 27 Jul '20

4:57 a.m.

Hey all, TLDR: I propose a change to oslo_messaging to allow doing a ping over RPC, this is useful to monitor liveness of agents. Few weeks ago, I proposed a patch to oslo_messaging [1], which is adding a ping endpoint to RPC dispatcher. It means that every openstack service which is using oslo_messaging RPC endpoints (almosts all OpenStack services and agents - e.g. neutron server + agents, nova + computes, etc.) will then be able to answer to a specific "ping" call over RPC. I decided to propose this patch in my company mainly for 2 reasons: 1 - we are struggling monitoring our nova compute and neutron agents in a correct way: 1.1 - sometimes our agents are disconnected from RPC, but the python process is still running. 1.2 - sometimes the agent is still connected, but the queue / binding on rabbit cluster is not working anymore (after a rabbit split for example). This one is very hard to debug, because the agent is still reporting health correctly on neutron server, but it's not able to receive messages anymore. 2 - we are trying to monitor agents running in k8s pods: when running a python agent (neutron l3-agent for example) in a k8s pod, we wanted to find a way to monitor if it is still live of not. Adding a RPC ping endpoint could help us solve both these issues. Note that we still need an external mechanism (out of OpenStack) to do this ping. We also think it could be nice for other OpenStackers, and especially large scale ops. Feel free to comment. [1] https://review.opendev.org/#/c/735385/ -- Arnaud Morin

Show replies by date

Ben Nemec

27 Jul 27 Jul

11:41 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

Tagging with Nova and Neutron as they are mentioned and I thought some people from those teams had opinions on this. Can you refresh my memory on why we dropped this before? I recall talking about it in Denver, but I can't for the life of me remember what the conclusion was. Did we intend to use something else for this that has since fallen through? On 7/27/20 4:57 AM, Arnaud Morin wrote:

...

Hey all,

TLDR: I propose a change to oslo_messaging to allow doing a ping over RPC, this is useful to monitor liveness of agents.

Few weeks ago, I proposed a patch to oslo_messaging [1], which is adding a ping endpoint to RPC dispatcher. It means that every openstack service which is using oslo_messaging RPC endpoints (almosts all OpenStack services and agents - e.g. neutron server + agents, nova + computes, etc.) will then be able to answer to a specific "ping" call over RPC.

I decided to propose this patch in my company mainly for 2 reasons: 1 - we are struggling monitoring our nova compute and neutron agents in a correct way:

1.1 - sometimes our agents are disconnected from RPC, but the python process is still running. 1.2 - sometimes the agent is still connected, but the queue / binding on rabbit cluster is not working anymore (after a rabbit split for example). This one is very hard to debug, because the agent is still reporting health correctly on neutron server, but it's not able to receive messages anymore.

2 - we are trying to monitor agents running in k8s pods: when running a python agent (neutron l3-agent for example) in a k8s pod, we wanted to find a way to monitor if it is still live of not.

Adding a RPC ping endpoint could help us solve both these issues. Note that we still need an external mechanism (out of OpenStack) to do this ping. We also think it could be nice for other OpenStackers, and especially large scale ops.

Feel free to comment.

[1] https://review.opendev.org/#/c/735385/

Dan Smith

12:08 p.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

...

Tagging with Nova and Neutron as they are mentioned and I thought some people from those teams had opinions on this.

Nova already implements ping() on the compute RPC interface, which we use to make sure compute waits to start up until conductor is available to do its bidding. So if a new obligatory RPC server method is actually added called ping(), it will break us.

...

Can you refresh my memory on why we dropped this before? I recall talking about it in Denver, but I can't for the life of me remember what the conclusion was. Did we intend to use something else for this that has since fallen through?

The prior conversation I recall was about helm sitting on our bus to (ab)use our ping method for health checks: https://opendev.org/openstack/openstack-helm/commit/baf5356a4fb61590a95f64a6... I believe that has since been reverted. The primary concern was about something other than nova sitting on our bus making calls to our internal services. I imagine that the proposal to bake it into oslo.messaging is for the same purpose, and I'd probably have the same concern. At the time I think we agreed that if we were going to support direct-to-service health checks, they should be teensy HTTP servers with oslo healthchecks middleware. Further loading down rabbit with those pings doesn't seem like the best plan to me. Especially since Nova (compute) services already check in over RPC periodically and the success of that is discoverable en masse through the API. --Dan

Johannes Kulik

28 Jul 28 Jul

3:02 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

Hi, On 7/27/20 7:08 PM, Dan Smith wrote:

...

The primary concern was about something other than nova sitting on our bus making calls to our internal services. I imagine that the proposal to bake it into oslo.messaging is for the same purpose, and I'd probably have the same concern. At the time I think we agreed that if we were going to support direct-to-service health checks, they should be teensy HTTP servers with oslo healthchecks middleware. Further loading down rabbit with those pings doesn't seem like the best plan to me. Especially since Nova (compute) services already check in over RPC periodically and the success of that is discoverable en masse through the API.

--Dan

While I get this concern, we have seen the problem described by the original poster in production multiple times: nova-compute reports to be healthy, is seen as up through the API, but doesn't work on any messages anymore. A health-check going through rabbitmq would really help spotting those situations, while having an additional HTTP server doesn't. Have a nice day, Johannes -- Johannes Kulik IT Architecture Senior Specialist *SAP SE *| Rosenthaler Str. 30 | 10178 Berlin | Germany

Ben Nemec

11 Aug 11 Aug

3:20 p.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

On 7/28/20 3:02 AM, Johannes Kulik wrote:

...

Hi,

On 7/27/20 7:08 PM, Dan Smith wrote:

...
The primary concern was about something other than nova sitting on our bus making calls to our internal services. I imagine that the proposal to bake it into oslo.messaging is for the same purpose, and I'd probably have the same concern. At the time I think we agreed that if we were going to support direct-to-service health checks, they should be teensy HTTP servers with oslo healthchecks middleware. Further loading down rabbit with those pings doesn't seem like the best plan to me. Especially since Nova (compute) services already check in over RPC periodically and the success of that is discoverable en masse through the API.

--Dan

While I get this concern, we have seen the problem described by the original poster in production multiple times: nova-compute reports to be healthy, is seen as up through the API, but doesn't work on any messages anymore. A health-check going through rabbitmq would really help spotting those situations, while having an additional HTTP server doesn't.

I wonder if this does help though. It seems like a bug that a nova-compute service would stop processing messages and still be seen as up in the service status. Do we understand why that is happening? If not, I'm unclear that a ping living at the oslo.messaging layer is going to do a better job of exposing such an outage. The fact that oslo.messaging is responding does not necessarily equate to nova-compute functioning as expected. To be clear, this is not me nacking the ping feature. I just want to make sure we understand what is going on here so we don't add another unreliable healthchecking mechanism to the one we already have.

...

Have a nice day, Johannes

Sean Mooney

4:20 p.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

On Tue, 2020-08-11 at 15:20 -0500, Ben Nemec wrote:

...

On 7/28/20 3:02 AM, Johannes Kulik wrote:

...
Hi,

On 7/27/20 7:08 PM, Dan Smith wrote:

...
The primary concern was about something other than nova sitting on our bus making calls to our internal services. I imagine that the proposal to bake it into oslo.messaging is for the same purpose, and I'd probably have the same concern. At the time I think we agreed that if we were going to support direct-to-service health checks, they should be teensy HTTP servers with oslo healthchecks middleware. Further loading down rabbit with those pings doesn't seem like the best plan to me. Especially since Nova (compute) services already check in over RPC periodically and the success of that is discoverable en masse through the API.

--Dan

While I get this concern, we have seen the problem described by the original poster in production multiple times: nova-compute reports to be healthy, is seen as up through the API, but doesn't work on any messages anymore. A health-check going through rabbitmq would really help spotting those situations, while having an additional HTTP server doesn't.

I wonder if this does help though. It seems like a bug that a nova-compute service would stop processing messages and still be seen as up in the service status.

...

Do we understand why that is happening? assuming it is https://bugs.launchpad.net/nova/+bug/1854992 then then the reason

it kind of is a bug this one to be precise https://bugs.launchpad.net/nova/+bug/1854992 the compute status is still up is the compute service is runing fine and sending heartbeats, the issue is that under certin failure modes the topic queue used to recivie rpc topic sends can disappear. one way this can happen is if the rabbitmq server restart, in which case the resend code in oslo will reconnect to the exchange but it will not nessisarly recreate the topic queue.

...

If not, I'm unclear that a ping living at the oslo.messaging layer is going to do a better job of exposing such an outage. The fact that oslo.messaging is responding does not necessarily equate to nova-compute functioning as expected.

maybe saying that a little clear. https://bugs.launchpad.net/nova/+bug/1854992 has other causes beyond the rabbit mq server crahsing but the underlying effect is the same the queue that the compute service uses to recive rpc call destroyed and not recreated. a related oslo bug https://bugs.launchpad.net/oslo.messaging/+bug/1661510 was "fixed" by add the mandatory transport flag feature. (you can porably mark that as fixed releaed by the way) from a nova persepctive the intened way to fix the nova bug was to use the new mandartroy flag and catch the MessageUndeliverable and have the conductor/api recreate the compute services topic queue and resent the amqp message. An open question is will the compute service detact that and start processing the queue again. if that will not fix the problem plan b was to add a self ping to the compute service wehere the compute service, on a long timeout (once an hour may once every 15 mins at the most), would try to send a message to its own recive queue. if it got the MessageUndeliverable excption then the comptue service woudl recreate its own queue. addint an interservice ping or triggering the ping enternally is unlikely to help with the nova bug. ideally we would prefer to have the conductor/api recreate the queue and re send the message if it detect the queue is missing rather then have a self ping as that does not add addtional load to the message bus and only recreates the queue if its needed. im not sure https://bugs.launchpad.net/nova/+bug/1854992 is the bug that is motiviting the creation of this oslo ping feature but that feels premature if it is. i think it would be better try to adress this by the sender recreating the queue if the deliver fails and if that is not viable then protpyope thge fix in nova. if the self ping fixes this miss queue error then we could extract the cod into oslo.

...

To be clear, this is not me nacking the ping feature. I just want to make sure we understand what is going on here so we don't add another unreliable healthchecking mechanism to the one we already have.

...
Have a nice day, Johannes

Thierry Carrez

12 Aug 12 Aug

5:32 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

Sean Mooney wrote:

...

On Tue, 2020-08-11 at 15:20 -0500, Ben Nemec wrote:

...
I wonder if this does help though. It seems like a bug that a nova-compute service would stop processing messages and still be seen as up in the service status. Do we understand why that is happening? If not, I'm unclear that a ping living at the oslo.messaging layer is going to do a better job of exposing such an outage. The fact that oslo.messaging is responding does not necessarily equate to nova-compute functioning as expected.

To be clear, this is not me nacking the ping feature. I just want to make sure we understand what is going on here so we don't add another unreliable healthchecking mechanism to the one we already have. [...] im not sure https://bugs.launchpad.net/nova/+bug/1854992 is the bug that is motiviting the creation of this oslo ping feature but that feels premature if it is. i think it would be better try to adress this by the sender recreating the queue if the deliver fails and if that is not viable then protpyope thge fix in nova. if the self ping fixes this miss queue error then we could extract the cod into oslo.

I think this is missing the point... This is not about working around a specific bug, it's about adding a way to detect a certain class of failure. It's more of an operational feature than a development bugfix. If I understood correctly, OVH is running that patch in production as a way to detect certain problems they regularly run into, something our existing monitor mechanisms fail to detect. That sounds like a worthwhile addition? Alternatively, if we can monitor the exact same class of failures using our existing systems (or by improving them rather than adding a new door), that works too. -- Thierry Carrez (ttx)

Sean Mooney

6:05 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

On Wed, 2020-08-12 at 12:32 +0200, Thierry Carrez wrote:

...

Sean Mooney wrote:

...
On Tue, 2020-08-11 at 15:20 -0500, Ben Nemec wrote:

...
I wonder if this does help though. It seems like a bug that a nova-compute service would stop processing messages and still be seen as up in the service status. Do we understand why that is happening? If not, I'm unclear that a ping living at the oslo.messaging layer is going to do a better job of exposing such an outage. The fact that oslo.messaging is responding does not necessarily equate to nova-compute functioning as expected.

To be clear, this is not me nacking the ping feature. I just want to make sure we understand what is going on here so we don't add another unreliable healthchecking mechanism to the one we already have.

[...] im not sure https://bugs.launchpad.net/nova/+bug/1854992 is the bug that is motiviting the creation of this oslo ping feature but that feels premature if it is. i think it would be better try to adress this by the sender recreating the queue if the deliver fails and if that is not viable then protpyope thge fix in nova. if the self ping fixes this miss queue error then we could extract the cod into oslo.

I think this is missing the point... This is not about working around a specific bug, it's about adding a way to detect a certain class of failure. It's more of an operational feature than a development bugfix.

right but we are concerned that there will be a negitive perfromance impact to adding it and it wont detect the one bug we are aware of of this type in a way that we could not also detect by using the mandtory flag. nova already has a heartbeat that the agents send to the conducto to report they are still alive. this ping would work in the opisite direction by reaching out to the compute node over the rpc bus. but that would only detect teh vailure mode if the pic use the topic queue and it could only fix it if recreating the queue via the conducor is a viable solution if it is using the mandataory flag and just recreating it is a better solution since we dont need to ping constantly in the background. if we get teh excpeiton we create the queue and retransmit. the compute manger does not resubscribe to the topic when the queue is recreated automaticaly then the new ping feature wont really help. we would need the comptue service or any other service that subsibse to the topic queue to try to ping its own topic queue and if that fails recreate the subsribtion/queue. as far as i am ware that is not what the fature is proposing

...

If I understood correctly, OVH is running that patch in production as a way to detect certain problems they regularly run into, something our existing monitor mechanisms fail to detect. That sounds like a worthwhile addition?

im not sure what failure mode it will detect. if they can define that then it would help with understanding if this is worthwhile or not.

...

Alternatively, if we can monitor the exact same class of failures using our existing systems (or by improving them rather than adding a new door), that works too.

we can monitor the exitance of the queue at least form the rabbitmq api(its disable by defualt but just enable the rabbit-managment plugin) but im not sure what there current issue this is trying to solve is.

...

Ben Nemec

10:50 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

On 8/12/20 5:32 AM, Thierry Carrez wrote:

...

Sean Mooney wrote:

...
...
I wonder if this does help though. It seems like a bug that a nova-compute service would stop processing messages and still be seen as up in the service status. Do we understand why that is happening? If not, I'm unclear that a ping living at the oslo.messaging layer is going to do a better job of exposing such an outage. The fact that oslo.messaging is responding does not necessarily equate to nova-compute functioning as expected.

To be clear, this is not me nacking the ping feature. I just want to make sure we understand what is going on here so we don't add another unreliable healthchecking mechanism to the one we already have. [...] im not sure https://bugs.launchpad.net/nova/+bug/1854992 is the bug

On Tue, 2020-08-11 at 15:20 -0500, Ben Nemec wrote: that is motiviting the creation of this oslo ping feature but that feels premature if it is. i think it would be better try to adress this by the sender recreating the queue if the deliver fails and if that is not viable then protpyope thge fix in nova. if the self ping fixes this miss queue error then we could extract the cod into oslo.

I think this is missing the point... This is not about working around a specific bug, it's about adding a way to detect a certain class of failure. It's more of an operational feature than a development bugfix.

If I understood correctly, OVH is running that patch in production as a way to detect certain problems they regularly run into, something our existing monitor mechanisms fail to detect. That sounds like a worthwhile addition?

Okay, I don't think I was aware that this was already being used. If someone already finds it useful and it's opt-in then I'm not inclined to block it. My main concern was that we were adding a feature that didn't actually address the problem at hand. I _would_ feel better about it if someone could give an example of a type of failure this is detecting that is missed by other monitoring methods though. Both because having a concrete example of a use case for the feature is good, and because if it turns out that the problems this is detecting are things like the Nova bug Sean is talking about (which I don't think this would catch anyway, since the topic is missing and there's nothing to ping) then there may be other changes we can/should make to improve things.

...

Alternatively, if we can monitor the exact same class of failures using our existing systems (or by improving them rather than adding a new door), that works too.

Thierry Carrez

13 Aug 13 Aug

3:24 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

Ben Nemec wrote:

...

On 8/12/20 5:32 AM, Thierry Carrez wrote:

...
Sean Mooney wrote:

...
...
I wonder if this does help though. It seems like a bug that a nova-compute service would stop processing messages and still be seen as up in the service status. Do we understand why that is happening? If not, I'm unclear that a ping living at the oslo.messaging layer is going to do a better job of exposing such an outage. The fact that oslo.messaging is responding does not necessarily equate to nova-compute functioning as expected.

To be clear, this is not me nacking the ping feature. I just want to make sure we understand what is going on here so we don't add another unreliable healthchecking mechanism to the one we already have. [...] im not sure https://bugs.launchpad.net/nova/+bug/1854992 is the bug

On Tue, 2020-08-11 at 15:20 -0500, Ben Nemec wrote: that is motiviting the creation of this oslo ping feature but that feels premature if it is. i think it would be better try to adress this by the sender recreating the queue if the deliver fails and if that is not viable then protpyope thge fix in nova. if the self ping fixes this miss queue error then we could extract the cod into oslo.

I think this is missing the point... This is not about working around a specific bug, it's about adding a way to detect a certain class of failure. It's more of an operational feature than a development bugfix.

If I understood correctly, OVH is running that patch in production as a way to detect certain problems they regularly run into, something our existing monitor mechanisms fail to detect. That sounds like a worthwhile addition?

Okay, I don't think I was aware that this was already being used. If someone already finds it useful and it's opt-in then I'm not inclined to block it. My main concern was that we were adding a feature that didn't actually address the problem at hand.

I _would_ feel better about it if someone could give an example of a type of failure this is detecting that is missed by other monitoring methods though. Both because having a concrete example of a use case for the feature is good, and because if it turns out that the problems this is detecting are things like the Nova bug Sean is talking about (which I don't think this would catch anyway, since the topic is missing and there's nothing to ping) then there may be other changes we can/should make to improve things.

Right. Let's wait for Arnaud to come back from vacation and confirm that (1) that patch is not a shot in the dark: it allows them to expose a class of issues in production (2) they fail to expose that same class of issues using other existing mechanisms, including those just suggested in this thread I just wanted to avoid early rejection of this health check ability on the grounds that the situation it exposes should just not happen. Or that, if enabled and heavily used, it would have a performance impact. -- Thierry Carrez (ttx)

Sean Mooney

7:14 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

On Thu, 2020-08-13 at 10:24 +0200, Thierry Carrez wrote:

...

Ben Nemec wrote:

...
On 8/12/20 5:32 AM, Thierry Carrez wrote:

...
Sean Mooney wrote:

...
On Tue, 2020-08-11 at 15:20 -0500, Ben Nemec wrote:

...
I wonder if this does help though. It seems like a bug that a nova-compute service would stop processing messages and still be seen as up in the service status. Do we understand why that is happening? If not, I'm unclear that a ping living at the oslo.messaging layer is going to do a better job of exposing such an outage. The fact that oslo.messaging is responding does not necessarily equate to nova-compute functioning as expected.

To be clear, this is not me nacking the ping feature. I just want to make sure we understand what is going on here so we don't add another unreliable healthchecking mechanism to the one we already have.

[...] im not sure https://bugs.launchpad.net/nova/+bug/1854992 is the bug that is motiviting the creation of this oslo ping feature but that feels premature if it is. i think it would be better try to adress this by the sender recreating the queue if the deliver fails and if that is not viable then protpyope thge fix in nova. if the self ping fixes this miss queue error then we could extract the cod into oslo.

I think this is missing the point... This is not about working around a specific bug, it's about adding a way to detect a certain class of failure. It's more of an operational feature than a development bugfix.

If I understood correctly, OVH is running that patch in production as a way to detect certain problems they regularly run into, something our existing monitor mechanisms fail to detect. That sounds like a worthwhile addition?

Okay, I don't think I was aware that this was already being used. If someone already finds it useful and it's opt-in then I'm not inclined to block it. My main concern was that we were adding a feature that didn't actually address the problem at hand.

I _would_ feel better about it if someone could give an example of a type of failure this is detecting that is missed by other monitoring methods though. Both because having a concrete example of a use case for the feature is good, and because if it turns out that the problems this is detecting are things like the Nova bug Sean is talking about (which I don't think this would catch anyway, since the topic is missing and there's nothing to ping) then there may be other changes we can/should make to improve things.

Right. Let's wait for Arnaud to come back from vacation and confirm that

(1) that patch is not a shot in the dark: it allows them to expose a class of issues in production

(2) they fail to expose that same class of issues using other existing mechanisms, including those just suggested in this thread

I just wanted to avoid early rejection of this health check ability on the grounds that the situation it exposes should just not happen. Or that, if enabled and heavily used, it would have a performance impact. I think the inital push back from nova is we already have ping rpc function https://github.com/openstack/nova/blob/c6218428e9b29a2c52808ec7d27b4b21aadc0... so if a geneirc metion called ping is added it will break nova.

the reset of the push back is related to not haveing a concrete usecase, including concern over perfroamce consideration and external services potenailly acessing the rpc bus which is coniserd an internal api. e.g. we woudl not want an external monitoring solution connecting to the rpc bus and invoking arbitary RPC calls, ping is well pretty safe but form a design point of view while litening to notification is fine we dont want anything outside of the openstack services actully sending message on the rpc bus. so if this does actully detect somethign we can otherwise detect and the use cases involves using it within the openstack services not form an external source then i think that is fine but we proably need to use another name (alive? status?) or otherewise modify nova so that there is no conflict.

...

Ben Nemec

10:28 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

On 8/13/20 7:14 AM, Sean Mooney wrote:

...

On Thu, 2020-08-13 at 10:24 +0200, Thierry Carrez wrote:

...
Ben Nemec wrote:

...
On 8/12/20 5:32 AM, Thierry Carrez wrote:

...
Sean Mooney wrote:

...
On Tue, 2020-08-11 at 15:20 -0500, Ben Nemec wrote:

...
I wonder if this does help though. It seems like a bug that a nova-compute service would stop processing messages and still be seen as up in the service status. Do we understand why that is happening? If not, I'm unclear that a ping living at the oslo.messaging layer is going to do a better job of exposing such an outage. The fact that oslo.messaging is responding does not necessarily equate to nova-compute functioning as expected.

To be clear, this is not me nacking the ping feature. I just want to make sure we understand what is going on here so we don't add another unreliable healthchecking mechanism to the one we already have.

[...] im not sure https://bugs.launchpad.net/nova/+bug/1854992 is the bug that is motiviting the creation of this oslo ping feature but that feels premature if it is. i think it would be better try to adress this by the sender recreating the queue if the deliver fails and if that is not viable then protpyope thge fix in nova. if the self ping fixes this miss queue error then we could extract the cod into oslo.

I think this is missing the point... This is not about working around a specific bug, it's about adding a way to detect a certain class of failure. It's more of an operational feature than a development bugfix.

If I understood correctly, OVH is running that patch in production as a way to detect certain problems they regularly run into, something our existing monitor mechanisms fail to detect. That sounds like a worthwhile addition?

Okay, I don't think I was aware that this was already being used. If someone already finds it useful and it's opt-in then I'm not inclined to block it. My main concern was that we were adding a feature that didn't actually address the problem at hand.

I _would_ feel better about it if someone could give an example of a type of failure this is detecting that is missed by other monitoring methods though. Both because having a concrete example of a use case for the feature is good, and because if it turns out that the problems this is detecting are things like the Nova bug Sean is talking about (which I don't think this would catch anyway, since the topic is missing and there's nothing to ping) then there may be other changes we can/should make to improve things.

Right. Let's wait for Arnaud to come back from vacation and confirm that

(1) that patch is not a shot in the dark: it allows them to expose a class of issues in production

(2) they fail to expose that same class of issues using other existing mechanisms, including those just suggested in this thread

I just wanted to avoid early rejection of this health check ability on the grounds that the situation it exposes should just not happen. Or that, if enabled and heavily used, it would have a performance impact. I think the inital push back from nova is we already have ping rpc function https://github.com/openstack/nova/blob/c6218428e9b29a2c52808ec7d27b4b21aadc0... so if a geneirc metion called ping is added it will break nova.

It occurred to me after I commented on the review that we have tempest running on oslo.messaging changes and it passed on the patch for this. I suppose it's possible that it broke some error handling in Nova that just isn't tested, but maybe the new ping could function as a cross-project replacement for the Nova ping? Anyway, it's still be to deduplicate the name, but I felt kind of dumb about having asked if it was tested when the test results were right in front of me. ;-)

...

the reset of the push back is related to not haveing a concrete usecase, including concern over perfroamce consideration and external services potenailly acessing the rpc bus which is coniserd an internal api. e.g. we woudl not want an external monitoring solution connecting to the rpc bus and invoking arbitary RPC calls, ping is well pretty safe but form a design point of view while litening to notification is fine we dont want anything outside of the openstack services actully sending message on the rpc bus.

I'm not concerned about the performance impact here. It's an optional feature, so anyone using it is choosing to take that hit. Having external stuff on the RPC bus is more of a gray area, but it's not like we can stop operators from doing that. I think it's probably better to provide a well-defined endpoint for them to talk to rather than have everyone implement their own slightly different RPC ping mechanism. The docs for this feature should be very explicit that this is the only thing external code should be calling.

...

so if this does actully detect somethign we can otherwise detect and the use cases involves using it within the openstack services not form an external source then i think that is fine but we proably need to use another name (alive? status?) or otherewise modify nova so that there is no conflict.

...

If I understand your analysis of the bug correctly, this would have caught that type of outage after all since the failure was asymmetric. The compute node was still able to send its status updates to Nova, but wasn't receiving any messages. A ping would have detected that situation.

Sean Mooney

11:07 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

On Thu, 2020-08-13 at 10:28 -0500, Ben Nemec wrote:

...

On 8/13/20 7:14 AM, Sean Mooney wrote:

...
On Thu, 2020-08-13 at 10:24 +0200, Thierry Carrez wrote:

...
Ben Nemec wrote:

...
On 8/12/20 5:32 AM, Thierry Carrez wrote:

...
Sean Mooney wrote:

...
On Tue, 2020-08-11 at 15:20 -0500, Ben Nemec wrote: > I wonder if this does help though. It seems like a bug that a > nova-compute service would stop processing messages and still be > seen as up in the service status. Do we understand why that is > happening? If not, I'm unclear that a ping living at the > oslo.messaging layer is going to do a better job of exposing such an > outage. The fact that oslo.messaging is responding does not > necessarily equate to nova-compute functioning as expected. > > To be clear, this is not me nacking the ping feature. I just want to > make sure we understand what is going on here so we don't add > another unreliable healthchecking mechanism to the one we already have.

[...] im not sure https://bugs.launchpad.net/nova/+bug/1854992 is the bug that is motiviting the creation of this oslo ping feature but that feels premature if it is. i think it would be better try to adress this by the sender recreating the queue if the deliver fails and if that is not viable then protpyope thge fix in nova. if the self ping fixes this miss queue error then we could extract the cod into oslo.

I think this is missing the point... This is not about working around a specific bug, it's about adding a way to detect a certain class of failure. It's more of an operational feature than a development bugfix.

If I understood correctly, OVH is running that patch in production as a way to detect certain problems they regularly run into, something our existing monitor mechanisms fail to detect. That sounds like a worthwhile addition?

Okay, I don't think I was aware that this was already being used. If someone already finds it useful and it's opt-in then I'm not inclined to block it. My main concern was that we were adding a feature that didn't actually address the problem at hand.

I _would_ feel better about it if someone could give an example of a type of failure this is detecting that is missed by other monitoring methods though. Both because having a concrete example of a use case for the feature is good, and because if it turns out that the problems this is detecting are things like the Nova bug Sean is talking about (which I don't think this would catch anyway, since the topic is missing and there's nothing to ping) then there may be other changes we can/should make to improve things.

Right. Let's wait for Arnaud to come back from vacation and confirm that

(1) that patch is not a shot in the dark: it allows them to expose a class of issues in production

(2) they fail to expose that same class of issues using other existing mechanisms, including those just suggested in this thread

I just wanted to avoid early rejection of this health check ability on the grounds that the situation it exposes should just not happen. Or that, if enabled and heavily used, it would have a performance impact.

I think the inital push back from nova is we already have ping rpc function https://github.com/openstack/nova/blob/c6218428e9b29a2c52808ec7d27b4b21aadc0... so if a geneirc metion called ping is added it will break nova.

It occurred to me after I commented on the review that we have tempest running on oslo.messaging changes and it passed on the patch for this. I suppose it's possible that it broke some error handling in Nova that just isn't tested, but maybe the new ping could function as a cross-project replacement for the Nova ping?

proably yes its only used in one place https://opendev.org/openstack/nova/src/branch/master/nova/conductor/api.py#L... which is only used here in the nova service base class https://github.com/openstack/nova/blob/0b613729ff975f69587a17cc7818c09f7683e... os worst case i think its just going to cause the service to start before the conductor is ready however they have to tolerate the conductor restarting ectra anyway so i dont think it will break anything too badly. i dont see why we coudl not use a generic version instead.

...

Anyway, it's still be to deduplicate the name, but I felt kind of dumb about having asked if it was tested when the test results were right in front of me. ;-)

...
the reset of the push back is related to not haveing a concrete usecase, including concern over perfroamce consideration and external services potenailly acessing the rpc bus which is coniserd an internal api. e.g. we woudl not want an external monitoring solution connecting to the rpc bus and invoking arbitary RPC calls, ping is well pretty safe but form a design point of view while litening to notification is fine we dont want anything outside of the openstack services actully sending message on the rpc bus.

I'm not concerned about the performance impact here. It's an optional feature, so anyone using it is choosing to take that hit.

Having external stuff on the RPC bus is more of a gray area, but it's not like we can stop operators from doing that.

well upstream certenly we cant really stop them. downstream on the other hadn without going through the certification process to have your product certifed to work with our downstream distrobution directlly invoking RPC endpoint would invlaidate your support. so form a dwonstream perpective we do have ways to prevent that via docs and makeing it clear that it not supported. we can technically do that upstream but cant really enforce it, its opensouce software after all if you break it then you get to keep the broken pices.

...

I think it's probably better to provide a well-defined endpoint for them to talk to rather than have everyone implement their own slightly different RPC ping mechanism. The docs for this feature should be very explicit that this is the only thing external code should be calling. ya i think that is a good approch. i would still prefer if people used say middelware to add a service ping admin api endpoint instead of driectly calling the rpc endpoint to avoid exposing rabbitmq but that is out of scope of this discussion.

...

...
so if this does actully detect somethign we can otherwise detect and the use cases involves using it within the openstack services not form an external source then i think that is fine but we proably need to use another name (alive? status?) or otherewise modify nova so that there is no conflict.

...
If I understand your analysis of the bug correctly, this would have caught that type of outage after all since the failure was asymmetric.

am im not sure it might yes looking at https://review.opendev.org/#/c/735385/6 its not clear to me how the endpoint is invoked. is it doing a topic send or a direct send? to detech the failure you would need to invoke a ping on the compute service and that ping would have to been encured on the to nova topic exchante with a routing key of compute.<compute node hostname> if the compute topic queue was broken either because it was nolonger bound to the correct topic or due to some other rabbitmq error then you woudl either get a message undeilverbale error of some kind with the mandaroy flag or likely a timeout without the mandaroty flag. so if the ping would be routed usign a topic too compute.<compute node hostname> then yes it would find this. although we can also detech this ourselves and fix it using the mandatory flag i think by just recreating the queue wehn it extis but we get an undeliverable message, at least i think we can rabbit is not my main are of expertiese so it woudl be nice is someone that know more about it can weigh in on that.

...

The compute node was still able to send its status updates to Nova, but wasn't receiving any messages. A ping would have detected that situation.

...

Ben Nemec

11:21 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

On 8/13/20 11:07 AM, Sean Mooney wrote:

...

...
I think it's probably better to provide a well-defined endpoint for them to talk to rather than have everyone implement their own slightly different RPC ping mechanism. The docs for this feature should be very explicit that this is the only thing external code should be calling. ya i think that is a good approch. i would still prefer if people used say middelware to add a service ping admin api endpoint instead of driectly calling the rpc endpoint to avoid exposing rabbitmq but that is out of scope of this discussion.

Completely agree. In the long run I would like to see this replaced with better integrated healthchecking in OpenStack, but we've been talking about that for years and have made minimal progress.

...

...
...
so if this does actully detect somethign we can otherwise detect and the use cases involves using it within the openstack services not form an external source then i think that is fine but we proably need to use another name (alive? status?) or otherewise modify nova so that there is no conflict.

...
If I understand your analysis of the bug correctly, this would have caught that type of outage after all since the failure was asymmetric.

am im not sure it might yes looking at https://review.opendev.org/#/c/735385/6 its not clear to me how the endpoint is invoked. is it doing a topic send or a direct send? to detech the failure you would need to invoke a ping on the compute service and that ping would have to been encured on the to nova topic exchante with a routing key of compute.<compute node hostname>

if the compute topic queue was broken either because it was nolonger bound to the correct topic or due to some other rabbitmq error then you woudl either get a message undeilverbale error of some kind with the mandaroy flag or likely a timeout without the mandaroty flag. so if the ping would be routed usign a topic too compute.<compute node hostname> then yes it would find this.

although we can also detech this ourselves and fix it using the mandatory flag i think by just recreating the queue wehn it extis but we get an undeliverable message, at least i think we can rabbit is not my main are of expertiese so it woudl be nice is someone that know more about it can weigh in on that.

I pinged Ken this morning to take a look at that. He should be able to tell us whether it's a good idea or crazy talk. :-)

Ken Giusti

4:17 p.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

On Thu, Aug 13, 2020 at 12:30 PM Ben Nemec <openstack@nemebean.com> wrote:

...

On 8/13/20 11:07 AM, Sean Mooney wrote:

...
...
I think it's probably better to provide a well-defined endpoint for them to talk to rather than have everyone implement their own slightly different RPC ping mechanism. The docs for this feature should be very explicit that this is the only thing external code should be calling. ya i think that is a good approch. i would still prefer if people used say middelware to add a service ping admin api endpoint instead of driectly calling the rpc endpoint to avoid exposing rabbitmq but that is out of scope of this discussion.

Completely agree. In the long run I would like to see this replaced with better integrated healthchecking in OpenStack, but we've been talking about that for years and have made minimal progress.

...
...
...
so if this does actully detect somethign we can otherwise detect and

the use cases involves using it within

...
...
...
the openstack services not form an external source then i think that is fine but we proably need to use another name (alive? status?) or otherewise modify nova so that there is no conflict.

...
If I understand your analysis of the bug correctly, this would have caught that type of outage after all since the failure was asymmetric. am im not sure it might yes looking at https://review.opendev.org/#/c/735385/6 its not clear to me how the endpoint is invoked. is it doing a topic send or a direct send? to detech the failure you would need to invoke a ping on the compute service and that ping would have to been encured on the to nova topic exchante with a routing key of compute.<compute node hostname>

if the compute topic queue was broken either because it was nolonger bound to the correct topic or due to some other rabbitmq error then you woudl either get a message undeilverbale error of some kind with the mandaroy flag or likely a timeout without the mandaroty flag. so if the ping would be routed usign a topic too compute.<compute node hostname> then yes it would find this.

although we can also detech this ourselves and fix it using the mandatory flag i think by just recreating the queue wehn it extis but we get an undeliverable message, at least i think we can rabbit is not my main are of expertiese so it woudl be nice is someone that know more about it can weigh in on that.

I pinged Ken this morning to take a look at that. He should be able to tell us whether it's a good idea or crazy talk. :-)

Like I can tell the difference between crazy and good ideas. Ben I thought you knew me better. ;) As discussed you can enable the mandatory flag on a per RPCClient instance, for example: _topts = oslo_messaging.TransportOptions(at_least_once=True) client = oslo_messaging.RPCClient(self.transport, self.target, timeout=conf.timeout, version_cap=conf.target_version, transport_options=_topts).prepare() This will cause an rpc call/cast to fail if rabbitmq cannot find a queue for the rpc request message [note the difference between 'queuing the message' and 'having the message consumed' - the mandatory flag has nothing to do with whether or not the message is eventually consumed]. Keep in mind that there may be some cases where having no active consumers is ok and you do not want to get a delivery failure exception - specifically fanout or perhaps cast. Depends on the use case. If there are fanout use cases that fail or degrade if all present services don't get a message then the mandatory flag will not detect an error if a subset of the bindings are lost. My biggest concern with this type of failure (lost binding) is that apparently the consumer is none the wiser when it happens. Without some sort of event issued by rabbitmq the RPC server cannot detect this problem and take corrective actions (or at least I cannot think of any ATM). -- Ken Giusti (kgiusti@gmail.com)

Arnaud Morin

20 Aug 20 Aug

10:35 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

Hey all, TLDR: - Patch in [1] updated - Example of usage in [3] - Agree with fixing nova/rabbit/oslo but would like to keep this ping endpoint also - Totally agree with documentation needed Long: Thank you all for your review and for the great information you bring to that topic! First thing, we are not yet using that patch in production, but in testing/dev only for now (at OVH). But the plan is to use it in production ASAP. Also, we initially pushed that for neutron agent, that's why I missed the fact that nova already used the "ping" endpoint, sorry for that. Anyway, I dont care about the naming, so in latest patchset of [1], you will see that I changed the name of the endpoint following Ken Giusti suggestions. The bug reported in [2] looks very similar to what we saw. Thank you Sean for bringing that to attention in this thread. To detect this error, using the above "ping" endpoint in oslo, we can use a script like the one in [3] (sorry about it, I can write better python :p). As mentionned by Sean in a previous mail, I am calling effectively the topic "compute.host123456.sbg5.cloud.ovh.net" in "nova" exchange. My initial plan would be to identify topics related to a compute and do pings in all topics, to make sure that all of them are answering. I am not yet sure about how often and if this is a good plan btw. Anyway, the compute is reporting status as UP, but the ping is timeouting, which is exactly what I wanted to detect! I mostly agree with all your comments about the fact that this is a trick that we do as operator, and using the RPC bus is maybe not the best approach, but this is pragmatic and quite simple IMHO. What I also like in this solution is the fact that this is partialy outside of OpenStack: the endpoint is inside, but doing the ping is external. Monitoring OpenStack is not always easy, and sometimes we struggle on finding the root cause of some issues. Having such endpoint allow us to monitor OpenStack from an external point of view, but still in a deeper way. It's like a probe in your car telling you that even if you are still running, your engine is off :) Still, making sure that this bug is fixed by doing some work on (rabbit|oslo.messaging|nova|whatever} is the best thing to do. However, IMO, this does not prevent this rpc ping endpoint from existing. Last, but not least, I totally agree about documenting this, but also adding some documentation on how to configure rabbit and OpenStack services in a way that fit operator needs. There are plenty of parameters which could be tweaked on both OpenStack and rabbit side. IMO, we need to explain a little bit more what are the impact of setting a specific parameter to a given value. For example, in another discussion ([4]), we were talking about "durable" queues in rabbit. We manage to find that if we enable HA, we should also enable durability of queues. Anyway that's another topic, and this is also something we discuss in large-scale group. Thank you all, [1] https://review.opendev.org/#/c/735385/ [2] https://bugs.launchpad.net/nova/+bug/1854992 [3] http://paste.openstack.org/show/796990/ [4] http://lists.openstack.org/pipermail/openstack-discuss/2020-August/016362.ht... -- Arnaud Morin On 13.08.20 - 17:17, Ken Giusti wrote:

...

On Thu, Aug 13, 2020 at 12:30 PM Ben Nemec <openstack@nemebean.com> wrote:

...
On 8/13/20 11:07 AM, Sean Mooney wrote:

...
...
I think it's probably better to provide a well-defined endpoint for them to talk to rather than have everyone implement their own slightly different RPC ping mechanism. The docs for this feature should be very explicit that this is the only thing external code should be calling. ya i think that is a good approch. i would still prefer if people used say middelware to add a service ping admin api endpoint instead of driectly calling the rpc endpoint to avoid exposing rabbitmq but that is out of scope of this discussion.

Completely agree. In the long run I would like to see this replaced with better integrated healthchecking in OpenStack, but we've been talking about that for years and have made minimal progress.

...
...
...
so if this does actully detect somethign we can otherwise detect and

the use cases involves using it within

...
...
...
the openstack services not form an external source then i think that is fine but we proably need to use another name (alive? status?) or otherewise modify nova so that there is no conflict.

...
If I understand your analysis of the bug correctly, this would have caught that type of outage after all since the failure was asymmetric. am im not sure it might yes looking at https://review.opendev.org/#/c/735385/6 its not clear to me how the endpoint is invoked. is it doing a topic send or a direct send? to detech the failure you would need to invoke a ping on the compute service and that ping would have to been encured on the to nova topic exchante with a routing key of compute.<compute node hostname>

if the compute topic queue was broken either because it was nolonger bound to the correct topic or due to some other rabbitmq error then you woudl either get a message undeilverbale error of some kind with the mandaroy flag or likely a timeout without the mandaroty flag. so if the ping would be routed usign a topic too compute.<compute node hostname> then yes it would find this.

although we can also detech this ourselves and fix it using the mandatory flag i think by just recreating the queue wehn it extis but we get an undeliverable message, at least i think we can rabbit is not my main are of expertiese so it woudl be nice is someone that know more about it can weigh in on that.

I pinged Ken this morning to take a look at that. He should be able to tell us whether it's a good idea or crazy talk. :-)

Like I can tell the difference between crazy and good ideas. Ben I thought you knew me better. ;)

As discussed you can enable the mandatory flag on a per RPCClient instance, for example:

_topts = oslo_messaging.TransportOptions(at_least_once=True) client = oslo_messaging.RPCClient(self.transport, self.target, timeout=conf.timeout, version_cap=conf.target_version, transport_options=_topts).prepare()

This will cause an rpc call/cast to fail if rabbitmq cannot find a queue for the rpc request message [note the difference between 'queuing the message' and 'having the message consumed' - the mandatory flag has nothing to do with whether or not the message is eventually consumed].

Keep in mind that there may be some cases where having no active consumers is ok and you do not want to get a delivery failure exception - specifically fanout or perhaps cast. Depends on the use case. If there are fanout use cases that fail or degrade if all present services don't get a message then the mandatory flag will not detect an error if a subset of the bindings are lost.

My biggest concern with this type of failure (lost binding) is that apparently the consumer is none the wiser when it happens. Without some sort of event issued by rabbitmq the RPC server cannot detect this problem and take corrective actions (or at least I cannot think of any ATM).

-- Ken Giusti (kgiusti@gmail.com)

Ben Nemec

4:41 p.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

Thanks for your patience with this! In the last Oslo meeting we had discussed possibly adding some sort of ping client to oslo.messaging to provide a common interface to use this. That would mitigate some of the concerns about everyone having to write their own ping test and potentially sending incorrect messages on the rabbit bus. Obviously that would be done as a followup to this, but I thought I'd mention it in case anyone wants to take a crack at writing something up. On 8/20/20 10:35 AM, Arnaud Morin wrote:

...

Hey all,

TLDR: - Patch in [1] updated - Example of usage in [3] - Agree with fixing nova/rabbit/oslo but would like to keep this ping endpoint also - Totally agree with documentation needed

Long:

Thank you all for your review and for the great information you bring to that topic!

First thing, we are not yet using that patch in production, but in testing/dev only for now (at OVH). But the plan is to use it in production ASAP.

Also, we initially pushed that for neutron agent, that's why I missed the fact that nova already used the "ping" endpoint, sorry for that.

Anyway, I dont care about the naming, so in latest patchset of [1], you will see that I changed the name of the endpoint following Ken Giusti suggestions.

The bug reported in [2] looks very similar to what we saw. Thank you Sean for bringing that to attention in this thread.

To detect this error, using the above "ping" endpoint in oslo, we can use a script like the one in [3] (sorry about it, I can write better python :p). As mentionned by Sean in a previous mail, I am calling effectively the topic "compute.host123456.sbg5.cloud.ovh.net" in "nova" exchange. My initial plan would be to identify topics related to a compute and do pings in all topics, to make sure that all of them are answering. I am not yet sure about how often and if this is a good plan btw.

Anyway, the compute is reporting status as UP, but the ping is timeouting, which is exactly what I wanted to detect!

I mostly agree with all your comments about the fact that this is a trick that we do as operator, and using the RPC bus is maybe not the best approach, but this is pragmatic and quite simple IMHO. What I also like in this solution is the fact that this is partialy outside of OpenStack: the endpoint is inside, but doing the ping is external. Monitoring OpenStack is not always easy, and sometimes we struggle on finding the root cause of some issues. Having such endpoint allow us to monitor OpenStack from an external point of view, but still in a deeper way. It's like a probe in your car telling you that even if you are still running, your engine is off :)

Still, making sure that this bug is fixed by doing some work on (rabbit|oslo.messaging|nova|whatever} is the best thing to do.

However, IMO, this does not prevent this rpc ping endpoint from existing.

Last, but not least, I totally agree about documenting this, but also adding some documentation on how to configure rabbit and OpenStack services in a way that fit operator needs. There are plenty of parameters which could be tweaked on both OpenStack and rabbit side. IMO, we need to explain a little bit more what are the impact of setting a specific parameter to a given value. For example, in another discussion ([4]), we were talking about "durable" queues in rabbit. We manage to find that if we enable HA, we should also enable durability of queues.

Anyway that's another topic, and this is also something we discuss in large-scale group.

Thank you all,

[1] https://review.opendev.org/#/c/735385/ [2] https://bugs.launchpad.net/nova/+bug/1854992 [3] http://paste.openstack.org/show/796990/ [4] http://lists.openstack.org/pipermail/openstack-discuss/2020-August/016362.ht...

Bogdan Dobrelya

28 Jul 28 Jul

3:38 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

On 7/27/20 7:08 PM, Dan Smith wrote:

...

...
Tagging with Nova and Neutron as they are mentioned and I thought some people from those teams had opinions on this.

Nova already implements ping() on the compute RPC interface, which we use to make sure compute waits to start up until conductor is available to do its bidding. So if a new obligatory RPC server method is actually added called ping(), it will break us.

...
Can you refresh my memory on why we dropped this before? I recall talking about it in Denver, but I can't for the life of me remember what the conclusion was. Did we intend to use something else for this that has since fallen through?

The prior conversation I recall was about helm sitting on our bus to (ab)use our ping method for health checks:

https://opendev.org/openstack/openstack-helm/commit/baf5356a4fb61590a95f64a6...

I believe that has since been reverted.

The primary concern was about something other than nova sitting on our bus making calls to our internal services. I imagine that the proposal to bake it into oslo.messaging is for the same purpose, and I'd probably have the same concern. At the time I think we agreed that if we were going to support direct-to-service health checks, they should be teensy HTTP servers with oslo healthchecks middleware. Further loading down rabbit with those pings doesn't seem like the best plan to me. Especially since Nova (compute) services already check in over RPC periodically and the success of that is discoverable en masse through the API.

Having RPC ping in the common messaging library could improve aliveness handling of long-running APIs, like listing multiple Neutron ports or Heat objects with full details, or running some longish Mistral workflow maybe. Indeed it should be made not breaking things already existing in Nova ofc.

...

--Dan

-- Best regards, Bogdan Dobrelya, Irc #bogdando

Ken Giusti

9:11 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

On Tue, Jul 28, 2020 at 4:48 AM Bogdan Dobrelya <bdobreli@redhat.com> wrote:

...

On 7/27/20 7:08 PM, Dan Smith wrote:

...
...
Tagging with Nova and Neutron as they are mentioned and I thought some people from those teams had opinions on this.

Nova already implements ping() on the compute RPC interface, which we use to make sure compute waits to start up until conductor is available to do its bidding. So if a new obligatory RPC server method is actually added called ping(), it will break us.

...
Can you refresh my memory on why we dropped this before? I recall talking about it in Denver, but I can't for the life of me remember what the conclusion was. Did we intend to use something else for this that has since fallen through?

The prior conversation I recall was about helm sitting on our bus to (ab)use our ping method for health checks:

https://opendev.org/openstack/openstack-helm/commit/baf5356a4fb61590a95f64a6...

...
I believe that has since been reverted.

The primary concern was about something other than nova sitting on our bus making calls to our internal services. I imagine that the proposal to bake it into oslo.messaging is for the same purpose, and I'd probably have the same concern. At the time I think we agreed that if we were going to support direct-to-service health checks, they should be teensy HTTP servers with oslo healthchecks middleware. Further loading down rabbit with those pings doesn't seem like the best plan to me. Especially since Nova (compute) services already check in over RPC periodically and the success of that is discoverable en masse through the API.

Having RPC ping in the common messaging library could improve aliveness handling of long-running APIs, like listing multiple Neutron ports or Heat objects with full details, or running some longish Mistral workflow maybe. Indeed it should be made not breaking things already existing in Nova ofc.

Not sure this is related to your concern about long running API's but O.M. has an optional RPC call heartbeat monitor that verifies the connectivity to the server while the call is in progress. See the description of call_monitor_timeout in the RPC client docs [0]. 0: https://docs.openstack.org/oslo.messaging/latest/reference/rpcclient.html

...

...
--Dan

-- Best regards, Bogdan Dobrelya, Irc #bogdando

-- Ken Giusti (kgiusti@gmail.com)

Bogdan Dobrelya

9:25 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

On 7/28/20 4:11 PM, Ken Giusti wrote:

...

On Tue, Jul 28, 2020 at 4:48 AM Bogdan Dobrelya <bdobreli@redhat.com <mailto:bdobreli@redhat.com>> wrote:

On 7/27/20 7:08 PM, Dan Smith wrote: >> Tagging with Nova and Neutron as they are mentioned and I thought some >> people from those teams had opinions on this. > > Nova already implements ping() on the compute RPC interface, which we > use to make sure compute waits to start up until conductor is available > to do its bidding. So if a new obligatory RPC server method is actually > added called ping(), it will break us. > >> Can you refresh my memory on why we dropped this before? I recall >> talking about it in Denver, but I can't for the life of me remember >> what the conclusion was. Did we intend to use something else for this >> that has since fallen through? > > The prior conversation I recall was about helm sitting on our bus to > (ab)use our ping method for health checks: > > https://opendev.org/openstack/openstack-helm/commit/baf5356a4fb61590a95f64a6... > > I believe that has since been reverted. > > The primary concern was about something other than nova sitting on our > bus making calls to our internal services. I imagine that the proposal > to bake it into oslo.messaging is for the same purpose, and I'd probably > have the same concern. At the time I think we agreed that if we were > going to support direct-to-service health checks, they should be teensy > HTTP servers with oslo healthchecks middleware. Further loading down > rabbit with those pings doesn't seem like the best plan to > me. Especially since Nova (compute) services already check in over RPC > periodically and the success of that is discoverable en masse through > the API.

Having RPC ping in the common messaging library could improve aliveness handling of long-running APIs, like listing multiple Neutron ports or Heat objects with full details, or running some longish Mistral workflow maybe. Indeed it should be made not breaking things already existing in Nova ofc.

Not sure this is related to your concern about long running API's but O.M. has an optional RPC call heartbeat monitor that verifies the connectivity to the server while the call is in progress. See the description of call_monitor_timeout in the RPC client docs [0].

Correct, but heartbeats didn't show off as a reliable solution. There were WSGI & eventlet related issues [1] with running heartbeats. I can't recall that was the final outcome of that discussion and what was the fix. So relying on explicit pings sent by clients could work better perhaps. [1] https://bugs.launchpad.net/tripleo/+bug/1829062

...

0: https://docs.openstack.org/oslo.messaging/latest/reference/rpcclient.html

> > --Dan >

-- Best regards, Bogdan Dobrelya, Irc #bogdando

-- Ken Giusti (kgiusti@gmail.com <mailto:kgiusti@gmail.com>)

-- Best regards, Bogdan Dobrelya, Irc #bogdando

Ben Nemec

10:09 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

On 7/28/20 9:25 AM, Bogdan Dobrelya wrote:

...

On 7/28/20 4:11 PM, Ken Giusti wrote:

...
On Tue, Jul 28, 2020 at 4:48 AM Bogdan Dobrelya <bdobreli@redhat.com <mailto:bdobreli@redhat.com>> wrote:

    On 7/27/20 7:08 PM, Dan Smith wrote:      >> Tagging with Nova and Neutron as they are mentioned and I     thought some      >> people from those teams had opinions on this.      >      > Nova already implements ping() on the compute RPC interface, which we      > use to make sure compute waits to start up until conductor is     available      > to do its bidding. So if a new obligatory RPC server method is     actually      > added called ping(), it will break us.      >      >> Can you refresh my memory on why we dropped this before? I recall      >> talking about it in Denver, but I can't for the life of me remember      >> what the conclusion was. Did we intend to use something else for     this      >> that has since fallen through?      >      > The prior conversation I recall was about helm sitting on our bus to      > (ab)use our ping method for health checks:      >      >

https://opendev.org/openstack/openstack-helm/commit/baf5356a4fb61590a95f64a6...

     >      > I believe that has since been reverted.      >      > The primary concern was about something other than nova sitting     on our      > bus making calls to our internal services. I imagine that the     proposal      > to bake it into oslo.messaging is for the same purpose, and I'd     probably      > have the same concern. At the time I think we agreed that if we were      > going to support direct-to-service health checks, they should be     teensy      > HTTP servers with oslo healthchecks middleware. Further loading down      > rabbit with those pings doesn't seem like the best plan to      > me. Especially since Nova (compute) services already check in     over RPC      > periodically and the success of that is discoverable en masse through      > the API.

    Having RPC ping in the common messaging library could improve aliveness     handling of long-running APIs, like listing multiple Neutron ports or     Heat objects with full details, or running some longish Mistral     workflow     maybe. Indeed it should be made not breaking things already existing in     Nova ofc.

Not sure this is related to your concern about long running API's but O.M. has an optional RPC call heartbeat monitor that verifies the connectivity to the server while the call is in progress. See the description of call_monitor_timeout in the RPC client docs [0].

Correct, but heartbeats didn't show off as a reliable solution. There were WSGI & eventlet related issues [1] with running heartbeats. I can't recall that was the final outcome of that discussion and what was the fix. So relying on explicit pings sent by clients could work better perhaps.

How so? The client is going to do the exact same thing as oslo.messaging heartbeats - start a separate thread to send pings, then make the long-running RPC call. It would hit the same eventlet/wsgi bug that oslo.messaging does. Also, there's a workaround for that bug in oslo.messaging: https://github.com/openstack/oslo.messaging/commit/1541b0c7f965b9defb02b9e63... If you re-implemented heartbeating you would have to also re-implement the workaround. On a related note, I've added a topic to our next meeting to discuss turning that workaround on by default since it's been there for a year and no one has complained that it broke them.

...

[1] https://bugs.launchpad.net/tripleo/+bug/1829062

...
0: https://docs.openstack.org/oslo.messaging/latest/reference/rpcclient.html

> > --Dan >

-- Best regards, Bogdan Dobrelya, Irc #bogdando

-- Ken Giusti (kgiusti@gmail.com <mailto:kgiusti@gmail.com>)

Dan Smith

5:26 p.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

...

Correct, but heartbeats didn't show off as a reliable solution. There were WSGI & eventlet related issues [1] with running heartbeats. I can't recall that was the final outcome of that discussion and what was the fix. So relying on explicit pings sent by clients could work better perhaps.

[1] https://bugs.launchpad.net/tripleo/+bug/1829062

There are two types of heartbeats in and around oslo.messaging, which is why call_monitor was used for the long-running RPC thing. The bug you're referencing is, I believe, talking about heartbeating the api->rabbit connection, and has nothing to do with service-to-service pinging, which this thread is about. The call_monitor stuff Ken mentioned requires the *server* side to do the heartbeating, so something like nova-compute or nova-conductor. Those things aren't running under uwsgi and don't have any problems with threading to accomplish those goals. So, if we're talking about generic ping() to provide a robust long-running RPC call, oslo.messaging already does this (if you ask for it). Otherwise, a generic service-to-service ping() doesn't, as was mentioned, really mean anything at all about the ability to do meaningful work (other than further saturate the message bus). --Dan

Bogdan Dobrelya

29 Jul 29 Jul

3:39 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

On 7/29/20 12:26 AM, Dan Smith wrote:

...

...
Correct, but heartbeats didn't show off as a reliable solution. There were WSGI & eventlet related issues [1] with running heartbeats. I can't recall that was the final outcome of that discussion and what was the fix. So relying on explicit pings sent by clients could work better perhaps.

[1] https://bugs.launchpad.net/tripleo/+bug/1829062

There are two types of heartbeats in and around oslo.messaging, which is why call_monitor was used for the long-running RPC thing. The bug you're referencing is, I believe, talking about heartbeating the api->rabbit connection, and has nothing to do with service-to-service pinging, which this thread is about.

The call_monitor stuff Ken mentioned requires the *server* side to do the heartbeating, so something like nova-compute or nova-conductor. Those things aren't running under uwsgi and don't have any problems with threading to accomplish those goals.

So, if we're talking about generic ping() to provide a robust long-running RPC call, oslo.messaging already does this (if you ask for it). Otherwise, a generic service-to-service ping() doesn't, as was mentioned, really mean anything at all about the ability to do meaningful work (other than further saturate the message bus).

Thank you for that great information Dan, Ken. Then please disregard that mistakenly highlighted aspect. Didn't want to derail the thread by that apparently unrelated side case. I believe the original intention for RPC ping was to have something initated by clients, not server-side? That may be useful when running in Kuberenetes pod with aliveness/readiness probes set up. While the latter may be not the best fit for RPC ping indeed, the former seems like a much better way to check aliveness than just checking TCP connection to rabbit port?

...

--Dan

-- Best regards, Bogdan Dobrelya, Irc #bogdando

Ken Giusti

28 Jul 28 Jul

9:25 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

On Mon, Jul 27, 2020 at 1:18 PM Dan Smith <dms@danplanet.com> wrote:

...

...
Tagging with Nova and Neutron as they are mentioned and I thought some people from those teams had opinions on this.

Nova already implements ping() on the compute RPC interface, which we use to make sure compute waits to start up until conductor is available to do its bidding. So if a new obligatory RPC server method is actually added called ping(), it will break us.

...
Can you refresh my memory on why we dropped this before? I recall talking about it in Denver, but I can't for the life of me remember what the conclusion was. Did we intend to use something else for this that has since fallen through?

The prior conversation I recall was about helm sitting on our bus to (ab)use our ping method for health checks:

https://opendev.org/openstack/openstack-helm/commit/baf5356a4fb61590a95f64a6...

I believe that has since been reverted.

The primary concern was about something other than nova sitting on our bus making calls to our internal services. I imagine that the proposal to bake it into oslo.messaging is for the same purpose, and I'd probably have the same concern. At the time I think we agreed that if we were going to support direct-to-service health checks, they should be teensy HTTP servers with oslo healthchecks middleware. Further loading down rabbit with those pings doesn't seem like the best plan to me. Especially since Nova (compute) services already check in over RPC periodically and the success of that is discoverable en masse through the API.

--Dan

While initially in favor of this feature Dan's concern has me reconsidering this. Now I believe that if the purpose of this feature is to check the operational health of a service _using_ oslo.messaging, then I'm against it. A naked ping to a generic service point in an application doesn't prove the operating health of that application beyond its connection to rabbit. Connectivity monitoring between an application and rabbit is done using the keepalive connection heartbeat mechanism built into the rabbit protocol, which O.M. supports today. -- Ken Giusti (kgiusti@gmail.com)

Thierry Carrez

3 Aug 3 Aug

5:15 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

Ken Giusti wrote:

...

On Mon, Jul 27, 2020 at 1:18 PM Dan Smith <dms@danplanet.com <mailto:dms@danplanet.com>> wrote:

...
The primary concern was about something other than nova sitting on our bus making calls to our internal services. I imagine that the proposal to bake it into oslo.messaging is for the same purpose, and I'd probably have the same concern. At the time I think we agreed that if we were going to support direct-to-service health checks, they should be teensy HTTP servers with oslo healthchecks middleware. Further loading down rabbit with those pings doesn't seem like the best plan to me. Especially since Nova (compute) services already check in over RPC periodically and the success of that is discoverable en masse through the API.

While initially in favor of this feature Dan's concern has me reconsidering this.

Now I believe that if the purpose of this feature is to check the operational health of a service _using_ oslo.messaging, then I'm against it. A naked ping to a generic service point in an application doesn't prove the operating health of that application beyond its connection to rabbit.

While I understand the need to further avoid loading down Rabbit, I like the universality of this solution, solving a real operational issue. Obviously that creates a trade-off (further loading rabbit to get more operational insights), but nobody forces you to run those ping calls, they would be opt-in. So the proposed code in itself does not weigh down Rabbit, or make anything sit on the bus.

...

Connectivity monitoring between an application and rabbit is done using the keepalive connection heartbeat mechanism built into the rabbit protocol, which O.M. supports today.

I'll let Arnaud answer, but I suspect the operational need is code-external checking of the rabbit->agent chain, not code-internal checking of the agent->rabbit chain. The heartbeat mechanism is used by the agent to keep the Rabbit connection alive, ensuring it works in most of the cases. The check described above is to catch the corner cases where it still doesn't. -- Thierry Carrez (ttx)

Arnaud Morin

6 Aug 6 Aug

9:04 a.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

Hey all, Thanks for your replies. About the fact that nova already implement this, I will try again on my side, but maybe it was not yet implemented in newton (I only tried nova on newton version). Thank you for bringing that to me. About the healhcheck already done on nova side (and also on neutron). As far as I understand, it's done using a specific rabbit queue, which can work while others queues are not working. The purpose of adding ping endpoint here is to be able to ping in all topics, not only those used for healthcheck reports. Also, as mentionned by Thierry, what we need is a way to externally do pings toward neutron agents and nova computes. The patch itself is not going to add any load on rabbit. It really depends on the way the operator will use it. On my side, I built a small external oslo.messaging script which I can use to do such pings. Cheers, -- Arnaud Morin On 03.08.20 - 12:15, Thierry Carrez wrote:

...

Ken Giusti wrote:

...
On Mon, Jul 27, 2020 at 1:18 PM Dan Smith <dms@danplanet.com <mailto:dms@danplanet.com>> wrote:

...
The primary concern was about something other than nova sitting on our bus making calls to our internal services. I imagine that the proposal to bake it into oslo.messaging is for the same purpose, and I'd probably have the same concern. At the time I think we agreed that if we were going to support direct-to-service health checks, they should be teensy HTTP servers with oslo healthchecks middleware. Further loading down rabbit with those pings doesn't seem like the best plan to me. Especially since Nova (compute) services already check in over RPC periodically and the success of that is discoverable en masse through the API.

While initially in favor of this feature Dan's concern has me reconsidering this.

Now I believe that if the purpose of this feature is to check the operational health of a service _using_ oslo.messaging, then I'm against it. A naked ping to a generic service point in an application doesn't prove the operating health of that application beyond its connection to rabbit.

While I understand the need to further avoid loading down Rabbit, I like the universality of this solution, solving a real operational issue.

Obviously that creates a trade-off (further loading rabbit to get more operational insights), but nobody forces you to run those ping calls, they would be opt-in. So the proposed code in itself does not weigh down Rabbit, or make anything sit on the bus.

...
Connectivity monitoring between an application and rabbit is done using the keepalive connection heartbeat mechanism built into the rabbit protocol, which O.M. supports today.

I'll let Arnaud answer, but I suspect the operational need is code-external checking of the rabbit->agent chain, not code-internal checking of the agent->rabbit chain. The heartbeat mechanism is used by the agent to keep the Rabbit connection alive, ensuring it works in most of the cases. The check described above is to catch the corner cases where it still doesn't.

-- Thierry Carrez (ttx)

Fabian Zimmermann

11 Aug 11 Aug

11:44 p.m.

New subject: [largescale-sig][nova][neutron][oslo] RPC ping

Hi, would be great if you could share your script. Fabian Arnaud Morin <arnaud.morin@gmail.com> schrieb am Do., 6. Aug. 2020, 16:11:

...

Hey all,

Thanks for your replies. About the fact that nova already implement this, I will try again on my side, but maybe it was not yet implemented in newton (I only tried nova on newton version). Thank you for bringing that to me.

About the healhcheck already done on nova side (and also on neutron). As far as I understand, it's done using a specific rabbit queue, which can work while others queues are not working. The purpose of adding ping endpoint here is to be able to ping in all topics, not only those used for healthcheck reports.

Also, as mentionned by Thierry, what we need is a way to externally do pings toward neutron agents and nova computes. The patch itself is not going to add any load on rabbit. It really depends on the way the operator will use it. On my side, I built a small external oslo.messaging script which I can use to do such pings.

Cheers,

-- Arnaud Morin

Mohammed Naser

3 Aug 3 Aug

9:21 a.m.

I have a few operational suggestions on how I think we could do this best: 1. I think exposing a healthcheck endpoint that _actually_ runs the ping and responds with a 200 OK makes a lot more sense in terms of being able to run it inside something like Kubernetes, you end up with a "who makes the ping and who responds to it" type of scenario which can be tricky though I'm sure we can figure that out 2. I've found that newer releases of RabbitMQ really help with those un-usable queues after a split, I haven't had any issues at all with newer releases, so that could be something to help your life be a lot easier. 3. You mentioned you're moving towards Kubernetes, we're doing the same and building an operator: https://opendev.org/vexxhost/openstack-operator -- Because the operator manages the whole thing and Kubernetes does it's thing too, we started moving towards 1 (single) rabbitmq per service, which reaaaaaaally helped a lot in stabilizing things. Oslo messaging is a lot better at recovering when a single service IP is pointing towards it because it doesn't do weird things like have threads trying to connect to other Rabbit ports. Just a thought. 4. In terms of telemetry and making sure you avoid that issue, we track the consumption rates of queues inside OpenStack. OpenStack consumption rate should be constant and never growing, anytime it grows, we instantly detect that something is fishy. However, the other issue comes in that when you restart any openstack service, it 'forgets' all it's existing queues and then you have a set of building up queues until they automatically expire which happens around 30 minutes-ish, so it makes that alarm of "things are not being consumed" a little noisy if you're restarting services Sorry for the wall of super unorganized text, all over the place here but thought I'd chime in with my 2 cents :) On Mon, Jul 27, 2020 at 6:04 AM Arnaud Morin <arnaud.morin@gmail.com> wrote:

...

Hey all,

TLDR: I propose a change to oslo_messaging to allow doing a ping over RPC, this is useful to monitor liveness of agents.

Few weeks ago, I proposed a patch to oslo_messaging [1], which is adding a ping endpoint to RPC dispatcher. It means that every openstack service which is using oslo_messaging RPC endpoints (almosts all OpenStack services and agents - e.g. neutron server + agents, nova + computes, etc.) will then be able to answer to a specific "ping" call over RPC.

I decided to propose this patch in my company mainly for 2 reasons: 1 - we are struggling monitoring our nova compute and neutron agents in a correct way:

1.1 - sometimes our agents are disconnected from RPC, but the python process is still running. 1.2 - sometimes the agent is still connected, but the queue / binding on rabbit cluster is not working anymore (after a rabbit split for example). This one is very hard to debug, because the agent is still reporting health correctly on neutron server, but it's not able to receive messages anymore.

2 - we are trying to monitor agents running in k8s pods: when running a python agent (neutron l3-agent for example) in a k8s pod, we wanted to find a way to monitor if it is still live of not.

Adding a RPC ping endpoint could help us solve both these issues. Note that we still need an external mechanism (out of OpenStack) to do this ping. We also think it could be nice for other OpenStackers, and especially large scale ops.

Feel free to comment.

[1] https://review.opendev.org/#/c/735385/

-- Arnaud Morin

-- Mohammed Naser VEXXHOST, Inc.

Arnaud Morin

6 Aug 6 Aug

9:11 a.m.

Hi Mohammed, 1 - That's something we would also like, but it's beyond the patch I propose. I need this patch not only for kubernetes, but also for monitoring my legagy openstack agents running outside of k8s. 2 - Yes, latest version of rabbitmq is better on that point, but we still see some weird issue (I will ask the community about it in another topic). 3 - Thanks for this operator, we'll take a look! By saying 1 rabbit per service, I understand 1 server, not 1 cluster, right? That sounds risky if you lose the server. I suppose you dont do that for the database? 4 - Nice, how to you monitor those consumptions? Using rabbit management API? Cheers, -- Arnaud Morin On 03.08.20 - 10:21, Mohammed Naser wrote:

...

I have a few operational suggestions on how I think we could do this best:

1. I think exposing a healthcheck endpoint that _actually_ runs the ping and responds with a 200 OK makes a lot more sense in terms of being able to run it inside something like Kubernetes, you end up with a "who makes the ping and who responds to it" type of scenario which can be tricky though I'm sure we can figure that out 2. I've found that newer releases of RabbitMQ really help with those un-usable queues after a split, I haven't had any issues at all with newer releases, so that could be something to help your life be a lot easier. 3. You mentioned you're moving towards Kubernetes, we're doing the same and building an operator: https://opendev.org/vexxhost/openstack-operator -- Because the operator manages the whole thing and Kubernetes does it's thing too, we started moving towards 1 (single) rabbitmq per service, which reaaaaaaally helped a lot in stabilizing things. Oslo messaging is a lot better at recovering when a single service IP is pointing towards it because it doesn't do weird things like have threads trying to connect to other Rabbit ports. Just a thought. 4. In terms of telemetry and making sure you avoid that issue, we track the consumption rates of queues inside OpenStack. OpenStack consumption rate should be constant and never growing, anytime it grows, we instantly detect that something is fishy. However, the other issue comes in that when you restart any openstack service, it 'forgets' all it's existing queues and then you have a set of building up queues until they automatically expire which happens around 30 minutes-ish, so it makes that alarm of "things are not being consumed" a little noisy if you're restarting services

Sorry for the wall of super unorganized text, all over the place here but thought I'd chime in with my 2 cents :)

On Mon, Jul 27, 2020 at 6:04 AM Arnaud Morin <arnaud.morin@gmail.com> wrote:

...
Hey all,

TLDR: I propose a change to oslo_messaging to allow doing a ping over RPC, this is useful to monitor liveness of agents.

Few weeks ago, I proposed a patch to oslo_messaging [1], which is adding a ping endpoint to RPC dispatcher. It means that every openstack service which is using oslo_messaging RPC endpoints (almosts all OpenStack services and agents - e.g. neutron server + agents, nova + computes, etc.) will then be able to answer to a specific "ping" call over RPC.

I decided to propose this patch in my company mainly for 2 reasons: 1 - we are struggling monitoring our nova compute and neutron agents in a correct way:

1.1 - sometimes our agents are disconnected from RPC, but the python process is still running. 1.2 - sometimes the agent is still connected, but the queue / binding on rabbit cluster is not working anymore (after a rabbit split for example). This one is very hard to debug, because the agent is still reporting health correctly on neutron server, but it's not able to receive messages anymore.

2 - we are trying to monitor agents running in k8s pods: when running a python agent (neutron l3-agent for example) in a k8s pod, we wanted to find a way to monitor if it is still live of not.

Adding a RPC ping endpoint could help us solve both these issues. Note that we still need an external mechanism (out of OpenStack) to do this ping. We also think it could be nice for other OpenStackers, and especially large scale ops.

Feel free to comment.

[1] https://review.opendev.org/#/c/735385/

-- Arnaud Morin

-- Mohammed Naser VEXXHOST, Inc.

Mohammed Naser

12 Aug 12 Aug

9:22 a.m.

On Thu, Aug 6, 2020 at 10:11 AM Arnaud Morin <arnaud.morin@gmail.com> wrote:

...

Hi Mohammed,

1 - That's something we would also like, but it's beyond the patch I propose. I need this patch not only for kubernetes, but also for monitoring my legagy openstack agents running outside of k8s.

2 - Yes, latest version of rabbitmq is better on that point, but we still see some weird issue (I will ask the community about it in another topic).

3 - Thanks for this operator, we'll take a look! By saying 1 rabbit per service, I understand 1 server, not 1 cluster, right? That sounds risky if you lose the server.

The controllers are pretty stable and if a controller dies, Kubernetes will take care of restarting the pod somewhere else and everything will reconnect and things will be happy again.

...

I suppose you dont do that for the database?

One database cluster per service, with 'old-school' replication because no one really does true multimaster in Galera with OpenStack anyways.

...

4 - Nice, how to you monitor those consumptions? Using rabbit management API?

Prometheus RabbitMQ exporter, now migrating to the native one shipping in the new RabbitMQ releases.

...

Cheers,

-- Arnaud Morin

On 03.08.20 - 10:21, Mohammed Naser wrote:

...
I have a few operational suggestions on how I think we could do this best:

1. I think exposing a healthcheck endpoint that _actually_ runs the ping and responds with a 200 OK makes a lot more sense in terms of being able to run it inside something like Kubernetes, you end up with a "who makes the ping and who responds to it" type of scenario which can be tricky though I'm sure we can figure that out 2. I've found that newer releases of RabbitMQ really help with those un-usable queues after a split, I haven't had any issues at all with newer releases, so that could be something to help your life be a lot easier. 3. You mentioned you're moving towards Kubernetes, we're doing the same and building an operator: https://opendev.org/vexxhost/openstack-operator -- Because the operator manages the whole thing and Kubernetes does it's thing too, we started moving towards 1 (single) rabbitmq per service, which reaaaaaaally helped a lot in stabilizing things. Oslo messaging is a lot better at recovering when a single service IP is pointing towards it because it doesn't do weird things like have threads trying to connect to other Rabbit ports. Just a thought. 4. In terms of telemetry and making sure you avoid that issue, we track the consumption rates of queues inside OpenStack. OpenStack consumption rate should be constant and never growing, anytime it grows, we instantly detect that something is fishy. However, the other issue comes in that when you restart any openstack service, it 'forgets' all it's existing queues and then you have a set of building up queues until they automatically expire which happens around 30 minutes-ish, so it makes that alarm of "things are not being consumed" a little noisy if you're restarting services

Sorry for the wall of super unorganized text, all over the place here but thought I'd chime in with my 2 cents :)

On Mon, Jul 27, 2020 at 6:04 AM Arnaud Morin <arnaud.morin@gmail.com> wrote:

...
Hey all,

TLDR: I propose a change to oslo_messaging to allow doing a ping over RPC, this is useful to monitor liveness of agents.

Few weeks ago, I proposed a patch to oslo_messaging [1], which is adding a ping endpoint to RPC dispatcher. It means that every openstack service which is using oslo_messaging RPC endpoints (almosts all OpenStack services and agents - e.g. neutron server + agents, nova + computes, etc.) will then be able to answer to a specific "ping" call over RPC.

I decided to propose this patch in my company mainly for 2 reasons: 1 - we are struggling monitoring our nova compute and neutron agents in a correct way:

1.1 - sometimes our agents are disconnected from RPC, but the python process is still running. 1.2 - sometimes the agent is still connected, but the queue / binding on rabbit cluster is not working anymore (after a rabbit split for example). This one is very hard to debug, because the agent is still reporting health correctly on neutron server, but it's not able to receive messages anymore.

2 - we are trying to monitor agents running in k8s pods: when running a python agent (neutron l3-agent for example) in a k8s pod, we wanted to find a way to monitor if it is still live of not.

Adding a RPC ping endpoint could help us solve both these issues. Note that we still need an external mechanism (out of OpenStack) to do this ping. We also think it could be nice for other OpenStackers, and especially large scale ops.

Feel free to comment.

[1] https://review.opendev.org/#/c/735385/

-- Arnaud Morin

-- Mohammed Naser VEXXHOST, Inc.

-- Mohammed Naser VEXXHOST, Inc.

Fabian Zimmermann

10:03 a.m.

Hi, Am Mi., 12. Aug. 2020 um 16:30 Uhr schrieb Mohammed Naser < mnaser@vexxhost.com>:

...

On Thu, Aug 6, 2020 at 10:11 AM Arnaud Morin <arnaud.morin@gmail.com> wrote: The controllers are pretty stable and if a controller dies, Kubernetes will take care of restarting the pod somewhere else and everything will reconnect and things will be happy again.

sounds really interesting. Do you have any docs how to use / do a poc of this setup? Fabian

Ben Nemec

11 Aug 11 Aug

3:28 p.m.

On 8/3/20 9:21 AM, Mohammed Naser wrote:

...

3. You mentioned you're moving towards Kubernetes, we're doing the same and building an operator: https://opendev.org/vexxhost/openstack-operator -- Because the operator manages the whole thing and Kubernetes does it's thing too, we started moving towards 1 (single) rabbitmq per service, which reaaaaaaally helped a lot in stabilizing things. Oslo messaging is a lot better at recovering when a single service IP is pointing towards it because it doesn't do weird things like have threads trying to connect to other Rabbit ports. Just a thought.

On a related note, LINE actually broke it down even further than that. There are details of their design in [0], but essentially they have downstream changes where they can specify a transport per notification topic to further separate out rabbit traffic. The spec hasn't been implemented yet upstream, but I thought I'd mention it since it seems relevant to this discussion. 0: https://specs.openstack.org/openstack/oslo-specs/specs/victoria/support-tran...

1812

Age (days ago)

1836

Last active (days ago)

List overview

Download

31 comments

10 participants

participants (10)

Arnaud Morin
Ben Nemec
Bogdan Dobrelya
Dan Smith
Fabian Zimmermann
Johannes Kulik
Ken Giusti
Mohammed Naser
Sean Mooney
Thierry Carrez

[largescale-sig] RPC ping

tags

participants (10)