[all] [oslo.messaging] Interest in collaboration on a NATS driver

30 Aug 2022

      Hello,

Quick reply, I agree with you on all points. Hopefully this design and collaboration discussion
and go on and reach a somewhat consensus on a path forward.

I should also be clear to points out that I’m with you, design matters, and there is a lot of things with this
design that has not been discussed or even scratched yet, but it’s compelling and I’m here trying to get the ball rolling :)

Best regards
Tobias

On 30 Aug 2022, at 11:43, Radosław Piliszek <radoslaw.piliszek@gmail.com<mailto:radoslaw.piliszek@gmail.com>> wrote:

Hi Tobias,

Thank you for the detailed response. My query was to gather more
insight on what your views/goals are and the responses do not
disappoint.

More queries inline below.

On Tue, 30 Aug 2022 at 11:15, Tobias Urdin <tobias.urdin@binero.com<mailto:tobias.urdin@binero.com>> wrote:
I would like OpenStack design to more embrace the distributed, cloud-native approach that Ceph and Kubernetes brings, and the resiliency of Ceph (and yes, I’m a major Ceph enthusiast)
and there I’m seeing messaging and database as potential blockers to continue on that path.

We both definitely agree that resiliency needs to be improved.

On Mon, 29 Aug 2022 at 15:47, Tobias Urdin <tobias.urdin@binero.com<mailto:tobias.urdin@binero.com>> wrote:

• Do retries and acknowledgements in the library (since NATS does NOT persist messages like RabbitMQ could)

What do you mean? Is NATS only a router? (I have not used this technology yet.)

It does not persist messages, if there is no backend to respond, the message will be dropped without any action hence why I
want the RPC layer in oslo.messaging (that already does acknowledge calls in the driver) to notify client side that it’s being processed
before client side waits for reply.

Ack, that makes sense. To let the client know whether there is any
consumer that accepted that message.
That said, bear in mind the consumer might accept and then die. If
NATS does not keep track of this message further, then the resilience
is handicapped.

• Find or maintain a NATS python library that doesn't use async like the official one does

Why is async a bad thing? For messaging it's the right thing.

This is actually just myself, I would love to just being able to use the official that is async based instead it’s just
me that doesn’t understand how that would be implemented.

https://github.com/nats-io/nats.py instead of the one in POC https://github.com/Gr1N/nats-python which has a lot of shortcomings and issues, my
idea was just to investigate if was even possible to implement in a feasible way.

Ack, I see.

Finally, have you considered just trying out ZeroMQ?

Does not exist anymore.

I think I might have been misunderstood as ZeroMQ still exists. ;-)
You probably mean the oslo.messaging backend that it's gone.
I meant that *maybe* it would be good to discuss a reimplementation of
that which considers the current OpenStack needs.
I would also emphasise that I imagine RPC and notification messaging
layers to have different needs and likely requiring different
approaches.

I mean, NATS is probably an overkill for OpenStack services since the
majority of them stay static on the hosts they control (think
nova-compute, neutron agents - and these are also the pain points that
operators want to ease).

I don’t think it it, or even if it is, why not use a better solution or stable approach than RabbitMQ?

This is also the whole point, I don’t want OpenStack to become or be static, I want it to be more dynamic and
cloud-native in it’s approach and support viable integrations that takes it there, we cannot live in the past forever, let’s envision and dream of the future as we want it! :)

Ack, you want it more dynamic and that's ok now that I understand your view.
That said, my whole point regarding this boils down to the usual
design principles that remind us that there are, more often than not,
some tradeoffs that have been made to build some tech - NATS is likely
no different: if it promises features A, B, C, D, and we need only A
and B, then *maybe* it has some constraints on the A and B we want or
we might miss that it lacks feature E or C/D add useless overhead. The
point is to have that in mind before going too deep, try to spot and
tackle such issues early on.

Finally, have you considered just trying out ZeroMQ?

ZeroMQ used to be supported in the past but then it was remvoed
if i understand correctly it only supprot notificaiton or RPC but not both
i dont recall which but perhapse im miss rememebrign on that point.

I believe it would be better suited for RPC than notifications, at
least in the simplest form.

As it’s advertised as scalable and performant I would argue that, why not use it for notifications as well? If anything according to
your observations above it’s more suited for that than RPC, even though request-reply (that we can use for RPC) is a strong first-class implementation in NATS as well.

Well, that was about ZMQ. I mostly meant that synchronous RPC (that
happens in OpenStack a lot) adapts very well to what can be achieved
with ZeroMQ without a lot of fuss.

I mean, NATS is probably an overkill for OpenStack services since the
majority of them stay static on the hosts they control (think
nova-compute, neutron agents - and these are also the pain points that
operators want to ease).

its not any more overkill then rabbitmq is

True that. Probably.

I agree with that, also if you think about it, how many issues related to stability, performance and outages is related to RabbitMQ? It’s quite a few if you ask me.
Just the resource utilization and clustering in RabbitMQ makes me feel bad.

Here we definitely agree.
As we used to discuss this before in this community, we are not sure
if this is RabbitMQ's fault of course or if we just don't know how to
utilise it properly. ;-)
Anyhow, RMQ being in Erlang does not help as it's more like a black
box to most of us here I believe (please raise your hands if you can
debug an EVM failure).

It’s here that I mean that the cloud-native and scalable implementation would shine, you should be able to rely on it, if sometimes dies so what, things should just
continue to work and that’s not my experience with RabbitMQ but it is my experience with Ceph because in the end the design really matters.

"Design really matters" is something that I remind myself and others
almost every day. Hence why this discussion is taking place now. :D

i also dont know waht you mean when you say
"majority of them stay static on the hosts they control"

NATS is intended a s a cloud native horrizontally scaleable message bus.
which is exactly what openstack need IMO.

NATS seems to be tweaked for "come and go" situations which is an
exception in the OpenStack world, not the rule (at least in my view).
I mean, one normally expects to have a preset number of hypervisors
and not them coming and going (which, I agree, is a nice vision, could
be a proper NATS driver, with more awareness in the client projects I
believe, would be an enabler for more dynamic clouds).

It could, but it also doesn’t have to be that. Why not strive for more dynamic? I don’t think anybody would argue that more dynamic is a bad thing
even if you were to have a more static approach to your cloud.

This has been discussed already above - tradeoffs. One cannot just
make up a hypervisor and need to spin up a nova-compute for it. It's a
different story for non-resource-bound services that NATS is
advertised for. You need more processing power? Sure, you spin another
worker and connect it with NATS. That scalability might be coming at a
price that we don't need to pay because OpenStack services are never
going to scale with this level of dynamism.

Finally, don't get me wrong. I love the fact that you are doing what
you are doing. I just want to make sure that it goes in the right
direction.

Cheers,
Radek
-yoctozepto