Hello, Quick reply, I agree with you on all points. Hopefully this design and collaboration discussion and go on and reach a somewhat consensus on a path forward. I should also be clear to points out that I’m with you, design matters, and there is a lot of things with this design that has not been discussed or even scratched yet, but it’s compelling and I’m here trying to get the ball rolling :) Best regards Tobias On 30 Aug 2022, at 11:43, Radosław Piliszek <radoslaw.piliszek@gmail.com<mailto:radoslaw.piliszek@gmail.com>> wrote: Hi Tobias, Thank you for the detailed response. My query was to gather more insight on what your views/goals are and the responses do not disappoint. More queries inline below. On Tue, 30 Aug 2022 at 11:15, Tobias Urdin <tobias.urdin@binero.com<mailto:tobias.urdin@binero.com>> wrote: I would like OpenStack design to more embrace the distributed, cloud-native approach that Ceph and Kubernetes brings, and the resiliency of Ceph (and yes, I’m a major Ceph enthusiast) and there I’m seeing messaging and database as potential blockers to continue on that path. We both definitely agree that resiliency needs to be improved. On Mon, 29 Aug 2022 at 15:47, Tobias Urdin <tobias.urdin@binero.com<mailto:tobias.urdin@binero.com>> wrote: • Do retries and acknowledgements in the library (since NATS does NOT persist messages like RabbitMQ could) What do you mean? Is NATS only a router? (I have not used this technology yet.) It does not persist messages, if there is no backend to respond, the message will be dropped without any action hence why I want the RPC layer in oslo.messaging (that already does acknowledge calls in the driver) to notify client side that it’s being processed before client side waits for reply. Ack, that makes sense. To let the client know whether there is any consumer that accepted that message. That said, bear in mind the consumer might accept and then die. If NATS does not keep track of this message further, then the resilience is handicapped. • Find or maintain a NATS python library that doesn't use async like the official one does Why is async a bad thing? For messaging it's the right thing. This is actually just myself, I would love to just being able to use the official that is async based instead it’s just me that doesn’t understand how that would be implemented. https://github.com/nats-io/nats.py instead of the one in POC https://github.com/Gr1N/nats-python which has a lot of shortcomings and issues, my idea was just to investigate if was even possible to implement in a feasible way. Ack, I see. Finally, have you considered just trying out ZeroMQ? Does not exist anymore. I think I might have been misunderstood as ZeroMQ still exists. ;-) You probably mean the oslo.messaging backend that it's gone. I meant that *maybe* it would be good to discuss a reimplementation of that which considers the current OpenStack needs. I would also emphasise that I imagine RPC and notification messaging layers to have different needs and likely requiring different approaches. I mean, NATS is probably an overkill for OpenStack services since the majority of them stay static on the hosts they control (think nova-compute, neutron agents - and these are also the pain points that operators want to ease). I don’t think it it, or even if it is, why not use a better solution or stable approach than RabbitMQ? This is also the whole point, I don’t want OpenStack to become or be static, I want it to be more dynamic and cloud-native in it’s approach and support viable integrations that takes it there, we cannot live in the past forever, let’s envision and dream of the future as we want it! :) Ack, you want it more dynamic and that's ok now that I understand your view. That said, my whole point regarding this boils down to the usual design principles that remind us that there are, more often than not, some tradeoffs that have been made to build some tech - NATS is likely no different: if it promises features A, B, C, D, and we need only A and B, then *maybe* it has some constraints on the A and B we want or we might miss that it lacks feature E or C/D add useless overhead. The point is to have that in mind before going too deep, try to spot and tackle such issues early on. Finally, have you considered just trying out ZeroMQ? ZeroMQ used to be supported in the past but then it was remvoed if i understand correctly it only supprot notificaiton or RPC but not both i dont recall which but perhapse im miss rememebrign on that point. I believe it would be better suited for RPC than notifications, at least in the simplest form. As it’s advertised as scalable and performant I would argue that, why not use it for notifications as well? If anything according to your observations above it’s more suited for that than RPC, even though request-reply (that we can use for RPC) is a strong first-class implementation in NATS as well. Well, that was about ZMQ. I mostly meant that synchronous RPC (that happens in OpenStack a lot) adapts very well to what can be achieved with ZeroMQ without a lot of fuss. I mean, NATS is probably an overkill for OpenStack services since the majority of them stay static on the hosts they control (think nova-compute, neutron agents - and these are also the pain points that operators want to ease). its not any more overkill then rabbitmq is True that. Probably. I agree with that, also if you think about it, how many issues related to stability, performance and outages is related to RabbitMQ? It’s quite a few if you ask me. Just the resource utilization and clustering in RabbitMQ makes me feel bad. Here we definitely agree. As we used to discuss this before in this community, we are not sure if this is RabbitMQ's fault of course or if we just don't know how to utilise it properly. ;-) Anyhow, RMQ being in Erlang does not help as it's more like a black box to most of us here I believe (please raise your hands if you can debug an EVM failure). It’s here that I mean that the cloud-native and scalable implementation would shine, you should be able to rely on it, if sometimes dies so what, things should just continue to work and that’s not my experience with RabbitMQ but it is my experience with Ceph because in the end the design really matters. "Design really matters" is something that I remind myself and others almost every day. Hence why this discussion is taking place now. :D i also dont know waht you mean when you say "majority of them stay static on the hosts they control" NATS is intended a s a cloud native horrizontally scaleable message bus. which is exactly what openstack need IMO. NATS seems to be tweaked for "come and go" situations which is an exception in the OpenStack world, not the rule (at least in my view). I mean, one normally expects to have a preset number of hypervisors and not them coming and going (which, I agree, is a nice vision, could be a proper NATS driver, with more awareness in the client projects I believe, would be an enabler for more dynamic clouds). It could, but it also doesn’t have to be that. Why not strive for more dynamic? I don’t think anybody would argue that more dynamic is a bad thing even if you were to have a more static approach to your cloud. This has been discussed already above - tradeoffs. One cannot just make up a hypervisor and need to spin up a nova-compute for it. It's a different story for non-resource-bound services that NATS is advertised for. You need more processing power? Sure, you spin another worker and connect it with NATS. That scalability might be coming at a price that we don't need to pay because OpenStack services are never going to scale with this level of dynamism. Finally, don't get me wrong. I love the fact that you are doing what you are doing. I just want to make sure that it goes in the right direction. Cheers, Radek -yoctozepto