[openstack-dev] [Fuel][FFE] Disabling HA for RPC queues in RabbitMQ

Dmitry Mescheryakov dmescheryakov at mirantis.com
Wed Dec 2 14:47:36 UTC 2015


2015-12-02 13:11 GMT+03:00 Bogdan Dobrelya <bdobrelia at mirantis.com>:

> On 01.12.2015 23:34, Peter Lemenkov wrote:
> > Hello All!
> >
> > Well, side-effects (or any other effects) are quite obvious and
> > predictable - this will decrease availability of RPC queues a bit.
> > That's for sure.
>
> And consistency. Without messages and queues being synced between all of
> the rabbit_hosts, how exactly dispatching rpc calls would work then
> workers connected to different AMQP urls?
>

There will be no problem with consistency here. Since we will disable HA,
queues will not be synced across the cluster and there will be exactly one
node hosting messages for a queue.


> Perhaps that change would only raise the partitions tolerance to the
> very high degree? But this should be clearly shown by load tests - under
> network partitions with mirroring against network partitions w/o
> mirroring. Rally could help here a lot.


Nope, the change will not increase partitioning tolerance at all. What I
expect is that it will not get worse. Regarding tests, sure we are going to
perform destructive testing to verify that there is no regression in
recovery time.


>
> >
> > However, Dmitry's guess is that the overall messaging backplane
> > stability increase (RabitMQ won't fail too often in some cases) would
> > compensate for this change. This issue is very much real - speaking of
>
> Agree, that should be proven by (rally) tests for the specific case I
> described in the spec [0]. Please correct it as I may understand things
> wrong, but here it is:
> - client 1 submits RPC call request R to the server 1 connected to the
> AMQP host X
> - worker A listens for jobs topic to the AMQP host X
> - worker B listens for jobs topic to the AMQP host Y
> - a job by the R was dispatched to the worker B
> Q: would the B never receive its job message because it just cannot see
> messages at the X?
> Q: timeout failure as the result.
>
> And things may go even much more weird for more complex scenarios.
>

Yes, in the described scenario B will receive the job. Node Y will proxy B
listening to node X. So, we will not experience timeout. Also, I have
replied in the review.


>
> [0] https://review.openstack.org/247517
>
> > me I've seen an awful cluster's performance degradation when a failing
> > RabbitMQ node was killed by some watchdog application (or even worse
> > wasn't killed at all). One of these issues was quite recently, and I'd
> > love to see them less frequently.
> >
> > That said I'm uncertain about the stability impact of this change, yet
> > I see a reasoning worth discussing behind it.
>
> I would support this to the 8.0 if only proven by the load tests within
> scenario I described plus standard destructive tests


As I said in my initial email, I've run boot_and_delete_server_with_secgroups
Rally scenario to verify my change. I think I should provide more details:

Scale team considers this test to be the worst case we have for RabbitMQ.
I've ran the test on 200 nodes lab and what I saw is that when I disable
HA, test time becomes 2 times smaller. That clearly shows that there is a
test where our current messaging system is bottleneck and just tuning it
considerably improves performance of OpenStack as a whole. Also while there
was small fail rate for HA mode (around 1-2%), in non-HA mode all tests
always completed successfully.

Overall, I think current results are already enough to consider the change
useful. What is left is to confirm that it does not make our failover worse.


> >
> > 2015-12-01 20:53 GMT+01:00 Sergii Golovatiuk <sgolovatiuk at mirantis.com>:
> >> Hi,
> >>
> >> -1 for FFE for disabling HA for RPC queue as we do not know all side
> effects
> >> in HA scenarios.
> >>
> >> On Tue, Dec 1, 2015 at 7:34 PM, Dmitry Mescheryakov
> >> <dmescheryakov at mirantis.com> wrote:
> >>>
> >>> Folks,
> >>>
> >>> I would like to request feature freeze exception for disabling HA for
> RPC
> >>> queues in RabbitMQ [1].
> >>>
> >>> As I already wrote in another thread [2], I've conducted tests which
> >>> clearly show benefit we will get from that change. The change itself
> is a
> >>> very small patch [3]. The only thing which I want to do before
> proposing to
> >>> merge this change is to conduct destructive tests against it in order
> to
> >>> make sure that we do not have a regression here. That should take just
> >>> several days, so if there will be no other objections, we will be able
> to
> >>> merge the change in a week or two timeframe.
> >>>
> >>> Thanks,
> >>>
> >>> Dmitry
> >>>
> >>> [1] https://review.openstack.org/247517
> >>> [2]
> >>>
> http://lists.openstack.org/pipermail/openstack-dev/2015-December/081006.html
> >>> [3] https://review.openstack.org/249180
> >>>
> >>>
> __________________________________________________________________________
> >>> OpenStack Development Mailing List (not for usage questions)
> >>> Unsubscribe:
> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>>
> >>
> >>
> >>
> __________________________________________________________________________
> >> OpenStack Development Mailing List (not for usage questions)
> >> Unsubscribe:
> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>
> >
> >
> >
>
>
> --
> Best regards,
> Bogdan Dobrelya,
> Irc #bogdando
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20151202/fc67627b/attachment.html>


More information about the OpenStack-dev mailing list