Open Stack

Thu Jan 5 18:26:47 UTC 2017

MIke,

I did a bunch of research and experiments on this last fall. We are running
Rabbit 3.5.6 on our main cluster and 3.6.5 on our Trove cluster which has
significantly less load (and criticality). We were going to upgrade to
3.6.5 everywhere but in the end decided not to, mainly because there was
little perceived benefit at the time. Our main issue is unchecked memory
growth at random times. I ended up making several config changes to the
stats collector and then we also restart it after every deploy and that
solved it (so far).

I'd say these were my main reasons for not going to 3.6 for our control
nodes:

   - In 3.6.x they re-wrote the stats processor to make it parallel. In
   every 3.6 release since then, Pivotal has fixed bugs in this code. Then
   finally they threw up their hands and said "we're going to make a complete
   rewrite in 3.7/4.x" (you need to look through issues on Github to find this
   discussion)
   - Out of the box with the same configs 3.6.5 used more memory than
   3.5.6, since this was our main issue, I consider this a negative.
   - Another issue is the ancient version of erlang we have with Ubuntu
   Trusty (which we are working on) which made upgrades more
   complex/impossible depending on the version.

Given those negatives, the main one being that I didn't think there would
be too many more fixes to the parallel statsdb collector in 3.6, we decided
to stick with 3.5.6. In the end the devil we know is better than the devil
we don't and I had no evidence that 3.6.5 would be an improvement.

I did decide to leave Trove on 3.6.5 because this would give us some
bake-in time if 3.5.x became untenable we'd at least have had it up and
running in production and some data on it.

If statsdb is not a concern for you, I think this changes the math and
maybe you should use 3.6.x. I would however recommend at least going to
3.5.6, it's been better than 3.3/3.4 was.

No matter what you do definitely read all the release notes. There are some
upgrades which require an entire cluster shutdown. The upgrade to 3.5.6 did
not require this IIRC.

Here's the hiera for our rabbit settings which I assume you can translate:

rabbitmq::cluster_partition_handling: 'autoheal'
rabbitmq::config_variables:
  'vm_memory_high_watermark': '0.6'
  'collect_statistics_interval': 30000
rabbitmq::config_management_variables:
  'rates_mode': 'none'
rabbitmq::file_limit: '65535'

Finally, if you do upgrade to 3.6.x please report back here with your
results at scale!

On Thu, Jan 5, 2017 at 8:49 AM, Mike Dorman <mdorman at godaddy.com> wrote:

> We are looking at upgrading to the latest RabbitMQ in an effort to ease
> some cluster failover issues we’ve been seeing.  (Currently on 3.4.0)
>
>
>
> Anyone been running 3.6.x?  And what has been your experience?  Any
> gottchas to watch out for?
>
>
>
> Thanks,
>
> Mike
>
>
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20170105/dbc5f2d3/attachment.html>

Open Stack

[Openstack-operators] RabbitMQ 3.6.x experience?

OpenStack

Community

Documentation

Branding & Legal