[kolla] All services stats DOWN after re-launch whole cluster.

Eddie Yen missile0407 at gmail.com
Thu Feb 6 00:22:06 UTC 2020


Hi Albert, thanks for your process.

I'll record these and think about how to do this on kolla if similar issue
happen in the future.

-Eddie

Albert Braden <Albert.Braden at synopsys.com> 於 2020年2月6日 週四 上午1:41寫道:

> Hi Eddie,
>
>
>
> This is the process that I use to reset RMQ when it fails. RMQ messages
> are ephemeral; losing your old RMQ messages doesn’t ruin the cluster.
>
>
>
> On master:
>
> service rabbitmq-server stop
>
> ps auxw|grep rabbit
>
> (kill any rabbit processes)
>
> rm -rf /var/lib/rabbitmq/mnesia/*
>
> service rabbitmq-server start
>
> rabbitmqctl add_user admin <RMQ pwd from transport_url in nova.conf>
>
> rabbitmqctl set_user_tags admin administrator
>
> rabbitmqctl set_permissions -p / admin ".*" ".*" ".*"
>
> rabbitmqctl add_user openstack <RMQ pwd from transport_url in nova.conf>
>
> rabbitmqctl set_permissions -p / openstack ".*" ".*" ".*"
>
> rabbitmqctl set_policy ha-all "" '{"ha-mode":"all"}'
>
> rabbitmqctl list_policies
>
>
>
> on slaves:
>
> rabbitmqctl stop_app
>
> If RMQ fails to reset on a slave, or fails to start after resetting, then:
>
> service rabbitmq-server stop
>
> ps auxw|grep rabbit
>
> (kill any rabbit processes)
>
> rm -rf /var/lib/rabbitmq/mnesia/*
>
> service rabbitmq-server start
>
> rabbitmqctl stop_app
>
> rabbitmqctl reset
>
> rabbitmqctl start_app
>
> rabbitmqctl stop_app
>
> rabbitmqctl join_cluster rabbit@<master>
>
> rabbitmqctl start_app
>
>
>
> *From:* Eddie Yen <missile0407 at gmail.com>
> *Sent:* Wednesday, February 5, 2020 3:33 AM
> *To:* openstack-discuss <openstack-discuss at lists.openstack.org>
> *Subject:* Re: [kolla] All services stats DOWN after re-launch whole
> cluster.
>
>
>
> Today I tried to recovery RabbitMQ back, but still not useful, even delete
> everything
>
> about data and configs for RabbitMQ then re-deploy (without destroy).
>
>
>
> And I found that the /etc/hosts on every nodes all been flushed, the
> hostname
>
> resolve data created by kolla-ansible are gone. Checked and found that the
> MAAS
>
> just enabled manage_etc_hosts config in /etc/cloud/cloud.cfg.d/ which
> caused
>
> /etc/hosts been reset everytime when boot.
>
>
>
> Not sure it was a root cause or not but unfortunately I already reset
> whole RabbitMQ
>
> data, so only I can do is destroy and deploy again. Fortunately this
> cluster was just
>
> beginning so no VM launch, and no do complex setup yet.
>
>
>
> I think the issue may solved, although still need a time to investigate.
> Based on this
>
> experience, need to notice about this may going to happen if using MAAS to
> deploy
>
> the OS.
>
>
>
> -Eddie
>
>
>
> Eddie Yen <missile0407 at gmail.com> 於 2020年2月4日 週二 下午9:45寫道:
>
> Hi Erik,
>
>
>
> I'm already checked NIC link and no issue found. Pinging the nodes each
> other on each interfaces is OK.
>
> And I'm not check docker logs about rabbitmq sbecause it works normally.
> I'll check that out later.
>
>
>
> -Eddie
>
>
>
> Erik McCormick <emccormick at cirrusseven.com> 於 2020年2月4日 週二 下午9:19寫道:
>
>>
> On Tue, Feb 4, 2020, 7:20 AM Eddie Yen <missile0407 at gmail.com> wrote:
>
> Hi everyone,
>
>
>
> We have the Kolla Openstack site, which is 3 HCI (Controller+Compute) + 3
> Storage (Ceph OSD)
>
> site without internet. We did the shutdown few days ago since CNY
> holidays.
>
>
>
> Today we re-launch whole cluster back. First we met the issue that MariaDB
> containers keep
>
> restarting, and we fixed by using mariadb_recovery command.
>
> After that we check the status of each services, and found that all
> services shown at
>
> Admin > System > System Information are DOWN. Strange is no MariaDB, AMQP
> connection,
>
> or other error found when check the downed service log.
>
>
>
> We tried reboot each servers but the situation still a same. Then we found
> the RabbitMQ log not
>
> updating, the last log still stayed at the date we shutdown. Logged in to
> RabbitMQ container and
>
> type "rabbitmqctl status" shows connection refused, and tried access its
> web manager from
>
> <VIP>:15672 on browser just gave us "503 Service unavailable" message.
> Also no port 5672
>
> listening.
>
>
>
>
>
> Any chance you have a NIC that didn't come up? What is in the log of the
> container itself? (ie. docker log rabbitmq).
>
>
>
>
>
> I searched this issue on the internet but only few information about this.
> One of solution is delete
>
> some files in mnesia folder, another is remove rabbitmq container and its
> volume then re-deploy.
>
> But both are not sure. Does anyone know how to solve it?
>
>
>
>
>
> Many thanks,
>
> Eddie.
>
>
>
> -Erik
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200206/9e7f3d83/attachment.html>


More information about the openstack-discuss mailing list