[kolla] All services stats DOWN after re-launch whole cluster.
Hi everyone, We have the Kolla Openstack site, which is 3 HCI (Controller+Compute) + 3 Storage (Ceph OSD) site without internet. We did the shutdown few days ago since CNY holidays. Today we re-launch whole cluster back. First we met the issue that MariaDB containers keep restarting, and we fixed by using mariadb_recovery command. After that we check the status of each services, and found that all services shown at Admin > System > System Information are DOWN. Strange is no MariaDB, AMQP connection, or other error found when check the downed service log. We tried reboot each servers but the situation still a same. Then we found the RabbitMQ log not updating, the last log still stayed at the date we shutdown. Logged in to RabbitMQ container and type "rabbitmqctl status" shows connection refused, and tried access its web manager from <VIP>:15672 on browser just gave us "503 Service unavailable" message. Also no port 5672 listening. I searched this issue on the internet but only few information about this. One of solution is delete some files in mnesia folder, another is remove rabbitmq container and its volume then re-deploy. But both are not sure. Does anyone know how to solve it? Many thanks, Eddie.
⁹ On Tue, Feb 4, 2020, 7:20 AM Eddie Yen <missile0407@gmail.com> wrote:
Hi everyone,
We have the Kolla Openstack site, which is 3 HCI (Controller+Compute) + 3 Storage (Ceph OSD) site without internet. We did the shutdown few days ago since CNY holidays.
Today we re-launch whole cluster back. First we met the issue that MariaDB containers keep restarting, and we fixed by using mariadb_recovery command. After that we check the status of each services, and found that all services shown at Admin > System > System Information are DOWN. Strange is no MariaDB, AMQP connection, or other error found when check the downed service log.
We tried reboot each servers but the situation still a same. Then we found the RabbitMQ log not updating, the last log still stayed at the date we shutdown. Logged in to RabbitMQ container and type "rabbitmqctl status" shows connection refused, and tried access its web manager from <VIP>:15672 on browser just gave us "503 Service unavailable" message. Also no port 5672 listening.
Any chance you have a NIC that didn't come up? What is in the log of the container itself? (ie. docker log rabbitmq).
I searched this issue on the internet but only few information about this. One of solution is delete some files in mnesia folder, another is remove rabbitmq container and its volume then re-deploy. But both are not sure. Does anyone know how to solve it?
Many thanks, Eddie.
-Erik
Hi Erik, I'm already checked NIC link and no issue found. Pinging the nodes each other on each interfaces is OK. And I'm not check docker logs about rabbitmq sbecause it works normally. I'll check that out later. -Eddie Erik McCormick <emccormick@cirrusseven.com> 於 2020年2月4日 週二 下午9:19寫道:
⁹
On Tue, Feb 4, 2020, 7:20 AM Eddie Yen <missile0407@gmail.com> wrote:
Hi everyone,
We have the Kolla Openstack site, which is 3 HCI (Controller+Compute) + 3 Storage (Ceph OSD) site without internet. We did the shutdown few days ago since CNY holidays.
Today we re-launch whole cluster back. First we met the issue that MariaDB containers keep restarting, and we fixed by using mariadb_recovery command. After that we check the status of each services, and found that all services shown at Admin > System > System Information are DOWN. Strange is no MariaDB, AMQP connection, or other error found when check the downed service log.
We tried reboot each servers but the situation still a same. Then we found the RabbitMQ log not updating, the last log still stayed at the date we shutdown. Logged in to RabbitMQ container and type "rabbitmqctl status" shows connection refused, and tried access its web manager from <VIP>:15672 on browser just gave us "503 Service unavailable" message. Also no port 5672 listening.
Any chance you have a NIC that didn't come up? What is in the log of the container itself? (ie. docker log rabbitmq).
I searched this issue on the internet but only few information about this. One of solution is delete some files in mnesia folder, another is remove rabbitmq container and its volume then re-deploy. But both are not sure. Does anyone know how to solve it?
Many thanks, Eddie.
-Erik
Today I tried to recovery RabbitMQ back, but still not useful, even delete everything about data and configs for RabbitMQ then re-deploy (without destroy). And I found that the /etc/hosts on every nodes all been flushed, the hostname resolve data created by kolla-ansible are gone. Checked and found that the MAAS just enabled manage_etc_hosts config in /etc/cloud/cloud.cfg.d/ which caused /etc/hosts been reset everytime when boot. Not sure it was a root cause or not but unfortunately I already reset whole RabbitMQ data, so only I can do is destroy and deploy again. Fortunately this cluster was just beginning so no VM launch, and no do complex setup yet. I think the issue may solved, although still need a time to investigate. Based on this experience, need to notice about this may going to happen if using MAAS to deploy the OS. -Eddie Eddie Yen <missile0407@gmail.com> 於 2020年2月4日 週二 下午9:45寫道:
Hi Erik,
I'm already checked NIC link and no issue found. Pinging the nodes each other on each interfaces is OK. And I'm not check docker logs about rabbitmq sbecause it works normally. I'll check that out later.
-Eddie
Erik McCormick <emccormick@cirrusseven.com> 於 2020年2月4日 週二 下午9:19寫道:
⁹
On Tue, Feb 4, 2020, 7:20 AM Eddie Yen <missile0407@gmail.com> wrote:
Hi everyone,
We have the Kolla Openstack site, which is 3 HCI (Controller+Compute) + 3 Storage (Ceph OSD) site without internet. We did the shutdown few days ago since CNY holidays.
Today we re-launch whole cluster back. First we met the issue that MariaDB containers keep restarting, and we fixed by using mariadb_recovery command. After that we check the status of each services, and found that all services shown at Admin > System > System Information are DOWN. Strange is no MariaDB, AMQP connection, or other error found when check the downed service log.
We tried reboot each servers but the situation still a same. Then we found the RabbitMQ log not updating, the last log still stayed at the date we shutdown. Logged in to RabbitMQ container and type "rabbitmqctl status" shows connection refused, and tried access its web manager from <VIP>:15672 on browser just gave us "503 Service unavailable" message. Also no port 5672 listening.
Any chance you have a NIC that didn't come up? What is in the log of the container itself? (ie. docker log rabbitmq).
I searched this issue on the internet but only few information about this. One of solution is delete some files in mnesia folder, another is remove rabbitmq container and its volume then re-deploy. But both are not sure. Does anyone know how to solve it?
Many thanks, Eddie.
-Erik
Hi Eddie, This is the process that I use to reset RMQ when it fails. RMQ messages are ephemeral; losing your old RMQ messages doesn’t ruin the cluster. On master: service rabbitmq-server stop ps auxw|grep rabbit (kill any rabbit processes) rm -rf /var/lib/rabbitmq/mnesia/* service rabbitmq-server start rabbitmqctl add_user admin <RMQ pwd from transport_url in nova.conf> rabbitmqctl set_user_tags admin administrator rabbitmqctl set_permissions -p / admin ".*" ".*" ".*" rabbitmqctl add_user openstack <RMQ pwd from transport_url in nova.conf> rabbitmqctl set_permissions -p / openstack ".*" ".*" ".*" rabbitmqctl set_policy ha-all "" '{"ha-mode":"all"}' rabbitmqctl list_policies on slaves: rabbitmqctl stop_app If RMQ fails to reset on a slave, or fails to start after resetting, then: service rabbitmq-server stop ps auxw|grep rabbit (kill any rabbit processes) rm -rf /var/lib/rabbitmq/mnesia/* service rabbitmq-server start rabbitmqctl stop_app rabbitmqctl reset rabbitmqctl start_app rabbitmqctl stop_app rabbitmqctl join_cluster rabbit@<master> rabbitmqctl start_app From: Eddie Yen <missile0407@gmail.com> Sent: Wednesday, February 5, 2020 3:33 AM To: openstack-discuss <openstack-discuss@lists.openstack.org> Subject: Re: [kolla] All services stats DOWN after re-launch whole cluster. Today I tried to recovery RabbitMQ back, but still not useful, even delete everything about data and configs for RabbitMQ then re-deploy (without destroy). And I found that the /etc/hosts on every nodes all been flushed, the hostname resolve data created by kolla-ansible are gone. Checked and found that the MAAS just enabled manage_etc_hosts config in /etc/cloud/cloud.cfg.d/ which caused /etc/hosts been reset everytime when boot. Not sure it was a root cause or not but unfortunately I already reset whole RabbitMQ data, so only I can do is destroy and deploy again. Fortunately this cluster was just beginning so no VM launch, and no do complex setup yet. I think the issue may solved, although still need a time to investigate. Based on this experience, need to notice about this may going to happen if using MAAS to deploy the OS. -Eddie Eddie Yen <missile0407@gmail.com<mailto:missile0407@gmail.com>> 於 2020年2月4日 週二 下午9:45寫道: Hi Erik, I'm already checked NIC link and no issue found. Pinging the nodes each other on each interfaces is OK. And I'm not check docker logs about rabbitmq sbecause it works normally. I'll check that out later. -Eddie Erik McCormick <emccormick@cirrusseven.com<mailto:emccormick@cirrusseven.com>> 於 2020年2月4日 週二 下午9:19寫道: ⁹ On Tue, Feb 4, 2020, 7:20 AM Eddie Yen <missile0407@gmail.com<mailto:missile0407@gmail.com>> wrote: Hi everyone, We have the Kolla Openstack site, which is 3 HCI (Controller+Compute) + 3 Storage (Ceph OSD) site without internet. We did the shutdown few days ago since CNY holidays. Today we re-launch whole cluster back. First we met the issue that MariaDB containers keep restarting, and we fixed by using mariadb_recovery command. After that we check the status of each services, and found that all services shown at Admin > System > System Information are DOWN. Strange is no MariaDB, AMQP connection, or other error found when check the downed service log. We tried reboot each servers but the situation still a same. Then we found the RabbitMQ log not updating, the last log still stayed at the date we shutdown. Logged in to RabbitMQ container and type "rabbitmqctl status" shows connection refused, and tried access its web manager from <VIP>:15672 on browser just gave us "503 Service unavailable" message. Also no port 5672 listening. Any chance you have a NIC that didn't come up? What is in the log of the container itself? (ie. docker log rabbitmq). I searched this issue on the internet but only few information about this. One of solution is delete some files in mnesia folder, another is remove rabbitmq container and its volume then re-deploy. But both are not sure. Does anyone know how to solve it? Many thanks, Eddie. -Erik
Hi Albert, thanks for your process. I'll record these and think about how to do this on kolla if similar issue happen in the future. -Eddie Albert Braden <Albert.Braden@synopsys.com> 於 2020年2月6日 週四 上午1:41寫道:
Hi Eddie,
This is the process that I use to reset RMQ when it fails. RMQ messages are ephemeral; losing your old RMQ messages doesn’t ruin the cluster.
On master:
service rabbitmq-server stop
ps auxw|grep rabbit
(kill any rabbit processes)
rm -rf /var/lib/rabbitmq/mnesia/*
service rabbitmq-server start
rabbitmqctl add_user admin <RMQ pwd from transport_url in nova.conf>
rabbitmqctl set_user_tags admin administrator
rabbitmqctl set_permissions -p / admin ".*" ".*" ".*"
rabbitmqctl add_user openstack <RMQ pwd from transport_url in nova.conf>
rabbitmqctl set_permissions -p / openstack ".*" ".*" ".*"
rabbitmqctl set_policy ha-all "" '{"ha-mode":"all"}'
rabbitmqctl list_policies
on slaves:
rabbitmqctl stop_app
If RMQ fails to reset on a slave, or fails to start after resetting, then:
service rabbitmq-server stop
ps auxw|grep rabbit
(kill any rabbit processes)
rm -rf /var/lib/rabbitmq/mnesia/*
service rabbitmq-server start
rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl start_app
rabbitmqctl stop_app
rabbitmqctl join_cluster rabbit@<master>
rabbitmqctl start_app
*From:* Eddie Yen <missile0407@gmail.com> *Sent:* Wednesday, February 5, 2020 3:33 AM *To:* openstack-discuss <openstack-discuss@lists.openstack.org> *Subject:* Re: [kolla] All services stats DOWN after re-launch whole cluster.
Today I tried to recovery RabbitMQ back, but still not useful, even delete everything
about data and configs for RabbitMQ then re-deploy (without destroy).
And I found that the /etc/hosts on every nodes all been flushed, the hostname
resolve data created by kolla-ansible are gone. Checked and found that the MAAS
just enabled manage_etc_hosts config in /etc/cloud/cloud.cfg.d/ which caused
/etc/hosts been reset everytime when boot.
Not sure it was a root cause or not but unfortunately I already reset whole RabbitMQ
data, so only I can do is destroy and deploy again. Fortunately this cluster was just
beginning so no VM launch, and no do complex setup yet.
I think the issue may solved, although still need a time to investigate. Based on this
experience, need to notice about this may going to happen if using MAAS to deploy
the OS.
-Eddie
Eddie Yen <missile0407@gmail.com> 於 2020年2月4日 週二 下午9:45寫道:
Hi Erik,
I'm already checked NIC link and no issue found. Pinging the nodes each other on each interfaces is OK.
And I'm not check docker logs about rabbitmq sbecause it works normally. I'll check that out later.
-Eddie
Erik McCormick <emccormick@cirrusseven.com> 於 2020年2月4日 週二 下午9:19寫道:
⁹
On Tue, Feb 4, 2020, 7:20 AM Eddie Yen <missile0407@gmail.com> wrote:
Hi everyone,
We have the Kolla Openstack site, which is 3 HCI (Controller+Compute) + 3 Storage (Ceph OSD)
site without internet. We did the shutdown few days ago since CNY holidays.
Today we re-launch whole cluster back. First we met the issue that MariaDB containers keep
restarting, and we fixed by using mariadb_recovery command.
After that we check the status of each services, and found that all services shown at
Admin > System > System Information are DOWN. Strange is no MariaDB, AMQP connection,
or other error found when check the downed service log.
We tried reboot each servers but the situation still a same. Then we found the RabbitMQ log not
updating, the last log still stayed at the date we shutdown. Logged in to RabbitMQ container and
type "rabbitmqctl status" shows connection refused, and tried access its web manager from
<VIP>:15672 on browser just gave us "503 Service unavailable" message. Also no port 5672
listening.
Any chance you have a NIC that didn't come up? What is in the log of the container itself? (ie. docker log rabbitmq).
I searched this issue on the internet but only few information about this. One of solution is delete
some files in mnesia folder, another is remove rabbitmq container and its volume then re-deploy.
But both are not sure. Does anyone know how to solve it?
Many thanks,
Eddie.
-Erik
Hi Eddie, Seems like an issue[1] which has been fixed previously. Could you please let me know which version are you using? -osmanlicilegi [1] https://bugs.launchpad.net/kolla-ansible/+bug/1837699 <https://bugs.launchpad.net/kolla-ansible/+bug/1837699>
On 5 Feb 2020, at 14:33, Eddie Yen <missile0407@gmail.com> wrote:
Today I tried to recovery RabbitMQ back, but still not useful, even delete everything about data and configs for RabbitMQ then re-deploy (without destroy).
And I found that the /etc/hosts on every nodes all been flushed, the hostname resolve data created by kolla-ansible are gone. Checked and found that the MAAS just enabled manage_etc_hosts config in /etc/cloud/cloud.cfg.d/ which caused /etc/hosts been reset everytime when boot.
Not sure it was a root cause or not but unfortunately I already reset whole RabbitMQ data, so only I can do is destroy and deploy again. Fortunately this cluster was just beginning so no VM launch, and no do complex setup yet.
I think the issue may solved, although still need a time to investigate. Based on this experience, need to notice about this may going to happen if using MAAS to deploy the OS.
-Eddie
Eddie Yen <missile0407@gmail.com <mailto:missile0407@gmail.com>> 於 2020年2月4日 週二 下午9:45寫道: Hi Erik,
I'm already checked NIC link and no issue found. Pinging the nodes each other on each interfaces is OK. And I'm not check docker logs about rabbitmq sbecause it works normally. I'll check that out later.
-Eddie
Erik McCormick <emccormick@cirrusseven.com <mailto:emccormick@cirrusseven.com>> 於 2020年2月4日 週二 下午9:19寫道: ⁹
On Tue, Feb 4, 2020, 7:20 AM Eddie Yen <missile0407@gmail.com <mailto:missile0407@gmail.com>> wrote: Hi everyone,
We have the Kolla Openstack site, which is 3 HCI (Controller+Compute) + 3 Storage (Ceph OSD) site without internet. We did the shutdown few days ago since CNY holidays.
Today we re-launch whole cluster back. First we met the issue that MariaDB containers keep restarting, and we fixed by using mariadb_recovery command. After that we check the status of each services, and found that all services shown at Admin > System > System Information are DOWN. Strange is no MariaDB, AMQP connection, or other error found when check the downed service log.
We tried reboot each servers but the situation still a same. Then we found the RabbitMQ log not updating, the last log still stayed at the date we shutdown. Logged in to RabbitMQ container and type "rabbitmqctl status" shows connection refused, and tried access its web manager from <VIP>:15672 on browser just gave us "503 Service unavailable" message. Also no port 5672 listening.
Any chance you have a NIC that didn't come up? What is in the log of the container itself? (ie. docker log rabbitmq).
I searched this issue on the internet but only few information about this. One of solution is delete some files in mnesia folder, another is remove rabbitmq container and its volume then re-deploy. But both are not sure. Does anyone know how to solve it?
Many thanks, Eddie.
-Erik
Hi Dincer, I'm using Rocky, and seems like this fix didn't merge to stable/rocky. And also what you wrote about flush host table issue in MAAS deployment. -Eddie Dincer Celik <hello@dincercelik.com> 於 2020年2月6日 週四 下午3:13寫道:
Hi Eddie,
Seems like an issue[1] which has been fixed previously. Could you please let me know which version are you using?
-osmanlicilegi
[1] https://bugs.launchpad.net/kolla-ansible/+bug/1837699
On 5 Feb 2020, at 14:33, Eddie Yen <missile0407@gmail.com> wrote:
Today I tried to recovery RabbitMQ back, but still not useful, even delete everything about data and configs for RabbitMQ then re-deploy (without destroy).
And I found that the /etc/hosts on every nodes all been flushed, the hostname resolve data created by kolla-ansible are gone. Checked and found that the MAAS just enabled manage_etc_hosts config in /etc/cloud/cloud.cfg.d/ which caused /etc/hosts been reset everytime when boot.
Not sure it was a root cause or not but unfortunately I already reset whole RabbitMQ data, so only I can do is destroy and deploy again. Fortunately this cluster was just beginning so no VM launch, and no do complex setup yet.
I think the issue may solved, although still need a time to investigate. Based on this experience, need to notice about this may going to happen if using MAAS to deploy the OS.
-Eddie
Eddie Yen <missile0407@gmail.com> 於 2020年2月4日 週二 下午9:45寫道:
Hi Erik,
I'm already checked NIC link and no issue found. Pinging the nodes each other on each interfaces is OK. And I'm not check docker logs about rabbitmq sbecause it works normally. I'll check that out later.
-Eddie
Erik McCormick <emccormick@cirrusseven.com> 於 2020年2月4日 週二 下午9:19寫道:
⁹
On Tue, Feb 4, 2020, 7:20 AM Eddie Yen <missile0407@gmail.com> wrote:
Hi everyone,
We have the Kolla Openstack site, which is 3 HCI (Controller+Compute) + 3 Storage (Ceph OSD) site without internet. We did the shutdown few days ago since CNY holidays.
Today we re-launch whole cluster back. First we met the issue that MariaDB containers keep restarting, and we fixed by using mariadb_recovery command. After that we check the status of each services, and found that all services shown at Admin > System > System Information are DOWN. Strange is no MariaDB, AMQP connection, or other error found when check the downed service log.
We tried reboot each servers but the situation still a same. Then we found the RabbitMQ log not updating, the last log still stayed at the date we shutdown. Logged in to RabbitMQ container and type "rabbitmqctl status" shows connection refused, and tried access its web manager from <VIP>:15672 on browser just gave us "503 Service unavailable" message. Also no port 5672 listening.
Any chance you have a NIC that didn't come up? What is in the log of the container itself? (ie. docker log rabbitmq).
I searched this issue on the internet but only few information about this. One of solution is delete some files in mnesia folder, another is remove rabbitmq container and its volume then re-deploy. But both are not sure. Does anyone know how to solve it?
Many thanks, Eddie.
-Erik
(Click the "Send" button too fast...) Thanks to Dincer's information. Looks like the issue has already been resolved before but not merge to the branch we're using. I'll do the cherry-pick to stable/rocky later. -Eddie Eddie Yen <missile0407@gmail.com> 於 2020年2月6日 週四 下午3:57寫道:
Hi Dincer,
I'm using Rocky, and seems like this fix didn't merge to stable/rocky. And also what you wrote about flush host table issue in MAAS deployment.
-Eddie
Dincer Celik <hello@dincercelik.com> 於 2020年2月6日 週四 下午3:13寫道:
Hi Eddie,
Seems like an issue[1] which has been fixed previously. Could you please let me know which version are you using?
-osmanlicilegi
[1] https://bugs.launchpad.net/kolla-ansible/+bug/1837699
On 5 Feb 2020, at 14:33, Eddie Yen <missile0407@gmail.com> wrote:
Today I tried to recovery RabbitMQ back, but still not useful, even delete everything about data and configs for RabbitMQ then re-deploy (without destroy).
And I found that the /etc/hosts on every nodes all been flushed, the hostname resolve data created by kolla-ansible are gone. Checked and found that the MAAS just enabled manage_etc_hosts config in /etc/cloud/cloud.cfg.d/ which caused /etc/hosts been reset everytime when boot.
Not sure it was a root cause or not but unfortunately I already reset whole RabbitMQ data, so only I can do is destroy and deploy again. Fortunately this cluster was just beginning so no VM launch, and no do complex setup yet.
I think the issue may solved, although still need a time to investigate. Based on this experience, need to notice about this may going to happen if using MAAS to deploy the OS.
-Eddie
Eddie Yen <missile0407@gmail.com> 於 2020年2月4日 週二 下午9:45寫道:
Hi Erik,
I'm already checked NIC link and no issue found. Pinging the nodes each other on each interfaces is OK. And I'm not check docker logs about rabbitmq sbecause it works normally. I'll check that out later.
-Eddie
Erik McCormick <emccormick@cirrusseven.com> 於 2020年2月4日 週二 下午9:19寫道:
⁹
On Tue, Feb 4, 2020, 7:20 AM Eddie Yen <missile0407@gmail.com> wrote:
Hi everyone,
We have the Kolla Openstack site, which is 3 HCI (Controller+Compute) + 3 Storage (Ceph OSD) site without internet. We did the shutdown few days ago since CNY holidays.
Today we re-launch whole cluster back. First we met the issue that MariaDB containers keep restarting, and we fixed by using mariadb_recovery command. After that we check the status of each services, and found that all services shown at Admin > System > System Information are DOWN. Strange is no MariaDB, AMQP connection, or other error found when check the downed service log.
We tried reboot each servers but the situation still a same. Then we found the RabbitMQ log not updating, the last log still stayed at the date we shutdown. Logged in to RabbitMQ container and type "rabbitmqctl status" shows connection refused, and tried access its web manager from <VIP>:15672 on browser just gave us "503 Service unavailable" message. Also no port 5672 listening.
Any chance you have a NIC that didn't come up? What is in the log of the container itself? (ie. docker log rabbitmq).
I searched this issue on the internet but only few information about this. One of solution is delete some files in mnesia folder, another is remove rabbitmq container and its volume then re-deploy. But both are not sure. Does anyone know how to solve it?
Many thanks, Eddie.
-Erik
participants (4)
-
Albert Braden
-
Dincer Celik
-
Eddie Yen
-
Erik McCormick