Alan, Kerem, Thank you kindly. Indeed the issue was “not synchronised”. Appreciate your time in looking at the issue. Regards, Abhijit Anand
On 23-Dec-2023, at 10:41 AM, Abhijit Anand <contact@abhijitanand.com> wrote:
Thank you kindly for your responses.
Here’s how i stumbled upon the fix: - Noticed the output of “openstack volume service list” shows a time variation of 2 minutes between the good and bad nodes. - Checked the timestamp on the bad node and indeed it was off by two minutes… (behind) - NTP was NOT installed on any of the nodes… Installing and configuring the NTP package fixed the issue instantly. - Seems like the reboot “somehow” caused the timestamp of compute2 to drift by 2 minutes. - What’s scary is - no log / debug hinted that this could be a possible issue.
Regards, Abhijit Anand
On 22-Dec-2023, at 9:42 AM, Abhijit Anand <contact@abhijitanand.com> wrote:
Hi,
Problem Statement: - “openstack volume service list” shows “volume service” down on 1 node. - Cinder scheduler logs show this. 2023-12-22 09:00:19.779 7 WARNING cinder.scheduler.host_manager [None req-40e35d10-9441-43c2-9647-6630cba21e1c - - - - - -] volume service is down. (host: compute2@rbd-1) - Why is the host_manager of Cinder-Scheduler showing “compute2” as down?
Versions: - Openstack version 2023.1 Installed using kolla-ansible on ubuntu 22.04 - Ceph version 18.2.0
More info: - I’ve got a 3 node openstack deployment with one of them doubling up as both compute as well as controller node. - Backend storage for cinder volumes and backups is a 3 node Ceph cluster deployed using cephadm. - a power outage resulted in all nodes shutting down. - once back up, the “openstack volume service list” command shows that cinder-volume and cinder-backup services are down on “compute2”. Refer “Output #1” below. - I enabled “debug=True” on both the “good” i.e. compute1, as well as “bad” i.e. “compute2” nodes. I dont see any errors or issues in the cinder-volume and cinder-backup logs of either. Refer outputs #2 and #3 below. - I don’t see any errors related to compute2 in ceph or cephadm logs. - No config has been changed. The only “change” was the forced shutdown. - I’ve restarted the docker containers (cinder volume, cinder backup, cinder-scheduler on controller). I’ve even rebooted the controller and the “bad” node. - All containers come back as healthy - Please see output #4, it shows, what appear to be “All Good!” updates from the “bad” node. Then why is the scheduler considering this node as down?
Outputs: #1: openstack volume service list abanand@Abhijits-MacBook-Air ~ % openstack volume service list +------------------+------------------+------+---------+-------+----------------------------+ | Binary | Host | Zone | Status | State | Updated At | +------------------+------------------+------+---------+-------+----------------------------+ | cinder-scheduler | controller | nova | enabled | up | 2023-12-22T03:44:22.000000 | | cinder-volume | compute1@rbd-1 | nova | enabled | up | 2023-12-22T03:44:21.000000 | | cinder-volume | compute2@rbd-1 | nova | enabled | down | 2023-12-22T03:42:04.000000 | | cinder-volume | controller@rbd-1 | nova | enabled | up | 2023-12-22T03:44:20.000000 | | cinder-backup | compute2 | nova | enabled | down | 2023-12-22T03:42:02.000000 | | cinder-backup | compute1 | nova | enabled | up | 2023-12-22T03:44:20.000000 | | cinder-backup | controller | nova | enabled | up | 2023-12-22T03:44:19.000000 | +------------------+------------------+------+---------+-------+——————————————+
#2: compute2 - “Bad” node logs after enabling debug=true 2023-12-22 09:25:28.041 23 DEBUG oslo_service.periodic_task [None req-8388d960-ae3a-46d3-838a-24c112929683 - - - - - -] Running periodic task VolumeManager.publish_service_capabilities run_periodic_tasks /var/lib/kolla/venv/lib/python3.10/site-packages/oslo_service/periodic_task.py:210 2023-12-22 09:25:28.043 23 DEBUG cinder.volume.drivers.rbd [None req-8388d960-ae3a-46d3-838a-24c112929683 - - - - - -] connecting to cinder@ceph (conf=/etc/ceph/ceph.conf, timeout=15). _do_conn /var/lib/kolla/venv/lib/python3.10/site-packages/cinder/volume/drivers/rbd.py:597 2023-12-22 09:25:28.074 23 DEBUG cinder.volume.drivers.rbd [None req-8388d960-ae3a-46d3-838a-24c112929683 - - - - - -] connecting to cinder@ceph (conf=/etc/ceph/ceph.conf, timeout=15). _do_conn /var/lib/kolla/venv/lib/python3.10/site-packages/cinder/volume/drivers/rbd.py:597 2023-12-22 09:25:28.103 23 DEBUG cinder.manager [None req-8388d960-ae3a-46d3-838a-24c112929683 - - - - - -] Notifying Schedulers of capabilities ... _publish_service_capabilities /var/lib/kolla/venv/lib/python3.10/site-packages/cinder/manager.py:202
#3: compute1 - “Good” node logs after enabling debug=true 2023-12-22 09:25:28.041 23 DEBUG oslo_service.periodic_task [None req-8388d960-ae3a-46d3-838a-24c112929683 - - - - - -] Running periodic task VolumeManager.publish_service_capabilities run_periodic_tasks /var/lib/kolla/venv/lib/python3.10/site-packages/oslo_service/periodic_task.py:210 2023-12-22 09:25:28.043 23 DEBUG cinder.volume.drivers.rbd [None req-8388d960-ae3a-46d3-838a-24c112929683 - - - - - -] connecting to cinder@ceph (conf=/etc/ceph/ceph.conf, timeout=15). _do_conn /var/lib/kolla/venv/lib/python3.10/site-packages/cinder/volume/drivers/rbd.py:597 2023-12-22 09:25:28.074 23 DEBUG cinder.volume.drivers.rbd [None req-8388d960-ae3a-46d3-838a-24c112929683 - - - - - -] connecting to cinder@ceph (conf=/etc/ceph/ceph.conf, timeout=15). _do_conn /var/lib/kolla/venv/lib/python3.10/site-packages/cinder/volume/drivers/rbd.py:597 2023-12-22 09:25:28.103 23 DEBUG cinder.manager [None req-8388d960-ae3a-46d3-838a-24c112929683 - - - - - -] Notifying Schedulers of capabilities ... _publish_service_capabilities /var/lib/kolla/venv/lib/python3.10/site-packages/cinder/manager.py:202
#4: Output from Cinder Scheduler logs 2023-12-22 09:35:18.861 7 DEBUG cinder.scheduler.host_manager [None req-09c91b17-3861-4c95-b85e-e4d39c17923c - - - - - -] Received backup service update from compute2: {'backend_state': True, 'driver_name': 'cinder.backup.drivers.ceph.CephBackupDriver', 'availability_zone': 'nova'} update_service_capabilities /var/lib/kolla/venv/lib/python3.10/site-packages/cinder/scheduler/host_manager.py:597 2023-12-22 09:35:29.659 7 DEBUG cinder.scheduler.host_manager [None req-ba4bd917-52a8-4b02-92a3-5f989c5d70ae - - - - - -] Received volume service update from compute2@rbd-1: {'vendor_name': 'Open Source', 'driver_version': '1.3.0', 'storage_protocol': 'ceph', 'total_capacity_gb': 14649.52, 'free_capacity_gb': 14216.63, 'reserved_percentage': 0, 'multiattach': True, 'thin_provisioning_support': True, 'max_over_subscription_ratio': '20.0', 'location_info': 'ceph:/etc/ceph/ceph.conf:d524a89e-60eb-11ee-a0c3-7db08f4d9091:cinder:volumes', 'backend_state': 'up', 'qos_support': True, 'volume_backend_name': 'rbd-1', 'replication_enabled': False, 'allocated_capacity_gb': 1031, 'filter_function': None, 'goodness_function': None} update_service_capabilities /var/lib/kolla/venv/lib/python3.10/site-packages/cinder/scheduler/host_manager.py:628