Thank you kindly for your responses. 

Here’s how i stumbled upon the fix:
- Noticed the output of “openstack volume service list” shows a time variation of 2 minutes between the good and bad nodes.
- Checked the timestamp on the bad node and indeed it was off by two minutes… (behind)
- NTP was NOT installed on any of the nodes… Installing and configuring the NTP package fixed the issue instantly.
- Seems like the reboot “somehow” caused the timestamp of compute2 to drift by 2 minutes.
- What’s scary is - no log / debug hinted that this could be a possible issue.

Regards,
Abhijit Anand 


On 22-Dec-2023, at 9:42 AM, Abhijit Anand <contact@abhijitanand.com> wrote:

Hi, 

Problem Statement:
- “openstack volume service list” shows “volume service” down on 1 node. 
- Cinder scheduler logs show this.
2023-12-22 09:00:19.779 7 WARNING cinder.scheduler.host_manager [None req-40e35d10-9441-43c2-9647-6630cba21e1c - - - - - -] volume service is down. (host: compute2@rbd-1)
- Why is the host_manager of Cinder-Scheduler showing “compute2” as down?

Versions:
- Openstack version 2023.1 Installed using  kolla-ansible on ubuntu 22.04
- Ceph version 18.2.0

More info:
- I’ve got a 3 node openstack deployment with one of them doubling up as both compute as well as controller node.
- Backend storage for cinder volumes and backups is a 3 node Ceph cluster deployed using cephadm.
- a power outage resulted in all nodes shutting down. 
- once back up, the “openstack volume service list” command shows that cinder-volume and cinder-backup services are down on “compute2”. Refer “Output #1” below.
- I enabled “debug=True” on both the “good” i.e. compute1, as well as “bad” i.e. “compute2” nodes. I dont see any errors or issues in the cinder-volume and cinder-backup logs of either. Refer outputs #2 and #3 below.
- I don’t see any errors related to compute2 in ceph or cephadm logs.
- No config has been changed. The only “change” was the forced shutdown.
- I’ve restarted the docker containers (cinder volume, cinder backup, cinder-scheduler on controller). I’ve even rebooted the controller and the “bad” node.
- All containers come back as healthy
- Please see output #4, it shows, what appear to be “All Good!” updates from the “bad” node. Then why is the scheduler considering this node as down?



Outputs:
#1: openstack volume service list
abanand@Abhijits-MacBook-Air ~ % openstack volume service list
+------------------+------------------+------+---------+-------+----------------------------+
| Binary           | Host             | Zone | Status  | State | Updated At                 |
+------------------+------------------+------+---------+-------+----------------------------+
| cinder-scheduler | controller       | nova | enabled | up    | 2023-12-22T03:44:22.000000 |
| cinder-volume    | compute1@rbd-1   | nova | enabled | up    | 2023-12-22T03:44:21.000000 |
| cinder-volume    | compute2@rbd-1   | nova | enabled | down  | 2023-12-22T03:42:04.000000 |
| cinder-volume    | controller@rbd-1 | nova | enabled | up    | 2023-12-22T03:44:20.000000 |
| cinder-backup    | compute2         | nova | enabled | down  | 2023-12-22T03:42:02.000000 |
| cinder-backup    | compute1         | nova | enabled | up    | 2023-12-22T03:44:20.000000 |
| cinder-backup    | controller       | nova | enabled | up    | 2023-12-22T03:44:19.000000 |
+------------------+------------------+------+---------+-------+——————————————+

#2: compute2 - “Bad” node logs after enabling debug=true
2023-12-22 09:25:28.041 23 DEBUG oslo_service.periodic_task [None req-8388d960-ae3a-46d3-838a-24c112929683 - - - - - -] Running periodic task VolumeManager.publish_service_capabilities run_periodic_tasks /var/lib/kolla/venv/lib/python3.10/site-packages/oslo_service/periodic_task.py:210
2023-12-22 09:25:28.043 23 DEBUG cinder.volume.drivers.rbd [None req-8388d960-ae3a-46d3-838a-24c112929683 - - - - - -] connecting to cinder@ceph (conf=/etc/ceph/ceph.conf, timeout=15). _do_conn /var/lib/kolla/venv/lib/python3.10/site-packages/cinder/volume/drivers/rbd.py:597
2023-12-22 09:25:28.074 23 DEBUG cinder.volume.drivers.rbd [None req-8388d960-ae3a-46d3-838a-24c112929683 - - - - - -] connecting to cinder@ceph (conf=/etc/ceph/ceph.conf, timeout=15). _do_conn /var/lib/kolla/venv/lib/python3.10/site-packages/cinder/volume/drivers/rbd.py:597
2023-12-22 09:25:28.103 23 DEBUG cinder.manager [None req-8388d960-ae3a-46d3-838a-24c112929683 - - - - - -] Notifying Schedulers of capabilities ... _publish_service_capabilities /var/lib/kolla/venv/lib/python3.10/site-packages/cinder/manager.py:202

#3: compute1 - “Good” node logs after enabling debug=true 
2023-12-22 09:25:28.041 23 DEBUG oslo_service.periodic_task [None req-8388d960-ae3a-46d3-838a-24c112929683 - - - - - -] Running periodic task VolumeManager.publish_service_capabilities run_periodic_tasks /var/lib/kolla/venv/lib/python3.10/site-packages/oslo_service/periodic_task.py:210
2023-12-22 09:25:28.043 23 DEBUG cinder.volume.drivers.rbd [None req-8388d960-ae3a-46d3-838a-24c112929683 - - - - - -] connecting to cinder@ceph (conf=/etc/ceph/ceph.conf, timeout=15). _do_conn /var/lib/kolla/venv/lib/python3.10/site-packages/cinder/volume/drivers/rbd.py:597
2023-12-22 09:25:28.074 23 DEBUG cinder.volume.drivers.rbd [None req-8388d960-ae3a-46d3-838a-24c112929683 - - - - - -] connecting to cinder@ceph (conf=/etc/ceph/ceph.conf, timeout=15). _do_conn /var/lib/kolla/venv/lib/python3.10/site-packages/cinder/volume/drivers/rbd.py:597
2023-12-22 09:25:28.103 23 DEBUG cinder.manager [None req-8388d960-ae3a-46d3-838a-24c112929683 - - - - - -] Notifying Schedulers of capabilities ... _publish_service_capabilities /var/lib/kolla/venv/lib/python3.10/site-packages/cinder/manager.py:202

#4: Output from Cinder Scheduler logs
2023-12-22 09:35:18.861 7 DEBUG cinder.scheduler.host_manager [None req-09c91b17-3861-4c95-b85e-e4d39c17923c - - - - - -] Received backup service update from compute2: {'backend_state': True, 'driver_name': 'cinder.backup.drivers.ceph.CephBackupDriver', 'availability_zone': 'nova'} update_service_capabilities /var/lib/kolla/venv/lib/python3.10/site-packages/cinder/scheduler/host_manager.py:597
2023-12-22 09:35:29.659 7 DEBUG cinder.scheduler.host_manager [None req-ba4bd917-52a8-4b02-92a3-5f989c5d70ae - - - - - -] Received volume service update from  compute2@rbd-1: {'vendor_name': 'Open Source', 'driver_version': '1.3.0', 'storage_protocol': 'ceph', 'total_capacity_gb': 14649.52, 'free_capacity_gb': 14216.63, 'reserved_percentage': 0, 'multiattach': True, 'thin_provisioning_support': True, 'max_over_subscription_ratio': '20.0', 'location_info': 'ceph:/etc/ceph/ceph.conf:d524a89e-60eb-11ee-a0c3-7db08f4d9091:cinder:volumes', 'backend_state': 'up', 'qos_support': True, 'volume_backend_name': 'rbd-1', 'replication_enabled': False, 'allocated_capacity_gb': 1031, 'filter_function': None, 'goodness_function': None} update_service_capabilities /var/lib/kolla/venv/lib/python3.10/site-packages/cinder/scheduler/host_manager.py:628