Restart cinder-volume with Ceph rdb

Sebastian Luna Valero sebastian.luna.valero at gmail.com
Mon May 17 06:04:09 UTC 2021


Thanks, Laurent.

Long story short, we have been able to bring the "cinder-volume" service
back up.

We restarted the "cinder-volume" and "cinder-scheduler" services with
"debug=True", got back the same debug message:

2021-05-15 23:15:27.091 31 DEBUG cinder.volume.drivers.rbd
[req-f43e30ae-2bdc-4690-9c1b-3e58081fdc9e - - - - -] connecting to
cinder at ceph (conf=/etc/ceph/ceph.conf, timeout=-1). _do_conn
/usr/lib/python3.6/site-packages/cinder/volume/drivers/rbd.py:431

Then, I had a look at the docs looking for "timeout" configuration options:

https://docs.openstack.org/cinder/train/configuration/block-storage/drivers/ceph-rbd-volume-driver.html#driver-options

"rados_connect_timeout = -1; (Integer) Timeout value (in seconds) used when
connecting to ceph cluster. If value < 0, no timeout is set and default
librados value is used."

I added it to the "cinder.conf" file for the "cinder-volume" service with:
"rados_connect_timeout=15".

Before this change the "cinder-volume" logs ended with this message:

2021-05-15 23:02:48.821 31 INFO cinder.volume.manager
[req-6e8f9f46-ee34-4925-9fc8-dea8729d0d93 - - - - -] Starting volume driver
RBDDriver (1.2.0)

After the change:

2021-05-15 23:02:48.821 31 INFO cinder.volume.manager
[req-6e8f9f46-ee34-4925-9fc8-dea8729d0d93 - - - - -] Starting volume driver
RBDDriver (1.2.0)
2021-05-15 23:04:23.180 31 INFO cinder.volume.manager
[req-6e8f9f46-ee34-4925-9fc8-dea8729d0d93 - - - - -] Driver initialization
completed successfully.
2021-05-15 23:04:23.190 31 INFO cinder.manager
[req-6e8f9f46-ee34-4925-9fc8-dea8729d0d93 - - - - -] Initiating service 12
cleanup
2021-05-15 23:04:23.196 31 INFO cinder.manager
[req-6e8f9f46-ee34-4925-9fc8-dea8729d0d93 - - - - -] Service 12 cleanup
completed.
2021-05-15 23:04:23.315 31 INFO cinder.volume.manager
[req-6e8f9f46-ee34-4925-9fc8-dea8729d0d93 - - - - -] Initializing RPC
dependent components of volume driver RBDDriver (1.2.0)
2021-05-15 23:05:10.381 31 INFO cinder.volume.manager
[req-6e8f9f46-ee34-4925-9fc8-dea8729d0d93 - - - - -] Driver post RPC
initialization completed successfully.

And now the service is reported as "up" in "openstack volume service list"
and we can successfully create Ceph volumes now. Many will do more
validation tests today to confirm.

So it looks like the "cinder-volume" service didn't start up properly in
the first place and that's why the service was "down".

Why adding "rados_connect_timeout=15" to cinder.conf solved the issue? I
honestly don't know and it was a matter of luck to try this out. If anyone
knows the reason, we would love to know more.

Thank you very much again for your kind help!

Best regards,
Sebastian

On Sat, 15 May 2021 at 19:40, Laurent Dumont <laurentfdumont at gmail.com>
wrote:

> That is a bit strange. I don't use the Ceph backend so I don't know any
> magic tricks.
>
>    - I'm surprised that the Debug logging level doesn't add anything
>    else. Is there any other lines besides the "connecting" one?
>    - Can we narrow down the port/IP destination for the Ceph RBD traffic?
>    - Can we failover the cinder-volume service to another controller and
>    check the status of the volume service?
>    - Did the power outage impact the Ceph cluster + network gear + all
>    the controllers?
>    - Does the content of /etc/ceph/ceph.conf appear to be valid inside
>    the container?
>
> Looking at the code -
> https://github.com/openstack/cinder/blob/stable/train/cinder/volume/drivers/rbd.py#L432
>
> It should raise an exception if there is a timeout when the connection
> client is built.
>
> except self.rados.Error:
> msg = _("Error connecting to ceph cluster.")
> LOG.exception(msg)
> client.shutdown()
> raise exception.VolumeBackendAPIException(data=msg)
>
> On Sat, May 15, 2021 at 4:16 AM Sebastian Luna Valero <
> sebastian.luna.valero at gmail.com> wrote:
>
>>
>> Hi All,
>>
>> Thanks for your inputs so far. I am also trying to help Manu with this
>> issue.
>>
>> The "cinder-volume" service was working properly with the existing
>> configuration. However, after a power outage the service is no longer
>> reported as "up".
>>
>> Looking at the source code, the service status is reported as "down" by
>> "cinder-scheduler" in here:
>>
>>
>> https://github.com/openstack/cinder/blob/stable/train/cinder/scheduler/host_manager.py#L618
>>
>> With message: "WARNING cinder.scheduler.host_manager [req-<>- default
>> default] volume service is down. (host: rbd:volumes at ceph-rbd)"
>>
>> I printed out the "service" tuple
>> https://github.com/openstack/cinder/blob/stable/train/cinder/scheduler/host_manager.py#L615
>> and we get:
>>
>> "2021-05-15 09:57:24.918 7 WARNING cinder.scheduler.host_manager [<> -
>> default default]
>> Service(active_backend_id=None,availability_zone='nova',binary='cinder-volume',cluster=<?>,cluster_name=None,created_at=2020-06-12T07:53:42Z,deleted=False,deleted_at=None,disabled=False,disabled_reason=None,frozen=False,host='rbd:volumes at ceph-rbd
>> ',id=12,modified_at=None,object_current_version='1.38',replication_status='disabled',report_count=8067424,rpc_current_version='3.16',topic='cinder-volume',updated_at=2021-05-12T15:37:52Z,uuid='604668e8-c2e7-46ed-a2b8-086e588079ac')"
>>
>> Cinder is configured with a Ceph RBD backend, as explained in
>> https://github.com/openstack/kolla-ansible/blob/stable/train/doc/source/reference/storage/external-ceph-guide.rst#cinder
>>
>> That's where the "backend_host=rbd:volumes" configuration is coming from.
>>
>> We are using 3 controller nodes for OpenStack and 3 monitor nodes for
>> Ceph.
>>
>> The Ceph cluster doesn't report any error. The "cinder-volume" containers
>> don't report any error. Moreover, when we go inside the "cinder-volume"
>> container we are able to list existing volumes with:
>>
>> rbd -p cinder.volumes --id cinder -k /etc/ceph/ceph.client.cinder.keyring
>> ls
>>
>> So the connection to the Ceph cluster works.
>>
>> Why is "cinder-scheduler" reporting the that the backend Ceph cluster is
>> down?
>>
>> Many thanks,
>> Sebastian
>>
>>
>> On Thu, 13 May 2021 at 13:12, Tobias Urdin <tobias.urdin at binero.com>
>> wrote:
>>
>>> Hello,
>>>
>>> I just saw that you are running Ceph Octopus with Train release and
>>> wanted to let you know that we saw issues with the os-brick version shipped
>>> with Train not supporting client version of Ceph Octopus.
>>>
>>> So for our Ceph cluster running Octopus we had to keep the client
>>> version on Nautilus until upgrading to Victoria which included a newer
>>> version of os-brick.
>>>
>>> Maybe this is unrelated to your issue but just wanted to put it out
>>> there.
>>>
>>> Best regards
>>> Tobias
>>>
>>> > On 13 May 2021, at 12:55, ManuParra <mparra at iaa.es> wrote:
>>> >
>>> > Hello Gorka, not yet, let me update cinder configuration, add the
>>> option, restart cinder and I’ll update the status.
>>> > Do you recommend other things to try for this cycle?
>>> > Regards.
>>> >
>>> >> On 13 May 2021, at 09:37, Gorka Eguileor <geguileo at redhat.com> wrote:
>>> >>
>>> >>> On 13/05, ManuParra wrote:
>>> >>> Hi Gorka again, yes, the first thing is to know why you can't
>>> connect to that host (Ceph is actually set up for HA) so that's the way to
>>> do it. I tell you this because previously from the beginning of the setup
>>> of our setup it has always been like that, with that hostname and there has
>>> been no problem.
>>> >>>
>>> >>> As for the errors, the strangest thing is that in Monasca I have not
>>> found any error log, only warning on “volume service is down. (host:
>>> rbd:volumes at ceph-rbd)" and info, which is even stranger.
>>> >>
>>> >> Have you tried the configuration change I recommended?
>>> >>
>>> >>
>>> >>>
>>> >>> Regards.
>>> >>>
>>> >>>> On 12 May 2021, at 23:34, Gorka Eguileor <geguileo at redhat.com>
>>> wrote:
>>> >>>>
>>> >>>> On 12/05, ManuParra wrote:
>>> >>>>> Hi Gorka, let me show the cinder config:
>>> >>>>>
>>> >>>>> [ceph-rbd]
>>> >>>>> rbd_ceph_conf = /etc/ceph/ceph.conf
>>> >>>>> rbd_user = cinder
>>> >>>>> backend_host = rbd:volumes
>>> >>>>> rbd_pool = cinder.volumes
>>> >>>>> volume_backend_name = ceph-rbd
>>> >>>>> volume_driver = cinder.volume.drivers.rbd.RBDDriver
>>> >>>>> …
>>> >>>>>
>>> >>>>> So, using rbd_exclusive_cinder_pool=True it will be used just for
>>> volumes? but the log is saying no connection to the backend_host.
>>> >>>>
>>> >>>> Hi,
>>> >>>>
>>> >>>> Your backend_host doesn't have a valid hostname, please set a proper
>>> >>>> hostname in that configuration option.
>>> >>>>
>>> >>>> Then the next thing you need to have is the cinder-volume service
>>> >>>> running correctly before making any requests.
>>> >>>>
>>> >>>> I would try adding rbd_exclusive_cinder_pool=true then tailing the
>>> >>>> volume logs, and restarting the service.
>>> >>>>
>>> >>>> See if the logs show any ERROR level entries.
>>> >>>>
>>> >>>> I would also check the service-list output right after the service
>>> is
>>> >>>> restarted, if it's up then I would check it again after 2 minutes.
>>> >>>>
>>> >>>> Cheers,
>>> >>>> Gorka.
>>> >>>>
>>> >>>>
>>> >>>>>
>>> >>>>> Regards.
>>> >>>>>
>>> >>>>>
>>> >>>>>> On 12 May 2021, at 11:49, Gorka Eguileor <geguileo at redhat.com>
>>> wrote:
>>> >>>>>>
>>> >>>>>> On 12/05, ManuParra wrote:
>>> >>>>>>> Thanks, I have restarted the service and I see that after a few
>>> minutes then cinder-volume service goes down again when I check it with the
>>> command openstack volume service list.
>>> >>>>>>> The host/service that contains the cinder-volumes is
>>> rbd:volumes at ceph-rbd that is RDB in Ceph, so the problem does not come
>>> from Cinder, rather from Ceph or from the RDB (Ceph) pools that stores the
>>> volumes. I have checked Ceph and the status of everything is correct, no
>>> errors or warnings.
>>> >>>>>>> The error I have is that cinder can’t  connect to
>>> rbd:volumes at ceph-rbd. Any further suggestions? Thanks in advance.
>>> >>>>>>> Kind regards.
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>> Hi,
>>> >>>>>>
>>> >>>>>> You are most likely using an older release, have a high number of
>>> cinder
>>> >>>>>> RBD volumes, and have not changed configuration option
>>> >>>>>> "rbd_exclusive_cinder_pool" from its default "false" value.
>>> >>>>>>
>>> >>>>>> Please add to your driver's section in cinder.conf the following:
>>> >>>>>>
>>> >>>>>> rbd_exclusive_cinder_pool = true
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> And restart the service.
>>> >>>>>>
>>> >>>>>> Cheers,
>>> >>>>>> Gorka.
>>> >>>>>>
>>> >>>>>>>> On 11 May 2021, at 22:30, Eugen Block <eblock at nde.ag> wrote:
>>> >>>>>>>>
>>> >>>>>>>> Hi,
>>> >>>>>>>>
>>> >>>>>>>> so restart the volume service;-)
>>> >>>>>>>>
>>> >>>>>>>> systemctl restart openstack-cinder-volume.service
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> Zitat von ManuParra <mparra at iaa.es>:
>>> >>>>>>>>
>>> >>>>>>>>> Dear OpenStack community,
>>> >>>>>>>>>
>>> >>>>>>>>> I have encountered a problem a few days ago and that is that
>>> when creating new volumes with:
>>> >>>>>>>>>
>>> >>>>>>>>> "openstack volume create --size 20 testmv"
>>> >>>>>>>>>
>>> >>>>>>>>> the volume creation status shows an error.  If I go to the
>>> error log detail it indicates:
>>> >>>>>>>>>
>>> >>>>>>>>> "Schedule allocate volume: Could not find any available
>>> weighted backend".
>>> >>>>>>>>>
>>> >>>>>>>>> Indeed then I go to the cinder log and it indicates:
>>> >>>>>>>>>
>>> >>>>>>>>> "volume service is down - host: rbd:volumes at ceph-rbd”.
>>> >>>>>>>>>
>>> >>>>>>>>> I check with:
>>> >>>>>>>>>
>>> >>>>>>>>> "openstack volume service list”  in which state are the
>>> services and I see that indeed this happens:
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> | cinder-volume | rbd:volumes at ceph-rbd | nova | enabled |
>>> down | 2021-04-29T09:48:42.000000 |
>>> >>>>>>>>>
>>> >>>>>>>>> And stopped since 2021-04-29 !
>>> >>>>>>>>>
>>> >>>>>>>>> I have checked Ceph (monitors,managers, osds. etc) and there
>>> are no problems with the Ceph BackEnd, everything is apparently working.
>>> >>>>>>>>>
>>> >>>>>>>>> This happened after an uncontrolled outage.So my question is
>>> how do I restart only cinder-volumes (I also have cinder-backup,
>>> cinder-scheduler but they are ok).
>>> >>>>>>>>>
>>> >>>>>>>>> Thank you very much in advance. Regards.
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >>
>>> >
>>> >
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210517/d2a28eba/attachment-0001.html>


More information about the openstack-discuss mailing list