Restart cinder-volume with Ceph rdb

Laurent Dumont laurentfdumont at gmail.com
Mon May 17 16:10:06 UTC 2021


Glad to know it was resolved! It's a bit weird that explicitly setting the
parameter works, but good to know!

On Mon, May 17, 2021 at 2:11 AM Sebastian Luna Valero <
sebastian.luna.valero at gmail.com> wrote:

>
> Thanks, Laurent.
>
> Long story short, we have been able to bring the "cinder-volume" service
> back up.
>
> We restarted the "cinder-volume" and "cinder-scheduler" services with
> "debug=True", got back the same debug message:
>
> 2021-05-15 23:15:27.091 31 DEBUG cinder.volume.drivers.rbd
> [req-f43e30ae-2bdc-4690-9c1b-3e58081fdc9e - - - - -] connecting to
> cinder at ceph (conf=/etc/ceph/ceph.conf, timeout=-1). _do_conn
> /usr/lib/python3.6/site-packages/cinder/volume/drivers/rbd.py:431
>
> Then, I had a look at the docs looking for "timeout" configuration options:
>
>
> https://docs.openstack.org/cinder/train/configuration/block-storage/drivers/ceph-rbd-volume-driver.html#driver-options
>
> "rados_connect_timeout = -1; (Integer) Timeout value (in seconds) used
> when connecting to ceph cluster. If value < 0, no timeout is set and
> default librados value is used."
>
> I added it to the "cinder.conf" file for the "cinder-volume" service with:
> "rados_connect_timeout=15".
>
> Before this change the "cinder-volume" logs ended with this message:
>
> 2021-05-15 23:02:48.821 31 INFO cinder.volume.manager
> [req-6e8f9f46-ee34-4925-9fc8-dea8729d0d93 - - - - -] Starting volume driver
> RBDDriver (1.2.0)
>
> After the change:
>
> 2021-05-15 23:02:48.821 31 INFO cinder.volume.manager
> [req-6e8f9f46-ee34-4925-9fc8-dea8729d0d93 - - - - -] Starting volume driver
> RBDDriver (1.2.0)
> 2021-05-15 23:04:23.180 31 INFO cinder.volume.manager
> [req-6e8f9f46-ee34-4925-9fc8-dea8729d0d93 - - - - -] Driver initialization
> completed successfully.
> 2021-05-15 23:04:23.190 31 INFO cinder.manager
> [req-6e8f9f46-ee34-4925-9fc8-dea8729d0d93 - - - - -] Initiating service 12
> cleanup
> 2021-05-15 23:04:23.196 31 INFO cinder.manager
> [req-6e8f9f46-ee34-4925-9fc8-dea8729d0d93 - - - - -] Service 12 cleanup
> completed.
> 2021-05-15 23:04:23.315 31 INFO cinder.volume.manager
> [req-6e8f9f46-ee34-4925-9fc8-dea8729d0d93 - - - - -] Initializing RPC
> dependent components of volume driver RBDDriver (1.2.0)
> 2021-05-15 23:05:10.381 31 INFO cinder.volume.manager
> [req-6e8f9f46-ee34-4925-9fc8-dea8729d0d93 - - - - -] Driver post RPC
> initialization completed successfully.
>
> And now the service is reported as "up" in "openstack volume service list"
> and we can successfully create Ceph volumes now. Many will do more
> validation tests today to confirm.
>
> So it looks like the "cinder-volume" service didn't start up properly in
> the first place and that's why the service was "down".
>
> Why adding "rados_connect_timeout=15" to cinder.conf solved the issue? I
> honestly don't know and it was a matter of luck to try this out. If anyone
> knows the reason, we would love to know more.
>
> Thank you very much again for your kind help!
>
> Best regards,
> Sebastian
>
> On Sat, 15 May 2021 at 19:40, Laurent Dumont <laurentfdumont at gmail.com>
> wrote:
>
>> That is a bit strange. I don't use the Ceph backend so I don't know any
>> magic tricks.
>>
>>    - I'm surprised that the Debug logging level doesn't add anything
>>    else. Is there any other lines besides the "connecting" one?
>>    - Can we narrow down the port/IP destination for the Ceph RBD traffic?
>>    - Can we failover the cinder-volume service to another controller and
>>    check the status of the volume service?
>>    - Did the power outage impact the Ceph cluster + network gear + all
>>    the controllers?
>>    - Does the content of /etc/ceph/ceph.conf appear to be valid inside
>>    the container?
>>
>> Looking at the code -
>> https://github.com/openstack/cinder/blob/stable/train/cinder/volume/drivers/rbd.py#L432
>>
>> It should raise an exception if there is a timeout when the connection
>> client is built.
>>
>> except self.rados.Error:
>> msg = _("Error connecting to ceph cluster.")
>> LOG.exception(msg)
>> client.shutdown()
>> raise exception.VolumeBackendAPIException(data=msg)
>>
>> On Sat, May 15, 2021 at 4:16 AM Sebastian Luna Valero <
>> sebastian.luna.valero at gmail.com> wrote:
>>
>>>
>>> Hi All,
>>>
>>> Thanks for your inputs so far. I am also trying to help Manu with this
>>> issue.
>>>
>>> The "cinder-volume" service was working properly with the existing
>>> configuration. However, after a power outage the service is no longer
>>> reported as "up".
>>>
>>> Looking at the source code, the service status is reported as "down" by
>>> "cinder-scheduler" in here:
>>>
>>>
>>> https://github.com/openstack/cinder/blob/stable/train/cinder/scheduler/host_manager.py#L618
>>>
>>> With message: "WARNING cinder.scheduler.host_manager [req-<>- default
>>> default] volume service is down. (host: rbd:volumes at ceph-rbd)"
>>>
>>> I printed out the "service" tuple
>>> https://github.com/openstack/cinder/blob/stable/train/cinder/scheduler/host_manager.py#L615
>>> and we get:
>>>
>>> "2021-05-15 09:57:24.918 7 WARNING cinder.scheduler.host_manager [<> -
>>> default default]
>>> Service(active_backend_id=None,availability_zone='nova',binary='cinder-volume',cluster=<?>,cluster_name=None,created_at=2020-06-12T07:53:42Z,deleted=False,deleted_at=None,disabled=False,disabled_reason=None,frozen=False,host='rbd:volumes at ceph-rbd
>>> ',id=12,modified_at=None,object_current_version='1.38',replication_status='disabled',report_count=8067424,rpc_current_version='3.16',topic='cinder-volume',updated_at=2021-05-12T15:37:52Z,uuid='604668e8-c2e7-46ed-a2b8-086e588079ac')"
>>>
>>> Cinder is configured with a Ceph RBD backend, as explained in
>>> https://github.com/openstack/kolla-ansible/blob/stable/train/doc/source/reference/storage/external-ceph-guide.rst#cinder
>>>
>>> That's where the "backend_host=rbd:volumes" configuration is coming from.
>>>
>>> We are using 3 controller nodes for OpenStack and 3 monitor nodes for
>>> Ceph.
>>>
>>> The Ceph cluster doesn't report any error. The "cinder-volume"
>>> containers don't report any error. Moreover, when we go inside the
>>> "cinder-volume" container we are able to list existing volumes with:
>>>
>>> rbd -p cinder.volumes --id cinder -k
>>> /etc/ceph/ceph.client.cinder.keyring ls
>>>
>>> So the connection to the Ceph cluster works.
>>>
>>> Why is "cinder-scheduler" reporting the that the backend Ceph cluster is
>>> down?
>>>
>>> Many thanks,
>>> Sebastian
>>>
>>>
>>> On Thu, 13 May 2021 at 13:12, Tobias Urdin <tobias.urdin at binero.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I just saw that you are running Ceph Octopus with Train release and
>>>> wanted to let you know that we saw issues with the os-brick version shipped
>>>> with Train not supporting client version of Ceph Octopus.
>>>>
>>>> So for our Ceph cluster running Octopus we had to keep the client
>>>> version on Nautilus until upgrading to Victoria which included a newer
>>>> version of os-brick.
>>>>
>>>> Maybe this is unrelated to your issue but just wanted to put it out
>>>> there.
>>>>
>>>> Best regards
>>>> Tobias
>>>>
>>>> > On 13 May 2021, at 12:55, ManuParra <mparra at iaa.es> wrote:
>>>> >
>>>> > Hello Gorka, not yet, let me update cinder configuration, add the
>>>> option, restart cinder and I’ll update the status.
>>>> > Do you recommend other things to try for this cycle?
>>>> > Regards.
>>>> >
>>>> >> On 13 May 2021, at 09:37, Gorka Eguileor <geguileo at redhat.com>
>>>> wrote:
>>>> >>
>>>> >>> On 13/05, ManuParra wrote:
>>>> >>> Hi Gorka again, yes, the first thing is to know why you can't
>>>> connect to that host (Ceph is actually set up for HA) so that's the way to
>>>> do it. I tell you this because previously from the beginning of the setup
>>>> of our setup it has always been like that, with that hostname and there has
>>>> been no problem.
>>>> >>>
>>>> >>> As for the errors, the strangest thing is that in Monasca I have
>>>> not found any error log, only warning on “volume service is down. (host:
>>>> rbd:volumes at ceph-rbd)" and info, which is even stranger.
>>>> >>
>>>> >> Have you tried the configuration change I recommended?
>>>> >>
>>>> >>
>>>> >>>
>>>> >>> Regards.
>>>> >>>
>>>> >>>> On 12 May 2021, at 23:34, Gorka Eguileor <geguileo at redhat.com>
>>>> wrote:
>>>> >>>>
>>>> >>>> On 12/05, ManuParra wrote:
>>>> >>>>> Hi Gorka, let me show the cinder config:
>>>> >>>>>
>>>> >>>>> [ceph-rbd]
>>>> >>>>> rbd_ceph_conf = /etc/ceph/ceph.conf
>>>> >>>>> rbd_user = cinder
>>>> >>>>> backend_host = rbd:volumes
>>>> >>>>> rbd_pool = cinder.volumes
>>>> >>>>> volume_backend_name = ceph-rbd
>>>> >>>>> volume_driver = cinder.volume.drivers.rbd.RBDDriver
>>>> >>>>> …
>>>> >>>>>
>>>> >>>>> So, using rbd_exclusive_cinder_pool=True it will be used just for
>>>> volumes? but the log is saying no connection to the backend_host.
>>>> >>>>
>>>> >>>> Hi,
>>>> >>>>
>>>> >>>> Your backend_host doesn't have a valid hostname, please set a
>>>> proper
>>>> >>>> hostname in that configuration option.
>>>> >>>>
>>>> >>>> Then the next thing you need to have is the cinder-volume service
>>>> >>>> running correctly before making any requests.
>>>> >>>>
>>>> >>>> I would try adding rbd_exclusive_cinder_pool=true then tailing the
>>>> >>>> volume logs, and restarting the service.
>>>> >>>>
>>>> >>>> See if the logs show any ERROR level entries.
>>>> >>>>
>>>> >>>> I would also check the service-list output right after the service
>>>> is
>>>> >>>> restarted, if it's up then I would check it again after 2 minutes.
>>>> >>>>
>>>> >>>> Cheers,
>>>> >>>> Gorka.
>>>> >>>>
>>>> >>>>
>>>> >>>>>
>>>> >>>>> Regards.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>> On 12 May 2021, at 11:49, Gorka Eguileor <geguileo at redhat.com>
>>>> wrote:
>>>> >>>>>>
>>>> >>>>>> On 12/05, ManuParra wrote:
>>>> >>>>>>> Thanks, I have restarted the service and I see that after a few
>>>> minutes then cinder-volume service goes down again when I check it with the
>>>> command openstack volume service list.
>>>> >>>>>>> The host/service that contains the cinder-volumes is
>>>> rbd:volumes at ceph-rbd that is RDB in Ceph, so the problem does not come
>>>> from Cinder, rather from Ceph or from the RDB (Ceph) pools that stores the
>>>> volumes. I have checked Ceph and the status of everything is correct, no
>>>> errors or warnings.
>>>> >>>>>>> The error I have is that cinder can’t  connect to
>>>> rbd:volumes at ceph-rbd. Any further suggestions? Thanks in advance.
>>>> >>>>>>> Kind regards.
>>>> >>>>>>>
>>>> >>>>>>
>>>> >>>>>> Hi,
>>>> >>>>>>
>>>> >>>>>> You are most likely using an older release, have a high number
>>>> of cinder
>>>> >>>>>> RBD volumes, and have not changed configuration option
>>>> >>>>>> "rbd_exclusive_cinder_pool" from its default "false" value.
>>>> >>>>>>
>>>> >>>>>> Please add to your driver's section in cinder.conf the following:
>>>> >>>>>>
>>>> >>>>>> rbd_exclusive_cinder_pool = true
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> And restart the service.
>>>> >>>>>>
>>>> >>>>>> Cheers,
>>>> >>>>>> Gorka.
>>>> >>>>>>
>>>> >>>>>>>> On 11 May 2021, at 22:30, Eugen Block <eblock at nde.ag> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>> Hi,
>>>> >>>>>>>>
>>>> >>>>>>>> so restart the volume service;-)
>>>> >>>>>>>>
>>>> >>>>>>>> systemctl restart openstack-cinder-volume.service
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>> Zitat von ManuParra <mparra at iaa.es>:
>>>> >>>>>>>>
>>>> >>>>>>>>> Dear OpenStack community,
>>>> >>>>>>>>>
>>>> >>>>>>>>> I have encountered a problem a few days ago and that is that
>>>> when creating new volumes with:
>>>> >>>>>>>>>
>>>> >>>>>>>>> "openstack volume create --size 20 testmv"
>>>> >>>>>>>>>
>>>> >>>>>>>>> the volume creation status shows an error.  If I go to the
>>>> error log detail it indicates:
>>>> >>>>>>>>>
>>>> >>>>>>>>> "Schedule allocate volume: Could not find any available
>>>> weighted backend".
>>>> >>>>>>>>>
>>>> >>>>>>>>> Indeed then I go to the cinder log and it indicates:
>>>> >>>>>>>>>
>>>> >>>>>>>>> "volume service is down - host: rbd:volumes at ceph-rbd”.
>>>> >>>>>>>>>
>>>> >>>>>>>>> I check with:
>>>> >>>>>>>>>
>>>> >>>>>>>>> "openstack volume service list”  in which state are the
>>>> services and I see that indeed this happens:
>>>> >>>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> | cinder-volume | rbd:volumes at ceph-rbd | nova | enabled |
>>>> down | 2021-04-29T09:48:42.000000 |
>>>> >>>>>>>>>
>>>> >>>>>>>>> And stopped since 2021-04-29 !
>>>> >>>>>>>>>
>>>> >>>>>>>>> I have checked Ceph (monitors,managers, osds. etc) and there
>>>> are no problems with the Ceph BackEnd, everything is apparently working.
>>>> >>>>>>>>>
>>>> >>>>>>>>> This happened after an uncontrolled outage.So my question is
>>>> how do I restart only cinder-volumes (I also have cinder-backup,
>>>> cinder-scheduler but they are ok).
>>>> >>>>>>>>>
>>>> >>>>>>>>> Thank you very much in advance. Regards.
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>
>>>> >>>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>
>>>> >>>>
>>>> >>>
>>>> >>
>>>> >>
>>>> >
>>>> >
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210517/de976431/attachment-0001.html>


More information about the openstack-discuss mailing list