[Openstack-operators] [nova][cinder] Is there interest in an admin-api to refresh volume connection info?

Arne Wiebalck Arne.Wiebalck at cern.ch
Wed Sep 13 09:24:02 UTC 2017


Matt, all,

I’m reviving this thread to check if the suggestion to address potentially stale connection
data by an admin command (or a scheduled task) made it to the planning for one of the
upcoming releases?

Thanks!
 Arne


On 16 Jun 2017, at 09:37, Saverio Proto <zioproto at gmail.com<mailto:zioproto at gmail.com>> wrote:

Hello Matt,

It is true that we are refreshing something that rarely changes. But
if you deliver a cloud service for several years, at one point you
might have to do these parameters changes.

Something that should not change rarely are the secrets of the ceph
users to talk to the ceph cluster. Good security would suggest
periodic secret rotation, but today this is not really feasible.

I know the problem is also that you cannot change stuff in libvirt
while the VMs are running. Maybe is time for a discussion with libvirt
developers to make our voice louder about required features ?

The goal would be to change on the fly the ceph/rbd secret that a VM
uses to access a volume, while the VM is running. I think this is very
important.

thank you

Saverio


2017-06-09 6:15 GMT+02:00 Matt Riedemann <mriedemos at gmail.com<mailto:mriedemos at gmail.com>>:
On 6/8/2017 1:39 PM, melanie witt wrote:

On Thu, 8 Jun 2017 08:58:20 -0500, Matt Riedemann wrote:

Nova stores the output of the Cinder os-initialize_connection info API in
the Nova block_device_mappings table, and uses that later for making volume
connections.

This data can get out of whack or need to be refreshed, like if your ceph
server IP changes, or you need to recycle some secret uuid for your ceph
cluster.

I think the only ways to do this on the nova side today are via volume
detach/re-attach, reboot, migrations, etc - all of which, except live
migration, are disruptive to the running guest.


I believe the only way to work around this currently is by doing a 'nova
shelve' followed by a 'nova unshelve'. That will end up querying the
connection_info from Cinder and update the block device mapping record for
the instance. Maybe detach/re-attach would work too but I can't remember
trying it.


Shelve has it's own fun set of problems like the fact it doesn't terminate
the connection to the volume backend on shelve. Maybe that's not a problem
for Ceph, I don't know. You do end up on another host though potentially,
and it's a full delete and spawn of the guest on that other host. Definitely
disruptive.


I've kicked around the idea of adding some sort of admin API interface
for refreshing the BDM.connection_info on-demand if needed by an operator.
Does anyone see value in this? Are operators doing stuff like this already,
but maybe via direct DB updates?

We could have something in the compute API which calls down to the
compute for an instance and has it refresh the connection_info from Cinder
and updates the BDM table in the nova DB. It could be an admin action API,
or part of the os-server-external-events API, like what we have for the
'network-changed' event sent from Neutron which nova uses to refresh the
network info cache.

Other ideas or feedback here?


We've discussed this a few times before and we were thinking it might be
best to handle this transparently and just do a connection_info refresh +
record update inline with the request flows that will end up reading
connection_info from the block device mapping records. That way, operators
won't have to intervene when connection_info changes.


The thing that sucks about this is if we're going to be refreshing something
that maybe rarely changes for every volume-related operation on the
instance. That seems like a lot of overhead to me (nova/cinder API
interactions, Cinder interactions to the volume backend, nova-compute round
trips to conductor and the DB to update the BDM table, etc).


At least in the case of Ceph, as long as a guest is running, it will
continue to work fine if the monitor IPs or secrets change because it will
continue to use its existing connection to the Ceph cluster. Things go wrong
when an instance action such as resize, stop/start, or reboot is done
because when the instance is taken offline and being brought back up, the
stale connection_info is read from the block_device_mapping table and
injected into the instance, and so it loses contact with the cluster. If we
query Cinder and update the block_device_mapping record at the beginning of
those actions, the instance will get the new connection_info.

-melanie




--

Thanks,

Matt


_______________________________________________
OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org<mailto:OpenStack-operators at lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

_______________________________________________
OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org<mailto:OpenStack-operators at lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

--
Arne Wiebalck
CERN IT

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20170913/532a9007/attachment.html>


More information about the OpenStack-operators mailing list