[openstack-dev] [TripleO] [Ironic] [Cinder] "Baremetal volumes" -- how to model direct attached storage

Chris Jones cmsj at tenshu.net
Fri Nov 14 08:42:48 UTC 2014


Hi

My thoughts:

Shoe-horning the ephemeral partition into Cinder seems like a lot of pain for almost no gain[1]. The only gain I can think of would be that we could bring a node down, boot it into a special ramdisk that exposes the volume to the network, so cindery operations (e.g. migration) could be performed, but I'm not even sure if anyone is asking for that?

Forcing Cinder to understand and track something it can never normally do anything with, seems like we're just trying to squeeze ourselves into an ever-shrinking VM costume!

Having said that, "preserve ephemeral" is a terrible oxymoron, so if we can do something about it, we probably should.

How about instead, we teach Nova/Ironic about a concept of "no ephemeral"? They make a partition on the first disk for the first image they deploy, and then they never touch the other part(s) of the disk(s), until the instance is destroyed. This creates one additional burden for operators, which is to create and format a partition the first time they boot, but since this is a very small number of commands, and something we could trivially bake into our (root?) elements, I'm not sure it's a huge problem.

This gets rid of the cognitive dissonance of preserving something that is described as ephemeral, and (IMO) makes it extremely clear that OpenStack isn't going to touch anything but the first partition of the first disk. If this were baked into the flavour rather than something we tack onto a nova rebuild command, it offers greater safety for operators, against the risk of accidentallying a vital state partition with a misconstructed rebuild command.


[1] for local disk, I mean. I still think it'd be nice for operators to be able to use a networked Cinder volume for /mnt/state/, but that presents a whole different set of challenges :)

Cheers,
--
Chris Jones

> On 13 Nov 2014, at 09:25, Robert Collins <robertc at robertcollins.net> wrote:
> 
> Back in the day before the ephemeral hack (though that was something
> folk have said they would like for libvirt too - so its not such a
> hack per-se) this was (broadly) sketched out. We spoke with the cinder
> PTL at the time in portland, from memory.
> 
> There was no spec, so here is my brain-dumpy-recollection...
> 
> - actual volumes are a poor match because we wouldn't be running
> cinder-volume on an ongoing basis and service records would accumulate
> etc.
> - we'd need cross-service scheduler support to make cinder operations
> line up with allocated bare metal nodes (and to e.g. make sure both
> our data volume and golden image volume are scheduled to the same
> machine).
> 
> - folk want to be able to do fairly arbitrary RAID(& JBOD) setups and
> that affects scheduling as well, one way to work it is to have Ironic
> export capabilities and specify actual RAID setups via matching
> flavors - this is the direction the ephemeral work took us, and is
> conceptually straight forwardly extended to RAID. We did talk about
> doing a little JSON schema to describe RAID / volume layouts, which
> cinder could potentially use for user defined volume flavors too.
> 
> One thing I think that is missing from your description is in this: "
> 
> To be clear, in TripleO, we need a way to keep the data on a local
> direct attached storage device while deploying a new image to the box."
> 
> I think we need to be able to do this with a single drive shared
> between image and data - doing one disk image, one disk data would add
> substantial waste given the size of disks these days (and for some
> form factors like moonshot it would rule out using them at all).
> 
> Of course, being able to do entirely network stored golden images
> might be something some deployments want, but we can't require them
> all to do that ;)
> 
> -Rob
> 
> 
> 
>> On 13 November 2014 11:30, Clint Byrum <clint at fewbar.com> wrote:
>> Each summit since we created "preserve ephemeral" mode in Nova, I have
>> some conversations where at least one person's brain breaks for a
>> second. There isn't always alcohol involved before, there almost
>> certainly is always a drink needed after. The very term is vexing, and I
>> think we have done ourselves a disservice to have it, even if it was the
>> best option at the time.
>> 
>> To be clear, in TripleO, we need a way to keep the data on a local
>> direct attached storage device while deploying a new image to the box.
>> If we were on VMs, we'd attach volumes, and just deploy new VMs and move
>> the volume over. If we had a SAN, we'd just move the LUN's. But at some
>> point when you deploy a cloud you're holding data that is expensive to
>> replicate all at once, and so you'd rather just keep using the same
>> server instead of trying to move the data.
>> 
>> Since we don't have baremetal Cinder, we had to come up with a way to
>> do this, so we used Nova rebuild, and slipped it a special command that
>> said "don't overwrite the partition you'd normally make the 'ephemeral'"
>> partition. This works fine, but it is confusing and limiting. We'd like
>> something better.
>> 
>> I had an interesting discussion with Devananda in which he suggested an
>> alternative approach. If we were to bring up cinder-volume on our deploy
>> ramdisks, and configure it in such a way that it claimed ownership of
>> the section of disk we'd like to preserve, then we could allocate that
>> storage as a volume. From there, we could boot from volume, or "attach"
>> the volume to the instance (which would really just tell us how to find
>> the volume). When we want to write a new image, we can just delete the old
>> instance and create a new one, scheduled to wherever that volume already
>> is. This would require the nova scheduler to have a filter available
>> where we could select a host by the volumes it has, so we can make sure to
>> send the instance request back to the box that still has all of the data.
>> 
>> Alternatively we can keep on using rebuild, but let the volume model the
>> preservation rather than our special case.
>> 
>> Thoughts? Suggestions? I feel like this might take some time, but it is
>> necessary to consider it now so we can drive any work we need to get it
>> done soon.
>> 
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> 
> 
> -- 
> Robert Collins <rbtcollins at hp.com>
> Distinguished Technologist
> HP Converged Cloud
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list