[Openstack-operators] RAID / stripe block storage volumes

Joe Topjian joe at topjian.net
Mon Mar 7 15:52:12 UTC 2016


On Mon, Mar 7, 2016 at 12:33 AM, Tim Bell <Tim.Bell at cern.ch> wrote:

> From: joe <joe at topjian.net>
> Date: Monday 7 March 2016 at 07:53
> To: openstack-operators <openstack-operators at lists.openstack.org>
> Subject: Re: [Openstack-operators] RAID / stripe block storage volumes
>
> We ($work) have been researching this topic for the past few weeks and I
> wanted to give an update on what we've found.
>
> First, we've found that both Rackspace and Azure advocate the use of
> RAID'ing block storage volumes from within an instance for both performance
> and resilience [1][2][3]. I only mention this to add to the earlier Amazon
> AWS information and not to imply that more people should share this view.
>
> Second, we discovered virtio-scsi [4]. By adding the following properties
> to an image, the disks will now appear as SCSI disks, including the more
> common /dev/sdx naming:
>
> hw_disk_bus_model=virtio-scsi
> hw_scsi_model=virtio-scsi
> hw_disk_bus=scsi
>
> What's notable is that, in our testing, ZFS pools and Gluster replicas are
> more likely to see the volume disconnect/fail with virtio-scsi. mdadm has
> always been fairly dependable, so there hasn't been a change there. We're
> still testing, but virtio-scsi looks promising.
>
>
> We found significantly slower (~20%) from the virtio SCSI on bonnie++. I
> had been thinking it would be better.
>
> What were your performance experiences ?
>
> Tim
>

That's one area we're still testing. We're seeing a 15% increase in reads
for 4k - 1m blocks but anywhere from 3-20% decrease in all types of writing
activity. Something seems off... or at least that there should be a reason.


>
> 1:
> https://support.rackspace.com/how-to/configuring-a-software-raid-on-a-linux-general-purpose-cloud-server/
> 2: https://support.rackspace.com/how-to/cloud-block-storage-faq/
> 3:
> https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-configure-raid/
> 4: https://wiki.openstack.org/wiki/LibvirtVirtioScsi
>
> On Mon, Feb 8, 2016 at 7:18 PM, Joe Topjian <joe at topjian.net> wrote:
>
>> Yep. Don't get me wrong -- I agree 100% with everything you've said
>> throughout this thread. Applications that have native replication are
>> awesome. Swift is crazy awesome. :)
>>
>> I understand that some may see the use of mdadm, Cinder-assisted
>> replication, etc as supporting "pet" environments, and I agree to some
>> extent. But I do think there are applicable use-cases where those services
>> could be very helpful.
>>
>> As one example, I know of large cloud-based environments which handle
>> very large data sets and are entirely stood up through configuration
>> management systems. However, due to the sheer size of data being handled,
>> rebuilding or resyncing a portion of the environment could take hours.
>> Failing over to a replicated volume is instant.In addition, being able to
>> both stripe and replicate goes a very long way in making the most out of
>> commodity block storage environments (for example, avoiding packing
>> problems and such).
>>
>> Should these types of applications be reading / writing directly to
>> Swift, HDFS, or handling replication themselves? Sure, in a perfect world.
>> Does Gluster fill all gaps I've mentioned? Kind of.
>>
>> I guess I'm just trying to survey the options available for applications
>> and environments that would otherwise be very flexible and resilient if it
>> wasn't for their awkward use of storage. :)
>>
>> On Mon, Feb 8, 2016 at 6:18 PM, Robert Starmer <robert at kumul.us> wrote:
>>
>>> Besides, wouldn't it be better to actually do application layer backup
>>> restore, or application level distribution for replication?  That
>>> architecture at least let's the application determine and deal with corrupt
>>> data transmission rather than the DRBD like model where you corrupt one
>>> data-set, you corrupt them all...
>>>
>>> Hence my comment about having some form of object storage (SWIFT is
>>> perhaps even a good example of this architeccture, the proxy replicates,
>>> checks MD5, etc. to verify good data, rather than just replicating blocks
>>> of data).
>>>
>>>
>>>
>>> On Mon, Feb 8, 2016 at 7:15 PM, Robert Starmer <robert at kumul.us> wrote:
>>>
>>>> I have not run into anyone replicating volumes or creating redundancy
>>>> at the VM level (beyond, as you point out, HDFS, etc.).
>>>>
>>>> R
>>>>
>>>> On Mon, Feb 8, 2016 at 6:54 PM, Joe Topjian <joe at topjian.net> wrote:
>>>>
>>>>> This is a great conversation and I really appreciate everyone's input.
>>>>> Though, I agree, we wandered off the original question and that's my fault
>>>>> for mentioning various storage backends.
>>>>>
>>>>> For the sake of conversation, let's just say the user has no knowledge
>>>>> of the underlying storage technology. They're presented with a Block
>>>>> Storage service and the rest is up to them. What known, working options
>>>>> does the user have to build their own block storage resilience? (Ignoring
>>>>> "obvious" solutions where the application has native replication, such as
>>>>> Galera, elasticsearch, etc)
>>>>>
>>>>> I have seen references to Cinder supporting replication, but I'm not
>>>>> able to find a lot of information about it. The support matrix[1] lists
>>>>> very few drivers that actually implement replication -- is this true or is
>>>>> there a trove of replication docs that I just haven't been able to find?
>>>>>
>>>>> Amazon AWS publishes instructions on how to use mdadm with EBS[2]. One
>>>>> might interpret that to mean mdadm is a supported solution within EC2 based
>>>>> instances.
>>>>>
>>>>> There are also references to DRBD and EC2, though I could not find
>>>>> anything as "official" as mdadm and EC2.
>>>>>
>>>>> Does anyone have experience (or know users) doing either?
>>>>> (specifically with libvirt/KVM, but I'd be curious to know in general)
>>>>>
>>>>> Or is it more advisable to create multiple instances where data is
>>>>> replicated instance-to-instance rather than a single instance with multiple
>>>>> volumes and have data replicated volume-to-volume (by way of a single
>>>>> instance)? And if so, why? Is a lack of stable volume-to-volume replication
>>>>> a limitation of certain hypervisors?
>>>>>
>>>>> Or has this area just not been explored in depth within OpenStack
>>>>> environments yet?
>>>>>
>>>>> 1: https://wiki.openstack.org/wiki/CinderSupportMatrix
>>>>> 2: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/raid-config.html
>>>>>
>>>>>
>>>>> On Mon, Feb 8, 2016 at 4:10 PM, Robert Starmer <robert at kumul.us>
>>>>> wrote:
>>>>>
>>>>>> I'm not against Ceph, but even 2 machines (and really 2 machines with
>>>>>> enough storage to be meaningful, e.g. not the all blade environments I've
>>>>>> built some o7k  systems on) may not be available for storage, so there are
>>>>>> cases where that's not necessarily the solution. I built resiliency in one
>>>>>> environment with a 2 node controller/Glance/db system with Gluster, which
>>>>>> enabled enough middleware resiliency to meet the customers recovery
>>>>>> expectations. Regardless, even with a cattle application model, the
>>>>>> infrastructure middleware still needs to be able to provide some level of
>>>>>> resiliency.
>>>>>>
>>>>>> But we've kind-of wandered off of the original question. I think that
>>>>>> to bring this back on topic, I think users can build resilience in their
>>>>>> own storage construction, but I still think there are use cases where the
>>>>>> middleware either needs to use it's own resiliency layer, and/or may end up
>>>>>> providing it for the end user.
>>>>>>
>>>>>> R
>>>>>>
>>>>>> On Mon, Feb 8, 2016 at 3:51 PM, Fox, Kevin M <Kevin.Fox at pnnl.gov>
>>>>>> wrote:
>>>>>>
>>>>>>> We've used ceph to address the storage requirement in small clouds
>>>>>>> pretty well. it works pretty well with only two storage nodes with
>>>>>>> replication set to 2, and because of the radosgw, you can share your small
>>>>>>> amount of storage between the object store and the block store avoiding the
>>>>>>> need to overprovision swift-only or cinder-only to handle usage unknowns.
>>>>>>> Its just one pool of storage.
>>>>>>>
>>>>>>> Your right, using lvm is like telling your users, don't do pets, but
>>>>>>> then having pets at the heart of your system. when you loose one, you loose
>>>>>>> a lot. With a small ceph, you can take out one of the nodes, burn it to the
>>>>>>> ground and put it back, and it just works. No pets.
>>>>>>>
>>>>>>> Do consider ceph for the small use case.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kevin
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> *From:* Robert Starmer [robert at kumul.us]
>>>>>>> *Sent:* Monday, February 08, 2016 1:30 PM
>>>>>>> *To:* Ned Rhudy
>>>>>>> *Cc:* OpenStack Operators
>>>>>>>
>>>>>>> *Subject:* Re: [Openstack-operators] RAID / stripe block storage
>>>>>>> volumes
>>>>>>>
>>>>>>> Ned's model is the model I meant by "multiple underlying storage
>>>>>>> services".  Most of the systems I've built are LV/LVM only,  a few added
>>>>>>> Ceph as an alternative/live-migration option, and one where we used Gluster
>>>>>>> due to size.  Note that the environments I have worked with in general are
>>>>>>> small (~20 compute), so huge Ceph environments aren't common.  I am also
>>>>>>> working on a project where the storage backend is entirely NFS...
>>>>>>>
>>>>>>> And I think users are more and more educated to assume that there is
>>>>>>> nothing guaranteed.  There is the realization, at least for a good set of
>>>>>>> the customers I've worked with (and I try to educate the non-believers),
>>>>>>> that the way you get best effect from a system like OpenStack is to
>>>>>>> consider everything disposable. The one gap I've seen is that there are
>>>>>>> plenty of folks who don't deploy SWIFT, and without some form of object
>>>>>>> store, there's still the question of where you place your datasets so that
>>>>>>> they can be quickly recovered (and how do you keep them up to date if you
>>>>>>> do have one).  With VMs, there's the concept that you can recover quickly
>>>>>>> because the "dataset" e.g. your OS, is already there for you, and in plenty
>>>>>>> of small environments, that's only as true as the glance repository (guess
>>>>>>> what's usually backing that when there's no SWIFT around...).
>>>>>>>
>>>>>>> So I see the issue as a holistic one. How do you show
>>>>>>> operators/users that they should consider everything disposable if we only
>>>>>>> look at the current running instance as the "thing"   Somewhere you still
>>>>>>> likely need some form of distributed resilience (and yes, I can see using
>>>>>>> the distributed Canonical, Centos, RedHat, Fedora, Debian, etc. mirrors as
>>>>>>> your distributed Image backup but what about the database content, etc.).
>>>>>>>
>>>>>>> Robert
>>>>>>>
>>>>>>> On Mon, Feb 8, 2016 at 1:44 PM, Ned Rhudy (BLOOMBERG/ 731 LEX) <
>>>>>>> erhudy at bloomberg.net> wrote:
>>>>>>>
>>>>>>>> In our environments, we offer two types of storage. Tenants can
>>>>>>>> either use Ceph/RBD and trade speed/latency for reliability and protection
>>>>>>>> against physical disk failures, or they can launch instances that are
>>>>>>>> realized as LVs on an LVM VG that we create on top of a RAID 0 spanning all
>>>>>>>> but the OS disk on the hypervisor. This lets the users elect to go all-in
>>>>>>>> on speed and sacrifice reliability for applications where replication/HA is
>>>>>>>> handled at the app level, if the data on the instance is sourced from
>>>>>>>> elsewhere, or if they just don't care much about the data.
>>>>>>>>
>>>>>>>> There are some further changes to our approach that we would like
>>>>>>>> to make down the road, but in general our users seem to like the current
>>>>>>>> system and being able to forgo reliability or speed as their circumstances
>>>>>>>> demand.
>>>>>>>>
>>>>>>>> From: joe at topjian.net
>>>>>>>> Subject: Re: [Openstack-operators] RAID / stripe block storage
>>>>>>>> volumes
>>>>>>>>
>>>>>>>> Hi Robert,
>>>>>>>>
>>>>>>>> Can you elaborate on "multiple underlying storage services"?
>>>>>>>>
>>>>>>>> The reason I asked the initial question is because historically
>>>>>>>> we've made our block storage service resilient to failure. Historically we
>>>>>>>> also made our compute environment resilient to failure, too, but over time,
>>>>>>>> we've seen users become more educated to cope with compute failure. As a
>>>>>>>> result, we've been able to become more lenient with regard to building
>>>>>>>> resilient compute environments.
>>>>>>>>
>>>>>>>> We've been discussing how possible it would be to translate that
>>>>>>>> same idea to block storage. Rather than have a large HA storage cluster
>>>>>>>> (whether Ceph, Gluster, NetApp, etc), is it possible to offer simple single
>>>>>>>> LVM volume servers and push the failure handling on to the user?
>>>>>>>>
>>>>>>>> Of course, this doesn't work for all types of use cases and
>>>>>>>> environments. We still have projects which require the cloud to own most
>>>>>>>> responsibility for failure than the users.
>>>>>>>>
>>>>>>>> But for environments were we offer general purpose / best effort
>>>>>>>> compute and storage, what methods are available to help the user be
>>>>>>>> resilient to block storage failures?
>>>>>>>>
>>>>>>>> Joe
>>>>>>>>
>>>>>>>> On Mon, Feb 8, 2016 at 12:09 PM, Robert Starmer <robert at kumul.us>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I've always recommended providing multiple underlying storage
>>>>>>>>> services to provide this rather than adding the overhead to the VM.  So,
>>>>>>>>> not in any of my systems or any I've worked with.
>>>>>>>>>
>>>>>>>>> R
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Feb 5, 2016 at 5:56 PM, Joe Topjian <joe at topjian.net>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> Does anyone have users RAID'ing or striping multiple block
>>>>>>>>>> storage volumes from within an instance?
>>>>>>>>>>
>>>>>>>>>> If so, what was the experience? Good, bad, possible but with
>>>>>>>>>> caveats?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Joe
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> OpenStack-operators mailing list
>>>>>>>>>> OpenStack-operators at lists.openstack.org
>>>>>>>>>>
>>>>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> OpenStack-operators mailing listOpenStack-operators at lists.openstack.orghttp://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> OpenStack-operators mailing list
>>>>>>>> OpenStack-operators at lists.openstack.org
>>>>>>>>
>>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> OpenStack-operators mailing list
>>>>>> OpenStack-operators at lists.openstack.org
>>>>>>
>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20160307/a3d4c34f/attachment.html>


More information about the OpenStack-operators mailing list