[Openstack-operators] Cinde block storage HA

Juan José Pavlik Salles jjpavlik at gmail.com
Tue Sep 16 18:27:08 UTC 2014


Hi Abel, I thought about trying it, but We had MANY performance problems
with the EMC because of running too many LUNs that's way we`d like to avoid
that scenario. It might seem the best solution but We don't want to go that
way again.

2014-09-16 15:20 GMT-03:00 Abel Lopez <alopgeek at gmail.com>:

> Have you tried using the native Emc drivers? That way cinder only acts as
> a broker between your instances and the storage back end, and you don't
> need to worry about your cinder-volume service being HA. (As much)
>
>
> On Tuesday, September 16, 2014, Juan José Pavlik Salles <
> jjpavlik at gmail.com> wrote:
>
>> Hi guys, I'm trying to put some HA on our cinder service, we have the
>> next scenario:
>>
>> -Real backends: EMC clarion (SATA drives) and HP Storevirtual P4000 (SAS
>> drives), this two backends export 2 big LUNs to our (one and only right
>> now) cinder server.
>> -Once these big LUNs are imported in the cinder server, two different VG
>> are created for two different cinder LVM drivers (cinder-volumes-1 and
>> cinder-volumes-2). This way I have two different storage resources to give
>> to my tenants.
>>
>> What I want is to deploy a second cinder server to act as failover of the
>> first one. Both servers are identical. So far I'm running a few tests with
>> isolated VMs.
>>
>> -I installed corosync+pacemaker in 2 VMs, added a Virtual IP.
>> -Imported in the VMs a LUN with iSCSI created a VG
>> -Exported a LV with tgt. More or less the same scenario we have on
>> production.
>>
>> If one of the VMs die the second one picks the virtual IP throughtout tgt
>> is exporting the LUN and the iSCSI session doesn't die, here you can see
>> part of the logs where the LUN is being imported:
>>
>> Sep 16 14:29:50 borrar-nfs kernel: [86630.416160]  connection1:0: ping
>> timeout of 5 secs expired, recv timeout 5, last rx 4316547395, last ping
>> 4316548646, now 4316549900
>> Sep 16 14:29:50 borrar-nfs kernel: [86630.418938]  connection1:0:
>> detected conn error (1011)
>> Sep 16 14:29:51 borrar-nfs iscsid: Kernel reported iSCSI connection 1:0
>> error (1011) state (3)
>> Sep 16 14:29:53 borrar-nfs iscsid: connection1:0 is operational after
>> recovery (1 attempts)
>>
>> This test was really simple, just one 1GB LUN but it worked ok, even when
>> the failover was tested during a writing operation.
>>
>> So it seems to be a good-so-far-solution, but there are a few things that
>> worries me a bit:
>>
>> -Timeouts? How much time do I have to detect the problem and move the IP
>> to the new node before the iscsi connections die. I think I could play a
>> little bit with timeo.noop_out_timeout in iscsid.conf
>> -What if there was a write operation going on while a node failed, what
>> if this operation never reached the real backends, could I come across some
>> inconsistencies in the volume FS? Any recommendations?
>> -If I create a volume in cinder, the proper target file is created
>> in /var/lib/cinder/volumes/volue-* but, I need the file to be created in
>> both cinder nodes in case one of them fail. What would be a proper solution
>> for this? shared storage for the directory? SVN?
>> -Both servers should be running tgt at the same time or maybe I should
>> start tgt on the failover server once the virtual IP is changed?
>>
>> Any comments or suggestions will be more than appreciated. Thanks!
>>
>> --
>> Pavlik Salles Juan José
>> Blog - http://viviendolared.blogspot.com
>>
>


-- 
Pavlik Salles Juan José
Blog - http://viviendolared.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20140916/dd93035a/attachment.html>


More information about the OpenStack-operators mailing list