[openstack-dev] [Cinder] A possible solution for HA Active-Active

Gorka Eguileor geguileo at redhat.com
Wed Aug 5 08:58:58 UTC 2015


On Tue, Aug 04, 2015 at 08:30:17AM -0700, Joshua Harlow wrote:
> Duncan Thomas wrote:
> >On 3 August 2015 at 20:53, Clint Byrum <clint at fewbar.com
> ><mailto:clint at fewbar.com>> wrote:
> >
> >    Excerpts from Devananda van der Veen's message of 2015-08-03
> >    08:53:21 -0700:
> >    Also on a side note, I think Cinder's need for this is really subtle,
> >    and one could just accept that sometimes it's going to break when it
> >    does
> >    two things to one resource from two hosts. The error rate there might
> >    even be lower than the false-error rate that would be caused by a
> >    twitchy
> >    DLM with timeouts a little low. So there's a core cinder discussion that
> >    keeps losing to the shiny DLM discussion, and I'd like to see it played
> >    out fully: Could Cinder just not do anything, and let the few drivers
> >    that react _really_ badly, implement their own concurrency controls?
> >
> >
> >
> >So the problem here is data corruption. Lots of our races can cause data
> >corruption. Not 'my instance didn't come up', not 'my network is screwed
> >and I need to tear everything down and do it again', but 'My 1tb of
> >customer database is now missing the second half'. This means that we
> >*really* need some confidence and understanding in whatever we do. The
> >idea of locks timing out and being stolen without fencing is frankly
> >scary and begging for data corruption unless we're very careful. I'd
> >rather use a persistent lock (e.g. a db record change) and manual
> >recovery than a lock timeout that might cause corruption.
> 
> So perhaps start off using persistent locks, gain confidence that we have
> all the right fixes in to prevent that data corruption, and then slowly
> remove persistent locks as needed. Sounds like an iterative solution to me,
> and one that will build confidence (hopefully that confidence building can
> be automated via a chaos-monkey like test-suite) as we go :)
> 

That was my suggestion as well, it is not that we cannot do without
locks, it's that we have confidence in them and the current code that
uses them, so we can start with an initial solution with distributed
locks, confirm that the rest of the code is running properly (as
distributed locks are not the only change needed) and then, on a second
iteration, proceed to remove locks in the Volume Manager and lastly on
the next iteration remove them in the drivers wherever it is possible,
and for those places where it isn't possible maybe look for alternative
solutions.

This way we can get a solution faster and avoid potential delays that
may raise if we try to do everything at once.

But I can see the point of those who say that why put the ops through
the DLM configuration process if we are probably going to remove the DLM
in a couple of releases. But since we don't know how difficult it will
get to remove all other locks, I think that a bird in the hand is worth
two in the bush and we should still go with the distributed locks and at
least make sure we have a solution.

Cheers,
Gorka.



More information about the OpenStack-dev mailing list