[openstack-dev] [Cinder] A possible solution for HA Active-Active

Clint Byrum clint at fewbar.com
Tue Aug 4 17:48:28 UTC 2015


Excerpts from Duncan Thomas's message of 2015-08-04 00:32:40 -0700:
> On 3 August 2015 at 20:53, Clint Byrum <clint at fewbar.com> wrote:
> 
> > Excerpts from Devananda van der Veen's message of 2015-08-03 08:53:21
> > -0700:
> > Also on a side note, I think Cinder's need for this is really subtle,
> > and one could just accept that sometimes it's going to break when it does
> > two things to one resource from two hosts. The error rate there might
> > even be lower than the false-error rate that would be caused by a twitchy
> > DLM with timeouts a little low. So there's a core cinder discussion that
> > keeps losing to the shiny DLM discussion, and I'd like to see it played
> > out fully: Could Cinder just not do anything, and let the few drivers
> > that react _really_ badly, implement their own concurrency controls?
> >
> 
> 
> So the problem here is data corruption. Lots of our races can cause data
> corruption. Not 'my instance didn't come up', not 'my network is screwed
> and I need to tear everything down and do it again', but 'My 1tb of
> customer database is now missing the second half'. This means that we
> *really* need some confidence and understanding in whatever we do. The idea
> of locks timing out and being stolen without fencing is frankly scary and
> begging for data corruption unless we're very careful. I'd rather use a
> persistent lock (e.g. a db record change) and manual recovery than a lock
> timeout that might cause corruption.
> 

Thanks Duncan. Can you be more specific about a known data-corrupting
race that a) isn't handled simply by serialization in the database,
and b) isn't specific to a single driver?



More information about the OpenStack-dev mailing list