[openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

Sahid Orentino Ferdjaoui sahid.ferdjaoui at redhat.com
Wed Feb 4 17:05:05 UTC 2015


On Wed, Feb 04, 2015 at 04:30:32PM +0000, Matthew Booth wrote:
> I've spent a few hours today reading about Galera, a clustering solution
> for MySQL. Galera provides multi-master 'virtually synchronous'
> replication between multiple mysql nodes. i.e. I can create a cluster of
> 3 mysql dbs and read and write from any of them with certain consistency
> guarantees.
> 
> I am no expert[1], but this is a TL;DR of a couple of things which I
> didn't know, but feel I should have done. The semantics are important to
> application design, which is why we should all be aware of them.
> 
> 
> * Commit will fail if there is a replication conflict
> 
> foo is a table with a single field, which is its primary key.
> 
> A: start transaction;
> B: start transaction;
> A: insert into foo values(1);
> B: insert into foo values(1); <-- 'regular' DB would block here, and
>                                   report an error on A's commit
> A: commit; <-- success
> B: commit; <-- KABOOM
> 
> Confusingly, Galera will report a 'deadlock' to node B, despite this not
> being a deadlock by any definition I'm familiar with.

Yes ! and if I can add more information and I hope I do not make
mistake I think it's a know issue which comes from MySQL, that is why
we have a decorator to do a retry and so handle this case here:

  http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177

> Essentially, anywhere that a regular DB would block, Galera will not
> block transactions on different nodes. Instead, it will cause one of the
> transactions to fail on commit. This is still ACID, but the semantics
> are quite different.
> 
> The impact of this is that code which makes correct use of locking may
> still fail with a 'deadlock'. The solution to this is to either fail the
> entire operation, or to re-execute the transaction and all its
> associated code in the expectation that it won't fail next time.
> 
> As I understand it, these can be eliminated by sending all writes to a
> single node, although that obviously makes less efficient use of your
> cluster.
> 
> 
> * Write followed by read on a different node can return stale data
> 
> During a commit, Galera replicates a transaction out to all other db
> nodes. Due to its design, Galera knows these transactions will be
> successfully committed to the remote node eventually[2], but it doesn't
> commit them straight away. The remote node will check these outstanding
> replication transactions for write conflicts on commit, but not for
> read. This means that you can do:
> 
> A: start transaction;
> A: insert into foo values(1)
> A: commit;
> B: select * from foo; <-- May not contain the value we inserted above[3]
> 
> This means that even for 'synchronous' slaves, if a client makes an RPC
> call which writes a row to write master A, then another RPC call which
> expects to read that row from synchronous slave node B, there's no
> default guarantee that it'll be there.
> 
> Galera exposes a session variable which will fix this: wsrep_sync_wait
> (or wsrep_causal_reads on older mysql). However, this isn't the default.
> It presumably has a performance cost, but I don't know what it is, or
> how it scales with various workloads.
> 
> 
> Because these are semantic issues, they aren't things which can be
> easily guarded with an if statement. We can't say:
> 
> if galera:
>   try:
>     commit
>   except:
>     rewind time
> 
> If we are to support this DB at all, we have to structure code in the
> first place to allow for its semantics.
> 
> Matt
> 
> [1] No, really: I just read a bunch of docs and blogs today. If anybody
> who is an expert would like to validate/correct that would be great.
> 
> [2]
> http://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/
> 
> [3]
> http://www.percona.com/blog/2013/03/03/investigating-replication-latency-in-percona-xtradb-cluster/
> -- 
> Matthew Booth
> Red Hat Engineering, Virtualisation Team
> 
> Phone: +442070094448 (UK)
> GPG ID:  D33C3490
> GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list