[openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Sahid Orentino Ferdjaoui
sahid.ferdjaoui at redhat.com
Wed Feb 4 17:05:05 UTC 2015
On Wed, Feb 04, 2015 at 04:30:32PM +0000, Matthew Booth wrote:
> I've spent a few hours today reading about Galera, a clustering solution
> for MySQL. Galera provides multi-master 'virtually synchronous'
> replication between multiple mysql nodes. i.e. I can create a cluster of
> 3 mysql dbs and read and write from any of them with certain consistency
> guarantees.
>
> I am no expert[1], but this is a TL;DR of a couple of things which I
> didn't know, but feel I should have done. The semantics are important to
> application design, which is why we should all be aware of them.
>
>
> * Commit will fail if there is a replication conflict
>
> foo is a table with a single field, which is its primary key.
>
> A: start transaction;
> B: start transaction;
> A: insert into foo values(1);
> B: insert into foo values(1); <-- 'regular' DB would block here, and
> report an error on A's commit
> A: commit; <-- success
> B: commit; <-- KABOOM
>
> Confusingly, Galera will report a 'deadlock' to node B, despite this not
> being a deadlock by any definition I'm familiar with.
Yes ! and if I can add more information and I hope I do not make
mistake I think it's a know issue which comes from MySQL, that is why
we have a decorator to do a retry and so handle this case here:
http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177
> Essentially, anywhere that a regular DB would block, Galera will not
> block transactions on different nodes. Instead, it will cause one of the
> transactions to fail on commit. This is still ACID, but the semantics
> are quite different.
>
> The impact of this is that code which makes correct use of locking may
> still fail with a 'deadlock'. The solution to this is to either fail the
> entire operation, or to re-execute the transaction and all its
> associated code in the expectation that it won't fail next time.
>
> As I understand it, these can be eliminated by sending all writes to a
> single node, although that obviously makes less efficient use of your
> cluster.
>
>
> * Write followed by read on a different node can return stale data
>
> During a commit, Galera replicates a transaction out to all other db
> nodes. Due to its design, Galera knows these transactions will be
> successfully committed to the remote node eventually[2], but it doesn't
> commit them straight away. The remote node will check these outstanding
> replication transactions for write conflicts on commit, but not for
> read. This means that you can do:
>
> A: start transaction;
> A: insert into foo values(1)
> A: commit;
> B: select * from foo; <-- May not contain the value we inserted above[3]
>
> This means that even for 'synchronous' slaves, if a client makes an RPC
> call which writes a row to write master A, then another RPC call which
> expects to read that row from synchronous slave node B, there's no
> default guarantee that it'll be there.
>
> Galera exposes a session variable which will fix this: wsrep_sync_wait
> (or wsrep_causal_reads on older mysql). However, this isn't the default.
> It presumably has a performance cost, but I don't know what it is, or
> how it scales with various workloads.
>
>
> Because these are semantic issues, they aren't things which can be
> easily guarded with an if statement. We can't say:
>
> if galera:
> try:
> commit
> except:
> rewind time
>
> If we are to support this DB at all, we have to structure code in the
> first place to allow for its semantics.
>
> Matt
>
> [1] No, really: I just read a bunch of docs and blogs today. If anybody
> who is an expert would like to validate/correct that would be great.
>
> [2]
> http://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/
>
> [3]
> http://www.percona.com/blog/2013/03/03/investigating-replication-latency-in-percona-xtradb-cluster/
> --
> Matthew Booth
> Red Hat Engineering, Virtualisation Team
>
> Phone: +442070094448 (UK)
> GPG ID: D33C3490
> GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-dev
mailing list