[openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

Matthew Booth mbooth at redhat.com
Wed Feb 4 16:30:32 UTC 2015


I've spent a few hours today reading about Galera, a clustering solution
for MySQL. Galera provides multi-master 'virtually synchronous'
replication between multiple mysql nodes. i.e. I can create a cluster of
3 mysql dbs and read and write from any of them with certain consistency
guarantees.

I am no expert[1], but this is a TL;DR of a couple of things which I
didn't know, but feel I should have done. The semantics are important to
application design, which is why we should all be aware of them.


* Commit will fail if there is a replication conflict

foo is a table with a single field, which is its primary key.

A: start transaction;
B: start transaction;
A: insert into foo values(1);
B: insert into foo values(1); <-- 'regular' DB would block here, and
                                  report an error on A's commit
A: commit; <-- success
B: commit; <-- KABOOM

Confusingly, Galera will report a 'deadlock' to node B, despite this not
being a deadlock by any definition I'm familiar with.

Essentially, anywhere that a regular DB would block, Galera will not
block transactions on different nodes. Instead, it will cause one of the
transactions to fail on commit. This is still ACID, but the semantics
are quite different.

The impact of this is that code which makes correct use of locking may
still fail with a 'deadlock'. The solution to this is to either fail the
entire operation, or to re-execute the transaction and all its
associated code in the expectation that it won't fail next time.

As I understand it, these can be eliminated by sending all writes to a
single node, although that obviously makes less efficient use of your
cluster.


* Write followed by read on a different node can return stale data

During a commit, Galera replicates a transaction out to all other db
nodes. Due to its design, Galera knows these transactions will be
successfully committed to the remote node eventually[2], but it doesn't
commit them straight away. The remote node will check these outstanding
replication transactions for write conflicts on commit, but not for
read. This means that you can do:

A: start transaction;
A: insert into foo values(1)
A: commit;
B: select * from foo; <-- May not contain the value we inserted above[3]

This means that even for 'synchronous' slaves, if a client makes an RPC
call which writes a row to write master A, then another RPC call which
expects to read that row from synchronous slave node B, there's no
default guarantee that it'll be there.

Galera exposes a session variable which will fix this: wsrep_sync_wait
(or wsrep_causal_reads on older mysql). However, this isn't the default.
It presumably has a performance cost, but I don't know what it is, or
how it scales with various workloads.


Because these are semantic issues, they aren't things which can be
easily guarded with an if statement. We can't say:

if galera:
  try:
    commit
  except:
    rewind time

If we are to support this DB at all, we have to structure code in the
first place to allow for its semantics.

Matt

[1] No, really: I just read a bunch of docs and blogs today. If anybody
who is an expert would like to validate/correct that would be great.

[2]
http://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/

[3]
http://www.percona.com/blog/2013/03/03/investigating-replication-latency-in-percona-xtradb-cluster/
-- 
Matthew Booth
Red Hat Engineering, Virtualisation Team

Phone: +442070094448 (UK)
GPG ID:  D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490



More information about the OpenStack-dev mailing list