[openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
Jay Pipes
jaypipes at gmail.com
Wed Feb 4 19:04:10 UTC 2015
On 02/04/2015 12:05 PM, Sahid Orentino Ferdjaoui wrote:
> On Wed, Feb 04, 2015 at 04:30:32PM +0000, Matthew Booth wrote:
>> I've spent a few hours today reading about Galera, a clustering solution
>> for MySQL. Galera provides multi-master 'virtually synchronous'
>> replication between multiple mysql nodes. i.e. I can create a cluster of
>> 3 mysql dbs and read and write from any of them with certain consistency
>> guarantees.
>>
>> I am no expert[1], but this is a TL;DR of a couple of things which I
>> didn't know, but feel I should have done. The semantics are important to
>> application design, which is why we should all be aware of them.
>>
>>
>> * Commit will fail if there is a replication conflict
>>
>> foo is a table with a single field, which is its primary key.
>>
>> A: start transaction;
>> B: start transaction;
>> A: insert into foo values(1);
>> B: insert into foo values(1); <-- 'regular' DB would block here, and
>> report an error on A's commit
>> A: commit; <-- success
>> B: commit; <-- KABOOM
>>
>> Confusingly, Galera will report a 'deadlock' to node B, despite this not
>> being a deadlock by any definition I'm familiar with.
It is a failure to certify the writeset, which bubbles up as an InnoDB
deadlock error. See my article here:
http://www.joinfu.com/2015/01/understanding-reservations-concurrency-locking-in-nova/
Which explains this.
> Yes ! and if I can add more information and I hope I do not make
> mistake I think it's a know issue which comes from MySQL, that is why
> we have a decorator to do a retry and so handle this case here:
>
> http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177
It's not an issue with MySQL. It's an issue with any database code that
is highly contentious.
Almost all highly distributed or concurrent applications need to handle
deadlock issues, and the most common way to handle deadlock issues on
database records is using a retry technique. There's nothing new about
that with Galera.
The issue with our use of the @_retry_on_deadlock decorator is *not*
that the retry decorator is not needed, but rather it is used too
frequently. The compare-and-swap technique I describe in the article
above dramatically* reduces the number of deadlocks that occur (and need
to be handled by the @_retry_on_deadlock decorator) and dramatically
reduces the contention over critical database sections.
Best,
-jay
* My colleague Pavel Kholkin is putting together the results of a
benchmark run that compares the compare-and-swap method with the raw
@_retry_on_deadlock decorator method. Spoiler: the compare-and-swap
method cuts the runtime of the benchmark by almost *half*.
>> Essentially, anywhere that a regular DB would block, Galera will not
>> block transactions on different nodes. Instead, it will cause one of the
>> transactions to fail on commit. This is still ACID, but the semantics
>> are quite different.
>>
>> The impact of this is that code which makes correct use of locking may
>> still fail with a 'deadlock'. The solution to this is to either fail the
>> entire operation, or to re-execute the transaction and all its
>> associated code in the expectation that it won't fail next time.
>>
>> As I understand it, these can be eliminated by sending all writes to a
>> single node, although that obviously makes less efficient use of your
>> cluster.
>>
>>
>> * Write followed by read on a different node can return stale data
>>
>> During a commit, Galera replicates a transaction out to all other db
>> nodes. Due to its design, Galera knows these transactions will be
>> successfully committed to the remote node eventually[2], but it doesn't
>> commit them straight away. The remote node will check these outstanding
>> replication transactions for write conflicts on commit, but not for
>> read. This means that you can do:
>>
>> A: start transaction;
>> A: insert into foo values(1)
>> A: commit;
>> B: select * from foo; <-- May not contain the value we inserted above[3]
>>
>> This means that even for 'synchronous' slaves, if a client makes an RPC
>> call which writes a row to write master A, then another RPC call which
>> expects to read that row from synchronous slave node B, there's no
>> default guarantee that it'll be there.
>>
>> Galera exposes a session variable which will fix this: wsrep_sync_wait
>> (or wsrep_causal_reads on older mysql). However, this isn't the default.
>> It presumably has a performance cost, but I don't know what it is, or
>> how it scales with various workloads.
>>
>>
>> Because these are semantic issues, they aren't things which can be
>> easily guarded with an if statement. We can't say:
>>
>> if galera:
>> try:
>> commit
>> except:
>> rewind time
>>
>> If we are to support this DB at all, we have to structure code in the
>> first place to allow for its semantics.
>>
>> Matt
>>
>> [1] No, really: I just read a bunch of docs and blogs today. If anybody
>> who is an expert would like to validate/correct that would be great.
>>
>> [2]
>> http://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/
>>
>> [3]
>> http://www.percona.com/blog/2013/03/03/investigating-replication-latency-in-percona-xtradb-cluster/
>> --
>> Matthew Booth
>> Red Hat Engineering, Virtualisation Team
>>
>> Phone: +442070094448 (UK)
>> GPG ID: D33C3490
>> GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
More information about the OpenStack-dev
mailing list