[openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

Attila Fazekas afazekas at redhat.com
Mon Feb 9 18:02:55 UTC 2015





----- Original Message -----
> From: "Jay Pipes" <jaypipes at gmail.com>
> To: openstack-dev at lists.openstack.org, "Pavel Kholkin" <pkholkin at mirantis.com>
> Sent: Wednesday, February 4, 2015 8:04:10 PM
> Subject: Re: [openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera
> 
> On 02/04/2015 12:05 PM, Sahid Orentino Ferdjaoui wrote:
> > On Wed, Feb 04, 2015 at 04:30:32PM +0000, Matthew Booth wrote:
> >> I've spent a few hours today reading about Galera, a clustering solution
> >> for MySQL. Galera provides multi-master 'virtually synchronous'
> >> replication between multiple mysql nodes. i.e. I can create a cluster of
> >> 3 mysql dbs and read and write from any of them with certain consistency
> >> guarantees.
> >>
> >> I am no expert[1], but this is a TL;DR of a couple of things which I
> >> didn't know, but feel I should have done. The semantics are important to
> >> application design, which is why we should all be aware of them.
> >>
> >>
> >> * Commit will fail if there is a replication conflict
> >>
> >> foo is a table with a single field, which is its primary key.
> >>
> >> A: start transaction;
> >> B: start transaction;
> >> A: insert into foo values(1);
> >> B: insert into foo values(1); <-- 'regular' DB would block here, and
> >>                                    report an error on A's commit
> >> A: commit; <-- success
> >> B: commit; <-- KABOOM
> >>
> >> Confusingly, Galera will report a 'deadlock' to node B, despite this not
> >> being a deadlock by any definition I'm familiar with.
> 
> It is a failure to certify the writeset, which bubbles up as an InnoDB
> deadlock error. See my article here:
> 
> http://www.joinfu.com/2015/01/understanding-reservations-concurrency-locking-in-nova/
> 
> Which explains this.

I do not see why not to use `FOR UPDATE` even with multi-writer or
Is the retry/swap way really solves anything here.

Using 'FOR UPDATE' in with 'repeatable read' isolation level, seams still more efficient
and has several advantages.

* The SELECT with 'FOR UPDATE' will read the committed version, so you do not really need to
  worry about when the transaction actually started. You will get fresh data before you reaching the
  actual UPDATE.

* In the article the example query will not return 
  new version of data in the same transaction even if you are retrying, so
  you need to restart the transaction anyway.

  When you are using the 'FOR UPDATE' way if any other transaction successfully commits conflicting
  row on any other galera writer, your pending transaction will be rolled back at your next statement,
  WITHOUT spending any time in certificating that transaction.
  In this perspective the checking the number after the update `Compare and swap` or
  handling an exception does not makes any difference.

* Using FOR UPDATE in a galera transaction (multi-writer) is not more evil than using UPDATE, 
  concurrent commit invalidates both of them in the same way (DBDeadlock).  

* The 'FOR UPDATE' if you are using just a `single writer` does not lets other threads to do useless work
  while wasting resources.

* The swap way also can be rolled back by galera almost anywhere (DBDeadLock).
  At the end the swap way looks like it just replaced  the exception handling,
  with a return code check + manual transaction restart.

Am I missed something ?

> > Yes ! and if I can add more information and I hope I do not make
> > mistake I think it's a know issue which comes from MySQL, that is why
> > we have a decorator to do a retry and so handle this case here:
> >
> >    http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177
> 
> It's not an issue with MySQL. It's an issue with any database code that
> is highly contentious.
> 
> Almost all highly distributed or concurrent applications need to handle
> deadlock issues, and the most common way to handle deadlock issues on
> database records is using a retry technique. There's nothing new about
> that with Galera.
> 
> The issue with our use of the @_retry_on_deadlock decorator is *not*
> that the retry decorator is not needed, but rather it is used too
> frequently. The compare-and-swap technique I describe in the article
> above dramatically* reduces the number of deadlocks that occur (and need
> to be handled by the @_retry_on_deadlock decorator) and dramatically
> reduces the contention over critical database sections.
> 
> Best,
> -jay
> 
> * My colleague Pavel Kholkin is putting together the results of a
> benchmark run that compares the compare-and-swap method with the raw
> @_retry_on_deadlock decorator method. Spoiler: the compare-and-swap
> method cuts the runtime of the benchmark by almost *half*.
> 
> >> Essentially, anywhere that a regular DB would block, Galera will not
> >> block transactions on different nodes. Instead, it will cause one of the
> >> transactions to fail on commit. This is still ACID, but the semantics
> >> are quite different.
> >>
> >> The impact of this is that code which makes correct use of locking may
> >> still fail with a 'deadlock'. The solution to this is to either fail the
> >> entire operation, or to re-execute the transaction and all its
> >> associated code in the expectation that it won't fail next time.
> >>
> >> As I understand it, these can be eliminated by sending all writes to a
> >> single node, although that obviously makes less efficient use of your
> >> cluster.
> >>
> >>
> >> * Write followed by read on a different node can return stale data
> >>
> >> During a commit, Galera replicates a transaction out to all other db
> >> nodes. Due to its design, Galera knows these transactions will be
> >> successfully committed to the remote node eventually[2], but it doesn't
> >> commit them straight away. The remote node will check these outstanding
> >> replication transactions for write conflicts on commit, but not for
> >> read. This means that you can do:
> >>
> >> A: start transaction;
> >> A: insert into foo values(1)
> >> A: commit;
> >> B: select * from foo; <-- May not contain the value we inserted above[3]
> >>
> >> This means that even for 'synchronous' slaves, if a client makes an RPC
> >> call which writes a row to write master A, then another RPC call which
> >> expects to read that row from synchronous slave node B, there's no
> >> default guarantee that it'll be there.
> >>
> >> Galera exposes a session variable which will fix this: wsrep_sync_wait
> >> (or wsrep_causal_reads on older mysql). However, this isn't the default.
> >> It presumably has a performance cost, but I don't know what it is, or
> >> how it scales with various workloads.
> >>
> >>
> >> Because these are semantic issues, they aren't things which can be
> >> easily guarded with an if statement. We can't say:
> >>
> >> if galera:
> >>    try:
> >>      commit
> >>    except:
> >>      rewind time
> >>
> >> If we are to support this DB at all, we have to structure code in the
> >> first place to allow for its semantics.
> >>
> >> Matt
> >>
> >> [1] No, really: I just read a bunch of docs and blogs today. If anybody
> >> who is an expert would like to validate/correct that would be great.
> >>
> >> [2]
> >> http://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/
> >>
> >> [3]
> >> http://www.percona.com/blog/2013/03/03/investigating-replication-latency-in-percona-xtradb-cluster/
> >> --
> >> Matthew Booth
> >> Red Hat Engineering, Virtualisation Team
> >>
> >> Phone: +442070094448 (UK)
> >> GPG ID:  D33C3490
> >> GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490
> >>
> >> __________________________________________________________________________
> >> OpenStack Development Mailing List (not for usage questions)
> >> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> > __________________________________________________________________________
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 



More information about the OpenStack-dev mailing list