[openstack-dev] [all][oslo.db][nova] TL; DR Things everybody should know about Galera

Peter Boros peter.boros at percona.com
Fri Feb 6 15:22:44 UTC 2015


Hi Angus and everyone,

I would like to reply for a couple of things:
- The behavior of overlapping transactions is dependent on the
transaction isolation level, even in the case of the single server,
for any database. This was pointed out by others earlier as well.

- The deadlock error from Galera can be confusing, but the point is
that the application can actually threat this as a deadlock (or apply
any kind of retry logic, which it would apply to a failed
transaction), I don't know if it would be even more confusing from the
developer's point of view, if it would say brute force error.
Transactions can fail in a database, in the initial example the
transaction will fail with a duplicate key error. The result is pretty
much the same from the application's perspective, the transaction was
not successful (it failed as a block), the application should handle
the failure. There can be a lot more reasons for a transaction to fail
regardless of the database engine, some of these failures are
persistent (for example the disk is full underneath the database), and
some of these are intermittent in nature like the case above. A good
retry mechanism can be good for handling the intermittent failures,
depending on the application logic.

- Like many others said it before me, consistent reads can be achieved
with wsrep_causal_reads set on in the session. I can shed some light
on how this works. Nodes in galera are participating in a group
communication. A global order of the transactions are established as
part of this. Since the global order of the transaction is known, a
session with wsrep_causal_reads on will put a "marker" in the local
replication queue. Because transaction ordering is global, the session
will be simply blocked until all the other transactions are processed
in the replication queue before that marker. So, setting
wsrep_causal_reads imposes additional latency only for the given
select we are using it on (it literally just waits the queue to be
processed up to the current transaction). So because of this, manual
checking of the global transaction ids is not necessary.

- On synchronous replication: galera only transmits the data
synchronously, it doesn't do synchronous apply. A transaction is sent
in parallel to the rest of the cluster nodes (to be accurate, it's
only sent to the nodes that are in the same group segment, but it
waits until all the group segments get the data). Once the other nodes
received it, the transaction commits locally, the others will apply it
later. The cluster can do this because of certification and because
certification is deterministic (the result of the certification will
be the same on all nodes, otherwise, the nodes have a different state,
for example one of them was written locally). The replication uses
write sets, which is practically row based mysql binary log event and
some metadata. The some metadata is good for 2 things: you can take a
look at 2 write sets and tell if they are conflicting or not, and you
can decide if a write set is applicable to a database. Because this is
checked at certification time, the apply part can be parallel (because
of the certification, it's guaranteed that the transactions are not
conflicting). When it comes to consistency and replication speed,
there are no wonders, there are tradeoffs to make. Two phase commit is
relatively slow, distributed locking is relatively slow, this is a lot
faster, but the application should handle transaction failures (which
it should probably handle anyway).

Here is the xtradb cluster documentation (Percona Server with galera):
http://www.percona.com/doc/percona-xtradb-cluster/5.6/#user-s-manual

Here is the multi-master replication part of the documentation:
http://www.percona.com/doc/percona-xtradb-cluster/5.6/features/multimaster-replication.html


On Fri, Feb 6, 2015 at 3:36 AM, Angus Lees <gus at inodes.org> wrote:
> On Fri Feb 06 2015 at 12:59:13 PM Gregory Haynes <greg at greghaynes.net>
> wrote:
>>
>> Excerpts from Joshua Harlow's message of 2015-02-06 01:26:25 +0000:
>> > Angus Lees wrote:
>> > > On Fri Feb 06 2015 at 4:25:43 AM Clint Byrum <clint at fewbar.com
>> > > <mailto:clint at fewbar.com>> wrote:
>> > >     I'd also like to see consideration given to systems that handle
>> > >     distributed consistency in a more active manner. etcd and
>> > > Zookeeper are
>> > >     both such systems, and might serve as efficient guards for
>> > > critical
>> > >     sections without raising latency.
>> > >
>> > >
>> > > +1 for moving to such systems.  Then we can have a repeat of the above
>> > > conversation without the added complications of SQL semantics ;)
>> > >
>> >
>> > So just an fyi:
>> >
>> > http://docs.openstack.org/developer/tooz/ exists.
>> >
>> > Specifically:
>> >
>> >
>> > http://docs.openstack.org/developer/tooz/developers.html#tooz.coordination.CoordinationDriver.get_lock
>> >
>> > It has a locking api that it provides (that plugs into the various
>> > backends); there is also a WIP https://review.openstack.org/#/c/151463/
>> > driver that is being worked for etc.d.
>> >
>>
>> An interesting note about the etcd implementation is that you can
>> select per-request whether you want to wait for quorum on a read or not.
>> This means that in theory you could obtain higher throughput for most
>> operations which do not require this and then only gain quorum for
>> operations which require it (e.g. locks).
>
>
> Along those lines and in an effort to be a bit less doom-and-gloom, I spent
> my lunch break trying to find non-marketing documentation on the Galera
> replication protocol and how it is exposed. (It was surprisingly difficult
> to find such information *)
>
> It's easy to get the transaction ID of the last commit
> (wsrep_last_committed), but I can't find a way to wait until at least a
> particular transaction ID has been synced.  If we can find that latter
> functionality, then we can expose that sequencer all the way through (HTTP
> header?) and then any follow-on commands can mention the sequencer of the
> previous write command that they really need to see the effects of.
>
> In practice, this should lead to zero additional wait time, since the Galera
> replication has almost certainly already caught up by the time the second
> command comes in - and we can just read from the local server with no
> additional delay.
>
> See the various *Index variables in the etcd API, for how the same idea gets
> used there.
>
>  - Gus
>
> (*) In case you're also curious, the only doc I found with any details was
> http://galeracluster.com/documentation-webpages/certificationbasedreplication.html
> and its sibling pages.
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Peter Boros, Principal Architect, Percona
Telephone: +1 888 401 3401 ext 546
Emergency: +1 888 401 3401 ext 911
Skype: percona.pboros



More information about the OpenStack-dev mailing list