[openstack-dev] [Fuel][MySQL][DLM][Oslo][DB][Trove][Galera][operators] Multi-master writes look OK, OCF RA and more things

Clint Byrum clint at fewbar.com
Tue May 17 18:55:07 UTC 2016

I missed your reply originally, so sorry for the 2 week lag...

Excerpts from Mike Bayer's message of 2016-04-30 15:14:05 -0500:
> On 04/30/2016 10:50 AM, Clint Byrum wrote:
> > Excerpts from Roman Podoliaka's message of 2016-04-29 12:04:49 -0700:
> >>
> >
> > I'm curious why you think setting wsrep_sync_wait=1 wouldn't help.
> >
> > The exact example appears in the Galera documentation:
> >
> > http://galeracluster.com/documentation-webpages/mysqlwsrepoptions.html#wsrep-sync-wait
> >
> > The moment you say 'SET SESSION wsrep_sync_wait=1', the behavior should
> > prevent the list problem you see, and it should not matter that it is
> > a separate session, as that is the entire point of the variable:
> we prefer to keep it off and just point applications at a single node 
> using master/passive/passive in HAProxy, so that we don't have the 
> unnecessary performance hit of waiting for all transactions to 
> propagate; we just stick on one node at a time.   We've fixed a lot of 
> issues in our config in ensuring that HAProxy definitely keeps all 
> clients on exactly one Galera node at a time.

Indeed, haproxy does a good job at shifting over rapidly. But it's not
atomic, so you will likely have a few seconds where commits landed on
the new demoted backup.

> >
> > "When you enable this parameter, the node triggers causality checks in
> > response to certain types of queries. During the check, the node blocks
> > new queries while the database server catches up with all updates made
> > in the cluster to the point where the check was begun. Once it reaches
> > this point, the node executes the original query."
> >
> > In the active/passive case where you never use the passive node as a
> > read slave, one could actually set wsrep_sync_wait=1 globally. This will
> > cause a ton of lag while new queries happen on the new active and old
> > transactions are still being applied, but that's exactly what you want,
> > so that when you fail over, nothing proceeds until all writes from the
> > original active node are applied and available on the new active node.
> > It would help if your failover technology actually _breaks_ connections
> > to a presumed dead node, so writes stop happening on the old one.
> If HAProxy is failing over from the master, which is no longer 
> reachable, to another passive node, which is reachable, that means that 
> master is partitioned and will leave the Galera primary component.   It 
> also means all current database connections are going to be bounced off, 
> which will cause errors for those clients either in the middle of an 
> operation, or if a pooled connection is reused before it is known that 
> the connection has been reset.  So failover is usually not an error-free 
> situation in any case from a database client perspective and retry 
> schemes are always going to be needed.

There are some really big assumptions above, so I want to enumerate

1. You assume that a partition between haproxy and a node is a partition
   between that node and the other galera nodes.
2. You assume that I never want to failover on purpose, smoothly.

In the case of (1), there are absolutely times where the load balancer
thinks a node is dead, and it is quite happily chugging along doing its
job. Transactions will be already committed in this scenario that have
not propagated, and there may be more than one load balancer, and only
one of them thinks that node is dead.

For the limited partition problem, having wsrep_sync_wait turned on
would result in consistency, and the lag would only be minimal as the
transactions propagate onto the new primary server.

For the multiple haproxy problem, lag would be _horrible_ on all nodes
that are getting reads as long as there's another one getting writes,
so a solution for making sure only one is specified would need to be
developed using a leader election strategy. If haproxy is able to query
wsrep status, that might be ideal, as galera will in fact elect leaders
for you (assuming all of your wsrep nodes are also mysql nodes, which
is not the case if you're using 2 nodes + garbd for example).

This is, however, a bit of a strawman, as most people don't need
active/active haproxy nodes, so the simplest solution is to go
active/passive on your haproxy nodes with something like UCARP handling
the failover there. As long as they all use the same primary/backup
ordering, then a new UCARP target should just result in using the same
node, and a very tiny window for inconsistency and connection errors.

The second assumption is handled by leader election as well. If there's
always one leader node that load balancers send traffic to, then one
should be able to force promotion of a different node as the leader,
and all new transactions and queries go to the new leader. The window
for that would be pretty small, and so wsrep_sync_wait time should
be able to be very low, if not 0. I'm not super familiar with the way
haproxy gracefully reloads configuration, but if you can just change
the preferred server and poke it with a signal that sends new stuff to
the new master, then you only have a window the size of however long
the last transaction takes to worry about inconsistency.

> Additionally, the purpose of the enginefacade [1] is to allow Openstack 
> applications to fix their often incorrectly written database access 
> logic such that in many (most?) cases, a single logical operation is no 
> longer unnecessarily split among multiple transactions when possible. 
> I know that this is not always feasible in the case where multiple web 
> requests are coordinating, however.

Yeah, that's really the problem. You can't be in control of all of the
ways the data is expected to be consistent. IMO, we should do a better
job in our API contracts to specify whether data is consistent or
not. A lot of this angst over whether we even need to deal with races
with Galera would go away if we could just make clear guarantees about
reads after writes.

> That leaves only the very infrequent scenario of, the master has 
> finished sending a write set off, the passives haven't finished 
> committing that write set, the master goes down and HAProxy fails over 
> to one of the passives, and the application that just happens to also be 
> connecting fresh onto that new passive node in order to perform the next 
> operation that relies upon the previously committed data so it does not 
> see a database error, and instead runs straight onto the node where the 
> committed data it's expecting hasn't arrived yet.   I can't make the 
> judgment for all applications if this scenario can't be handled like any 
> other transient error that occurs during a failover situation, however 
> if there is such a case, then IMO the wsrep_sync_wait (formerly known as 
> wsrep_causal_reads) may be used on a per-transaction basis for that very 
> critical, not-retryable-even-during-failover operation.  Allowing this 
> variable to be set for the scope of a transaction and reset afterwards, 
> and only when talking to Galera, is something we've planned to work into 
> the enginefacade as well as an declarative transaction attribute that 
> would be a pass-through on other systems.

It's not infrequent if you're failing over so you can update the current
master without interrupting service. Thinking through the common case
so that it doesn't erupt in database errors (a small percentage is ok,
a large percentage is not) or inconsistencies in the data seems like a
prudent thing to do.

More information about the OpenStack-dev mailing list