[openstack-dev] [Fuel][MySQL][DLM][Oslo][DB][Trove][Galera][operators] Multi-master writes look OK, OCF RA and more things

Bogdan Dobrelya bdobrelia at mirantis.com
Wed May 18 08:02:38 UTC 2016

On 05/17/2016 08:55 PM, Clint Byrum wrote:
> I missed your reply originally, so sorry for the 2 week lag...
> Excerpts from Mike Bayer's message of 2016-04-30 15:14:05 -0500:
>> On 04/30/2016 10:50 AM, Clint Byrum wrote:
>>> Excerpts from Roman Podoliaka's message of 2016-04-29 12:04:49 -0700:
>>> I'm curious why you think setting wsrep_sync_wait=1 wouldn't help.
>>> The exact example appears in the Galera documentation:
>>> http://galeracluster.com/documentation-webpages/mysqlwsrepoptions.html#wsrep-sync-wait
>>> The moment you say 'SET SESSION wsrep_sync_wait=1', the behavior should
>>> prevent the list problem you see, and it should not matter that it is
>>> a separate session, as that is the entire point of the variable:
>> we prefer to keep it off and just point applications at a single node 
>> using master/passive/passive in HAProxy, so that we don't have the 
>> unnecessary performance hit of waiting for all transactions to 
>> propagate; we just stick on one node at a time.   We've fixed a lot of 
>> issues in our config in ensuring that HAProxy definitely keeps all 
>> clients on exactly one Galera node at a time.
> Indeed, haproxy does a good job at shifting over rapidly. But it's not
> atomic, so you will likely have a few seconds where commits landed on
> the new demoted backup.
>>> "When you enable this parameter, the node triggers causality checks in
>>> response to certain types of queries. During the check, the node blocks
>>> new queries while the database server catches up with all updates made
>>> in the cluster to the point where the check was begun. Once it reaches
>>> this point, the node executes the original query."
>>> In the active/passive case where you never use the passive node as a
>>> read slave, one could actually set wsrep_sync_wait=1 globally. This will
>>> cause a ton of lag while new queries happen on the new active and old
>>> transactions are still being applied, but that's exactly what you want,
>>> so that when you fail over, nothing proceeds until all writes from the
>>> original active node are applied and available on the new active node.
>>> It would help if your failover technology actually _breaks_ connections
>>> to a presumed dead node, so writes stop happening on the old one.
>> If HAProxy is failing over from the master, which is no longer 
>> reachable, to another passive node, which is reachable, that means that 
>> master is partitioned and will leave the Galera primary component.   It 
>> also means all current database connections are going to be bounced off, 
>> which will cause errors for those clients either in the middle of an 
>> operation, or if a pooled connection is reused before it is known that 
>> the connection has been reset.  So failover is usually not an error-free 
>> situation in any case from a database client perspective and retry 
>> schemes are always going to be needed.
> There are some really big assumptions above, so I want to enumerate
> them:
> 1. You assume that a partition between haproxy and a node is a partition
>    between that node and the other galera nodes.
> 2. You assume that I never want to failover on purpose, smoothly.
> In the case of (1), there are absolutely times where the load balancer
> thinks a node is dead, and it is quite happily chugging along doing its
> job. Transactions will be already committed in this scenario that have
> not propagated, and there may be more than one load balancer, and only
> one of them thinks that node is dead.
> For the limited partition problem, having wsrep_sync_wait turned on
> would result in consistency, and the lag would only be minimal as the
> transactions propagate onto the new primary server.
> For the multiple haproxy problem, lag would be _horrible_ on all nodes
> that are getting reads as long as there's another one getting writes,
> so a solution for making sure only one is specified would need to be
> developed using a leader election strategy. If haproxy is able to query
> wsrep status, that might be ideal, as galera will in fact elect leaders
> for you (assuming all of your wsrep nodes are also mysql nodes, which
> is not the case if you're using 2 nodes + garbd for example).
> This is, however, a bit of a strawman, as most people don't need
> active/active haproxy nodes, so the simplest solution is to go
> active/passive on your haproxy nodes with something like UCARP handling
> the failover there. As long as they all use the same primary/backup
> ordering, then a new UCARP target should just result in using the same
> node, and a very tiny window for inconsistency and connection errors.
> The second assumption is handled by leader election as well. If there's
> always one leader node that load balancers send traffic to, then one
> should be able to force promotion of a different node as the leader,
> and all new transactions and queries go to the new leader. The window
> for that would be pretty small, and so wsrep_sync_wait time should
> be able to be very low, if not 0. I'm not super familiar with the way
> haproxy gracefully reloads configuration, but if you can just change
> the preferred server and poke it with a signal that sends new stuff to
> the new master, then you only have a window the size of however long
> the last transaction takes to worry about inconsistency.
>> Additionally, the purpose of the enginefacade [1] is to allow Openstack 
>> applications to fix their often incorrectly written database access 
>> logic such that in many (most?) cases, a single logical operation is no 
>> longer unnecessarily split among multiple transactions when possible. 
>> I know that this is not always feasible in the case where multiple web 
>> requests are coordinating, however.
> Yeah, that's really the problem. You can't be in control of all of the
> ways the data is expected to be consistent. IMO, we should do a better
> job in our API contracts to specify whether data is consistent or
> not. A lot of this angst over whether we even need to deal with races
> with Galera would go away if we could just make clear guarantees about
> reads after writes.
>> That leaves only the very infrequent scenario of, the master has 
>> finished sending a write set off, the passives haven't finished 
>> committing that write set, the master goes down and HAProxy fails over 
>> to one of the passives, and the application that just happens to also be 
>> connecting fresh onto that new passive node in order to perform the next 
>> operation that relies upon the previously committed data so it does not 
>> see a database error, and instead runs straight onto the node where the 
>> committed data it's expecting hasn't arrived yet.   I can't make the 
>> judgment for all applications if this scenario can't be handled like any 
>> other transient error that occurs during a failover situation, however 
>> if there is such a case, then IMO the wsrep_sync_wait (formerly known as 
>> wsrep_causal_reads) may be used on a per-transaction basis for that very 
>> critical, not-retryable-even-during-failover operation.  Allowing this 
>> variable to be set for the scope of a transaction and reset afterwards, 
>> and only when talking to Galera, is something we've planned to work into 
>> the enginefacade as well as an declarative transaction attribute that 
>> would be a pass-through on other systems.
> It's not infrequent if you're failing over so you can update the current
> master without interrupting service. Thinking through the common case
> so that it doesn't erupt in database errors (a small percentage is ok,
> a large percentage is not) or inconsistencies in the data seems like a
> prudent thing to do.

Please note, that one of main purposes of the subject work *is* to
demonstrate that there is no more blockers to start using A/A Galera
without backup/standby nodes, which would allow all of the complex
things above to be skipped.

I believe the additional test cases added in the Appendix B cover as
well the most generic case for Openstack service running a DB
transaction via sqlalchemy ORM (or oslo.db's enginefasade?), which
rollbacks transactions on deadlocks, uses the default TI=repeatable
read, does not specify the with_lockmode() / with_for_update(), and it
seems does not require wsrep_sync_wait/wsrep_causal_reads enabled.

Please correct me, if I'm missing something important. I suggest to
quick revise DB related code in Openstack projects (perhaps looking for
things like *.query or *.filter_by?) and recommend all ops to switch to
A/A writes and reads, if Galera cluster is used.

I'm not sure though how to address/cover with jepsen cases the issues
like Roman P. had described above, and to compose any test showing a
profit for using wsrep_sync_wait=1 or another values >0. Any help is

> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Best regards,
Bogdan Dobrelya,
Irc #bogdando

More information about the OpenStack-dev mailing list