[Openstack-operators] Openstack and mysql galera with haproxy

Sławek Kapłoński slawek at kaplonski.pl
Tue Sep 23 20:05:20 UTC 2014


Hello,


Dnia poniedziałek, 22 września 2014 22:02:26 Sławek Kapłoński pisze:
> Hello,
> 
> Answears below
> 
> ---
> Best regards
> Sławek Kapłoński
> slawek at kaplonski.pl
> 
> Dnia poniedziałek, 22 września 2014 13:41:51 Jay Pipes pisze:
> > Hi Peter, Sławek, answers inline...
> > 
> > On 09/22/2014 08:12 AM, Peter Boros wrote:
> > > Hi,
> > > 
> > > StaleDataError is not given by MySQL, but rather SQLAlchemy. After a
> > > quick look, it seems like SQLAlchemy gets this, if the update updated
> > > different number of rows then it expected. I am not sure what is the
> > > expectation based on, perhaps soembody can chime in and we can put
> > > this together. What is the transaction isolation level you are running
> > > on?
> > 
> > The transaction isolation level is REPEATABLE_READ, unless Sławek has
> > changed the defaults (unlikely).
> 
> For sure I didn't change it
> 
> > > For the timeout setting in neutron: that's a good way to approach it
> > > too, you can even be more agressive and set it to a few seconds. In
> > > MySQL making connections is very cheap (at least compared to other
> > > databases), an idle timeout of a few seconds for a connection is
> > > typical.
> > > 
> > > On Mon, Sep 22, 2014 at 12:35 PM, Sławek Kapłoński <slawek at kaplonski.pl>
> 
> wrote:
> > >> Hello,
> > >> 
> > >> Thanks for Your explanations. I thought so and now I decrease
> > >> "idle_connection_timeout" in neutron and nova. Now when master server
> > >> is
> > >> back to cluster than in less than one minute all conections are again
> > >> made to this master node becuase old connections which was made to
> > >> backup node are closed. So for now it looks almost perfect but when I
> > >> now testing cluster (with master node active and all connections
> > >> established to this node) in neutron I still sometimes see errors like:
> > >> StaleDataError: UPDATE statement on table 'ports' expected to update 1
> > >> row(s); 0 were matched.
> > >> 
> > >> and also today I found errors like:
> > >> 2014-09-22 11:38:05.715 11474 INFO sqlalchemy.engine.base.Engine [-]
> > >> ROLLBACK 2014-09-22 11:38:05.784 11474 ERROR
> > >> neutron.openstack.common.db.sqlalchemy.session [-] DB exception
> > >> wrapped.
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session Traceback (most recent
> > >> call
> > >> last):
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session   File
> > >> "/usr/lib/python2.7/dist-
> > >> packages/neutron/openstack/common/db/sqlalchemy/session.py", line 524,
> > >> in
> > >> _wrap
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session     return f(*args,
> > >> **kwargs) 2014-09-22 11:38:05.784 11474 TRACE
> >  
> >  From looking up the code, it looks like you are using Havana [1]. The
> > 
> > code in the master branch of Neutron now uses oslo.db, not
> > neutron.openstack.common.db, so this issue may have been resolved in
> > later versions of Neutron.
> 
> Yes, I'm using havana and I have now no possibility to upgrade it fast to
> icehouse (about master branch I even don't want to think :)). Do You want to
> tell me that this problem will be existing in havana and this can't be
> fixed in that release?
> 
> > [1]
> > https://github.com/openstack/neutron/blob/stable/havana/neutron/openstack/
> > co mmon/db/sqlalchemy/session.py#L524
> > 
> > >> neutron.openstack.common.db.sqlalchemy.session   File
> > >> "/usr/lib/python2.7/dist-
> > >> packages/neutron/openstack/common/db/sqlalchemy/session.py", line 718,
> > >> in
> > >> flush 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session     return
> > >> super(Session,
> > >> self).flush(*args, **kwargs)
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session   File
> > >> "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line
> > >> 1818,
> > >> in
> > >> flush
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session     self._flush(objects)
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session   File
> > >> "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line
> > >> 1936,
> > >> in
> > >> _flush
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session
> > >> transaction.rollback(_capture_exception=True)
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session   File
> > >> "/usr/lib/python2.7/dist-packages/sqlalchemy/util/langhelpers.py", line
> > >> 58, in __exit__
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session
> > >> compat.reraise(exc_type,
> > >> exc_value, exc_tb)
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session   File
> > >> "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line
> > >> 1900,
> > >> in
> > >> _flush
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session
> > >> flush_context.execute()
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session   File
> > >> "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/unitofwork.py", line
> > >> 372, in execute
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session     rec.execute(self)
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session   File
> > >> "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/unitofwork.py", line
> > >> 525, in execute
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session     uow
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session   File
> > >> "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/persistence.py", line
> > >> 64, in save_obj
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session     table, insert)
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session   File
> > >> "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/persistence.py", line
> > >> 541, in _emit_insert_statements
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session     execute(statement,
> > >> multiparams)
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session   File
> > >> "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 662,
> > >> in
> > >> execute
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session     params)
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session   File
> > >> "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 761,
> > >> in
> > >> _execute_clauseelement
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session     compiled_sql,
> > >> distilled_params
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session   File
> > >> "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 874,
> > >> in
> > >> _execute_context
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session     context)
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session   File
> > >> "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line
> > >> 1027,
> > >> in
> > >> _handle_dbapi_exception
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session
> > >> util.reraise(*exc_info)
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session   File
> > >> "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 856,
> > >> in
> > >> _execute_context
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session     context)
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session   File
> > >> "/usr/lib/python2.7/dist-packages/sqlalchemy/connectors/mysqldb.py",
> > >> line
> > >> 60, in do_executemany
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session     rowcount =
> > >> cursor.executemany(statement, parameters)
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session   File
> > >> "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 206, in
> > >> executemany
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session     r = r +
> > >> self.execute(query, a)
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session TypeError: unsupported
> > >> operand type(s) for +: 'int' and 'NoneType'
> > 
> > Hmm, this is just bad coding in the MySQLdb driver, frankly. It is
> > assuming a call to Cursor.execute() will return an integer, but it can
> > return None in some circumstances. See code here:
> > 
> > http://sourceforge.net/p/mysql-python/code/ci/8041cc6df636b9c42d52e01b727a
> > a9 8b43f3632c/tree/MySQLdb/MySQLdb/cursors.py
> > 
> > Note that MySQLdb1 (the above, which is what is packaged in your Linux
> > distro I believe) is old and buggy. The maintainer has moved on to
> > MySQLdb2, which has a different call interface in this part of the code.
> 
> Ok, I checked that code and I found that in fact sometimes method "execute"
> returns None and when it is called from executemany than there is problem.
> Simple change in executemany method to:
> 
>     if not m:
> 	r = 0
>             for a in args:
>                 iter_r = self.execute(query, a)
>                 if iter_r:
>                     r = r + iter_r
>             return r
> 
> for me now in that tests which I made but I don't know what be result of
> that change in longer time period and why execute method returns that None
> value :/
> 
> You wrote that MySQLdb2 has different call interface in this part of code -
> should change to MySQLdb2 fix that problem or not? If yes can You explain me
> how I can change it? Should I install other package just or maybe I should
> set in somewhere in config? (I'm using ubuntu 12.04)
> 
> > >> 2014-09-22 11:38:05.784 11474 TRACE
> > >> neutron.openstack.common.db.sqlalchemy.session
> > >> 
> > >> And I have to investigate why it is happening becuase whith single
> > >> mysql
> > >> server I have no such errors :/
> > 
> > Not sure, frankly. The code is executing many INSERT or UPDATE
> > statements in a single block. The MySQL connection is clearly getting
> > borked on one of those attempts and results in the traceback you see
> > above.
> 
> I'm not sure in 100% but on much bigger cluster with one database server
> there was no such problem even once so I supposed that it is related
> somehow to galera which I now want to use.

Today I checked that problem was always when neutron tries to delete row from 
ipavailabilityranges table and when I moved whole database to old (one) mysql 
server then problem was the same. So it is not related to galera or haproxy. 
This problem is strange for me but I think that small patch for python-mysqldb 
library (as I wrote in previous email) solves this problem :)

> 
> > best,
> > -jay
> > 
> > >> ---
> > >> Best regards
> > >> Sławek Kapłoński
> > >> slawek at kaplonski.pl
> > >> 
> > >> Dnia poniedziałek, 22 września 2014 11:18:27 Peter Boros pisze:
> > >>> Hi,
> > >>> 
> > >>> Let me answer this and one of your previous questions in one because
> > >>> they are related.
> > >>> 
> > >>> Earlier you wrote:
> > >>>> I made such modifications today in my infra and generally it looks
> > >>>> better now. I don't see deadlocks. But I have one more problem with
> > >>>> that: generally it works fine when main node is active but in
> > >>>> situation when this node is down, haproxy connect to one of backup
> > >>>> nodes. Still all is ok but problem is when main node is up again -
> > >>>> all
> > >>>> new connections are made to main node but active connections which
> > >>>> was
> > >>>> made to backup node still are active and neutron (or nova) are using
> > >>>> connections to two servers and then there are problems with deadlock
> > >>>> again.
> > >>>> Do You know how to prevent such situation?
> > >>> 
> > >>> This is because of how haproxy works. Haproxy's load balancing is TCP
> > >>> level, once the TCP connection is established, haproxy has nothing to
> > >>> do with it. If the MySQL application (neutron in this case), uses
> > >>> persistent connections, at the time of failing over, haproxy doesn't
> > >>> make an extra decision upon failover, because a connection is already
> > >>> established. This can be mitigated by using haproxy 1.5 and defining
> > >>> the backend with on-marked-down shutdown-sessions, this will kill the
> > >>> connections at the TCP level on the formerly active node. Or in case
> > >>> of graceful failover, include killing connections in the failover
> > >>> script on the formerly active node. The application in this case will
> > >>> get error 1 or 2 you described.
> > >>> 
> > >>>  From your description error 1 and 2 are related to killing
> > >>> 
> > >>> connections. Case 1 (MySQL server has gone away) happens when the
> > >>> connection was killed (but not at the MySQL protocol level) while it
> > >>> was idle, and the application is attempting to re-use it. In this case
> > >>> the correct behaviour would be re-establishing the connection. Error 2
> > >>> is the same thing, but while the connection was actually doing
> > >>> something, reconnecting and retrying is the correct behaviour. These
> > >>> errors are not avoidable, if the node dies non-gracefully. A server
> > >>> can for example lose power while doing the transaction, in this case
> > >>> the transaction will be aborted, and the application will get one of
> > >>> the errors described above. The application has to know that the data
> > >>> is not written, since it didn't do commit or database didn't
> > >>> acknowledge the commit.
> > >>> 
> > >>> On Sat, Sep 20, 2014 at 12:11 AM, Sławek Kapłoński
> > >>> <slawek at kaplonski.pl>
> > >> 
> > >> wrote:
> > >>>> Hello,
> > >>>> 
> > >>>> New questions below
> > >>>> 
> > >>>> ---
> > >>>> Best regards
> > >>>> Sławek Kapłoński
> > >>>> slawek at kaplonski.pl
> > >>>> 
> > >>>> Dnia czwartek, 18 września 2014 09:45:21 Clint Byrum pisze:
> > >>>>> Excerpts from Sławek Kapłoński's message of 2014-09-18 09:29:27 
-0700:
> > >>>>>> Hello,
> > >>>>>> 
> > >>>>>> Is anyone here using openstack with mysql galera and haproxy? Have
> > >>>>>> You
> > >>>>>> got
> > >>>>>> any problems with that?
> > >>>>>> I was today installed such ha infra for database (two mysql servers
> > >>>>>> in
> > >>>>>> galera cluster and haproxy on controller and neutron node, this
> > >>>>>> haproxy
> > >>>>>> is connecting to one of galera servers with round robin algorithm).
> > >>>>>> Generally all is working fine but I have few problems:
> > >>>>>> 1. I have a lot of messages like:
> > >>>>>> WARNING neutron.openstack.common.db.sqlalchemy.session [-] Got
> > >>>>>> mysql
> > >>>>>> server
> > >>>>>> has gone away: (2006, 'MySQL server has gone away')
> > >>>>>> 2. I have (most on neutron) many errors like:
> > >>>>>> OperationalError: (OperationalError) (2013, 'Lost connection to
> > >>>>>> MySQL
> > >>>>>> server during query') 'UPDATE ml2_port_bindings SET vif_type=%s,
> > >>>>>> driver=%s, segment=%s WHERE ml2_port_bindings.port_id =
> > >>>>> 
> > >>>>> 1 and 2 look like timeout issues. Check haproxy's timeouts. They
> > >>>>> need
> > >>>>> to be just a little longer than MySQL's connection timeouts.
> > >>>> 
> > >>>> After I made ACTIVE/PASSIVE cluster and change sql_idle_timeout in
> > >>>> neutron
> > >>>> and nova problem 1 looks that is solver. Unfortunatelly I found that
> > >>>> when
> > >>>> I'm deleting port from neutron I still sometimes have got errors like
> > >>>> in
> > >>>> 2. I don't check exactly nova logs yet so I'm not sure is it only in
> > >>>> neutron or in both.
> > >>>> Do You maybe know why it happens in neutron? It not happend when I
> > >>>> have
> > >>>> single mysql node without haproxy and galera so I suppose that
> > >>>> haproxy
> > >>>> or
> > >>>> galera is responsible for that problem :/
> > >>>> 
> > >>>>>> 3. Also errors:
> > >>>>>> StaleDataError: UPDATE statement on table 'ports' expected to
> > >>>>>> update
> > >>>>>> 1
> > >>>>>> row(s); 0 were matched.
> > >>>>>> 4. and errors:
> > >>>>>> DBDeadlock: (OperationalError) (1213, 'Deadlock found when trying
> > >>>>>> to
> > >>>>>> get
> > >>>>>> lock; try restarting transaction') 'UPDATE ipavailabilityranges SET
> > >>>>>> first_ip=%s WHERE ipavailabilityranges.allocation_pool_id =
> > >>>>> 
> > >>>>> 3 and 4 are a known issue. Our code doesn't always retry
> > >>>>> transactions,
> > >>>>> which is required to use Galera ACTIVE/ACTIVE. Basically, that
> > >>>>> doesn't
> > >>>>> work.
> > >>>>> 
> > >>>>> You can use ACTIVE/PASSIVE, and even do vertical partitioning where
> > >>>>> one of the servers is ACTIVE for Nova, but another one is ACTIVE for
> > >>>>> Neutron. But AFAIK, ACTIVE/ACTIVE isn't being tested and the work
> > >>>>> hasn't
> > >>>>> been done to make the concurrent transactions work properly.
> > >>>>> 
> > >>>>> _______________________________________________
> > >>>>> OpenStack-operators mailing list
> > >>>>> OpenStack-operators at lists.openstack.org
> > >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operat
> > >>>>> or
> > >>>>> s
> > >>>> 
> > >>>> _______________________________________________
> > >>>> OpenStack-operators mailing list
> > >>>> OpenStack-operators at lists.openstack.org
> > >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operato
> > >>>> rs
> > >> 
> > >> _______________________________________________
> > >> OpenStack-operators mailing list
> > >> OpenStack-operators at lists.openstack.org
> > >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
> > 
> > _______________________________________________
> > OpenStack-operators mailing list
> > OpenStack-operators at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


---
Pozdrawiam
Sławek Kapłoński
slawek at kaplonski.pl
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20140923/abac9321/attachment.pgp>


More information about the OpenStack-operators mailing list