[openstack-dev] [Nova][Oslo-incubator] Automatic retry db.api query if database connection lost

Victor Sergeyev vsergeyev at mirantis.com
Mon Jul 29 15:42:28 UTC 2013


Any suggestions, please?

On Mon, Jul 22, 2013 at 11:39 AM, Victor Sergeyev <vsergeyev at mirantis.com>wrote:

> Hi All.
> There is a blueprint (
> https://blueprints.launchpad.net/nova/+spec/db-reconnect) by Devananda
> van der Veen, which goal is to implement reconnection to a database and
> retrying of the last operation if a db connection fails. I’m working on the
> implementation of this BP in oslo-incubator (
> https://review.openstack.org/#/c/33831/).
> Function _raise_if_db_connection_lost() was added to _wrap_db_error()
> decorator defined in openstack/common/db/sqlalchemy/session.py. This
> function catches sqlalchemy.exc.OperationalError and finds database error
> code in this exception. If this error code is on `database has gone away`
> error codes list, this function raises DBConnectionError exception.
> Decorator for db.api methods was added to openstack/common/db/api.py.
> We can apply this decorator to methods in db.sqlalchemy.api (not to
> individual queries).
> It catches DBConnectionError exception and retries the last query in a
> loop until it succeeds, or until the timeout is reached. The timeout value
> is configurable with min, max, and increment options.
> We suppose that all db.api methods are executed inside a single
> transaction, so retrying the whole method, when a connection is lost,
> should be safe.
> I would really like to receive some comments about the following
> suggestions:
> 1. I can’t imagine a situation when we lose connection to an SQLite DB.
> Also, as far as I know, SQLite is not used in production at the moment, so
> we don't handle this case.
> 2. Please, leave some comments about  `database has gone away` error codes
> list in MySQL and PostgreSQL.
> 3. Johannes Erdfelt suggested that “retrying the whole method, even if
> it's in a transaction, is only safe the entire method is idempotent. A
> method could execute successfully in the database, but the connection could
> be dropped before the final status is sent to the client.”
>  I agree, that this situation can cause data corruption in a database (e.
> g., if we try to insert something to a database), but I’m not sure, how
> RDBMS handle this. Also, I haven't succeeded in creation of a functional
> test case, that would allow to reproduce the described situation easily.
> Thanks, Victor
