[openstack-dev] Race in FixedIP.associate_pool

melanie witt melwittt at gmail.com
Thu Dec 21 18:52:25 UTC 2017

On Fri, 15 Dec 2017 18:38:00 -0800, Arun Sag wrote:
> Here are the sequence of actions happen in nova-network
> 1. allocate_for_instance calls -> allocate_fixed_ips
> 2. FixedIPs are successfully associated (we can see this in the log)
> 3. allocate_for_instance calls get_instance_nw_info, which in turn
> gets the fixedip's associated in step 2 using
> objects.FixedIPList.get_by_instance_uuid, This raises FixedIPNotFound
> exception
> We remove the slave and just ran with just single master, the errors
> went away. We also switched to using semi-synchronous replication
> between master
> and slave,  the errors went away too. All of this points to a race
> between write and read to the DB.
> Does openstack expects synchronous replication to read-only slaves?

No, synchronous replication to read-only slaves is not expected.

The way this is handled is that oslo.db has the notion of an "async 
reader" which is safe to use on an asynchronously updated slave database 
and a regular "reader" which is only safe to use on a synchronously 
updated slave database, else the master database will be used [1].

In nova, we indicate to oslo.db whether a database API method is safe 
for use on an asynchronously updated slave database using decorators 
[2][3]. There are few methods decorated this way.

The method you're seeing the race with, fixed_ip_get_by_instance [4] is 
decorated with the "reader" decorator, indicating that it's only safe 
for a synchronously updated slave database, else it will use the master.

So, this query should *not* be going to an asynchronously updated slave 
database. If you're using asynchronous replication, it should be going 
to the master.

Have you patched any nova/db/sqlalchemy/api method decorators or patched 
oslo.db at all to use the "async reader" for more methods? If not, then 
it's possible there is a bug in oslo.db or nova related to "async 
reader" state leaking across green threads.

Which reminds me of a fairly recent bug [5] we ran into when doing a 
concurrent scatter-gather to multiple cell databases. You might try the 
patch [6] locally to see if it changes the behavior when you have 
asynchronous replication enabled. We had thought only scatter-gather was 
affected (which was introduced in pike) but it's possible the async 
slave database read might also be affected.

If you could try that patch, please let me know whether it helps and we 
will backport it.


[5] https://bugs.launchpad.net/nova/+bug/1722404
[6] https://review.openstack.org/#/c/511651

More information about the OpenStack-dev mailing list