[openstack-dev] Race in FixedIP.associate_pool
melwittt at gmail.com
Thu Dec 21 18:52:25 UTC 2017
On Fri, 15 Dec 2017 18:38:00 -0800, Arun Sag wrote:
> Here are the sequence of actions happen in nova-network
> 1. allocate_for_instance calls -> allocate_fixed_ips
> 2. FixedIPs are successfully associated (we can see this in the log)
> 3. allocate_for_instance calls get_instance_nw_info, which in turn
> gets the fixedip's associated in step 2 using
> objects.FixedIPList.get_by_instance_uuid, This raises FixedIPNotFound
> We remove the slave and just ran with just single master, the errors
> went away. We also switched to using semi-synchronous replication
> between master
> and slave, the errors went away too. All of this points to a race
> between write and read to the DB.
> Does openstack expects synchronous replication to read-only slaves?
No, synchronous replication to read-only slaves is not expected.
The way this is handled is that oslo.db has the notion of an "async
reader" which is safe to use on an asynchronously updated slave database
and a regular "reader" which is only safe to use on a synchronously
updated slave database, else the master database will be used .
In nova, we indicate to oslo.db whether a database API method is safe
for use on an asynchronously updated slave database using decorators
. There are few methods decorated this way.
The method you're seeing the race with, fixed_ip_get_by_instance  is
decorated with the "reader" decorator, indicating that it's only safe
for a synchronously updated slave database, else it will use the master.
So, this query should *not* be going to an asynchronously updated slave
database. If you're using asynchronous replication, it should be going
to the master.
Have you patched any nova/db/sqlalchemy/api method decorators or patched
oslo.db at all to use the "async reader" for more methods? If not, then
it's possible there is a bug in oslo.db or nova related to "async
reader" state leaking across green threads.
Which reminds me of a fairly recent bug  we ran into when doing a
concurrent scatter-gather to multiple cell databases. You might try the
patch  locally to see if it changes the behavior when you have
asynchronous replication enabled. We had thought only scatter-gather was
affected (which was introduced in pike) but it's possible the async
slave database read might also be affected.
If you could try that patch, please let me know whether it helps and we
will backport it.
More information about the OpenStack-dev