Open Stack

Fri May 19 18:40:25 UTC 2017

Recently we noticed failures in Newton when we attempted to live-migrate an 
instance with 16 vifs.  We tracked it down to an RPC timeout in nova which timed 
out waiting for the 'refresh_cache-%s' lock in get_instance_nw_info().  This led 
to a few other discoveries.

First, we have no fair locking in OpenStack.  The live migration code path was 
waiting for the lock, but the code processing the incoming "network-changed" 
events kept getting the lock instead even though they arrived while the live 
migration code was already blocked waiting for the lock.

Second, it turns out the cost of processing the "network-changed" events is 
astronomical.

1) In Newton nova commit 5de902a was merged to fix evacuate bugs, but it meant 
both source and dest compute nodes got the "network-changed" events.  This 
doubled the number of neutron API calls during a live migration.

2) A "network-changed" event is sent from neutron each time something changes. 
There are multiple of these events for each vif during a live-migration.  In the 
current upstream code the only information passed with the event is the instance 
id, so nova will loop over all the ports in the instance and build up all the 
information about subnets/floatingIP/fixedIP/etc. for that instance.  This 
results in O(N^2) neutron API calls where N is the number of vifs in the instance.

3) mriedem has proposed a patch series (https://review.openstack.org/#/c/465783 
and https://review.openstack.org/#/c/465787) that would change neutron to 
include the port ID, and allow nova to update just that port.  This reduces the 
cost to O(N), but it's still significant.

In a hardware lab with 4 compute nodes I created 4 boot-from-volume instances, 
each with 16 vifs.  I then live-migrated them all in parallel.  (The one on 
compute-0 was migrated to compute-1, the one on compute-1 was migrated to 
compute-2, etc.)  The aggregate CPU usage for a few critical components on the 
controller node is shown below.  Note in particular the CPU usage for 
neutron--it's using most of 10 CPUs for ~10 seconds, spiking to 13 CPUs.  This 
seems like an absurd amount of work to do just to update the cache in nova.

Labels:
   L0: neutron-server
   L1: nova-conductor
   L2: beam.smp
   L3: postgres
-          -                 -      L0      L1      L2      L3
date       time             dt     occ     occ     occ     occ
yyyy/mm/dd hh:mm:ss.dec    (s)     (%)     (%)     (%)     (%)
2017-05-19 17:51:38.710  2.173   19.75    1.28    2.85    1.96
2017-05-19 17:51:40.012  1.302    1.02    1.75    3.80    5.07
2017-05-19 17:51:41.334  1.322    2.34    2.66    5.25    1.76
2017-05-19 17:51:42.681  1.347   91.79    3.31    5.27    5.64
2017-05-19 17:51:44.035  1.354   40.78    7.27    3.48    7.34
2017-05-19 17:51:45.406  1.371    7.12   21.35    8.66   19.58
2017-05-19 17:51:46.784  1.378   16.71  196.29    6.87   15.93
2017-05-19 17:51:48.133  1.349   18.51  362.46    8.57   25.70
2017-05-19 17:51:49.508  1.375  284.16  199.30    4.58   18.49
2017-05-19 17:51:50.919  1.411  512.88   17.61    7.47   42.88
2017-05-19 17:51:52.322  1.403  412.34    8.90    9.15   19.24
2017-05-19 17:51:53.734  1.411  320.24    5.20   10.59    9.08
2017-05-19 17:51:55.129  1.396  304.92    2.27   10.65   10.29
2017-05-19 17:51:56.551  1.422  556.09   14.56   10.74   18.85
2017-05-19 17:51:57.977  1.426  979.63   43.41   14.17   21.32
2017-05-19 17:51:59.382  1.405  902.56   48.31   13.69   18.59
2017-05-19 17:52:00.808  1.425 1140.99   74.28   15.12   17.18
2017-05-19 17:52:02.238  1.430 1013.91   69.77   16.46   21.19
2017-05-19 17:52:03.647  1.409  964.94  175.09   15.81   27.23
2017-05-19 17:52:05.077  1.430  838.15  109.13   15.70   34.12
2017-05-19 17:52:06.502  1.425  525.88   79.09   14.42   11.09
2017-05-19 17:52:07.954  1.452  614.58   38.38   12.20   17.89
2017-05-19 17:52:09.380  1.426  763.25   68.40   12.36   16.08
2017-05-19 17:52:10.825  1.445  901.57   73.59   15.90   41.12
2017-05-19 17:52:12.252  1.427  966.15   42.97   16.76   23.07
2017-05-19 17:52:13.702  1.450  902.40   70.98   19.66   17.50
2017-05-19 17:52:15.173  1.471 1023.33   59.71   19.78   18.91
2017-05-19 17:52:16.605  1.432 1127.04   64.19   16.41   26.80
2017-05-19 17:52:18.046  1.442 1300.56   68.22   16.29   24.39
2017-05-19 17:52:19.517  1.471 1055.60   71.74   14.39   17.09
2017-05-19 17:52:20.983  1.465  845.30   61.48   15.24   22.86
2017-05-19 17:52:22.447  1.464 1027.33   65.53   15.94   26.85
2017-05-19 17:52:23.919  1.472 1003.08   56.97   14.39   28.93
2017-05-19 17:52:25.367  1.448  702.50   45.42   11.78   20.53
2017-05-19 17:52:26.814  1.448  558.63   66.48   13.22   29.64
2017-05-19 17:52:28.276  1.462  620.34  206.63   14.58   17.17
2017-05-19 17:52:29.749  1.473  555.62  110.37   10.95   13.27
2017-05-19 17:52:31.228  1.479  436.66   33.65    9.00   21.55
2017-05-19 17:52:32.685  1.456  417.12   87.44   13.44   12.27
2017-05-19 17:52:34.128  1.443  368.31   87.08   11.95   14.70
2017-05-19 17:52:35.558  1.430  171.66   11.67    9.28   13.36
2017-05-19 17:52:36.976  1.417  231.82   10.57    7.03   14.08
2017-05-19 17:52:38.413  1.438  241.14   77.78    6.86   15.34
2017-05-19 17:52:39.827  1.413   85.01   63.72    5.85   14.01
2017-05-19 17:52:41.200  1.373    3.31    3.43    7.18    1.78
2017-05-19 17:52:42.556  1.357   60.68    2.94    6.51    6.16
2017-05-19 17:52:44.019  1.463   24.23    5.94    3.45    3.15
2017-05-19 17:52:45.376  1.356    0.93    3.91    5.13    0.83
2017-05-19 17:52:46.699  1.323    7.68    4.12    5.43    0.45
2017-05-19 17:52:48.033  1.334    5.85    1.70    6.00    1.91
2017-05-19 17:52:49.373  1.341   66.28    2.37    4.49   16.40
2017-05-19 17:52:50.715  1.342   31.67    3.03    3.66    6.91
2017-05-19 17:52:52.023  1.308    2.80    2.35    3.30   10.76
2017-05-19 17:52:53.330  1.307    6.94    5.78    3.25    2.30
2017-05-19 17:52:54.699  1.368    3.11    2.67    8.34    1.01
2017-05-19 17:52:56.049  1.351   23.14    2.28    2.83    5.30
2017-05-19 17:52:57.434  1.384   46.86    5.02    6.27   11.93
2017-05-19 17:52:58.803  1.370    3.78   10.26    3.08    2.08
2017-05-19 17:53:00.206  1.403   66.09    8.20    4.07    1.27
2017-05-19 17:53:01.542  1.336   63.71    9.70    3.17    4.89
2017-05-19 17:53:02.855  1.312   21.53    3.99    4.33    5.03

It seems like it should be possible to reduce the amount of work involved here. 
  One possibility would be to partially revert nova commit 5de902a for the 
live-migration case, since it seemed to work fine in Mitaka and earlier. 
Another possibility would be to include additional information about what 
changed in the "network-changed" event, which would reduce the amount of queries 
that nova would need to do.

In a larger cloud this is going to cause major issues for NFV-type workloads, 
where instances having many vifs is going to be relatively common.

Chris

Open Stack

[openstack-dev] [nova][neutron] massive overhead processing "network-changed" events during live migration

OpenStack

Community

Documentation

Branding & Legal