[openstack-dev] [nova][neutron] Migration from nova-network to Neutron for large production clouds

Tim Bell Tim.Bell at cern.ch
Thu Aug 21 12:33:27 UTC 2014


On 21 Aug 2014, at 12:38, Thierry Carrez <thierry at openstack.org> wrote:

> Tim Bell wrote:
>> Michael has been posting very informative blogs on the summary of the
>> mid-cycle meetups for Nova. The one on the Nova Network to Neutron
>> migration was of particular interest to me as it raises a number of
>> potential impacts for the CERN production cloud. The blog itself is at
>> http://www.stillhq.com/openstack/juno/000014.html
>> 
>> I would welcome suggestions from the community on the approach to take
>> and areas that the nova/neutron team could review to limit the impact on
>> the cloud users.
>> 
>> For some background, CERN has been running nova-network in flat DHCP
>> mode since our first Diablo deployment. We moved to production for our
>> users in July last year and are currently supporting around 70,000
>> cores, 6 cells, 100s of projects and thousands of VMs. Upgrades
>> generally involve disabling the API layer while allowing running VMs to
>> carry on without disruption. Within the time scale of the migration to
>> Neutron (M release at the latest), these numbers are expected to double.
> 
> Thanks for bringing your concerns here. To start this discussion, it's
> worth adding some context on the currently-proposed "cold" migration
> path. During the Icehouse and Juno cycles the TC reviewed the gaps
> between the integration requirements we now place on new entrants and
> the currently-integrated projects. That resulted in a number of
> identified gaps that we asked projects to address ASAP, ideally within
> the Juno cycle.
> 
> Most of the Neutron gaps revolved around its failure to be a full
> nova-network replacement -- some gaps around supporting basic modes of
> operation, and a gap in providing a basic migration path. Neutron devs
> promised to close that in Juno, but after a bit of discussion we
> considered that a cold migration path was all we'd require them to
> provide in Juno.
> 
> That doesn't mean a "hot" or "warm" migration path can't be worked on.
> There are two questions to solve: how can we technically perform that
> migration with a minimal amount of downtime, and is it reasonable to
> mark nova-network deprecated until we've solved that issue.
> 
> On the first question, migration is typically an operational problem,
> and operators could really help to design one that would be acceptable
> to them. They may require developers to add features in the code to
> support that process, but we seem to not even be at this stage. Ideally
> I would like ops and devs to join to solve that technical challenge.
> 
> The answer to the second question lies in the multiple dimensions of
> "deprecated".
> 
> On one side it means "is no longer in our future plans, new usage is now
> discouraged, new development is stopped, explore your options to migrate
> out of it". I think it's extremely important that we do that as early as
> possible, to reduce duplication of effort and set expectations correctly.
> 
> On the other side it means "will be removed in release X" (not
> necessarily the next release, but you set a countdown). To do that, you
> need to be pretty confident that you'll have your ducks in a row at
> removal date, and don't set up operators for a nightmare migration.
> 
>> For us, the concerns we have with the ‘cold’ approach would be on the
>> user impact and operational risk of such a change. Specifically,
>> 
>> 1.      A big bang approach of shutting down the cloud, upgrade and the
>> resuming the cloud would cause significant user disruption
>> 
>> 2.      The risks involved with a cloud of this size and the open source
>> network drivers would be difficult to mitigate through testing and could
>> lead to site wide downtime
>> 
>> 3.      Rebooting VMs may be possible to schedule in batches but would
>> need to be staggered to keep availability levels
> 
> What minimal level of "hot" would be acceptable to you ?
> 

I am wary of using phrases like "not acceptable" as they tend to lead to very binary discussions :-)

We could consider rebooting VMs. We would much rather not have to. Rebooting all at once would cause major difficulties.

Staggering the VM migrations would allow us to significantly reduce the risk as we could pause in the event of an operational issue. My assumption is that rollback would be a major development effort so I prefer a way to progress with caution.

Renumbering IPs of VMs would be painful also.

I think, as you say, a small team of developers and operators with this need can sit down to find the right balance between a simple migration and an implementation which does not require infinite development effort.

Since there is an upcoming Ops meet up next week in San Antonio (Michael S thought he would attend), I can suggest to Tom that he gets some volunteers and then we discuss further in Paris.

I'm all in favour of early announcements of depreciations so that we can start to work this through with the community. I'd also like to not leave it too late as we are adding new VMs and hypervisors all the time and so the scale challenges will increase.

Tim

> -- 
> Thierry Carrez (ttx)
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




More information about the OpenStack-dev mailing list