[openstack-dev] [nova] Austin summit performance VMs CI and technical debt session recap

Matt Riedemann mriedem at linux.vnet.ibm.com
Mon May 2 02:56:53 UTC 2016


On Wednesday morning we discussed the state of performance VMs CI and 
technical debt. Performance VMs are more commonly known as those taking 
advantage of network function virtualization (NFV) features, like 
SR-IOV, PCI, NUMA, CPU pinning and huge pages. The full etherpad is here 
[1].

The session started out with a recap of the existing CI testing we have 
in Nova today for NFV:

1. Intel PCI CI - pretty basic custom test(s) of booting an instance 
with a PCI device flavor and then SSHing into the guest to ensure the 
device shows up.

2. Mellanox SR-IOV CI for macvtap - networking scenario tests in Tempest 
using an SR-IOV port of type 'macvtap'.

3. Mellanox SR-IOV CI for direct - networking scenario tests in Tempest 
using an SR-IOV port of type 'direct'.

4. Intel NFV CI - custom API tests in a Tempest plugin using flavors 
that have NUMA, CPU pinning and Huge Pages extra specs.

We then talked about gaps in testing of NFV features, the major ones being:

1. Intel NFV CI is single-node so we don't expose bugs with scheduling 
to multiple computes (we had a major bug in Nova where we'd only ever 
schedule to a single compute when using NUMA). We could potentially test 
some of this with an in-tree functional test.

2. We don't have any testing for SR-IOV ports of type 'direct-physical' 
which was recently added but is buggy.

3. We don't have any testing for resize/migrate with a different PCI 
device flavor, and according to Moshe Levi from Mellanox it's never 
worked, or he doesn't see how it could have. Testing this properly would 
require a multinode devstack job, which we don't have for any of the NFV 
third party CI today. Moshe has a patch up to fix the bug [2] but 
long-term we really need CI testing for this so we don't regress it.

4. ovs-dpdk has limited testing in Nova today. The Intel Networking CI 
job runs it on any changes to nova/virt/libvirt/vif.py and on Neutron 
changes. I've asked that the module whitelist be expanded for Nova 
changes to run these tests. It also sounds like it's going to be run on 
os-vif changes, so once we integrate os-vif for ovs-dpdk we'll have some 
coverage there.

5. In general we have issues with the NFV CI systems:

a) There are different teams running the different Intel CI jobs, so 
communication and status reporting can be difficult. Sean Mooney said 
that his team might be consolidating and owning some of the various 
jobs, so that should help.

b) The Mellanox CI jobs are running on dedicated hardware and doing 
cleanups of the host between runs, but this can miss things. The Intel 
CI guys said that they use privileged containers to get around this type 
of issue. It would be great if the various teams running these CIs could 
share what they are doing and best practices, tooling, etc.

c) We might be able to run some of the Intel NFV CI testing in the 
community infra since some of the public cloud providers being used 
allow nested virt. However, Clark Boylan reported that they have noticed 
very strange and abrupt crashes when running in these modes, so right 
now the stability is in question. Sean Mooney from Intel said that they 
could look into upstreaming some of their CI to community infra. We 
could also get an experimental job setup to see how stable it is and 
tease out the issues.

--

Beyond CI testing we also talked about the gap in upstream 
documentation. The good news is there is more documentation upstream 
than I was aware of. The neutron networking guide has information on 
configuring nova/neutron for using SR-IOV. The admin guide has some good 
information on CPU pinning and large pages, and some documentation for 
some of the more widely used flavor extra specs, but is by no means 
exhaustive - or clear on when a flavor extra spec or image metadata is used.

Stephen Finucane and Ludovic Beliveau volunteered to help work on the 
documentation.

--

One of the takeaways from this session was the clear lack of NFV users 
and people from the OPNFV community in the room. At one point someone 
asked for anyone from those groups to raise their hand and maybe one 
person did. There are surely developers involved, like Moshe, Sean, 
Stephen and Ludovic, but we still have a gap between the companies 
pushing for these features and the developers doing the work. That's one 
of the reasons why the core team consistently makes NFV support a lower 
priority. Part of the issue might simply be that those stakeholders are 
in different track sessions at the same time as the design summit. But I 
and some others from the core team were in an NFV luncheon on Monday to 
talk about what the NFV community can do to be more involved and we went 
over some of the above and pointed out this very session to attend, and 
it didn't seem to change that since the NFV stakeholders in that 
luncheon didn't attend the design session.

--

On Friday during the meetup session we briefly discussed FPGAs and 
similar acceleration-type resources. There were a lot of questions 
around not only what to do about modeling these resources, but what to 
do with an instance if/when the function it needs is re-programmed. As 
an initial step, Jay Pipes, Ed Leafe and some others agreed to talk 
about how generic resource pools can model these types of resource 
classes, but this is all very early stage conversation.

--

Looking ahead:

1. Moshe is taking over the SR-IOV/PCI bi-weekly IRC meeting [3]. We can 
continue some of the discussions in that meeting.

2. Sean Mooney and the Intel CI teams sound like they have some work to 
do with consolidation and potentially upstreaming some of their CI to 
community infra.

3. There are some volunteers to help dig into documentation gaps. I 
expect we can start to get an idea of concrete action items for this in 
the SR-IOV meeting.

4. Jay Pipes is working on refactoring the PCI resource tracker code as 
part of the overall scheduler effort, and Moshe is working on the 
resize/migrate bugs with respect to PCI devices. It would also be great 
if we could get away from hard-coding a PCI whitelist in nova.conf, but 
there isn't a clear picture, at least in my mind, on what this entails 
and who would drive the work. This is probably another item for the 
SR-IOV/PCI meeting.

5. We're going to document the current list of gaps (code issues, 
testing, documentation) in the Nova devref so we have something to point 
to when new features are requested. Basically, this is our list of debt, 
and we want to see that paid off before taking on new features and debt 
for NFV.

[1] https://etherpad.openstack.org/p/newton-nova-performance-vms
[2] https://review.openstack.org/#/c/307124/
[3] 
http://lists.openstack.org/pipermail/openstack-dev/2016-April/093541.html

-- 

Thanks,

Matt Riedemann




More information about the OpenStack-dev mailing list