[openstack-dev] [nova] Austin summit performance VMs CI and technical debt session recap
Matt Riedemann
mriedem at linux.vnet.ibm.com
Mon May 2 02:56:53 UTC 2016
On Wednesday morning we discussed the state of performance VMs CI and
technical debt. Performance VMs are more commonly known as those taking
advantage of network function virtualization (NFV) features, like
SR-IOV, PCI, NUMA, CPU pinning and huge pages. The full etherpad is here
[1].
The session started out with a recap of the existing CI testing we have
in Nova today for NFV:
1. Intel PCI CI - pretty basic custom test(s) of booting an instance
with a PCI device flavor and then SSHing into the guest to ensure the
device shows up.
2. Mellanox SR-IOV CI for macvtap - networking scenario tests in Tempest
using an SR-IOV port of type 'macvtap'.
3. Mellanox SR-IOV CI for direct - networking scenario tests in Tempest
using an SR-IOV port of type 'direct'.
4. Intel NFV CI - custom API tests in a Tempest plugin using flavors
that have NUMA, CPU pinning and Huge Pages extra specs.
We then talked about gaps in testing of NFV features, the major ones being:
1. Intel NFV CI is single-node so we don't expose bugs with scheduling
to multiple computes (we had a major bug in Nova where we'd only ever
schedule to a single compute when using NUMA). We could potentially test
some of this with an in-tree functional test.
2. We don't have any testing for SR-IOV ports of type 'direct-physical'
which was recently added but is buggy.
3. We don't have any testing for resize/migrate with a different PCI
device flavor, and according to Moshe Levi from Mellanox it's never
worked, or he doesn't see how it could have. Testing this properly would
require a multinode devstack job, which we don't have for any of the NFV
third party CI today. Moshe has a patch up to fix the bug [2] but
long-term we really need CI testing for this so we don't regress it.
4. ovs-dpdk has limited testing in Nova today. The Intel Networking CI
job runs it on any changes to nova/virt/libvirt/vif.py and on Neutron
changes. I've asked that the module whitelist be expanded for Nova
changes to run these tests. It also sounds like it's going to be run on
os-vif changes, so once we integrate os-vif for ovs-dpdk we'll have some
coverage there.
5. In general we have issues with the NFV CI systems:
a) There are different teams running the different Intel CI jobs, so
communication and status reporting can be difficult. Sean Mooney said
that his team might be consolidating and owning some of the various
jobs, so that should help.
b) The Mellanox CI jobs are running on dedicated hardware and doing
cleanups of the host between runs, but this can miss things. The Intel
CI guys said that they use privileged containers to get around this type
of issue. It would be great if the various teams running these CIs could
share what they are doing and best practices, tooling, etc.
c) We might be able to run some of the Intel NFV CI testing in the
community infra since some of the public cloud providers being used
allow nested virt. However, Clark Boylan reported that they have noticed
very strange and abrupt crashes when running in these modes, so right
now the stability is in question. Sean Mooney from Intel said that they
could look into upstreaming some of their CI to community infra. We
could also get an experimental job setup to see how stable it is and
tease out the issues.
--
Beyond CI testing we also talked about the gap in upstream
documentation. The good news is there is more documentation upstream
than I was aware of. The neutron networking guide has information on
configuring nova/neutron for using SR-IOV. The admin guide has some good
information on CPU pinning and large pages, and some documentation for
some of the more widely used flavor extra specs, but is by no means
exhaustive - or clear on when a flavor extra spec or image metadata is used.
Stephen Finucane and Ludovic Beliveau volunteered to help work on the
documentation.
--
One of the takeaways from this session was the clear lack of NFV users
and people from the OPNFV community in the room. At one point someone
asked for anyone from those groups to raise their hand and maybe one
person did. There are surely developers involved, like Moshe, Sean,
Stephen and Ludovic, but we still have a gap between the companies
pushing for these features and the developers doing the work. That's one
of the reasons why the core team consistently makes NFV support a lower
priority. Part of the issue might simply be that those stakeholders are
in different track sessions at the same time as the design summit. But I
and some others from the core team were in an NFV luncheon on Monday to
talk about what the NFV community can do to be more involved and we went
over some of the above and pointed out this very session to attend, and
it didn't seem to change that since the NFV stakeholders in that
luncheon didn't attend the design session.
--
On Friday during the meetup session we briefly discussed FPGAs and
similar acceleration-type resources. There were a lot of questions
around not only what to do about modeling these resources, but what to
do with an instance if/when the function it needs is re-programmed. As
an initial step, Jay Pipes, Ed Leafe and some others agreed to talk
about how generic resource pools can model these types of resource
classes, but this is all very early stage conversation.
--
Looking ahead:
1. Moshe is taking over the SR-IOV/PCI bi-weekly IRC meeting [3]. We can
continue some of the discussions in that meeting.
2. Sean Mooney and the Intel CI teams sound like they have some work to
do with consolidation and potentially upstreaming some of their CI to
community infra.
3. There are some volunteers to help dig into documentation gaps. I
expect we can start to get an idea of concrete action items for this in
the SR-IOV meeting.
4. Jay Pipes is working on refactoring the PCI resource tracker code as
part of the overall scheduler effort, and Moshe is working on the
resize/migrate bugs with respect to PCI devices. It would also be great
if we could get away from hard-coding a PCI whitelist in nova.conf, but
there isn't a clear picture, at least in my mind, on what this entails
and who would drive the work. This is probably another item for the
SR-IOV/PCI meeting.
5. We're going to document the current list of gaps (code issues,
testing, documentation) in the Nova devref so we have something to point
to when new features are requested. Basically, this is our list of debt,
and we want to see that paid off before taking on new features and debt
for NFV.
[1] https://etherpad.openstack.org/p/newton-nova-performance-vms
[2] https://review.openstack.org/#/c/307124/
[3]
http://lists.openstack.org/pipermail/openstack-dev/2016-April/093541.html
--
Thanks,
Matt Riedemann
More information about the OpenStack-dev
mailing list