[Openstack-operators] [neutron] [os-vif] VF overcommitting and performance in SR-IOV

Blair Bethwaite blair.bethwaite at gmail.com
Tue Jan 23 02:39:20 UTC 2018


This is starting to veer into magic territory for my level of
understanding so beware... but I believe there are (or could be
depending on your exact hardware) PCI config space considerations.
IIUC each SRIOV VF will have its own PCI BAR. Depending on the window
size required (which may be determined by other hardware features such
as flow-steering), you can potentially hit compatibility issues with
your server BIOS not supporting mapping of addresses which surpass
4GB. This can then result in the device hanging on initialisation (at
server boot) and effectively bricking the box until the device is
removed.

We have seen this first hand on a Dell R730 with Mellanox ConnectX-4
card (there are several other Dell 13G platforms with the same BIOS
chipsets). We were explicitly increasing the PCI BAR size for the
device (not upping the number of VFs) in relation to a memory
exhaustion issue when running MPI collective communications on hosts
with 28+ cores, we only had 16 (or maybe 32, I forget) VFs configured
in the firmware.

At the end of that support case (which resulted in a replacement NIC),
the support engineer's summary included:
"""
-When a BIOS limits the BAR to be contained in the 4GB address space -
it is a BIOS limitation.
Unfortunately, there is no way to tell - Some BIOS implementations use
proprietary heuristics to decide when to map a specific BAR below 4GB.

-When SR-IOV is enabled, and num-vfs is high, the corresponding VF BAR
can be huge.
In this case, the BIOS may exhaust the ~2GB address space that it has
available below 4GB.
In this case, the BIOS may hang – and the server won’t boot.
"""

At the very least you should ask your hardware vendors some very
specific questions before doing anything that might change your PCI
BAR sizes.

Cheers,

On 23 January 2018 at 11:44, Pedro Sousa <pgsousa at gmail.com> wrote:
> Hi,
>
> I have sr-iov in production in some customers with maximum number of VFs and
> didn't notice any performance issues.
>
> My understanding is that of course you will have performance penalty if you
> consume all those vfs, because you're dividing the bandwidth across them,
> but other than if they're are there doing nothing you won't notice anything.
>
> But I'm just talking from my experience :)
>
> Regards,
> Pedro Sousa
>
> On Mon, Jan 22, 2018 at 11:47 PM, Maciej Kucia <maciej at kucia.net> wrote:
>>
>> Thank you for the reply. I am interested in SR-IOV and pci whitelisting is
>> certainly involved.
>> I suspect that OpenStack itself can handle those numbers of devices,
>> especially in telco applications where not much scheduling is being done.
>> The feedback I am getting is from sysadmins who work on network
>> virtualization but I think this is just a rumor without any proof.
>>
>> The question is if performance penalty from SR-IOV drivers or PCI itself
>> is negligible. Should cloud admin configure maximum number of VFs for
>> flexibility or should it be manually managed and balanced depending on
>> application?
>>
>> Regards,
>> Maciej
>>
>>>
>>>
>>> 2018-01-22 18:38 GMT+01:00 Jay Pipes <jaypipes at gmail.com>:
>>>>
>>>> On 01/22/2018 11:36 AM, Maciej Kucia wrote:
>>>>>
>>>>> Hi!
>>>>>
>>>>> Is there any noticeable performance penalty when using multiple virtual
>>>>> functions?
>>>>>
>>>>> For simplicity I am enabling all available virtual functions in my
>>>>> NICs.
>>>>
>>>>
>>>> I presume by the above you are referring to setting your
>>>> pci_passthrough_whitelist on your compute nodes to whitelist all VFs on a
>>>> particular PF's PCI address domain/bus?
>>>>
>>>>> Sometimes application is using only few of them. I am using Intel and
>>>>> Mellanox.
>>>>>
>>>>> I do not see any performance drop but I am getting feedback that this
>>>>> might not be the best approach.
>>>>
>>>>
>>>> Who is giving you this feedback?
>>>>
>>>> The only issue with enabling (potentially 254 or more) VFs for each PF
>>>> is that each VF will end up as a record in the pci_devices table in the Nova
>>>> cell database. Multiply 254 or more times the number of PFs times the number
>>>> of compute nodes in your deployment and you can get a large number of
>>>> records that need to be stored. That said, the pci_devices table is well
>>>> indexed and even if you had 1M or more records in the table, the access of a
>>>> few hundred of those records when the resource tracker does a
>>>> PciDeviceList.get_by_compute_node() [1] will still be quite fast.
>>>>
>>>> Best,
>>>> -jay
>>>>
>>>> [1]
>>>> https://github.com/openstack/nova/blob/stable/pike/nova/compute/resource_tracker.py#L572
>>>> and then
>>>>
>>>> https://github.com/openstack/nova/blob/stable/pike/nova/pci/manager.py#L71
>>>>
>>>>> Any recommendations?
>>>>>
>>>>> Thanks,
>>>>> Maciej
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> OpenStack-operators mailing list
>>>>> OpenStack-operators at lists.openstack.org
>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>>>
>>>>
>>>> _______________________________________________
>>>> OpenStack-operators mailing list
>>>> OpenStack-operators at lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>
>>>
>>
>>
>> _______________________________________________
>> OpenStack-operators mailing list
>> OpenStack-operators at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>



-- 
Cheers,
~Blairo



More information about the OpenStack-operators mailing list