Hello, community,

   I would like to ask about best practices for using infiniband and RoCEv2 for openstack managed virtual machines. I am a newbie to this area so any information is appreciated. The use case right now is basically for  training/inferencing deep learning applications, for example connecting to parallel filesystem through RDMA (IB/RoCEv2).

   I've done a very basic search through the internet and found mainline code provides sriov-agent which could do basic VF passthrough. There is also a project named mellanox-networking which looks like it could handle IB but seems not to be updated since Train release. All of the above mentioned codes, seems to not handle switches/routers which in my opinion is not complete (at least for RoCEv2, seems PFC/ECN etc should configure switch).

   Is there any available implementation for using IB/RoCEv2 in production? Thank you very much for sharing insights.

--

Best Regards,

Jiatong Shen