Re: [openstack-hpc] What's the state of openstack-hpc now?
Hi, Apologies for top-posting but I don't intend to answer all the historical project points you've raised. Regarding old things floating around on github, your mileage may vary, but I doubt at this point you want to be looking at any of that in great detail. You haven't really explained what you mean by or want from HPC in this context, so I'm guessing a little based on your other questions... OpenStack is many things to different people and organisations, but at the software core is a very flexible infrastructure provisioning framework. HPC requires infrastructure (compute, network, storage), and OpenStack can certainly deliver it - make your deployment choices to suit your use-cases. A major choice would be whether you will use full system virtualisation or bare-metal or containers or <insert next trend> - that choice largely depends on your typical workloads and what style of cluster you want. Beyond that, compared to "typical" cloud hardware - faster CPUs, faster memory, faster network (probably with much greater east-west capacity), integration of a suitable parallel file-system. However, OpenStack is not a HPC management / scheduling / queuing / middleware system - there are lots of those already and you should pick one that fits your requirements and then (if it helps) run it atop an OpenStack cloud (it might help, e.g., if you want to run multiple logical clusters on the same physical infrastructure, if you want to mix other more traditional cloud workloads in, if you're just doing everything with OpenStack like the other cool kids). There are lots of nuances here, e.g., where one scheduler might lend itself better to more dynamic infrastructure (adding/removing instances), another might be lighter-weight for use with a Cluster-as-a-Service deployment model, whilst another suits a multi-user managed service style cluster. I'm sure there is good experience and opinion hidden on this list if you want to interrogate those sorts of choices more specifically. Most of the relevant choices you need to make with respect to running HPC workloads on infrastructure that is provisioned through OpenStack will come down to your hypervisor choices. My preference for now is to stick with the OpenStack community's most popular free OS and hypervisor (Ubuntu and KVM+Libvirt) - when I facilitated the hypervisor-tuning ops session at the Vancouver summit (with a bunch of folks interested in HPC on OpenStack) there was no-one in the room running a different hypervisor, though several were using RHEL. With the right tuning KVM can get you to within a hair's breadth of bare-metal performance for a wide range of CPU, memory and inter-process comms benchmarks, plus you can easily make use of PCI passthrough for latency sensitive or "difficult" devices like NICs/HCAs and GPGPUs. And the "right tuning" is not really some arcane knowledge, it's mainly about exposing host CPU capabilities, pinning vCPUs to pCPUs, and tuning or pinning and exposing NUMA topology - most of this is supported directly through OpenStack-native features now. To answer the GPU question more explicitly - yes you can do this. Mainly you need to ensure you're getting compatible hardware (GPU and relevant motherboard components) - most of the typical GPGPU choices (e.g. K80, K40, M60) will work, and you should probably be wary of PCIe switches unless you know exactly what you're doing (recommend trying before buying). At the OpenStack level you just define the PCI devices you want OpenStack Nova to provision and you can then define custom instance-types/flavors that will get a GPU passed through. Similar things go for networking. Lastly, just because you can do this doesn't make it a good idea... OpenStack is complex, HPC systems are complex, layering one complicated thing on another is a good way to create tricky problems that hide in the interface between the two layers. So make sure you're gaining something from having OpenStack in the mix here. HTH, Blair On 15 March 2016 at 23:00, <openstack-hpc-request@lists.openstack.org> wrote:
Message: 1 Date: Tue, 15 Mar 2016 19:05:38 +0800 From: "me,apporc" <appleorchard2000@gmail.com> To: openstack-hpc@lists.openstack.org Subject: [openstack-hpc] What's the state of openstack-hpc now? Message-ID: <CAOBTi0sftGTG-fscM-C5wLu6bTgZMaLaM2eXBJpa0a=vkPDusg@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
Hi, all
I found this etherpad[1] which was created long time ago, inside which there are some blueprints: support-heterogeneous-archs[2], heterogeneous-instance-types[3] and schedule-instances-on-heterogeneous-architectures[4] . But those blueprints had been obselete since year 2014, and some of its patches were abandoned. There however is a forked branch github[5] or launchpad[6], which is diverged far away from nova/trunk, and not updated since 2014 too.
Is that we just abandoned those blueprints in openstack or else?
Besides, there is a CaaS[7] project called Senlin[8], which refered to the word "HPC" in its wiki. But it seems like not really related. "Cluster" can mean many things, but hpc is some kind different.
I can not get the status of GPU support in nova. As the case of network, SR-IOV[9] seems ok. For storage, i don't know what the word "mi2" means in etherpad[1].
According to what i got above, it seems we can not use hpc in openstack now. But there are some videos here[10], here[11] and here[12].Since we can not get GPU in nova instance, are they just building traditional hpcs without GPU?
I need more information, thanks in advance.
1. https://etherpad.openstack.org/p/HVHsTqOQGc 2. https://blueprints.launchpad.net/nova/+spec/support-heterogeneous-archs 3. https://blueprints.launchpad.net/nova/+spec/heterogeneous-instance-types 4. https://blueprints.launchpad.net/nova/+spec/schedule-instances-on-heterogene... 5. https://github.com/usc-isi/nova 6. https://code.launchpad.net/~usc-isi/nova/hpc-trunk 7. https://wiki.openstack.org/wiki/CaaS 8. https://wiki.openstack.org/wiki/Senlin 9. https://wiki.openstack.org/wiki/SR-IOV-Passthrough-For-Networking 10. https://www.openstack.org/summit/vancouver-2015/summit-videos/presentation/o... 11. https://www.openstack.org/summit/tokyo-2015/videos/presentation/hpc-on-opens... 12. https://www.openstack.org/summit/tokyo-2015/videos/presentation/canonical-hp... -- Regards, apporc
If I may add, on the networking side OpenStack with KVM can get very close to bare meta performance and functionality. With the native support of SR-IOV, supported NICs can provide near bare metal latency and throughput. Another very popular network capability for HPC is RDMA. With SR-IOV capable NIC devices will expose RDMA interfaces to the guest allowing them to run verbs based applications (like MPI) with similar efficiency as bare metal. RDMA is supported natively over RoCE (RDMA over Converged Ethernet) and is also supported over InfiniBand. Erez -----Original Message----- From: Blair Bethwaite [mailto:blair.bethwaite@gmail.com] Sent: Tuesday, March 15, 2016 3:54 PM To: appleorchard2000@gmail.com Cc: openstack-hpc@lists.openstack.org Subject: Re: [openstack-hpc] What's the state of openstack-hpc now? Hi, Apologies for top-posting but I don't intend to answer all the historical project points you've raised. Regarding old things floating around on github, your mileage may vary, but I doubt at this point you want to be looking at any of that in great detail. You haven't really explained what you mean by or want from HPC in this context, so I'm guessing a little based on your other questions... OpenStack is many things to different people and organisations, but at the software core is a very flexible infrastructure provisioning framework. HPC requires infrastructure (compute, network, storage), and OpenStack can certainly deliver it - make your deployment choices to suit your use-cases. A major choice would be whether you will use full system virtualisation or bare-metal or containers or <insert next trend> - that choice largely depends on your typical workloads and what style of cluster you want. Beyond that, compared to "typical" cloud hardware - faster CPUs, faster memory, faster network (probably with much greater east-west capacity), integration of a suitable parallel file-system. However, OpenStack is not a HPC management / scheduling / queuing / middleware system - there are lots of those already and you should pick one that fits your requirements and then (if it helps) run it atop an OpenStack cloud (it might help, e.g., if you want to run multiple logical clusters on the same physical infrastructure, if you want to mix other more traditional cloud workloads in, if you're just doing everything with OpenStack like the other cool kids). There are lots of nuances here, e.g., where one scheduler might lend itself better to more dynamic infrastructure (adding/removing instances), another might be lighter-weight for use with a Cluster-as-a-Service deployment model, whilst another suits a multi-user managed service style cluster. I'm sure there is good experience and opinion hidden on this list if you want to interrogate those sorts of choices more specifically. Most of the relevant choices you need to make with respect to running HPC workloads on infrastructure that is provisioned through OpenStack will come down to your hypervisor choices. My preference for now is to stick with the OpenStack community's most popular free OS and hypervisor (Ubuntu and KVM+Libvirt) - when I facilitated the hypervisor-tuning ops session at the Vancouver summit (with a bunch of folks interested in HPC on OpenStack) there was no-one in the room running a different hypervisor, though several were using RHEL. With the right tuning KVM can get you to within a hair's breadth of bare-metal performance for a wide range of CPU, memory and inter-process comms benchmarks, plus you can easily make use of PCI passthrough for latency sensitive or "difficult" devices like NICs/HCAs and GPGPUs. And the "right tuning" is not really some arcane knowledge, it's mainly about exposing host CPU capabilities, pinning vCPUs to pCPUs, and tuning or pinning and exposing NUMA topology - most of this is supported directly through OpenStack-native features now. To answer the GPU question more explicitly - yes you can do this. Mainly you need to ensure you're getting compatible hardware (GPU and relevant motherboard components) - most of the typical GPGPU choices (e.g. K80, K40, M60) will work, and you should probably be wary of PCIe switches unless you know exactly what you're doing (recommend trying before buying). At the OpenStack level you just define the PCI devices you want OpenStack Nova to provision and you can then define custom instance-types/flavors that will get a GPU passed through. Similar things go for networking. Lastly, just because you can do this doesn't make it a good idea... OpenStack is complex, HPC systems are complex, layering one complicated thing on another is a good way to create tricky problems that hide in the interface between the two layers. So make sure you're gaining something from having OpenStack in the mix here. HTH, Blair On 15 March 2016 at 23:00, <openstack-hpc-request@lists.openstack.org> wrote:
Message: 1 Date: Tue, 15 Mar 2016 19:05:38 +0800 From: "me,apporc" <appleorchard2000@gmail.com> To: openstack-hpc@lists.openstack.org Subject: [openstack-hpc] What's the state of openstack-hpc now? Message-ID:
<CAOBTi0sftGTG-fscM-C5wLu6bTgZMaLaM2eXBJpa0a=vkPDusg@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
Hi, all
I found this etherpad[1] which was created long time ago, inside which there are some blueprints: support-heterogeneous-archs[2], heterogeneous-instance-types[3] and schedule-instances-on-heterogeneous-architectures[4] . But those blueprints had been obselete since year 2014, and some of its patches were abandoned. There however is a forked branch github[5] or launchpad[6], which is diverged far away from nova/trunk, and not updated since 2014 too.
Is that we just abandoned those blueprints in openstack or else?
Besides, there is a CaaS[7] project called Senlin[8], which refered to the word "HPC" in its wiki. But it seems like not really related. "Cluster" can mean many things, but hpc is some kind different.
I can not get the status of GPU support in nova. As the case of network, SR-IOV[9] seems ok. For storage, i don't know what the word "mi2" means in etherpad[1].
According to what i got above, it seems we can not use hpc in openstack now. But there are some videos here[10], here[11] and here[12].Since we can not get GPU in nova instance, are they just building traditional hpcs without GPU?
I need more information, thanks in advance.
1. https://etherpad.openstack.org/p/HVHsTqOQGc 2. https://blueprints.launchpad.net/nova/+spec/support-heterogeneous-arch s 3. https://blueprints.launchpad.net/nova/+spec/heterogeneous-instance-typ es 4. https://blueprints.launchpad.net/nova/+spec/schedule-instances-on-hete rogeneous-architectures 5. https://github.com/usc-isi/nova 6. https://code.launchpad.net/~usc-isi/nova/hpc-trunk 7. https://wiki.openstack.org/wiki/CaaS 8. https://wiki.openstack.org/wiki/Senlin 9. https://wiki.openstack.org/wiki/SR-IOV-Passthrough-For-Networking 10. https://www.openstack.org/summit/vancouver-2015/summit-videos/presenta tion/openstack-in-hpc-operations-a-campus-perspective 11. https://www.openstack.org/summit/tokyo-2015/videos/presentation/hpc-on -openstack-use-cases 12. https://www.openstack.org/summit/tokyo-2015/videos/presentation/canoni cal-hpc-and-openstack-from-real-experience -- Regards, apporc -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.openstack.org/pipermail/openstack-hpc/attachments/201603 15/31820640/attachment-0001.html>
------------------------------
_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc
End of OpenStack-HPC Digest, Vol 30, Issue 2 ********************************************
-- Cheers, ~Blairo _______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc
Hi, In regard to GPU and InfiniBand performance within KVM, I'd like to point to a recent publication of ours at VEE15 titled "Supporting High Performance Molecular Dynamics in Virtualized Clusters using IOMMU, SR-IOV, and GPUDirect." In the paper we show that using Nvidia GPUs and SR-IOV Mellanox CX3 InfiniBand adapters, we can support two MPI based HPC MD simulation applications, LAMMPS and HOOMD, running on a small cluster. We found overhead to be under 2% when compared to bare metal (no virtualization) for our HPC applications, which we consider to be very good. I'll leave the details for the paper itself, but if anybody has any specific questions, feel free to send me and/or my co-authors an email. http://dl.acm.org/citation.cfm?id=2731194 Best, Andrew Andrew J. Younge School of Informatics & Computing Indiana University / Bloomington, IN USA ajyounge@indiana.edu / http://ajyounge.com On Tue, Mar 15, 2016 at 12:27 PM, Erez Cohen <erezc@mellanox.com> wrote:
If I may add, on the networking side OpenStack with KVM can get very close to bare meta performance and functionality. With the native support of SR-IOV, supported NICs can provide near bare metal latency and throughput. Another very popular network capability for HPC is RDMA. With SR-IOV capable NIC devices will expose RDMA interfaces to the guest allowing them to run verbs based applications (like MPI) with similar efficiency as bare metal. RDMA is supported natively over RoCE (RDMA over Converged Ethernet) and is also supported over InfiniBand.
Erez
-----Original Message----- From: Blair Bethwaite [mailto:blair.bethwaite@gmail.com] Sent: Tuesday, March 15, 2016 3:54 PM To: appleorchard2000@gmail.com Cc: openstack-hpc@lists.openstack.org Subject: Re: [openstack-hpc] What's the state of openstack-hpc now?
Hi,
Apologies for top-posting but I don't intend to answer all the historical project points you've raised. Regarding old things floating around on github, your mileage may vary, but I doubt at this point you want to be looking at any of that in great detail. You haven't really explained what you mean by or want from HPC in this context, so I'm guessing a little based on your other questions...
OpenStack is many things to different people and organisations, but at the software core is a very flexible infrastructure provisioning framework. HPC requires infrastructure (compute, network, storage), and OpenStack can certainly deliver it - make your deployment choices to suit your use-cases. A major choice would be whether you will use full system virtualisation or bare-metal or containers or <insert next trend> - that choice largely depends on your typical workloads and what style of cluster you want. Beyond that, compared to "typical" cloud hardware - faster CPUs, faster memory, faster network (probably with much greater east-west capacity), integration of a suitable parallel file-system.
However, OpenStack is not a HPC management / scheduling / queuing / middleware system - there are lots of those already and you should pick one that fits your requirements and then (if it helps) run it atop an OpenStack cloud (it might help, e.g., if you want to run multiple logical clusters on the same physical infrastructure, if you want to mix other more traditional cloud workloads in, if you're just doing everything with OpenStack like the other cool kids). There are lots of nuances here, e.g., where one scheduler might lend itself better to more dynamic infrastructure (adding/removing instances), another might be lighter-weight for use with a Cluster-as-a-Service deployment model, whilst another suits a multi-user managed service style cluster. I'm sure there is good experience and opinion hidden on this list if you want to interrogate those sorts of choices more specifically.
Most of the relevant choices you need to make with respect to running HPC workloads on infrastructure that is provisioned through OpenStack will come down to your hypervisor choices. My preference for now is to stick with the OpenStack community's most popular free OS and hypervisor (Ubuntu and KVM+Libvirt) - when I facilitated the hypervisor-tuning ops session at the Vancouver summit (with a bunch of folks interested in HPC on OpenStack) there was no-one in the room running a different hypervisor, though several were using RHEL. With the right tuning KVM can get you to within a hair's breadth of bare-metal performance for a wide range of CPU, memory and inter-process comms benchmarks, plus you can easily make use of PCI passthrough for latency sensitive or "difficult" devices like NICs/HCAs and GPGPUs. And the "right tuning" is not really some arcane knowledge, it's mainly about exposing host CPU capabilities, pinning vCPUs to pCPUs, and tuning or pinning and exposing NUMA ! topology - most of this is supported directly through OpenStack-native features now.
To answer the GPU question more explicitly - yes you can do this. Mainly you need to ensure you're getting compatible hardware (GPU and relevant motherboard components) - most of the typical GPGPU choices (e.g. K80, K40, M60) will work, and you should probably be wary of PCIe switches unless you know exactly what you're doing (recommend trying before buying). At the OpenStack level you just define the PCI devices you want OpenStack Nova to provision and you can then define custom instance-types/flavors that will get a GPU passed through. Similar things go for networking.
Lastly, just because you can do this doesn't make it a good idea... OpenStack is complex, HPC systems are complex, layering one complicated thing on another is a good way to create tricky problems that hide in the interface between the two layers. So make sure you're gaining something from having OpenStack in the mix here.
HTH, Blair
On 15 March 2016 at 23:00, <openstack-hpc-request@lists.openstack.org> wrote:
Message: 1 Date: Tue, 15 Mar 2016 19:05:38 +0800 From: "me,apporc" <appleorchard2000@gmail.com> To: openstack-hpc@lists.openstack.org Subject: [openstack-hpc] What's the state of openstack-hpc now? Message-ID:
<CAOBTi0sftGTG-fscM-C5wLu6bTgZMaLaM2eXBJpa0a=vkPDusg@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
Hi, all
I found this etherpad[1] which was created long time ago, inside which there are some blueprints: support-heterogeneous-archs[2], heterogeneous-instance-types[3] and schedule-instances-on-heterogeneous-architectures[4] . But those blueprints had been obselete since year 2014, and some of its patches were abandoned. There however is a forked branch github[5] or launchpad[6], which is diverged far away from nova/trunk, and not updated since 2014 too.
Is that we just abandoned those blueprints in openstack or else?
Besides, there is a CaaS[7] project called Senlin[8], which refered to the word "HPC" in its wiki. But it seems like not really related. "Cluster" can mean many things, but hpc is some kind different.
I can not get the status of GPU support in nova. As the case of network, SR-IOV[9] seems ok. For storage, i don't know what the word "mi2" means in etherpad[1].
According to what i got above, it seems we can not use hpc in openstack now. But there are some videos here[10], here[11] and here[12].Since we can not get GPU in nova instance, are they just building traditional hpcs without GPU?
I need more information, thanks in advance.
1. https://etherpad.openstack.org/p/HVHsTqOQGc 2. https://blueprints.launchpad.net/nova/+spec/support-heterogeneous-arch s 3. https://blueprints.launchpad.net/nova/+spec/heterogeneous-instance-typ es 4. https://blueprints.launchpad.net/nova/+spec/schedule-instances-on-hete rogeneous-architectures 5. https://github.com/usc-isi/nova 6. https://code.launchpad.net/~usc-isi/nova/hpc-trunk 7. https://wiki.openstack.org/wiki/CaaS 8. https://wiki.openstack.org/wiki/Senlin 9. https://wiki.openstack.org/wiki/SR-IOV-Passthrough-For-Networking 10. https://www.openstack.org/summit/vancouver-2015/summit-videos/presenta tion/openstack-in-hpc-operations-a-campus-perspective 11. https://www.openstack.org/summit/tokyo-2015/videos/presentation/hpc-on -openstack-use-cases 12. https://www.openstack.org/summit/tokyo-2015/videos/presentation/canoni cal-hpc-and-openstack-from-real-experience -- Regards, apporc -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.openstack.org/pipermail/openstack-hpc/attachments/201603 15/31820640/attachment-0001.html>
------------------------------
_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc
End of OpenStack-HPC Digest, Vol 30, Issue 2 ********************************************
-- Cheers, ~Blairo
_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc
_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc
Hi Andrew, On 16 March 2016 at 05:28, Andrew J Younge <ajyounge@indiana.edu> wrote:
point to a recent publication of ours at VEE15 titled "Supporting High Performance Molecular Dynamics in Virtualized Clusters using IOMMU, SR-IOV, and GPUDirect." In the paper we show that using Nvidia GPUs ... http://dl.acm.org/citation.cfm?id=2731194
Oooh interesting - GPUDirect too. That's something I've been wanting to try out in our environment. Will take a look a your paper... -- Cheers, ~Blairo
Hi Andrew, hi John - I've just started trying to get CUDA P2P working in our virtualized HPC environment. I figure this must be something you solved already in order to produce the aforementioned paper, but having read it a couple of times I don't think it provides enough detail about the guest config, hoping you can shed some light... The issue I'm grappling with is that despite using a qemu-kvm machine type (q35) with an emulated PCIe bus and seeing that indeed the P2P capable GPUs (NVIDIA K80s) are attached to that bus, and nvidia-smi sees them as sharing a PHB, the simpleP2P CUDA sample fails when checking their ability to communicate with each other. Is there some magic config I might be missing, did you need to make any PCI-ACS changes? Best regards, Blair On 16 March 2016 at 07:57, Blair Bethwaite <blair.bethwaite@gmail.com> wrote:
Hi Andrew,
On 16 March 2016 at 05:28, Andrew J Younge <ajyounge@indiana.edu> wrote:
point to a recent publication of ours at VEE15 titled "Supporting High Performance Molecular Dynamics in Virtualized Clusters using IOMMU, SR-IOV, and GPUDirect." In the paper we show that using Nvidia GPUs ... http://dl.acm.org/citation.cfm?id=2731194
Oooh interesting - GPUDirect too. That's something I've been wanting to try out in our environment. Will take a look a your paper...
-- Cheers, ~Blairo
-- Cheers, ~Blairo
Hi Blair, I'm very interested to hear more about your project using virtualzed GPUs, and hopefully JP and/or myself can be of help here. So in the past we've struggled with the usage of PCI bridges as a connector between multiple GPUs. This was first seen with Xen and S2070 servers (which has 4 older GPUs across Nvidia PCI bridges) and found that the ACS was prohibiting the successful passthrough of the GPU. While we just decided to use discrete independent adapters moving forward, we've never gone back and tried this with KVM. With that, I can expect the same issues as the ACS cannot guarantee proper isolation of the device. Looking at the K80 GPUs, I'm seeing that there are 3 PLX bridges for each GPU pair (see my output below for a native system w/out KVM), and I'd estimate likely these would be on the same iommu group. This could be the problem. I have heard that such a patch exists in KVM for you to override the IOMMU groups and ACS protections, however I don't have any experience with it directly [1]. In our experiments, we used an updated SeaBIOS, whereas the link provided below details a UEFI BIOS. This may have different implications that I don't have experience with. Furthermore, I assume this patch will likely just be ignoring all of ACS, which is going to be an obvious and potentially severe security risk. In a purely academic environment such a security risk may not matter, but it should be noted nonetheless. So, lets take a few steps back to confirm things. Are you able to actually pass both K80 GPUs through to a running KVM instance, and have the Nvidia drivers loaded? Any dmesg output errors here may go a long way. Are you also passing through the PCI bridge device (lspci should show one)? If you're actually making it that far, it may next be worth simply running a regular CUDA application set first before trying any GPUDirect methods. For our GPUDirect usage, we were specifically leveraging the RDMA support with an InfiniBand adapter rather than CUDA P2P, so your mileage may vary there as well. Hopefully this is helpful in finding your problem. With this, I'd be interested to hear if the ACS override mechanism, or any other option works for enabling passthrough with K80 GPUs (we have a few dozen non-virtualized for another project). If you have any other non-bridged GPU cards (like a K20 or C2075) lying around, it may be worth giving that a try to try to rule-out other potential issues first. [1] https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF#Bypassing_the_... [root@r-001 ~]# lspci | grep -i -e PLX -e nvidia 02:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 03:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 03:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 05:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 06:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 07:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 07:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 08:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 09:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 82:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 83:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 83:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 84:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 85:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 86:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 87:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 87:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 88:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 89:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) [root@r-001 ~]# nvidia-smi topo --matrix GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx4_0 CPU Affinity GPU0 X PIX PHB PHB SOC SOC SOC SOC SOC 0-11,24-35 GPU1 PIX X PHB PHB SOC SOC SOC SOC SOC 0-11,24-35 GPU2 PHB PHB X PIX SOC SOC SOC SOC SOC 0-11,24-35 GPU3 PHB PHB PIX X SOC SOC SOC SOC SOC 0-11,24-35 GPU4 SOC SOC SOC SOC X PIX PHB PHB PHB 12-23,36-47 GPU5 SOC SOC SOC SOC PIX X PHB PHB PHB 12-23,36-47 GPU6 SOC SOC SOC SOC PHB PHB X PIX PHB 12-23,36-47 GPU7 SOC SOC SOC SOC PHB PHB PIX X PHB 12-23,36-47 mlx4_0 SOC SOC SOC SOC PHB PHB PHB PHB X Legend: X = Self SOC = Path traverses a socket-level link (e.g. QPI) PHB = Path traverses a PCIe host bridge PXB = Path traverses multiple PCIe internal switches PIX = Path traverses a PCIe internal switch Cheers, Andrew Andrew J. Younge School of Informatics & Computing Indiana University / Bloomington, IN USA ajyounge@indiana.edu / http://ajyounge.com On Tue, Sep 27, 2016 at 4:37 AM, Blair Bethwaite <blair.bethwaite@gmail.com> wrote:
Hi Andrew, hi John -
I've just started trying to get CUDA P2P working in our virtualized HPC environment. I figure this must be something you solved already in order to produce the aforementioned paper, but having read it a couple of times I don't think it provides enough detail about the guest config, hoping you can shed some light...
The issue I'm grappling with is that despite using a qemu-kvm machine type (q35) with an emulated PCIe bus and seeing that indeed the P2P capable GPUs (NVIDIA K80s) are attached to that bus, and nvidia-smi sees them as sharing a PHB, the simpleP2P CUDA sample fails when checking their ability to communicate with each other. Is there some magic config I might be missing, did you need to make any PCI-ACS changes?
Best regards, Blair
On 16 March 2016 at 07:57, Blair Bethwaite <blair.bethwaite@gmail.com> wrote:
Hi Andrew,
On 16 March 2016 at 05:28, Andrew J Younge <ajyounge@indiana.edu> wrote:
point to a recent publication of ours at VEE15 titled "Supporting High Performance Molecular Dynamics in Virtualized Clusters using IOMMU, SR-IOV, and GPUDirect." In the paper we show that using Nvidia GPUs ... http://dl.acm.org/citation.cfm?id=2731194
Oooh interesting - GPUDirect too. That's something I've been wanting to try out in our environment. Will take a look a your paper...
-- Cheers, ~Blairo
-- Cheers, ~Blairo
Hi Andrew, Just wanted to quickly say that I really appreciate your prompt reply and hope you'll be happy to assist further if possible. I've just gotten slightly sidetracked by some other issues but will come back to this in the next week and provide more background info and results of workaround attempts. Cheers, Blair On 28 Sep 2016 2:13 AM, "Andrew J Younge" <ajyounge@indiana.edu> wrote:
Hi Blair,
I'm very interested to hear more about your project using virtualzed GPUs, and hopefully JP and/or myself can be of help here.
So in the past we've struggled with the usage of PCI bridges as a connector between multiple GPUs. This was first seen with Xen and S2070 servers (which has 4 older GPUs across Nvidia PCI bridges) and found that the ACS was prohibiting the successful passthrough of the GPU. While we just decided to use discrete independent adapters moving forward, we've never gone back and tried this with KVM. With that, I can expect the same issues as the ACS cannot guarantee proper isolation of the device. Looking at the K80 GPUs, I'm seeing that there are 3 PLX bridges for each GPU pair (see my output below for a native system w/out KVM), and I'd estimate likely these would be on the same iommu group. This could be the problem.
I have heard that such a patch exists in KVM for you to override the IOMMU groups and ACS protections, however I don't have any experience with it directly [1]. In our experiments, we used an updated SeaBIOS, whereas the link provided below details a UEFI BIOS. This may have different implications that I don't have experience with. Furthermore, I assume this patch will likely just be ignoring all of ACS, which is going to be an obvious and potentially severe security risk. In a purely academic environment such a security risk may not matter, but it should be noted nonetheless.
So, lets take a few steps back to confirm things. Are you able to actually pass both K80 GPUs through to a running KVM instance, and have the Nvidia drivers loaded? Any dmesg output errors here may go a long way. Are you also passing through the PCI bridge device (lspci should show one)? If you're actually making it that far, it may next be worth simply running a regular CUDA application set first before trying any GPUDirect methods. For our GPUDirect usage, we were specifically leveraging the RDMA support with an InfiniBand adapter rather than CUDA P2P, so your mileage may vary there as well.
Hopefully this is helpful in finding your problem. With this, I'd be interested to hear if the ACS override mechanism, or any other option works for enabling passthrough with K80 GPUs (we have a few dozen non-virtualized for another project). If you have any other non-bridged GPU cards (like a K20 or C2075) lying around, it may be worth giving that a try to try to rule-out other potential issues first.
[1] https://wiki.archlinux.org/index.php/PCI_passthrough_via_ OVMF#Bypassing_the_IOMMU_groups_.28ACS_override_patch.29
[root@r-001 ~]# lspci | grep -i -e PLX -e nvidia 02:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 03:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 03:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 05:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 06:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 07:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 07:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 08:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 09:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 82:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 83:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 83:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 84:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 85:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 86:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 87:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 87:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 88:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 89:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) [root@r-001 ~]# nvidia-smi topo --matrix GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx4_0 CPU Affinity GPU0 X PIX PHB PHB SOC SOC SOC SOC SOC 0-11,24-35 GPU1 PIX X PHB PHB SOC SOC SOC SOC SOC 0-11,24-35 GPU2 PHB PHB X PIX SOC SOC SOC SOC SOC 0-11,24-35 GPU3 PHB PHB PIX X SOC SOC SOC SOC SOC 0-11,24-35 GPU4 SOC SOC SOC SOC X PIX PHB PHB PHB 12-23,36-47 GPU5 SOC SOC SOC SOC PIX X PHB PHB PHB 12-23,36-47 GPU6 SOC SOC SOC SOC PHB PHB X PIX PHB 12-23,36-47 GPU7 SOC SOC SOC SOC PHB PHB PIX X PHB 12-23,36-47 mlx4_0 SOC SOC SOC SOC PHB PHB PHB PHB X
Legend:
X = Self SOC = Path traverses a socket-level link (e.g. QPI) PHB = Path traverses a PCIe host bridge PXB = Path traverses multiple PCIe internal switches PIX = Path traverses a PCIe internal switch
Cheers, Andrew
Andrew J. Younge School of Informatics & Computing Indiana University / Bloomington, IN USA ajyounge@indiana.edu / http://ajyounge.com
On Tue, Sep 27, 2016 at 4:37 AM, Blair Bethwaite <blair.bethwaite@gmail.com> wrote:
Hi Andrew, hi John -
I've just started trying to get CUDA P2P working in our virtualized HPC environment. I figure this must be something you solved already in order to produce the aforementioned paper, but having read it a couple of times I don't think it provides enough detail about the guest config, hoping you can shed some light...
The issue I'm grappling with is that despite using a qemu-kvm machine type (q35) with an emulated PCIe bus and seeing that indeed the P2P capable GPUs (NVIDIA K80s) are attached to that bus, and nvidia-smi sees them as sharing a PHB, the simpleP2P CUDA sample fails when checking their ability to communicate with each other. Is there some magic config I might be missing, did you need to make any PCI-ACS changes?
Best regards, Blair
On 16 March 2016 at 07:57, Blair Bethwaite <blair.bethwaite@gmail.com> wrote:
Hi Andrew,
On 16 March 2016 at 05:28, Andrew J Younge <ajyounge@indiana.edu>
wrote:
point to a recent publication of ours at VEE15 titled "Supporting High Performance Molecular Dynamics in Virtualized Clusters using IOMMU, SR-IOV, and GPUDirect." In the paper we show that using Nvidia GPUs ... http://dl.acm.org/citation.cfm?id=2731194
Oooh interesting - GPUDirect too. That's something I've been wanting to try out in our environment. Will take a look a your paper...
-- Cheers, ~Blairo
-- Cheers, ~Blairo
Thank you all. From what you post i see the aspects of hpc in openstack now. As i see, because of https://wiki.openstack.org/wiki/SR-IOV-Passthrough-For-Networking and https://wiki.openstack.org/wiki/Pci_passthrough it's possible to create instances to form a hpc. The performance is very good too. About the management of the hpc clusters(if we have many of them), we can use heat, or senlin later. On Tue, Mar 15, 2016 at 9:54 PM, Blair Bethwaite <blair.bethwaite@gmail.com> wrote:
Hi,
Apologies for top-posting but I don't intend to answer all the historical project points you've raised. Regarding old things floating around on github, your mileage may vary, but I doubt at this point you want to be looking at any of that in great detail. You haven't really explained what you mean by or want from HPC in this context, so I'm guessing a little based on your other questions...
OpenStack is many things to different people and organisations, but at the software core is a very flexible infrastructure provisioning framework. HPC requires infrastructure (compute, network, storage), and OpenStack can certainly deliver it - make your deployment choices to suit your use-cases. A major choice would be whether you will use full system virtualisation or bare-metal or containers or <insert next trend> - that choice largely depends on your typical workloads and what style of cluster you want. Beyond that, compared to "typical" cloud hardware - faster CPUs, faster memory, faster network (probably with much greater east-west capacity), integration of a suitable parallel file-system.
However, OpenStack is not a HPC management / scheduling / queuing / middleware system - there are lots of those already and you should pick one that fits your requirements and then (if it helps) run it atop an OpenStack cloud (it might help, e.g., if you want to run multiple logical clusters on the same physical infrastructure, if you want to mix other more traditional cloud workloads in, if you're just doing everything with OpenStack like the other cool kids). There are lots of nuances here, e.g., where one scheduler might lend itself better to more dynamic infrastructure (adding/removing instances), another might be lighter-weight for use with a Cluster-as-a-Service deployment model, whilst another suits a multi-user managed service style cluster. I'm sure there is good experience and opinion hidden on this list if you want to interrogate those sorts of choices more specifically.
Most of the relevant choices you need to make with respect to running HPC workloads on infrastructure that is provisioned through OpenStack will come down to your hypervisor choices. My preference for now is to stick with the OpenStack community's most popular free OS and hypervisor (Ubuntu and KVM+Libvirt) - when I facilitated the hypervisor-tuning ops session at the Vancouver summit (with a bunch of folks interested in HPC on OpenStack) there was no-one in the room running a different hypervisor, though several were using RHEL. With the right tuning KVM can get you to within a hair's breadth of bare-metal performance for a wide range of CPU, memory and inter-process comms benchmarks, plus you can easily make use of PCI passthrough for latency sensitive or "difficult" devices like NICs/HCAs and GPGPUs. And the "right tuning" is not really some arcane knowledge, it's mainly about exposing host CPU capabilities, pinning vCPUs to pCPUs, and tuning or pinning and exposing NUMA topology - most of this is supported directly through OpenStack-native features now.
To answer the GPU question more explicitly - yes you can do this. Mainly you need to ensure you're getting compatible hardware (GPU and relevant motherboard components) - most of the typical GPGPU choices (e.g. K80, K40, M60) will work, and you should probably be wary of PCIe switches unless you know exactly what you're doing (recommend trying before buying). At the OpenStack level you just define the PCI devices you want OpenStack Nova to provision and you can then define custom instance-types/flavors that will get a GPU passed through. Similar things go for networking.
Lastly, just because you can do this doesn't make it a good idea... OpenStack is complex, HPC systems are complex, layering one complicated thing on another is a good way to create tricky problems that hide in the interface between the two layers. So make sure you're gaining something from having OpenStack in the mix here.
HTH, Blair
Message: 1 Date: Tue, 15 Mar 2016 19:05:38 +0800 From: "me,apporc" <appleorchard2000@gmail.com> To: openstack-hpc@lists.openstack.org Subject: [openstack-hpc] What's the state of openstack-hpc now? Message-ID: <CAOBTi0sftGTG-fscM-C5wLu6bTgZMaLaM2eXBJpa0a= vkPDusg@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
Hi, all
I found this etherpad[1] which was created long time ago, inside which there are some blueprints: support-heterogeneous-archs[2], heterogeneous-instance-types[3] and schedule-instances-on-heterogeneous-architectures[4] . But those blueprints had been obselete since year 2014, and some of its patches were abandoned. There however is a forked branch github[5] or launchpad[6], which is diverged far away from nova/trunk, and not updated since 2014 too.
Is that we just abandoned those blueprints in openstack or else?
Besides, there is a CaaS[7] project called Senlin[8], which refered to
On 15 March 2016 at 23:00, <openstack-hpc-request@lists.openstack.org> wrote: the
word "HPC" in its wiki. But it seems like not really related. "Cluster" can mean many things, but hpc is some kind different.
I can not get the status of GPU support in nova. As the case of network, SR-IOV[9] seems ok. For storage, i don't know what the word "mi2" means in etherpad[1].
According to what i got above, it seems we can not use hpc in openstack now. But there are some videos here[10], here[11] and here[12].Since we can not get GPU in nova instance, are they just building traditional hpcs without GPU?
I need more information, thanks in advance.
1. https://etherpad.openstack.org/p/HVHsTqOQGc 2. https://blueprints.launchpad.net/nova/+spec/support-heterogeneous-archs 3. https://blueprints.launchpad.net/nova/+spec/heterogeneous-instance-types 4.
https://blueprints.launchpad.net/nova/+spec/schedule-instances-on-heterogene...
5. https://github.com/usc-isi/nova 6. https://code.launchpad.net/~usc-isi/nova/hpc-trunk 7. https://wiki.openstack.org/wiki/CaaS 8. https://wiki.openstack.org/wiki/Senlin 9. https://wiki.openstack.org/wiki/SR-IOV-Passthrough-For-Networking 10.
https://www.openstack.org/summit/vancouver-2015/summit-videos/presentation/o...
11.
https://www.openstack.org/summit/tokyo-2015/videos/presentation/hpc-on-opens...
12.
https://www.openstack.org/summit/tokyo-2015/videos/presentation/canonical-hp...
-- Regards, apporc
participants (4)
-
Andrew J Younge
-
Blair Bethwaite
-
Erez Cohen
-
me,apporc