Looking for practical Openstack + Ceph guidance for shared HPC

older
Re: [openstack-hpc] Looking for...

Joshua Dotson

26 Jul 2013 26 Jul '13

1:06 a.m.

Hello. A contingent of my organization, the Kno.e.sis Center @ Wright State University <http://www.knoesis.org/>, recently received a grant award which we intend to use to support a handful of mid-size HPC-style workloads (MPI <-- definitely, GPGPU <-- if possible/plausible) in addition to many mid-size IaaS-style workloads (MongoDB, Storm, Hadoop, many others). As a third layer, I'm playing with the idea of evaluating an elastic OpenShift Origin atop the same infrastructure. Approximately $400k to $500k will hopefully be available for this deployment, though exact numbers are not yet available to me. While I'm prepared to build a home-grown small-to-mid-size "classical" HPC, using modern hardware, and a smaller silo for home-grown Openstack for the minority stakeholders, I am hoping to find ways of making proponents of both workloads simultaneously happy, or close to it. That is, I would like to give my computer scientist users a friendly method of running their HPC-style jobs on a combined performance-tuned silo of Openstack. Doing so would load-balance the procured hardware and infrastructure with the users who want a Tomcat or a Virtuoso instance. I see a number of serious issues realizing such a goal. For example, the state of Infiniband vs. Openstack seems not quite ready/available/documented/accessible for such use in production, unless I'm just blind to the right blogs. The added myriad abstractions and latency virtualization might impose on an HPC task, not to mention cloud software defined networking (Quantum, especially when sans hardware acceleration), seem likely to really get in the way of practicality, economics and efficiency. That said, most of what we do here isn't HPC, so I believe such trade-offs can be agreed upon, if a reasonable job scheduling and workload management mechanism can be found and agreed upon by all stake holders, grant proposal majority (HPC) and minority (IaaS) alike. I get the impression from my readings that HPC-style deployment (separate from job queuing) against the EC2 API should work. I don't have a good feeling that the experience would be particularly friendly, however, without paying for closed source applications. I'm thinking a high-performance Ceph install would help bring up the storage end of things in a modern open-source CoTS way. I've not done specific research on Lustre + Openstack, but no reports of such a setup have presented themselves to me, either. These blue sky ideas matter nil, it seems, if a sufficiently-large high-performance production-quality Openstack deployment is beyond the funds to be allotted, which is something else I'm working on. I've built smallish but useful virt-manager, oVirt and Openstack environments here already, but none of them are enough for the very-important HPC job proposed for this grant. The scientist running the proposed computation gave me following information to clarify what would parity (for his job only) his experience running the computation with an external HPC service provider. - MPI - 20 Gbps Infiniband compute interconnect - 600 cores (those currently used are probably G4 Opteron 2.4 Ghz) - 4 GB RAM per core - at lest 2 TB shared storage, though I'm thinking we need much more for use by our general community - unsure of the storage networking topology We're in the shopping phase for this grant award and are still playing with ideas. It seems likely to snap back into an old-school HPC, at this time. I've fielded communication about our needs to a number of Openstack and hardware providers, on the hope that they can bring something helpful to the table. Please let know if you can point me in the right direction(s). I'm up to reading whatever text is thrown at me on this topic. :-) Thanks, Joshua -- Joshua M. Dotson Systems Administrator Kno.e.sis Center Wright State University - Dayton, OH

Attachments:

attachment.html (text/html — 4.1 KB)

Show replies by date

Brian Schott

26 Jul 26 Jul

1:30 a.m.

New subject: Looking for practical Openstack + Ceph guidance for shared HPC

Joshua, It is something those of us working the gap between HPC and cloud have struggled with. We lack a strong project team within OpenStack IMHO, although there are several small groups doing HPC with OpenStack. Hopefully others will chime in on some other topics, such as Infiniband support, but we did some testing with a GRID K2 card for GPU pass-through with different hypervisors. A talk I gave at the OpenStack DC meetup is here: http://www.slideshare.net/bfschott/nimbis-schott-openstackgpustatus20130618 The short GPU answer is that it is possible with Xen, XenCenter, and XCP to passthrough GPUs today, but OpenStack doesn't have support by default in Nova. This is still in a roll-your-own mode for deployment. Brian ------------------------------------------------- Brian Schott, CTO Nimbis Services, Inc. brian.schott@nimbisservices.com ph: 443-274-6064 fx: 443-274-6060 On Jul 25, 2013, at 4:06 PM, Joshua Dotson <josh@knoesis.org> wrote:

...

Hello.

A contingent of my organization, the Kno.e.sis Center @ Wright State University, recently received a grant award which we intend to use to support a handful of mid-size HPC-style workloads (MPI <-- definitely, GPGPU <-- if possible/plausible) in addition to many mid-size IaaS-style workloads (MongoDB, Storm, Hadoop, many others). As a third layer, I'm playing with the idea of evaluating an elastic OpenShift Origin atop the same infrastructure. Approximately $400k to $500k will hopefully be available for this deployment, though exact numbers are not yet available to me.

While I'm prepared to build a home-grown small-to-mid-size "classical" HPC, using modern hardware, and a smaller silo for home-grown Openstack for the minority stakeholders, I am hoping to find ways of making proponents of both workloads simultaneously happy, or close to it. That is, I would like to give my computer scientist users a friendly method of running their HPC-style jobs on a combined performance-tuned silo of Openstack. Doing so would load-balance the procured hardware and infrastructure with the users who want a Tomcat or a Virtuoso instance.

I see a number of serious issues realizing such a goal. For example, the state of Infiniband vs. Openstack seems not quite ready/available/documented/accessible for such use in production, unless I'm just blind to the right blogs. The added myriad abstractions and latency virtualization might impose on an HPC task, not to mention cloud software defined networking (Quantum, especially when sans hardware acceleration), seem likely to really get in the way of practicality, economics and efficiency. That said, most of what we do here isn't HPC, so I believe such trade-offs can be agreed upon, if a reasonable job scheduling and workload management mechanism can be found and agreed upon by all stake holders, grant proposal majority (HPC) and minority (IaaS) alike.

I get the impression from my readings that HPC-style deployment (separate from job queuing) against the EC2 API should work. I don't have a good feeling that the experience would be particularly friendly, however, without paying for closed source applications. I'm thinking a high-performance Ceph install would help bring up the storage end of things in a modern open-source CoTS way. I've not done specific research on Lustre + Openstack, but no reports of such a setup have presented themselves to me, either.

These blue sky ideas matter nil, it seems, if a sufficiently-large high-performance production-quality Openstack deployment is beyond the funds to be allotted, which is something else I'm working on. I've built smallish but useful virt-manager, oVirt and Openstack environments here already, but none of them are enough for the very-important HPC job proposed for this grant. The scientist running the proposed computation gave me following information to clarify what would parity (for his job only) his experience running the computation with an external HPC service provider. MPI 20 Gbps Infiniband compute interconnect 600 cores (those currently used are probably G4 Opteron 2.4 Ghz) 4 GB RAM per core at lest 2 TB shared storage, though I'm thinking we need much more for use by our general community unsure of the storage networking topology We're in the shopping phase for this grant award and are still playing with ideas. It seems likely to snap back into an old-school HPC, at this time. I've fielded communication about our needs to a number of Openstack and hardware providers, on the hope that they can bring something helpful to the table. Please let know if you can point me in the right direction(s). I'm up to reading whatever text is thrown at me on this topic. :-)

Thanks, Joshua -- Joshua M. Dotson Systems Administrator Kno.e.sis Center Wright State University - Dayton, OH _______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

Lorin Hochstein

6:16 a.m.

New subject: Looking for practical Openstack + Ceph guidance for shared HPC

Joshua: You may want to ask on the openstack-operators list as well. I know there are some folks there who are (trying to) deploy OpenStack on top of Infiniband, with varying degrees of success while fighting with drivers. You'll also get some insights into storage solutions (mostly Ceph vs. gluster, I suspect). I haven't come across anybody who's tried to deploy OpenStack with Lustre: that way lies madness, I think. Take care, Lorin On Thu, Jul 25, 2013 at 4:30 PM, Brian Schott < brian.schott@nimbisservices.com> wrote:

...

Joshua,

It is something those of us working the gap between HPC and cloud have struggled with. We lack a strong project team within OpenStack IMHO, although there are several small groups doing HPC with OpenStack. Hopefully others will chime in on some other topics, such as Infiniband support, but we did some testing with a GRID K2 card for GPU pass-through with different hypervisors. A talk I gave at the OpenStack DC meetup is here: http://www.slideshare.net/bfschott/nimbis-schott-openstackgpustatus20130618

The short GPU answer is that it is possible with Xen, XenCenter, and XCP to passthrough GPUs today, but OpenStack doesn't have support by default in Nova. This is still in a roll-your-own mode for deployment.

Brian

------------------------------------------------- Brian Schott, CTO Nimbis Services, Inc. brian.schott@nimbisservices.com ph: 443-274-6064 fx: 443-274-6060

On Jul 25, 2013, at 4:06 PM, Joshua Dotson <josh@knoesis.org> wrote:

Hello.

A contingent of my organization, the Kno.e.sis Center @ Wright State University <http://www.knoesis.org/>, recently received a grant award which we intend to use to support a handful of mid-size HPC-style workloads (MPI <-- definitely, GPGPU <-- if possible/plausible) in addition to many mid-size IaaS-style workloads (MongoDB, Storm, Hadoop, many others). As a third layer, I'm playing with the idea of evaluating an elastic OpenShift Origin atop the same infrastructure. Approximately $400k to $500k will hopefully be available for this deployment, though exact numbers are not yet available to me.

While I'm prepared to build a home-grown small-to-mid-size "classical" HPC, using modern hardware, and a smaller silo for home-grown Openstack for the minority stakeholders, I am hoping to find ways of making proponents of both workloads simultaneously happy, or close to it. That is, I would like to give my computer scientist users a friendly method of running their HPC-style jobs on a combined performance-tuned silo of Openstack. Doing so would load-balance the procured hardware and infrastructure with the users who want a Tomcat or a Virtuoso instance.

I see a number of serious issues realizing such a goal. For example, the state of Infiniband vs. Openstack seems not quite ready/available/documented/accessible for such use in production, unless I'm just blind to the right blogs. The added myriad abstractions and latency virtualization might impose on an HPC task, not to mention cloud software defined networking (Quantum, especially when sans hardware acceleration), seem likely to really get in the way of practicality, economics and efficiency. That said, most of what we do here isn't HPC, so I believe such trade-offs can be agreed upon, if a reasonable job scheduling and workload management mechanism can be found and agreed upon by all stake holders, grant proposal majority (HPC) and minority (IaaS) alike.

I get the impression from my readings that HPC-style deployment (separate from job queuing) against the EC2 API should work. I don't have a good feeling that the experience would be particularly friendly, however, without paying for closed source applications. I'm thinking a high-performance Ceph install would help bring up the storage end of things in a modern open-source CoTS way. I've not done specific research on Lustre + Openstack, but no reports of such a setup have presented themselves to me, either.

These blue sky ideas matter nil, it seems, if a sufficiently-large high-performance production-quality Openstack deployment is beyond the funds to be allotted, which is something else I'm working on. I've built smallish but useful virt-manager, oVirt and Openstack environments here already, but none of them are enough for the very-important HPC job proposed for this grant. The scientist running the proposed computation gave me following information to clarify what would parity (for his job only) his experience running the computation with an external HPC service provider.

- MPI - 20 Gbps Infiniband compute interconnect - 600 cores (those currently used are probably G4 Opteron 2.4 Ghz) - 4 GB RAM per core - at lest 2 TB shared storage, though I'm thinking we need much more for use by our general community - unsure of the storage networking topology

We're in the shopping phase for this grant award and are still playing with ideas. It seems likely to snap back into an old-school HPC, at this time. I've fielded communication about our needs to a number of Openstack and hardware providers, on the hope that they can bring something helpful to the table. Please let know if you can point me in the right direction(s). I'm up to reading whatever text is thrown at me on this topic. :-)

Thanks, Joshua -- Joshua M. Dotson Systems Administrator Kno.e.sis Center Wright State University - Dayton, OH _______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

-- Lorin Hochstein Lead Architect - Cloud Services Nimbis Services, Inc. www.nimbisservices.com

Narayan Desai

6:22 a.m.

New subject: Looking for practical Openstack + Ceph guidance for shared HPC

Brian's right. You will end up doing a lot of work; nova isn't ready for this out of the box. The key problems are: - drivers for virtualization (either via SR-IOV or device passthrough) for net + gpu - IO architecture There is apparently SR-IOV support for IB, provided that you have the right hardware, firmware, and driver, though I haven't managed to make it work. This provides a pkey isolated multi-tenant environment. THat is basically a mellanox only solution. Like Brian said, Xen is the only way to go for GPU virtualization. You can do some really interesting things with the I/O architecture. We've been experimenting with both glusterfs and ceph. Both seem to perform decently well; we've managed to get the glusterfs setup going at 60 GB/s in aggregate with a pile of clients. There isn't good integration of all of the capabilities yet in mainline openstack, but this looks promising. Ceph looks like the mainline integration is better, but we haven't tried those things out yet. At the end of the day, you need to ask yourself why you want to accomplish this. If you're running a workload that is well suited to an HPC cluster, you should probably use one. If you need multi-tenancy, user control of system software, or need to run a workload poorly suited for a traditional cluster, then it is worth thinking strongly about. You'll end up needing to do a bunch of work though. In our case, the reason that we pursued this course is because we have workloads and developers that benefit from the cloud control plane. hth. -nld On Thu, Jul 25, 2013 at 3:30 PM, Brian Schott < brian.schott@nimbisservices.com> wrote:

...

Joshua,

It is something those of us working the gap between HPC and cloud have struggled with. We lack a strong project team within OpenStack IMHO, although there are several small groups doing HPC with OpenStack. Hopefully others will chime in on some other topics, such as Infiniband support, but we did some testing with a GRID K2 card for GPU pass-through with different hypervisors. A talk I gave at the OpenStack DC meetup is here: http://www.slideshare.net/bfschott/nimbis-schott-openstackgpustatus20130618

The short GPU answer is that it is possible with Xen, XenCenter, and XCP to passthrough GPUs today, but OpenStack doesn't have support by default in Nova. This is still in a roll-your-own mode for deployment.

Brian

------------------------------------------------- Brian Schott, CTO Nimbis Services, Inc. brian.schott@nimbisservices.com ph: 443-274-6064 fx: 443-274-6060

On Jul 25, 2013, at 4:06 PM, Joshua Dotson <josh@knoesis.org> wrote:

Hello.

A contingent of my organization, the Kno.e.sis Center @ Wright State University <http://www.knoesis.org/>, recently received a grant award which we intend to use to support a handful of mid-size HPC-style workloads (MPI <-- definitely, GPGPU <-- if possible/plausible) in addition to many mid-size IaaS-style workloads (MongoDB, Storm, Hadoop, many others). As a third layer, I'm playing with the idea of evaluating an elastic OpenShift Origin atop the same infrastructure. Approximately $400k to $500k will hopefully be available for this deployment, though exact numbers are not yet available to me.

While I'm prepared to build a home-grown small-to-mid-size "classical" HPC, using modern hardware, and a smaller silo for home-grown Openstack for the minority stakeholders, I am hoping to find ways of making proponents of both workloads simultaneously happy, or close to it. That is, I would like to give my computer scientist users a friendly method of running their HPC-style jobs on a combined performance-tuned silo of Openstack. Doing so would load-balance the procured hardware and infrastructure with the users who want a Tomcat or a Virtuoso instance.

I see a number of serious issues realizing such a goal. For example, the state of Infiniband vs. Openstack seems not quite ready/available/documented/accessible for such use in production, unless I'm just blind to the right blogs. The added myriad abstractions and latency virtualization might impose on an HPC task, not to mention cloud software defined networking (Quantum, especially when sans hardware acceleration), seem likely to really get in the way of practicality, economics and efficiency. That said, most of what we do here isn't HPC, so I believe such trade-offs can be agreed upon, if a reasonable job scheduling and workload management mechanism can be found and agreed upon by all stake holders, grant proposal majority (HPC) and minority (IaaS) alike.

I get the impression from my readings that HPC-style deployment (separate from job queuing) against the EC2 API should work. I don't have a good feeling that the experience would be particularly friendly, however, without paying for closed source applications. I'm thinking a high-performance Ceph install would help bring up the storage end of things in a modern open-source CoTS way. I've not done specific research on Lustre + Openstack, but no reports of such a setup have presented themselves to me, either.

These blue sky ideas matter nil, it seems, if a sufficiently-large high-performance production-quality Openstack deployment is beyond the funds to be allotted, which is something else I'm working on. I've built smallish but useful virt-manager, oVirt and Openstack environments here already, but none of them are enough for the very-important HPC job proposed for this grant. The scientist running the proposed computation gave me following information to clarify what would parity (for his job only) his experience running the computation with an external HPC service provider.

- MPI - 20 Gbps Infiniband compute interconnect - 600 cores (those currently used are probably G4 Opteron 2.4 Ghz) - 4 GB RAM per core - at lest 2 TB shared storage, though I'm thinking we need much more for use by our general community - unsure of the storage networking topology

We're in the shopping phase for this grant award and are still playing with ideas. It seems likely to snap back into an old-school HPC, at this time. I've fielded communication about our needs to a number of Openstack and hardware providers, on the hope that they can bring something helpful to the table. Please let know if you can point me in the right direction(s). I'm up to reading whatever text is thrown at me on this topic. :-)

Thanks, Joshua -- Joshua M. Dotson Systems Administrator Kno.e.sis Center Wright State University - Dayton, OH _______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

Di Pe

11:48 a.m.

New subject: Looking for practical Openstack + Ceph guidance for shared HPC

All, one issue Joshua touched on was options for workload management. While IB and GPU seem to frequently discussed when it comes to openstack and HPC they are not so relevant in our HPC environment (3000 ish cores, each node connected via 1G, scaleout NAS storage, biomedical research, genomics, proteomics, statistics) ..... many other midsize shops may be have similar setups. We are just starting to look at openstack for a for a potential deployment with Ubuntu 14.04. We have good experiences using KVM for some of our resources. Some of the things we are hoping to get from openstack in the future are: - flexible partitioning of resources for special sauce software (hadoop, interactive HPC software) - self service for developers and scientists - allow research group that spans multiple research organizations (internal / external) controlled access to an isolated (virtual) datacenter (potentially with fisma compliance) - save images that researchers built for later use (reproducible in case someone asks how they got to this result) - chargeback for HPC resources for internal and external users - usage of idle resources for testing in Enterprise IT, VDI ,etc - compute fencing (as we are heading to 24 cores per box most of our multi threaded code can still only take advantage of 4-6 cores. This either leaves stuff idle or users step on each other on shared nodes ....cgroups is a bit of a pain to maintain. - checkpointing and restarting long running jobs (for prioritization and better protection against job failures). perhaps LXC containers as alternative to KVM (we use BLCR today but that community is quite small) - standardization of our infrastructure - potential participation in futuregrid, XSEDE, etc That's perhaps a lot to ask but we would be looking at a 2-3 year time frame. What I don't quite unterstand, how would one handle workload management. Currently we see people using SGE, moab, LSF and some slurm in classic HPC. Concepts like backfill, preemption fair share and such things are probably unknown to openstack? If so it would perhaps be acceptable to run a workload manager on a subset of always-on KVM systems or even baremetal for classic HPC. But how does one consolidate the reporting, billing, chargeback of 2 separate systems? Are there any efforts to integrate workload managers directly into nova? SGE and Slurm are both open source and would support everything we require. Or are folks thinking about writing something from scratch in python? Thanks dipe On Thu, Jul 25, 2013 at 6:22 PM, Narayan Desai <narayan.desai@gmail.com>wrote:

...

Brian's right. You will end up doing a lot of work; nova isn't ready for this out of the box.

The key problems are: - drivers for virtualization (either via SR-IOV or device passthrough) for net + gpu - IO architecture

There is apparently SR-IOV support for IB, provided that you have the right hardware, firmware, and driver, though I haven't managed to make it work. This provides a pkey isolated multi-tenant environment. THat is basically a mellanox only solution.

Like Brian said, Xen is the only way to go for GPU virtualization.

You can do some really interesting things with the I/O architecture. We've been experimenting with both glusterfs and ceph. Both seem to perform decently well; we've managed to get the glusterfs setup going at 60 GB/s in aggregate with a pile of clients. There isn't good integration of all of the capabilities yet in mainline openstack, but this looks promising. Ceph looks like the mainline integration is better, but we haven't tried those things out yet.

At the end of the day, you need to ask yourself why you want to accomplish this. If you're running a workload that is well suited to an HPC cluster, you should probably use one. If you need multi-tenancy, user control of system software, or need to run a workload poorly suited for a traditional cluster, then it is worth thinking strongly about. You'll end up needing to do a bunch of work though.

In our case, the reason that we pursued this course is because we have workloads and developers that benefit from the cloud control plane.

hth. -nld

On Thu, Jul 25, 2013 at 3:30 PM, Brian Schott < brian.schott@nimbisservices.com> wrote:

...
Joshua,

It is something those of us working the gap between HPC and cloud have struggled with. We lack a strong project team within OpenStack IMHO, although there are several small groups doing HPC with OpenStack. Hopefully others will chime in on some other topics, such as Infiniband support, but we did some testing with a GRID K2 card for GPU pass-through with different hypervisors. A talk I gave at the OpenStack DC meetup is here:

http://www.slideshare.net/bfschott/nimbis-schott-openstackgpustatus20130618

The short GPU answer is that it is possible with Xen, XenCenter, and XCP to passthrough GPUs today, but OpenStack doesn't have support by default in Nova. This is still in a roll-your-own mode for deployment.

Brian

------------------------------------------------- Brian Schott, CTO Nimbis Services, Inc. brian.schott@nimbisservices.com ph: 443-274-6064 fx: 443-274-6060

On Jul 25, 2013, at 4:06 PM, Joshua Dotson <josh@knoesis.org> wrote:

Hello.

A contingent of my organization, the Kno.e.sis Center @ Wright State University <http://www.knoesis.org/>, recently received a grant award which we intend to use to support a handful of mid-size HPC-style workloads (MPI <-- definitely, GPGPU <-- if possible/plausible) in addition to many mid-size IaaS-style workloads (MongoDB, Storm, Hadoop, many others). As a third layer, I'm playing with the idea of evaluating an elastic OpenShift Origin atop the same infrastructure. Approximately $400k to $500k will hopefully be available for this deployment, though exact numbers are not yet available to me.

While I'm prepared to build a home-grown small-to-mid-size "classical" HPC, using modern hardware, and a smaller silo for home-grown Openstack for the minority stakeholders, I am hoping to find ways of making proponents of both workloads simultaneously happy, or close to it. That is, I would like to give my computer scientist users a friendly method of running their HPC-style jobs on a combined performance-tuned silo of Openstack. Doing so would load-balance the procured hardware and infrastructure with the users who want a Tomcat or a Virtuoso instance.

I see a number of serious issues realizing such a goal. For example, the state of Infiniband vs. Openstack seems not quite ready/available/documented/accessible for such use in production, unless I'm just blind to the right blogs. The added myriad abstractions and latency virtualization might impose on an HPC task, not to mention cloud software defined networking (Quantum, especially when sans hardware acceleration), seem likely to really get in the way of practicality, economics and efficiency. That said, most of what we do here isn't HPC, so I believe such trade-offs can be agreed upon, if a reasonable job scheduling and workload management mechanism can be found and agreed upon by all stake holders, grant proposal majority (HPC) and minority (IaaS) alike.

I get the impression from my readings that HPC-style deployment (separate from job queuing) against the EC2 API should work. I don't have a good feeling that the experience would be particularly friendly, however, without paying for closed source applications. I'm thinking a high-performance Ceph install would help bring up the storage end of things in a modern open-source CoTS way. I've not done specific research on Lustre + Openstack, but no reports of such a setup have presented themselves to me, either.

These blue sky ideas matter nil, it seems, if a sufficiently-large high-performance production-quality Openstack deployment is beyond the funds to be allotted, which is something else I'm working on. I've built smallish but useful virt-manager, oVirt and Openstack environments here already, but none of them are enough for the very-important HPC job proposed for this grant. The scientist running the proposed computation gave me following information to clarify what would parity (for his job only) his experience running the computation with an external HPC service provider.

- MPI - 20 Gbps Infiniband compute interconnect - 600 cores (those currently used are probably G4 Opteron 2.4 Ghz) - 4 GB RAM per core - at lest 2 TB shared storage, though I'm thinking we need much more for use by our general community - unsure of the storage networking topology

We're in the shopping phase for this grant award and are still playing with ideas. It seems likely to snap back into an old-school HPC, at this time. I've fielded communication about our needs to a number of Openstack and hardware providers, on the hope that they can bring something helpful to the table. Please let know if you can point me in the right direction(s). I'm up to reading whatever text is thrown at me on this topic. :-)

Thanks, Joshua -- Joshua M. Dotson Systems Administrator Kno.e.sis Center Wright State University - Dayton, OH _______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

Narayan Desai

8:24 p.m.

New subject: Looking for practical Openstack + Ceph guidance for shared HPC

I think that workload management is perhaps too vague of a term here. There are a few different processes in play. One is the system level resource management (nova-scheduler, etc). This component needs to figure out how to allocate the resources to tenants, and pull them back with needed. Inside of a single tenant's allocation, there is a need for workload management. This is a good place for traditional HPC resource managers; we've run torque in this capacity, for example. I think that the model today for nova scheduler is wrong; it only supports open-ended leases on compute resources. Another component that makes this tricky is the need to support interactive workloads; batch is good for well, batch, but interactive is always problematic in these environments. There is some research targeting this problem, in particular Mesos and Omega. Mesos looks ok for serial workloads, but has a real problem with parallel (non-resizable) ones. Omega looks a lot better, but that is an internal google thing. I don't think that there is an off the shelf solution today. -nld On Fri, Jul 26, 2013 at 1:48 AM, Di Pe <dipeit@gmail.com> wrote:

...

All,

one issue Joshua touched on was options for workload management. While IB and GPU seem to frequently discussed when it comes to openstack and HPC they are not so relevant in our HPC environment (3000 ish cores, each node connected via 1G, scaleout NAS storage, biomedical research, genomics, proteomics, statistics) ..... many other midsize shops may be have similar setups. We are just starting to look at openstack for a for a potential deployment with Ubuntu 14.04. We have good experiences using KVM for some of our resources. Some of the things we are hoping to get from openstack in the future are:

- flexible partitioning of resources for special sauce software (hadoop, interactive HPC software) - self service for developers and scientists - allow research group that spans multiple research organizations (internal / external) controlled access to an isolated (virtual) datacenter (potentially with fisma compliance) - save images that researchers built for later use (reproducible in case someone asks how they got to this result) - chargeback for HPC resources for internal and external users - usage of idle resources for testing in Enterprise IT, VDI ,etc - compute fencing (as we are heading to 24 cores per box most of our multi threaded code can still only take advantage of 4-6 cores. This either leaves stuff idle or users step on each other on shared nodes ....cgroups is a bit of a pain to maintain. - checkpointing and restarting long running jobs (for prioritization and better protection against job failures). perhaps LXC containers as alternative to KVM (we use BLCR today but that community is quite small) - standardization of our infrastructure - potential participation in futuregrid, XSEDE, etc

That's perhaps a lot to ask but we would be looking at a 2-3 year time frame. What I don't quite unterstand, how would one handle workload management. Currently we see people using SGE, moab, LSF and some slurm in classic HPC. Concepts like backfill, preemption fair share and such things are probably unknown to openstack? If so it would perhaps be acceptable to run a workload manager on a subset of always-on KVM systems or even baremetal for classic HPC. But how does one consolidate the reporting, billing, chargeback of 2 separate systems?

Are there any efforts to integrate workload managers directly into nova? SGE and Slurm are both open source and would support everything we require. Or are folks thinking about writing something from scratch in python?

Thanks dipe

On Thu, Jul 25, 2013 at 6:22 PM, Narayan Desai <narayan.desai@gmail.com>wrote:

...
Brian's right. You will end up doing a lot of work; nova isn't ready for this out of the box.

The key problems are: - drivers for virtualization (either via SR-IOV or device passthrough) for net + gpu - IO architecture

There is apparently SR-IOV support for IB, provided that you have the right hardware, firmware, and driver, though I haven't managed to make it work. This provides a pkey isolated multi-tenant environment. THat is basically a mellanox only solution.

Like Brian said, Xen is the only way to go for GPU virtualization.

You can do some really interesting things with the I/O architecture. We've been experimenting with both glusterfs and ceph. Both seem to perform decently well; we've managed to get the glusterfs setup going at 60 GB/s in aggregate with a pile of clients. There isn't good integration of all of the capabilities yet in mainline openstack, but this looks promising. Ceph looks like the mainline integration is better, but we haven't tried those things out yet.

At the end of the day, you need to ask yourself why you want to accomplish this. If you're running a workload that is well suited to an HPC cluster, you should probably use one. If you need multi-tenancy, user control of system software, or need to run a workload poorly suited for a traditional cluster, then it is worth thinking strongly about. You'll end up needing to do a bunch of work though.

In our case, the reason that we pursued this course is because we have workloads and developers that benefit from the cloud control plane.

hth. -nld

On Thu, Jul 25, 2013 at 3:30 PM, Brian Schott < brian.schott@nimbisservices.com> wrote:

...
Joshua,

It is something those of us working the gap between HPC and cloud have struggled with. We lack a strong project team within OpenStack IMHO, although there are several small groups doing HPC with OpenStack. Hopefully others will chime in on some other topics, such as Infiniband support, but we did some testing with a GRID K2 card for GPU pass-through with different hypervisors. A talk I gave at the OpenStack DC meetup is here:

http://www.slideshare.net/bfschott/nimbis-schott-openstackgpustatus20130618

The short GPU answer is that it is possible with Xen, XenCenter, and XCP to passthrough GPUs today, but OpenStack doesn't have support by default in Nova. This is still in a roll-your-own mode for deployment.

Brian

------------------------------------------------- Brian Schott, CTO Nimbis Services, Inc. brian.schott@nimbisservices.com ph: 443-274-6064 fx: 443-274-6060

On Jul 25, 2013, at 4:06 PM, Joshua Dotson <josh@knoesis.org> wrote:

Hello.

A contingent of my organization, the Kno.e.sis Center @ Wright State University <http://www.knoesis.org/>, recently received a grant award which we intend to use to support a handful of mid-size HPC-style workloads (MPI <-- definitely, GPGPU <-- if possible/plausible) in addition to many mid-size IaaS-style workloads (MongoDB, Storm, Hadoop, many others). As a third layer, I'm playing with the idea of evaluating an elastic OpenShift Origin atop the same infrastructure. Approximately $400k to $500k will hopefully be available for this deployment, though exact numbers are not yet available to me.

While I'm prepared to build a home-grown small-to-mid-size "classical" HPC, using modern hardware, and a smaller silo for home-grown Openstack for the minority stakeholders, I am hoping to find ways of making proponents of both workloads simultaneously happy, or close to it. That is, I would like to give my computer scientist users a friendly method of running their HPC-style jobs on a combined performance-tuned silo of Openstack. Doing so would load-balance the procured hardware and infrastructure with the users who want a Tomcat or a Virtuoso instance.

I see a number of serious issues realizing such a goal. For example, the state of Infiniband vs. Openstack seems not quite ready/available/documented/accessible for such use in production, unless I'm just blind to the right blogs. The added myriad abstractions and latency virtualization might impose on an HPC task, not to mention cloud software defined networking (Quantum, especially when sans hardware acceleration), seem likely to really get in the way of practicality, economics and efficiency. That said, most of what we do here isn't HPC, so I believe such trade-offs can be agreed upon, if a reasonable job scheduling and workload management mechanism can be found and agreed upon by all stake holders, grant proposal majority (HPC) and minority (IaaS) alike.

I get the impression from my readings that HPC-style deployment (separate from job queuing) against the EC2 API should work. I don't have a good feeling that the experience would be particularly friendly, however, without paying for closed source applications. I'm thinking a high-performance Ceph install would help bring up the storage end of things in a modern open-source CoTS way. I've not done specific research on Lustre + Openstack, but no reports of such a setup have presented themselves to me, either.

These blue sky ideas matter nil, it seems, if a sufficiently-large high-performance production-quality Openstack deployment is beyond the funds to be allotted, which is something else I'm working on. I've built smallish but useful virt-manager, oVirt and Openstack environments here already, but none of them are enough for the very-important HPC job proposed for this grant. The scientist running the proposed computation gave me following information to clarify what would parity (for his job only) his experience running the computation with an external HPC service provider.

- MPI - 20 Gbps Infiniband compute interconnect - 600 cores (those currently used are probably G4 Opteron 2.4 Ghz) - 4 GB RAM per core - at lest 2 TB shared storage, though I'm thinking we need much more for use by our general community - unsure of the storage networking topology

We're in the shopping phase for this grant award and are still playing with ideas. It seems likely to snap back into an old-school HPC, at this time. I've fielded communication about our needs to a number of Openstack and hardware providers, on the hope that they can bring something helpful to the table. Please let know if you can point me in the right direction(s). I'm up to reading whatever text is thrown at me on this topic. :-)

Thanks, Joshua -- Joshua M. Dotson Systems Administrator Kno.e.sis Center Wright State University - Dayton, OH _______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

Joshua Dotson

9:42 p.m.

New subject: Looking for practical Openstack + Ceph guidance for shared HPC

All, Thanks for the great responses. I'm finding this discussion very enlightening. I have some mostly idle and uninformed musings about HPC node scheduling at the nova-scheduler level: Efficient handling of typical IaaS loads is much different from what I suppose one might call HPCaaS loads. For example, a typical IaaS instances is expected to be sometimes idle, awaiting clients and work, unless developers really tune their SaaS application stacks, if that's what your cloud is running. A good bit of virtualization is rarely-elastic "infrastructure", things like OwnCloud or a mirrors server. These tasks usually benefit from sharing a physical node. Beyond that, a lot of SaaS developers (who run on IaaS) stress the horizontal scaling before any large vertical scaling, because resiliency (especially geo-replication) and cost are very important there. Though there are definitely limits, instance granularity (smaller instances) and more horizontal than vertical footprints, for SaaS and PaaS stacks running in clouds which themselves have a horizontal footprint, are industry standard methods of achieving SLAs. I seldom hear of HPC jobs having such performance profiles, because performance, not resiliency, comes first. What you end up with is nova-scheduler being tuned out-of-the box to over commit on I/O, memory, disk and CPU and having little or no motive to change course to honor things like pre-planning the number of hours instances will run, unlike HPC resource schedulers -- unless I'm severely mistaken. The API itself would need extending to do that, which maybe isn't altogether impossible to bring about. Thus, my experience has been that cloud-centric load balancing is essentially sacrilegious to a lot of HPC (and also, importantly, "Big Data" -- if for throughput over resiliency/storage) tasks. Along these lines, in my view, local-to-node disk space, physical or otherwise, has little place in HPC, at scale (e.g. swap off). I'd really like to see things like CephFS be battle hardened enough to move us forward toward unified storage, of the cloud API variety, which can service HPC-level throughput. HPC jobs which are I/O, memory and/or CPU bound might find themselves benefiting from over committing, but only because backfilling with the opposite variet(ies) of bounding could make sense in such cases -- and only then if we're comfortable guessing that the job in question isn't going to shift gears and start chomping on memory, etc... So then, it's almost as if we need to put pop-up Openstack environments into an old school PBS or LSF, rather than put HPC(s) into an Openstack cloud. I guess what I'm saying is, maybe it would be easier to have a "job script" which bootstraps an single-tenant(?) Openstack environment at job run time, for those who need IaaS. But no, for many reasons, not the least of which is that keeping a perpetual "job" running would be a nightmare in old-school HPC land. Beyond that, I don't get the feeling old-school HPC job schedulers have any "elastic" or modern REST API abilities. It seems that Nova scheduler needs some form of tiered scheduling, if ever HPCaaS were to be really efficient. Nova needs to know how often I plan to peg I/O, memory, and other resources. While I suppose some machine learning could be introduced to profile each tenant and each of their deployments, I do not see us getting much traction with these things, until Openstack has a place in their API for a "cluster" primitive. Now, I've not been keeping up with details on the Heat project<http://www.openstack.org/summit/san-diego-2012/openstack-summit-sessions/presentation/heat-a-template-based-orchestration-engine-for-openstack>, but maybe that's just what the doctor ordered, as a base for HPCaaS on Openstack... Since Heat is clone of AWS Cloud Formation, this seems relevant: https://aws.amazon.com/hpc-applications/ - "Cluster instances can be launched within a *Placement Group*. All instances launched within a Placement Group have low latency, full bisection 10 Gbps bandwidth between instances. Like many other Amazon EC2 resources, *Placement Groups* are dynamic and can be resized if required. You can also connect multiple *Placement Groups* to create very large clusters for massively parallel processing." Thanks again, Joshua On Fri, Jul 26, 2013 at 11:24 AM, Narayan Desai <narayan.desai@gmail.com>wrote:

...

I think that workload management is perhaps too vague of a term here. There are a few different processes in play.

One is the system level resource management (nova-scheduler, etc). This component needs to figure out how to allocate the resources to tenants, and pull them back with needed.

Inside of a single tenant's allocation, there is a need for workload management. This is a good place for traditional HPC resource managers; we've run torque in this capacity, for example.

I think that the model today for nova scheduler is wrong; it only supports open-ended leases on compute resources. Another component that makes this tricky is the need to support interactive workloads; batch is good for well, batch, but interactive is always problematic in these environments.

There is some research targeting this problem, in particular Mesos and Omega. Mesos looks ok for serial workloads, but has a real problem with parallel (non-resizable) ones. Omega looks a lot better, but that is an internal google thing. I don't think that there is an off the shelf solution today. -nld

On Fri, Jul 26, 2013 at 1:48 AM, Di Pe <dipeit@gmail.com> wrote:

...
All,

one issue Joshua touched on was options for workload management. While IB and GPU seem to frequently discussed when it comes to openstack and HPC they are not so relevant in our HPC environment (3000 ish cores, each node connected via 1G, scaleout NAS storage, biomedical research, genomics, proteomics, statistics) ..... many other midsize shops may be have similar setups. We are just starting to look at openstack for a for a potential deployment with Ubuntu 14.04. We have good experiences using KVM for some of our resources. Some of the things we are hoping to get from openstack in the future are:

- flexible partitioning of resources for special sauce software (hadoop, interactive HPC software) - self service for developers and scientists - allow research group that spans multiple research organizations (internal / external) controlled access to an isolated (virtual) datacenter (potentially with fisma compliance) - save images that researchers built for later use (reproducible in case someone asks how they got to this result) - chargeback for HPC resources for internal and external users - usage of idle resources for testing in Enterprise IT, VDI ,etc - compute fencing (as we are heading to 24 cores per box most of our multi threaded code can still only take advantage of 4-6 cores. This either leaves stuff idle or users step on each other on shared nodes ....cgroups is a bit of a pain to maintain. - checkpointing and restarting long running jobs (for prioritization and better protection against job failures). perhaps LXC containers as alternative to KVM (we use BLCR today but that community is quite small) - standardization of our infrastructure - potential participation in futuregrid, XSEDE, etc

That's perhaps a lot to ask but we would be looking at a 2-3 year time frame. What I don't quite unterstand, how would one handle workload management. Currently we see people using SGE, moab, LSF and some slurm in classic HPC. Concepts like backfill, preemption fair share and such things are probably unknown to openstack? If so it would perhaps be acceptable to run a workload manager on a subset of always-on KVM systems or even baremetal for classic HPC. But how does one consolidate the reporting, billing, chargeback of 2 separate systems?

Are there any efforts to integrate workload managers directly into nova? SGE and Slurm are both open source and would support everything we require. Or are folks thinking about writing something from scratch in python?

Thanks dipe

On Thu, Jul 25, 2013 at 6:22 PM, Narayan Desai <narayan.desai@gmail.com>wrote:

...
Brian's right. You will end up doing a lot of work; nova isn't ready for this out of the box.

The key problems are: - drivers for virtualization (either via SR-IOV or device passthrough) for net + gpu - IO architecture

There is apparently SR-IOV support for IB, provided that you have the right hardware, firmware, and driver, though I haven't managed to make it work. This provides a pkey isolated multi-tenant environment. THat is basically a mellanox only solution.

Like Brian said, Xen is the only way to go for GPU virtualization.

You can do some really interesting things with the I/O architecture. We've been experimenting with both glusterfs and ceph. Both seem to perform decently well; we've managed to get the glusterfs setup going at 60 GB/s in aggregate with a pile of clients. There isn't good integration of all of the capabilities yet in mainline openstack, but this looks promising. Ceph looks like the mainline integration is better, but we haven't tried those things out yet.

At the end of the day, you need to ask yourself why you want to accomplish this. If you're running a workload that is well suited to an HPC cluster, you should probably use one. If you need multi-tenancy, user control of system software, or need to run a workload poorly suited for a traditional cluster, then it is worth thinking strongly about. You'll end up needing to do a bunch of work though.

In our case, the reason that we pursued this course is because we have workloads and developers that benefit from the cloud control plane.

hth. -nld

On Thu, Jul 25, 2013 at 3:30 PM, Brian Schott < brian.schott@nimbisservices.com> wrote:

...
Joshua,

It is something those of us working the gap between HPC and cloud have struggled with. We lack a strong project team within OpenStack IMHO, although there are several small groups doing HPC with OpenStack. Hopefully others will chime in on some other topics, such as Infiniband support, but we did some testing with a GRID K2 card for GPU pass-through with different hypervisors. A talk I gave at the OpenStack DC meetup is here:

http://www.slideshare.net/bfschott/nimbis-schott-openstackgpustatus20130618

The short GPU answer is that it is possible with Xen, XenCenter, and XCP to passthrough GPUs today, but OpenStack doesn't have support by default in Nova. This is still in a roll-your-own mode for deployment.

Brian

------------------------------------------------- Brian Schott, CTO Nimbis Services, Inc. brian.schott@nimbisservices.com ph: 443-274-6064 fx: 443-274-6060

On Jul 25, 2013, at 4:06 PM, Joshua Dotson <josh@knoesis.org> wrote:

Hello.

A contingent of my organization, the Kno.e.sis Center @ Wright State University <http://www.knoesis.org/>, recently received a grant award which we intend to use to support a handful of mid-size HPC-style workloads (MPI <-- definitely, GPGPU <-- if possible/plausible) in addition to many mid-size IaaS-style workloads (MongoDB, Storm, Hadoop, many others). As a third layer, I'm playing with the idea of evaluating an elastic OpenShift Origin atop the same infrastructure. Approximately $400k to $500k will hopefully be available for this deployment, though exact numbers are not yet available to me.

While I'm prepared to build a home-grown small-to-mid-size "classical" HPC, using modern hardware, and a smaller silo for home-grown Openstack for the minority stakeholders, I am hoping to find ways of making proponents of both workloads simultaneously happy, or close to it. That is, I would like to give my computer scientist users a friendly method of running their HPC-style jobs on a combined performance-tuned silo of Openstack. Doing so would load-balance the procured hardware and infrastructure with the users who want a Tomcat or a Virtuoso instance.

I see a number of serious issues realizing such a goal. For example, the state of Infiniband vs. Openstack seems not quite ready/available/documented/accessible for such use in production, unless I'm just blind to the right blogs. The added myriad abstractions and latency virtualization might impose on an HPC task, not to mention cloud software defined networking (Quantum, especially when sans hardware acceleration), seem likely to really get in the way of practicality, economics and efficiency. That said, most of what we do here isn't HPC, so I believe such trade-offs can be agreed upon, if a reasonable job scheduling and workload management mechanism can be found and agreed upon by all stake holders, grant proposal majority (HPC) and minority (IaaS) alike.

I get the impression from my readings that HPC-style deployment (separate from job queuing) against the EC2 API should work. I don't have a good feeling that the experience would be particularly friendly, however, without paying for closed source applications. I'm thinking a high-performance Ceph install would help bring up the storage end of things in a modern open-source CoTS way. I've not done specific research on Lustre + Openstack, but no reports of such a setup have presented themselves to me, either.

These blue sky ideas matter nil, it seems, if a sufficiently-large high-performance production-quality Openstack deployment is beyond the funds to be allotted, which is something else I'm working on. I've built smallish but useful virt-manager, oVirt and Openstack environments here already, but none of them are enough for the very-important HPC job proposed for this grant. The scientist running the proposed computation gave me following information to clarify what would parity (for his job only) his experience running the computation with an external HPC service provider.

- MPI - 20 Gbps Infiniband compute interconnect - 600 cores (those currently used are probably G4 Opteron 2.4 Ghz) - 4 GB RAM per core - at lest 2 TB shared storage, though I'm thinking we need much more for use by our general community - unsure of the storage networking topology

We're in the shopping phase for this grant award and are still playing with ideas. It seems likely to snap back into an old-school HPC, at this time. I've fielded communication about our needs to a number of Openstack and hardware providers, on the hope that they can bring something helpful to the table. Please let know if you can point me in the right direction(s). I'm up to reading whatever text is thrown at me on this topic. :-)

Thanks, Joshua -- Joshua M. Dotson Systems Administrator Kno.e.sis Center Wright State University - Dayton, OH _______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

-- Joshua M. Dotson Systems Administrator Kno.e.sis Center Wright State University - Dayton, OH 937-985-3246

Andrew J Younge

27 Jul 27 Jul

12:56 a.m.

New subject: Looking for practical Openstack + Ceph guidance for shared HPC

Hi everybody, Alright I'm going to jump in on this conversation. First I want to say this is a great discussion, and I'm excited to see others as interested in HPC and Openstack as I am. The questions raised regarding the state-of-art in OpenStack related to high performance capabilities are spot-on and I have no doubt many people on this list are working hard to address them today. Now, I'd actually like to vote against the thought of shoehorning OpenStack into a typical HPC environment, as suggested in Josh's last email. As alluded to, doing so would be impractical when considering HPC scheduling, job reservation times etc, however it would be a downright nightmare for supporting the software stack necessary for running a cloud environment. I'd be impressed to find a single HPC system administrator who'd allow you to install a type1 or even type2 hypervisor in a traditional compute node, let alone the proper higher order APIs and software packages necessary to make it a reality. This is where Cloud Iaas solutions provide a true advantage when running scientific applications over conventional HPC offerings - users have the freedom to design their environment as they see fit within a VM, a luxury that's continually a limitation within normal HPC offerings today. There have been many efforts over the years to layer higher-order solutions atop traditional HPC and cluster LRMS solutions, many of which were great but have lost a lot of traction in recent years. I'll simply just reference the term Grid Computing and say little more on the subject. I would strongly argue that bridging the gap between HPC and IaaS should (must?) be accomplished by providing HPC oriented tools, services, and hardware within a user-centric IaaS solution like OpenStack, rather than the other way around. The effort necessary to bring HPC in a Cloud IaaS is much more tractable this way, as the mechanisms, the capabilities, and recently the interest are all possible. Actually this is a core concept of my tentative phd dissertation. There will always be an added overhead with using virtualization technologies inherent within Cloud IaaS, however the total overhead is continually decreasing and capabilities are ever increasing, making IaaS a potential solution for many mid-tier scientific research projects for the first time. Regarding Ceph in OpenStack, I do think there is an effort for providing Ceph within OpenStack already http://ceph.com/docs/next/rbd/rbd-openstack/. Please note though that this is distributed block storage for booting and running VMs in OpenStack, and not distributed shared block storage within a VM itself. Ideally, it would be great to have a many to 1 relationship with VMs and volumes (Cinder, nova-volume, EBS), representing a distributed shared storage among many running VMs simultaneously. There is a new blueprint for this now https://blueprints.launchpad.net/cinder/+spec/read-only-volumes however this is using read-only volumes. My eye is focused on a solution currently revolving around Lustre, but there's no theoretical reason similar work couldn't be done or isn't already completed using Ceph. For scheduling in OpenStack, this seems like a rather large problem with many potential solutions. If the HPC community has taught us anything about scheduling, its that its an NP-complete problem with an unlimited number of potential solutions. There was an original proposal for a proximity based scheduler https://wiki.openstack.org/wiki/ProximityScheduler but this is just 1 theoretical opportunity of many. I'd be ecstatic to get involved in a community effort to push for high performance scheduling in OpenStack using whatever mechanisms people collectively see fit. I've hinted at many different things here, so let me know if you'd like to discuss anything in more detail. Also, if you want more information on a solution for GPUs and/or InfiniBand in OpenStack as alluded to by Brian and Nerayan, just let me know. I may have one. Regards, Andrew P.S. Is anybody planning on attending the next OpenStack summit and want to participate in a discussion on some of these topics? http://www.openstack.org/summit/openstack-summit-hong-kong-2013/become-a-spe... The speaker proposal deadline is in 5 days I believe... On 7/26/13 12:42 PM, Joshua Dotson wrote:

...

All,

Thanks for the great responses. I'm finding this discussion very enlightening. I have some mostly idle and uninformed musings about HPC node scheduling at the nova-scheduler level:

Efficient handling of typical IaaS loads is much different from what I suppose one might call HPCaaS loads. For example, a typical IaaS instances is expected to be sometimes idle, awaiting clients and work, unless developers really tune their SaaS application stacks, if that's what your cloud is running. A good bit of virtualization is rarely-elastic "infrastructure", things like OwnCloud or a mirrors server. These tasks usually benefit from sharing a physical node.

Beyond that, a lot of SaaS developers (who run on IaaS) stress the horizontal scaling before any large vertical scaling, because resiliency (especially geo-replication) and cost are very important there. Though there are definitely limits, instance granularity (smaller instances) and more horizontal than vertical footprints, for SaaS and PaaS stacks running in clouds which themselves have a horizontal footprint, are industry standard methods of achieving SLAs.

I seldom hear of HPC jobs having such performance profiles, because performance, not resiliency, comes first. What you end up with is nova-scheduler being tuned out-of-the box to over commit on I/O, memory, disk and CPU and having little or no motive to change course to honor things like pre-planning the number of hours instances will run, unlike HPC resource schedulers -- unless I'm severely mistaken. The API itself would need extending to do that, which maybe isn't altogether impossible to bring about.

Thus, my experience has been that cloud-centric load balancing is essentially sacrilegious to a lot of HPC (and also, importantly, "Big Data" -- if for throughput over resiliency/storage) tasks. Along these lines, in my view, local-to-node disk space, physical or otherwise, has little place in HPC, at scale (e.g. swap off). I'd really like to see things like CephFS be battle hardened enough to move us forward toward unified storage, of the cloud API variety, which can service HPC-level throughput. HPC jobs which are I/O, memory and/or CPU bound might find themselves benefiting from over committing, but only because backfilling with the opposite variet(ies) of bounding could make sense in such cases -- and only then if we're comfortable guessing that the job in question isn't going to shift gears and start chomping on memory, etc...

So then, it's almost as if we need to put pop-up Openstack environments into an old school PBS or LSF, rather than put HPC(s) into an Openstack cloud. I guess what I'm saying is, maybe it would be easier to have a "job script" which bootstraps an single-tenant(?) Openstack environment at job run time, for those who need IaaS. But no, for many reasons, not the least of which is that keeping a perpetual "job" running would be a nightmare in old-school HPC land. Beyond that, I don't get the feeling old-school HPC job schedulers have any "elastic" or modern REST API abilities.

It seems that Nova scheduler needs some form of tiered scheduling, if ever HPCaaS were to be really efficient. Nova needs to know how often I plan to peg I/O, memory, and other resources. While I suppose some machine learning could be introduced to profile each tenant and each of their deployments, I do not see us getting much traction with these things, until Openstack has a place in their API for a "cluster" primitive. Now, I've not been keeping up with details on the Heat project <http://www.openstack.org/summit/san-diego-2012/openstack-summit-sessions/presentation/heat-a-template-based-orchestration-engine-for-openstack>, but maybe that's just what the doctor ordered, as a base for HPCaaS on Openstack...

Since Heat is clone of AWS Cloud Formation, this seems relevant: https://aws.amazon.com/hpc-applications/

* "Cluster instances can be launched within a /Placement Group/. All instances launched within a Placement Group have low latency, full bisection 10 Gbps bandwidth between instances. Like many other Amazon EC2 resources, /Placement Groups/ are dynamic and can be resized if required. You can also connect multiple /Placement Groups/ to create very large clusters for massively parallel processing."

Thanks again, Joshua

On Fri, Jul 26, 2013 at 11:24 AM, Narayan Desai <narayan.desai@gmail.com <mailto:narayan.desai@gmail.com>> wrote:

I think that workload management is perhaps too vague of a term here. There are a few different processes in play.

One is the system level resource management (nova-scheduler, etc). This component needs to figure out how to allocate the resources to tenants, and pull them back with needed.

Inside of a single tenant's allocation, there is a need for workload management. This is a good place for traditional HPC resource managers; we've run torque in this capacity, for example.

I think that the model today for nova scheduler is wrong; it only supports open-ended leases on compute resources. Another component that makes this tricky is the need to support interactive workloads; batch is good for well, batch, but interactive is always problematic in these environments.

There is some research targeting this problem, in particular Mesos and Omega. Mesos looks ok for serial workloads, but has a real problem with parallel (non-resizable) ones. Omega looks a lot better, but that is an internal google thing. I don't think that there is an off the shelf solution today. -nld

On Fri, Jul 26, 2013 at 1:48 AM, Di Pe <dipeit@gmail.com <mailto:dipeit@gmail.com>> wrote:

All,

one issue Joshua touched on was options for workload management. While IB and GPU seem to frequently discussed when it comes to openstack and HPC they are not so relevant in our HPC environment (3000 ish cores, each node connected via 1G, scaleout NAS storage, biomedical research, genomics, proteomics, statistics) ..... many other midsize shops may be have similar setups. We are just starting to look at openstack for a for a potential deployment with Ubuntu 14.04. We have good experiences using KVM for some of our resources. Some of the things we are hoping to get from openstack in the future are:

* flexible partitioning of resources for special sauce software (hadoop, interactive HPC software) * self service for developers and scientists * allow research group that spans multiple research organizations (internal / external) controlled access to an isolated (virtual) datacenter (potentially with fisma compliance) * save images that researchers built for later use (reproducible in case someone asks how they got to this result) * chargeback for HPC resources for internal and external users * usage of idle resources for testing in Enterprise IT, VDI ,etc * compute fencing (as we are heading to 24 cores per box most of our multi threaded code can still only take advantage of 4-6 cores. This either leaves stuff idle or users step on each other on shared nodes ....cgroups is a bit of a pain to maintain. * checkpointing and restarting long running jobs (for prioritization and better protection against job failures). perhaps LXC containers as alternative to KVM (we use BLCR today but that community is quite small) * standardization of our infrastructure * potential participation in futuregrid, XSEDE, etc

That's perhaps a lot to ask but we would be looking at a 2-3 year time frame. What I don't quite unterstand, how would one handle workload management. Currently we see people using SGE, moab, LSF and some slurm in classic HPC. Concepts like backfill, preemption fair share and such things are probably unknown to openstack? If so it would perhaps be acceptable to run a workload manager on a subset of always-on KVM systems or even baremetal for classic HPC. But how does one consolidate the reporting, billing, chargeback of 2 separate systems?

Are there any efforts to integrate workload managers directly into nova? SGE and Slurm are both open source and would support everything we require. Or are folks thinking about writing something from scratch in python?

Thanks dipe

On Thu, Jul 25, 2013 at 6:22 PM, Narayan Desai <narayan.desai@gmail.com <mailto:narayan.desai@gmail.com>> wrote:

Brian's right. You will end up doing a lot of work; nova isn't ready for this out of the box.

The key problems are: - drivers for virtualization (either via SR-IOV or device passthrough) for net + gpu - IO architecture

There is apparently SR-IOV support for IB, provided that you have the right hardware, firmware, and driver, though I haven't managed to make it work. This provides a pkey isolated multi-tenant environment. THat is basically a mellanox only solution.

Like Brian said, Xen is the only way to go for GPU virtualization.

You can do some really interesting things with the I/O architecture. We've been experimenting with both glusterfs and ceph. Both seem to perform decently well; we've managed to get the glusterfs setup going at 60 GB/s in aggregate with a pile of clients. There isn't good integration of all of the capabilities yet in mainline openstack, but this looks promising. Ceph looks like the mainline integration is better, but we haven't tried those things out yet.

At the end of the day, you need to ask yourself why you want to accomplish this. If you're running a workload that is well suited to an HPC cluster, you should probably use one. If you need multi-tenancy, user control of system software, or need to run a workload poorly suited for a traditional cluster, then it is worth thinking strongly about. You'll end up needing to do a bunch of work though.

In our case, the reason that we pursued this course is because we have workloads and developers that benefit from the cloud control plane.

hth. -nld

On Thu, Jul 25, 2013 at 3:30 PM, Brian Schott <brian.schott@nimbisservices.com <mailto:brian.schott@nimbisservices.com>> wrote:

Joshua,

It is something those of us working the gap between HPC and cloud have struggled with. We lack a strong project team within OpenStack IMHO, although there are several small groups doing HPC with OpenStack. Hopefully others will chime in on some other topics, such as Infiniband support, but we did some testing with a GRID K2 card for GPU pass-through with different hypervisors. A talk I gave at the OpenStack DC meetup is here: http://www.slideshare.net/bfschott/nimbis-schott-openstackgpustatus20130618

The short GPU answer is that it is possible with Xen, XenCenter, and XCP to passthrough GPUs today, but OpenStack doesn't have support by default in Nova. This is still in a roll-your-own mode for deployment.

Brian

------------------------------------------------- Brian Schott, CTO Nimbis Services, Inc. brian.schott@nimbisservices.com <mailto:brian.schott@nimbisservices.com> ph: 443-274-6064 <tel:443-274-6064> fx: 443-274-6060 <tel:443-274-6060>

On Jul 25, 2013, at 4:06 PM, Joshua Dotson <josh@knoesis.org <mailto:josh@knoesis.org>> wrote:

...
Hello.

A contingent of my organization, the Kno.e.sis Center @ Wright State University <http://www.knoesis.org/>, recently received a grant award which we intend to use to support a handful of mid-size HPC-style workloads (MPI <-- definitely, GPGPU <-- if possible/plausible) in addition to many mid-size IaaS-style workloads (MongoDB, Storm, Hadoop, many others). As a third layer, I'm playing with the idea of evaluating an elastic OpenShift Origin atop the same infrastructure. Approximately $400k to $500k will hopefully be available for this deployment, though exact numbers are not yet available to me.

While I'm prepared to build a home-grown small-to-mid-size "classical" HPC, using modern hardware, and a smaller silo for home-grown Openstack for the minority stakeholders, I am hoping to find ways of making proponents of both workloads simultaneously happy, or close to it. That is, I would like to give my computer scientist users a friendly method of running their HPC-style jobs on a combined performance-tuned silo of Openstack. Doing so would load-balance the procured hardware and infrastructure with the users who want a Tomcat or a Virtuoso instance.

I see a number of serious issues realizing such a goal. For example, the state of Infiniband vs. Openstack seems not quite ready/available/documented/accessible for such use in production, unless I'm just blind to the right blogs. The added myriad abstractions and latency virtualization might impose on an HPC task, not to mention cloud software defined networking (Quantum, especially when sans hardware acceleration), seem likely to really get in the way of practicality, economics and efficiency. That said, most of what we do here isn't HPC, so I believe such trade-offs can be agreed upon, if a reasonable job scheduling and workload management mechanism can be found and agreed upon by all stake holders, grant proposal majority (HPC) and minority (IaaS) alike.

I get the impression from my readings that HPC-style deployment (separate from job queuing) against the EC2 API should work. I don't have a good feeling that the experience would be particularly friendly, however, without paying for closed source applications. I'm thinking a high-performance Ceph install would help bring up the storage end of things in a modern open-source CoTS way. I've not done specific research on Lustre + Openstack, but no reports of such a setup have presented themselves to me, either.

These blue sky ideas matter nil, it seems, if a sufficiently-large high-performance production-quality Openstack deployment is beyond the funds to be allotted, which is something else I'm working on. I've built smallish but useful virt-manager, oVirt and Openstack environments here already, but none of them are enough for the very-important HPC job proposed for this grant. The scientist running the proposed computation gave me following information to clarify what would parity (for his job only) his experience running the computation with an external HPC service provider.

* MPI * 20 Gbps Infiniband compute interconnect * 600 cores (those currently used are probably G4 Opteron 2.4 Ghz) * 4 GB RAM per core * at lest 2 TB shared storage, though I'm thinking we need much more for use by our general community * unsure of the storage networking topology

We're in the shopping phase for this grant award and are still playing with ideas. It seems likely to snap back into an old-school HPC, at this time. I've fielded communication about our needs to a number of Openstack and hardware providers, on the hope that they can bring something helpful to the table.

Please let know if you can point me in the right direction(s). I'm up to reading whatever text is thrown at me on this topic. :-)

Thanks, Joshua -- Joshua M. Dotson Systems Administrator Kno.e.sis Center Wright State University - Dayton, OH _______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org <mailto:OpenStack-HPC@lists.openstack.org> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org <mailto:OpenStack-HPC@lists.openstack.org> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org <mailto:OpenStack-HPC@lists.openstack.org> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org <mailto:OpenStack-HPC@lists.openstack.org> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org <mailto:OpenStack-HPC@lists.openstack.org> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

-- Joshua M. Dotson Systems Administrator Kno.e.sis Center Wright State University - Dayton, OH 937-985-3246 <tel:937-985-3246>

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

-- Andrew J. Younge Pervasive Tech Institute / School of Informatics & Computing Indiana University / Bloomington, IN USA ajyounge@indiana.edu / http://ajyounge.com

Tim Bell

1:44 p.m.

New subject: Looking for practical Openstack + Ceph guidance for shared HPC

CERN will be sending 3 people to the summit. It would be great to have an HPC/HTC get together, even if we can't get a speaker slot. Not sure on the best format but something where we do a 5 minute flash on our current status and areas for investigation to identify common solutions would be very useful. Tim

...

P.S. Is anybody planning on attending the next OpenStack summit and want to participate in a discussion on some of these topics? http://www.openstack.org/summit/openstack-summit-hong-kong-2013/become-a-spe... The speaker proposal deadline is in 5 days I believe...

John Paul Walters

29 Jul 29 Jul

4:36 p.m.

New subject: Looking for practical Openstack + Ceph guidance for shared HPC

We usually hold an HPC-themed design summit session, although last time it was one of the unconferences. We're planning to send at least someone from ISI to the summit, so we'd be happy to coordinate a shared design summit/speaker session. JP On Jul 27, 2013, at 4:44 AM, Tim Bell <Tim.Bell@cern.ch> wrote:

...

CERN will be sending 3 people to the summit. It would be great to have an HPC/HTC get together, even if we can't get a speaker slot.

Not sure on the best format but something where we do a 5 minute flash on our current status and areas for investigation to identify common solutions would be very useful.

Tim

...
P.S. Is anybody planning on attending the next OpenStack summit and want to participate in a discussion on some of these topics? http://www.openstack.org/summit/openstack-summit-hong-kong-2013/become-a-spe... The speaker proposal deadline is in 5 days I believe...

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

Tim Bell

8:56 p.m.

New subject: Looking for practical Openstack + Ceph guidance for shared HPC

I think there would be enough user stories now for a session outside the unconference. Given the experiences with Infiniband, grid to cloud approaches, challenges of scheduling and data locality, there should be enough for interest outside of the core HPC/HTC teams. If there is not sufficient time in the schedule, we can do it the unconference. Are you OK to submit a talk proposal ? Tim

...

-----Original Message----- From: John Paul Walters [mailto:jwalters@isi.edu] Sent: 29 July 2013 13:37 To: Tim Bell Cc: ajyounge@indiana.edu; openstack-hpc@lists.openstack.org; Andrew J Younge Subject: Re: [openstack-hpc] Looking for practical Openstack + Ceph guidance for shared HPC

We usually hold an HPC-themed design summit session, although last time it was one of the unconferences. We're planning to send at least someone from ISI to the summit, so we'd be happy to coordinate a shared design summit/speaker session.

JP

On Jul 27, 2013, at 4:44 AM, Tim Bell <Tim.Bell@cern.ch> wrote:

...
CERN will be sending 3 people to the summit. It would be great to have an HPC/HTC get together, even if we can't get a speaker slot.

Not sure on the best format but something where we do a 5 minute flash on our current status and areas for investigation to identify common solutions would be very useful.

Tim

...
P.S. Is anybody planning on attending the next OpenStack summit and want to participate in a discussion on some of these

topics?

...
http://www.openstack.org/summit/openstack-summit-hong-kong-2013/becom e-a-speaker/ The speaker proposal deadline is in 5 days I believe...

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

John Paul Walters

9:07 p.m.

New subject: Looking for practical Openstack + Ceph guidance for shared HPC

Hi Tim, I'm happy to submit a proposal. To allow me to frame the proposal a little, who outside of ISI is interested in participating and giving a short overview of their HPC/HTC work? This would be for the speaker sessions. We can coordinate a design summit session separately if folks are interested. I don't think the deadline for design summit sessions is approaching quite yet. JP On Jul 29, 2013, at 11:56 AM, Tim Bell <Tim.Bell@cern.ch> wrote:

...

I think there would be enough user stories now for a session outside the unconference. Given the experiences with Infiniband, grid to cloud approaches, challenges of scheduling and data locality, there should be enough for interest outside of the core HPC/HTC teams.

If there is not sufficient time in the schedule, we can do it the unconference.

Are you OK to submit a talk proposal ?

Tim

...
-----Original Message----- From: John Paul Walters [mailto:jwalters@isi.edu] Sent: 29 July 2013 13:37 To: Tim Bell Cc: ajyounge@indiana.edu; openstack-hpc@lists.openstack.org; Andrew J Younge Subject: Re: [openstack-hpc] Looking for practical Openstack + Ceph guidance for shared HPC

We usually hold an HPC-themed design summit session, although last time it was one of the unconferences. We're planning to send at least someone from ISI to the summit, so we'd be happy to coordinate a shared design summit/speaker session.

JP

On Jul 27, 2013, at 4:44 AM, Tim Bell <Tim.Bell@cern.ch> wrote:

...
CERN will be sending 3 people to the summit. It would be great to have an HPC/HTC get together, even if we can't get a speaker slot.

Not sure on the best format but something where we do a 5 minute flash on our current status and areas for investigation to identify common solutions would be very useful.

Tim

...
P.S. Is anybody planning on attending the next OpenStack summit and want to participate in a discussion on some of these

topics?

...
http://www.openstack.org/summit/openstack-summit-hong-kong-2013/becom e-a-speaker/ The speaker proposal deadline is in 5 days I believe...

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

Tim Bell

9:12 p.m.

New subject: Looking for practical Openstack + Ceph guidance for shared HPC

You're welcome to include CERN... we could talk on some subset of our current HTC/HPC topics such as batch to cloud migrations, scheduling, scaling, academic federation and data locality. Tim

...

-----Original Message----- From: John Paul Walters [mailto:jwalters@isi.edu] Sent: 29 July 2013 18:07 To: Tim Bell Cc: ajyounge@indiana.edu; openstack-hpc@lists.openstack.org; Andrew J Younge Subject: Re: [openstack-hpc] Looking for practical Openstack + Ceph guidance for shared HPC

Hi Tim,

I'm happy to submit a proposal. To allow me to frame the proposal a little, who outside of ISI is interested in participating and giving a short overview of their HPC/HTC work? This would be for the speaker sessions. We can coordinate a design summit session separately if folks are interested. I don't think the deadline for design summit sessions is approaching quite yet.

JP

On Jul 29, 2013, at 11:56 AM, Tim Bell <Tim.Bell@cern.ch> wrote:

...
I think there would be enough user stories now for a session outside the unconference. Given the experiences with Infiniband, grid to cloud approaches, challenges of scheduling and data locality, there should be enough for interest outside of the core HPC/HTC teams.

If there is not sufficient time in the schedule, we can do it the unconference.

Are you OK to submit a talk proposal ?

Tim

...
-----Original Message----- From: John Paul Walters [mailto:jwalters@isi.edu] Sent: 29 July 2013 13:37 To: Tim Bell Cc: ajyounge@indiana.edu; openstack-hpc@lists.openstack.org; Andrew J Younge Subject: Re: [openstack-hpc] Looking for practical Openstack + Ceph guidance for shared HPC

We usually hold an HPC-themed design summit session, although last time it was one of the unconferences. We're planning to send at least someone from ISI to the summit, so we'd be happy to coordinate a shared design summit/speaker session.

JP

On Jul 27, 2013, at 4:44 AM, Tim Bell <Tim.Bell@cern.ch> wrote:

...
CERN will be sending 3 people to the summit. It would be great to have an HPC/HTC get together, even if we can't get a speaker slot.

Not sure on the best format but something where we do a 5 minute flash on our current status and areas for investigation to identify common solutions would be very useful.

Tim

...
P.S. Is anybody planning on attending the next OpenStack summit and want to participate in a discussion on some of these

topics?

...
http://www.openstack.org/summit/openstack-summit-hong-kong-2013/bec om e-a-speaker/ The speaker proposal deadline is in 5 days I believe...

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

Joshua Dotson

26 Jul 26 Jul

10:19 p.m.

New subject: Looking for practical Openstack + Ceph guidance for shared HPC

Narayan, I just read about Mesos and Omega. Wow. This stuff is awesome. Thanks, Joshua On Fri, Jul 26, 2013 at 11:24 AM, Narayan Desai <narayan.desai@gmail.com>wrote:

...

I think that workload management is perhaps too vague of a term here. There are a few different processes in play.

One is the system level resource management (nova-scheduler, etc). This component needs to figure out how to allocate the resources to tenants, and pull them back with needed.

Inside of a single tenant's allocation, there is a need for workload management. This is a good place for traditional HPC resource managers; we've run torque in this capacity, for example.

I think that the model today for nova scheduler is wrong; it only supports open-ended leases on compute resources. Another component that makes this tricky is the need to support interactive workloads; batch is good for well, batch, but interactive is always problematic in these environments.

There is some research targeting this problem, in particular Mesos and Omega. Mesos looks ok for serial workloads, but has a real problem with parallel (non-resizable) ones. Omega looks a lot better, but that is an internal google thing. I don't think that there is an off the shelf solution today. -nld

On Fri, Jul 26, 2013 at 1:48 AM, Di Pe <dipeit@gmail.com> wrote:

...
All,

one issue Joshua touched on was options for workload management. While IB and GPU seem to frequently discussed when it comes to openstack and HPC they are not so relevant in our HPC environment (3000 ish cores, each node connected via 1G, scaleout NAS storage, biomedical research, genomics, proteomics, statistics) ..... many other midsize shops may be have similar setups. We are just starting to look at openstack for a for a potential deployment with Ubuntu 14.04. We have good experiences using KVM for some of our resources. Some of the things we are hoping to get from openstack in the future are:

- flexible partitioning of resources for special sauce software (hadoop, interactive HPC software) - self service for developers and scientists - allow research group that spans multiple research organizations (internal / external) controlled access to an isolated (virtual) datacenter (potentially with fisma compliance) - save images that researchers built for later use (reproducible in case someone asks how they got to this result) - chargeback for HPC resources for internal and external users - usage of idle resources for testing in Enterprise IT, VDI ,etc - compute fencing (as we are heading to 24 cores per box most of our multi threaded code can still only take advantage of 4-6 cores. This either leaves stuff idle or users step on each other on shared nodes ....cgroups is a bit of a pain to maintain. - checkpointing and restarting long running jobs (for prioritization and better protection against job failures). perhaps LXC containers as alternative to KVM (we use BLCR today but that community is quite small) - standardization of our infrastructure - potential participation in futuregrid, XSEDE, etc

That's perhaps a lot to ask but we would be looking at a 2-3 year time frame. What I don't quite unterstand, how would one handle workload management. Currently we see people using SGE, moab, LSF and some slurm in classic HPC. Concepts like backfill, preemption fair share and such things are probably unknown to openstack? If so it would perhaps be acceptable to run a workload manager on a subset of always-on KVM systems or even baremetal for classic HPC. But how does one consolidate the reporting, billing, chargeback of 2 separate systems?

Are there any efforts to integrate workload managers directly into nova? SGE and Slurm are both open source and would support everything we require. Or are folks thinking about writing something from scratch in python?

Thanks dipe

On Thu, Jul 25, 2013 at 6:22 PM, Narayan Desai <narayan.desai@gmail.com>wrote:

...
Brian's right. You will end up doing a lot of work; nova isn't ready for this out of the box.

The key problems are: - drivers for virtualization (either via SR-IOV or device passthrough) for net + gpu - IO architecture

There is apparently SR-IOV support for IB, provided that you have the right hardware, firmware, and driver, though I haven't managed to make it work. This provides a pkey isolated multi-tenant environment. THat is basically a mellanox only solution.

Like Brian said, Xen is the only way to go for GPU virtualization.

You can do some really interesting things with the I/O architecture. We've been experimenting with both glusterfs and ceph. Both seem to perform decently well; we've managed to get the glusterfs setup going at 60 GB/s in aggregate with a pile of clients. There isn't good integration of all of the capabilities yet in mainline openstack, but this looks promising. Ceph looks like the mainline integration is better, but we haven't tried those things out yet.

At the end of the day, you need to ask yourself why you want to accomplish this. If you're running a workload that is well suited to an HPC cluster, you should probably use one. If you need multi-tenancy, user control of system software, or need to run a workload poorly suited for a traditional cluster, then it is worth thinking strongly about. You'll end up needing to do a bunch of work though.

In our case, the reason that we pursued this course is because we have workloads and developers that benefit from the cloud control plane.

hth. -nld

On Thu, Jul 25, 2013 at 3:30 PM, Brian Schott < brian.schott@nimbisservices.com> wrote:

...
Joshua,

It is something those of us working the gap between HPC and cloud have struggled with. We lack a strong project team within OpenStack IMHO, although there are several small groups doing HPC with OpenStack. Hopefully others will chime in on some other topics, such as Infiniband support, but we did some testing with a GRID K2 card for GPU pass-through with different hypervisors. A talk I gave at the OpenStack DC meetup is here:

http://www.slideshare.net/bfschott/nimbis-schott-openstackgpustatus20130618

The short GPU answer is that it is possible with Xen, XenCenter, and XCP to passthrough GPUs today, but OpenStack doesn't have support by default in Nova. This is still in a roll-your-own mode for deployment.

Brian

------------------------------------------------- Brian Schott, CTO Nimbis Services, Inc. brian.schott@nimbisservices.com ph: 443-274-6064 fx: 443-274-6060

On Jul 25, 2013, at 4:06 PM, Joshua Dotson <josh@knoesis.org> wrote:

Hello.

A contingent of my organization, the Kno.e.sis Center @ Wright State University <http://www.knoesis.org/>, recently received a grant award which we intend to use to support a handful of mid-size HPC-style workloads (MPI <-- definitely, GPGPU <-- if possible/plausible) in addition to many mid-size IaaS-style workloads (MongoDB, Storm, Hadoop, many others). As a third layer, I'm playing with the idea of evaluating an elastic OpenShift Origin atop the same infrastructure. Approximately $400k to $500k will hopefully be available for this deployment, though exact numbers are not yet available to me.

While I'm prepared to build a home-grown small-to-mid-size "classical" HPC, using modern hardware, and a smaller silo for home-grown Openstack for the minority stakeholders, I am hoping to find ways of making proponents of both workloads simultaneously happy, or close to it. That is, I would like to give my computer scientist users a friendly method of running their HPC-style jobs on a combined performance-tuned silo of Openstack. Doing so would load-balance the procured hardware and infrastructure with the users who want a Tomcat or a Virtuoso instance.

I see a number of serious issues realizing such a goal. For example, the state of Infiniband vs. Openstack seems not quite ready/available/documented/accessible for such use in production, unless I'm just blind to the right blogs. The added myriad abstractions and latency virtualization might impose on an HPC task, not to mention cloud software defined networking (Quantum, especially when sans hardware acceleration), seem likely to really get in the way of practicality, economics and efficiency. That said, most of what we do here isn't HPC, so I believe such trade-offs can be agreed upon, if a reasonable job scheduling and workload management mechanism can be found and agreed upon by all stake holders, grant proposal majority (HPC) and minority (IaaS) alike.

I get the impression from my readings that HPC-style deployment (separate from job queuing) against the EC2 API should work. I don't have a good feeling that the experience would be particularly friendly, however, without paying for closed source applications. I'm thinking a high-performance Ceph install would help bring up the storage end of things in a modern open-source CoTS way. I've not done specific research on Lustre + Openstack, but no reports of such a setup have presented themselves to me, either.

These blue sky ideas matter nil, it seems, if a sufficiently-large high-performance production-quality Openstack deployment is beyond the funds to be allotted, which is something else I'm working on. I've built smallish but useful virt-manager, oVirt and Openstack environments here already, but none of them are enough for the very-important HPC job proposed for this grant. The scientist running the proposed computation gave me following information to clarify what would parity (for his job only) his experience running the computation with an external HPC service provider.

- MPI - 20 Gbps Infiniband compute interconnect - 600 cores (those currently used are probably G4 Opteron 2.4 Ghz) - 4 GB RAM per core - at lest 2 TB shared storage, though I'm thinking we need much more for use by our general community - unsure of the storage networking topology

We're in the shopping phase for this grant award and are still playing with ideas. It seems likely to snap back into an old-school HPC, at this time. I've fielded communication about our needs to a number of Openstack and hardware providers, on the hope that they can bring something helpful to the table. Please let know if you can point me in the right direction(s). I'm up to reading whatever text is thrown at me on this topic. :-)

Thanks, Joshua -- Joshua M. Dotson Systems Administrator Kno.e.sis Center Wright State University - Dayton, OH _______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

_______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc

-- Joshua M. Dotson Systems Administrator Kno.e.sis Center Wright State University - Dayton, OH 937-985-3246

4537

Age (days ago)

4541

Last active (days ago)

List overview

Download

13 comments

8 participants

participants (8)

Andrew J Younge
Brian Schott
Di Pe
John Paul Walters
Joshua Dotson
Lorin Hochstein
Narayan Desai
Tim Bell