Re: [openstack-hpc] Looking for OpenStack workload management
At CERN, we have similar interests and aims as you describe. We're running LSF on top of the cloud to allow legacy batch applications to use cloud resources as well as those who are talking directly to the cloud APIs. To track resources, we are setting up a dedicated batch project in OpenStack. These resources are accounted to the IT department. We then use the batch system accounting records to work out who has used what resources inside the 'virtual batch' system. We've also been experimenting with submitting batch accounting records into ceilometer in order to consolidate the reporting into a single technology. Where we are trying to find a good solution is regarding how to use spot market resources in our cloud. If we allocate out a quota to a project but they do not use the full resources, we want to be able to offer that quota to others on a low SLA (killed with little warning). This allows us to get our resource utilisation up while ensuring that projects can get the resources they're entitled to when there is a need. This is an interesting area to be getting a few sites together to share solutions. Tim From: Di Pe [mailto:dipeit@gmail.com] Sent: 26 July 2013 08:48 To: openstack-hpc@lists.openstack.org Subject: Re: [openstack-hpc] Looking for practical Openstack + Ceph guidance for shared HPC All, one issue Joshua touched on was options for workload management. While IB and GPU seem to frequently discussed when it comes to openstack and HPC they are not so relevant in our HPC environment (3000 ish cores, each node connected via 1G, scaleout NAS storage, biomedical research, genomics, proteomics, statistics) ..... many other midsize shops may be have similar setups. We are just starting to look at openstack for a for a potential deployment with Ubuntu 14.04. We have good experiences using KVM for some of our resources. Some of the things we are hoping to get from openstack in the future are: * flexible partitioning of resources for special sauce software (hadoop, interactive HPC software) * self service for developers and scientists * allow research group that spans multiple research organizations (internal / external) controlled access to an isolated (virtual) datacenter (potentially with fisma compliance) * save images that researchers built for later use (reproducible in case someone asks how they got to this result) * chargeback for HPC resources for internal and external users * usage of idle resources for testing in Enterprise IT, VDI ,etc * compute fencing (as we are heading to 24 cores per box most of our multi threaded code can still only take advantage of 4-6 cores. This either leaves stuff idle or users step on each other on shared nodes ....cgroups is a bit of a pain to maintain. * checkpointing and restarting long running jobs (for prioritization and better protection against job failures). perhaps LXC containers as alternative to KVM (we use BLCR today but that community is quite small) * standardization of our infrastructure * potential participation in futuregrid, XSEDE, etc That's perhaps a lot to ask but we would be looking at a 2-3 year time frame. What I don't quite unterstand, how would one handle workload management. Currently we see people using SGE, moab, LSF and some slurm in classic HPC. Concepts like backfill, preemption fair share and such things are probably unknown to openstack? If so it would perhaps be acceptable to run a workload manager on a subset of always-on KVM systems or even baremetal for classic HPC. But how does one consolidate the reporting, billing, chargeback of 2 separate systems? Are there any efforts to integrate workload managers directly into nova? SGE and Slurm are both open source and would support everything we require. Or are folks thinking about writing something from scratch in python? Thanks dipe On Thu, Jul 25, 2013 at 6:22 PM, Narayan Desai <narayan.desai@gmail.com <mailto:narayan.desai@gmail.com> > wrote: Brian's right. You will end up doing a lot of work; nova isn't ready for this out of the box. The key problems are: - drivers for virtualization (either via SR-IOV or device passthrough) for net + gpu - IO architecture There is apparently SR-IOV support for IB, provided that you have the right hardware, firmware, and driver, though I haven't managed to make it work. This provides a pkey isolated multi-tenant environment. THat is basically a mellanox only solution. Like Brian said, Xen is the only way to go for GPU virtualization. You can do some really interesting things with the I/O architecture. We've been experimenting with both glusterfs and ceph. Both seem to perform decently well; we've managed to get the glusterfs setup going at 60 GB/s in aggregate with a pile of clients. There isn't good integration of all of the capabilities yet in mainline openstack, but this looks promising. Ceph looks like the mainline integration is better, but we haven't tried those things out yet. At the end of the day, you need to ask yourself why you want to accomplish this. If you're running a workload that is well suited to an HPC cluster, you should probably use one. If you need multi-tenancy, user control of system software, or need to run a workload poorly suited for a traditional cluster, then it is worth thinking strongly about. You'll end up needing to do a bunch of work though. In our case, the reason that we pursued this course is because we have workloads and developers that benefit from the cloud control plane. hth. -nld On Thu, Jul 25, 2013 at 3:30 PM, Brian Schott <brian.schott@nimbisservices.com <mailto:brian.schott@nimbisservices.com> > wrote: Joshua, It is something those of us working the gap between HPC and cloud have struggled with. We lack a strong project team within OpenStack IMHO, although there are several small groups doing HPC with OpenStack. Hopefully others will chime in on some other topics, such as Infiniband support, but we did some testing with a GRID K2 card for GPU pass-through with different hypervisors. A talk I gave at the OpenStack DC meetup is here: http://www.slideshare.net/bfschott/nimbis-schott-openstackgpustatus20130618 The short GPU answer is that it is possible with Xen, XenCenter, and XCP to passthrough GPUs today, but OpenStack doesn't have support by default in Nova. This is still in a roll-your-own mode for deployment. Brian ------------------------------------------------- Brian Schott, CTO Nimbis Services, Inc. brian.schott@nimbisservices.com <mailto:brian.schott@nimbisservices.com> ph: 443-274-6064 <tel:443-274-6064> fx: 443-274-6060 <tel:443-274-6060> On Jul 25, 2013, at 4:06 PM, Joshua Dotson <josh@knoesis.org <mailto:josh@knoesis.org> > wrote: Hello. A contingent of my organization, the Kno.e.sis Center @ Wright State University <http://www.knoesis.org/> , recently received a grant award which we intend to use to support a handful of mid-size HPC-style workloads (MPI <-- definitely, GPGPU <-- if possible/plausible) in addition to many mid-size IaaS-style workloads (MongoDB, Storm, Hadoop, many others). As a third layer, I'm playing with the idea of evaluating an elastic OpenShift Origin atop the same infrastructure. Approximately $400k to $500k will hopefully be available for this deployment, though exact numbers are not yet available to me. While I'm prepared to build a home-grown small-to-mid-size "classical" HPC, using modern hardware, and a smaller silo for home-grown Openstack for the minority stakeholders, I am hoping to find ways of making proponents of both workloads simultaneously happy, or close to it. That is, I would like to give my computer scientist users a friendly method of running their HPC-style jobs on a combined performance-tuned silo of Openstack. Doing so would load-balance the procured hardware and infrastructure with the users who want a Tomcat or a Virtuoso instance. I see a number of serious issues realizing such a goal. For example, the state of Infiniband vs. Openstack seems not quite ready/available/documented/accessible for such use in production, unless I'm just blind to the right blogs. The added myriad abstractions and latency virtualization might impose on an HPC task, not to mention cloud software defined networking (Quantum, especially when sans hardware acceleration), seem likely to really get in the way of practicality, economics and efficiency. That said, most of what we do here isn't HPC, so I believe such trade-offs can be agreed upon, if a reasonable job scheduling and workload management mechanism can be found and agreed upon by all stake holders, grant proposal majority (HPC) and minority (IaaS) alike. I get the impression from my readings that HPC-style deployment (separate from job queuing) against the EC2 API should work. I don't have a good feeling that the experience would be particularly friendly, however, without paying for closed source applications. I'm thinking a high-performance Ceph install would help bring up the storage end of things in a modern open-source CoTS way. I've not done specific research on Lustre + Openstack, but no reports of such a setup have presented themselves to me, either. These blue sky ideas matter nil, it seems, if a sufficiently-large high-performance production-quality Openstack deployment is beyond the funds to be allotted, which is something else I'm working on. I've built smallish but useful virt-manager, oVirt and Openstack environments here already, but none of them are enough for the very-important HPC job proposed for this grant. The scientist running the proposed computation gave me following information to clarify what would parity (for his job only) his experience running the computation with an external HPC service provider. * MPI * 20 Gbps Infiniband compute interconnect * 600 cores (those currently used are probably G4 Opteron 2.4 Ghz) * 4 GB RAM per core * at lest 2 TB shared storage, though I'm thinking we need much more for use by our general community * unsure of the storage networking topology We're in the shopping phase for this grant award and are still playing with ideas. It seems likely to snap back into an old-school HPC, at this time. I've fielded communication about our needs to a number of Openstack and hardware providers, on the hope that they can bring something helpful to the table. Please let know if you can point me in the right direction(s). I'm up to reading whatever text is thrown at me on this topic. :-) Thanks, Joshua -- Joshua M. Dotson Systems Administrator Kno.e.sis Center Wright State University - Dayton, OH _______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org <mailto:OpenStack-HPC@lists.openstack.org> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc _______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org <mailto:OpenStack-HPC@lists.openstack.org> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc _______________________________________________ OpenStack-HPC mailing list OpenStack-HPC@lists.openstack.org <mailto:OpenStack-HPC@lists.openstack.org> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc
On Fri, Jul 26, 2013 at 5:33 AM, Tim Bell <Tim.Bell@cern.ch> wrote:
** **
Where we are trying to find a good solution is regarding how to use spot market resources in our cloud. If we allocate out a quota to a project but they do not use the full resources, we want to be able to offer that quota to others on a low SLA (killed with little warning). This allows us to get our resource utilisation up while ensuring that projects can get the resources they’re entitled to when there is a need.****
** **
This is an interesting area to be getting a few sites together to share solutions.****
** **
This is an area I'm looking at as well. My short term desire is to provide a way to provided for the same hardware sometimes being used to run virtualized instances and sometimes be provisioned as a baremetal system (either through Triple0 or manually reprovisioned). My short term solution seems to be evolving around host-aggregates and special flavors. That communicates the "these can disappear at any time" SLA but doesn't address the quota issue which I think is a more general need. In many deployment resource cost is expressed directly as cost, in my research environment we don't have any direct internal billing for compute resources and I don' t pretend to understand the labyrinth of grants that fund the place, so "cost" here is expressed in quotas. I want people to have large (possibly infinite) quotas for cheap resources like "spot instances" and small quotas for "expensive" resources for example aggregates with 1:1 virtual to physical resource allocation. I though I remembered talk of abstracting the quota system into it's own project at the Portland summit, but can't seem to find it in my notes or on line, if I didn't dream that I'd love it if someone could refresh my memory. -Jon Jonathan Proulx Sr. Technical Architect MIT CSAIL
There is work we're doing with HP regarding Quota management (see the blue print https://wiki.openstack.org/wiki/KeystoneCentralizedQuotaManagement). The aim behind this is to be able to define quotas at the domain level and then delegate the administration down to the domain managers within their area along with being able to allocate quotas to different cells/regions. It won't help for the spot market/utilisation scenarios though, more to provide a single pane of glass for quota management and delegation. Tim From: jonathan.proulx@gmail.com [mailto:jonathan.proulx@gmail.com] On Behalf Of Jonathan Proulx Sent: 26 July 2013 17:34 To: Tim Bell Cc: Di Pe; openstack-hpc@lists.openstack.org Subject: Re: [openstack-hpc] Looking for OpenStack workload management On Fri, Jul 26, 2013 at 5:33 AM, Tim Bell <Tim.Bell@cern.ch <mailto:Tim.Bell@cern.ch> > wrote: Where we are trying to find a good solution is regarding how to use spot market resources in our cloud. If we allocate out a quota to a project but they do not use the full resources, we want to be able to offer that quota to others on a low SLA (killed with little warning). This allows us to get our resource utilisation up while ensuring that projects can get the resources they're entitled to when there is a need. This is an interesting area to be getting a few sites together to share solutions. This is an area I'm looking at as well. My short term desire is to provide a way to provided for the same hardware sometimes being used to run virtualized instances and sometimes be provisioned as a baremetal system (either through Triple0 or manually reprovisioned). My short term solution seems to be evolving around host-aggregates and special flavors. That communicates the "these can disappear at any time" SLA but doesn't address the quota issue which I think is a more general need. In many deployment resource cost is expressed directly as cost, in my research environment we don't have any direct internal billing for compute resources and I don' t pretend to understand the labyrinth of grants that fund the place, so "cost" here is expressed in quotas. I want people to have large (possibly infinite) quotas for cheap resources like "spot instances" and small quotas for "expensive" resources for example aggregates with 1:1 virtual to physical resource allocation. I though I remembered talk of abstracting the quota system into it's own project at the Portland summit, but can't seem to find it in my notes or on line, if I didn't dream that I'd love it if someone could refresh my memory. -Jon Jonathan Proulx Sr. Technical Architect MIT CSAIL
participants (2)
-
Jonathan Proulx
-
Tim Bell