[openstack-hpc] Looking for practical Openstack + Ceph guidance for shared HPC

Joshua Dotson josh at knoesis.org
Fri Jul 26 17:19:21 UTC 2013


Narayan,

I just read about Mesos and Omega.  Wow.  This stuff is awesome.

Thanks,
Joshua


On Fri, Jul 26, 2013 at 11:24 AM, Narayan Desai <narayan.desai at gmail.com>wrote:

> I think that workload management is perhaps too vague of a term here.
> There are a few different processes in play.
>
> One is the system level resource management (nova-scheduler, etc). This
> component needs to figure out how to allocate the resources to tenants, and
> pull them back with needed.
>
> Inside of a single tenant's allocation, there is a need for workload
> management. This is a good place for traditional HPC resource managers;
> we've run torque in this capacity, for example.
>
> I think that the model today for nova scheduler is wrong; it only supports
> open-ended leases on compute resources. Another component that makes this
> tricky is the need to support interactive workloads; batch is good for
> well, batch, but interactive is always problematic in these environments.
>
> There is some research targeting this problem, in particular Mesos and
> Omega. Mesos looks ok for serial workloads, but has a real problem with
> parallel (non-resizable) ones. Omega looks a lot better, but that is an
> internal google thing. I don't think that there is an off the shelf
> solution today.
> -nld
>
>
> On Fri, Jul 26, 2013 at 1:48 AM, Di Pe <dipeit at gmail.com> wrote:
>
>> All,
>>
>> one issue Joshua touched on was options for workload management. While IB
>> and GPU seem to frequently discussed when it comes to openstack and HPC
>> they are not so relevant in our HPC environment (3000 ish cores, each node
>> connected via 1G, scaleout NAS storage, biomedical research, genomics,
>> proteomics, statistics) ..... many other midsize shops may be have similar
>> setups. We are just starting to look at openstack for a for a potential
>> deployment with Ubuntu 14.04. We have good experiences using KVM for some
>> of our resources. Some of the things we are hoping to get from openstack in
>> the future are:
>>
>>
>>    - flexible partitioning of resources for special sauce software
>>    (hadoop, interactive HPC software)
>>    - self service for developers and scientists
>>    - allow research group that spans multiple research organizations
>>    (internal / external) controlled access to an isolated (virtual) datacenter
>>    (potentially with fisma compliance)
>>    - save images that researchers built for later use (reproducible in
>>    case someone asks how they got to this result)
>>    - chargeback for HPC resources for internal and external users
>>    - usage of idle resources for testing in Enterprise IT, VDI ,etc
>>    - compute fencing (as we are heading to 24 cores per box most of our
>>    multi threaded code can still only take advantage of 4-6 cores. This either
>>    leaves stuff idle or users step on each other on shared nodes ....cgroups
>>    is a bit of a pain to maintain.
>>    - checkpointing and restarting long running jobs (for prioritization
>>    and better protection against job failures). perhaps LXC containers as
>>    alternative to KVM (we use BLCR today but that community is quite small)
>>    - standardization of our infrastructure
>>    - potential participation in futuregrid, XSEDE, etc
>>
>>
>> That's perhaps a lot to ask but we would be looking at a 2-3 year time
>> frame. What I don't quite unterstand, how would one handle workload
>> management. Currently we see people using SGE, moab, LSF and some slurm in
>> classic HPC. Concepts like backfill, preemption fair share and such things
>> are probably unknown to openstack? If so it would perhaps be acceptable to
>> run a workload manager on a subset of always-on KVM systems or even
>> baremetal for classic HPC. But how does one consolidate the reporting,
>> billing, chargeback of 2 separate systems?
>>
>> Are there any efforts to integrate workload managers directly into nova?
>> SGE and Slurm are both open source and would support everything we require.
>> Or are folks thinking about writing something from scratch in python?
>>
>> Thanks
>> dipe
>>
>>
>>
>>
>>
>> On Thu, Jul 25, 2013 at 6:22 PM, Narayan Desai <narayan.desai at gmail.com>wrote:
>>
>>> Brian's right. You will end up doing a lot of work; nova isn't ready for
>>> this out of the box.
>>>
>>> The key problems are:
>>>  - drivers for virtualization (either via SR-IOV or device passthrough)
>>> for net + gpu
>>>  - IO architecture
>>>
>>> There is apparently SR-IOV support for IB, provided that you have the
>>> right hardware, firmware, and driver, though I haven't managed to make it
>>> work. This provides a pkey isolated multi-tenant environment. THat is
>>> basically a mellanox only solution.
>>>
>>> Like Brian said, Xen is the only way to go for GPU virtualization.
>>>
>>> You can do some really interesting things with the I/O architecture.
>>> We've been experimenting with both glusterfs and ceph. Both seem to perform
>>> decently well; we've managed to get the glusterfs setup going at 60 GB/s in
>>> aggregate with a pile of clients. There isn't good integration of all of
>>> the capabilities yet in mainline openstack, but this looks promising. Ceph
>>> looks like the mainline integration is better, but we haven't tried those
>>> things out yet.
>>>
>>> At the end of the day, you need to ask yourself why you want to
>>> accomplish this. If you're running a workload that is well suited to an HPC
>>> cluster, you should probably use one. If you need multi-tenancy, user
>>> control of system software, or need to run a workload poorly suited for a
>>> traditional cluster, then it is worth thinking strongly about. You'll end
>>> up needing to do a bunch of work though.
>>>
>>> In our case, the reason that we pursued this course is because we have
>>> workloads and developers that benefit from the cloud control plane.
>>>
>>> hth.
>>>  -nld
>>>
>>>
>>>
>>> On Thu, Jul 25, 2013 at 3:30 PM, Brian Schott <
>>> brian.schott at nimbisservices.com> wrote:
>>>
>>>> Joshua,
>>>>
>>>> It is something those of us working the gap between HPC and cloud have
>>>> struggled with.  We lack a strong project team within OpenStack IMHO,
>>>> although there are several small groups doing HPC with OpenStack.
>>>>  Hopefully others will chime in on some other topics, such as Infiniband
>>>> support, but we did some testing with a GRID K2 card for GPU pass-through
>>>> with different hypervisors.  A talk I gave at the OpenStack DC meetup is
>>>> here:
>>>>
>>>> http://www.slideshare.net/bfschott/nimbis-schott-openstackgpustatus20130618
>>>>
>>>> The short GPU answer is that it is possible with Xen, XenCenter, and
>>>> XCP to passthrough GPUs today, but OpenStack doesn't have support by
>>>> default in Nova.  This is still in a roll-your-own mode for deployment.
>>>>
>>>> Brian
>>>>
>>>> -------------------------------------------------
>>>> Brian Schott, CTO
>>>> Nimbis Services, Inc.
>>>> brian.schott at nimbisservices.com
>>>> ph: 443-274-6064  fx: 443-274-6060
>>>>
>>>>
>>>>
>>>> On Jul 25, 2013, at 4:06 PM, Joshua Dotson <josh at knoesis.org> wrote:
>>>>
>>>> Hello.
>>>>
>>>> A contingent of my organization, the Kno.e.sis Center @ Wright State
>>>> University <http://www.knoesis.org/>, recently received a grant award
>>>> which we intend to use to support a handful of mid-size HPC-style workloads
>>>> (MPI <-- definitely, GPGPU <-- if possible/plausible) in addition to many
>>>> mid-size IaaS-style workloads (MongoDB, Storm, Hadoop, many others).  As a
>>>> third layer, I'm playing with the idea of evaluating an elastic OpenShift
>>>> Origin atop the same infrastructure.  Approximately $400k to $500k will
>>>> hopefully be available for this deployment, though exact numbers are not
>>>> yet available to me.
>>>>
>>>> While I'm prepared to build a home-grown small-to-mid-size "classical"
>>>> HPC, using modern hardware, and a smaller silo for home-grown Openstack for
>>>> the minority stakeholders, I am hoping to find ways of making proponents of
>>>> both workloads simultaneously happy, or close to it.  That is, I would like
>>>> to give my computer scientist users a friendly method of running their
>>>> HPC-style jobs on a combined performance-tuned silo of Openstack.  Doing so
>>>> would load-balance the procured hardware and infrastructure with the users
>>>> who want a Tomcat or a Virtuoso instance.
>>>>
>>>> I see a number of serious issues realizing such a goal.  For example,
>>>> the state of Infiniband vs. Openstack seems not quite
>>>> ready/available/documented/accessible for such use in production, unless
>>>> I'm just blind to the right blogs.  The added myriad abstractions and
>>>> latency virtualization might impose on an HPC task, not to mention cloud
>>>> software defined networking (Quantum, especially when sans hardware
>>>> acceleration), seem likely to really get in the way of practicality,
>>>> economics and efficiency.  That said, most of what we do here isn't HPC, so
>>>> I believe such trade-offs can be agreed upon, if a reasonable job
>>>> scheduling and workload management mechanism can be found and agreed upon
>>>> by all stake holders, grant proposal majority (HPC) and minority (IaaS)
>>>> alike.
>>>>
>>>> I get the impression from my readings that HPC-style deployment
>>>> (separate from job queuing) against the EC2 API should work.  I don't have
>>>> a good feeling that the experience would be particularly friendly, however,
>>>> without paying for closed source applications.  I'm thinking a
>>>> high-performance Ceph install would help bring up the storage end of things
>>>> in a modern open-source CoTS way.  I've not done specific research on
>>>> Lustre + Openstack, but no reports of such a setup have presented
>>>> themselves to me, either.
>>>>
>>>> These blue sky ideas matter nil, it seems, if a sufficiently-large
>>>> high-performance production-quality Openstack deployment is beyond the
>>>> funds to be allotted, which is something else I'm working on.  I've built
>>>> smallish but useful virt-manager, oVirt and Openstack environments here
>>>> already, but none of them are enough for the very-important HPC job
>>>> proposed for this grant.  The scientist running the proposed computation
>>>> gave me following information to clarify what would parity (for his job
>>>> only) his experience running the computation with an external HPC service
>>>> provider.
>>>>
>>>>    - MPI
>>>>    - 20 Gbps Infiniband compute interconnect
>>>>    - 600 cores (those currently used are probably G4 Opteron 2.4 Ghz)
>>>>    - 4 GB RAM per core
>>>>    - at lest 2 TB shared storage, though I'm thinking we need much
>>>>    more for use by our general community
>>>>    - unsure of the storage networking topology
>>>>
>>>> We're in the shopping phase for this grant award and are still playing
>>>> with ideas.  It seems likely to snap back into an old-school HPC, at this
>>>> time.  I've fielded communication about our needs to a number of Openstack
>>>> and hardware providers, on the hope that they can bring something helpful
>>>> to the table.
>>>> Please let know if you can point me in the right direction(s). I'm up
>>>> to reading whatever text is thrown at me on this topic.  :-)
>>>>
>>>> Thanks,
>>>> Joshua
>>>> --
>>>> Joshua M. Dotson
>>>> Systems Administrator
>>>> Kno.e.sis Center
>>>> Wright State University - Dayton, OH
>>>> _______________________________________________
>>>> OpenStack-HPC mailing list
>>>> OpenStack-HPC at lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> OpenStack-HPC mailing list
>>>> OpenStack-HPC at lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc
>>>>
>>>>
>>>
>>> _______________________________________________
>>> OpenStack-HPC mailing list
>>> OpenStack-HPC at lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc
>>>
>>>
>>
>> _______________________________________________
>> OpenStack-HPC mailing list
>> OpenStack-HPC at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc
>>
>>
>
> _______________________________________________
> OpenStack-HPC mailing list
> OpenStack-HPC at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-hpc
>
>


-- 
Joshua M. Dotson
Systems Administrator
Kno.e.sis Center
Wright State University - Dayton, OH
937-985-3246
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-hpc/attachments/20130726/ec835399/attachment-0001.html>


More information about the OpenStack-HPC mailing list