[Openstack-operators] [scientific-wg] Lustre war stories

Blair Bethwaite blair.bethwaite at gmail.com
Wed Jul 6 12:57:11 UTC 2016


Hi Álvaro, hi David -

NB: adding os-ops.

David, we have some real-time Lustre war stories we can share and
hopefully provide some positive conclusions to come Barcelona. I've
given an overview of what we're doing below. Are there any specifics
you were interested in when you raised Lustre in the meeting?

Our present approach leans on SRIOV and has worked with both
nova-network and now Neutron, and actually though this would work with
the Mellanox Neutron ML2 driver we are not using OpenStack Networking
to orchestrate this yet. It achieves the plumbing required to get a
parallel filesystem integrated into a typical virtualised OpenStack
deployment, but it does not "cloudify" the parallel filesystem in
anyway (for this you really need the filesystem to have some concept
of multi-tenancy and/or strong client isolation) and has quite limited
applicability to one or a small number of trusted users/projects.

Currently we have a single high-performance data network per cluster,
that is a high-bandwidth and RDMA capable Ethernet fabric. Our
Mellanox NICs (not sure if other vendors have similar features?) allow
us to restrict SRIOV virtual functions (VFs) to specific VLANs, so we
tie them to the data VLAN and then use PCI passthrough (care of
private Nova instance-types) to give guests a PCI VF plugged straight
into that network. Guests need to load appropriate drivers and
configure their own L3. Lustre servers are traditional bare-metal
affairs sitting at the bottom of that subnet.

We have one deployment like this which has been running Lustre over
TCP for about 12 months. That seems to work pretty well except that we
are in the midst of investigating high rx_errors on the servers and
discards on the switches, which seem like they might be
causing/related to Lustre write checksum errors that we see a lot of -
these don't seem to be fatal or data corrupting but rather Lustre
transport level errors which might cause write errors to propagate to
clients, but we're unsure... That particular problem does not seem to
be inherently related to our host configs or use of SRIOV though, more
likely a fabric config issue.

We have a second slightly larger deployment with a similar
configuration, the most notable difference for that one is that it is
using the o2iblnd (o2ib Lustre Network Driver), i.e., Lustre is
configured as for IB but is really running on RoCE. We plan to extract
some performance comparisons from this over coming weeks (there I
would like to compare with both TCP over SRIOV and TCP over
linux-bridge). Probably the main issue with this setup so far is the
need to build Lustre modules against both kernel and Mellanox OFED -
normally compute nodes like this stay very static, but now that they
are cloud instances there is a natural push towards more frequent
updating, and there is not a great deal of clarity about which
combinations are currently supported.

Cheers,

On 6 July 2016 at 21:31, Álvaro López García <aloga at ifca.unican.es> wrote:
> On 06 Jul 2016 (11:58), Stig Telfer wrote:
>> Hi Dave -
>
> Hi all,
>
>> I’d like to introduce Wojciech and Matt from across town at the
>> University.  Wojciech and Matt work on managing and developing the
>> Lustre storage here at Cambridge University.  Right now we are just
>> getting started on integrating Lustre into OpenStack but Blair (also
>> copied) has a similar setup up and running already at Monash
>> University in Melbourne.
>>
>> Parallel filesystems in general are an activity area for the
>> Scientific Working Group, so we try to keep people in touch about what
>> works and what is possible.
>
> Yes, it would be awesome to get some other user stories about parallel
> filesystems and share our approaches, concerns and hacks.
>
>> I’m aware that Lustre is also used in this way in CSC in Finland and
>> (from today’s discussion) Álvaro has a similar configuration using
>> GPFS at his university in Spain.
>
> A bit of context. We are CSIC are operating two separate computing
> infrastructures, one HPC node part of the Spanish supercompting network,
> and one HTC cluster, plugged to the European Grid Infrastructure. Access
> policies for both systems are completely different, and they  are being
> used by a variety of disciplines (high energy physics, astrophysics,
> cosmology, engineering, bio, etc.). Both systems rely on GPFS: the HPC
> node leverages Infiniband for the storage network, whereas the HTC one
> uses 10GbE or 1GbE.
>
> In the IRC meeting I said that these filesystems were not shared, but I
> was wrong. All the filesystems are shared across both infrastructures,
> with the particularity that the HPC filesystems are only shared as read
> only outside the HPC node (they are shared over Ethernet)
>
> The interesting part is that we are running a complete SGE cluster (HTC)
> on top of OpenStack, with access to GPFS. The HTC cluster is subject to
> periodic updates due to middleware upgrades, therefore we were running a
> completely virtualized cluster so that we could perform easy upgrades
> and rollbacks if needed, so a natural step was moving it on top of our
> cloud and manage the nodes using OpenStack.
>
> We are still using nova-network in multi-host mode. We have defined a
> floating IP pool within our storage network, completely reserved to the
> tenant being used for spawning the cluster instances. We then use some
> contextualization scripts that allocate an IP on that network on boot at
> set un the node as part of the GPFS cluster.
>
> There are some more hacks going on the background due to the GPFS
> characteristics, but the overall setup is as described.
>
> Cheers,
> --
> Álvaro López García                              aloga at ifca.unican.es
> Instituto de Física de Cantabria         http://alvarolopez.github.io
> Ed. Juan Jordá, Campus UC                      tel: (+34) 942 200 969
> Avda. de los Castros s/n                            skype: aloga.csic
> 39005 Santander (SPAIN)



-- 
Cheers,
~Blairo



More information about the OpenStack-operators mailing list