[Openstack-operators] Perform MPI program on a cluster of OpenStack instances

Reza Bakhshayeshi reza.b2008 at gmail.com
Fri May 17 15:35:45 UTC 2013


Hi all

In the case for example when I'm using m1.tiny with two cores total which
is a power of 2, when P and Q are equal to 1 and 2 (and np 2) it stops with
the following error, when I set P and Q equal to 2 and 2 (and np 4 or
higher) it hangs at the beginning of MPIRandomAccess, which is beginning of
the test.
In worst case it takes couple of minutes for two medium nodes to pass this
stage.
I waited seven hours. Still I get four hpcc process with 100 percent usage
of CPU, but nothing's going on.

For MPICH I receive this error:

=====================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
=====================================================================================
[proxy:0:1 at ubuntu-benchmark02] HYD_pmcd_pmip_control_cmd_cb
(./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
[proxy:0:1 at ubuntu-benchmark02] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1 at ubuntu-benchmark02] main (./pm/pmiserv/pmip.c:226): demux engine
error waiting for event
[mpiexec at ubuntu-benchmark01] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated
badly; aborting
[mpiexec at ubuntu-benchmark01] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec at ubuntu-benchmark01] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for
completion
[mpiexec at ubuntu-benchmark01] main (./ui/mpich/mpiexec.c:405): process
manager error waiting for completion

and their developer answer:
>BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
Here's the key part -- it sounds like your application process died
badly.  Sounds like a problem with your benchmark.
The rest is just cleanup of the remaining processes.

and in further discussion on of them said:
If cpi works then I suspect MPICH is not your problem.

and here is Open MPI error:

localadmin at ubuntu-benchmark:~/hpcc-1.4.2$ mpirun -np 2 --hostfile hosts2
hpcc
localadmin at 192.168.100.3's password:
[ubuntu-benchmark:01828] *** Process received signal ***
[ubuntu-benchmark:02164] *** Process received signal ***
[ubuntu-benchmark:02164] Signal: Segmentation fault (11)
[ubuntu-benchmark:02164] Signal code: Address not mapped (1)
[ubuntu-benchmark:02164] Failing at address: 0xda3000
[ubuntu-benchmark:01828] Signal: Segmentation fault (11)
[ubuntu-benchmark:01828] Signal code: Address not mapped (1)
[ubuntu-benchmark:01828] Failing at address: 0x2039000
[ubuntu-benchmark:02164] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x364a0)
[0x2b791e2d74a0]
[ubuntu-benchmark:02164] [ 1]
hpcc(HPCC_Power2NodesMPIRandomAccessCheck+0xa31) [0x423961]
[ubuntu-benchmark:02164] [ 2] hpcc(HPCC_MPIRandomAccess+0x87a) [0x41e53a]
[ubuntu-benchmark:02164] [ 3] hpcc(main+0xfbf) [0x40a2bf]
[ubuntu-benchmark:02164] [ 4]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x2b791e2c276d]
[ubuntu-benchmark:02164] [ 5] hpcc() [0x40abfd]
[ubuntu-benchmark:02164] *** End of error message ***
[ubuntu-benchmark:01828] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x364a0)
[0x2acd639c34a0]
[ubuntu-benchmark:01828] [ 1]
hpcc(HPCC_Power2NodesMPIRandomAccessCheck+0x9c8) [0x4238f8]
[ubuntu-benchmark:01828] [ 2] hpcc(HPCC_MPIRandomAccess+0x87a) [0x41e53a]
[ubuntu-benchmark:01828] [ 3] hpcc(main+0xfbf) [0x40a2bf]
[ubuntu-benchmark:01828] [ 4]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x2acd639ae76d]
[ubuntu-benchmark:01828] [ 5] hpcc() [0x40abfd]
[ubuntu-benchmark:01828] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 2164 on node 192.168.100.3
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

@Lorin
On more thing to add, I asked Open MPI developers and they said you don't
need distributed file system when Open MPI installed in the same location
on every machine.

@Dave
I have used Open MPI in a virtualized environment this way. I didn't
understand what incompatibility you mean? By the way, It's not important to
me using Open MPI, MPICH or anything else. my goal is to run the test.

@Brian
Unfortunately --mca btl_openib_free_list_max didn't help and I received the
same error or hanging situation.

@Jacob

I set memory overcommitment ratio to one both on server and instances in
/proc/sys/vm/overcommitment_ratio
and disabled hyper threads in
/sys/devices/system/node/node0/cpu{1,3,5,7}/online
but unfortunately they didn't help and I received the same error or hanging
situation.
I could do the test on one instance.




On 16 May 2013 20:56, Jacob Liberman <jliberma at redhat.com> wrote:

>  On 05/15/2013 08:08 AM, Reza Bakhshayeshi wrote:
>
>  Hi
>
>  I want to perform a MPI program across the instances. I've already done
> it on a traditional and virtual cluster, so I'm pretty sure about the
> healthiness of my installation.
>  Unfortunately I can't perform it on a cluster of OpenStack instances.
>  My MPI program is HPCC, it stops at the begging of MPIRandomAccess.
>
>  I would be so grateful if anyone had a similar experience or can guess
> some possibilities and solutions.
>
>  Regards,
>  Reza
>
>
> _______________________________________________
> OpenStack-operators mailing listOpenStack-operators at lists.openstack.orghttp://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>  Does it hang or does it fail with an error? Please send along any errors.
>
> The HPCC random access test will size the problem to half the available
> RAM in the whole system.
>
> I would make sure your memory over commitment ratio is set to 1.
>
> I would also disable hyperthreading and make sure you are running on a
> power of 2 processor count.
>
> You can start by running the MPI test within a single instance on a single
> host.
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20130517/db432455/attachment.html>


More information about the OpenStack-operators mailing list