Hello @all,
I lately observed some troublesome behaviour of the MPI using Elmer, both the version coming with Ubuntu 9.10 and a freshly compiled from the SVN trunk one.
Using just 1 or two cores the (otherwise identical) jobs of a dynamic simulation would run as expected. But using 3 or even 4 cores would lead to a condition where the simulator seems to calculate on (juding from the load indicator), with no error message being shown. But there is no observable progress any more, and the jobs printing their progress reports in the Solver Log window become silent.
The stagnations seem non-determinstic as the iteration step at which this behaviour is observed changes from run to run. The solver(s) can be killed by ElmerGUI, so they obviously do live and receive signals even when in stagnation mode. When trying to start 4 solvers simultaneously the first iterations (if they get through at all) appear markedly slower than any started with less cores, but even for the attached simple problem definition the deadlock is reached just after a few iteration rounds.
While for a more elaborate problem the 3-core version always dies during its execution in the described way the attached example problem sometimes succeeds, sometime dies.
The solver log file of a soon-stopping 4 core run of the attached example problem is attached as well.
Kind regards,
Peter
Non-deterministic MPI deadlocking on Quadcores?
Non-deterministic MPI deadlocking on Quadcores?
- Attachments
-
- SolverLog.txt
- Log outputs during a soon-stopping 4 core run
- (5.91 KiB) Downloaded 371 times
-
- MPI-Test.zip
- Simple heat transport test case
- (41.55 KiB) Downloaded 355 times
Re: Non-deterministic MPI deadlocking on Quadcores?
Hi Peter,
unfortunately I have to confirm what you say. I have exactly the same problem. I was wondering if this might be due to the version/implementation of mpi used. I have installed Elmer from svn trunk in two machines. The first runs on Debian with openmpi installed by the package manager and runs fine (the machine has 8 cores, I tested different runs using up to 6 cores, never noticed any problem). The second is a Quadcore machine and uses openmpi 1.4 compiled by me with the following config options:
where $opt_mpi_dir is the mpi install directory. In this machine, my computation always hangs after a while in exactly the same manner as you described (cpu running on 100%, but no iterations anymore). I tried different (older) releases of openmpi, but no success. Installation of pre-compiled openmpi is not an option for me here due to some other software I'm using in this machine.
I'm wondering which openmpi are you using and how you installed it.
Best regards,
Martin Vymazal
unfortunately I have to confirm what you say. I have exactly the same problem. I was wondering if this might be due to the version/implementation of mpi used. I have installed Elmer from svn trunk in two machines. The first runs on Debian with openmpi installed by the package manager and runs fine (the machine has 8 cores, I tested different runs using up to 6 cores, never noticed any problem). The second is a Quadcore machine and uses openmpi 1.4 compiled by me with the following config options:
Code: Select all
--enable-shared --enable-static --with-threads=posix --with-mpi-f90-size=medium F77=gfortran FC=gfortran --prefix=$opt_mpi_dir
I'm wondering which openmpi are you using and how you installed it.
Best regards,
Martin Vymazal
Re: Non-deterministic MPI deadlocking on Quadcores?
Hi,
Could you please post your complilation script and/or instructions for reproducing the problem?
I'm using the following script on my 32-bit Ubuntu 9.10 system (relevant OpenMPI packages are libopenmpi-dev and openmpi-bin):
Could you please post your complilation script and/or instructions for reproducing the problem?
I'm using the following script on my 32-bit Ubuntu 9.10 system (relevant OpenMPI packages are libopenmpi-dev and openmpi-bin):
Code: Select all
#!/bin/sh -f
export CC=mpicc.openmpi
export CXX=mpic++.openmpi
export FC=mpif90.openmpi
export F77=mpif90.openmpi
export ELMER_HOME=/usr/local
modules="matc umfpack mathlibs elmergrid meshgen2d eio hutiter fem"
for m in $modules; do
cd $m
./configure --with-mpi=yes --with-mpi-dir=/usr/lib/openmpi --prefix=$ELMER_HOME
make clean
make
sudo make install
cd ..
done
Re: Non-deterministic MPI deadlocking on Quadcores?
Hi mal,
Sorry for replying so late. I used the following script to compile elmer:
I can run Elmer on 2 cores. I noticed the problem while running on 3 cores (Elmer writes the initial vtk file and after 5, maybe 10 minutes freezes). Three cores show 100% load, but looking at Elmer output, no more iterations are performed no matter how long I wait.
I tried with several (compiled) versions of openmpi (1.3.1, 1.3.3, 1.4.1), but the problem remains the same. My gcc version is 4.4.3.
Best regards,
Martin Vymazal
Sorry for replying so late. I used the following script to compile elmer:
Code: Select all
export COMPILER_PATH=$HOME/local/x86_64/bin
export CC="$COMPILER_PATH/mpicc"
export CXX="$COMPILER_PATH/mpicxx"
export FC="$COMPILER_PATH/mpif90"
export F77="$COMPILER_PATH/mpif90"
# Not sure the following line is really necessary:
export LIBS=-lpthread
#This is the folder with compiled Elmer binaries:
export ELMER_HOME="/data/software/elmerfem/elmer"
###################################
#options for the configure script:
###################################
export OPTIONS="--prefix=$ELMER_HOME --with-64bits=yes --with-mpi-lib-dir=$HOME/local/x86_64/lib --with-mpi-inc-dir=$HOME/local/x86_64/include --with-mpi-bin-dir=$HOME/local/x86_64/bin"
modules="matc umfpack mathlibs elmergrid meshgen2d eio hutiter fem post"
##### configure, build and install #########
for m in $modules; do
echo "module $m"
echo "###############"
##### parallel #######
cd $m ;
./configure $OPTIONS
make -j5
make install
cd ..
done
I tried with several (compiled) versions of openmpi (1.3.1, 1.3.3, 1.4.1), but the problem remains the same. My gcc version is 4.4.3.
Best regards,
Martin Vymazal
Re: Non-deterministic MPI deadlocking on Quadcores?
Hi @all,
just a small positive update on this topic: Using the out-of-the-box Elmer version distributed with Ubuntu 10.04 eliminated the described problems. Both on 32-bit and 64-bit machines everything works as expected.
I have no clue what went wrong with the former versions, though.
Kind regards,
Peter
just a small positive update on this topic: Using the out-of-the-box Elmer version distributed with Ubuntu 10.04 eliminated the described problems. Both on 32-bit and 64-bit machines everything works as expected.
I have no clue what went wrong with the former versions, though.
Kind regards,
Peter