Non-deterministic MPI deadlocking on Quadcores?

Discussion about building and installing Elmer
Post Reply
petroo
Posts: 148
Joined: 13 Jan 2010, 19:07
Location: Aachen, Germany

Non-deterministic MPI deadlocking on Quadcores?

Post by petroo »

Hello @all,

I lately observed some troublesome behaviour of the MPI using Elmer, both the version coming with Ubuntu 9.10 and a freshly compiled from the SVN trunk one.

Using just 1 or two cores the (otherwise identical) jobs of a dynamic simulation would run as expected. But using 3 or even 4 cores would lead to a condition where the simulator seems to calculate on (juding from the load indicator), with no error message being shown. But there is no observable progress any more, and the jobs printing their progress reports in the Solver Log window become silent.

The stagnations seem non-determinstic as the iteration step at which this behaviour is observed changes from run to run. The solver(s) can be killed by ElmerGUI, so they obviously do live and receive signals even when in stagnation mode. When trying to start 4 solvers simultaneously the first iterations (if they get through at all) appear markedly slower than any started with less cores, but even for the attached simple problem definition the deadlock is reached just after a few iteration rounds.

While for a more elaborate problem the 3-core version always dies during its execution in the described way the attached example problem sometimes succeeds, sometime dies.

The solver log file of a soon-stopping 4 core run of the attached example problem is attached as well.

Kind regards,

Peter
Attachments
SolverLog.txt
Log outputs during a soon-stopping 4 core run
(5.91 KiB) Downloaded 371 times
MPI-Test.zip
Simple heat transport test case
(41.55 KiB) Downloaded 355 times
Martin_
Posts: 11
Joined: 08 Feb 2010, 16:27

Re: Non-deterministic MPI deadlocking on Quadcores?

Post by Martin_ »

Hi Peter,

unfortunately I have to confirm what you say. I have exactly the same problem. I was wondering if this might be due to the version/implementation of mpi used. I have installed Elmer from svn trunk in two machines. The first runs on Debian with openmpi installed by the package manager and runs fine (the machine has 8 cores, I tested different runs using up to 6 cores, never noticed any problem). The second is a Quadcore machine and uses openmpi 1.4 compiled by me with the following config options:

Code: Select all

--enable-shared --enable-static --with-threads=posix --with-mpi-f90-size=medium  F77=gfortran FC=gfortran --prefix=$opt_mpi_dir
where $opt_mpi_dir is the mpi install directory. In this machine, my computation always hangs after a while in exactly the same manner as you described (cpu running on 100%, but no iterations anymore). I tried different (older) releases of openmpi, but no success. Installation of pre-compiled openmpi is not an option for me here due to some other software I'm using in this machine.

I'm wondering which openmpi are you using and how you installed it.

Best regards,

Martin Vymazal
mal
Site Admin
Posts: 54
Joined: 21 Aug 2009, 14:21

Re: Non-deterministic MPI deadlocking on Quadcores?

Post by mal »

Hi,

Could you please post your complilation script and/or instructions for reproducing the problem?

I'm using the following script on my 32-bit Ubuntu 9.10 system (relevant OpenMPI packages are libopenmpi-dev and openmpi-bin):

Code: Select all

#!/bin/sh -f

export CC=mpicc.openmpi
export CXX=mpic++.openmpi
export FC=mpif90.openmpi
export F77=mpif90.openmpi

export ELMER_HOME=/usr/local

modules="matc umfpack mathlibs elmergrid meshgen2d eio hutiter fem"
for m in $modules; do
    cd $m
    ./configure --with-mpi=yes --with-mpi-dir=/usr/lib/openmpi --prefix=$ELMER_HOME
    make clean
    make
    sudo make install
    cd ..
done
Martin_
Posts: 11
Joined: 08 Feb 2010, 16:27

Re: Non-deterministic MPI deadlocking on Quadcores?

Post by Martin_ »

Hi mal,

Sorry for replying so late. I used the following script to compile elmer:

Code: Select all

export COMPILER_PATH=$HOME/local/x86_64/bin

export CC="$COMPILER_PATH/mpicc"
export CXX="$COMPILER_PATH/mpicxx"
export FC="$COMPILER_PATH/mpif90"
export F77="$COMPILER_PATH/mpif90"

# Not sure the following line is really necessary:
export LIBS=-lpthread 

#This is the folder with compiled Elmer binaries:
export ELMER_HOME="/data/software/elmerfem/elmer"

###################################
#options for the configure script:
###################################
export OPTIONS="--prefix=$ELMER_HOME --with-64bits=yes --with-mpi-lib-dir=$HOME/local/x86_64/lib --with-mpi-inc-dir=$HOME/local/x86_64/include --with-mpi-bin-dir=$HOME/local/x86_64/bin"

modules="matc umfpack mathlibs elmergrid meshgen2d eio hutiter fem post"

##### configure, build and install #########
 for m in $modules; do
   echo "module $m"
   echo "###############"
   ##### parallel #######
  cd $m ; 
  ./configure $OPTIONS
  make -j5
  make install
  cd ..
done
I can run Elmer on 2 cores. I noticed the problem while running on 3 cores (Elmer writes the initial vtk file and after 5, maybe 10 minutes freezes). Three cores show 100% load, but looking at Elmer output, no more iterations are performed no matter how long I wait.

I tried with several (compiled) versions of openmpi (1.3.1, 1.3.3, 1.4.1), but the problem remains the same. My gcc version is 4.4.3.

Best regards,

Martin Vymazal
petroo
Posts: 148
Joined: 13 Jan 2010, 19:07
Location: Aachen, Germany

Re: Non-deterministic MPI deadlocking on Quadcores?

Post by petroo »

Hi @all,

just a small positive update on this topic: Using the out-of-the-box Elmer version distributed with Ubuntu 10.04 eliminated the described problems. Both on 32-bit and 64-bit machines everything works as expected.

I have no clue what went wrong with the former versions, though.

Kind regards,

Peter
Post Reply