Elmer 8.1 using MUMPS 5.0.1 direct solve: SIGSEGV (11)

Clearly defined bug reports and their fixes
Post Reply
SpicyBroseph
Posts: 7
Joined: 13 Jan 2016, 01:28
Antispam: Yes

Elmer 8.1 using MUMPS 5.0.1 direct solve: SIGSEGV (11)

Post by SpicyBroseph »

Hello,

I've compiled Elmer using a self-compiled MUMPS on CentOS 6.7, both using mpicc and libs from openmpi 1.6.5. (I compiled MUMPS using -fPIC to produce shared libraries, and it passes all the tests it comes with-- so I don't think that's the problem). I plan on submitting a tutorial on how I build the whole shebang once I get it working.

So, Elmer compiles and passes all but 4 tests, which from what I can gather is expected. However, if I set the direct solver to mumps it takes a dump with a SIGSEGV:

Code: Select all

ELMER SOLVER (v 8.1) STARTED AT: 2016/01/15 12:15:11
ParCommInit:  Initialize #PEs:            1
MAIN: 
MAIN: =============================================================
MAIN: ElmerSolver finite element software, Welcome!
MAIN: This program is free software licensed under (L)GPL
MAIN: Copyright 1st April 1995 - , CSC - IT Center for Science Ltd.
MAIN: Webpage http://www.csc.fi/elmer, Email elmeradm@csc.fi
MAIN: Version: 8.1 (Rev: 549ce2a, Compiled: 2016-01-15)
MAIN:  HYPRE library linked in.
MAIN:  Trilinos library linked in.
MAIN:  MUMPS library linked in.
MAIN: =============================================================
MAIN: 
MAIN: 
MAIN: -------------------------------------
MAIN: Reading Model: unitstest.sif
Loading user function library: [StressSolve]...[StressSolver_Init0]
Loading user function library: [SaveData]...[SaveScalars_Init0]
LoadMesh: Base mesh name: .//unitstest
LoadMesh: Elapsed time (CPU,REAL):     0.1080    0.1110 (s)
MAIN: -------------------------------------
AddVtuOutputSolverHack: Adding ResultOutputSolver to write VTU output in file: unitstest
Loading user function library: [StressSolve]...[StressSolver_Init]
Loading user function library: [StressSolve]...[StressSolver]
OptimizeBandwidth: ---------------------------------------------------------
OptimizeBandwidth: Computing matrix structure for: linear elasticity...done.
OptimizeBandwidth: Half bandwidth without optimization: 1414
OptimizeBandwidth: 
OptimizeBandwidth: Bandwidth Optimization ...done.
OptimizeBandwidth: Half bandwidth after optimization: 133
OptimizeBandwidth: ---------------------------------------------------------
Loading user function library: [SaveData]...[SaveScalars_Init]
Loading user function library: [SaveData]...[SaveScalars]
Loading user function library: [ResultOutputSolve]...[ResultOutputSolver_Init]
Loading user function library: [ResultOutputSolve]...[ResultOutputSolver]
MAIN: 
MAIN: -------------------------------------
MAIN:  Steady state iteration:            1
MAIN: -------------------------------------
MAIN: 
SingleSolver: Attempting to call solver
SingleSolver: Solver Equation string is: linear elasticity
StressSolve: 
StressSolve: --------------------------------------------------
StressSolve: Solving displacements from linear elasticity model
StressSolve: --------------------------------------------------
StressSolve: Starting assembly...
StressSolve: Assembly:
: .Bulk assembly done
DefUtils::DefaultDirichletBCs: Setting Dirichlet boundary conditions
DefUtils::DefaultDirichletBCs: Dirichlet boundary conditions set
StressSolve: Set boundaries done

Program received signal 11 (SIGSEGV): Segmentation fault.
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.  

The process that invoked fork was:

  Local host:          login (PID 13019)
  MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------


What is also concerning is that MPI warning about the fork, but I can disable that using a command line MCA argument (even though it gives me heart burn to do so.) But, to test if it's an issue with Elmer, I set the direct solver to ufmpack, and it then works fine-- which leads me to believe there is some issue with the way Elmer is interfacing with mumps that is causing it to go off in to the weeds. Has Elmer been tested with the latest mumps version of 5.0.1, or is it only tested with 4.10 and below?

It's not spitting out a stack trace even when I make elmer using CMAKE_BUILD_TYPE=Debug so I can't be sure where it's causing the error. Has anybody else ran in to this?

Here's the pertinent lines from my .sif file, and I am using gmsh to create the mesh that I feed in to ElmerGrid.

Code: Select all

Solver 1
  Equation = Linear elasticity
  Procedure = "StressSolve" "StressSolver"
  Variable = -dofs 3 Displacement
  Exec Solver = Always
  Stabilize = True
  Bubbles = False
  Lumped Mass Matrix = False
  Optimize Bandwidth = True
  Steady State Convergence Tolerance = 1.0e-5
  Nonlinear System Convergence Tolerance = 1.0e-7
  Nonlinear System Max Iterations = 1
  Nonlinear System Newton After Iterations = 3
  Nonlinear System Newton After Tolerance = 1.0e-3
  Nonlinear System Relaxation Factor = 1
  Linear System Solver = Direct
  Linear System Direct Method = MUMPS
End


If anybody has any suggestions or things to try, I'm all ears. Thanks. In the mean time I'm going to try to add some debug statements to the fortran to see where it's dying, and to try using (if I can compile it) an older version of mumps. I'll let you know if anything changes.

Nick
SpicyBroseph
Posts: 7
Joined: 13 Jan 2016, 01:28
Antispam: Yes

Re: Elmer 8.1 using MUMPS 5.0.1 direct solve: SIGSEGV (11)

Post by SpicyBroseph »

Hello,

An update. Compiling under normal conditions and running with gdb I get the following stacktrace which seems to point to an error in DirectSolve.F90. Here is the relevant snippet, the entire stacktrace is below. Still wondering if this is something to do with versions, but it seems like a legit bug. I've include a tar.gz of my files if you'd like to recreate it. To do so, untar and go in to "unitstest" and run the following (in parallel using 2 processes):

mpirun -np 2 ElmerSolver unitstest.sif

or to run gdb,

mpirun -np 2 gdb --command=gdb.cmd ElmerSolver unitstest.sif

Relevant snippet:

#8 0x00002aaaaaedc857 in directsolve::mumps_solvesystem (solver=..., a=..., x=..., b=..., free_fact=) at /gpfs/admin/setup/elmer/elmerfem/fem/src/DirectSolve.F90:889
Cannot access memory at address 0x0
#9 0x00002aaaaaed85a7 in directsolve::directsolver (a=..., x=..., b=..., solver=..., free_fact=) at /gpfs/admin/setup/elmer/elmerfem/fem/src/DirectSolve.F90:2352
Cannot access memory at address 0x0
Cannot resolve DW_OP_push_object_address for a missing object
#10 0x00002aaaaaf64b25 in solverutils::solvelinearsystem (a=0xa01d60, b=..., x=..., norm=0, dofs=3, solver=..., bulkmatrix=) at /gpfs/admin/setup/elmer/elmerfem/fem/src/SolverUtils.F90:10409
#11 0x00002aaaaaf601bc in solverutils::solvesystem (a=0xa01d60, para=0x0, b=..., x=..., norm=0, dofs=3, solver=...) at /gpfs/admin/setup/elmer/elmerfem/fem/src/SolverUtils.F90:10664
#12 0x00002aaaab1a4b85 in defutils::defaultsolve (usolver=) at /gpfs/admin/setup/elmer/elmerfem/fem/src/DefUtils.F90:2613
Cannot resolve DW_OP_push_object_address for a missing object

Code: Select all

[root@login unitstest]#  mpirun -np 2 gdb --command=gdb.cmd ElmerSolver unitstest.sif
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /gpfs/apps/elmer/bin/ElmerSolver...GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
done.
"/gpfs/admin/setup/elmer/elmer-test/unitstest/unitstest.sif" is not a core dump: File format not recognized
Reading symbols from /gpfs/apps/elmer/bin/ElmerSolver...done.
"/gpfs/admin/setup/elmer/elmer-test/unitstest/unitstest.sif" is not a core dump: File format not recognized
[Thread debugging using libthread_db enabled]
[Thread debugging using libthread_db enabled]
ELMER SOLVER (v 8.1) STARTED AT: 2016/01/18 10:57:40
ELMER SOLVER (v 8.1) STARTED AT: 2016/01/18 10:57:40
[New Thread 0x2aaabb210700 (LWP 26899)]
[New Thread 0x2aaabb210700 (LWP 26900)]
[New Thread 0x2aaabc059700 (LWP 26901)]
[New Thread 0x2aaabc059700 (LWP 26902)]
ParCommInit:  Initialize #PEs:            2
MAIN: 
MAIN: =============================================================
MAIN: ElmerSolver finite element software, Welcome!
MAIN: This program is free software licensed under (L)GPL
MAIN: Copyright 1st April 1995 - , CSC - IT Center for Science Ltd.
MAIN: Webpage http://www.csc.fi/elmer, Email elmeradm@csc.fi
MAIN: Version: 8.1 (Rev: 549ce2a, Compiled: 2016-01-18)
MAIN:  Running in parallel using 2 tasks.
MAIN:  HYPRE library linked in.
MAIN:  Trilinos library linked in.
MAIN:  MUMPS library linked in.
MAIN: =============================================================
ParCommInit:  Initialize #PEs:            2
MAIN: 
MAIN: 
MAIN: -------------------------------------
MAIN: Reading Model: unitstest.sif
Loading user function library: [StressSolve]...[StressSolver_Init0]
Loading user function library: [SaveData]...[SaveScalars_Init0]
LoadMesh: Base mesh name: ./unitstest
LoadMesh: Elapsed time (CPU,REAL):     0.0580    0.0622 (s)
MAIN: -------------------------------------
AddVtuOutputSolverHack: Adding ResultOutputSolver to write VTU output in file: unitstest
Loading user function library: [StressSolve]...[StressSolver_Init]
Loading user function library: [StressSolve]...[StressSolver]
OptimizeBandwidth: ---------------------------------------------------------
OptimizeBandwidth: Computing matrix structure for: linear elasticity...done.
OptimizeBandwidth: Half bandwidth without optimization: 694
OptimizeBandwidth: 
OptimizeBandwidth: Bandwidth Optimization ...done.
OptimizeBandwidth: Half bandwidth after optimization: 133
OptimizeBandwidth: ---------------------------------------------------------
Loading user function library: [SaveData]...[SaveScalars_Init]
MAIN: 
MAIN: -------------------------------------
MAIN:  Steady state iteration:            1
MAIN: -------------------------------------
MAIN: 
ListToCRSMatrix: Matrix format changed from CRS to List
SingleSolver: Attempting to call solver
SingleSolver: Solver Equation string is: linear elasticity
StressSolve: 
StressSolve: --------------------------------------------------
StressSolve: Solving displacements from linear elasticity model
StressSolve: --------------------------------------------------
StressSolve: Starting assembly...
StressSolve: Assembly:
StressSolve: Bulk assembly done
DefUtils::DefaultDirichletBCs: Setting Dirichlet boundary conditions
DefUtils::DefaultDirichletBCs: Dirichlet boundary conditions set
StressSolve: Set boundaries done

Program received signal SIGSEGV, Segmentation fault.
0x00002aaaaebcebb8 in opal_memory_ptmalloc2_int_malloc () from /cm/shared/apps/openmpi/gcc/64/1.6.5/lib64/libmpi.so.1
#0  0x00002aaaaebcebb8 in opal_memory_ptmalloc2_int_malloc () from /cm/shared/apps/openmpi/gcc/64/1.6.5/lib64/libmpi.so.1
#1  0x00002aaaaebcf297 in opal_memory_ptmalloc2_int_memalign () from /cm/shared/apps/openmpi/gcc/64/1.6.5/lib64/libmpi.so.1
#2  0x00002aaaaebcfea3 in opal_memory_ptmalloc2_memalign () from /cm/shared/apps/openmpi/gcc/64/1.6.5/lib64/libmpi.so.1
#3  0x00002aaaad3a211e in gk_malloc () from /cm/shared/apps/parmetis/4.0.3/lib/libmetis.so
#4  0x00002aaaad3d126b in METIS_NodeND () from /cm/shared/apps/parmetis/4.0.3/lib/libmetis.so
#5  0x00002aaaab3b539b in dmumps_ana_f_ () from /gpfs/apps/elmer/bin/../lib/elmersolver/libelmersolver.so
#6  0x00002aaaab37ec93 in dmumps_ana_driver_ () from /gpfs/apps/elmer/bin/../lib/elmersolver/libelmersolver.so
#7  0x00002aaaab3497e7 in dmumps_ () from /gpfs/apps/elmer/bin/../lib/elmersolver/libelmersolver.so
#8  0x00002aaaaaedc857 in directsolve::mumps_solvesystem (solver=..., a=..., x=..., b=..., free_fact=) at /gpfs/admin/setup/elmer/elmerfem/fem/src/DirectSolve.F90:889
Cannot access memory at address 0x0
#9  0x00002aaaaaed85a7 in directsolve::directsolver (a=..., x=..., b=..., solver=..., free_fact=) at /gpfs/admin/setup/elmer/elmerfem/fem/src/DirectSolve.F90:2352
Cannot access memory at address 0x0
Cannot resolve DW_OP_push_object_address for a missing object
#10 0x00002aaaaaf64b25 in solverutils::solvelinearsystem (a=0xa01d60, b=..., x=..., norm=0, dofs=3, solver=..., bulkmatrix=) at /gpfs/admin/setup/elmer/elmerfem/fem/src/SolverUtils.F90:10409
#11 0x00002aaaaaf601bc in solverutils::solvesystem (a=0xa01d60, para=0x0, b=..., x=..., norm=0, dofs=3, solver=...) at /gpfs/admin/setup/elmer/elmerfem/fem/src/SolverUtils.F90:10664
#12 0x00002aaaab1a4b85 in defutils::defaultsolve (usolver=) at /gpfs/admin/setup/elmer/elmerfem/fem/src/DefUtils.F90:2613
Cannot resolve DW_OP_push_object_address for a missing object
#13 0x00002aaac0927bc3 in stresssolver () at /gpfs/admin/setup/elmer/elmerfem/fem/src/modules/StressSolve.F90:626
#14 0x00002aaaaada940a in loadmod::execsolver (fptr=46912863606402, model=..., solver=..., dt=1, transient=.FALSE.) at /gpfs/admin/setup/elmer/elmerfem/fem/src/LoadMod.F90:448
#15 0x00002aaaaafcbbf3 in mainutils::singlesolver (model=..., solver=0x8d9640, dt=1, transientsimulation=.FALSE.) at /gpfs/admin/setup/elmer/elmerfem/fem/src/MainUtils.F90:3884
#16 0x00002aaaaafca09b in mainutils::solveractivate (model=..., solver=0x8d9640, dt=1, transientsimulation=.FALSE.) at /gpfs/admin/setup/elmer/elmerfem/fem/src/MainUtils.F90:4051
#17 0x00002aaaaafda06c in mainutils::solvecoupled () at /gpfs/admin/setup/elmer/elmerfem/fem/src/MainUtils.F90:2021
#18 0x00002aaaaafdaeee in mainutils::solveequations (coupledminiter=0, coupledmaxiter=1, steadystatereached=.FALSE., realtimestep=1) at /gpfs/admin/setup/elmer/elmerfem/fem/src/MainUtils.F90:1781
#19 0x00002aaaab2f0fa7 in execsimulation (timeintervals=1, coupledminiter=0, coupledmaxiter=1, outputintervals=..., transient=.FALSE., scanning=.FALSE.) at /gpfs/admin/setup/elmer/elmerfem/fem/src/ElmerSolver.F90:1702
#20 0x00002aaaab2ec57b in elmersolver (initialize=0) at /gpfs/admin/setup/elmer/elmerfem/fem/src/ElmerSolver.F90:563
#21 0x0000000000401604 in solver () at /gpfs/admin/setup/elmer/elmerfem/fem/src/Solver.F90:69
#22 0x000000000040190d in main (argc=1, argv=0x7fffffffd82d '/gpfs/apps/elmer/bin/ElmerSolver\000') at /gpfs/admin/setup/elmer/elmerfem/fem/src/Solver.F90:34
#23 0x00002aaab4a9ad5d in __libc_start_main () from /lib64/libc.so.6
#24 0x0000000000401279 in _start ()
A debugging session is active.

	Inferior 1 [process 26894] will be killed.

Quit anyway? (y or n) [answered Y; input not from terminal]
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 26889 on
node login exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
Attachments
unitstest.zip
(452.4 KiB) Downloaded 431 times
SpicyBroseph
Posts: 7
Joined: 13 Jan 2016, 01:28
Antispam: Yes

Re: Elmer 8.1 using MUMPS 5.0.1 direct solve: SIGSEGV (11)

Post by SpicyBroseph »

Hello,

An update. I initially suspected this to be an issue with OpenMPI given the backtrace. However, this is most definitely a ParMETIS issue (from within MUMPS). This is what I did:

I recompiled everything from source using OpenMPI 1.10.1 (including LAPACK, OpenBLAS, SCALAPACK, ParMETIS, HYPRE, MUMPS) and rebuilt elmer-- same error. Grr.

So I started going up the stack chain and tried rebuilding MUMPS without parmetis, because of this:

Code: Select all

#3  0x00002aaaad5e311e in gk_malloc () from /cm/shared/apps/parmetis/4.0.3/lib/libmetis.so
#4  0x00002aaaad61226b in METIS_NodeND () from /cm/shared/apps/parmetis/4.0.3/lib/libmetis.so
#5  0x00002aaaab14873b in dmumps_ana_f_ () from /gpfs/apps/elmer/bin/../lib/elmersolver/libelmersolver.so
#6  0x00002aaaab112033 in dmumps_ana_driver_ () from /gpfs/apps/elmer/bin/../lib/elmersolver/libelmersolver.so
#7  0x00002aaaab0dcb87 in dmumps_ () from /gpfs/apps/elmer/bin/../lib/elmersolver/libelmersolver.so
#8  0x00002aaaaae2cb98 in directsolve::mumps_solvesystem (solver=) at /gpfs/admin/setup/elmer/elmerfem/fem/src/DirectSolve.F90:889
.. and lo and behold, it works! After extensive internet searching it seems that PARMETIS is pretty buggy with MUMPS, so I suppose this isn't a huge surprise. Next: I am going to try and build MUMPS with METIS 5.1 to see if that works, and I'll report back-- then I will also try to rebuild ParMETIS with METIS 5.1 (the Metis it comes with is 5.0) and again will report back.

Then I'll release my compendium of 'how to' for building Elmer/MUMPS on CentOS 6.7 from source. Thanks.
Post Reply