ElmerSolver_mpi/VectorHelmholtz module crash on HPC

General discussion about Elmer
Post Reply
Vowa
Posts: 4
Joined: 07 Feb 2017, 04:10
Antispam: Yes

ElmerSolver_mpi/VectorHelmholtz module crash on HPC

Post by Vowa »

Hi ElmerFEM community,

I experienced a problem with the ElmerSolver_mpi/VectorHelmholtz module on a high performance computer with the SLURM wordload manager. About a third of the job submissions of an identical simulation crash, with the typical error message below.

I've also attached the mesh and .sif file of the the standard bent waveguide example from the ElmerGUI tutorial which causes this problem.

Does anyone have an idea what is going wrong and how it could be solved?

Best regards

Code: Select all

ELMER SOLVER (v 8.2) STARTED AT: 2017/03/06 14:43:45
ELMER SOLVER (v 8.2) STARTED AT: 2017/03/06 14:43:45
ELMER SOLVER (v 8.2) STARTED AT: 2017/03/06 14:43:45
ELMER SOLVER (v 8.2) STARTED AT: 2017/03/06 14:43:45
ELMER SOLVER (v 8.2) STARTED AT: 2017/03/06 14:43:45
ELMER SOLVER (v 8.2) STARTED AT: 2017/03/06 14:43:45
ELMER SOLVER (v 8.2) STARTED AT: 2017/03/06 14:43:45
ELMER SOLVER (v 8.2) STARTED AT: 2017/03/06 14:43:45
ELMER SOLVER (v 8.2) STARTED AT: 2017/03/06 14:43:45
ELMER SOLVER (v 8.2) STARTED AT: 2017/03/06 14:43:45
ELMER SOLVER (v 8.2) STARTED AT: 2017/03/06 14:43:45
ELMER SOLVER (v 8.2) STARTED AT: 2017/03/06 14:43:45

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x2AF060D04367
#1  0x2AF060D0497E
#2  0x2AF06272491F
#3  0x2AF063726BA0
#4  0x2AF063724609
#5  0x2AF061F73D2C
#6  0x2AF061DDC4A2
#7  0x2AF061E4E28A
#8  0x2AF061E4039B
#9  0x2AF061E356F2
#10  0x2AF061F35E79
#11  0x2AF061DDA038
#12  0x2AF061DF9D79
#13  0x2AF061B4C707
#14  0x2AF05F40883E
#15  0x2AF05F4FC553
#16  0x401252 in MAIN__ at Solver.F90:69

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
srun: error: compute-a1-017: task 4: Segmentation fault (core dumped)
srun: Terminating job step 60962479.0
slurmstepd: error: *** STEP 60962479.0 ON compute-a1-017 CANCELLED AT 2017-03-06T14:43:47 ***
srun: Job step aborted: Waiting up to 122 seconds for job step to finish.
srun: error: compute-a1-017: tasks 0-3: Killed
srun: error: compute-a1-029: tasks 9-11: Killed
srun: error: compute-a1-019: tasks 5-8: Killed




Attachments
files.zip
.sif and ELMERSOLVER_STARTINFO file
(1.47 KiB) Downloaded 301 times
mesh.zip
the mesh
(530.19 KiB) Downloaded 319 times
kataja
Posts: 74
Joined: 09 May 2014, 16:06
Antispam: Yes

Re: ElmerSolver_mpi/VectorHelmholtz module crash on HPC

Post by kataja »

Hi

do you run into problems with other solvers? It seems strange that Elmer doesn't even print initialization of parallel environment structures. So, on that note, you might be encountering issues with MPI linking.

Also, it looks like the PEC boundary is not defined everywhere in "Boundary Condition 3", which of course means that there is a PMC boundary condition posed (n x curl E = 0). This shouldn't, however, cause any crashes..

Other things that should speed up convergence in your case:

Code: Select all

optimize bandwidth = false
linear system preconditioning = vanka
linear system iterative method = bicgstabl
linear system max iterations = 5000
Cheers,
Juhani
Vowa
Posts: 4
Joined: 07 Feb 2017, 04:10
Antispam: Yes

Re: ElmerSolver_mpi/VectorHelmholtz module crash on HPC

Post by Vowa »

Hi Juhani,

thanks for your swift reply. In fact it turned out it was a MPI issue. Adding this line:
#SBATCH --ntasks-per-node=XX (with XX an arbitrary integer) to my slurm submission script mysteriously solved the issue.

Cheers
Post Reply