Whitney Solver Parallel: Performance and Memory Organisation

sarz · Post by **sarz** » 11 Feb 2014, 17:01

Hi there,

I did a performance and convergence test with the WhitneyAVSolver. My Machine is a HPZ820 workstation with 16 CPU-cores (32 with multithreading) and a total amount of 128GB RAM.

Problem description: A sphere with uniform magnetization lies in a cylinder filled with air. The magnetic flux density and the magnetic field strength have to be computed. (The analytical solution of this problem is known: the field strength inside the sphere is H=-1/3*(M_x,M_y,M_z) )
This is done with meshes with an increasing number of nodes, starting from 5012 nodes to 1831866 nodes (In every computed case the mesh size doubles). The results are attached in a libreoffice calc file as well as in a python script using matplotlib.

There are a few things I wanted to ask:

1) Do you know the theoretical rate of convergence the solver should have? Although my solution is quite accurate everywhere but on the spheres surface (there is a delta-peak) the convergence rate is approximately number_of_nodes^(-1/3) which is linear regarding one spatial dimension but not that good...
2) Do you know how the memory is organized? I had problems to run the tests with the finest grids on single/double/quadro core/s although I could run them on many-cores where the solution is correct. It seems like I have to double the number of cores if I double the number of nodes. In my case that would mean I'm restricted to approximately 4 million nodes if I don't physically upgrade my workstation. Is that really the case?
3) The speedup seems to be bounded below 6. Is that realistic?

Thanks in advance for your thoughts.
Best regards, Stefan

evaluation.py.txt: (3.76 KiB) Downloaded 311 times

perfomance.ods: (15.62 KiB) Downloaded 317 times

raback · Post by **raback** » 11 Feb 2014, 19:55

Hi Stefan

Nice tests you have done! I comment short, maybe somebody can answer on the theoretical convergence rate.

2) If you run with MPI then the memory needed is distributed equally among the processes. As each CPU has its own memory the more you have CPUs the bigger jobs you can run.

3) The solver has shown good scalability up to hundreds of cores. Below you find a case that scales wonderfully up to 256 cores. Unfortunately I don't remember the size of the problem but I think it had around few millions dofs. Typically you needs roughly ~10,000 dofs for core to obtain good scaling. Obviously your case does not scale that well. Of course the attached scalability results were on a supercomputer but within one CPU that should not have much effect. Perhaps there is something unideal in your case.

-Peter

raback · Post by **raback** » 11 Feb 2014, 20:24

Hi

These solver settings give at least marginally better speed. However, I found nothing really wrong here.

Code: Select all

Solver 1
   Equation = "MGDynamics"

   Variable = "A"

   Procedure = "MagnetoDynamics" "WhitneyAVSolver"
   Fix Input Current Density = Logical False

   Newton-Raphson Iteration = Logical False
   Nonlinear System Max Iterations = 1
   Nonlinear System Convergence Tolerance = 1e-6

   Linear System Symmetric = Logical True
   Linear System Solver = "Iterative"
   Linear System Preconditioning = None
   Linear System Convergence Tolerance = 1e-8
   Linear System Residual Output = 100
   Linear System Max Iterations = 5000
   Linear System Iterative Method = BiCGstabl
   Steady State Convergence Tolerance = 1e-6
End

-Peter

sarz · Post by **sarz** » 12 Feb 2014, 16:32

Hi,

thanks for your reply. I doubt that I fully understood how the memory sharing works...

I have two CPUs, each of which has 8 physical cores. I always thought these 8 cores share 128/2CPUs=64GB of memory. If I look at my little test it seems, that the maximum of memory space a single core can address is far below 64GB. In other words: Do I need to do a mpirun with many cores to be able to use all of the 128GB? Does that mean, if I want to handle bigger problems the only way is to increase the number of CPUs and the memory/CPU. An increase of memory alone won't work?
So, do you think with my type of problem and with my workstation I'm restricted to approximately 4 million nodes (=approx. 20 million tetraeders)?

Thanks a lot for your help.

Best regards, Stefan

akirahinoshiro · Post by **akirahinoshiro** » 12 Feb 2014, 18:24

Could you upload your sif file?
Maybe your mesh too if it is not too big?

sarz · Post by **sarz** » 13 Feb 2014, 15:37

Hi there,

I made a little investigation and tried to find out what is going on.

1) By monitoring the system I now can tell that each core can address the whole RAM space. Still the solver succeeds only with a multi-core mpi run. In the attached logfile (

log.log: (7.04 KiB) Downloaded 276 times

) you can see the solver output before aborting. I also gave the maximum RAM usage before aborting in the log file. After all I still don't understand why a single/double/quadro core run doesn't work because in my opinion it should.

2) I also tried to track the size the problem has in the RAM. This is what I found out:

Number of Nodes - problem size in RAM
121964 - 10.2 GB
238476 - 18.1GB
471018 - 33 GB
911006 - 61 GB
1831866 - 119.5GB

That is a little setback to me because it means, the biggest mesh size I can handle with my machine is ~1.9M nodes. This estimate is validated by a test run with a 2M node mesh which didn't work even with all cores.
So, I really expected the max. problem size on my machine to be much bigger. I mean it has 128GB RAM

Is there anything I could do about that besides upgrading my machine or am I lost

Is it possible that the garbage collector doesn't work properly? (I heard sometimes Fortran compilers tend to mess up in the allocation-deallocation process)

3) For akirahinoshiro: Here is the sif file:

n11.sif: (3.89 KiB) Downloaded 294 times

I can't upload the mesh-file here. It has 500MB. If you really need it I could provide a Dropbox link.

Best regards, Stefan

sarz · Post by **sarz** » 20 Feb 2014, 15:36

Hi there,

I tried to find out whether there are memory leaks or memory allocation errors. To do so I recompiled fem with debug flags and investigated the code with Intel Inspector, a memory profiler.

Surprisingly Intel Inspector found a list of memory issues which are listed below

Code: Select all

P1  Missing allocation Lists.f90 libelmersolver-7.0.so  New
P2  Memory leak ElmerSolver.f90 libelmersolver-7.0.so 784 New
P3  Memory leak ElmerSolver.f90 libelmersolver-7.0.so 8667 New
P4  Memory leak [Unknown] ElmerSolver 16 New
P5  Memory leak GeneralUtils.f90 libelmersolver-7.0.so 1598188 New
P6  Memory leak Load.c libelmersolver-7.0.so 8672 New
P7  Memory leak Load.c libelmersolver-7.0.so 1896 New
P8  Memory leak MagnetoDynamics.f90 MagnetoDynamics.so 3440 New
P9  Memory leak MeshUtils.f90 libelmersolver-7.0.so 784 New
P10  Memory leak MeshUtils.f90 libelmersolver-7.0.so 2120 New
P11  Memory leak MeshUtils.f90 libelmersolver-7.0.so 107712 New
P12  Memory leak MeshUtils.f90 libelmersolver-7.0.so 13008 New
P13  Memory leak MeshUtils.f90 libelmersolver-7.0.so 32 New
P14  Memory leak MeshUtils.f90 libelmersolver-7.0.so 4128 New
P15  Memory leak MeshUtils.f90 libelmersolver-7.0.so 95104 New
P16  Memory leak ModelDescription.f90 libelmersolver-7.0.so 3584 New
P17  Memory leak ModelDescription.f90 libelmersolver-7.0.so 784 New
P18  Memory leak Solver.f90 ElmerSolver 8531 New
P19  Memory leak [Unknown] libelmersolver-7.0.so 404 New
P20  Invalid partial memory access ModelDescription.f90 libelmersolver-7.0.so  New
P21  Invalid partial memory access ModelDescription.f90 libelmersolver-7.0.so  New
P22  Uninitialized memory access ElementDescription.f90 libelmersolver-7.0.so  New
P23  Uninitialized memory access ElementDescription.f90 libelmersolver-7.0.so  New
P24  Uninitialized memory access ElementDescription.f90 libelmersolver-7.0.so  New
P25  Uninitialized memory access ElementDescription.f90 libelmersolver-7.0.so  New
P26  Uninitialized memory access ElementDescription.f90 libelmersolver-7.0.so  New
P27  Uninitialized memory access ElementDescription.f90 libelmersolver-7.0.so  New
P28  Uninitialized memory access ElementDescription.f90 libelmersolver-7.0.so  New
P29  Uninitialized memory access ElementDescription.f90 libelmersolver-7.0.so  New
P30  Uninitialized memory access ElementDescription.f90 libelmersolver-7.0.so  New
P31  Memory not deallocated DefUtils.f90; ElementDescription.f90; ElmerSolver.f90; GeneralUtils.f90; HashTable.f90; Integration.f90; Lists.f90; Load.c; MagnetoDynamics.f90; MainUtils.f90; MeshUtils.f90; ModelDescription.f90; Solver.f90 ElmerSolver; MagnetoDynamics.so; libelmersolver-7.0.so 1384876 New

Additionally the solver aborts now due to segmentation faults with the following error message

Code: Select all

Image              PC                Routine            Line        Source  
libelmersolver-7.  00007FF1A21FE704  lists_mp_listgetr        2616  Lists.f90
MagnetoDynamics.s  00007FF19F1BF858  magnetodynamicsca        4663  MagnetoDynamics.f90
libelmersolver-7.  00007FF1A23BC310  Unknown               Unknown  Unknown
libelmersolver-7.  00007FF1A23BC369  execsolver_               532  Load.c
libelmersolver-7.  00007FF1A247B707  mainutils_mp_sing        3614  MainUtils.f90
libelmersolver-7.  00007FF1A247C655  mainutils_mp_solv        3776  MainUtils.f90
libelmersolver-7.  00007FF1A2466BE3  Unknown               Unknown  Unknown
libelmersolver-7.  00007FF1A246242C  mainutils_mp_solv        1483  MainUtils.f90
libelmersolver-7.  00007FF1A27416B8  Unknown               Unknown  Unknown
libelmersolver-7.  00007FF1A2734716  elmersolver_              628  ElmerSolver.f90
ElmerSolver        000000000040AE53  MAIN__                    271  Solver.f90
ElmerSolver        000000000040AA56  Unknown               Unknown  Unknown
libc.so.6          0000003A18C1ED1D  Unknown               Unknown  Unknown
ElmerSolver        000000000040A949  Unknown               Unknown  Unknown

Theses errors are consistent with Intel Inspectors memory issues (it allows to get more detailed information than I posted above), so I guess there has to be some problem.
I am planning to debug the whole thing and try to get rid of these errors if these are errors at all. (I still believe that this can be due to intel fortran compiler specifics)

How do I do that best? Shall I adjust the source files and then send them in if I can successfully remove the memory errors?
Or shall I post detailed reports of every single error?

Maybe I should start a new thread for that...
Please tell me what you think.

Thanks, Stefan

Elmer Discussion Forum

Whitney Solver Parallel: Performance and Memory Organisation

Whitney Solver Parallel: Performance and Memory Organisation

Re: Whitney Solver Parallel: Performance and Memory Organisation

Re: Whitney Solver Parallel: Performance and Memory Organisation

Re: Whitney Solver Parallel: Performance and Memory Organisation

Re: Whitney Solver Parallel: Performance and Memory Organisation

Re: Whitney Solver Parallel: Performance and Memory Organisation

Re: Whitney Solver Parallel: Performance and Memory Organisation