Hi there,
I did a performance and convergence test with the WhitneyAVSolver. My Machine is a HPZ820 workstation with 16 CPU-cores (32 with multithreading) and a total amount of 128GB RAM.
Problem description: A sphere with uniform magnetization lies in a cylinder filled with air. The magnetic flux density and the magnetic field strength have to be computed. (The analytical solution of this problem is known: the field strength inside the sphere is H=-1/3*(M_x,M_y,M_z) )
This is done with meshes with an increasing number of nodes, starting from 5012 nodes to 1831866 nodes (In every computed case the mesh size doubles). The results are attached in a libreoffice calc file as well as in a python script using matplotlib.
There are a few things I wanted to ask:
1) Do you know the theoretical rate of convergence the solver should have? Although my solution is quite accurate everywhere but on the spheres surface (there is a delta-peak) the convergence rate is approximately number_of_nodes^(-1/3) which is linear regarding one spatial dimension but not that good...
2) Do you know how the memory is organized? I had problems to run the tests with the finest grids on single/double/quadro core/s although I could run them on many-cores where the solution is correct. It seems like I have to double the number of cores if I double the number of nodes. In my case that would mean I'm restricted to approximately 4 million nodes if I don't physically upgrade my workstation. Is that really the case?
3) The speedup seems to be bounded below 6. Is that realistic?
Thanks in advance for your thoughts.
Best regards, Stefan
Whitney Solver Parallel: Performance and Memory Organisation
-
- Site Admin
- Posts: 4828
- Joined: 22 Aug 2009, 11:57
- Antispam: Yes
- Location: Espoo, Finland
- Contact:
Re: Whitney Solver Parallel: Performance and Memory Organisation
Hi Stefan
Nice tests you have done! I comment short, maybe somebody can answer on the theoretical convergence rate.
2) If you run with MPI then the memory needed is distributed equally among the processes. As each CPU has its own memory the more you have CPUs the bigger jobs you can run.
3) The solver has shown good scalability up to hundreds of cores. Below you find a case that scales wonderfully up to 256 cores. Unfortunately I don't remember the size of the problem but I think it had around few millions dofs. Typically you needs roughly ~10,000 dofs for core to obtain good scaling. Obviously your case does not scale that well. Of course the attached scalability results were on a supercomputer but within one CPU that should not have much effect. Perhaps there is something unideal in your case.
-Peter
Nice tests you have done! I comment short, maybe somebody can answer on the theoretical convergence rate.
2) If you run with MPI then the memory needed is distributed equally among the processes. As each CPU has its own memory the more you have CPUs the bigger jobs you can run.
3) The solver has shown good scalability up to hundreds of cores. Below you find a case that scales wonderfully up to 256 cores. Unfortunately I don't remember the size of the problem but I think it had around few millions dofs. Typically you needs roughly ~10,000 dofs for core to obtain good scaling. Obviously your case does not scale that well. Of course the attached scalability results were on a supercomputer but within one CPU that should not have much effect. Perhaps there is something unideal in your case.
-Peter
- Attachments
-
- EndWindingsScalabilityOnSisu.png
- AV solver scalability
- (713.61 KiB) Not downloaded yet
-
- Site Admin
- Posts: 4828
- Joined: 22 Aug 2009, 11:57
- Antispam: Yes
- Location: Espoo, Finland
- Contact:
Re: Whitney Solver Parallel: Performance and Memory Organisation
Hi
These solver settings give at least marginally better speed. However, I found nothing really wrong here.
-Peter
These solver settings give at least marginally better speed. However, I found nothing really wrong here.
Code: Select all
Solver 1
Equation = "MGDynamics"
Variable = "A"
Procedure = "MagnetoDynamics" "WhitneyAVSolver"
Fix Input Current Density = Logical False
Newton-Raphson Iteration = Logical False
Nonlinear System Max Iterations = 1
Nonlinear System Convergence Tolerance = 1e-6
Linear System Symmetric = Logical True
Linear System Solver = "Iterative"
Linear System Preconditioning = None
Linear System Convergence Tolerance = 1e-8
Linear System Residual Output = 100
Linear System Max Iterations = 5000
Linear System Iterative Method = BiCGstabl
Steady State Convergence Tolerance = 1e-6
End
Re: Whitney Solver Parallel: Performance and Memory Organisation
Hi,
thanks for your reply. I doubt that I fully understood how the memory sharing works...
I have two CPUs, each of which has 8 physical cores. I always thought these 8 cores share 128/2CPUs=64GB of memory. If I look at my little test it seems, that the maximum of memory space a single core can address is far below 64GB. In other words: Do I need to do a mpirun with many cores to be able to use all of the 128GB? Does that mean, if I want to handle bigger problems the only way is to increase the number of CPUs and the memory/CPU. An increase of memory alone won't work?
So, do you think with my type of problem and with my workstation I'm restricted to approximately 4 million nodes (=approx. 20 million tetraeders)?
Thanks a lot for your help.
Best regards, Stefan
thanks for your reply. I doubt that I fully understood how the memory sharing works...
I have two CPUs, each of which has 8 physical cores. I always thought these 8 cores share 128/2CPUs=64GB of memory. If I look at my little test it seems, that the maximum of memory space a single core can address is far below 64GB. In other words: Do I need to do a mpirun with many cores to be able to use all of the 128GB? Does that mean, if I want to handle bigger problems the only way is to increase the number of CPUs and the memory/CPU. An increase of memory alone won't work?
So, do you think with my type of problem and with my workstation I'm restricted to approximately 4 million nodes (=approx. 20 million tetraeders)?
Thanks a lot for your help.
Best regards, Stefan
-
- Posts: 27
- Joined: 13 Aug 2013, 16:50
- Antispam: Yes
Re: Whitney Solver Parallel: Performance and Memory Organisation
Could you upload your sif file?
Maybe your mesh too if it is not too big?
Maybe your mesh too if it is not too big?
Re: Whitney Solver Parallel: Performance and Memory Organisation
Hi there,
I made a little investigation and tried to find out what is going on.
1) By monitoring the system I now can tell that each core can address the whole RAM space. Still the solver succeeds only with a multi-core mpi run. In the attached logfile ( ) you can see the solver output before aborting. I also gave the maximum RAM usage before aborting in the log file. After all I still don't understand why a single/double/quadro core run doesn't work because in my opinion it should.
2) I also tried to track the size the problem has in the RAM. This is what I found out:
Number of Nodes - problem size in RAM
121964 - 10.2 GB
238476 - 18.1GB
471018 - 33 GB
911006 - 61 GB
1831866 - 119.5GB
That is a little setback to me because it means, the biggest mesh size I can handle with my machine is ~1.9M nodes. This estimate is validated by a test run with a 2M node mesh which didn't work even with all cores.
So, I really expected the max. problem size on my machine to be much bigger. I mean it has 128GB RAM
Is there anything I could do about that besides upgrading my machine or am I lost
Is it possible that the garbage collector doesn't work properly? (I heard sometimes Fortran compilers tend to mess up in the allocation-deallocation process)
3) For akirahinoshiro: Here is the sif file: I can't upload the mesh-file here. It has 500MB. If you really need it I could provide a Dropbox link.
Best regards, Stefan
I made a little investigation and tried to find out what is going on.
1) By monitoring the system I now can tell that each core can address the whole RAM space. Still the solver succeeds only with a multi-core mpi run. In the attached logfile ( ) you can see the solver output before aborting. I also gave the maximum RAM usage before aborting in the log file. After all I still don't understand why a single/double/quadro core run doesn't work because in my opinion it should.
2) I also tried to track the size the problem has in the RAM. This is what I found out:
Number of Nodes - problem size in RAM
121964 - 10.2 GB
238476 - 18.1GB
471018 - 33 GB
911006 - 61 GB
1831866 - 119.5GB
That is a little setback to me because it means, the biggest mesh size I can handle with my machine is ~1.9M nodes. This estimate is validated by a test run with a 2M node mesh which didn't work even with all cores.
So, I really expected the max. problem size on my machine to be much bigger. I mean it has 128GB RAM
Is there anything I could do about that besides upgrading my machine or am I lost
Is it possible that the garbage collector doesn't work properly? (I heard sometimes Fortran compilers tend to mess up in the allocation-deallocation process)
3) For akirahinoshiro: Here is the sif file: I can't upload the mesh-file here. It has 500MB. If you really need it I could provide a Dropbox link.
Best regards, Stefan
Re: Whitney Solver Parallel: Performance and Memory Organisation
Hi there,
I tried to find out whether there are memory leaks or memory allocation errors. To do so I recompiled fem with debug flags and investigated the code with Intel Inspector, a memory profiler.
Surprisingly Intel Inspector found a list of memory issues which are listed below
Additionally the solver aborts now due to segmentation faults with the following error message
Theses errors are consistent with Intel Inspectors memory issues (it allows to get more detailed information than I posted above), so I guess there has to be some problem.
I am planning to debug the whole thing and try to get rid of these errors if these are errors at all. (I still believe that this can be due to intel fortran compiler specifics)
How do I do that best? Shall I adjust the source files and then send them in if I can successfully remove the memory errors?
Or shall I post detailed reports of every single error?
Maybe I should start a new thread for that...
Please tell me what you think.
Thanks, Stefan
I tried to find out whether there are memory leaks or memory allocation errors. To do so I recompiled fem with debug flags and investigated the code with Intel Inspector, a memory profiler.
Surprisingly Intel Inspector found a list of memory issues which are listed below
Code: Select all
P1 Missing allocation Lists.f90 libelmersolver-7.0.so New
P2 Memory leak ElmerSolver.f90 libelmersolver-7.0.so 784 New
P3 Memory leak ElmerSolver.f90 libelmersolver-7.0.so 8667 New
P4 Memory leak [Unknown] ElmerSolver 16 New
P5 Memory leak GeneralUtils.f90 libelmersolver-7.0.so 1598188 New
P6 Memory leak Load.c libelmersolver-7.0.so 8672 New
P7 Memory leak Load.c libelmersolver-7.0.so 1896 New
P8 Memory leak MagnetoDynamics.f90 MagnetoDynamics.so 3440 New
P9 Memory leak MeshUtils.f90 libelmersolver-7.0.so 784 New
P10 Memory leak MeshUtils.f90 libelmersolver-7.0.so 2120 New
P11 Memory leak MeshUtils.f90 libelmersolver-7.0.so 107712 New
P12 Memory leak MeshUtils.f90 libelmersolver-7.0.so 13008 New
P13 Memory leak MeshUtils.f90 libelmersolver-7.0.so 32 New
P14 Memory leak MeshUtils.f90 libelmersolver-7.0.so 4128 New
P15 Memory leak MeshUtils.f90 libelmersolver-7.0.so 95104 New
P16 Memory leak ModelDescription.f90 libelmersolver-7.0.so 3584 New
P17 Memory leak ModelDescription.f90 libelmersolver-7.0.so 784 New
P18 Memory leak Solver.f90 ElmerSolver 8531 New
P19 Memory leak [Unknown] libelmersolver-7.0.so 404 New
P20 Invalid partial memory access ModelDescription.f90 libelmersolver-7.0.so New
P21 Invalid partial memory access ModelDescription.f90 libelmersolver-7.0.so New
P22 Uninitialized memory access ElementDescription.f90 libelmersolver-7.0.so New
P23 Uninitialized memory access ElementDescription.f90 libelmersolver-7.0.so New
P24 Uninitialized memory access ElementDescription.f90 libelmersolver-7.0.so New
P25 Uninitialized memory access ElementDescription.f90 libelmersolver-7.0.so New
P26 Uninitialized memory access ElementDescription.f90 libelmersolver-7.0.so New
P27 Uninitialized memory access ElementDescription.f90 libelmersolver-7.0.so New
P28 Uninitialized memory access ElementDescription.f90 libelmersolver-7.0.so New
P29 Uninitialized memory access ElementDescription.f90 libelmersolver-7.0.so New
P30 Uninitialized memory access ElementDescription.f90 libelmersolver-7.0.so New
P31 Memory not deallocated DefUtils.f90; ElementDescription.f90; ElmerSolver.f90; GeneralUtils.f90; HashTable.f90; Integration.f90; Lists.f90; Load.c; MagnetoDynamics.f90; MainUtils.f90; MeshUtils.f90; ModelDescription.f90; Solver.f90 ElmerSolver; MagnetoDynamics.so; libelmersolver-7.0.so 1384876 New
Code: Select all
Image PC Routine Line Source
libelmersolver-7. 00007FF1A21FE704 lists_mp_listgetr 2616 Lists.f90
MagnetoDynamics.s 00007FF19F1BF858 magnetodynamicsca 4663 MagnetoDynamics.f90
libelmersolver-7. 00007FF1A23BC310 Unknown Unknown Unknown
libelmersolver-7. 00007FF1A23BC369 execsolver_ 532 Load.c
libelmersolver-7. 00007FF1A247B707 mainutils_mp_sing 3614 MainUtils.f90
libelmersolver-7. 00007FF1A247C655 mainutils_mp_solv 3776 MainUtils.f90
libelmersolver-7. 00007FF1A2466BE3 Unknown Unknown Unknown
libelmersolver-7. 00007FF1A246242C mainutils_mp_solv 1483 MainUtils.f90
libelmersolver-7. 00007FF1A27416B8 Unknown Unknown Unknown
libelmersolver-7. 00007FF1A2734716 elmersolver_ 628 ElmerSolver.f90
ElmerSolver 000000000040AE53 MAIN__ 271 Solver.f90
ElmerSolver 000000000040AA56 Unknown Unknown Unknown
libc.so.6 0000003A18C1ED1D Unknown Unknown Unknown
ElmerSolver 000000000040A949 Unknown Unknown Unknown
I am planning to debug the whole thing and try to get rid of these errors if these are errors at all. (I still believe that this can be due to intel fortran compiler specifics)
How do I do that best? Shall I adjust the source files and then send them in if I can successfully remove the memory errors?
Or shall I post detailed reports of every single error?
Maybe I should start a new thread for that...
Please tell me what you think.
Thanks, Stefan